Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-18 Thread Michel Lespinasse
On Thu, Apr 15, 2021 at 01:13:13AM -0600, Yu Zhao wrote:
> Page table scanning doesn't replace the existing rmap walk. It is
> complementary and only happens when it is likely that most of the
> pages on a system under pressure have been referenced, i.e., out of
> *inactive* pages, by definition of the existing implementation. Under
> such a condition, scanning *active* pages one by one with the rmap is
> likely to cost more than scanning them all at once via page tables.
> When we evict *inactive* pages, we still use the rmap and share a
> common path with the existing code.
> 
> Page table scanning falls back to the rmap walk if the page tables of
> a process are apparently sparse, i.e., rss < size of the page tables.

Could you expand a bit more as to how page table scanning and rmap
scanning coexist ? Say, there is some memory pressure and you want to
identify good candidate pages to recaim. You could scan processes with
the page table scanning method, or you could scan the lru list through
the rmap method. How do you mix the two - when you use the lru/rmap
method, won't you encounter both pages that are mapped in "dense"
processes where scanning page tables would have been better, and pages
that are mapped in "sparse" processes where you are happy to be using
rmap, and even pges that are mapped into both types of processes at
once ?  Or, can you change the lru/rmap scan so that it will efficiently
skip over all dense processes when you use it ?

Thanks,

--
Michel "walken" Lespinasse


Re: [RFC PATCH 24/37] mm: implement speculative handling in __do_fault()

2021-04-08 Thread Michel Lespinasse
On Thu, Apr 08, 2021 at 08:13:43AM +0100, Matthew Wilcox wrote:
> On Thu, Apr 08, 2021 at 09:00:26AM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 07, 2021 at 10:27:12PM +0100, Matthew Wilcox wrote:
> > > Doing I/O without any lock held already works; it just uses the file
> > > refcount.  It would be better to use a vma refcount, as I already said.
> > 
> > The original workload that I developed SPF for (wy back when) was
> > prefaulting a single huge vma. Using a vma refcount was a total loss
> > because it resulted in the same cacheline contention that down_read()
> > was having.
> > 
> > As such, I'm always incredibly sad to see mention of vma refcounts.
> > They're fundamentally not solving the problem :/
> 
> OK, let me outline my locking scheme because I think it's rather better
> than Michel's.  The vma refcount is the slow path.
> 
> 1. take the RCU read lock
> 2. walk the pgd/p4d/pud/pmd
> 3. allocate page tables if necessary.  *handwave GFP flags*.
> 4. walk the vma tree
> 5. call ->map_pages
> 6. take ptlock
> 7. insert page(s)
> 8. drop ptlock
> if this all worked out, we're done, drop the RCU read lock and return.
> 9. increment vma refcount
> 10. drop RCU read lock
> 11. call ->fault
> 12. decrement vma refcount

Note that most of your proposed steps seem similar in principle to mine.
Looking at the fast path (steps 1-8):
- step 2 sounds like the speculative part of __handle_mm_fault()
- (step 3 not included in my proposal)
- step 4 is basically the lookup I currently have in the arch fault handler
- step 6 sounds like the speculative part of map_pte_lock()

I have working implementations for each step, while your proposal
summarizes each as a point item. It's not clear to me what to make of it;
presumably you would be "filling in the blanks" in a different way
than I have but you are not explaining how. Are you suggesting that
the precautions taken in each step to avoid races with mmap writers
would not be necessary in your proposal ? if that is the case, what is
the alternative mechanism would you use to handle such races ?

Going back to the source of this, you suggested not copying the VMA,
what is your proposed alternative ? Do you suggest that fault handlers
should deal with the vma potentially mutating under them ? Or should
mmap writers consider vmas as immutable and copy them whenever they
want to change them ? or are you implying a locking mechanism that would
prevent mmap writers from executing while the fault is running ?

> Compared to today, where we bump the refcount on the file underlying the
> vma, this is _better_ scalability -- different mappings of the same file
> will not contend on the file's refcount.
> 
> I suspect your huge VMA was anon, and that wouldn't need a vma refcount
> as faulting in new pages doesn't need to do I/O, just drop the RCU
> lock, allocate and retry.


Re: [RFC PATCH 34/37] mm: rcu safe vma freeing only for multithreaded user space

2021-04-08 Thread Michel Lespinasse
On Wed, Apr 07, 2021 at 03:50:06AM +0100, Matthew Wilcox wrote:
> On Tue, Apr 06, 2021 at 06:44:59PM -0700, Michel Lespinasse wrote:
> > Performance tuning: as single threaded userspace does not use
> > speculative page faults, it does not require rcu safe vma freeing.
> > Turn this off to avoid the related (small) extra overheads.
> > 
> > For multi threaded userspace, we often see a performance benefit from
> > the rcu safe vma freeing - even in tests that do not have any frequent
> > concurrent page faults ! This is because rcu safe vma freeing prevents
> > recently released vmas from being immediately reused in a new thread.
> 
> Why does that provide a performance benefit?  Recently released
> VMAs are cache-hot, and NUMA-local.  I'd expect the RCU delay to be
> performance-negative.

I only have the observation and no full explanation for it.
Just try it on wis-mmap and wis-malloc threaded cases. Of course this
all washes away when dealing with more realistic macro benchmarks.


Re: [RFC PATCH 24/37] mm: implement speculative handling in __do_fault()

2021-04-07 Thread Michel Lespinasse
On Wed, Apr 07, 2021 at 04:40:34PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 06, 2021 at 06:44:49PM -0700, Michel Lespinasse wrote:
> > In the speculative case, call the vm_ops->fault() method from within
> > an rcu read locked section, and verify the mmap sequence lock at the
> > start of the section. A match guarantees that the original vma is still
> > valid at that time, and that the associated vma->vm_file stays valid
> > while the vm_ops->fault() method is running.
> > 
> > Note that this implies that speculative faults can not sleep within
> > the vm_ops->fault method. We will only attempt to fetch existing pages
> > from the page cache during speculative faults; any miss (or prefetch)
> > will be handled by falling back to non-speculative fault handling.
> > 
> > The speculative handling case also does not preallocate page tables,
> > as it is always called with a pre-existing page table.
> 
> So what's wrong with SRCU ? Laurent mumbled something about frequent
> SRCU kthread activity being a problem; is that still so and is that
> fundamentally unfixable?
> 
> Because to me it seems a much more natural solution to the whole thing.

The short answer is that I did not try SRCU. My thought process was,
page cache already uses an RCU read lock, I just need to expand its
scope a little.

Using SRCU might allow us to hit disk during speculative faults; OTOH
we may need to switch to a more robust validation mechanism than the
global counter to reap any benefits.


Re: [RFC PATCH 09/37] mm: add per-mm mmap sequence counter for speculative page fault handling.

2021-04-07 Thread Michel Lespinasse
On Wed, Apr 07, 2021 at 04:47:34PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 06, 2021 at 06:44:34PM -0700, Michel Lespinasse wrote:
> > The counter's write side is hooked into the existing mmap locking API:
> > mmap_write_lock() increments the counter to the next (odd) value, and
> > mmap_write_unlock() increments it again to the next (even) value.
> > 
> > The counter's speculative read side is supposed to be used as follows:
> > 
> > seq = mmap_seq_read_start(mm);
> > if (seq & 1)
> > goto fail;
> >  speculative handling here 
> > if (!mmap_seq_read_check(mm, seq)
> > goto fail;
> > 
> > This API guarantees that, if none of the "fail" tests abort
> > speculative execution, the speculative code section did not run
> > concurrently with any mmap writer.
> 
> So this is obviously safe, but it's also super excessive. Any change,
> anywhere, will invalidate and abort a SPF.
> 
> Since you make a complete copy of the vma, you could memcmp it in its
> entirety instead of this.

Yeah, there is a deliberate choice here to start with the simplest
possible approach, but this could lead to more SPF aborts than
strictly necessary.

It's not clear to me that just comparing original vs current vma
attributes would always be sufficient - I think in some cases, we may
also need to worry about attributes being changed back and forth
concurrently with the fault. However, the comparison you suggest would
be safe at least in the case where the original pte was pte_none.

At some point, I did try implementing the optimization you suggest -
if the global counter has changed, re-fetch the current vma and
compare it against the old - but this was basically no help for the
(Android) workloads I looked at. There were just too few aborts that
were caused by the global counter in the first place. Note that my
patchset only handles anon and page cache faults speculatively, so
generally there wasn't very much time for the counter to change.

But yes, this may not hold across all workloads, and maybe we'd want
to do something smarter once/if a problem workload is identified.


Re: [RFC PATCH 11/37] x86/mm: attempt speculative mm faults first

2021-04-07 Thread Michel Lespinasse
On Wed, Apr 07, 2021 at 04:35:28PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 07, 2021 at 04:48:44PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 06, 2021 at 06:44:36PM -0700, Michel Lespinasse wrote:
> > > --- a/arch/x86/mm/fault.c
> > > +++ b/arch/x86/mm/fault.c
> > > @@ -1219,6 +1219,8 @@ void do_user_addr_fault(struct pt_regs *regs,
> > >   struct mm_struct *mm;
> > >   vm_fault_t fault;
> > >   unsigned int flags = FAULT_FLAG_DEFAULT;
> > > + struct vm_area_struct pvma;
> > 
> > That's 200 bytes on-stack... I suppose that's just about acceptible, but
> > perhaps we need a comment in struct vm_area_struct to make people aware
> > this things lives on-stack and size really is an issue now.
> 
> Michel's gone off and done his own thing here.

I don't think that is an entirely fair representation. First we are
both aware of each other's work, there is no working in dark caves here.
But also, I don't even consider this patchset to be entirely my thing;
most of the main building blocks come from prior proposals before mine.

> The rest of us (Laurent, Liam & I) are working on top of the maple tree
> which shrinks vm_area_struct by five pointers, so just 160 bytes.

The idea of evaluating maple tree and speculative faults as a bundle
is actually worrying to me. I think both ideas are interesting and
worth looking into on their own, but I'm not convinced that they have
much to do with each other.

> Also, our approach doesn't involve copying VMAs in order to handle a fault.

See my other reply to Peter's message - copying VMAs is a convenient
way to reduce the size of the patchset, but it's not fundamental to
the approach at all.


Re: [RFC PATCH 11/37] x86/mm: attempt speculative mm faults first

2021-04-07 Thread Michel Lespinasse
On Wed, Apr 07, 2021 at 01:14:53PM -0700, Michel Lespinasse wrote:
> On Wed, Apr 07, 2021 at 04:48:44PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 06, 2021 at 06:44:36PM -0700, Michel Lespinasse wrote:
> > > --- a/arch/x86/mm/fault.c
> > > +++ b/arch/x86/mm/fault.c
> > > @@ -1219,6 +1219,8 @@ void do_user_addr_fault(struct pt_regs *regs,
> > >   struct mm_struct *mm;
> > >   vm_fault_t fault;
> > >   unsigned int flags = FAULT_FLAG_DEFAULT;
> > > + struct vm_area_struct pvma;
> > 
> > That's 200 bytes on-stack... I suppose that's just about acceptible, but
> > perhaps we need a comment in struct vm_area_struct to make people aware
> > this things lives on-stack and size really is an issue now.
> 
> Right, I agree that having the vma copy on-stack is not ideal.
> 
> I think what really should be done, is to copy just the attributes of
> the vma that will be needed during the page fault. Things like vm_mm,
> vm_page_prot, vm_flags, vm_ops, vm_pgoff, vm_file, vm_private_data,
> vm_policy. We definitely do not need rbtree and rmap fields such as
> vm_prev, vm_next, vm_rb, rb_subtree_gap, shared, anon_vma_chain etc...
> 
> The reason I did things this way, is because changing the entire fault
> handler to use attributes stored in struct vm_fault, rather than in
> the original vma, would be quite intrusive. I think it would be a
> reasonable step to consider once there is agreement on the rest of the
> speculative fault patch set, but it's practical doing it before then.

I meant it's NOT practical using attributes rather than a vma copy
until there is sufficient agreement to merge the patchset.


Re: [RFC PATCH 11/37] x86/mm: attempt speculative mm faults first

2021-04-07 Thread Michel Lespinasse
On Wed, Apr 07, 2021 at 04:48:44PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 06, 2021 at 06:44:36PM -0700, Michel Lespinasse wrote:
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1219,6 +1219,8 @@ void do_user_addr_fault(struct pt_regs *regs,
> > struct mm_struct *mm;
> > vm_fault_t fault;
> > unsigned int flags = FAULT_FLAG_DEFAULT;
> > +   struct vm_area_struct pvma;
> 
> That's 200 bytes on-stack... I suppose that's just about acceptible, but
> perhaps we need a comment in struct vm_area_struct to make people aware
> this things lives on-stack and size really is an issue now.

Right, I agree that having the vma copy on-stack is not ideal.

I think what really should be done, is to copy just the attributes of
the vma that will be needed during the page fault. Things like vm_mm,
vm_page_prot, vm_flags, vm_ops, vm_pgoff, vm_file, vm_private_data,
vm_policy. We definitely do not need rbtree and rmap fields such as
vm_prev, vm_next, vm_rb, rb_subtree_gap, shared, anon_vma_chain etc...

The reason I did things this way, is because changing the entire fault
handler to use attributes stored in struct vm_fault, rather than in
the original vma, would be quite intrusive. I think it would be a
reasonable step to consider once there is agreement on the rest of the
speculative fault patch set, but it's practical doing it before then.


Re: [RFC PATCH 24/37] mm: implement speculative handling in __do_fault()

2021-04-06 Thread Michel Lespinasse
On Wed, Apr 07, 2021 at 03:35:27AM +0100, Matthew Wilcox wrote:
> On Tue, Apr 06, 2021 at 06:44:49PM -0700, Michel Lespinasse wrote:
> > In the speculative case, call the vm_ops->fault() method from within
> > an rcu read locked section, and verify the mmap sequence lock at the
> > start of the section. A match guarantees that the original vma is still
> > valid at that time, and that the associated vma->vm_file stays valid
> > while the vm_ops->fault() method is running.
> > 
> > Note that this implies that speculative faults can not sleep within
> > the vm_ops->fault method. We will only attempt to fetch existing pages
> > from the page cache during speculative faults; any miss (or prefetch)
> > will be handled by falling back to non-speculative fault handling.
> > 
> > The speculative handling case also does not preallocate page tables,
> > as it is always called with a pre-existing page table.
> 
> I still don't understand why you want to do this.  The speculative
> fault that doesn't do I/O is already here, and it's called ->map_pages
> (which I see you also do later).  So what's the point of this patch?

I have to admit I did not give much tought about which path would be
generally most common here.

The speculative vm_ops->fault path would be used:
- for private mapping write faults,
- when fault-around is disabled (probably an uncommon case in general,
  but actually common at Google).

That said, I do think your point makes sense in general, espicially if
this could help avoid the per-filesystem enable bit.


Re: [RFC PATCH 03/37] do_anonymous_page: use update_mmu_tlb()

2021-04-06 Thread Michel Lespinasse
Hi Bibo,

You introduced this code in commit 7df676974359f back in May.
Could you check that the change is correct ?

Thanks,

On Tue, Apr 06, 2021 at 06:44:28PM -0700, Michel Lespinasse wrote:
> update_mmu_tlb() can be used instead of update_mmu_cache() when the
> page fault handler detects that it lost the race to another page fault.
> 
> Signed-off-by: Michel Lespinasse 
> ---
>  mm/memory.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 5efa07fb6cdc..8ee4bd239303 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3567,7 +3567,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault 
> *vmf)
>   vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>   >ptl);
>   if (!pte_none(*vmf->pte)) {
> - update_mmu_cache(vma, vmf->address, vmf->pte);
> + update_mmu_tlb(vma, vmf->address, vmf->pte);
>   goto release;
>   }
>  
> -- 
> 2.20.1
> 


[RFC PATCH 28/37] mm: implement speculative handling in filemap_map_pages()

2021-04-06 Thread Michel Lespinasse
In the speculative case, we know the page table already exists, and it
must be locked with pte_map_lock(). In the case where no page is found
for the given address, return VM_FAULT_RETRY which will abort the
fault before we get into the vm_ops->fault() callback. This is fine
because if filemap_map_pages does not find the page in page cache,
vm_ops->fault() will not either.

Initialize addr and last_pgoff to correspond to the pte at the original
fault address (which was mapped with pte_map_lock()), rather than the
pte at start_pgoff. The choice of initial values doesn't matter as
they will all be adjusted together before use, so they just need to be
consistent with each other, and using the original fault address and
pte allows us to reuse pte_map_lock() without any changes to it.

Signed-off-by: Michel Lespinasse 
---
 mm/filemap.c | 27 ---
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 6e8505fe5df9..d496771749e6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3136,25 +3136,31 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
struct vm_area_struct *vma = vmf->vma;
struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
-   pgoff_t last_pgoff = start_pgoff;
+   pgoff_t last_pgoff;
unsigned long addr;
XA_STATE(xas, >i_pages, start_pgoff);
struct page *head, *page;
unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
-   vm_fault_t ret = 0;
+   vm_fault_t ret = (vmf->flags & FAULT_FLAG_SPECULATIVE) ?
+   VM_FAULT_RETRY : 0;
 
-   rcu_read_lock();
+   /* filemap_map_pages() is called within an rcu read lock already. */
head = first_map_page(mapping, , end_pgoff);
if (!head)
-   goto out;
+   return ret;
 
-   if (filemap_map_pmd(vmf, head)) {
-   ret = VM_FAULT_NOPAGE;
-   goto out;
+   if (!(vmf->flags & FAULT_FLAG_SPECULATIVE) &&
+   filemap_map_pmd(vmf, head))
+   return VM_FAULT_NOPAGE;
+
+   if (!pte_map_lock(vmf)) {
+   unlock_page(head);
+   put_page(head);
+   return VM_FAULT_RETRY;
}
+   addr = vmf->address;
+   last_pgoff = vmf->pgoff;
 
-   addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
-   vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, >ptl);
do {
page = find_subpage(head, xas.xa_index);
if (PageHWPoison(page))
@@ -3184,8 +3190,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
put_page(head);
} while ((head = next_map_page(mapping, , end_pgoff)) != NULL);
pte_unmap_unlock(vmf->pte, vmf->ptl);
-out:
-   rcu_read_unlock();
+   vmf->pte = NULL;
WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
return ret;
 }
-- 
2.20.1



[RFC PATCH 11/37] x86/mm: attempt speculative mm faults first

2021-04-06 Thread Michel Lespinasse
Attempt speculative mm fault handling first, and fall back to the
existing (non-speculative) code if that fails.

The speculative handling closely mirrors the non-speculative logic.
This includes some x86 specific bits such as the access_error() call.
This is why we chose to implement the speculative handling in arch/x86
rather than in common code.

The vma is first looked up and copied, under protection of the rcu
read lock. The mmap lock sequence count is used to verify the
integrity of the copied vma, and passed to do_handle_mm_fault() to
allow checking against races with mmap writers when finalizing the fault.

Signed-off-by: Michel Lespinasse 
---
 arch/x86/mm/fault.c   | 36 +++
 include/linux/vm_event_item.h |  4 
 mm/vmstat.c   |  4 
 3 files changed, 44 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index a73347e2cdfc..f8c8e325af77 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1219,6 +1219,8 @@ void do_user_addr_fault(struct pt_regs *regs,
struct mm_struct *mm;
vm_fault_t fault;
unsigned int flags = FAULT_FLAG_DEFAULT;
+   struct vm_area_struct pvma;
+   unsigned long seq;
 
tsk = current;
mm = tsk->mm;
@@ -1316,6 +1318,39 @@ void do_user_addr_fault(struct pt_regs *regs,
}
 #endif
 
+   count_vm_event(SPF_ATTEMPT);
+   seq = mmap_seq_read_start(mm);
+   if (seq & 1)
+   goto spf_abort;
+   rcu_read_lock();
+   vma = find_vma(mm, address);
+   if (!vma || vma->vm_start > address) {
+   rcu_read_unlock();
+   goto spf_abort;
+   }
+   pvma = *vma;
+   rcu_read_unlock();
+   if (!mmap_seq_read_check(mm, seq))
+   goto spf_abort;
+   vma = 
+   if (unlikely(access_error(error_code, vma)))
+   goto spf_abort;
+   fault = do_handle_mm_fault(vma, address,
+  flags | FAULT_FLAG_SPECULATIVE, seq, regs);
+
+   /* Quick path to respond to signals */
+   if (fault_signal_pending(fault, regs)) {
+   if (!user_mode(regs))
+   kernelmode_fixup_or_oops(regs, error_code, address,
+SIGBUS, BUS_ADRERR);
+   return;
+   }
+   if (!(fault & VM_FAULT_RETRY))
+   goto done;
+
+spf_abort:
+   count_vm_event(SPF_ABORT);
+
/*
 * Kernel-mode access to the user address space should only occur
 * on well-defined single instructions listed in the exception
@@ -1412,6 +1447,7 @@ void do_user_addr_fault(struct pt_regs *regs,
}
 
mmap_read_unlock(mm);
+done:
if (likely(!(fault & VM_FAULT_ERROR)))
return;
 
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 18e75974d4e3..cc4f8d14e43f 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_SWAP
SWAP_RA,
SWAP_RA_HIT,
+#endif
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   SPF_ATTEMPT,
+   SPF_ABORT,
 #endif
NR_VM_EVENT_ITEMS
 };
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 74b2c374b86c..9ae1c27a549e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1365,6 +1365,10 @@ const char * const vmstat_text[] = {
"swap_ra",
"swap_ra_hit",
 #endif
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   "spf_attempt",
+   "spf_abort",
+#endif
 #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
-- 
2.20.1



[RFC PATCH 26/37] mm: implement speculative fault handling in finish_fault()

2021-04-06 Thread Michel Lespinasse
In the speculative case, we want to avoid direct pmd checks (which
would require some extra synchronization to be safe), and rely on
pte_map_lock which will both lock the page table and verify that the
pmd has not changed from its initial value.

Signed-off-by: Michel Lespinasse 
---
 mm/memory.c | 28 +++-
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7139004c624d..13e2aaf900e5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3915,23 +3915,25 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
return ret;
}
 
-   if (pmd_none(*vmf->pmd)) {
-   if (PageTransCompound(page)) {
-   ret = do_set_pmd(vmf, page);
-   if (ret != VM_FAULT_FALLBACK)
-   return ret;
+   if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+   if (pmd_none(*vmf->pmd)) {
+   if (PageTransCompound(page)) {
+   ret = do_set_pmd(vmf, page);
+   if (ret != VM_FAULT_FALLBACK)
+   return ret;
+   }
+
+   if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
+   return VM_FAULT_OOM;
}
 
-   if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
-   return VM_FAULT_OOM;
+   /* See comment in __handle_mm_fault() */
+   if (pmd_devmap_trans_unstable(vmf->pmd))
+   return 0;
}
 
-   /* See comment in __handle_mm_fault() */
-   if (pmd_devmap_trans_unstable(vmf->pmd))
-   return 0;
-
-   vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
- vmf->address, >ptl);
+   if (!pte_map_lock(vmf))
+   return VM_FAULT_RETRY;
ret = 0;
/* Re-check under ptl */
if (likely(pte_none(*vmf->pte)))
-- 
2.20.1



[RFC PATCH 09/37] mm: add per-mm mmap sequence counter for speculative page fault handling.

2021-04-06 Thread Michel Lespinasse
The counter's write side is hooked into the existing mmap locking API:
mmap_write_lock() increments the counter to the next (odd) value, and
mmap_write_unlock() increments it again to the next (even) value.

The counter's speculative read side is supposed to be used as follows:

seq = mmap_seq_read_start(mm);
if (seq & 1)
goto fail;
 speculative handling here 
if (!mmap_seq_read_check(mm, seq)
goto fail;

This API guarantees that, if none of the "fail" tests abort
speculative execution, the speculative code section did not run
concurrently with any mmap writer.

This is very similar to a seqlock, but both the writer and speculative
readers are allowed to block. In the fail case, the speculative reader
does not spin on the sequence counter; instead it should fall back to
a different mechanism such as grabbing the mmap lock read side.

Signed-off-by: Michel Lespinasse 
---
 include/linux/mm_types.h  |  4 +++
 include/linux/mmap_lock.h | 58 +--
 2 files changed, 60 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6613b26a8894..70882e628908 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -461,6 +461,10 @@ struct mm_struct {
 * counters
 */
struct rw_semaphore mmap_lock;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   unsigned long mmap_seq;
+#endif
+
 
struct list_head mmlist; /* List of maybe swapped mm's. These
  * are globally strung together off
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 8ff276a7560e..8f4eca2d0f43 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -8,8 +8,16 @@
 #include 
 #include 
 
-#define MMAP_LOCK_INITIALIZER(name) \
-   .mmap_lock = __RWSEM_INITIALIZER((name).mmap_lock),
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+#define MMAP_LOCK_SEQ_INITIALIZER(name) \
+   .mmap_seq = 0,
+#else
+#define MMAP_LOCK_SEQ_INITIALIZER(name)
+#endif
+
+#define MMAP_LOCK_INITIALIZER(name)\
+   .mmap_lock = __RWSEM_INITIALIZER((name).mmap_lock), \
+   MMAP_LOCK_SEQ_INITIALIZER(name)
 
 DECLARE_TRACEPOINT(mmap_lock_start_locking);
 DECLARE_TRACEPOINT(mmap_lock_acquire_returned);
@@ -63,13 +71,52 @@ static inline void __mmap_lock_trace_released(struct 
mm_struct *mm, bool write)
 static inline void mmap_init_lock(struct mm_struct *mm)
 {
init_rwsem(>mmap_lock);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   mm->mmap_seq = 0;
+#endif
 }
 
+static inline void __mmap_seq_write_lock(struct mm_struct *mm)
+{
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   VM_BUG_ON_MM(mm->mmap_seq & 1, mm);
+   mm->mmap_seq++;
+   smp_wmb();
+#endif
+}
+
+static inline void __mmap_seq_write_unlock(struct mm_struct *mm)
+{
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   smp_wmb();
+   mm->mmap_seq++;
+   VM_BUG_ON_MM(mm->mmap_seq & 1, mm);
+#endif
+}
+
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+static inline unsigned long mmap_seq_read_start(struct mm_struct *mm)
+{
+   unsigned long seq;
+
+   seq = READ_ONCE(mm->mmap_seq);
+   smp_rmb();
+   return seq;
+}
+
+static inline bool mmap_seq_read_check(struct mm_struct *mm, unsigned long seq)
+{
+   smp_rmb();
+   return seq == READ_ONCE(mm->mmap_seq);
+}
+#endif
+
 static inline void mmap_write_lock(struct mm_struct *mm)
 {
__mmap_lock_trace_start_locking(mm, true);
down_write(>mmap_lock);
__mmap_lock_trace_acquire_returned(mm, true, true);
+   __mmap_seq_write_lock(mm);
 }
 
 static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
@@ -77,6 +124,7 @@ static inline void mmap_write_lock_nested(struct mm_struct 
*mm, int subclass)
__mmap_lock_trace_start_locking(mm, true);
down_write_nested(>mmap_lock, subclass);
__mmap_lock_trace_acquire_returned(mm, true, true);
+   __mmap_seq_write_lock(mm);
 }
 
 static inline int mmap_write_lock_killable(struct mm_struct *mm)
@@ -86,6 +134,8 @@ static inline int mmap_write_lock_killable(struct mm_struct 
*mm)
__mmap_lock_trace_start_locking(mm, true);
error = down_write_killable(>mmap_lock);
__mmap_lock_trace_acquire_returned(mm, true, !error);
+   if (likely(!error))
+   __mmap_seq_write_lock(mm);
return error;
 }
 
@@ -96,17 +146,21 @@ static inline bool mmap_write_trylock(struct mm_struct *mm)
__mmap_lock_trace_start_locking(mm, true);
ok = down_write_trylock(>mmap_lock) != 0;
__mmap_lock_trace_acquire_returned(mm, true, ok);
+   if (likely(ok))
+   __mmap_seq_write_lock(mm);
return ok;
 }
 
 static inline void mmap_write_unlock(struct mm_struct *mm)
 {
+ 

[RFC PATCH 10/37] mm: rcu safe vma freeing

2021-04-06 Thread Michel Lespinasse
This prepares for speculative page faults looking up and copying vmas
under protection of an rcu read lock, instead of the usual mmap read lock.

Signed-off-by: Michel Lespinasse 
---
 include/linux/mm_types.h | 16 +++-
 kernel/fork.c| 11 ++-
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 70882e628908..024970635921 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -304,12 +304,18 @@ struct vm_userfaultfd_ctx {};
 struct vm_area_struct {
/* The first cache line has the info for VMA tree walking. */
 
-   unsigned long vm_start; /* Our start address within vm_mm. */
-   unsigned long vm_end;   /* The first byte after our end address
-  within vm_mm. */
+   union {
+   struct {
+   /* VMA covers [vm_start; vm_end) addresses within mm */
+   unsigned long vm_start, vm_end;
 
-   /* linked list of VM areas per task, sorted by address */
-   struct vm_area_struct *vm_next, *vm_prev;
+   /* linked list of VMAs per task, sorted by address */
+   struct vm_area_struct *vm_next, *vm_prev;
+   };
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   struct rcu_head vm_rcu; /* Used for deferred freeing. */
+#endif
+   };
 
struct rb_node vm_rb;
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 426cd0c51f9e..b6078e546114 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -369,11 +369,20 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct 
*orig)
return new;
 }
 
-void vm_area_free(struct vm_area_struct *vma)
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+static void __vm_area_free(struct rcu_head *head)
 {
+   struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
+ vm_rcu);
kmem_cache_free(vm_area_cachep, vma);
 }
 
+void vm_area_free(struct vm_area_struct *vma)
+{
+   call_rcu(>vm_rcu, __vm_area_free);
+}
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
 static void account_kernel_stack(struct task_struct *tsk, int account)
 {
void *stack = task_stack_page(tsk);
-- 
2.20.1



[RFC PATCH 12/37] mm: refactor __handle_mm_fault() / handle_pte_fault()

2021-04-06 Thread Michel Lespinasse
Move the code that initializes vmf->pte and vmf->orig_pte from
handle_pte_fault() to its single call site in __handle_mm_fault().

This ensures vmf->pte is now initialized together with the higher levels
of the page table hierarchy. This also prepares for speculative page fault
handling, where the entire page table walk (higher levels down to ptes)
needs special care in the speculative case.

Signed-off-by: Michel Lespinasse 
---
 mm/memory.c | 98 ++---
 1 file changed, 49 insertions(+), 49 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3691be1f1319..66e7a4554c54 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3516,7 +3516,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (pte_alloc(vma->vm_mm, vmf->pmd))
return VM_FAULT_OOM;
 
-   /* See comment in handle_pte_fault() */
+   /* See comment in __handle_mm_fault() */
if (unlikely(pmd_trans_unstable(vmf->pmd)))
return 0;
 
@@ -3797,7 +3797,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
return VM_FAULT_OOM;
}
 
-   /* See comment in handle_pte_fault() */
+   /* See comment in __handle_mm_fault() */
if (pmd_devmap_trans_unstable(vmf->pmd))
return 0;
 
@@ -4253,53 +4253,6 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 {
pte_t entry;
 
-   if (unlikely(pmd_none(*vmf->pmd))) {
-   /*
-* Leave __pte_alloc() until later: because vm_ops->fault may
-* want to allocate huge page, and if we expose page table
-* for an instant, it will be difficult to retract from
-* concurrent faults and from rmap lookups.
-*/
-   vmf->pte = NULL;
-   } else {
-   /*
-* If a huge pmd materialized under us just retry later.  Use
-* pmd_trans_unstable() via pmd_devmap_trans_unstable() instead
-* of pmd_trans_huge() to ensure the pmd didn't become
-* pmd_trans_huge under us and then back to pmd_none, as a
-* result of MADV_DONTNEED running immediately after a huge pmd
-* fault in a different thread of this mm, in turn leading to a
-* misleading pmd_trans_huge() retval. All we have to ensure is
-* that it is a regular pmd that we can walk with
-* pte_offset_map() and we can do that through an atomic read
-* in C, which is what pmd_trans_unstable() provides.
-*/
-   if (pmd_devmap_trans_unstable(vmf->pmd))
-   return 0;
-   /*
-* A regular pmd is established and it can't morph into a huge
-* pmd from under us anymore at this point because we hold the
-* mmap_lock read mode and khugepaged takes it in write mode.
-* So now it's safe to run pte_offset_map().
-*/
-   vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
-   vmf->orig_pte = *vmf->pte;
-
-   /*
-* some architectures can have larger ptes than wordsize,
-* e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
-* CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
-* accesses.  The code below just needs a consistent view
-* for the ifs and we later double check anyway with the
-* ptl lock held. So here a barrier will do.
-*/
-   barrier();
-   if (pte_none(vmf->orig_pte)) {
-   pte_unmap(vmf->pte);
-   vmf->pte = NULL;
-   }
-   }
-
if (!vmf->pte) {
if (vma_is_anonymous(vmf->vma))
return do_anonymous_page(vmf);
@@ -4439,6 +4392,53 @@ static vm_fault_t __handle_mm_fault(struct 
vm_area_struct *vma,
}
}
 
+   if (unlikely(pmd_none(*vmf.pmd))) {
+   /*
+* Leave __pte_alloc() until later: because vm_ops->fault may
+* want to allocate huge page, and if we expose page table
+* for an instant, it will be difficult to retract from
+* concurrent faults and from rmap lookups.
+*/
+   vmf.pte = NULL;
+   } else {
+   /*
+* If a huge pmd materialized under us just retry later.  Use
+* pmd_trans_unstable() via pmd_devmap_trans_unstable() instead
+* of pmd_trans_huge() to ensure the pmd didn't become
+* pmd_trans_huge under us and then back to pmd_none, as a
+* result of MADV_DONTNEED running immediately after a huge pmd
+   

[RFC PATCH 13/37] mm: implement speculative handling in __handle_mm_fault().

2021-04-06 Thread Michel Lespinasse
The page table tree is walked with local irqs disabled, which prevents
page table reclamation (similarly to what fast GUP does). The logic is
otherwise similar to the non-speculative path, but with additional
restrictions: in the speculative path, we do not handle huge pages or
wiring new pages tables.

Signed-off-by: Michel Lespinasse 
---
 include/linux/mm.h |  4 +++
 mm/memory.c| 77 --
 2 files changed, 79 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d5988e78e6ab..dee8a4833779 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -525,6 +525,10 @@ struct vm_fault {
};
unsigned int flags; /* FAULT_FLAG_xxx flags
 * XXX: should really be 'const' */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   unsigned long seq;
+   pmd_t orig_pmd;
+#endif
pmd_t *pmd; /* Pointer to pmd entry matching
 * the 'address' */
pud_t *pud; /* Pointer to pud entry matching
diff --git a/mm/memory.c b/mm/memory.c
index 66e7a4554c54..a17704aac019 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4307,7 +4307,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
  * return value.  See filemap_fault() and __lock_page_or_retry().
  */
 static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
-   unsigned long address, unsigned int flags)
+   unsigned long address, unsigned int flags, unsigned long seq)
 {
struct vm_fault vmf = {
.vma = vma,
@@ -4322,6 +4322,79 @@ static vm_fault_t __handle_mm_fault(struct 
vm_area_struct *vma,
p4d_t *p4d;
vm_fault_t ret;
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   if (flags & FAULT_FLAG_SPECULATIVE) {
+   pgd_t pgdval;
+   p4d_t p4dval;
+   pud_t pudval;
+
+   vmf.seq = seq;
+
+   local_irq_disable();
+   pgd = pgd_offset(mm, address);
+   pgdval = READ_ONCE(*pgd);
+   if (pgd_none(pgdval) || unlikely(pgd_bad(pgdval)))
+   goto spf_fail;
+
+   p4d = p4d_offset(pgd, address);
+   p4dval = READ_ONCE(*p4d);
+   if (p4d_none(p4dval) || unlikely(p4d_bad(p4dval)))
+   goto spf_fail;
+
+   vmf.pud = pud_offset(p4d, address);
+   pudval = READ_ONCE(*vmf.pud);
+   if (pud_none(pudval) || unlikely(pud_bad(pudval)) ||
+   unlikely(pud_trans_huge(pudval)) ||
+   unlikely(pud_devmap(pudval)))
+   goto spf_fail;
+
+   vmf.pmd = pmd_offset(vmf.pud, address);
+   vmf.orig_pmd = READ_ONCE(*vmf.pmd);
+
+   /*
+* pmd_none could mean that a hugepage collapse is in
+* progress in our back as collapse_huge_page() mark
+* it before invalidating the pte (which is done once
+* the IPI is catched by all CPU and we have interrupt
+* disabled).  For this reason we cannot handle THP in
+* a speculative way since we can't safely identify an
+* in progress collapse operation done in our back on
+* that PMD.
+*/
+   if (unlikely(pmd_none(vmf.orig_pmd) ||
+is_swap_pmd(vmf.orig_pmd) ||
+pmd_trans_huge(vmf.orig_pmd) ||
+pmd_devmap(vmf.orig_pmd)))
+   goto spf_fail;
+
+   /*
+* The above does not allocate/instantiate page-tables because
+* doing so would lead to the possibility of instantiating
+* page-tables after free_pgtables() -- and consequently
+* leaking them.
+*
+* The result is that we take at least one non-speculative
+* fault per PMD in order to instantiate it.
+*/
+
+   vmf.pte = pte_offset_map(vmf.pmd, address);
+   vmf.orig_pte = READ_ONCE(*vmf.pte);
+   barrier();
+   if (pte_none(vmf.orig_pte)) {
+   pte_unmap(vmf.pte);
+   vmf.pte = NULL;
+   }
+
+   local_irq_enable();
+
+   return handle_pte_fault();
+
+spf_fail:
+   local_irq_enable();
+   return VM_FAULT_RETRY;
+   }
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
pgd = pgd_offset(mm, address);
p4d = p4d_alloc(mm, pgd, address);
if (!p4d)
@@ -4541,7 +4614,7 @@ vm_fault_t do_handle_mm_fault(struct vm_area_struct *vma,
if (unlikely(is_vm_hugetlb_page(vma)))
ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
else
-   

[RFC PATCH 34/37] mm: rcu safe vma freeing only for multithreaded user space

2021-04-06 Thread Michel Lespinasse
Performance tuning: as single threaded userspace does not use
speculative page faults, it does not require rcu safe vma freeing.
Turn this off to avoid the related (small) extra overheads.

For multi threaded userspace, we often see a performance benefit from
the rcu safe vma freeing - even in tests that do not have any frequent
concurrent page faults ! This is because rcu safe vma freeing prevents
recently released vmas from being immediately reused in a new thread.

Signed-off-by: Michel Lespinasse 
---
 kernel/fork.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 2f20a5c5fed8..623875e8e742 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -389,10 +389,12 @@ static void __vm_area_free(struct rcu_head *head)
 void vm_area_free(struct vm_area_struct *vma)
 {
 #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
-   call_rcu(>vm_rcu, __vm_area_free);
-#else
-   vm_area_free(vma);
+   if (atomic_read(>vm_mm->mm_users) > 1) {
+   call_rcu(>vm_rcu, __vm_area_free);
+   return;
+   }
 #endif
+   vm_area_free(vma);
 }
 
 static void account_kernel_stack(struct task_struct *tsk, int account)
-- 
2.20.1



[RFC PATCH 32/37] f2fs: implement speculative fault handling

2021-04-06 Thread Michel Lespinasse
We just need to make sure f2fs_filemap_fault() doesn't block in the
speculative case as it is called with an rcu read lock held.

Signed-off-by: Michel Lespinasse 
---
 fs/f2fs/file.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index d26ff2ae3f5e..c1cfdc3ec98e 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -37,7 +37,12 @@ static vm_fault_t f2fs_filemap_fault(struct vm_fault *vmf)
struct inode *inode = file_inode(vmf->vma->vm_file);
vm_fault_t ret;
 
-   down_read(_I(inode)->i_mmap_sem);
+   if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+   if (!down_read_trylock(_I(inode)->i_mmap_sem))
+   return VM_FAULT_RETRY;
+   } else {
+   down_read(_I(inode)->i_mmap_sem);
+   }
ret = filemap_fault(vmf);
up_read(_I(inode)->i_mmap_sem);
 
@@ -171,6 +176,7 @@ static const struct vm_operations_struct f2fs_file_vm_ops = 
{
.fault  = f2fs_filemap_fault,
.map_pages  = filemap_map_pages,
.page_mkwrite   = f2fs_vm_page_mkwrite,
+   .speculative= true,
 };
 
 static int get_parent_ino(struct inode *inode, nid_t *pino)
-- 
2.20.1



[RFC PATCH 36/37] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

2021-04-06 Thread Michel Lespinasse
Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT so that the speculative fault
handling code can be compiled on this architecture.

Signed-off-by: Michel Lespinasse 
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index e4e1b6550115..6c2bbfe3fd38 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -208,6 +208,7 @@ config ARM64
select SWIOTLB
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
+   select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
help
  ARM 64-bit (AArch64) Linux support.
 
-- 
2.20.1



[RFC PATCH 33/37] mm: enable speculative fault handling only for multithreaded user space

2021-04-06 Thread Michel Lespinasse
Performance tuning: single threaded userspace does not benefit from
speculative page faults, so we turn them off to avoid any related
(small) extra overheads.

Signed-off-by: Michel Lespinasse 
---
 arch/x86/mm/fault.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 48b86911a6df..b1a07ca82d59 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1318,6 +1318,10 @@ void do_user_addr_fault(struct pt_regs *regs,
}
 #endif
 
+   /* Only try spf for multithreaded user space faults. */
+   if (!(flags & FAULT_FLAG_USER) || atomic_read(>mm_users) == 1)
+   goto no_spf;
+
count_vm_event(SPF_ATTEMPT);
seq = mmap_seq_read_start(mm);
if (seq & 1)
@@ -1351,6 +1355,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 
 spf_abort:
count_vm_event(SPF_ABORT);
+no_spf:
 
/*
 * Kernel-mode access to the user address space should only occur
-- 
2.20.1



[RFC PATCH 01/37] mmap locking API: mmap_lock_is_contended returns a bool

2021-04-06 Thread Michel Lespinasse
Change mmap_lock_is_contended to return a bool value, rather than an
int which the callers are then supposed to interpret as a bool. This
is to ensure consistency with other mmap lock API functions (such as
the trylock functions).

Signed-off-by: Michel Lespinasse 
---
 include/linux/mmap_lock.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 0540f0156f58..4e27f755766b 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -171,9 +171,9 @@ static inline void mmap_assert_write_locked(struct 
mm_struct *mm)
VM_BUG_ON_MM(!rwsem_is_locked(>mmap_lock), mm);
 }
 
-static inline int mmap_lock_is_contended(struct mm_struct *mm)
+static inline bool mmap_lock_is_contended(struct mm_struct *mm)
 {
-   return rwsem_is_contended(>mmap_lock);
+   return rwsem_is_contended(>mmap_lock) != 0;
 }
 
 #endif /* _LINUX_MMAP_LOCK_H */
-- 
2.20.1



[RFC PATCH 35/37] mm: spf statistics

2021-04-06 Thread Michel Lespinasse
Add a new CONFIG_SPECULATIVE_PAGE_FAULT_STATS config option,
and dump extra statistics about executed spf cases and abort reasons
when the option is set.

Signed-off-by: Michel Lespinasse 
---
 arch/x86/mm/fault.c   | 19 +++---
 include/linux/mmap_lock.h | 19 +-
 include/linux/vm_event_item.h | 24 
 include/linux/vmstat.h|  6 +++
 mm/Kconfig.debug  |  7 
 mm/memory.c   | 71 ---
 mm/vmstat.c   | 24 
 7 files changed, 149 insertions(+), 21 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index b1a07ca82d59..e210bbcb8bc5 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1324,22 +1324,31 @@ void do_user_addr_fault(struct pt_regs *regs,
 
count_vm_event(SPF_ATTEMPT);
seq = mmap_seq_read_start(mm);
-   if (seq & 1)
+   if (seq & 1) {
+   count_vm_spf_event(SPF_ABORT_ODD);
goto spf_abort;
+   }
rcu_read_lock();
vma = find_vma(mm, address);
-   if (!vma || vma->vm_start > address ||
-   !vma_can_speculate(vma, flags)) {
+   if (!vma || vma->vm_start > address) {
rcu_read_unlock();
+   count_vm_spf_event(SPF_ABORT_UNMAPPED);
+   goto spf_abort;
+   }
+   if (!vma_can_speculate(vma, flags)) {
+   rcu_read_unlock();
+   count_vm_spf_event(SPF_ABORT_NO_SPECULATE);
goto spf_abort;
}
pvma = *vma;
rcu_read_unlock();
-   if (!mmap_seq_read_check(mm, seq))
+   if (!mmap_seq_read_check(mm, seq, SPF_ABORT_VMA_COPY))
goto spf_abort;
vma = 
-   if (unlikely(access_error(error_code, vma)))
+   if (unlikely(access_error(error_code, vma))) {
+   count_vm_spf_event(SPF_ABORT_ACCESS_ERROR);
goto spf_abort;
+   }
fault = do_handle_mm_fault(vma, address,
   flags | FAULT_FLAG_SPECULATIVE, seq, regs);
 
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 8f4eca2d0f43..98f24a9910a9 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
 #define MMAP_LOCK_SEQ_INITIALIZER(name) \
@@ -104,12 +105,26 @@ static inline unsigned long mmap_seq_read_start(struct 
mm_struct *mm)
return seq;
 }
 
-static inline bool mmap_seq_read_check(struct mm_struct *mm, unsigned long seq)
+static inline bool __mmap_seq_read_check(struct mm_struct *mm,
+unsigned long seq)
 {
smp_rmb();
return seq == READ_ONCE(mm->mmap_seq);
 }
-#endif
+
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT_STATS
+static inline bool mmap_seq_read_check(struct mm_struct *mm, unsigned long seq,
+   enum vm_event_item fail_event)
+{
+   if (__mmap_seq_read_check(mm, seq))
+   return true;
+   count_vm_event(fail_event);
+   return false;
+}
+#else
+#define mmap_seq_read_check(mm, seq, fail) __mmap_seq_read_check(mm, seq)
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT_STATS */
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
 
 static inline void mmap_write_lock(struct mm_struct *mm)
 {
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index cc4f8d14e43f..6d25fd9ce4d1 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -124,6 +124,30 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
SPF_ATTEMPT,
SPF_ABORT,
+#endif
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT_STATS
+   SPF_ABORT_ODD,
+   SPF_ABORT_UNMAPPED,
+   SPF_ABORT_NO_SPECULATE,
+   SPF_ABORT_VMA_COPY,
+   SPF_ABORT_ACCESS_ERROR,
+   SPF_ABORT_PUD,
+   SPF_ABORT_PMD,
+   SPF_ABORT_ANON_VMA,
+   SPF_ABORT_PTE_MAP_LOCK_SEQ1,
+   SPF_ABORT_PTE_MAP_LOCK_PMD,
+   SPF_ABORT_PTE_MAP_LOCK_PTL,
+   SPF_ABORT_PTE_MAP_LOCK_SEQ2,
+   SPF_ABORT_USERFAULTFD,
+   SPF_ABORT_FAULT,
+   SPF_ABORT_NON_SWAP_ENTRY,
+   SPF_ABORT_SWAP_NOPAGE,
+   SPF_ATTEMPT_ANON,
+   SPF_ATTEMPT_FILE,
+   SPF_ATTEMPT_SWAP,
+   SPF_ATTEMPT_NUMA,
+   SPF_ATTEMPT_PTE,
+   SPF_ATTEMPT_WP,
 #endif
NR_VM_EVENT_ITEMS
 };
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 506d625163a1..34e05604a93f 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -109,6 +109,12 @@ static inline void vm_events_fold_cpu(int cpu)
 
 #endif /* CONFIG_VM_EVENT_COUNTERS */
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT_STATS
+#define count_vm_spf_event(x) count_vm_e

[RFC PATCH 00/37] Speculative page faults

2021-04-06 Thread Michel Lespinasse
- Has the complexity issue in previous SPF proposals been sufficiently
  addressed ?
- Are the performance benefits compelling enough ? I think the answer is
  yes for the anon case, but maybe not as clear for the file cases.
- Is the Android use case compelling enough to merge the entire patchset ?
- Can we use this as a foundation for other mmap scalability work ?
  I hear several proposals involving the idea of RCU based fault handling,
  and hope this proposal might serve as a building block for these ?


Michel Lespinasse (37):
  mmap locking API: mmap_lock_is_contended returns a bool
  mmap locking API: name the return values
  do_anonymous_page: use update_mmu_tlb()
  do_anonymous_page: reduce code duplication
  mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT
  x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
  mm: add FAULT_FLAG_SPECULATIVE flag
  mm: add do_handle_mm_fault()
  mm: add per-mm mmap sequence counter for speculative page fault handling.
  mm: rcu safe vma freeing
  x86/mm: attempt speculative mm faults first
  mm: refactor __handle_mm_fault() / handle_pte_fault()
  mm: implement speculative handling in __handle_mm_fault().
  mm: add pte_map_lock() and pte_spinlock()
  mm: implement speculative handling in do_anonymous_page()
  mm: enable speculative fault handling through do_anonymous_page()
  mm: implement speculative handling in do_numa_page()
  mm: enable speculative fault handling in do_numa_page()
  mm: implement speculative handling in wp_page_copy()
  mm: implement and enable speculative fault handling in handle_pte_fault()
  mm: implement speculative handling in do_swap_page()
  mm: enable speculative fault handling through do_swap_page()
  mm: rcu safe vma->vm_file freeing
  mm: implement speculative handling in __do_fault()
  mm: implement speculative handling in filemap_fault()
  mm: implement speculative fault handling in finish_fault()
  mm: implement speculative handling in do_fault_around()
  mm: implement speculative handling in filemap_map_pages()
  fs: list file types that support speculative faults.
  mm: enable speculative fault handling for supported file types.
  ext4: implement speculative fault handling
  f2fs: implement speculative fault handling
  mm: enable speculative fault handling only for multithreaded user space
  mm: rcu safe vma freeing only for multithreaded user space
  mm: spf statistics
  arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
  arm64/mm: attempt speculative mm faults first

 arch/arm64/Kconfig|   1 +
 arch/arm64/mm/fault.c |  52 +++
 arch/x86/Kconfig  |   1 +
 arch/x86/mm/fault.c   |  51 +++
 fs/btrfs/file.c   |   1 +
 fs/cifs/file.c|   1 +
 fs/exec.c |   1 +
 fs/ext4/file.c|   1 +
 fs/ext4/inode.c   |   7 +-
 fs/f2fs/file.c|   8 +-
 fs/fuse/file.c|   1 +
 fs/nfs/file.c |   1 +
 fs/ubifs/file.c   |   1 +
 fs/vboxsf/file.c  |   1 +
 fs/xfs/xfs_file.c |   3 +
 include/linux/mm.h|  76 +++-
 include/linux/mm_types.h  |  20 +-
 include/linux/mmap_lock.h | 109 --
 include/linux/vm_event_item.h |  28 ++
 include/linux/vmstat.h|   6 +
 kernel/fork.c |  26 +-
 mm/Kconfig|  22 ++
 mm/Kconfig.debug  |   7 +
 mm/filemap.c  |  73 +++-
 mm/memory.c   | 634 --
 mm/mmap.c |  11 +-
 mm/nommu.c|   6 +-
 mm/vmstat.c   |  28 ++
 28 files changed, 942 insertions(+), 235 deletions(-)

-- 
2.20.1



[RFC PATCH 31/37] ext4: implement speculative fault handling

2021-04-06 Thread Michel Lespinasse
We just need to make sure ext4_filemap_fault() doesn't block in the
speculative case as it is called with an rcu read lock held.

Signed-off-by: Michel Lespinasse 
---
 fs/ext4/file.c  | 1 +
 fs/ext4/inode.c | 7 ++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 194f5d00fa32..546b9d4aa9d7 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -742,6 +742,7 @@ static const struct vm_operations_struct ext4_file_vm_ops = 
{
.fault  = ext4_filemap_fault,
.map_pages  = filemap_map_pages,
.page_mkwrite   = ext4_page_mkwrite,
+   .speculative= true,
 };
 
 static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0948a43f1b3d..370484403c71 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6192,7 +6192,12 @@ vm_fault_t ext4_filemap_fault(struct vm_fault *vmf)
struct inode *inode = file_inode(vmf->vma->vm_file);
vm_fault_t ret;
 
-   down_read(_I(inode)->i_mmap_sem);
+   if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+   if (!down_read_trylock(_I(inode)->i_mmap_sem))
+   return VM_FAULT_RETRY;
+   } else {
+   down_read(_I(inode)->i_mmap_sem);
+   }
ret = filemap_fault(vmf);
up_read(_I(inode)->i_mmap_sem);
 
-- 
2.20.1



[RFC PATCH 29/37] fs: list file types that support speculative faults.

2021-04-06 Thread Michel Lespinasse
Add a speculative field to the vm_operations_struct, which indicates if
the associated file type supports speculative faults.

Initially this is set for files that implement fault() with filemap_fault().

Signed-off-by: Michel Lespinasse 
---
 fs/btrfs/file.c| 1 +
 fs/cifs/file.c | 1 +
 fs/fuse/file.c | 1 +
 fs/nfs/file.c  | 1 +
 fs/ubifs/file.c| 1 +
 fs/vboxsf/file.c   | 1 +
 include/linux/mm.h | 7 +++
 mm/filemap.c   | 1 +
 8 files changed, 14 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0e155f013839..b31851271e51 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2293,6 +2293,7 @@ static const struct vm_operations_struct 
btrfs_file_vm_ops = {
.fault  = filemap_fault,
.map_pages  = filemap_map_pages,
.page_mkwrite   = btrfs_page_mkwrite,
+   .speculative= true,
 };
 
 static int btrfs_file_mmap(struct file *filp, struct vm_area_struct *vma)
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 042e24aad410..a0d5fbb25c62 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -4116,6 +4116,7 @@ static const struct vm_operations_struct cifs_file_vm_ops 
= {
.fault = filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = cifs_page_mkwrite,
+   .speculative = true,
 };
 
 int cifs_file_strict_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 8cccecb55fb8..c4874240d157 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2354,6 +2354,7 @@ static const struct vm_operations_struct fuse_file_vm_ops 
= {
.fault  = filemap_fault,
.map_pages  = filemap_map_pages,
.page_mkwrite   = fuse_page_mkwrite,
+   .speculative= true,
 };
 
 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 16ad5050e046..e653c6bc23ca 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -588,6 +588,7 @@ static const struct vm_operations_struct nfs_file_vm_ops = {
.fault = filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = nfs_vm_page_mkwrite,
+   .speculative = true,
 };
 
 static int nfs_need_check_write(struct file *filp, struct inode *inode,
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 0e4b4be3aa26..3d97f1c3e9c7 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1598,6 +1598,7 @@ static const struct vm_operations_struct 
ubifs_file_vm_ops = {
.fault= filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = ubifs_vm_page_mkwrite,
+   .speculative = true,
 };
 
 static int ubifs_file_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/fs/vboxsf/file.c b/fs/vboxsf/file.c
index c4ab5996d97a..e0a6a3af9cb9 100644
--- a/fs/vboxsf/file.c
+++ b/fs/vboxsf/file.c
@@ -146,6 +146,7 @@ static const struct vm_operations_struct vboxsf_file_vm_ops 
= {
.close  = vboxsf_vma_close,
.fault  = filemap_fault,
.map_pages  = filemap_map_pages,
+   .speculative= true,
 };
 
 static int vboxsf_file_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f26490aff514..b4c0c10e434e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -641,6 +641,13 @@ struct vm_operations_struct {
 */
struct page *(*find_special_page)(struct vm_area_struct *vma,
  unsigned long addr);
+   /*
+* speculative indicates that the vm_operations support
+* speculative page faults. This allows ->fault and ->map_pages
+* to be called with FAULT_FLAG_SPECULATIVE set; such calls will
+* run within an rcu read locked section and with mmap lock not held.
+*/
+   bool speculative;
 };
 
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
diff --git a/mm/filemap.c b/mm/filemap.c
index d496771749e6..b83040c672d3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3226,6 +3226,7 @@ const struct vm_operations_struct generic_file_vm_ops = {
.fault  = filemap_fault,
.map_pages  = filemap_map_pages,
.page_mkwrite   = filemap_page_mkwrite,
+   .speculative= true,
 };
 
 /* This is used for a general mmap of a disk file */
-- 
2.20.1



[RFC PATCH 27/37] mm: implement speculative handling in do_fault_around()

2021-04-06 Thread Michel Lespinasse
Call the vm_ops->map_pages method within an rcu read locked section.
In the speculative case, verify the mmap sequence lock at the start of
the section. A match guarantees that the original vma is still valid
at that time, and that the associated vma->vm_file stays valid while
the vm_ops->map_pages() method is running.

Do not test vmf->pmd in the speculative case - we only speculate when
a page table already exists, and and this saves us from having to handle
synchronization around the vmf->pmd read.

Change xfs_filemap_map_pages() account for the fact that it can not
block anymore, as it is now running within an rcu read lock.

Signed-off-by: Michel Lespinasse 
---
 fs/xfs/xfs_file.c |  3 +++
 mm/memory.c   | 22 --
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a007ca0711d9..b360732b20ae 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1387,6 +1387,9 @@ xfs_filemap_map_pages(
struct inode*inode = file_inode(vmf->vma->vm_file);
vm_fault_t ret;
 
+   if (!xfs_ilock_nowait(XFS_I(inode), XFS_MMAPLOCK_SHARED))
+   return (vmf->flags & FAULT_FLAG_SPECULATIVE) ?
+   VM_FAULT_RETRY : 0;
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
ret = filemap_map_pages(vmf, start_pgoff, end_pgoff);
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
diff --git a/mm/memory.c b/mm/memory.c
index 13e2aaf900e5..a20e13d84145 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4012,6 +4012,7 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
pgoff_t start_pgoff = vmf->pgoff;
pgoff_t end_pgoff;
int off;
+   vm_fault_t ret;
 
nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;
@@ -4030,14 +4031,31 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
end_pgoff = min3(end_pgoff, vma_pages(vmf->vma) + vmf->vma->vm_pgoff - 
1,
start_pgoff + nr_pages - 1);
 
-   if (pmd_none(*vmf->pmd)) {
+   if (!(vmf->flags & FAULT_FLAG_SPECULATIVE) &&
+   pmd_none(*vmf->pmd)) {
vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm);
if (!vmf->prealloc_pte)
return VM_FAULT_OOM;
smp_wmb(); /* See comment in __pte_alloc() */
}
 
-   return vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);
+   rcu_read_lock();
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+   if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq)) {
+   rcu_read_unlock();
+   return VM_FAULT_RETRY;
+   }
+   /*
+* the mmap sequence check verified that vmf->vma was still
+* current at that point in time.
+* The rcu read lock ensures vmf->vma->vm_file stays valid.
+*/
+   }
+#endif
+   ret = vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);
+   rcu_read_unlock();
+   return ret;
 }
 
 static vm_fault_t do_read_fault(struct vm_fault *vmf)
-- 
2.20.1



[RFC PATCH 08/37] mm: add do_handle_mm_fault()

2021-04-06 Thread Michel Lespinasse
Add a new do_handle_mm_fault function, which extends the existing
handle_mm_fault() API by adding an mmap sequence count, to be used
in the FAULT_FLAG_SPECULATIVE case.

In the initial implementation, FAULT_FLAG_SPECULATIVE always fails
(by returning VM_FAULT_RETRY).

The existing handle_mm_fault() API is kept as a wrapper around
do_handle_mm_fault() so that we do not have to immediately update
every handle_mm_fault() call site.

Signed-off-by: Michel Lespinasse 
---
 include/linux/mm.h | 12 +---
 mm/memory.c| 10 +++---
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 021fdab5b721..d5988e78e6ab 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1724,9 +1724,15 @@ int generic_error_remove_page(struct address_space 
*mapping, struct page *page);
 int invalidate_inode_page(struct page *page);
 
 #ifdef CONFIG_MMU
-extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
- unsigned long address, unsigned int flags,
- struct pt_regs *regs);
+extern vm_fault_t do_handle_mm_fault(struct vm_area_struct *vma,
+   unsigned long address, unsigned int flags,
+   unsigned long seq, struct pt_regs *regs);
+static inline vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
+   unsigned long address, unsigned int flags,
+   struct pt_regs *regs)
+{
+   return do_handle_mm_fault(vma, address, flags, 0, regs);
+}
 extern int fixup_user_fault(struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
bool *unlocked);
diff --git a/mm/memory.c b/mm/memory.c
index 477c98bfdd9d..3691be1f1319 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4509,11 +4509,15 @@ static inline void mm_account_fault(struct pt_regs 
*regs,
  * The mmap_lock may have been released depending on flags and our
  * return value.  See filemap_fault() and __lock_page_or_retry().
  */
-vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
-  unsigned int flags, struct pt_regs *regs)
+vm_fault_t do_handle_mm_fault(struct vm_area_struct *vma,
+   unsigned long address, unsigned int flags,
+   unsigned long seq, struct pt_regs *regs)
 {
vm_fault_t ret;
 
+   if (flags & FAULT_FLAG_SPECULATIVE)
+   return VM_FAULT_RETRY;
+
__set_current_state(TASK_RUNNING);
 
count_vm_event(PGFAULT);
@@ -4555,7 +4559,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, 
unsigned long address,
 
return ret;
 }
-EXPORT_SYMBOL_GPL(handle_mm_fault);
+EXPORT_SYMBOL_GPL(do_handle_mm_fault);
 
 #ifndef __PAGETABLE_P4D_FOLDED
 /*
-- 
2.20.1



[RFC PATCH 30/37] mm: enable speculative fault handling for supported file types.

2021-04-06 Thread Michel Lespinasse
Introduce vma_can_speculate(), which allows speculative handling for
VMAs mapping supported file types.

>From do_handle_mm_fault(), speculative handling will follow through
__handle_mm_fault(), handle_pte_fault() and do_fault().

At this point, we expect speculative faults to continue through one of:
- do_read_fault(), fully implemented;
- do_cow_fault(), which might abort if missing anon vmas,
- do_shared_fault(), not implemented yet
  (would require ->page_mkwrite() changes).

vma_can_speculate() provides an early abort for the do_shared_fault() case,
limiting the time spent on trying that unimplemented case.

Signed-off-by: Michel Lespinasse 
---
 arch/x86/mm/fault.c |  3 ++-
 include/linux/mm.h  | 14 ++
 mm/memory.c | 17 -
 3 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index fbf265f56a06..48b86911a6df 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1324,7 +1324,8 @@ void do_user_addr_fault(struct pt_regs *regs,
goto spf_abort;
rcu_read_lock();
vma = find_vma(mm, address);
-   if (!vma || vma->vm_start > address || !vma_is_anonymous(vma)) {
+   if (!vma || vma->vm_start > address ||
+   !vma_can_speculate(vma, flags)) {
rcu_read_unlock();
goto spf_abort;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b4c0c10e434e..edb809e9036b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -700,6 +700,20 @@ static inline bool vma_is_accessible(struct vm_area_struct 
*vma)
return vma->vm_flags & VM_ACCESS_FLAGS;
 }
 
+static inline bool vma_can_speculate(struct vm_area_struct *vma,
+   unsigned int flags)
+{
+   if (vma_is_anonymous(vma))
+   return true;
+   if (!vma->vm_ops->speculative)
+   return false;
+   if (!(flags & FAULT_FLAG_WRITE))
+   return true;
+   if (!(vma->vm_flags & VM_SHARED))
+   return true;
+   return false;
+}
+
 #ifdef CONFIG_SHMEM
 /*
  * The vma_is_shmem is not inline because it is used only by slow
diff --git a/mm/memory.c b/mm/memory.c
index a20e13d84145..074945faf1ab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4090,8 +4090,12 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
vm_fault_t ret;
 
-   if (unlikely(anon_vma_prepare(vma)))
-   return VM_FAULT_OOM;
+   if (unlikely(!vma->anon_vma)) {
+   if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+   return VM_FAULT_RETRY;
+   if (__anon_vma_prepare(vma))
+   return VM_FAULT_OOM;
+   }
 
vmf->cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
if (!vmf->cow_page)
@@ -4128,6 +4132,8 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
vm_fault_t ret, tmp;
 
+   VM_BUG_ON(vmf->flags & FAULT_FLAG_SPECULATIVE);
+
ret = __do_fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
@@ -4172,12 +4178,12 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
struct mm_struct *vm_mm = vma->vm_mm;
vm_fault_t ret;
 
-   VM_BUG_ON(vmf->flags & FAULT_FLAG_SPECULATIVE);
-
/*
 * The VMA was not fully populated on mmap() or missing VM_DONTEXPAND
 */
if (!vma->vm_ops->fault) {
+   VM_BUG_ON(vmf->flags & FAULT_FLAG_SPECULATIVE);
+
/*
 * If we find a migration pmd entry or a none pmd entry, which
 * should never happen, return SIGBUS
@@ -4739,7 +4745,8 @@ vm_fault_t do_handle_mm_fault(struct vm_area_struct *vma,
 {
vm_fault_t ret;
 
-   VM_BUG_ON((flags & FAULT_FLAG_SPECULATIVE) && !vma_is_anonymous(vma));
+   VM_BUG_ON((flags & FAULT_FLAG_SPECULATIVE) &&
+ !vma_can_speculate(vma, flags));
 
__set_current_state(TASK_RUNNING);
 
-- 
2.20.1



[RFC PATCH 25/37] mm: implement speculative handling in filemap_fault()

2021-04-06 Thread Michel Lespinasse
Extend filemap_fault() to handle speculative faults.

In the speculative case, we will only be fishing existing pages out of
the page cache. The logic we use mirrors what is done in the
non-speculative case, assuming that pages are found in the page cache,
are up to date and not already locked, and that readahead is not
necessary at this time. In all other cases, the fault is aborted to be
handled non-speculatively.

Signed-off-by: Michel Lespinasse 
---
 mm/filemap.c | 45 -
 1 file changed, 44 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 43700480d897..6e8505fe5df9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2851,7 +2851,9 @@ static struct file *do_async_mmap_readahead(struct 
vm_fault *vmf,
  * it in the page cache, and handles the special cases reasonably without
  * having a lot of duplicated code.
  *
- * vma->vm_mm->mmap_lock must be held on entry.
+ * If FAULT_FLAG_SPECULATIVE is set, this function runs within an rcu
+ * read locked section and with mmap lock not held.
+ * Otherwise, vma->vm_mm->mmap_lock must be held on entry.
  *
  * If our return value has VM_FAULT_RETRY set, it's because the mmap_lock
  * may be dropped before doing I/O or by lock_page_maybe_drop_mmap().
@@ -2876,6 +2878,47 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
struct page *page;
vm_fault_t ret = 0;
 
+   if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+   page = find_get_page(mapping, offset);
+   if (unlikely(!page) || unlikely(PageReadahead(page)))
+   return VM_FAULT_RETRY;
+
+   if (!trylock_page(page))
+   return VM_FAULT_RETRY;
+
+   if (unlikely(compound_head(page)->mapping != mapping))
+   goto page_unlock;
+   VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
+   if (unlikely(!PageUptodate(page)))
+   goto page_unlock;
+
+   max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+   if (unlikely(offset >= max_off))
+   goto page_unlock;
+
+   /*
+* Update readahead mmap_miss statistic.
+*
+* Note that we are not sure if finish_fault() will
+* manage to complete the transaction. If it fails,
+* we'll come back to filemap_fault() non-speculative
+* case which will update mmap_miss a second time.
+* This is not ideal, we would prefer to guarantee the
+* update will happen exactly once.
+*/
+   if (!(vmf->vma->vm_flags & VM_RAND_READ) && ra->ra_pages) {
+   unsigned int mmap_miss = READ_ONCE(ra->mmap_miss);
+   if (mmap_miss)
+   WRITE_ONCE(ra->mmap_miss, --mmap_miss);
+   }
+
+   vmf->page = page;
+   return VM_FAULT_LOCKED;
+page_unlock:
+   unlock_page(page);
+   return VM_FAULT_RETRY;
+   }
+
max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
if (unlikely(offset >= max_off))
return VM_FAULT_SIGBUS;
-- 
2.20.1



[RFC PATCH 19/37] mm: implement speculative handling in wp_page_copy()

2021-04-06 Thread Michel Lespinasse
Change wp_page_copy() to handle the speculative case.
This involves aborting speculative faults if they have to allocate an
anon_vma, and using pte_map_lock() instead of pte_offset_map_lock()
to complete the page fault.

Also change call sites to clear vmf->pte after unmapping the page table,
in order to satisfy pte_map_lock()'s preconditions.

Signed-off-by: Michel Lespinasse 
---
 mm/memory.c | 31 ++-
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index eea72bd78d06..547d9d0ee962 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2911,20 +2911,27 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
pte_t entry;
int page_copied = 0;
struct mmu_notifier_range range;
+   vm_fault_t ret = VM_FAULT_OOM;
 
-   if (unlikely(anon_vma_prepare(vma)))
-   goto oom;
+   if (unlikely(!vma->anon_vma)) {
+   if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+   ret = VM_FAULT_RETRY;
+   goto out;
+   }
+   if (__anon_vma_prepare(vma))
+   goto out;
+   }
 
if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
new_page = alloc_zeroed_user_highpage_movable(vma,
  vmf->address);
if (!new_page)
-   goto oom;
+   goto out;
} else {
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
vmf->address);
if (!new_page)
-   goto oom;
+   goto out;
 
if (!cow_user_page(new_page, old_page, vmf)) {
/*
@@ -2941,7 +2948,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
}
 
if (mem_cgroup_charge(new_page, mm, GFP_KERNEL))
-   goto oom_free_new;
+   goto out_free_new;
cgroup_throttle_swaprate(new_page, GFP_KERNEL);
 
__SetPageUptodate(new_page);
@@ -2954,7 +2961,11 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
/*
 * Re-check the pte - we dropped the lock
 */
-   vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, >ptl);
+   if (!pte_map_lock(vmf)) {
+   ret = VM_FAULT_RETRY;
+   /* put_page() will uncharge the page */
+   goto out_free_new;
+   }
if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
if (old_page) {
if (!PageAnon(old_page)) {
@@ -3042,12 +3053,12 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
put_page(old_page);
}
return page_copied ? VM_FAULT_WRITE : 0;
-oom_free_new:
+out_free_new:
put_page(new_page);
-oom:
+out:
if (old_page)
put_page(old_page);
-   return VM_FAULT_OOM;
+   return ret;
 }
 
 /**
@@ -3190,6 +3201,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
return wp_pfn_shared(vmf);
 
pte_unmap_unlock(vmf->pte, vmf->ptl);
+   vmf->pte = NULL;
return wp_page_copy(vmf);
}
 
@@ -3228,6 +3240,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
get_page(vmf->page);
 
pte_unmap_unlock(vmf->pte, vmf->ptl);
+   vmf->pte = NULL;
return wp_page_copy(vmf);
 }
 
-- 
2.20.1



[RFC PATCH 24/37] mm: implement speculative handling in __do_fault()

2021-04-06 Thread Michel Lespinasse
In the speculative case, call the vm_ops->fault() method from within
an rcu read locked section, and verify the mmap sequence lock at the
start of the section. A match guarantees that the original vma is still
valid at that time, and that the associated vma->vm_file stays valid
while the vm_ops->fault() method is running.

Note that this implies that speculative faults can not sleep within
the vm_ops->fault method. We will only attempt to fetch existing pages
from the page cache during speculative faults; any miss (or prefetch)
will be handled by falling back to non-speculative fault handling.

The speculative handling case also does not preallocate page tables,
as it is always called with a pre-existing page table.

Signed-off-by: Michel Lespinasse 
---
 mm/memory.c | 63 +++--
 1 file changed, 42 insertions(+), 21 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 6eddd7b4e89c..7139004c624d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3709,29 +3709,50 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
vm_fault_t ret;
 
-   /*
-* Preallocate pte before we take page_lock because this might lead to
-* deadlocks for memcg reclaim which waits for pages under writeback:
-*  lock_page(A)
-*  SetPageWriteback(A)
-*  unlock_page(A)
-* lock_page(B)
-*  lock_page(B)
-* pte_alloc_one
-*   shrink_page_list
-* wait_on_page_writeback(A)
-*  SetPageWriteback(B)
-*  unlock_page(B)
-*  # flush A, B to clear the writeback
-*/
-   if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
-   vmf->prealloc_pte = pte_alloc_one(vma->vm_mm);
-   if (!vmf->prealloc_pte)
-   return VM_FAULT_OOM;
-   smp_wmb(); /* See comment in __pte_alloc() */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+   rcu_read_lock();
+   if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq)) {
+   ret = VM_FAULT_RETRY;
+   } else {
+   /*
+* The mmap sequence count check guarantees that the
+* vma we fetched at the start of the fault was still
+* current at that point in time. The rcu read lock
+* ensures vmf->vma->vm_file stays valid.
+*/
+   ret = vma->vm_ops->fault(vmf);
+   }
+   rcu_read_unlock();
+   } else
+#endif
+   {
+   /*
+* Preallocate pte before we take page_lock because
+* this might lead to deadlocks for memcg reclaim
+* which waits for pages under writeback:
+*  lock_page(A)
+*  SetPageWriteback(A)
+*  unlock_page(A)
+* lock_page(B)
+*  lock_page(B)
+* pte_alloc_one
+*   shrink_page_list
+* wait_on_page_writeback(A)
+*  SetPageWriteback(B)
+*  unlock_page(B)
+*  # flush A, B to clear writeback
+*/
+   if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
+   vmf->prealloc_pte = pte_alloc_one(vma->vm_mm);
+   if (!vmf->prealloc_pte)
+   return VM_FAULT_OOM;
+   smp_wmb(); /* See comment in __pte_alloc() */
+   }
+
+   ret = vma->vm_ops->fault(vmf);
}
 
-   ret = vma->vm_ops->fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
VM_FAULT_DONE_COW)))
return ret;
-- 
2.20.1



[RFC PATCH 37/37] arm64/mm: attempt speculative mm faults first

2021-04-06 Thread Michel Lespinasse
Attempt speculative mm fault handling first, and fall back to the
existing (non-speculative) code if that fails.

This follows the lines of the x86 speculative fault handling code,
but with some minor arch differences such as the way that the
VM_FAULT_BADACCESS case is handled.

Signed-off-by: Michel Lespinasse 
---
 arch/arm64/mm/fault.c | 52 +++
 1 file changed, 52 insertions(+)

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index f37d4e3830b7..3757bfbb457a 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -530,6 +531,9 @@ static int __kprobes do_page_fault(unsigned long far, 
unsigned int esr,
unsigned long vm_flags = VM_ACCESS_FLAGS;
unsigned int mm_flags = FAULT_FLAG_DEFAULT;
unsigned long addr = untagged_addr(far);
+   struct vm_area_struct *vma;
+   struct vm_area_struct pvma;
+   unsigned long seq;
 
if (kprobe_page_fault(regs, esr))
return 0;
@@ -564,6 +568,53 @@ static int __kprobes do_page_fault(unsigned long far, 
unsigned int esr,
 
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
 
+   /* Only try spf for multithreaded user space faults. */
+   if (!(mm_flags & FAULT_FLAG_USER) || atomic_read(>mm_users) == 1)
+   goto no_spf;
+
+   count_vm_event(SPF_ATTEMPT);
+   seq = mmap_seq_read_start(mm);
+   if (seq & 1) {
+   count_vm_spf_event(SPF_ABORT_ODD);
+   goto spf_abort;
+   }
+   rcu_read_lock();
+   vma = find_vma(mm, addr);
+   if (!vma || vma->vm_start > addr) {
+   rcu_read_unlock();
+   count_vm_spf_event(SPF_ABORT_UNMAPPED);
+   goto spf_abort;
+   }
+   if (!vma_can_speculate(vma, mm_flags)) {
+   rcu_read_unlock();
+   count_vm_spf_event(SPF_ABORT_NO_SPECULATE);
+   goto spf_abort;
+   }
+   pvma = *vma;
+   rcu_read_unlock();
+   if (!mmap_seq_read_check(mm, seq, SPF_ABORT_VMA_COPY))
+   goto spf_abort;
+   vma = 
+   if (!(vma->vm_flags & vm_flags)) {
+   count_vm_spf_event(SPF_ABORT_ACCESS_ERROR);
+   goto spf_abort;
+   }
+   fault = do_handle_mm_fault(vma, addr & PAGE_MASK,
+   mm_flags | FAULT_FLAG_SPECULATIVE, seq, regs);
+
+   /* Quick path to respond to signals */
+   if (fault_signal_pending(fault, regs)) {
+   if (!user_mode(regs))
+   goto no_context;
+   return 0;
+   }
+   if (!(fault & VM_FAULT_RETRY))
+   goto done;
+
+spf_abort:
+   count_vm_event(SPF_ABORT);
+no_spf:
+
/*
 * As per x86, we may deadlock here. However, since the kernel only
 * validly references user space from well defined areas of the code,
@@ -604,6 +655,7 @@ static int __kprobes do_page_fault(unsigned long far, 
unsigned int esr,
}
}
mmap_read_unlock(mm);
+done:
 
/*
 * Handle the "normal" (no error) case first.
-- 
2.20.1



[RFC PATCH 07/37] mm: add FAULT_FLAG_SPECULATIVE flag

2021-04-06 Thread Michel Lespinasse
Define the new FAULT_FLAG_SPECULATIVE flag, which indicates when we are
attempting speculative fault handling (without holding the mmap lock).

Signed-off-by: Michel Lespinasse 
---
 include/linux/mm.h | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8ba434287387..021fdab5b721 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -434,6 +434,7 @@ extern pgprot_t protection_map[16];
  * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
  * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
  * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal 
signals.
+ * @FAULT_FLAG_SPECULATIVE: The fault is handled without holding the mmap_sem.
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -464,6 +465,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_REMOTE  0x80
 #define FAULT_FLAG_INSTRUCTION 0x100
 #define FAULT_FLAG_INTERRUPTIBLE   0x200
+#define FAULT_FLAG_SPECULATIVE 0x400
 
 /*
  * The default fault flags that should be used by most of the
@@ -501,7 +503,8 @@ static inline bool fault_flag_allow_retry_first(unsigned 
int flags)
{ FAULT_FLAG_USER,  "USER" }, \
{ FAULT_FLAG_REMOTE,"REMOTE" }, \
{ FAULT_FLAG_INSTRUCTION,   "INSTRUCTION" }, \
-   { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }
+   { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }, \
+   { FAULT_FLAG_SPECULATIVE,   "SPECULATIVE" }
 
 /*
  * vm_fault is filled by the pagefault handler and passed to the vma's
-- 
2.20.1



[RFC PATCH 18/37] mm: enable speculative fault handling in do_numa_page()

2021-04-06 Thread Michel Lespinasse
Change handle_pte_fault() to allow speculative fault execution to proceed
through do_numa_page().

do_swap_page() does not implement speculative execution yet, so it
needs to abort with VM_FAULT_RETRY in that case.

Signed-off-by: Michel Lespinasse 
---
 mm/memory.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 78cc36749754..eea72bd78d06 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3346,6 +3346,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vm_fault_t ret = 0;
void *shadow = NULL;
 
+   if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+   pte_unmap(vmf->pte);
+   return VM_FAULT_RETRY;
+   }
+
if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
goto out;
 
@@ -4342,17 +4347,17 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
return do_fault(vmf);
}
 
-   if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
-   pte_unmap(vmf->pte);
-   return VM_FAULT_RETRY;
-   }
-
if (!pte_present(vmf->orig_pte))
return do_swap_page(vmf);
 
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
 
+   if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+   pte_unmap(vmf->pte);
+   return VM_FAULT_RETRY;
+   }
+
vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
spin_lock(vmf->ptl);
entry = vmf->orig_pte;
-- 
2.20.1



[RFC PATCH 23/37] mm: rcu safe vma->vm_file freeing

2021-04-06 Thread Michel Lespinasse
Defer freeing of vma->vm_file when freeing vmas.
This is to allow speculative page faults in the mapped file case.

Signed-off-by: Michel Lespinasse 
---
 fs/exec.c |  1 +
 kernel/fork.c | 17 +++--
 mm/mmap.c | 11 +++
 mm/nommu.c|  6 ++
 4 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 18594f11c31f..c9da73eb0f53 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -286,6 +286,7 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
mmap_write_unlock(mm);
 err_free:
bprm->vma = NULL;
+   VM_BUG_ON(vma->vm_file);
vm_area_free(vma);
return err;
 }
diff --git a/kernel/fork.c b/kernel/fork.c
index b6078e546114..2f20a5c5fed8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -369,19 +369,31 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct 
*orig)
return new;
 }
 
+static inline void vm_area_free(struct vm_area_struct *vma)
+{
+   if (vma->vm_file)
+   fput(vma->vm_file);
+   kmem_cache_free(vm_area_cachep, vma);
+}
+
 #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
 static void __vm_area_free(struct rcu_head *head)
 {
struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
  vm_rcu);
-   kmem_cache_free(vm_area_cachep, vma);
+   vm_area_free(vma);
 }
 
+#endif
+
 void vm_area_free(struct vm_area_struct *vma)
 {
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
call_rcu(>vm_rcu, __vm_area_free);
+#else
+   vm_area_free(vma);
+#endif
 }
-#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
 
 static void account_kernel_stack(struct task_struct *tsk, int account)
 {
@@ -621,6 +633,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 fail_nomem_anon_vma_fork:
mpol_put(vma_policy(tmp));
 fail_nomem_policy:
+   tmp->vm_file = NULL;/* prevents fput within vm_area_free() */
vm_area_free(tmp);
 fail_nomem:
retval = -ENOMEM;
diff --git a/mm/mmap.c b/mm/mmap.c
index 3f287599a7a3..cc2323e243bb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -178,9 +178,8 @@ static struct vm_area_struct *remove_vma(struct 
vm_area_struct *vma)
might_sleep();
if (vma->vm_ops && vma->vm_ops->close)
vma->vm_ops->close(vma);
-   if (vma->vm_file)
-   fput(vma->vm_file);
mpol_put(vma_policy(vma));
+   /* fput(vma->vm_file) happens in vm_area_free after an RCU delay. */
vm_area_free(vma);
return next;
 }
@@ -949,7 +948,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long 
start,
if (remove_next) {
if (file) {
uprobe_munmap(next, next->vm_start, next->vm_end);
-   fput(file);
+   /* fput(file) happens whthin vm_area_free(next) */
+   VM_BUG_ON(file != next->vm_file);
}
if (next->anon_vma)
anon_vma_merge(vma, next);
@@ -1828,7 +1828,7 @@ unsigned long mmap_region(struct file *file, unsigned 
long addr,
 * fput the vma->vm_file here or we would add 
an extra fput for file
 * and cause general protection fault 
ultimately.
 */
-   fput(vma->vm_file);
+   /* fput happens within vm_area_free */
vm_area_free(vma);
vma = merge;
/* Update vm_flags to pick up the change. */
@@ -1907,6 +1907,7 @@ unsigned long mmap_region(struct file *file, unsigned 
long addr,
if (vm_flags & VM_DENYWRITE)
allow_write_access(file);
 free_vma:
+   VM_BUG_ON(vma->vm_file);
vm_area_free(vma);
 unacct_error:
if (charged)
@@ -2779,6 +2780,7 @@ int __split_vma(struct mm_struct *mm, struct 
vm_area_struct *vma,
  out_free_mpol:
mpol_put(vma_policy(new));
  out_free_vma:
+   new->vm_file = NULL;/* prevents fput within vm_area_free() */
vm_area_free(new);
return err;
 }
@@ -3343,6 +3345,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct 
**vmap,
 out_free_mempol:
mpol_put(vma_policy(new_vma));
 out_free_vma:
+   new_vma->vm_file = NULL;/* Prevent fput within vm_area_free */
vm_area_free(new_vma);
 out:
return NULL;
diff --git a/mm/nommu.c b/mm/nommu.c
index 5c9ab799c0e6..06a0dc0b913b 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -664,9 +664,8 @@ static void delete_vma(struct mm_struct *mm, struct 
vm_area_struct *vma)
 {
if (vma->vm_ops && vma->vm_ops->close)
vma->vm_ops->close(vma);
-   if (vma->vm_file)
-   fput(vma->vm_file);
put_nommu_region(vma->vm_region

[RFC PATCH 16/37] mm: enable speculative fault handling through do_anonymous_page()

2021-04-06 Thread Michel Lespinasse
in x86 fault handler, only attempt spf if the vma is anonymous.

In do_handle_mm_fault(), let speculative page faults proceed as long
as they fall into anonymous vmas. This enables the speculative
handling code in __handle_mm_fault() and do_anonymous_page().

In handle_pte_fault(), if vmf->pte is set (the original pte was not
pte_none), catch speculative faults and return VM_FAULT_RETRY as
those cases are not implemented yet. Also assert that do_fault()
is not reached in the speculative case.

Signed-off-by: Michel Lespinasse 
---
 arch/x86/mm/fault.c |  2 +-
 mm/memory.c | 16 
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f8c8e325af77..fbf265f56a06 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1324,7 +1324,7 @@ void do_user_addr_fault(struct pt_regs *regs,
goto spf_abort;
rcu_read_lock();
vma = find_vma(mm, address);
-   if (!vma || vma->vm_start > address) {
+   if (!vma || vma->vm_start > address || !vma_is_anonymous(vma)) {
rcu_read_unlock();
goto spf_abort;
}
diff --git a/mm/memory.c b/mm/memory.c
index fd84576f9c01..a2c5bf29f989 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4103,6 +4103,8 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
struct mm_struct *vm_mm = vma->vm_mm;
vm_fault_t ret;
 
+   VM_BUG_ON(vmf->flags & FAULT_FLAG_SPECULATIVE);
+
/*
 * The VMA was not fully populated on mmap() or missing VM_DONTEXPAND
 */
@@ -4340,6 +4342,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
return do_fault(vmf);
}
 
+   if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+   pte_unmap(vmf->pte);
+   return VM_FAULT_RETRY;
+   }
+
if (!pte_present(vmf->orig_pte))
return do_swap_page(vmf);
 
@@ -4668,8 +4675,7 @@ vm_fault_t do_handle_mm_fault(struct vm_area_struct *vma,
 {
vm_fault_t ret;
 
-   if (flags & FAULT_FLAG_SPECULATIVE)
-   return VM_FAULT_RETRY;
+   VM_BUG_ON((flags & FAULT_FLAG_SPECULATIVE) && !vma_is_anonymous(vma));
 
__set_current_state(TASK_RUNNING);
 
@@ -4691,10 +4697,12 @@ vm_fault_t do_handle_mm_fault(struct vm_area_struct 
*vma,
if (flags & FAULT_FLAG_USER)
mem_cgroup_enter_user_fault();
 
-   if (unlikely(is_vm_hugetlb_page(vma)))
+   if (unlikely(is_vm_hugetlb_page(vma))) {
+   VM_BUG_ON(flags & FAULT_FLAG_SPECULATIVE);
ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
-   else
+   } else {
ret = __handle_mm_fault(vma, address, flags, seq);
+   }
 
if (flags & FAULT_FLAG_USER) {
mem_cgroup_exit_user_fault();
-- 
2.20.1



[RFC PATCH 04/37] do_anonymous_page: reduce code duplication

2021-04-06 Thread Michel Lespinasse
In do_anonymous_page(), we have separate cases for the zero page vs
allocating new anonymous pages. However, once the pte entry has been
computed, the rest of the handling (mapping and locking the page table,
checking that we didn't lose a race with another page fault handler, etc)
is identical between the two cases.

This change reduces the code duplication between the two cases.

Signed-off-by: Michel Lespinasse 
---
 mm/memory.c | 85 +++--
 1 file changed, 37 insertions(+), 48 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8ee4bd239303..477c98bfdd9d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3495,7 +3495,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
-   struct page *page;
+   struct page *page = NULL;
vm_fault_t ret = 0;
pte_t entry;
 
@@ -3525,77 +3525,66 @@ static vm_fault_t do_anonymous_page(struct vm_fault 
*vmf)
!mm_forbids_zeropage(vma->vm_mm)) {
entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
vma->vm_page_prot));
-   vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-   vmf->address, >ptl);
-   if (!pte_none(*vmf->pte)) {
-   update_mmu_tlb(vma, vmf->address, vmf->pte);
-   goto unlock;
-   }
-   ret = check_stable_address_space(vma->vm_mm);
-   if (ret)
-   goto unlock;
-   /* Deliver the page fault to userland, check inside PT lock */
-   if (userfaultfd_missing(vma)) {
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
-   return handle_userfault(vmf, VM_UFFD_MISSING);
-   }
-   goto setpte;
+   } else {
+   /* Allocate our own private page. */
+   if (unlikely(anon_vma_prepare(vma)))
+   goto oom;
+   page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
+   if (!page)
+   goto oom;
+
+   if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL))
+   goto oom_free_page;
+   cgroup_throttle_swaprate(page, GFP_KERNEL);
+
+   /*
+* The memory barrier inside __SetPageUptodate makes sure that
+* preceding stores to the page contents become visible before
+* the set_pte_at() write.
+*/
+   __SetPageUptodate(page);
+
+   entry = mk_pte(page, vma->vm_page_prot);
+   if (vma->vm_flags & VM_WRITE)
+   entry = pte_mkwrite(pte_mkdirty(entry));
}
 
-   /* Allocate our own private page. */
-   if (unlikely(anon_vma_prepare(vma)))
-   goto oom;
-   page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
-   if (!page)
-   goto oom;
-
-   if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL))
-   goto oom_free_page;
-   cgroup_throttle_swaprate(page, GFP_KERNEL);
-
-   /*
-* The memory barrier inside __SetPageUptodate makes sure that
-* preceding stores to the page contents become visible before
-* the set_pte_at() write.
-*/
-   __SetPageUptodate(page);
-
-   entry = mk_pte(page, vma->vm_page_prot);
-   if (vma->vm_flags & VM_WRITE)
-   entry = pte_mkwrite(pte_mkdirty(entry));
-
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>ptl);
if (!pte_none(*vmf->pte)) {
update_mmu_tlb(vma, vmf->address, vmf->pte);
-   goto release;
+   goto unlock;
}
 
ret = check_stable_address_space(vma->vm_mm);
if (ret)
-   goto release;
+   goto unlock;
 
/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
-   put_page(page);
+   if (page)
+   put_page(page);
return handle_userfault(vmf, VM_UFFD_MISSING);
}
 
-   inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-   page_add_new_anon_rmap(page, vma, vmf->address, false);
-   lru_cache_add_inactive_or_unevictable(page, vma);
-setpte:
+   if (page) {
+   inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
+   page_add_new_anon_rmap(page, vma, vmf->address, false);
+   lru_cache_add_inactive_or_unevictable(page, vma);
+   }
+
set_pte_at(vma->vm_mm, vmf->address, vmf

[RFC PATCH 17/37] mm: implement speculative handling in do_numa_page()

2021-04-06 Thread Michel Lespinasse
change do_numa_page() to use pte_spinlock() when locking the page table,
so that the mmap sequence counter will be validated in the speculative case.

Signed-off-by: Michel Lespinasse 
---
 mm/memory.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index a2c5bf29f989..78cc36749754 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4181,8 +4181,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 * validation through pte_unmap_same(). It's of NUMA type but
 * the pfn may be screwed if the read is non atomic.
 */
-   vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
-   spin_lock(vmf->ptl);
+   if (!pte_spinlock(vmf))
+   return VM_FAULT_RETRY;
if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
goto out;
-- 
2.20.1



[RFC PATCH 20/37] mm: implement and enable speculative fault handling in handle_pte_fault()

2021-04-06 Thread Michel Lespinasse
In handle_pte_fault(), allow speculative execution to proceed.

Use pte_spinlock() to validate the mmap sequence count when locking
the page table.

If speculative execution proceeds through do_wp_page(), ensure that we
end up in the wp_page_reuse() or wp_page_copy() paths, rather than
wp_pfn_shared() or wp_page_shared() (both unreachable as we only
handle anon vmas so far) or handle_userfault() (needs an explicit
abort to handle non-speculatively).

Signed-off-by: Michel Lespinasse 
---
 mm/memory.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 547d9d0ee962..fc555fae0844 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3103,6 +3103,7 @@ static vm_fault_t wp_pfn_shared(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
 
+   VM_BUG_ON(vmf->flags & FAULT_FLAG_SPECULATIVE);
if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
vm_fault_t ret;
 
@@ -3123,6 +3124,8 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
vm_fault_t ret = VM_FAULT_WRITE;
 
+   VM_BUG_ON(vmf->flags & FAULT_FLAG_SPECULATIVE);
+
get_page(vmf->page);
 
if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
@@ -3176,6 +3179,8 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 
if (userfaultfd_pte_wp(vma, *vmf->pte)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
+   if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+   return VM_FAULT_RETRY;
return handle_userfault(vmf, VM_UFFD_WP);
}
 
@@ -4366,13 +4371,8 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
 
-   if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
-   pte_unmap(vmf->pte);
+   if (!pte_spinlock(vmf))
return VM_FAULT_RETRY;
-   }
-
-   vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-   spin_lock(vmf->ptl);
entry = vmf->orig_pte;
if (unlikely(!pte_same(*vmf->pte, entry))) {
update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
-- 
2.20.1



[RFC PATCH 02/37] mmap locking API: name the return values

2021-04-06 Thread Michel Lespinasse
In the mmap locking API, the *_killable() functions return an error
(or 0 on success), and the *_trylock() functions return a boolean
(true on success).

Rename the return values "int error" and "bool ok", respectively,
rather than using "ret" for both cases which I find less readable.

Signed-off-by: Michel Lespinasse 
---
 include/linux/mmap_lock.h | 32 
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 4e27f755766b..8ff276a7560e 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -81,22 +81,22 @@ static inline void mmap_write_lock_nested(struct mm_struct 
*mm, int subclass)
 
 static inline int mmap_write_lock_killable(struct mm_struct *mm)
 {
-   int ret;
+   int error;
 
__mmap_lock_trace_start_locking(mm, true);
-   ret = down_write_killable(>mmap_lock);
-   __mmap_lock_trace_acquire_returned(mm, true, ret == 0);
-   return ret;
+   error = down_write_killable(>mmap_lock);
+   __mmap_lock_trace_acquire_returned(mm, true, !error);
+   return error;
 }
 
 static inline bool mmap_write_trylock(struct mm_struct *mm)
 {
-   bool ret;
+   bool ok;
 
__mmap_lock_trace_start_locking(mm, true);
-   ret = down_write_trylock(>mmap_lock) != 0;
-   __mmap_lock_trace_acquire_returned(mm, true, ret);
-   return ret;
+   ok = down_write_trylock(>mmap_lock) != 0;
+   __mmap_lock_trace_acquire_returned(mm, true, ok);
+   return ok;
 }
 
 static inline void mmap_write_unlock(struct mm_struct *mm)
@@ -120,22 +120,22 @@ static inline void mmap_read_lock(struct mm_struct *mm)
 
 static inline int mmap_read_lock_killable(struct mm_struct *mm)
 {
-   int ret;
+   int error;
 
__mmap_lock_trace_start_locking(mm, false);
-   ret = down_read_killable(>mmap_lock);
-   __mmap_lock_trace_acquire_returned(mm, false, ret == 0);
-   return ret;
+   error = down_read_killable(>mmap_lock);
+   __mmap_lock_trace_acquire_returned(mm, false, !error);
+   return error;
 }
 
 static inline bool mmap_read_trylock(struct mm_struct *mm)
 {
-   bool ret;
+   bool ok;
 
__mmap_lock_trace_start_locking(mm, false);
-   ret = down_read_trylock(>mmap_lock) != 0;
-   __mmap_lock_trace_acquire_returned(mm, false, ret);
-   return ret;
+   ok = down_read_trylock(>mmap_lock) != 0;
+   __mmap_lock_trace_acquire_returned(mm, false, ok);
+   return ok;
 }
 
 static inline void mmap_read_unlock(struct mm_struct *mm)
-- 
2.20.1



[RFC PATCH 05/37] mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT

2021-04-06 Thread Michel Lespinasse
This configuration variable will be used to build the code needed to
handle speculative page fault.

This is enabled by default on supported architectures with SMP and MMU set.

The architecture support is needed since the speculative page fault handler
is called from the architecture's page faulting code, and some code has to
be added there to try speculative fault handling first.

Signed-off-by: Michel Lespinasse 
---
 mm/Kconfig | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 24c045b24b95..322bda319dea 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -872,4 +872,26 @@ config MAPPING_DIRTY_HELPERS
 config KMAP_LOCAL
bool
 
+config ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
+   def_bool n
+
+config SPECULATIVE_PAGE_FAULT
+   bool "Speculative page faults"
+   default y
+   depends on ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT && MMU && SMP
+   help
+ Try to handle user space page faults without holding the mmap lock.
+
+ Instead of blocking writers through the use of mmap lock,
+ the page fault handler merely verifies, at the end of the page
+ fault, that no writers have been running concurrently with it.
+
+ In high concurrency situations, the speculative fault handler
+ gains a throughput advantage by avoiding having to update the
+ mmap lock reader count.
+
+ If the check fails due to a concurrent writer, or due to hitting
+ an unsupported case, the fault handler falls back to classical
+ processing using the mmap read lock.
+
 endmenu
-- 
2.20.1



[RFC PATCH 21/37] mm: implement speculative handling in do_swap_page()

2021-04-06 Thread Michel Lespinasse
If the pte is larger than long, use pte_spinlock() to lock the page table
when verifying the pte - pte_spinlock() is necessary to ensure the page
table is still valid when we are locking it.

Abort speculative faults if the pte is not a swap entry, or if the desired
page is not found in swap cache, to keep things as simple as possible.

Only use trylock when locking the swapped page - again to keep things
simple, and also the usual lock_page_or_retry would otherwise try to
release the mmap lock which is not held in the speculative case.

Use pte_map_lock() to ensure proper synchronization when finally committing
the faulted page to the mm address space.

Signed-off-by: Michel Lespinasse 
---
 mm/memory.c | 74 ++---
 1 file changed, 42 insertions(+), 32 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index fc555fae0844..ab3160719bf3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2637,30 +2637,6 @@ bool __pte_map_lock(struct vm_fault *vmf)
 
 #endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
 
-/*
- * handle_pte_fault chooses page fault handler according to an entry which was
- * read non-atomically.  Before making any commitment, on those architectures
- * or configurations (e.g. i386 with PAE) which might give a mix of unmatched
- * parts, do_swap_page must check under lock before unmapping the pte and
- * proceeding (but do_wp_page is only called after already making such a check;
- * and do_anonymous_page can safely check later on).
- */
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
-   pte_t *page_table, pte_t orig_pte)
-{
-   int same = 1;
-#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION)
-   if (sizeof(pte_t) > sizeof(unsigned long)) {
-   spinlock_t *ptl = pte_lockptr(mm, pmd);
-   spin_lock(ptl);
-   same = pte_same(*page_table, orig_pte);
-   spin_unlock(ptl);
-   }
-#endif
-   pte_unmap(page_table);
-   return same;
-}
-
 static inline bool cow_user_page(struct page *dst, struct page *src,
 struct vm_fault *vmf)
 {
@@ -3369,12 +3345,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
return VM_FAULT_RETRY;
}
 
-   if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
-   goto out;
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION)
+   if (sizeof(pte_t) > sizeof(unsigned long)) {
+   /*
+* vmf->orig_pte was read non-atomically. Before making
+* any commitment, on those architectures or configurations
+* (e.g. i386 with PAE) which might give a mix of
+* unmatched parts, we must check under lock before
+* unmapping the pte and proceeding.
+*
+* (but do_wp_page is only called after already making
+* such a check; and do_anonymous_page can safely
+* check later on).
+*/
+   if (!pte_spinlock(vmf))
+   return VM_FAULT_RETRY;
+   if (!pte_same(*vmf->pte, vmf->orig_pte))
+   goto unlock;
+   spin_unlock(vmf->ptl);
+   }
+#endif
+   pte_unmap(vmf->pte);
+   vmf->pte = NULL;
 
entry = pte_to_swp_entry(vmf->orig_pte);
if (unlikely(non_swap_entry(entry))) {
-   if (is_migration_entry(entry)) {
+   if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+   ret = VM_FAULT_RETRY;
+   } else if (is_migration_entry(entry)) {
migration_entry_wait(vma->vm_mm, vmf->pmd,
 vmf->address);
} else if (is_device_private_entry(entry)) {
@@ -3395,8 +3393,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
swapcache = page;
 
if (!page) {
-   struct swap_info_struct *si = swp_swap_info(entry);
+   struct swap_info_struct *si;
 
+   if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+   delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+   return VM_FAULT_RETRY;
+   }
+
+   si = swp_swap_info(entry);
if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
__swap_count(entry) == 1) {
/* skip swapcache */
@@ -3459,7 +3463,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
goto out_release;
}
 
-   locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
+   if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+   locked = trylock_page(page);
+   else
+   locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
 
delayacct_clear_flag(DELAYACCT_PF_SWAP

[RFC PATCH 03/37] do_anonymous_page: use update_mmu_tlb()

2021-04-06 Thread Michel Lespinasse
update_mmu_tlb() can be used instead of update_mmu_cache() when the
page fault handler detects that it lost the race to another page fault.

Signed-off-by: Michel Lespinasse 
---
 mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 5efa07fb6cdc..8ee4bd239303 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3567,7 +3567,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>ptl);
if (!pte_none(*vmf->pte)) {
-   update_mmu_cache(vma, vmf->address, vmf->pte);
+   update_mmu_tlb(vma, vmf->address, vmf->pte);
goto release;
}
 
-- 
2.20.1



[RFC PATCH 14/37] mm: add pte_map_lock() and pte_spinlock()

2021-04-06 Thread Michel Lespinasse
pte_map_lock() and pte_spinlock() are used by fault handlers to ensure
the pte is mapped and locked before they commit the faulted page to the
mm's address space at the end of the fault.

The functions differ in their preconditions; pte_map_lock() expects
the pte to be unmapped prior to the call, while pte_spinlock() expects
it to be already mapped.

In the speculative fault case, the functions verify, after locking the pte,
that the mmap sequence count has not changed since the start of the fault,
and thus that no mmap lock writers have been running concurrently with
the fault. After that point the page table lock serializes any further
races with concurrent mmap lock writers.

If the mmap sequence count check fails, both functions will return false
with the pte being left unmapped and unlocked.

Signed-off-by: Michel Lespinasse 
---
 include/linux/mm.h | 34 ++
 mm/memory.c| 71 ++
 2 files changed, 105 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dee8a4833779..f26490aff514 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3183,5 +3183,39 @@ extern int sysctl_nr_trim_pages;
 
 void mem_dump_obj(void *object);
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+
+bool __pte_map_lock(struct vm_fault *vmf);
+
+static inline bool pte_map_lock(struct vm_fault *vmf)
+{
+   VM_BUG_ON(vmf->pte);
+   return __pte_map_lock(vmf);
+}
+
+static inline bool pte_spinlock(struct vm_fault *vmf)
+{
+   VM_BUG_ON(!vmf->pte);
+   return __pte_map_lock(vmf);
+}
+
+#else  /* !CONFIG_SPECULATIVE_PAGE_FAULT */
+
+static inline bool pte_map_lock(struct vm_fault *vmf)
+{
+   vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address,
+  >ptl);
+   return true;
+}
+
+static inline bool pte_spinlock(struct vm_fault *vmf)
+{
+   vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+   spin_lock(vmf->ptl);
+   return true;
+}
+
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/mm/memory.c b/mm/memory.c
index a17704aac019..3e192d5f89a6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2566,6 +2566,77 @@ int apply_to_existing_page_range(struct mm_struct *mm, 
unsigned long addr,
 }
 EXPORT_SYMBOL_GPL(apply_to_existing_page_range);
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+
+bool __pte_map_lock(struct vm_fault *vmf)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   pmd_t pmdval;
+#endif
+   pte_t *pte = vmf->pte;
+   spinlock_t *ptl;
+
+   if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+   vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+   if (!pte)
+   vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+   spin_lock(vmf->ptl);
+   return true;
+   }
+
+   local_irq_disable();
+   if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq))
+   goto fail;
+   /*
+* The mmap sequence count check guarantees that the page
+* tables are still valid at that point, and having IRQs
+* disabled ensures that they stay around (see Fast GUP
+* comment in mm/gup.c).
+*/
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   /*
+* We check if the pmd value is still the same to ensure that there
+* is not a huge collapse operation in progress in our back.
+*/
+   pmdval = READ_ONCE(*vmf->pmd);
+   if (!pmd_same(pmdval, vmf->orig_pmd))
+   goto fail;
+#endif
+   ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+   if (!pte)
+   pte = pte_offset_map(vmf->pmd, vmf->address);
+   /*
+* Try locking the page table.
+*
+* Note that we might race against zap_pte_range() which
+* invalidates TLBs while holding the page table lock.
+* We still have local IRQs disabled here to prevent the
+* page table from being reclaimed, and zap_pte_range() could
+* thus deadlock with us if we tried using spin_lock() here.
+*
+* We also don't want to retry until spin_trylock() succeeds,
+* because of the starvation potential against a stream of lockers.
+*/
+   if (unlikely(!spin_trylock(ptl)))
+   goto fail;
+   if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq))
+   goto unlock_fail;
+   local_irq_enable();
+   vmf->pte = pte;
+   vmf->ptl = ptl;
+   return true;
+
+unlock_fail:
+   spin_unlock(ptl);
+fail:
+   if (pte)
+   pte_unmap(pte);
+   local_irq_enable();
+   return false;
+}
+
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
 /*
  * handle_pte_fault chooses page fault handler according to an entry which was
  * read non-atomically.  Before making any commitment, on those architectures
-- 
2.20.1



[RFC PATCH 06/37] x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

2021-04-06 Thread Michel Lespinasse
Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT so that the speculative fault
handling code can be compiled on this architecture.

Signed-off-by: Michel Lespinasse 
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879d398e..a93e4ed7040e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -33,6 +33,7 @@ config X86_64
select NEED_DMA_MAP_STATE
select SWIOTLB
select ARCH_HAS_ELFCORE_COMPAT
+   select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
 
 config FORCE_DYNAMIC_FTRACE
def_bool y
-- 
2.20.1



[RFC PATCH 15/37] mm: implement speculative handling in do_anonymous_page()

2021-04-06 Thread Michel Lespinasse
Change do_anonymous_page() to handle the speculative case.
This involves aborting speculative faults if they have to allocate a new
anon_vma, and using pte_map_lock() instead of pte_offset_map_lock()
to complete the page fault.

Signed-off-by: Michel Lespinasse 
---
 mm/memory.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3e192d5f89a6..fd84576f9c01 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3598,8 +3598,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
vma->vm_page_prot));
} else {
/* Allocate our own private page. */
-   if (unlikely(anon_vma_prepare(vma)))
-   goto oom;
+   if (unlikely(!vma->anon_vma)) {
+   if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+   return VM_FAULT_RETRY;
+   if (__anon_vma_prepare(vma))
+   goto oom;
+   }
page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
if (!page)
goto oom;
@@ -3620,8 +3624,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
entry = pte_mkwrite(pte_mkdirty(entry));
}
 
-   vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-   >ptl);
+   if (!pte_map_lock(vmf)) {
+   ret = VM_FAULT_RETRY;
+   goto release;
+   }
if (!pte_none(*vmf->pte)) {
update_mmu_tlb(vma, vmf->address, vmf->pte);
goto unlock;
@@ -3636,6 +3642,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
pte_unmap_unlock(vmf->pte, vmf->ptl);
if (page)
put_page(page);
+   if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+   return VM_FAULT_RETRY;
return handle_userfault(vmf, VM_UFFD_MISSING);
}
 
@@ -3653,6 +3661,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
return 0;
 unlock:
pte_unmap_unlock(vmf->pte, vmf->ptl);
+release:
if (page)
put_page(page);
return ret;
-- 
2.20.1



[RFC PATCH 22/37] mm: enable speculative fault handling through do_swap_page()

2021-04-06 Thread Michel Lespinasse
Change do_swap_page() to allow speculative fault execution to proceed.

Signed-off-by: Michel Lespinasse 
---
 mm/memory.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index ab3160719bf3..6eddd7b4e89c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3340,11 +3340,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vm_fault_t ret = 0;
void *shadow = NULL;
 
-   if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
-   pte_unmap(vmf->pte);
-   return VM_FAULT_RETRY;
-   }
-
 #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION)
if (sizeof(pte_t) > sizeof(unsigned long)) {
/*
-- 
2.20.1



Re: [PATCH] alpha/kernel/traps: Use find_vma_intersection() in traps for setting si_code

2021-04-01 Thread Michel Lespinasse
You are correct that find_vma is insufficient for what's intended
here, and that find_vma_intersection fixes it.

I'll let the arch maintainers speak of what the consequences of the
changed si_code would be - the bug has been here so long, that I would
worry some userspace might have come to depend on it (the old "common
law feature" issue).
 Just a concern I have, with 0 evidence behind it, so I hope it turns
out not to be an actual issue.

Acked-by: Michel Lespinasse 

On Thu, Apr 1, 2021 at 12:51 PM Liam Howlett  wrote:
>
> find_vma() will continue to search upwards until the end of the virtual
> memory space.  This means the si_code would almost never be set to
> SEGV_MAPERR even when the address falls outside of any VMA.
>
> Using find_vma_intersection() allows for what is intended by only
> returning a VMA if it falls within the range provided, in this case a
> window of 1.
>
> Signed-off-by: Liam R. Howlett 
> ---
>  arch/alpha/kernel/traps.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/arch/alpha/kernel/traps.c b/arch/alpha/kernel/traps.c
> index 921d4b6e4d95..7f51386c06d0 100644
> --- a/arch/alpha/kernel/traps.c
> +++ b/arch/alpha/kernel/traps.c
> @@ -957,8 +957,10 @@ do_entUnaUser(void __user * va, unsigned long opcode,
> si_code = SEGV_ACCERR;
> else {
> struct mm_struct *mm = current->mm;
> +   unsigned long addr = (unsigned long)va;
> +
> mmap_read_lock(mm);
> -   if (find_vma(mm, (unsigned long)va))
> +   if (find_vma_intersection(mm, addr, addr + 1))
> si_code = SEGV_ACCERR;
> else
> si_code = SEGV_MAPERR;
> --
> 2.30.0



-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


Re: [PATCH v3 2/2] mmap_lock: add tracepoints around lock acquisition

2020-10-09 Thread Michel Lespinasse
On Fri, Oct 9, 2020 at 3:05 PM Axel Rasmussen  wrote:
> The goal of these tracepoints is to be able to debug lock contention
> issues. This lock is acquired on most (all?) mmap / munmap / page fault
> operations, so a multi-threaded process which does a lot of these can
> experience significant contention.
>
> We trace just before we start acquisition, when the acquisition returns
> (whether it succeeded or not), and when the lock is released (or
> downgraded). The events are broken out by lock type (read / write).
>
> The events are also broken out by memcg path. For container-based
> workloads, users often think of several processes in a memcg as a single
> logical "task", so collecting statistics at this level is useful.
>
> The end goal is to get latency information. This isn't directly included
> in the trace events. Instead, users are expected to compute the time
> between "start locking" and "acquire returned", using e.g. synthetic
> events or BPF. The benefit we get from this is simpler code.
>
> Because we use tracepoint_enabled() to decide whether or not to trace,
> this patch has effectively no overhead unless tracepoints are enabled at
> runtime. If tracepoints are enabled, there is a performance impact, but
> how much depends on exactly what e.g. the BPF program does.
>
> Signed-off-by: Axel Rasmussen 

Reviewed-by: Michel Lespinasse 

Looks good to me, thanks!


Re: [PATCH v2 2/2] mmap_lock: add tracepoints around lock acquisition

2020-10-08 Thread Michel Lespinasse
On Wed, Oct 7, 2020 at 11:44 AM Axel Rasmussen  wrote:
> The goal of these tracepoints is to be able to debug lock contention
> issues. This lock is acquired on most (all?) mmap / munmap / page fault
> operations, so a multi-threaded process which does a lot of these can
> experience significant contention.
>
> We trace just before we start acquisition, when the acquisition returns
> (whether it succeeded or not), and when the lock is released (or
> downgraded). The events are broken out by lock type (read / write).
>
> The events are also broken out by memcg path. For container-based
> workloads, users often think of several processes in a memcg as a single
> logical "task", so collecting statistics at this level is useful.
>
> The end goal is to get latency information. This isn't directly included
> in the trace events. Instead, users are expected to compute the time
> between "start locking" and "acquire returned", using e.g. synthetic
> events or BPF. The benefit we get from this is simpler code.
>
> Because we use tracepoint_enabled() to decide whether or not to trace,
> this patch has effectively no overhead unless tracepoints are enabled at
> runtime. If tracepoints are enabled, there is a performance impact, but
> how much depends on exactly what e.g. the BPF program does.
>
> Signed-off-by: Axel Rasmussen 

Thanks for working on this.

I like that there is no overhead unless CONFIG_TRACING is set.
However, I think the __mmap_lock_traced_lock and similar functions are
the wrong level of abstraction, especially considering that we are
considering to switch away from using rwsem as the underlying lock
implementation. Would you consider something along the following lines
instead for include/linux/mmap_lock.h ?

#ifdef CONFIG_TRACING

DECLARE_TRACEPOINT(...);

void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write);

static inline void mmap_lock_trace_start_locking(struct mm_struct *mm,
bool write)
{
  if (tracepoint_enabled(mmap_lock_start_locking))
__mmap_lock_do_trace_start_locking(mm, write);
}

#else

static inline void mmap_lock_trace_start_locking(struct mm_struct *mm,
bool write) {}

#endif

static inline void mmap_write_lock(struct mm_struct *mm)
{
  mmap_lock_trace_start_locking(mm, true);
  down_write(>mmap_lock);
  mmap_lock_trace_acquire_returned(mm, true, true);
}

I think this is more straightforward, and also the
mmap_lock_trace_start_locking and similar functions don't depend on
the underlying lock implementation.

The changes to the other files look fine to me.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


Re: [PATCH v2 1/2] tracing: support "bool" type in synthetic trace events

2020-10-08 Thread Michel Lespinasse
On Wed, Oct 7, 2020 at 11:44 AM Axel Rasmussen  wrote:
> It's common [1] to define tracepoint fields as "bool" when they contain
> a true / false value. Currently, defining a synthetic event with a
> "bool" field yields EINVAL. It's possible to work around this by using
> e.g. u8 (assuming sizeof(bool) is 1, and bool is unsigned; if either of
> these properties don't match, you get EINVAL [2]).
>
> Supporting "bool" explicitly makes hooking this up easier and more
> portable for userspace.

Acked-by: Michel Lespinasse 

Looks fine to me, but you really want to get Steven's opinion over mine here :)


Re: [PATCH 1/2] mmap locking API: Order lock of nascent mm outside lock of live mm

2020-10-03 Thread Michel Lespinasse
On Fri, Oct 2, 2020 at 9:33 AM Jann Horn  wrote:
> On Fri, Oct 2, 2020 at 11:18 AM Michel Lespinasse  wrote:
> > On Thu, Oct 1, 2020 at 6:25 PM Jann Horn  wrote:
> > > Until now, the mmap lock of the nascent mm was ordered inside the mmap 
> > > lock
> > > of the old mm (in dup_mmap() and in UML's activate_mm()).
> > > A following patch will change the exec path to very broadly lock the
> > > nascent mm, but fine-grained locking should still work at the same time 
> > > for
> > > the new mm.
> > > To do this in a way that lockdep is happy about, let's turn around the 
> > > lock
> > > ordering in both places that currently nest the locks.
> > > Since SINGLE_DEPTH_NESTING is normally used for the inner nesting layer,
> > > make up our own lock subclass MMAP_LOCK_SUBCLASS_NASCENT and use that
> > > instead.
> > >
> > > The added locking calls in exec_mmap() are temporary; the following patch
> > > will move the locking out of exec_mmap().
> >
> > Thanks for doing this.
> >
> > This is probably a silly question, but I am not sure exactly where we
> > lock the old MM while bprm is creating the new MM ? I am guessing this
> > would be only in setup_arg_pages(), copying the args and environment
> > from the old the the new MM ? If that is correct, then wouldn't it be
> > sufficient to use mmap_write_lock_nested in setup_arg_pages() ? Or, is
> > the issue that we'd prefer to have a killable version of it there ?
>
> We're also implicitly locking the old MM anytime we take page faults
> before exec_mmap(), which basically means the various userspace memory
> accesses in do_execveat_common(). This happens after bprm_mm_init(),
> so we've already set bprm->vma at that point.

Ah yes, I see the issue now. It would be much nicer if copy_strings
could coax copy_from_user into taking a nested lock, but of course
there is no way to do that.

I'm not sure if it'd be reasonable to kmap the source pages like we do
for the destination pages ?

Adding a nascent lock instead of a nested lock, as you propose, seems
to work, but it also looks quite unusual. Not that I have anything
better to propose at this point though...


Unrelated to the above: copy_from_user and copy_to_user should not be
called with mmap_lock held; it may be worth adding these assertions
too (probably in separate patches) ?


> Uuugh, dammit, I see what happened. Sorry about the trouble. Thanks
> for telling me, guess I'll go back to sending patches the way I did it
> before. :/

Yeah, I've hit such issues with gmail before too :/

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


Re: [PATCH 1/2] mmap locking API: Order lock of nascent mm outside lock of live mm

2020-10-02 Thread Michel Lespinasse
On Thu, Oct 1, 2020 at 6:25 PM Jann Horn  wrote:
> Until now, the mmap lock of the nascent mm was ordered inside the mmap lock
> of the old mm (in dup_mmap() and in UML's activate_mm()).
> A following patch will change the exec path to very broadly lock the
> nascent mm, but fine-grained locking should still work at the same time for
> the new mm.
> To do this in a way that lockdep is happy about, let's turn around the lock
> ordering in both places that currently nest the locks.
> Since SINGLE_DEPTH_NESTING is normally used for the inner nesting layer,
> make up our own lock subclass MMAP_LOCK_SUBCLASS_NASCENT and use that
> instead.
>
> The added locking calls in exec_mmap() are temporary; the following patch
> will move the locking out of exec_mmap().

Thanks for doing this.

This is probably a silly question, but I am not sure exactly where we
lock the old MM while bprm is creating the new MM ? I am guessing this
would be only in setup_arg_pages(), copying the args and environment
from the old the the new MM ? If that is correct, then wouldn't it be
sufficient to use mmap_write_lock_nested in setup_arg_pages() ? Or, is
the issue that we'd prefer to have a killable version of it there ?

Also FYI I was going to play with these patches a bit to help answer
these questions on my own, but wasn't able to easily apply them as
they came lightly mangled (whitespace issues) when I saved them.


Re: [PATCH 3/4] mmap locking API: Don't check locking if the mm isn't live yet

2020-09-30 Thread Michel Lespinasse
On Wed, Sep 30, 2020 at 1:15 PM Jann Horn  wrote:
> On Wed, Sep 30, 2020 at 2:50 PM Jann Horn  wrote:
> > On Wed, Sep 30, 2020 at 2:30 PM Jason Gunthorpe  wrote:
> > > On Tue, Sep 29, 2020 at 06:20:00PM -0700, Jann Horn wrote:
> > > > In preparation for adding a mmap_assert_locked() check in
> > > > __get_user_pages(), teach the mmap_assert_*locked() helpers that it's 
> > > > fine
> > > > to operate on an mm without locking in the middle of execve() as long as
> > > > it hasn't been installed on a process yet.
> > >
> > > I'm happy to see lockdep being added here, but can you elaborate on
> > > why add this mmap_locked_required instead of obtaining the lock in the
> > > execv path?
> >
> > My thinking was: At that point, we're logically still in the
> > single-owner initialization phase of the mm_struct. Almost any object
> > has initialization and teardown steps that occur in a context where
> > the object only has a single owner, and therefore no locking is
> > required. It seems to me that adding locking in places like
> > get_arg_page() would be confusing because it would suggest the
> > existence of concurrency where there is no actual concurrency, and it
> > might be annoying in terms of lockdep if someone tries to use
> > something like get_arg_page() while holding the mmap_sem of the
> > calling process. It would also mean that we'd be doing extra locking
> > in normal kernel builds that isn't actually logically required.
> >
> > Hmm, on the other hand, dup_mmap() already locks the child mm (with
> > mmap_write_lock_nested()), so I guess it wouldn't be too bad to also
> > do it in get_arg_page() and tomoyo_dump_page(), with comments that
> > note that we're doing this for lockdep consistency... I guess I can go
> > change this in v2.
>
> Actually, I'm taking that back. There's an extra problem:
> get_arg_page() accesses bprm->vma, which is set all the way back in
> __bprm_mm_init(). We really shouldn't be pretending that we're
> properly taking the mmap_sem when actually, we keep reusing a
> vm_area_struct pointer.
>
> So for that reason I prefer the approach in the existing patch, where
> we make it clear that mm_struct has two different lifetime phases in
> which GUP works, and that those lifetime phases have very different
> locking requirements.
>
> Does that sound reasonable?

I'm really not a fan of adding such exceptions; I think it's both
unusual and adds complexity that is not strictly contained into the
init paths.

I don't really understand the concern with the bprm vma in
get_arg_page(); I'm not super familiar with this code but isn't it a
normal vma within the process that __do_execve_file() is creating ? I
received Jason's last email while I was composing this one, but I
think I have the same concern/approach as him, i.e. I think it would
be simplest to keep the new MM locked through the __do_execve_file()
call and avoid adding the mmap_lock_required exception to the
mmap_assert_locked rule.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


Re: [PATCH 4/4] mm/gup: Assert that the mmap lock is held in __get_user_pages()

2020-09-30 Thread Michel Lespinasse
On Wed, Sep 30, 2020 at 5:32 AM Jason Gunthorpe  wrote:
> On Tue, Sep 29, 2020 at 06:20:01PM -0700, Jann Horn wrote:
> > After having cleaned up all GUP callers (except for the atomisp staging
> > driver, which currently gets mmap locking completely wrong [1]) to always
> > ensure that they hold the mmap lock when calling into GUP (unless the mm is
> > not yet globally visible), add an assertion to make sure it stays that way
> > going forward.

Thanks for doing this, there is a lot of value in ensuring that a
function's callers follows the prerequisites.

Acked-by: Michel Lespinasse 


Re: [PATCH 2/4] binfmt_elf: Take the mmap lock around find_extend_vma()

2020-09-30 Thread Michel Lespinasse
On Tue, Sep 29, 2020 at 6:20 PM Jann Horn  wrote:
> create_elf_tables() runs after setup_new_exec(), so other tasks can
> already access our new mm and do things like process_madvise() on it.
> (At the time I'm writing this commit, process_madvise() is not in mainline
> yet, but has been in akpm's tree for some time.)
>
> While I believe that there are currently no APIs that would actually allow
> another process to mess up our VMA tree (process_madvise() is limited to
> MADV_COLD and MADV_PAGEOUT, and uring and userfaultfd cannot reach an mm
> under which no syscalls have been executed yet), this seems like an
> accident waiting to happen.
>
> Let's make sure that we always take the mmap lock around GUP paths as long
> as another process might be able to see the mm.
>
> (Yes, this diff looks suspicious because we drop the lock before doing
> anything with `vma`, but that's because we actually don't do anything with
> it apart from the NULL check.)
>
> Signed-off-by: Jann Horn 

Thanks for these cleanups :)
Acked-by: Michel Lespinasse 


Re: [PATCH 1/4] mm/gup_benchmark: Take the mmap lock around GUP

2020-09-30 Thread Michel Lespinasse
On Tue, Sep 29, 2020 at 6:20 PM Jann Horn  wrote:
> To be safe against concurrent changes to the VMA tree, we must take the
> mmap lock around GUP operations (excluding the GUP-fast family of
> operations, which will take the mmap lock by themselves if necessary).
>
> This code is only for testing, and it's only reachable by root through
> debugfs, so this doesn't really have any impact; however, if we want to add
> lockdep asserts into the GUP path, we need to have clean locking here.
>
> Signed-off-by: Jann Horn 

Acked-by: Michel Lespinasse 

Thanks for these cleanups :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


Re: Lockdep question regarding two-level locks

2020-08-22 Thread Michel Lespinasse
On Sat, Aug 22, 2020 at 9:04 AM Michel Lespinasse  wrote:
> - B's implementation could, when lockdep is enabled, always release
> lock A before acquiring lock B. This is not ideal though, since this
> would hinder testing of the not-blocked code path in the acquire
> sequence.

Actually, this may be an acceptable way to handle my issue. In the
non-blocking case, B's implementation does not have to actually
release A, but it could tell lockdep that it's released A, acquired B
and acquired A again. Kinda ugly but should work...

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


Re: Lockdep question regarding two-level locks

2020-08-22 Thread Michel Lespinasse
On Sat, Aug 22, 2020 at 9:39 AM  wrote:
> On Sat, Aug 22, 2020 at 09:04:09AM -0700, Michel Lespinasse wrote:
> > Hi,
> >
> > I am wondering about how to describe the following situation to lockdep:
> >
> > - lock A would be something that's already implemented (a mutex or
> > possibly a spinlock).
> > - lock B is a range lock, which I would be writing the code for
> > (including lockdep hooks). I do not expect lockdep to know about range
> > locking, but I want it to treat lock B like any other and detect lock
> > ordering issues related to it.
> > - lock A protects a number of structures, including lock B's list of
> > locked ranges, but other structures as well.
> > - lock A is intended to be held for only short periods of time, lock
> > B's ranges might be held for longer.
>
> That's the 'normal' state for blocking locks, no?
>
> See how both struct mutex and struct rw_semaphore have internal locks.

Right (note that I already have an implementation of my range lock
('B') along these lines, with a low-level lock ('A') tucked into it
and all the expected lockdep support).

> > Usage would be along the following lines:
> >
> > acquire:
> > A_lock();
> > // might access data protected by A here
> > bool blocked = B_lock(range); // must be called under lock A; will
> > release lock A if blocking on B.
> > // might access data protected by A here (especially to re-validate in
> > case A was released while blocking on B...)
> > A_unlock()
> >
> > release:
> > A_lock()
> > // might access data protected by A here
> > A_B_unlock(range); // must be called under lock A; releases locks A and B.
>
> up_{read,write}() / mutex_unlock() release 'B', the actual lock, early,
> and then take 'A', the internal lock, to actually implement the release.
>
> That way lockdep doesn't see the B-A order :-)
>
> > There might also be other places that need to lock A for a short time,
> > either inside and outside of lock B.
>
> Any cases that aren't covered by the current implementation of rwsems ?

My issue is that I have other data, unrelated to B, which frequently
needs to be accessed or updated at just the same points where we are
acquiring or releasing B.

(to go into use cases: that data would be the vma rbtree and per-MM
statistics; one may want to find a free gap between vmas before range
locking it, or take vmas in and out of the rbtree as we acquire or
release a lock on the corresponding address range, etc...)

As the accesses to that data tend to naturally align with the places
where we take or release the B lock, it is tempting to expose A to the
caller so that A can protect additional data not directly related to
B. It seems like changing B's internal lock into a public one which
the caller would take and release explicitly around B calls would be
straighforward, but it causes issues as lockdep doesn't understand
that construct...

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


Lockdep question regarding two-level locks

2020-08-22 Thread Michel Lespinasse
Hi,

I am wondering about how to describe the following situation to lockdep:

- lock A would be something that's already implemented (a mutex or
possibly a spinlock).
- lock B is a range lock, which I would be writing the code for
(including lockdep hooks). I do not expect lockdep to know about range
locking, but I want it to treat lock B like any other and detect lock
ordering issues related to it.
- lock A protects a number of structures, including lock B's list of
locked ranges, but other structures as well.
- lock A is intended to be held for only short periods of time, lock
B's ranges might be held for longer.

Usage would be along the following lines:

acquire:
A_lock();
// might access data protected by A here
bool blocked = B_lock(range); // must be called under lock A; will
release lock A if blocking on B.
// might access data protected by A here (especially to re-validate in
case A was released while blocking on B...)
A_unlock()

release:
A_lock()
// might access data protected by A here
A_B_unlock(range); // must be called under lock A; releases locks A and B.

There might also be other places that need to lock A for a short time,
either inside and outside of lock B.

The correct lock ordering here is that lock A can be acquired while
holding lock B. However, the acquire sequence here seems to violate
that, as A must be locked before B there. In reality, the usage
pattern does not create circular dependencies, because lock A would be
released if blocking on lock B. However, I am not sure how to convey
that usage pattern to lockdep.


A few options I am considering:

- Is there a way to indicate to lockdep, in B's locking function
definition, that I am acquiring B after A but really want the lock
order to be registered as A after B, since I know how to avoid the
circular dependency issue by releasing A if blocking on B ?

- B's locking function definition could tell lockdep that B was
acquired with a trylock. This avoids lockdep reporting a lock ordering
issue between A and B, but also will make lockdep ignore lock ordering
issues between any other lock and B. So this is not a proper solution,
as we may just as well not implement lockdep support in lock B in that
case.

- B's implementation could, when lockdep is enabled, always release
lock A before acquiring lock B. This is not ideal though, since this
would hinder testing of the not-blocked code path in the acquire
sequence.

Would the lockdep maintainers have any guidance as to how to handle
this locking case ?

Thanks,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


Re: [PATCH v2 2/2] mm: proc: smaps_rollup: do not stall write attempts on mmap_lock

2020-08-14 Thread Michel Lespinasse
On Wed, Aug 12, 2020 at 7:13 PM Chinwen Chang
 wrote:
> smaps_rollup will try to grab mmap_lock and go through the whole vma
> list until it finishes the iterating. When encountering large processes,
> the mmap_lock will be held for a longer time, which may block other
> write requests like mmap and munmap from progressing smoothly.
>
> There are upcoming mmap_lock optimizations like range-based locks, but
> the lock applied to smaps_rollup would be the coarse type, which doesn't
> avoid the occurrence of unpleasant contention.
>
> To solve aforementioned issue, we add a check which detects whether
> anyone wants to grab mmap_lock for write attempts.

I think your retry mechanism still doesn't handle all cases. When you
get back the mmap lock, the address where you stopped last time could
now be in the middle of a vma. I think the consistent thing to do in
that case would be to retry scanning from the address you stopped at,
even if it's not on a vma boundary anymore. You may have to change
smap_gather_stats to support that, though.


Re: [PATCH v2 1/2] mmap locking API: add mmap_lock_is_contended()

2020-08-14 Thread Michel Lespinasse
On Wed, Aug 12, 2020 at 7:14 PM Chinwen Chang
 wrote:
>
> Add new API to query if someone wants to acquire mmap_lock
> for write attempts.
>
> Using this instead of rwsem_is_contended makes it more tolerant
> of future changes to the lock type.
>
> Signed-off-by: Chinwen Chang 

Acked-by: Michel Lespinasse 


Re: [PATCH v2 0/2] Try to release mmap_lock temporarily in smaps_rollup

2020-08-14 Thread Michel Lespinasse
On Thu, Aug 13, 2020 at 9:11 AM Chinwen Chang
 wrote:
> On Thu, 2020-08-13 at 02:53 -0700, Michel Lespinasse wrote:
> > On Wed, Aug 12, 2020 at 7:14 PM Chinwen Chang
> >  wrote:
> > > Recently, we have observed some janky issues caused by unpleasantly long
> > > contention on mmap_lock which is held by smaps_rollup when probing large
> > > processes. To address the problem, we let smaps_rollup detect if anyone
> > > wants to acquire mmap_lock for write attempts. If yes, just release the
> > > lock temporarily to ease the contention.
> > >
> > > smaps_rollup is a procfs interface which allows users to summarize the
> > > process's memory usage without the overhead of seq_* calls. Android uses 
> > > it
> > > to sample the memory usage of various processes to balance its memory pool
> > > sizes. If no one wants to take the lock for write requests, smaps_rollup
> > > with this patch will behave like the original one.
> > >
> > > Although there are on-going mmap_lock optimizations like range-based 
> > > locks,
> > > the lock applied to smaps_rollup would be the coarse one, which is hard to
> > > avoid the occurrence of aforementioned issues. So the detection and
> > > temporary release for write attempts on mmap_lock in smaps_rollup is still
> > > necessary.
> >
> > I do not mind extending the mmap lock API as needed. However, in the
> > past I have tried adding rwsem_is_contended to mlock(), and later to
> > mm_populate() paths, and IIRC gotten pushback on it both times. I
> > don't feel strongly on this, but would prefer if someone else approved
> > the rwsem_is_contended() use case.
> >
> Hi Michel,
>
> Thank you for your kind feedback.
>
> In my opinion, the difference between the case in smaps_rollup and the
> one in your example is that, for the former, the interference comes from
> the outside of the affected process, for the latter, it doesn't.
>
> In other words, anyone may use smaps_rollup to probe the affected
> process without the information about what step the affected one is
> executing.

Thanks, that is a good enough point for me :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


Re: [PATCH v2 0/2] Try to release mmap_lock temporarily in smaps_rollup

2020-08-13 Thread Michel Lespinasse
On Wed, Aug 12, 2020 at 7:14 PM Chinwen Chang
 wrote:
> Recently, we have observed some janky issues caused by unpleasantly long
> contention on mmap_lock which is held by smaps_rollup when probing large
> processes. To address the problem, we let smaps_rollup detect if anyone
> wants to acquire mmap_lock for write attempts. If yes, just release the
> lock temporarily to ease the contention.
>
> smaps_rollup is a procfs interface which allows users to summarize the
> process's memory usage without the overhead of seq_* calls. Android uses it
> to sample the memory usage of various processes to balance its memory pool
> sizes. If no one wants to take the lock for write requests, smaps_rollup
> with this patch will behave like the original one.
>
> Although there are on-going mmap_lock optimizations like range-based locks,
> the lock applied to smaps_rollup would be the coarse one, which is hard to
> avoid the occurrence of aforementioned issues. So the detection and
> temporary release for write attempts on mmap_lock in smaps_rollup is still
> necessary.

I do not mind extending the mmap lock API as needed. However, in the
past I have tried adding rwsem_is_contended to mlock(), and later to
mm_populate() paths, and IIRC gotten pushback on it both times. I
don't feel strongly on this, but would prefer if someone else approved
the rwsem_is_contended() use case.

Couple related questions, how many VMAs are we looking at here ? Would
need_resched() be workable too ?

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


Re: [PATCH] RISC-V: Acquire mmap lock before invoking walk_page_range

2020-06-17 Thread Michel Lespinasse
On Wed, Jun 17, 2020 at 1:38 PM Atish Patra  wrote:
> As per walk_page_range documentation, mmap lock should be acquired by the
> caller before invoking walk_page_range. mmap_assert_locked gets triggered
> without that. The details can be found here.
>
> http://lists.infradead.org/pipermail/linux-riscv/2020-June/010335.html
>
> Fixes: 395a21ff859c(riscv: add ARCH_HAS_SET_DIRECT_MAP support)
> Signed-off-by: Atish Patra 

Thanks for the fix.

Reviewed-by: Michel Lespinasse 


Re: [PATCH] openrisc: Fix boot oops when DEBUG_VM is enabled

2020-06-17 Thread Michel Lespinasse
On Wed, Jun 17, 2020 at 2:03 AM Stafford Horne  wrote:
> Since v5.8-rc1 OpenRISC Linux fails to boot when DEBUG_VM is enabled.
> This has been bisected to commit 42fc541404f2 ("mmap locking API: add
> mmap_assert_locked() and mmap_assert_write_locked()").
>
> The added locking checks exposed the issue that OpenRISC was not taking
> this mmap lock when during page walks for DMA operations.  This patch
> locks and unlocks the mmap lock for page walking.
>
> Fixes: 42fc541404f2 ("mmap locking API: add mmap_assert_locked() and 
> mmap_assert_write_locked()"
> Cc: Michel Lespinasse 
> Signed-off-by: Stafford Horne 

Thanks for the fix.

Reviewed-by: Michel Lespinasse 


Re: mm lock issue while booting Linux on 5.8-rc1 for RISC-V

2020-06-17 Thread Michel Lespinasse
On Tue, Jun 16, 2020 at 11:07 PM Stafford Horne  wrote:
> On Wed, Jun 17, 2020 at 02:35:39PM +0900, Stafford Horne wrote:
> > On Tue, Jun 16, 2020 at 01:47:24PM -0700, Michel Lespinasse wrote:
> > > This makes me wonder actually - maybe there is a latent bug that got
> > > exposed after my change added the rwsem_is_locked assertion to the
> > > lockdep_assert_held one. If that is the case, it may be helpful to
> > > bisect when that issue first appeared, by testing before my patchset
> > > with VM_BUG_ON(!rwsem_is_locked(>mmap_lock)) added to
> > > walk_page_range() / walk_page_range_novma() / walk_page_vma() ...
> >
> > Hello,
> >
> > I tried to bisect it, but I think this issue goes much further back.
> >
> > Just with the below patch booting fails all the way back to v5.7.
> >
> > What does this mean by they way, why would mmap_assert_locked() want to 
> > assert
> > that the rwsem_is_locked() is not true?

It's the opposite - VM_BUG_ON(cond) triggers if cond is true, so in
other words it asserts that cond is false. Yeah, I agree it is kinda
confusing. But in our case, it asserts that the rwsem is locked, which
is what we want.

> The openrisc code that was walking the page ranges was not locking mm. I have
> added the  below patch to v5.8-rc1 and it seems to work fine.  I will send a
> better patch in a bit.
>
> iff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c
> index c152a68811dd..bd5f05dd9174 100644
> --- a/arch/openrisc/kernel/dma.c
> +++ b/arch/openrisc/kernel/dma.c
> @@ -74,8 +74,10 @@ void *arch_dma_set_uncached(void *cpu_addr, size_t size)
>  * We need to iterate through the pages, clearing the dcache for
>  * them and setting the cache-inhibit bit.
>  */
> +   mmap_read_lock(_mm);
> error = walk_page_range(_mm, va, va + size, 
> _nocache_walk_ops,
> NULL);
> +   mmap_read_unlock(_mm);
> if (error)
> return ERR_PTR(error);
> return cpu_addr;
> @@ -85,9 +87,11 @@ void arch_dma_clear_uncached(void *cpu_addr, size_t size)
>  {
> unsigned long va = (unsigned long)cpu_addr;
>
> +   mmap_read_lock(_mm);
> /* walk_page_range shouldn't be able to fail here */
> WARN_ON(walk_page_range(_mm, va, va + size,
> _nocache_walk_ops, NULL));
> +   mmap_read_unlock(_mm);
>  }

Thanks a lot for getting to the bottom of this. I think this is the proper fix.


Re: mm lock issue while booting Linux on 5.8-rc1 for RISC-V

2020-06-16 Thread Michel Lespinasse
This makes me wonder actually - maybe there is a latent bug that got
exposed after my change added the rwsem_is_locked assertion to the
lockdep_assert_held one. If that is the case, it may be helpful to
bisect when that issue first appeared, by testing before my patchset
with VM_BUG_ON(!rwsem_is_locked(>mmap_lock)) added to
walk_page_range() / walk_page_range_novma() / walk_page_vma() ...

On Tue, Jun 16, 2020 at 12:41 PM Atish Patra  wrote:
>
> On Tue, Jun 16, 2020 at 12:19 PM Stafford Horne  wrote:
> >
> > On Tue, Jun 16, 2020 at 03:44:49AM -0700, Michel Lespinasse wrote:
> > > I am also unable to reproduce the issue so far.
> > >
> > > I wanted to point to a few things in case this helps:
> > > - Commit 42fc541404f2 was bisected as the cause. This commit changes
> > > walk_page_range_novma() to use mmap_assert_locked() instead of
> > > lockdep_assert_held()
> > > - mmap_assert_locked() checks lockdep_assert_held(), but also checks
> > > that the rwsem itself is locked.
> > >
> > > Now how could lockdep think the lock is held, but the lock itself is
> > > not marked as locked ???
> > >
> > > I'm not sure if it helps at all, but a few commits earlier,
> > > 0cc55a0213a0 introduces mmap_read_trylock_non_owner(), which is used
> > > exclusively by stackmap, and does the opposite: it acquires the mmap
> > > lock without telling lockdep about it. I can't see any smoking gun
> > > linking this to our bug, but I thought it may be worth mentioning as
> > > it involves the same suspects (stackmap and the difference between
> > > owning the lock vs lockdep thinking we own the lock).
> > >
> > > I'm sorry, that's only how far I was able to go on this bug - I'm not
> > > sure how to investigate it further as I can not reproduce the issue...
> >
> > Hello Michel,
> >
> > Thanks for your notes, it may be related.
> >
> > On my OpenRISC build with DEBUG_LOCKDEP and DEBUG_VM I am seeing the below
> > trace.  Without DEBUG_VM enabled the system does seem to work normally.
> >
> > Some notes on the trace:
> >  - The mmap_assert_locked failure is not in the same place as RISC-V, but
> >similarly during walk_page_range not walk_page_range_novma.
> >  - This build does not use BPF.
> >  - I do see a LOCKDEP warning early on, which may be the cause.
> >
>
> I don't have LOCKDEP enabled in RISC-V. I guess we are seeing the same
> bug caused in different paths.
> Here is the full boot log for RISC-V if it helps.
>
> https://paste.centos.org/view/bb5ee1e7
>
> > I will still do some more investination, you input is very helpful.
> >
> > -Stafford
> >
> > exec: /home/shorne/work/openrisc/qemu/build/or1k-softmmu/qemu-system-or1k 
> > -cpu or1200 -M or1k-sim -kernel /home/shorne/work/linux/vmlinux -net nic 
> > -net tap,ifname=tap0,script=no,downscript=no -serial mon:stdio -nographic 
> > -gdb tcp::10001 -m 512
> > [0.00] Compiled-in FDT at (ptrval)
> > [0.00] Linux version 5.8.0-rc1-simple-smp-2-g2335724b6a99-dirty 
> > (sho...@lianli.shorne-pla.net) (or1k-elf-gcc (GCC) 9.0.1 20190409 
> > (experimental), GNU ld (GNU Binutils) 2.32.51.20190425) #140 Wed Jun 17 
> > 04:01:43 JST 2020
> > [0.00] CPU: OpenRISC-13 (revision 8) @20 MHz
> > [0.00] -- dcache disabled
> > [0.00] -- icache disabled
> > [0.00] -- dmmu:  128 entries, 1 way(s)
> > [0.00] -- immu:  128 entries, 1 way(s)
> > [0.00] -- additional features:
> > [0.00] -- power management
> > [0.00] -- PIC
> > [0.00] -- timer
> > [0.00] setup_memory: Memory: 0x0-0x200
> > [0.00] Setting up paging and PTEs.
> > [0.00] map_ram: Memory: 0x0-0x200
> > [0.00] Zone ranges:
> > [0.00]   Normal   [mem 0x-0x01ff]
> > [0.00] Movable zone start for each node
> > [0.00] Early memory node ranges
> > [0.00]   node   0: [mem 0x-0x01ff]
> > [0.00] Initmem setup node 0 [mem 
> > 0x-0x01ff]
> > [0.00] On node 0 totalpages: 4096
> > [0.00]   Normal zone: 16 pages used for memmap
> > [0.00]   Normal zone: 0 pages reserved
> > [0.00]   Normal zone: 4096 pages, LIFO batch:0
> > [0.00] itlb_miss_handler (ptrval)
> > [0.00] dtlb_miss_handler (ptrval)
> > [0.00] OpenRISC Linux -- http://openrisc.io
> > [0.00] pcpu-alloc: s0 r0 d3276

Re: mm lock issue while booting Linux on 5.8-rc1 for RISC-V

2020-06-16 Thread Michel Lespinasse
1 : 0064 t2 :  s0 :
> >> > ffe3eaea7c90
> >> > [   10.385119]  s1 : ffe001018600 a0 : 0289 a1 :
> >> > 0020
> >> > [   10.386099]  a2 : 0005 a3 :  a4 :
> >> > ffe001012758
> >> > [   10.387294]  a5 :  a6 : 0102 a7 :
> >> > 0006
> >> > [   10.388265]  s2 : ffe3f00674c0 s3 : ffe00106e108 s4 :
> >> > ffe00106e100
> >> > [   10.389250]  s5 : ffe00106e908 s6 :  s7 :
> >> > 6db6db6db6db6db7
> >> > [   10.390272]  s8 : 0001 s9 : ffe00021a4f8 s10:
> >> > 
> >> > [   10.391293]  s11: ffe3f0066600 t3 : 0001a7a8 t4 :
> >> > 0001a7a8
> >> > [   10.392314]  t5 :  t6 : ffe00107b76b
> >> > [   10.393096] status: 00000120 badaddr: 
> >> > cause: 0003
> >> > [   10.397755] ---[ end trace 861659596ac28841 ]---
> >> > ---
> >> >
> >> > I haven't had the chance to bisect to figure out which commit caused
> >> > the issue. Just wanted
> >> > to check if it is a known issue already.
> >>
> >> Hi Atish,
> >>
> >> Note, I am getting the same (just now) when booting v5.8-rc1 on OpenRISC.  
> >> If
> >> you have any updates please post back.  I will try to look into this today 
> >> or
> >> tomorrow.
> >
> > I have bisected this to, 42fc541404f249778e752ab39c8bc25fcb2dbe1e:
> >
> >   mmap locking API: add mmap_assert_locked() and mmap_assert_write_locked()
> >
> > This should have just changed the existing lockdep api's but something has
> > changed.  I haven't had time to look at it yet.
> >
> > Ccing: Michel Lespinasse 
>
> This isn't manifesting on boot for me, on either rc1 or that commit.  I'm
> running a simple buildroot-based userspace, so I doubt anything is triggering
> BPF.  I don't run the BPF selftests, as they're a bit of a pain (IIRC they
> don't cross compile and need LLVM) -- does anyone have a userspace I can use 
> to
> trigger the bug?



-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


Re: [PATCH] Documentation: fix filesystems/locking.rst malformed table warnings

2020-06-14 Thread Michel Lespinasse
On Sun, Jun 14, 2020 at 8:22 PM Randy Dunlap  wrote:
> From: Randy Dunlap 
>
> Fix Sphinx malformed table warnings in filesystems/locking.rst:
>
> lnx-58-rc1/Documentation/filesystems/locking.rst:443: WARNING: Malformed 
> table.
> Text in column margin in table line 8.
>
> lnx-58-rc1/Documentation/filesystems/locking.rst:620: WARNING: Malformed 
> table.
> Text in column margin in table line 2.
>
> Fixes: ec23eb54fbc7 ("docs: fs: convert docs without extension to ReST")
> Fixes: c1e8d7c6a7a6 ("mmap locking API: convert mmap_sem comments")
> Signed-off-by: Randy Dunlap 
> Cc: Mauro Carvalho Chehab 
> Cc: Michel Lespinasse 
> Cc: Andrew Morton 

Acked-by: Michel Lespinasse 
Thanks for the fixes !

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


Re: [PATCH v6 10/12] mmap locking API: rename mmap_sem to mmap_lock

2020-06-04 Thread Michel Lespinasse
On Thu, Jun 4, 2020 at 1:16 AM youling 257  wrote:
> 2020-06-04 13:57 GMT+08:00, Michel Lespinasse :
> > However I would like more information about your report. Did you apply
> > the series yourself ? If so, what base tree did you apply it onto ? If
> > not, what tree did you use that already included the series ?
>
> I build linux next-20200603 tree have the kernel/sys.c error.

Thanks for the info.

The proper fix is to replace down_write(>mmap_sem) with
mmap_write_lock(mm) and up_write(>mmap_sem) with
mmap_write_unlock(mm)

If you just needed a quick fix to get the tree to build, what you did
(replacing mmap_sem with mmap_lock) is also fine.

Thanks again !

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


Re: [PATCH v6 10/12] mmap locking API: rename mmap_sem to mmap_lock

2020-06-03 Thread Michel Lespinasse
On Wed, Jun 3, 2020 at 9:35 PM youling 257  wrote:
> I have build error about kernel/sys.c,
>
> kernel/sys.c: In function ‘prctl_set_vma’:
> kernel/sys.c:2392:18: error:
> ‘struct mm_struct’ has no member named ‘mmap_sem’; did you mean
> ‘mmap_base’?
>   2392 |  down_write(>mmap_sem);
>|  ^~~~
>|  mmap_base
> kernel/sys.c:2402:16: error:
> ‘struct mm_struct’ has no member named ‘mmap_sem’; did you mean
> ‘mmap_base’?
>   2402 |  up_write(>mmap_sem);
>|^~~~
>|mmap_base
>
> why not rename kernel/sys.c mmap_sem to mmap_lock?

The proper fix would be to use the mmap locking apis defined in
include/linux/mmap_lock.h instead.

However I would like more information about your report. Did you apply
the series yourself ? If so, what base tree did you apply it onto ? If
not, what tree did you use that already included the series ?

Thanks,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


Re: [PATCH v6 12/12] mmap locking API: convert mmap_sem comments

2020-05-21 Thread Michel Lespinasse
On Thu, May 21, 2020 at 12:42 AM Vlastimil Babka  wrote:
> On 5/20/20 7:29 AM, Michel Lespinasse wrote:
> > Convert comments that reference mmap_sem to reference mmap_lock instead.
> >
> > Signed-off-by: Michel Lespinasse 
>
> Reviewed-by: Vlastimil Babka 
>
> But:
>
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index a1247d3553da..1bf46e2e0cec 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -224,7 +224,7 @@ static int mpol_new_bind(struct mempolicy *pol, const 
> > nodemask_t *nodes)
> >   * handle an empty nodemask with MPOL_PREFERRED here.
> >   *
> >   * Must be called holding task's alloc_lock to protect task's mems_allowed
> > - * and mempolicy.  May also be called holding the mmap_semaphore for write.
> > + * and mempolicy.  May also be called holding the mmap_lockaphore for 
> > write.
> >   */
> >  static int mpol_set_nodemask(struct mempolicy *pol,
> >const nodemask_t *nodes, struct nodemask_scratch *nsc)
>
> :)

Haha, good catch !

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


Re: [PATCH v6 12/12] mmap locking API: convert mmap_sem comments

2020-05-20 Thread Michel Lespinasse
Looks good, thanks !

On Wed, May 20, 2020 at 8:22 PM Andrew Morton  wrote:
> On Tue, 19 May 2020 22:29:08 -0700 Michel Lespinasse  
> wrote:
> > Convert comments that reference mmap_sem to reference mmap_lock instead.
>
> This may not be complete..
>
> From: Andrew Morton 
> Subject: mmap-locking-api-convert-mmap_sem-comments-fix
>
> fix up linux-next leftovers
>
> Cc: Daniel Jordan 
> Cc: Davidlohr Bueso 
> Cc: David Rientjes 
> Cc: Hugh Dickins 
> Cc: Jason Gunthorpe 
> Cc: Jerome Glisse 
> Cc: John Hubbard 
> Cc: Laurent Dufour 
> Cc: Liam Howlett 
> Cc: Matthew Wilcox 
> Cc: Michel Lespinasse 
> Cc: Peter Zijlstra 
> Cc: Vlastimil Babka 
> Cc: Ying Han 
> Signed-off-by: Andrew Morton 

Reviewed-by: Michel Lespinasse 


Re: [PATCH v6 05/12] mmap locking API: convert mmap_sem call sites missed by coccinelle

2020-05-20 Thread Michel Lespinasse
Looks good. I'm not sure if you need a review, but just in case:

On Wed, May 20, 2020 at 8:23 PM Andrew Morton  wrote:
> On Tue, 19 May 2020 22:29:01 -0700 Michel Lespinasse  
> wrote:
>
> > Convert the last few remaining mmap_sem rwsem calls to use the new
> > mmap locking API. These were missed by coccinelle for some reason
> > (I think coccinelle does not support some of the preprocessor
> > constructs in these files ?)
>
> From: Andrew Morton 
> Subject: mmap-locking-api-convert-mmap_sem-call-sites-missed-by-coccinelle-fix
>
> convert linux-next leftovers
>
> Cc: Michel Lespinasse 
> Cc: Daniel Jordan 
> Cc: Laurent Dufour 
> Cc: Vlastimil Babka 
> Cc: Davidlohr Bueso 
> Cc: David Rientjes 
> Cc: Hugh Dickins 
> Cc: Jason Gunthorpe 
> Cc: Jerome Glisse 
> Cc: John Hubbard 
> Cc: Liam Howlett 
> Cc: Matthew Wilcox 
> Cc: Peter Zijlstra 
> Cc: Ying Han 
> Signed-off-by: Andrew Morton 

Reviewed-by: Michel Lespinasse 


Re: [PATCH v5.5 10/10] mmap locking API: rename mmap_sem to mmap_lock

2020-05-20 Thread Michel Lespinasse
On Wed, May 20, 2020 at 12:32 AM John Hubbard  wrote:
> On 2020-05-19 19:39, Michel Lespinasse wrote:
> >> That gives you additional options inside internal_get_user_pages_fast(), 
> >> such
> >> as, approximately:
> >>
> >> if (!(gup_flags & FOLL_FAST_ONLY))
> >>  might_lock_read(>mm->mmap_lock);
> >>
> >> ...not that that is necessarily a great idea, seeing as how it merely 
> >> changes
> >> "might lock" into "maybe might lock".  :)
> >
> > I think that is completely fine, makes sure everyone not using
> > FOLL_FAST_ONLY realizes that the call could block.
> >
> > Can I ask you to add that assertion in your patchset ? Based on
> > Matthew's feedback, I would do it in my patchset, but it doesn't seem
> > worth doing if we know this will conflict with your changes.
>
> Sure, that's no problem. Although it looks like my changes may land
> in mmotm first, and then your patchset, so maybe the right move is to
> make this change *after* both of those things happen, yes?

I don't have a strong opinion on this. I would suggest you add the
might_lock_read() assertion and whoever comes in second will deal with
the conflict by changing mmap_sem to mmap_lock in that assertion.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


[PATCH v6 11/12] mmap locking API: convert mmap_sem API comments

2020-05-19 Thread Michel Lespinasse
Convert comments that reference old mmap_sem APIs to reference
corresponding new mmap locking APIs instead.

Signed-off-by: Michel Lespinasse 
---
 Documentation/vm/hmm.rst   |  6 +++---
 arch/alpha/mm/fault.c  |  2 +-
 arch/ia64/mm/fault.c   |  2 +-
 arch/m68k/mm/fault.c   |  2 +-
 arch/microblaze/mm/fault.c |  2 +-
 arch/mips/mm/fault.c   |  2 +-
 arch/nds32/mm/fault.c  |  2 +-
 arch/nios2/mm/fault.c  |  2 +-
 arch/openrisc/mm/fault.c   |  2 +-
 arch/parisc/mm/fault.c |  2 +-
 arch/riscv/mm/fault.c  |  2 +-
 arch/sh/mm/fault.c |  2 +-
 arch/sparc/mm/fault_32.c   |  2 +-
 arch/sparc/mm/fault_64.c   |  2 +-
 arch/xtensa/mm/fault.c |  2 +-
 drivers/android/binder_alloc.c |  4 ++--
 fs/hugetlbfs/inode.c   |  2 +-
 fs/userfaultfd.c   |  2 +-
 mm/filemap.c   |  2 +-
 mm/gup.c   | 12 ++--
 mm/huge_memory.c   |  4 ++--
 mm/khugepaged.c|  2 +-
 mm/ksm.c   |  2 +-
 mm/memory.c|  4 ++--
 mm/mempolicy.c |  2 +-
 mm/migrate.c   |  4 ++--
 mm/mmap.c  |  2 +-
 mm/oom_kill.c  |  8 
 net/ipv4/tcp.c |  2 +-
 29 files changed, 43 insertions(+), 43 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 4e3e9362afeb..046817505033 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -194,15 +194,15 @@ The usage pattern is::
 
  again:
   range.notifier_seq = mmu_interval_read_begin(_sub);
-  down_read(>mmap_sem);
+  mmap_read_lock(mm);
   ret = hmm_range_fault();
   if (ret) {
-  up_read(>mmap_sem);
+  mmap_read_unlock(mm);
   if (ret == -EBUSY)
  goto again;
   return ret;
   }
-  up_read(>mmap_sem);
+  mmap_read_unlock(mm);
 
   take_lock(driver->update);
   if (mmu_interval_read_retry(, range.notifier_seq) {
diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index 36efa778ee1a..c2303a8c2b9f 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -171,7 +171,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
if (fault & VM_FAULT_RETRY) {
flags |= FAULT_FLAG_TRIED;
 
-/* No need to up_read(>mmap_sem) as we would
+/* No need to mmap_read_unlock(mm) as we would
 * have already released it in __lock_page_or_retry
 * in mm/filemap.c.
 */
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 9b95050c2048..0f788992608a 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -169,7 +169,7 @@ ia64_do_page_fault (unsigned long address, unsigned long 
isr, struct pt_regs *re
if (fault & VM_FAULT_RETRY) {
flags |= FAULT_FLAG_TRIED;
 
-/* No need to up_read(>mmap_sem) as we would
+/* No need to mmap_read_unlock(mm) as we would
 * have already released it in __lock_page_or_retry
 * in mm/filemap.c.
 */
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 650acab0d77d..a94a814ad6ad 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -165,7 +165,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
address,
flags |= FAULT_FLAG_TRIED;
 
/*
-* No need to up_read(>mmap_sem) as we would
+* No need to mmap_read_unlock(mm) as we would
 * have already released it in __lock_page_or_retry
 * in mm/filemap.c.
 */
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index 9d7c423dea1d..ebf1ac50b291 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -239,7 +239,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long 
address,
flags |= FAULT_FLAG_TRIED;
 
/*
-* No need to up_read(>mmap_sem) as we would
+* No need to mmap_read_unlock(mm) as we would
 * have already released it in __lock_page_or_retry
 * in mm/filemap.c.
 */
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 9ef2dd39111e..01b168a90434 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -181,7 +181,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, 
unsigned long write,
flags |= FAULT_FLAG_TRIED;
 
/*
-* No need t

[PATCH v6 09/12] mmap locking API: add mmap_assert_locked() and mmap_assert_write_locked()

2020-05-19 Thread Michel Lespinasse
Add new APIs to assert that mmap_sem is held.

Using this instead of rwsem_is_locked and lockdep_assert_held[_write]
makes the assertions more tolerant of future changes to the lock type.

Signed-off-by: Michel Lespinasse 
---
 arch/x86/events/core.c|  2 +-
 fs/userfaultfd.c  |  6 +++---
 include/linux/mmap_lock.h | 14 ++
 mm/gup.c  |  2 +-
 mm/hmm.c  |  2 +-
 mm/memory.c   |  2 +-
 mm/mmu_notifier.c |  6 +++---
 mm/pagewalk.c |  6 +++---
 mm/util.c |  2 +-
 9 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index a619763e96e1..66559ac4f89e 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2182,7 +2182,7 @@ static void x86_pmu_event_mapped(struct perf_event 
*event, struct mm_struct *mm)
 * For now, this can't happen because all callers hold mmap_sem
 * for write.  If this changes, we'll need a different solution.
 */
-   lockdep_assert_held_write(>mmap_sem);
+   mmap_assert_write_locked(mm);
 
if (atomic_inc_return(>context.perf_rdpmc_allowed) == 1)
on_each_cpu_mask(mm_cpumask(mm), refresh_pce, NULL, 1);
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 9c645eee1a59..12b492409040 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -234,7 +234,7 @@ static inline bool userfaultfd_huge_must_wait(struct 
userfaultfd_ctx *ctx,
pte_t *ptep, pte;
bool ret = true;
 
-   VM_BUG_ON(!rwsem_is_locked(>mmap_sem));
+   mmap_assert_locked(mm);
 
ptep = huge_pte_offset(mm, address, vma_mmu_pagesize(vma));
 
@@ -286,7 +286,7 @@ static inline bool userfaultfd_must_wait(struct 
userfaultfd_ctx *ctx,
pte_t *pte;
bool ret = true;
 
-   VM_BUG_ON(!rwsem_is_locked(>mmap_sem));
+   mmap_assert_locked(mm);
 
pgd = pgd_offset(mm, address);
if (!pgd_present(*pgd))
@@ -405,7 +405,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned 
long reason)
 * Coredumping runs without mmap_sem so we can only check that
 * the mmap_sem is held, if PF_DUMPCORE was not set.
 */
-   WARN_ON_ONCE(!rwsem_is_locked(>mmap_sem));
+   mmap_assert_locked(mm);
 
ctx = vmf->vma->vm_userfaultfd_ctx.ctx;
if (!ctx)
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index acac1bf5ecd2..43ef914e6468 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -1,6 +1,8 @@
 #ifndef _LINUX_MMAP_LOCK_H
 #define _LINUX_MMAP_LOCK_H
 
+#include 
+
 #define MMAP_LOCK_INITIALIZER(name) \
.mmap_sem = __RWSEM_INITIALIZER((name).mmap_sem),
 
@@ -73,4 +75,16 @@ static inline void mmap_read_unlock_non_owner(struct 
mm_struct *mm)
up_read_non_owner(>mmap_sem);
 }
 
+static inline void mmap_assert_locked(struct mm_struct *mm)
+{
+   lockdep_assert_held(>mmap_sem);
+   VM_BUG_ON_MM(!rwsem_is_locked(>mmap_sem), mm);
+}
+
+static inline void mmap_assert_write_locked(struct mm_struct *mm)
+{
+   lockdep_assert_held_write(>mmap_sem);
+   VM_BUG_ON_MM(!rwsem_is_locked(>mmap_sem), mm);
+}
+
 #endif /* _LINUX_MMAP_LOCK_H */
diff --git a/mm/gup.c b/mm/gup.c
index 631285295950..c1c0b37d0e8f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1405,7 +1405,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
VM_BUG_ON(end   & ~PAGE_MASK);
VM_BUG_ON_VMA(start < vma->vm_start, vma);
VM_BUG_ON_VMA(end   > vma->vm_end, vma);
-   VM_BUG_ON_MM(!rwsem_is_locked(>mmap_sem), mm);
+   mmap_assert_locked(mm);
 
gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK;
if (vma->vm_flags & VM_LOCKONFAULT)
diff --git a/mm/hmm.c b/mm/hmm.c
index 280585833adf..660a4bcf932a 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -581,7 +581,7 @@ long hmm_range_fault(struct hmm_range *range)
struct mm_struct *mm = range->notifier->mm;
int ret;
 
-   lockdep_assert_held(>mmap_sem);
+   mmap_assert_locked(mm);
 
do {
/* If range is no longer valid force retry. */
diff --git a/mm/memory.c b/mm/memory.c
index e6dd3309c5a3..20f98ea8968e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1214,7 +1214,7 @@ static inline unsigned long zap_pud_range(struct 
mmu_gather *tlb,
next = pud_addr_end(addr, end);
if (pud_trans_huge(*pud) || pud_devmap(*pud)) {
if (next - addr != HPAGE_PUD_SIZE) {
-   
VM_BUG_ON_VMA(!rwsem_is_locked(>mm->mmap_sem), vma);
+   mmap_assert_locked(tlb->mm);
split_huge_pud(vma, pud, addr);
} else if (zap_huge_pud(tlb, vma, pud, addr))
goto next;
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index cfd0a03bf5c

[PATCH v6 02/12] MMU notifier: use the new mmap locking API

2020-05-19 Thread Michel Lespinasse
This use is converted manually ahead of the next patch in the series,
as it requires including a new header which the automated conversion
would miss.

Signed-off-by: Michel Lespinasse 
Reviewed-by: Daniel Jordan 
Reviewed-by: Davidlohr Bueso 
Reviewed-by: Laurent Dufour 
Reviewed-by: Vlastimil Babka 
---
 include/linux/mmu_notifier.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 736f6918335e..2f462710a1a4 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -277,9 +278,9 @@ mmu_notifier_get(const struct mmu_notifier_ops *ops, struct 
mm_struct *mm)
 {
struct mmu_notifier *ret;
 
-   down_write(>mmap_sem);
+   mmap_write_lock(mm);
ret = mmu_notifier_get_locked(ops, mm);
-   up_write(>mmap_sem);
+   mmap_write_unlock(mm);
return ret;
 }
 void mmu_notifier_put(struct mmu_notifier *subscription);
-- 
2.26.2.761.g0e0b3e54be-goog



[PATCH v6 10/12] mmap locking API: rename mmap_sem to mmap_lock

2020-05-19 Thread Michel Lespinasse
Rename the mmap_sem field to mmap_lock. Any new uses of this lock
should now go through the new mmap locking api. The mmap_lock is
still implemented as a rwsem, though this could change in the future.

Signed-off-by: Michel Lespinasse 
Reviewed-by: Vlastimil Babka 
---
 arch/ia64/mm/fault.c  |  4 +--
 arch/x86/mm/fault.c   |  2 +-
 drivers/gpu/drm/etnaviv/etnaviv_gem.c |  2 +-
 include/linux/mm_types.h  |  2 +-
 include/linux/mmap_lock.h | 38 +--
 mm/memory.c   |  2 +-
 mm/mmap.c |  4 +--
 mm/mmu_notifier.c |  2 +-
 8 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 693f00b117e1..9b95050c2048 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -70,8 +70,8 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, 
struct pt_regs *re
mask = isr >> IA64_ISR_X_BIT) & 1UL) << VM_EXEC_BIT)
| (((isr >> IA64_ISR_W_BIT) & 1UL) << VM_WRITE_BIT));
 
-   /* mmap_sem is performance critical */
-   prefetchw(>mmap_sem);
+   /* mmap_lock is performance critical */
+   prefetchw(>mmap_lock);
 
/*
 * If we're in an interrupt or have no user context, we must not take 
the fault..
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 181f66b9049f..35f530f9dfc0 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1522,7 +1522,7 @@ dotraplinkage void
 do_page_fault(struct pt_regs *regs, unsigned long hw_error_code,
unsigned long address)
 {
-   prefetchw(>mm->mmap_sem);
+   prefetchw(>mm->mmap_lock);
trace_page_fault_entries(regs, hw_error_code, address);
 
if (unlikely(kmmio_fault(regs, address)))
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem.c 
b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
index dc9ef302f517..701f3995f621 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_gem.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
@@ -661,7 +661,7 @@ static int etnaviv_gem_userptr_get_pages(struct 
etnaviv_gem_object *etnaviv_obj)
struct etnaviv_gem_userptr *userptr = _obj->userptr;
int ret, pinned = 0, npages = etnaviv_obj->base.size >> PAGE_SHIFT;
 
-   might_lock_read(>mm->mmap_sem);
+   might_lock_read(>mm->mmap_lock);
 
if (userptr->mm != current->mm)
return -EPERM;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4aba6c0c2ba8..d13b90399c16 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -436,7 +436,7 @@ struct mm_struct {
spinlock_t page_table_lock; /* Protects page tables and some
 * counters
 */
-   struct rw_semaphore mmap_sem;
+   struct rw_semaphore mmap_lock;
 
struct list_head mmlist; /* List of maybe swapped mm's. These
  * are globally strung together off
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 43ef914e6468..b5bd86778cca 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -4,67 +4,67 @@
 #include 
 
 #define MMAP_LOCK_INITIALIZER(name) \
-   .mmap_sem = __RWSEM_INITIALIZER((name).mmap_sem),
+   .mmap_lock = __RWSEM_INITIALIZER((name).mmap_lock),
 
 static inline void mmap_init_lock(struct mm_struct *mm)
 {
-   init_rwsem(>mmap_sem);
+   init_rwsem(>mmap_lock);
 }
 
 static inline void mmap_write_lock(struct mm_struct *mm)
 {
-   down_write(>mmap_sem);
+   down_write(>mmap_lock);
 }
 
 static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
 {
-   down_write_nested(>mmap_sem, subclass);
+   down_write_nested(>mmap_lock, subclass);
 }
 
 static inline int mmap_write_lock_killable(struct mm_struct *mm)
 {
-   return down_write_killable(>mmap_sem);
+   return down_write_killable(>mmap_lock);
 }
 
 static inline bool mmap_write_trylock(struct mm_struct *mm)
 {
-   return down_write_trylock(>mmap_sem) != 0;
+   return down_write_trylock(>mmap_lock) != 0;
 }
 
 static inline void mmap_write_unlock(struct mm_struct *mm)
 {
-   up_write(>mmap_sem);
+   up_write(>mmap_lock);
 }
 
 static inline void mmap_write_downgrade(struct mm_struct *mm)
 {
-   downgrade_write(>mmap_sem);
+   downgrade_write(>mmap_lock);
 }
 
 static inline void mmap_read_lock(struct mm_struct *mm)
 {
-   down_read(>mmap_sem);
+   down_read(>mmap_lock);
 }
 
 static inline int mmap_read_lock_killable(struct mm_struct *mm)
 {
-   return down_read_killable(>mmap_sem);
+   return down_read_killable(>mmap_lock);
 }
 
 static inline bool mmap_read_trylock(stru

[PATCH v6 08/12] mmap locking API: add MMAP_LOCK_INITIALIZER

2020-05-19 Thread Michel Lespinasse
Define a new initializer for the mmap locking api.
Initially this just evaluates to __RWSEM_INITIALIZER as the API
is defined as wrappers around rwsem.

Signed-off-by: Michel Lespinasse 
Reviewed-by: Laurent Dufour 
Reviewed-by: Vlastimil Babka 
---
 arch/x86/kernel/tboot.c| 2 +-
 drivers/firmware/efi/efi.c | 2 +-
 include/linux/mmap_lock.h  | 3 +++
 mm/init-mm.c   | 2 +-
 4 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
index b89f6ac6a0c0..885058325c20 100644
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -90,7 +90,7 @@ static struct mm_struct tboot_mm = {
.pgd= swapper_pg_dir,
.mm_users   = ATOMIC_INIT(2),
.mm_count   = ATOMIC_INIT(1),
-   .mmap_sem   = __RWSEM_INITIALIZER(init_mm.mmap_sem),
+   MMAP_LOCK_INITIALIZER(init_mm)
.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
 };
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 911a2bd0f6b7..916313ec8acb 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -54,7 +54,7 @@ struct mm_struct efi_mm = {
.mm_rb  = RB_ROOT,
.mm_users   = ATOMIC_INIT(2),
.mm_count   = ATOMIC_INIT(1),
-   .mmap_sem   = __RWSEM_INITIALIZER(efi_mm.mmap_sem),
+   MMAP_LOCK_INITIALIZER(efi_mm)
.page_table_lock= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock),
.mmlist = LIST_HEAD_INIT(efi_mm.mmlist),
.cpu_bitmap = { [BITS_TO_LONGS(NR_CPUS)] = 0},
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index d1826ce42f00..acac1bf5ecd2 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -1,6 +1,9 @@
 #ifndef _LINUX_MMAP_LOCK_H
 #define _LINUX_MMAP_LOCK_H
 
+#define MMAP_LOCK_INITIALIZER(name) \
+   .mmap_sem = __RWSEM_INITIALIZER((name).mmap_sem),
+
 static inline void mmap_init_lock(struct mm_struct *mm)
 {
init_rwsem(>mmap_sem);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 19603302a77f..fe9c03d8e07b 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -31,7 +31,7 @@ struct mm_struct init_mm = {
.pgd= swapper_pg_dir,
.mm_users   = ATOMIC_INIT(2),
.mm_count   = ATOMIC_INIT(1),
-   .mmap_sem   = __RWSEM_INITIALIZER(init_mm.mmap_sem),
+   MMAP_LOCK_INITIALIZER(init_mm)
.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
.arg_lock   =  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
-- 
2.26.2.761.g0e0b3e54be-goog



[PATCH v6 07/12] mmap locking API: add mmap_read_trylock_non_owner()

2020-05-19 Thread Michel Lespinasse
Add a couple APIs used by kernel/bpf/stackmap.c only:
- mmap_read_trylock_non_owner()
- mmap_read_unlock_non_owner() (may be called from a work queue).

It's still not ideal that bpf/stackmap subverts the lock ownership
in this way. Thanks to Peter Zijlstra for suggesting this API as the
least-ugly way of addressing this in the short term.

Signed-off-by: Michel Lespinasse 
Reviewed-by: Daniel Jordan 
Reviewed-by: Vlastimil Babka 
---
 include/linux/mmap_lock.h | 14 ++
 kernel/bpf/stackmap.c | 17 +
 2 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index a757cb30ae77..d1826ce42f00 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -56,4 +56,18 @@ static inline void mmap_read_unlock(struct mm_struct *mm)
up_read(>mmap_sem);
 }
 
+static inline bool mmap_read_trylock_non_owner(struct mm_struct *mm)
+{
+   if (down_read_trylock(>mmap_sem)) {
+   rwsem_release(>mmap_sem.dep_map, _RET_IP_);
+   return true;
+   }
+   return false;
+}
+
+static inline void mmap_read_unlock_non_owner(struct mm_struct *mm)
+{
+   up_read_non_owner(>mmap_sem);
+}
+
 #endif /* _LINUX_MMAP_LOCK_H */
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 11d41f0c7005..998968659892 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -33,7 +33,7 @@ struct bpf_stack_map {
 /* irq_work to run up_read() for build_id lookup in nmi context */
 struct stack_map_irq_work {
struct irq_work irq_work;
-   struct rw_semaphore *sem;
+   struct mm_struct *mm;
 };
 
 static void do_up_read(struct irq_work *entry)
@@ -44,8 +44,7 @@ static void do_up_read(struct irq_work *entry)
return;
 
work = container_of(entry, struct stack_map_irq_work, irq_work);
-   up_read_non_owner(work->sem);
-   work->sem = NULL;
+   mmap_read_unlock_non_owner(work->mm);
 }
 
 static DEFINE_PER_CPU(struct stack_map_irq_work, up_read_work);
@@ -317,7 +316,7 @@ static void stack_map_get_build_id_offset(struct 
bpf_stack_build_id *id_offs,
 * with build_id.
 */
if (!user || !current || !current->mm || irq_work_busy ||
-   mmap_read_trylock(current->mm) == 0) {
+   !mmap_read_trylock_non_owner(current->mm)) {
/* cannot access current->mm, fall back to ips */
for (i = 0; i < trace_nr; i++) {
id_offs[i].status = BPF_STACK_BUILD_ID_IP;
@@ -342,16 +341,10 @@ static void stack_map_get_build_id_offset(struct 
bpf_stack_build_id *id_offs,
}
 
if (!work) {
-   mmap_read_unlock(current->mm);
+   mmap_read_unlock_non_owner(current->mm);
} else {
-   work->sem = >mm->mmap_sem;
+   work->mm = current->mm;
irq_work_queue(>irq_work);
-   /*
-* The irq_work will release the mmap_sem with
-* up_read_non_owner(). The rwsem_release() is called
-* here to release the lock from lockdep's perspective.
-*/
-   rwsem_release(>mm->mmap_sem.dep_map, _RET_IP_);
}
 }
 
-- 
2.26.2.761.g0e0b3e54be-goog



[PATCH v6 05/12] mmap locking API: convert mmap_sem call sites missed by coccinelle

2020-05-19 Thread Michel Lespinasse
Convert the last few remaining mmap_sem rwsem calls to use the new
mmap locking API. These were missed by coccinelle for some reason
(I think coccinelle does not support some of the preprocessor
constructs in these files ?)

Signed-off-by: Michel Lespinasse 
Reviewed-by: Daniel Jordan 
Reviewed-by: Laurent Dufour 
Reviewed-by: Vlastimil Babka 
---
 arch/mips/mm/fault.c   | 10 +-
 arch/riscv/mm/pageattr.c   |  4 ++--
 arch/x86/kvm/mmu/paging_tmpl.h |  8 
 fs/proc/base.c |  6 +++---
 4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index f8d62cd83b36..9ef2dd39111e 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -97,7 +97,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, 
unsigned long write,
if (user_mode(regs))
flags |= FAULT_FLAG_USER;
 retry:
-   down_read(>mmap_sem);
+   mmap_read_lock(mm);
vma = find_vma(mm, address);
if (!vma)
goto bad_area;
@@ -190,7 +190,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, 
unsigned long write,
}
}
 
-   up_read(>mmap_sem);
+   mmap_read_unlock(mm);
return;
 
 /*
@@ -198,7 +198,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, 
unsigned long write,
  * Fix it, but check if it's kernel or user first..
  */
 bad_area:
-   up_read(>mmap_sem);
+   mmap_read_unlock(mm);
 
 bad_area_nosemaphore:
/* User mode accesses just cause a SIGSEGV */
@@ -250,14 +250,14 @@ static void __kprobes __do_page_fault(struct pt_regs 
*regs, unsigned long write,
 * We ran out of memory, call the OOM killer, and return the userspace
 * (which will retry the fault, or kill us if we got oom-killed).
 */
-   up_read(>mmap_sem);
+   mmap_read_unlock(mm);
if (!user_mode(regs))
goto no_context;
pagefault_out_of_memory();
return;
 
 do_sigbus:
-   up_read(>mmap_sem);
+   mmap_read_unlock(mm);
 
/* Kernel mode? Handle exceptions or die */
if (!user_mode(regs))
diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
index 728759eb530a..b9072c043222 100644
--- a/arch/riscv/mm/pageattr.c
+++ b/arch/riscv/mm/pageattr.c
@@ -117,10 +117,10 @@ static int __set_memory(unsigned long addr, int numpages, 
pgprot_t set_mask,
if (!numpages)
return 0;
 
-   down_read(_mm.mmap_sem);
+   mmap_read_lock(_mm);
ret =  walk_page_range_novma(_mm, start, end, _ops, NULL,
 );
-   up_read(_mm.mmap_sem);
+   mmap_read_unlock(_mm);
 
flush_tlb_kernel_range(start, end);
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 9bdf9b7d9a96..40e5bb67cc09 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -165,22 +165,22 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
unsigned long pfn;
unsigned long paddr;
 
-   down_read(>mm->mmap_sem);
+   mmap_read_lock(current->mm);
vma = find_vma_intersection(current->mm, vaddr, vaddr + 
PAGE_SIZE);
if (!vma || !(vma->vm_flags & VM_PFNMAP)) {
-   up_read(>mm->mmap_sem);
+   mmap_read_unlock(current->mm);
return -EFAULT;
}
pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
paddr = pfn << PAGE_SHIFT;
table = memremap(paddr, PAGE_SIZE, MEMREMAP_WB);
if (!table) {
-   up_read(>mm->mmap_sem);
+   mmap_read_unlock(current->mm);
return -EFAULT;
}
ret = CMPXCHG([index], orig_pte, new_pte);
memunmap(table);
-   up_read(>mm->mmap_sem);
+   mmap_read_unlock(current->mm);
}
 
return (ret != orig_pte);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9a68032d8d73..a96377557db7 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2314,7 +2314,7 @@ proc_map_files_readdir(struct file *file, struct 
dir_context *ctx)
if (!mm)
goto out_put_task;
 
-   ret = down_read_killable(>mmap_sem);
+   ret = mmap_read_lock_killable(mm);
if (ret) {
mmput(mm);
goto out_put_task;
@@ -2341,7 +2341,7 @@ proc_map_files_readdir(struct file *file, struct 
dir_context *ctx)
p = genradix_ptr_alloc(, nr_files++, GFP_KERNEL);
if (!p) {
ret = -ENOMEM;
-   up_read(>mmap_sem);
+   mmap_read_unlock(mm);
mmput

[PATCH v6 00/12] Add a new mmap locking API wrapping mmap_sem calls

2020-05-19 Thread Michel Lespinasse
Reposting this patch series on top of v5.7-rc6. I think this is ready
for inclusion into the -mm tree; however there were some minor points
of feedback to address and also it was easier to regenerate a full
version after the v5.5 (only updating patches 09/10 and 10/10) caused
some confusion.


This patch series adds a new mmap locking API replacing the existing
mmap_sem lock and unlocks. Initially the API is just implemente in terms
of inlined rwsem calls, so it doesn't provide any new functionality.

There are two justifications for the new API:

- At first, it provides an easy hooking point to instrument mmap_sem
  locking latencies independently of any other rwsems.

- In the future, it may be a starting point for replacing the rwsem
  implementation with a different one, such as range locks. This is
  something that is being explored, even though there is no wide concensus
  about this possible direction yet.
  (see https://patchwork.kernel.org/cover/11401483/)


Changes since v5.5 of the patchset:

- Applied the changes on top of v5.7-rc6. This was a straight rebase
  except for the changes noted here.

- Re-generated the coccinelle changes (patch 04/12).

- Patch 08/12: use (name) in the MMAP_LOCK_INITIALIZER macro.

- Patch 09/12: use lockdep_assert_held() / lockdep_assert_held_write()
  so that mmap_assert_locked() and mmap_assert_write_locked() get better
  coverage when lockdep is enabled but CONFIG_DEBUG_VM is not.

- Added patches 11 and 12, converting comments that referenced mmap_sem
  rwsem calls or the mmap_sem lock itself, to reference the corresponding
  mmap locking APIs or the mmap_lock itself.


Changes since v5 of the patchset:

- Patch 09/10: Add both mmap_assert_locked() and mmap_assert_write_locked();
  convert some call sites that were using lockdep assertions to use these
  new APIs instead.


Changes since v4 of the patchset:

- Applied the changes on top of v5.7-rc2. This was a straight rebase
  except for changes noted here.

- Patch 01/10: renamed the mmap_write_downgrade API
  (as suggested by Davidlohr Bueso).

- Patch 05/10: added arch/riscv/mm/pageattr.c changes that had been
  previously missed, as found by the kbuild bot.

- Patch 06/10: use SINGLE_DEPTH_NESTING as suggested by Matthew Wilcox.

- Patch 08/10: change MMAP_LOCK_INITIALIZER definition
  as suggested by Matthew Wilcox.

- Patch 09/10: add mm_assert_locked API as suggested by Matthew Wilcox.


Changes since v3 of the patchset:

- The changes now apply on top of v5.7-rc1. This was a straight rebase
  except for changes noted here.

- Re-generated the coccinelle changes (patch 04/10).

- Patch 06/10: removed the mmap_write_unlock_nested API;
  mmap_write_lock_nested() calls now pair with the regular mmap_write_unlock()
  as was suggested by many people.

- Patch 07/10: removed the mmap_read_release API; this is replaced with
  mmap_read_trylock_non_owner() which pairs with mmap_read_unlock_non_owner()
  Thanks to Peter Zijlstra for the suggestion.


Changes since v2 of the patchset:

- Removed the mmap_is_locked API - v2 had removed all uses of it,
  but the actual function definition was still there unused.
  Thanks to Jason Gunthorpe for noticing the unused mmap_is_locked function.


Changes since v1 of the patchset:

- Manually convert drivers/dma-buf/dma-resv.c ahead of the automated
  coccinelle conversion as this file requires a new include statement.
  Thanks to Intel's kbuild test bot for finding the issue.

- In coccinelle automated conversion, apply a single coccinelle rule
  as suggested by Markus Elfring.

- In manual conversion of sites missed by coccinelle, fix an issue where
  I had used mm_read_unlock (from an older version of my patchset) instead
  of mmap_read_unlock in some arch/mips code.
  This was also identified by Intel's kbuild test bot.

- Do not add a new mmap_is_locked API, and use lockdep_assert_held instead.
  Thanks to Jason Gunthorpe and Matthew Wilcox for the suggestion.


The changes apply on top of v5.7-rc6.

I think these changes are ready for integration into the -mm tree now
(for integration into v5.8). The coccinelle part of the change is
relatively invasive, but can be skipped over on a file by file basis
if it causes any conflicts with other pending changes. The new mmap
locking API can interoperate with new code that is still using direct
rwsem calls, until the last patch in the series which renames mmap_sem
to enforce using the new API. Maybe that last patch could be delayed for
a bit, so that we'd get a chance to convert any new code that locks
mmap_sem in the -rc1 release before applying that last patch.


Michel Lespinasse (12):
  mmap locking API: initial implementation as rwsem wrappers
  MMU notifier: use the new mmap locking API
  DMA  reservations: use the new mmap locking API
  mmap locking API: use coccinelle to convert mmap_sem rwsem call sites
  mmap locking API: convert mmap_sem call sites missed by coccinelle
  mmap locking API: convert nested write

[PATCH v6 06/12] mmap locking API: convert nested write lock sites

2020-05-19 Thread Michel Lespinasse
Add API for nested write locks and convert the few call sites doing that.

Signed-off-by: Michel Lespinasse 
Reviewed-by: Daniel Jordan 
Reviewed-by: Laurent Dufour 
Reviewed-by: Vlastimil Babka 
---
 arch/um/include/asm/mmu_context.h | 3 ++-
 include/linux/mmap_lock.h | 5 +
 kernel/fork.c | 2 +-
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/um/include/asm/mmu_context.h 
b/arch/um/include/asm/mmu_context.h
index 62262c5c7785..17ddd4edf875 100644
--- a/arch/um/include/asm/mmu_context.h
+++ b/arch/um/include/asm/mmu_context.h
@@ -8,6 +8,7 @@
 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -47,7 +48,7 @@ static inline void activate_mm(struct mm_struct *old, struct 
mm_struct *new)
 * when the new ->mm is used for the first time.
 */
__switch_mm(>context.id);
-   down_write_nested(>mmap_sem, 1);
+   mmap_write_lock_nested(new, SINGLE_DEPTH_NESTING);
uml_setup_stubs(new);
mmap_write_unlock(new);
 }
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 97ac53b66052..a757cb30ae77 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -11,6 +11,11 @@ static inline void mmap_write_lock(struct mm_struct *mm)
down_write(>mmap_sem);
 }
 
+static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
+{
+   down_write_nested(>mmap_sem, subclass);
+}
+
 static inline int mmap_write_lock_killable(struct mm_struct *mm)
 {
return down_write_killable(>mmap_sem);
diff --git a/kernel/fork.c b/kernel/fork.c
index a4db6bc952c6..e702e84897fa 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -499,7 +499,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
/*
 * Not linked in yet - no deadlock potential:
 */
-   down_write_nested(>mmap_sem, SINGLE_DEPTH_NESTING);
+   mmap_write_lock_nested(mm, SINGLE_DEPTH_NESTING);
 
/* No ordering required: file already has been exposed. */
RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm));
-- 
2.26.2.761.g0e0b3e54be-goog



[PATCH v6 03/12] DMA reservations: use the new mmap locking API

2020-05-19 Thread Michel Lespinasse
This use is converted manually ahead of the next patch in the series,
as it requires including a new header which the automated conversion
would miss.

Signed-off-by: Michel Lespinasse 
Reviewed-by: Daniel Jordan 
Reviewed-by: Laurent Dufour 
Reviewed-by: Vlastimil Babka 
---
 drivers/dma-buf/dma-resv.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
index 4264e64788c4..b45f8514dc82 100644
--- a/drivers/dma-buf/dma-resv.c
+++ b/drivers/dma-buf/dma-resv.c
@@ -34,6 +34,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 /**
@@ -109,7 +110,7 @@ static int __init dma_resv_lockdep(void)
 
dma_resv_init();
 
-   down_read(>mmap_sem);
+   mmap_read_lock(mm);
ww_acquire_init(, _ww_class);
ret = dma_resv_lock(, );
if (ret == -EDEADLK)
@@ -118,7 +119,7 @@ static int __init dma_resv_lockdep(void)
fs_reclaim_release(GFP_KERNEL);
ww_mutex_unlock();
ww_acquire_fini();
-   up_read(>mmap_sem);
+   mmap_read_unlock(mm);

mmput(mm);
 
-- 
2.26.2.761.g0e0b3e54be-goog



[PATCH v6 12/12] mmap locking API: convert mmap_sem comments

2020-05-19 Thread Michel Lespinasse
Convert comments that reference mmap_sem to reference mmap_lock instead.

Signed-off-by: Michel Lespinasse 
---
 .../admin-guide/mm/numa_memory_policy.rst | 10 ++---
 Documentation/admin-guide/mm/userfaultfd.rst  |  2 +-
 Documentation/filesystems/locking.rst |  2 +-
 Documentation/vm/transhuge.rst|  4 +-
 arch/arc/mm/fault.c   |  2 +-
 arch/arm/kernel/vdso.c|  2 +-
 arch/arm/mm/fault.c   |  2 +-
 arch/ia64/mm/fault.c  |  2 +-
 arch/microblaze/mm/fault.c|  2 +-
 arch/nds32/mm/fault.c |  2 +-
 arch/powerpc/include/asm/pkeys.h  |  2 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c|  6 +--
 arch/powerpc/mm/book3s32/tlb.c|  2 +-
 arch/powerpc/mm/book3s64/hash_pgtable.c   |  4 +-
 arch/powerpc/mm/book3s64/subpage_prot.c   |  2 +-
 arch/powerpc/mm/fault.c   |  8 ++--
 arch/powerpc/mm/pgtable.c |  2 +-
 arch/powerpc/platforms/cell/spufs/file.c  |  6 +--
 arch/riscv/mm/fault.c |  2 +-
 arch/s390/kvm/priv.c  |  2 +-
 arch/s390/mm/fault.c  |  2 +-
 arch/s390/mm/gmap.c   | 32 +++
 arch/s390/mm/pgalloc.c|  2 +-
 arch/sh/mm/cache-sh4.c|  2 +-
 arch/sh/mm/fault.c|  2 +-
 arch/sparc/mm/fault_64.c  |  2 +-
 arch/um/kernel/skas/mmu.c |  2 +-
 arch/um/kernel/tlb.c  |  2 +-
 arch/unicore32/mm/fault.c |  2 +-
 arch/x86/events/core.c|  2 +-
 arch/x86/include/asm/mmu.h|  2 +-
 arch/x86/include/asm/pgtable-3level.h |  8 ++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  2 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c|  6 +--
 arch/x86/kernel/ldt.c |  2 +-
 arch/x86/mm/fault.c   | 12 +++---
 drivers/char/mspec.c  |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|  2 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v7.c |  2 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v8.c |  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c   |  6 +--
 drivers/gpu/drm/i915/i915_perf.c  |  2 +-
 drivers/gpu/drm/ttm/ttm_bo_vm.c   |  6 +--
 drivers/infiniband/core/uverbs_main.c |  2 +-
 drivers/infiniband/hw/hfi1/mmu_rb.c   |  2 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c |  2 +-
 drivers/misc/cxl/cxllib.c |  2 +-
 drivers/misc/sgi-gru/grufault.c   |  8 ++--
 drivers/oprofile/buffer_sync.c|  2 +-
 drivers/staging/android/ashmem.c  |  4 +-
 drivers/staging/comedi/comedi_fops.c  |  2 +-
 drivers/tty/vt/consolemap.c   |  2 +-
 drivers/xen/gntdev.c  |  2 +-
 fs/coredump.c |  4 +-
 fs/exec.c |  2 +-
 fs/ext2/file.c|  2 +-
 fs/ext4/super.c   |  6 +--
 fs/kernfs/file.c  |  4 +-
 fs/proc/base.c|  6 +--
 fs/proc/task_mmu.c|  6 +--
 fs/userfaultfd.c  | 18 -
 fs/xfs/xfs_file.c |  2 +-
 fs/xfs/xfs_inode.c| 14 +++
 fs/xfs/xfs_iops.c |  4 +-
 include/asm-generic/pgtable.h |  6 +--
 include/linux/fs.h|  4 +-
 include/linux/huge_mm.h   |  2 +-
 include/linux/mempolicy.h |  2 +-
 include/linux/mm.h| 10 ++---
 include/linux/mm_types.h  |  2 +-
 include/linux/mmu_notifier.h  |  8 ++--
 include/linux/pagemap.h   |  2 +-
 include/linux/rmap.h  |  2 +-
 include/linux/sched/mm.h  | 10 ++---
 kernel/acct.c |  2 +-
 kernel/cgroup/cpuset.c|  4 +-
 kernel/events/core.c  |  6 +--
 kernel/events/uprobes.c   |  4 +-
 kernel/exit.c |  2 +-
 kernel/relay.c|  2 +-
 kernel/sys.c  |  4 +-
 lib/test_lockup.c |  8 ++--
 mm/filemap.c  | 38 +-
 mm/frame_vector.c |  2 +-
 mm/gup.c  | 38 +-
 mm/huge_memory.c  |  4 +-
 mm/hugetlb.c  |  2 +-
 mm

[PATCH v6 01/12] mmap locking API: initial implementation as rwsem wrappers

2020-05-19 Thread Michel Lespinasse
This change wraps the existing mmap_sem related rwsem calls into a new
mmap locking API. There are two justifications for the new API:

- At first, it provides an easy hooking point to instrument mmap_sem
  locking latencies independently of any other rwsems.

- In the future, it may be a starting point for replacing the rwsem
  implementation with a different one, such as range locks.

Signed-off-by: Michel Lespinasse 
Reviewed-by: Daniel Jordan 
Reviewed-by: Davidlohr Bueso 
Reviewed-by: Laurent Dufour 
Reviewed-by: Vlastimil Babka 
---
 include/linux/mm.h|  1 +
 include/linux/mmap_lock.h | 54 +++
 2 files changed, 55 insertions(+)
 create mode 100644 include/linux/mmap_lock.h

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5a323422d783..051ec782bdbb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
new file mode 100644
index ..97ac53b66052
--- /dev/null
+++ b/include/linux/mmap_lock.h
@@ -0,0 +1,54 @@
+#ifndef _LINUX_MMAP_LOCK_H
+#define _LINUX_MMAP_LOCK_H
+
+static inline void mmap_init_lock(struct mm_struct *mm)
+{
+   init_rwsem(>mmap_sem);
+}
+
+static inline void mmap_write_lock(struct mm_struct *mm)
+{
+   down_write(>mmap_sem);
+}
+
+static inline int mmap_write_lock_killable(struct mm_struct *mm)
+{
+   return down_write_killable(>mmap_sem);
+}
+
+static inline bool mmap_write_trylock(struct mm_struct *mm)
+{
+   return down_write_trylock(>mmap_sem) != 0;
+}
+
+static inline void mmap_write_unlock(struct mm_struct *mm)
+{
+   up_write(>mmap_sem);
+}
+
+static inline void mmap_write_downgrade(struct mm_struct *mm)
+{
+   downgrade_write(>mmap_sem);
+}
+
+static inline void mmap_read_lock(struct mm_struct *mm)
+{
+   down_read(>mmap_sem);
+}
+
+static inline int mmap_read_lock_killable(struct mm_struct *mm)
+{
+   return down_read_killable(>mmap_sem);
+}
+
+static inline bool mmap_read_trylock(struct mm_struct *mm)
+{
+   return down_read_trylock(>mmap_sem) != 0;
+}
+
+static inline void mmap_read_unlock(struct mm_struct *mm)
+{
+   up_read(>mmap_sem);
+}
+
+#endif /* _LINUX_MMAP_LOCK_H */
-- 
2.26.2.761.g0e0b3e54be-goog



Re: [PATCH v5.5 10/10] mmap locking API: rename mmap_sem to mmap_lock

2020-05-19 Thread Michel Lespinasse
On Tue, May 19, 2020 at 11:15 AM John Hubbard  wrote:
> On 2020-05-19 08:32, Matthew Wilcox wrote:
> > On Tue, May 19, 2020 at 03:20:40PM +0200, Laurent Dufour wrote:
> >> Le 19/05/2020 à 15:10, Michel Lespinasse a écrit :
> >>> On Mon, May 18, 2020 at 03:45:22PM +0200, Laurent Dufour wrote:
> >>>> Le 24/04/2020 à 03:39, Michel Lespinasse a écrit :
> >>>>> Rename the mmap_sem field to mmap_lock. Any new uses of this lock
> >>>>> should now go through the new mmap locking api. The mmap_lock is
> >>>>> still implemented as a rwsem, though this could change in the future.
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem.c 
> >>>>> b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
> >>>>> index dc9ef302f517..701f3995f621 100644
> >>>>> --- a/drivers/gpu/drm/etnaviv/etnaviv_gem.c
> >>>>> +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
> >>>>> @@ -661,7 +661,7 @@ static int etnaviv_gem_userptr_get_pages(struct 
> >>>>> etnaviv_gem_object *etnaviv_obj)
> >>>>>   struct etnaviv_gem_userptr *userptr = _obj->userptr;
> >>>>>   int ret, pinned = 0, npages = etnaviv_obj->base.size >> 
> >>>>> PAGE_SHIFT;
> >>>>> - might_lock_read(>mm->mmap_sem);
> >>>>> + might_lock_read(>mm->mmap_lock);
> >>>>
> >>>> Why not a mm_might_lock_read() new API to hide the mmap_lock, and add it 
> >>>> to
> >>>> the previous patch?
> >>>
> >>> I'm not sure why this is needed - we may rework the lock to be
> >>> something else than rwsem, but might_lock_read should still apply to
> >>> it and make sense ? I'm not sure what the extra API would bring...
> >>
> >> I guess at one time the API would become might_lock_read_a_range(), isn't 
> >> it?

I don't think this should be necessary - from lockdep perspective,
there should not be a difference between locking a range vs the entire
address space.

> >> Furthermore this would hiding the lock's name which the goal of this 
> >> series.

Actually to me the name is very secondary - the goal of the series is
to add an api abstracting the mmap locking. The lock name is secondary
to that, it only gets renamed because mmap_sem was too specific (if we
are to change the mmap locking) and to ensure we convert all direct
uses to use the api instead.

> > I think this assertion should be deleted from this driver.  It's there
> > in case get_user_pages_fast() takes the mmap sem.  It would make sense to
> > have this assertion in get_user_pages_fast() in case we take the fast path
> > which doesn't acquire the mmap_sem.  Something like this:

I like this idea a lot - having might_lock assertions in
get_user_pages_fast makes a log more sense than doing the same at the
call sites.

> There are a couple of recent developments in this code to keep in mind. I 
> don't
> *think* either one is a problem here, but just in case:
>
> a) The latest version of the above routine [1] is on its way to mmotm as of
> yesterday, and that version more firmly divides the fast and slow parts,
> via a new FOLL_FAST_ONLY flag. The fall-back to slow/regular gup only occurs
> if the caller does not set FOLL_FAST_ONLY. (Note that it's a gup.c internal
> flag, btw.)
>
> That gives you additional options inside internal_get_user_pages_fast(), such
> as, approximately:
>
> if (!(gup_flags & FOLL_FAST_ONLY))
> might_lock_read(>mm->mmap_lock);
>
> ...not that that is necessarily a great idea, seeing as how it merely changes
> "might lock" into "maybe might lock".  :)

I think that is completely fine, makes sure everyone not using
FOLL_FAST_ONLY realizes that the call could block.

Can I ask you to add that assertion in your patchset ? Based on
Matthew's feedback, I would do it in my patchset, but it doesn't seem
worth doing if we know this will conflict with your changes.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


Re: [PATCH v5.5 10/10] mmap locking API: rename mmap_sem to mmap_lock

2020-05-19 Thread Michel Lespinasse
On Mon, May 18, 2020 at 01:07:26PM +0200, Vlastimil Babka wrote:
> Any plan about all the code comments mentioning mmap_sem? :) Not urgent.

It's mostly a sed job, I'll add it in the next version as it seems
the patchset is getting ready for inclusion.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


Re: [PATCH v5.5 10/10] mmap locking API: rename mmap_sem to mmap_lock

2020-05-19 Thread Michel Lespinasse
On Mon, May 18, 2020 at 03:45:22PM +0200, Laurent Dufour wrote:
> Le 24/04/2020 à 03:39, Michel Lespinasse a écrit :
> > Rename the mmap_sem field to mmap_lock. Any new uses of this lock
> > should now go through the new mmap locking api. The mmap_lock is
> > still implemented as a rwsem, though this could change in the future.
> > 
> > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem.c 
> > b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
> > index dc9ef302f517..701f3995f621 100644
> > --- a/drivers/gpu/drm/etnaviv/etnaviv_gem.c
> > +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
> > @@ -661,7 +661,7 @@ static int etnaviv_gem_userptr_get_pages(struct 
> > etnaviv_gem_object *etnaviv_obj)
> > struct etnaviv_gem_userptr *userptr = _obj->userptr;
> > int ret, pinned = 0, npages = etnaviv_obj->base.size >> PAGE_SHIFT;
> > -   might_lock_read(>mm->mmap_sem);
> > +   might_lock_read(>mm->mmap_lock);
> 
> Why not a mm_might_lock_read() new API to hide the mmap_lock, and add it to
> the previous patch?

I'm not sure why this is needed - we may rework the lock to be
something else than rwsem, but might_lock_read should still apply to
it and make sense ? I'm not sure what the extra API would bring...

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


Re: [PATCH v5.5 09/10] mmap locking API: add mmap_assert_locked() and mmap_assert_write_locked()

2020-05-19 Thread Michel Lespinasse
On Mon, May 18, 2020 at 01:01:33PM +0200, Vlastimil Babka wrote:
> On 4/24/20 3:38 AM, Michel Lespinasse wrote:
> > +static inline void mmap_assert_locked(struct mm_struct *mm)
> > +{
> > +   VM_BUG_ON_MM(!lockdep_is_held_type(>mmap_sem, -1), mm);
> > +   VM_BUG_ON_MM(!rwsem_is_locked(>mmap_sem), mm);
> > +}
> > +
> > +static inline void mmap_assert_write_locked(struct mm_struct *mm)
> > +{
> > +   VM_BUG_ON_MM(!lockdep_is_held_type(>mmap_sem, 0), mm);
> > +   VM_BUG_ON_MM(!rwsem_is_locked(>mmap_sem), mm);
> > +}
> 
> I would remove VM_BUG_ON_MM() from the lockdep part. If kernel has lockdep
> enabled, it's already in heavy debugging mode enough so let's just use it and
> not depend on DEBUG_VM. Many sites you convert don't require DEBUG_VM for the
> lockdep checks.
> 
> With that you can also use the standard lockdep_assert_held() and
> lockdep_assert_held_write() wrappers.
> 
> If user has both lockdep and DEBUG_VM enabled, should we run both variants?
> Perhaps lockdep is enough as it's more comprehensive? Your initial v5 version
> was doing that.

Thanks, changed these to lockdep_assert_held() /
lockdep_assert_held_write() as suggested. This misses dumping out the
mm, but I think there is only limited value in that. I did keep the
rwsem_is_locked fallback as people had commented earlier about
getting assertions in the non-lockdep case. If both are enabled...
then we'll get somewhat redundant assertions, but I think that is fine
(better to keep the code simple than try to work around that).

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


Re: [PATCH v5 08/10] mmap locking API: add MMAP_LOCK_INITIALIZER

2020-05-19 Thread Michel Lespinasse
On Mon, May 18, 2020 at 12:45:06PM +0200, Vlastimil Babka wrote:
> On 4/22/20 2:14 AM, Michel Lespinasse wrote:
> > Define a new initializer for the mmap locking api.
> > Initially this just evaluates to __RWSEM_INITIALIZER as the API
> > is defined as wrappers around rwsem.
> > 
> > Signed-off-by: Michel Lespinasse 
> 
> 
> Reviewed-by: Vlastimil Babka 
> 
> 
> Nit below.
> 
> > +#define MMAP_LOCK_INITIALIZER(name) \
> > +   .mmap_sem = __RWSEM_INITIALIZER(name.mmap_sem),
> 
> "(name).mmap_sem" is safer in a macro?

Thanks, applied in next version.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


Re: [PATCH v5 06/10] mmap locking API: convert nested write lock sites

2020-05-19 Thread Michel Lespinasse
On Mon, May 18, 2020 at 12:32:03PM +0200, Vlastimil Babka wrote:
> On 4/22/20 2:14 AM, Michel Lespinasse wrote:
> > Add API for nested write locks and convert the few call sites doing that.
> > 
> > Signed-off-by: Michel Lespinasse 
> > Reviewed-by: Daniel Jordan 
> 
> Reviewed-by: Vlastimil Babka 
> 
> Perhaps we could even move SINGLE_DEPTH_NESTING into the wrapper? It's 
> unlikely
> there will be a new user with a different subclass?

I think I'll leave it in the API for now. I don't foresee new uses
being added as long as we stick to coarse mmap locking, but if
extending the api to support range locking it'd become more likely
that we'd want to lock multiple ranges for mremap...

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


Re: [PATCH V2 2/2] rbtree_latch: don't need to check seq when it found a node

2020-05-15 Thread Michel Lespinasse
On Fri, May 15, 2020 at 9:52 PM Lai Jiangshan
 wrote:
>
> On Sat, May 16, 2020 at 12:28 PM Michel Lespinasse  wrote:
> >
> > On Fri, May 15, 2020 at 03:59:09PM +, Lai Jiangshan wrote:
> > > latch_tree_find() should be protected by caller via RCU or so.
> > > When it find a node in an attempt, the node must be a valid one
> > > in RCU's point's of view even the tree is (being) updated with a
> > > new node with the same key which is entirely subject to timing
> > > anyway.
> >
> > I'm not sure I buy this. Even if we get a valid node, is it the one we
> > were searching for ? I don't see how this could be guaranteed if the
> > read raced with a tree rebalancing.
>
> It is valid because ops->comp() returns 0 and it should be
> the one we were searching for unless ops->comp() is wrong.
> The searched one could be possible just deleted, but it is still
> a legitimate searched result in RCU's point's of view.
>
> A tree rebalancing can cause a searching fails to find
> an existing target. This is the job of read_seqcount_retry()
> to tell you to retry.

Ah, yes, this is correct. It wouldn't work if we wanted to return the
next higher key for example, but it does work for exact matches. Nice!

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


Re: [PATCH V2 2/2] rbtree_latch: don't need to check seq when it found a node

2020-05-15 Thread Michel Lespinasse
On Fri, May 15, 2020 at 03:59:09PM +, Lai Jiangshan wrote:
> latch_tree_find() should be protected by caller via RCU or so.
> When it find a node in an attempt, the node must be a valid one
> in RCU's point's of view even the tree is (being) updated with a
> new node with the same key which is entirely subject to timing
> anyway.

I'm not sure I buy this. Even if we get a valid node, is it the one we
were searching for ? I don't see how this could be guaranteed if the
read raced with a tree rebalancing.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


Re: [RFC][PATCH 1/7] rbtree: Add generic add and find helpers

2020-04-30 Thread Michel Lespinasse
On Thu, Apr 30, 2020 at 12:28 AM Juri Lelli  wrote:
> > --- a/include/linux/rbtree.h
> > +++ b/include/linux/rbtree.h
> > @@ -141,12 +141,18 @@ static inline void rb_insert_color_cache
> >   rb_insert_color(node, >rb_root);
> >  }
> >
> > -static inline void rb_erase_cached(struct rb_node *node,
> > +static inline bool rb_erase_cached(struct rb_node *node,
> >  struct rb_root_cached *root)
> >  {
> > - if (root->rb_leftmost == node)
> > + bool leftmost = false;
> > +
> > + if (root->rb_leftmost == node) {
> >   root->rb_leftmost = rb_next(node);
>
> Think we need
>
>  if (root->rb_leftmost)
>
> > + leftmost = true;
>
> DEADLINE crashes w/o that.

I think Peter's code is correct; after removing the only node in an
rbtree rb_leftmost should be NULL.

The issue appears to be in dequeue_pushable_dl_task unconditionally
dereferencing the pointer returned by rb_first_cached(), which may be
NULL. I'm not sure what the correct behavior is though, i.e. what
dl_rq->earliest_dl.next should be set to if the rbtree ends up empty.
Current code (before Peter's changes) preserves the existing
dl_rq->earliest_dl.next value in that case, which seems very weird to
me (and worthy of a comment if it's correct).


Re: [RFC][PATCH 1/7] rbtree: Add generic add and find helpers

2020-04-29 Thread Michel Lespinasse
Hi Peter,

On Wed, Apr 29, 2020 at 05:32:59PM +0200, Peter Zijlstra wrote:
> I've always been bothered by the endless (fragile) boilerplate for
> rbtree, and I recently wrote some rbtree helpers for objtool and
> figured I should lift them into the kernel and use them more widely.
> 
> Provide:
> 
> partial-order; less() based:
>  - rb_add(): add a new entry to the rbtree
>  - rb_add_cached(): like rb_add(), but for a rb_root_cached
> 
> total-order; cmp() based:
>  - rb_find(): find an entry in an rbtree
>  - rb_find_first(): find the first (leftmost) matching entry
>  - rb_next_match(): continue from rb_find_first()
>  - rb_for_each(): for loop with rb_find_first() / rb_next_match()
> 
> Also make rb_add_cached() / rb_erase_cached() return true when
> leftmost.
> 
> Inlining and constant propagation should see the compiler inline the
> whole thing, including the various compare functions.

I really like the idea of this change. Also,I think it opens avenues
for converting some users which had previously been avoiding raw rbtrees
seemingly only because they didn't want to write this boilerplate.

Few questions:

- Adding the rb_add_cached() / rb_erase_cached() return value looks
  like it almost belongs to a separate patch. Is this only used in
  patch 3/7 (sched/deadline) or did I miss other uses ? Not objecting
  to it, but it wasn't obvious to me when reading the patch what the
  return value was for.

- Have you considered passing a cmp() function to rb_add() and
  rb_add_cached(), and having these test cmp() < 0 rather than less() ?
  I figure every user will need to have a cmp() function, so it'd be
  nicer if they didn't also need a less() function, if the generated
  code is similar (if you checked and rejected it because of bad code,
  please just say so).

Reviewed-by: Michel Lespinasse 

I also looked at the other commits in the series, making use of the
helpers, and they seem very reasonable but I did not give them as
thorough a look at this one.

> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  include/linux/rbtree.h   |  127 
> +-
>  tools/include/linux/rbtree.h |  129 
> ++-
>  tools/objtool/elf.c  |   63 ++---
>  3 files changed, 257 insertions(+), 62 deletions(-)
> 
> --- a/include/linux/rbtree.h
> +++ b/include/linux/rbtree.h
> @@ -141,12 +141,18 @@ static inline void rb_insert_color_cache
>   rb_insert_color(node, >rb_root);
>  }
>  
> -static inline void rb_erase_cached(struct rb_node *node,
> +static inline bool rb_erase_cached(struct rb_node *node,
>  struct rb_root_cached *root)
>  {
> - if (root->rb_leftmost == node)
> + bool leftmost = false;
> +
> + if (root->rb_leftmost == node) {
>   root->rb_leftmost = rb_next(node);
> + leftmost = true;
> + }
>   rb_erase(node, >rb_root);
> +
> + return leftmost;
>  }
>  
>  static inline void rb_replace_node_cached(struct rb_node *victim,
> @@ -158,4 +164,121 @@ static inline void rb_replace_node_cache
>   rb_replace_node(victim, new, >rb_root);
>  }
>  
> +static inline bool rb_add_cached(struct rb_root_cached *tree, struct rb_node 
> *node,
> +  bool (*less)(struct rb_node *, const struct 
> rb_node *))
> +{
> + struct rb_node **link = >rb_root.rb_node;
> + struct rb_node *parent = NULL;
> + bool leftmost = true;
> +
> + while (*link) {
> + parent = *link;
> + if (less(node, parent)) {
> + link = >rb_left;
> + } else {
> + link = >rb_right;
> + leftmost = false;
> + }
> + }
> +
> + rb_link_node(node, parent, link);
> + rb_insert_color_cached(node, tree, leftmost);
> +
> + return leftmost;
> +}
> +
> +static inline void rb_add(struct rb_root *tree, struct rb_node *node,
> +   bool (*less)(struct rb_node *, const struct rb_node 
> *))
> +{
> + struct rb_node **link = >rb_node;
> + struct rb_node *parent = NULL;
> +
> + while (*link) {
> + parent = *link;
> + if (less(node, parent))
> + link = >rb_left;
> + else
> + link = >rb_right;
> + }
> +
> + rb_link_node(node, parent, link);
> + rb_insert_color(node, tree);
> +}
> +
> +static inline struct rb_node *rb_find_add(struct rb_root *tree, struct 
> rb_node *node,
> +   int (*cmp)(struct rb_node *, const struct rb_node *))
> +{
>

Re: [PATCH 08/11] mm: convert vma_interval_tree to half closed intervals

2019-10-04 Thread Michel Lespinasse
On Thu, Oct 03, 2019 at 01:18:55PM -0700, Davidlohr Bueso wrote:
> The vma and anon vma interval tree really wants [a, b) intervals,
> not fully closed. As such convert it to use the new
> interval_tree_gen.h. Because of vma_last_pgoff(), the conversion
> is quite straightforward.

I am not certain if we need to worry about integer overflow here.
The problem case would be accessing the last block of a file that is
exactly 16TB long, on an arch where long (and thus pgoff_t) is 32-bit.
Maybe FS folks can tell us whether that case is currently supported,
or if we can just not worry about it ?

I would also want to rename the fields in struct zap_details into
start_index and end_index so we can verify we don't leave any
off-by-one uses.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


  1   2   3   4   5   6   7   8   9   10   >