Re: [PATCH v3] mm: Avoid unnecessary page fault retires on shared memory types

2022-05-27 Thread Peter Xu
et>, Chris Zankel , Michal Simek , Thomas 
Bogendoerfer , linux-par...@vger.kernel.org, Max 
Filippov , linux-ker...@vger.kernel.org, Dinh Nguyen 
, Palmer Dabbelt , Sven Schnelle 
, Guo Ren , Borislav Petkov 
, Johannes Berg , 
linuxppc-dev@lists.ozlabs.org, "David S . Miller" 
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


On Fri, May 27, 2022 at 12:46:31PM +0200, Ingo Molnar wrote:
> 
> * Peter Xu  wrote:
> 
> > This patch provides a ~12% perf boost on my aarch64 test VM with a simple
> > program sequentially dirtying 400MB shmem file being mmap()ed and these are
> > the time it needs:
> >
> >   Before: 650.980 ms (+-1.94%)
> >   After:  569.396 ms (+-1.38%)
> 
> Nice!
> 
> >  arch/x86/mm/fault.c   |  4 
> 
> Reviewed-by: Ingo Molnar 
> 
> Minor comment typo:
> 
> > +   /*
> > +* We should do the same as VM_FAULT_RETRY, but let's not
> > +* return -EBUSY since that's not reflecting the reality on
> > +* what has happened - we've just fully completed a page
> > +* fault, with the mmap lock released.  Use -EAGAIN to show
> > +* that we want to take the mmap lock _again_.
> > +*/
> 
> s/reflecting the reality on what has happened
>  /reflecting the reality of what has happened

Will fix.

> 
> > ret = handle_mm_fault(vma, address, fault_flags, NULL);
> > +
> > +   if (ret & VM_FAULT_COMPLETED) {
> > +   /*
> > +* NOTE: it's a pity that we need to retake the lock here
> > +* to pair with the unlock() in the callers. Ideally we
> > +* could tell the callers so they do not need to unlock.
> > +*/
> > +   mmap_read_lock(mm);
> > +   *unlocked = true;
> > +   return 0;
> 
> Indeed that's a pity - I guess more performance could be gained here, 
> especially in highly parallel threaded workloads?

Yes I think so.

The patch avoids the page fault retry, including the mmap lock/unlock side.
Now if we retake the lock for fixup_user_fault() we still safe time for
pgtable walks but the lock overhead will be somehow kept, just with smaller
critical sections.

Some fixup_user_fault() callers won't be affected as long as unlocked==NULL
is passed - e.g. the futex code path (fault_in_user_writeable).  After all
they never needed to retake the lock before/after this patch.

It's about the other callers, and they may need some more touch-ups case by
case.  Examples are follow_fault_pfn() in vfio and hva_to_pfn_remapped() in
KVM: both of them returns -EAGAIN when *unlocked==true.  We need to teach
them to know "*unlocked==true" does not necessarily mean a retry attempt.

I think I can look into them if this patch can be accepted as a follow up.

Thanks for taking a look!

-- 
Peter Xu



Re: [PATCH v3] mm: Avoid unnecessary page fault retires on shared memory types

2022-05-27 Thread Peter Xu
e.net>, Chris Zankel , Michal Simek , 
Thomas Bogendoerfer , linux-par...@vger.kernel.org, 
Max Filippov , linux-ker...@vger.kernel.org, Dinh Nguyen 
, Palmer Dabbelt , Sven Schnelle 
, Guo Ren , Borislav Petkov 
, Johannes Berg , 
linuxppc-dev@lists.ozlabs.org, "David S . Miller" 
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


Hi, Heiko,

On Fri, May 27, 2022 at 02:23:42PM +0200, Heiko Carstens wrote:
> On Tue, May 24, 2022 at 07:45:31PM -0400, Peter Xu wrote:
> > I observed that for each of the shared file-backed page faults, we're very
> > likely to retry one more time for the 1st write fault upon no page.  It's
> > because we'll need to release the mmap lock for dirty rate limit purpose
> > with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).
> > 
> > Then after that throttling we return VM_FAULT_RETRY.
> > 
> > We did that probably because VM_FAULT_RETRY is the only way we can return
> > to the fault handler at that time telling it we've released the mmap lock.
> > 
> > However that's not ideal because it's very likely the fault does not need
> > to be retried at all since the pgtable was well installed before the
> > throttling, so the next continuous fault (including taking mmap read lock,
> > walk the pgtable, etc.) could be in most cases unnecessary.
> > 
> > It's not only slowing down page faults for shared file-backed, but also add
> > more mmap lock contention which is in most cases not needed at all.
> > 
> > To observe this, one could try to write to some shmem page and look at
> > "pgfault" value in /proc/vmstat, then we should expect 2 counts for each
> > shmem write simply because we retried, and vm event "pgfault" will capture
> > that.
> > 
> > To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
> > show that we've completed the whole fault and released the lock.  It's also
> > a hint that we should very possibly not need another fault immediately on
> > this page because we've just completed it.
> > 
> > This patch provides a ~12% perf boost on my aarch64 test VM with a simple
> > program sequentially dirtying 400MB shmem file being mmap()ed and these are
> > the time it needs:
> > 
> >   Before: 650.980 ms (+-1.94%)
> >   After:  569.396 ms (+-1.38%)
> > 
> > I believe it could help more than that.
> > 
> > We need some special care on GUP and the s390 pgfault handler (for gmap
> > code before returning from pgfault), the rest changes in the page fault
> > handlers should be relatively straightforward.
> > 
> > Another thing to mention is that mm_account_fault() does take this new
> > fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.
> > 
> > I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
> > not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
> > them as-is.
> > 
> > Signed-off-by: Peter Xu 
> ...
> > diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> > index e173b6187ad5..9503a7cfaf03 100644
> > --- a/arch/s390/mm/fault.c
> > +++ b/arch/s390/mm/fault.c
> > @@ -339,6 +339,7 @@ static inline vm_fault_t do_exception(struct pt_regs 
> > *regs, int access)
> > unsigned long address;
> > unsigned int flags;
> > vm_fault_t fault;
> > +   bool need_unlock = true;
> > bool is_write;
> >  
> > tsk = current;
> > @@ -433,6 +434,13 @@ static inline vm_fault_t do_exception(struct pt_regs 
> > *regs, int access)
> > goto out_up;
> > goto out;
> > }
> > +
> > +   /* The fault is fully completed (including releasing mmap lock) */
> > +   if (fault & VM_FAULT_COMPLETED) {
> > +   need_unlock = false;
> > +   goto out_gmap;
> > +   }
> > +
> > if (unlikely(fault & VM_FAULT_ERROR))
> > goto out_up;
> >  
> > @@ -452,6 +460,7 @@ static inline vm_fault_t do_exception(struct pt_regs 
> > *regs, int access)
> > mmap_read_lock(mm);
> > goto retry;
> > }
> > +out_gmap:
> > if (IS_ENABLED(CONFIG_PGSTE) && gmap) {
> > address =  __gmap_link(gmap, current->thread.gmap_addr,
> >address);
> > @@ -466,7 +475,8 @@ static inline vm_fault_t do_exception(struct pt_regs 
> > *regs, int access)
> > }
> > fault = 0;
> >  out_up:
> > -   mmap_read_unlock(mm);
> > +   if (need_unlock)
> > +   mmap_read_unlock(mm);
> >  out:
> 
> This seems to be incorrect. __gmap_link() requires the mmap_lock to be
> held. Christian, Janosch, or David, could you please check?

Thanks for pointing that out.  Indeed I see the clue right above the
comment of __gmap_link():

/*
 * ...
 * The mmap_lock of the mm that belongs to the address space must be held
 * when this function gets called.
 */
int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)

A further fact is it'll walk the pgtable right afterwards, assuming
gmap->mm will definitely be the current mm or it'll 

Re: [PATCH v3] mm: Avoid unnecessary page fault retires on shared memory types

2022-05-27 Thread Heiko Carstens
e.net>, Chris Zankel , Michal Simek , 
Thomas Bogendoerfer , linux-par...@vger.kernel.org, 
Max Filippov , linux-ker...@vger.kernel.org, Dinh Nguyen 
, Palmer Dabbelt , Sven Schnelle 
, Guo Ren , Borislav Petkov 
, Johannes Berg , 
linuxppc-dev@lists.ozlabs.org, "David S . Miller" 
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


On Tue, May 24, 2022 at 07:45:31PM -0400, Peter Xu wrote:
> I observed that for each of the shared file-backed page faults, we're very
> likely to retry one more time for the 1st write fault upon no page.  It's
> because we'll need to release the mmap lock for dirty rate limit purpose
> with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).
> 
> Then after that throttling we return VM_FAULT_RETRY.
> 
> We did that probably because VM_FAULT_RETRY is the only way we can return
> to the fault handler at that time telling it we've released the mmap lock.
> 
> However that's not ideal because it's very likely the fault does not need
> to be retried at all since the pgtable was well installed before the
> throttling, so the next continuous fault (including taking mmap read lock,
> walk the pgtable, etc.) could be in most cases unnecessary.
> 
> It's not only slowing down page faults for shared file-backed, but also add
> more mmap lock contention which is in most cases not needed at all.
> 
> To observe this, one could try to write to some shmem page and look at
> "pgfault" value in /proc/vmstat, then we should expect 2 counts for each
> shmem write simply because we retried, and vm event "pgfault" will capture
> that.
> 
> To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
> show that we've completed the whole fault and released the lock.  It's also
> a hint that we should very possibly not need another fault immediately on
> this page because we've just completed it.
> 
> This patch provides a ~12% perf boost on my aarch64 test VM with a simple
> program sequentially dirtying 400MB shmem file being mmap()ed and these are
> the time it needs:
> 
>   Before: 650.980 ms (+-1.94%)
>   After:  569.396 ms (+-1.38%)
> 
> I believe it could help more than that.
> 
> We need some special care on GUP and the s390 pgfault handler (for gmap
> code before returning from pgfault), the rest changes in the page fault
> handlers should be relatively straightforward.
> 
> Another thing to mention is that mm_account_fault() does take this new
> fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.
> 
> I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
> not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
> them as-is.
> 
> Signed-off-by: Peter Xu 
...
> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> index e173b6187ad5..9503a7cfaf03 100644
> --- a/arch/s390/mm/fault.c
> +++ b/arch/s390/mm/fault.c
> @@ -339,6 +339,7 @@ static inline vm_fault_t do_exception(struct pt_regs 
> *regs, int access)
>   unsigned long address;
>   unsigned int flags;
>   vm_fault_t fault;
> + bool need_unlock = true;
>   bool is_write;
>  
>   tsk = current;
> @@ -433,6 +434,13 @@ static inline vm_fault_t do_exception(struct pt_regs 
> *regs, int access)
>   goto out_up;
>   goto out;
>   }
> +
> + /* The fault is fully completed (including releasing mmap lock) */
> + if (fault & VM_FAULT_COMPLETED) {
> + need_unlock = false;
> + goto out_gmap;
> + }
> +
>   if (unlikely(fault & VM_FAULT_ERROR))
>   goto out_up;
>  
> @@ -452,6 +460,7 @@ static inline vm_fault_t do_exception(struct pt_regs 
> *regs, int access)
>   mmap_read_lock(mm);
>   goto retry;
>   }
> +out_gmap:
>   if (IS_ENABLED(CONFIG_PGSTE) && gmap) {
>   address =  __gmap_link(gmap, current->thread.gmap_addr,
>  address);
> @@ -466,7 +475,8 @@ static inline vm_fault_t do_exception(struct pt_regs 
> *regs, int access)
>   }
>   fault = 0;
>  out_up:
> - mmap_read_unlock(mm);
> + if (need_unlock)
> + mmap_read_unlock(mm);
>  out:

This seems to be incorrect. __gmap_link() requires the mmap_lock to be
held. Christian, Janosch, or David, could you please check?


Re: [PATCH v3] mm: Avoid unnecessary page fault retires on shared memory types

2022-05-27 Thread Ingo Molnar
et>, Chris Zankel , Michal Simek , Thomas 
Bogendoerfer , linux-par...@vger.kernel.org, Max 
Filippov , linux-ker...@vger.kernel.org, Dinh Nguyen 
, Palmer Dabbelt , Sven Schnelle 
, Guo Ren , Borislav Petkov 
, Johannes Berg , 
linuxppc-dev@lists.ozlabs.org, "David S . Miller" 
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 



* Peter Xu  wrote:

> This patch provides a ~12% perf boost on my aarch64 test VM with a simple
> program sequentially dirtying 400MB shmem file being mmap()ed and these are
> the time it needs:
>
>   Before: 650.980 ms (+-1.94%)
>   After:  569.396 ms (+-1.38%)

Nice!

>  arch/x86/mm/fault.c   |  4 

Reviewed-by: Ingo Molnar 

Minor comment typo:

> + /*
> +  * We should do the same as VM_FAULT_RETRY, but let's not
> +  * return -EBUSY since that's not reflecting the reality on
> +  * what has happened - we've just fully completed a page
> +  * fault, with the mmap lock released.  Use -EAGAIN to show
> +  * that we want to take the mmap lock _again_.
> +  */

s/reflecting the reality on what has happened
 /reflecting the reality of what has happened

>   ret = handle_mm_fault(vma, address, fault_flags, NULL);
> +
> + if (ret & VM_FAULT_COMPLETED) {
> + /*
> +  * NOTE: it's a pity that we need to retake the lock here
> +  * to pair with the unlock() in the callers. Ideally we
> +  * could tell the callers so they do not need to unlock.
> +  */
> + mmap_read_lock(mm);
> + *unlocked = true;
> + return 0;

Indeed that's a pity - I guess more performance could be gained here, 
especially in highly parallel threaded workloads?

Thanks,

Ingo


Re: [PATCH v3] mm: Avoid unnecessary page fault retires on shared memory types

2022-05-27 Thread Max Filippov
k.msu.ru>, Al Viro , Andy Lutomirski 
, Paul Walmsley , Thomas Gleixner 
, "open list:ALPHA PORT" , 
Andrew Morton , Vlastimil Babka , 
Richard Henderson , Chris Zankel , Michal 
Simek , Thomas Bogendoerfer , 
"open list:PARISC ARCHITECTURE" , LKML 
, Dinh Nguyen , Palmer 
Dabbelt , Sven Schnelle , Guo Ren 
, Borislav Petkov , Johannes Berg 
, linuxppc-dev@lists.ozlabs.org, "David S . Miller" 

Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


On Tue, May 24, 2022 at 4:45 PM Peter Xu  wrote:
>
> I observed that for each of the shared file-backed page faults, we're very
> likely to retry one more time for the 1st write fault upon no page.  It's
> because we'll need to release the mmap lock for dirty rate limit purpose
> with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).
>
> Then after that throttling we return VM_FAULT_RETRY.
>
> We did that probably because VM_FAULT_RETRY is the only way we can return
> to the fault handler at that time telling it we've released the mmap lock.
>
> However that's not ideal because it's very likely the fault does not need
> to be retried at all since the pgtable was well installed before the
> throttling, so the next continuous fault (including taking mmap read lock,
> walk the pgtable, etc.) could be in most cases unnecessary.
>
> It's not only slowing down page faults for shared file-backed, but also add
> more mmap lock contention which is in most cases not needed at all.
>
> To observe this, one could try to write to some shmem page and look at
> "pgfault" value in /proc/vmstat, then we should expect 2 counts for each
> shmem write simply because we retried, and vm event "pgfault" will capture
> that.
>
> To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
> show that we've completed the whole fault and released the lock.  It's also
> a hint that we should very possibly not need another fault immediately on
> this page because we've just completed it.
>
> This patch provides a ~12% perf boost on my aarch64 test VM with a simple
> program sequentially dirtying 400MB shmem file being mmap()ed and these are
> the time it needs:
>
>   Before: 650.980 ms (+-1.94%)
>   After:  569.396 ms (+-1.38%)
>
> I believe it could help more than that.
>
> We need some special care on GUP and the s390 pgfault handler (for gmap
> code before returning from pgfault), the rest changes in the page fault
> handlers should be relatively straightforward.
>
> Another thing to mention is that mm_account_fault() does take this new
> fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.
>
> I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
> not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
> them as-is.
>
> Signed-off-by: Peter Xu 
> ---
>
> v3:
> - Rebase to akpm/mm-unstable
> - Copy arch maintainers
> ---

For xtensa:
Acked-by: Max Filippov 

-- 
Thanks.
-- Max


Re: [PATCH v3] mm: Avoid unnecessary page fault retires on shared memory types

2022-05-26 Thread Guo Ren
e.com>, Thomas Gleixner , linux-al...@vger.kernel.org, 
Andrew Morton , Vlastimil Babka , 
Richard Henderson , Chris Zankel , Michal 
Simek , Thomas Bogendoerfer , 
Parisc List , Max Filippov , 
Linux Kernel Mailing List , Dinh Nguyen 
, Palmer Dabbelt , Sven Schnelle 
, Borislav Petkov , Johannes Berg 
, linuxppc-dev , 
"David S . Miller" 
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


For csky part.

Acked-by: Guo Ren 

On Wed, May 25, 2022 at 7:45 AM Peter Xu  wrote:
>
> I observed that for each of the shared file-backed page faults, we're very
> likely to retry one more time for the 1st write fault upon no page.  It's
> because we'll need to release the mmap lock for dirty rate limit purpose
> with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).
>
> Then after that throttling we return VM_FAULT_RETRY.
>
> We did that probably because VM_FAULT_RETRY is the only way we can return
> to the fault handler at that time telling it we've released the mmap lock.
>
> However that's not ideal because it's very likely the fault does not need
> to be retried at all since the pgtable was well installed before the
> throttling, so the next continuous fault (including taking mmap read lock,
> walk the pgtable, etc.) could be in most cases unnecessary.
>
> It's not only slowing down page faults for shared file-backed, but also add
> more mmap lock contention which is in most cases not needed at all.
>
> To observe this, one could try to write to some shmem page and look at
> "pgfault" value in /proc/vmstat, then we should expect 2 counts for each
> shmem write simply because we retried, and vm event "pgfault" will capture
> that.
>
> To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
> show that we've completed the whole fault and released the lock.  It's also
> a hint that we should very possibly not need another fault immediately on
> this page because we've just completed it.
>
> This patch provides a ~12% perf boost on my aarch64 test VM with a simple
> program sequentially dirtying 400MB shmem file being mmap()ed and these are
> the time it needs:
>
>   Before: 650.980 ms (+-1.94%)
>   After:  569.396 ms (+-1.38%)
>
> I believe it could help more than that.
>
> We need some special care on GUP and the s390 pgfault handler (for gmap
> code before returning from pgfault), the rest changes in the page fault
> handlers should be relatively straightforward.
>
> Another thing to mention is that mm_account_fault() does take this new
> fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.
>
> I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
> not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
> them as-is.
>
> Signed-off-by: Peter Xu 
> ---
>
> v3:
> - Rebase to akpm/mm-unstable
> - Copy arch maintainers
> ---
>  arch/alpha/mm/fault.c |  4 
>  arch/arc/mm/fault.c   |  4 
>  arch/arm/mm/fault.c   |  4 
>  arch/arm64/mm/fault.c |  4 
>  arch/csky/mm/fault.c  |  4 
>  arch/hexagon/mm/vm_fault.c|  4 
>  arch/ia64/mm/fault.c  |  4 
>  arch/m68k/mm/fault.c  |  4 
>  arch/microblaze/mm/fault.c|  4 
>  arch/mips/mm/fault.c  |  4 
>  arch/nios2/mm/fault.c |  4 
>  arch/openrisc/mm/fault.c  |  4 
>  arch/parisc/mm/fault.c|  4 
>  arch/powerpc/mm/copro_fault.c |  5 +
>  arch/powerpc/mm/fault.c   |  5 +
>  arch/riscv/mm/fault.c |  4 
>  arch/s390/mm/fault.c  | 12 +++-
>  arch/sh/mm/fault.c|  4 
>  arch/sparc/mm/fault_32.c  |  4 
>  arch/sparc/mm/fault_64.c  |  5 +
>  arch/um/kernel/trap.c |  4 
>  arch/x86/mm/fault.c   |  4 
>  arch/xtensa/mm/fault.c|  4 
>  include/linux/mm_types.h  |  2 ++
>  mm/gup.c  | 34 +-
>  mm/memory.c   |  2 +-
>  26 files changed, 138 insertions(+), 3 deletions(-)
>
> diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
> index ec20c1004abf..ef427a6bdd1a 100644
> --- a/arch/alpha/mm/fault.c
> +++ b/arch/alpha/mm/fault.c
> @@ -155,6 +155,10 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
> if (fault_signal_pending(fault, regs))
> return;
>
> +   /* The fault is fully completed (including releasing mmap lock) */
> +   if (fault & VM_FAULT_COMPLETED)
> +   return;
> +
> if (unlikely(fault & VM_FAULT_ERROR)) {
> if (fault & VM_FAULT_OOM)
> goto out_of_memory;
> diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
> index dad27e4d69ff..5ca59a482632 100644
> --- a/arch/arc/mm/fault.c
> +++ b/arch/arc/mm/fault.c
> @@ -146,6 +146,10 @@ void do_page_fault(unsigned long address, struct pt_regs 
> *regs)
> return;
>

Re: [PATCH v3] mm: Avoid unnecessary page fault retires on shared memory types

2022-05-25 Thread Vineet Gupta

nkel , Michal Simek , Thomas Bogendoerfer , 
linux-par...@vger.kernel.org, linux-m...@vger.kernel.org, Dinh Nguyen , Palmer Dabbelt , Sven 
Schnelle , Guo Ren , Borislav Petkov , Johannes Berg 
, linuxppc-dev@lists.ozlabs.org, "David S . Miller" 
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 




On 5/24/22 16:45, Peter Xu wrote:

I observed that for each of the shared file-backed page faults, we're very
likely to retry one more time for the 1st write fault upon no page.  It's
because we'll need to release the mmap lock for dirty rate limit purpose
with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).

Then after that throttling we return VM_FAULT_RETRY.

We did that probably because VM_FAULT_RETRY is the only way we can return
to the fault handler at that time telling it we've released the mmap lock.

However that's not ideal because it's very likely the fault does not need
to be retried at all since the pgtable was well installed before the
throttling, so the next continuous fault (including taking mmap read lock,
walk the pgtable, etc.) could be in most cases unnecessary.

It's not only slowing down page faults for shared file-backed, but also add
more mmap lock contention which is in most cases not needed at all.

To observe this, one could try to write to some shmem page and look at
"pgfault" value in /proc/vmstat, then we should expect 2 counts for each
shmem write simply because we retried, and vm event "pgfault" will capture
that.

To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
show that we've completed the whole fault and released the lock.  It's also
a hint that we should very possibly not need another fault immediately on
this page because we've just completed it.

This patch provides a ~12% perf boost on my aarch64 test VM with a simple
program sequentially dirtying 400MB shmem file being mmap()ed and these are
the time it needs:

   Before: 650.980 ms (+-1.94%)
   After:  569.396 ms (+-1.38%)

I believe it could help more than that.

We need some special care on GUP and the s390 pgfault handler (for gmap
code before returning from pgfault), the rest changes in the page fault
handlers should be relatively straightforward.

Another thing to mention is that mm_account_fault() does take this new
fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.

I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
them as-is.

Signed-off-by: Peter Xu
---

v3:
- Rebase to akpm/mm-unstable
- Copy arch maintainers
---
   arch/arc/mm/fault.c   |  4 


Acked-by: Vineet Gupta 

Thx,
-Vineet


Re: [PATCH v3] mm: Avoid unnecessary page fault retires on shared memory types

2022-05-25 Thread Johannes Weiner
Michal Simek , Thomas Bogendoerfer 
, linux-par...@vger.kernel.org, Max Filippov 
, linux-ker...@vger.kernel.org, Dinh Nguyen 
, Palmer Dabbelt , Sven Schnelle 
, Guo Ren , Borislav Petkov 
, Johannes Berg , 
linuxppc-dev@lists.ozlabs.org, "David S . Miller" 
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


On Tue, May 24, 2022 at 07:45:31PM -0400, Peter Xu wrote:
> I observed that for each of the shared file-backed page faults, we're very
> likely to retry one more time for the 1st write fault upon no page.  It's
> because we'll need to release the mmap lock for dirty rate limit purpose
> with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).
> 
> Then after that throttling we return VM_FAULT_RETRY.
> 
> We did that probably because VM_FAULT_RETRY is the only way we can return
> to the fault handler at that time telling it we've released the mmap lock.
> 
> However that's not ideal because it's very likely the fault does not need
> to be retried at all since the pgtable was well installed before the
> throttling, so the next continuous fault (including taking mmap read lock,
> walk the pgtable, etc.) could be in most cases unnecessary.
> 
> It's not only slowing down page faults for shared file-backed, but also add
> more mmap lock contention which is in most cases not needed at all.
> 
> To observe this, one could try to write to some shmem page and look at
> "pgfault" value in /proc/vmstat, then we should expect 2 counts for each
> shmem write simply because we retried, and vm event "pgfault" will capture
> that.
> 
> To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
> show that we've completed the whole fault and released the lock.  It's also
> a hint that we should very possibly not need another fault immediately on
> this page because we've just completed it.
> 
> This patch provides a ~12% perf boost on my aarch64 test VM with a simple
> program sequentially dirtying 400MB shmem file being mmap()ed and these are
> the time it needs:
> 
>   Before: 650.980 ms (+-1.94%)
>   After:  569.396 ms (+-1.38%)
> 
> I believe it could help more than that.
> 
> We need some special care on GUP and the s390 pgfault handler (for gmap
> code before returning from pgfault), the rest changes in the page fault
> handlers should be relatively straightforward.
> 
> Another thing to mention is that mm_account_fault() does take this new
> fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.
> 
> I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
> not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
> them as-is.
> 
> Signed-off-by: Peter Xu 

Acked-by: Johannes Weiner 


Re: [PATCH v3] mm: Avoid unnecessary page fault retires on shared memory types

2022-05-25 Thread Peter Zijlstra
On Tue, May 24, 2022 at 07:45:31PM -0400, Peter Xu wrote:
> I observed that for each of the shared file-backed page faults, we're very
> likely to retry one more time for the 1st write fault upon no page.  It's
> because we'll need to release the mmap lock for dirty rate limit purpose
> with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).
> 
> Then after that throttling we return VM_FAULT_RETRY.
> 
> We did that probably because VM_FAULT_RETRY is the only way we can return
> to the fault handler at that time telling it we've released the mmap lock.
> 
> However that's not ideal because it's very likely the fault does not need
> to be retried at all since the pgtable was well installed before the
> throttling, so the next continuous fault (including taking mmap read lock,
> walk the pgtable, etc.) could be in most cases unnecessary.
> 
> It's not only slowing down page faults for shared file-backed, but also add
> more mmap lock contention which is in most cases not needed at all.
> 
> To observe this, one could try to write to some shmem page and look at
> "pgfault" value in /proc/vmstat, then we should expect 2 counts for each
> shmem write simply because we retried, and vm event "pgfault" will capture
> that.
> 
> To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
> show that we've completed the whole fault and released the lock.  It's also
> a hint that we should very possibly not need another fault immediately on
> this page because we've just completed it.
> 
> This patch provides a ~12% perf boost on my aarch64 test VM with a simple
> program sequentially dirtying 400MB shmem file being mmap()ed and these are
> the time it needs:
> 
>   Before: 650.980 ms (+-1.94%)
>   After:  569.396 ms (+-1.38%)
> 
> I believe it could help more than that.
> 
> We need some special care on GUP and the s390 pgfault handler (for gmap
> code before returning from pgfault), the rest changes in the page fault
> handlers should be relatively straightforward.
> 
> Another thing to mention is that mm_account_fault() does take this new
> fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.
> 
> I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
> not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
> them as-is.
> 
> Signed-off-by: Peter Xu 

Acked-by: Peter Zijlstra (Intel) 


Re: [PATCH v3] mm: Avoid unnecessary page fault retires on shared memory types

2022-05-25 Thread Geert Uytterhoeven
On Wed, May 25, 2022 at 1:45 AM Peter Xu  wrote:
> I observed that for each of the shared file-backed page faults, we're very
> likely to retry one more time for the 1st write fault upon no page.  It's
> because we'll need to release the mmap lock for dirty rate limit purpose
> with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).
>
> Then after that throttling we return VM_FAULT_RETRY.
>
> We did that probably because VM_FAULT_RETRY is the only way we can return
> to the fault handler at that time telling it we've released the mmap lock.
>
> However that's not ideal because it's very likely the fault does not need
> to be retried at all since the pgtable was well installed before the
> throttling, so the next continuous fault (including taking mmap read lock,
> walk the pgtable, etc.) could be in most cases unnecessary.
>
> It's not only slowing down page faults for shared file-backed, but also add
> more mmap lock contention which is in most cases not needed at all.
>
> To observe this, one could try to write to some shmem page and look at
> "pgfault" value in /proc/vmstat, then we should expect 2 counts for each
> shmem write simply because we retried, and vm event "pgfault" will capture
> that.
>
> To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
> show that we've completed the whole fault and released the lock.  It's also
> a hint that we should very possibly not need another fault immediately on
> this page because we've just completed it.
>
> This patch provides a ~12% perf boost on my aarch64 test VM with a simple
> program sequentially dirtying 400MB shmem file being mmap()ed and these are
> the time it needs:
>
>   Before: 650.980 ms (+-1.94%)
>   After:  569.396 ms (+-1.38%)
>
> I believe it could help more than that.
>
> We need some special care on GUP and the s390 pgfault handler (for gmap
> code before returning from pgfault), the rest changes in the page fault
> handlers should be relatively straightforward.
>
> Another thing to mention is that mm_account_fault() does take this new
> fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.
>
> I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
> not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
> them as-is.
>
> Signed-off-by: Peter Xu 

>  arch/m68k/mm/fault.c  |  4 

Acked-by: Geert Uytterhoeven 

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


[PATCH v3] mm: Avoid unnecessary page fault retires on shared memory types

2022-05-24 Thread Peter Xu
I observed that for each of the shared file-backed page faults, we're very
likely to retry one more time for the 1st write fault upon no page.  It's
because we'll need to release the mmap lock for dirty rate limit purpose
with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).

Then after that throttling we return VM_FAULT_RETRY.

We did that probably because VM_FAULT_RETRY is the only way we can return
to the fault handler at that time telling it we've released the mmap lock.

However that's not ideal because it's very likely the fault does not need
to be retried at all since the pgtable was well installed before the
throttling, so the next continuous fault (including taking mmap read lock,
walk the pgtable, etc.) could be in most cases unnecessary.

It's not only slowing down page faults for shared file-backed, but also add
more mmap lock contention which is in most cases not needed at all.

To observe this, one could try to write to some shmem page and look at
"pgfault" value in /proc/vmstat, then we should expect 2 counts for each
shmem write simply because we retried, and vm event "pgfault" will capture
that.

To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
show that we've completed the whole fault and released the lock.  It's also
a hint that we should very possibly not need another fault immediately on
this page because we've just completed it.

This patch provides a ~12% perf boost on my aarch64 test VM with a simple
program sequentially dirtying 400MB shmem file being mmap()ed and these are
the time it needs:

  Before: 650.980 ms (+-1.94%)
  After:  569.396 ms (+-1.38%)

I believe it could help more than that.

We need some special care on GUP and the s390 pgfault handler (for gmap
code before returning from pgfault), the rest changes in the page fault
handlers should be relatively straightforward.

Another thing to mention is that mm_account_fault() does take this new
fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.

I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
them as-is.

Signed-off-by: Peter Xu 
---

v3:
- Rebase to akpm/mm-unstable
- Copy arch maintainers
---
 arch/alpha/mm/fault.c |  4 
 arch/arc/mm/fault.c   |  4 
 arch/arm/mm/fault.c   |  4 
 arch/arm64/mm/fault.c |  4 
 arch/csky/mm/fault.c  |  4 
 arch/hexagon/mm/vm_fault.c|  4 
 arch/ia64/mm/fault.c  |  4 
 arch/m68k/mm/fault.c  |  4 
 arch/microblaze/mm/fault.c|  4 
 arch/mips/mm/fault.c  |  4 
 arch/nios2/mm/fault.c |  4 
 arch/openrisc/mm/fault.c  |  4 
 arch/parisc/mm/fault.c|  4 
 arch/powerpc/mm/copro_fault.c |  5 +
 arch/powerpc/mm/fault.c   |  5 +
 arch/riscv/mm/fault.c |  4 
 arch/s390/mm/fault.c  | 12 +++-
 arch/sh/mm/fault.c|  4 
 arch/sparc/mm/fault_32.c  |  4 
 arch/sparc/mm/fault_64.c  |  5 +
 arch/um/kernel/trap.c |  4 
 arch/x86/mm/fault.c   |  4 
 arch/xtensa/mm/fault.c|  4 
 include/linux/mm_types.h  |  2 ++
 mm/gup.c  | 34 +-
 mm/memory.c   |  2 +-
 26 files changed, 138 insertions(+), 3 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index ec20c1004abf..ef427a6bdd1a 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -155,6 +155,10 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
if (fault_signal_pending(fault, regs))
return;
 
+   /* The fault is fully completed (including releasing mmap lock) */
+   if (fault & VM_FAULT_COMPLETED)
+   return;
+
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index dad27e4d69ff..5ca59a482632 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -146,6 +146,10 @@ void do_page_fault(unsigned long address, struct pt_regs 
*regs)
return;
}
 
+   /* The fault is fully completed (including releasing mmap lock) */
+   if (fault & VM_FAULT_COMPLETED)
+   return;
+
/*
 * Fault retry nuances, mmap_lock already relinquished by core mm
 */
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index a062e07516dd..46cccd6bf705 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -322,6 +322,10 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct 
pt_regs *regs)
return 0;
}
 
+   /* The fault is fully completed (including releasing mmap lock) */
+   if (fault & VM_FAULT_COMPLETED)
+   return 0;
+
if (!(fault & VM_FAULT_ERROR)) {