Re: [PATCH v3] mm: Avoid unnecessary page fault retires on shared memory types

2022-05-25 Thread Johannes Weiner
Michal Simek , Thomas Bogendoerfer 
, linux-par...@vger.kernel.org, Max Filippov 
, linux-ker...@vger.kernel.org, Dinh Nguyen 
, Palmer Dabbelt , Sven Schnelle 
, Guo Ren , Borislav Petkov 
, Johannes Berg , 
linuxppc-dev@lists.ozlabs.org, "David S . Miller" 
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


On Tue, May 24, 2022 at 07:45:31PM -0400, Peter Xu wrote:
> I observed that for each of the shared file-backed page faults, we're very
> likely to retry one more time for the 1st write fault upon no page.  It's
> because we'll need to release the mmap lock for dirty rate limit purpose
> with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).
> 
> Then after that throttling we return VM_FAULT_RETRY.
> 
> We did that probably because VM_FAULT_RETRY is the only way we can return
> to the fault handler at that time telling it we've released the mmap lock.
> 
> However that's not ideal because it's very likely the fault does not need
> to be retried at all since the pgtable was well installed before the
> throttling, so the next continuous fault (including taking mmap read lock,
> walk the pgtable, etc.) could be in most cases unnecessary.
> 
> It's not only slowing down page faults for shared file-backed, but also add
> more mmap lock contention which is in most cases not needed at all.
> 
> To observe this, one could try to write to some shmem page and look at
> "pgfault" value in /proc/vmstat, then we should expect 2 counts for each
> shmem write simply because we retried, and vm event "pgfault" will capture
> that.
> 
> To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
> show that we've completed the whole fault and released the lock.  It's also
> a hint that we should very possibly not need another fault immediately on
> this page because we've just completed it.
> 
> This patch provides a ~12% perf boost on my aarch64 test VM with a simple
> program sequentially dirtying 400MB shmem file being mmap()ed and these are
> the time it needs:
> 
>   Before: 650.980 ms (+-1.94%)
>   After:  569.396 ms (+-1.38%)
> 
> I believe it could help more than that.
> 
> We need some special care on GUP and the s390 pgfault handler (for gmap
> code before returning from pgfault), the rest changes in the page fault
> handlers should be relatively straightforward.
> 
> Another thing to mention is that mm_account_fault() does take this new
> fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.
> 
> I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
> not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
> them as-is.
> 
> Signed-off-by: Peter Xu 

Acked-by: Johannes Weiner 


Re: mm: Question about the use of 'accessed' flags and pte_young() helper

2020-10-20 Thread Johannes Weiner
On Tue, Oct 20, 2020 at 05:52:07PM +0200, Vlastimil Babka wrote:
> On 10/8/20 11:49 AM, Christophe Leroy wrote:
> > In a 10 years old commit
> > (https://github.com/linuxppc/linux/commit/d069cb4373fe0d451357c4d3769623a7564dfa9f),
> >  powerpc 8xx has
> > made the handling of PTE accessed bit conditional to CONFIG_SWAP.
> > Since then, this has been extended to some other powerpc variants.
> > 
> > That commit means that when CONFIG_SWAP is not selected, the accessed bit 
> > is not set by SW TLB miss
> > handlers, leading to pte_young() returning garbage, or should I say 
> > possibly returning false
> > allthough a page has been accessed since its access flag was reset.
> > 
> > Looking at various mm/ places, pte_young() is used independent of 
> > CONFIG_SWAP
> > 
> > Is it still valid the not manage accessed flags when CONFIG_SWAP is not 
> > selected ?
> 
> AFAIK it's wrong, reclaim needs it to detect accessed pages on inactive
> list, via page_referenced(), including file pages (page cache) where
> CONFIG_SWAP plays no role. Maybe it was different 10 years ago.

Yes, we require this bit for properly aging mmapped file pages. The
underlying assumption in the referenced commit is incorrect.

> > If yes, should pte_young() always return true in that case ?
> 
> It should best work as intended. If not possible, true is maybe better, as
> false will lead to inactive file list thrashing.

An unconditional true will cause mmapped file pages to be permanently
mlocked / unevictable.

Either way will break some workloads. The only good answer is the
truth :-)


Re: linux-next: runtime warning in Linus' tree

2020-08-13 Thread Johannes Weiner
On Thu, Aug 13, 2020 at 04:46:54PM +1000, Stephen Rothwell wrote:
> [0.055220][T0] WARNING: CPU: 0 PID: 0 at mm/memcontrol.c:5220 
> mem_cgroup_css_alloc+0x350/0x904

> [The line numbers in the final linux next are 5226 and 5141 due to
> later patches.]
> 
> Introduced (or exposed) by commit
> 
>   3e38e0aaca9e ("mm: memcg: charge memcg percpu memory to the parent cgroup")
> 
> This commit actually adds the WARN_ON, so it either adds the bug that
> sets it off, or the bug already existed.
> 
> Unfotunately, the version of this patch in linux-next up tuntil today
> is different.  :-(

Sorry, I made a last-minute request to include these checks in that
patch to make the code a bit more robust, but they trigger a false
positive here. Let's remove them.

---

>From de8ea7c96c056c3cbe7b93995029986a158fb9cd Mon Sep 17 00:00:00 2001
From: Johannes Weiner 
Date: Thu, 13 Aug 2020 10:40:54 -0400
Subject: [PATCH] mm: memcontrol: fix warning when allocating the root cgroup

Commit 3e38e0aaca9e ("mm: memcg: charge memcg percpu memory to the
parent cgroup") adds memory tracking to the memcg kernel structures
themselves to make cgroups liable for the memory they are consuming
through the allocation of child groups (which can be significant).

This code is a bit awkward as it's spread out through several
functions: The outermost function does memalloc_use_memcg(parent) to
set up current->active_memcg, which designates which cgroup to charge,
and the inner functions pass GFP_ACCOUNT to request charging for
specific allocations. To make sure this dependency is satisfied at all
times - to make sure we don't randomly charge whoever is calling the
functions - the inner functions warn on !current->active_memcg.

However, this triggers a false warning when the root memcg itself is
allocated. No parent exists in this case, and so current->active_memcg
is rightfully NULL. It's a false positive, not indicative of a bug.

Delete the warnings for now, we can revisit this later.

Fixes: 3e38e0aaca9e ("mm: memcg: charge memcg percpu memory to the parent 
cgroup")
Signed-off-by: Johannes Weiner 
---
 mm/memcontrol.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d59fd9af6e63..9d87082e64aa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5137,9 +5137,6 @@ static int alloc_mem_cgroup_per_node_info(struct 
mem_cgroup *memcg, int node)
if (!pn)
return 1;
 
-   /* We charge the parent cgroup, never the current task */
-   WARN_ON_ONCE(!current->active_memcg);
-
pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
 GFP_KERNEL_ACCOUNT);
if (!pn->lruvec_stat_local) {
@@ -5222,9 +5219,6 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
goto fail;
}
 
-   /* We charge the parent cgroup, never the current task */
-   WARN_ON_ONCE(!current->active_memcg);
-
memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu,
GFP_KERNEL_ACCOUNT);
if (!memcg->vmstats_local)
-- 
2.28.0



Re: [PATCH 1/2] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-04-14 Thread Johannes Weiner
On Mon, Apr 13, 2020 at 05:15:49PM -0400, Waiman Long wrote:
> As said by Linus:
> 
>   A symmetric naming is only helpful if it implies symmetries in use.
>   Otherwise it's actively misleading.

As the btrfs example proves - people can be tempted by this false
symmetry to pair kzalloc with kzfree, which isn't what we wanted.

>   In "kzalloc()", the z is meaningful and an important part of what the
>   caller wants.
> 
>   In "kzfree()", the z is actively detrimental, because maybe in the
>   future we really _might_ want to use that "memfill(0xdeadbeef)" or
>   something. The "zero" part of the interface isn't even _relevant_.
> 
> The main reason that kzfree() exists is to clear sensitive information
> that should not be leaked to other future users of the same memory
> objects.
> 
> Rename kzfree() to kfree_sensitive() to follow the example of the
> recently added kvfree_sensitive() and make the intention of the API
> more explicit. In addition, memzero_explicit() is used to clear the
> memory to make sure that it won't get optimized away by the compiler.
> 
> The renaming is done by using the command sequence:
> 
>   git grep -w --name-only kzfree |\
>   xargs sed -i 's/\bkzfree\b/kfree_sensitive/'
> 
> followed by some editing of the kfree_sensitive() kerneldoc and the
> use of memzero_explicit() instead of memset().
> 
> Suggested-by: Joe Perches 
> Signed-off-by: Waiman Long 

Looks good to me. Thanks for fixing this very old mistake.

Acked-by: Johannes Weiner 


Re: [PATCH 25/28] mm: remove vmalloc_user_node_flags

2020-04-13 Thread Johannes Weiner
On Thu, Apr 09, 2020 at 03:25:03PM -0700, Andrii Nakryiko wrote:
> cc Johannes who suggested this API call originally

I forgot why we did it this way - probably just cruft begetting more
cruft. Either way, Christoph's cleanup makes this look a lot better.

> On Wed, Apr 8, 2020 at 5:03 AM Christoph Hellwig  wrote:
> >
> > Open code it in __bpf_map_area_alloc, which is the only caller.  Also
> > clean up __bpf_map_area_alloc to have a single vmalloc call with
> > slightly different flags instead of the current two different calls.
> >
> > For this to compile for the nommu case add a __vmalloc_node_range stub
> > to nommu.c.
> >
> > Signed-off-by: Christoph Hellwig 

Acked-by: Johannes Weiner 


Re: [PATCH v3 3/3] mm: use numa_mem_id() in alloc_pages_node()

2015-07-30 Thread Johannes Weiner
On Thu, Jul 30, 2015 at 06:34:31PM +0200, Vlastimil Babka wrote:
 numa_mem_id() is able to handle allocation from CPUs on memory-less nodes,
 so it's a more robust fallback than the currently used numa_node_id().

Won't it fall through to the next closest memory node in the zonelist
anyway? Is this for callers doing NUMA_NO_NODE with __GFP_THISZONE?
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 6/7][TRIVIAL][resend] mm: cleanup page reclaim comment error

2012-06-15 Thread Johannes Weiner
On Fri, Jun 15, 2012 at 09:19:45PM +0800, Wanpeng Li wrote:
 From: Wanpeng Li l...@linux.vnet.ibm.com
 
 Since there are five lists in LRU cache, the array nr in get_scan_count
 should be:
 
 nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan
 nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
 
 Signed-off-by: Wanpeng Li liwp.li...@gmail.com
 Acked-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com
 Acked-by: Minchan Kim minc...@kernel.org
 Reviewed-by: Rik van Riel r...@redhat.com
 
 ---
  mm/vmscan.c |3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)
 
 diff --git a/mm/vmscan.c b/mm/vmscan.c
 index eeb3bc9..ed823df 100644
 --- a/mm/vmscan.c
 +++ b/mm/vmscan.c
 @@ -1567,7 +1567,8 @@ static int vmscan_swappiness(struct scan_control *sc)
   * by looking at the fraction of the pages scanned we did rotate back
   * onto the active list instead of evict.
   *
 - * nr[0] = anon pages to scan; nr[1] = file pages to scan
 + * nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan
 + * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
   */

Does including this in the comment have any merit in the first place?
We never access nr[0] or nr[1] etc. anywhere with magic numbers.  It's
a local function with one callsite, the passed array is declared and
accessed exclusively by what is defined in enum lru_list, where is the
point in repeating the enum items?.  I'd rather the next change to
this comment would be its removal.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] sparsemem/bootmem: catch greater than section size allocations

2012-02-29 Thread Johannes Weiner
On Tue, Feb 28, 2012 at 12:11:51PM -0800, Nishanth Aravamudan wrote:
 On 28.02.2012 [14:53:26 +0100], Johannes Weiner wrote:
  On Fri, Feb 24, 2012 at 11:33:58AM -0800, Nishanth Aravamudan wrote:
   While testing AMS (Active Memory Sharing) / CMO (Cooperative Memory
   Overcommit) on powerpc, we tripped the following:
   
   kernel BUG at mm/bootmem.c:483!
   cpu 0x0: Vector: 700 (Program Check) at [c0c03940]
   pc: c0a62bd8: .alloc_bootmem_core+0x90/0x39c
   lr: c0a64bcc: .sparse_early_usemaps_alloc_node+0x84/0x29c
   sp: c0c03bc0
  msr: 80021032
 current = 0xc0b0cce0
 paca= 0xc1d8
   pid   = 0, comm = swapper
   kernel BUG at mm/bootmem.c:483!
   enter ? for help
   [c0c03c80] c0a64bcc
   .sparse_early_usemaps_alloc_node+0x84/0x29c
   [c0c03d50] c0a64f10 .sparse_init+0x12c/0x28c
   [c0c03e20] c0a474f4 .setup_arch+0x20c/0x294
   [c0c03ee0] c0a4079c .start_kernel+0xb4/0x460
   [c0c03f90] c0009670 .start_here_common+0x1c/0x2c
   
   This is
   
   BUG_ON(limit  goal + size  limit);
   
   and after some debugging, it seems that
   
 goal = 0x700
 limit = 0x800
   
   and sparse_early_usemaps_alloc_node -
   sparse_early_usemaps_alloc_pgdat_section - alloc_bootmem_section calls
   
 return alloc_bootmem_section(usemap_size() * count, section_nr);
   
   This is on a system with 8TB available via the AMS pool, and as a quirk
   of AMS in firmware, all of that memory shows up in node 0. So, we end up
   with an allocation that will fail the goal/limit constraints. In theory,
   we could fall-back to alloc_bootmem_node() in
   sparse_early_usemaps_alloc_node(), but since we actually have HOTREMOVE
   defined, we'll BUG_ON() instead. A simple solution appears to be to
   disable the limit check if the size of the allocation in
   alloc_bootmem_secition exceeds the section size.
  
  It makes sense to allow the usemaps to spill over to subsequent
  sections instead of panicking, so FWIW:
  
  Acked-by: Johannes Weiner han...@cmpxchg.org
  
  That being said, it would be good if check_usemap_section_nr() printed
  the cross-dependencies between pgdats and sections when the usemaps of
  a node spilled over to other sections than the ones holding the pgdat.
  
  How about this?
  
  ---
  From: Johannes Weiner han...@cmpxchg.org
  Subject: sparsemem/bootmem: catch greater than section size allocations fix
  
  If alloc_bootmem_section() no longer guarantees section-locality, we
  need check_usemap_section_nr() to print possible cross-dependencies
  between node descriptors and the usemaps allocated through it.
  
  Signed-off-by: Johannes Weiner han...@cmpxchg.org
  ---
  
  diff --git a/mm/sparse.c b/mm/sparse.c
  index 61d7cde..9e032dc 100644
  --- a/mm/sparse.c
  +++ b/mm/sparse.c
  @@ -359,6 +359,7 @@ static void __init 
  sparse_early_usemaps_alloc_node(unsigned long**usemap_map,
  continue;
  usemap_map[pnum] = usemap;
  usemap += size;
  +   check_usemap_section_nr(nodeid, usemap_map[pnum]);
  }
  return;
  }
 
 This makes sense to me -- ok if I fold it into the re-worked patch
 (based upon Mel's comments)?

Sure thing!

  Furthermore, I wonder if we can remove the sparse-specific stuff from
  bootmem.c as well, as now even more so than before, calculating the
  desired area is really none of bootmem's business.
  
  Would something like this be okay?
  
  ---
  From: Johannes Weiner han...@cmpxchg.org
  Subject: [patch] mm: remove sparsemem allocation details from the bootmem 
  allocator
  
  alloc_bootmem_section() derives allocation area constraints from the
  specified sparsemem section.  This is a bit specific for a generic
  memory allocator like bootmem, though, so move it over to sparsemem.
  
  Since __alloc_bootmem_node() already retries failed allocations with
  relaxed area constraints, the fallback code in sparsemem.c can be
  removed and the code becomes a bit more compact overall.
  
  Signed-off-by: Johannes Weiner han...@cmpxchg.org
 
 I've not tested it, but the intention seems sensible. I think it should
 remain a separate change.

Yes, I agree.  I'll resend it in a bit as stand-alone patch.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2] bootmem/sparsemem: remove limit constraint in alloc_bootmem_section

2012-02-29 Thread Johannes Weiner
 -
 sparse_early_usemaps_alloc_pgdat_section calls
 
   return alloc_bootmem_section(usemap_size() * count, section_nr);
 
 This is on a system with 8TB available via the AMS pool, and as a quirk
 of AMS in firmware, all of that memory shows up in node 0. So, we end up
 with an allocation that will fail the goal/limit constraints. In theory,
 we could fall-back to alloc_bootmem_node() in
 sparse_early_usemaps_alloc_node(), but since we actually have HOTREMOVE
 defined, we'll BUG_ON() instead. A simple solution appears to be to
 unconditionally remove the limit condition in alloc_bootmem_section,
 meaning allocations are allowed to cross section boundaries (necessary
 for systems of this size).
 
 Johannes Weiner pointed out that if alloc_bootmem_section() no longer
 guarantees section-locality, we need check_usemap_section_nr() to print
 possible cross-dependencies between node descriptors and the usemaps
 allocated through it. That makes the two loops in
 sparse_early_usemaps_alloc_node() identical, so re-factor the code a
 bit.
 
 Signed-off-by: Nishanth Aravamudan n...@us.ibm.com

Acked-by: Johannes Weiner han...@cmpxchg.org
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] sparsemem/bootmem: catch greater than section size allocations

2012-02-28 Thread Johannes Weiner
On Fri, Feb 24, 2012 at 11:33:58AM -0800, Nishanth Aravamudan wrote:
 While testing AMS (Active Memory Sharing) / CMO (Cooperative Memory
 Overcommit) on powerpc, we tripped the following:
 
 kernel BUG at mm/bootmem.c:483!
 cpu 0x0: Vector: 700 (Program Check) at [c0c03940]
 pc: c0a62bd8: .alloc_bootmem_core+0x90/0x39c
 lr: c0a64bcc: .sparse_early_usemaps_alloc_node+0x84/0x29c
 sp: c0c03bc0
msr: 80021032
   current = 0xc0b0cce0
   paca= 0xc1d8
 pid   = 0, comm = swapper
 kernel BUG at mm/bootmem.c:483!
 enter ? for help
 [c0c03c80] c0a64bcc
 .sparse_early_usemaps_alloc_node+0x84/0x29c
 [c0c03d50] c0a64f10 .sparse_init+0x12c/0x28c
 [c0c03e20] c0a474f4 .setup_arch+0x20c/0x294
 [c0c03ee0] c0a4079c .start_kernel+0xb4/0x460
 [c0c03f90] c0009670 .start_here_common+0x1c/0x2c
 
 This is
 
 BUG_ON(limit  goal + size  limit);
 
 and after some debugging, it seems that
 
   goal = 0x700
   limit = 0x800
 
 and sparse_early_usemaps_alloc_node -
 sparse_early_usemaps_alloc_pgdat_section - alloc_bootmem_section calls
 
   return alloc_bootmem_section(usemap_size() * count, section_nr);
 
 This is on a system with 8TB available via the AMS pool, and as a quirk
 of AMS in firmware, all of that memory shows up in node 0. So, we end up
 with an allocation that will fail the goal/limit constraints. In theory,
 we could fall-back to alloc_bootmem_node() in
 sparse_early_usemaps_alloc_node(), but since we actually have HOTREMOVE
 defined, we'll BUG_ON() instead. A simple solution appears to be to
 disable the limit check if the size of the allocation in
 alloc_bootmem_secition exceeds the section size.

It makes sense to allow the usemaps to spill over to subsequent
sections instead of panicking, so FWIW:

Acked-by: Johannes Weiner han...@cmpxchg.org

That being said, it would be good if check_usemap_section_nr() printed
the cross-dependencies between pgdats and sections when the usemaps of
a node spilled over to other sections than the ones holding the pgdat.

How about this?

---
From: Johannes Weiner han...@cmpxchg.org
Subject: sparsemem/bootmem: catch greater than section size allocations fix

If alloc_bootmem_section() no longer guarantees section-locality, we
need check_usemap_section_nr() to print possible cross-dependencies
between node descriptors and the usemaps allocated through it.

Signed-off-by: Johannes Weiner han...@cmpxchg.org
---

diff --git a/mm/sparse.c b/mm/sparse.c
index 61d7cde..9e032dc 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -359,6 +359,7 @@ static void __init sparse_early_usemaps_alloc_node(unsigned 
long**usemap_map,
continue;
usemap_map[pnum] = usemap;
usemap += size;
+   check_usemap_section_nr(nodeid, usemap_map[pnum]);
}
return;
}

---

Furthermore, I wonder if we can remove the sparse-specific stuff from
bootmem.c as well, as now even more so than before, calculating the
desired area is really none of bootmem's business.

Would something like this be okay?

---
From: Johannes Weiner han...@cmpxchg.org
Subject: [patch] mm: remove sparsemem allocation details from the bootmem 
allocator

alloc_bootmem_section() derives allocation area constraints from the
specified sparsemem section.  This is a bit specific for a generic
memory allocator like bootmem, though, so move it over to sparsemem.

Since __alloc_bootmem_node() already retries failed allocations with
relaxed area constraints, the fallback code in sparsemem.c can be
removed and the code becomes a bit more compact overall.

Signed-off-by: Johannes Weiner han...@cmpxchg.org
---
 include/linux/bootmem.h |3 ---
 mm/bootmem.c|   26 --
 mm/sparse.c |   29 +
 3 files changed, 9 insertions(+), 49 deletions(-)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index ab344a5..001c248 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -135,9 +135,6 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat,
 extern int reserve_bootmem_generic(unsigned long addr, unsigned long size,
   int flags);
 
-extern void *alloc_bootmem_section(unsigned long size,
-  unsigned long section_nr);
-
 #ifdef CONFIG_HAVE_ARCH_ALLOC_REMAP
 extern void *alloc_remap(int nid, unsigned long size);
 #else
diff --git a/mm/bootmem.c b/mm/bootmem.c
index 7bc0557..d34026c 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -756,32 +756,6 @@ void * __init __alloc_bootmem_node_high(pg_data_t *pgdat, 
unsigned long size,
 
 }
 
-#ifdef CONFIG_SPARSEMEM
-/**
- * alloc_bootmem_section - allocate boot memory from a specific section
- * @size: size of the request in bytes