Re: [PATCH v3] mm: Avoid unnecessary page fault retires on shared memory types
Michal Simek , Thomas Bogendoerfer , linux-par...@vger.kernel.org, Max Filippov , linux-ker...@vger.kernel.org, Dinh Nguyen , Palmer Dabbelt , Sven Schnelle , Guo Ren , Borislav Petkov , Johannes Berg , linuxppc-dev@lists.ozlabs.org, "David S . Miller" Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org Sender: "Linuxppc-dev" On Tue, May 24, 2022 at 07:45:31PM -0400, Peter Xu wrote: > I observed that for each of the shared file-backed page faults, we're very > likely to retry one more time for the 1st write fault upon no page. It's > because we'll need to release the mmap lock for dirty rate limit purpose > with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()). > > Then after that throttling we return VM_FAULT_RETRY. > > We did that probably because VM_FAULT_RETRY is the only way we can return > to the fault handler at that time telling it we've released the mmap lock. > > However that's not ideal because it's very likely the fault does not need > to be retried at all since the pgtable was well installed before the > throttling, so the next continuous fault (including taking mmap read lock, > walk the pgtable, etc.) could be in most cases unnecessary. > > It's not only slowing down page faults for shared file-backed, but also add > more mmap lock contention which is in most cases not needed at all. > > To observe this, one could try to write to some shmem page and look at > "pgfault" value in /proc/vmstat, then we should expect 2 counts for each > shmem write simply because we retried, and vm event "pgfault" will capture > that. > > To make it more efficient, add a new VM_FAULT_COMPLETED return code just to > show that we've completed the whole fault and released the lock. It's also > a hint that we should very possibly not need another fault immediately on > this page because we've just completed it. > > This patch provides a ~12% perf boost on my aarch64 test VM with a simple > program sequentially dirtying 400MB shmem file being mmap()ed and these are > the time it needs: > > Before: 650.980 ms (+-1.94%) > After: 569.396 ms (+-1.38%) > > I believe it could help more than that. > > We need some special care on GUP and the s390 pgfault handler (for gmap > code before returning from pgfault), the rest changes in the page fault > handlers should be relatively straightforward. > > Another thing to mention is that mm_account_fault() does take this new > fault as a generic fault to be accounted, unlike VM_FAULT_RETRY. > > I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do > not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping > them as-is. > > Signed-off-by: Peter Xu Acked-by: Johannes Weiner
Re: mm: Question about the use of 'accessed' flags and pte_young() helper
On Tue, Oct 20, 2020 at 05:52:07PM +0200, Vlastimil Babka wrote: > On 10/8/20 11:49 AM, Christophe Leroy wrote: > > In a 10 years old commit > > (https://github.com/linuxppc/linux/commit/d069cb4373fe0d451357c4d3769623a7564dfa9f), > > powerpc 8xx has > > made the handling of PTE accessed bit conditional to CONFIG_SWAP. > > Since then, this has been extended to some other powerpc variants. > > > > That commit means that when CONFIG_SWAP is not selected, the accessed bit > > is not set by SW TLB miss > > handlers, leading to pte_young() returning garbage, or should I say > > possibly returning false > > allthough a page has been accessed since its access flag was reset. > > > > Looking at various mm/ places, pte_young() is used independent of > > CONFIG_SWAP > > > > Is it still valid the not manage accessed flags when CONFIG_SWAP is not > > selected ? > > AFAIK it's wrong, reclaim needs it to detect accessed pages on inactive > list, via page_referenced(), including file pages (page cache) where > CONFIG_SWAP plays no role. Maybe it was different 10 years ago. Yes, we require this bit for properly aging mmapped file pages. The underlying assumption in the referenced commit is incorrect. > > If yes, should pte_young() always return true in that case ? > > It should best work as intended. If not possible, true is maybe better, as > false will lead to inactive file list thrashing. An unconditional true will cause mmapped file pages to be permanently mlocked / unevictable. Either way will break some workloads. The only good answer is the truth :-)
Re: linux-next: runtime warning in Linus' tree
On Thu, Aug 13, 2020 at 04:46:54PM +1000, Stephen Rothwell wrote: > [0.055220][T0] WARNING: CPU: 0 PID: 0 at mm/memcontrol.c:5220 > mem_cgroup_css_alloc+0x350/0x904 > [The line numbers in the final linux next are 5226 and 5141 due to > later patches.] > > Introduced (or exposed) by commit > > 3e38e0aaca9e ("mm: memcg: charge memcg percpu memory to the parent cgroup") > > This commit actually adds the WARN_ON, so it either adds the bug that > sets it off, or the bug already existed. > > Unfotunately, the version of this patch in linux-next up tuntil today > is different. :-( Sorry, I made a last-minute request to include these checks in that patch to make the code a bit more robust, but they trigger a false positive here. Let's remove them. --- >From de8ea7c96c056c3cbe7b93995029986a158fb9cd Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Thu, 13 Aug 2020 10:40:54 -0400 Subject: [PATCH] mm: memcontrol: fix warning when allocating the root cgroup Commit 3e38e0aaca9e ("mm: memcg: charge memcg percpu memory to the parent cgroup") adds memory tracking to the memcg kernel structures themselves to make cgroups liable for the memory they are consuming through the allocation of child groups (which can be significant). This code is a bit awkward as it's spread out through several functions: The outermost function does memalloc_use_memcg(parent) to set up current->active_memcg, which designates which cgroup to charge, and the inner functions pass GFP_ACCOUNT to request charging for specific allocations. To make sure this dependency is satisfied at all times - to make sure we don't randomly charge whoever is calling the functions - the inner functions warn on !current->active_memcg. However, this triggers a false warning when the root memcg itself is allocated. No parent exists in this case, and so current->active_memcg is rightfully NULL. It's a false positive, not indicative of a bug. Delete the warnings for now, we can revisit this later. Fixes: 3e38e0aaca9e ("mm: memcg: charge memcg percpu memory to the parent cgroup") Signed-off-by: Johannes Weiner --- mm/memcontrol.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d59fd9af6e63..9d87082e64aa 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5137,9 +5137,6 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) if (!pn) return 1; - /* We charge the parent cgroup, never the current task */ - WARN_ON_ONCE(!current->active_memcg); - pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat, GFP_KERNEL_ACCOUNT); if (!pn->lruvec_stat_local) { @@ -5222,9 +5219,6 @@ static struct mem_cgroup *mem_cgroup_alloc(void) goto fail; } - /* We charge the parent cgroup, never the current task */ - WARN_ON_ONCE(!current->active_memcg); - memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu, GFP_KERNEL_ACCOUNT); if (!memcg->vmstats_local) -- 2.28.0
Re: [PATCH 1/2] mm, treewide: Rename kzfree() to kfree_sensitive()
On Mon, Apr 13, 2020 at 05:15:49PM -0400, Waiman Long wrote: > As said by Linus: > > A symmetric naming is only helpful if it implies symmetries in use. > Otherwise it's actively misleading. As the btrfs example proves - people can be tempted by this false symmetry to pair kzalloc with kzfree, which isn't what we wanted. > In "kzalloc()", the z is meaningful and an important part of what the > caller wants. > > In "kzfree()", the z is actively detrimental, because maybe in the > future we really _might_ want to use that "memfill(0xdeadbeef)" or > something. The "zero" part of the interface isn't even _relevant_. > > The main reason that kzfree() exists is to clear sensitive information > that should not be leaked to other future users of the same memory > objects. > > Rename kzfree() to kfree_sensitive() to follow the example of the > recently added kvfree_sensitive() and make the intention of the API > more explicit. In addition, memzero_explicit() is used to clear the > memory to make sure that it won't get optimized away by the compiler. > > The renaming is done by using the command sequence: > > git grep -w --name-only kzfree |\ > xargs sed -i 's/\bkzfree\b/kfree_sensitive/' > > followed by some editing of the kfree_sensitive() kerneldoc and the > use of memzero_explicit() instead of memset(). > > Suggested-by: Joe Perches > Signed-off-by: Waiman Long Looks good to me. Thanks for fixing this very old mistake. Acked-by: Johannes Weiner
Re: [PATCH 25/28] mm: remove vmalloc_user_node_flags
On Thu, Apr 09, 2020 at 03:25:03PM -0700, Andrii Nakryiko wrote: > cc Johannes who suggested this API call originally I forgot why we did it this way - probably just cruft begetting more cruft. Either way, Christoph's cleanup makes this look a lot better. > On Wed, Apr 8, 2020 at 5:03 AM Christoph Hellwig wrote: > > > > Open code it in __bpf_map_area_alloc, which is the only caller. Also > > clean up __bpf_map_area_alloc to have a single vmalloc call with > > slightly different flags instead of the current two different calls. > > > > For this to compile for the nommu case add a __vmalloc_node_range stub > > to nommu.c. > > > > Signed-off-by: Christoph Hellwig Acked-by: Johannes Weiner
Re: [PATCH v3 3/3] mm: use numa_mem_id() in alloc_pages_node()
On Thu, Jul 30, 2015 at 06:34:31PM +0200, Vlastimil Babka wrote: numa_mem_id() is able to handle allocation from CPUs on memory-less nodes, so it's a more robust fallback than the currently used numa_node_id(). Won't it fall through to the next closest memory node in the zonelist anyway? Is this for callers doing NUMA_NO_NODE with __GFP_THISZONE? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 6/7][TRIVIAL][resend] mm: cleanup page reclaim comment error
On Fri, Jun 15, 2012 at 09:19:45PM +0800, Wanpeng Li wrote: From: Wanpeng Li l...@linux.vnet.ibm.com Since there are five lists in LRU cache, the array nr in get_scan_count should be: nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan nr[2] = file inactive pages to scan; nr[3] = file active pages to scan Signed-off-by: Wanpeng Li liwp.li...@gmail.com Acked-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com Acked-by: Minchan Kim minc...@kernel.org Reviewed-by: Rik van Riel r...@redhat.com --- mm/vmscan.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index eeb3bc9..ed823df 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1567,7 +1567,8 @@ static int vmscan_swappiness(struct scan_control *sc) * by looking at the fraction of the pages scanned we did rotate back * onto the active list instead of evict. * - * nr[0] = anon pages to scan; nr[1] = file pages to scan + * nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan + * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan */ Does including this in the comment have any merit in the first place? We never access nr[0] or nr[1] etc. anywhere with magic numbers. It's a local function with one callsite, the passed array is declared and accessed exclusively by what is defined in enum lru_list, where is the point in repeating the enum items?. I'd rather the next change to this comment would be its removal. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] sparsemem/bootmem: catch greater than section size allocations
On Tue, Feb 28, 2012 at 12:11:51PM -0800, Nishanth Aravamudan wrote: On 28.02.2012 [14:53:26 +0100], Johannes Weiner wrote: On Fri, Feb 24, 2012 at 11:33:58AM -0800, Nishanth Aravamudan wrote: While testing AMS (Active Memory Sharing) / CMO (Cooperative Memory Overcommit) on powerpc, we tripped the following: kernel BUG at mm/bootmem.c:483! cpu 0x0: Vector: 700 (Program Check) at [c0c03940] pc: c0a62bd8: .alloc_bootmem_core+0x90/0x39c lr: c0a64bcc: .sparse_early_usemaps_alloc_node+0x84/0x29c sp: c0c03bc0 msr: 80021032 current = 0xc0b0cce0 paca= 0xc1d8 pid = 0, comm = swapper kernel BUG at mm/bootmem.c:483! enter ? for help [c0c03c80] c0a64bcc .sparse_early_usemaps_alloc_node+0x84/0x29c [c0c03d50] c0a64f10 .sparse_init+0x12c/0x28c [c0c03e20] c0a474f4 .setup_arch+0x20c/0x294 [c0c03ee0] c0a4079c .start_kernel+0xb4/0x460 [c0c03f90] c0009670 .start_here_common+0x1c/0x2c This is BUG_ON(limit goal + size limit); and after some debugging, it seems that goal = 0x700 limit = 0x800 and sparse_early_usemaps_alloc_node - sparse_early_usemaps_alloc_pgdat_section - alloc_bootmem_section calls return alloc_bootmem_section(usemap_size() * count, section_nr); This is on a system with 8TB available via the AMS pool, and as a quirk of AMS in firmware, all of that memory shows up in node 0. So, we end up with an allocation that will fail the goal/limit constraints. In theory, we could fall-back to alloc_bootmem_node() in sparse_early_usemaps_alloc_node(), but since we actually have HOTREMOVE defined, we'll BUG_ON() instead. A simple solution appears to be to disable the limit check if the size of the allocation in alloc_bootmem_secition exceeds the section size. It makes sense to allow the usemaps to spill over to subsequent sections instead of panicking, so FWIW: Acked-by: Johannes Weiner han...@cmpxchg.org That being said, it would be good if check_usemap_section_nr() printed the cross-dependencies between pgdats and sections when the usemaps of a node spilled over to other sections than the ones holding the pgdat. How about this? --- From: Johannes Weiner han...@cmpxchg.org Subject: sparsemem/bootmem: catch greater than section size allocations fix If alloc_bootmem_section() no longer guarantees section-locality, we need check_usemap_section_nr() to print possible cross-dependencies between node descriptors and the usemaps allocated through it. Signed-off-by: Johannes Weiner han...@cmpxchg.org --- diff --git a/mm/sparse.c b/mm/sparse.c index 61d7cde..9e032dc 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -359,6 +359,7 @@ static void __init sparse_early_usemaps_alloc_node(unsigned long**usemap_map, continue; usemap_map[pnum] = usemap; usemap += size; + check_usemap_section_nr(nodeid, usemap_map[pnum]); } return; } This makes sense to me -- ok if I fold it into the re-worked patch (based upon Mel's comments)? Sure thing! Furthermore, I wonder if we can remove the sparse-specific stuff from bootmem.c as well, as now even more so than before, calculating the desired area is really none of bootmem's business. Would something like this be okay? --- From: Johannes Weiner han...@cmpxchg.org Subject: [patch] mm: remove sparsemem allocation details from the bootmem allocator alloc_bootmem_section() derives allocation area constraints from the specified sparsemem section. This is a bit specific for a generic memory allocator like bootmem, though, so move it over to sparsemem. Since __alloc_bootmem_node() already retries failed allocations with relaxed area constraints, the fallback code in sparsemem.c can be removed and the code becomes a bit more compact overall. Signed-off-by: Johannes Weiner han...@cmpxchg.org I've not tested it, but the intention seems sensible. I think it should remain a separate change. Yes, I agree. I'll resend it in a bit as stand-alone patch. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] bootmem/sparsemem: remove limit constraint in alloc_bootmem_section
- sparse_early_usemaps_alloc_pgdat_section calls return alloc_bootmem_section(usemap_size() * count, section_nr); This is on a system with 8TB available via the AMS pool, and as a quirk of AMS in firmware, all of that memory shows up in node 0. So, we end up with an allocation that will fail the goal/limit constraints. In theory, we could fall-back to alloc_bootmem_node() in sparse_early_usemaps_alloc_node(), but since we actually have HOTREMOVE defined, we'll BUG_ON() instead. A simple solution appears to be to unconditionally remove the limit condition in alloc_bootmem_section, meaning allocations are allowed to cross section boundaries (necessary for systems of this size). Johannes Weiner pointed out that if alloc_bootmem_section() no longer guarantees section-locality, we need check_usemap_section_nr() to print possible cross-dependencies between node descriptors and the usemaps allocated through it. That makes the two loops in sparse_early_usemaps_alloc_node() identical, so re-factor the code a bit. Signed-off-by: Nishanth Aravamudan n...@us.ibm.com Acked-by: Johannes Weiner han...@cmpxchg.org ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] sparsemem/bootmem: catch greater than section size allocations
On Fri, Feb 24, 2012 at 11:33:58AM -0800, Nishanth Aravamudan wrote: While testing AMS (Active Memory Sharing) / CMO (Cooperative Memory Overcommit) on powerpc, we tripped the following: kernel BUG at mm/bootmem.c:483! cpu 0x0: Vector: 700 (Program Check) at [c0c03940] pc: c0a62bd8: .alloc_bootmem_core+0x90/0x39c lr: c0a64bcc: .sparse_early_usemaps_alloc_node+0x84/0x29c sp: c0c03bc0 msr: 80021032 current = 0xc0b0cce0 paca= 0xc1d8 pid = 0, comm = swapper kernel BUG at mm/bootmem.c:483! enter ? for help [c0c03c80] c0a64bcc .sparse_early_usemaps_alloc_node+0x84/0x29c [c0c03d50] c0a64f10 .sparse_init+0x12c/0x28c [c0c03e20] c0a474f4 .setup_arch+0x20c/0x294 [c0c03ee0] c0a4079c .start_kernel+0xb4/0x460 [c0c03f90] c0009670 .start_here_common+0x1c/0x2c This is BUG_ON(limit goal + size limit); and after some debugging, it seems that goal = 0x700 limit = 0x800 and sparse_early_usemaps_alloc_node - sparse_early_usemaps_alloc_pgdat_section - alloc_bootmem_section calls return alloc_bootmem_section(usemap_size() * count, section_nr); This is on a system with 8TB available via the AMS pool, and as a quirk of AMS in firmware, all of that memory shows up in node 0. So, we end up with an allocation that will fail the goal/limit constraints. In theory, we could fall-back to alloc_bootmem_node() in sparse_early_usemaps_alloc_node(), but since we actually have HOTREMOVE defined, we'll BUG_ON() instead. A simple solution appears to be to disable the limit check if the size of the allocation in alloc_bootmem_secition exceeds the section size. It makes sense to allow the usemaps to spill over to subsequent sections instead of panicking, so FWIW: Acked-by: Johannes Weiner han...@cmpxchg.org That being said, it would be good if check_usemap_section_nr() printed the cross-dependencies between pgdats and sections when the usemaps of a node spilled over to other sections than the ones holding the pgdat. How about this? --- From: Johannes Weiner han...@cmpxchg.org Subject: sparsemem/bootmem: catch greater than section size allocations fix If alloc_bootmem_section() no longer guarantees section-locality, we need check_usemap_section_nr() to print possible cross-dependencies between node descriptors and the usemaps allocated through it. Signed-off-by: Johannes Weiner han...@cmpxchg.org --- diff --git a/mm/sparse.c b/mm/sparse.c index 61d7cde..9e032dc 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -359,6 +359,7 @@ static void __init sparse_early_usemaps_alloc_node(unsigned long**usemap_map, continue; usemap_map[pnum] = usemap; usemap += size; + check_usemap_section_nr(nodeid, usemap_map[pnum]); } return; } --- Furthermore, I wonder if we can remove the sparse-specific stuff from bootmem.c as well, as now even more so than before, calculating the desired area is really none of bootmem's business. Would something like this be okay? --- From: Johannes Weiner han...@cmpxchg.org Subject: [patch] mm: remove sparsemem allocation details from the bootmem allocator alloc_bootmem_section() derives allocation area constraints from the specified sparsemem section. This is a bit specific for a generic memory allocator like bootmem, though, so move it over to sparsemem. Since __alloc_bootmem_node() already retries failed allocations with relaxed area constraints, the fallback code in sparsemem.c can be removed and the code becomes a bit more compact overall. Signed-off-by: Johannes Weiner han...@cmpxchg.org --- include/linux/bootmem.h |3 --- mm/bootmem.c| 26 -- mm/sparse.c | 29 + 3 files changed, 9 insertions(+), 49 deletions(-) diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h index ab344a5..001c248 100644 --- a/include/linux/bootmem.h +++ b/include/linux/bootmem.h @@ -135,9 +135,6 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat, extern int reserve_bootmem_generic(unsigned long addr, unsigned long size, int flags); -extern void *alloc_bootmem_section(unsigned long size, - unsigned long section_nr); - #ifdef CONFIG_HAVE_ARCH_ALLOC_REMAP extern void *alloc_remap(int nid, unsigned long size); #else diff --git a/mm/bootmem.c b/mm/bootmem.c index 7bc0557..d34026c 100644 --- a/mm/bootmem.c +++ b/mm/bootmem.c @@ -756,32 +756,6 @@ void * __init __alloc_bootmem_node_high(pg_data_t *pgdat, unsigned long size, } -#ifdef CONFIG_SPARSEMEM -/** - * alloc_bootmem_section - allocate boot memory from a specific section - * @size: size of the request in bytes