Re: [PATCH -next v6 0/2] Make memory reclamation measurable

2024-03-07 Thread Michal Hocko
On Thu 07-03-24 15:40:29, Bixuan Cui wrote:
[...]
> Currently, with the help of kernel trace events or tools like Perfetto, we
> can only see that kswapd is competing for CPU and the frequency of memory
> reclamation triggers, but we do not have detailed information or metrics
> about memory reclamation, such as the duration and amount of each
> reclamation, or who is releasing memory (super_cache, f2fs, ext4), etc. This
> makes it impossible to locate the above problems.

I am not sure I agree with you here. We do provide insight into LRU and
shrinkers reclaim. Why isn't that enough. In general I would advise you
to focus more on describing why the existing infrastructure is
insuficient (having examples would be really appreciated).

> Currently this patch helps us solve 2 actual performance problems (kswapd
> preempts the CPU causing game delay)
> 1. The increased memory allocation in the game (across different versions)
> has led to the degradation of kswapd.
> This is found by calculating the total amount of Reclaim(page) during
> the game startup phase.
> 
> 2. The adoption of a different file system in the new system version has
> resulted in a slower reclamation rate.
> This is discovered through the OBJ_NAME change. For example, OBJ_NAME
> changes from super_cache_scan to ext4_es_scan.
> 
> Subsequently, it is also possible to calculate the memory reclamation rate
> to evaluate the memory performance of different versions.

Why cannot you achive this with existing tracing or /proc/vmstat
infrastructure?

> The main reasons for adding static tracepoints are:
> 1. To subdivide the time spent in the shrinker->count_objects() and
> shrinker->scan_objects() functions within the do_shrink_slab function. Using
> BPF kprobe, we can only track the time spent in the do_shrink_slab function.
> 2. When tracing frequently called functions, static tracepoints (BPF
> tp/tracepoint) have lower performance impact compared to dynamic tracepoints
> (BPF kprobe).

You can track the time process has been preempted by other means, no? We
have context switching tracepoints in place. Have you considered that
option?
-- 
Michal Hocko
SUSE Labs



Re: [PATCH v2] mm: Update mark_victim tracepoints fields

2024-02-22 Thread Michal Hocko
On Wed 21-02-24 13:30:51, Carlos Galo wrote:
> On Tue, Feb 20, 2024 at 11:55 PM Michal Hocko  wrote:
> >
> > Hi,
> > sorry I have missed this before.
> >
> > On Thu 11-01-24 21:05:30, Carlos Galo wrote:
> > > The current implementation of the mark_victim tracepoint provides only
> > > the process ID (pid) of the victim process. This limitation poses
> > > challenges for userspace tools that need additional information
> > > about the OOM victim. The association between pid and the additional
> > > data may be lost after the kill, making it difficult for userspace to
> > > correlate the OOM event with the specific process.
> >
> > You are correct that post OOM all per-process information is lost. On
> > the other hand we do dump all this information to the kernel log. Could
> > you explain why that is not suitable for your purpose?
> 
> Userspace tools often need real-time visibility into OOM situations
> for userspace intervention. Our use case involves utilizing BPF
> programs, along with BPF ring buffers, to provide OOM notification to
> userspace. Parsing kernel logs would be significant overhead as
> opposed to the event based BPF approach.

Please put that into the changelog.

> > > In order to mitigate this limitation, add the following fields:
> > >
> > > - UID
> > >In Android each installed application has a unique UID. Including
> > >the `uid` assists in correlating OOM events with specific apps.
> > >
> > > - Process Name (comm)
> > >Enables identification of the affected process.
> > >
> > > - OOM Score
> > >Allows userspace to get additional insights of the relative kill
> > >priority of the OOM victim.
> >
> > What is the oom score useful for?
> >
> The OOM score provides us a measure of the victim's importance. On the
> android side, it allows us to identify if top or foreground apps are
> killed, which have user perceptible impact.

But the value on its own (wihtout knowing scores of other tasks) doesn't
really tell you anything, does it?
 
> > Is there any reason to provide a different information from the one
> > reported to the kernel log?
> > __oom_kill_process:
> > pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, 
> > file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n",
> > message, task_pid_nr(victim), victim->comm, K(mm->total_vm),
> > K(get_mm_counter(mm, MM_ANONPAGES)),
> > K(get_mm_counter(mm, MM_FILEPAGES)),
> > K(get_mm_counter(mm, MM_SHMEMPAGES)),
> > from_kuid(_user_ns, task_uid(victim)),
> > mm_pgtables_bytes(mm) >> 10, victim->signal->oom_score_adj);
> >
> 
> We added these fields we need (UID, process name, and OOM score), but
> we're open to adding the others if you prefer that for consistency
> with the kernel log.

yes, I think the consistency would be better here. For one it reports
numbers which can tell quite a lot about the killed victim. It is a
superset of what you already asking for. With a notable exception of the
oom_score which is really dubious without a wider context.
-- 
Michal Hocko
SUSE Labs



Re: [PATCH v2] mm: Update mark_victim tracepoints fields

2024-02-20 Thread Michal Hocko
Hi,
sorry I have missed this before.

On Thu 11-01-24 21:05:30, Carlos Galo wrote:
> The current implementation of the mark_victim tracepoint provides only
> the process ID (pid) of the victim process. This limitation poses
> challenges for userspace tools that need additional information
> about the OOM victim. The association between pid and the additional
> data may be lost after the kill, making it difficult for userspace to
> correlate the OOM event with the specific process.

You are correct that post OOM all per-process information is lost. On
the other hand we do dump all this information to the kernel log. Could
you explain why that is not suitable for your purpose?

> In order to mitigate this limitation, add the following fields:
> 
> - UID
>In Android each installed application has a unique UID. Including
>the `uid` assists in correlating OOM events with specific apps.
> 
> - Process Name (comm)
>Enables identification of the affected process.
> 
> - OOM Score
>Allows userspace to get additional insights of the relative kill
>priority of the OOM victim.

What is the oom score useful for?

Is there any reason to provide a different information from the one
reported to the kernel log?
__oom_kill_process:
pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n",
message, task_pid_nr(victim), victim->comm, K(mm->total_vm),
K(get_mm_counter(mm, MM_ANONPAGES)),
K(get_mm_counter(mm, MM_FILEPAGES)),
K(get_mm_counter(mm, MM_SHMEMPAGES)),
from_kuid(_user_ns, task_uid(victim)),
mm_pgtables_bytes(mm) >> 10, victim->signal->oom_score_adj);

-- 
Michal Hocko
SUSE Labs



Re: [PATCH -next v6 0/2] Make memory reclamation measurable

2024-02-20 Thread Michal Hocko
On Wed 21-02-24 11:00:53, Bixuan Cui wrote:
> 
> 
> 在 2024/2/21 10:22, Steven Rostedt 写道:
> > It's up to the memory management folks to decide on this. -- Steve
> Noted with thanks.

It would be really helpful to have more details on why we need those
trace points.

It is my understanding that you would like to have a more fine grained
numbers for the time duration of different parts of the reclaim process.
I can imagine this could be useful in some cases but is it useful enough
and for a wider variety of workloads? Is that worth a dedicated static
tracepoints? Why an add-hoc dynamic tracepoints or BPF for a very
special situation is not sufficient?

In other words, tell us more about the usecases and why is this
generally useful.

Thanks!
-- 
Michal Hocko
SUSE Labs



Re: [PATCH] module: print module name on refcount error

2023-09-14 Thread Michal Hocko
On Mon 28-08-23 14:18:30, Jean Delvare wrote:
[...]
> > > It would likely be better to use refcount_t instead of atomic_t.  
> > 
> > Patches welcomed.
> 
> Michal, do I understand correctly that this would prevent the case our
> customer had (too many gets), but won't make a difference for actual
> too-many-puts situations?


yes, refcount_t cannot protect from too-many-puts because there is not
real way to protect from those AFAICS. At a certain moment you just drop
to 0 and lose your object and all following that is a UAF. But I do not
think this is actually the interesting case at all.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2] docs: proc.rst: meminfo: briefly describe gaps in memory accounting

2021-04-20 Thread Michal Hocko
On Tue 20-04-21 15:57:08, Michal Hocko wrote:
[...]
> Usual memory consumption is usually something like LRU pages + Slab
> memory + kernel stack + vmalloc used + pcp.
> 
> > But I know that KernelStack is allocated through vmalloc these days,
> > and I don't know whether VmallocUsed includes KernelStack or whether I
> > can sum them.  Similarly, is Mlocked a subset of Unevictable?

Forgot to reply to these two. Yes they do. So if we want to be precise
then you have to check the stack allocation configuration. There are
just so many traps lurking here. Something you get used to over time
but this is certainly far far away from an ideal state. What else we can
expect from an ad hoc approach to providing data to userspace that was
historically applied to this and many other proc interfaces. We are
trying to be strict for some time but some mistakes are simply hard to
fix up (e.g. accounting shmem as a page cache to name some more).

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2] docs: proc.rst: meminfo: briefly describe gaps in memory accounting

2021-04-20 Thread Michal Hocko
On Tue 20-04-21 14:24:30, Matthew Wilcox wrote:
> On Tue, Apr 20, 2021 at 03:13:54PM +0300, Mike Rapoport wrote:
> > Add a paragraph that explains that it may happen that the counters in
> > /proc/meminfo do not add up to the overall memory usage.
> 
> ... that is, the sum may be lower because memory is allocated for other
> purposes that is not reported here, right?

yes. Many direct page allocator users are not accounted in any of the
existing counters.

> Is it ever possible for it to be higher?  Maybe due to a race when
> sampling the counters?

Yes likely possible. You will never get an atomic snapshot of all
counters.

> >  Provides information about distribution and utilization of memory.  This
> > -varies by architecture and compile options.  The following is from a
> > -16GB PIII, which has highmem enabled.  You may not have all of these 
> > fields.
> > +varies by architecture and compile options. Please note that it may happen
> > +that the memory accounted here does not add up to the overall memory usage
> > +and the difference for some workloads can be substantial. In many cases 
> > there
> > +are other means to find out additional memory using subsystem specific
> > +interfaces, for instance /proc/net/sockstat for TCP memory allocations.
> 
> How about just:
> 
> +varies by architecture and compile options.  The memory reported here
> +may not add up to the overall memory usage and the difference for some
> +workloads can be substantial. [...]
> 
> But I'd like to be a bit more explicit about the reason, hence my question
> above to be sure I understand.
> 
> 
> It's also not entirely clear which of the fields in meminfo can be
> usefully summed.  VmallocTotal is larger than MemTotal, for example.

Yes. Many/Most counters cannot be simply sumed up. A trivial example would be
Active/Inactive is a sum of both anona and file. Mlocked will be
accounted in LRU pages and Unevictable. MemAvailable is not really a
counter... 

Usual memory consumption is usually something like LRU pages + Slab
memory + kernel stack + vmalloc used + pcp.

> But I know that KernelStack is allocated through vmalloc these days,
> and I don't know whether VmallocUsed includes KernelStack or whether I
> can sum them.  Similarly, is Mlocked a subset of Unevictable?
> 
> There is some attempt at explaining how these numbers fit together, but
> it's outdated, and doesn't include Mlocked, Unevictable or KernelStack

Agreed there is a lot of tribal knowledge or even misconceptions flying
around and it will take much more work to put everything into shape.
This is only one tiny step forward.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2] docs: proc.rst: meminfo: briefly describe gaps in memory accounting

2021-04-20 Thread Michal Hocko
On Tue 20-04-21 15:21:08, Mike Rapoport wrote:
> On Tue, Apr 20, 2021 at 03:13:54PM +0300, Mike Rapoport wrote:
> > From: Mike Rapoport 
> > 
> > Add a paragraph that explains that it may happen that the counters in
> > /proc/meminfo do not add up to the overall memory usage.
> > 
> > Signed-off-by: Mike Rapoport 
> 
> Ooops, forgot to add Michal's Ack, sorry.

Let's make it more explicit
Acked-by: Michal Hocko 

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-20 Thread Michal Hocko
On Tue 20-04-21 09:25:51, peter.enderb...@sony.com wrote:
> On 4/20/21 11:12 AM, Michal Hocko wrote:
> > On Tue 20-04-21 09:02:57, peter.enderb...@sony.com wrote:
> >>>> But that isn't really system memory at all, it's just allocated device
> >>>> memory.
> >>> OK, that was not really clear to me. So this is not really accounted to
> >>> MemTotal? If that is really the case then reporting it into the oom
> >>> report is completely pointless and I am not even sure /proc/meminfo is
> >>> the right interface either. It would just add more confusion I am
> >>> afraid.
> >>>  
> >> Why is it confusing? Documentation is quite clear:
> > Because a single counter without a wider context cannot be put into any
> > reasonable context. There is no notion of the total amount of device
> > memory usable for dma-buf. As Christian explained some of it can be RAM
> > based. So a single number is rather pointless on its own in many cases.
> >
> > Or let me just ask. What can you tell from dma-bud: $FOO kB in its
> > current form?
> 
> It is better to be blind?

No it is better to have a sensible counter that can be reasoned about.
So far you are only claiming that having something is better than
nothing and I would agree with you if that was a debugging one off
interface. But /proc/meminfo and other proc files have to be maintained
with future portability in mind. This is not a dumping ground for _some_
counters that might be interesting at the _current_ moment. E.g. what
happens if somebody wants to have a per device resp. memory based
dma-buf data? Are you going to change the semantic or add another
2 counters?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v9 7/8] x86/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE

2021-04-20 Thread Michal Hocko
On Fri 16-04-21 13:24:10, Oscar Salvador wrote:
> Enable x86_64 platform to use the MHP_MEMMAP_ON_MEMORY feature.
> 
> Signed-off-by: Oscar Salvador 
> Reviewed-by: David Hildenbrand 

Acked-by: Michal Hocko 

> ---
>  arch/x86/Kconfig | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 2792879d398e..9f0211df1746 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -2433,6 +2433,9 @@ config ARCH_ENABLE_MEMORY_HOTREMOVE
>   def_bool y
>   depends on MEMORY_HOTPLUG
>  
> +config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
> + def_bool y
> +
>  config USE_PERCPU_NUMA_NODE_ID
>   def_bool y
>   depends on NUMA
> -- 
> 2.16.3

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v9 4/8] mm,memory_hotplug: Allocate memmap from the added memory range

2021-04-20 Thread Michal Hocko
emmap_pages = walk_memory_blocks(start, size, NULL,
> +   get_nr_vmemmap_pages_cb);
> + if (nr_vmemmap_pages) {
> + if (size != memory_block_size_bytes()) {
> + pr_warn("Refuse to remove %#llx - %#llx,"
> + "wrong granularity\n",
> + start, start + size);
> + return -EINVAL;
> + }
> +
> +         /*
> +  * Let remove_pmd_table->free_hugepage_table do the
> +  * right thing if we used vmem_altmap when hot-adding
> +  * the range.
> +  */
> + mhp_altmap.alloc = nr_vmemmap_pages;
> + altmap = _altmap;
> + }
> + }
> +
>   /* remove memmap entry */
>   firmware_map_remove(start, start + size, "System RAM");

I have to say I still dislike this and I would just wrap it inside out
and do the operation from within walk_memory_blocks but I will not
insist.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v9 3/8] mm,memory_hotplug: Factor out adjusting present pages into adjust_present_page_count()

2021-04-20 Thread Michal Hocko
On Fri 16-04-21 13:24:06, Oscar Salvador wrote:
> From: David Hildenbrand 
> 
> Let's have a single place (inspired by adjust_managed_page_count()) where
> we adjust present pages.
> In contrast to adjust_managed_page_count(), only memory onlining/offlining
> is allowed to modify the number of present pages.
> 
> Signed-off-by: David Hildenbrand 
> Signed-off-by: Oscar Salvador 
> Reviewed-by: Oscar Salvador 

Not sure self review counts ;)

Acked-by: Michal Hocko 

Btw. I strongly suspect the resize lock is quite pointless here.
Something for a follow up patch.

> ---
>  mm/memory_hotplug.c | 22 --
>  1 file changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 25e59d5dc13c..d05056b3c173 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -829,6 +829,16 @@ struct zone * zone_for_pfn_range(int online_type, int 
> nid, unsigned start_pfn,
>   return default_zone_for_pfn(nid, start_pfn, nr_pages);
>  }
>  
> +static void adjust_present_page_count(struct zone *zone, long nr_pages)
> +{
> + unsigned long flags;
> +
> + zone->present_pages += nr_pages;
> + pgdat_resize_lock(zone->zone_pgdat, );
> + zone->zone_pgdat->node_present_pages += nr_pages;
> + pgdat_resize_unlock(zone->zone_pgdat, );
> +}
> +
>  int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
>  int online_type, int nid)
>  {
> @@ -882,11 +892,7 @@ int __ref online_pages(unsigned long pfn, unsigned long 
> nr_pages,
>   }
>  
>   online_pages_range(pfn, nr_pages);
> - zone->present_pages += nr_pages;
> -
> - pgdat_resize_lock(zone->zone_pgdat, );
> - zone->zone_pgdat->node_present_pages += nr_pages;
> - pgdat_resize_unlock(zone->zone_pgdat, );
> + adjust_present_page_count(zone, nr_pages);
>  
>   node_states_set_node(nid, );
>   if (need_zonelists_rebuild)
> @@ -1701,11 +1707,7 @@ int __ref offline_pages(unsigned long start_pfn, 
> unsigned long nr_pages)
>  
>   /* removal success */
>   adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
> - zone->present_pages -= nr_pages;
> -
> - pgdat_resize_lock(zone->zone_pgdat, );
> - zone->zone_pgdat->node_present_pages -= nr_pages;
> - pgdat_resize_unlock(zone->zone_pgdat, );
> + adjust_present_page_count(zone, -nr_pages);
>  
>   init_per_zone_wmark_min();
>  
> -- 
> 2.16.3

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v9 2/8] mm,memory_hotplug: Relax fully spanned sections check

2021-04-20 Thread Michal Hocko
On Fri 16-04-21 13:24:05, Oscar Salvador wrote:
> When using self-hosted vmemmap pages, the number of pages passed to
> {online,offline}_pages might not fully span sections, but they always
> fully span pageblocks.
> Relax the check account for that case.

It would be good to call those out explicitly.  It would be also
great to explain why pageblock_nr_pages is an actual constrain. There
shouldn't be any real reason for that except for "we want online_pages
to operate on whole memblocks and memmap_on_memory will poke
pageblock_nr_pages aligned holes in the beginning which is a special
case we want to allow."

> Signed-off-by: Oscar Salvador 
> Reviewed-by: David Hildenbrand 

With the changelog extended and the comment clarification (se below)
feel free to add
Acked-by: Michal Hocko 

> ---
>  mm/memory_hotplug.c | 18 ++
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 0cdbbfbc5757..25e59d5dc13c 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -838,9 +838,14 @@ int __ref online_pages(unsigned long pfn, unsigned long 
> nr_pages,
>   int ret;
>   struct memory_notify arg;
>  
> - /* We can only online full sections (e.g., SECTION_IS_ONLINE) */
> + /* We can only offline full sections (e.g., SECTION_IS_ONLINE).
> +  * However, when using e.g: memmap_on_memory, some pages are initialized
> +  * prior to calling in here. The remaining amount of pages must be
> +  * pageblock aligned.


I would rephrase (and also note that multi line comment usually have a
leading line without any content - not that I care much though).

/*
 * {on,off}lining is constrained to full memory sections (or
 * more precisly to memory blocks from the user space POV).
 * memmap_on_memory is an exception because it reserves initial
 * part of the physical memory space for vmemmaps. That space is
 * pageblock aligned.
> +  */

Same comment would apply to oofline_pages.

>   if (WARN_ON_ONCE(!nr_pages ||
> -  !IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION)))
> +  !IS_ALIGNED(pfn, pageblock_nr_pages) ||
> +  !IS_ALIGNED(pfn + nr_pages, PAGES_PER_SECTION)))
>   return -EINVAL;
>  
>   mem_hotplug_begin();
> @@ -1573,9 +1578,14 @@ int __ref offline_pages(unsigned long start_pfn, 
> unsigned long nr_pages)
>   int ret, node;
>   char *reason;
>  
> - /* We can only offline full sections (e.g., SECTION_IS_ONLINE) */
> + /* We can only offline full sections (e.g., SECTION_IS_ONLINE).
> +  * However, when using e.g: memmap_on_memory, some pages are initialized
> +  * prior to calling in here. The remaining amount of pages must be
> +  * pageblock aligned.
> +  */
>   if (WARN_ON_ONCE(!nr_pages ||
> -  !IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION)))
> +  !IS_ALIGNED(start_pfn, pageblock_nr_pages) ||
> +      !IS_ALIGNED(start_pfn + nr_pages, PAGES_PER_SECTION)))
>   return -EINVAL;
>  
>   mem_hotplug_begin();
> -- 
> 2.16.3

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v9 1/8] drivers/base/memory: Introduce memory_block_{online,offline}

2021-04-20 Thread Michal Hocko
On Fri 16-04-21 13:24:04, Oscar Salvador wrote:
> This is a preparatory patch that introduces two new functions:
> memory_block_online() and memory_block_offline().
> 
> For now, these functions will only call online_pages() and offline_pages()
> respectively, but they will be later in charge of preparing the vmemmap
> pages, carrying out the initialization and proper accounting of such
> pages.
> 
> Since memory_block struct contains all the information, pass this struct
> down the chain till the end functions.
> 
> Signed-off-by: Oscar Salvador 
> Reviewed-by: David Hildenbrand 

Acked-by: Michal Hocko 

> ---
>  drivers/base/memory.c | 33 +
>  1 file changed, 21 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index f35298425575..f209925a5d4e 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -169,30 +169,41 @@ int memory_notify(unsigned long val, void *v)
>   return blocking_notifier_call_chain(_chain, val, v);
>  }
>  
> +static int memory_block_online(struct memory_block *mem)
> +{
> + unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
> + unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
> +
> + return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid);
> +}
> +
> +static int memory_block_offline(struct memory_block *mem)
> +{
> + unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
> + unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
> +
> + return offline_pages(start_pfn, nr_pages);
> +}
> +
>  /*
>   * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is
>   * OK to have direct references to sparsemem variables in here.
>   */
>  static int
> -memory_block_action(unsigned long start_section_nr, unsigned long action,
> - int online_type, int nid)
> +memory_block_action(struct memory_block *mem, unsigned long action)
>  {
> - unsigned long start_pfn;
> - unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
>   int ret;
>  
> - start_pfn = section_nr_to_pfn(start_section_nr);
> -
>   switch (action) {
>   case MEM_ONLINE:
> - ret = online_pages(start_pfn, nr_pages, online_type, nid);
> + ret = memory_block_online(mem);
>   break;
>   case MEM_OFFLINE:
> - ret = offline_pages(start_pfn, nr_pages);
> + ret = memory_block_offline(mem);
>   break;
>   default:
>   WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: "
> -  "%ld\n", __func__, start_section_nr, action, action);
> +  "%ld\n", __func__, mem->start_section_nr, action, action);
>   ret = -EINVAL;
>   }
>  
> @@ -210,9 +221,7 @@ static int memory_block_change_state(struct memory_block 
> *mem,
>   if (to_state == MEM_OFFLINE)
>   mem->state = MEM_GOING_OFFLINE;
>  
> -     ret = memory_block_action(mem->start_section_nr, to_state,
> -   mem->online_type, mem->nid);
> -
> + ret = memory_block_action(mem, to_state);
>   mem->state = ret ? from_state_req : to_state;
>  
>   return ret;
> -- 
> 2.16.3

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-20 Thread Michal Hocko
On Tue 20-04-21 09:02:57, peter.enderb...@sony.com wrote:
> 
> >> But that isn't really system memory at all, it's just allocated device
> >> memory.
> > OK, that was not really clear to me. So this is not really accounted to
> > MemTotal? If that is really the case then reporting it into the oom
> > report is completely pointless and I am not even sure /proc/meminfo is
> > the right interface either. It would just add more confusion I am
> > afraid.
> >  
> 
> Why is it confusing? Documentation is quite clear:

Because a single counter without a wider context cannot be put into any
reasonable context. There is no notion of the total amount of device
memory usable for dma-buf. As Christian explained some of it can be RAM
based. So a single number is rather pointless on its own in many cases.

Or let me just ask. What can you tell from dma-bud: $FOO kB in its
current form?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] docs: proc.rst: meminfo: briefly describe gaps in memory accounting

2021-04-20 Thread Michal Hocko
On Tue 20-04-21 11:51:05, Mike Rapoport wrote:
> From: Mike Rapoport 

Some trivial changelog would be better than nothing.

> Signed-off-by: Mike Rapoport 

But I do agree that this is a useful information to have in the
documentation. Having networking counters as an example is helpful as
well. I am not familiar with those myself much and I do remember there
is much to it than just sockstat. It would be great to consult this with
some networking expert and extend the documentation for that case which
tends to be quite common AFAIK.

Anyway this is already an improvement and a step into the right
direction.

Acked-by: Michal Hocko 

one nit below
> ---
>  Documentation/filesystems/proc.rst | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.rst 
> b/Documentation/filesystems/proc.rst
> index 48fbfc336ebf..bf245151645b 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -929,8 +929,15 @@ meminfo
>  ~~~
>  
>  Provides information about distribution and utilization of memory.  This
> -varies by architecture and compile options.  The following is from a
> -16GB PIII, which has highmem enabled.  You may not have all of these fields.
> +varies by architecture and compile options. Please note that is may happen

that it may happen

> +that the memory accounted here does not add up to the overall memory usage
> +and the difference for some workloads can be substantial. In many cases
> +there are other means to find out additional memory using subsystem
> +specific interfaces, for instance /proc/net/sockstat for networking
> +buffers.
> +
> +The following is from a 16GB PIII, which has highmem enabled.
> +You may not have all of these fields.
>  
>  ::
>  
> -- 
> 2.29.2

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 0/2 V6]Add dma-buf counter

2021-04-20 Thread Michal Hocko
On Tue 20-04-21 10:22:18, Peter Enderborg wrote:
> The dma-buf counter is a metric for mapped memory used by it's clients.
> It is a shared buffer that is typically used for interprocess communication
> or process to hardware communication. In android we used to have ION,. but
> it is now replaced with dma-buf. ION had some overview metrics that was 
> similar.

The discussion around the previous version is still not over and as it
seems your proposed approach is not really viable. So please do not send
new versions until that is sorted out.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-20 Thread Michal Hocko
On Tue 20-04-21 10:00:07, Christian König wrote:
> Am 20.04.21 um 09:46 schrieb Michal Hocko:
> > On Tue 20-04-21 09:32:14, Christian König wrote:
> > > Am 20.04.21 um 09:04 schrieb Michal Hocko:
> > > > On Mon 19-04-21 18:37:13, Christian König wrote:
> > > > > Am 19.04.21 um 18:11 schrieb Michal Hocko:
> > [...]
> > > > What I am trying to bring up with NUMA side is that the same problem can
> > > > happen on per-node basis. Let's say that some user consumes unexpectedly
> > > > large amount of dma-buf on a certain node. This can lead to observable
> > > > performance impact on anybody on allocating from that node and even
> > > > worse cause an OOM for node bound consumers. How do I find out that it
> > > > was dma-buf that has caused the problem?
> > > Yes, that is the direction my thinking goes as well, but also even 
> > > further.
> > > 
> > > See DMA-buf is also used to share device local memory between processes as
> > > well. In other words VRAM on graphics hardware.
> > > 
> > > On my test system here I have 32GB of system memory and 16GB of VRAM. I 
> > > can
> > > use DMA-buf to allocate that 16GB of VRAM quite easily which then shows up
> > > under /proc/meminfo as used memory.
> > This is something that would be really interesting in the changelog. I
> > mean the expected and extreme memory consumption of this memory. Ideally
> > with some hints on what to do when the number is really high (e.g. mount
> > debugfs and have a look here and there to check whether this is just too
> > many users or an unexpected pattern to be reported).
> > 
> > > But that isn't really system memory at all, it's just allocated device
> > > memory.
> > OK, that was not really clear to me. So this is not really accounted to
> > MemTotal?
> 
> It depends. In a lot of embedded systems you only have system memory and in
> this case that value here is indeed really useful.
> 
> > If that is really the case then reporting it into the oom
> > report is completely pointless and I am not even sure /proc/meminfo is
> > the right interface either. It would just add more confusion I am
> > afraid.
> 
> I kind of agree. As I said a DMA-buf could be backed by system memory or
> device memory.
> 
> In the case when it is backed by system memory it does make sense to report
> this in an OOM dump.
> 
> But only the exporting driver knows what the DMA-buf handle represents, the
> framework just provides the common ground for inter driver communication.

Then those drivers need to account for meminfo/oom report purposes.

> > > > See where I am heading?
> > > Yeah, totally. Thanks for pointing this out.
> > > 
> > > Suggestions how to handle that?
> > As I've pointed out in previous reply we do have an API to account per
> > node memory but now that you have brought up that this is not something
> > we account as a regular memory then this doesn't really fit into that
> > model. But maybe I am just confused.
> 
> Well does that API also has a counter for memory used by device drivers?

I think that "memory used by device drivers" is immaterial. The only
important thing is to account that memory where it makes sense. So for
RAM based allocations to report them via meminfo and find other way to
report device memory allocations.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c

2021-04-20 Thread Michal Hocko
On Mon 19-04-21 17:11:37, Johannes Weiner wrote:
> On Mon, Apr 19, 2021 at 01:26:29PM -0400, Waiman Long wrote:
[...]
> - the soft limit tree and soft limit reclaim
> 
> - the threshold and oom event notification stuff
> 
> - the charge moving code
> 
> - remaining v1 interface files, as well as their helper functions
> 
> From a quick scan, this adds up to ~2,500 lines of old code with no
> actual dependencies from the common code or from v2, and which could
> be moved out of the way without disrupting ongoing development much.

Moving those into its own file makes sense to me as well. If the code is
not conditional (e.g. like swap accounting and some others) then moving
it would make memecontrol.c easier to navigate through.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-20 Thread Michal Hocko
On Tue 20-04-21 10:20:43, Mike Rapoport wrote:
> On Tue, Apr 20, 2021 at 09:04:51AM +0200, Michal Hocko wrote:
> > On Mon 19-04-21 18:37:13, Christian König wrote:
> > > Am 19.04.21 um 18:11 schrieb Michal Hocko:
> > [...]
> > > > The question is not whether it is NUMA aware but whether it is useful to
> > > > know per-numa data for the purpose the counter is supposed to serve.
> > > 
> > > No, not at all. The pages of a single DMA-buf could even be from different
> > > NUMA nodes if the exporting driver decides that this is somehow useful.
> > 
> > As the use of the counter hasn't been explained yet I can only
> > speculate. One thing that I can imagine to be useful is to fill gaps in
> > our accounting. It is quite often that the memroy accounted in
> > /proc/meminfo (or oom report) doesn't add up to the overall memory
> > usage. In some workloads the workload can be huge! In many cases there
> > are other means to find out additional memory by a subsystem specific
> > interfaces (e.g. networking buffers). I do assume that dma-buf is just
> > one of those and the counter can fill the said gap at least partially
> > for some workloads. That is definitely useful.
> 
> A bit off-topic.
> 
> Michal, I think it would have been nice to have an explanation like above
> in Documentation/proc/meminfo, what do you say?

Not sure which specific parts (likely the unaccounted memory?) but sure
why not. Our /proc/meminfo is rather underdocumented. More information
cannot hurt.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-20 Thread Michal Hocko
On Tue 20-04-21 09:32:14, Christian König wrote:
> Am 20.04.21 um 09:04 schrieb Michal Hocko:
> > On Mon 19-04-21 18:37:13, Christian König wrote:
> > > Am 19.04.21 um 18:11 schrieb Michal Hocko:
[...]
> > What I am trying to bring up with NUMA side is that the same problem can
> > happen on per-node basis. Let's say that some user consumes unexpectedly
> > large amount of dma-buf on a certain node. This can lead to observable
> > performance impact on anybody on allocating from that node and even
> > worse cause an OOM for node bound consumers. How do I find out that it
> > was dma-buf that has caused the problem?
> 
> Yes, that is the direction my thinking goes as well, but also even further.
> 
> See DMA-buf is also used to share device local memory between processes as
> well. In other words VRAM on graphics hardware.
> 
> On my test system here I have 32GB of system memory and 16GB of VRAM. I can
> use DMA-buf to allocate that 16GB of VRAM quite easily which then shows up
> under /proc/meminfo as used memory.

This is something that would be really interesting in the changelog. I
mean the expected and extreme memory consumption of this memory. Ideally
with some hints on what to do when the number is really high (e.g. mount
debugfs and have a look here and there to check whether this is just too
many users or an unexpected pattern to be reported).

> But that isn't really system memory at all, it's just allocated device
> memory.

OK, that was not really clear to me. So this is not really accounted to
MemTotal? If that is really the case then reporting it into the oom
report is completely pointless and I am not even sure /proc/meminfo is
the right interface either. It would just add more confusion I am
afraid.
 
> > See where I am heading?
> 
> Yeah, totally. Thanks for pointing this out.
> 
> Suggestions how to handle that?

As I've pointed out in previous reply we do have an API to account per
node memory but now that you have brought up that this is not something
we account as a regular memory then this doesn't really fit into that
model. But maybe I am just confused.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-20 Thread Michal Hocko
On Mon 19-04-21 18:37:13, Christian König wrote:
> Am 19.04.21 um 18:11 schrieb Michal Hocko:
[...]
> > The question is not whether it is NUMA aware but whether it is useful to
> > know per-numa data for the purpose the counter is supposed to serve.
> 
> No, not at all. The pages of a single DMA-buf could even be from different
> NUMA nodes if the exporting driver decides that this is somehow useful.

As the use of the counter hasn't been explained yet I can only
speculate. One thing that I can imagine to be useful is to fill gaps in
our accounting. It is quite often that the memroy accounted in
/proc/meminfo (or oom report) doesn't add up to the overall memory
usage. In some workloads the workload can be huge! In many cases there
are other means to find out additional memory by a subsystem specific
interfaces (e.g. networking buffers). I do assume that dma-buf is just
one of those and the counter can fill the said gap at least partially
for some workloads. That is definitely useful.

What I am trying to bring up with NUMA side is that the same problem can
happen on per-node basis. Let's say that some user consumes unexpectedly
large amount of dma-buf on a certain node. This can lead to observable
performance impact on anybody on allocating from that node and even
worse cause an OOM for node bound consumers. How do I find out that it
was dma-buf that has caused the problem?

See where I am heading?
-- 
Michal Hocko
SUSE Labs


Re: [RFC] memory reserve for userspace oom-killer

2021-04-20 Thread Michal Hocko
On Mon 19-04-21 18:44:02, Shakeel Butt wrote:
> Proposal: Provide memory guarantees to userspace oom-killer.
> 
> Background:
> 
> Issues with kernel oom-killer:
> 1. Very conservative and prefer to reclaim. Applications can suffer
> for a long time.
> 2. Borrows the context of the allocator which can be resource limited
> (low sched priority or limited CPU quota).
> 3. Serialized by global lock.
> 4. Very simplistic oom victim selection policy.
> 
> These issues are resolved through userspace oom-killer by:
> 1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to
> early detect suffering.
> 2. Independent process context which can be given dedicated CPU quota
> and high scheduling priority.
> 3. Can be more aggressive as required.
> 4. Can implement sophisticated business logic/policies.
> 
> Android's LMKD and Facebook's oomd are the prime examples of userspace
> oom-killers. One of the biggest challenges for userspace oom-killers
> is to potentially function under intense memory pressure and are prone
> to getting stuck in memory reclaim themselves. Current userspace
> oom-killers aim to avoid this situation by preallocating user memory
> and protecting themselves from global reclaim by either mlocking or
> memory.min. However a new allocation from userspace oom-killer can
> still get stuck in the reclaim and policy rich oom-killer do trigger
> new allocations through syscalls or even heap.

Can you be more specific please?

> Our attempt of userspace oom-killer faces similar challenges.
> Particularly at the tail on the very highly utilized machines we have
> observed userspace oom-killer spectacularly failing in many possible
> ways in the direct reclaim. We have seen oom-killer stuck in direct
> reclaim throttling, stuck in reclaim and allocations from interrupts
> keep stealing reclaimed memory. We have even observed systems where
> all the processes were stuck in throttle_direct_reclaim() and only
> kswapd was running and the interrupts kept stealing the memory
> reclaimed by kswapd.
> 
> To reliably solve this problem, we need to give guaranteed memory to
> the userspace oom-killer.

There is nothing like that. Even memory reserves are a finite resource
which can be consumed as it is sharing those reserves with other users
who are not necessarily coordinated. So before we start discussing
making this even more muddy by handing over memory reserves to the
userspace we should really examine whether pre-allocation is something
that will not work.

> At the moment we are contemplating between
> the following options and I would like to get some feedback.
> 
> 1. prctl(PF_MEMALLOC)
> 
> The idea is to give userspace oom-killer (just one thread which is
> finding the appropriate victims and will be sending SIGKILLs) access
> to MEMALLOC reserves. Most of the time the preallocation, mlock and
> memory.min will be good enough but for rare occasions, when the
> userspace oom-killer needs to allocate, the PF_MEMALLOC flag will
> protect it from reclaim and let the allocation dip into the memory
> reserves.

I do not think that handing over an unlimited ticket to the memory
reserves to userspace is a good idea. Even the in kernel oom killer is
bound to a partial access to reserves. So if we really want this then
it should be in sync with and bound by the ALLOC_OOM.

> The misuse of this feature would be risky but it can be limited to
> privileged applications. Userspace oom-killer is the only appropriate
> user of this feature. This option is simple to implement.
> 
> 2. Mempool
> 
> The idea is to preallocate mempool with a given amount of memory for
> userspace oom-killer. Preferably this will be per-thread and
> oom-killer can preallocate mempool for its specific threads. The core
> page allocator can check before going to the reclaim path if the task
> has private access to the mempool and return page from it if yes.

Could you elaborate some more on how this would be controlled from the
userspace? A dedicated syscall? A driver?

> This option would be more complicated than the previous option as the
> lifecycle of the page from the mempool would be more sophisticated.
> Additionally the current mempool does not handle higher order pages
> and we might need to extend it to allow such allocations. Though this
> feature might have more use-cases and it would be less risky than the
> previous option.

I would tend to agree.

> Another idea I had was to use kthread based oom-killer and provide the
> policies through eBPF program. Though I am not sure how to make it
> monitor arbitrary metrics and if that can be done without any
> allocations.

A kernel module or eBPF to implement oom decisions has already been
discussed few years back. But I am afraid this would be hard to wire in
for anything except for the victim selection. I am not sure it is
maintainable to also control when the OOM handling should trigger.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-19 Thread Michal Hocko
On Mon 19-04-21 17:44:13, Christian König wrote:
> Am 19.04.21 um 17:19 schrieb peter.enderb...@sony.com:
> > On 4/19/21 5:00 PM, Michal Hocko wrote:
> > > On Mon 19-04-21 12:41:58, peter.enderb...@sony.com wrote:
> > > > On 4/19/21 2:16 PM, Michal Hocko wrote:
> > > > > On Sat 17-04-21 12:40:32, Peter Enderborg wrote:
> > > > > > This adds a total used dma-buf memory. Details
> > > > > > can be found in debugfs, however it is not for everyone
> > > > > > and not always available. dma-buf are indirect allocated by
> > > > > > userspace. So with this value we can monitor and detect
> > > > > > userspace applications that have problems.
> > > > > The changelog would benefit from more background on why this is 
> > > > > needed,
> > > > > and who is the primary consumer of that value.
> > > > > 
> > > > > I cannot really comment on the dma-buf internals but I have two 
> > > > > remarks.
> > > > > Documentation/filesystems/proc.rst needs an update with the counter
> > > > > explanation and secondly is this information useful for OOM situations
> > > > > analysis? If yes then show_mem should dump the value as well.
> > > > > 
> > > > >  From the implementation point of view, is there any reason why this
> > > > > hasn't used the existing global_node_page_state infrastructure?
> > > > I fix doc in next version.  Im not sure what you expect the commit 
> > > > message to include.
> > > As I've said. Usual justification covers answers to following questions
> > >   - Why do we need it?
> > >   - Why the existing data is insuficient?
> > >   - Who is supposed to use the data and for what?
> > > 
> > > I can see an answer for the first two questions (because this can be a
> > > lot of memory and the existing infrastructure is not production suitable
> > > - debugfs). But the changelog doesn't really explain who is going to use
> > > the new data. Is this a monitoring to raise an early alarm when the
> > > value grows? Is this for debugging misbehaving drivers? How is it
> > > valuable for those?
> > > 
> > > > The function of the meminfo is: (From 
> > > > Documentation/filesystems/proc.rst)
> > > > 
> > > > "Provides information about distribution and utilization of memory."
> > > True. Yet we do not export any random counters, do we?
> > > 
> > > > Im not the designed of dma-buf, I think  global_node_page_state as a 
> > > > kernel
> > > > internal.
> > > It provides a node specific and optimized counters. Is this a good fit
> > > with your new counter? Or the NUMA locality is of no importance?
> > Sounds good to me, if Christian Koenig think it is good, I will use that.
> > It is only virtio in drivers that use the global_node_page_state if
> > that matters.
> 
> DMA-buf are not NUMA aware at all. On which node the pages are allocated
> (and if we use pages at all and not internal device memory) is up to the
> exporter and importer.

The question is not whether it is NUMA aware but whether it is useful to
know per-numa data for the purpose the counter is supposed to serve.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-19 Thread Michal Hocko
On Mon 19-04-21 12:41:58, peter.enderb...@sony.com wrote:
> On 4/19/21 2:16 PM, Michal Hocko wrote:
> > On Sat 17-04-21 12:40:32, Peter Enderborg wrote:
> >> This adds a total used dma-buf memory. Details
> >> can be found in debugfs, however it is not for everyone
> >> and not always available. dma-buf are indirect allocated by
> >> userspace. So with this value we can monitor and detect
> >> userspace applications that have problems.
> > The changelog would benefit from more background on why this is needed,
> > and who is the primary consumer of that value.
> >
> > I cannot really comment on the dma-buf internals but I have two remarks.
> > Documentation/filesystems/proc.rst needs an update with the counter
> > explanation and secondly is this information useful for OOM situations
> > analysis? If yes then show_mem should dump the value as well.
> >
> > From the implementation point of view, is there any reason why this
> > hasn't used the existing global_node_page_state infrastructure?
> 
> I fix doc in next version.  Im not sure what you expect the commit message to 
> include.

As I've said. Usual justification covers answers to following questions
- Why do we need it?
- Why the existing data is insuficient?
- Who is supposed to use the data and for what?

I can see an answer for the first two questions (because this can be a
lot of memory and the existing infrastructure is not production suitable
- debugfs). But the changelog doesn't really explain who is going to use
the new data. Is this a monitoring to raise an early alarm when the
value grows? Is this for debugging misbehaving drivers? How is it
valuable for those?

> The function of the meminfo is: (From Documentation/filesystems/proc.rst)
> 
> "Provides information about distribution and utilization of memory."

True. Yet we do not export any random counters, do we?

> Im not the designed of dma-buf, I think  global_node_page_state as a kernel
> internal.

It provides a node specific and optimized counters. Is this a good fit
with your new counter? Or the NUMA locality is of no importance?

> dma-buf is a device driver that provides a function so I might be
> on the outside. However I also see that it might be relevant for a OOM.
> It is memory that can be freed by killing userspace processes.
> 
> The show_mem thing. Should it be a separate patch?

This is up to you but if you want to expose the counter then send it in
one series.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-19 Thread Michal Hocko
On Sat 17-04-21 12:40:32, Peter Enderborg wrote:
> This adds a total used dma-buf memory. Details
> can be found in debugfs, however it is not for everyone
> and not always available. dma-buf are indirect allocated by
> userspace. So with this value we can monitor and detect
> userspace applications that have problems.

The changelog would benefit from more background on why this is needed,
and who is the primary consumer of that value.

I cannot really comment on the dma-buf internals but I have two remarks.
Documentation/filesystems/proc.rst needs an update with the counter
explanation and secondly is this information useful for OOM situations
analysis? If yes then show_mem should dump the value as well.

>From the implementation point of view, is there any reason why this
hasn't used the existing global_node_page_state infrastructure?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard

2021-04-16 Thread Michal Hocko
On Fri 16-04-21 07:26:43, Dave Hansen wrote:
> On 4/16/21 5:35 AM, Michal Hocko wrote:
> >   I have to confess that I haven't grasped the initialization
> >   completely. There is a nice comment explaining a 2 socket system with
> >   3 different NUMA nodes attached to it with one node being terminal.
> >   This is OK if the terminal node is PMEM but how that fits into usual
> >   NUMA setups. E.g.
> >   4 nodes each with its set of CPUs
> >   node distances:
> >   node   0   1   2   3
> >   0:  10  20  20  20
> >   1:  20  10  20  20
> >   2:  20  20  10  20
> >   3:  20  20  20  10
> >   Do I get it right that Node 3 would be terminal?
> 
> Yes, I think Node 3 would end up being the terminal node in that setup.
> 
> That said, I'm not sure how much I expect folks to use this on
> traditional, non-tiered setups.  It's also hard to argue what the
> migration order *should* be when all the nodes are uniform.

Well, they are not really uniform. The access latency differ and I can
imagine that spreading page cache to a distant node might be just much
better than an IO involved in the refault.

On the other hand I do understand that restricting the feature to CPU
less NUMA setups is quite sane to give us a better understanding of how
this can be used and improve on top. Maybe we will learn that we want to
have the demotion path admin controlable (on the system level or maybe
even more fine grained on the memcg/cpuset).

[...]
> > - The demotion is implemented at shrink_page_list level which migrates
> >   pages in the first round and then falls back to the regular reclaim
> >   when migration fails. This means that the reclaim context
> >   (PF_MEMALLOC) will allocate memory so it has access to full memory
> >   reserves. Btw. I do not __GFP_NO_MEMALLOC anywhere in the allocation
> >   mask which looks like a bug rather than an intention. Btw. using
> >   GFP_NOWAIT in the allocation callback would make more things clear
> >   IMO.
> 
> Yes, the lack of __GFP_NO_MEMALLOC is a bug.  I'll fix that up.
> 
> GFP_NOWAIT _seems_ like it will work.  I'll give it a shot.

Let me clarify a bit. The slow path does involve __gfp_pfmemalloc_flags
before bailing out for non sleeping allocation. So you would need both.
Unless you want to involve reclaim on the target node while you are
reclaiming the origin node.

> > - Memcg reclaim is excluded from all this because it is not NUMA aware
> >   which makes sense to me.
> > - Anonymous pages are bit tricky because they can be demoted even when
> >   they cannot be reclaimed due to no (or no available) swap storage.
> >   Unless I have missed something the second round will try to reclaim
> >   them even the later is true and I am not sure this is completely OK.
> 
> What we want is something like this:
> 
> Swap Space / Demotion OK  -> Can Reclaim
> Swap Space / Demotion Off -> Can Reclaim
> Swap Full  / Demotion OK  -> Can Reclaim
> Swap Full  / Demotion Off -> No Reclaim
> 
> I *think* that's what can_reclaim_anon_pages() ends up doing.  Maybe I'm
> misunderstanding what you are referring to, though.  By "second round"
> did you mean when we do reclaim on a node which is a terminal node?

No, I mean the migration failure case where you splice back to the page
list to reclaim. In that round you do not demote and want to reclaim.
But a lack of swap space will make that page unreclaimable. I suspect
this would just work out fine but I am not sure from the top of my head.

> > I am still trying to digest the whole thing but at least jamming
> > node_reclaim logic into kswapd seems strange to me. Need to think more
> > about that though.
> 
> I'm entirely open to other ways to do the opt-in.  It seemed sane at the
> time, but I also understand the kswapd concern.
> 
> > Btw. do you have any numbers from running this with some real work
> > workload?
> 
> Yes, quite a bit.  Do you have a specific scenario in mind?  Folks seem
> to come at this in two different ways:
> 
> Some want to know how much DRAM they can replace by buying some PMEM.
> They tend to care about how much adding the (cheaper) PMEM slows them
> down versus (expensive) DRAM.  They're making a cost-benefit call
> 
> Others want to repurpose some PMEM they already have.  They want to know
> how much using PMEM in this way will speed them up.  They will basically
> take any speedup they can get.
> 
> I ask because as a kernel developer with PMEM in my systems, I find the
> "I'll take what I can get" case more personally appealing.  But, the
> business folks are much more keen on the "DRAM replacement" use.  Do you
> have any thoughts on what you would like to see?

I was thinking about typical large in memory processing (e.g. in memory
databases) where the hot part of the working set is only a portion and
spilling over to a slower memory can be still benefitial because IO +
data preprocessing on cold data is much slower.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard

2021-04-16 Thread Michal Hocko
Hi,
I am really sorry to jump into this train sooo late. I have quickly
glanced through the series and I have some questions/concerns. Let me
express them here rather than in specific patches.

First of all I do think that demotion is a useful way to balance the
memory in general. And that is not really bound to PMEM equipped
systems. There are larger NUMA machines which are not trivial to
partition and our existing NUMA APIs are far from ideal to help with
that.

I do appreciate that the whole thing is an opt in because this might
break workloads which are careful with the placement. I am not sure
there is a way to handle constrains in an optimal way if that is
possible at all in some cases (e.g. do we have a way to track page to
its cpuset resp. task mempolicy in all cases?).

The cover letter is focusing on usecases but it doesn't really provide
so let me try to lay it down here (let's see whether I missed something
important).
- order for demontion defines a very simple fallback to a single node
  based on the proximity but cycles are not allowed in the fallback
  mask.
  I have to confess that I haven't grasped the initialization
  completely. There is a nice comment explaining a 2 socket system with
  3 different NUMA nodes attached to it with one node being terminal.
  This is OK if the terminal node is PMEM but how that fits into usual
  NUMA setups. E.g.
  4 nodes each with its set of CPUs
  node distances:
  node   0   1   2   3
  0:  10  20  20  20
  1:  20  10  20  20
  2:  20  20  10  20
  3:  20  20  20  10
  Do I get it right that Node 3 would be terminal?
- The demotion is controlled by node_reclaim_mode but unlike other modes
  it applies to both direct and kswapd reclaims.
  I do not see that explained anywhere though.
- The demotion is implemented at shrink_page_list level which migrates
  pages in the first round and then falls back to the regular reclaim
  when migration fails. This means that the reclaim context
  (PF_MEMALLOC) will allocate memory so it has access to full memory
  reserves. Btw. I do not __GFP_NO_MEMALLOC anywhere in the allocation
  mask which looks like a bug rather than an intention. Btw. using
  GFP_NOWAIT in the allocation callback would make more things clear
  IMO.
- Memcg reclaim is excluded from all this because it is not NUMA aware
  which makes sense to me.
- Anonymous pages are bit tricky because they can be demoted even when
  they cannot be reclaimed due to no (or no available) swap storage.
  Unless I have missed something the second round will try to reclaim
  them even the later is true and I am not sure this is completely OK.

I hope I've captured all important parts. There are some more details
but they do not seem that important. 

I am still trying to digest the whole thing but at least jamming
node_reclaim logic into kswapd seems strange to me. Need to think more
about that though.

Btw. do you have any numbers from running this with some real work
workload?
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

2021-04-16 Thread Michal Hocko
On Thu 15-04-21 15:31:46, Tim Chen wrote:
> 
> 
> On 4/9/21 12:24 AM, Michal Hocko wrote:
> > On Thu 08-04-21 13:29:08, Shakeel Butt wrote:
> >> On Thu, Apr 8, 2021 at 11:01 AM Yang Shi  wrote:
> > [...]
> >>> The low priority jobs should be able to be restricted by cpuset, for
> >>> example, just keep them on second tier memory nodes. Then all the
> >>> above problems are gone.
> > 
> > Yes, if the aim is to isolate some users from certain numa node then
> > cpuset is a good fit but as Shakeel says this is very likely not what
> > this work is aiming for.
> > 
> >> Yes that's an extreme way to overcome the issue but we can do less
> >> extreme by just (hard) limiting the top tier usage of low priority
> >> jobs.
> > 
> > Per numa node high/hard limit would help with a more fine grained control.
> > The configuration would be tricky though. All low priority memcgs would
> > have to be carefully configured to leave enough for your important
> > processes. That includes also memory which is not accounted to any
> > memcg. 
> > The behavior of those limits would be quite tricky for OOM situations
> > as well due to a lack of NUMA aware oom killer.
> > 
> 
> Another downside of putting limits on individual NUMA
> node is it would limit flexibility.

Let me just clarify one thing. I haven't been proposing per NUMA limits.
As I've said above it would be quite tricky to use and the behavior
would be tricky as well. All I am saying is that we do not want to have
an interface that is tightly bound to any specific HW setup (fast RAM as
a top tier and PMEM as a fallback) that you have proposed here. We want
to have a generic NUMA based abstraction. How that abstraction is going
to look like is an open question and it really depends on usecase that
we expect to see.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 5/8] mm: memcontrol: rename lruvec_holds_page_lru_lock to page_matches_lruvec

2021-04-16 Thread Michal Hocko
On Fri 16-04-21 13:14:04, Muchun Song wrote:
> lruvec_holds_page_lru_lock() doesn't check anything about locking and is
> used to check whether the page belongs to the lruvec. So rename it to
> page_matches_lruvec().
> 
> Signed-off-by: Muchun Song 

Acked-by: Michal Hocko 

Thanks

> ---
>  include/linux/memcontrol.h | 7 +++
>  mm/vmscan.c| 2 +-
>  2 files changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 2fc728492c9b..40b0c31ea4ba 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1492,8 +1492,7 @@ static inline void unlock_page_lruvec_irqrestore(struct 
> lruvec *lruvec,
>   spin_unlock_irqrestore(>lru_lock, flags);
>  }
>  
> -static inline bool lruvec_holds_page_lru_lock(struct page *page,
> -   struct lruvec *lruvec)
> +static inline bool page_matches_lruvec(struct page *page, struct lruvec 
> *lruvec)
>  {
>   return lruvec_pgdat(lruvec) == page_pgdat(page) &&
>  lruvec_memcg(lruvec) == page_memcg(page);
> @@ -1504,7 +1503,7 @@ static inline struct lruvec 
> *relock_page_lruvec_irq(struct page *page,
>   struct lruvec *locked_lruvec)
>  {
>   if (locked_lruvec) {
> - if (lruvec_holds_page_lru_lock(page, locked_lruvec))
> + if (page_matches_lruvec(page, locked_lruvec))
>   return locked_lruvec;
>  
>   unlock_page_lruvec_irq(locked_lruvec);
> @@ -1518,7 +1517,7 @@ static inline struct lruvec 
> *relock_page_lruvec_irqsave(struct page *page,
>   struct lruvec *locked_lruvec, unsigned long *flags)
>  {
>   if (locked_lruvec) {
> - if (lruvec_holds_page_lru_lock(page, locked_lruvec))
> + if (page_matches_lruvec(page, locked_lruvec))
>   return locked_lruvec;
>  
>   unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bb8321026c0c..2bc5cf409958 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2062,7 +2062,7 @@ static unsigned noinline_for_stack 
> move_pages_to_lru(struct lruvec *lruvec,
>* All pages were isolated from the same lruvec (and isolation
>* inhibits memcg migration).
>*/
> - VM_BUG_ON_PAGE(!lruvec_holds_page_lru_lock(page, lruvec), page);
> + VM_BUG_ON_PAGE(!page_matches_lruvec(page, lruvec), page);
>   add_page_to_lru_list(page, lruvec);
>   nr_pages = thp_nr_pages(page);
>   nr_moved += nr_pages;
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v8 7/7] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig

2021-04-15 Thread Michal Hocko
On Thu 15-04-21 12:35:44, Oscar Salvador wrote:
> pfn_range_valid_contig() bails out when it finds an in-use page or a
> hugetlb page, among other things.
> We can drop the in-use page check since __alloc_contig_pages can migrate
> away those pages, and the hugetlb page check can go too since
> isolate_migratepages_range is now capable of dealing with hugetlb pages.
> Either way, those checks are racy so let the end function handle it
> when the time comes.
> 
> Signed-off-by: Oscar Salvador 
> Suggested-by: David Hildenbrand 
> Reviewed-by: David Hildenbrand 
> Acked-by: Mike Kravetz 

Acked-by: Michal Hocko 

> ---
>  mm/page_alloc.c | 6 --
>  1 file changed, 6 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b5a94de3cdde..c5338e912ace 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8901,12 +8901,6 @@ static bool pfn_range_valid_contig(struct zone *z, 
> unsigned long start_pfn,
>  
>   if (PageReserved(page))
>   return false;
> -
> - if (page_count(page) > 0)
> - return false;
> -
> - if (PageHuge(page))
> -     return false;
>   }
>   return true;
>  }
> -- 
> 2.16.3

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v8 3/7] mm,hugetlb: Drop clearing of flag from prep_new_huge_page

2021-04-15 Thread Michal Hocko
On Thu 15-04-21 12:35:40, Oscar Salvador wrote:
> Pages allocated after boot get its private field cleared by means
> of post_alloc_hook().

You surely meant to say s@boot@page/cma allocator@ here

> Pages allocated during boot, that is directly from the memblock allocator,
> get cleared by paging_init()->..->memmap_init_zone->..->__init_single_page()
> before any memblock allocation.
> 
> Based on this ground, let us remove the clearing of the flag from
> prep_new_huge_page() as it is not needed.

I would also mention that this is a leftover from 6c0371490140
("hugetlb: convert PageHugeFreed to HPageFreed flag"). Previously the
explicit clearing was necessary because compound allocations do not get
this initialization (see prep_compound_page).

> Signed-off-by: Oscar Salvador 

with that
Acked-by: Michal Hocko 

> ---
>  mm/hugetlb.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 54d81d5947ed..2cb9fa79cbaa 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1493,7 +1493,6 @@ static void prep_new_huge_page(struct hstate *h, struct 
> page *page, int nid)
>   spin_lock_irq(_lock);
>   h->nr_huge_pages++;
>   h->nr_huge_pages_node[nid]++;
> - ClearHPageFreed(page);
>   spin_unlock_irq(_lock);
>  }
>  
> -- 
> 2.16.3

-- 
Michal Hocko
SUSE Labs


Re: [External] Re: [PATCH 2/7] mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm

2021-04-15 Thread Michal Hocko
On Thu 15-04-21 11:13:16, Muchun Song wrote:
> On Wed, Apr 14, 2021 at 6:15 PM Michal Hocko  wrote:
> >
> > On Wed 14-04-21 18:04:35, Muchun Song wrote:
> > > On Wed, Apr 14, 2021 at 5:24 PM Michal Hocko  wrote:
> > > >
> > > > On Tue 13-04-21 14:51:48, Muchun Song wrote:
> > > > > When mm is NULL, we do not need to hold rcu lock and call css_tryget 
> > > > > for
> > > > > the root memcg. And we also do not need to check !mm in every loop of
> > > > > while. So bail out early when !mm.
> > > >
> > > > mem_cgroup_charge and other callers unconditionally drop the reference
> > > > so how come this does not underflow reference count?
> > >
> > > For the root memcg, the CSS_NO_REF flag is set, so css_get
> > > and css_put do not get or put reference.
> >
> > Ohh, right you are. I must have forgot about that special case. I am
> > pretty sure I (and likely few more) will stumble over that in the future
> > again. A small comment explaining that the reference can be safely
> > ignore would be helpful.
> 
> OK. Will do.

I would go with the following if that helps
/*
 * No need to css_get on root memcg as the reference counting is
 * disabled on the root level in the cgroup core. See CSS_NO_REF
 */

Thanks
-- 
Michal Hocko
SUSE Labs


Re: [External] Re: [PATCH 4/7] mm: memcontrol: simplify lruvec_holds_page_lru_lock

2021-04-15 Thread Michal Hocko
On Wed 14-04-21 13:49:56, Johannes Weiner wrote:
> On Wed, Apr 14, 2021 at 06:00:42PM +0800, Muchun Song wrote:
> > On Wed, Apr 14, 2021 at 5:44 PM Michal Hocko  wrote:
> > >
> > > On Tue 13-04-21 14:51:50, Muchun Song wrote:
> > > > We already have a helper lruvec_memcg() to get the memcg from lruvec, we
> > > > do not need to do it ourselves in the lruvec_holds_page_lru_lock(). So 
> > > > use
> > > > lruvec_memcg() instead. And if mem_cgroup_disabled() returns false, the
> > > > page_memcg(page) (the LRU pages) cannot be NULL. So remove the odd logic
> > > > of "memcg = page_memcg(page) ? : root_mem_cgroup". And use lruvec_pgdat
> > > > to simplify the code. We can have a single definition for this function
> > > > that works for !CONFIG_MEMCG, CONFIG_MEMCG + mem_cgroup_disabled() and
> > > > CONFIG_MEMCG.
> > >
> > > Neat. While you are at it wouldn't it make sesne to rename the function
> > > as well. I do not want to bikeshed but this is really a misnomer. it
> > > doesn't check anything about locking. page_belongs_lruvec?
> > 
> > Right. lruvec_holds_page_lru_lock is used to check whether
> > the page belongs to the lruvec. page_belongs_lruvec
> > obviously more clearer. I can rename it to
> > page_belongs_lruvec the next version.
> 
> This sounds strange to me, I think 'belongs' needs a 'to' to be
> correct, so page_belongs_to_lruvec(). Still kind of a mouthful.
> 
> page_matches_lruvec()?
> 
> page_from_lruvec()?

Any of those is much better than what we have here.

-- 
Michal Hocko
SUSE Labs


Re: High kmalloc-32 slab cache consumption with 10k containers

2021-04-15 Thread Michal Hocko
On Thu 15-04-21 10:53:00, Bharata B Rao wrote:
> On Wed, Apr 07, 2021 at 08:28:07AM +1000, Dave Chinner wrote:
> > 
> > Another approach may be to identify filesystem types that do not
> > need memcg awareness and feed that into alloc_super() to set/clear
> > the SHRINKER_MEMCG_AWARE flag. This could be based on fstype - most
> > virtual filesystems that expose system information do not really
> > need full memcg awareness because they are generally only visible to
> > a single memcg instance...
> 
> Would something like below be appropriate?

No. First of all you are defining yet another way to say
SHRINKER_MEMCG_AWARE which is messy. And secondly why would shmem, proc
and ramfs be any special and they would be ok to opt out? There is no
single word about that reasoning in your changelog.

> >From f314083ad69fde2a420a1b74febd6d3f7a25085f Mon Sep 17 00:00:00 2001
> From: Bharata B Rao 
> Date: Wed, 14 Apr 2021 11:21:24 +0530
> Subject: [PATCH 1/1] fs: Let filesystems opt out of memcg awareness
> 
> All filesystem mounts by default are memcg aware and end hence
> end up creating shrinker list_lrus for all the memcgs. Due to
> the way the memcg_nr_cache_ids grow and the list_lru heads are
> allocated for all memcgs, huge amount of memory gets consumed
> by kmalloc-32 slab cache when running thousands of containers.
> 
> Improve this situation by allowing filesystems to opt out
> of memcg awareness. In this patch, tmpfs, proc and ramfs
> opt out of memcg awareness. This leads to considerable memory
> savings when running 10k containers.
> 
> Signed-off-by: Bharata B Rao 
> ---
>  fs/proc/root.c |  1 +
>  fs/ramfs/inode.c   |  1 +
>  fs/super.c | 27 +++
>  include/linux/fs_context.h |  2 ++
>  mm/shmem.c |  1 +
>  5 files changed, 24 insertions(+), 8 deletions(-)

[...]
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4 11/13] mm/mempolicy: huge-page allocation for many preferred

2021-04-14 Thread Michal Hocko
Please use hugetlb prefix to make it explicit that this is hugetlb
related.

On Wed 17-03-21 11:40:08, Feng Tang wrote:
> From: Ben Widawsky 
> 
> Implement the missing huge page allocation functionality while obeying
> the preferred node semantics.
> 
> This uses a fallback mechanism to try multiple preferred nodes first,
> and then all other nodes. It cannot use the helper function that was
> introduced because huge page allocation already has its own helpers and
> it was more LOC, and effort to try to consolidate that.
> 
> The weirdness is MPOL_PREFERRED_MANY can't be called yet because it is
> part of the UAPI we haven't yet exposed. Instead of make that define
> global, it's simply changed with the UAPI patch.
> 
> [ feng: add NOWARN flag, and skip the direct reclaim to speedup allocation
>   in some case ]
> 
> Link: 
> https://lore.kernel.org/r/20200630212517.308045-12-ben.widaw...@intel.com
> Signed-off-by: Ben Widawsky 
> Signed-off-by: Feng Tang 
> ---
>  mm/hugetlb.c   | 26 +++---
>  mm/mempolicy.c |  3 ++-
>  2 files changed, 25 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 8fb42c6..9dfbfa3 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1105,7 +1105,7 @@ static struct page *dequeue_huge_page_vma(struct hstate 
> *h,
>   unsigned long address, int avoid_reserve,
>   long chg)
>  {
> - struct page *page;
> + struct page *page = NULL;
>   struct mempolicy *mpol;
>   gfp_t gfp_mask;
>   nodemask_t *nodemask;
> @@ -1126,7 +1126,17 @@ static struct page *dequeue_huge_page_vma(struct 
> hstate *h,
>  
>   gfp_mask = htlb_alloc_mask(h);
>   nid = huge_node(vma, address, gfp_mask, , );
> - page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
> + if (mpol->mode != MPOL_BIND && nodemask) { /* AKA MPOL_PREFERRED_MANY */

Please use MPOL_PREFERRED_MANY explicitly here.

> + gfp_t gfp_mask1 = gfp_mask | __GFP_NOWARN;
> +
> + gfp_mask1 &= ~__GFP_DIRECT_RECLAIM;
> + page = dequeue_huge_page_nodemask(h,
> + gfp_mask1, nid, nodemask);
> + if (!page)
> + page = dequeue_huge_page_nodemask(h, gfp_mask, nid, 
> NULL);
> + } else {
> + page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
> + }
>   if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
>   SetHPageRestoreReserve(page);
>   h->resv_huge_pages--;

__GFP_DIRECT_RECLAIM handing is not needed here. dequeue_huge_page_nodemask 
only uses gfp mask to get zone and cpusets constraines. So the above
should have simply been
if (mpol->mode == MPOL_PREFERRED_MANY) {
page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
if (page)
goto got_page;
/* fallback to all nodes */
nodemask = NULL;
}
page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
got_page:
if (page ...)

> @@ -1883,7 +1893,17 @@ struct page *alloc_buddy_huge_page_with_mpol(struct 
> hstate *h,
>   nodemask_t *nodemask;
>  
>   nid = huge_node(vma, addr, gfp_mask, , );
> - page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask);
> + if (mpol->mode != MPOL_BIND && nodemask) { /* AKA MPOL_PREFERRED_MANY */
> + gfp_t gfp_mask1 = gfp_mask | __GFP_NOWARN;
> +
> + gfp_mask1 &= ~__GFP_DIRECT_RECLAIM;
> + page = alloc_surplus_huge_page(h,
> + gfp_mask1, nid, nodemask);
> + if (!page)
> + alloc_surplus_huge_page(h, gfp_mask, nid, NULL);
> + } else {
> + page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask);
> + }

And here similar
if (mpol->mode == MPOL_PREFERRED_MANY) {
page = alloc_surplus_huge_page(h, (gfp_mask | __GFP_NOWARN) & 
~(__GFP_DIRECT_RECLAIM), nodemask);
if (page)
goto got_page;
/* fallback to all nodes */
nodemask = NULL;
    }
page = alloc_surplus_huge_page(h, gfp_mask, nodemask);
got_page:
>   mpol_cond_put(mpol);

You can have a dedicated gfp mask here if you prefer of course but I
calling out MPOL_PREFERRED_MANY explicitly will make the code easier to
read.

>   return page;
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4 10/13] mm/mempolicy: VMA allocation for many preferred

2021-04-14 Thread Michal Hocko
On Wed 17-03-21 11:40:07, Feng Tang wrote:
[...]
> @@ -2301,10 +2300,26 @@ alloc_pages_vma(gfp_t gfp, int order, struct 
> vm_area_struct *vma,
>* does not allow the current node in its nodemask, we allocate
>* the standard way.
>*/
> - if ((pol->mode == MPOL_PREFERRED ||
> -  pol->mode == MPOL_PREFERRED_MANY) &&
> - !(pol->flags & MPOL_F_LOCAL))
> + if (pol->mode == MPOL_PREFERRED || !(pol->flags & 
> MPOL_F_LOCAL)) {
>   hpage_node = first_node(pol->nodes);
> + } else if (pol->mode == MPOL_PREFERRED_MANY) {
> + struct zoneref *z;
> +
> + /*
> +  * In this policy, with direct reclaim, the normal
> +  * policy based allocation will do the right thing - try
> +  * twice using the preferred nodes first, and all nodes
> +  * second.
> +  */
> + if (gfp & __GFP_DIRECT_RECLAIM) {
> + page = alloc_pages_policy(pol, gfp, order, 
> NUMA_NO_NODE);
> + goto out;
> + }
> +
> + z = first_zones_zonelist(node_zonelist(numa_node_id(), 
> GFP_HIGHUSER),
> +  gfp_zone(GFP_HIGHUSER), 
> >nodes);
> + hpage_node = zone_to_nid(z->zone);
> + }
>  
>   nmask = policy_nodemask(gfp, pol);
>   if (!nmask || node_isset(hpage_node, *nmask)) {
> @@ -2330,9 +2345,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct 
> vm_area_struct *vma,
>   }
>   }
>  
> - nmask = policy_nodemask(gfp, pol);
> - preferred_nid = policy_node(gfp, pol, node);
> - page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
> + page = alloc_pages_policy(pol, gfp, order, NUMA_NO_NODE);
>   mpol_cond_put(pol);
>  out:
>   return page;

OK, it took me a while to grasp this but the code is a mess I have to
say. Not that it was an act of beauty before but this just makes it much
harder to follow. And alloc_pages_policy doesn't really help I have to
say. I would have expected that a dedicated alloc_pages_preferred and a
general fallback to __alloc_pages_nodemask would have been much easier
to follow.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4 08/13] mm/mempolicy: Create a page allocator for policy

2021-04-14 Thread Michal Hocko
On Wed 17-03-21 11:40:05, Feng Tang wrote:
> From: Ben Widawsky 
> 
> Add a helper function which takes care of handling multiple preferred
> nodes. It will be called by future patches that need to handle this,
> specifically VMA based page allocation, and task based page allocation.
> Huge pages don't quite fit the same pattern because they use different
> underlying page allocation functions. This consumes the previous
> interleave policy specific allocation function to make a one stop shop
> for policy based allocation.
> 
> With this, MPOL_PREFERRED_MANY's semantic is more like MPOL_PREFERRED
> that it will first try the preferred node/nodes, and fallback to all
> other nodes when first try fails. Thanks to Michal Hocko for suggestions
> on this.
> 
> For now, only interleaved policy will be used so there should be no
> functional change yet. However, if bisection points to issues in the
> next few commits, it was likely the fault of this patch.

I am not sure this is helping much. Let's see in later patches but I
would keep them separate and rather create a dedicated function for the
new policy allocation mode.

> Similar functionality is offered via policy_node() and
> policy_nodemask(). By themselves however, neither can achieve this
> fallback style of sets of nodes.
> 
> [ Feng: for the first try, add NOWARN flag, and skip the direct reclaim
>   to speedup allocation in some case ]
> 
> Link: https://lore.kernel.org/r/20200630212517.308045-9-ben.widaw...@intel.com
> Signed-off-by: Ben Widawsky 
> Signed-off-by: Feng Tang 
> ---
>  mm/mempolicy.c | 65 
> ++
>  1 file changed, 52 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index d945f29..d21105b 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2187,22 +2187,60 @@ bool mempolicy_nodemask_intersects(struct task_struct 
> *tsk,
>   return ret;
>  }
>  
> -/* Allocate a page in interleaved policy.
> -   Own path because it needs to do special accounting. */
> -static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
> - unsigned nid)
> +/* Handle page allocation for all but interleaved policies */
> +static struct page *alloc_pages_policy(struct mempolicy *pol, gfp_t gfp,
> +unsigned int order, int preferred_nid)
>  {
>   struct page *page;
> + gfp_t gfp_mask = gfp;
>  
> - page = __alloc_pages(gfp, order, nid);
> - /* skip NUMA_INTERLEAVE_HIT counter update if numa stats is disabled */
> - if (!static_branch_likely(_numa_stat_key))
> + if (pol->mode == MPOL_INTERLEAVE) {
> + page = __alloc_pages(gfp, order, preferred_nid);
> + /* skip NUMA_INTERLEAVE_HIT counter update if numa stats is 
> disabled */
> + if (!static_branch_likely(_numa_stat_key))
> + return page;
> + if (page && page_to_nid(page) == preferred_nid) {
> + preempt_disable();
> + __inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT);
> + preempt_enable();
> + }
>   return page;
> - if (page && page_to_nid(page) == nid) {
> - preempt_disable();
> - __inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT);
> - preempt_enable();
>   }
> +
> + VM_BUG_ON(preferred_nid != NUMA_NO_NODE);
> +
> + preferred_nid = numa_node_id();
> +
> + /*
> +  * There is a two pass approach implemented here for
> +  * MPOL_PREFERRED_MANY. In the first pass we try the preferred nodes
> +  * but allow the allocation to fail. The below table explains how
> +  * this is achieved.
> +  *
> +  * | Policy| preferred nid | nodemask   |
> +  * |---|---||
> +  * | MPOL_DEFAULT  | local | NULL   |
> +  * | MPOL_PREFERRED| best  | NULL   |
> +  * | MPOL_INTERLEAVE   | ERR   | ERR|
> +  * | MPOL_BIND | local | pol->nodes |
> +  * | MPOL_PREFERRED_MANY   | best  | pol->nodes |
> +  * | MPOL_PREFERRED_MANY (round 2) | local | NULL   |
> +  * +---+---++
> +  */
> + if (pol->mode == MPOL_PREFERRED_MANY) {
> + gfp_mask |=  __GFP_NOWARN;
> +
> + /* Skip direct reclaim, as there will be a second try */
> + gfp_mask &

Re: [PATCH v4 07/13] mm/mempolicy: handle MPOL_PREFERRED_MANY like BIND

2021-04-14 Thread Michal Hocko
On Wed 17-03-21 11:40:04, Feng Tang wrote:
> From: Ben Widawsky 
> 
> Begin the real plumbing for handling this new policy. Now that the
> internal representation for preferred nodes and bound nodes is the same,
> and we can envision what multiple preferred nodes will behave like,
> there are obvious places where we can simply reuse the bind behavior.
> 
> In v1 of this series, the moral equivalent was:
> "mm: Finish handling MPOL_PREFERRED_MANY". Like that, this attempts to
> implement the easiest spots for the new policy. Unlike that, this just
> reuses BIND.

No, this is a bug step back. I think we really want to treat this as
PREFERRED. It doesn't have much to do with the BIND semantic at all.
At this stage there should be 2 things remaining - syscalls plumbing and
2 pass allocation request (optimistic preferred nodes restricted and
fallback to all nodes).
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4 06/13] mm/mempolicy: kill v.preferred_nodes

2021-04-14 Thread Michal Hocko
On Wed 17-03-21 11:40:03, Feng Tang wrote:
> From: Ben Widawsky 
> 
> Now that preferred_nodes is just a mask, and policies are mutually
> exclusive, there is no reason to have a separate mask.
> 
> This patch is optional. It definitely helps clean up code in future
> patches, but there is no functional difference to leaving it with the
> previous name. I do believe it helps demonstrate the exclusivity of the
> fields.

Yeah, let's just do it after the whole thing is merged. The separation
helps a bit to review the code at this stage because it is so much
easier to grep for preferred_nodes than nodes.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4 05/13] mm/mempolicy: refactor rebind code for PREFERRED_MANY

2021-04-14 Thread Michal Hocko
On Wed 17-03-21 11:40:02, Feng Tang wrote:
> From: Dave Hansen 
> 
> Again, this extracts the "only one node must be set" behavior of
> MPOL_PREFERRED.  It retains virtually all of the existing code so it can
> be used by MPOL_PREFERRED_MANY as well.
> 
> v2:
> Fixed typos in commit message. (Ben)
> Merged bits from other patches. (Ben)
> annotate mpol_rebind_preferred_many as unused (Ben)

I am giving up on the rebinding code for now until we clarify that in my
earlier email.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4 04/13] mm/mempolicy: allow preferred code to take a nodemask

2021-04-14 Thread Michal Hocko
On Wed 17-03-21 11:40:01, Feng Tang wrote:
> From: Dave Hansen 
> 
> Create a helper function (mpol_new_preferred_many()) which is usable
> both by the old, single-node MPOL_PREFERRED and the new
> MPOL_PREFERRED_MANY.
> 
> Enforce the old single-node MPOL_PREFERRED behavior in the "new"
> version of mpol_new_preferred() which calls mpol_new_preferred_many().
> 
> v3:
>   * fix a stack overflow caused by emty nodemask (Feng)
> 
> Link: https://lore.kernel.org/r/20200630212517.308045-5-ben.widaw...@intel.com
> Signed-off-by: Dave Hansen 
> Signed-off-by: Ben Widawsky 
> Signed-off-by: Feng Tang 
> ---
>  mm/mempolicy.c | 21 +++--
>  1 file changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 1228d8e..6fb2cab 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -203,17 +203,34 @@ static int mpol_new_interleave(struct mempolicy *pol, 
> const nodemask_t *nodes)
>   return 0;
>  }
>  
> -static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
> +static int mpol_new_preferred_many(struct mempolicy *pol,
> +const nodemask_t *nodes)
>  {
>   if (!nodes)
>   pol->flags |= MPOL_F_LOCAL; /* local allocation */

Now you have confused me. I thought that MPOL_PREFERRED_MANY for NULL
nodemask will be disallowed as it is effectively MPOL_PREFERRED aka
MPOL_F_LOCAL. Or do I misread the code?

>   else if (nodes_empty(*nodes))
>   return -EINVAL; /*  no allowed nodes */
>   else
> - pol->v.preferred_nodes = nodemask_of_node(first_node(*nodes));
> + pol->v.preferred_nodes = *nodes;
>   return 0;
>  }
>  
> +static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
> +{
> + if (nodes) {
> + /* MPOL_PREFERRED can only take a single node: */
> + nodemask_t tmp;
> +
> + if (nodes_empty(*nodes))
> + return -EINVAL;
> +
> + tmp = nodemask_of_node(first_node(*nodes));
> + return mpol_new_preferred_many(pol, );
> + }
> +
> + return mpol_new_preferred_many(pol, NULL);
> +}
> +
>  static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
>  {
>   if (nodes_empty(*nodes))
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4 03/13] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes

2021-04-14 Thread Michal Hocko
o not think this is a correct behavior. Preferred policy, whether it
is a single node or a nodemask, is a hint not a requirement. So we
should always treat it as intersecting. I do understand that the naming
can be confusing because intersect operation should indeed check
nodemaska but this is yet another trap of the mempolicy code. It is
only used for the OOM selection.

Btw. the code is wrong for INTERLEAVE as well because it uses the
interleaving node as a hint as well. It is not bound by the interleave
nodemask. Sigh...

>   case MPOL_BIND:
>   case MPOL_INTERLEAVE:
>   ret = nodes_intersects(mempolicy->v.nodes, *mask);
[...]

> @@ -2349,6 +2373,9 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy 
> *b)
>   case MPOL_BIND:
>   case MPOL_INTERLEAVE:
>   return !!nodes_equal(a->v.nodes, b->v.nodes);
> + case MPOL_PREFERRED_MANY:
> + return !!nodes_equal(a->v.preferred_nodes,
> +  b->v.preferred_nodes);

Again different from MPOL_PREFERRED...

>   case MPOL_PREFERRED:
>   /* a's ->flags is the same as b's */
>   if (a->flags & MPOL_F_LOCAL)
> @@ -2523,6 +2550,8 @@ int mpol_misplaced(struct page *page, struct 
> vm_area_struct *vma, unsigned long
>   polnid = zone_to_nid(z->zone);
>   break;
>  
> + /* case MPOL_PREFERRED_MANY: */
> +

I hope a follow up patch will make this not panic but as you are already
plumbing everything in it should really be as simple as node_isset
check.

>   default:
>   BUG();

Besides that, this should really go!

> @@ -3035,6 +3066,9 @@ void mpol_to_str(char *buffer, int maxlen, struct 
> mempolicy *pol)
>   switch (mode) {
>   case MPOL_DEFAULT:
>   break;
> + case MPOL_PREFERRED_MANY:
> + WARN_ON(flags & MPOL_F_LOCAL);

Why WARN_ON here?

> + fallthrough;
>   case MPOL_PREFERRED:
>   if (flags & MPOL_F_LOCAL)
>   mode = MPOL_LOCAL;
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4 02/13] mm/mempolicy: convert single preferred_node to full nodemask

2021-04-14 Thread Michal Hocko
   pol->w.cpuset_mems_allowed, preferred_node);
> + pol->v.preferred_nodes = tmp;
>   pol->w.cpuset_mems_allowed = *nodes;
>   }

I have to say that I really disliked the original code (becasuse it
fiddles with user provided input behind the back) I got lost here
completely. What the heck is going on?
a) why do we even care remaping a hint which is overriden by the cpuset
at the page allocator level and b) why do we need to allocate _two_
potentially large temporary bitmaps for that here?

I haven't spotted anything unexpected in the rest.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4 00/13] Introduced multi-preference mempolicy

2021-04-14 Thread Michal Hocko
clearup (Feng)
>   * One RFC patch to speedup mem alloc in some case (Feng)
> 
>   Since v1:
>   * Dropped patch to replace numa_node_id in some places (mhocko)
>   * Dropped all the page allocation patches in favor of new mechanism to
> use fallbacks. (mhocko)
>   * Dropped the special snowflake preferred node algorithm (bwidawsk)
>   * If the preferred node fails, ALL nodes are rechecked instead of just
> the non-preferred nodes.
> 
> v4 Summary:
> 1: Random fix I found along the way
> 2-5: Represent node preference as a mask internally
> 6-7: Tread many preferred like bind
> 8-11: Handle page allocation for the new policy
> 12: Enable the uapi
> 13: unifiy 2 functions
> 
> Ben Widawsky (8):
>   mm/mempolicy: Add comment for missing LOCAL
>   mm/mempolicy: kill v.preferred_nodes
>   mm/mempolicy: handle MPOL_PREFERRED_MANY like BIND
>   mm/mempolicy: Create a page allocator for policy
>   mm/mempolicy: Thread allocation for many preferred
>   mm/mempolicy: VMA allocation for many preferred
>   mm/mempolicy: huge-page allocation for many preferred
>   mm/mempolicy: Advertise new MPOL_PREFERRED_MANY
> 
> Dave Hansen (4):
>   mm/mempolicy: convert single preferred_node to full nodemask
>   mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
>   mm/mempolicy: allow preferred code to take a nodemask
>   mm/mempolicy: refactor rebind code for PREFERRED_MANY
> 
> Feng Tang (1):
>   mem/mempolicy: unify mpol_new_preferred() and
> mpol_new_preferred_many()
> 
>  .../admin-guide/mm/numa_memory_policy.rst  |  22 +-
>  include/linux/mempolicy.h  |   6 +-
>  include/uapi/linux/mempolicy.h |   6 +-
>  mm/hugetlb.c   |  26 +-
>  mm/mempolicy.c | 272 
> ++---
>  5 files changed, 225 insertions(+), 107 deletions(-)
> 
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock

2021-04-14 Thread Michal Hocko
On Wed 14-04-21 12:49:53, Oscar Salvador wrote:
> On Wed, Apr 14, 2021 at 12:32:58PM +0200, Michal Hocko wrote:
[...]
> > > I checked, and when we get there in __alloc_bootmem_huge_page, 
> > > page->private is
> > > still zeroed, so I guess it should be safe to assume that we do not 
> > > really need
> > > to clear the flag in __prep_new_huge_page() routine?
> > 
> > It would be quite nasty if the struct pages content would be undefined.
> > Maybe that is possible but then I would rather stick the initialization
> > into __alloc_bootmem_huge_page.
> 
> Yes, but I do not think that is really possible unless I missed something.

Yeah, it should be fine. I was thinking of a alloc, modify struct pages,
free back to the bootmem allocator sequence. But I do not remember ever
seeing sequence like that. Bootmem allocator users tend to be simple,
allocate storage and either retain it for the life time. Other than
PageReserved bit they do not touch metadata. If we want to be paranoid
then we can add VM_WARN_ON for unexpected state when allocating from the
bootmem. But I am not sure this is really worth it.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock

2021-04-14 Thread Michal Hocko
On Wed 14-04-21 12:01:47, Oscar Salvador wrote:
> On Wed, Apr 14, 2021 at 10:28:33AM +0200, Michal Hocko wrote:
> > You are right it doesn't do it there. But all struct pages, even those
> > that are allocated by the bootmem allocator should initialize its struct
> > pages. They would be poisoned otherwise, right? I would have to look at
> > the exact code path but IIRC this should be around the time bootmem
> > allocator state transitions to the page allocator.
> 
> Ok, you are right.
> struct pages are initialized a bit earlier through:
> 
> start_kernel
>  setup_arch
>   paging_init
>zone_sizes_init
> free_area_init
>  free_area_init_node
>   free_area_init_core
>memmap_init_zone
> memmap_init_range
>  __init_single_page
> 
> While the allocation of bootmem hugetlb happens
> 
> start_kernel
>  parse_args
>   ...
>hugepages_setup
> ...
>  hugetlb_hstate_alloc_pages
>   __alloc_bootmem_huge_page
> 
> which is after the setup_arch() call.

Thanks for pulling those paths. It is always painful to crawl that code.

> So by the time we get the page from __alloc_bootmem_huge_page(), fields are
> zeroed.
> I thought we might get in trouble because memblock_alloc_try_nid_raw() calls
> page_init_poison() which poisons the chunk with 0xff,e.g:
> 
> [1.955471] boot:    
> 
> [1.955476] boot:    
> 
> 
>  but it seems that does not the memmap struct page.

Well, to be precise it does the very same thing with memamp struct pages
but that is before the initialization code you have pointed out above.
In this context it just poisons the allocated content which is the GB
page storage.

> I checked, and when we get there in __alloc_bootmem_huge_page, page->private 
> is
> still zeroed, so I guess it should be safe to assume that we do not really 
> need
> to clear the flag in __prep_new_huge_page() routine?

It would be quite nasty if the struct pages content would be undefined.
Maybe that is possible but then I would rather stick the initialization
into __alloc_bootmem_huge_page.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [External] Re: [PATCH 2/7] mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm

2021-04-14 Thread Michal Hocko
On Wed 14-04-21 18:04:35, Muchun Song wrote:
> On Wed, Apr 14, 2021 at 5:24 PM Michal Hocko  wrote:
> >
> > On Tue 13-04-21 14:51:48, Muchun Song wrote:
> > > When mm is NULL, we do not need to hold rcu lock and call css_tryget for
> > > the root memcg. And we also do not need to check !mm in every loop of
> > > while. So bail out early when !mm.
> >
> > mem_cgroup_charge and other callers unconditionally drop the reference
> > so how come this does not underflow reference count?
> 
> For the root memcg, the CSS_NO_REF flag is set, so css_get
> and css_put do not get or put reference.

Ohh, right you are. I must have forgot about that special case. I am
pretty sure I (and likely few more) will stumble over that in the future
again. A small comment explaining that the reference can be safely
ignore would be helpful.

Anyway
Acked-by: Michal Hocko 

Thanks!

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 7/7] mm: vmscan: remove noinline_for_stack

2021-04-14 Thread Michal Hocko
On Tue 13-04-21 14:51:53, Muchun Song wrote:
> The noinline_for_stack is introduced by commit 666356297ec4 ("vmscan:
> set up pagevec as late as possible in shrink_inactive_list()"), its
> purpose is to delay the allocation of pagevec as late as possible to
> save stack memory. But the commit 2bcf88796381 ("mm: take pagevecs off
> reclaim stack") replace pagevecs by lists of pages_to_free. So we do
> not need noinline_for_stack, just remove it (let the compiler decide
> whether to inline).
>
> Signed-off-by: Muchun Song 
> Acked-by: Johannes Weiner 
> Acked-by: Roman Gushchin 
> Reviewed-by: Shakeel Butt 

Acked-by: Michal Hocko 

> ---
>  mm/vmscan.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 64bf07cc20f2..e40b21298d77 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2015,8 +2015,8 @@ static int too_many_isolated(struct pglist_data *pgdat, 
> int file,
>   *
>   * Returns the number of pages moved to the given lruvec.
>   */
> -static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
> -  struct list_head *list)
> +static unsigned int move_pages_to_lru(struct lruvec *lruvec,
> +   struct list_head *list)
>  {
>   int nr_pages, nr_moved = 0;
>   LIST_HEAD(pages_to_free);
> @@ -2096,7 +2096,7 @@ static int current_may_throttle(void)
>   * shrink_inactive_list() is a helper for shrink_node().  It returns the 
> number
>   * of reclaimed pages
>   */
> -static noinline_for_stack unsigned long
> +static unsigned long
>  shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>struct scan_control *sc, enum lru_list lru)
>  {
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 4/7] mm: memcontrol: simplify lruvec_holds_page_lru_lock

2021-04-14 Thread Michal Hocko
On Tue 13-04-21 14:51:50, Muchun Song wrote:
> We already have a helper lruvec_memcg() to get the memcg from lruvec, we
> do not need to do it ourselves in the lruvec_holds_page_lru_lock(). So use
> lruvec_memcg() instead. And if mem_cgroup_disabled() returns false, the
> page_memcg(page) (the LRU pages) cannot be NULL. So remove the odd logic
> of "memcg = page_memcg(page) ? : root_mem_cgroup". And use lruvec_pgdat
> to simplify the code. We can have a single definition for this function
> that works for !CONFIG_MEMCG, CONFIG_MEMCG + mem_cgroup_disabled() and
> CONFIG_MEMCG.

Neat. While you are at it wouldn't it make sesne to rename the function
as well. I do not want to bikeshed but this is really a misnomer. it
doesn't check anything about locking. page_belongs_lruvec?

> Signed-off-by: Muchun Song 
> Acked-by: Johannes Weiner 

Acked-by: Michal Hocko 

> ---
>  include/linux/memcontrol.h | 31 +++
>  1 file changed, 7 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4f49865c9958..38b8d3fb24ff 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -755,22 +755,6 @@ static inline struct lruvec 
> *mem_cgroup_page_lruvec(struct page *page)
>   return mem_cgroup_lruvec(memcg, pgdat);
>  }
>  
> -static inline bool lruvec_holds_page_lru_lock(struct page *page,
> -   struct lruvec *lruvec)
> -{
> - pg_data_t *pgdat = page_pgdat(page);
> - const struct mem_cgroup *memcg;
> - struct mem_cgroup_per_node *mz;
> -
> - if (mem_cgroup_disabled())
> - return lruvec == >__lruvec;
> -
> - mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> - memcg = page_memcg(page) ? : root_mem_cgroup;
> -
> - return lruvec->pgdat == pgdat && mz->memcg == memcg;
> -}
> -
>  struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>  
>  struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
> @@ -1229,14 +1213,6 @@ static inline struct lruvec 
> *mem_cgroup_page_lruvec(struct page *page)
>   return >__lruvec;
>  }
>  
> -static inline bool lruvec_holds_page_lru_lock(struct page *page,
> -   struct lruvec *lruvec)
> -{
> - pg_data_t *pgdat = page_pgdat(page);
> -
> - return lruvec == >__lruvec;
> -}
> -
>  static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page 
> *page)
>  {
>  }
> @@ -1518,6 +1494,13 @@ static inline void 
> unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
>   spin_unlock_irqrestore(>lru_lock, flags);
>  }
>  
> +static inline bool lruvec_holds_page_lru_lock(struct page *page,
> +   struct lruvec *lruvec)
> +{
> + return lruvec_pgdat(lruvec) == page_pgdat(page) &&
> +        lruvec_memcg(lruvec) == page_memcg(page);
> +}
> +
>  /* Don't lock again iff page's lruvec locked */
>  static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
>   struct lruvec *locked_lruvec)
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 3/7] mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec

2021-04-14 Thread Michal Hocko
On Tue 13-04-21 14:51:49, Muchun Song wrote:
> All the callers of mem_cgroup_page_lruvec() just pass page_pgdat(page)
> as the 2nd parameter to it (except isolate_migratepages_block()). But
> for isolate_migratepages_block(), the page_pgdat(page) is also equal
> to the local variable of @pgdat. So mem_cgroup_page_lruvec() do not
> need the pgdat parameter. Just remove it to simplify the code.
> 
> Signed-off-by: Muchun Song 
> Acked-by: Johannes Weiner 

I like this. Two arguments where one can be directly inferred from the
first one can just lead to subtle bugs. In this case it even doesn't
give any advantage for most callers.

Acked-by: Michal Hocko 

> ---
>  include/linux/memcontrol.h | 10 +-
>  mm/compaction.c|  2 +-
>  mm/memcontrol.c|  9 +++--
>  mm/swap.c  |  2 +-
>  mm/workingset.c|  2 +-
>  5 files changed, 11 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index c960fd49c3e8..4f49865c9958 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -743,13 +743,12 @@ static inline struct lruvec *mem_cgroup_lruvec(struct 
> mem_cgroup *memcg,
>  /**
>   * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
>   * @page: the page
> - * @pgdat: pgdat of the page
>   *
>   * This function relies on page->mem_cgroup being stable.
>   */
> -static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
> - struct pglist_data *pgdat)
> +static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page)
>  {
> + pg_data_t *pgdat = page_pgdat(page);
>   struct mem_cgroup *memcg = page_memcg(page);
>  
>   VM_WARN_ON_ONCE_PAGE(!memcg && !mem_cgroup_disabled(), page);
> @@ -1223,9 +1222,10 @@ static inline struct lruvec *mem_cgroup_lruvec(struct 
> mem_cgroup *memcg,
>   return >__lruvec;
>  }
>  
> -static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
> - struct pglist_data *pgdat)
> +static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page)
>  {
> + pg_data_t *pgdat = page_pgdat(page);
> +
>   return >__lruvec;
>  }
>  
> diff --git a/mm/compaction.c b/mm/compaction.c
> index caa4c36c1db3..e7da342003dd 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1033,7 +1033,7 @@ isolate_migratepages_block(struct compact_control *cc, 
> unsigned long low_pfn,
>   if (!TestClearPageLRU(page))
>   goto isolate_fail_put;
>  
> - lruvec = mem_cgroup_page_lruvec(page, pgdat);
> + lruvec = mem_cgroup_page_lruvec(page);
>  
>   /* If we already hold the lock, we can skip some rechecking */
>   if (lruvec != locked) {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9cbfff59b171..1f807448233e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1177,9 +1177,8 @@ void lruvec_memcg_debug(struct lruvec *lruvec, struct 
> page *page)
>  struct lruvec *lock_page_lruvec(struct page *page)
>  {
>   struct lruvec *lruvec;
> - struct pglist_data *pgdat = page_pgdat(page);
>  
> - lruvec = mem_cgroup_page_lruvec(page, pgdat);
> + lruvec = mem_cgroup_page_lruvec(page);
>   spin_lock(>lru_lock);
>  
>   lruvec_memcg_debug(lruvec, page);
> @@ -1190,9 +1189,8 @@ struct lruvec *lock_page_lruvec(struct page *page)
>  struct lruvec *lock_page_lruvec_irq(struct page *page)
>  {
>   struct lruvec *lruvec;
> - struct pglist_data *pgdat = page_pgdat(page);
>  
> - lruvec = mem_cgroup_page_lruvec(page, pgdat);
> + lruvec = mem_cgroup_page_lruvec(page);
>   spin_lock_irq(>lru_lock);
>  
>   lruvec_memcg_debug(lruvec, page);
> @@ -1203,9 +1201,8 @@ struct lruvec *lock_page_lruvec_irq(struct page *page)
>  struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long 
> *flags)
>  {
>   struct lruvec *lruvec;
> - struct pglist_data *pgdat = page_pgdat(page);
>  
> - lruvec = mem_cgroup_page_lruvec(page, pgdat);
> + lruvec = mem_cgroup_page_lruvec(page);
>   spin_lock_irqsave(>lru_lock, *flags);
>  
>   lruvec_memcg_debug(lruvec, page);
> diff --git a/mm/swap.c b/mm/swap.c
> index a75a8265302b..e0d5699213cc 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -313,7 +313,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, 
> unsigned int nr_pages)
>  
>  void lru_note_cost_page(struct page *page)
>  {
> - lru_note_cost(mem_cgroup_page_lruvec(page, page_pgdat(page)),
&g

Re: [PATCH 2/7] mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm

2021-04-14 Thread Michal Hocko
On Tue 13-04-21 14:51:48, Muchun Song wrote:
> When mm is NULL, we do not need to hold rcu lock and call css_tryget for
> the root memcg. And we also do not need to check !mm in every loop of
> while. So bail out early when !mm.

mem_cgroup_charge and other callers unconditionally drop the reference
so how come this does not underflow reference count?
 
> Signed-off-by: Muchun Song 
> Acked-by: Johannes Weiner 
> Reviewed-by: Shakeel Butt 
> ---
>  mm/memcontrol.c | 21 ++---
>  1 file changed, 10 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f229de925aa5..9cbfff59b171 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -901,20 +901,19 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct 
> mm_struct *mm)
>   if (mem_cgroup_disabled())
>   return NULL;
>  
> + /*
> +  * Page cache insertions can happen without an
> +  * actual mm context, e.g. during disk probing
> +  * on boot, loopback IO, acct() writes etc.
> +  */
> + if (unlikely(!mm))
> + return root_mem_cgroup;
> +
>   rcu_read_lock();
>   do {
> - /*
> -  * Page cache insertions can happen without an
> -  * actual mm context, e.g. during disk probing
> -  * on boot, loopback IO, acct() writes etc.
> -  */
> - if (unlikely(!mm))
> + memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> + if (unlikely(!memcg))
>   memcg = root_mem_cgroup;
> - else {
> - memcg = 
> mem_cgroup_from_task(rcu_dereference(mm->owner));
> - if (unlikely(!memcg))
> - memcg = root_mem_cgroup;
> - }
>   } while (!css_tryget(>css));
>   rcu_read_unlock();
>   return memcg;
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/7] mm: memcontrol: fix page charging in page replacement

2021-04-14 Thread Michal Hocko
On Tue 13-04-21 14:51:47, Muchun Song wrote:
> The pages aren't accounted at the root level, so do not charge the page
> to the root memcg in page replacement. Although we do not display the
> value (mem_cgroup_usage) so there shouldn't be any actual problem, but
> there is a WARN_ON_ONCE in the page_counter_cancel(). Who knows if it
> will trigger? So it is better to fix it.
> 
> Signed-off-by: Muchun Song 
> Acked-by: Johannes Weiner 
> Reviewed-by: Shakeel Butt 

Acked-by: Michal Hocko 

> ---
>  mm/memcontrol.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 64ada9e650a5..f229de925aa5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6806,9 +6806,11 @@ void mem_cgroup_migrate(struct page *oldpage, struct 
> page *newpage)
>   /* Force-charge the new page. The old one will be freed soon */
>   nr_pages = thp_nr_pages(newpage);
>  
> - page_counter_charge(>memory, nr_pages);
> - if (do_memsw_account())
> - page_counter_charge(>memsw, nr_pages);
> + if (!mem_cgroup_is_root(memcg)) {
> + page_counter_charge(>memory, nr_pages);
> + if (do_memsw_account())
> + page_counter_charge(>memsw, nr_pages);
> +     }
>  
>   css_get(>css);
>   commit_charge(newpage, memcg);
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock

2021-04-14 Thread Michal Hocko
On Wed 14-04-21 09:41:32, Oscar Salvador wrote:
> On Wed, Apr 14, 2021 at 08:04:21AM +0200, Michal Hocko wrote:
> > On Tue 13-04-21 14:19:03, Mike Kravetz wrote:
> > > On 4/13/21 6:23 AM, Michal Hocko wrote:
> > > The only place where page->private may not be initialized is when we do
> > > allocations at boot time from memblock.  In this case, we will add the
> > > pages to the free list via put_page/free_huge_page so the appropriate
> > > flags will be cleared before anyone notices.
> > 
> > Pages allocated by the bootmem should be pre initialized from the boot,
> > no?
> 
> I guess Mike means:
> 
> hugetlb_hstate_alloc_pages
>  alloc_bootmem_huge_page
>   __alloc_bootmem_huge_page
>memblock_alloc_try_nid_raw
> 
> and AFAICS, memblock_alloc_try_nid_raw() does not zero the memory.

You are right it doesn't do it there. But all struct pages, even those
that are allocated by the bootmem allocator should initialize its struct
pages. They would be poisoned otherwise, right? I would have to look at
the exact code path but IIRC this should be around the time bootmem
allocator state transitions to the page allocator.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock

2021-04-14 Thread Michal Hocko
On Tue 13-04-21 14:19:03, Mike Kravetz wrote:
> On 4/13/21 6:23 AM, Michal Hocko wrote:
> > On Tue 13-04-21 12:47:43, Oscar Salvador wrote:
[...]
> > Or do we need it for giga pages which are not allocated by the page
> > allocator? If yes then moving it to prep_compound_gigantic_page would be
> > better.
> 
> I am pretty sure dynamically allocated giga pages have page->Private
> cleared as well.  It is not obvious, but the alloc_contig_range code
> used to put together giga pages will end up calling isolate_freepages_range
> that will call split_map_pages and then post_alloc_hook for each MAX_ORDER
> block.

Thanks for saving me from crawling that code.

> As mentioned, this is not obvious and I would hate to rely on this
> behavior not changing.

Thinking about it some more, having some (page granularity) allocator
not clearing page private would be a serious problem for anybody relying
on its state. So I am not sure this can change.
 
> > So should we just drop it here?
> 
> The only place where page->private may not be initialized is when we do
> allocations at boot time from memblock.  In this case, we will add the
> pages to the free list via put_page/free_huge_page so the appropriate
> flags will be cleared before anyone notices.

Pages allocated by the bootmem should be pre initialized from the boot,
no?

> I'm wondering if we should just do a set_page_private(page, 0) here in
> prep_new_huge_page since we now use that field for flags.  Or, is that
> overkill?

I would rather not duplicate the work done by underlying allocators. I
do not think other users of the allocator want to do the same so why
should hugetlb be any different.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages

2021-04-13 Thread Michal Hocko
On Tue 13-04-21 12:47:45, Oscar Salvador wrote:
> alloc_contig_range will fail if it ever sees a HugeTLB page within the
> range we are trying to allocate, even when that page is free and can be
> easily reallocated.
> This has proved to be problematic for some users of alloc_contic_range,
> e.g: CMA and virtio-mem, where those would fail the call even when those
> pages lay in ZONE_MOVABLE and are free.
> 
> We can do better by trying to replace such page.
> 
> Free hugepages are tricky to handle so as to no userspace application
> notices disruption, we need to replace the current free hugepage with
> a new one.
> 
> In order to do that, a new function called alloc_and_dissolve_huge_page
> is introduced.
> This function will first try to get a new fresh hugepage, and if it
> succeeds, it will replace the old one in the free hugepage pool.
> 
> The free page replacement is done under hugetlb_lock, so no external
> users of hugetlb will notice the change.
> To allocate the new huge page, we use alloc_buddy_huge_page(), so we
> do not have to deal with any counters, and prep_new_huge_page() is not
> called. This is valulable because in case we need to free the new page,
> we only need to call __free_pages().
> 
> Once we know that the page to be replaced is a genuine 0-refcounted
> huge page, we remove the old page from the freelist by remove_hugetlb_page().
> Then, we can call __prep_new_huge_page() and __prep_account_new_huge_page()
> for the new huge page to properly initialize it and increment the
> hstate->nr_huge_pages counter (previously decremented by
> remove_hugetlb_page()).
> Once done, the page is enqueued by enqueue_huge_page() and it is ready
> to be used.
> 
> There is one tricky case when
> page's refcount is 0 because it is in the process of being released.
> A missing PageHugeFreed bit will tell us that freeing is in flight so
> we retry after dropping the hugetlb_lock. The race window should be
> small and the next retry should make a forward progress.
> 
> E.g:
> 
> CPU0  CPU1
> free_huge_page()  isolate_or_dissolve_huge_page
> PageHuge() == T
> alloc_and_dissolve_huge_page
>   alloc_buddy_huge_page()
>   spin_lock_irq(hugetlb_lock)
>   // PageHuge() && !PageHugeFreed &&
>   // !PageCount()
>   spin_unlock_irq(hugetlb_lock)
>   spin_lock_irq(hugetlb_lock)
>   1) update_and_free_page
>PageHuge() == F
>__free_pages()
>   2) enqueue_huge_page
>SetPageHugeFreed()
>   spin_unlock(_lock)
> spin_lock_irq(hugetlb_lock)
>1) PageHuge() == F (freed by case#1 from 
> CPU0)
>  2) PageHuge() == T
>PageHugeFreed() == T
>- proceed with replacing the page
> 
> In the case above we retry as the window race is quite small and we have high
> chances to succeed next time.
> 
> With regard to the allocation, we restrict it to the node the page belongs
> to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
> 
> Note that gigantic hugetlb pages are fenced off since there is a cyclic
> dependency between them and alloc_contig_range.
> 
> Signed-off-by: Oscar Salvador 

Acked-by: Michal Hocko 

One minor nit below
[...]

> + /*
> +  * Ok, old_page is still a genuine free hugepage. Remove it from
> +  * the freelist and decrease the counters. These will be
> +  * incremented again when calling __prep_account_new_huge_page()
> +  * and enqueue_huge_page() for new_page. The counters will 
> remain
> +  * stable since this happens under the lock.
> +  */
> + remove_hugetlb_page(h, old_page, false);
> +
> + /*
> +  * Call __prep_new_huge_page() to construct the hugetlb page, 
> and
> +  * enqueue it then to place it in the freelists. After this,
> +  * counters are back on track. Free hugepages have a refcount 
> of 0,
> +  * so we need to decrease new_page's count as well.
> +  */
> + __prep_new_huge_page(new_page);
> + __prep_account_new_huge_page(h, nid);

I think it would help to put something like the following into the
comment above this really strange construct.

/*
 * new_page needs to be initialized with the 

Re: [PATCH v7 4/7] mm,hugetlb: Split prep_new_huge_page functionality

2021-04-13 Thread Michal Hocko
On Tue 13-04-21 12:47:44, Oscar Salvador wrote:
[...]
> +static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> +{
> + __prep_new_huge_page(page);
>   spin_lock_irq(_lock);
> - h->nr_huge_pages++;
> - h->nr_huge_pages_node[nid]++;
> + __prep_account_new_huge_page(h, nid);
>   spin_unlock_irq(_lock);
>  }

Any reason to decouple the locking from the accounting?
>  
> -- 
> 2.16.3

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v7 4/7] mm,hugetlb: Split prep_new_huge_page functionality

2021-04-13 Thread Michal Hocko
On Tue 13-04-21 15:24:32, Michal Hocko wrote:
> On Tue 13-04-21 12:47:44, Oscar Salvador wrote:
> [...]
> > +static void prep_new_huge_page(struct hstate *h, struct page *page, int 
> > nid)
> > +{
> > +   __prep_new_huge_page(page);
> > spin_lock_irq(_lock);
> > -   h->nr_huge_pages++;
> > -   h->nr_huge_pages_node[nid]++;
> > +   __prep_account_new_huge_page(h, nid);
> > spin_unlock_irq(_lock);
> >  }
> 
> Any reason to decouple the locking from the accounting?

OK, I spoke too soon. The next patch already has the locking around when
calling this. Please add a lockdep assert into the helper to make the
locking expectations more obvious.

With that
Acked-by: Michal Hocko 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock

2021-04-13 Thread Michal Hocko
On Tue 13-04-21 12:47:43, Oscar Salvador wrote:
> Currently, the clearing of the flag is done under the lock, but this
> is unnecessary as we just allocated the page and we did not give it
> away yet, so no one should be messing with it.
> 
> Also, this helps making clear that here the lock is only protecting the
> counter.

While moving the flag clearing is ok I am wondering why do we need that
in the first place. I think it is just a leftover from 6c0371490140
("hugetlb: convert PageHugeFreed to HPageFreed flag"). Prior to that a tail
page as been used to keep track of the state but now all happens in the
head page and the flag uses page->private which is always initialized
when allocated by the allocator (post_alloc_hook).

Or do we need it for giga pages which are not allocated by the page
allocator? If yes then moving it to prep_compound_gigantic_page would be
better.

So should we just drop it here?

> Signed-off-by: Oscar Salvador 
> ---
>  mm/hugetlb.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 54d81d5947ed..e40d5fe5c63c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1490,10 +1490,10 @@ static void prep_new_huge_page(struct hstate *h, 
> struct page *page, int nid)
>   hugetlb_set_page_subpool(page, NULL);
>   set_hugetlb_cgroup(page, NULL);
>   set_hugetlb_cgroup_rsvd(page, NULL);
> + ClearHPageFreed(page);
>   spin_lock_irq(_lock);
>   h->nr_huge_pages++;
>   h->nr_huge_pages_node[nid]++;
> -     ClearHPageFreed(page);
>   spin_unlock_irq(_lock);
>  }
>  
> -- 
> 2.16.3

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 resend] mm/memory_hotplug: Make unpopulated zones PCP structures unreachable during hot remove

2021-04-13 Thread Michal Hocko
On Tue 13-04-21 11:36:08, Vlastimil Babka wrote:
[...]
> AFAIK even Oscar's work on using the node to self-contain its own structures 
> is
> only applicable to struct pages, not percpu allocations?

Correct. Teaching pcp storage on movable zone sounds like a large
undertaking to me. Not sure this is worth it TBH. Even an idea of any
pcp access synchronization with memory hotplug makes for a decent headache.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

2021-04-13 Thread Michal Hocko
On Fri 09-04-21 16:26:53, Tim Chen wrote:
> 
> On 4/8/21 4:52 AM, Michal Hocko wrote:
> 
> >> The top tier memory used is reported in
> >>
> >> memory.toptier_usage_in_bytes
> >>
> >> The amount of top tier memory usable by each cgroup without
> >> triggering page reclaim is controlled by the
> >>
> >> memory.toptier_soft_limit_in_bytes 
> > 
> 
> Michal,
> 
> Thanks for your comments.  I will like to take a step back and
> look at the eventual goal we envision: a mechanism to partition the 
> tiered memory between the cgroups. 

OK, this is goot mission statemet to start with. I would expect a follow
up to say what kind of granularity of control you want to achieve here.
Pressumably you want more than all or nothing because that is where
cpusets can be used for.

> A typical use case may be a system with two set of tasks.
> One set of task is very latency sensitive and we desire instantaneous
> response from them. Another set of tasks will be running batch jobs
> were latency and performance is not critical.   In this case,
> we want to carve out enough top tier memory such that the working set
> of the latency sensitive tasks can fit entirely in the top tier memory.
> The rest of the top tier memory can be assigned to the background tasks.  

While from a very high level this makes sense I would be interested in
more details though. Your high letency sensitive applications very likely
want to be bound to high performance node, right? Can they tolerate
memory reclaim? Can they consume more memory than the node size? What do
you expect to happen then?
 
> To achieve such cgroup based tiered memory management, we probably want
> something like the following.
> 
> For generalization let's say that there are N tiers of memory t_0, t_1 ... 
> t_N-1,
> where tier t_0 sits at the top and demotes to the lower tier. 

How is each tear defined? Is this an admin define set of NUMA nodes or
is this platform specific?

[...]

> Will such an interface looks sane and acceptable with everyone?

Let's talk more about usecases first before we even start talking about
the interface or which controller is the best fit for implementing it.
 
> The patch set I posted is meant to be a straw man cgroup v1 implementation
> and I readily admits that it falls short of the eventual functionality 
> we want to achieve.  It is meant to solicit feedback from everyone on how the 
> tiered
> memory management should work.

OK, fair enough. Let me then just state that I strongly believe that
Soft limit based approach is a dead end and it would be better to focus
on the actual usecases and try to understand what you want to achieve
first.

[...]

> > What I am trying to say (and I have brought that up when demotion has been
> > discussed at LSFMM) is that the implementation shouldn't be PMEM aware.
> > The specific technology shouldn't be imprinted into the interface.
> > Fundamentally you are trying to balance memory among NUMA nodes as we do
> > not have other abstraction to use. So rather than talking about top,
> > secondary, nth tier we have different NUMA nodes with different
> > characteristics and you want to express your "priorities" for them.
> 
> With node priorities, how would the system reserve enough
> high performance memory for those performance critical task cgroup? 
> 
> By priority, do you mean the order of allocation of nodes for a cgroup?
> Or you mean that all the similar performing memory node will be grouped in
> the same priority?

I have to say I do not yet have a clear idea on what those priorities
would look like. I just wanted to outline that usecases you are
interested about likely want to implement some form of (application
transparent) control for memory distribution over several nodes. There
is a long way to land on something more specific I am afraid.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: optimize memory allocation

2021-04-13 Thread Michal Hocko
On Mon 12-04-21 15:49:53, ultrac...@163.com wrote:
> From: Chen Xiaoguang 
> 
> Check memory cgroup limit before allocating real memory. This may
> improve performance especially in slow path when memory allocation
> exceeds cgroup limitation.

I would be really curious about any actual numbers because I have really
hard times to see scenarios when this would lead to an improvement.
Effectitelly only non-oom allocations would benefit theoretically (e.g.
atomic or GFP_NORETRY etc). All others will trigger the memcg oom killer
to help forward progress.

Besides that I really dislike kmem and LRU pages to be handled
differently so for that reason
Nacked-by: Michal Hocko 

If the optimization really can be provent then the patch would require
to be much more invasive.

> Signed-off-by: Chen Xiaoguang 
> Signed-off-by: Chen He 
> ---
>  include/linux/memcontrol.h | 30 ++
>  mm/memcontrol.c| 34 --
>  mm/page_alloc.c| 24 +---
>  3 files changed, 55 insertions(+), 33 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0c04d39..59bb3ba 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1583,8 +1583,9 @@ static inline void memcg_set_shrinker_bit(struct 
> mem_cgroup *memcg,
>  #endif
>  
>  #ifdef CONFIG_MEMCG_KMEM
> -int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
> -void __memcg_kmem_uncharge_page(struct page *page, int order);
> +int __memcg_kmem_charge_page(struct mem_cgroup *memcg, gfp_t gfp, int order);
> +void __memcg_kmem_uncharge_page(struct page *page, int order,
> + struct mem_cgroup *memcg);
>  
>  struct obj_cgroup *get_obj_cgroup_from_current(void);
>  
> @@ -1610,18 +1611,30 @@ static inline bool memcg_kmem_enabled(void)
>   return static_branch_likely(_kmem_enabled_key);
>  }
>  
> +extern struct mem_cgroup *get_mem_cgroup_from_current(void);
> +
>  static inline int memcg_kmem_charge_page(struct page *page, gfp_t gfp,
>int order)
>  {
> - if (memcg_kmem_enabled())
> - return __memcg_kmem_charge_page(page, gfp, order);
> - return 0;
> + struct mem_cgroup *memcg;
> + int ret = 0;
> +
> + memcg = get_mem_cgroup_from_current();
> + if (memcg && memcg_kmem_enabled() && !mem_cgroup_is_root(memcg)) {
> + ret = __memcg_kmem_charge_page(memcg, gfp, order);
> + if (!ret) {
> + page->memcg_data = (unsigned long)memcg | 
> MEMCG_DATA_KMEM;
> + return 0;
> + }
> + css_put(>css);
> + }
> + return ret;
>  }
>  
>  static inline void memcg_kmem_uncharge_page(struct page *page, int order)
>  {
>   if (memcg_kmem_enabled())
> - __memcg_kmem_uncharge_page(page, order);
> + __memcg_kmem_uncharge_page(page, order, NULL);
>  }
>  
>  /*
> @@ -1647,13 +1660,14 @@ static inline void memcg_kmem_uncharge_page(struct 
> page *page, int order)
>  {
>  }
>  
> -static inline int __memcg_kmem_charge_page(struct page *page, gfp_t gfp,
> +static inline int __memcg_kmem_charge_page(struct mem_cgroup *memcg, gfp_t 
> gfp,
>  int order)
>  {
>   return 0;
>  }
>  
> -static inline void __memcg_kmem_uncharge_page(struct page *page, int order)
> +static inline void __memcg_kmem_uncharge_page(struct page *page, int order,
> + struct mem_cgroup *memcg)
>  {
>  }
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e064ac0d..8df57b7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1085,7 +1085,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>  /**
>   * If active memcg is set, do not fallback to current->mm->memcg.
>   */
> -static __always_inline struct mem_cgroup *get_mem_cgroup_from_current(void)
> +struct mem_cgroup *get_mem_cgroup_from_current(void)
>  {
>   if (memcg_kmem_bypass())
>   return NULL;
> @@ -3113,21 +3113,11 @@ static void __memcg_kmem_uncharge(struct mem_cgroup 
> *memcg, unsigned int nr_page
>   *
>   * Returns 0 on success, an error code on failure.
>   */
> -int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
> +int __memcg_kmem_charge_page(struct mem_cgroup *memcg, gfp_t gfp, int order)
>  {
> - struct mem_cgroup *memcg;
> - int ret = 0;
> + int ret;
>  
> - memcg = get_mem_cgroup_from_current();
> - if (memcg && !mem_cgroup_is_root(memcg)) {
> -  

Re: [PATCH v2 resend] mm/memory_hotplug: Make unpopulated zones PCP structures unreachable during hot remove

2021-04-13 Thread Michal Hocko
On Mon 12-04-21 14:40:18, Vlastimil Babka wrote:
> On 4/12/21 2:08 PM, Mel Gorman wrote:
> > zone_pcp_reset allegedly protects against a race with drain_pages
> > using local_irq_save but this is bogus. local_irq_save only operates
> > on the local CPU. If memory hotplug is running on CPU A and drain_pages
> > is running on CPU B, disabling IRQs on CPU A does not affect CPU B and
> > offers no protection.
> > 
> > This patch deletes IRQ disable/enable on the grounds that IRQs protect
> > nothing and assumes the existing hotplug paths guarantees the PCP cannot be
> > used after zone_pcp_enable(). That should be the case already because all
> > the pages have been freed and there is no page to put on the PCP lists.
> > 
> > Signed-off-by: Mel Gorman 
> 
> Yeah the irq disabling here is clearly bogus, so:
> 
> Acked-by: Vlastimil Babka 
> 
> But I think Michal has a point that we might best leave the pagesets around, 
> by
> a future change. I'm have some doubts that even with your reordering of the
> reset/destroy after zonelist rebuild in v1 they cant't be reachable. We have 
> no
> protection between zonelist rebuild and zonelist traversal, and that's why we
> just leave pgdats around.
> 
> So I can imagine a task racing with memory hotremove might see watermarks as 
> ok
> in get_page_from_freelist() for the zone and proceeds to try_this_zone:, then
> gets stalled/scheduled out while hotremove rebuilds the zonelist and destroys
> the pcplists, then the first task is resumed and proceeds with 
> rmqueue_pcplist().
> 
> So that's very rare thus not urgent, and this patch doesn't make it less rare 
> so
> not a reason to block it.

Completely agreed here. Not an urgent thing to work on but something to
look into long term.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 resend] mm/memory_hotplug: Make unpopulated zones PCP structures unreachable during hot remove

2021-04-13 Thread Michal Hocko
On Mon 12-04-21 13:08:42, Mel Gorman wrote:
> zone_pcp_reset allegedly protects against a race with drain_pages
> using local_irq_save but this is bogus. local_irq_save only operates
> on the local CPU. If memory hotplug is running on CPU A and drain_pages
> is running on CPU B, disabling IRQs on CPU A does not affect CPU B and
> offers no protection.
> 
> This patch deletes IRQ disable/enable on the grounds that IRQs protect
> nothing and assumes the existing hotplug paths guarantees the PCP cannot be
> used after zone_pcp_enable(). That should be the case already because all
> the pages have been freed and there is no page to put on the PCP lists.

Yes, that is the case since ec6e8c7e0314 ("mm, page_alloc: disable
pcplists during memory offline"). Prior to this commit the behavior was
undefined but full zone/node hotremove is rare enough that an existing
race was likely never observed.

Acked-by: Michal Hocko 

Thanks!
 
> Signed-off-by: Mel Gorman 
> ---
> Resending for email address correction and adding lists
> 
> Changelog since v1
> o Minimal fix
> 
>  mm/page_alloc.c | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5e8aedb64b57..9bf0db982f14 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8952,12 +8952,9 @@ void zone_pcp_enable(struct zone *zone)
>  
>  void zone_pcp_reset(struct zone *zone)
>  {
> - unsigned long flags;
>   int cpu;
>   struct per_cpu_pageset *pset;
>  
> - /* avoid races with drain_pages()  */
> - local_irq_save(flags);
>   if (zone->pageset != _pageset) {
>   for_each_online_cpu(cpu) {
>   pset = per_cpu_ptr(zone->pageset, cpu);
> @@ -8966,7 +8963,6 @@ void zone_pcp_reset(struct zone *zone)
>   free_percpu(zone->pageset);
>   zone->pageset = _pageset;
>   }
> - local_irq_restore(flags);
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/memory_hotplug: Make unpopulated zones PCP structures unreachable during hot remove

2021-04-10 Thread Michal Hocko
On Fri 09-04-21 16:12:59, Mel Gorman wrote:
[...]
> If anything, the minimal "fix" is to simply delete IRQ disable/enable on
> the grounds that IRQs protect nothing and assume the existing hotplug
> paths guarantees the PCP cannot be used after zone_pcp_enable(). That
> should be the case already because all the pages have been freed and
> there is nothing to even put into the PCPs but I worried that the PCP
> structure itself might still be reachable even if it's useless which is
> why I freed the structure once they could not be reached via zonelists.

OK. Let's do that for now and I will put a follow up on my todo list.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/memory_hotplug: Make unpopulated zones PCP structures unreachable during hot remove

2021-04-09 Thread Michal Hocko
On Fri 09-04-21 14:42:21, Mel Gorman wrote:
> On Fri, Apr 09, 2021 at 02:48:12PM +0200, Michal Hocko wrote:
> > On Fri 09-04-21 14:42:58, Michal Hocko wrote:
> > > On Fri 09-04-21 13:09:57, Mel Gorman wrote:
> > > > zone_pcp_reset allegedly protects against a race with drain_pages
> > > > using local_irq_save but this is bogus. local_irq_save only operates
> > > > on the local CPU. If memory hotplug is running on CPU A and drain_pages
> > > > is running on CPU B, disabling IRQs on CPU A does not affect CPU B and
> > > > offers no protection.
> > > 
> > > Yes, the synchronization aspect is bogus indeed.
> > > 
> > > > This patch reorders memory hotremove such that the PCP structures
> > > > relevant to the zone are no longer reachable by the time the structures
> > > > are freed.  With this reordering, no protection is required to prevent
> > > > a use-after-free and the IRQs can be left enabled. zone_pcp_reset is
> > > > renamed to zone_pcp_destroy to make it clear that the per-cpu structures
> > > > are deleted when the function returns.
> > > 
> > > Wouldn't it be much easier to simply not destroy/reset pcp of an empty
> > > zone at all? The whole point of this exercise seems to be described in
> > > 340175b7d14d5. setup_zone_pageset can check for an already allocated pcp
> > > and simply reinitialize it. 
> > 
> > I meant this
> > 
> 
> It might be simplier but if the intention is to free as much memory
> as possible during hot-remove, it seems wasteful to leave the per-cpu
> structures behind if we do not have to.

We do leave the whole pgdat behind. I do not think pagesets really do
matter.

> If a problem with my patch can
> be spotted then I'm happy to go with an alternative fix but there are
> two minor issues with your proposed fix.

I will not insist but this code has proven to bitrot and I just find it
much simpler to drop it altogether rather than conserve it in some form.
Not something I would insist though.

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index e6a602e82860..b0fdda77e570 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6496,7 +6496,13 @@ void __meminit setup_zone_pageset(struct zone *zone)
> > struct per_cpu_pageset *p;
> > int cpu;
> >  
> > -   zone->pageset = alloc_percpu(struct per_cpu_pageset);
> > +   /*
> > +* zone could have gone completely offline during memory hotplug
> > +* when the pgdat is left behind for simplicity. On a next onlining
> > +* we do not need to reallocate pcp state.
> > +*/
> > +   if (!zone->pageset)
> > +   zone->pageset = alloc_percpu(struct per_cpu_pageset);
> 
> Should be "if (zone->pageset != _pageset)" ?

Memory hotplug really wants the NULL
check. it doesn't use boot_pageset (if we drop rest to boot_pageset).
But you are right that the boot time initialization first sets
boot_pageset (zone_pcp_init) and initializes real pagesets later
(setup_per_cpu_pageset). But this can be handled at the memory hotplug
layer I believe

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 754026a9164d..1cadfec323fc 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -883,7 +883,8 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
 */
if (!populated_zone(zone)) {
need_zonelists_rebuild = 1;
-   setup_zone_pageset(zone);
+   if (!zone->pageset)
+   setup_zone_pageset(zone);
}

> > for_each_possible_cpu(cpu) {
> > p = per_cpu_ptr(zone->pageset, cpu);
> > pageset_init(p);
> > @@ -8803,25 +8809,6 @@ void zone_pcp_enable(struct zone *zone)
> > mutex_unlock(_batch_high_lock);
> >  }
> >  
> > -void zone_pcp_reset(struct zone *zone)
> > -{
> > -   unsigned long flags;
> > -   int cpu;
> > -   struct per_cpu_pageset *pset;
> > -
> > -   /* avoid races with drain_pages()  */
> > -   local_irq_save(flags);
> > -   if (zone->pageset != _pageset) {
> > -   for_each_online_cpu(cpu) {
> > -   pset = per_cpu_ptr(zone->pageset, cpu);
> > -   drain_zonestat(zone, pset);
> > -   }
> > -   free_percpu(zone->pageset);
> > -   zone->pageset = _pageset;
> > -   }
> > -   local_irq_restore(flags);
> > -}
> > -
> 
> zone_pcp_reset still needs to exist to drain the remaining vmstats or
> it'll break 5a883813845a ("memory-hotplug: fix zone stat
> mismatch").

Are you sure we are reseting vmstats in the hotremove. I do not see
anything like that. Maybe this was needed at the time. I will double
check.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/memory_hotplug: Make unpopulated zones PCP structures unreachable during hot remove

2021-04-09 Thread Michal Hocko
On Fri 09-04-21 14:42:58, Michal Hocko wrote:
> On Fri 09-04-21 13:09:57, Mel Gorman wrote:
> > zone_pcp_reset allegedly protects against a race with drain_pages
> > using local_irq_save but this is bogus. local_irq_save only operates
> > on the local CPU. If memory hotplug is running on CPU A and drain_pages
> > is running on CPU B, disabling IRQs on CPU A does not affect CPU B and
> > offers no protection.
> 
> Yes, the synchronization aspect is bogus indeed.
> 
> > This patch reorders memory hotremove such that the PCP structures
> > relevant to the zone are no longer reachable by the time the structures
> > are freed.  With this reordering, no protection is required to prevent
> > a use-after-free and the IRQs can be left enabled. zone_pcp_reset is
> > renamed to zone_pcp_destroy to make it clear that the per-cpu structures
> > are deleted when the function returns.
> 
> Wouldn't it be much easier to simply not destroy/reset pcp of an empty
> zone at all? The whole point of this exercise seems to be described in
> 340175b7d14d5. setup_zone_pageset can check for an already allocated pcp
> and simply reinitialize it. 

I meant this

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 754026a9164d..7169342f5474 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1810,7 +1818,6 @@ int __ref offline_pages(unsigned long start_pfn, unsigned 
long nr_pages)
init_per_zone_wmark_min();
 
if (!populated_zone(zone)) {
-   zone_pcp_reset(zone);
build_all_zonelists(NULL);
} else
zone_pcp_update(zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e6a602e82860..b0fdda77e570 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6496,7 +6496,13 @@ void __meminit setup_zone_pageset(struct zone *zone)
struct per_cpu_pageset *p;
int cpu;
 
-   zone->pageset = alloc_percpu(struct per_cpu_pageset);
+   /*
+* zone could have gone completely offline during memory hotplug
+* when the pgdat is left behind for simplicity. On a next onlining
+* we do not need to reallocate pcp state.
+*/
+   if (!zone->pageset)
+   zone->pageset = alloc_percpu(struct per_cpu_pageset);
for_each_possible_cpu(cpu) {
p = per_cpu_ptr(zone->pageset, cpu);
pageset_init(p);
@@ -8803,25 +8809,6 @@ void zone_pcp_enable(struct zone *zone)
mutex_unlock(_batch_high_lock);
 }
 
-void zone_pcp_reset(struct zone *zone)
-{
-   unsigned long flags;
-   int cpu;
-   struct per_cpu_pageset *pset;
-
-   /* avoid races with drain_pages()  */
-   local_irq_save(flags);
-   if (zone->pageset != _pageset) {
-   for_each_online_cpu(cpu) {
-   pset = per_cpu_ptr(zone->pageset, cpu);
-   drain_zonestat(zone, pset);
-   }
-   free_percpu(zone->pageset);
-   zone->pageset = _pageset;
-   }
-   local_irq_restore(flags);
-}
-
 #ifdef CONFIG_MEMORY_HOTREMOVE
 /*
  * All pages in the range must be in a single zone, must not contain holes,
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/memory_hotplug: Make unpopulated zones PCP structures unreachable during hot remove

2021-04-09 Thread Michal Hocko
On Fri 09-04-21 13:09:57, Mel Gorman wrote:
> zone_pcp_reset allegedly protects against a race with drain_pages
> using local_irq_save but this is bogus. local_irq_save only operates
> on the local CPU. If memory hotplug is running on CPU A and drain_pages
> is running on CPU B, disabling IRQs on CPU A does not affect CPU B and
> offers no protection.

Yes, the synchronization aspect is bogus indeed.

> This patch reorders memory hotremove such that the PCP structures
> relevant to the zone are no longer reachable by the time the structures
> are freed.  With this reordering, no protection is required to prevent
> a use-after-free and the IRQs can be left enabled. zone_pcp_reset is
> renamed to zone_pcp_destroy to make it clear that the per-cpu structures
> are deleted when the function returns.

Wouldn't it be much easier to simply not destroy/reset pcp of an empty
zone at all? The whole point of this exercise seems to be described in
340175b7d14d5. setup_zone_pageset can check for an already allocated pcp
and simply reinitialize it. 
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

2021-04-09 Thread Michal Hocko
On Thu 08-04-21 13:29:08, Shakeel Butt wrote:
> On Thu, Apr 8, 2021 at 11:01 AM Yang Shi  wrote:
[...]
> > The low priority jobs should be able to be restricted by cpuset, for
> > example, just keep them on second tier memory nodes. Then all the
> > above problems are gone.

Yes, if the aim is to isolate some users from certain numa node then
cpuset is a good fit but as Shakeel says this is very likely not what
this work is aiming for.

> Yes that's an extreme way to overcome the issue but we can do less
> extreme by just (hard) limiting the top tier usage of low priority
> jobs.

Per numa node high/hard limit would help with a more fine grained control.
The configuration would be tricky though. All low priority memcgs would
have to be carefully configured to leave enough for your important
processes. That includes also memory which is not accounted to any
memcg. 
The behavior of those limits would be quite tricky for OOM situations
as well due to a lack of NUMA aware oom killer.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: page_counter: mitigate consequences of a page_counter underflow

2021-04-08 Thread Michal Hocko
On Thu 08-04-21 10:31:55, Johannes Weiner wrote:
> When the unsigned page_counter underflows, even just by a few pages, a
> cgroup will not be able to run anything afterwards and trigger the OOM
> killer in a loop.
> 
> Underflows shouldn't happen, but when they do in practice, we may just
> be off by a small amount that doesn't interfere with the normal
> operation - consequences don't need to be that dire.

Yes, I do agree.

> Reset the page_counter to 0 upon underflow. We'll issue a warning that
> the accounting will be off and then try to keep limping along.

I do not remember any reports about the existing WARN_ON but it is not
really hard to imagine a charging imbalance to be introduced easily.

> [ We used to do this with the original res_counter, where it was a
>   more straight-forward correction inside the spinlock section. I
>   didn't carry it forward into the lockless page counters for
>   simplicity, but it turns out this is quite useful in practice. ]

The lack of external synchronization makes it more tricky because
certain charges might get just lost depending on the ordering. This
sucks but considering that the system is already botched and counters
cannot be trusted this is definitely better than a potentially
completely unusable memcg. It would be nice to mention that in the above
paragraph as a caveat.

> Signed-off-by: Johannes Weiner 

Acked-by: Michal Hocko 

> ---
>  mm/page_counter.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index c6860f51b6c6..7d83641eb86b 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -52,9 +52,13 @@ void page_counter_cancel(struct page_counter *counter, 
> unsigned long nr_pages)
>   long new;
>  
>   new = atomic_long_sub_return(nr_pages, >usage);
> - propagate_protected_usage(counter, new);
>   /* More uncharges than charges? */
> - WARN_ON_ONCE(new < 0);
> + if (WARN_ONCE(new < 0, "page_counter underflow: %ld nr_pages=%lu\n",
> +   new, nr_pages)) {
> + new = 0;
> + atomic_long_set(>usage, new);
> + }
> + propagate_protected_usage(counter, new);
>  }
>  
>  /**
> -- 
> 2.31.1

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

2021-04-08 Thread Michal Hocko
On Wed 07-04-21 15:33:26, Tim Chen wrote:
> 
> 
> On 4/6/21 2:08 AM, Michal Hocko wrote:
> > On Mon 05-04-21 10:08:24, Tim Chen wrote:
> > [...]
> >> To make fine grain cgroup based management of the precious top tier
> >> DRAM memory possible, this patchset adds a few new features:
> >> 1. Provides memory monitors on the amount of top tier memory used per 
> >> cgroup 
> >>and by the system as a whole.
> >> 2. Applies soft limits on the top tier memory each cgroup uses 
> >> 3. Enables kswapd to demote top tier pages from cgroup with excess top
> >>tier memory usages.
> > 
> 
> Michal,
> 
> Thanks for giving your feedback.  Much appreciated.
> 
> > Could you be more specific on how this interface is supposed to be used?
> 
> We created a README section on the cgroup control part of this patchset:
> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.71=20f20be02671384470c7cd8f66b56a9061a4071f
> to illustrate how this interface should be used.

I have to confess I didn't get to look at demotion patches yet.

> The top tier memory used is reported in
> 
> memory.toptier_usage_in_bytes
> 
> The amount of top tier memory usable by each cgroup without
> triggering page reclaim is controlled by the
> 
> memory.toptier_soft_limit_in_bytes 

Are you trying to say that soft limit acts as some sort of guarantee?
Does that mean that if the memcg is under memory pressure top tiear
memory is opted out from any reclaim if the usage is not in excess?

>From you previous email it sounds more like the limit is evaluated on
the global memory pressure to balance specific memcgs which are in
excess when trying to reclaim/demote a toptier numa node.

Soft limit reclaim has several problems. Those are historical and
therefore the behavior cannot be changed. E.g. go after the biggest
excessed memcg (with priority 0 - aka potential full LRU scan) and then
continue with a normal reclaim. This can be really disruptive to the top
user.

So you can likely define a more sane semantic. E.g. push back memcgs
proporitional to their excess but then we have two different soft limits
behavior which is bad as well. I am not really sure there is a sensible
way out by (ab)using soft limit here.

Also I am not really sure how this is going to be used in practice.
There is no soft limit by default. So opting in would effectivelly
discriminate those memcgs. There has been a similar problem with the
soft limit we have in general. Is this really what you are looing for?
What would be a typical usecase?

[...]
> >> The patchset is based on cgroup v1 interface. One shortcoming of the v1
> >> interface is the limit on the cgroup is a soft limit, so a cgroup can
> >> exceed the limit quite a bit before reclaim before page demotion reins
> >> it in. 
> > 
> > I have to say that I dislike abusing soft limit reclaim for this. In the
> > past we have learned that the existing implementation is unfixable and
> > changing the existing semantic impossible due to backward compatibility.
> > So I would really prefer the soft limit just find its rest rather than
> > see new potential usecases.
> 
> Do you think we can reuse some of the existing soft reclaim machinery
> for the v2 interface?
> 
> More particularly, can we treat memory_toptier.high in cgroup v2 as a soft 
> limit?

No, you should follow existing limits semantics. High limit acts as a
allocation throttling interface.

> We sort how much each mem cgroup exceeds memory_toptier.high and
> go after the cgroup that have largest excess first for page demotion.
> Will appreciate if you can shed some insights on what could go wrong
> with such an approach. 

This cannot work as a thorttling interface.
 
> > I haven't really looked into details of this patchset but from a cursory
> > look it seems like you are actually introducing a NUMA aware limits into
> > memcg that would control consumption from some nodes differently than
> > other nodes. This would be rather alien concept to the existing memcg
> > infrastructure IMO. It looks like it is fusing borders between memcg and
> > cputset controllers.
> 
> Want to make sure I understand what you mean by NUMA aware limits.
> Yes, in the patch set, it does treat the NUMA nodes differently.
> We are putting constraint on the "top tier" RAM nodes vs the lower
> tier PMEM nodes.  Is this what you mean?

What I am trying to say (and I have brought that up when demotion has been
discussed at LSFMM) is that the implementation shouldn't be PMEM aware.
The specific technology shouldn't be imprinted into the interface.
Fundamentally you are trying to balance memory among NUMA nodes as we do
not have othe

Re: High kmalloc-32 slab cache consumption with 10k containers

2021-04-07 Thread Michal Hocko
On Wed 07-04-21 19:13:42, Bharata B Rao wrote:
> On Wed, Apr 07, 2021 at 01:54:48PM +0200, Michal Hocko wrote:
> > On Mon 05-04-21 11:18:48, Bharata B Rao wrote:
> > > Hi,
> > > 
> > > When running 1 (more-or-less-empty-)containers on a bare-metal Power9
> > > server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
> > > consumption increases quite a lot (around 172G) when the containers are
> > > running. Most of it comes from slab (149G) and within slab, the majority 
> > > of
> > > it comes from kmalloc-32 cache (102G)
> > 
> > Is this 10k cgroups a testing enviroment or does anybody really use that
> > in production? I would be really curious to hear how that behaves when
> > those containers are not idle. E.g. global memory reclaim iterating over
> > 10k memcgs will likely be very visible. I do remember playing with
> > similar setups few years back and the overhead was very high.
> 
> This 10k containers is only a test scenario that we are looking at.

OK, this is good to know. I would definitely recommend looking at the
runtime aspect of such a large scale deployment before optimizing for
memory footprint. I do agree that the later is an interesting topic on
its own but I do not expect such a deployment on small machines so the
overhead shouldn't be a showstopper. I would be definitely interested
to hear about the runtime overhead. I do expect some interesting
finidings there.

Btw. I do expect that memory controller will not be the only one
deployed right?

-- 
Michal Hocko
SUSE Labs


Re: High kmalloc-32 slab cache consumption with 10k containers

2021-04-07 Thread Michal Hocko
On Mon 05-04-21 11:18:48, Bharata B Rao wrote:
> Hi,
> 
> When running 1 (more-or-less-empty-)containers on a bare-metal Power9
> server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
> consumption increases quite a lot (around 172G) when the containers are
> running. Most of it comes from slab (149G) and within slab, the majority of
> it comes from kmalloc-32 cache (102G)

Is this 10k cgroups a testing enviroment or does anybody really use that
in production? I would be really curious to hear how that behaves when
those containers are not idle. E.g. global memory reclaim iterating over
10k memcgs will likely be very visible. I do remember playing with
similar setups few years back and the overhead was very high.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4 7/8] hugetlb: make free_huge_page irq safe

2021-04-07 Thread Michal Hocko
On Wed 07-04-21 11:12:37, Oscar Salvador wrote:
> On Mon, Apr 05, 2021 at 04:00:42PM -0700, Mike Kravetz wrote:
> > Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
> > non-task context") was added to address the issue of free_huge_page
> > being called from irq context.  That commit hands off free_huge_page
> > processing to a workqueue if !in_task.  However, this doesn't cover
> > all the cases as pointed out by 0day bot lockdep report [1].
> > 
> > :  Possible interrupt unsafe locking scenario:
> > :
> > :CPU0CPU1
> > :
> > :   lock(hugetlb_lock);
> > :local_irq_disable();
> > :lock(slock-AF_INET);
> > :lock(hugetlb_lock);
> > :   
> > : lock(slock-AF_INET);
> > 
> > Shakeel has later explained that this is very likely TCP TX zerocopy
> > from hugetlb pages scenario when the networking code drops a last
> > reference to hugetlb page while having IRQ disabled. Hugetlb freeing
> > path doesn't disable IRQ while holding hugetlb_lock so a lock dependency
> > chain can lead to a deadlock.
> > 
> > This commit addresses the issue by doing the following:
> > - Make hugetlb_lock irq safe.  This is mostly a simple process of
> >   changing spin_*lock calls to spin_*lock_irq* calls.
> > - Make subpool lock irq safe in a similar manner.
> > - Revert the !in_task check and workqueue handoff.
> > 
> > [1] 
> > https://lore.kernel.org/linux-mm/f1c03b05bc43a...@google.com/
> > 
> > Signed-off-by: Mike Kravetz 
> > Acked-by: Michal Hocko 
> > Reviewed-by: Muchun Song 
> 
> So, irq_lock_irqsave/spin_unlock_irqrestore is to be used in places
> that might have been called from an IRQ context?

Yes. spin_unlock_irq will enable interrupts unconditionally which is
certainly not what we want if the path is called with IRQ disabled by
the caller.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4 5/8] hugetlb: call update_and_free_page without hugetlb_lock

2021-04-07 Thread Michal Hocko
On Wed 07-04-21 10:27:49, Oscar Salvador wrote:
> On Mon, Apr 05, 2021 at 04:00:40PM -0700, Mike Kravetz wrote:
[...]
> > @@ -2671,22 +2682,34 @@ static void try_to_free_low(struct hstate *h, 
> > unsigned long count,
> > nodemask_t *nodes_allowed)
> >  {
> > int i;
> > +   struct page *page, *next;
> > +   LIST_HEAD(page_list);
> >  
> > if (hstate_is_gigantic(h))
> > return;
> >  
> > +   /*
> > +* Collect pages to be freed on a list, and free after dropping lock
> > +*/
> > for_each_node_mask(i, *nodes_allowed) {
> > -   struct page *page, *next;
> > struct list_head *freel = >hugepage_freelists[i];
> > list_for_each_entry_safe(page, next, freel, lru) {
> > if (count >= h->nr_huge_pages)
> > -   return;
> > +   goto out;
> > if (PageHighMem(page))
> > continue;
> > remove_hugetlb_page(h, page, false);
> > -   update_and_free_page(h, page);
> > +   list_add(>lru, _list);
> > }
> > }
> > +
> > +out:
> > +   spin_unlock(_lock);
> > +   list_for_each_entry_safe(page, next, _list, lru) {
> > +   update_and_free_page(h, page);
> > +   cond_resched();
> > +   }
> > +   spin_lock(_lock);
> 
> Can we get here with an empty list?

An emoty page_list? If yes then sure, this can happen but
list_for_each_entry_safe will simply not iterate. Or what do you mean?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4 4/8] hugetlb: create remove_hugetlb_page() to separate functionality

2021-04-07 Thread Michal Hocko
On Tue 06-04-21 09:49:13, Mike Kravetz wrote:
> On 4/6/21 2:56 AM, Michal Hocko wrote:
> > On Mon 05-04-21 16:00:39, Mike Kravetz wrote:
[...]
> >> @@ -2298,6 +2312,7 @@ static int alloc_and_dissolve_huge_page(struct 
> >> hstate *h, struct page *old_page,
> >>/*
> >> * Freed from under us. Drop new_page too.
> >> */
> >> +  remove_hugetlb_page(h, new_page, false);
> >>update_and_free_page(h, new_page);
> >>goto unlock;
> >>} else if (page_count(old_page)) {
> >> @@ -2305,6 +2320,7 @@ static int alloc_and_dissolve_huge_page(struct 
> >> hstate *h, struct page *old_page,
> >> * Someone has grabbed the page, try to isolate it here.
> >> * Fail with -EBUSY if not possible.
> >> */
> >> +  remove_hugetlb_page(h, new_page, false);
> >>update_and_free_page(h, new_page);
> >>spin_unlock(_lock);
> >>if (!isolate_huge_page(old_page, list))
> > 
> > the page is not enqued anywhere here so remove_hugetlb_page would blow
> > when linked list debugging is enabled.
> 
> I also thought this would be an issue.  However, INIT_LIST_HEAD would
> have been called for the page so,

OK, this is true for a freshly allocated hugetlb page (prep_new_huge_page.
It's a very sublte dependency though. In case somebody ever wants to
fortify linked lists and decides to check list_del on an empty list then
this would wait silently to blow up.

> Going forward, I agree it would be better to perhaps add a list_empty
> check so that things do not blow up if the debugging code is changed.

Yes this is less tricky then a bool flag or making more stages of the
tear down. 2 stages are more than enough IMHO.

> At one time I also thought of splitting the functionality in
> alloc_fresh_huge_page and prep_new_huge_page so that it would only
> allocate/prep the page but not increment nr_huge_pages.

We already have that distinction. alloc_buddy_huge_page is there to
allocate a fresh huge page without any hstate  accunting. Considering
that giga pages are not supported for the migration anyway, maybe this
would make Oscar's work slightly less tricky?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/vmscan.c: drop_slab_node with task's memcg

2021-04-06 Thread Michal Hocko
On Tue 06-04-21 23:12:34, Neil Sun wrote:
> 
> 
> On 2021/4/6 22:39, Michal Hocko wrote:
> > 
> > Have you considered using high limit for the pro-active memory reclaim?
> 
> Thanks, Michal, do you mean the procfs interfaces?
> We have set vm.vfs_cache_pressure=1000 and so on.
> would you please take an example?

No, I've meant high memory limit available in the memcg v2 interface:
Documentation/admin-guide/cgroup-v2.rst.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/vmscan.c: drop_slab_node with task's memcg

2021-04-06 Thread Michal Hocko
On Tue 06-04-21 22:34:02, Neil Sun wrote:
> 
> 
> On 2021/4/6 19:39, Michal Hocko wrote:
> > On Tue 06-04-21 19:30:22, Neil Sun wrote:
> > > On 2021/4/6 15:21, Michal Hocko wrote:
> > > > 
> > > > You are changing semantic of the existing user interface. This knob has
> > > > never been memcg aware and it is supposed to have a global impact. I do
> > > > not think we can simply change that without some users being surprised
> > > > or even breaking them.
> > > 
> > > Yes, do you think add new interface to sysfs is a good way? such as
> > > /sys/fs/cgroup/memory/lxc/i-vbe1u8o7/memory.kmem.drop_caches
> > 
> > There were other attempts to add a memcg specific alternative to
> > drop_caches. A lack of a strong usecase has been a reason that no such
> > attempt has been merged until now. drop_caches is a problematic
> > interface because it is really coarse and people have learned to (ab)use
> > it to workaround problem rather than fix them properly.
> > 
> > What is your usecase?
> > 
> 
> We have some lxc containers running on the server, when mysqld running
> backup jobs in the container, page cache will grow up and eat up all unused
> memory in the container, then some new jobs come, we can see that tasks are
> busy on allocing memory with reclaiming, so we want to drop page cache after
> mysql backup job for individual container, it will speed up allocing memory
> when new jobs come.
> 
> This patch only drop slab cache but not page cache, this can be the
> first step if people really need this interface.

Have you considered using high limit for the pro-active memory reclaim?
It really seems odd to drop a certain category of memory without aging
information we already do have. I do understand the start time overhead
concern but it seems to be a much better approach to drop old objects
rather than hammer a very specific type of memory.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/vmscan.c: drop_slab_node with task's memcg

2021-04-06 Thread Michal Hocko
On Tue 06-04-21 19:30:22, Neil Sun wrote:
> On 2021/4/6 15:21, Michal Hocko wrote:
> > 
> > You are changing semantic of the existing user interface. This knob has
> > never been memcg aware and it is supposed to have a global impact. I do
> > not think we can simply change that without some users being surprised
> > or even breaking them.
> 
> Yes, do you think add new interface to sysfs is a good way? such as
> /sys/fs/cgroup/memory/lxc/i-vbe1u8o7/memory.kmem.drop_caches

There were other attempts to add a memcg specific alternative to
drop_caches. A lack of a strong usecase has been a reason that no such
attempt has been merged until now. drop_caches is a problematic
interface because it is really coarse and people have learned to (ab)use
it to workaround problem rather than fix them properly.

What is your usecase?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4 5/8] hugetlb: call update_and_free_page without hugetlb_lock

2021-04-06 Thread Michal Hocko
On Mon 05-04-21 16:00:40, Mike Kravetz wrote:
> With the introduction of remove_hugetlb_page(), there is no need for
> update_and_free_page to hold the hugetlb lock.  Change all callers to
> drop the lock before calling.
> 
> With additional code modifications, this will allow loops which decrease
> the huge page pool to drop the hugetlb_lock with each page to reduce
> long hold times.
> 
> The ugly unlock/lock cycle in free_pool_huge_page will be removed in
> a subsequent patch which restructures free_pool_huge_page.
> 
> Signed-off-by: Mike Kravetz 

Still looks good.

Acked-by: Michal Hocko 

> ---
>  mm/hugetlb.c | 43 +--
>  1 file changed, 33 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index df2a3d1f632b..be6031a8e2a9 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1446,16 +1446,18 @@ static void __free_huge_page(struct page *page)
>  
>   if (HPageTemporary(page)) {
>   remove_hugetlb_page(h, page, false);
> + spin_unlock(_lock);
>   update_and_free_page(h, page);
>   } else if (h->surplus_huge_pages_node[nid]) {
>   /* remove the page from active list */
>   remove_hugetlb_page(h, page, true);
> + spin_unlock(_lock);
>   update_and_free_page(h, page);
>   } else {
>   arch_clear_hugepage_flags(page);
>   enqueue_huge_page(h, page);
> + spin_unlock(_lock);
>   }
> - spin_unlock(_lock);
>  }
>  
>  /*
> @@ -1736,7 +1738,13 @@ static int free_pool_huge_page(struct hstate *h, 
> nodemask_t *nodes_allowed,
>   list_entry(h->hugepage_freelists[node].next,
> struct page, lru);
>   remove_hugetlb_page(h, page, acct_surplus);
> + /*
> +  * unlock/lock around update_and_free_page is temporary
> +  * and will be removed with subsequent patch.
> +  */
> + spin_unlock(_lock);
>   update_and_free_page(h, page);
> + spin_lock(_lock);
>   ret = 1;
>   break;
>   }
> @@ -1805,8 +1813,9 @@ int dissolve_free_huge_page(struct page *page)
>   }
>   remove_hugetlb_page(h, page, false);
>   h->max_huge_pages--;
> + spin_unlock(_lock);
>   update_and_free_page(h, head);
> - rc = 0;
> + return 0;
>   }
>  out:
>   spin_unlock(_lock);
> @@ -2291,6 +2300,7 @@ static int alloc_and_dissolve_huge_page(struct hstate 
> *h, struct page *old_page,
>   gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
>   int nid = page_to_nid(old_page);
>   struct page *new_page;
> + struct page *page_to_free;
>   int ret = 0;
>  
>   /*
> @@ -2313,16 +2323,16 @@ static int alloc_and_dissolve_huge_page(struct hstate 
> *h, struct page *old_page,
>* Freed from under us. Drop new_page too.
>*/
>   remove_hugetlb_page(h, new_page, false);
> - update_and_free_page(h, new_page);
> - goto unlock;
> + page_to_free = new_page;
> + goto unlock_free;
>   } else if (page_count(old_page)) {
>   /*
>* Someone has grabbed the page, try to isolate it here.
>* Fail with -EBUSY if not possible.
>*/
>   remove_hugetlb_page(h, new_page, false);
> - update_and_free_page(h, new_page);
>   spin_unlock(_lock);
> + update_and_free_page(h, new_page);
>   if (!isolate_huge_page(old_page, list))
>   ret = -EBUSY;
>   return ret;
> @@ -2344,11 +2354,12 @@ static int alloc_and_dissolve_huge_page(struct hstate 
> *h, struct page *old_page,
>* enqueue_huge_page for new page.  Net result is no change.
>*/
>   remove_hugetlb_page(h, old_page, false);
> - update_and_free_page(h, old_page);
>   enqueue_huge_page(h, new_page);
> + page_to_free = old_page;
>   }
> -unlock:
> +unlock_free:
>   spin_unlock(_lock);
> + update_and_free_page(h, page_to_free);
>  
>   return ret;
>  }
> @@ -2671,22 +2682,34 @@ static void try_to_free_low(struct hstate *h, 
> unsigned long count,
>   nodemask_t *nodes_allowed)
>  {
>   int i;

Re: [PATCH v4 4/8] hugetlb: create remove_hugetlb_page() to separate functionality

2021-04-06 Thread Michal Hocko
On Mon 05-04-21 16:00:39, Mike Kravetz wrote:
> The new remove_hugetlb_page() routine is designed to remove a hugetlb
> page from hugetlbfs processing.  It will remove the page from the active
> or free list, update global counters and set the compound page
> destructor to NULL so that PageHuge() will return false for the 'page'.
> After this call, the 'page' can be treated as a normal compound page or
> a collection of base size pages.
> 
> update_and_free_page no longer decrements h->nr_huge_pages{_node} as
> this is performed in remove_hugetlb_page.  The only functionality
> performed by update_and_free_page is to free the base pages to the lower
> level allocators.
> 
> update_and_free_page is typically called after remove_hugetlb_page.
> 
> remove_hugetlb_page is to be called with the hugetlb_lock held.
> 
> Creating this routine and separating functionality is in preparation for
> restructuring code to reduce lock hold times.  This commit should not
> introduce any changes to functionality.
> 
> Signed-off-by: Mike Kravetz 

Btw. I would prefer to reverse the ordering of this and Oscar's
patchset. This one is a bug fix which might be interesting for stable
backports while Oscar's work can be looked as a general functionality
improvement.

> @@ -2298,6 +2312,7 @@ static int alloc_and_dissolve_huge_page(struct hstate 
> *h, struct page *old_page,
>   /*
>* Freed from under us. Drop new_page too.
>*/
> + remove_hugetlb_page(h, new_page, false);
>   update_and_free_page(h, new_page);
>   goto unlock;
>   } else if (page_count(old_page)) {
> @@ -2305,6 +2320,7 @@ static int alloc_and_dissolve_huge_page(struct hstate 
> *h, struct page *old_page,
>* Someone has grabbed the page, try to isolate it here.
>* Fail with -EBUSY if not possible.
>*/
> + remove_hugetlb_page(h, new_page, false);
>   update_and_free_page(h, new_page);
>   spin_unlock(_lock);
>   if (!isolate_huge_page(old_page, list))

the page is not enqued anywhere here so remove_hugetlb_page would blow
when linked list debugging is enabled.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4 1/8] mm/cma: change cma mutex to irq safe spinlock

2021-04-06 Thread Michal Hocko
On Mon 05-04-21 16:00:36, Mike Kravetz wrote:
> cma_release is currently a sleepable operatation because the bitmap
> manipulation is protected by cma->lock mutex. Hugetlb code which relies
> on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
> irq safe.
> 
> The lock doesn't protect any sleepable operation so it can be changed to
> a (irq aware) spin lock. The bitmap processing should be quite fast in
> typical case but if cma sizes grow to TB then we will likely need to
> replace the lock by a more optimized bitmap implementation.
> 
> Signed-off-by: Mike Kravetz 

I belive I have acked the previous version already. Anyway
Acked-by: Michal Hocko 

> ---
>  mm/cma.c   | 18 +-
>  mm/cma.h   |  2 +-
>  mm/cma_debug.c |  8 
>  3 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/cma.c b/mm/cma.c
> index f3bca4178c7f..995e15480937 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -24,7 +24,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -83,13 +82,14 @@ static void cma_clear_bitmap(struct cma *cma, unsigned 
> long pfn,
>unsigned long count)
>  {
>   unsigned long bitmap_no, bitmap_count;
> + unsigned long flags;
>  
>   bitmap_no = (pfn - cma->base_pfn) >> cma->order_per_bit;
>   bitmap_count = cma_bitmap_pages_to_bits(cma, count);
>  
> - mutex_lock(>lock);
> + spin_lock_irqsave(>lock, flags);
>   bitmap_clear(cma->bitmap, bitmap_no, bitmap_count);
> - mutex_unlock(>lock);
> + spin_unlock_irqrestore(>lock, flags);
>  }
>  
>  static void __init cma_activate_area(struct cma *cma)
> @@ -118,7 +118,7 @@ static void __init cma_activate_area(struct cma *cma)
>pfn += pageblock_nr_pages)
>   init_cma_reserved_pageblock(pfn_to_page(pfn));
>  
> - mutex_init(>lock);
> + spin_lock_init(>lock);
>  
>  #ifdef CONFIG_CMA_DEBUGFS
>   INIT_HLIST_HEAD(>mem_head);
> @@ -392,7 +392,7 @@ static void cma_debug_show_areas(struct cma *cma)
>   unsigned long nr_part, nr_total = 0;
>   unsigned long nbits = cma_bitmap_maxno(cma);
>  
> - mutex_lock(>lock);
> + spin_lock_irq(>lock);
>   pr_info("number of available pages: ");
>   for (;;) {
>   next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start);
> @@ -407,7 +407,7 @@ static void cma_debug_show_areas(struct cma *cma)
>   start = next_zero_bit + nr_zero;
>   }
>   pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
> - mutex_unlock(>lock);
> + spin_unlock_irq(>lock);
>  }
>  #else
>  static inline void cma_debug_show_areas(struct cma *cma) { }
> @@ -454,12 +454,12 @@ struct page *cma_alloc(struct cma *cma, unsigned long 
> count,
>   goto out;
>  
>   for (;;) {
> - mutex_lock(>lock);
> + spin_lock_irq(>lock);
>   bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
>   bitmap_maxno, start, bitmap_count, mask,
>   offset);
>   if (bitmap_no >= bitmap_maxno) {
> - mutex_unlock(>lock);
> + spin_unlock_irq(>lock);
>   break;
>   }
>   bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
> @@ -468,7 +468,7 @@ struct page *cma_alloc(struct cma *cma, unsigned long 
> count,
>* our exclusive use. If the migration fails we will take the
>* lock again and unmark it.
>*/
> - mutex_unlock(>lock);
> + spin_unlock_irq(>lock);
>  
>   pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
>   ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
> diff --git a/mm/cma.h b/mm/cma.h
> index 68ffad4e430d..2c775877eae2 100644
> --- a/mm/cma.h
> +++ b/mm/cma.h
> @@ -15,7 +15,7 @@ struct cma {
>   unsigned long   count;
>   unsigned long   *bitmap;
>   unsigned int order_per_bit; /* Order of pages represented by one bit */
> - struct mutexlock;
> + spinlock_t  lock;
>  #ifdef CONFIG_CMA_DEBUGFS
>   struct hlist_head mem_head;
>   spinlock_t mem_head_lock;
> diff --git a/mm/cma_debug.c b/mm/cma_debug.c
> index d5bf8aa34fdc..2e7704955f4f 100644
> --- a/mm/cma_debug.c
> +++ b/mm/cma_debug.c
> @@ -36,10 +36,10 @@ static int cma_used_get(void *data, u64 *val)
>   struct cma *cma = data;

Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

2021-04-06 Thread Michal Hocko
On Mon 05-04-21 10:08:24, Tim Chen wrote:
[...]
> To make fine grain cgroup based management of the precious top tier
> DRAM memory possible, this patchset adds a few new features:
> 1. Provides memory monitors on the amount of top tier memory used per cgroup 
>and by the system as a whole.
> 2. Applies soft limits on the top tier memory each cgroup uses 
> 3. Enables kswapd to demote top tier pages from cgroup with excess top
>tier memory usages.

Could you be more specific on how this interface is supposed to be used?

> This allows us to provision different amount of top tier memory to each
> cgroup according to the cgroup's latency need.
> 
> The patchset is based on cgroup v1 interface. One shortcoming of the v1
> interface is the limit on the cgroup is a soft limit, so a cgroup can
> exceed the limit quite a bit before reclaim before page demotion reins
> it in. 

I have to say that I dislike abusing soft limit reclaim for this. In the
past we have learned that the existing implementation is unfixable and
changing the existing semantic impossible due to backward compatibility.
So I would really prefer the soft limit just find its rest rather than
see new potential usecases.

I haven't really looked into details of this patchset but from a cursory
look it seems like you are actually introducing a NUMA aware limits into
memcg that would control consumption from some nodes differently than
other nodes. This would be rather alien concept to the existing memcg
infrastructure IMO. It looks like it is fusing borders between memcg and
cputset controllers.

You also seem to be basing the interface on the very specific usecase.
Can we expect that there will be many different tiers requiring their
own balancing?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/vmscan.c: drop_slab_node with task's memcg

2021-04-06 Thread Michal Hocko
On Fri 02-04-21 18:38:54, Neil Sun wrote:
> This patch makes shrink_slab() with task's memcg in drop_slab_node(),
> so we can free reclaimable slab objects belongs to memcg /lxc/i-vbe1u8o7
> with following command:

You are changing semantic of the existing user interface. This knob has
never been memcg aware and it is supposed to have a global impact. I do
not think we can simply change that without some users being surprised
or even breaking them.

> cgexec -g memory:/lxc/i-vbe1u8o7 sysctl vm.drop_caches=2
> 
> Test with following steps:
> 
> root@i-yl0pwrt8:~# free -h
>   totalusedfree  shared  buff/cache   
> available
> Mem:   62Gi   265Mi62Gi   1.0Mi   290Mi
> 61Gi
> Swap:  31Gi  0B31Gi
> root@i-yl0pwrt8:~# (cd /tmp && /root/generate_slab_cache)
> root@i-yl0pwrt8:~# free -h
>   totalusedfree  shared  buff/cache   
> available
> Mem:   62Gi   266Mi60Gi   1.0Mi   2.2Gi
> 61Gi
> Swap:  31Gi  0B31Gi
> root@i-yl0pwrt8:~# cgcreate -g memory:/lxc/i-vbe1u8o7
> root@i-yl0pwrt8:~# cgexec -g memory:/lxc/i-vbe1u8o7 /root/generate_slab_cache
> root@i-yl0pwrt8:~# free -h
>   totalusedfree  shared  buff/cache   
> available
> Mem:   62Gi   267Mi58Gi   1.0Mi   4.1Gi
> 61Gi
> Swap:  31Gi  0B31Gi
> root@i-yl0pwrt8:~# cgexec -g memory:/lxc/i-vbe1u8o7 sysctl vm.drop_caches=2
> vm.drop_caches = 2
> root@i-yl0pwrt8:~# free -h
>   totalusedfree  shared  buff/cache   
> available
> Mem:   62Gi   268Mi60Gi   1.0Mi   2.2Gi
> 61Gi
> Swap:  31Gi  0B31Gi
> root@i-yl0pwrt8:~# sysctl vm.drop_caches=2
> vm.drop_caches = 2
> root@i-yl0pwrt8:~# free -h
>   totalusedfree  shared  buff/cache   
> available
> Mem:   62Gi   267Mi62Gi   1.0Mi   290Mi
> 61Gi
> Swap:  31Gi  0B31Gi
> 
> Signed-off-by: Neil Sun 
> ---
>  mm/vmscan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 562e87cb..81d770a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -702,7 +702,7 @@ void drop_slab_node(int nid)
>   return;
>  
>   freed = 0;
> - memcg = mem_cgroup_iter(NULL, NULL, NULL);
> + memcg = mem_cgroup_from_task(current);
>       do {
>   freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
>   } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs


Re: [External] Re: [PATCH v2] writeback: fix obtain a reference to a freeing memcg css

2021-04-01 Thread Michal Hocko
On Thu 01-04-21 21:59:13, Muchun Song wrote:
> On Thu, Apr 1, 2021 at 6:26 PM Michal Hocko  wrote:
[...]
> > Even if the css ref count is not really necessary it shouldn't cause any
> > harm and it makes the code easier to understand. At least a comment
> > explaining why that is not necessary would be required without it
> 
> OK. I will add a comment here to explain why we need to hold a
> ref.

I do not think this is necessary. Taking the reference is a standard
way and I am not sure it requires a comment. I meant to say that not
having a reference should really have a comment explaining why.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2] writeback: fix obtain a reference to a freeing memcg css

2021-04-01 Thread Michal Hocko
On Thu 01-04-21 17:33:43, Muchun Song wrote:
> The caller of wb_get_create() should pin the memcg, because
> wb_get_create() relies on this guarantee. The rcu read lock
> only can guarantee that the memcg css returned by css_from_id()
> cannot be released, but the reference of the memcg can be zero.
> Fix it by holding a reference to the css before calling
> wb_get_create(). This is not a problem I encountered in the
> real world. Just the result of a code review.
> 
> And it is unnecessary to use GFP_ATOMIC, so replace it with
> GFP_NOIO.

This should go into it's own patch. With more explanation why NOIO is
required.

> Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction 
> and use it for stat updates")
> Signed-off-by: Muchun Song 

For the css part feel free to add
Acked-by: Michal Hocko 

Even if the css ref count is not really necessary it shouldn't cause any
harm and it makes the code easier to understand. At least a comment
explaining why that is not necessary would be required without it.

Thanks!

> ---
> Changelog in v2:
>  1. Replace GFP_ATOMIC with GFP_NOIO suggested by Matthew.
> 
>  fs/fs-writeback.c | 11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index e91980f49388..df7f89f8f771 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -501,16 +501,21 @@ static void inode_switch_wbs(struct inode *inode, int 
> new_wb_id)
>   if (atomic_read(_nr_in_flight) > WB_FRN_MAX_IN_FLIGHT)
>   return;
>  
> - isw = kzalloc(sizeof(*isw), GFP_ATOMIC);
> + isw = kzalloc(sizeof(*isw), GFP_NOIO);
>   if (!isw)
>   return;
>  
>   /* find and pin the new wb */
>   rcu_read_lock();
>   memcg_css = css_from_id(new_wb_id, _cgrp_subsys);
> - if (memcg_css)
> - isw->new_wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
> + if (memcg_css && !css_tryget(memcg_css))
> + memcg_css = NULL;
>   rcu_read_unlock();
> + if (!memcg_css)
> + goto out_free;
> +
> + isw->new_wb = wb_get_create(bdi, memcg_css, GFP_NOIO);
> + css_put(memcg_css);
>   if (!isw->new_wb)
>   goto out_free;
>  
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3] sysfs: Unconditionally use vmalloc for buffer

2021-04-01 Thread Michal Hocko
On Wed 31-03-21 19:21:45, Kees Cook wrote:
> The sysfs interface to seq_file continues to be rather fragile
> (seq_get_buf() should not be used outside of seq_file), as seen with
> some recent exploits[1]. Move the seq_file buffer to the vmap area
> (while retaining the accounting flag), since it has guard pages that
> will catch and stop linear overflows.

I thought the previous discussion has led to a conclusion that the
preferred way is to disallow direct seq_file buffer usage. But this is
obviously up to sysfs maintainers. I am happy you do not want to spread
this out to all seq_file users anymore.

> This seems justified given that
> sysfs's use of seq_file already uses kvmalloc(), is almost always using
> a PAGE_SIZE or larger allocation, has normally short-lived allocations,
> and is not normally on a performance critical path.

Let me clarify on this, because this is not quite right. kvmalloc vs
vmalloc (both with GFP_KERNEL) on PAGE_SIZE are two different beasts.
The first one is almost always going to use kmalloc because the page
allocator almost never fails those requests.

> Once seq_get_buf() has been removed (and all sysfs callbacks using
> seq_file directly), this change can also be removed.
> 
> [1] https://blog.grimm-co.com/2021/03/new-old-bugs-in-linux-kernel.html
> 
> Signed-off-by: Kees Cook 
> ---
> v3:
> - Limit to only sysfs (instead of all of seq_file).
> v2: 
> https://lore.kernel.org/lkml/20210315174851.68-1-keesc...@chromium.org/
> v1: 
> https://lore.kernel.org/lkml/20210312205558.2947488-1-keesc...@chromium.org/
> ---
>  fs/sysfs/file.c | 23 +++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
> index 9aefa7779b29..70e7a450e5d1 100644
> --- a/fs/sysfs/file.c
> +++ b/fs/sysfs/file.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "sysfs.h"
>  
> @@ -32,6 +33,25 @@ static const struct sysfs_ops *sysfs_file_ops(struct 
> kernfs_node *kn)
>   return kobj->ktype ? kobj->ktype->sysfs_ops : NULL;
>  }
>  
> +/*
> + * To be proactively defensive against sysfs show() handlers that do not
> + * correctly stay within their PAGE_SIZE buffer, use the vmap area to gain
> + * the trailing guard page which will stop linear buffer overflows.
> + */
> +static void *sysfs_kf_seq_start(struct seq_file *sf, loff_t *ppos)
> +{
> + struct kernfs_open_file *of = sf->private;
> + struct kernfs_node *kn = of->kn;
> +
> + WARN_ON_ONCE(sf->buf);
> + sf->buf = __vmalloc(kn->attr.size, GFP_KERNEL_ACCOUNT);
> + if (!sf->buf)
> + return ERR_PTR(-ENOMEM);
> + sf->size = kn->attr.size;
> +
> + return NULL + !*ppos;
> +}
> +
>  /*
>   * Reads on sysfs are handled through seq_file, which takes care of hairy
>   * details like buffering and seeking.  The following function pipes
> @@ -206,14 +226,17 @@ static const struct kernfs_ops sysfs_file_kfops_empty = 
> {
>  };
>  
>  static const struct kernfs_ops sysfs_file_kfops_ro = {
> + .seq_start  = sysfs_kf_seq_start,
>   .seq_show   = sysfs_kf_seq_show,
>  };
>  
>  static const struct kernfs_ops sysfs_file_kfops_wo = {
> + .seq_start  = sysfs_kf_seq_start,
>   .write  = sysfs_kf_write,
>  };
>  
>  static const struct kernfs_ops sysfs_file_kfops_rw = {
> + .seq_start  = sysfs_kf_seq_start,
>   .seq_show   = sysfs_kf_seq_show,
>   .write  = sysfs_kf_write,
>  };
> -- 
> 2.25.1

-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC 0/3] drivers/char: remove /dev/kmem for good

2021-03-31 Thread Michal Hocko
On Wed 31-03-21 17:08:03, Enrico Weigelt, metux IT consult wrote:
> On 19.03.21 18:33, Sebastian Andrzej Siewior wrote:
> > On 2021-03-19 10:14:02 [-0700], Linus Torvalds wrote:
> > > On Fri, Mar 19, 2021 at 7:35 AM David Hildenbrand  
> > > wrote:
> > > > 
> > > > Let's start a discussion if /dev/kmem is worth keeping around and
> > > > fixing/maintaining or if we should just remove it now for good.
> > > 
> > > I'll happily do this for the next merge window, but would really want
> > > distros to confirm that they don't enable it.
> > > 
> > > I can confirm that it's certainly not enabled on any of the machines I
> > > have, but..
> > 
> > Debian has CONFIG_DEVKMEM disabled since 2.6.31.
> 
> SLES, too. (but no idea since when exactly)

15-SP2 IIRC

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] userfaultfd: Write protect when virtual memory range has no page table entry

2021-03-31 Thread Michal Hocko
On Mon 22-03-21 14:49:35, Michal Hocko wrote:
> On Mon 22-03-21 15:00:37, Mike Rapoport wrote:
> > On Mon, Mar 22, 2021 at 11:14:37AM +0100, Michal Hocko wrote:
> > > Le'ts Andrea and Mike
> > > 
> > > On Fri 19-03-21 22:24:28, Bui Quang Minh wrote:
> > > > userfaultfd_writeprotect() use change_protection() to clear write bit in
> > > > page table entries (pte/pmd). So, later write to this virtual address
> > > > range causes a page fault, which is then handled by userspace program.
> > > > However, change_protection() has no effect when there is no page table
> > > > entries associated with that virtual memory range (a newly mapped memory
> > > > range). As a result, later access to that memory range causes 
> > > > allocating a
> > > > page table entry with write bit still set (due to VM_WRITE flag in
> > > > vma->vm_flags).
> > > > 
> > > > Add checks for VM_UFFD_WP in vma->vm_flags when allocating new page 
> > > > table
> > > > entry in missing page table entry page fault path.
> > > 
> > > From the above it is not really clear whether this is a usability
> > > problem or a bug of the interface.
> > 
> > I'd say it's usability/documentation clarity issue. 
> > Userspace can register an area with
> > 
> > UFFDIO_REGISTER_MODE_MISSING | UFFDIO_REGISTER_MODE_WP
> > 
> > and then it will be notified either when page table has no entry for a
> > virtual address or when there is a write to a write protected address.
> 
> Thanks for the clarification! I have suspected this to be the case but
> I am not really familiar with the interface to have any strong statement
> here. Maybe we want to document this explicitly.

Btw. Andrew the patch still seems to be in mmotm. Do you plan to keep it
there?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3 1/8] mm/cma: change cma mutex to irq safe spinlock

2021-03-31 Thread Michal Hocko
On Tue 30-03-21 20:41:41, Mike Kravetz wrote:
> cma_release is currently a sleepable operatation because the bitmap
> manipulation is protected by cma->lock mutex. Hugetlb code which relies
> on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
> irq safe.
> 
> The lock doesn't protect any sleepable operation so it can be changed to
> a (irq aware) spin lock. The bitmap processing should be quite fast in
> typical case but if cma sizes grow to TB then we will likely need to
> replace the lock by a more optimized bitmap implementation.
> 
> Signed-off-by: Mike Kravetz 

Acked-by: Michal Hocko 

> ---
>  mm/cma.c   | 18 +-
>  mm/cma.h   |  2 +-
>  mm/cma_debug.c |  8 
>  3 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/cma.c b/mm/cma.c
> index b2393b892d3b..2380f2571eb5 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -24,7 +24,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -83,13 +82,14 @@ static void cma_clear_bitmap(struct cma *cma, unsigned 
> long pfn,
>unsigned int count)
>  {
>   unsigned long bitmap_no, bitmap_count;
> + unsigned long flags;
>  
>   bitmap_no = (pfn - cma->base_pfn) >> cma->order_per_bit;
>   bitmap_count = cma_bitmap_pages_to_bits(cma, count);
>  
> - mutex_lock(>lock);
> + spin_lock_irqsave(>lock, flags);
>   bitmap_clear(cma->bitmap, bitmap_no, bitmap_count);
> - mutex_unlock(>lock);
> + spin_unlock_irqrestore(>lock, flags);
>  }
>  
>  static void __init cma_activate_area(struct cma *cma)
> @@ -118,7 +118,7 @@ static void __init cma_activate_area(struct cma *cma)
>pfn += pageblock_nr_pages)
>   init_cma_reserved_pageblock(pfn_to_page(pfn));
>  
> - mutex_init(>lock);
> + spin_lock_init(>lock);
>  
>  #ifdef CONFIG_CMA_DEBUGFS
>   INIT_HLIST_HEAD(>mem_head);
> @@ -392,7 +392,7 @@ static void cma_debug_show_areas(struct cma *cma)
>   unsigned long nr_part, nr_total = 0;
>   unsigned long nbits = cma_bitmap_maxno(cma);
>  
> - mutex_lock(>lock);
> + spin_lock_irq(>lock);
>   pr_info("number of available pages: ");
>   for (;;) {
>   next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start);
> @@ -407,7 +407,7 @@ static void cma_debug_show_areas(struct cma *cma)
>   start = next_zero_bit + nr_zero;
>   }
>   pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
> - mutex_unlock(>lock);
> + spin_unlock_irq(>lock);
>  }
>  #else
>  static inline void cma_debug_show_areas(struct cma *cma) { }
> @@ -454,12 +454,12 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
> unsigned int align,
>   goto out;
>  
>   for (;;) {
> - mutex_lock(>lock);
> + spin_lock_irq(>lock);
>   bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
>   bitmap_maxno, start, bitmap_count, mask,
>   offset);
>   if (bitmap_no >= bitmap_maxno) {
> - mutex_unlock(>lock);
> + spin_unlock_irq(>lock);
>   break;
>   }
>   bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
> @@ -468,7 +468,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
> unsigned int align,
>* our exclusive use. If the migration fails we will take the
>* lock again and unmark it.
>*/
> - mutex_unlock(>lock);
> + spin_unlock_irq(>lock);
>  
>   pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
>   ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
> diff --git a/mm/cma.h b/mm/cma.h
> index 68ffad4e430d..2c775877eae2 100644
> --- a/mm/cma.h
> +++ b/mm/cma.h
> @@ -15,7 +15,7 @@ struct cma {
>   unsigned long   count;
>   unsigned long   *bitmap;
>   unsigned int order_per_bit; /* Order of pages represented by one bit */
> - struct mutexlock;
> + spinlock_t  lock;
>  #ifdef CONFIG_CMA_DEBUGFS
>   struct hlist_head mem_head;
>   spinlock_t mem_head_lock;
> diff --git a/mm/cma_debug.c b/mm/cma_debug.c
> index d5bf8aa34fdc..2e7704955f4f 100644
> --- a/mm/cma_debug.c
> +++ b/mm/cma_debug.c
> @@ -36,10 +36,10 @@ static int cma_used_get(void *data, u64 *val)
>   struct cma *cma = data;
>   unsigned long used;
> 

Re: [PATCH] linux/memcontrol.h: Remove duplicate struct declaration

2021-03-30 Thread Michal Hocko
On Tue 30-03-21 10:02:36, Wan Jiabing wrote:
> struct mem_cgroup is declared twice. One has been declared
> at forward struct declaration. Remove the duplicate.

Duplicate declaration shouldn't really cause any problems. You are right
that there is one which is independent on CONFIG_MEMCG so the below one
is not needed though. It would be great if the changelog mentioned that
so that.

> Signed-off-by: Wan Jiabing 

Acked-by: Michal Hocko 

> ---
>  include/linux/memcontrol.h | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0c04d39a7967..f0ae33a0f175 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1068,8 +1068,6 @@ void split_page_memcg(struct page *head, unsigned int 
> nr);
>  #define MEM_CGROUP_ID_SHIFT  0
>  #define MEM_CGROUP_ID_MAX0
>  
> -struct mem_cgroup;
> -
>  static inline struct mem_cgroup *page_memcg(struct page *page)
>  {
>   return NULL;
> -- 
> 2.25.1

-- 
Michal Hocko
SUSE Labs


Re: [External] Re: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock

2021-03-30 Thread Michal Hocko
On Tue 30-03-21 16:08:36, Muchun Song wrote:
> On Tue, Mar 30, 2021 at 4:01 PM Michal Hocko  wrote:
> >
> > On Mon 29-03-21 16:23:55, Mike Kravetz wrote:
> > > Ideally, cma_release could be called from any context.  However, that is
> > > not possible because a mutex is used to protect the per-area bitmap.
> > > Change the bitmap to an irq safe spinlock.
> >
> > I would phrase the changelog slightly differerent
> > "
> > cma_release is currently a sleepable operatation because the bitmap
> > manipulation is protected by cma->lock mutex. Hugetlb code which relies
> > on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
> > irq safe.
> >
> > The lock doesn't protect any sleepable operation so it can be changed to
> > a (irq aware) spin lock. The bitmap processing should be quite fast in
> > typical case but if cma sizes grow to TB then we will likely need to
> > replace the lock by a more optimized bitmap implementation.
> > "
> >
> > it seems that you are overusing irqsave variants even from context which
> > are never called from the IRQ context so they do not need storing flags.
> >
> > [...]
> > > @@ -391,8 +391,9 @@ static void cma_debug_show_areas(struct cma *cma)
> > >   unsigned long start = 0;
> > >   unsigned long nr_part, nr_total = 0;
> > >   unsigned long nbits = cma_bitmap_maxno(cma);
> > > + unsigned long flags;
> > >
> > > - mutex_lock(>lock);
> > > + spin_lock_irqsave(>lock, flags);
> >
> > spin_lock_irq should be sufficient. This is only called from the
> > allocation context and that is never called from IRQ context.
> 
> This makes me think more. I think that spin_lock should be
> sufficient. Right?

Nope. Think of the following scenario
spin_lock(cma->lock);

put_page
  __free_huge_page
cma_release
  spin_lock_irqsave() DEADLOCK
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 6/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page

2021-03-30 Thread Michal Hocko
On Mon 29-03-21 16:24:00, Mike Kravetz wrote:
> free_pool_huge_page was called with hugetlb_lock held.  It would remove
> a hugetlb page, and then free the corresponding pages to the lower level
> allocators such as buddy.  free_pool_huge_page was called in a loop to
> remove hugetlb pages and these loops could hold the hugetlb_lock for a
> considerable time.
> 
> Create new routine remove_pool_huge_page to replace free_pool_huge_page.
> remove_pool_huge_page will remove the hugetlb page, and it must be
> called with the hugetlb_lock held.  It will return the removed page and
> it is the responsibility of the caller to free the page to the lower
> level allocators.  The hugetlb_lock is dropped before freeing to these
> allocators which results in shorter lock hold times.
> 
> Add new helper routine to call update_and_free_page for a list of pages.
> 
> Note: Some changes to the routine return_unused_surplus_pages are in
> need of explanation.  Commit e5bbc8a6c992 ("mm/hugetlb.c: fix reservation
> race when freeing surplus pages") modified this routine to address a
> race which could occur when dropping the hugetlb_lock in the loop that
> removes pool pages.  Accounting changes introduced in that commit were
> subtle and took some thought to understand.  This commit removes the
> cond_resched_lock() and the potential race.  Therefore, remove the
> subtle code and restore the more straight forward accounting effectively
> reverting the commit.
> 
> Signed-off-by: Mike Kravetz 

Please drop INIT_LIST_HEAD which seems to be a left over from rebasing
to use LIST_HEAD.

Acked-by: Michal Hocko 

> ---
>  mm/hugetlb.c | 95 +---
>  1 file changed, 53 insertions(+), 42 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index dec7bd0dc63d..d3f3cb8766b8 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1209,7 +1209,7 @@ static int hstate_next_node_to_alloc(struct hstate *h,
>  }
>  
>  /*
> - * helper for free_pool_huge_page() - return the previously saved
> + * helper for remove_pool_huge_page() - return the previously saved
>   * node ["this node"] from which to free a huge page.  Advance the
>   * next node id whether or not we find a free huge page to free so
>   * that the next attempt to free addresses the next node.
> @@ -1391,6 +1391,16 @@ static void update_and_free_page(struct hstate *h, 
> struct page *page)
>   }
>  }
>  
> +static void update_and_free_pages_bulk(struct hstate *h, struct list_head 
> *list)
> +{
> + struct page *page, *t_page;
> +
> + list_for_each_entry_safe(page, t_page, list, lru) {
> + update_and_free_page(h, page);
> + cond_resched();
> + }
> +}
> +
>  struct hstate *size_to_hstate(unsigned long size)
>  {
>   struct hstate *h;
> @@ -1721,16 +1731,18 @@ static int alloc_pool_huge_page(struct hstate *h, 
> nodemask_t *nodes_allowed,
>  }
>  
>  /*
> - * Free huge page from pool from next node to free.
> - * Attempt to keep persistent huge pages more or less
> - * balanced over allowed nodes.
> + * Remove huge page from pool from next node to free.  Attempt to keep
> + * persistent huge pages more or less balanced over allowed nodes.
> + * This routine only 'removes' the hugetlb page.  The caller must make
> + * an additional call to free the page to low level allocators.
>   * Called with hugetlb_lock locked.
>   */
> -static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
> -  bool acct_surplus)
> +static struct page *remove_pool_huge_page(struct hstate *h,
> + nodemask_t *nodes_allowed,
> +  bool acct_surplus)
>  {
>   int nr_nodes, node;
> - int ret = 0;
> + struct page *page = NULL;
>  
>   for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
>   /*
> @@ -1739,23 +1751,14 @@ static int free_pool_huge_page(struct hstate *h, 
> nodemask_t *nodes_allowed,
>*/
>   if ((!acct_surplus || h->surplus_huge_pages_node[node]) &&
>   !list_empty(>hugepage_freelists[node])) {
> - struct page *page =
> - list_entry(h->hugepage_freelists[node].next,
> + page = list_entry(h->hugepage_freelists[node].next,
> struct page, lru);
>   remove_hugetlb_page(h, page, acct_surplus);
> - /*
> -  * unlock/lock around update_and_free_page is temporary
> -   

Re: [PATCH v2 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release

2021-03-30 Thread Michal Hocko
On Mon 29-03-21 16:23:56, Mike Kravetz wrote:
> Now that cma_release is non-blocking and irq safe, there is no need to
> drop hugetlb_lock before calling.
> 
> Signed-off-by: Mike Kravetz 

Acked-by: Michal Hocko 

> ---
>  mm/hugetlb.c | 6 --
>  1 file changed, 6 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3c3e4baa4156..1d62f0492e7b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1353,14 +1353,8 @@ static void update_and_free_page(struct hstate *h, 
> struct page *page)
>   set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
>   set_page_refcounted(page);
>   if (hstate_is_gigantic(h)) {
> - /*
> -  * Temporarily drop the hugetlb_lock, because
> -  * we might block in free_gigantic_page().
> -  */
> - spin_unlock(_lock);
>   destroy_compound_gigantic_page(page, huge_page_order(h));
>   free_gigantic_page(page, huge_page_order(h));
> - spin_lock(_lock);
>   } else {
>       __free_pages(page, huge_page_order(h));
>   }
> -- 
> 2.30.2
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock

2021-03-30 Thread Michal Hocko
 0;
>   unsigned long start, end = 0;
>   unsigned long bitmap_maxno = cma_bitmap_maxno(cma);
> + unsigned long flags;
>  
> - mutex_lock(>lock);
> + spin_lock_irqsave(>lock, flags);
>   for (;;) {
>   start = find_next_zero_bit(cma->bitmap, bitmap_maxno, end);
>   if (start >= bitmap_maxno)
> @@ -61,7 +63,7 @@ static int cma_maxchunk_get(void *data, u64 *val)
>   end = find_next_bit(cma->bitmap, bitmap_maxno, start);
>   maxchunk = max(end - start, maxchunk);
>   }
> - mutex_unlock(>lock);
> + spin_unlock_irqrestore(>lock, flags);
>   *val = (u64)maxchunk << cma->order_per_bit;
>  
>   return 0;

and here.
-- 
Michal Hocko
SUSE Labs


Re: [External] [PATCH 7/8] hugetlb: make free_huge_page irq safe

2021-03-29 Thread Michal Hocko
On Sat 27-03-21 15:06:36, Muchun Song wrote:
> On Thu, Mar 25, 2021 at 8:29 AM Mike Kravetz  wrote:
> >
> > Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
> > non-task context") was added to address the issue of free_huge_page
> > being called from irq context.  That commit hands off free_huge_page
> > processing to a workqueue if !in_task.  However, as seen in [1] this
> > does not cover all cases.  Instead, make the locks taken in the
> > free_huge_page irq safe.
> >
> > This patch does the following:
> > - Make hugetlb_lock irq safe.  This is mostly a simple process of
> >   changing spin_*lock calls to spin_*lock_irq* calls.
> > - Make subpool lock irq safe in a similar manner.
> > - Revert the !in_task check and workqueue handoff.
> >
> > [1] 
> > https://lore.kernel.org/linux-mm/f1c03b05bc43a...@google.com/
> >
> > Signed-off-by: Mike Kravetz 
> 
> The changes are straightforward.
> 
> Reviewed-by: Muchun Song 
> 
> Since this patchset aims to fix a real word issue. Should we add a Fixes
> tag?

Do we know since when it is possible to use hugetlb in the networking
context? Maybe this is possible since ever but I am wondering why the
lockdep started complaining only now. Maybe just fuzzing finally started
using this setup which nobody does normally.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()

2021-03-29 Thread Michal Hocko
On Fri 26-03-21 14:32:01, Mike Kravetz wrote:
[...]
> - Just change the mutex to an irq safe spinlock.

Yes please.

>   AFAICT, the potential
>   downsides could be:
>   - Interrupts disabled during long bitmap scans

How large those bitmaps are in practice?

>   - Wasted cpu cycles (spinning) if there is much contention on lock
>   Both of these would be more of an issue on small/embedded systems. I
>   took a quick look at the callers of cma_alloc/cma_release and nothing
>   stood out that could lead to high degrees of contention.  However, I
>   could have missed something.

If this is really a practical concern then we can try a more complex
solution based on some data.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/page_alloc: try oom if reclaim is unable to make forward progress

2021-03-26 Thread Michal Hocko
On Fri 26-03-21 11:22:54, Aaron Tomlin wrote:
[...]
> > Both reclaim and compaction maintain their own retries counters as they
> > are targeting a different operation. Although the compaction really
> > depends on the reclaim to do some progress.
> 
> Yes. Looking at should_compact_retry() if the last known compaction result
> was skipped i.e. suggesting there was not enough order-0 pages to support
> compaction, so assistance is needed from reclaim
> (see __compaction_suitable()).
> 
> I noticed that the value of compaction_retries, compact_result and
> compact_priority was 0, COMPACT_SKIPPED and 1 i.e. COMPACT_PRIO_SYNC_LIGHT,
> respectively.
> 
> > OK, this sound unexpected as it indicates that the reclaim is able to
> > make a forward progress but compaction doesn't want to give up and keeps
> > retrying. Are you able to reproduce this or could you find out which
> > specific condition keeps compaction retrying? I would expect that it is
> > one of the 3 conditions before the max_retries is checked.
> 
> Unfortunately, I have been told it is not entirely reproducible.
> I suspect it is the following in should_compact_retry() - as I indicated
> above the last known value stored in compaction_retries was 0:
> 
> 
> if (order > PAGE_ALLOC_COSTLY_ORDER)
> max_retries /= 4;
> if (*compaction_retries <= max_retries) {
> ret = true;
> goto out;
> }

OK, I kinda expected this would be not easily reproducible. The reason I
dislike your patch is that it addes yet another criterion for oom while
we already do have 2 which doesn't make the resulting code easier to
reason about. We should be focusing on the compaction retry logic and
see whether we can have some "run away" scenarios there. Seeing so many
retries without compaction bailing out sounds like a bug in that retry
logic. Vlastimil is much more familiar with that.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v5 1/5] mm,memory_hotplug: Allocate memmap from the added memory range

2021-03-26 Thread Michal Hocko
On Fri 26-03-21 15:53:41, David Hildenbrand wrote:
> On 26.03.21 15:38, Michal Hocko wrote:
> > On Fri 26-03-21 09:52:58, David Hildenbrand wrote:
[...]
> > > 2. We won't allocate kasan shadow memory. We most probably have to do it
> > > explicitly via kasan_add_zero_shadow()/kasan_remove_zero_shadow(), see
> > > mm/memremap.c:pagemap_range()
> > 
> > I think this is similar to the above. Does kasan has to know about
> > memory which will never be used for anything?
> 
> IIRC, kasan will track read/writes to the vmemmap as well. So it could
> theoretically detect if we read from the vmemmap before writing
> (initializing) it IIUC.
> This is also why mm/memremap.c does a kasan_add_zero_shadow() before the
> move_pfn_range_to_zone()->memmap_init_range() for the whole region,
> including altmap space.
> 
> Now, I am no expert on KASAN, what would happen in case we have access to
> non-tracked memory.
> 
> commit 0207df4fa1a869281ddbf72db6203dbf036b3e1a
> Author: Andrey Ryabinin 
> Date:   Fri Aug 17 15:47:04 2018 -0700
> 
> kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN
> 
> indicates that kasan will crash the system on "non-existent shadow memory"

Interesting. Thanks for the pointer.

> > > Further a locking rework might be necessary. We hold the device hotplug
> > > lock, but not the memory hotplug lock. E.g., for get_online_mems(). Might
> > > have to move that out online_pages.
> > 
> > Could you be more explicit why this locking is needed? What it would
> > protect from for vmemmap pages?
> > 
> 
> One example is in mm/kmemleak.c:kmemleak_scan(), where we scan the vmemmap
> for pointers. We don't want the vmemmap to get unmapped while we are working
> on it (-> fault).
 
Hmm, but they are not going away during offline. They just have a less
defined state. Or what exactly do you mean by unmapped?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v5 1/5] mm,memory_hotplug: Allocate memmap from the added memory range

2021-03-26 Thread Michal Hocko
On Fri 26-03-21 09:52:58, David Hildenbrand wrote:
[...]
> Something else to note:
> 
> 
> We'll not call the memory notifier (e.g., MEM_ONLINE) for the vmemmap. The
> result is that
> 
> 1. We won't allocate extended struct pages for the range. Don't think this
> is really problematic (pages are never allocated/freed, so I guess we don't
> care - like ZONE_DEVICE code).

Agreed. I do not think we need them. Future might disagree but let's
handle it when we have a clear demand.

> 2. We won't allocate kasan shadow memory. We most probably have to do it
> explicitly via kasan_add_zero_shadow()/kasan_remove_zero_shadow(), see
> mm/memremap.c:pagemap_range()

I think this is similar to the above. Does kasan has to know about
memory which will never be used for anything?

> Further a locking rework might be necessary. We hold the device hotplug
> lock, but not the memory hotplug lock. E.g., for get_online_mems(). Might
> have to move that out online_pages.

Could you be more explicit why this locking is needed? What it would
protect from for vmemmap pages?
-- 
Michal Hocko
SUSE Labs


  1   2   3   4   5   6   7   8   9   10   >