Re: [PATCH] MAINTAINERS: add myself as slab allocators maintainer

2021-01-14 Thread Vlastimil Babka
On 1/8/21 7:46 PM, Christoph Lameter wrote:
> I am ok with you as a slab maintainer. I have seen some good work from
> you.
> 
> Acked-by: Christoph Lameter 

Thanks!

Vlastimil


Re: [PATCH v2] mm/slub: disable user tracing for kmemleak caches by default

2021-01-14 Thread Vlastimil Babka
On 1/13/21 9:51 PM, Johannes Berg wrote:
> From: Johannes Berg 
> 
> If kmemleak is enabled, it uses a kmem cache for its own objects.
> These objects are used to hold information kmemleak uses, including
> a stack trace. If slub_debug is also turned on, each of them has
> *another* stack trace, so the overhead adds up, and on my tests (on
> ARCH=um, admittedly) 2/3rds of the allocations end up being doing
> the stack tracing.
> 
> Turn off SLAB_STORE_USER if SLAB_NOLEAKTRACE was given, to avoid
> storing the essentially same data twice.
> 
> Signed-off-by: Johannes Berg 

Acked-by: Vlastimil Babka 

> ---
> Perhaps instead it should go the other way around, and kmemleak
> could even use/access the stack trace that's already in there ...
> But I don't really care too much, I can just turn off slub debug
> for the kmemleak caches via the command line anyway :-)
> 
> v2:
>  - strip SLAB_STORE_USER only coming from slub_debug so that the
>command line args always take effect
> 
> ---
>  mm/slub.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 34dcc09e2ec9..a66c9948c529 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1412,6 +1412,15 @@ slab_flags_t kmem_cache_flags(unsigned int object_size,
>   size_t len;
>   char *next_block;
>   slab_flags_t block_flags;
> + slab_flags_t slub_debug_local = slub_debug;
> +
> + /*
> +  * If the slab cache is for debugging (e.g. kmemleak) then
> +  * don't store user (stack trace) information by default,
> +  * but let the user enable it via the command line below.
> +  */
> + if (flags & SLAB_NOLEAKTRACE)
> + slub_debug_local &= ~SLAB_STORE_USER;
>  
>   len = strlen(name);
>   next_block = slub_debug_string;
> @@ -1446,7 +1455,7 @@ slab_flags_t kmem_cache_flags(unsigned int object_size,
>   }
>   }
>  
> - return flags | slub_debug;
> + return flags | slub_debug_local;
>  }
>  #else /* !CONFIG_SLUB_DEBUG */
>  static inline void setup_object_debug(struct kmem_cache *s,
> 



SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

2021-01-14 Thread Vlastimil Babka
On 1/12/21 5:35 PM, Christoph Lameter wrote:
> On Tue, 12 Jan 2021, Jann Horn wrote:
> 
>> [This is not something I intend to work on myself. But since I
>> stumbled over this issue, I figured I should at least document/report
>> it, in case anyone is willing to pick it up.]
> 
> Well yeah all true. There is however a slabinfo tool that has an -s option
> to shrink all slabs.
> 
>   slabinfo -s
> 
> So you could put that somewhere that executes if the system is
> idle or put it into cron or so.

Hm this would be similar to recommending a periodical echo > drop_caches
operation. We actually discourage from that (and yeah, some tools do that, and
we now report those in dmesg). I believe the kernel should respond to memory
pressure and not OOM prematurely by itself, including SLUB.


Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

2021-01-13 Thread Vlastimil Babka
On 1/12/21 12:12 AM, Jann Horn wrote:
> [This is not something I intend to work on myself. But since I
> stumbled over this issue, I figured I should at least document/report
> it, in case anyone is willing to pick it up.]
> 
> Hi!

Hi, thanks for saving me a lot of typing!

...

> This means that in practice, SLUB actually ends up keeping as many
> **pages** on the percpu partial lists as it intends to keep **free
> objects** there.

Yes, I concluded the same thing.

...

> I suspect that this may have also contributed to the memory wastage
> problem with memory cgroups that was fixed in v5.9
> (https://lore.kernel.org/linux-mm/20200623174037.3951353-1-g...@fb.com/);
> meaning that servers with lots of CPU cores running pre-5.9 kernels
> with memcg and systemd (which tends to stick every service into its
> own memcg) might be even worse off.

Very much yes. Investigating an increase of kmemcg usage of a workload between
an older kernel with SLAB and 5.3-based kernel with SLUB led us to find the same
issue as you did. It doesn't help that slabinfo (global or per-memcg) is also
inaccurate as it cannot count free objects on per-cpu partial slabs and thus
reports them as active. I was aware that some empty slab pages might linger on
per-cpu lists, but only after seeing how many were freed after "echo 1 >
.../shrink" made me realize the extent of this.

> It also seems unsurprising to me that flushing ~30 pages out of the
> percpu partial caches at once with IRQs disabled would cause tail
> latency spikes (as noted by Joonsoo Kim and Christoph Lameter in
> commit 345c905d13a4e "slub: Make cpu partial slab support
> configurable").
> 
> At first I thought that this wasn't a significant issue because SLUB
> has a reclaim path that can trim the percpu partial lists; but as it
> turns out, that reclaim path is not actually wired up to the page
> allocator's reclaim logic. The SLUB reclaim stuff is only triggered by
> (very rare) subsystem-specific calls into SLUB for specific slabs and
> by sysfs entries. So in userland processes will OOM even if SLUB still
> has megabytes of entirely unused pages lying around.

Yeah, we considered to wire the shrinking to memcg OOM, but it's a poor
solution. I'm considering introducing a proper shrinker that would be registered
and work like other shrinkers for reclaimable caches. Then we would make it
memcg-aware in our backport - upstream after v5.9 doesn't need that obviously.

> It might be a good idea to figure out whether it is possible to
> efficiently keep track of a more accurate count of the free objects on

As long as there are some inuse objects, it shouldn't matter much if the slab is
sitting on per-cpu partial list or per-node list, as it can't be freed anyway.
It becomes a real problem only after the slab become fully free. If we detected
that in __slab_free() also for already-frozen slabs, we would need to know which
CPU this slab belongs to (currently that's not tracked afaik), and send it an
IPI to do some light version of unfreeze_partials() that would only remove empty
slabs. The trick would be not to cause too many IPI's by this, obviously :/

Actually I'm somewhat wrong above. If a CPU and per-node partial list runs out
of free objects, it's wasteful to allocate new slabs if almost-empty slabs sit
on another CPU's per-node partial list.

> percpu partial lists; and if not, maybe change the accounting to
> explicitly track the number of partial pages, and use limits that are

That would be probably the simplest solution. Maybe sufficient  upstream where
the wastage only depends on number of caches and not memcgs. For pre-5.9 I also
considered limiting the number of pages only for the per-memcg clones :/
Currently writing to the /sys/...//cpu_partial file is propagated to all
the clones and root cache.

> more appropriate for that? And perhaps the page allocator reclaim path
> should also occasionally rip unused pages out of the percpu partial
> lists?

That would be best done by the a shrinker?

BTW, SLAB does this by reaping of its per-cpu and shared arrays by timers (which
works, but is not ideal) They also can't grow that large like this.




Re: [PATCH 1/2] kasan, mm: fix conflicts with init_on_alloc/free

2021-01-13 Thread Vlastimil Babka
On 1/13/21 5:03 PM, Andrey Konovalov wrote:
> A few places where SLUB accesses object's data or metadata were missed in
> a previous patch. This leads to false positives with hardware tag-based
> KASAN when bulk allocations are used with init_on_alloc/free.
> 
> Fix the false-positives by resetting pointer tags during these accesses.
> 
> Link: 
> https://linux-review.googlesource.com/id/I50dd32838a666e173fe06c3c5c766f2c36aae901
> Fixes: aa1ef4d7b3f67 ("kasan, mm: reset tags when accessing metadata")
> Reported-by: Dmitry Vyukov 
> Signed-off-by: Andrey Konovalov 

Acked-by: Vlastimil Babka 

> ---
>  mm/slub.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index dc5b42e700b8..75fb097d990d 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2791,7 +2791,8 @@ static __always_inline void 
> maybe_wipe_obj_freeptr(struct kmem_cache *s,
>  void *obj)
>  {
>   if (unlikely(slab_want_init_on_free(s)) && obj)
> - memset((void *)((char *)obj + s->offset), 0, sizeof(void *));
> + memset((void *)((char *)kasan_reset_tag(obj) + s->offset),
> + 0, sizeof(void *));
>  }
>  
>  /*
> @@ -2883,7 +2884,7 @@ static __always_inline void *slab_alloc_node(struct 
> kmem_cache *s,
>   stat(s, ALLOC_FASTPATH);
>   }
>  
> - maybe_wipe_obj_freeptr(s, kasan_reset_tag(object));
> + maybe_wipe_obj_freeptr(s, object);

And in that case the reset was unnecessary, right. (commit log only mentions
adding missing resets).

>   if (unlikely(slab_want_init_on_alloc(gfpflags, s)) && object)
>   memset(kasan_reset_tag(object), 0, s->object_size);
> @@ -3329,7 +3330,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t 
> flags, size_t size,
>   int j;
>  
>   for (j = 0; j < i; j++)
> - memset(p[j], 0, s->object_size);
> + memset(kasan_reset_tag(p[j]), 0, s->object_size);
>   }
>  
>   /* memcg and kmem_cache debug support */
> 



Re: [PATCH] mm/slub: disable user tracing for kmemleak caches

2021-01-13 Thread Vlastimil Babka
On 1/13/21 5:09 PM, Johannes Berg wrote:
> From: Johannes Berg 
> 
> If kmemleak is enabled, it uses a kmem cache for its own objects.
> These objects are used to hold information kmemleak uses, including
> a stack trace. If slub_debug is also turned on, each of them has
> *another* stack trace, so the overhead adds up, and on my tests (on
> ARCH=um, admittedly) 2/3rds of the allocations end up being doing
> the stack tracing.
> 
> Turn off SLAB_STORE_USER if SLAB_NOLEAKTRACE was given, to avoid
> storing the essentially same data twice.
> 
> Signed-off-by: Johannes Berg 

How about stripping away SLAB_STORE_USER only if it's added from the global
slub_debug variable? In case somebody lists one of the kmemleak caches
explicitly in "slub_debug=..." instead of just booting with "slub_debug", we
should honor that.

> ---
> Perhaps instead it should go the other way around, and kmemleak
> could even use/access the stack trace that's already in there ...
> But I don't really care too much, I can just turn off slub debug
> for the kmemleak caches via the command line anyway :-)
> 
> ---
>  mm/slub.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 34dcc09e2ec9..625a32a6645b 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1446,7 +1446,16 @@ slab_flags_t kmem_cache_flags(unsigned int object_size,
>   }
>   }
>  
> - return flags | slub_debug;
> + flags |= slub_debug;
> +
> + /*
> +  * If the slab cache is for debugging (e.g. kmemleak) then
> +  * don't store user (stack trace) information.
> +  */
> + if (flags & SLAB_NOLEAKTRACE)
> + flags &= ~SLAB_STORE_USER;
> +
> + return flags;
>  }
>  #else /* !CONFIG_SLUB_DEBUG */
>  static inline void setup_object_debug(struct kmem_cache *s,
> 



Re: [PATCH] mm: slub: Convert sys slab alloc_calls, free_calls to bin attribute

2021-01-13 Thread Vlastimil Babka
On 1/12/21 10:21 AM, Faiyaz Mohammed wrote:
> Reading the sys slab alloc_calls, free_calls returns the available object
> owners, but the size of this file is limited to PAGE_SIZE
> because of the limitation of sysfs attributes, it is returning the
> partial owner info, which is not sufficient to debug/account the slab
> memory and alloc_calls output is not matching with /proc/slabinfo.
> 
> To remove the PAGE_SIZE limitation converted the sys slab
> alloc_calls, free_calls to bin attribute.
> 
> Signed-off-by: Faiyaz Mohammed 
> ---
>  mm/slub.c | 61 +++--
>  1 file changed, 47 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index b52384e..8744e5ec 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4710,13 +4710,14 @@ static void process_slab(struct loc_track *t, struct 
> kmem_cache *s,
>  }
>  
>  static int list_locations(struct kmem_cache *s, char *buf,
> - enum track_item alloc)
> + loff_t offset, enum track_item alloc)
>  {
>   int len = 0;
>   unsigned long i;
>   struct loc_track t = { 0, 0, NULL };
>   int node;
>   struct kmem_cache_node *n;
> + static unsigned int previous_read_count;

Hmm static? What about parallel reads from different files? I guess you'll have
to somehow employ the offset parameter here and it won't be pretty, because you
are still printing free text and not some fixed-size binary chunks where seeking
is simple.
Also it's wasteful to to repeat the data gathering for each pritned page, you'd
need a mechanism that allows holding private data between printing out the
pages. If bin_attribute doesn't have that, you'd need e.g. seq_file which we use
for /proc/pid/(s)maps etc.

>   unsigned long *map = bitmap_alloc(oo_objects(s->max), GFP_KERNEL);

This line doesn't exist since 90e9f6a66c78f in v5.6-rc1, is the patch based on
an old kernel?

>   if (!map || !alloc_loc_track(&t, PAGE_SIZE / sizeof(struct location),
> @@ -4742,11 +4743,9 @@ static int list_locations(struct kmem_cache *s, char 
> *buf,
>   spin_unlock_irqrestore(&n->list_lock, flags);
>   }
>  
> - for (i = 0; i < t.count; i++) {
> + for (i = previous_read_count; i < t.count; i++) {
>   struct location *l = &t.loc[i];
>  
> - if (len > PAGE_SIZE - KSYM_SYMBOL_LEN - 100)
> - break;
>   len += sprintf(buf + len, "%7ld ", l->count);
>  
>   if (l->addr)
> @@ -4784,12 +4783,20 @@ static int list_locations(struct kmem_cache *s, char 
> *buf,
>nodemask_pr_args(&l->nodes));
>  
>   len += sprintf(buf + len, "\n");
> +
> + if (len > PAGE_SIZE - KSYM_SYMBOL_LEN - 100) {
> + previous_read_count = i + 1;
> + break;
> + }
>   }
>  
> + if ((offset != 0) && ((i >= t.count) || (previous_read_count > 
> t.count))) {
> + previous_read_count = 0;
> + len = 0;
> + } else if (!t.count)
> + len += sprintf(buf, "No data\n");
>   free_loc_track(&t);
>   bitmap_free(map);
> - if (!t.count)
> - len += sprintf(buf, "No data\n");
>   return len;
>  }
>  
> @@ -5180,6 +5187,7 @@ static int any_slab_objects(struct kmem_cache *s)
>  
>  struct slab_attribute {
>   struct attribute attr;
> + struct bin_attribute bin_attr;
>   ssize_t (*show)(struct kmem_cache *s, char *buf);
>   ssize_t (*store)(struct kmem_cache *s, const char *x, size_t count);
>  };
> @@ -5192,6 +5200,12 @@ struct slab_attribute {
>   static struct slab_attribute _name##_attr =  \
>   __ATTR(_name, 0600, _name##_show, _name##_store)
>  
> +#define SLAB_BIN_ATTR_RO(_name) \
> + static struct slab_attribute _name##_attr = { \
> + .bin_attr = \
> + __BIN_ATTR_RO(_name, 0) \
> + } \
> +
>  static ssize_t slab_size_show(struct kmem_cache *s, char *buf)
>  {
>   return sprintf(buf, "%u\n", s->size);
> @@ -5535,21 +5549,33 @@ static ssize_t validate_store(struct kmem_cache *s,
>  }
>  SLAB_ATTR(validate);
>  
> -static ssize_t alloc_calls_show(struct kmem_cache *s, char *buf)
> +static ssize_t alloc_calls_read(struct file *filp, struct kobject *kobj,
> + struct bin_attribute *bin_attr, char *buf,
> + loff_t offset, size_t count)
>  {
> + struct kmem_cache *s;
> +
> + s = to_slab(kobj);
>   if (!(s->flags & SLAB_STORE_USER))
>   return -ENOSYS;
> - return list_locations(s, buf, TRACK_ALLOC);
> +
> + return list_locations(s, buf, offset, TRACK_ALLOC);
>  }
> -SLAB_ATTR_RO(alloc_calls);
> +SLAB_BIN_ATTR_RO(alloc_calls);
>  
> -static ssize_t free_calls_show(struct kmem_cache *s, char *buf)
> +static ssize_t free_calls_read(struct file *filp, struct kobject *kobj,
> + struct bin_a

[PATCH 3/3] mm, slab, slub: stop taking cpu hotplug lock

2021-01-13 Thread Vlastimil Babka
SLAB has been using get/put_online_cpus() around creating, destroying and
shrinking kmem caches since 95402b382901 ("cpu-hotplug: replace per-subsystem
mutexes with get_online_cpus()") in 2008, which is supposed to be replacing
a private mutex (cache_chain_mutex, called slab_mutex today) with system-wide
mechanism, but in case of SLAB it's in fact used in addition to the existing
mutex, without explanation why.

SLUB appears to have avoided the cpu hotplug lock initially, but gained it due
to common code unification, such as 20cea9683ecc ("mm, sl[aou]b: Move
kmem_cache_create mutex handling to common code").

Regardless of the history, checking if the hotplug lock is actually needed
today suggests that it's not, and therefore it's better to avoid this
system-wide lock and the ordering this imposes wrt other locks (such as
slab_mutex).

Specifically, in SLAB we have for_each_online_cpu() in do_tune_cpucache()
protected by slab_mutex, and cpu hotplug callbacks that also take the
slab_mutex, which is also taken by the common slab function that currently also
take the hotplug lock. Thus the slab_mutex protection should be sufficient.
Also per-cpu array caches are allocated for each possible cpu, so not affected
by their online/offline state.

In SLUB we have for_each_online_cpu() in functions that show statistics and are
already unprotected today, as racing with hotplug is not harmful. Otherwise
SLUB relies on percpu allocator. The slub_cpu_dead() hotplug callback takes the
slab_mutex.

To sum up, this patch removes get/put_online_cpus() calls from slab as it
should be safe without further adjustments.

Signed-off-by: Vlastimil Babka 
---
 mm/slab_common.c | 10 --
 1 file changed, 10 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index e040b3820a75..0ca99ebefbf2 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -310,8 +310,6 @@ kmem_cache_create_usercopy(const char *name,
const char *cache_name;
int err;
 
-   get_online_cpus();
-
mutex_lock(&slab_mutex);
 
err = kmem_cache_sanity_check(name, size);
@@ -360,8 +358,6 @@ kmem_cache_create_usercopy(const char *name,
 out_unlock:
mutex_unlock(&slab_mutex);
 
-   put_online_cpus();
-
if (err) {
if (flags & SLAB_PANIC)
panic("kmem_cache_create: Failed to create slab '%s'. 
Error %d\n",
@@ -487,8 +483,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
if (unlikely(!s))
return;
 
-   get_online_cpus();
-
mutex_lock(&slab_mutex);
 
s->refcount--;
@@ -503,8 +497,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
}
 out_unlock:
mutex_unlock(&slab_mutex);
-
-   put_online_cpus();
 }
 EXPORT_SYMBOL(kmem_cache_destroy);
 
@@ -521,12 +513,10 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
 {
int ret;
 
-   get_online_cpus();
 
kasan_cache_shrink(cachep);
ret = __kmem_cache_shrink(cachep);
 
-   put_online_cpus();
return ret;
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
-- 
2.29.2



[PATCH 1/3] mm, slub: stop freeing kmem_cache_node structures on node offline

2021-01-13 Thread Vlastimil Babka
Commit e4f8e513c3d3 ("mm/slub: fix a deadlock in show_slab_objects()") has
fixed a problematic locking order by removing the memory hotplug lock
get/put_online_mems() from show_slab_objects(). During the discussion, it was
argued [1] that this is OK, because existing slabs on the node would prevent
a hotremove to proceed.

That's true, but per-node kmem_cache_node structures are not necessarily
allocated on the same node and may exist even without actual slab pages
on the same node. Any path that uses get_node() directly or via
for_each_kmem_cache_node() (such as show_slab_objects()) can race with
freeing of kmem_cache_node even with the !NULL check, resulting in
use-after-free.

To that end, commit e4f8e513c3d3 argues in a comment that:

 * We don't really need mem_hotplug_lock (to hold off
 * slab_mem_going_offline_callback) here because slab's memory hot
 * unplug code doesn't destroy the kmem_cache->node[] data.

While it's true that slab_mem_going_offline_callback() doesn't free
the kmem_cache_node, the later callback slab_mem_offline_callback() actually
does, so the race and use-after-free exists. Not just for show_slab_objects()
after commit e4f8e513c3d3, but also many other places that are not under
slab_mutex. And adding slab_mutex locking or other synchronization to SLUB
paths such as get_any_partial() would be bad for performance and error-prone.

The easiest solution is therefore to make the abovementioned comment true and
stop freeing the kmem_cache_node structures, accepting some wasted memory in
the full memory node removal scenario. Analogically we also don't free
hotremoved pgdat as mentioned in [1], nor the similar per-node structures in
SLAB. Importantly this approach will not block the hotremove, as generally such
nodes should be movable in order to succeed hotremove in the first place, and
thus the GFP_KERNEL allocated kmem_cache_node will come from elsewhere.

[1] https://lore.kernel.org/linux-mm/20190924151147.gb23...@dhcp22.suse.cz/

Signed-off-by: Vlastimil Babka 
---
 mm/slub.c | 28 +++-
 1 file changed, 11 insertions(+), 17 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index daf5ca1755d5..0d01a893cb64 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4286,8 +4286,6 @@ static int slab_mem_going_offline_callback(void *arg)
 
 static void slab_mem_offline_callback(void *arg)
 {
-   struct kmem_cache_node *n;
-   struct kmem_cache *s;
struct memory_notify *marg = arg;
int offline_node;
 
@@ -4301,21 +4299,11 @@ static void slab_mem_offline_callback(void *arg)
return;
 
mutex_lock(&slab_mutex);
-   list_for_each_entry(s, &slab_caches, list) {
-   n = get_node(s, offline_node);
-   if (n) {
-   /*
-* if n->nr_slabs > 0, slabs still exist on the node
-* that is going down. We were unable to free them,
-* and offline_pages() function shouldn't call this
-* callback. So, we must fail.
-*/
-   BUG_ON(slabs_node(s, offline_node));
-
-   s->node[offline_node] = NULL;
-   kmem_cache_free(kmem_cache_node, n);
-   }
-   }
+   /*
+* We no longer free kmem_cache_node structures here, as it would be
+* racy with all get_node() users, and infeasible to protect them with
+* slab_mutex.
+*/
mutex_unlock(&slab_mutex);
 }
 
@@ -4341,6 +4329,12 @@ static int slab_mem_going_online_callback(void *arg)
 */
mutex_lock(&slab_mutex);
list_for_each_entry(s, &slab_caches, list) {
+   /*
+* The structure may already exist if the node was previously
+* onlined and offlined.
+*/
+   if (get_node(s, nid))
+   continue;
/*
 * XXX: kmem_cache_alloc_node will fallback to other nodes
 *  since memory is not yet available from the node that
-- 
2.29.2



[PATCH 2/3] mm, slab, slub: stop taking memory hotplug lock

2021-01-13 Thread Vlastimil Babka
Since commit 03afc0e25f7f ("slab: get_online_mems for
kmem_cache_{create,destroy,shrink}") we are taking memory hotplug lock for SLAB
and SLUB when creating, destroying or shrinking a cache. It is quite a heavy
lock and it's best to avoid it if possible, as we had several issues with
lockdep complaining about ordering in the past, see e.g. e4f8e513c3d3
("mm/slub: fix a deadlock in show_slab_objects()").

The problem scenario in 03afc0e25f7f (solved by the memory hotplug lock) can be
summarized as follows: while there's slab_mutex synchronizing new kmem cache
creation and SLUB's MEM_GOING_ONLINE callback slab_mem_going_online_callback(),
we may miss creation of kmem_cache_node for the hotplugged node in the new kmem
cache, because the hotplug callback doesn't yet see the new cache, and cache
creation in init_kmem_cache_nodes() only inits kmem_cache_node for nodes in the
N_NORMAL_MEMORY nodemask, which however may not yet include the new node, as
that happens only later after the MEM_GOING_ONLINE callback.

Instead of using get/put_online_mems(), the problem can be solved by SLUB
maintaining its own nodemask of nodes for which it has allocated the per-node
kmem_cache_node structures. This nodemask would generally mirror the
N_NORMAL_MEMORY nodemask, but would be updated only in under SLUB's control in
its memory hotplug callbacks under the slab_mutex. This patch adds such
nodemask and its handling.

Commit 03afc0e25f7f mentiones "issues like [the one above]", but there don't
appear to be further issues. All the paths (shared for SLAB and SLUB) taking
the memory hotplug locks are also taking the slab_mutex, except
kmem_cache_shrink() where 03afc0e25f7f replaced slab_mutex with
get/put_online_mems().

We however cannot simply restore slab_mutex in kmem_cache_shrink(), as SLUB can
enters the function from a write to sysfs 'shrink' file, thus holding kernfs
lock, and in kmem_cache_create() the kernfs lock is nested within slab_mutex.
But on closer inspection we don't actually need to protect kmem_cache_shrink()
from hotplug callbacks: While SLUB's __kmem_cache_shrink() does
for_each_kmem_cache_node(), missing a new node added in parallel hotplug is not
fatal, and parallel hotremove does not free kmem_cache_node's anymore after the
previous patch, so use-after free cannot happen. The per-node shrinking itself
is protected by n->list_lock. Same is true for SLAB, and SLOB is no-op.

SLAB also doesn't need the memory hotplug locking, which it only gained by
03afc0e25f7f through the shared paths in slab_common.c. Its memory hotplug
callbacks are also protected by slab_mutex against races with these paths. The
problem of SLUB relying on N_NORMAL_MEMORY doesn't apply to SLAB, as its
setup_kmem_cache_nodes relies on N_ONLINE, and the new node is already set
there during the MEM_GOING_ONLINE callback, so no special care is needed
for SLAB.

As such, this patch removes all get/put_online_mems() usage by the slab
subsystem.

Signed-off-by: Vlastimil Babka 
---
 mm/slab_common.c |  8 ++--
 mm/slub.c| 28 +---
 2 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index d5986ebb84ea..e040b3820a75 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -311,7 +311,6 @@ kmem_cache_create_usercopy(const char *name,
int err;
 
get_online_cpus();
-   get_online_mems();
 
mutex_lock(&slab_mutex);
 
@@ -361,7 +360,6 @@ kmem_cache_create_usercopy(const char *name,
 out_unlock:
mutex_unlock(&slab_mutex);
 
-   put_online_mems();
put_online_cpus();
 
if (err) {
@@ -490,7 +488,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
return;
 
get_online_cpus();
-   get_online_mems();
 
mutex_lock(&slab_mutex);
 
@@ -507,7 +504,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
 out_unlock:
mutex_unlock(&slab_mutex);
 
-   put_online_mems();
put_online_cpus();
 }
 EXPORT_SYMBOL(kmem_cache_destroy);
@@ -526,10 +522,10 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
int ret;
 
get_online_cpus();
-   get_online_mems();
+
kasan_cache_shrink(cachep);
ret = __kmem_cache_shrink(cachep);
-   put_online_mems();
+
put_online_cpus();
return ret;
 }
diff --git a/mm/slub.c b/mm/slub.c
index 0d01a893cb64..0d4bdf6783ee 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -236,6 +236,14 @@ static inline void stat(const struct kmem_cache *s, enum 
stat_item si)
 #endif
 }
 
+/*
+ * Tracks for which NUMA nodes we have kmem_cache_nodes allocated.
+ * Corresponds to node_state[N_NORMAL_MEMORY], but can temporarily
+ * differ during memory hotplug/hotremove operations.
+ * Protected by slab_mutex.
+ */
+static nodemask_t slab_nodes;
+
 /

[PATCH 0/3] mm, slab, slub: remove cpu and memory hotplug locks

2021-01-13 Thread Vlastimil Babka
Changes since RFC:
- lockdep didn't like reintroducing slab_mutex in kmem_cache_shrink(), so
  instead explain why it's not needed after all (in Patch 2)
- the above is only safe with Patch 1 already in place, so order it first
  (current Patch 1 was Patch 3 in RFC)

Some related work caused me to look at how we use get/put_mems_online() and
get/put_online_cpus() during kmem cache creation/descruction/shrinking, and
realize that it should be actually safe to remove all of that with rather small
effort (as e.g. Michal Hocko suspected in some of the past discussions
already). This has the benefit to avoid rather heavy locks that have caused
locking order issues already in the past. So this is the result, Patches 2 and
3 remove memory hotplug and cpu hotplug locking, respectively. Patch 1 is due
to realization that in fact some races exist despite the locks (even if not
removed), but the most sane solution is not to introduce more of them, but
rather accept some wasted memory in scenarios that should be rare anyway (full
memory hot remove), as we do the same in other contexts already.

Vlastimil Babka (3):
  mm, slub: stop freeing kmem_cache_node structures on node offline
  mm, slab, slub: stop taking memory hotplug lock
  mm, slab, slub: stop taking cpu hotplug lock

 mm/slab_common.c | 18 ++--
 mm/slub.c| 56 +++-
 2 files changed, 38 insertions(+), 36 deletions(-)

-- 
2.29.2



Re: [PATCH v3] proc_sysctl: fix oops caused by incorrect command parameters.

2021-01-12 Thread Vlastimil Babka
On 1/12/21 8:24 AM, Michal Hocko wrote:
>> > > 
>> > > If we're going to do a separate "patch: make process_sysctl_arg()
>> > > return an errno instead of 0" then fine, we can discuss that.  But it's
>> > > conceptually a different work from fixing this situation.
>> > > .
>> > > 
>> > However, are the logs generated by process_sysctl_arg() clearer and more 
>> > accurate than parse_args()? Should the logs generated by 
>> > process_sysctl_arg() be deleted?
>> 
>> I think the individual logs are very useful and should be retained.
> 
> Yes, other sysfs specific error messages are likely useful. I just fail
> to see why a missing value should be handled here when there is an
> existing handling in the caller. Not sure whether a complete shadow
> reporting in process_sysctl_arg is a deliberate decision or not.
> Vlastimil?

Yes, it's a way to have more useful sysctl-specific reports than the generic
ones. And I think I was inspired by some other existing code, but don't remember
exactly. The options are:

1) the current sysctl-specific reports, return 0 as the values are only consumed
2) be silent and return error, invent new error codes to have generic report be
more useful for sysctl, but inevitably lose some nuances anyway
3) a mix where 2) is used for situations where generic report is sufficient
enough, 1) where not

Patch v2 went with option 1), v3 with option 3). I think it's down to
preferences. I would personally go with v2 and message similar to the existing
ones, i.e.:

"Failed to set sysctl parameter '%s': no value given\n"

Also we seem to be silently doing nothing when strlen(val) == 0, i.e.
"hung_task_panic=" was passed. Worth reporting the same error.

But v3 is fine with me as well. The generic error message works. We could just
add "if (!len) return -EINVAL" below the strlen() call.

Also please Cc: stable.

> Anyway one way or the other, all I care about is to have a reporting in
> place because this shouldn't be a silent failure.
> 



Re: [PATCH] mm, compaction: move high_pfn to the for loop scope.

2021-01-12 Thread Vlastimil Babka
On 1/12/21 10:47 AM, Rokudo Yan wrote:
> In fast_isolate_freepages, high_pfn will be used if a prefered one(PFN >= 
> low_fn) not found. But the high_pfn
> is not reset before searching an free area, so when it was used as freepage, 
> it may from another free area searched before.
> And move_freelist_head(freelist, freepage) will have unexpected behavior(eg. 
> corrupt the MOVABLE freelist)
> 
> Unable to handle kernel paging request at virtual address dead0200
> Mem abort info:
>   ESR = 0x9644
>   Exception class = DABT (current EL), IL = 32 bits
>   SET = 0, FnV = 0
>   EA = 0, S1PTW = 0
> Data abort info:
>   ISV = 0, ISS = 0x0044
>   CM = 0, WnR = 1
> [dead0200] address between user and kernel address ranges
> 
> -000|list_cut_before(inline)
> -000|move_freelist_head(inline)
> -000|fast_isolate_freepages(inline)
> -000|isolate_freepages(inline)
> -000|compaction_alloc(?, ?)
> -001|unmap_and_move(inline)
> -001|migrate_pages([NSD:0xFF80088CBBD0] from = 0xFF80088CBD88, 
> [NSD:0xFF80088CBBC8] get_new_p
> -002|__read_once_size(inline)
> -002|static_key_count(inline)
> -002|static_key_false(inline)
> -002|trace_mm_compaction_migratepages(inline)
> -002|compact_zone(?, [NSD:0xFF80088CBCB0] capc = 0x0)
> -003|kcompactd_do_work(inline)
> -003|kcompactd([X19] p = 0xFF93227FBC40)
> -004|kthread([X20] _create = 0xFFE1AFB26380)
> -005|ret_from_fork(asm)
> ---|end of frame
> 
> The issue was reported on an smart phone product with 6GB ram and 3GB zram as 
> swap device.
> 
> This patch fixes the issue by reset high_pfn before searching each free area, 
> which ensure
> freepage and freelist match when call move_freelist_head in 
> fast_isolate_freepages().
> 
> Link: 
> http://lkml.kernel.org/r/20190118175136.31341-12-mgor...@techsingularity.net
> Fixes: 5a811889de10f1eb ("mm, compaction: use free lists to quickly locate a 
> migration target")

Sounds serious enough. Wonder how it wasn't reported sooner.

Cc: 
Acked-by: Vlastimil Babka 

> ---
>  mm/compaction.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index cc1a7f600a86..75f0e550b18f 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1303,7 +1303,7 @@ fast_isolate_freepages(struct compact_control *cc)
>  {
>   unsigned int limit = min(1U, freelist_scan_limit(cc) >> 1);
>   unsigned int nr_scanned = 0;
> - unsigned long low_pfn, min_pfn, high_pfn = 0, highest = 0;
> + unsigned long low_pfn, min_pfn, highest = 0;
>   unsigned long nr_isolated = 0;
>   unsigned long distance;
>   struct page *page = NULL;
> @@ -1348,6 +1348,7 @@ fast_isolate_freepages(struct compact_control *cc)
>   struct page *freepage;
>   unsigned long flags;
>   unsigned int order_scanned = 0;
> + unsigned long high_pfn = 0;
>  
>   if (!area->nr_free)
>   continue;
> 



Re: [RFC 0/3] mm, slab, slub: remove cpu and memory hotplug locks

2021-01-11 Thread Vlastimil Babka


On 1/6/21 8:09 PM, Christoph Lameter wrote:
> On Wed, 6 Jan 2021, Vlastimil Babka wrote:
> 
>> rather accept some wasted memory in scenarios that should be rare anyway 
>> (full
>> memory hot remove), as we do the same in other contexts already. It's all RFC
>> for now, as I might have missed some reason why it's not safe.
> 
> Looks good to me. My only concern is the kernel that has hotplug disabled.
> Current code allows the online/offline checks to be optimized away.
> 
> Can this patch be enhanced to do the same?

Thanks, indeed I didn't think about that.
But on closer inspection, there doesn't seem to be need for such enhancement:

- Patch 1 adds the new slab_nodes nodemask, which is basically a copy of
N_NORMAL_MEMORY. Without memory hotplug, the callbacks that would update it
don't occur (maybe are even eliminated as dead code?), other code that uses the
nodemask is unaffected wtf performance, it's just using a different nodemask for
the same operations. The extra memory usage of adding the nodemask is negligible
and not worth complicating the code to distinguish between the new nodemask and
N_NORMAL_MEMORY depending on hotplug being disabled or enabled.

- Patch 1 also restores slab_mutex lock in kmem_cache_shrink(). Commit
03afc0e25f7f removed it because the memory hotplug lock was deemed to be
sufficient replacement, but probably didn't consider the case where hotplug is
disabled and thus the hotplug lock is no-op. Maybe it's safe not to take
slab_mutex in kmem_cache_shrink() in that case, but kmem_cache_shrink() is only
called from a sysfs handler, thus a very cold path anyway.
But, I found out that lockdep complains about it, so I have to rethink this part
anyway... (when kmem_cache_shrink() is called from write to the 'shrink' file we
already have kernfs lock "kn->active#28" and try to lock slab_mutex, but there's
existing dependency in reverse order where in kmem_cache_create() we start with
slab_mutex and sysfs_slab_add takes the kernfs lock, I wonder how this wasn't a
problem before 03afc0e25f7f)

- Patch 2 purely just removes calls to cpu hotplug lock.

- Patch 3 only affects memory hotplug callbacks so nothing to enhance if hotplug
is disabled




Re: [PATCH mm,percpu_ref,rcu 1/6] mm: Add mem_dump_obj() to print source of memory block

2021-01-08 Thread Vlastimil Babka
On 1/8/21 8:01 PM, Paul E. McKenney wrote:
> 
> Andrew pushed this to an upstream maintainer, but I have not seen these
> patches appear anywhere.  So if that upstream maintainer was Linus, I can
> send a follow-up patch once we converge.  If the upstream maintainer was
> in fact me, I can of course update the commit directly.  If the upstream
> maintainer was someone else, please let me know who it is.  ;-)
> 
> (Linus does not appear to have pushed anything out since before Andrew's
> email, hence my uncertainty.)

I've wondered about the mm-commits messages too, and concluded that the most 
probable explanation is that Andrew tried to add your series to mmotm and then 
tried mmotm merge to linux-next and found out the series is already there via 
your rcu tree :)
 
>> > --- a/mm/slab.c
>> > +++ b/mm/slab.c
>> > @@ -3635,6 +3635,26 @@ void *__kmalloc_node_track_caller(size_t size, 
>> > gfp_t flags,
>> >  EXPORT_SYMBOL(__kmalloc_node_track_caller);
>> >  #endif /* CONFIG_NUMA */
>> >  
>> > +void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct page 
>> > *page)
>> > +{
>> > +  struct kmem_cache *cachep;
>> > +  unsigned int objnr;
>> > +  void *objp;
>> > +
>> > +  kpp->kp_ptr = object;
>> > +  kpp->kp_page = page;
>> > +  cachep = page->slab_cache;
>> > +  kpp->kp_slab_cache = cachep;
>> > +  objp = object - obj_offset(cachep);
>> > +  kpp->kp_data_offset = obj_offset(cachep);
>> > +  page = virt_to_head_page(objp);
>> 
>> Hm when can this page be different from the "page" we got as function 
>> parameter?
>> I guess only if "object" was so close to the beginning of page that "object -
>> obj_offset(cachep)" underflowed it. So either "object" pointed to the
>> padding/redzone, or even below page->s_mem. Both situations sounds like we
>> should handle them differently than continuing with an unrelated page that's
>> below our slab page?
> 
> I examined other code to obtain this.  I have been assuming that the
> point was to be able to handle multipage slabs, but I freely confess to
> having no idea.  But I am reluctant to change this sequence unless the
> other code translating from pointer to in-slab object is also changed.

OK, I will check the other code.

>> > +  objnr = obj_to_index(cachep, page, objp);
>> 
>> Related, this will return bogus value for objp below page->s_mem.
>> And if our "object" pointer pointed beyond last valid object, this will give 
>> us
>> too large index.
>> 
>> 
>> > +  objp = index_to_obj(cachep, page, objnr);
>> 
>> Too large index can cause dbg_userword to be beyond our page.
>> In SLUB version you have the WARN_ON_ONCE that catches such invalid pointers
>> (before first valid object or after last valid object) and skips getting the
>> backtrace for those, so analogical thing should probably be done here?
> 
> Like this, just before the "objp =" statement?
> 
>   WARN_ON_ONCE(objnr >= cachep->num);

Yeah, that should do the trick to prevent accessing garbage dbg_userword.

But I wrote the comments about SLAB first and only in the SLUB part I realized
about the larger picture. So now I would consider something like below, which
should find the closest valid object index and resulting pointer offset in
kmem_dump_obj() might become negative. Pointers to padding, below page->s_mem or
beyond last object just become respectively large negative or positive pointer
offsets and we probably don't need to warn for them specially unless we warn
also for all other pointers that are not within the "data area" of the object.

void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct page *page)
{
struct kmem_cache *cachep;
unsigned int objnr;
void *objp;

kpp->kp_ptr = object;
kpp->kp_page = page;
cachep = page->slab_cache;
kpp->kp_slab_cache = cachep;
kpp->kp_data_offset = obj_offset(cachep);
if (object < page->s_mem)
objnr = 0;
else
objnr = obj_to_index(cachep, page, object);
if (objnr >= cachep->num)
objnr = cachep->num - 1;
objp = index_to_obj(cachep, page, objnr);
kpp->kp_objp = objp;
if (DEBUG && cachep->flags & SLAB_STORE_USER)
kpp->kp_ret = *dbg_userword(cachep, objp);
}

 
>> > +  kpp->kp_objp = objp;
>> > +  if (DEBUG && cachep->flags & SLAB_STORE_USER)
>> > +  kpp->kp_ret = *dbg_userword(cachep, objp);
>> > +}
>> > +
>> > diff --git a/mm/slub.c b/mm/slub.c
>> > index 0c8b43a..3c1a843 100644
>> > --- a/mm/slub.c
>> > +++ b/mm/slub.c
>> > @@ -3919,6 +3919,46 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
>> >return 0;
>> >  }
>> >  
>> > +void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct page 
>> > *page)
>> > +{
>> > +  void *base;
>> > +  int __maybe_unused i;
>> > +  unsigned int objnr;
>> > +  void *objp;
>> > +  void *objp0;
>> > +  struct kmem_cache *s = page->slab_cache;
>> > +  struct track __maybe_unused *trackp;
>> > +
>> > +  kpp->kp_

Re: [PATCH v4 sl-b 0/6] Export return addresses etc. for better diagnostics

2021-01-08 Thread Vlastimil Babka
On 1/8/21 1:26 AM, Paul E. McKenney wrote:
> On Wed, Jan 06, 2021 at 03:42:12PM -0800, Paul E. McKenney wrote:
>> On Wed, Jan 06, 2021 at 01:48:43PM -0800, Andrew Morton wrote:
>> > On Tue, 5 Jan 2021 17:16:03 -0800 "Paul E. McKenney"  
>> > wrote:
>> > 
>> > > This is v4 of the series the improves diagnostics by providing access
>> > > to additional information including the return addresses, slab names,
>> > > offsets, and sizes collected by the sl*b allocators and by vmalloc().
>> > 
>> > Looks reasonable.  And not as bloaty as I feared, but it does add ~300
>> > bytes to my allnoconfig build.  Is the CONFIG_ coverage as tight as it
>> > could be?
>> 
>> Glad I managed to exceed your expectations.  ;-)
>> 
>> Let's see...  When I do an allnoconfig build, it has CONFIG_PRINTK=n.
>> Given that this facility is based on printk(), I could create an
>> additional patch that #ifdef'ed everything out in CONFIG_PRINTK=n kernels.
>> This would uglify things a bit, but it would save ~300 bytes.
>> 
>> If I don't hear otherwise from you, I will put something together,
>> test it, and send it along.
> 
> And is a first draft, very lightly tested.  Thoughts?

Looks ok, but I'm not sure it's worth the trouble, there's probably tons of
other code that could be treated like this and nobody is doing that
systematically (at least I haven't heard about kernel tinyfication effort for
years)... Up to Andrew I guess.

Thanks,
Vlastimil

>   Thanx, Paul
> 
> 
> 
> commit 4164efdca255093a423b55f44bd788b46d9c648f
> Author: Paul E. McKenney 
> Date:   Thu Jan 7 13:46:11 2021 -0800
> 
> mm: Don't build mm_dump_obj() on CONFIG_PRINTK=n kernels
> 
> The mem_dump_obj() functionality adds a few hundred bytes, which is a
> small price to pay.  Except on kernels built with CONFIG_PRINTK=n, in
> which mem_dump_obj() messages will be suppressed.  This commit therefore
> makes mem_dump_obj() be a static inline empty function on kernels built
> with CONFIG_PRINTK=n and excludes all of its support functions as well.
> This avoids kernel bloat on systems that cannot use mem_dump_obj().
> 
> Cc: Christoph Lameter 
> Cc: Pekka Enberg 
> Cc: David Rientjes 
> Cc: Joonsoo Kim 
> Cc: 
> Suggested-by: Andrew Morton 
> Signed-off-by: Paul E. McKenney 
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index af7d050..ed75a3a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3169,7 +3169,11 @@ unsigned long wp_shared_mapping_range(struct 
> address_space *mapping,
>  
>  extern int sysctl_nr_trim_pages;
>  
> +#ifdef CONFIG_PRINTK
>  void mem_dump_obj(void *object);
> +#else
> +static inline void mem_dump_obj(void *object) {}
> +#endif
>  
>  #endif /* __KERNEL__ */
>  #endif /* _LINUX_MM_H */
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 7ae6040..0c97d78 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -186,8 +186,10 @@ void kfree(const void *);
>  void kfree_sensitive(const void *);
>  size_t __ksize(const void *);
>  size_t ksize(const void *);
> +#ifdef CONFIG_PRINTK
>  bool kmem_valid_obj(void *object);
>  void kmem_dump_obj(void *object);
> +#endif
>  
>  #ifdef CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR
>  void __check_heap_object(const void *ptr, unsigned long n, struct page *page,
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index c18f475..3c8a765 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -246,7 +246,7 @@ pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms)
>  int register_vmap_purge_notifier(struct notifier_block *nb);
>  int unregister_vmap_purge_notifier(struct notifier_block *nb);
>  
> -#ifdef CONFIG_MMU
> +#if defined(CONFIG_MMU) && defined(CONFIG_PRINTK)
>  bool vmalloc_dump_obj(void *object);
>  #else
>  static inline bool vmalloc_dump_obj(void *object) { return false; }
> diff --git a/mm/slab.c b/mm/slab.c
> index dcc55e7..965d277 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3635,6 +3635,7 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t 
> flags,
>  EXPORT_SYMBOL(__kmalloc_node_track_caller);
>  #endif /* CONFIG_NUMA */
>  
> +#ifdef CONFIG_PRINTK
>  void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct page 
> *page)
>  {
>   struct kmem_cache *cachep;
> @@ -3654,6 +3655,7 @@ void kmem_obj_info(struct kmem_obj_info *kpp, void 
> *object, struct page *page)
>   if (DEBUG && cachep->flags & SLAB_STORE_USER)
>   kpp->kp_ret = *dbg_userword(cachep, objp);
>  }
> +#endif
>  
>  /**
>   * __do_kmalloc - allocate memory
> diff --git a/mm/slab.h b/mm/slab.h
> index ecad9b5..d2e39ab 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -615,6 +615,7 @@ static inline bool slab_want_init_on_free(struct 
> kmem_cache *c)
>   return false;
>  }
>  
> +#ifdef CONFIG_PRINTK
>  #define KS_ADDRS_COUNT

Re: [PATCH mm,percpu_ref,rcu 4/6] mm: Make mem_obj_dump() vmalloc() dumps include start and length

2021-01-08 Thread Vlastimil Babka
On 1/6/21 2:17 AM, paul...@kernel.org wrote:
> From: "Paul E. McKenney" 
> 
> This commit adds the starting address and number of pages to the vmalloc()
> information dumped by way of vmalloc_dump_obj().
> 
> Cc: Andrew Morton 
> Cc: Joonsoo Kim 
> Cc: 
> Reported-by: Andrii Nakryiko 
> Suggested-by: Vlastimil Babka 
> Signed-off-by: Paul E. McKenney 

Acked-by: Vlastimil Babka 

> ---
>  mm/vmalloc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index c274ea4..e3229ff 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3456,7 +3456,8 @@ bool vmalloc_dump_obj(void *object)
>   vm = find_vm_area(objp);
>   if (!vm)
>   return false;
> - pr_cont(" vmalloc allocated at %pS\n", vm->caller);
> + pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n",
> + vm->nr_pages, (unsigned long)vm->addr, vm->caller);
>   return true;
>  }
>  
> 



Re: [PATCH mm,percpu_ref,rcu 3/6] mm: Make mem_dump_obj() handle vmalloc() memory

2021-01-08 Thread Vlastimil Babka
On 1/6/21 2:17 AM, paul...@kernel.org wrote:
> From: "Paul E. McKenney" 
> 
> This commit adds vmalloc() support to mem_dump_obj().  Note that the
> vmalloc_dump_obj() function combines the checking and dumping, in
> contrast with the split between kmem_valid_obj() and kmem_dump_obj().
> The reason for the difference is that the checking in the vmalloc()
> case involves acquiring a global lock, and redundant acquisitions of
> global locks should be avoided, even on not-so-fast paths.
> 
> Note that this change causes on-stack variables to be reported as
> vmalloc() storage from kernel_clone() or similar, depending on the degree
> of inlining that your compiler does.  This is likely more helpful than
> the earlier "non-paged (local) memory".
> 
> Cc: Andrew Morton 
> Cc: Joonsoo Kim 
> Cc: 
> Reported-by: Andrii Nakryiko 
> Signed-off-by: Paul E. McKenney 

Acked-by: Vlastimil Babka 


Re: [PATCH mm,percpu_ref,rcu 1/6] mm: Add mem_dump_obj() to print source of memory block

2021-01-08 Thread Vlastimil Babka
On 1/6/21 2:17 AM, paul...@kernel.org wrote:
> From: "Paul E. McKenney" 
> 
> There are kernel facilities such as per-CPU reference counts that give
> error messages in generic handlers or callbacks, whose messages are
> unenlightening.  In the case of per-CPU reference-count underflow, this
> is not a problem when creating a new use of this facility because in that
> case the bug is almost certainly in the code implementing that new use.
> However, trouble arises when deploying across many systems, which might
> exercise corner cases that were not seen during development and testing.
> Here, it would be really nice to get some kind of hint as to which of
> several uses the underflow was caused by.
> 
> This commit therefore exposes a mem_dump_obj() function that takes
> a pointer to memory (which must still be allocated if it has been
> dynamically allocated) and prints available information on where that
> memory came from.  This pointer can reference the middle of the block as
> well as the beginning of the block, as needed by things like RCU callback
> functions and timer handlers that might not know where the beginning of
> the memory block is.  These functions and handlers can use mem_dump_obj()
> to print out better hints as to where the problem might lie.
> 
> The information printed can depend on kernel configuration.  For example,
> the allocation return address can be printed only for slab and slub,
> and even then only when the necessary debug has been enabled.  For slab,
> build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
> to the next power of two or use the SLAB_STORE_USER when creating the
> kmem_cache structure.  For slub, build with CONFIG_SLUB_DEBUG=y and
> boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
> if more focused use is desired.  Also for slub, use CONFIG_STACKTRACE
> to enable printing of the allocation-time stack trace.
> 
> Cc: Christoph Lameter 
> Cc: Pekka Enberg 
> Cc: David Rientjes 
> Cc: Joonsoo Kim 
> Cc: Andrew Morton 
> Cc: 
> Reported-by: Andrii Nakryiko 
> [ paulmck: Convert to printing and change names per Joonsoo Kim. ]
> [ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
> [ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
> [ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
> [ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
> Acked-by: Joonsoo Kim 

Acked-by: Vlastimil Babka 

Some nits below:

> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3635,6 +3635,26 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t 
> flags,
>  EXPORT_SYMBOL(__kmalloc_node_track_caller);
>  #endif /* CONFIG_NUMA */
>  
> +void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct page 
> *page)
> +{
> + struct kmem_cache *cachep;
> + unsigned int objnr;
> + void *objp;
> +
> + kpp->kp_ptr = object;
> + kpp->kp_page = page;
> + cachep = page->slab_cache;
> + kpp->kp_slab_cache = cachep;
> + objp = object - obj_offset(cachep);
> + kpp->kp_data_offset = obj_offset(cachep);
> + page = virt_to_head_page(objp);

Hm when can this page be different from the "page" we got as function parameter?
I guess only if "object" was so close to the beginning of page that "object -
obj_offset(cachep)" underflowed it. So either "object" pointed to the
padding/redzone, or even below page->s_mem. Both situations sounds like we
should handle them differently than continuing with an unrelated page that's
below our slab page?

> + objnr = obj_to_index(cachep, page, objp);

Related, this will return bogus value for objp below page->s_mem.
And if our "object" pointer pointed beyond last valid object, this will give us
too large index.


> + objp = index_to_obj(cachep, page, objnr);

Too large index can cause dbg_userword to be beyond our page.
In SLUB version you have the WARN_ON_ONCE that catches such invalid pointers
(before first valid object or after last valid object) and skips getting the
backtrace for those, so analogical thing should probably be done here?

> + kpp->kp_objp = objp;
> + if (DEBUG && cachep->flags & SLAB_STORE_USER)
> + kpp->kp_ret = *dbg_userword(cachep, objp);
> +}
> +
> diff --git a/mm/slub.c b/mm/slub.c
> index 0c8b43a..3c1a843 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3919,6 +3919,46 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
>   return 0;
>  }
>  
> +void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct page 
> *page)
> +{
> + void *base;
> + int __maybe_unused i;
> + u

[PATCH] MAINTAINERS: add myself as slab allocators maintainer

2021-01-08 Thread Vlastimil Babka
I would like to help with slab allocators maintenance, from the perspective of
being responsible for SLAB and more recently also SLUB in an enterprise distro
kernel and supporting its users. Recently I've been focusing on improving
SLUB's debugging features, and patch review in the area, including the kmemcg
accounting rewrite last year.

Signed-off-by: Vlastimil Babka 
---
Hi,

this might look perhaps odd with 4 people (plus Andrew) already listed, but on
closer look we have 2 (or 3 if you count SLOB) allocators and the focus of each
maintainer varies. Maybe this would be also an opportunity to split the area
explicitly, if the maintainers would want that? I'm fine with all of them
personally - while we moved to SLUB in new code streams, olders ones with SLAB
will continue be supported for years anyway.

Thanks,
Vlastimil

 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index da4892dc52b4..fd657bb4157f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16351,6 +16351,7 @@ M:  Pekka Enberg 
 M: David Rientjes 
 M: Joonsoo Kim 
 M: Andrew Morton 
+M: Vlastimil Babka 
 L: linux...@kvack.org
 S: Maintained
 F: include/linux/sl?b*.h
-- 
2.29.2



Re: [PATCH] mm/memcontrol: fix warning in mem_cgroup_page_lruvec()

2021-01-07 Thread Vlastimil Babka
On 1/4/21 6:03 AM, Hugh Dickins wrote:
> Boot a CONFIG_MEMCG=y kernel with "cgroup_disabled=memory" and you are
> met by a series of warnings from the VM_WARN_ON_ONCE_PAGE(!memcg, page)
> recently added to the inline mem_cgroup_page_lruvec().
> 
> An earlier attempt to place that warning, in mem_cgroup_lruvec(), had
> been careful to do so after weeding out the mem_cgroup_disabled() case;
> but was itself invalid because of the mem_cgroup_lruvec(NULL, pgdat) in
> clear_pgdat_congested() and age_active_anon().
> 
> Warning in mem_cgroup_page_lruvec() was once useful in detecting a KSM
> charge bug, so may be worth keeping: but skip if mem_cgroup_disabled().
> 
> Fixes: 9a1ac2288cf1 ("mm/memcontrol:rewrite mem_cgroup_page_lruvec()")
> Signed-off-by: Hugh Dickins 

Acked-by: Vlastimil Babka 

> ---
> 
>  include/linux/memcontrol.h |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- 5.11-rc2/include/linux/memcontrol.h   2020-12-27 20:39:36.751923135 
> -0800
> +++ linux/include/linux/memcontrol.h  2021-01-03 19:38:24.822978559 -0800
> @@ -665,7 +665,7 @@ static inline struct lruvec *mem_cgroup_
>  {
>   struct mem_cgroup *memcg = page_memcg(page);
>  
> - VM_WARN_ON_ONCE_PAGE(!memcg, page);
> + VM_WARN_ON_ONCE_PAGE(!memcg && !mem_cgroup_disabled(), page);

Nit: I would reverse the order of conditions as mem_cgroup_disabled() is either
"return true" or a static key. Not that it matters too much on DEBUG_VM 
configs...

>   return mem_cgroup_lruvec(memcg, pgdat);
>  }
>  
> 



Re: [PATCH] mm/mmap: replace if (cond) BUG() with BUG_ON()

2021-01-07 Thread Vlastimil Babka
On 1/7/21 6:36 PM, Andrea Arcangeli wrote:
> On Thu, Jan 07, 2021 at 06:28:29PM +0100, Vlastimil Babka wrote:
>> On 1/6/21 9:18 PM, Hugh Dickins wrote:
>> > On Wed, 6 Jan 2021, Andrea Arcangeli wrote:
>> >> 
>> >> I'd be surprised if the kernel can boot with BUG_ON() defined as "do
>> >> {}while(0)" so I guess it doesn't make any difference.
>> > 
>> > I had been afraid of that too, when CONFIG_BUG is not set:
>> > but I think it's actually "if (cond) do {} while (0)".
>> 
>> It's a maze of configs and arch-specific vs generic headers, but I do see 
>> this
>> in include/asm-generic/bug.h:
>> 
>> #else /* !CONFIG_BUG */
>> #ifndef HAVE_ARCH_BUG
>> #define BUG() do {} while (1)
>> #endif
>> 
>> So seems to me there *are* configurations possible where side-effects are 
>> indeed
>> thrown away, right?
> 
> But this not BUG_ON, 

Oh, you're right, I got lost in the maze.

> and that is an infinite loop while(1), not an

And I overlooked that "1" too.

So that AFAICS means *both* VM_BUG_ON and VM_WARN_ON behave differently wrt
side-effects when disabled than BUG_ON and WARN_ON.

> optimization away as in while (0) that I was suggesting to just throw
> away cond and make it a noop. BUG() is actually the thing to use to
> move functional stuff out of BUG_ON so it's not going to be causing
> issues if it just loops.
> 
> This overall feels mostly an aesthetically issue.
> 
> Thanks,
> Andrea
> 



Re: [PATCH] mm/mmap: replace if (cond) BUG() with BUG_ON()

2021-01-07 Thread Vlastimil Babka
On 1/6/21 9:18 PM, Hugh Dickins wrote:
> On Wed, 6 Jan 2021, Andrea Arcangeli wrote:
>> 
>> I'd be surprised if the kernel can boot with BUG_ON() defined as "do
>> {}while(0)" so I guess it doesn't make any difference.
> 
> I had been afraid of that too, when CONFIG_BUG is not set:
> but I think it's actually "if (cond) do {} while (0)".

It's a maze of configs and arch-specific vs generic headers, but I do see this
in include/asm-generic/bug.h:

#else /* !CONFIG_BUG */
#ifndef HAVE_ARCH_BUG
#define BUG() do {} while (1)
#endif

So seems to me there *are* configurations possible where side-effects are indeed
thrown away, right?

WARN_ON is different as the result of the "inner" condition should be further
usable for constructing "outer" conditions:

(still in !CONFIG_BUG section)
#ifndef HAVE_ARCH_WARN_ON
#define WARN_ON(condition) ({
int __ret_warn_on = !!(condition);
unlikely(__ret_warn_on);
})
#endif

For completeness let's look at our own extensions when VM_DEBUG is disabled,
which is quite analogical to disabling CONFIG_BUG and thus it should better be
consistent with the generic stuff.

#define VM_BUG_ON(cond) BUILD_BUG_ON_INVALID(cond)

where BUILD_BUG_ON_INVALID generates no code, so it's consistent with BUG_ON()
and !CONFIG_BUG.

#define VM_WARN_ON(cond) BUILD_BUG_ON_INVALID(cond)

... well that's not consistent with WARN_ON. Hmm if you have asked me before I
checked, I would have said that it is, that I checked it already in the past
and/or there was some discussion already about it. Memory is failing me it
seems. We should better fix this?


[RFC 0/3] mm, slab, slub: remove cpu and memory hotplug locks

2021-01-06 Thread Vlastimil Babka
Hi,

some related work caused me to look at how we use get/put_mems_online() and
get/put_online_cpus() during kmem cache creation/descruction/shrinking, and
realize that it should be actually safe to remove all of that with rather small
effort (as e.g. Michal Hocko suspected in some of the past discussions
already). This has the benefit to avoid rather heavy locks that have caused
locking order issues already in the past. So this is the result, Patches 1 and
2 remove memory hotplug and cpu hotplug locking, respectively. Patch 3 is due
to realization that in fact some races exist despite the locks (even if not
removed), but the most sane solution is not to introduce more of them, but
rather accept some wasted memory in scenarios that should be rare anyway (full
memory hot remove), as we do the same in other contexts already. It's all RFC
for now, as I might have missed some reason why it's not safe.

Vlastimil Babka (3):
  mm, slab, slub: stop taking memory hotplug lock
  mm, slab, slub: stop taking cpu hotplug lock
  mm, slub: stop freeing kmem_cache_node structures on node offline

 mm/slab_common.c | 20 --
 mm/slub.c| 54 
 2 files changed, 40 insertions(+), 34 deletions(-)

-- 
2.29.2



[RFC 3/3] mm, slub: stop freeing kmem_cache_node structures on node offline

2021-01-06 Thread Vlastimil Babka
Commit e4f8e513c3d3 ("mm/slub: fix a deadlock in show_slab_objects()") has
fixed a problematic locking order by removing the memory hotplug lock
get/put_online_mems() from show_slab_objects(). During the discussion, it was
argued [1] that this is OK, because existing slabs on the node would prevent
a hotremove to proceed.

That's true, but per-node kmem_cache_node structures are not necessarily
allocated on the same node and may exist even without actual slab pages
on the same node. Any path that uses get_node() directly or via
for_each_kmem_cache_node() (such as show_slab_objects()) can race with
freeing of kmem_cache_node even with the !NULL check, resulting in
use-after-free.

To that end, commit e4f8e513c3d3 argues in a comment that:

 * We don't really need mem_hotplug_lock (to hold off
 * slab_mem_going_offline_callback) here because slab's memory hot
 * unplug code doesn't destroy the kmem_cache->node[] data.

While it's true that slab_mem_going_offline_callback() doesn't free
the kmem_cache_node, the later callback slab_mem_offline_callback() actually
does, so the race and use-after-free exists. Not just for show_slab_objects()
after commit e4f8e513c3d3, but also many other places that are not under
slab_mutex. And adding slab_mutex locking or other synchronization to SLUB
paths such as get_any_partial() would be bad for performance and error-prone.

The easiest solution is therefore to make the abovementioned comment true and
stop freeing the kmem_cache_node structures, accepting some wasted memory in
the full memory node removal scenario. Analogically we also don't free
hotremoved pgdat as mentioned in [1], nor the similar per-node structures in
SLAB. Importantly this approach will not block the hotremove, as generally such
nodes should be movable in order to succeed hotremove in the first place, and
thus the GFP_KERNEL allocated kmem_cache_node will come from elsewhere.

[1] https://lore.kernel.org/linux-mm/20190924151147.gb23...@dhcp22.suse.cz/

Signed-off-by: Vlastimil Babka 
---
 mm/slub.c | 26 +++---
 1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 2e2edd5c9cfc..d7c4f08dcf39 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4268,21 +4268,11 @@ static void slab_mem_offline_callback(void *arg)
 
mutex_lock(&slab_mutex);
node_clear(offline_node, slab_nodes);
-   list_for_each_entry(s, &slab_caches, list) {
-   n = get_node(s, offline_node);
-   if (n) {
-   /*
-* if n->nr_slabs > 0, slabs still exist on the node
-* that is going down. We were unable to free them,
-* and offline_pages() function shouldn't call this
-* callback. So, we must fail.
-*/
-   BUG_ON(slabs_node(s, offline_node));
-
-   s->node[offline_node] = NULL;
-   kmem_cache_free(kmem_cache_node, n);
-   }
-   }
+   /*
+* We no longer free kmem_cache_node structures here, as it would be
+* racy with all get_node() users, and infeasible to protect them with
+* slab_mutex.
+*/
mutex_unlock(&slab_mutex);
 }
 
@@ -4308,6 +4298,12 @@ static int slab_mem_going_online_callback(void *arg)
 */
mutex_lock(&slab_mutex);
list_for_each_entry(s, &slab_caches, list) {
+   /*
+* The structure may already exist if the node was previously
+* onlined and offlined.
+*/
+   if (get_node(s, nid))
+   continue;
/*
 * XXX: kmem_cache_alloc_node will fallback to other nodes
 *  since memory is not yet available from the node that
-- 
2.29.2



[RFC 1/3] mm, slab, slub: stop taking memory hotplug lock

2021-01-06 Thread Vlastimil Babka
Since commit 03afc0e25f7f ("slab: get_online_mems for
kmem_cache_{create,destroy,shrink}") we are taking memory hotplug lock for SLAB
and SLUB when creating, destroying or shrinking a cache. It is quite a heavy
lock and it's best to avoid it if possible, as we had several issues with
lockdep complaining about ordering in the past, see e.g. e4f8e513c3d3
("mm/slub: fix a deadlock in show_slab_objects()").

The problem scenario in 03afc0e25f7f (solved by the memory hotplug lock) can be
summarized as follows: while there's slab_mutex synchronizing new kmem cache
creation and SLUB's MEM_GOING_ONLINE callback slab_mem_going_online_callback(),
we may miss creation of kmem_cache_node for the hotplugged node in the new kmem
cache, because the hotplug callback doesn't yet see the new cache, and cache
creation in init_kmem_cache_nodes() only inits kmem_cache_node for nodes in the
N_NORMAL_MEMORY nodemask, which however may not yet include the new node, as
that happens only later after the MEM_GOING_ONLINE callback.

Instead of using get/put_online_mems(), the problem can be solved by SLUB
maintaining its own nodemask of nodes for which it has allocated the per-node
kmem_cache_node structures. This nodemask would generally mirror the
N_NORMAL_MEMORY nodemask, but would be updated only in under SLUB's control in
its memory hotplug callbacks under the slab_mutex. This patch adds such
nodemask and its handling.

Commit 03afc0e25f7f mentiones "issues like [the one above]", but there don't
appear to be further issues. All the paths (shared for SLAB and SLUB) taking
the memory hotplug locks are also taking the slab_mutex, except
kmem_cache_shrink() where 03afc0e25f7f replaced slab_mutex with
get/put_online_mems(). So this patch restores slab_mutex in
kmem_cache_shrink(). slab_mutex should be otherwise sufficient, as all the
memory hotplug callbacks in SLUB take it as well.

SLAB also doesn't need the memory hotplug locking, which it only gained by
03afc0e25f7f through the shared paths in slab_common.c. Its memory hotplug
callbacks are also protected by slab_mutex against races with these paths. The
problem of SLUB relying on N_NORMAL_MEMORY doesn't apply to SLAB, as its
setup_kmem_cache_nodes relies on N_ONLINE, and the new node is already set
there during the MEM_GOING_ONLINE callback, so no special care is needed
for SLAB.

As such, this patch removes all get/put_online_mems() usage by the slab
subsystem.

Signed-off-by: Vlastimil Babka 
---
 mm/slab_common.c | 10 --
 mm/slub.c| 28 +---
 2 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 075b23ce94ec..e728265c8b7d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -311,7 +311,6 @@ kmem_cache_create_usercopy(const char *name,
int err;
 
get_online_cpus();
-   get_online_mems();
 
mutex_lock(&slab_mutex);
 
@@ -361,7 +360,6 @@ kmem_cache_create_usercopy(const char *name,
 out_unlock:
mutex_unlock(&slab_mutex);
 
-   put_online_mems();
put_online_cpus();
 
if (err) {
@@ -490,7 +488,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
return;
 
get_online_cpus();
-   get_online_mems();
 
mutex_lock(&slab_mutex);
 
@@ -507,7 +504,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
 out_unlock:
mutex_unlock(&slab_mutex);
 
-   put_online_mems();
put_online_cpus();
 }
 EXPORT_SYMBOL(kmem_cache_destroy);
@@ -526,10 +522,12 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
int ret;
 
get_online_cpus();
-   get_online_mems();
+   mutex_lock(&slab_mutex);
+
kasan_cache_shrink(cachep);
ret = __kmem_cache_shrink(cachep);
-   put_online_mems();
+
+   mutex_unlock(&slab_mutex);
put_online_cpus();
return ret;
 }
diff --git a/mm/slub.c b/mm/slub.c
index 1f4584954f4c..2e2edd5c9cfc 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -236,6 +236,14 @@ static inline void stat(const struct kmem_cache *s, enum 
stat_item si)
 #endif
 }
 
+/*
+ * Tracks for which NUMA nodes we have kmem_cache_nodes allocated.
+ * Corresponds to node_state[N_NORMAL_MEMORY], but can temporarily
+ * differ during memory hotplug/hotremove operations.
+ * Protected by slab_mutex.
+ */
+static nodemask_t slab_nodes;
+
 /
  * Core slab cache functions
  ***/
@@ -2678,7 +2686,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t 
gfpflags, int node,
 * ignore the node constraint
 */
if (unlikely(node != NUMA_NO_NODE &&
-!node_state(node, N_NORMAL_MEMORY)))
+!node_isset(node, slab_nodes)))

[RFC 2/3] mm, slab, slub: stop taking cpu hotplug lock

2021-01-06 Thread Vlastimil Babka
SLAB has been using get/put_online_cpus() around creating, destroying and
shrinking kmem caches since 95402b382901 ("cpu-hotplug: replace per-subsystem
mutexes with get_online_cpus()") in 2008, which is supposed to be replacing
a private mutex (cache_chain_mutex, called slab_mutex today) with system-wide
mechanism, but in case of SLAB it's in fact used in addition to the existing
mutex, without explanation why.

SLUB appears to have avoided the cpu hotplug lock initially, but gained it due
to common code unification, such as 20cea9683ecc ("mm, sl[aou]b: Move
kmem_cache_create mutex handling to common code").

Regardless of the history, checking if the hotplug lock is actually needed
today suggests that it's not, and therefore it's better to avoid this
system-wide lock and the ordering this imposes wrt other locks (such as
slab_mutex).

Specifically, in SLAB we have for_each_online_cpu() in do_tune_cpucache()
protected by slab_mutex, and cpu hotplug callbacks that also take the
slab_mutex, which is also taken by the common slab function that currently also
take the hotplug lock. Thus the slab_mutex protection should be sufficient.
Also per-cpu array caches are allocated for each possible cpu, so not affected
by their online/offline state.

In SLUB we have for_each_online_cpu() in functions that show statistics and are
already unprotected today, as racing with hotplug is not harmful. Otherwise
SLUB relies on percpu allocator. The slub_cpu_dead() hotplug callback takes the
slab_mutex.

To sum up, this patch removes get/put_online_cpus() calls from slab as it
should be safe without further adjustments.

Signed-off-by: Vlastimil Babka 
---
 mm/slab_common.c | 10 --
 1 file changed, 10 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index e728265c8b7d..0f29a2b59dac 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -310,8 +310,6 @@ kmem_cache_create_usercopy(const char *name,
const char *cache_name;
int err;
 
-   get_online_cpus();
-
mutex_lock(&slab_mutex);
 
err = kmem_cache_sanity_check(name, size);
@@ -360,8 +358,6 @@ kmem_cache_create_usercopy(const char *name,
 out_unlock:
mutex_unlock(&slab_mutex);
 
-   put_online_cpus();
-
if (err) {
if (flags & SLAB_PANIC)
panic("kmem_cache_create: Failed to create slab '%s'. 
Error %d\n",
@@ -487,8 +483,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
if (unlikely(!s))
return;
 
-   get_online_cpus();
-
mutex_lock(&slab_mutex);
 
s->refcount--;
@@ -503,8 +497,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
}
 out_unlock:
mutex_unlock(&slab_mutex);
-
-   put_online_cpus();
 }
 EXPORT_SYMBOL(kmem_cache_destroy);
 
@@ -521,14 +513,12 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
 {
int ret;
 
-   get_online_cpus();
mutex_lock(&slab_mutex);
 
kasan_cache_shrink(cachep);
ret = __kmem_cache_shrink(cachep);
 
mutex_unlock(&slab_mutex);
-   put_online_cpus();
return ret;
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
-- 
2.29.2



Re: [PATCH] mm/mremap: fix BUILD_BUG_ON() error in get_extent

2021-01-04 Thread Vlastimil Babka
The subject should say BUILD_BUG()

On 12/30/20 4:40 PM, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> clang cannt evaluate this function argument at compile time
> when the function is not inlined, which leads to a link
> time failure:
> 
> ld.lld: error: undefined symbol: __compiletime_assert_414
 referenced by mremap.c
   mremap.o:(get_extent) in archive mm/built-in.a
> 
> Mark the function as __always_inline to avoid it.

Not sure if it's the ideal fix, maybe just convert it to BUG() instead?
Functions shouldn't really have BUILD_BUG depending on parameters and rely on
inlining to make it work...

> Fixes: 9ad9718bfa41 ("mm/mremap: calculate extent in one place")

I think it's rather this one that introduces the BUILD_BUG() ?
c49dd3401802 ("mm: speedup mremap on 1GB or larger regions")

> Signed-off-by: Arnd Bergmann 
> ---
>  mm/mremap.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index c5590afe7165..1cb464a07184 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -336,8 +336,9 @@ enum pgt_entry {
>   * valid. Else returns a smaller extent bounded by the end of the source and
>   * destination pgt_entry.
>   */
> -static unsigned long get_extent(enum pgt_entry entry, unsigned long old_addr,
> - unsigned long old_end, unsigned long new_addr)
> +static __always_inline unsigned long get_extent(enum pgt_entry entry,
> + unsigned long old_addr, unsigned long old_end,
> + unsigned long new_addr)
>  {
>   unsigned long next, extent, mask, size;
>  
> 



Re: [PATCH v6 0/3] mm,thp,shm: limit shmem THP alloc gfp_mask

2020-12-14 Thread Vlastimil Babka
On 12/14/20 10:16 PM, Hugh Dickins wrote:
> On Tue, 24 Nov 2020, Rik van Riel wrote:
> 
>> The allocation flags of anonymous transparent huge pages can be controlled
>> through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
>> help the system from getting bogged down in the page reclaim and compaction
>> code when many THPs are getting allocated simultaneously.
>> 
>> However, the gfp_mask for shmem THP allocations were not limited by those
>> configuration settings, and some workloads ended up with all CPUs stuck
>> on the LRU lock in the page reclaim code, trying to allocate dozens of
>> THPs simultaneously.
>> 
>> This patch applies the same configurated limitation of THPs to shmem
>> hugepage allocations, to prevent that from happening.
>> 
>> This way a THP defrag setting of "never" or "defer+madvise" will result
>> in quick allocation failures without direct reclaim when no 2MB free
>> pages are available.
>> 
>> With this patch applied, THP allocations for tmpfs will be a little
>> more aggressive than today for files mmapped with MADV_HUGEPAGE,
>> and a little less aggressive for files that are not mmapped or
>> mapped without that flag.
>> 
>> v6: make khugepaged actually obey tmpfs mount flags
>> v5: reduce gfp mask further if needed, to accomodate i915 (Matthew Wilcox)
>> v4: rename alloc_hugepage_direct_gfpmask to vma_thp_gfp_mask (Matthew Wilcox)
>> v3: fix NULL vma issue spotted by Hugh Dickins & tested
>> v2: move gfp calculation to shmem_getpage_gfp as suggested by Yu Xu
> 
> Andrew, please don't rush
> 
> mmthpshmem-limit-shmem-thp-alloc-gfp_mask.patch
> mmthpshm-limit-gfp-mask-to-no-more-than-specified.patch
> mmthpshmem-make-khugepaged-obey-tmpfs-mount-flags.patch
> 
> to Linus in your first wave of mmotm->5.11 sendings.
> Or, alternatively, go ahead and send them to Linus, but
> be aware that I'm fairly likely to want adjustments later.
> 
> Sorry for limping along so far behind, but I still have more
> re-reading of the threads to do, and I'm still investigating
> why tmpfs huge=always becomes so ineffective in my testing with
> these changes, even if I ramp up from default defrag=madvise to
> defrag=always:
> 5.10   mmotm
> thp_file_alloc   4641788  216027
> thp_file_fallback 275339 8895647

So AFAICS before the patch shmem allocated hugepages basically with:
mapping_gfp_mask(inode->i_mapping) |  __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN
where mapping_gfp_mask() should be the default GFP_HIGHUSER_MOVABLE unless I
missed some shmem-specific override of the mask.

So the important flags mean all zones avilable, both __GFP_DIRECT_RECLAIM and
__GFP_KSWAPD_RECLAIM, but also __GFP_NORETRY which makes it less aggressive.

Now, with defrag=madvise and without madvised vma, there's just
GFP_TRANSHUGE_LIGHT, which means no __GFP_DIRECT_RECLAIM (and no
__GFP_KSWAPD_RECLAIM). Thus no reclaim and compaction at all. Indeed "little
less aggressive" is an understatement.

On the other hand, with defrag=always and again without madvised vma there
should be GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM | __GFP_NORETRY, so
compared to "before the patch" this is only missing __GFP_KSWAPD_RECLAIM. I
would be surprised if this meant so much difference in your testing as you show
above - I think you would have to be allocating those THPs just at a rate where
kswapd+kcompactd can keep up and nothing else "steals" the pages that background
reclaim+compaction creates.
In that (subjectively unlikely) case, I think significant improvement should be
visible with defrag=defer over defrag=madvise.

> I've been looking into it off and on for weeks (gfp_mask wrangling is
> not my favourite task! so tend to find higher priorities to divert me);
> hoped to arrive at a conclusion before merge window, but still have
> nothing constructive to say yet, hence my silence so far.
> 
> Above's "a little less aggressive" appears understatement at present.
> I respect what Rik is trying to achieve here, and I may end up
> concluding that there's nothing better to be done than what he has.
> My kind of hugepage-thrashing-in-low-memory may be so remote from
> normal usage, and could be skirting the latency horrors we all want
> to avoid: but I haven't reached that conclusion yet - the disparity
> in effectiveness still deserves more investigation.
> 
> (There's also a specific issue with the gfp_mask limiting: I have
> not yet reviewed the allowing and denying in detail, but it looks
> like it does not respect the caller's GFP_ZONEMASK - the gfp in
> shmem_getpage_gfp() and shmem_read_mapping_page_gfp() is there to
> satisfy the gma500, which wanted to use shmem but could only manage
> DMA32.  I doubt it wants THPS, but shmem_enabled=force forces them.)
> 
> Thanks,
> Hugh
> 



[PATCH] mm, slab, slub: clear the slab_cache field when freeing page

2020-12-10 Thread Vlastimil Babka
The page allocator expects that page->mapping is NULL for a page being freed.
SLAB and SLUB use the slab_cache field which is in union with mapping, but
before freeing the page, the field is referenced with the "mapping" name when
set to NULL.

It's IMHO more correct (albeit functionally the same) to use the slab_cache
name as that's the field we use in SL*B, and document why we clear it in a
comment (we don't clear fields such as s_mem or freelist, as page allocator
doesn't care about those). While using the 'mapping' name would automagically
keep the code correct if the unions in struct page changed, such changes should
be done consciously and needed changes evaluated - the comment should help with
that.

Signed-off-by: Vlastimil Babka 
---
 mm/slab.c | 3 ++-
 mm/slub.c | 4 ++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 72b6743bdccf..b667f03095f1 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1399,7 +1399,8 @@ static void kmem_freepages(struct kmem_cache *cachep, 
struct page *page)
__ClearPageSlabPfmemalloc(page);
__ClearPageSlab(page);
page_mapcount_reset(page);
-   page->mapping = NULL;
+   /* In union with page->mapping where page allocator expects NULL */
+   page->slab_cache = NULL;
 
if (current->reclaim_state)
current->reclaim_state->reclaimed_slab += 1 << order;
diff --git a/mm/slub.c b/mm/slub.c
index d3406ef65863..81c22f4c7e63 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1836,8 +1836,8 @@ static void __free_slab(struct kmem_cache *s, struct page 
*page)
 
__ClearPageSlabPfmemalloc(page);
__ClearPageSlab(page);
-
-   page->mapping = NULL;
+   /* In union with page->mapping where page allocator expects NULL */
+   page->slab_cache = NULL;
if (current->reclaim_state)
current->reclaim_state->reclaimed_slab += pages;
unaccount_slab_page(page, order, s);
-- 
2.29.2



Re: [PATCH v2 sl-b 1/5] mm: Add mem_dump_obj() to print source of memory block

2020-12-10 Thread Vlastimil Babka
On 12/10/20 12:04 AM, Paul E. McKenney wrote:
>> > +/**
>> > + * kmem_valid_obj - does the pointer reference a valid slab object?
>> > + * @object: pointer to query.
>> > + *
>> > + * Return: %true if the pointer is to a not-yet-freed object from
>> > + * kmalloc() or kmem_cache_alloc(), either %true or %false if the pointer
>> > + * is to an already-freed object, and %false otherwise.
>> > + */
>> 
>> It should be possible to find out more about object being free or not, than 
>> you
>> currently do. At least to find out if it's definitely free. When it appears
>> allocated, it can be actually still free in some kind of e.g. per-cpu or
>> per-node cache that would be infeasible to check. But that improvement to the
>> output can be also added later. Also SLUB stores the freeing stacktrace, 
>> which
>> might be useful...
> 
> I can see how this could help debugging a use-after-free situation,
> at least as long as the poor sap that subsequently allocated it doesn't
> free it.
> 
> I can easily add more fields to the kmem_provenance structure.  Maybe
> it would make sense to have another exported API that you provide a
> kmem_provenance structure to, and it fills it in.
> 
> One caution though...  I rely on the object being allocated.
> If it officially might already be freed, complex and high-overhead
> synchronization seems to be required to safely access the various data
> structures.

Good point! It's easy to forget that when being used to similar digging in a
crash dump, where nothing changes.

> So any use on an already-freed object is on a "you break it you get to
> keep the pieces" basis.  On the other hand, if you are dealing with a
> use-after-free situation, life is hard anyway.

Yeah, even now I think it's potentially dangerous, as you can get
kmem_valid_obj() as true because PageSlab(page) is true. But the object might be
already free, so as soon as another CPU frees another object from the same slab
page, the page gets also freed... or it was already freed and then allocated by
another slab so it's PageSlab() again.
I guess at least some safety could be achieved by pinning the page with
get_page_unless_zero. But maybe your current implementation is already safe,
need to check in detail.

> Or am I missing your point?
> 
>> > +bool kmem_valid_obj(void *object)
>> > +{
>> > +  struct page *page;
>> > +
>> > +  if (!virt_addr_valid(object))
>> > +  return false;
>> > +  page = virt_to_head_page(object);
>> > +  return PageSlab(page);
>> > +}
>> > +EXPORT_SYMBOL_GPL(kmem_valid_obj);
>> > +
>> > +/**
>> > + * kmem_dump_obj - Print available slab provenance information
>> > + * @object: slab object for which to find provenance information.
>> > + *
>> > + * This function uses pr_cont(), so that the caller is expected to have
>> > + * printed out whatever preamble is appropriate.  The provenance 
>> > information
>> > + * depends on the type of object and on how much debugging is enabled.
>> > + * For a slab-cache object, the fact that it is a slab object is printed,
>> > + * and, if available, the slab name, return address, and stack trace from
>> > + * the allocation of that object.
>> > + *
>> > + * This function will splat if passed a pointer to a non-slab object.
>> > + * If you are not sure what type of object you have, you should instead
>> > + * use mem_dump_obj().
>> > + */
>> > +void kmem_dump_obj(void *object)
>> > +{
>> > +  int i;
>> > +  struct page *page;
>> > +  struct kmem_provenance kp;
>> > +
>> > +  if (WARN_ON_ONCE(!virt_addr_valid(object)))
>> > +  return;
>> > +  page = virt_to_head_page(object);
>> > +  if (WARN_ON_ONCE(!PageSlab(page))) {
>> > +  pr_cont(" non-slab memory.\n");
>> > +  return;
>> > +  }
>> > +  kp.kp_ptr = object;
>> > +  kp.kp_page = page;
>> > +  kp.kp_nstack = KS_ADDRS_COUNT;
>> > +  kmem_provenance(&kp);
>> 
>> You don't seem to be printing kp.kp_objp anywhere? (unless in later patch, 
>> but
>> would make sense in this patch already).
> 
> Good point!
> 
> However, please note that the various debugging options that reserve
> space at the beginning.  This can make the meaning of kp.kp_objp a bit
> different than one might expect.

Yeah, I think the best would be to match the address that
kmalloc/kmem_cache_alloc() would return, thus the beginning of the object
itself, so you can calculate the offset within it, etc.

>> > --- a/mm/util.c
>> > +++ b/mm/util.c
>> 
>> I think mm/debug.c is a better fit as it already has dump_page() of a similar
>> nature. Also you can call that from from mem_dump_obj() at least in case when
>> the more specific handlers fail. It will even include page_owner info if 
>> enabled! :)
> 
> I will count this as one vote for mm/debug.c.
> 
> Two things to consider, though...  First, Joonsoo suggests that the fact
> that this produces useful information without any debugging information
> enabled makes it not be debugging as such.

Well there's already dump_page() which also produces information wit

Re: [PATCH v2 sl-b 3/5] mm: Make mem_dump_obj() handle vmalloc() memory

2020-12-10 Thread Vlastimil Babka
On 12/10/20 12:23 AM, Paul E. McKenney wrote:
> On Wed, Dec 09, 2020 at 06:51:20PM +0100, Vlastimil Babka wrote:
>> On 12/9/20 2:13 AM, paul...@kernel.org wrote:
>> > From: "Paul E. McKenney" 
>> > 
>> > This commit adds vmalloc() support to mem_dump_obj().  Note that the
>> > vmalloc_dump_obj() function combines the checking and dumping, in
>> > contrast with the split between kmem_valid_obj() and kmem_dump_obj().
>> > The reason for the difference is that the checking in the vmalloc()
>> > case involves acquiring a global lock, and redundant acquisitions of
>> > global locks should be avoided, even on not-so-fast paths.
>> > 
>> > Note that this change causes on-stack variables to be reported as
>> > vmalloc() storage from kernel_clone() or similar, depending on the degree
>> > of inlining that your compiler does.  This is likely more helpful than
>> > the earlier "non-paged (local) memory".
>> > 
>> > Cc: Andrew Morton 
>> > Cc: Joonsoo Kim 
>> > Cc: 
>> > Reported-by: Andrii Nakryiko 
>> > Signed-off-by: Paul E. McKenney 
>> 
>> ...
>> 
>> > --- a/mm/vmalloc.c
>> > +++ b/mm/vmalloc.c
>> > @@ -3431,6 +3431,18 @@ void pcpu_free_vm_areas(struct vm_struct **vms, int 
>> > nr_vms)
>> >  }
>> >  #endif/* CONFIG_SMP */
>> >  
>> > +bool vmalloc_dump_obj(void *object)
>> > +{
>> > +  struct vm_struct *vm;
>> > +  void *objp = (void *)PAGE_ALIGN((unsigned long)object);
>> > +
>> > +  vm = find_vm_area(objp);
>> > +  if (!vm)
>> > +  return false;
>> > +  pr_cont(" vmalloc allocated at %pS\n", vm->caller);
>> 
>> Would it be useful to print the vm area boundaries too?
> 
> Like this?

Yeah, thanks!

> I also considered instead using vm->size, but that always seems to include
> an extra page, so a 4-page span is listed as having 20480 bytes and a
> one-page span is 8192 bytes.  This might be more accurate in some sense,
> but would be quite confusing to someone trying to compare this size with
> that requested in the vmalloc() call.

Right.

> 
>   Thanx, Paul
> 
> 
> 
> commit 33e0469c289c2f78e5f0d0c463c8ee3357d273c0
> Author: Paul E. McKenney 
> Date:   Wed Dec 9 15:15:27 2020 -0800
> 
> mm: Make mem_obj_dump() vmalloc() dumps include start and length
> 
> This commit adds the starting address and number of pages to the vmalloc()
> information dumped by way of vmalloc_dump_obj().
> 
> Cc: Andrew Morton 
> Cc: Joonsoo Kim 
> Cc: 
> Reported-by: Andrii Nakryiko 
> Suggested-by: Vlastimil Babka 
> Signed-off-by: Paul E. McKenney 
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 7421719..77b1100 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3439,7 +3439,8 @@ bool vmalloc_dump_obj(void *object)
>   vm = find_vm_area(objp);
>   if (!vm)
>   return false;
> - pr_cont(" vmalloc allocated at %pS\n", vm->caller);
> + pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n",
> + vm->nr_pages, (unsigned long)vm->addr, vm->caller);
>   return true;
>  }
>  
> 



Re: [PATCH v2 sl-b 3/5] mm: Make mem_dump_obj() handle vmalloc() memory

2020-12-09 Thread Vlastimil Babka
On 12/9/20 2:13 AM, paul...@kernel.org wrote:
> From: "Paul E. McKenney" 
> 
> This commit adds vmalloc() support to mem_dump_obj().  Note that the
> vmalloc_dump_obj() function combines the checking and dumping, in
> contrast with the split between kmem_valid_obj() and kmem_dump_obj().
> The reason for the difference is that the checking in the vmalloc()
> case involves acquiring a global lock, and redundant acquisitions of
> global locks should be avoided, even on not-so-fast paths.
> 
> Note that this change causes on-stack variables to be reported as
> vmalloc() storage from kernel_clone() or similar, depending on the degree
> of inlining that your compiler does.  This is likely more helpful than
> the earlier "non-paged (local) memory".
> 
> Cc: Andrew Morton 
> Cc: Joonsoo Kim 
> Cc: 
> Reported-by: Andrii Nakryiko 
> Signed-off-by: Paul E. McKenney 

...

> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3431,6 +3431,18 @@ void pcpu_free_vm_areas(struct vm_struct **vms, int 
> nr_vms)
>  }
>  #endif   /* CONFIG_SMP */
>  
> +bool vmalloc_dump_obj(void *object)
> +{
> + struct vm_struct *vm;
> + void *objp = (void *)PAGE_ALIGN((unsigned long)object);
> +
> + vm = find_vm_area(objp);
> + if (!vm)
> + return false;
> + pr_cont(" vmalloc allocated at %pS\n", vm->caller);

Would it be useful to print the vm area boundaries too?

> + return true;
> +}
> +
>  #ifdef CONFIG_PROC_FS
>  static void *s_start(struct seq_file *m, loff_t *pos)
>   __acquires(&vmap_purge_lock)
> 



Re: [PATCH v2 sl-b 2/5] mm: Make mem_dump_obj() handle NULL and zero-sized pointers

2020-12-09 Thread Vlastimil Babka
On 12/9/20 2:13 AM, paul...@kernel.org wrote:
> From: "Paul E. McKenney" 
> 
> This commit makes mem_dump_obj() call out NULL and zero-sized pointers
> specially instead of classifying them as non-paged memory.
> 
> Cc: Christoph Lameter 
> Cc: Pekka Enberg 
> Cc: David Rientjes 
> Cc: Joonsoo Kim 
> Cc: Andrew Morton 
> Cc: 
> Reported-by: Andrii Nakryiko 
> Signed-off-by: Paul E. McKenney 

Acked-by: Vlastimil Babka 

> ---
>  mm/util.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/util.c b/mm/util.c
> index d0e60d2..8c2449f 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -985,7 +985,12 @@ int __weak memcmp_pages(struct page *page1, struct page 
> *page2)
>  void mem_dump_obj(void *object)
>  {
>   if (!virt_addr_valid(object)) {
> - pr_cont(" non-paged (local) memory.\n");
> + if (object == NULL)
> + pr_cont(" NULL pointer.\n");
> + else if (object == ZERO_SIZE_PTR)
> + pr_cont(" zero-size pointer.\n");
> + else
> + pr_cont(" non-paged (local) memory.\n");
>   return;
>   }
>   if (kmem_valid_obj(object)) {
> 



Re: [PATCH v2 sl-b 1/5] mm: Add mem_dump_obj() to print source of memory block

2020-12-09 Thread Vlastimil Babka
On 12/9/20 2:12 AM, paul...@kernel.org wrote:
> From: "Paul E. McKenney" 
> 
> There are kernel facilities such as per-CPU reference counts that give
> error messages in generic handlers or callbacks, whose messages are
> unenlightening.  In the case of per-CPU reference-count underflow, this
> is not a problem when creating a new use of this facility because in that
> case the bug is almost certainly in the code implementing that new use.
> However, trouble arises when deploying across many systems, which might
> exercise corner cases that were not seen during development and testing.
> Here, it would be really nice to get some kind of hint as to which of
> several uses the underflow was caused by.
> 
> This commit therefore exposes a mem_dump_obj() function that takes
> a pointer to memory (which must still be allocated if it has been
> dynamically allocated) and prints available information on where that
> memory came from.  This pointer can reference the middle of the block as
> well as the beginning of the block, as needed by things like RCU callback
> functions and timer handlers that might not know where the beginning of
> the memory block is.  These functions and handlers can use mem_dump_obj()
> to print out better hints as to where the problem might lie.

Sounds useful, yeah. It occured to me at least once that we don't have a nice
generic way to print this kind of info. I usually dig it from a crash dump...

> The information printed can depend on kernel configuration.  For example,
> the allocation return address can be printed only for slab and slub,
> and even then only when the necessary debug has been enabled.  For slab,
> build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
> to the next power of two or use the SLAB_STORE_USER when creating the
> kmem_cache structure.  For slub, build with CONFIG_SLUB_DEBUG=y and
> boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
> if more focused use is desired.  Also for slub, use CONFIG_STACKTRACE
> to enable printing of the allocation-time stack trace.
> 
> Cc: Christoph Lameter 
> Cc: Pekka Enberg 
> Cc: David Rientjes 
> Cc: Joonsoo Kim 
> Cc: Andrew Morton 
> Cc: 
> Reported-by: Andrii Nakryiko 
> [ paulmck: Convert to printing and change names per Joonsoo Kim. ]
> [ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
> Signed-off-by: Paul E. McKenney 

...

> +/**
> + * kmem_valid_obj - does the pointer reference a valid slab object?
> + * @object: pointer to query.
> + *
> + * Return: %true if the pointer is to a not-yet-freed object from
> + * kmalloc() or kmem_cache_alloc(), either %true or %false if the pointer
> + * is to an already-freed object, and %false otherwise.
> + */

It should be possible to find out more about object being free or not, than you
currently do. At least to find out if it's definitely free. When it appears
allocated, it can be actually still free in some kind of e.g. per-cpu or
per-node cache that would be infeasible to check. But that improvement to the
output can be also added later. Also SLUB stores the freeing stacktrace, which
might be useful...

> +bool kmem_valid_obj(void *object)
> +{
> + struct page *page;
> +
> + if (!virt_addr_valid(object))
> + return false;
> + page = virt_to_head_page(object);
> + return PageSlab(page);
> +}
> +EXPORT_SYMBOL_GPL(kmem_valid_obj);
> +
> +/**
> + * kmem_dump_obj - Print available slab provenance information
> + * @object: slab object for which to find provenance information.
> + *
> + * This function uses pr_cont(), so that the caller is expected to have
> + * printed out whatever preamble is appropriate.  The provenance information
> + * depends on the type of object and on how much debugging is enabled.
> + * For a slab-cache object, the fact that it is a slab object is printed,
> + * and, if available, the slab name, return address, and stack trace from
> + * the allocation of that object.
> + *
> + * This function will splat if passed a pointer to a non-slab object.
> + * If you are not sure what type of object you have, you should instead
> + * use mem_dump_obj().
> + */
> +void kmem_dump_obj(void *object)
> +{
> + int i;
> + struct page *page;
> + struct kmem_provenance kp;
> +
> + if (WARN_ON_ONCE(!virt_addr_valid(object)))
> + return;
> + page = virt_to_head_page(object);
> + if (WARN_ON_ONCE(!PageSlab(page))) {
> + pr_cont(" non-slab memory.\n");
> + return;
> + }
> + kp.kp_ptr = object;
> + kp.kp_page = page;
> + kp.kp_nstack = KS_ADDRS_COUNT;
> + kmem_provenance(&kp);

You don't seem to be printing kp.kp_objp anywhere? (unless in later patch, but
would make sense in this patch already).

> + if (page->slab_cache)
> + pr_cont(" slab %s", page->slab_cache->name);
> + else
> + pr_cont(" slab ");
> + if (kp.kp_ret)
> + pr_cont(" allocated at %pS\n", kp.kp

Re: [PATCH v2] mm/page_owner: Record timestamp and pid

2020-12-09 Thread Vlastimil Babka
On 12/9/20 1:51 PM, Georgi Djakov wrote:
> From: Liam Mark 
> 
> Collect the time for each allocation recorded in page owner so that
> allocation "surges" can be measured.
> 
> Record the pid for each allocation recorded in page owner so that
> the source of allocation "surges" can be better identified.
> 
> The above is very useful when doing memory analysis. On a crash for
> example, we can get this information from kdump (or ramdump) and parse
> it to figure out memory allocation problems.

Yes, I can imagine this to be useful.

> Please note that on x86_64 this increases the size of struct page_owner
> from 16 bytes to 32.

That's the tradeoff, but it's not a functionality intended for production, so
unless somebody says they need to enable page_owner for debugging and this
increase prevents them from fitting into available memory, let's not complicate
things with making this optional.

> Signed-off-by: Liam Mark 
> Signed-off-by: Georgi Djakov 

Acked-by: Vlastimil Babka 

> ---
> 
> v2:
> - Improve the commit message (Andrew and Vlastimil)
> - Update page_owner.rst with more recent object size information (Andrew)
> - Use pid_t for the pid (Andrew)
> - Print the info also in __dump_page_owner() (Vlastimil)
> 
> v1: https://lore.kernel.org/r/20201112184106.733-1-georgi.dja...@linaro.org/
> 
> 
>  Documentation/vm/page_owner.rst | 12 ++--
>  mm/page_owner.c | 17 +
>  2 files changed, 19 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/vm/page_owner.rst b/Documentation/vm/page_owner.rst
> index 02deac76673f..cf7c0c361621 100644
> --- a/Documentation/vm/page_owner.rst
> +++ b/Documentation/vm/page_owner.rst
> @@ -41,17 +41,17 @@ size change due to this facility.
>  - Without page owner::
>  
> textdata bss dec hex filename
> -   40662   1493 644   42799a72f mm/page_alloc.o
> +  483922333 644   51369c8a9 mm/page_alloc.o
>  
>  - With page owner::
>  
> textdata bss dec hex filename
> -   40892   1493 644   43029a815 mm/page_alloc.o
> -   1427  24   81459 5b3 mm/page_ext.o
> -   2722  50   02772 ad4 mm/page_owner.o
> +  488002445 644   51889cab1 mm/page_alloc.o
> +   6574 108  2967111a37 mm/page_owner.o
> +   1025   8   81041 411 mm/page_ext.o
>  
> -Although, roughly, 4 KB code is added in total, page_alloc.o increase by
> -230 bytes and only half of it is in hotpath. Building the kernel with
> +Although, roughly, 8 KB code is added in total, page_alloc.o increase by
> +520 bytes and less than half of it is in hotpath. Building the kernel with
>  page owner and turning it on if needed would be great option to debug
>  kernel memory problem.
>  
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index b735a8eafcdb..af464bb7fbe7 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "internal.h"
>  
> @@ -25,6 +26,8 @@ struct page_owner {
>   gfp_t gfp_mask;
>   depot_stack_handle_t handle;
>   depot_stack_handle_t free_handle;
> + u64 ts_nsec;
> + pid_t pid;
>  };
>  
>  static bool page_owner_enabled = false;
> @@ -172,6 +175,8 @@ static inline void __set_page_owner_handle(struct page 
> *page,
>   page_owner->order = order;
>   page_owner->gfp_mask = gfp_mask;
>   page_owner->last_migrate_reason = -1;
> + page_owner->pid = current->pid;
> + page_owner->ts_nsec = local_clock();
>   __set_bit(PAGE_EXT_OWNER, &page_ext->flags);
>   __set_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags);
>  
> @@ -236,6 +241,8 @@ void __copy_page_owner(struct page *oldpage, struct page 
> *newpage)
>   new_page_owner->last_migrate_reason =
>   old_page_owner->last_migrate_reason;
>   new_page_owner->handle = old_page_owner->handle;
> + new_page_owner->pid = old_page_owner->pid;
> + new_page_owner->ts_nsec = old_page_owner->ts_nsec;
>  
>   /*
>* We don't clear the bit on the oldpage as it's going to be freed
> @@ -349,9 +356,10 @@ print_page_owner(char __user *buf, size_t count, 
> unsigned long pfn,
>   return -ENOMEM;
>  
>   ret = snprintf(kbuf, count,
> - "Page allocated via order %u, mask %#x(%pGg)\n",
> + "Page allocated via order %u, mask %#x(%pGg), pid %d, 
> ts %llu ns\n",
>  

Re: [PATCH] mm,hwpoison: Return -EBUSY when migration fails

2020-12-09 Thread Vlastimil Babka
On 12/9/20 10:28 AM, Oscar Salvador wrote:
> Currently, we return -EIO when we fail to migrate the page.
> 
> Migrations' failures are rather transient as they can happen due to
> several reasons, e.g: high page refcount bump, mapping->migrate_page
> failing etc.
> All meaning that at that time the page could not be migrated, but
> that has nothing to do with an EIO error.
> 
> Let us return -EBUSY instead, as we do in case we failed to isolate
> the page.
> 
> While are it, let us remove the "ret" print as its value does not change.
> 
> Signed-off-by: Oscar Salvador 

Acked-by: Vlastimil Babka 

Technically this affects madvise(2) so let's cc linux-api. The manpage doesn't
document error codes of MADV_HWPOISON and MADV_SOFT_OFFLINE (besides EPERM)
though so nothing to adjust there. It's meant only for the hwpoison testing
suite anyway.

> ---
>  mm/memory-failure.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 428991e297e2..1942fb83ac64 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1849,11 +1849,11 @@ static int __soft_offline_page(struct page *page)
>   pr_info("soft offline: %#lx: %s migration failed %d, 
> type %lx (%pGp)\n",
>   pfn, msg_page[huge], ret, page->flags, 
> &page->flags);
>   if (ret > 0)
> - ret = -EIO;
> + ret = -EBUSY;
>   }
>   } else {
> - pr_info("soft offline: %#lx: %s isolation failed: %d, page 
> count %d, type %lx (%pGp)\n",
> - pfn, msg_page[huge], ret, page_count(page), 
> page->flags, &page->flags);
> + pr_info("soft offline: %#lx: %s isolation failed, page count 
> %d, type %lx (%pGp)\n",
> + pfn, msg_page[huge], page_count(page), page->flags, 
> &page->flags);
>   ret = -EBUSY;
>   }
>   return ret;
> 



Re: [Ksummit-discuss] crediting bug reports and fixes folded into original patch

2020-12-09 Thread Vlastimil Babka
On 12/9/20 8:58 AM, Dan Carpenter wrote:
> On Tue, Dec 08, 2020 at 09:01:49PM -0800, Joe Perches wrote:
>> On Tue, 2020-12-08 at 16:34 -0800, Kees Cook wrote:
>> 
>> > If not "Adjusted-by", what about "Tweaked-by", "Helped-by",
>> > "Corrected-by"?
>> 
>> Improved-by: / Enhanced-by: / Revisions-by: 
>> 
> 
> I don't think we should give any credit for improvements or enhancements,

Well, some are actually useful and not about reviewer's preferred style :) But
if an author redoes the patch as a result, it's their choice to mention useful
improvements in the next version's change log.

> only for fixes.  Complaining about style is its own reward.

Right, let's focus on fixes and reports of bugs, that would have resulted in a
standalone commit, but don't.

> Having to redo a patch is already a huge headache.  Normally, I already
> considered the reviewer's prefered style and decided I didn't like it.
> Then to make me redo the patch in an ugly style and say thankyou on
> top of that???  Forget about it.  Plus, as a reviewer I hate reviewing
> patches over and over.
> 
> I've argued for years that we should have a Fixes-from: tag.  The zero

Standardizing the Fixes-from: tag (or any better name) would be a forward
progress, yes.

> day bot is already encouraging people to add Reported-by tags for this
> and a lot of people do.

"Reported-by:" becomes ambiguous once the bugfix for the reported issue in the
patch is folded, as it's no longer clear whether the bot reported the original
issue the patch is fixing, or a bug in the fix. So we should have a different
variant. "Fixes-reported-by:" so it has the same prefix?

> regards,
> dan carpenter
> 



Re: [PATCH 3/7] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED

2020-12-04 Thread Vlastimil Babka
On 12/1/20 12:35 PM, Oscar Salvador wrote:
> On Wed, Nov 25, 2020 at 07:20:33PM +0100, Vlastimil Babka wrote:
>> On 11/19/20 11:57 AM, Oscar Salvador wrote:
>> > From: Naoya Horiguchi 
>> > 
>> > The call to get_user_pages_fast is only to get the pointer to a struct
>> > page of a given address, pinning it is memory-poisoning handler's job,
>> > so drop the refcount grabbed by get_user_pages_fast().
>> > 
>> > Note that the target page is still pinned after this put_page() because
>> > the current process should have refcount from mapping.
>> 
>> Well, but can't it go away due to reclaim, migration or whatever?
> 
> Yes, it can.
> 
>> > @@ -900,20 +900,23 @@ static int madvise_inject_error(int behavior,
>> > */
>> >size = page_size(compound_head(page));
>> > +  /*
>> > +   * The get_user_pages_fast() is just to get the pfn of the
>> > +   * given address, and the refcount has nothing to do with
>> > +   * what we try to test, so it should be released immediately.
>> > +   * This is racy but it's intended because the real hardware
>> > +   * errors could happen at any moment and memory error handlers
>> > +   * must properly handle the race.
>> 
>> Sure they have to. We might just be unexpectedly messing with other process'
>> memory. Or does anything else prevent that?
> 
> No, nothing does, and I have to confess that I managed to confuse myself here.
> If we release such page and that page ends up in buddy, nothing prevents 
> someone
> else to get that page, and then we would be messing with other process memory.
> 
> I guess the right thing to do is just to make sure we got that page and that
> that page remains pinned as long as the memory failure handling goes.

OK, so that means we don't introduce this race for MADV_SOFT_OFFLINE, but it's
already (and still) there for MADV_HWPOISON since Dan's 23e7b5c2e271 ("mm,
madvise_inject_error: Let memory_failure() optionally take a page reference") 
no?

> I will remove those patches from the patchset and re-submit with only the
> refactoring and pcp-disabling.
> 
> Thanks Vlastimil
> 



Re: [PATCH] mm: mmap_lock: fix use-after-free race and css ref leak in tracepoints

2020-12-04 Thread Vlastimil Babka
On 12/2/20 2:11 AM, Shakeel Butt wrote:
> On Tue, Dec 1, 2020 at 5:07 PM Steven Rostedt  wrote:
>>
>> On Tue, 1 Dec 2020 16:36:32 -0800
>> Shakeel Butt  wrote:
>>
>> > SGTM but note that usually Andrew squash all the patches into one
>> > before sending to Linus. If you plan to replace the path buffer with
>> > integer IDs then no need to spend time fixing buffer related bug.
>>
>> I don't think Andrew squashes all the patches. I believe he sends Linus
>> a patch series.
> 
> I am talking about the patch and the following fixes to that patch.
> Those are usually squashed into one patch.

Yeah, if there's a way forward that doesn't need to construct full path on each
event and the associated complexity and just use an ID, let's convert to the ID
and squash it, for less churn. Especially if there are other existing
tracepoints that use the ID.

If there's further (somewhat orthogonal) work to make the IDs easier for
userspace, it can be added on top later, but really shouldn't need to add the
current complex solution only to remove it later?

Thanks,
Vlastimil


Re: [PATCH v2] mm: fix a race on nr_swap_pages

2020-12-04 Thread Vlastimil Babka
On 12/4/20 3:52 AM, Zhaoyang Huang wrote:
> The scenario on which "Free swap = -4kB" happens in my system, which is caused
> by several get_swap_pages racing with each other and show_swap_cache_info
> happens simutaniously. No need to add a lock on get_swap_page_of_type as we
> remove "Presub/PosAdd" here.
> 
> ProcessA  ProcessBProcessC
> ngoals = 1ngoals = 1
> avail = nr_swap_pages(1)  avail = nr_swap_pages(1)
> nr_swap_pages(1) -= ngoals
>   nr_swap_pages(0) -= ngoals
>   nr_swap_pages = 
> -1
> 
> Signed-off-by: Zhaoyang Huang 

Better now.
Acked-by: Vlastimil Babka 

> ---
> change of v2: fix bug of unpaired of spin_lock
> ---
> ---
>  mm/swapfile.c | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index cf63b5f..1212f17 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -974,9 +974,13 @@ int get_swap_pages(int n_goal, swp_entry_t 
> swp_entries[], int entry_size)
>   /* Only single cluster request supported */
>   WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
>  
> + spin_lock(&swap_avail_lock);
> +
>   avail_pgs = atomic_long_read(&nr_swap_pages) / size;
> - if (avail_pgs <= 0)
> + if (avail_pgs <= 0) {
> + spin_unlock(&swap_avail_lock);
>   goto noswap;
> + }
>  
>   if (n_goal > SWAP_BATCH)
>   n_goal = SWAP_BATCH;
> @@ -986,8 +990,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], 
> int entry_size)
>  
>   atomic_long_sub(n_goal * size, &nr_swap_pages);
>  
> - spin_lock(&swap_avail_lock);
> -
>  start_over:
>   node = numa_node_id();
>   plist_for_each_entry_safe(si, next, &swap_avail_heads[node], 
> avail_lists[node]) {
> @@ -1061,14 +1063,13 @@ swp_entry_t get_swap_page_of_type(int type)
>  
>   spin_lock(&si->lock);
>   if (si->flags & SWP_WRITEOK) {
> - atomic_long_dec(&nr_swap_pages);
>   /* This is called for allocating swap entry, not cache */
>   offset = scan_swap_map(si, 1);
>   if (offset) {
> + atomic_long_dec(&nr_swap_pages);
>   spin_unlock(&si->lock);
>   return swp_entry(type, offset);
>   }
> - atomic_long_inc(&nr_swap_pages);
>   }
>   spin_unlock(&si->lock);
>  fail:
> 



Re: [PATCH v2] mm/page_alloc: speeding up the iteration of max_order

2020-12-04 Thread Vlastimil Babka
On 12/4/20 1:56 PM, Muchun Song wrote:
> When we free a page whose order is very close to MAX_ORDER and greater
> than pageblock_order, it wastes some CPU cycles to increase max_order
> to MAX_ORDER one by one and check the pageblock migratetype of that page
> repeatedly especially when MAX_ORDER is much larger than pageblock_order.

I would add:

We also should not be checking migratetype of buddy when "order == MAX_ORDER -
1" as the buddy pfn may be invalid, so adjust the condition. With the new check,
we don't need the max_order check anymore, so we replace it.

Also adjust max_order initialization so that it's lower by one than previously,
which makes the code hopefully more clear.

> Signed-off-by: Muchun Song 

Fixes: d9dddbf55667 ("mm/page_alloc: prevent merging between isolated and other
pageblocks")
Acked-by: Vlastimil Babka 
Thanks!

> ---
> Changes in v2:
>  - Rework the code suggested by Vlastimil. Thanks.
> 
>  mm/page_alloc.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f91df593bf71..56e603eea1dd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1002,7 +1002,7 @@ static inline void __free_one_page(struct page *page,
>   struct page *buddy;
>   bool to_tail;
>  
> - max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
> + max_order = min_t(unsigned int, MAX_ORDER - 1, pageblock_order);
>  
>   VM_BUG_ON(!zone_is_initialized(zone));
>   VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> @@ -1015,7 +1015,7 @@ static inline void __free_one_page(struct page *page,
>   VM_BUG_ON_PAGE(bad_range(zone, page), page);
>  
>  continue_merging:
> - while (order < max_order - 1) {
> + while (order < max_order) {
>   if (compaction_capture(capc, page, order, migratetype)) {
>   __mod_zone_freepage_state(zone, -(1 << order),
>   migratetype);
> @@ -1041,7 +1041,7 @@ static inline void __free_one_page(struct page *page,
>   pfn = combined_pfn;
>   order++;
>   }
> - if (max_order < MAX_ORDER) {
> + if (order < MAX_ORDER - 1) {
>   /* If we are here, it means order is >= pageblock_order.
>* We want to prevent merge between freepages on isolate
>* pageblock and normal pageblock. Without this, pageblock
> @@ -1062,7 +1062,7 @@ static inline void __free_one_page(struct page *page,
>   is_migrate_isolate(buddy_mt)))
>   goto done_merging;
>   }
> - max_order++;
> + max_order = order + 1;
>   goto continue_merging;
>   }
>  
> 



Re: [External] Re: [PATCH] mm/page_alloc: speeding up the iteration of max_order

2020-12-04 Thread Vlastimil Babka
On 12/4/20 5:03 AM, Muchun Song wrote:
> On Fri, Dec 4, 2020 at 1:37 AM Vlastimil Babka  wrote:
>>
>> On 12/2/20 1:18 PM, Muchun Song wrote:
>> > When we free a page whose order is very close to MAX_ORDER and greater
>> > than pageblock_order, it wastes some CPU cycles to increase max_order
>> > to MAX_ORDER one by one and check the pageblock migratetype of that page
>>
>> But we have to do that. It's not the same page, it's the merged page and the 
>> new
>> buddy is a different pageblock and we need to check if they have compatible
>> migratetypes and can merge, or we have to bail out. So the patch is wrong.
>>
>> > repeatedly especially when MAX_ORDER is much larger than pageblock_order.
>>
>> Do we have such architectures/configurations anyway?
>>
>> > Signed-off-by: Muchun Song 
>> > ---
>> >  mm/page_alloc.c | 4 +++-
>> >  1 file changed, 3 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> > index 141f12e5142c..959541234e1d 100644
>> > --- a/mm/page_alloc.c
>> > +++ b/mm/page_alloc.c
>> > @@ -1041,7 +1041,7 @@ static inline void __free_one_page(struct page *page,
>> >   pfn = combined_pfn;
>> >   order++;
>> >   }
>> > - if (max_order < MAX_ORDER) {
> 
> If we free a page with order == MAX_ORDER - 1, it has no buddy.
> The following pageblock operation is also pointless.

OK, I see.

>> > + if (max_order < MAX_ORDER && order < MAX_ORDER - 1) {

Yes, this makes sense, as in your other patch we shouldn't check the buddy when
order == MAX_ORDER - 1 already.

>> >   /* If we are here, it means order is >= pageblock_order.
>> >* We want to prevent merge between freepages on isolate
>> >* pageblock and normal pageblock. Without this, pageblock
>> > @@ -1062,6 +1062,8 @@ static inline void __free_one_page(struct page *page,
>> >   
>> > is_migrate_isolate(buddy_mt)))
>> >   goto done_merging;
>> >   }
>> > + if (unlikely(order != max_order - 1))
>> > + max_order = order + 1;
>> >   max_order++;

OK I see now what you want to do here. the "if" may be true if we already
entered the function with order > pageblock_order.
I think we could just simplfy the "if" and "max_order++" above to:

max_order = order + 2

which starts to get a bit ugly, so why not change max_order to be -1 (compared
to now) in the whole function:

max_order = min_t(unsigned int, MAX_ORDER - 1, pageblock_order);
...
continue_merging:
while (order < max_order) {
...
if (order < MAX_ORDER - 1) {
// it's redundant to keep checking max_order < MAX_ORDER - 1 here after your
change, right?
...

max_order = order + 1; // less weird than "+ 2"

Off by one errors, here we go!

>> Or maybe I just don't understand what this is doing. When is the new 'if' 
>> even
>> true? We just bailed out of "while (order < max_order - 1)" after the last
>> "order++", which means it should hold that "order == max_order - 1")?
> 
> No, I do not agree. The MAX_ORDER may be greater than 11.
> 
> # git grep "CONFIG_FORCE_MAX_ZONEORDER"
> # arch/arm/configs/imx_v6_v7_defconfig:CONFIG_FORCE_MAX_ZONEORDER=14
> # arch/powerpc/configs/85xx/ge_imp3a_defconfig:CONFIG_FORCE_MAX_ZONEORDER=17
> # arch/powerpc/configs/fsl-emb-nonhw.config:CONFIG_FORCE_MAX_ZONEORDER=13
> 
> Have you seen it? On some architecture, the MAX_ORDER
> can be 17. When we free a page with an order 16. Without this
> patch, the max_order should be increased one by one from 10 to
> 17.
> 
> Thanks.
> 
> 
>> Your description sounds like you want to increase max_order to MAX_ORDER in 
>> one
>> step, which as I explained would be wrong. But the implementation looks 
>> actually
>> like a no-op.
>>
>> >   max_order++;
>> >   goto continue_merging;
>> >   }
>> >
>>
> 
> 
> --
> Yours,
> Muchun
> 



Re: [PATCH] mm: fix a race on nr_swap_pages

2020-12-03 Thread Vlastimil Babka
On 12/3/20 12:36 PM, Zhaoyang Huang wrote:
> The scenario on which "Free swap -4kB" happens in my system, which is caused 
> by
>  get_swap_page_of_type or get_swap_pages racing with show_mem. Remove the race
>  here.
> 
> Signed-off-by: Zhaoyang Huang 
> ---
>  mm/swapfile.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index cf63b5f..13201b6 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -974,6 +974,8 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], 
> int entry_size)
>   /* Only single cluster request supported */
>   WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
>  
> + spin_lock(&swap_avail_lock);
> +
>   avail_pgs = atomic_long_read(&nr_swap_pages) / size;
>   if (avail_pgs <= 0)
>   goto noswap;

This goto will leave with the spin lock locked, so that's a bug.

> @@ -986,8 +988,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], 
> int entry_size)
>  
>   atomic_long_sub(n_goal * size, &nr_swap_pages);
>  
> - spin_lock(&swap_avail_lock);
> -

Is the problem that while we adjust n_goal with a min3(..., avail_pgs), somebody
else can decrease nr_swap_pages in the meanwhile and then we underflow? If yes,
the spin lock won't eliminate all such cases it seems, as e.g.
get_swap_page_of_type isn't done under the same lock, AFAIK.

>  start_over:
>   node = numa_node_id();
>   plist_for_each_entry_safe(si, next, &swap_avail_heads[node], 
> avail_lists[node]) {
> @@ -1061,14 +1061,13 @@ swp_entry_t get_swap_page_of_type(int type)
>  
>   spin_lock(&si->lock);
>   if (si->flags & SWP_WRITEOK) {
> - atomic_long_dec(&nr_swap_pages);
>   /* This is called for allocating swap entry, not cache */
>   offset = scan_swap_map(si, 1);
>   if (offset) {
> + atomic_long_dec(&nr_swap_pages);
>   spin_unlock(&si->lock);
>   return swp_entry(type, offset);
>   }
> - atomic_long_inc(&nr_swap_pages);

This hunk looks safer, unless I miss something. Did you check if it's enough to
prevent the negative values on your systems?

>   }
>   spin_unlock(&si->lock);
>  fail:
> 



Re: [PATCH] mm/page_alloc: speeding up the iteration of max_order

2020-12-03 Thread Vlastimil Babka
On 12/2/20 1:18 PM, Muchun Song wrote:
> When we free a page whose order is very close to MAX_ORDER and greater
> than pageblock_order, it wastes some CPU cycles to increase max_order
> to MAX_ORDER one by one and check the pageblock migratetype of that page

But we have to do that. It's not the same page, it's the merged page and the new
buddy is a different pageblock and we need to check if they have compatible
migratetypes and can merge, or we have to bail out. So the patch is wrong.

> repeatedly especially when MAX_ORDER is much larger than pageblock_order.

Do we have such architectures/configurations anyway?

> Signed-off-by: Muchun Song 
> ---
>  mm/page_alloc.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 141f12e5142c..959541234e1d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1041,7 +1041,7 @@ static inline void __free_one_page(struct page *page,
>   pfn = combined_pfn;
>   order++;
>   }
> - if (max_order < MAX_ORDER) {
> + if (max_order < MAX_ORDER && order < MAX_ORDER - 1) {
>   /* If we are here, it means order is >= pageblock_order.
>* We want to prevent merge between freepages on isolate
>* pageblock and normal pageblock. Without this, pageblock
> @@ -1062,6 +1062,8 @@ static inline void __free_one_page(struct page *page,
>   is_migrate_isolate(buddy_mt)))
>   goto done_merging;
>   }
> + if (unlikely(order != max_order - 1))
> + max_order = order + 1;

Or maybe I just don't understand what this is doing. When is the new 'if' even
true? We just bailed out of "while (order < max_order - 1)" after the last
"order++", which means it should hold that "order == max_order - 1")?
Your description sounds like you want to increase max_order to MAX_ORDER in one
step, which as I explained would be wrong. But the implementation looks actually
like a no-op.

>   max_order++;
>   goto continue_merging;
>   }
> 



Re: [PATCH] mm/page_isolation: do not isolate the max order page

2020-12-03 Thread Vlastimil Babka
On 12/3/20 5:26 PM, David Hildenbrand wrote:
> On 03.12.20 01:03, Vlastimil Babka wrote:
>> On 12/2/20 1:21 PM, Muchun Song wrote:
>>> The max order page has no buddy page and never merge to other order.
>>> So isolating and then freeing it is pointless.
>>>
>>> Signed-off-by: Muchun Song 
>> 
>> Acked-by: Vlastimil Babka 
>> 
>>> ---
>>>  mm/page_isolation.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>>> index a254e1f370a3..bddf788f45bf 100644
>>> --- a/mm/page_isolation.c
>>> +++ b/mm/page_isolation.c
>>> @@ -88,7 +88,7 @@ static void unset_migratetype_isolate(struct page *page, 
>>> unsigned migratetype)
>>>  */
>>> if (PageBuddy(page)) {
>>> order = buddy_order(page);
>>> -   if (order >= pageblock_order) {
>>> +   if (order >= pageblock_order && order < MAX_ORDER - 1) {
>>> pfn = page_to_pfn(page);
>>> buddy_pfn = __find_buddy_pfn(pfn, order);
>>> buddy = page + (buddy_pfn - pfn);
>> 
>> Hm I wonder if order == MAX_ORDER - 1, then the buddy can actually be a
>> !pfn_valid() in some corner case? pfn_valid_within(buddy_pfn) that follows 
>> would
>> only catch it on archs with holes in zone. Then 
>> is_migrate_isolate_page(buddy)
>> might access an invalid buddy. So this might be actually a bug fix and not 
>> just
>> optimization, just the bug hasn't been observed in practice.
> 
> I think we have no users that isolate/unisolate close to holes.
> 
> CMA regions are properly aligned (to max of page_order /
> max_order_nr_pages) and don't contain holes.

The problem as I see it, is that buddy_order(page) might be already MAX_ORDER -
1 (e.g. two pageblocks on x86), and then finding buddy of that one is beyond the
guaranteed alignment (if they merged, which they can't, it would be four
pageblocks). Might not be just a hole within zone, but also across zone 
boundary?
While being isolated and used pages migrated away, the freed pages shouldn't
merge to MAX_ORDER-1, but if the MAX_ORDER-1 free page was already there before
the isolation?

> virtio-mem does not apply as it knows its range has no holes.
> 
> gigantic pages are aligned naturally and we check that there are no
> holes upfront.
> 
> There are no other users.
> 
> 
> I don't see a need for stable/fixes.
> 



Re: [External] Re: [PATCH] mm/page_isolation: do not isolate the max order page

2020-12-03 Thread Vlastimil Babka
On 12/3/20 3:43 AM, Muchun Song wrote:
> On Thu, Dec 3, 2020 at 8:03 AM Vlastimil Babka  wrote:
>>
>> On 12/2/20 1:21 PM, Muchun Song wrote:
>> > The max order page has no buddy page and never merge to other order.
>> > So isolating and then freeing it is pointless.
>> >
>> > Signed-off-by: Muchun Song 
>>
>> Acked-by: Vlastimil Babka 
>>
>> > ---
>> >  mm/page_isolation.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>> > index a254e1f370a3..bddf788f45bf 100644
>> > --- a/mm/page_isolation.c
>> > +++ b/mm/page_isolation.c
>> > @@ -88,7 +88,7 @@ static void unset_migratetype_isolate(struct page *page, 
>> > unsigned migratetype)
>> >*/
>> >   if (PageBuddy(page)) {
>> >   order = buddy_order(page);
>> > - if (order >= pageblock_order) {
>> > + if (order >= pageblock_order && order < MAX_ORDER - 1) {
>> >   pfn = page_to_pfn(page);
>> >   buddy_pfn = __find_buddy_pfn(pfn, order);
>> >   buddy = page + (buddy_pfn - pfn);
>>
>> Hm I wonder if order == MAX_ORDER - 1, then the buddy can actually be a
>> !pfn_valid() in some corner case? pfn_valid_within(buddy_pfn) that follows 
>> would
>> only catch it on archs with holes in zone. Then 
>> is_migrate_isolate_page(buddy)
>> might access an invalid buddy. So this might be actually a bug fix and not 
>> just
>> optimization, just the bug hasn't been observed in practice.
> 
> Agree. Should we add a Fixes tag in the commit log? Thanks.

Right.

Fixes: 3c605096d315 ("mm/page_alloc: restrict max order of merging on isolated
pageblock")

The criteria for CC stable is not met though as it's theoretical.

>>
>> >
>>
> 
> 



Re: [PATCH] mm/page_isolation: do not isolate the max order page

2020-12-02 Thread Vlastimil Babka
On 12/2/20 1:21 PM, Muchun Song wrote:
> The max order page has no buddy page and never merge to other order.
> So isolating and then freeing it is pointless.
> 
> Signed-off-by: Muchun Song 

Acked-by: Vlastimil Babka 

> ---
>  mm/page_isolation.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index a254e1f370a3..bddf788f45bf 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -88,7 +88,7 @@ static void unset_migratetype_isolate(struct page *page, 
> unsigned migratetype)
>*/
>   if (PageBuddy(page)) {
>   order = buddy_order(page);
> - if (order >= pageblock_order) {
> + if (order >= pageblock_order && order < MAX_ORDER - 1) {
>   pfn = page_to_pfn(page);
>   buddy_pfn = __find_buddy_pfn(pfn, order);
>   buddy = page + (buddy_pfn - pfn);

Hm I wonder if order == MAX_ORDER - 1, then the buddy can actually be a
!pfn_valid() in some corner case? pfn_valid_within(buddy_pfn) that follows would
only catch it on archs with holes in zone. Then is_migrate_isolate_page(buddy)
might access an invalid buddy. So this might be actually a bug fix and not just
optimization, just the bug hasn't been observed in practice.

> 



crediting bug reports and fixes folded into original patch

2020-12-02 Thread Vlastimil Babka
Hi,

there was a bit of debate on Twitter about this, so I thought I would bring it
here. Imagine a scenario where patch sits as a commit in -next and there's a bug
report or fix, possibly by a bot or with some static analysis. The maintainer
decides to fold it into the original patch, which makes sense for e.g.
bisectability. But there seem to be no clear rules about attribution in this
case, which looks like there should be, probably in
Documentation/maintainer/modifying-patches.rst

The original bug fix might include a From: $author, a Reported-by: (e.g.
syzbot), Fixes: $next-commit, some tag such as Addresses-Coverity: to credit the
static analysis tool, and an SoB. After folding, all that's left might be a line
as "include fix from $author" in the SoB area. This is a loss of
metadata/attribution just due to folding, and might make contributors unhappy.
Had they sent the fix after the original commit was mainline and immutable, all
the info above would "survive" in the form of new commit.

So I think we could decide what the proper format would be, and document it
properly. I personally wouldn't mind just copy/pasting the whole commit message
of the fix (with just a short issue description, no need to include stacktraces
etc if the fix is folded), we could just standardize where, and how to delimit
it from the main commit message. If it's a report (person or bot) of a bug that
the main author then fixed, preserve the Reported-by in the same way (making
clear it's not a Reported-By for the "main thing" addressed by the commit).

In the debate one less verbose alternatve proposed was a SoB with comment
describing it's for a fix and not whole patch, as some see SoB as the main mark
of contribution, that can be easily found and counted etc. I'm not so sure about
it myself, as AFAIK SoB is mainly a DCO thing, and for a maintainer it means
something else ("passed through my tree") than for a patch author. And this
approach would still lose the other tags.

Thoughts?
Vlastimil


Re: [External] Re: [PATCH] mm/memcg: fix NULL pointer dereference at workingset_eviction

2020-11-30 Thread Vlastimil Babka

On 11/30/20 2:45 PM, Michal Hocko wrote:

On Mon 30-11-20 21:36:49, Muchun Song wrote:

On Mon, Nov 30, 2020 at 9:23 PM Michal Hocko  wrote:
>
> On Mon 30-11-20 21:15:12, Muchun Song wrote:
> > We found a case of kernel panic. The stack trace is as follows
> > (omit some irrelevant information):
> >
> > BUG: kernel NULL pointer dereference, address: 00c8
> > RIP: 0010:workingset_eviction+0x26b/0x450
> > Call Trace:
> >  __remove_mapping+0x224/0x2b0
> >  shrink_page_list+0x8c2/0x14e0
> >  shrink_inactive_list+0x1bf/0x3f0
> >  ? do_raw_spin_unlock+0x49/0xc0
> >  ? _raw_spin_unlock+0xa/0x20
> >  shrink_lruvec+0x401/0x640
> >
> > This was caused by commit 76761ffa9ea1 ("mm/memcg: bail out early when
> > !memcg in mem_cgroup_lruvec"). When the parameter of memcg is NULL, we
> > should not use the &pgdat->__lruvec. So this just reverts commit
> > 76761ffa9ea1 to fix it.
> >
> > Fixes: 76761ffa9ea1 ("mm/memcg: bail out early when !memcg in 
mem_cgroup_lruvec")
>
> I do not see any commits like that in the current Linus tree. Is this a
> commit id from the linux-next? If yes, can we just fold it into the
> respective patch in mmotm tree please?

Yes. This commit is on the linux-next tree.


FYI, patches coming from mmotm are constantly rebased in linux-next so
the sha1 is meaningless and shouldn't be added as a reference in the
changelog.


Of course can.


Thanks! I believe Andrew should be able to just pick up the patch and
make it -fix patch.


Well the fix is a revert, so just remove the patch from mmotm/next?
BTW, looks like it wasn't sent to linux-mm [1], looks like missing To: header.

[1] 
https://lore.kernel.org/lkml/1606446515-36069-1-git-send-email-alex@linux.alibaba.com/





Re: [PATCH] mm/page_owner: Record timestamp and pid

2020-11-30 Thread Vlastimil Babka

On 11/27/20 8:23 PM, Souptick Joarder wrote:

On Sat, Nov 28, 2020 at 12:36 AM Vlastimil Babka  wrote:


On 11/27/20 7:57 PM, Georgi Djakov wrote:
> Hi Vlastimil,
>
> Thanks for the comment!
>
> On 11/27/20 19:52, Vlastimil Babka wrote:
>> On 11/12/20 8:14 PM, Andrew Morton wrote:
>>> On Thu, 12 Nov 2020 20:41:06 +0200 Georgi Djakov 
>>> wrote:
>>>
>>>> From: Liam Mark 
>>>>
>>>> Collect the time for each allocation recorded in page owner so that
>>>> allocation "surges" can be measured.
>>>>
>>>> Record the pid for each allocation recorded in page owner so that
>>>> the source of allocation "surges" can be better identified.
>>>
>>> Please provide a description of why this is considered useful.  What
>>> has it been used for, what problems has it been used to solve?
>>
>> Worth noting that on x86_64 it doubles the size of struct page_owner
>> from 16 bytes to 32, so it better be justified:
>
> Well, that's true. But for debug options there is almost always some penalty.
> The timestamp and pid information is very useful for me (and others, i 
believe)
> when doing memory analysis. On a crash for example, we can get this 
information
> from kdump (or RAM-dump) and look into it to catch memory allocation problems
> more easily.

Right. Btw, you should add printing the info to __dump_page_owner().

> If you find the above argument not strong enough, how about a separate config
> option for this? Maybe something like CONFIG_PAGE_OWNER_EXTENDED, which could
> be enabled in addition to CONFIG_PAGE_OWNER?

It might be strong enough if it's mentioned in changelog, and also what exactly
the space tradeoff is :)


Just a thought ... putting it inside CONFIG_PAGE_OWNER_DEBUG might be
better if it is used
purely for debugging purposes.


I don't think we need to introduce new config just yet, until someone makes the 
case for it. Even then, it could be instead doable as an extension to 
"page_owner=on" boot option.

I mean let's add those fields, but improve the changelog.



You can also mention that SLUB object tracking has also pid+timestamp.

> Thanks,
> Georgi
>
>>
>> struct page_owner {
>>  short unsigned int order;/* 0 2 */
>>  short int  last_migrate_reason;  /* 2 2 */
>>  gfp_t  gfp_mask; /* 4 4 */
>>  depot_stack_handle_t   handle;   /* 8 4 */
>>  depot_stack_handle_t   free_handle;  /*12 4 */
>>  u64ts_nsec;  /*16 8 */
>>  intpid;  /*24 4 */
>>
>>  /* size: 32, cachelines: 1, members: 7 */
>>  /* padding: 4 */
>>  /* last cacheline: 32 bytes */
>> };
>>
>








Re: [PATCH] mm/page_owner: Record timestamp and pid

2020-11-27 Thread Vlastimil Babka

On 11/27/20 7:57 PM, Georgi Djakov wrote:

Hi Vlastimil,

Thanks for the comment!

On 11/27/20 19:52, Vlastimil Babka wrote:

On 11/12/20 8:14 PM, Andrew Morton wrote:
On Thu, 12 Nov 2020 20:41:06 +0200 Georgi Djakov  
wrote:



From: Liam Mark 

Collect the time for each allocation recorded in page owner so that
allocation "surges" can be measured.

Record the pid for each allocation recorded in page owner so that
the source of allocation "surges" can be better identified.


Please provide a description of why this is considered useful.  What
has it been used for, what problems has it been used to solve?


Worth noting that on x86_64 it doubles the size of struct page_owner
from 16 bytes to 32, so it better be justified:


Well, that's true. But for debug options there is almost always some penalty.
The timestamp and pid information is very useful for me (and others, i believe)
when doing memory analysis. On a crash for example, we can get this information
from kdump (or RAM-dump) and look into it to catch memory allocation problems
more easily.


Right. Btw, you should add printing the info to __dump_page_owner().


If you find the above argument not strong enough, how about a separate config
option for this? Maybe something like CONFIG_PAGE_OWNER_EXTENDED, which could
be enabled in addition to CONFIG_PAGE_OWNER?


It might be strong enough if it's mentioned in changelog, and also what exactly 
the space tradeoff is :)


You can also mention that SLUB object tracking has also pid+timestamp.


Thanks,
Georgi



struct page_owner {
     short unsigned int order;    /* 0 2 */
     short int  last_migrate_reason;  /* 2 2 */
     gfp_t  gfp_mask; /* 4 4 */
     depot_stack_handle_t   handle;   /* 8 4 */
     depot_stack_handle_t   free_handle;  /*    12 4 */
     u64    ts_nsec;  /*    16 8 */
     int    pid;  /*    24 4 */

     /* size: 32, cachelines: 1, members: 7 */
     /* padding: 4 */
     /* last cacheline: 32 bytes */
};







Re: [PATCH] mm/page_owner: Record timestamp and pid

2020-11-27 Thread Vlastimil Babka

On 11/12/20 8:14 PM, Andrew Morton wrote:

On Thu, 12 Nov 2020 20:41:06 +0200 Georgi Djakov  
wrote:


From: Liam Mark 

Collect the time for each allocation recorded in page owner so that
allocation "surges" can be measured.

Record the pid for each allocation recorded in page owner so that
the source of allocation "surges" can be better identified.


Please provide a description of why this is considered useful.  What
has it been used for, what problems has it been used to solve?


Worth noting that on x86_64 it doubles the size of struct page_owner
from 16 bytes to 32, so it better be justified:

struct page_owner {
short unsigned int order;/* 0 2 */
short int  last_migrate_reason;  /* 2 2 */
gfp_t  gfp_mask; /* 4 4 */
depot_stack_handle_t   handle;   /* 8 4 */
depot_stack_handle_t   free_handle;  /*12 4 */
u64ts_nsec;  /*16 8 */
intpid;  /*24 4 */

/* size: 32, cachelines: 1, members: 7 */
/* padding: 4 */
/* last cacheline: 32 bytes */
};



Are there userspace tools which aid in the processing of this new
information?

Can/should Documentation/vm/page_owner.rst be updated?


--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -25,6 +26,8 @@ struct page_owner {

gfp_t gfp_mask;
depot_stack_handle_t handle;
depot_stack_handle_t free_handle;
+   u64 ts_nsec;
+   int pid;


pid_t would be nicer?







Re: [PATCH v3] mm/shmem.c: make shmem_mapping() inline

2020-11-27 Thread Vlastimil Babka

On 11/15/20 5:52 PM, Hui Su wrote:

shmem_mapping() isn't worth an out-of-line call
from any callsite.

So make it inline by
- make shmem_aops global
- export shmem_aops
- inline the shmem_mapping()

and replace the direct call 'shmem_aops' with shmem_mapping()
in shmem.c.

v1->v2:
remove the inline for func declaration in shmem_fs.h

v2->v3:
make shmem_aops global, and export it to modules.

Signed-off-by: Hui Su 


Acked-by: Vlastimil Babka 


---
  include/linux/shmem_fs.h |  6 +-
  mm/shmem.c   | 16 ++--
  2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index a5a5d1d4d7b1..d82b6f396588 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -67,7 +67,11 @@ extern unsigned long shmem_get_unmapped_area(struct file *, 
unsigned long addr,
unsigned long len, unsigned long pgoff, unsigned long flags);
  extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
  #ifdef CONFIG_SHMEM
-extern bool shmem_mapping(struct address_space *mapping);
+extern const struct address_space_operations shmem_aops;
+static inline bool shmem_mapping(struct address_space *mapping)
+{
+   return mapping->a_ops == &shmem_aops;
+}
  #else
  static inline bool shmem_mapping(struct address_space *mapping)
  {
diff --git a/mm/shmem.c b/mm/shmem.c
index 537c137698f8..b7361fce50bc 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -246,7 +246,7 @@ static inline void shmem_inode_unacct_blocks(struct inode 
*inode, long pages)
  }
  
  static const struct super_operations shmem_ops;

-static const struct address_space_operations shmem_aops;
+const struct address_space_operations shmem_aops;
  static const struct file_operations shmem_file_operations;
  static const struct inode_operations shmem_inode_operations;
  static const struct inode_operations shmem_dir_inode_operations;
@@ -1152,7 +1152,7 @@ static void shmem_evict_inode(struct inode *inode)
struct shmem_inode_info *info = SHMEM_I(inode);
struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
  
-	if (inode->i_mapping->a_ops == &shmem_aops) {

+   if (shmem_mapping(inode->i_mapping)) {
shmem_unacct_size(info->flags, inode->i_size);
inode->i_size = 0;
shmem_truncate_range(inode, 0, (loff_t)-1);
@@ -1858,7 +1858,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t 
index,
}
  
  	/* shmem_symlink() */

-   if (mapping->a_ops != &shmem_aops)
+   if (!shmem_mapping(mapping))
goto alloc_nohuge;
if (shmem_huge == SHMEM_HUGE_DENY || sgp_huge == SGP_NOHUGE)
goto alloc_nohuge;
@@ -2352,11 +2352,6 @@ static struct inode *shmem_get_inode(struct super_block 
*sb, const struct inode
return inode;
  }
  
-bool shmem_mapping(struct address_space *mapping)

-{
-   return mapping->a_ops == &shmem_aops;
-}
-
  static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
  pmd_t *dst_pmd,
  struct vm_area_struct *dst_vma,
@@ -3865,7 +3860,7 @@ static void shmem_destroy_inodecache(void)
kmem_cache_destroy(shmem_inode_cachep);
  }
  
-static const struct address_space_operations shmem_aops = {

+const struct address_space_operations shmem_aops = {
.writepage  = shmem_writepage,
.set_page_dirty = __set_page_dirty_no_writeback,
  #ifdef CONFIG_TMPFS
@@ -3877,6 +3872,7 @@ static const struct address_space_operations shmem_aops = 
{
  #endif
.error_remove_page = generic_error_remove_page,
  };
+EXPORT_SYMBOL(shmem_aops);
  
  static const struct file_operations shmem_file_operations = {

.mmap   = shmem_mmap,
@@ -4312,7 +4308,7 @@ struct page *shmem_read_mapping_page_gfp(struct 
address_space *mapping,
struct page *page;
int error;
  
-	BUG_ON(mapping->a_ops != &shmem_aops);

+   BUG_ON(!shmem_mapping(mapping));
error = shmem_getpage_gfp(inode, index, &page, SGP_CACHE,
  gfp, NULL, NULL, NULL);
if (error)





Re: [PATCH] mm/shmem: use kmem_cache_zalloc in shmem_alloc_inode()

2020-11-27 Thread Vlastimil Babka

On 11/15/20 6:40 PM, Hui Su wrote:

in shmem_get_inode():
new_inode();
   new_inode_pseudo();
 alloc_inode();
   ops->alloc_inode(); -> shmem_alloc_inode()
 kmem_cache_alloc();

memset(info, 0, (char *)inode - (char *)info);

So use kmem_cache_zalloc() in shmem_alloc_inode(),
and remove the memset in shmem_get_inode().

Signed-off-by: Hui Su 


Looks correct, but now we clear also the inode part which seems to be handled by 
alloc_inode() well enough. So, unsure. btrfs and ext4 variants don't use kzalloc 
neither. It's also more obvious with the current way that the info is cleared. Hmm?



---
  mm/shmem.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 537c137698f8..b84adda45461 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2308,7 +2308,6 @@ static struct inode *shmem_get_inode(struct super_block 
*sb, const struct inode
inode->i_atime = inode->i_mtime = inode->i_ctime = 
current_time(inode);
inode->i_generation = prandom_u32();
info = SHMEM_I(inode);
-   memset(info, 0, (char *)inode - (char *)info);
spin_lock_init(&info->lock);
atomic_set(&info->stop_eviction, 0);
info->seals = F_SEAL_SEAL;
@@ -3828,7 +3827,7 @@ static struct kmem_cache *shmem_inode_cachep;
  static struct inode *shmem_alloc_inode(struct super_block *sb)
  {
struct shmem_inode_info *info;
-   info = kmem_cache_alloc(shmem_inode_cachep, GFP_KERNEL);
+   info = kmem_cache_zalloc(shmem_inode_cachep, GFP_KERNEL);
if (!info)
return NULL;
return &info->vfs_inode;





Re: [PATCH -next] mm/page_alloc: Mark some symbols with static keyword

2020-11-27 Thread Vlastimil Babka

On 11/16/20 10:02 AM, Zou Wei wrote:

Fix the following sparse warnings:

mm/page_alloc.c:3040:6: warning: symbol '__drain_all_pages' was not declared. 
Should it be static?
mm/page_alloc.c:6349:6: warning: symbol '__zone_set_pageset_high_and_batch' was 
not declared. Should it be static?

Signed-off-by: Zou Wei 


Acked-by: Vlastimil Babka 


---
  mm/page_alloc.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 63d8d8b..e7548344 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3037,7 +3037,7 @@ static void drain_local_pages_wq(struct work_struct *work)
   * that need the guarantee that every CPU has drained can disable the
   * optimizing racy check.
   */
-void __drain_all_pages(struct zone *zone, bool force_all_cpus)
+static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
  {
int cpu;
  
@@ -6346,7 +6346,7 @@ static void pageset_init(struct per_cpu_pageset *p)

pcp->batch = BOOT_PAGESET_BATCH;
  }
  
-void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,

+static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long 
high,
unsigned long batch)
  {
struct per_cpu_pageset *p;





Re: [PATCH] mm: cma: improve pr_debug log in cma_release()

2020-11-27 Thread Vlastimil Babka

On 11/25/20 4:32 PM, Charan Teja Reddy wrote:

It is required to print 'count' of pages, along with the pages, passed
to cma_release to debug the cases of mismatched count value passed
between cma_alloc() and cma_release() from a code path.

As an example, consider the below scenario:
1) CMA pool size is 4MB and
2) User doing the erroneous step of allocating 2 pages but freeing 1
page in a loop from this CMA pool.
The step 2 causes cma_alloc() to return NULL at one point of time
because of -ENOMEM condition.

And the current pr_debug logs is not giving the info about these types
of allocation patterns because of count value not being printed in
cma_release().

We are printing the count value in the trace logs, just extend the same
to pr_debug logs too.

Signed-off-by: Charan Teja Reddy 


Acked-by: Vlastimil Babka 


---
  mm/cma.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/cma.c b/mm/cma.c
index 7f415d7..07c904b 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -512,7 +512,7 @@ bool cma_release(struct cma *cma, const struct page *pages, 
unsigned int count)
if (!cma || !pages)
return false;
  
-	pr_debug("%s(page %p)\n", __func__, (void *)pages);

+   pr_debug("%s(page %p, count %zu)\n", __func__, (void *)pages, count);
  
  	pfn = page_to_pfn(pages);
  





Re: [PATCH] mm/page_alloc: Do not isolate redundant pageblock

2020-11-27 Thread Vlastimil Babka

On 11/27/20 3:19 PM, Muchun Song wrote:

Current pageblock isolation logic could isolate each pageblock individually
since commit d9dddbf55667 ("mm/page_alloc: prevent merging between isolated
and other pageblocks"). So we not need to concern about page allocator
merges buddies from different pageblocks and changes MIGRATE_ISOLATE to
some other migration type.


Yeah, that should be the case now.


Signed-off-by: Muchun Song 
---
  mm/page_alloc.c | 26 --
  1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cefbef32bf4a..608a2c2b8ab7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8313,16 +8313,14 @@ struct page *has_unmovable_pages(struct zone *zone, 
struct page *page,
  }
  
  #ifdef CONFIG_CONTIG_ALLOC

-static unsigned long pfn_max_align_down(unsigned long pfn)
+static unsigned long pfn_align_down(unsigned long pfn)
  {
-   return pfn & ~(max_t(unsigned long, MAX_ORDER_NR_PAGES,
-pageblock_nr_pages) - 1);
+   return pfn & ~(pageblock_nr_pages - 1);
  }
  
-static unsigned long pfn_max_align_up(unsigned long pfn)

+static unsigned long pfn_align_up(unsigned long pfn)
  {
-   return ALIGN(pfn, max_t(unsigned long, MAX_ORDER_NR_PAGES,
-   pageblock_nr_pages));
+   return ALIGN(pfn, pageblock_nr_pages);
  }


How bout just removing these wrappers completely and using ALIGN and ALIGN_DOWN 
directly, as there are just two uses for each?



  /* [start, end) must belong to a single zone. */
@@ -8415,14 +8413,6 @@ int alloc_contig_range(unsigned long start, unsigned 
long end,
INIT_LIST_HEAD(&cc.migratepages);
  
  	/*

-* What we do here is we mark all pageblocks in range as
-* MIGRATE_ISOLATE.  Because pageblock and max order pages may
-* have different sizes, and due to the way page allocator
-* work, we align the range to biggest of the two pages so
-* that page allocator won't try to merge buddies from
-* different pageblocks and change MIGRATE_ISOLATE to some
-* other migration type.
-*
 * Once the pageblocks are marked as MIGRATE_ISOLATE, we
 * migrate the pages from an unaligned range (ie. pages that
 * we are interested in).  This will put all the pages in
@@ -8438,8 +8428,8 @@ int alloc_contig_range(unsigned long start, unsigned long 
end,
 * put back to page allocator so that buddy can use them.
 */
  
-	ret = start_isolate_page_range(pfn_max_align_down(start),

-  pfn_max_align_up(end), migratetype, 0);
+   ret = start_isolate_page_range(pfn_align_down(start), pfn_align_up(end),
+  migratetype, 0);
if (ret)
return ret;
  
@@ -8522,8 +8512,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,

free_contig_range(end, outer_end - end);
  
  done:

-   undo_isolate_page_range(pfn_max_align_down(start),
-   pfn_max_align_up(end), migratetype);
+   undo_isolate_page_range(pfn_align_down(start), pfn_align_up(end),
+   migratetype);
return ret;
  }
  EXPORT_SYMBOL(alloc_contig_range);





Re: [PATCH 3/3] mm,thp,shmem: make khugepaged obey tmpfs mount flags

2020-11-26 Thread Vlastimil Babka
On 11/26/20 7:14 PM, Rik van Riel wrote:
> On Thu, 2020-11-26 at 18:18 +0100, Vlastimil Babka wrote:
>> On 11/24/20 8:49 PM, Rik van Riel wrote:
>>> Currently if thp enabled=[madvise], mounting a tmpfs filesystem
>>> with huge=always and mmapping files from that tmpfs does not
>>> result in khugepaged collapsing those mappings, despite the
>>> mount flag indicating that it should.
>>>
>>> Fix that by breaking up the blocks of tests in hugepage_vma_check
>>> a little bit, and testing things in the correct order.
>>>
>>> Signed-off-by: Rik van Riel 
>>> Fixes: c2231020ea7b ("mm: thp: register mm for khugepaged when
>>> merging vma for shmem")
>>
>> Looks ok. But, it we have sysfs thp enabled=never, and shmem mount
>> explicitly 
>> thp enabled, then shmem mount overrides the global sysfs setting and
>> thp's will 
>> be allocated there, right? However, khugepaged_enabled() will be
>> false and thus 
>> khugepaged won't run at all? So a similar situation than what you're
>> fixing here.
> 
> Indeed, that is somewhat similar. Whether or not shmem
> allocations attempt huge pages is controlled by both
> the file /sys/kernel/mm/transparent_hugepage/shmem_enabled

Ah right, there's also that sysfs file.

> and mount options.
> 
> This patch makes khugepaged treat the mount options
> and/or
> sysfs flag as enabling collapsing of huge pages in case
> enabled = [always] for regular THPs.
> 
> Should I send another patch on top
> of this that causes
> khugepaged to be enabled when regular THPs are disabled,
> but shmem THPs are enabled in any way?

I think it would make sense. Although it might involve counting
thp-enabled shmem mounts and only run khugepaged when there are >0 of them.


Re: [PATCH 3/3] mm,thp,shmem: make khugepaged obey tmpfs mount flags

2020-11-26 Thread Vlastimil Babka

On 11/24/20 8:49 PM, Rik van Riel wrote:

Currently if thp enabled=[madvise], mounting a tmpfs filesystem
with huge=always and mmapping files from that tmpfs does not
result in khugepaged collapsing those mappings, despite the
mount flag indicating that it should.

Fix that by breaking up the blocks of tests in hugepage_vma_check
a little bit, and testing things in the correct order.

Signed-off-by: Rik van Riel 
Fixes: c2231020ea7b ("mm: thp: register mm for khugepaged when merging vma for 
shmem")


Looks ok. But, it we have sysfs thp enabled=never, and shmem mount explicitly 
thp enabled, then shmem mount overrides the global sysfs setting and thp's will 
be allocated there, right? However, khugepaged_enabled() will be false and thus 
khugepaged won't run at all? So a similar situation than what you're fixing here.



---
  include/linux/khugepaged.h |  2 ++
  mm/khugepaged.c| 22 --
  2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index c941b7377321..2fcc01891b47 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -3,6 +3,7 @@
  #define _LINUX_KHUGEPAGED_H
  
  #include  /* MMF_VM_HUGEPAGE */

+#include 
  
  
  #ifdef CONFIG_TRANSPARENT_HUGEPAGE

@@ -57,6 +58,7 @@ static inline int khugepaged_enter(struct vm_area_struct *vma,
  {
if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags))
if ((khugepaged_always() ||
+(shmem_file(vma->vm_file) && shmem_huge_enabled(vma)) ||
 (khugepaged_req_madv() && (vm_flags & VM_HUGEPAGE))) &&
!(vm_flags & VM_NOHUGEPAGE) &&
!test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4e3dff13eb70..abab394c4206 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -440,18 +440,28 @@ static inline int khugepaged_test_exit(struct mm_struct 
*mm)
  static bool hugepage_vma_check(struct vm_area_struct *vma,
   unsigned long vm_flags)
  {
-   if ((!(vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
-   (vm_flags & VM_NOHUGEPAGE) ||
+   /* Explicitly disabled through madvise. */
+   if ((vm_flags & VM_NOHUGEPAGE) ||
test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
return false;
  
-	if (shmem_file(vma->vm_file) ||

-   (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
-vma->vm_file &&
-(vm_flags & VM_DENYWRITE))) {
+   /* Enabled via shmem mount options or sysfs settings. */
+   if (shmem_file(vma->vm_file) && shmem_huge_enabled(vma)) {
return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
HPAGE_PMD_NR);
}
+
+   /* THP settings require madvise. */
+   if (!(vm_flags & VM_HUGEPAGE) && !khugepaged_always())
+   return false;
+
+   /* Read-only file mappings need to be aligned for THP to work. */
+   if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && vma->vm_file &&
+   (vm_flags & VM_DENYWRITE)) {
+   return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
+   HPAGE_PMD_NR);
+   }
+
if (!vma->anon_vma || vma->vm_ops)
return false;
if (vma_is_temporary_stack(vma))





Re: [PATCH 1/3] mm,thp,shmem: limit shmem THP alloc gfp_mask

2020-11-26 Thread Vlastimil Babka

On 11/24/20 8:49 PM, Rik van Riel wrote:

The allocation flags of anonymous transparent huge pages can be controlled
through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
help the system from getting bogged down in the page reclaim and compaction
code when many THPs are getting allocated simultaneously.

However, the gfp_mask for shmem THP allocations were not limited by those
configuration settings, and some workloads ended up with all CPUs stuck
on the LRU lock in the page reclaim code, trying to allocate dozens of
THPs simultaneously.

This patch applies the same configurated limitation of THPs to shmem
hugepage allocations, to prevent that from happening.

Controlling the gfp_mask of THP allocations through the knobs in
sysfs allows users to determine the balance between how aggressively
the system tries to allocate THPs at fault time, and how much the
application may end up stalling attempting those allocations.

This way a THP defrag setting of "never" or "defer+madvise" will result
in quick allocation failures without direct reclaim when no 2MB free
pages are available.

With this patch applied, THP allocations for tmpfs will be a little
more aggressive than today for files mmapped with MADV_HUGEPAGE,
and a little less aggressive for files that are not mmapped or
mapped without that flag.

Signed-off-by: Rik van Riel 


Acked-by: Vlastimil Babka 


Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add

2020-11-26 Thread Vlastimil Babka

On 11/26/20 12:22 PM, Vlastimil Babka wrote:

On 11/26/20 8:24 AM, Yu Zhao wrote:

On Thu, Nov 26, 2020 at 02:39:03PM +0800, Alex Shi wrote:



在 2020/11/26 下午12:52, Yu Zhao 写道:
>>   */
>>  void __pagevec_lru_add(struct pagevec *pvec)
>>  {
>> -  int i;
>> -  struct lruvec *lruvec = NULL;
>> +  int i, nr_lruvec;
>>unsigned long flags = 0;
>> +  struct page *page;
>> +  struct lruvecs lruvecs;
>>  
>> -	for (i = 0; i < pagevec_count(pvec); i++) {

>> -  struct page *page = pvec->pages[i];
>> +  nr_lruvec = sort_page_lruvec(&lruvecs, pvec);
> Simply looping pvec multiple times (15 at most) for different lruvecs
> would be better because 1) it requires no extra data structures and
> therefore has better cache locality (theoretically faster) 2) it only
> loops once when !CONFIG_MEMCG and !CONFIG_NUMA and therefore has no
> impact on Android and Chrome OS.
> 


With multiple memcgs, it do help a lot, I had gotten 30% grain on readtwice
case. but yes, w/o MEMCG and NUMA, it's good to keep old behavior. So 
would you like has a proposal for this?


Oh, no, I'm not against your idea. I was saying it doesn't seem
necessary to sort -- a nested loop would just do the job given
pagevec is small.


Yeah that could work. The worst case doesn't look nice (O(n^2)) but it should be
rather rare to have pages from really multiple memcgs/nodes?


However, Matthew wanted to increase pagevec size [1] and once 15^2 becomes 63^2, 
it starts to be somewhat more worrying.


[1] https://lore.kernel.org/linux-mm/20201105172651.2455-1-wi...@infradead.org/


Maybe with the following change? Avoids going through the nulls if we got lucky
(or have !MEMCG !NUMA).


diff --git a/mm/swap.c b/mm/swap.c
index cb3794e13b48..1d238edc2907 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -996,15 +996,26 @@ static void __pagevec_lru_add_fn(struct page *page, 
struct lruvec *lruvec)
   */
  void __pagevec_lru_add(struct pagevec *pvec)
  {
-   int i;
+   int i, j;


int i, j, n;


struct lruvec *lruvec = NULL;
unsigned long flags = 0;
  


n = pagevec_count(pvec);

for (i = 0; i < pagevec_count(pvec); i++) {


for (i = 0; n; i++) {


struct page *page = pvec->pages[i];
  
+		if (!page)

+   continue;
+
lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
-   __pagevec_lru_add_fn(page, lruvec);


n--;


+
+   for (j = i; j < pagevec_count(pvec); j++) {
+   if (page_to_nid(pvec->pages[j]) != page_to_nid(page) ||
+   page_memcg(pvec->pages[j]) != page_memcg(page))
+   continue;
+
+   __pagevec_lru_add_fn(pvec->pages[j], lruvec);
+   pvec->pages[j] = NULL;


n--;

+   }
}
if (lruvec)
unlock_page_lruvec_irqrestore(lruvec, flags);







Re: [PATCH next] mm/vmscan: __isolate_lru_page_prepare clean up

2020-11-26 Thread Vlastimil Babka

On 11/26/20 3:25 AM, Alex Shi wrote:



在 2020/11/26 上午7:43, Andrew Morton 写道:

On Tue, 24 Nov 2020 12:21:28 +0100 Vlastimil Babka  wrote:


On 11/22/20 3:00 PM, Alex Shi wrote:

Thanks a lot for all comments, I picked all up and here is the v3:

 From 167131dd106a96fd08af725df850e0da6ec899af Mon Sep 17 00:00:00 2001
From: Alex Shi 
Date: Fri, 20 Nov 2020 14:49:16 +0800
Subject: [PATCH v3 next] mm/vmscan: __isolate_lru_page_prepare clean up

The function just return 2 results, so use a 'switch' to deal with its
result is unnecessary, and simplify it to a bool func as Vlastimil
suggested.

Also remove 'goto' by reusing list_move(), and take Matthew Wilcox's
suggestion to update comments in function.


I wouldn't mind if the goto stayed, but it's not repeating that much 
without it (list_move() + continue, 3 times) so...


I tried that, and .text became significantly larger, for reasons which
I didn't investigate ;)


I found out that comparing whole .text doesn't often work as changes might be 
lost in alignment, or
once in a while cross the alignment boundary and become exagerated. 
bloat-o-meter works nice though.


Uh, BTW, with the gcc 8.3.1 and centos 7, goto or continue version has same size
on my side with or w/o DEBUG_LIST. But actually, this clean up patch could
add 10 bytes also with or w/o DEDBUG_LIST.

Maybe related with different compiler?


gcc (SUSE Linux) 10.2.1 20201117 [revision 
98ba03ffe0b9f37b4916ce6238fad754e00d720b]

./scripts/bloat-o-meter vmscan.o.before mm/vmscan.o
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-1 (-1)
Function old new   delta
isolate_lru_pages   11251124  -1
Total: Before=57283, After=57282, chg -0.00%

Not surprising, as I'd expect the compiler to figure out by itself that 
list_move + continue
repeats and can be unified.  The reason for goto to stay would be rather 
readability (subjective).


Thanks
Alex





Re: [PATCH 7/7] mm,hwpoison: Remove drain_all_pages from shake_page

2020-11-26 Thread Vlastimil Babka

On 11/19/20 11:57 AM, Oscar Salvador wrote:

get_hwpoison_page already drains pcplists, previously disabling
them when trying to grab a refcount.
We do not need shake_page to take care of it anymore.

Signed-off-by: Oscar Salvador 
---
  mm/memory-failure.c | 7 ++-
  1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 512613e9a1bd..ad976e1c3178 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -263,8 +263,8 @@ static int kill_proc(struct to_kill *tk, unsigned long pfn, 
int flags)
  }
  
  /*

- * When a unknown page type is encountered drain as many buffers as possible
- * in the hope to turn the page into a LRU or free page, which we can handle.
+ * Unknown page type encountered. Try to check whether it can turn PageLRU by
+ * lru_add_drain_all, or a free page by reclaiming slabs when possible.
   */
  void shake_page(struct page *p, int access)
  {
@@ -275,9 +275,6 @@ void shake_page(struct page *p, int access)
lru_add_drain_all();
if (PageLRU(p))
return;
-   drain_all_pages(page_zone(p));
-   if (PageLRU(p) || is_free_buddy_page(p))
-   return;


I wonder if page in the lru pagevec can in fact become free after draining the 
lru - in that case we could keep the is_free_buddy_page() check.



}
  
  	/*






Re: [PATCH 6/7] mm,hwpoison: Disable pcplists before grabbing a refcount

2020-11-26 Thread Vlastimil Babka

On 11/19/20 11:57 AM, Oscar Salvador wrote:

Currently, we have a sort of retry mechanism to make sure pages in
pcp-lists are spilled to the buddy system, so we can handle those.

We can save us this extra checks with the new disable-pcplist mechanism
that is available with [1].

zone_pcplist_disable makes sure to 1) disable pcplists, so any page
that is freed up from that point onwards will end up in the buddy
system and 2) drain pcplists, so those pages that already in pcplists
are spilled to buddy.

With that, we can make a common entry point for grabbing a refcount
from both soft_offline and memory_failure paths that is guarded by
zone_pcplist_disable/zone_pcplist_enable.

[1] 
https://patchwork.kernel.org/project/linux-mm/cover/2020092812.11329-1-vba...@suse.cz/

Signed-off-by: Oscar Salvador 


Acked-by: Vlastimil Babka 

Note as you say the series should go after [1] above, which means after 
mm-page_alloc-disable-pcplists-during-memory-offline.patch in mmots, but 
currently it's ordered before, where zone_pcp_disable() etc doesn't yet exist. 
Found out as I review using checked out this commit from -next.



---
  mm/memory-failure.c | 120 +---
  1 file changed, 58 insertions(+), 62 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4abf5d5ffc96..512613e9a1bd 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -985,26 +985,67 @@ static int __get_hwpoison_page(struct page *page)
return 0;
  }
  
-static int get_hwpoison_page(struct page *p)

+/*
+ * Safely get reference count of an arbitrary page.
+ *
+ * Returns 0 for a free page, 1 for an in-use page,
+ * -EIO for a page-type we cannot handle and -EBUSY if we raced with an
+ * allocation.
+ * We only incremented refcount in case the page was already in-use and it
+ * is a known type we can handle.
+ */
+static int get_any_page(struct page *p)
  {
-   int ret;
-   bool drained = false;
+   int ret = 0, pass = 0;
  
-retry:

-   ret = __get_hwpoison_page(p);
-   if (!ret && !is_free_buddy_page(p) && !page_count(p) && !drained) {
-   /*
-* The page might be in a pcplist, so try to drain those
-* and see if we are lucky.
-*/
-   drain_all_pages(page_zone(p));
-   drained = true;
-   goto retry;
+try_again:
+   if (!__get_hwpoison_page(p)) {
+   if (page_count(p)) {
+   /* We raced with an allocation, retry. */
+   if (pass++ < 3)
+   goto try_again;
+   ret = -EBUSY;
+   } else if (!PageHuge(p) && !is_free_buddy_page(p)) {
+   /* We raced with put_page, retry. */
+   if (pass++ < 3)
+   goto try_again;
+   ret = -EIO;
+   }
+   } else {
+   if (PageHuge(p) || PageLRU(p) || __PageMovable(p)) {
+   ret = 1;
+   } else {
+   /*
+* A page we cannot handle. Check whether we can turn
+* it into something we can handle.
+*/
+   if (pass++ < 3) {
+   put_page(p);
+   shake_page(p, 1);
+   goto try_again;
+   }
+   put_page(p);
+   ret = -EIO;
+   }
}
  
  	return ret;

  }
  
+static int get_hwpoison_page(struct page *p, enum mf_flags ctxt)

+{
+   int ret;
+
+   zone_pcp_disable(page_zone(p));
+   if (ctxt == MF_SOFT_OFFLINE)
+   ret = get_any_page(p);
+   else
+   ret = __get_hwpoison_page(p);
+   zone_pcp_enable(page_zone(p));
+
+   return ret;
+}
+
  /*
   * Do all that is necessary to remove user space mappings. Unmap
   * the pages and send SIGBUS to the processes if the data was dirty.
@@ -1185,7 +1226,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int 
flags)
  
  	num_poisoned_pages_inc();
  
-	if (!get_hwpoison_page(p)) {

+   if (!get_hwpoison_page(p, 0)) {
/*
 * Check "filter hit" and "race with other subpage."
 */
@@ -1387,7 +1428,7 @@ int memory_failure(unsigned long pfn, int flags)
 * In fact it's dangerous to directly bump up page count from 0,
 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
 */
-   if (!get_hwpoison_page(p)) {
+   if (!get_hwpoison_page(p, 0)) {
if (is_free_buddy_page(p)) {
if (take_page_off_buddy(p)) {
page_ref_inc(p);
@@ -1674,7 +1715,7 @@ int unpoison_memory(unsigned long pfn)
 

Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add

2020-11-26 Thread Vlastimil Babka

On 11/26/20 8:24 AM, Yu Zhao wrote:

On Thu, Nov 26, 2020 at 02:39:03PM +0800, Alex Shi wrote:



在 2020/11/26 下午12:52, Yu Zhao 写道:
>>   */
>>  void __pagevec_lru_add(struct pagevec *pvec)
>>  {
>> -  int i;
>> -  struct lruvec *lruvec = NULL;
>> +  int i, nr_lruvec;
>>unsigned long flags = 0;
>> +  struct page *page;
>> +  struct lruvecs lruvecs;
>>  
>> -	for (i = 0; i < pagevec_count(pvec); i++) {

>> -  struct page *page = pvec->pages[i];
>> +  nr_lruvec = sort_page_lruvec(&lruvecs, pvec);
> Simply looping pvec multiple times (15 at most) for different lruvecs
> would be better because 1) it requires no extra data structures and
> therefore has better cache locality (theoretically faster) 2) it only
> loops once when !CONFIG_MEMCG and !CONFIG_NUMA and therefore has no
> impact on Android and Chrome OS.
> 


With multiple memcgs, it do help a lot, I had gotten 30% grain on readtwice
case. but yes, w/o MEMCG and NUMA, it's good to keep old behavior. So 
would you like has a proposal for this?


Oh, no, I'm not against your idea. I was saying it doesn't seem
necessary to sort -- a nested loop would just do the job given
pagevec is small.


Yeah that could work. The worst case doesn't look nice (O(n^2)) but it should be 
rather rare to have pages from really multiple memcgs/nodes?


Maybe with the following change? Avoids going through the nulls if we got lucky 
(or have !MEMCG !NUMA).



diff --git a/mm/swap.c b/mm/swap.c
index cb3794e13b48..1d238edc2907 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -996,15 +996,26 @@ static void __pagevec_lru_add_fn(struct page *page, 
struct lruvec *lruvec)
   */
  void __pagevec_lru_add(struct pagevec *pvec)
  {
-   int i;
+   int i, j;


int i, j, n;


struct lruvec *lruvec = NULL;
unsigned long flags = 0;
  


n = pagevec_count(pvec);

for (i = 0; i < pagevec_count(pvec); i++) {


for (i = 0; n; i++) {


struct page *page = pvec->pages[i];
  
+		if (!page)

+   continue;
+
lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
-   __pagevec_lru_add_fn(page, lruvec);


n--;


+
+   for (j = i; j < pagevec_count(pvec); j++) {
+   if (page_to_nid(pvec->pages[j]) != page_to_nid(page) ||
+   page_memcg(pvec->pages[j]) != page_memcg(page))
+   continue;
+
+   __pagevec_lru_add_fn(pvec->pages[j], lruvec);
+   pvec->pages[j] = NULL;


n--;

+   }
}
if (lruvec)
unlock_page_lruvec_irqrestore(lruvec, flags);





Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add

2020-11-26 Thread Vlastimil Babka

On 11/26/20 4:12 AM, Alex Shi wrote:



在 2020/11/25 下午11:38, Vlastimil Babka 写道:

On 11/20/20 9:27 AM, Alex Shi wrote:

The current relock logical will change lru_lock when found a new
lruvec, so if 2 memcgs are reading file or alloc page at same time,
they could hold the lru_lock alternately, and wait for each other for
fairness attribute of ticket spin lock.

This patch will sort that all lru_locks and only hold them once in
above scenario. That could reduce fairness waiting for lock reget.
Than, vm-scalability/case-lru-file-readtwice could get ~5% performance
gain on my 2P*20core*HT machine.


Hm, once you sort the pages like this, it's a shame not to splice them instead 
of more list_del() + list_add() iterations. update_lru_size() could be also 
called once?


Yes, looks it's a good idea to use splice instead of list_del/add, but pages
may on different lru list in a same lruvec, and also may come from different
zones. That could involve 5 cycles for different lists, and more for zones...


Hmm, zones wouldn't affect splicing (there's a per-node lru these days), but 
would affect accounting. And yeah, there are 5 lru lists, and we probably need 
to be under lru lock to stabilize where a page belongs, so pre-sorting without 
lock wouldn't be safe? Bummer.



I give up the try.





Re: [PATCH 3/7] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED

2020-11-25 Thread Vlastimil Babka

On 11/19/20 11:57 AM, Oscar Salvador wrote:

From: Naoya Horiguchi 

The call to get_user_pages_fast is only to get the pointer to a struct
page of a given address, pinning it is memory-poisoning handler's job,
so drop the refcount grabbed by get_user_pages_fast().

Note that the target page is still pinned after this put_page() because
the current process should have refcount from mapping.


Well, but can't it go away due to reclaim, migration or whatever?


Signed-off-by: Naoya Horiguchi 
Signed-off-by: Oscar Salvador 
---
  mm/madvise.c | 19 +++
  1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index c6b5524add58..7a0f64b93635 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -900,20 +900,23 @@ static int madvise_inject_error(int behavior,
 */
size = page_size(compound_head(page));
  
+		/*

+* The get_user_pages_fast() is just to get the pfn of the
+* given address, and the refcount has nothing to do with
+* what we try to test, so it should be released immediately.
+* This is racy but it's intended because the real hardware
+* errors could happen at any moment and memory error handlers
+* must properly handle the race.


Sure they have to. We might just be unexpectedly messing with other process' 
memory. Or does anything else prevent that?



+*/
+   put_page(page);
+
if (behavior == MADV_SOFT_OFFLINE) {
pr_info("Soft offlining pfn %#lx at process virtual address 
%#lx\n",
 pfn, start);
-   ret = soft_offline_page(pfn, MF_COUNT_INCREASED);
+   ret = soft_offline_page(pfn, 0);
} else {
pr_info("Injecting memory failure for pfn %#lx at process 
virtual address %#lx\n",
 pfn, start);
-   /*
-* Drop the page reference taken by 
get_user_pages_fast(). In
-* the absence of MF_COUNT_INCREASED the 
memory_failure()
-* routine is responsible for pinning the page to 
prevent it
-* from being released back to the page allocator.
-*/
-   put_page(page);
ret = memory_failure(pfn, 0);
}
  





Re: [PATCH 1/7] mm,hwpoison: Refactor get_any_page

2020-11-25 Thread Vlastimil Babka

On 11/19/20 11:57 AM, Oscar Salvador wrote:

When we want to grab a refcount via get_any_page, we call
__get_any_page that calls get_hwpoison_page to get the
actual refcount.
get_any_page is only there because we have a sort of retry
mechanism in case the page we met is unknown to us or
if we raced with an allocation.

Also __get_any_page prints some messages about the page type
in case the page was a free page or the page type was unknown,
but if anything, we only need to print a message in case the
pagetype was unknown, as that is reporting an error down the chain.

Let us merge get_any_page and __get_any_page, and let the message
be printed in soft_offline_page.

Signed-off-by: Oscar Salvador 


Acked-by: Vlastimil Babka 



Re: [PATCH 2/7] mm,hwpoison: Drop pfn parameter

2020-11-25 Thread Vlastimil Babka

On 11/19/20 11:57 AM, Oscar Salvador wrote:

pfn parameter is no longer needed, drop it.


Could have been also part of previous patch.


Signed-off-by: Oscar Salvador 


Acked-by: Vlastimil Babka 


---
  mm/memory-failure.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 0d2323ba4b8e..04237fc04a00 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1712,7 +1712,7 @@ EXPORT_SYMBOL(unpoison_memory);
   * We only incremented refcount in case the page was already in-use and it is
   * a known type we can handle.
   */
-static int get_any_page(struct page *p, unsigned long pfn, int flags)
+static int get_any_page(struct page *p, int flags)
  {
int ret = 0, pass = 0;
  
@@ -1926,7 +1926,7 @@ int soft_offline_page(unsigned long pfn, int flags)
  
  retry:

get_online_mems();
-   ret = get_any_page(page, pfn, flags);
+   ret = get_any_page(page, flags);
put_online_mems();
  
  	if (ret > 0) {






Re: [PATCH v5 4/4] mm,hwpoison: drop unneeded pcplist draining

2020-11-25 Thread Vlastimil Babka

On 10/13/20 4:44 PM, Oscar Salvador wrote:

memory_failure and soft_offline_path paths now drain pcplists by calling
get_hwpoison_page.

memory_failure flags the page as HWPoison before, so that page cannot
longer go into a pcplist, and soft_offline_page only flags a page as
HWPoison if 1) we took the page off a buddy freelist 2) the page was
in-use and we migrated it 3) was a clean pagecache.

Because of that, a page cannot longer be poisoned and be in a pcplist.

Signed-off-by: Oscar Salvador 


Acked-by: Vlastimil Babka 


---
  mm/madvise.c | 5 -
  1 file changed, 5 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 416a56b8e757..c6b5524add58 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -877,7 +877,6 @@ static long madvise_remove(struct vm_area_struct *vma,
  static int madvise_inject_error(int behavior,
unsigned long start, unsigned long end)
  {
-   struct zone *zone;
unsigned long size;
  
  	if (!capable(CAP_SYS_ADMIN))

@@ -922,10 +921,6 @@ static int madvise_inject_error(int behavior,
return ret;
}
  
-	/* Ensure that all poisoned pages are removed from per-cpu lists */

-   for_each_populated_zone(zone)
-   drain_all_pages(zone);
-
return 0;
  }
  #endif





Re: [PATCH v5 3/4] mm,hwpoison: take free pages off the buddy freelists for hugetlb

2020-11-25 Thread Vlastimil Babka

On 10/13/20 4:44 PM, Oscar Salvador wrote:

Currently, free hugetlb get dissolved, but we also need to make sure
to take the poisoned subpage off the buddy frelists, so no one stumbles
upon it (see previous patch for more information).

Signed-off-by: Oscar Salvador 


Acked-by: Vlastimil Babka 


---
  mm/memory-failure.c | 19 +--
  1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 181bed890c16..30aadeca97d2 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -809,7 +809,7 @@ static int me_swapcache_clean(struct page *p, unsigned long 
pfn)
   */
  static int me_huge_page(struct page *p, unsigned long pfn)
  {
-   int res = 0;
+   int res;
struct page *hpage = compound_head(p);
struct address_space *mapping;
  
@@ -820,6 +820,7 @@ static int me_huge_page(struct page *p, unsigned long pfn)

if (mapping) {
res = truncate_error_page(hpage, pfn, mapping);
} else {
+   res = MF_FAILED;
unlock_page(hpage);
/*
 * migration entry prevents later access on error anonymous
@@ -828,8 +829,10 @@ static int me_huge_page(struct page *p, unsigned long pfn)
 */
if (PageAnon(hpage))
put_page(hpage);
-   dissolve_free_huge_page(p);
-   res = MF_RECOVERED;
+   if (!dissolve_free_huge_page(p) && take_page_off_buddy(p)) {
+   page_ref_inc(p);
+   res = MF_RECOVERED;
+   }
lock_page(hpage);
}
  
@@ -1198,9 +1201,13 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)

}
}
unlock_page(head);
-   dissolve_free_huge_page(p);
-   action_result(pfn, MF_MSG_FREE_HUGE, MF_DELAYED);
-   return 0;
+   res = MF_FAILED;
+   if (!dissolve_free_huge_page(p) && take_page_off_buddy(p)) {
+   page_ref_inc(p);
+   res = MF_RECOVERED;
+   }
+   action_result(pfn, MF_MSG_FREE_HUGE, res);
+   return res == MF_RECOVERED ? 0 : -EBUSY;
}
  
  	lock_page(head);






Re: [PATCH v5 2/4] mm,hwpoison: take free pages off the buddy freelists

2020-11-25 Thread Vlastimil Babka

On 10/13/20 4:44 PM, Oscar Salvador wrote:

The crux of the matter is that historically we left poisoned pages in the
buddy system because we have some checks in place when allocating a page
that are gatekeeper for poisoned pages.  Unfortunately, we do have other
users (e.g: compaction [1]) that scan buddy freelists and try to get a
page from there without checking whether the page is HWPoison.

As I stated already, I think it is fundamentally wrong to keep HWPoison
pages within the buddy systems, checks in place or not.

Let us fix this the same way we did for soft_offline [2], taking the page
off the buddy freelist so it is completely unreachable.

Note that this is fairly simple to trigger, as we only need to poison free
buddy pages (madvise MADV_HWPOISON) and then run some sort of memory stress
system.

Just for a matter of reference, I put a dump_page() in compaction_alloc()
to trigger for HWPoison patches:

kernel: page:12b2982b refcount:1 mapcount:0 mapping: 
index:0x1 pfn:0x1d5db
kernel: flags: 0xfc080(hwpoison)
kernel: raw: 000fc080 ea7573c8 c9857de0 
kernel: raw: 0001  0001 
kernel: page dumped because: compaction_alloc

kernel: CPU: 4 PID: 123 Comm: kcompactd0 Tainted: GE 
5.9.0-rc2-mm1-1-default+ #5
kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
kernel: Call Trace:
kernel:  dump_stack+0x6d/0x8b
kernel:  compaction_alloc+0xb2/0xc0
kernel:  migrate_pages+0x2a6/0x12a0
kernel:  ? isolate_freepages+0xc80/0xc80
kernel:  ? __ClearPageMovable+0xb0/0xb0
kernel:  compact_zone+0x5eb/0x11c0
kernel:  ? finish_task_switch+0x74/0x300
kernel:  ? lock_timer_base+0xa8/0x170
kernel:  proactive_compact_node+0x89/0xf0
kernel:  ? kcompactd+0x2d0/0x3a0
kernel:  kcompactd+0x2d0/0x3a0
kernel:  ? finish_wait+0x80/0x80
kernel:  ? kcompactd_do_work+0x350/0x350
kernel:  kthread+0x118/0x130
kernel:  ? kthread_associate_blkcg+0xa0/0xa0
kernel:  ret_from_fork+0x22/0x30

After that, if e.g: a process faults in the page,  it will get killed
unexpectedly.
Fix it by containing the page immediatelly.

Besides that, two more changes can be noticed:

* MF_DELAYED no longer suits as we are fixing the issue by containing
   the page immediately, so it does no longer rely on the allocation-time
   checks to stop HWPoison to be handed over.
   gain unless it is unpoisoned, so we fixed the situation.
   Because of that, let us use MF_RECOVERED from now on.

* The second block that handles PageBuddy pages is no longer needed:
   We call shake_page and then check whether the page is Buddy
   because shake_page calls drain_all_pages, which sends pcp-pages back to
   the buddy freelists, so we could have a chance to handle free pages.
   Currently, get_hwpoison_page already calls drain_all_pages, and we call
   get_hwpoison_page right before coming here, so we should be on the safe
   side.

[1] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u
[2] https://patchwork.kernel.org/cover/11792607/

Signed-off-by: Oscar Salvador 
Acked-by: Naoya Horiguchi 


Makes a lot of sense.
Acked-by: Vlastimil Babka 


---
  mm/memory-failure.c | 27 +--
  1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e2f12410c594..181bed890c16 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1341,6 +1341,7 @@ int memory_failure(unsigned long pfn, int flags)
struct dev_pagemap *pgmap;
int res;
unsigned long page_flags;
+   bool retry = true;
  
  	if (!sysctl_memory_failure_recovery)

panic("Memory failure on page %lx", pfn);
@@ -1358,6 +1359,7 @@ int memory_failure(unsigned long pfn, int flags)
return -ENXIO;
}
  
+try_again:

if (PageHuge(p))
return memory_failure_hugetlb(pfn, flags);
if (TestSetPageHWPoison(p)) {
@@ -1382,8 +1384,21 @@ int memory_failure(unsigned long pfn, int flags)
 */
if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
if (is_free_buddy_page(p)) {
-   action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
-   return 0;
+   if (take_page_off_buddy(p)) {
+   page_ref_inc(p);
+   res = MF_RECOVERED;
+   } else {
+   /* We lost the race, try again */
+   if (retry) {
+   ClearPageHWPoison(p);
+   num_poisoned_pages_dec();
+   retry = false;
+   goto try_again;
+   }
+

Re: [PATCH v5 1/4] mm,hwpoison: drain pcplists before bailing out for non-buddy zero-refcount page

2020-11-25 Thread Vlastimil Babka

On 10/13/20 4:44 PM, Oscar Salvador wrote:

A page with 0-refcount and !PageBuddy could perfectly be a pcppage.
Currently, we bail out with an error if we encounter such a page, meaning
that we do not handle pcppages neither from hard-offline nor from
soft-offline path.

Fix this by draining pcplists whenever we find this kind of page and retry
the check again.  It might be that pcplists have been spilled into the
buddy allocator and so we can handle it.

Signed-off-by: Oscar Salvador 
Acked-by: Naoya Horiguchi 


Acked-by: Vlastimil Babka 


---
  mm/memory-failure.c | 24 ++--
  1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index c0bb186bba62..e2f12410c594 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -946,13 +946,13 @@ static int page_action(struct page_state *ps, struct page 
*p,
  }
  
  /**

- * get_hwpoison_page() - Get refcount for memory error handling:
+ * __get_hwpoison_page() - Get refcount for memory error handling:
   * @page: raw error page (hit by memory error)
   *
   * Return: return 0 if failed to grab the refcount, otherwise true (some
   * non-zero value.)
   */
-static int get_hwpoison_page(struct page *page)
+static int __get_hwpoison_page(struct page *page)
  {
struct page *head = compound_head(page);
  
@@ -982,6 +982,26 @@ static int get_hwpoison_page(struct page *page)

return 0;
  }
  
+static int get_hwpoison_page(struct page *p)

+{
+   int ret;
+   bool drained = false;
+
+retry:
+   ret = __get_hwpoison_page(p);
+   if (!ret && !is_free_buddy_page(p) && !page_count(p) && !drained) {
+   /*
+* The page might be in a pcplist, so try to drain those
+* and see if we are lucky.
+*/
+   drain_all_pages(page_zone(p));
+   drained = true;
+   goto retry;
+   }
+
+   return ret;
+}
+
  /*
   * Do all that is necessary to remove user space mappings. Unmap
   * the pages and send SIGBUS to the processes if the data was dirty.





Re: [PATCH v2] mm/page_alloc: clear all pages in post_alloc_hook() with init_on_alloc=1

2020-11-25 Thread Vlastimil Babka

On 11/20/20 7:04 PM, David Hildenbrand wrote:

commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and
init_on_free=1 boot options") resulted with init_on_alloc=1 in all pages
leaving the buddy via alloc_pages() and friends to be
initialized/cleared/zeroed on allocation.

However, the same logic is currently not applied to
alloc_contig_pages(): allocated pages leaving the buddy aren't cleared
with init_on_alloc=1 and init_on_free=0. Let's also properly clear
pages on that allocation path.

To achieve that, let's move clearing into post_alloc_hook(). This will not
only affect alloc_contig_pages() allocations but also any pages used as
migration target in compaction code via compaction_alloc().

While this sounds sub-optimal, it's the very same handling as when
allocating migration targets via alloc_migration_target() - pages will
get properly cleared with init_on_free=1. In case we ever want to optimize
migration in that regard, we should tackle all such migration users - if we
believe migration code can be fully trusted.

With this change, we will see double clearing of pages in some
cases. One example are gigantic pages (either allocated via CMA, or
allocated dynamically via alloc_contig_pages()) - which is the right
thing to do (and to be optimized outside of the buddy in the callers) as
discussed in:
   https://lkml.kernel.org/r/20201019182853.7467-1-gpicc...@canonical.com

This change implies that with init_on_alloc=1
- All CMA allocations will be cleared
- Gigantic pages allocated via alloc_contig_pages() will be cleared
- virtio-mem memory to be unplugged will be cleared. While this is
   suboptimal, it's similar to memory balloon drivers handling, where
   all pages to be inflated will get cleared as well.
- Pages isolated for compaction will be cleared

Cc: Andrew Morton 
Cc: Alexander Potapenko 
Cc: Michal Hocko 
Cc: Mike Kravetz 
Cc: Vlastimil Babka 
Cc: Mike Rapoport 
Cc: Oscar Salvador 
Cc: Kees Cook 
Cc: Michael Ellerman 
Signed-off-by: David Hildenbrand 


Acked-by: Vlastimil Babka 


---

This is the follow-up of:
   "[PATCH v1] mm/page_alloc: clear pages in alloc_contig_pages() with
   init_on_alloc=1 or __GFP_ZERO"

v1 -> v2:
- Let's clear anything that leaves the buddy, also affecting compaction.
- Don't implement __GFP_ZERO support for now

---
  mm/page_alloc.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eaa227a479e4..108b81c0dfa8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2275,6 +2275,9 @@ inline void post_alloc_hook(struct page *page, unsigned 
int order,
kasan_alloc_pages(page, order);
kernel_poison_pages(page, 1 << order, 1);
set_page_owner(page, order, gfp_flags);
+
+   if (!free_pages_prezeroed() && want_init_on_alloc(gfp_flags))
+   kernel_init_free_pages(page, 1 << order);
  }
  
  static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,

@@ -2282,9 +2285,6 @@ static void prep_new_page(struct page *page, unsigned int 
order, gfp_t gfp_flags
  {
post_alloc_hook(page, order, gfp_flags);
  
-	if (!free_pages_prezeroed() && want_init_on_alloc(gfp_flags))

-   kernel_init_free_pages(page, 1 << order);
-
if (order && (gfp_flags & __GFP_COMP))
prep_compound_page(page, order);
  





Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add

2020-11-25 Thread Vlastimil Babka

On 11/20/20 9:27 AM, Alex Shi wrote:

The current relock logical will change lru_lock when found a new
lruvec, so if 2 memcgs are reading file or alloc page at same time,
they could hold the lru_lock alternately, and wait for each other for
fairness attribute of ticket spin lock.

This patch will sort that all lru_locks and only hold them once in
above scenario. That could reduce fairness waiting for lock reget.
Than, vm-scalability/case-lru-file-readtwice could get ~5% performance
gain on my 2P*20core*HT machine.


Hm, once you sort the pages like this, it's a shame not to splice them 
instead of more list_del() + list_add() iterations. update_lru_size() 
could be also called once?



Suggested-by: Konstantin Khlebnikov 
Signed-off-by: Alex Shi 
Cc: Konstantin Khlebnikov 
Cc: Andrew Morton 
Cc: Hugh Dickins 
Cc: Yu Zhao 
Cc: Michal Hocko 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
  mm/swap.c | 57 +++
  1 file changed, 49 insertions(+), 8 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 490553f3f9ef..c787b38bf9c0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1009,24 +1009,65 @@ static void __pagevec_lru_add_fn(struct page *page, 
struct lruvec *lruvec)
trace_mm_lru_insertion(page, lru);
  }
  
+struct lruvecs {

+   struct list_head lists[PAGEVEC_SIZE];
+   struct lruvec *vecs[PAGEVEC_SIZE];
+};
+
+/* Sort pvec pages on their lruvec */
+int sort_page_lruvec(struct lruvecs *lruvecs, struct pagevec *pvec)
+{
+   int i, j, nr_lruvec;
+   struct page *page;
+   struct lruvec *lruvec = NULL;
+
+   lruvecs->vecs[0] = NULL;
+   for (i = nr_lruvec = 0; i < pagevec_count(pvec); i++) {
+   page = pvec->pages[i];
+   lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+
+   /* Try to find a same lruvec */
+   for (j = 0; j <= nr_lruvec; j++)
+   if (lruvec == lruvecs->vecs[j])
+   break;
+
+   /* A new lruvec */
+   if (j > nr_lruvec) {
+   INIT_LIST_HEAD(&lruvecs->lists[nr_lruvec]);
+   lruvecs->vecs[nr_lruvec] = lruvec;
+   j = nr_lruvec++;
+   lruvecs->vecs[nr_lruvec] = 0;
+   }
+
+   list_add_tail(&page->lru, &lruvecs->lists[j]);
+   }
+
+   return nr_lruvec;
+}
+
  /*
   * Add the passed pages to the LRU, then drop the caller's refcount
   * on them.  Reinitialises the caller's pagevec.
   */
  void __pagevec_lru_add(struct pagevec *pvec)
  {
-   int i;
-   struct lruvec *lruvec = NULL;
+   int i, nr_lruvec;
unsigned long flags = 0;
+   struct page *page;
+   struct lruvecs lruvecs;
  
-	for (i = 0; i < pagevec_count(pvec); i++) {

-   struct page *page = pvec->pages[i];
+   nr_lruvec = sort_page_lruvec(&lruvecs, pvec);
  
-		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);

-   __pagevec_lru_add_fn(page, lruvec);
+   for (i = 0; i < nr_lruvec; i++) {
+   spin_lock_irqsave(&lruvecs.vecs[i]->lru_lock, flags);
+   while (!list_empty(&lruvecs.lists[i])) {
+   page = lru_to_page(&lruvecs.lists[i]);
+   list_del(&page->lru);
+   __pagevec_lru_add_fn(page, lruvecs.vecs[i]);
+   }
+   spin_unlock_irqrestore(&lruvecs.vecs[i]->lru_lock, flags);
}
-   if (lruvec)
-   unlock_page_lruvec_irqrestore(lruvec, flags);
+
release_pages(pvec->pages, pvec->nr);
pagevec_reinit(pvec);
  }





Re: [PATCH] [RFC] init/main: fix broken buffer_init when DEFERRED_STRUCT_PAGE_INIT set

2020-11-25 Thread Vlastimil Babka

On 11/23/20 12:05 PM, Lin Feng wrote:

In the booting phase if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set,
we have following callchain:

start_kernel
...
   mm_init
 mem_init
  memblock_free_all
reset_all_zones_managed_pages
free_low_memory_core_early
...
   buffer_init
 nr_free_buffer_pages
   zone->managed_pages
...
   rest_init
 kernel_init
   kernel_init_freeable
 page_alloc_init_late
   kthread_run(deferred_init_memmap, NODE_DATA(nid), "pgdatinit%d", 
nid);
   wait_for_completion(&pgdat_init_all_done_comp);
   ...
   files_maxfiles_init

It's clear that buffer_init depends on zone->managed_pages, but it's reset
in reset_all_zones_managed_pages after that pages are readded into
  zone->managed_pages, but when buffer_init runs this process is half done
  and most of them will finally be added till deferred_init_memmap done.
In large memory couting of nr_free_buffer_pages drifts too much, also
drifting from kernels to kernels on same hardware.

Fix is simple, it delays buffer_init run till deferred_init_memmap all done.


Hmm nobody should need bh_cachep to allocate buffer heads until then, right.


But as corrected by this patch, max_buffer_heads becomes very large,
the value is roughly as many as 4 times of totalram_pages, formula:
max_buffer_heads = nrpages * (10%) * (PAGE_SIZE / sizeof(struct buffer_head));

Say in a 64GB memory box we have 16777216 pages, then max_buffer_heads
turns out to be roughly 67,108,864.
In common cases, should a buffer_head be mapped to one page/block(4KB)?
So max_buffer_heads never exceeds totalram_pages.
IMO it's likely to make buffer_heads_over_limit bool value alwasy false,
then make codes 'if (buffer_heads_over_limit)' test in vmscan unnecessary.
Correct me if it's not true.


Maybe we could compile that out with CONFIG_HIGHMEM then?


So this patch will change the original behavior related to
buffer_heads_over_limit in vmscan since we used a half done value
of zone->managed_pages before, or should we use a smaller factor(<10%) in
previous formula.

Signed-off-by: Lin Feng 


Acked-by: Vlastimil Babka 


---
  init/main.c | 2 --
  mm/page_alloc.c | 3 +++
  2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/init/main.c b/init/main.c
index 20baced721ad..a3f7c3416286 100644
--- a/init/main.c
+++ b/init/main.c
@@ -58,7 +58,6 @@
  #include 
  #include 
  #include 
-#include 
  #include 
  #include 
  #include 
@@ -1034,7 +1033,6 @@ asmlinkage __visible void __init __no_sanitize_address 
start_kernel(void)
fork_init();
proc_caches_init();
uts_ns_init();
-   buffer_init();
key_init();
security_init();
dbg_late_init();
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eaa227a479e4..2931d706fb52 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -70,6 +70,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include 

  #include 
@@ -2103,6 +2104,8 @@ void __init page_alloc_init_late(void)
files_maxfiles_init();
  #endif
  
+	buffer_init();

+
/* Discard memblock private memory */
memblock_discard();
  





Re: [PATCH 2/2] mm: Move free_unref_page to mm/internal.h

2020-11-25 Thread Vlastimil Babka

On 11/25/20 4:46 AM, Matthew Wilcox (Oracle) wrote:

Code outside mm/ should not be calling free_unref_page().  Also
move free_unref_page_list().


Good idea.


Signed-off-by: Matthew Wilcox (Oracle) 


Acked-by: Vlastimil Babka 

There seems to be some effort to remove "extern" from function 
declarations from headers. Do we want to do that, at once, or piecemeal? 
If the latter, this is a chance for these functions at least :)



---
  include/linux/gfp.h | 2 --
  mm/internal.h   | 3 +++
  2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index c603237e006c..6e479e9c48ce 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -580,8 +580,6 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t 
size, gfp_t gfp_mask);
  
  extern void __free_pages(struct page *page, unsigned int order);

  extern void free_pages(unsigned long addr, unsigned int order);
-extern void free_unref_page(struct page *page);
-extern void free_unref_page_list(struct list_head *list);
  
  struct page_frag_cache;

  extern void __page_frag_cache_drain(struct page *page, unsigned int count);
diff --git a/mm/internal.h b/mm/internal.h
index 75ae680d0a2c..5864815947fe 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -201,6 +201,9 @@ extern void post_alloc_hook(struct page *page, unsigned int 
order,
gfp_t gfp_flags);
  extern int user_min_free_kbytes;
  
+extern void free_unref_page(struct page *page);

+extern void free_unref_page_list(struct list_head *list);
+
  extern void zone_pcp_update(struct zone *zone);
  extern void zone_pcp_reset(struct zone *zone);
  extern void zone_pcp_disable(struct zone *zone);





Re: [PATCH 1/2] sparc: Fix handling of page table constructor failure

2020-11-25 Thread Vlastimil Babka

On 11/25/20 4:46 AM, Matthew Wilcox (Oracle) wrote:

The page has just been allocated, so its refcount is 1.  free_unref_page()
is for use on pages which have a zero refcount.  Use __free_page()
like the other implementations of pte_alloc_one().

Fixes: 1ae9ae5f7df7 ("sparc: handle pgtable_page_ctor() fail")
Signed-off-by: Matthew Wilcox (Oracle) 


Acked-by: Vlastimil Babka 


---
  arch/sparc/mm/init_64.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 96edf64d4fb3..182bb7bdaa0a 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2894,7 +2894,7 @@ pgtable_t pte_alloc_one(struct mm_struct *mm)
if (!page)
return NULL;
if (!pgtable_pte_page_ctor(page)) {
-   free_unref_page(page);
+   __free_page(page);
return NULL;
}
return (pte_t *) page_address(page);





Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

2020-11-25 Thread Vlastimil Babka

On 11/25/20 6:34 AM, Andrea Arcangeli wrote:

Hello,

On Mon, Nov 23, 2020 at 02:01:16PM +0100, Vlastimil Babka wrote:

On 11/21/20 8:45 PM, Andrea Arcangeli wrote:
> A corollary issue was fixed in
> 39639000-39814fff : Unknown E820 type
> 
> pfn 0x7a200 -> 0x7a20 min_pfn hit non-RAM:
> 
> 7a17b000-7a216fff : Unknown E820 type


It would be nice to also provide a /proc/zoneinfo and how exactly the 
"zone_spans_pfn" was violated. I assume we end up below zone's 
start_pfn, but is it true?


Agreed, I was about to grab that info along with all page struct
around the pfn 0x7a200 and phys address 0x7a216fff.

# grep -A1 E820 /proc/iomem
7a17b000-7a216fff : Unknown E820 type
7a217000-7bff : System RAM

DMA  zone_start_pfn 1zone_end_pfn() 4096 contiguous 1
DMA32zone_start_pfn 4096 zone_end_pfn() 1048576  contiguous 0
Normal   zone_start_pfn 1048576  zone_end_pfn() 4715392  contiguous 1
Movable  zone_start_pfn 0zone_end_pfn() 0contiguous 0


So the above means that around the "Unknown E820 type" we have:

pfn 499712 - start of pageblock in ZONE_DMA32
pfn 500091 - start of the "Unknown E820 type" range
pfn 500224 - start of another pageblock
pfn 500246 - end of "Unknown E820 type"

So this is indeed not a zone boundary issue, but basically a hole not 
aligned to pageblock boundary and really unexpected.
We have CONFIG_HOLES_IN_ZONE (that x86 doesn't set) for architectures 
that do this, and even that config only affects pfn_valid_within(). But 
here pfn_valid() is true, but the zone/node linkage is unexpected.



However the real bug seems that reserved pages have a zero zone_id in
the page->flags when it should have the real zone id/nid. The patch I
sent earlier to validate highest would only be needed to deal with
pfn_valid.

Something must have changed more recently than v5.1 that caused the
zoneid of reserved pages to be wrong, a possible candidate for the
real would be this change below:

+   __init_single_page(pfn_to_page(pfn), pfn, 0, 0);

Even if it may not be it, at the light of how the reserved page
zoneid/nid initialized went wrong, the above line like it's too flakey
to stay.

It'd be preferable if the pfn_valid fails and the
pfn_to_section_nr(pfn) returns an invalid section for the intermediate
step. Even better memset 0xff over the whole page struct until the
second stage comes around.

Whenever pfn_valid is true, it's better that the zoneid/nid is correct
all times, otherwise if the second stage fails we end up in a bug with
weird side effects.


Yeah I guess it would be simpler if zoneid/nid was correct for 
pfn_valid() pfns within a zone's range, even if they are reserved due 
not not being really usable memory.


I don't think we want to introduce CONFIG_HOLES_IN_ZONE to x86. If the 
chosen solution is to make this to a real hole, the hole should be 
extended to MAX_ORDER_NR_PAGES aligned boundaries.


In any case, compaction code can't fix this with better range checks.


Maybe it's not the above that left a zero zoneid though, I haven't
tried to bisect it yet to look how the page->flags looked like on a
older kernel that didn't seem to reproduce this crash, I'm just
guessing.

Thanks,
Andrea





Re: [PATCH v4] mm: Optional full ASLR for mmap() and mremap()

2020-11-24 Thread Vlastimil Babka

Please CC linux-api on future versions.

On 10/26/20 5:05 PM, Topi Miettinen wrote:

Writing a new value of 3 to /proc/sys/kernel/randomize_va_space
enables full randomization of memory mappings created with mmap(NULL,
...). With 2, the base of the VMA used for such mappings is random,
but the mappings are created in predictable places within the VMA and
in sequential order. With 3, new VMAs are created to fully randomize
the mappings. Also mremap(..., MREMAP_MAYMOVE) will move the mappings
even if not necessary.

The method is to randomize the new address without considering
VMAs. If the address fails checks because of overlap with the stack
area (or in case of mremap(), overlap with the old mapping), the
operation is retried a few times before falling back to old method.

On 32 bit systems this may cause problems due to increased VM
fragmentation if the address space gets crowded.

On all systems, it will reduce performance and increase memory
usage due to less efficient use of page tables and inability to
merge adjacent VMAs with compatible attributes.

In this example with value of 2, dynamic loader, libc, anonymous
memory reserved with mmap() and locale-archive are located close to
each other:

$ cat /proc/self/maps (only first line for each object shown for brevity)
58c1175b1000-58c1175b3000 r--p  fe:0c 1868624
/usr/bin/cat
79752ec17000-79752f179000 r--p  fe:0c 2473999
/usr/lib/locale/locale-archive
79752f179000-79752f279000 rw-p  00:00 0
79752f279000-79752f29e000 r--p  fe:0c 2402415
/usr/lib/x86_64-linux-gnu/libc-2.31.so
79752f43a000-79752f44 rw-p  00:00 0
79752f46f000-79752f47 r--p  fe:0c 2400484
/usr/lib/x86_64-linux-gnu/ld-2.31.so
79752f49b000-79752f49c000 rw-p  00:00 0
7ffdcad9e000-7ffdcadbf000 rw-p  00:00 0  [stack]
7ffdcadd2000-7ffdcadd6000 r--p  00:00 0  [vvar]
7ffdcadd6000-7ffdcadd8000 r-xp  00:00 0  [vdso]

With 3, they are located at unrelated addresses:
$ echo 3 > /proc/sys/kernel/randomize_va_space
$ cat /proc/self/maps (only first line for each object shown for brevity)
1206a8fa000-1206a8fb000 r--p  fe:0c 2400484  
/usr/lib/x86_64-linux-gnu/ld-2.31.so
1206a926000-1206a927000 rw-p  00:00 0
19174173000-19174175000 rw-p  00:00 0
ac82f419000-ac82f519000 rw-p  00:00 0
afa66a42000-afa66fa4000 r--p  fe:0c 2473999  
/usr/lib/locale/locale-archive
d8656ba9000-d8656bce000 r--p  fe:0c 2402415  
/usr/lib/x86_64-linux-gnu/libc-2.31.so
d8656d6a000-d8656d6e000 rw-p  00:00 0
5df90b712000-5df90b714000 r--p  fe:0c 1868624
/usr/bin/cat
7ffe1be4c000-7ffe1be6d000 rw-p  00:00 0  [stack]
7ffe1bf07000-7ffe1bf0b000 r--p  00:00 0  [vvar]
7ffe1bf0b000-7ffe1bf0d000 r-xp  00:00 0  [vdso]

CC: Andrew Morton 
CC: Jann Horn 
CC: Kees Cook 
CC: Matthew Wilcox 
CC: Mike Rapoport 
Signed-off-by: Topi Miettinen 
---
v2: also randomize mremap(..., MREMAP_MAYMOVE)
v3: avoid stack area and retry in case of bad random address (Jann
Horn), improve description in kernel.rst (Matthew Wilcox)
v4: use /proc/$pid/maps in the example (Mike Rapaport), CCs (Andrew
Morton), only check randomize_va_space == 3
---
  Documentation/admin-guide/hw-vuln/spectre.rst |  6 ++--
  Documentation/admin-guide/sysctl/kernel.rst   | 15 ++
  init/Kconfig  |  2 +-
  mm/internal.h |  8 +
  mm/mmap.c | 30 +--
  mm/mremap.c   | 27 +
  6 files changed, 75 insertions(+), 13 deletions(-)

diff --git a/Documentation/admin-guide/hw-vuln/spectre.rst 
b/Documentation/admin-guide/hw-vuln/spectre.rst
index e05e581af5cf..9ea250522077 100644
--- a/Documentation/admin-guide/hw-vuln/spectre.rst
+++ b/Documentation/admin-guide/hw-vuln/spectre.rst
@@ -254,7 +254,7 @@ Spectre variant 2
 left by the previous process will also be cleared.
  
 User programs should use address space randomization to make attacks

-   more difficult (Set /proc/sys/kernel/randomize_va_space = 1 or 2).
+   more difficult (Set /proc/sys/kernel/randomize_va_space = 1, 2 or 3).
  
  3. A virtualized guest attacking the host

  ^
@@ -499,8 +499,8 @@ Spectre variant 2
 more overhead and run slower.
  
 User programs should use address space randomization

-   (/proc/sys/kernel/randomize_va_space = 1 or 2) to make attacks more
-   difficult.
+   (/proc/sys/kernel/randomize_va_space = 1, 2 or 3) to make attacks
+   more difficult.
  
  3. VM mitigation

  
diff --git a/Documentation/adm

Re: [PATCH] mm: memory_hotplug: put migration failure information under DEBUG_VM

2020-11-24 Thread Vlastimil Babka

On 11/23/20 4:10 PM, Charan Teja Kalla wrote:


Thanks Michal!
On 11/23/2020 7:43 PM, Michal Hocko wrote:

On Mon 23-11-20 19:33:16, Charan Teja Reddy wrote:

When the pages are failed to get isolate or migrate, the page owner
information along with page info is dumped. If there are continuous
failures in migration(say page is pinned) or isolation, the log buffer
is simply getting flooded with the page owner information. As most of
the times page info is sufficient to know the causes for failures of
migration or isolation, place the page owner information under DEBUG_VM.


I do not see why this path is any different from others that call
dump_page. Page owner can add a very valuable information to debug
the underlying reasons for failures here. It is an opt-in debugging
feature which needs to be enabled explicitly. So I would argue users
are ready to accept a lot of data in the kernel log.


Just thinking how frequently failures can happen in those paths. In the
memory hotplug path, we can flood the page owner logs just by making one
page pinned. Say If it is anonymous page, the page owner information


So you say it's flooded when page_owner info is included, but not 
flooded when only the rest of __dump_page() is printed? (which is also 
not just one or two lines). That has to be very specific rate of failures.
Anyway I don't like the solution with arbitrary config option. To 
prevent flooding we generally have ratelimiting, how about that?


Also agreed with Michal that page_owner is explicitly enabled debugging 
option and if you use it in production, that's rather surprising to me, 
and possibly more rare than DEBUG_VM, which IIRC Fedora kernels use.


Re: [PATCH] mm/compaction: make defer_compaction and compaction_deferred static

2020-11-24 Thread Vlastimil Babka

On 11/23/20 6:08 PM, Hui Su wrote:

defer_compaction() and  compaction_deferred() and
compaction_restarting() in mm/compaction.c won't
be used in other files, so make them static, and
remove the declaration in the header file.

Take the chance to fix a typo.

Signed-off-by: Hui Su 


Acked-by: Vlastimil Babka 


---
  include/linux/compaction.h | 12 
  mm/compaction.c|  8 
  2 files changed, 4 insertions(+), 16 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 1de5a1151ee7..ed4070ed41ef 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -98,11 +98,8 @@ extern void reset_isolation_suitable(pg_data_t *pgdat);
  extern enum compact_result compaction_suitable(struct zone *zone, int order,
unsigned int alloc_flags, int highest_zoneidx);
  
-extern void defer_compaction(struct zone *zone, int order);

-extern bool compaction_deferred(struct zone *zone, int order);
  extern void compaction_defer_reset(struct zone *zone, int order,
bool alloc_success);
-extern bool compaction_restarting(struct zone *zone, int order);
  
  /* Compaction has made some progress and retrying makes sense */

  static inline bool compaction_made_progress(enum compact_result result)
@@ -194,15 +191,6 @@ static inline enum compact_result 
compaction_suitable(struct zone *zone, int ord
return COMPACT_SKIPPED;
  }
  
-static inline void defer_compaction(struct zone *zone, int order)

-{
-}
-
-static inline bool compaction_deferred(struct zone *zone, int order)
-{
-   return true;
-}
-
  static inline bool compaction_made_progress(enum compact_result result)
  {
return false;
diff --git a/mm/compaction.c b/mm/compaction.c
index 13cb7a961b31..60135a820b55 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -157,7 +157,7 @@ EXPORT_SYMBOL(__ClearPageMovable);
   * allocation success. 1 << compact_defer_shift, compactions are skipped up
   * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT
   */
-void defer_compaction(struct zone *zone, int order)
+static void defer_compaction(struct zone *zone, int order)
  {
zone->compact_considered = 0;
zone->compact_defer_shift++;
@@ -172,7 +172,7 @@ void defer_compaction(struct zone *zone, int order)
  }
  
  /* Returns true if compaction should be skipped this time */

-bool compaction_deferred(struct zone *zone, int order)
+static bool compaction_deferred(struct zone *zone, int order)
  {
unsigned long defer_limit = 1UL << zone->compact_defer_shift;
  
@@ -209,7 +209,7 @@ void compaction_defer_reset(struct zone *zone, int order,

  }
  
  /* Returns true if restarting compaction after many failures */

-bool compaction_restarting(struct zone *zone, int order)
+static bool compaction_restarting(struct zone *zone, int order)
  {
if (order < zone->compact_order_failed)
return false;
@@ -237,7 +237,7 @@ static void reset_cached_positions(struct zone *zone)
  }
  
  /*

- * Compound pages of >= pageblock_order should consistenly be skipped until
+ * Compound pages of >= pageblock_order should consistently be skipped until
   * released. It is always pointless to compact pages of such order (if they 
are
   * migratable), and the pageblocks they occupy cannot contain any free pages.
   */





Re: [PATCH next] mm/vmscan: __isolate_lru_page_prepare clean up

2020-11-24 Thread Vlastimil Babka

On 11/22/20 3:00 PM, Alex Shi wrote:

Thanks a lot for all comments, I picked all up and here is the v3:

 From 167131dd106a96fd08af725df850e0da6ec899af Mon Sep 17 00:00:00 2001
From: Alex Shi 
Date: Fri, 20 Nov 2020 14:49:16 +0800
Subject: [PATCH v3 next] mm/vmscan: __isolate_lru_page_prepare clean up

The function just return 2 results, so use a 'switch' to deal with its
result is unnecessary, and simplify it to a bool func as Vlastimil
suggested.

Also remove 'goto' by reusing list_move(), and take Matthew Wilcox's
suggestion to update comments in function.


I wouldn't mind if the goto stayed, but it's not repeating that much 
without it (list_move() + continue, 3 times) so...


Acked-by: Vlastimil Babka 


Signed-off-by: Alex Shi 
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Hugh Dickins 
Cc: Yu Zhao 
Cc: Vlastimil Babka 
Cc: Michal Hocko 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
---
  include/linux/swap.h |  2 +-
  mm/compaction.c  |  2 +-
  mm/vmscan.c  | 68 
  3 files changed, 33 insertions(+), 39 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 596bc2f4d9b0..5bba15ac5a2e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -356,7 +356,7 @@ extern void lru_cache_add_inactive_or_unevictable(struct 
page *page,
  extern unsigned long zone_reclaimable_pages(struct zone *zone);
  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
-extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode);
+extern bool __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode);
  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
  unsigned long nr_pages,
  gfp_t gfp_mask,
diff --git a/mm/compaction.c b/mm/compaction.c
index b68931854253..8d71ffebe6cb 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -988,7 +988,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if (unlikely(!get_page_unless_zero(page)))
goto isolate_fail;
  
-		if (__isolate_lru_page_prepare(page, isolate_mode) != 0)

+   if (!__isolate_lru_page_prepare(page, isolate_mode))
goto isolate_fail_put;
  
  		/* Try isolate the page */

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c6f94e55c3fe..4d2703c43310 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1536,19 +1536,17 @@ unsigned int reclaim_clean_pages_from_list(struct zone 
*zone,
   * page:  page to consider
   * mode:  one of the LRU isolation modes defined above
   *
- * returns 0 on success, -ve errno on failure.
+ * returns true on success, false on failure.
   */
-int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode)
+bool __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode)
  {
-   int ret = -EBUSY;
-
/* Only take pages on the LRU. */
if (!PageLRU(page))
-   return ret;
+   return false;
  
  	/* Compaction should not handle unevictable pages but CMA can do so */

if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
-   return ret;
+   return false;
  
  	/*

 * To minimise LRU disruption, the caller can indicate that it only
@@ -1561,7 +1559,7 @@ int __isolate_lru_page_prepare(struct page *page, 
isolate_mode_t mode)
if (mode & ISOLATE_ASYNC_MIGRATE) {
/* All the caller can do on PageWriteback is block */
if (PageWriteback(page))
-   return ret;
+   return false;
  
  		if (PageDirty(page)) {

struct address_space *mapping;
@@ -1577,20 +1575,20 @@ int __isolate_lru_page_prepare(struct page *page, 
isolate_mode_t mode)
 * from the page cache.
 */
if (!trylock_page(page))
-   return ret;
+   return false;
  
  			mapping = page_mapping(page);

migrate_dirty = !mapping || mapping->a_ops->migratepage;
unlock_page(page);
if (!migrate_dirty)
-   return ret;
+   return false;
}
}
  
  	if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))

-   return ret;
+   return false;
  
-	return 0;

+   return true;
  }
  
  /*

@@ -1674,35 +1672,31 @@ static unsigned long isolate_lru_pages(unsigned long 
nr_to_scan,
 * only when the page is being freed somewhere else.
 */
scan += nr

Re: Pinning ZONE_MOVABLE pages

2020-11-23 Thread Vlastimil Babka

+CC John Hubbard

On 11/20/20 9:27 PM, Pavel Tatashin wrote:

Recently, I encountered a hang that is happening during memory hot
remove operation. It turns out that the hang is caused by pinned user
pages in ZONE_MOVABLE.

Kernel expects that all pages in ZONE_MOVABLE can be migrated, but
this is not the case if a user applications such as through dpdk
libraries pinned them via vfio dma map. Kernel keeps trying to
hot-remove them, but refcnt never gets to zero, so we are looping
until the hardware watchdog kicks in.

We cannot do dma unmaps before hot-remove, because hot-remove is a
slow operation, and we have thousands for network flows handled by
dpdk that we just cannot suspend for the duration of hot-remove
operation.

The solution is for dpdk to allocate pages from a zone below
ZONE_MOVAVLE, i.e. ZONE_NORMAL/ZONE_HIGHMEM, but this is not possible.
There is no user interface that we have that allows applications to
select what zone the memory should come from.

I've spoken with Stephen Hemminger, and he said that DPDK is moving in
the direction of using transparent huge pages instead of HugeTLBs,
which means that we need to allow at least anonymous, and anonymous
transparent huge pages to come from non-movable zones on demand.

Here is what I am proposing:
1. Add a new flag that is passed through pin_user_pages_* down to
fault handlers, and allow the fault handler to allocate from a
non-movable zone.

Sample function stacks through which this info needs to be passed is this:

pin_user_pages_remote(gup_flags)
  __get_user_pages_remote(gup_flags)
   __gup_longterm_locked(gup_flags)
__get_user_pages_locked(gup_flags)
 __get_user_pages(gup_flags)
  faultin_page(gup_flags)
   Convert gup_flags into fault_flags
   handle_mm_fault(fault_flags)

 From handle_mm_fault(), the stack diverges into various faults,
examples include:

Transparent Huge Page
handle_mm_fault(fault_flags)
__handle_mm_fault(fault_flags)
Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask
create_huge_pmd(vmf);
do_huge_pmd_anonymous_page(vmf);
mm_get_huge_zero_page(vma->vm_mm); -> flag is lost, so flag from
vmf.gfp_mask should be passed as well.

There are several other similar paths in a transparent huge page, also
there is a named path where allocation is based on filesystems, and
the flag should be honored there as well, but it does not have to be
added at the same time.

Regular Pages
handle_mm_fault(fault_flags)
__handle_mm_fault(fault_flags)
Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask
handle_pte_fault(vmf)
do_anonymous_page(vmf);
page = alloc_zeroed_user_highpage_movable(vma, vmf->address); ->
replace change this call according to gfp_mask.

The above only take care of the case if user application faults on the
page during pinning time, but there are also cases where pages already
exist.


Makes sense, as this means no userspace change.


2. Add an internal move_pages_zone() similar to move_pages() syscall
but instead of migrating to a different NUMA node, migrate pages from
ZONE_MOVABLE to another zone.
Call move_pages_zone() on demand prior to pinning pages from
vfio_pin_map_dma() for instance.


As others already said, migrating away before the longterm pin should be 
the solution. IIRC it was one of the goals of long term pinning api 
proposed long time ago by Peter Ziljstra I think? The implementation 
that was merged relatively recently doesn't do that (yet?) for all 
movable pages, just CMA, but it could.



3. Perhaps, it also makes sense to add madvise() flag, to allocate
pages from non-movable zone. When a user application knows that it
will do DMA mapping, and pin pages for a long time, the memory that it
allocates should never be migrated or hot-removed, so make sure that
it comes from the appropriate place.
The benefit of adding madvise() flag is that we won't have to deal
with slow page migration during pin time, but the disadvantage is that
we would need to change the user interface.


It's best if we avoid involving userspace until it's shown that's it's 
insufficient.



Before I start working on the above approaches, I would like to get an
opinion from the community on an appropriate path forward for this
problem. If what I described sounds reasonable, or if there are other
ideas on how to address the problem that I am seeing.

Thank you,
Pasha





Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

2020-11-23 Thread Vlastimil Babka

On 11/21/20 8:45 PM, Andrea Arcangeli wrote:

A corollary issue was fixed in
e577c8b64d58fe307ea4d5149d31615df2d90861. A second issue remained in
v5.7:

https://lkml.kernel.org/r/8c537eb7-85ee-4dcf-943e-3cc0ed0df...@lca.pw

==
page:eaaa refcount:1 mapcount:0 mapping:2243743b index:0x0
flags: 0x1fffe01000(reserved)
==

73a6e474cb376921a311786652782155eac2fdf0 was applied to supposedly the
second issue, but I still reproduced it twice with v5.9 on two
different systems:

==
page:62b3e92f refcount:1 mapcount:0 mapping: index:0x0 
pfn:0x39800
flags: 0x1000(reserved)
==
page:2a7114f8 refcount:1 mapcount:0 mapping: index:0x0 
pfn:0x7a200
flags: 0x1fff1000(reserved)
==

I actually never reproduced it until v5.9, but it's still the same bug
as it was reported first for v5.7.

See the page is "reserved" in all 3 cases. In the last two crashes
with the pfn:

pfn 0x39800 -> 0x3980 min_pfn hit non-RAM:

39639000-39814fff : Unknown E820 type

pfn 0x7a200 -> 0x7a20 min_pfn hit non-RAM:

7a17b000-7a216fff : Unknown E820 type


It would be nice to also provide a /proc/zoneinfo and how exactly the 
"zone_spans_pfn" was violated. I assume we end up below zone's 
start_pfn, but is it true?



This actually seems a false positive bugcheck, the page structures are
valid and the zones are correct, just it's non-RAM but setting
pageblockskip should do no harm. However it's possible to solve the
crash without lifting the bugcheck, by enforcing the invariant that
the free_pfn cursor doesn't point to reserved pages (which would be
otherwise implicitly achieved through the PageBuddy check, except in
the new fast_isolate_around() path).

Fixes: 5a811889de10 ("mm, compaction: use free lists to quickly locate a migration 
target")
Signed-off-by: Andrea Arcangeli 
---
  mm/compaction.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 13cb7a961b31..d17e69549d34 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1433,7 +1433,10 @@ fast_isolate_freepages(struct compact_control *cc)
page = pageblock_pfn_to_page(min_pfn,
pageblock_end_pfn(min_pfn),
cc->zone);
-   cc->free_pfn = min_pfn;
+   if (likely(!PageReserved(page)))


PageReserved check seems rather awkward solution to me. Wouldn't it be 
more obvious if we made sure we don't end up below zone's start_pfn (if 
my assumption is correct) in the first place?


When I check the code:

unsigned long distance;
distance = (cc->free_pfn - cc->migrate_pfn);
low_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 2));
min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1));

I think what can happen is that cc->free_pfn <= cc->migrate_pfn after 
the very last isolate_migratepages(). Then compact_finished() detects 
that in compact_zone(), but only after migrate_pages() and thus 
fast_isolate_freepages() is called.


That would mean distance can be negative, or rather a large unsigned 
number and low_pfn and min_pfn end up away from the zone?


Or maybe the above doesn't happen, but cc->free_pfn gets so close to 
start of the zone, that the calculations above make min_pfn go below 
start_pfn?


In any case I would rather make sure we stay within the expected zone 
boundaries, than play tricks with PageReserved. Mel?



+   cc->free_pfn = min_pfn;
+   else
+   page = NULL;
}
}
}





Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order

2020-11-18 Thread Vlastimil Babka

On 11/18/20 9:27 AM, Bharata B Rao wrote:

The page order of the slab that gets chosen for a given slab
cache depends on the number of objects that can be fit in the
slab while meeting other requirements. We start with a value
of minimum objects based on nr_cpu_ids that is driven by
possible number of CPUs and hence could be higher than the
actual number of CPUs present in the system. This leads to
calculate_order() chosing a page order that is on the higher
side leading to increased slab memory consumption on systems
that have bigger page sizes.

Hence rely on the number of online CPUs when determining the
mininum objects, thereby increasing the chances of chosing
a lower conservative page order for the slab.

Signed-off-by: Bharata B Rao 


Acked-by: Vlastimil Babka 

Ideally, we would react to hotplug events and update existing caches 
accordingly. But for that, recalculation of order for existing caches 
would have to be made safe, while not affecting hot paths. We have 
removed the sysfs interface with 32a6f409b693 ("mm, slub: remove runtime 
allocation order changes") as it didn't seem easy and worth the trouble.


In case somebody wants to start with a large order right from the boot 
because they know they will hotplug lots of cpus later, they can use 
slub_min_objects= boot param to override this heuristic. So in case this 
change regresses somebody's performance, there's a way around it and 
thus the risk is low IMHO.



---
This is a generic change and I am unsure how it would affect
other archs, but as a start, here are some numbers from
PowerPC pseries KVM guest with and without this patch:

This table shows how this change has affected some of the slab
caches.
===
Current Patched
Cache  
===
TCPv6   532 261
net_namespace   534 262
dtl 322 161
names_cache 322 161
task_struct 538 132
thread_stack328 8 2
pgtable-2^11168 8 4
pgtable-2^8 322 161
kmalloc-32k 168 8 4
kmalloc-16k 328 8 2
kmalloc-8k  324 8 1
kmalloc-4k  322 161
===

Slab memory (kB) consumption comparision
==
Current Patched
==
After-boot  205760  156096
During-hackbench629145  506752 (Avg of 5 runs)
After-hackbench 474176  331840 (after drop_caches)
==

Hackbench Time (Avg of 5 runs)
(hackbench -s 1024 -l 200 -g 200 -f 25 -P)
==
Current Patched
==
10.990  11.010
==

Measuring the effect due to CPU hotplug

Since the patch doesn't consider all the possible CPUs for page
order calcluation, let's see how affects the case when CPUs are
hotplugged. Here I compare a system that is booted with 64CPUs
with a system that is booted with 16CPUs but hotplugged with
48CPUs after boot. These numbers are with the patch applied.

Slab memory (kB) consumption comparision
===
64bootCPUs  16bootCPUs+48HotPluggedCPUs
===
After-boot  390272  159744
After-hotplug   -   251328
During-hackbench1001267 941926 (Avg of 5 runs)
After-hackbench 913600  827200 (after drop_caches)
===

Hackbench Time (Avg of 5 runs)
(hackbench -s 1024 -l 200 -g 200 -f 25 -P)
===
64bootCPUs  16bootCPUs+48HotPluggedCPUs
===
12.554  12.589
===
  mm/slub.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index 34dcc09e2ec9..8342c0a167b2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3433,7 +3433,7 @@ static inline int calculate_order(unsigned int size)
 */
min_objects = slub_min_objects;
if (!min_objects)
-   min_objects = 4 * (fls(nr_cpu_ids) + 1);
+   min_objects = 4 * (fls(num_online_cpus()) + 1);
max_objects =

Re: [PATCH v3 3/5] kernel/power: allow hibernation with page_poison sanity checking

2020-11-13 Thread Vlastimil Babka

On 11/13/20 1:10 PM, David Hildenbrand wrote:

@@ -1186,12 +1194,12 @@ void clear_free_pages(void)
if (WARN_ON(!(free_pages_map)))
return;
  
-	if (IS_ENABLED(CONFIG_PAGE_POISONING_ZERO) || want_init_on_free()) {

+   if (page_poisoning_enabled() || want_init_on_free()) {


Any reason why not to use the static version here?


It's a single check per resume, so not worth live-patching this place.


Reviewed-by: David Hildenbrand 






[PATCH v3 3/5] kernel/power: allow hibernation with page_poison sanity checking

2020-11-13 Thread Vlastimil Babka
Page poisoning used to be incompatible with hibernation, as the state of
poisoned pages was lost after resume, thus enabling CONFIG_HIBERNATION forces
CONFIG_PAGE_POISONING_NO_SANITY. For the same reason, the poisoning with zeroes
variant CONFIG_PAGE_POISONING_ZERO used to disable hibernation. The latter
restriction was removed by commit 1ad1410f632d ("PM / Hibernate: allow
hibernation with PAGE_POISONING_ZERO") and similarly for init_on_free by commit
18451f9f9e58 ("PM: hibernate: fix crashes with init_on_free=1") by making sure
free pages are cleared after resume.

We can use the same mechanism to instead poison free pages with PAGE_POISON
after resume. This covers both zero and 0xAA patterns. Thus we can remove the
Kconfig restriction that disables page poison sanity checking when hibernation
is enabled.

Signed-off-by: Vlastimil Babka 
Acked-by: Rafael J. Wysocki  # for hibernation
---
 include/linux/mm.h   |  1 +
 kernel/power/hibernate.c |  2 +-
 kernel/power/power.h |  2 +-
 kernel/power/snapshot.c  | 14 +++---
 mm/Kconfig.debug |  1 -
 5 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5ab5358be2b3..b1585ff2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2891,6 +2891,7 @@ static inline void kernel_unpoison_pages(struct page 
*page, int numpages)
 #else
 static inline bool page_poisoning_enabled(void) { return false; }
 static inline bool page_poisoning_enabled_static(void) { return false; }
+static inline void __kernel_poison_pages(struct page *page, int nunmpages) { }
 static inline void kernel_poison_pages(struct page *page, int numpages) { }
 static inline void kernel_unpoison_pages(struct page *page, int numpages) { }
 #endif
diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index d5d48404db37..559acef3fddb 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -329,7 +329,7 @@ static int create_image(int platform_mode)
 
if (!in_suspend) {
events_check_enabled = false;
-   clear_free_pages();
+   clear_or_poison_free_pages();
}
 
platform_leave(platform_mode);
diff --git a/kernel/power/power.h b/kernel/power/power.h
index 24f12d534515..778bf431ec02 100644
--- a/kernel/power/power.h
+++ b/kernel/power/power.h
@@ -106,7 +106,7 @@ extern int create_basic_memory_bitmaps(void);
 extern void free_basic_memory_bitmaps(void);
 extern int hibernate_preallocate_memory(void);
 
-extern void clear_free_pages(void);
+extern void clear_or_poison_free_pages(void);
 
 /**
  * Auxiliary structure used for reading the snapshot image data and
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index af534982062d..64b7aab9aee4 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1178,7 +1178,15 @@ void free_basic_memory_bitmaps(void)
pr_debug("Basic memory bitmaps freed\n");
 }
 
-void clear_free_pages(void)
+static void clear_or_poison_free_page(struct page *page)
+{
+   if (page_poisoning_enabled_static())
+   __kernel_poison_pages(page, 1);
+   else if (want_init_on_free())
+   clear_highpage(page);
+}
+
+void clear_or_poison_free_pages(void)
 {
struct memory_bitmap *bm = free_pages_map;
unsigned long pfn;
@@ -1186,12 +1194,12 @@ void clear_free_pages(void)
if (WARN_ON(!(free_pages_map)))
return;
 
-   if (IS_ENABLED(CONFIG_PAGE_POISONING_ZERO) || want_init_on_free()) {
+   if (page_poisoning_enabled() || want_init_on_free()) {
memory_bm_position_reset(bm);
pfn = memory_bm_next_pfn(bm);
while (pfn != BM_END_OF_MAP) {
if (pfn_valid(pfn))
-   clear_highpage(pfn_to_page(pfn));
+   clear_or_poison_free_page(pfn_to_page(pfn));
 
pfn = memory_bm_next_pfn(bm);
}
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 864f129f1937..c57786ad5be9 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -64,7 +64,6 @@ config PAGE_OWNER
 
 config PAGE_POISONING
bool "Poison pages after freeing"
-   select PAGE_POISONING_NO_SANITY if HIBERNATION
help
  Fill the pages with poison patterns after free_pages() and verify
  the patterns before alloc_pages. The filling of the memory helps
-- 
2.29.2



[PATCH v3 4/5] mm, page_poison: remove CONFIG_PAGE_POISONING_NO_SANITY

2020-11-13 Thread Vlastimil Babka
CONFIG_PAGE_POISONING_NO_SANITY skips the check on page alloc whether the
poison pattern was corrupted, suggesting a use-after-free. The motivation to
introduce it in commit 8823b1dbc05f ("mm/page_poison.c: enable PAGE_POISONING
as a separate option") was to simply sanitize freed pages, optimally together
with CONFIG_PAGE_POISONING_ZERO.

These days we have an init_on_free=1 boot option, which makes this use case of
page poisoning redundant. For sanitizing, writing zeroes is sufficient, there
is pretty much no benefit from writing the 0xAA poison pattern to freed pages,
without checking it back on alloc. Thus, remove this option and suggest
init_on_free instead in the main config's help.

Signed-off-by: Vlastimil Babka 
Acked-by: David Hildenbrand 
---
 drivers/virtio/virtio_balloon.c |  4 +---
 mm/Kconfig.debug| 15 ---
 mm/page_poison.c|  3 ---
 3 files changed, 5 insertions(+), 17 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index e53faed6ba93..8985fc2cea86 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -1114,9 +1114,7 @@ static int virtballoon_validate(struct virtio_device 
*vdev)
 * page reporting as it could potentially change the contents
 * of our free pages.
 */
-   if (!want_init_on_free() &&
-   (IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY) ||
-!page_poisoning_enabled_static()))
+   if (!want_init_on_free() && !page_poisoning_enabled_static())
__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
else if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON))
__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_REPORTING);
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index c57786ad5be9..14e29fe5bfa6 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -74,18 +74,11 @@ config PAGE_POISONING
  Note that "poison" here is not the same thing as the "HWPoison"
  for CONFIG_MEMORY_FAILURE. This is software poisoning only.
 
- If unsure, say N
+ If you are only interested in sanitization of freed pages without
+ checking the poison pattern on alloc, you can boot the kernel with
+ "init_on_free=1" instead of enabling this.
 
-config PAGE_POISONING_NO_SANITY
-   depends on PAGE_POISONING
-   bool "Only poison, don't sanity check"
-   help
-  Skip the sanity checking on alloc, only fill the pages with
-  poison on free. This reduces some of the overhead of the
-  poisoning feature.
-
-  If you are only interested in sanitization, say Y. Otherwise
-  say N.
+ If unsure, say N
 
 config PAGE_POISONING_ZERO
bool "Use zero for poisoning instead of debugging value"
diff --git a/mm/page_poison.c b/mm/page_poison.c
index 0d899a01d107..65cdf844c8ad 100644
--- a/mm/page_poison.c
+++ b/mm/page_poison.c
@@ -51,9 +51,6 @@ static void check_poison_mem(unsigned char *mem, size_t bytes)
unsigned char *start;
unsigned char *end;
 
-   if (IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY))
-   return;
-
start = memchr_inv(mem, PAGE_POISON, bytes);
if (!start)
return;
-- 
2.29.2



[PATCH v3 1/5] mm, page_alloc: do not rely on the order of page_poison and init_on_alloc/free parameters

2020-11-13 Thread Vlastimil Babka
Enabling page_poison=1 together with init_on_alloc=1 or init_on_free=1 produces
a warning in dmesg that page_poison takes precedence. However, as these
warnings are printed in early_param handlers for init_on_alloc/free, they are
not printed if page_poison is enabled later on the command line (handlers are
called in the order of their parameters), or when init_on_alloc/free is always
enabled by the respective config option - before the page_poison early param
handler is called, it is not considered to be enabled. This is inconsistent.

We can remove the dependency on order by making the init_on_* parameters only
set a boolean variable, and postponing the evaluation after all early params
have been processed. Introduce a new init_mem_debugging_and_hardening()
function for that, and move the related debug_pagealloc processing there as
well.

As a result init_mem_debugging_and_hardening() knows always accurately if
init_on_* and/or page_poison options were enabled. Thus we can also optimize
want_init_on_alloc() and want_init_on_free(). We don't need to check
page_poisoning_enabled() there, we can instead not enable the init_on_* static
keys at all, if page poisoning is enabled. This results in a simpler and more
effective code.

Signed-off-by: Vlastimil Babka 
Reviewed-by: David Hildenbrand 
Reviewed-by: Mike Rapoport 
---
 include/linux/mm.h | 20 ++-
 init/main.c|  2 +-
 mm/page_alloc.c| 88 ++
 3 files changed, 46 insertions(+), 64 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 24f47e140a4c..82ab5c894d94 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2860,6 +2860,7 @@ extern int apply_to_existing_page_range(struct mm_struct 
*mm,
   unsigned long address, unsigned long size,
   pte_fn_t fn, void *data);
 
+extern void init_mem_debugging_and_hardening(void);
 #ifdef CONFIG_PAGE_POISONING
 extern bool page_poisoning_enabled(void);
 extern void kernel_poison_pages(struct page *page, int numpages, int enable);
@@ -2869,35 +2870,20 @@ static inline void kernel_poison_pages(struct page 
*page, int numpages,
int enable) { }
 #endif
 
-#ifdef CONFIG_INIT_ON_ALLOC_DEFAULT_ON
-DECLARE_STATIC_KEY_TRUE(init_on_alloc);
-#else
 DECLARE_STATIC_KEY_FALSE(init_on_alloc);
-#endif
 static inline bool want_init_on_alloc(gfp_t flags)
 {
-   if (static_branch_unlikely(&init_on_alloc) &&
-   !page_poisoning_enabled())
+   if (static_branch_unlikely(&init_on_alloc))
return true;
return flags & __GFP_ZERO;
 }
 
-#ifdef CONFIG_INIT_ON_FREE_DEFAULT_ON
-DECLARE_STATIC_KEY_TRUE(init_on_free);
-#else
 DECLARE_STATIC_KEY_FALSE(init_on_free);
-#endif
 static inline bool want_init_on_free(void)
 {
-   return static_branch_unlikely(&init_on_free) &&
-  !page_poisoning_enabled();
+   return static_branch_unlikely(&init_on_free);
 }
 
-#ifdef CONFIG_DEBUG_PAGEALLOC
-extern void init_debug_pagealloc(void);
-#else
-static inline void init_debug_pagealloc(void) {}
-#endif
 extern bool _debug_pagealloc_enabled_early;
 DECLARE_STATIC_KEY_FALSE(_debug_pagealloc_enabled);
 
diff --git a/init/main.c b/init/main.c
index 7262d0a8861b..524089384155 100644
--- a/init/main.c
+++ b/init/main.c
@@ -816,7 +816,7 @@ static void __init mm_init(void)
 * bigger than MAX_ORDER unless SPARSEMEM.
 */
page_ext_init_flatmem();
-   init_debug_pagealloc();
+   init_mem_debugging_and_hardening();
kfence_alloc_pool();
report_meminit();
mem_init();
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 63d8d8b72c10..567060c2ad83 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -165,53 +165,26 @@ unsigned long totalcma_pages __read_mostly;
 
 int percpu_pagelist_fraction;
 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
-#ifdef CONFIG_INIT_ON_ALLOC_DEFAULT_ON
-DEFINE_STATIC_KEY_TRUE(init_on_alloc);
-#else
 DEFINE_STATIC_KEY_FALSE(init_on_alloc);
-#endif
 EXPORT_SYMBOL(init_on_alloc);
 
-#ifdef CONFIG_INIT_ON_FREE_DEFAULT_ON
-DEFINE_STATIC_KEY_TRUE(init_on_free);
-#else
 DEFINE_STATIC_KEY_FALSE(init_on_free);
-#endif
 EXPORT_SYMBOL(init_on_free);
 
+static bool _init_on_alloc_enabled_early __read_mostly
+   = IS_ENABLED(CONFIG_INIT_ON_ALLOC_DEFAULT_ON);
 static int __init early_init_on_alloc(char *buf)
 {
-   int ret;
-   bool bool_result;
 
-   ret = kstrtobool(buf, &bool_result);
-   if (ret)
-   return ret;
-   if (bool_result && page_poisoning_enabled())
-   pr_info("mem auto-init: CONFIG_PAGE_POISONING is on, will take 
precedence over init_on_alloc\n");
-   if (bool_result)
-   static_branch_enable(&init_on_alloc);
-   else
-   static_branch_disab

[PATCH v3 5/5] mm, page_poison: remove CONFIG_PAGE_POISONING_ZERO

2020-11-13 Thread Vlastimil Babka
CONFIG_PAGE_POISONING_ZERO uses the zero pattern instead of 0xAA. It was
introduced by commit 1414c7f4f7d7 ("mm/page_poisoning.c: allow for zero
poisoning"), noting that using zeroes retains the benefit of sanitizing content
of freed pages, with the benefit of not having to zero them again on alloc, and
the downside of making some forms of corruption (stray writes of NULLs) harder
to detect than with the 0xAA pattern. Together with
CONFIG_PAGE_POISONING_NO_SANITY it made possible to sanitize the contents on
free without checking it back on alloc.

These days we have the init_on_free() option to achieve sanitization with
zeroes and to save clearing on alloc (and without checking on alloc). Arguably
if someone does choose to check the poison for corruption on alloc, the savings
of not clearing the page are secondary, and it makes sense to always use the
0xAA poison pattern. Thus, remove the CONFIG_PAGE_POISONING_ZERO option for
being redundant.

Signed-off-by: Vlastimil Babka 
Acked-by: David Hildenbrand 
---
 include/linux/poison.h   |  4 
 mm/Kconfig.debug | 12 
 mm/page_alloc.c  |  8 +---
 tools/include/linux/poison.h |  6 +-
 4 files changed, 2 insertions(+), 28 deletions(-)

diff --git a/include/linux/poison.h b/include/linux/poison.h
index dc8ae5d8db03..aff1c9250c82 100644
--- a/include/linux/poison.h
+++ b/include/linux/poison.h
@@ -27,11 +27,7 @@
 #define TIMER_ENTRY_STATIC ((void *) 0x300 + POISON_POINTER_DELTA)
 
 /** mm/page_poison.c **/
-#ifdef CONFIG_PAGE_POISONING_ZERO
-#define PAGE_POISON 0x00
-#else
 #define PAGE_POISON 0xaa
-#endif
 
 /** mm/page_alloc.c /
 
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 14e29fe5bfa6..1e73717802f8 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -80,18 +80,6 @@ config PAGE_POISONING
 
  If unsure, say N
 
-config PAGE_POISONING_ZERO
-   bool "Use zero for poisoning instead of debugging value"
-   depends on PAGE_POISONING
-   help
-  Instead of using the existing poison value, fill the pages with
-  zeros. This makes it harder to detect when errors are occurring
-  due to sanitization but the zeroing at free means that it is
-  no longer necessary to write zeros when GFP_ZERO is used on
-  allocation.
-
-  If unsure, say N
-
 config DEBUG_PAGE_REF
bool "Enable tracepoint to track down page reference manipulation"
depends on DEBUG_KERNEL
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cd966829bed3..e80d5ce1b292 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2226,12 +2226,6 @@ static inline int check_new_page(struct page *page)
return 1;
 }
 
-static inline bool free_pages_prezeroed(void)
-{
-   return (IS_ENABLED(CONFIG_PAGE_POISONING_ZERO) &&
-   page_poisoning_enabled_static()) || want_init_on_free();
-}
-
 #ifdef CONFIG_DEBUG_VM
 /*
  * With DEBUG_VM enabled, order-0 pages are checked for expected state when
@@ -2300,7 +2294,7 @@ static void prep_new_page(struct page *page, unsigned int 
order, gfp_t gfp_flags
 {
post_alloc_hook(page, order, gfp_flags);
 
-   if (!free_pages_prezeroed() && want_init_on_alloc(gfp_flags))
+   if (!want_init_on_free() && want_init_on_alloc(gfp_flags))
kernel_init_free_pages(page, 1 << order);
 
if (order && (gfp_flags & __GFP_COMP))
diff --git a/tools/include/linux/poison.h b/tools/include/linux/poison.h
index d29725769107..2e6338ac5eed 100644
--- a/tools/include/linux/poison.h
+++ b/tools/include/linux/poison.h
@@ -35,12 +35,8 @@
  */
 #define TIMER_ENTRY_STATIC ((void *) 0x300 + POISON_POINTER_DELTA)
 
-/** mm/debug-pagealloc.c **/
-#ifdef CONFIG_PAGE_POISONING_ZERO
-#define PAGE_POISON 0x00
-#else
+/** mm/page_poison.c **/
 #define PAGE_POISON 0xaa
-#endif
 
 /** mm/page_alloc.c /
 
-- 
2.29.2



[PATCH v3 2/5] mm, page_poison: use static key more efficiently

2020-11-13 Thread Vlastimil Babka
Commit 11c9c7edae06 ("mm/page_poison.c: replace bool variable with static key")
changed page_poisoning_enabled() to a static key check. However, the function
is not inlined, so each check still involves a function call with overhead not
eliminated when page poisoning is disabled.

Analogically to how debug_pagealloc is handled, this patch converts
page_poisoning_enabled() back to boolean check, and introduces
page_poisoning_enabled_static() for fast paths. Both functions are inlined.

The function kernel_poison_pages() is also called unconditionally and does
the static key check inside. Remove it from there and put it to callers. Also
split it to two functions kernel_poison_pages() and kernel_unpoison_pages()
instead of the confusing bool parameter.

Also optimize the check that enables page poisoning instead of debug_pagealloc
for architectures without proper debug_pagealloc support. Move the check to
init_mem_debugging_and_hardening() to enable a single static key instead of
having two static branches in page_poisoning_enabled_static().

Signed-off-by: Vlastimil Babka 
---
 drivers/virtio/virtio_balloon.c |  2 +-
 include/linux/mm.h  | 33 +---
 mm/page_alloc.c | 18 +--
 mm/page_poison.c| 53 +
 4 files changed, 52 insertions(+), 54 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 481611c09dae..e53faed6ba93 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -1116,7 +1116,7 @@ static int virtballoon_validate(struct virtio_device 
*vdev)
 */
if (!want_init_on_free() &&
(IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY) ||
-!page_poisoning_enabled()))
+!page_poisoning_enabled_static()))
__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
else if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON))
__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_REPORTING);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 82ab5c894d94..5ab5358be2b3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2862,12 +2862,37 @@ extern int apply_to_existing_page_range(struct 
mm_struct *mm,
 
 extern void init_mem_debugging_and_hardening(void);
 #ifdef CONFIG_PAGE_POISONING
-extern bool page_poisoning_enabled(void);
-extern void kernel_poison_pages(struct page *page, int numpages, int enable);
+extern void __kernel_poison_pages(struct page *page, int numpages);
+extern void __kernel_unpoison_pages(struct page *page, int numpages);
+extern bool _page_poisoning_enabled_early;
+DECLARE_STATIC_KEY_FALSE(_page_poisoning_enabled);
+static inline bool page_poisoning_enabled(void)
+{
+   return _page_poisoning_enabled_early;
+}
+/*
+ * For use in fast paths after init_mem_debugging() has run, or when a
+ * false negative result is not harmful when called too early.
+ */
+static inline bool page_poisoning_enabled_static(void)
+{
+   return static_branch_unlikely(&_page_poisoning_enabled);
+}
+static inline void kernel_poison_pages(struct page *page, int numpages)
+{
+   if (page_poisoning_enabled_static())
+   __kernel_poison_pages(page, numpages);
+}
+static inline void kernel_unpoison_pages(struct page *page, int numpages)
+{
+   if (page_poisoning_enabled_static())
+   __kernel_unpoison_pages(page, numpages);
+}
 #else
 static inline bool page_poisoning_enabled(void) { return false; }
-static inline void kernel_poison_pages(struct page *page, int numpages,
-   int enable) { }
+static inline bool page_poisoning_enabled_static(void) { return false; }
+static inline void kernel_poison_pages(struct page *page, int numpages) { }
+static inline void kernel_unpoison_pages(struct page *page, int numpages) { }
 #endif
 
 DECLARE_STATIC_KEY_FALSE(init_on_alloc);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 567060c2ad83..cd966829bed3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -775,6 +775,17 @@ void init_mem_debugging_and_hardening(void)
static_branch_enable(&init_on_free);
}
 
+#ifdef CONFIG_PAGE_POISONING
+   /*
+* Page poisoning is debug page alloc for some arches. If
+* either of those options are enabled, enable poisoning.
+*/
+   if (page_poisoning_enabled() ||
+(!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
+ debug_pagealloc_enabled()))
+   static_branch_enable(&_page_poisoning_enabled);
+#endif
+
 #ifdef CONFIG_DEBUG_PAGEALLOC
if (!debug_pagealloc_enabled())
return;
@@ -1262,7 +1273,8 @@ static __always_inline bool free_pages_prepare(struct 
page *page,
if (want_init_on_free())
kernel_init_free_pages(page, 1 << order);
 
-   kernel_poison_pages(page, 1 <&

[PATCH v3 0/5] cleanup page poisoning

2020-11-13 Thread Vlastimil Babka
Changes since v2 [2]

- rebase to next-20201113
- apply review feedback from David
- acks from David and Rafael (thanks!)

The first version was called "optimize handling of memory debugging parameters"
[1], changes are:

- apply review feedback
- drop former Patch 3
- add new patches 3-5, change name and cover letter of series

I have identified a number of issues and opportunities for cleanup with
CONFIG_PAGE_POISON and friends:
- interaction with init_on_alloc and init_on_free parameters depends on
  the order of parameters (Patch 1)
- the boot time enabling uses static key, but inefficienty (Patch 2)
- sanity checking is incompatible with hibernation (Patch 3)
- CONFIG_PAGE_POISONING_NO_SANITY can be removed now that we have init_on_free
  (Patch 4)
- CONFIG_PAGE_POISONING_ZERO can be most likely removed now that we have
  init_on_free (Patch 5)

[1] https://lore.kernel.org/r/20201026173358.14704-1-vba...@suse.cz
[2] https://lore.kernel.org/linux-mm/20201103152237.9853-1-vba...@suse.cz/

Vlastimil Babka (5):
  mm, page_alloc: do not rely on the order of page_poison and
init_on_alloc/free parameters
  mm, page_poison: use static key more efficiently
  kernel/power: allow hibernation with page_poison sanity checking
  mm, page_poison: remove CONFIG_PAGE_POISONING_NO_SANITY
  mm, page_poison: remove CONFIG_PAGE_POISONING_ZERO

 drivers/virtio/virtio_balloon.c |   4 +-
 include/linux/mm.h  |  54 +--
 include/linux/poison.h  |   4 --
 init/main.c |   2 +-
 kernel/power/hibernate.c|   2 +-
 kernel/power/power.h|   2 +-
 kernel/power/snapshot.c |  14 +++-
 mm/Kconfig.debug|  28 ++--
 mm/page_alloc.c | 112 
 mm/page_poison.c|  56 ++--
 tools/include/linux/poison.h|   6 +-
 11 files changed, 117 insertions(+), 167 deletions(-)

-- 
2.29.2



Re: [PATCH v3 7/7] mm, page_alloc: disable pcplists during memory offline

2020-11-12 Thread Vlastimil Babka

On 11/11/20 6:58 PM, David Hildenbrand wrote:

On 11.11.20 10:28, Vlastimil Babka wrote:

-   /*
-* per-cpu pages are drained after start_isolate_page_range, but
-* if there are still pages that are not free, make sure that we
-* drain again, because when we isolated range we might have
-* raced with another thread that was adding pages to pcp list.
-*
-* Forward progress should be still guaranteed because
-* pages on the pcp list can only belong to MOVABLE_ZONE
-* because has_unmovable_pages explicitly checks for
-* PageBuddy on freed pages on other zones.
-*/
ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
-   if (ret)
-   drain_all_pages(zone);
+


Why two empty lines before the "} while (ret);" ? (unless I'm confused
while looking at this diff)



No there's just a single emply line after "ret = test_pages_isolated..."
Before there was none, which looked ok with the extra identation of the
now-removed "drain_all_pages(zone);"


+void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,
+   unsigned long batch)
+{
+   struct per_cpu_pageset *p;
+   int cpu;
+
+   for_each_possible_cpu(cpu) {
+   p = per_cpu_ptr(zone->pageset, cpu);
+   pageset_update(&p->pcp, high, batch);
+   }
+}
+
  /*
   * Calculate and set new high and batch values for all per-cpu pagesets of a
   * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
@@ -6315,8 +6338,6 @@ static void pageset_init(struct per_cpu_pageset *p)
  static void zone_set_pageset_high_and_batch(struct zone *zone)
  {
unsigned long new_high, new_batch;
-   struct per_cpu_pageset *p;
-   int cpu;
  
  	if (percpu_pagelist_fraction) {

new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
@@ -6336,10 +6357,7 @@ static void zone_set_pageset_high_and_batch(struct zone 
*zone)
zone->pageset_high = new_high;
zone->pageset_batch = new_batch;
  
-	for_each_possible_cpu(cpu) {

-   p = per_cpu_ptr(zone->pageset, cpu);
-   pageset_update(&p->pcp, new_high, new_batch);
-   }
+   __zone_set_pageset_high_and_batch(zone, new_high, new_batch);
  }


These two hunks look like an unrelated cleanup, or am I missing something?


It's extracting part of functionality to __zone_set_pageset_high_and_batch()
that's now called also from zone_pcp_enable() and zone_pcp_disable() - to only
adjust the per-cpu zone->pageset values without the usual calculation.


Thanks for looking into this!


Thanks for review. Here's updated version that adds more detailed comment about
force_all_cpus parameter to __drain_all_pages() header, hopefully that clarifies
your concerns.

8<
From 5077d00c195fa66e613bdb27a195bbcc1e892515 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka 
Date: Fri, 18 Sep 2020 18:35:32 +0200
Subject: [PATCH] mm, page_alloc: disable pcplists during memory offline

Memory offlining relies on page isolation to guarantee a forward
progress because pages cannot be reused while they are isolated. But the
page isolation itself doesn't prevent from races while freed pages are
stored on pcp lists and thus can be reused.  This can be worked around by
repeated draining of pcplists, as done by commit 968318261221
("mm/memory_hotplug: drain per-cpu pages again during memory offline").

David and Michal would prefer that this race was closed in a way that callers
of page isolation who need stronger guarantees don't need to repeatedly drain.
David suggested disabling pcplists usage completely during page isolation,
instead of repeatedly draining them.

To achieve this without adding special cases in alloc/free fastpath, we can use
the same approach as boot pagesets - when pcp->high is 0, any pcplist addition
will be immediately flushed.

The race can thus be closed by setting pcp->high to 0 and draining pcplists
once, before calling start_isolate_page_range(). The draining will serialize
after processes that already disabled interrupts and read the old value of
pcp->high in free_unref_page_commit(), and processes that have not yet disabled
interrupts, will observe pcp->high == 0 when they are rescheduled, and skip
pcplists. This guarantees no stray pages on pcplists in zones where isolation
happens.

This patch thus adds zone_pcp_disable() and zone_pcp_enable() functions that
page isolation users can call before start_isolate_page_range() and after
unisolating (or offlining) the isolated pages.

Also, drain_all_pages() is optimized to only execute on cpus where pcplists are
not empty. The check can however race with a free to pcplist that has not yet

Re: [PATCH v2 3/5] kernel/power: allow hibernation with page_poison sanity checking

2020-11-12 Thread Vlastimil Babka

On 11/11/20 4:42 PM, David Hildenbrand wrote:
...

@@ -1152,12 +1152,18 @@ void clear_free_pages(void)
if (WARN_ON(!(free_pages_map)))
return;
  
-	if (IS_ENABLED(CONFIG_PAGE_POISONING_ZERO) || want_init_on_free()) {

+   if (page_poisoning_enabled() || want_init_on_free()) {
memory_bm_position_reset(bm);
pfn = memory_bm_next_pfn(bm);
while (pfn != BM_END_OF_MAP) {
-   if (pfn_valid(pfn))
-   clear_highpage(pfn_to_page(pfn));
+   if (pfn_valid(pfn)) {
+   struct page *page = pfn_to_page(pfn);


^ empty line missing. And at least I prefer to declare all variables in
the function header.

I'd even suggest to move this into a separate function like

clear_or_poison_free_page(struct page *page)




Ok, fixup below.

8<
From cae1e8ccfa57c28ed1b2f5f8a47319b86cbdcfbf Mon Sep 17 00:00:00 2001
From: Vlastimil Babka 
Date: Thu, 12 Nov 2020 15:33:07 +0100
Subject: [PATCH] kernel/power: allow hibernation with page_poison sanity
 checking-fix

Adapt to __kernel_unpoison_pages fixup. Split out clear_or_poison_free_page()
per David Hildenbrand.

Signed-off-by: Vlastimil Babka 
---
 include/linux/mm.h  |  1 +
 kernel/power/snapshot.c | 18 ++
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 861b9392b5dc..d4cfb06a611e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2896,6 +2896,7 @@ static inline void kernel_unpoison_pages(struct page 
*page, int numpages)
 #else
 static inline bool page_poisoning_enabled(void) { return false; }
 static inline bool page_poisoning_enabled_static(void) { return false; }
+static inline void __kernel_poison_pages(struct page *page, int nunmpages) { }
 static inline void kernel_poison_pages(struct page *page, int numpages) { }
 static inline void kernel_unpoison_pages(struct page *page, int numpages) { }
 #endif
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 6b1c84afa891..a3491b29c5cc 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1144,6 +1144,14 @@ void free_basic_memory_bitmaps(void)
pr_debug("Basic memory bitmaps freed\n");
 }
 
+static void clear_or_poison_free_page(struct page *page)

+{
+   if (page_poisoning_enabled_static())
+   __kernel_poison_pages(page, 1);
+   else if (want_init_on_free())
+   clear_highpage(page);
+}
+
 void clear_or_poison_free_pages(void)
 {
struct memory_bitmap *bm = free_pages_map;
@@ -1156,14 +1164,8 @@ void clear_or_poison_free_pages(void)
memory_bm_position_reset(bm);
pfn = memory_bm_next_pfn(bm);
while (pfn != BM_END_OF_MAP) {
-   if (pfn_valid(pfn)) {
-   struct page *page = pfn_to_page(pfn);
-   if (page_poisoning_enabled_static())
-   kernel_poison_pages(page, 1);
-   else if (want_init_on_free())
-   clear_highpage(page);
-
-   }
+   if (pfn_valid(pfn))
+   clear_or_poison_free_page(pfn_to_page(pfn));
 
 			pfn = memory_bm_next_pfn(bm);

}
--
2.29.1




Re: [PATCH v2 2/5] mm, page_poison: use static key more efficiently

2020-11-12 Thread Vlastimil Babka

On 11/11/20 4:38 PM, David Hildenbrand wrote:

On 03.11.20 16:22, Vlastimil Babka wrote:

Commit 11c9c7edae06 ("mm/page_poison.c: replace bool variable with static key")
changed page_poisoning_enabled() to a static key check. However, the function
is not inlined, so each check still involves a function call with overhead not
eliminated when page poisoning is disabled.

Analogically to how debug_pagealloc is handled, this patch converts
page_poisoning_enabled() back to boolean check, and introduces
page_poisoning_enabled_static() for fast paths. Both functions are inlined.

The function kernel_poison_pages() is also called unconditionally and does
the static key check inside. Remove it from there and put it to callers. Also
split it to two functions kernel_poison_pages() and kernel_unpoison_pages()
instead of the confusing bool parameter.

Also optimize the check that enables page poisoning instead of debug_pagealloc
for architectures without proper debug_pagealloc support. Move the check to
init_mem_debugging_and_hardening() to enable a single static key instead of
having two static branches in page_poisoning_enabled_static().


[...]


+ * For use in fast paths after init_mem_debugging() has run, or when a
+ * false negative result is not harmful when called too early.
+ */
+static inline bool page_poisoning_enabled_static(void)
+{
+   return (static_branch_unlikely(&_page_poisoning_enabled));


As already mentioned IIRC:


Yes, it was, and I thought I fixed it. Guess not.


return static_branch_unlikely(&_page_poisoning_enabled);


+}



@@ -1260,7 +1271,8 @@ static __always_inline bool free_pages_prepare(struct 
page *page,
if (want_init_on_free())
kernel_init_free_pages(page, 1 << order);
  
-	kernel_poison_pages(page, 1 << order, 0);

+   if (page_poisoning_enabled_static())
+   kernel_poison_pages(page, 1 << order);


This would look much better by having kernel_poison_pages() simply be
implemented in a header, where the static check is performed.

Take a look at how it's handled in mm/shuffle.h
 
Ok. Fixup below.


8<

From 7ce26ba61296f583f0f9089e7887f07424f25d2c Mon Sep 17 00:00:00 2001
From: Vlastimil Babka 
Date: Thu, 12 Nov 2020 15:20:58 +0100
Subject: [PATCH] mm, page_poison: use static key more efficiently-fix

Non-functional cleanups, per David Hildenbrand.

Signed-off-by: Vlastimil Babka 
---
 include/linux/mm.h | 16 +---
 mm/page_alloc.c|  7 +++
 mm/page_poison.c   |  4 ++--
 3 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4d6dd9f44571..861b9392b5dc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2867,8 +2867,8 @@ extern int apply_to_existing_page_range(struct mm_struct 
*mm,
 
 extern void init_mem_debugging_and_hardening(void);

 #ifdef CONFIG_PAGE_POISONING
-extern void kernel_poison_pages(struct page *page, int numpages);
-extern void kernel_unpoison_pages(struct page *page, int numpages);
+extern void __kernel_poison_pages(struct page *page, int numpages);
+extern void __kernel_unpoison_pages(struct page *page, int numpages);
 extern bool _page_poisoning_enabled_early;
 DECLARE_STATIC_KEY_FALSE(_page_poisoning_enabled);
 static inline bool page_poisoning_enabled(void)
@@ -2881,7 +2881,17 @@ static inline bool page_poisoning_enabled(void)
  */
 static inline bool page_poisoning_enabled_static(void)
 {
-   return (static_branch_unlikely(&_page_poisoning_enabled));
+   return static_branch_unlikely(&_page_poisoning_enabled);
+}
+static inline void kernel_poison_pages(struct page *page, int numpages)
+{
+   if (page_poisoning_enabled_static())
+   __kernel_poison_pages(page, numpages);
+}
+static inline void kernel_unpoison_pages(struct page *page, int numpages)
+{
+   if (page_poisoning_enabled_static())
+   __kernel_unpoison_pages(page, numpages);
 }
 #else
 static inline bool page_poisoning_enabled(void) { return false; }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd7f9345adc0..1388b5939551 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1271,8 +1271,8 @@ static __always_inline bool free_pages_prepare(struct 
page *page,
if (want_init_on_free())
kernel_init_free_pages(page, 1 << order);
 
-	if (page_poisoning_enabled_static())

-   kernel_poison_pages(page, 1 << order);
+   kernel_poison_pages(page, 1 << order);
+
/*
 * arch_free_page() can make the page's contents inaccessible.  s390
 * does this.  So nothing which can access the page's contents should
@@ -2281,8 +2281,7 @@ inline void post_alloc_hook(struct page *page, unsigned 
int order,
if (debug_pagealloc_enabled_static())
kernel_map_pages(page, 1 << order, 1);
kasan_alloc_pages(page, order);
-   if (page_poisoning_enabled_static())
-   

Re: [PATCH v21 19/19] mm/lru: revise the comments of lru_lock

2020-11-12 Thread Vlastimil Babka

On 11/5/20 9:55 AM, Alex Shi wrote:

From: Hugh Dickins 

Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to
fix the incorrect comments in code. Also fixed some zone->lru_lock comment
error from ancient time. etc.

I struggled to understand the comment above move_pages_to_lru() (surely
it never calls page_referenced()), and eventually realized that most of
it had got separated from shrink_active_list(): move that comment back.

Signed-off-by: Hugh Dickins 
Signed-off-by: Alex Shi 
Acked-by: Johannes Weiner 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: Andrey Ryabinin 
Cc: Jann Horn 
Cc: Mel Gorman 
Cc: Johannes Weiner 
Cc: Matthew Wilcox 
Cc: Hugh Dickins 
Cc: cgro...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org


Acked-by: Vlastimil Babka 


Re: [PATCH v21 18/19] mm/lru: introduce the relock_page_lruvec function

2020-11-12 Thread Vlastimil Babka

On 11/5/20 9:55 AM, Alex Shi wrote:

From: Alexander Duyck 

Use this new function to replace repeated same code, no func change.

When testing for relock we can avoid the need for RCU locking if we simply
compare the page pgdat and memcg pointers versus those that the lruvec is
holding. By doing this we can avoid the extra pointer walks and accesses of
the memory cgroup.


Ah, clever. Seems to address my worry from previous patch (except the potential 
to improve documenting comments).



In addition we can avoid the checks entirely if lruvec is currently NULL.

Signed-off-by: Alexander Duyck 
Signed-off-by: Alex Shi 
Acked-by: Hugh Dickins 
Acked-by: Johannes Weiner 


Acked-by: Vlastimil Babka 


Cc: Johannes Weiner 
Cc: Andrew Morton 
Cc: Thomas Gleixner 
Cc: Andrey Ryabinin 
Cc: Matthew Wilcox 
Cc: Mel Gorman 
Cc: Konstantin Khlebnikov 
Cc: Hugh Dickins 
Cc: Tejun Heo 
Cc: linux-kernel@vger.kernel.org
Cc: cgro...@vger.kernel.org
Cc: linux...@kvack.org


Re: [PATCH v21 17/19] mm/lru: replace pgdat lru_lock with lruvec lock

2020-11-12 Thread Vlastimil Babka

On 11/5/20 9:55 AM, Alex Shi wrote:

This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
each of memcg per node. So on a large machine, each of memcg don't
have to suffer from per node pgdat->lru_lock competition. They could go
fast with their self lru_lock.

After move memcg charge before lru inserting, page isolation could
serialize page's memcg, then per memcg lruvec lock is stable and could
replace per node lru lock.

In func isolate_migratepages_block, compact_unlock_should_abort and
lock_page_lruvec_irqsave are open coded to work with compact_control.
Also add a debug func in locking which may give some clues if there are
sth out of hands.

Daniel Jordan's testing show 62% improvement on modified readtwice case
on his 2P * 10 core * 2 HT broadwell box.
https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3lo...@ca-dmjordan1.us.oracle.com/

On a large machine with memcg enabled but not used, the page's lruvec
seeking pass a few pointers, that may lead to lru_lock holding time
increase and a bit regression.

Hugh Dickins helped on the patch polish, thanks!

Signed-off-by: Alex Shi 
Acked-by: Hugh Dickins 
Cc: Rong Chen 
Cc: Hugh Dickins 
Cc: Andrew Morton 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Yang Shi 
Cc: Matthew Wilcox 
Cc: Konstantin Khlebnikov 
Cc: Tejun Heo 
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
Cc: cgro...@vger.kernel.org


I think I need some explanation about the rcu_read_lock() usage in 
lock_page_lruvec*() (and places effectively opencoding it).
Preferably in form of some code comment, but that can be also added as a 
additional patch later, I don't want to block the series.


mem_cgroup_page_lruvec() comment says

 * This function relies on page->mem_cgroup being stable - see the
 * access rules in commit_charge().

commit_charge() comment:

 * Any of the following ensures page->mem_cgroup stability:
 *
 * - the page lock
 * - LRU isolation
 * - lock_page_memcg()
 * - exclusive reference

"LRU isolation" used to be quite clear, but now is it after 
TestClearPageLRU(page) or after deleting from the lru list as well?

Also it doesn't mention rcu_read_lock(), should it?

So what exactly are we protecting by rcu_read_lock() in e.g. lock_page_lruvec()?

rcu_read_lock();
lruvec = mem_cgroup_page_lruvec(page, pgdat);
spin_lock(&lruvec->lru_lock);
rcu_read_unlock();

Looks like we are protecting the lruvec from going away and it can't go away 
anymore after we take the lru_lock?


But then e.g. in __munlock_pagevec() we are doing this without an 
rcu_read_lock():

new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));

where new_lruvec is potentionally not the one that we have locked

And the last thing mem_cgroup_page_lruvec() is doing is:

if (unlikely(lruvec->pgdat != pgdat))
lruvec->pgdat = pgdat;
return lruvec;

So without the rcu_read_lock() is this potentionally accessing the pgdat field 
of lruvec that might have just gone away?


Thanks,
Vlastimil


<    1   2   3   4   5   6   7   8   9   10   >