Re: [PATCH v4 5/7] mm: rework non-root kmem_cache lifecycle management

2019-05-21 Thread Waiman Long
On 5/21/19 3:23 PM, Roman Gushchin wrote:
> On Tue, May 21, 2019 at 02:39:50PM -0400, Waiman Long wrote:
>> On 5/14/19 8:06 PM, Shakeel Butt wrote:
 @@ -2651,20 +2652,35 @@ struct kmem_cache *memcg_kmem_get_cache(struct 
 kmem_cache *cachep)
 struct mem_cgroup *memcg;
 struct kmem_cache *memcg_cachep;
 int kmemcg_id;
 +   struct memcg_cache_array *arr;

 VM_BUG_ON(!is_root_cache(cachep));

 if (memcg_kmem_bypass())
 return cachep;

 -   memcg = get_mem_cgroup_from_current();
 +   rcu_read_lock();
 +
 +   if (unlikely(current->active_memcg))
 +   memcg = current->active_memcg;
 +   else
 +   memcg = mem_cgroup_from_task(current);
 +
 +   if (!memcg || memcg == root_mem_cgroup)
 +   goto out_unlock;
 +
 kmemcg_id = READ_ONCE(memcg->kmemcg_id);
 if (kmemcg_id < 0)
 -   goto out;
 +   goto out_unlock;

 -   memcg_cachep = cache_from_memcg_idx(cachep, kmemcg_id);
 -   if (likely(memcg_cachep))
 -   return memcg_cachep;
 +   arr = rcu_dereference(cachep->memcg_params.memcg_caches);
 +
 +   /*
 +* Make sure we will access the up-to-date value. The code updating
 +* memcg_caches issues a write barrier to match this (see
 +* memcg_create_kmem_cache()).
 +*/
 +   memcg_cachep = READ_ONCE(arr->entries[kmemcg_id]);

 /*
  * If we are in a safe context (can wait, and not in interrupt
 @@ -2677,10 +2693,20 @@ struct kmem_cache *memcg_kmem_get_cache(struct 
 kmem_cache *cachep)
  * memcg_create_kmem_cache, this means no further allocation
  * could happen with the slab_mutex held. So it's better to
  * defer everything.
 +*
 +* If the memcg is dying or memcg_cache is about to be released,
 +* don't bother creating new kmem_caches. Because memcg_cachep
 +* is ZEROed as the fist step of kmem offlining, we don't need
 +* percpu_ref_tryget() here. css_tryget_online() check in
>>> *percpu_ref_tryget_live()
>>>
 +* memcg_schedule_kmem_cache_create() will prevent us from
 +* creation of a new kmem_cache.
  */
 -   memcg_schedule_kmem_cache_create(memcg, cachep);
 -out:
 -   css_put(>css);
 +   if (unlikely(!memcg_cachep))
 +   memcg_schedule_kmem_cache_create(memcg, cachep);
 +   else if (percpu_ref_tryget(_cachep->memcg_params.refcnt))
 +   cachep = memcg_cachep;
 +out_unlock:
 +   rcu_read_lock();
>> There is one more bug that causes the kernel to panic on bootup when I
>> turned on debugging options.
>>
>> [   49.871437] =
>> [   49.875452] WARNING: suspicious RCU usage
>> [   49.879476] 5.2.0-rc1.bz1699202_memcg_test+ #2 Not tainted
>> [   49.884967] -
>> [   49.888991] include/linux/rcupdate.h:268 Illegal context switch in
>> RCU read-side critical section!
>> [   49.897950]
>> [   49.897950] other info that might help us debug this:
>> [   49.897950]
>> [   49.905958]
>> [   49.905958] rcu_scheduler_active = 2, debug_locks = 1
>> [   49.912492] 3 locks held by systemd/1:
>> [   49.916252]  #0: 633673c5 (>i_mutex_dir_key#5){.+.+},
>> at: lookup_slow+0x42/0x70
>> [   49.924788]  #1: 29fa8c75 (rcu_read_lock){}, at:
>> memcg_kmem_get_cache+0x12b/0x910
>> [   49.933316]  #2: 29fa8c75 (rcu_read_lock){}, at:
>> memcg_kmem_get_cache+0x3da/0x910
>>
>> It should be "rcu_read_unlock();" at the end.
> Oops. Good catch, thanks Waiman!
>
> I'm somewhat surprised it didn't get up in my tests, neither any of test
> bots caught it. Anyway, I'll fix it and send v5.

In non-preempt kernel rcu_read_lock() is almost a no-op. So you probably
won't see any ill effect with this bug.

>
> Does the rest of the patchset looks sane to you?

I haven't done a full review of the patch, but it looks sane to me from
my cursory look at it. We hit similar problem in Red Hat. That is why I
am looking at your patch. Looking forward to your v5 patch.

Cheers,
Longman



Re: [PATCH v4 5/7] mm: rework non-root kmem_cache lifecycle management

2019-05-21 Thread Roman Gushchin
On Tue, May 21, 2019 at 02:39:50PM -0400, Waiman Long wrote:
> On 5/14/19 8:06 PM, Shakeel Butt wrote:
> >> @@ -2651,20 +2652,35 @@ struct kmem_cache *memcg_kmem_get_cache(struct 
> >> kmem_cache *cachep)
> >> struct mem_cgroup *memcg;
> >> struct kmem_cache *memcg_cachep;
> >> int kmemcg_id;
> >> +   struct memcg_cache_array *arr;
> >>
> >> VM_BUG_ON(!is_root_cache(cachep));
> >>
> >> if (memcg_kmem_bypass())
> >> return cachep;
> >>
> >> -   memcg = get_mem_cgroup_from_current();
> >> +   rcu_read_lock();
> >> +
> >> +   if (unlikely(current->active_memcg))
> >> +   memcg = current->active_memcg;
> >> +   else
> >> +   memcg = mem_cgroup_from_task(current);
> >> +
> >> +   if (!memcg || memcg == root_mem_cgroup)
> >> +   goto out_unlock;
> >> +
> >> kmemcg_id = READ_ONCE(memcg->kmemcg_id);
> >> if (kmemcg_id < 0)
> >> -   goto out;
> >> +   goto out_unlock;
> >>
> >> -   memcg_cachep = cache_from_memcg_idx(cachep, kmemcg_id);
> >> -   if (likely(memcg_cachep))
> >> -   return memcg_cachep;
> >> +   arr = rcu_dereference(cachep->memcg_params.memcg_caches);
> >> +
> >> +   /*
> >> +* Make sure we will access the up-to-date value. The code updating
> >> +* memcg_caches issues a write barrier to match this (see
> >> +* memcg_create_kmem_cache()).
> >> +*/
> >> +   memcg_cachep = READ_ONCE(arr->entries[kmemcg_id]);
> >>
> >> /*
> >>  * If we are in a safe context (can wait, and not in interrupt
> >> @@ -2677,10 +2693,20 @@ struct kmem_cache *memcg_kmem_get_cache(struct 
> >> kmem_cache *cachep)
> >>  * memcg_create_kmem_cache, this means no further allocation
> >>  * could happen with the slab_mutex held. So it's better to
> >>  * defer everything.
> >> +*
> >> +* If the memcg is dying or memcg_cache is about to be released,
> >> +* don't bother creating new kmem_caches. Because memcg_cachep
> >> +* is ZEROed as the fist step of kmem offlining, we don't need
> >> +* percpu_ref_tryget() here. css_tryget_online() check in
> > *percpu_ref_tryget_live()
> >
> >> +* memcg_schedule_kmem_cache_create() will prevent us from
> >> +* creation of a new kmem_cache.
> >>  */
> >> -   memcg_schedule_kmem_cache_create(memcg, cachep);
> >> -out:
> >> -   css_put(>css);
> >> +   if (unlikely(!memcg_cachep))
> >> +   memcg_schedule_kmem_cache_create(memcg, cachep);
> >> +   else if (percpu_ref_tryget(_cachep->memcg_params.refcnt))
> >> +   cachep = memcg_cachep;
> >> +out_unlock:
> >> +   rcu_read_lock();
> 
> There is one more bug that causes the kernel to panic on bootup when I
> turned on debugging options.
> 
> [   49.871437] =
> [   49.875452] WARNING: suspicious RCU usage
> [   49.879476] 5.2.0-rc1.bz1699202_memcg_test+ #2 Not tainted
> [   49.884967] -
> [   49.888991] include/linux/rcupdate.h:268 Illegal context switch in
> RCU read-side critical section!
> [   49.897950]
> [   49.897950] other info that might help us debug this:
> [   49.897950]
> [   49.905958]
> [   49.905958] rcu_scheduler_active = 2, debug_locks = 1
> [   49.912492] 3 locks held by systemd/1:
> [   49.916252]  #0: 633673c5 (>i_mutex_dir_key#5){.+.+},
> at: lookup_slow+0x42/0x70
> [   49.924788]  #1: 29fa8c75 (rcu_read_lock){}, at:
> memcg_kmem_get_cache+0x12b/0x910
> [   49.933316]  #2: 29fa8c75 (rcu_read_lock){}, at:
> memcg_kmem_get_cache+0x3da/0x910
> 
> It should be "rcu_read_unlock();" at the end.

Oops. Good catch, thanks Waiman!

I'm somewhat surprised it didn't get up in my tests, neither any of test
bots caught it. Anyway, I'll fix it and send v5.

Does the rest of the patchset looks sane to you?

Thank you!

Roman


Re: [PATCH v4 5/7] mm: rework non-root kmem_cache lifecycle management

2019-05-20 Thread Roman Gushchin
On Mon, May 20, 2019 at 10:54:24AM -0400, Waiman Long wrote:
> On 5/14/19 8:06 PM, Shakeel Butt wrote:
> > diff --git a/mm/slab_common.c b/mm/slab_common.c
> > index 4e5b4292a763..1ee967b4805e 100644
> > --- a/mm/slab_common.c
> > +++ b/mm/slab_common.c
> > @@ -45,6 +45,8 @@ static void slab_caches_to_rcu_destroy_workfn(struct 
> > work_struct *work);
> >  static DECLARE_WORK(slab_caches_to_rcu_destroy_work,
> > slab_caches_to_rcu_destroy_workfn);
> >
> > +static void kmemcg_queue_cache_shutdown(struct percpu_ref *percpu_ref);
> > +
> 
> kmemcg_queue_cache_shutdown is only defined if CONFIG_MEMCG_KMEM is
> defined. If it is not defined, a compilation warning can be produced.
> Maybe putting the declaration inside a CONFIG_MEMCG_KMEM block:

Hi Waiman!

Yes, that makes total sense to me. Thank you for letting me know!
How about this one?

--

>From 0fa19369adc240cc93281911a59713822a4f3e07 Mon Sep 17 00:00:00 2001
From: Roman Gushchin 
Date: Mon, 20 May 2019 10:52:07 -0700
Subject: [PATCH] mm: guard kmemcg_queue_cache_shutdown() with
 CONFIG_MEMCG_KMEM

Currently kmemcg_queue_cache_shutdown() is defined only
if CONFIG_MEMCG_KMEM is set, however the declaration is not guarded
with corresponding ifdefs. So a compilation warning might be produced.

Let's move the declaration to the section of slab_common.c, where all
kmemcg-specific stuff is defined.

Reported-by: Waiman Long 
Signed-off-by: Roman Gushchin 
---
 mm/slab_common.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 9d2a3d6245dc..e818609c8209 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -45,8 +45,6 @@ static void slab_caches_to_rcu_destroy_workfn(struct 
work_struct *work);
 static DECLARE_WORK(slab_caches_to_rcu_destroy_work,
slab_caches_to_rcu_destroy_workfn);
 
-static void kmemcg_queue_cache_shutdown(struct percpu_ref *percpu_ref);
-
 /*
  * Set of flags that will prevent slab merging
  */
@@ -134,6 +132,8 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t 
flags, size_t nr,
 LIST_HEAD(slab_root_caches);
 static DEFINE_SPINLOCK(memcg_kmem_wq_lock);
 
+static void kmemcg_queue_cache_shutdown(struct percpu_ref *percpu_ref);
+
 void slab_init_memcg_params(struct kmem_cache *s)
 {
s->memcg_params.root_cache = NULL;
-- 
2.20.1


Re: [PATCH v4 5/7] mm: rework non-root kmem_cache lifecycle management

2019-05-20 Thread Waiman Long
On 5/14/19 8:06 PM, Shakeel Butt wrote:
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 4e5b4292a763..1ee967b4805e 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -45,6 +45,8 @@ static void slab_caches_to_rcu_destroy_workfn(struct 
> work_struct *work);
>  static DECLARE_WORK(slab_caches_to_rcu_destroy_work,
> slab_caches_to_rcu_destroy_workfn);
>
> +static void kmemcg_queue_cache_shutdown(struct percpu_ref *percpu_ref);
> +

kmemcg_queue_cache_shutdown is only defined if CONFIG_MEMCG_KMEM is
defined. If it is not defined, a compilation warning can be produced.
Maybe putting the declaration inside a CONFIG_MEMCG_KMEM block:

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 61d7a96a917b..57ba6cf3dc39 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -45,7 +45,9 @@ static void slab_caches_to_rcu_destroy_workfn(struct
work_stru
ct *work);
 static DECLARE_WORK(slab_caches_to_rcu_destroy_work,
         slab_caches_to_rcu_destroy_workfn);
 
+#ifdef CONFIG_MEMCG_KMEM
 static void kmemcg_queue_cache_shutdown(struct percpu_ref *percpu_ref);
+#endif
 
 /*
  * Set of flags that will prevent slab merging
-- 

Cheers,
Longman



Re: [PATCH v4 5/7] mm: rework non-root kmem_cache lifecycle management

2019-05-15 Thread Shakeel Butt
From: Christopher Lameter 
Date: Wed, May 15, 2019 at 7:00 AM
To: Roman Gushchin
Cc: Andrew Morton, Shakeel Butt, ,
, , Johannes Weiner,
Michal Hocko, Rik van Riel, Vladimir Davydov,


> On Tue, 14 May 2019, Roman Gushchin wrote:
>
> > To make this possible we need to introduce a new percpu refcounter
> > for non-root kmem_caches. The counter is initialized to the percpu
> > mode, and is switched to atomic mode after deactivation, so we never
> > shutdown an active cache. The counter is bumped for every charged page
> > and also for every running allocation. So the kmem_cache can't
> > be released unless all allocations complete.
>
> Increase refcounts during each allocation? Looks to be quite heavy
> processing.

Not really, it's a percpu refcnt. Basically the memcg's
percpu_ref_tryget* is replaced with kmem_cache's percpu_ref_tryget,
so, no additional processing.


Re: [PATCH v4 5/7] mm: rework non-root kmem_cache lifecycle management

2019-05-15 Thread Christopher Lameter
On Tue, 14 May 2019, Roman Gushchin wrote:

> To make this possible we need to introduce a new percpu refcounter
> for non-root kmem_caches. The counter is initialized to the percpu
> mode, and is switched to atomic mode after deactivation, so we never
> shutdown an active cache. The counter is bumped for every charged page
> and also for every running allocation. So the kmem_cache can't
> be released unless all allocations complete.

Increase refcounts during each allocation? Looks to be quite heavy
processing.


Re: [PATCH v4 5/7] mm: rework non-root kmem_cache lifecycle management

2019-05-14 Thread Shakeel Butt
From: Roman Gushchin 
Date: Tue, May 14, 2019 at 2:55 PM
To: Andrew Morton, Shakeel Butt
Cc: , ,
, Johannes Weiner, Michal Hocko, Rik van Riel,
Christoph Lameter, Vladimir Davydov, , Roman
Gushchin

> This commit makes several important changes in the lifecycle
> of a non-root kmem_cache, which also affect the lifecycle
> of a memory cgroup.
>
> Currently each charged slab page has a page->mem_cgroup pointer
> to the memory cgroup and holds a reference to it.
> Kmem_caches are held by the memcg and are released with it.
> It means that none of kmem_caches are released unless at least one
> reference to the memcg exists, which is not optimal.
>
> So the current scheme can be illustrated as:
> page->mem_cgroup->kmem_cache.
>
> To implement the slab memory reparenting we need to invert the scheme
> into: page->kmem_cache->mem_cgroup.
>
> Let's make every page to hold a reference to the kmem_cache (we
> already have a stable pointer), and make kmem_caches to hold a single
> reference to the memory cgroup.
>
> To make this possible we need to introduce a new percpu refcounter
> for non-root kmem_caches. The counter is initialized to the percpu
> mode, and is switched to atomic mode after deactivation, so we never
> shutdown an active cache. The counter is bumped for every charged page
> and also for every running allocation. So the kmem_cache can't
> be released unless all allocations complete.
>
> To shutdown non-active empty kmem_caches, let's reuse the
> infrastructure of the RCU-delayed work queue, used previously for
> the deactivation. After the generalization, it's perfectly suited
> for our needs.
>
> Since now we can release a kmem_cache at any moment after the
> deactivation, let's call sysfs_slab_remove() only from the shutdown
> path. It makes deactivation path simpler.
>
> Because we don't set the page->mem_cgroup pointer, we need to change
> the way how memcg-level stats is working for slab pages. We can't use
> mod_lruvec_page_state() helpers anymore, so switch over to
> mod_lruvec_state().
>
> * I used the following simple approach to test the performance
> (stolen from another patchset by T. Harding):
>
> time find / -name fname-no-exist
> echo 2 > /proc/sys/vm/drop_caches
> repeat 10 times
>
> Results (I've chosen best results in several runs):
>
> origpatched
>
> real0m0.648sreal0m0.593s
> user0m0.148suser0m0.162s
> sys 0m0.295ssys 0m0.253s
>
> real0m0.581sreal0m0.649s
> user0m0.119suser0m0.136s
> sys 0m0.254ssys 0m0.250s
>
> real0m0.645sreal0m0.705s
> user0m0.138suser0m0.138s
> sys 0m0.263ssys 0m0.250s
>
> real0m0.691sreal0m0.718s
> user0m0.139suser0m0.134s
> sys 0m0.262ssys 0m0.253s
>
> real0m0.654sreal0m0.715s
> user0m0.146suser0m0.128s
> sys 0m0.247ssys 0m0.261s
>
> real0m0.675sreal0m0.717s
> user0m0.129suser0m0.137s
> sys 0m0.277ssys 0m0.248s
>
> real0m0.631sreal0m0.719s
> user0m0.137suser0m0.134s
> sys 0m0.255ssys 0m0.251s
>
> real0m0.622sreal0m0.715s
> user0m0.108suser0m0.124s
> sys 0m0.279ssys 0m0.264s
>
> real0m0.651sreal0m0.669s
> user0m0.139suser0m0.139s
> sys 0m0.252ssys 0m0.247s
>
> real0m0.671sreal0m0.632s
> user0m0.130suser0m0.139s
> sys 0m0.263ssys 0m0.245s
>
> So it looks like the difference is not noticeable in this test.
>
> Signed-off-by: Roman Gushchin 

Reviewed-by: Shakeel Butt 

> ---
>  include/linux/slab.h |  3 +-
>  mm/memcontrol.c  | 57 +-
>  mm/slab.h| 82 +---
>  mm/slab_common.c | 74 +++
>  mm/slub.c| 12 +--
>  5 files changed, 135 insertions(+), 93 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 47923c173f30..1b54e5f83342 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>
>  /*
> @@ -152,7 +153,6 @@ int kmem_cache_shrink(struct kmem_cache *);
>
>  void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
>  void memcg_deactivate_kmem_caches(struct mem_cgroup *);
> -void memcg_destroy_kmem_caches(struct mem_cgroup *);
>
>  /*
>   * Please use this macro to create slab caches. Simply specify the
> @@ -641,6 +641,7 @@ struct memcg_cache_params {
> struct mem_cgroup *memcg;
> struct list_head children_node;
> struct list_head kmem_caches_node;
> +   struct percpu_ref refcnt;
>
>