Re: [PATCH] mm: idle-page: fix oops because end_pfn is larger than max_pfn

2019-06-18 Thread Vladimir Davydov
On Tue, Jun 18, 2019 at 12:45:02PM -0700, Andrew Morton wrote:
> On Tue, 18 Jun 2019 13:43:52 +0100 Colin King  
> wrote:
> 
> > From: Colin Ian King 
> > 
> > Currently the calcuation of end_pfn can round up the pfn number to
> > more than the actual maximum number of pfns, causing an Oops. Fix
> > this by ensuring end_pfn is never more than max_pfn.
> > 
> > This can be easily triggered when on systems where the end_pfn gets
> > rounded up to more than max_pfn using the idle-page stress-ng
> > stress test:
> > 
> 
> cc Vladimir.  This seems rather obvious - I'm wondering if the code was
> that way for some subtle reason?

No subtle reason at all - just a bug. The patch looks good to me,

Acked-by: Vladimir Davydov 


Re: [PATCH v7 10/10] mm: reparent memcg kmem_caches on cgroup removal

2019-06-16 Thread Vladimir Davydov
On Tue, Jun 11, 2019 at 04:18:13PM -0700, Roman Gushchin wrote:
> Let's reparent non-root kmem_caches on memcg offlining. This allows us
> to release the memory cgroup without waiting for the last outstanding
> kernel object (e.g. dentry used by another application).
> 
> Since the parent cgroup is already charged, everything we need to do
> is to splice the list of kmem_caches to the parent's kmem_caches list,
> swap the memcg pointer, drop the css refcounter for each kmem_cache
> and adjust the parent's css refcounter.
> 
> Please, note that kmem_cache->memcg_params.memcg isn't a stable
> pointer anymore. It's safe to read it under rcu_read_lock(),
> cgroup_mutex held, or any other way that protects the memory cgroup
> from being released.
> 
> We can race with the slab allocation and deallocation paths. It's not
> a big problem: parent's charge and slab global stats are always
> correct, and we don't care anymore about the child usage and global
> stats. The child cgroup is already offline, so we don't use or show it
> anywhere.
> 
> Local slab stats (NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE)
> aren't used anywhere except count_shadow_nodes(). But even there it
> won't break anything: after reparenting "nodes" will be 0 on child
> level (because we're already reparenting shrinker lists), and on
> parent level page stats always were 0, and this patch won't change
> anything.
> 
> Signed-off-by: Roman Gushchin 

Acked-by: Vladimir Davydov 


Re: [PATCH v7 07/10] mm: synchronize access to kmem_cache dying flag using a spinlock

2019-06-16 Thread Vladimir Davydov
On Tue, Jun 11, 2019 at 04:18:10PM -0700, Roman Gushchin wrote:
> Currently the memcg_params.dying flag and the corresponding
> workqueue used for the asynchronous deactivation of kmem_caches
> is synchronized using the slab_mutex.
> 
> It makes impossible to check this flag from the irq context,
> which will be required in order to implement asynchronous release
> of kmem_caches.
> 
> So let's switch over to the irq-save flavor of the spinlock-based
> synchronization.
> 
> Signed-off-by: Roman Gushchin 

Acked-by: Vladimir Davydov 


Re: [PATCH v7 06/10] mm: don't check the dying flag on kmem_cache creation

2019-06-16 Thread Vladimir Davydov
On Tue, Jun 11, 2019 at 04:18:09PM -0700, Roman Gushchin wrote:
> There is no point in checking the root_cache->memcg_params.dying
> flag on kmem_cache creation path. New allocations shouldn't be
> performed using a dead root kmem_cache, so no new memcg kmem_cache
> creation can be scheduled after the flag is set. And if it was
> scheduled before, flush_memcg_workqueue() will wait for it anyway.
> 
> So let's drop this check to simplify the code.
> 
> Signed-off-by: Roman Gushchin 

Acked-by: Vladimir Davydov 


Re: [PATCH v6 10/10] mm: reparent slab memory on cgroup removal

2019-06-09 Thread Vladimir Davydov
On Tue, Jun 04, 2019 at 07:44:54PM -0700, Roman Gushchin wrote:
> Let's reparent memcg slab memory on memcg offlining. This allows us
> to release the memory cgroup without waiting for the last outstanding
> kernel object (e.g. dentry used by another application).
> 
> So instead of reparenting all accounted slab pages, let's do reparent
> a relatively small amount of kmem_caches. Reparenting is performed as
> a part of the deactivation process.
> 
> Since the parent cgroup is already charged, everything we need to do
> is to splice the list of kmem_caches to the parent's kmem_caches list,
> swap the memcg pointer and drop the css refcounter for each kmem_cache
> and adjust the parent's css refcounter. Quite simple.
> 
> Please, note that kmem_cache->memcg_params.memcg isn't a stable
> pointer anymore. It's safe to read it under rcu_read_lock() or
> with slab_mutex held.
> 
> We can race with the slab allocation and deallocation paths. It's not
> a big problem: parent's charge and slab global stats are always
> correct, and we don't care anymore about the child usage and global
> stats. The child cgroup is already offline, so we don't use or show it
> anywhere.
> 
> Local slab stats (NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE)
> aren't used anywhere except count_shadow_nodes(). But even there it
> won't break anything: after reparenting "nodes" will be 0 on child
> level (because we're already reparenting shrinker lists), and on
> parent level page stats always were 0, and this patch won't change
> anything.
> 
> Signed-off-by: Roman Gushchin 
> ---
>  include/linux/slab.h |  4 ++--
>  mm/list_lru.c|  8 +++-
>  mm/memcontrol.c  | 14 --
>  mm/slab.h| 23 +--
>  mm/slab_common.c | 22 +++---
>  5 files changed, 53 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 1b54e5f83342..109cab2ad9b4 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -152,7 +152,7 @@ void kmem_cache_destroy(struct kmem_cache *);
>  int kmem_cache_shrink(struct kmem_cache *);
>  
>  void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
> -void memcg_deactivate_kmem_caches(struct mem_cgroup *);
> +void memcg_deactivate_kmem_caches(struct mem_cgroup *, struct mem_cgroup *);
>  
>  /*
>   * Please use this macro to create slab caches. Simply specify the
> @@ -638,7 +638,7 @@ struct memcg_cache_params {
>   bool dying;
>   };
>   struct {
> - struct mem_cgroup *memcg;
> + struct mem_cgroup __rcu *memcg;
>   struct list_head children_node;
>   struct list_head kmem_caches_node;
>   struct percpu_ref refcnt;
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index 0f1f6b06b7f3..0b2319897e86 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -77,11 +77,15 @@ list_lru_from_kmem(struct list_lru_node *nlru, void *ptr,
>   if (!nlru->memcg_lrus)
>   goto out;
>  
> + rcu_read_lock();
>   memcg = mem_cgroup_from_kmem(ptr);
> - if (!memcg)
> + if (!memcg) {
> + rcu_read_unlock();
>   goto out;
> + }
>  
>   l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
> + rcu_read_unlock();
>  out:
>   if (memcg_ptr)
>   *memcg_ptr = memcg;
> @@ -131,12 +135,14 @@ bool list_lru_add(struct list_lru *lru, struct 
> list_head *item)
>  
>   spin_lock(>lock);
>   if (list_empty(item)) {
> + rcu_read_lock();
>   l = list_lru_from_kmem(nlru, item, );
>   list_add_tail(item, >list);
>   /* Set shrinker bit if the first element was added */
>   if (!l->nr_items++)
>   memcg_set_shrinker_bit(memcg, nid,
>  lru_shrinker_id(lru));
> + rcu_read_unlock();

AFAICS we don't need rcu_read_lock here, because holding nlru->lock
guarantees that the cgroup doesn't get freed. If that's correct,
I think we better remove __rcu mark and use READ_ONCE for accessing
memcg_params.memcg, thus making it the caller's responsibility to
ensure the cgroup lifetime.

>   nlru->nr_items++;
>   spin_unlock(>lock);
>   return true;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c097b1fc74ec..0f64a2c06803 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3209,15 +3209,15 @@ static void memcg_offline_kmem(struct mem_cgroup 
> *memcg)
>*/
>   memcg->kmem_state = KMEM_ALLOCATED;
>  
> - memcg_deactivate_kmem_caches(memcg);
> -
> - kmemcg_id = memcg->kmemcg_id;
> - BUG_ON(kmemcg_id < 0);
> -
>   parent = parent_mem_cgroup(memcg);
>   if (!parent)
>   parent = root_mem_cgroup;
>  
> + memcg_deactivate_kmem_caches(memcg, parent);
> +
> + kmemcg_id = 

Re: [PATCH v6 09/10] mm: stop setting page->mem_cgroup pointer for slab pages

2019-06-09 Thread Vladimir Davydov
On Tue, Jun 04, 2019 at 07:44:53PM -0700, Roman Gushchin wrote:
> Every slab page charged to a non-root memory cgroup has a pointer
> to the memory cgroup and holds a reference to it, which protects
> a non-empty memory cgroup from being released. At the same time
> the page has a pointer to the corresponding kmem_cache, and also
> hold a reference to the kmem_cache. And kmem_cache by itself
> holds a reference to the cgroup.
> 
> So there is clearly some redundancy, which allows to stop setting
> the page->mem_cgroup pointer and rely on getting memcg pointer
> indirectly via kmem_cache. Further it will allow to change this
> pointer easier, without a need to go over all charged pages.
> 
> So let's stop setting page->mem_cgroup pointer for slab pages,
> and stop using the css refcounter directly for protecting
> the memory cgroup from going away. Instead rely on kmem_cache
> as an intermediate object.
> 
> Make sure that vmstats and shrinker lists are working as previously,
> as well as /proc/kpagecgroup interface.
> 
> Signed-off-by: Roman Gushchin 

Acked-by: Vladimir Davydov 


Re: [PATCH v6 08/10] mm: rework non-root kmem_cache lifecycle management

2019-06-09 Thread Vladimir Davydov
On Tue, Jun 04, 2019 at 07:44:52PM -0700, Roman Gushchin wrote:
> Currently each charged slab page holds a reference to the cgroup to
> which it's charged. Kmem_caches are held by the memcg and are released
> all together with the memory cgroup. It means that none of kmem_caches
> are released unless at least one reference to the memcg exists, which
> is very far from optimal.
> 
> Let's rework it in a way that allows releasing individual kmem_caches
> as soon as the cgroup is offline, the kmem_cache is empty and there
> are no pending allocations.
> 
> To make it possible, let's introduce a new percpu refcounter for
> non-root kmem caches. The counter is initialized to the percpu mode,
> and is switched to the atomic mode during kmem_cache deactivation. The
> counter is bumped for every charged page and also for every running
> allocation. So the kmem_cache can't be released unless all allocations
> complete.
> 
> To shutdown non-active empty kmem_caches, let's reuse the work queue,
> previously used for the kmem_cache deactivation. Once the reference
> counter reaches 0, let's schedule an asynchronous kmem_cache release.
> 
> * I used the following simple approach to test the performance
> (stolen from another patchset by T. Harding):
> 
> time find / -name fname-no-exist
> echo 2 > /proc/sys/vm/drop_caches
> repeat 10 times
> 
> Results:
> 
> orig  patched
> 
> real  0m1.455sreal0m1.355s
> user  0m0.206suser0m0.219s
> sys   0m0.855ssys 0m0.807s
> 
> real  0m1.487sreal0m1.699s
> user  0m0.221suser0m0.256s
> sys   0m0.806ssys 0m0.948s
> 
> real  0m1.515sreal0m1.505s
> user  0m0.183suser0m0.215s
> sys   0m0.876ssys 0m0.858s
> 
> real  0m1.291sreal0m1.380s
> user  0m0.193suser0m0.198s
> sys   0m0.843ssys 0m0.786s
> 
> real  0m1.364sreal0m1.374s
> user  0m0.180suser0m0.182s
> sys   0m0.868ssys 0m0.806s
> 
> real  0m1.352sreal0m1.312s
> user  0m0.201suser0m0.212s
> sys   0m0.820ssys 0m0.761s
> 
> real  0m1.302sreal0m1.349s
> user  0m0.205suser0m0.203s
> sys   0m0.803ssys 0m0.792s
> 
> real  0m1.334sreal0m1.301s
> user  0m0.194suser0m0.201s
> sys   0m0.806ssys 0m0.779s
> 
> real  0m1.426sreal0m1.434s
> user  0m0.216suser0m0.181s
> sys   0m0.824ssys 0m0.864s
> 
> real  0m1.350sreal    0m1.295s
> user  0m0.200suser0m0.190s
> sys   0m0.842ssys 0m0.811s
> 
> So it looks like the difference is not noticeable in this test.
> 
> Signed-off-by: Roman Gushchin 

Acked-by: Vladimir Davydov 


Re: [PATCH v6 07/10] mm: synchronize access to kmem_cache dying flag using a spinlock

2019-06-09 Thread Vladimir Davydov
On Tue, Jun 04, 2019 at 07:44:51PM -0700, Roman Gushchin wrote:
> Currently the memcg_params.dying flag and the corresponding
> workqueue used for the asynchronous deactivation of kmem_caches
> is synchronized using the slab_mutex.
> 
> It makes impossible to check this flag from the irq context,
> which will be required in order to implement asynchronous release
> of kmem_caches.
> 
> So let's switch over to the irq-save flavor of the spinlock-based
> synchronization.
> 
> Signed-off-by: Roman Gushchin 
> ---
>  mm/slab_common.c | 19 +++
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 09b26673b63f..2914a8f0aa85 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -130,6 +130,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t 
> flags, size_t nr,
>  #ifdef CONFIG_MEMCG_KMEM
>  
>  LIST_HEAD(slab_root_caches);
> +static DEFINE_SPINLOCK(memcg_kmem_wq_lock);
>  
>  void slab_init_memcg_params(struct kmem_cache *s)
>  {
> @@ -629,6 +630,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
>   struct memcg_cache_array *arr;
>   struct kmem_cache *s = NULL;
>   char *cache_name;
> + bool dying;
>   int idx;
>  
>   get_online_cpus();
> @@ -640,7 +642,13 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
>* The memory cgroup could have been offlined while the cache
>* creation work was pending.
>*/
> - if (memcg->kmem_state != KMEM_ONLINE || root_cache->memcg_params.dying)
> + if (memcg->kmem_state != KMEM_ONLINE)
> + goto out_unlock;
> +
> + spin_lock_irq(_kmem_wq_lock);
> + dying = root_cache->memcg_params.dying;
> + spin_unlock_irq(_kmem_wq_lock);
> + if (dying)
>   goto out_unlock;

I do understand why we need to sync setting dying flag for a kmem cache
about to be destroyed in flush_memcg_workqueue vs checking the flag in
kmemcg_cache_deactivate: this is needed so that we don't schedule a new
deactivation work after we flush RCU/workqueue. However, I don't think
it's necessary to check the dying flag here, in memcg_create_kmem_cache:
we can't schedule a new cache creation work after kmem_cache_destroy has
started, because one mustn't allocate from a dead kmem cache; since we
flush the queue before getting to actual destruction, no cache creation
work can be pending. Yeah, it might happen that a cache creation work
starts execution while flush_memcg_workqueue is in progress, but I don't
see any point in optimizing this case - after all, cache destruction is
a very cold path. Since checking the flag in memcg_create_kmem_cache
raises question, I suggest to simply drop this check.

Anyway, it would be nice to see some comment in the code explaining why
we check dying flag under a spin lock in kmemcg_cache_deactivate.

>  
>   idx = memcg_cache_id(memcg);
> @@ -735,14 +743,17 @@ static void kmemcg_cache_deactivate(struct kmem_cache 
> *s)
>  
>   __kmemcg_cache_deactivate(s);
>  
> + spin_lock_irq(_kmem_wq_lock);
>   if (s->memcg_params.root_cache->memcg_params.dying)
> - return;
> + goto unlock;
>  
>   /* pin memcg so that @s doesn't get destroyed in the middle */
>   css_get(>memcg_params.memcg->css);
>  
>   s->memcg_params.work_fn = __kmemcg_cache_deactivate_after_rcu;
>   call_rcu(>memcg_params.rcu_head, kmemcg_rcufn);
> +unlock:
> + spin_unlock_irq(_kmem_wq_lock);
>  }
>  
>  void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
> @@ -852,9 +863,9 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
>  
>  static void flush_memcg_workqueue(struct kmem_cache *s)
>  {
> - mutex_lock(_mutex);
> + spin_lock_irq(_kmem_wq_lock);
>   s->memcg_params.dying = true;
> - mutex_unlock(_mutex);
> + spin_unlock_irq(_kmem_wq_lock);
>  
>   /*
>* SLAB and SLUB deactivate the kmem_caches through call_rcu. Make


Re: [PATCH v6 05/10] mm: introduce __memcg_kmem_uncharge_memcg()

2019-06-09 Thread Vladimir Davydov
On Tue, Jun 04, 2019 at 07:44:49PM -0700, Roman Gushchin wrote:
> Let's separate the page counter modification code out of
> __memcg_kmem_uncharge() in a way similar to what
> __memcg_kmem_charge() and __memcg_kmem_charge_memcg() work.
> 
> This will allow to reuse this code later using a new
> memcg_kmem_uncharge_memcg() wrapper, which calls
> __memcg_kmem_uncharge_memcg() if memcg_kmem_enabled()
> check is passed.
> 
> Signed-off-by: Roman Gushchin 
> Reviewed-by: Shakeel Butt 

Acked-by: Vladimir Davydov 


Re: [PATCH v6 04/10] mm: generalize postponed non-root kmem_cache deactivation

2019-06-09 Thread Vladimir Davydov
On Tue, Jun 04, 2019 at 07:44:48PM -0700, Roman Gushchin wrote:
> Currently SLUB uses a work scheduled after an RCU grace period
> to deactivate a non-root kmem_cache. This mechanism can be reused
> for kmem_caches release, but requires generalization for SLAB
> case.
> 
> Introduce kmemcg_cache_deactivate() function, which calls
> allocator-specific __kmem_cache_deactivate() and schedules
> execution of __kmem_cache_deactivate_after_rcu() with all
> necessary locks in a worker context after an rcu grace period.
> 
> Here is the new calling scheme:
>   kmemcg_cache_deactivate()
> __kmemcg_cache_deactivate()  SLAB/SLUB-specific
> kmemcg_rcufn()   rcu
>   kmemcg_workfn()work
> __kmemcg_cache_deactivate_after_rcu()SLAB/SLUB-specific
> 
> instead of:
>   __kmemcg_cache_deactivate()SLAB/SLUB-specific
> slab_deactivate_memcg_cache_rcu_sched()  SLUB-only
>   kmemcg_rcufn() rcu
> kmemcg_workfn()  work
>   kmemcg_cache_deact_after_rcu() SLUB-only
> 
> For consistency, all allocator-specific functions start with "__".
> 
> Signed-off-by: Roman Gushchin 

Much easier to review after extracting renaming, thanks.

Acked-by: Vladimir Davydov 


Re: [PATCH v6 03/10] mm: rename slab delayed deactivation functions and fields

2019-06-09 Thread Vladimir Davydov
On Tue, Jun 04, 2019 at 07:44:47PM -0700, Roman Gushchin wrote:
> The delayed work/rcu deactivation infrastructure of non-root
> kmem_caches can be also used for asynchronous release of these
> objects. Let's get rid of the word "deactivation" in corresponding
> names to make the code look better after generalization.
> 
> It's easier to make the renaming first, so that the generalized
> code will look consistent from scratch.
> 
> Let's rename struct memcg_cache_params fields:
>   deact_fn -> work_fn
>   deact_rcu_head -> rcu_head
>   deact_work -> work
> 
> And RCU/delayed work callbacks in slab common code:
>   kmemcg_deactivate_rcufn -> kmemcg_rcufn
>   kmemcg_deactivate_workfn -> kmemcg_workfn
> 
> This patch contains no functional changes, only renamings.
> 
> Signed-off-by: Roman Gushchin 

Acked-by: Vladimir Davydov 


Re: [PATCH v6 01/10] mm: add missing smp read barrier on getting memcg kmem_cache pointer

2019-06-09 Thread Vladimir Davydov
On Tue, Jun 04, 2019 at 07:44:45PM -0700, Roman Gushchin wrote:
> Johannes noticed that reading the memcg kmem_cache pointer in
> cache_from_memcg_idx() is performed using READ_ONCE() macro,
> which doesn't implement a SMP barrier, which is required
> by the logic.
> 
> Add a proper smp_rmb() to be paired with smp_wmb() in
> memcg_create_kmem_cache().
> 
> The same applies to memcg_create_kmem_cache() itself,
> which reads the same value without barriers and READ_ONCE().
> 
> Suggested-by: Johannes Weiner 
> Signed-off-by: Roman Gushchin 
> ---
>  mm/slab.h| 1 +
>  mm/slab_common.c | 3 ++-
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/slab.h b/mm/slab.h
> index 739099af6cbb..1176b61bb8fc 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -260,6 +260,7 @@ cache_from_memcg_idx(struct kmem_cache *s, int idx)
>* memcg_caches issues a write barrier to match this (see
>* memcg_create_kmem_cache()).
>*/
> + smp_rmb();
>   cachep = READ_ONCE(arr->entries[idx]);

Hmm, we used to have lockless_dereference() here, but it was replaced
with READ_ONCE some time ago. The commit message claims that READ_ONCE
has an implicit read barrier in it.

commit 506458efaf153c1ea480591c5602a5a3ba5a3b76
Author: Will Deacon 
Date:   Tue Oct 24 11:22:48 2017 +0100

locking/barriers: Convert users of lockless_dereference() to READ_ONCE()

READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
can be used instead of lockless_dereference() without any change in
semantics.

Signed-off-by: Will Deacon 
Cc: Linus Torvalds 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.dea...@arm.com
Signed-off-by: Ingo Molnar 

commit 76ebbe78f7390aee075a7f3768af197ded1bdfbb
Author: Will Deacon 
Date:   Tue Oct 24 11:22:47 2017 +0100

locking/barriers: Add implicit smp_read_barrier_depends() to READ_ONCE()

In preparation for the removal of lockless_dereference(), which is the
same as READ_ONCE() on all architectures other than Alpha, add an
implicit smp_read_barrier_depends() to READ_ONCE() so that it can be
used to head dependency chains on all architectures.

Signed-off-by: Will Deacon 
Cc: Linus Torvalds 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1508840570-22169-3-git-send-email-will.dea...@arm.com
Signed-off-by: Ingo Molnar 


Re: [PATCH v5 6/7] mm: reparent slab memory on cgroup removal

2019-05-28 Thread Vladimir Davydov
On Tue, May 28, 2019 at 07:58:17PM +, Roman Gushchin wrote:
> It looks like outstanding questions are:
> 1) synchronization around the dying flag
> 2) removing CONFIG_SLOB in 2/7
> 3) early sysfs_slab_remove()
> 4) mem_cgroup_from_kmem in 7/7
> 
> Please, let me know if I missed anything.

Also, I think that it might be possible to get rid of RCU call in kmem
cache destructor, because the cgroup subsystem already handles it and
we could probably piggyback - see my comment to 5/7. Not sure if it's
really necessary, since we already have RCU in SLUB, but worth looking
into, I guess, as it might simplify the code a bit.


Re: [PATCH v5 6/7] mm: reparent slab memory on cgroup removal

2019-05-28 Thread Vladimir Davydov
On Tue, May 21, 2019 at 01:07:34PM -0700, Roman Gushchin wrote:
> Let's reparent memcg slab memory on memcg offlining. This allows us
> to release the memory cgroup without waiting for the last outstanding
> kernel object (e.g. dentry used by another application).
> 
> So instead of reparenting all accounted slab pages, let's do reparent
> a relatively small amount of kmem_caches. Reparenting is performed as
> a part of the deactivation process.
> 
> Since the parent cgroup is already charged, everything we need to do
> is to splice the list of kmem_caches to the parent's kmem_caches list,
> swap the memcg pointer and drop the css refcounter for each kmem_cache
> and adjust the parent's css refcounter. Quite simple.
> 
> Please, note that kmem_cache->memcg_params.memcg isn't a stable
> pointer anymore. It's safe to read it under rcu_read_lock() or
> with slab_mutex held.
> 
> We can race with the slab allocation and deallocation paths. It's not
> a big problem: parent's charge and slab global stats are always
> correct, and we don't care anymore about the child usage and global
> stats. The child cgroup is already offline, so we don't use or show it
> anywhere.
> 
> Local slab stats (NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE)
> aren't used anywhere except count_shadow_nodes(). But even there it
> won't break anything: after reparenting "nodes" will be 0 on child
> level (because we're already reparenting shrinker lists), and on
> parent level page stats always were 0, and this patch won't change
> anything.
> 
> Signed-off-by: Roman Gushchin 
> Reviewed-by: Shakeel Butt 

This one looks good to me. I can't see why anything could possibly go
wrong after this change.


Re: [PATCH v5 5/7] mm: rework non-root kmem_cache lifecycle management

2019-05-28 Thread Vladimir Davydov
On Tue, May 28, 2019 at 08:08:28PM +0300, Vladimir Davydov wrote:
> Hello Roman,
> 
> On Tue, May 21, 2019 at 01:07:33PM -0700, Roman Gushchin wrote:
> > This commit makes several important changes in the lifecycle
> > of a non-root kmem_cache, which also affect the lifecycle
> > of a memory cgroup.
> > 
> > Currently each charged slab page has a page->mem_cgroup pointer
> > to the memory cgroup and holds a reference to it.
> > Kmem_caches are held by the memcg and are released with it.
> > It means that none of kmem_caches are released unless at least one
> > reference to the memcg exists, which is not optimal.
> > 
> > So the current scheme can be illustrated as:
> > page->mem_cgroup->kmem_cache.
> > 
> > To implement the slab memory reparenting we need to invert the scheme
> > into: page->kmem_cache->mem_cgroup.
> > 
> > Let's make every page to hold a reference to the kmem_cache (we
> > already have a stable pointer), and make kmem_caches to hold a single
> > reference to the memory cgroup.
> 
> Is there any reason why we can't reference both mem cgroup and kmem
> cache per each charged kmem page? I mean,
> 
>   page->mem_cgroup references mem_cgroup
>   page->kmem_cache references kmem_cache
>   mem_cgroup references kmem_cache while it's online
> 
> TBO it seems to me that not taking a reference to mem cgroup per charged
> kmem page makes the code look less straightforward, e.g. as you
> mentioned in the commit log, we have to use mod_lruvec_state() for memcg
> pages and mod_lruvec_page_state() for root pages.

I think I completely missed the point here. In the following patch you
move kmem caches from a child to the parent cgroup on offline (aka
reparent them). That's why you can't maintain page->mem_cgroup. Sorry
for misunderstanding.


Re: [PATCH v5 5/7] mm: rework non-root kmem_cache lifecycle management

2019-05-28 Thread Vladimir Davydov
On Tue, May 28, 2019 at 01:37:50PM -0400, Waiman Long wrote:
> On 5/28/19 1:08 PM, Vladimir Davydov wrote:
> >>  static void flush_memcg_workqueue(struct kmem_cache *s)
> >>  {
> >> +  /*
> >> +   * memcg_params.dying is synchronized using slab_mutex AND
> >> +   * memcg_kmem_wq_lock spinlock, because it's not always
> >> +   * possible to grab slab_mutex.
> >> +   */
> >>mutex_lock(_mutex);
> >> +  spin_lock(_kmem_wq_lock);
> >>s->memcg_params.dying = true;
> >> +  spin_unlock(_kmem_wq_lock);
> > I would completely switch from the mutex to the new spin lock -
> > acquiring them both looks weird.
> >
> >>mutex_unlock(_mutex);
> >>  
> >>/*
> 
> There are places where the slab_mutex is held and sleeping functions
> like kvzalloc() are called. I understand that taking both mutex and
> spinlocks look ugly, but converting all the slab_mutex critical sections
> to spinlock critical sections will be a major undertaking by itself. So
> I would suggest leaving that for now.

I didn't mean that. I meant taking spin_lock wherever we need to access
the 'dying' flag, even if slab_mutex is held. So that we don't need to
take mutex_lock in flush_memcg_workqueue, where it's used solely for
'dying' synchronization.


Re: [PATCH v5 7/7] mm: fix /proc/kpagecgroup interface for slab pages

2019-05-28 Thread Vladimir Davydov
On Tue, May 21, 2019 at 01:07:35PM -0700, Roman Gushchin wrote:
> Switching to an indirect scheme of getting mem_cgroup pointer for
> !root slab pages broke /proc/kpagecgroup interface for them.
> 
> Let's fix it by learning page_cgroup_ino() how to get memcg
> pointer for slab pages.
> 
> Reported-by: Shakeel Butt 
> Signed-off-by: Roman Gushchin 
> Reviewed-by: Shakeel Butt 
> ---
>  mm/memcontrol.c  |  5 -
>  mm/slab.h| 25 +
>  mm/slab_common.c |  1 +
>  3 files changed, 30 insertions(+), 1 deletion(-)

What about mem_cgroup_from_kmem, see mm/list_lru.c?
Shouldn't we fix it, too?


Re: [PATCH v5 1/7] mm: postpone kmem_cache memcg pointer initialization to memcg_link_cache()

2019-05-28 Thread Vladimir Davydov
On Tue, May 21, 2019 at 01:07:29PM -0700, Roman Gushchin wrote:
> Initialize kmem_cache->memcg_params.memcg pointer in
> memcg_link_cache() rather than in init_memcg_params().
> 
> Once kmem_cache will hold a reference to the memory cgroup,
> it will simplify the refcounting.
> 
> For non-root kmem_caches memcg_link_cache() is always called
> before the kmem_cache becomes visible to a user, so it's safe.
> 
> Signed-off-by: Roman Gushchin 
> Reviewed-by: Shakeel Butt 

This one is a no-brainer.

Acked-by: Vladimir Davydov 

Provided it's necessary for the rest of the series.


Re: [PATCH v5 4/7] mm: unify SLAB and SLUB page accounting

2019-05-28 Thread Vladimir Davydov
On Tue, May 21, 2019 at 01:07:32PM -0700, Roman Gushchin wrote:
> Currently the page accounting code is duplicated in SLAB and SLUB
> internals. Let's move it into new (un)charge_slab_page helpers
> in the slab_common.c file. These helpers will be responsible
> for statistics (global and memcg-aware) and memcg charging.
> So they are replacing direct memcg_(un)charge_slab() calls.
> 
> Signed-off-by: Roman Gushchin 
> Reviewed-by: Shakeel Butt 
> Acked-by: Christoph Lameter 

Makes sense even without the rest of the series,

Acked-by: Vladimir Davydov 


Re: [PATCH v5 2/7] mm: generalize postponed non-root kmem_cache deactivation

2019-05-28 Thread Vladimir Davydov
On Tue, May 21, 2019 at 01:07:30PM -0700, Roman Gushchin wrote:
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 6e00bdf8618d..4e5b4292a763 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -866,11 +859,12 @@ static void flush_memcg_workqueue(struct kmem_cache *s)
>   mutex_unlock(_mutex);
>  
>   /*
> -  * SLUB deactivates the kmem_caches through call_rcu. Make
> +  * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make
>* sure all registered rcu callbacks have been invoked.
>*/
> - if (IS_ENABLED(CONFIG_SLUB))
> - rcu_barrier();
> +#ifndef CONFIG_SLOB
> + rcu_barrier();
> +#endif

Nit: you don't need to check CONFIG_SLOB here as this code is under
CONFIG_MEMCG_KMEM which depends on !SLOB.

Other than that, the patch looks good to me, provided using rcu for slab
deactivation proves to be really necessary, although I'd split it in two
patches - one doing renaming another refactoring - easier to review that
way.


Re: [PATCH v5 5/7] mm: rework non-root kmem_cache lifecycle management

2019-05-28 Thread Vladimir Davydov
Hello Roman,

On Tue, May 21, 2019 at 01:07:33PM -0700, Roman Gushchin wrote:
> This commit makes several important changes in the lifecycle
> of a non-root kmem_cache, which also affect the lifecycle
> of a memory cgroup.
> 
> Currently each charged slab page has a page->mem_cgroup pointer
> to the memory cgroup and holds a reference to it.
> Kmem_caches are held by the memcg and are released with it.
> It means that none of kmem_caches are released unless at least one
> reference to the memcg exists, which is not optimal.
> 
> So the current scheme can be illustrated as:
> page->mem_cgroup->kmem_cache.
> 
> To implement the slab memory reparenting we need to invert the scheme
> into: page->kmem_cache->mem_cgroup.
> 
> Let's make every page to hold a reference to the kmem_cache (we
> already have a stable pointer), and make kmem_caches to hold a single
> reference to the memory cgroup.

Is there any reason why we can't reference both mem cgroup and kmem
cache per each charged kmem page? I mean,

  page->mem_cgroup references mem_cgroup
  page->kmem_cache references kmem_cache
  mem_cgroup references kmem_cache while it's online

TBO it seems to me that not taking a reference to mem cgroup per charged
kmem page makes the code look less straightforward, e.g. as you
mentioned in the commit log, we have to use mod_lruvec_state() for memcg
pages and mod_lruvec_page_state() for root pages.

> 
> To make this possible we need to introduce a new percpu refcounter
> for non-root kmem_caches. The counter is initialized to the percpu
> mode, and is switched to atomic mode after deactivation, so we never
> shutdown an active cache. The counter is bumped for every charged page
> and also for every running allocation. So the kmem_cache can't
> be released unless all allocations complete.
> 
> To shutdown non-active empty kmem_caches, let's reuse the
> infrastructure of the RCU-delayed work queue, used previously for
> the deactivation. After the generalization, it's perfectly suited
> for our needs.
> 
> Since now we can release a kmem_cache at any moment after the
> deactivation, let's call sysfs_slab_remove() only from the shutdown
> path. It makes deactivation path simpler.

But a cache can be dangling for quite a while after cgroup was taken
down, even after this patch, because there still can be pages charged to
it. The reason why we call sysfs_slab_remove() is to delete associated
files from sysfs ASAP. I'd try to preserve the current behavior if
possible.

> 
> Because we don't set the page->mem_cgroup pointer, we need to change
> the way how memcg-level stats is working for slab pages. We can't use
> mod_lruvec_page_state() helpers anymore, so switch over to
> mod_lruvec_state().

> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 4e5b4292a763..8d68de4a2341 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -727,9 +737,31 @@ static void kmemcg_schedule_work_after_rcu(struct 
> rcu_head *head)
>   queue_work(memcg_kmem_cache_wq, >memcg_params.work);
>  }
>  
> +static void kmemcg_cache_shutdown_after_rcu(struct kmem_cache *s)
> +{
> + WARN_ON(shutdown_cache(s));
> +}
> +
> +static void kmemcg_queue_cache_shutdown(struct percpu_ref *percpu_ref)
> +{
> + struct kmem_cache *s = container_of(percpu_ref, struct kmem_cache,
> + memcg_params.refcnt);
> +
> + spin_lock(_kmem_wq_lock);

This code may be called from irq context AFAIU so you should use
irq-safe primitive.

> + if (s->memcg_params.root_cache->memcg_params.dying)
> + goto unlock;
> +
> + WARN_ON(s->memcg_params.work_fn);
> + s->memcg_params.work_fn = kmemcg_cache_shutdown_after_rcu;
> + call_rcu(>memcg_params.rcu_head, kmemcg_schedule_work_after_rcu);

I may be totally wrong here, but I have a suspicion we don't really need
rcu here.

As I see it, you add this code so as to prevent memcg_kmem_get_cache
from dereferencing a destroyed kmem cache. Can't we continue using
css_tryget_online for that? I mean, take rcu_read_lock() and try to get
css reference. If you succeed, then the cgroup must be online, and
css_offline won't be called until you unlock rcu, right? This means that
the cache is guaranteed to be alive until then, because the cgroup holds
a reference to all its kmem caches until it's taken offline.

> +unlock:
> + spin_unlock(_kmem_wq_lock);
> +}
> +
>  static void kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
>  {
>   __kmemcg_cache_deactivate_after_rcu(s);
> + percpu_ref_kill(>memcg_params.refcnt);
>  }
>  
>  static void kmemcg_cache_deactivate(struct kmem_cache *s)
> @@ -854,8 +861,15 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
>  
>  static void flush_memcg_workqueue(struct kmem_cache *s)
>  {
> + /*
> +  * memcg_params.dying is synchronized using slab_mutex AND
> +  * memcg_kmem_wq_lock spinlock, because it's not always
> +  * possible to grab slab_mutex.
> +  */
>   

Re: [PATCH] memcg: make it work on sparse non-0-node systems

2019-05-17 Thread Vladimir Davydov
On Fri, May 17, 2019 at 06:48:37AM +0200, Jiri Slaby wrote:
> On 16. 05. 19, 15:59, Michal Hocko wrote:
> >> However, I tend to agree with Michal that (ab)using node[0].memcg_lrus
> >> to check if a list_lru is memcg aware looks confusing. I guess we could
> >> simply add a bool flag to list_lru instead. Something like this, may be:
> > 
> > Yes, this makes much more sense to me!
> 
> I am not sure if I should send a patch with this solution or Vladimir
> will (given he is an author and has a diff already)?

I didn't even try to compile it, let alone test it. I'd appreciate if
you could wrap it up and send it out using your authorship. Feel free
to add my acked-by.


Re: [PATCH] memcg: make it work on sparse non-0-node systems

2019-05-09 Thread Vladimir Davydov
On Mon, Apr 29, 2019 at 12:59:39PM +0200, Jiri Slaby wrote:
> We have a single node system with node 0 disabled:
>   Scanning NUMA topology in Northbridge 24
>   Number of physical nodes 2
>   Skipping disabled node 0
>   Node 1 MemBase  Limit fbff
>   NODE_DATA(1) allocated [mem 0xfbfda000-0xfbfe]
> 
> This causes crashes in memcg when system boots:
>   BUG: unable to handle kernel NULL pointer dereference at 0008
>   #PF error: [normal kernel read fault]
> ...
>   RIP: 0010:list_lru_add+0x94/0x170
> ...
>   Call Trace:
>d_lru_add+0x44/0x50
>dput.part.34+0xfc/0x110
>__fput+0x108/0x230
>task_work_run+0x9f/0xc0
>exit_to_usermode_loop+0xf5/0x100
> 
> It is reproducible as far as 4.12. I did not try older kernels. You have
> to have a new enough systemd, e.g. 241 (the reason is unknown -- was not
> investigated). Cannot be reproduced with systemd 234.
> 
> The system crashes because the size of lru array is never updated in
> memcg_update_all_list_lrus and the reads are past the zero-sized array,
> causing dereferences of random memory.
> 
> The root cause are list_lru_memcg_aware checks in the list_lru code.
> The test in list_lru_memcg_aware is broken: it assumes node 0 is always
> present, but it is not true on some systems as can be seen above.
> 
> So fix this by checking the first online node instead of node 0.
> 
> Signed-off-by: Jiri Slaby 
> Cc: Johannes Weiner 
> Cc: Michal Hocko 
> Cc: Vladimir Davydov 
> Cc: 
> Cc: 
> Cc: Raghavendra K T 
> ---
>  mm/list_lru.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index 0730bf8ff39f..7689910f1a91 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -37,11 +37,7 @@ static int lru_shrinker_id(struct list_lru *lru)
>  
>  static inline bool list_lru_memcg_aware(struct list_lru *lru)
>  {
> - /*
> -  * This needs node 0 to be always present, even
> -  * in the systems supporting sparse numa ids.
> -  */
> - return !!lru->node[0].memcg_lrus;
> + return !!lru->node[first_online_node].memcg_lrus;
>  }
>  
>  static inline struct list_lru_one *

Yep, I didn't expect node 0 could ever be unavailable, my bad.
The patch looks fine to me:

Acked-by: Vladimir Davydov 

However, I tend to agree with Michal that (ab)using node[0].memcg_lrus
to check if a list_lru is memcg aware looks confusing. I guess we could
simply add a bool flag to list_lru instead. Something like this, may be:

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index aa5efd9351eb..d5ceb2839a2d 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -54,6 +54,7 @@ struct list_lru {
 #ifdef CONFIG_MEMCG_KMEM
struct list_headlist;
int shrinker_id;
+   boolmemcg_aware;
 #endif
 };
 
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 0730bf8ff39f..8e605e40a4c6 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -37,11 +37,7 @@ static int lru_shrinker_id(struct list_lru *lru)
 
 static inline bool list_lru_memcg_aware(struct list_lru *lru)
 {
-   /*
-* This needs node 0 to be always present, even
-* in the systems supporting sparse numa ids.
-*/
-   return !!lru->node[0].memcg_lrus;
+   return lru->memcg_aware;
 }
 
 static inline struct list_lru_one *
@@ -451,6 +447,7 @@ static int memcg_init_list_lru(struct list_lru *lru, bool 
memcg_aware)
 {
int i;
 
+   lru->memcg_aware = memcg_aware;
if (!memcg_aware)
return 0;
 


Re: [PATCH 0/5] mm: reparent slab memory on cgroup removal

2019-04-18 Thread Vladimir Davydov
Hello Roman,

On Wed, Apr 17, 2019 at 02:54:29PM -0700, Roman Gushchin wrote:
> There is however a significant problem with reparenting of slab memory:
> there is no list of charged pages. Some of them are in shrinker lists,
> but not all. Introducing of a new list is really not an option.

True, introducing a list of charged pages would negatively affect
SL[AU]B performance since we would need to protect it with some kind
of lock.

> 
> But fortunately there is a way forward: every slab page has a stable pointer
> to the corresponding kmem_cache. So the idea is to reparent kmem_caches
> instead of slab pages.
> 
> It's actually simpler and cheaper, but requires some underlying changes:
> 1) Make kmem_caches to hold a single reference to the memory cgroup,
>instead of a separate reference per every slab page.
> 2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
>page->kmem_cache->memcg indirection instead. It's used only on
>slab page release, so it shouldn't be a big issue.
> 3) Introduce a refcounter for non-root slab caches. It's required to
>be able to destroy kmem_caches when they become empty and release
>the associated memory cgroup.

Which means an unconditional atomic inc/dec on charge/uncharge paths
AFAIU. Note, we have per cpu batching so charging a kmem page in cgroup
v2 doesn't require an atomic variable modification. I guess you could
use some sort of per cpu ref counting though.

Anyway, releasing mem_cgroup objects, but leaving kmem_cache objects
dangling looks kinda awkward to me. It would be great if we could
release both, but I assume it's hardly possible due to SL[AU]B
complexity.

What about reusing dead cgroups instead? Yeah, it would be kinda unfair,
because a fresh cgroup would get a legacy of objects left from previous
owners, but still, if we delete a cgroup, the workload must be dead and
so apart from a few long-lived objects, there should mostly be cached
objects charged to it, which should be easily released on memory
pressure. Sorry if somebody's asked this question before - I must have
missed that.

Thanks,
Vladimir


Re: [PATCH] mm: slowly shrink slabs with a relatively small number of objects

2018-09-04 Thread Vladimir Davydov
On Tue, Sep 04, 2018 at 10:52:46AM -0700, Roman Gushchin wrote:
> Reparenting of all pages is definitely an option to consider,

Reparenting pages would be great indeed, but I'm not sure we could do
that, because we'd have to walk over page lists of semi-active kmem
caches and do it consistently while some pages may be freed as we go.
Kmem caches are so optimized for performance that implementing such a
procedure without impacting any hot paths would be nearly impossible
IMHO. And there are two implementations (SLAB/SLUB), both of which we'd
have to support.

> but it's not free in any case, so if there is no problem,
> why should we? Let's keep it as a last measure. In my case,
> the proposed patch works perfectly: the number of dying cgroups
> jumps around 100, where it grew steadily to 2k and more before.
> 
> I believe that reparenting of LRU lists is required to minimize
> the number of LRU lists to scan, but I'm not sure.

AFAIR the sole purpose of LRU reparenting is releasing kmemcg_id as soon
as a cgroup directory is deleted. If we didn't do that, dead cgroups
would occupy slots in per memcg arrays (list_lru, kmem_cache) so if we
had say 10K dead cgroups, we'd have to allocate 80 KB arrays to store
per memcg data for each kmem_cache and list_lru. Back when kmem
accounting was introduced, we used kmalloc() for allocating those arrays
so growing the size up to 80 KB would result in getting ENOMEM when
trying to create a cgroup too often. Now, we fall back on vmalloc() so
may be it wouldn't be a problem...

Alternatively, I guess we could "reparent" those dangling LRU objects
not to the parent cgroup's list_lru_memcg, but instead to a special
list_lru_memcg which wouldn't be assigned to any cgroup and which would
be reclaimed ASAP on both global or memcg pressure.


Re: [PATCH] mm: slowly shrink slabs with a relatively small number of objects

2018-09-04 Thread Vladimir Davydov
On Tue, Sep 04, 2018 at 10:52:46AM -0700, Roman Gushchin wrote:
> Reparenting of all pages is definitely an option to consider,

Reparenting pages would be great indeed, but I'm not sure we could do
that, because we'd have to walk over page lists of semi-active kmem
caches and do it consistently while some pages may be freed as we go.
Kmem caches are so optimized for performance that implementing such a
procedure without impacting any hot paths would be nearly impossible
IMHO. And there are two implementations (SLAB/SLUB), both of which we'd
have to support.

> but it's not free in any case, so if there is no problem,
> why should we? Let's keep it as a last measure. In my case,
> the proposed patch works perfectly: the number of dying cgroups
> jumps around 100, where it grew steadily to 2k and more before.
> 
> I believe that reparenting of LRU lists is required to minimize
> the number of LRU lists to scan, but I'm not sure.

AFAIR the sole purpose of LRU reparenting is releasing kmemcg_id as soon
as a cgroup directory is deleted. If we didn't do that, dead cgroups
would occupy slots in per memcg arrays (list_lru, kmem_cache) so if we
had say 10K dead cgroups, we'd have to allocate 80 KB arrays to store
per memcg data for each kmem_cache and list_lru. Back when kmem
accounting was introduced, we used kmalloc() for allocating those arrays
so growing the size up to 80 KB would result in getting ENOMEM when
trying to create a cgroup too often. Now, we fall back on vmalloc() so
may be it wouldn't be a problem...

Alternatively, I guess we could "reparent" those dangling LRU objects
not to the parent cgroup's list_lru_memcg, but instead to a special
list_lru_memcg which wouldn't be assigned to any cgroup and which would
be reclaimed ASAP on both global or memcg pressure.


Re: [PATCH] memcg: Remove memcg_cgroup::id from IDR on mem_cgroup_css_alloc() failure

2018-08-01 Thread Vladimir Davydov
On Wed, Aug 01, 2018 at 11:55:52AM -0400, Johannes Weiner wrote:
> On Tue, Jul 31, 2018 at 04:39:08PM -0700, Andrew Morton wrote:
> > On Mon, 30 Jul 2018 11:31:13 -0400 Johannes Weiner  
> > wrote:
> > 
> > > Subject: [PATCH] mm: memcontrol: simplify memcg idr allocation and error
> > >  unwinding
> > > 
> > > The memcg ID is allocated early in the multi-step memcg creation
> > > process, which needs 2-step ID allocation and IDR publishing, as well
> > > as two separate IDR cleanup/unwind sites on error.
> > > 
> > > Defer the IDR allocation until the last second during onlining to
> > > eliminate all this complexity. There is no requirement to have the ID
> > > and IDR entry earlier than that. And the root reference to the ID is
> > > put in the offline path, so this matches nicely.
> > 
> > This patch isn't aware of Kirill's later "mm, memcg: assign memcg-aware
> > shrinkers bitmap to memcg", which altered mem_cgroup_css_online():
> > 
> > @@ -4356,6 +4470,11 @@ static int mem_cgroup_css_online(struct
> >  {
> > struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> >  
> > +   if (memcg_alloc_shrinker_maps(memcg)) {
> > +   mem_cgroup_id_remove(memcg);
> > +   return -ENOMEM;
> > +   }
> > +
> > /* Online state pins memcg ID, memcg ID pins CSS */
> > atomic_set(>id.ref, 1);
> > css_get(css);
> > 
> 
> Hm, that looks out of place too. The bitmaps are allocated for the
> entire lifetime of the css, not just while it's online.
> 
> Any objections to the following fixup to that patch?

That would be incorrect. Memory cgroups that haven't been put online
are invisible to for_each_mem_cgroup(), which is used for expanding
shrinker maps of all cgroups - see memcg_expand_shrinker_maps(). So if
memcg_expand_shrinker_maps() is called between css_alloc and css_online,
it will miss this cgroup and its shrinker_map won't be reallocated to
fit the new id. Allocating the shrinker map in css_online guarantees
that it won't happen. Looks like this code lacks a comment...


Re: [PATCH] memcg: Remove memcg_cgroup::id from IDR on mem_cgroup_css_alloc() failure

2018-08-01 Thread Vladimir Davydov
On Wed, Aug 01, 2018 at 11:55:52AM -0400, Johannes Weiner wrote:
> On Tue, Jul 31, 2018 at 04:39:08PM -0700, Andrew Morton wrote:
> > On Mon, 30 Jul 2018 11:31:13 -0400 Johannes Weiner  
> > wrote:
> > 
> > > Subject: [PATCH] mm: memcontrol: simplify memcg idr allocation and error
> > >  unwinding
> > > 
> > > The memcg ID is allocated early in the multi-step memcg creation
> > > process, which needs 2-step ID allocation and IDR publishing, as well
> > > as two separate IDR cleanup/unwind sites on error.
> > > 
> > > Defer the IDR allocation until the last second during onlining to
> > > eliminate all this complexity. There is no requirement to have the ID
> > > and IDR entry earlier than that. And the root reference to the ID is
> > > put in the offline path, so this matches nicely.
> > 
> > This patch isn't aware of Kirill's later "mm, memcg: assign memcg-aware
> > shrinkers bitmap to memcg", which altered mem_cgroup_css_online():
> > 
> > @@ -4356,6 +4470,11 @@ static int mem_cgroup_css_online(struct
> >  {
> > struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> >  
> > +   if (memcg_alloc_shrinker_maps(memcg)) {
> > +   mem_cgroup_id_remove(memcg);
> > +   return -ENOMEM;
> > +   }
> > +
> > /* Online state pins memcg ID, memcg ID pins CSS */
> > atomic_set(>id.ref, 1);
> > css_get(css);
> > 
> 
> Hm, that looks out of place too. The bitmaps are allocated for the
> entire lifetime of the css, not just while it's online.
> 
> Any objections to the following fixup to that patch?

That would be incorrect. Memory cgroups that haven't been put online
are invisible to for_each_mem_cgroup(), which is used for expanding
shrinker maps of all cgroups - see memcg_expand_shrinker_maps(). So if
memcg_expand_shrinker_maps() is called between css_alloc and css_online,
it will miss this cgroup and its shrinker_map won't be reallocated to
fit the new id. Allocating the shrinker map in css_online guarantees
that it won't happen. Looks like this code lacks a comment...


Re: [PATCH] memcg: Remove memcg_cgroup::id from IDR on mem_cgroup_css_alloc() failure

2018-08-01 Thread Vladimir Davydov
On Mon, Jul 30, 2018 at 11:31:13AM -0400, Johannes Weiner wrote:
> On Sun, Jul 29, 2018 at 10:26:21PM +0300, Vladimir Davydov wrote:
> > On Fri, Jul 27, 2018 at 03:31:34PM -0400, Johannes Weiner wrote:
> > > That said, the lifetime of the root reference on the ID is the online
> > > state, we put that in css_offline. Is there a reason we need to have
> > > the ID ready and the memcg in the IDR before onlining it?
> > 
> > I fail to see any reason for this in the code.
> 
> Me neither, thanks for double checking.
> 
> The patch also survives stress testing cgroup creation and destruction
> with the script from 73f576c04b94 ("mm: memcontrol: fix cgroup
> creation failure after many small jobs").
> 
> > > Can we do something like this and not mess with the alloc/free
> > > sequence at all?
> > 
> > I guess so, and this definitely looks better to me.
> 
> Cool, then I think we should merge Kirill's patch as the fix and mine
> as a follow-up cleanup.
> 
> ---
> 
> From b4106ea1f163479da805eceada60c942bd66e524 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner 
> Date: Mon, 30 Jul 2018 11:03:55 -0400
> Subject: [PATCH] mm: memcontrol: simplify memcg idr allocation and error
>  unwinding
> 
> The memcg ID is allocated early in the multi-step memcg creation
> process, which needs 2-step ID allocation and IDR publishing, as well
> as two separate IDR cleanup/unwind sites on error.
> 
> Defer the IDR allocation until the last second during onlining to
> eliminate all this complexity. There is no requirement to have the ID
> and IDR entry earlier than that. And the root reference to the ID is
> put in the offline path, so this matches nicely.
> 
> Signed-off-by: Johannes Weiner 

Acked-by: Vladimir Davydov 


Re: [PATCH] memcg: Remove memcg_cgroup::id from IDR on mem_cgroup_css_alloc() failure

2018-08-01 Thread Vladimir Davydov
On Mon, Jul 30, 2018 at 11:31:13AM -0400, Johannes Weiner wrote:
> On Sun, Jul 29, 2018 at 10:26:21PM +0300, Vladimir Davydov wrote:
> > On Fri, Jul 27, 2018 at 03:31:34PM -0400, Johannes Weiner wrote:
> > > That said, the lifetime of the root reference on the ID is the online
> > > state, we put that in css_offline. Is there a reason we need to have
> > > the ID ready and the memcg in the IDR before onlining it?
> > 
> > I fail to see any reason for this in the code.
> 
> Me neither, thanks for double checking.
> 
> The patch also survives stress testing cgroup creation and destruction
> with the script from 73f576c04b94 ("mm: memcontrol: fix cgroup
> creation failure after many small jobs").
> 
> > > Can we do something like this and not mess with the alloc/free
> > > sequence at all?
> > 
> > I guess so, and this definitely looks better to me.
> 
> Cool, then I think we should merge Kirill's patch as the fix and mine
> as a follow-up cleanup.
> 
> ---
> 
> From b4106ea1f163479da805eceada60c942bd66e524 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner 
> Date: Mon, 30 Jul 2018 11:03:55 -0400
> Subject: [PATCH] mm: memcontrol: simplify memcg idr allocation and error
>  unwinding
> 
> The memcg ID is allocated early in the multi-step memcg creation
> process, which needs 2-step ID allocation and IDR publishing, as well
> as two separate IDR cleanup/unwind sites on error.
> 
> Defer the IDR allocation until the last second during onlining to
> eliminate all this complexity. There is no requirement to have the ID
> and IDR entry earlier than that. And the root reference to the ID is
> put in the offline path, so this matches nicely.
> 
> Signed-off-by: Johannes Weiner 

Acked-by: Vladimir Davydov 


Re: [PATCH] memcg: Remove memcg_cgroup::id from IDR on mem_cgroup_css_alloc() failure

2018-07-29 Thread Vladimir Davydov
On Fri, Jul 27, 2018 at 03:31:34PM -0400, Johannes Weiner wrote:
> That said, the lifetime of the root reference on the ID is the online
> state, we put that in css_offline. Is there a reason we need to have
> the ID ready and the memcg in the IDR before onlining it?

I fail to see any reason for this in the code.

> Can we do something like this and not mess with the alloc/free
> sequence at all?

I guess so, and this definitely looks better to me.


Re: [PATCH] memcg: Remove memcg_cgroup::id from IDR on mem_cgroup_css_alloc() failure

2018-07-29 Thread Vladimir Davydov
On Fri, Jul 27, 2018 at 03:31:34PM -0400, Johannes Weiner wrote:
> That said, the lifetime of the root reference on the ID is the online
> state, we put that in css_offline. Is there a reason we need to have
> the ID ready and the memcg in the IDR before onlining it?

I fail to see any reason for this in the code.

> Can we do something like this and not mess with the alloc/free
> sequence at all?

I guess so, and this definitely looks better to me.


Re: [PATCH v8 05/17] mm: Assign memcg-aware shrinkers bitmap to memcg

2018-07-06 Thread Vladimir Davydov
On Thu, Jul 05, 2018 at 03:10:30PM -0700, Andrew Morton wrote:
> On Wed, 4 Jul 2018 18:51:12 +0300 Kirill Tkhai  wrote:
> 
> > > - why aren't we decreasing shrinker_nr_max in
> > >   unregister_memcg_shrinker()?  That's easy to do, avoids pointless
> > >   work in shrink_slab_memcg() and avoids memory waste in future
> > >   prealloc_memcg_shrinker() calls.
> > 
> > You sure, but there are some things. Initially I went in the same way
> > as memcg_nr_cache_ids is made and just took the same x2 arithmetic.
> > It never decreases, so it looked good to make shrinker maps like it.
> > It's the only reason, so, it should not be a problem to rework.
> > 
> > The only moment is Vladimir strongly recommends modularity, i.e.
> > to have memcg_shrinker_map_size and shrinker_nr_max as different variables.
> 
> For what reasons?

Having the only global variable updated in vmscan.c and used in
memcontrol.c or vice versa didn't look good to me. So I suggested to
introduce two separate static variables: one for max shrinker id (local
to vmscan.c) and another for max allocated size of per memcg shrinker
maps (local to memcontrol.c). Having the two variables instead of one
allows us to define a clear API between memcontrol.c and shrinker
infrastructure without sharing variables, which makes the code easier
to follow IMHO.

It is also more flexible: with the two variables we can decrease
shrinker_id_max when a shrinker is destroyed so that we can speed up
shrink_slab - this is fairly easy to do - but leave per memcg shrinker
maps the same size, because shrinking them is rather complicated and
doesn't seem to be worthwhile - the workload is likely to use the same
amount of memory again in the future.

> 
> > After the rework we won't be able to have this anymore, since memcontrol.c
> > will have to know actual shrinker_nr_max value and it will have to be 
> > exported.

Not necessarily. You can pass max shrinker id instead of the new id to
memcontrol.c in function arguments. But as I said before, I really don't
think that shrinking per memcg maps would make much sense.

> >
> > Could this be a problem?


Re: [PATCH v8 05/17] mm: Assign memcg-aware shrinkers bitmap to memcg

2018-07-06 Thread Vladimir Davydov
On Thu, Jul 05, 2018 at 03:10:30PM -0700, Andrew Morton wrote:
> On Wed, 4 Jul 2018 18:51:12 +0300 Kirill Tkhai  wrote:
> 
> > > - why aren't we decreasing shrinker_nr_max in
> > >   unregister_memcg_shrinker()?  That's easy to do, avoids pointless
> > >   work in shrink_slab_memcg() and avoids memory waste in future
> > >   prealloc_memcg_shrinker() calls.
> > 
> > You sure, but there are some things. Initially I went in the same way
> > as memcg_nr_cache_ids is made and just took the same x2 arithmetic.
> > It never decreases, so it looked good to make shrinker maps like it.
> > It's the only reason, so, it should not be a problem to rework.
> > 
> > The only moment is Vladimir strongly recommends modularity, i.e.
> > to have memcg_shrinker_map_size and shrinker_nr_max as different variables.
> 
> For what reasons?

Having the only global variable updated in vmscan.c and used in
memcontrol.c or vice versa didn't look good to me. So I suggested to
introduce two separate static variables: one for max shrinker id (local
to vmscan.c) and another for max allocated size of per memcg shrinker
maps (local to memcontrol.c). Having the two variables instead of one
allows us to define a clear API between memcontrol.c and shrinker
infrastructure without sharing variables, which makes the code easier
to follow IMHO.

It is also more flexible: with the two variables we can decrease
shrinker_id_max when a shrinker is destroyed so that we can speed up
shrink_slab - this is fairly easy to do - but leave per memcg shrinker
maps the same size, because shrinking them is rather complicated and
doesn't seem to be worthwhile - the workload is likely to use the same
amount of memory again in the future.

> 
> > After the rework we won't be able to have this anymore, since memcontrol.c
> > will have to know actual shrinker_nr_max value and it will have to be 
> > exported.

Not necessarily. You can pass max shrinker id instead of the new id to
memcontrol.c in function arguments. But as I said before, I really don't
think that shrinking per memcg maps would make much sense.

> >
> > Could this be a problem?


Re: [PATCH v8 05/17] mm: Assign memcg-aware shrinkers bitmap to memcg

2018-07-06 Thread Vladimir Davydov
On Tue, Jul 03, 2018 at 01:50:00PM -0700, Andrew Morton wrote:
> On Tue, 03 Jul 2018 18:09:26 +0300 Kirill Tkhai  wrote:
> 
> > Imagine a big node with many cpus, memory cgroups and containers.
> > Let we have 200 containers, every container has 10 mounts,
> > and 10 cgroups. All container tasks don't touch foreign
> > containers mounts. If there is intensive pages write,
> > and global reclaim happens, a writing task has to iterate
> > over all memcgs to shrink slab, before it's able to go
> > to shrink_page_list().
> > 
> > Iteration over all the memcg slabs is very expensive:
> > the task has to visit 200 * 10 = 2000 shrinkers
> > for every memcg, and since there are 2000 memcgs,
> > the total calls are 2000 * 2000 = 400.
> > 
> > So, the shrinker makes 4 million do_shrink_slab() calls
> > just to try to isolate SWAP_CLUSTER_MAX pages in one
> > of the actively writing memcg via shrink_page_list().
> > I've observed a node spending almost 100% in kernel,
> > making useless iteration over already shrinked slab.
> > 
> > This patch adds bitmap of memcg-aware shrinkers to memcg.
> > The size of the bitmap depends on bitmap_nr_ids, and during
> > memcg life it's maintained to be enough to fit bitmap_nr_ids
> > shrinkers. Every bit in the map is related to corresponding
> > shrinker id.
> > 
> > Next patches will maintain set bit only for really charged
> > memcg. This will allow shrink_slab() to increase its
> > performance in significant way. See the last patch for
> > the numbers.
> > 
> > ...
> >
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -182,6 +182,11 @@ static int prealloc_memcg_shrinker(struct shrinker 
> > *shrinker)
> > if (id < 0)
> > goto unlock;
> >  
> > +   if (memcg_expand_shrinker_maps(id)) {
> > +   idr_remove(_idr, id);
> > +   goto unlock;
> > +   }
> > +
> > if (id >= shrinker_nr_max)
> > shrinker_nr_max = id + 1;
> > shrinker->id = id;
> 
> This function ends up being a rather sad little thing.
> 
> : static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> : {
> : int id, ret = -ENOMEM;
> : 
> : down_write(_rwsem);
> : id = idr_alloc(_idr, shrinker, 0, 0, GFP_KERNEL);
> : if (id < 0)
> : goto unlock;
> : 
> : if (memcg_expand_shrinker_maps(id)) {
> : idr_remove(_idr, id);
> : goto unlock;
> : }
> : 
> : if (id >= shrinker_nr_max)
> : shrinker_nr_max = id + 1;
> : shrinker->id = id;
> : ret = 0;
> : unlock:
> : up_write(_rwsem);
> : return ret;
> : }
> 
> - there's no need to call memcg_expand_shrinker_maps() unless id >=
>   shrinker_nr_max so why not move the code and avoid calling
>   memcg_expand_shrinker_maps() in most cases.

memcg_expand_shrinker_maps will return immediately if per memcg shrinker
maps can accommodate the new id. Since prealloc_memcg_shrinker is
definitely not a hot path, I don't see any penalty in calling this
function on each prealloc_memcg_shrinker invocation.

> 
> - why aren't we decreasing shrinker_nr_max in
>   unregister_memcg_shrinker()?  That's easy to do, avoids pointless
>   work in shrink_slab_memcg() and avoids memory waste in future
>   prealloc_memcg_shrinker() calls.

We can shrink the maps, but IMHO it isn't worth the complexity it would
introduce, because in my experience if a workload used N mount points
(containers, whatever) at some point of its lifetime, it is likely to
use the same amount in the future.

> 
>   It should be possible to find the highest ID in an IDR tree with a
>   straightforward descent of the underlying radix tree, but I doubt if
>   that has been wired up.  Otherwise a simple loop in
>   unregister_memcg_shrinker() would be needed.
> 
> 


Re: [PATCH v8 05/17] mm: Assign memcg-aware shrinkers bitmap to memcg

2018-07-06 Thread Vladimir Davydov
On Tue, Jul 03, 2018 at 01:50:00PM -0700, Andrew Morton wrote:
> On Tue, 03 Jul 2018 18:09:26 +0300 Kirill Tkhai  wrote:
> 
> > Imagine a big node with many cpus, memory cgroups and containers.
> > Let we have 200 containers, every container has 10 mounts,
> > and 10 cgroups. All container tasks don't touch foreign
> > containers mounts. If there is intensive pages write,
> > and global reclaim happens, a writing task has to iterate
> > over all memcgs to shrink slab, before it's able to go
> > to shrink_page_list().
> > 
> > Iteration over all the memcg slabs is very expensive:
> > the task has to visit 200 * 10 = 2000 shrinkers
> > for every memcg, and since there are 2000 memcgs,
> > the total calls are 2000 * 2000 = 400.
> > 
> > So, the shrinker makes 4 million do_shrink_slab() calls
> > just to try to isolate SWAP_CLUSTER_MAX pages in one
> > of the actively writing memcg via shrink_page_list().
> > I've observed a node spending almost 100% in kernel,
> > making useless iteration over already shrinked slab.
> > 
> > This patch adds bitmap of memcg-aware shrinkers to memcg.
> > The size of the bitmap depends on bitmap_nr_ids, and during
> > memcg life it's maintained to be enough to fit bitmap_nr_ids
> > shrinkers. Every bit in the map is related to corresponding
> > shrinker id.
> > 
> > Next patches will maintain set bit only for really charged
> > memcg. This will allow shrink_slab() to increase its
> > performance in significant way. See the last patch for
> > the numbers.
> > 
> > ...
> >
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -182,6 +182,11 @@ static int prealloc_memcg_shrinker(struct shrinker 
> > *shrinker)
> > if (id < 0)
> > goto unlock;
> >  
> > +   if (memcg_expand_shrinker_maps(id)) {
> > +   idr_remove(_idr, id);
> > +   goto unlock;
> > +   }
> > +
> > if (id >= shrinker_nr_max)
> > shrinker_nr_max = id + 1;
> > shrinker->id = id;
> 
> This function ends up being a rather sad little thing.
> 
> : static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> : {
> : int id, ret = -ENOMEM;
> : 
> : down_write(_rwsem);
> : id = idr_alloc(_idr, shrinker, 0, 0, GFP_KERNEL);
> : if (id < 0)
> : goto unlock;
> : 
> : if (memcg_expand_shrinker_maps(id)) {
> : idr_remove(_idr, id);
> : goto unlock;
> : }
> : 
> : if (id >= shrinker_nr_max)
> : shrinker_nr_max = id + 1;
> : shrinker->id = id;
> : ret = 0;
> : unlock:
> : up_write(_rwsem);
> : return ret;
> : }
> 
> - there's no need to call memcg_expand_shrinker_maps() unless id >=
>   shrinker_nr_max so why not move the code and avoid calling
>   memcg_expand_shrinker_maps() in most cases.

memcg_expand_shrinker_maps will return immediately if per memcg shrinker
maps can accommodate the new id. Since prealloc_memcg_shrinker is
definitely not a hot path, I don't see any penalty in calling this
function on each prealloc_memcg_shrinker invocation.

> 
> - why aren't we decreasing shrinker_nr_max in
>   unregister_memcg_shrinker()?  That's easy to do, avoids pointless
>   work in shrink_slab_memcg() and avoids memory waste in future
>   prealloc_memcg_shrinker() calls.

We can shrink the maps, but IMHO it isn't worth the complexity it would
introduce, because in my experience if a workload used N mount points
(containers, whatever) at some point of its lifetime, it is likely to
use the same amount in the future.

> 
>   It should be possible to find the highest ID in an IDR tree with a
>   straightforward descent of the underlying radix tree, but I doubt if
>   that has been wired up.  Otherwise a simple loop in
>   unregister_memcg_shrinker() would be needed.
> 
> 


Re: [PATCH v4] mm: fix race between kmem_cache destroy, create and deactivate

2018-06-12 Thread Vladimir Davydov
On Mon, Jun 11, 2018 at 12:29:51PM -0700, Shakeel Butt wrote:
> The memcg kmem cache creation and deactivation (SLUB only) is
> asynchronous. If a root kmem cache is destroyed whose memcg cache is in
> the process of creation or deactivation, the kernel may crash.
> 
> Example of one such crash:
>   general protection fault:  [#1] SMP PTI
>   CPU: 1 PID: 1721 Comm: kworker/14:1 Not tainted 4.17.0-smp
>   ...
>   Workqueue: memcg_kmem_cache kmemcg_deactivate_workfn
>   RIP: 0010:has_cpu_slab
>   ...
>   Call Trace:
>   ? on_each_cpu_cond
>   __kmem_cache_shrink
>   kmemcg_cache_deact_after_rcu
>   kmemcg_deactivate_workfn
>   process_one_work
>   worker_thread
>   kthread
>   ret_from_fork+0x35/0x40
> 
> To fix this race, on root kmem cache destruction, mark the cache as
> dying and flush the workqueue used for memcg kmem cache creation and
> deactivation. SLUB's memcg kmem cache deactivation also includes RCU
> callback and thus make sure all previous registered RCU callbacks
> have completed as well.
> 
> Signed-off-by: Shakeel Butt 

Acked-by: Vladimir Davydov 

Thanks.


Re: [PATCH v4] mm: fix race between kmem_cache destroy, create and deactivate

2018-06-12 Thread Vladimir Davydov
On Mon, Jun 11, 2018 at 12:29:51PM -0700, Shakeel Butt wrote:
> The memcg kmem cache creation and deactivation (SLUB only) is
> asynchronous. If a root kmem cache is destroyed whose memcg cache is in
> the process of creation or deactivation, the kernel may crash.
> 
> Example of one such crash:
>   general protection fault:  [#1] SMP PTI
>   CPU: 1 PID: 1721 Comm: kworker/14:1 Not tainted 4.17.0-smp
>   ...
>   Workqueue: memcg_kmem_cache kmemcg_deactivate_workfn
>   RIP: 0010:has_cpu_slab
>   ...
>   Call Trace:
>   ? on_each_cpu_cond
>   __kmem_cache_shrink
>   kmemcg_cache_deact_after_rcu
>   kmemcg_deactivate_workfn
>   process_one_work
>   worker_thread
>   kthread
>   ret_from_fork+0x35/0x40
> 
> To fix this race, on root kmem cache destruction, mark the cache as
> dying and flush the workqueue used for memcg kmem cache creation and
> deactivation. SLUB's memcg kmem cache deactivation also includes RCU
> callback and thus make sure all previous registered RCU callbacks
> have completed as well.
> 
> Signed-off-by: Shakeel Butt 

Acked-by: Vladimir Davydov 

Thanks.


Re: [PATCH v3] mm: fix race between kmem_cache destroy, create and deactivate

2018-06-09 Thread Vladimir Davydov
On Tue, May 29, 2018 at 05:12:04PM -0700, Shakeel Butt wrote:
> The memcg kmem cache creation and deactivation (SLUB only) is
> asynchronous. If a root kmem cache is destroyed whose memcg cache is in
> the process of creation or deactivation, the kernel may crash.
> 
> Example of one such crash:
>   general protection fault:  [#1] SMP PTI
>   CPU: 1 PID: 1721 Comm: kworker/14:1 Not tainted 4.17.0-smp
>   ...
>   Workqueue: memcg_kmem_cache kmemcg_deactivate_workfn
>   RIP: 0010:has_cpu_slab
>   ...
>   Call Trace:
>   ? on_each_cpu_cond
>   __kmem_cache_shrink
>   kmemcg_cache_deact_after_rcu
>   kmemcg_deactivate_workfn
>   process_one_work
>   worker_thread
>   kthread
>   ret_from_fork+0x35/0x40
> 
> To fix this race, on root kmem cache destruction, mark the cache as
> dying and flush the workqueue used for memcg kmem cache creation and
> deactivation.

> @@ -845,6 +862,8 @@ void kmem_cache_destroy(struct kmem_cache *s)
>   if (unlikely(!s))
>   return;
>  
> + flush_memcg_workqueue(s);
> +

This should definitely help against async memcg_kmem_cache_create(),
but I'm afraid it doesn't eliminate the race with async destruction,
unfortunately, because the latter uses call_rcu_sched():

  memcg_deactivate_kmem_caches
   __kmem_cache_deactivate
slab_deactivate_memcg_cache_rcu_sched
 call_rcu_sched
kmem_cache_destroy
 shutdown_memcg_caches
  shutdown_cache
  memcg_deactivate_rcufn
   

Can we somehow flush those pending rcu requests?


Re: [PATCH v3] mm: fix race between kmem_cache destroy, create and deactivate

2018-06-09 Thread Vladimir Davydov
On Tue, May 29, 2018 at 05:12:04PM -0700, Shakeel Butt wrote:
> The memcg kmem cache creation and deactivation (SLUB only) is
> asynchronous. If a root kmem cache is destroyed whose memcg cache is in
> the process of creation or deactivation, the kernel may crash.
> 
> Example of one such crash:
>   general protection fault:  [#1] SMP PTI
>   CPU: 1 PID: 1721 Comm: kworker/14:1 Not tainted 4.17.0-smp
>   ...
>   Workqueue: memcg_kmem_cache kmemcg_deactivate_workfn
>   RIP: 0010:has_cpu_slab
>   ...
>   Call Trace:
>   ? on_each_cpu_cond
>   __kmem_cache_shrink
>   kmemcg_cache_deact_after_rcu
>   kmemcg_deactivate_workfn
>   process_one_work
>   worker_thread
>   kthread
>   ret_from_fork+0x35/0x40
> 
> To fix this race, on root kmem cache destruction, mark the cache as
> dying and flush the workqueue used for memcg kmem cache creation and
> deactivation.

> @@ -845,6 +862,8 @@ void kmem_cache_destroy(struct kmem_cache *s)
>   if (unlikely(!s))
>   return;
>  
> + flush_memcg_workqueue(s);
> +

This should definitely help against async memcg_kmem_cache_create(),
but I'm afraid it doesn't eliminate the race with async destruction,
unfortunately, because the latter uses call_rcu_sched():

  memcg_deactivate_kmem_caches
   __kmem_cache_deactivate
slab_deactivate_memcg_cache_rcu_sched
 call_rcu_sched
kmem_cache_destroy
 shutdown_memcg_caches
  shutdown_cache
  memcg_deactivate_rcufn
   

Can we somehow flush those pending rcu requests?


Re: [PATCH v2] mm: fix race between kmem_cache destroy, create and deactivate

2018-05-26 Thread Vladimir Davydov
On Tue, May 22, 2018 at 01:13:36PM -0700, Shakeel Butt wrote:
> The memcg kmem cache creation and deactivation (SLUB only) is
> asynchronous. If a root kmem cache is destroyed whose memcg cache is in
> the process of creation or deactivation, the kernel may crash.
> 
> Example of one such crash:
>   general protection fault:  [#1] SMP PTI
>   CPU: 1 PID: 1721 Comm: kworker/14:1 Not tainted 4.17.0-smp
>   ...
>   Workqueue: memcg_kmem_cache kmemcg_deactivate_workfn
>   RIP: 0010:has_cpu_slab
>   ...
>   Call Trace:
>   ? on_each_cpu_cond
>   __kmem_cache_shrink
>   kmemcg_cache_deact_after_rcu
>   kmemcg_deactivate_workfn
>   process_one_work
>   worker_thread
>   kthread
>   ret_from_fork+0x35/0x40
> 
> This issue is due to the lack of real reference counting for the root
> kmem_caches. Currently kmem_cache does have a field named refcount which
> has been used for multiple purposes i.e. shared count, reference count
> and noshare flag. Due to its conflated nature, it can not be used for
> reference counting by other subsystems.
> 
> This patch decoupled the reference counting from shared count and
> noshare flag. The new field 'shared_count' represents the shared count
> and noshare flag while 'refcount' is converted into a real reference
> counter.
> 
> The reference counting is only implemented for root kmem_caches for
> simplicity. The reference of a root kmem_cache is elevated on sharing or
> while its memcg kmem_cache creation or deactivation request is in the
> fly and thus it is made sure that the root kmem_cache is not destroyed
> in the middle. As the reference of kmem_cache is elevated on sharing,
> the 'shared_count' does not need any locking protection as at worst it
> can be out-dated for a small window which is tolerable.

I wonder if we could fix this problem without introducing reference
counting for kmem caches (which seems a bit of an overkill to me TBO),
e.g. by flushing memcg_kmem_cache_wq before root cache destruction?


Re: [PATCH v2] mm: fix race between kmem_cache destroy, create and deactivate

2018-05-26 Thread Vladimir Davydov
On Tue, May 22, 2018 at 01:13:36PM -0700, Shakeel Butt wrote:
> The memcg kmem cache creation and deactivation (SLUB only) is
> asynchronous. If a root kmem cache is destroyed whose memcg cache is in
> the process of creation or deactivation, the kernel may crash.
> 
> Example of one such crash:
>   general protection fault:  [#1] SMP PTI
>   CPU: 1 PID: 1721 Comm: kworker/14:1 Not tainted 4.17.0-smp
>   ...
>   Workqueue: memcg_kmem_cache kmemcg_deactivate_workfn
>   RIP: 0010:has_cpu_slab
>   ...
>   Call Trace:
>   ? on_each_cpu_cond
>   __kmem_cache_shrink
>   kmemcg_cache_deact_after_rcu
>   kmemcg_deactivate_workfn
>   process_one_work
>   worker_thread
>   kthread
>   ret_from_fork+0x35/0x40
> 
> This issue is due to the lack of real reference counting for the root
> kmem_caches. Currently kmem_cache does have a field named refcount which
> has been used for multiple purposes i.e. shared count, reference count
> and noshare flag. Due to its conflated nature, it can not be used for
> reference counting by other subsystems.
> 
> This patch decoupled the reference counting from shared count and
> noshare flag. The new field 'shared_count' represents the shared count
> and noshare flag while 'refcount' is converted into a real reference
> counter.
> 
> The reference counting is only implemented for root kmem_caches for
> simplicity. The reference of a root kmem_cache is elevated on sharing or
> while its memcg kmem_cache creation or deactivation request is in the
> fly and thus it is made sure that the root kmem_cache is not destroyed
> in the middle. As the reference of kmem_cache is elevated on sharing,
> the 'shared_count' does not need any locking protection as at worst it
> can be out-dated for a small window which is tolerable.

I wonder if we could fix this problem without introducing reference
counting for kmem caches (which seems a bit of an overkill to me TBO),
e.g. by flushing memcg_kmem_cache_wq before root cache destruction?


Re: [PATCH] memcg: force charge kmem counter too

2018-05-26 Thread Vladimir Davydov
On Fri, May 25, 2018 at 11:55:01AM -0700, Shakeel Butt wrote:
> Based on several conditions the kernel can decide to force charge an
> allocation for a memcg i.e. overcharge memcg->memory and memcg->memsw
> counters. Do the same for memcg->kmem counter too. In cgroup-v1, this
> bug can cause a __GFP_NOFAIL kmem allocation fail if an explicit limit
> on kmem counter is set and reached.

memory.kmem.limit is broken and unlikely to ever be fixed as this knob
was deprecated in cgroup-v2. The fact that hitting the limit doesn't
trigger reclaim can result in unexpected behavior from user's pov, like
getting ENOMEM while listing a directory. Bypassing the limit for NOFAIL
allocations isn't going to fix those problem. So I'd suggest to avoid
setting memory.kmem.limit instead of trying to fix it or, even better,
switch to cgroup-v2.

> 
> Signed-off-by: Shakeel Butt 
> ---
>  mm/memcontrol.c | 21 +++--
>  1 file changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ab5673dbfc4e..0a88f824c550 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1893,6 +1893,18 @@ void mem_cgroup_handle_over_high(void)
>   current->memcg_nr_pages_over_high = 0;
>  }
>  
> +/*
> + * Based on try_charge() force charge conditions.
> + */
> +static inline bool should_force_charge(gfp_t gfp_mask)
> +{
> + return (unlikely(tsk_is_oom_victim(current) ||
> +  fatal_signal_pending(current) ||
> +  current->flags & PF_EXITING ||
> +  current->flags & PF_MEMALLOC ||
> +  gfp_mask & __GFP_NOFAIL));
> +}
> +
>  static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> unsigned int nr_pages)
>  {
> @@ -2008,6 +2020,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
> gfp_mask,
>* The allocation either can't fail or will lead to more memory
>* being freed very soon.  Allow memory usage go over the limit
>* temporarily by force charging it.
> +  *
> +  * NOTE: Please keep the should_force_charge() conditions in sync.
>*/
>   page_counter_charge(>memory, nr_pages);
>   if (do_memsw_account())
> @@ -2331,8 +2345,11 @@ int memcg_kmem_charge_memcg(struct page *page, gfp_t 
> gfp, int order,
>  
>   if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
>   !page_counter_try_charge(>kmem, nr_pages, )) {
> - cancel_charge(memcg, nr_pages);
> - return -ENOMEM;
> + if (!should_force_charge(gfp)) {
> + cancel_charge(memcg, nr_pages);
> + return -ENOMEM;
> + }
> + page_counter_charge(>kmem, nr_pages);
>   }
>  
>   page->mem_cgroup = memcg;


Re: [PATCH] memcg: force charge kmem counter too

2018-05-26 Thread Vladimir Davydov
On Fri, May 25, 2018 at 11:55:01AM -0700, Shakeel Butt wrote:
> Based on several conditions the kernel can decide to force charge an
> allocation for a memcg i.e. overcharge memcg->memory and memcg->memsw
> counters. Do the same for memcg->kmem counter too. In cgroup-v1, this
> bug can cause a __GFP_NOFAIL kmem allocation fail if an explicit limit
> on kmem counter is set and reached.

memory.kmem.limit is broken and unlikely to ever be fixed as this knob
was deprecated in cgroup-v2. The fact that hitting the limit doesn't
trigger reclaim can result in unexpected behavior from user's pov, like
getting ENOMEM while listing a directory. Bypassing the limit for NOFAIL
allocations isn't going to fix those problem. So I'd suggest to avoid
setting memory.kmem.limit instead of trying to fix it or, even better,
switch to cgroup-v2.

> 
> Signed-off-by: Shakeel Butt 
> ---
>  mm/memcontrol.c | 21 +++--
>  1 file changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ab5673dbfc4e..0a88f824c550 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1893,6 +1893,18 @@ void mem_cgroup_handle_over_high(void)
>   current->memcg_nr_pages_over_high = 0;
>  }
>  
> +/*
> + * Based on try_charge() force charge conditions.
> + */
> +static inline bool should_force_charge(gfp_t gfp_mask)
> +{
> + return (unlikely(tsk_is_oom_victim(current) ||
> +  fatal_signal_pending(current) ||
> +  current->flags & PF_EXITING ||
> +  current->flags & PF_MEMALLOC ||
> +  gfp_mask & __GFP_NOFAIL));
> +}
> +
>  static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> unsigned int nr_pages)
>  {
> @@ -2008,6 +2020,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
> gfp_mask,
>* The allocation either can't fail or will lead to more memory
>* being freed very soon.  Allow memory usage go over the limit
>* temporarily by force charging it.
> +  *
> +  * NOTE: Please keep the should_force_charge() conditions in sync.
>*/
>   page_counter_charge(>memory, nr_pages);
>   if (do_memsw_account())
> @@ -2331,8 +2345,11 @@ int memcg_kmem_charge_memcg(struct page *page, gfp_t 
> gfp, int order,
>  
>   if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
>   !page_counter_try_charge(>kmem, nr_pages, )) {
> - cancel_charge(memcg, nr_pages);
> - return -ENOMEM;
> + if (!should_force_charge(gfp)) {
> + cancel_charge(memcg, nr_pages);
> + return -ENOMEM;
> + }
> + page_counter_charge(>kmem, nr_pages);
>   }
>  
>   page->mem_cgroup = memcg;


Re: [PATCH v7 00/17] Improve shrink_slab() scalability (old complexity was O(n^2), new is O(n))

2018-05-26 Thread Vladimir Davydov
Hello Kirill,

The whole patch set looks good to me now.

Acked-by: Vladimir Davydov <vdavydov@gmail.com>

Thanks,
Vladimir

On Tue, May 22, 2018 at 01:07:10PM +0300, Kirill Tkhai wrote:
> Hi,
> 
> this patches solves the problem with slow shrink_slab() occuring
> on the machines having many shrinkers and memory cgroups (i.e.,
> with many containers). The problem is complexity of shrink_slab()
> is O(n^2) and it grows too fast with the growth of containers
> numbers.
> 
> Let we have 200 containers, and every container has 10 mounts
> and 10 cgroups. All container tasks are isolated, and they don't
> touch foreign containers mounts.
> 
> In case of global reclaim, a task has to iterate all over the memcgs
> and to call all the memcg-aware shrinkers for all of them. This means,
> the task has to visit 200 * 10 = 2000 shrinkers for every memcg,
> and since there are 2000 memcgs, the total calls of do_shrink_slab()
> are 2000 * 2000 = 400.
> 
> 4 million calls are not a number operations, which can takes 1 cpu cycle.
> E.g., super_cache_count() accesses at least two lists, and makes arifmetical
> calculations. Even, if there are no charged objects, we do these calculations,
> and replaces cpu caches by read memory. I observed nodes spending almost 100%
> time in kernel, in case of intensive writing and global reclaim. The writer
> consumes pages fast, but it's need to shrink_slab() before the reclaimer
> reached shrink pages function (and frees SWAP_CLUSTER_MAX pages). Even if
> there is no writing, the iterations just waste the time, and slows reclaim 
> down.
> 
> Let's see the small test below:
> 
> $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
> $mkdir /sys/fs/cgroup/memory/ct
> $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
> $for i in `seq 0 4000`;
>   do mkdir /sys/fs/cgroup/memory/ct/$i;
>   echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
>   mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file;
> done
> 
> Then, let's see drop caches time (5 sequential calls):
> $time echo 3 > /proc/sys/vm/drop_caches
> 
> 0.00user 13.78system 0:13.78elapsed 99%CPU
> 0.00user 5.59system 0:05.60elapsed 99%CPU
> 0.00user 5.48system 0:05.48elapsed 99%CPU
> 0.00user 8.35system 0:08.35elapsed 99%CPU
> 0.00user 8.34system 0:08.35elapsed 99%CPU
> 
> 
> Last four calls don't actually shrink something. So, the iterations
> over slab shrinkers take 5.48 seconds. Not so good for scalability.
> 
> The patchset solves the problem by making shrink_slab() of O(n)
> complexity. There are following functional actions:
> 
> 1)Assign id to every registered memcg-aware shrinker.
> 2)Maintain per-memcgroup bitmap of memcg-aware shrinkers,
>   and set a shrinker-related bit after the first element
>   is added to lru list (also, when removed child memcg
>   elements are reparanted).
> 3)Split memcg-aware shrinkers and !memcg-aware shrinkers,
>   and call a shrinker if its bit is set in memcg's shrinker
>   bitmap.
>   (Also, there is a functionality to clear the bit, after
>   last element is shrinked).
> 
> This gives signify performance increase. The result after patchset is applied:
> 
> $time echo 3 > /proc/sys/vm/drop_caches
> 
> 0.00user 1.10system 0:01.10elapsed 99%CPU
> 0.00user 0.00system 0:00.01elapsed 64%CPU
> 0.00user 0.01system 0:00.01elapsed 82%CPU
> 0.00user 0.00system 0:00.01elapsed 64%CPU
> 0.00user 0.01system 0:00.01elapsed 82%CPU
> 
> The results show the performance increases at least in 548 times.
> 
> So, the patchset makes shrink_slab() of less complexity and improves
> the performance in such types of load I pointed. This will give a profit
> in case of !global reclaim case, since there also will be less
> do_shrink_slab() calls.
> 
> This patchset is made against linux-next.git tree.
> 
> v7: Refactorings and readability improvements.
> 
> v6: Added missed rcu_dereference() to memcg_set_shrinker_bit().
> Use different functions for allocation and expanding map.
> Use new memcg_shrinker_map_size variable in memcontrol.c.
> Refactorings.
> 
> v5: Make the optimizing logic under CONFIG_MEMCG_SHRINKER instead of MEMCG && 
> !SLOB
> 
> v4: Do not use memcg mem_cgroup_idr for iteration over mem cgroups
> 
> v3: Many changes requested in commentaries to v2:
> 
> 1)rebase on prealloc_shrinker() code base
> 2)root_mem_cgroup is made out of memcg maps
> 3)rwsem replaced with shrinkers_nr_max_mutex
> 4)changes around assignment of shrinker id to list lru
> 5)everything renamed
> 
> v2: Many changes requested in commentaries to v1:
> 
> 1)the code mostly moved to mm/memcontrol.c;
> 2)using IDR instead of array o

Re: [PATCH v7 00/17] Improve shrink_slab() scalability (old complexity was O(n^2), new is O(n))

2018-05-26 Thread Vladimir Davydov
Hello Kirill,

The whole patch set looks good to me now.

Acked-by: Vladimir Davydov 

Thanks,
Vladimir

On Tue, May 22, 2018 at 01:07:10PM +0300, Kirill Tkhai wrote:
> Hi,
> 
> this patches solves the problem with slow shrink_slab() occuring
> on the machines having many shrinkers and memory cgroups (i.e.,
> with many containers). The problem is complexity of shrink_slab()
> is O(n^2) and it grows too fast with the growth of containers
> numbers.
> 
> Let we have 200 containers, and every container has 10 mounts
> and 10 cgroups. All container tasks are isolated, and they don't
> touch foreign containers mounts.
> 
> In case of global reclaim, a task has to iterate all over the memcgs
> and to call all the memcg-aware shrinkers for all of them. This means,
> the task has to visit 200 * 10 = 2000 shrinkers for every memcg,
> and since there are 2000 memcgs, the total calls of do_shrink_slab()
> are 2000 * 2000 = 400.
> 
> 4 million calls are not a number operations, which can takes 1 cpu cycle.
> E.g., super_cache_count() accesses at least two lists, and makes arifmetical
> calculations. Even, if there are no charged objects, we do these calculations,
> and replaces cpu caches by read memory. I observed nodes spending almost 100%
> time in kernel, in case of intensive writing and global reclaim. The writer
> consumes pages fast, but it's need to shrink_slab() before the reclaimer
> reached shrink pages function (and frees SWAP_CLUSTER_MAX pages). Even if
> there is no writing, the iterations just waste the time, and slows reclaim 
> down.
> 
> Let's see the small test below:
> 
> $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
> $mkdir /sys/fs/cgroup/memory/ct
> $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
> $for i in `seq 0 4000`;
>   do mkdir /sys/fs/cgroup/memory/ct/$i;
>   echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
>   mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file;
> done
> 
> Then, let's see drop caches time (5 sequential calls):
> $time echo 3 > /proc/sys/vm/drop_caches
> 
> 0.00user 13.78system 0:13.78elapsed 99%CPU
> 0.00user 5.59system 0:05.60elapsed 99%CPU
> 0.00user 5.48system 0:05.48elapsed 99%CPU
> 0.00user 8.35system 0:08.35elapsed 99%CPU
> 0.00user 8.34system 0:08.35elapsed 99%CPU
> 
> 
> Last four calls don't actually shrink something. So, the iterations
> over slab shrinkers take 5.48 seconds. Not so good for scalability.
> 
> The patchset solves the problem by making shrink_slab() of O(n)
> complexity. There are following functional actions:
> 
> 1)Assign id to every registered memcg-aware shrinker.
> 2)Maintain per-memcgroup bitmap of memcg-aware shrinkers,
>   and set a shrinker-related bit after the first element
>   is added to lru list (also, when removed child memcg
>   elements are reparanted).
> 3)Split memcg-aware shrinkers and !memcg-aware shrinkers,
>   and call a shrinker if its bit is set in memcg's shrinker
>   bitmap.
>   (Also, there is a functionality to clear the bit, after
>   last element is shrinked).
> 
> This gives signify performance increase. The result after patchset is applied:
> 
> $time echo 3 > /proc/sys/vm/drop_caches
> 
> 0.00user 1.10system 0:01.10elapsed 99%CPU
> 0.00user 0.00system 0:00.01elapsed 64%CPU
> 0.00user 0.01system 0:00.01elapsed 82%CPU
> 0.00user 0.00system 0:00.01elapsed 64%CPU
> 0.00user 0.01system 0:00.01elapsed 82%CPU
> 
> The results show the performance increases at least in 548 times.
> 
> So, the patchset makes shrink_slab() of less complexity and improves
> the performance in such types of load I pointed. This will give a profit
> in case of !global reclaim case, since there also will be less
> do_shrink_slab() calls.
> 
> This patchset is made against linux-next.git tree.
> 
> v7: Refactorings and readability improvements.
> 
> v6: Added missed rcu_dereference() to memcg_set_shrinker_bit().
> Use different functions for allocation and expanding map.
> Use new memcg_shrinker_map_size variable in memcontrol.c.
> Refactorings.
> 
> v5: Make the optimizing logic under CONFIG_MEMCG_SHRINKER instead of MEMCG && 
> !SLOB
> 
> v4: Do not use memcg mem_cgroup_idr for iteration over mem cgroups
> 
> v3: Many changes requested in commentaries to v2:
> 
> 1)rebase on prealloc_shrinker() code base
> 2)root_mem_cgroup is made out of memcg maps
> 3)rwsem replaced with shrinkers_nr_max_mutex
> 4)changes around assignment of shrinker id to list lru
> 5)everything renamed
> 
> v2: Many changes requested in commentaries to v1:
> 
> 1)the code mostly moved to mm/memcontrol.c;
> 2)using IDR instead of array of shrinkers;
> 3)added 

Re: [PATCH v6 05/17] mm: Assign memcg-aware shrinkers bitmap to memcg

2018-05-21 Thread Vladimir Davydov
On Mon, May 21, 2018 at 01:16:40PM +0300, Kirill Tkhai wrote:
> >> +static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
> >> +   int size, int old_size)
> > 
> > Nit: No point in passing old_size here. You can instead use
> > memcg_shrinker_map_size directly.
> 
> This is made for the readability. All the actions with global variable
> is made in the same function -- memcg_expand_shrinker_maps(), all
> the actions with local variables are also in the same -- 
> memcg_expand_one_shrinker_map().
> Accessing memcg_shrinker_map_size in memcg_expand_one_shrinker_map()
> looks not intuitive and breaks modularity. 

I guess it depends on how you look at it. Anyway, it's nitpicking so I
won't mind if you leave it as is.

> >> +static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
> >> +{
> >> +  struct memcg_shrinker_map *map;
> >> +  int nid, size, ret = 0;
> >> +
> >> +  if (mem_cgroup_is_root(memcg))
> >> +  return 0;
> >> +
> >> +  mutex_lock(_shrinker_map_mutex);
> >> +  size = memcg_shrinker_map_size;
> >> +  for_each_node(nid) {
> >> +  map = kvzalloc(sizeof(*map) + size, GFP_KERNEL);
> >> +  if (!map) {
> > 
> >> +  memcg_free_shrinker_maps(memcg);
> > 
> > Nit: Please don't call this function under the mutex as it isn't
> > necessary. Set 'ret', break the loop, then check 'ret' after releasing
> > the mutex, and call memcg_free_shrinker_maps() if it's not 0.
> 
> No, it must be called under the mutex. See the race with 
> memcg_expand_one_shrinker_map().
> NULL maps are not expanded, and this is the indicator we use to differ memcg, 
> which is
> not completely online. If the allocations in memcg_alloc_shrinker_maps() fail 
> at nid == 1,
> then freeing of nid == 0 can race with expanding.

Ah, I see, you're right.

> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 3de12a9bdf85..f09ea20d7270 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -171,6 +171,7 @@ static DECLARE_RWSEM(shrinker_rwsem);
> >>  
> >>  #ifdef CONFIG_MEMCG_KMEM
> >>  static DEFINE_IDR(shrinker_idr);
> > 
> >> +static int memcg_shrinker_nr_max;
> > 
> > Nit: Please rename it to shrinker_id_max and make it store max shrinker
> > id, not the max number shrinkers that have ever been allocated. This
> > will make it easier to understand IMO.
> >
> > Also, this variable doesn't belong to this patch as you don't really
> > need it to expaned mem cgroup maps. Let's please move it to patch 3
> > (the one that introduces shrinker_idr).
> > 
> >>  
> >>  static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> >>  {
> >> @@ -181,6 +182,15 @@ static int prealloc_memcg_shrinker(struct shrinker 
> >> *shrinker)
> >>ret = id = idr_alloc(_idr, shrinker, 0, 0, GFP_KERNEL);
> >>if (ret < 0)
> >>goto unlock;
> > 
> >> +
> >> +  if (id >= memcg_shrinker_nr_max) {
> >> +  if (memcg_expand_shrinker_maps(id + 1)) {
> >> +  idr_remove(_idr, id);
> >> +  goto unlock;
> >> +  }
> >> +  memcg_shrinker_nr_max = id + 1;
> >> +  }
> >> +
> > 
> > Then we'll have here:
> > 
> > if (memcg_expaned_shrinker_maps(id)) {
> > idr_remove(shrinker_idr, id);
> > goto unlock;
> > }
> > 
> > and from patch 3:
> > 
> > shrinker_id_max = MAX(shrinker_id_max, id);
> 
> So, shrinker_id_max contains "the max number shrinkers that have ever been 
> allocated" minus 1.
> The only difference to existing logic is "minus 1", which will be needed to 
> reflect in
> shrink_slab_memcg()->for_each_set_bit()...
> 
> To have "minus 1" instead of "not to have minus 1" looks a little subjective.

OK, leave 'nr' then, but please consider my other comments:

 - rename memcg_shrinker_nr_max to shrinker_nr_max so that the variable
   name is consistent with shrinker_idr

 - move shrinker_nr_max to patch 3 as you don't need it for expanding
   memcg shrinker maps

 - don't use shrinker_nr_max to check whether we need to expand memcg
   maps - simply call memcg_expand_shrinker_maps() and let it decide -
   this will neatly isolate all the logic related to memcg shrinker map
   allocation in memcontrol.c


Re: [PATCH v6 05/17] mm: Assign memcg-aware shrinkers bitmap to memcg

2018-05-21 Thread Vladimir Davydov
On Mon, May 21, 2018 at 01:16:40PM +0300, Kirill Tkhai wrote:
> >> +static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
> >> +   int size, int old_size)
> > 
> > Nit: No point in passing old_size here. You can instead use
> > memcg_shrinker_map_size directly.
> 
> This is made for the readability. All the actions with global variable
> is made in the same function -- memcg_expand_shrinker_maps(), all
> the actions with local variables are also in the same -- 
> memcg_expand_one_shrinker_map().
> Accessing memcg_shrinker_map_size in memcg_expand_one_shrinker_map()
> looks not intuitive and breaks modularity. 

I guess it depends on how you look at it. Anyway, it's nitpicking so I
won't mind if you leave it as is.

> >> +static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
> >> +{
> >> +  struct memcg_shrinker_map *map;
> >> +  int nid, size, ret = 0;
> >> +
> >> +  if (mem_cgroup_is_root(memcg))
> >> +  return 0;
> >> +
> >> +  mutex_lock(_shrinker_map_mutex);
> >> +  size = memcg_shrinker_map_size;
> >> +  for_each_node(nid) {
> >> +  map = kvzalloc(sizeof(*map) + size, GFP_KERNEL);
> >> +  if (!map) {
> > 
> >> +  memcg_free_shrinker_maps(memcg);
> > 
> > Nit: Please don't call this function under the mutex as it isn't
> > necessary. Set 'ret', break the loop, then check 'ret' after releasing
> > the mutex, and call memcg_free_shrinker_maps() if it's not 0.
> 
> No, it must be called under the mutex. See the race with 
> memcg_expand_one_shrinker_map().
> NULL maps are not expanded, and this is the indicator we use to differ memcg, 
> which is
> not completely online. If the allocations in memcg_alloc_shrinker_maps() fail 
> at nid == 1,
> then freeing of nid == 0 can race with expanding.

Ah, I see, you're right.

> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 3de12a9bdf85..f09ea20d7270 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -171,6 +171,7 @@ static DECLARE_RWSEM(shrinker_rwsem);
> >>  
> >>  #ifdef CONFIG_MEMCG_KMEM
> >>  static DEFINE_IDR(shrinker_idr);
> > 
> >> +static int memcg_shrinker_nr_max;
> > 
> > Nit: Please rename it to shrinker_id_max and make it store max shrinker
> > id, not the max number shrinkers that have ever been allocated. This
> > will make it easier to understand IMO.
> >
> > Also, this variable doesn't belong to this patch as you don't really
> > need it to expaned mem cgroup maps. Let's please move it to patch 3
> > (the one that introduces shrinker_idr).
> > 
> >>  
> >>  static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> >>  {
> >> @@ -181,6 +182,15 @@ static int prealloc_memcg_shrinker(struct shrinker 
> >> *shrinker)
> >>ret = id = idr_alloc(_idr, shrinker, 0, 0, GFP_KERNEL);
> >>if (ret < 0)
> >>goto unlock;
> > 
> >> +
> >> +  if (id >= memcg_shrinker_nr_max) {
> >> +  if (memcg_expand_shrinker_maps(id + 1)) {
> >> +  idr_remove(_idr, id);
> >> +  goto unlock;
> >> +  }
> >> +  memcg_shrinker_nr_max = id + 1;
> >> +  }
> >> +
> > 
> > Then we'll have here:
> > 
> > if (memcg_expaned_shrinker_maps(id)) {
> > idr_remove(shrinker_idr, id);
> > goto unlock;
> > }
> > 
> > and from patch 3:
> > 
> > shrinker_id_max = MAX(shrinker_id_max, id);
> 
> So, shrinker_id_max contains "the max number shrinkers that have ever been 
> allocated" minus 1.
> The only difference to existing logic is "minus 1", which will be needed to 
> reflect in
> shrink_slab_memcg()->for_each_set_bit()...
> 
> To have "minus 1" instead of "not to have minus 1" looks a little subjective.

OK, leave 'nr' then, but please consider my other comments:

 - rename memcg_shrinker_nr_max to shrinker_nr_max so that the variable
   name is consistent with shrinker_idr

 - move shrinker_nr_max to patch 3 as you don't need it for expanding
   memcg shrinker maps

 - don't use shrinker_nr_max to check whether we need to expand memcg
   maps - simply call memcg_expand_shrinker_maps() and let it decide -
   this will neatly isolate all the logic related to memcg shrinker map
   allocation in memcontrol.c


Re: [PATCH v6 12/17] mm: Set bit in memcg shrinker bitmap on first list_lru item apearance

2018-05-21 Thread Vladimir Davydov
On Mon, May 21, 2018 at 12:31:34PM +0300, Kirill Tkhai wrote:
> On 20.05.2018 10:55, Vladimir Davydov wrote:
> > On Fri, May 18, 2018 at 11:43:42AM +0300, Kirill Tkhai wrote:
> >> Introduce set_shrinker_bit() function to set shrinker-related
> >> bit in memcg shrinker bitmap, and set the bit after the first
> >> item is added and in case of reparenting destroyed memcg's items.
> >>
> >> This will allow next patch to make shrinkers be called only,
> >> in case of they have charged objects at the moment, and
> >> to improve shrink_slab() performance.
> >>
> >> Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
> >> ---
> >>  include/linux/memcontrol.h |   14 ++
> >>  mm/list_lru.c  |   22 --
> >>  2 files changed, 34 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> >> index e51c6e953d7a..7ae1b94becf3 100644
> >> --- a/include/linux/memcontrol.h
> >> +++ b/include/linux/memcontrol.h
> >> @@ -1275,6 +1275,18 @@ static inline int memcg_cache_id(struct mem_cgroup 
> >> *memcg)
> >>  
> >>  extern int memcg_expand_shrinker_maps(int new_id);
> >>  
> >> +static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
> >> +int nid, int shrinker_id)
> >> +{
> > 
> >> +  if (shrinker_id >= 0 && memcg && memcg != root_mem_cgroup) {
> > 
> > Nit: I'd remove these checks from this function and require the caller
> > to check that shrinker_id >= 0 and memcg != NULL or root_mem_cgroup.
> > See below how the call sites would look then.
> > 
> >> +  struct memcg_shrinker_map *map;
> >> +
> >> +  rcu_read_lock();
> >> +  map = rcu_dereference(memcg->nodeinfo[nid]->shrinker_map);
> >> +  set_bit(shrinker_id, map->map);
> >> +  rcu_read_unlock();
> >> +  }
> >> +}
> >>  #else
> >>  #define for_each_memcg_cache_index(_idx)  \
> >>for (; NULL; )
> >> @@ -1297,6 +1309,8 @@ static inline void memcg_put_cache_ids(void)
> >>  {
> >>  }
> >>  
> >> +static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
> >> +int nid, int shrinker_id) { }
> >>  #endif /* CONFIG_MEMCG_KMEM */
> >>  
> >>  #endif /* _LINUX_MEMCONTROL_H */
> >> diff --git a/mm/list_lru.c b/mm/list_lru.c
> >> index cab8fad7f7e2..7df71ab0de1c 100644
> >> --- a/mm/list_lru.c
> >> +++ b/mm/list_lru.c
> >> @@ -31,6 +31,11 @@ static void list_lru_unregister(struct list_lru *lru)
> >>mutex_unlock(_lrus_mutex);
> >>  }
> >>  
> >> +static int lru_shrinker_id(struct list_lru *lru)
> >> +{
> >> +  return lru->shrinker_id;
> >> +}
> >> +
> >>  static inline bool list_lru_memcg_aware(struct list_lru *lru)
> >>  {
> >>/*
> >> @@ -94,6 +99,11 @@ static void list_lru_unregister(struct list_lru *lru)
> >>  {
> >>  }
> >>  
> >> +static int lru_shrinker_id(struct list_lru *lru)
> >> +{
> >> +  return -1;
> >> +}
> >> +
> >>  static inline bool list_lru_memcg_aware(struct list_lru *lru)
> >>  {
> >>return false;
> >> @@ -119,13 +129,17 @@ bool list_lru_add(struct list_lru *lru, struct 
> >> list_head *item)
> >>  {
> >>int nid = page_to_nid(virt_to_page(item));
> >>struct list_lru_node *nlru = >node[nid];
> >> +  struct mem_cgroup *memcg;
> >>struct list_lru_one *l;
> >>  
> >>spin_lock(>lock);
> >>if (list_empty(item)) {
> >> -  l = list_lru_from_kmem(nlru, item, NULL);
> >> +  l = list_lru_from_kmem(nlru, item, );
> >>list_add_tail(item, >list);
> >> -  l->nr_items++;
> >> +  /* Set shrinker bit if the first element was added */
> >> +  if (!l->nr_items++)
> >> +  memcg_set_shrinker_bit(memcg, nid,
> >> + lru_shrinker_id(lru));
> > 
> > This would turn into
> > 
> > if (!l->nr_items++ && memcg)
> > memcg_set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
> > 
> > Note, you don't need to check that lru_shrinker_id(

Re: [PATCH v6 12/17] mm: Set bit in memcg shrinker bitmap on first list_lru item apearance

2018-05-21 Thread Vladimir Davydov
On Mon, May 21, 2018 at 12:31:34PM +0300, Kirill Tkhai wrote:
> On 20.05.2018 10:55, Vladimir Davydov wrote:
> > On Fri, May 18, 2018 at 11:43:42AM +0300, Kirill Tkhai wrote:
> >> Introduce set_shrinker_bit() function to set shrinker-related
> >> bit in memcg shrinker bitmap, and set the bit after the first
> >> item is added and in case of reparenting destroyed memcg's items.
> >>
> >> This will allow next patch to make shrinkers be called only,
> >> in case of they have charged objects at the moment, and
> >> to improve shrink_slab() performance.
> >>
> >> Signed-off-by: Kirill Tkhai 
> >> ---
> >>  include/linux/memcontrol.h |   14 ++
> >>  mm/list_lru.c  |   22 --
> >>  2 files changed, 34 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> >> index e51c6e953d7a..7ae1b94becf3 100644
> >> --- a/include/linux/memcontrol.h
> >> +++ b/include/linux/memcontrol.h
> >> @@ -1275,6 +1275,18 @@ static inline int memcg_cache_id(struct mem_cgroup 
> >> *memcg)
> >>  
> >>  extern int memcg_expand_shrinker_maps(int new_id);
> >>  
> >> +static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
> >> +int nid, int shrinker_id)
> >> +{
> > 
> >> +  if (shrinker_id >= 0 && memcg && memcg != root_mem_cgroup) {
> > 
> > Nit: I'd remove these checks from this function and require the caller
> > to check that shrinker_id >= 0 and memcg != NULL or root_mem_cgroup.
> > See below how the call sites would look then.
> > 
> >> +  struct memcg_shrinker_map *map;
> >> +
> >> +  rcu_read_lock();
> >> +  map = rcu_dereference(memcg->nodeinfo[nid]->shrinker_map);
> >> +  set_bit(shrinker_id, map->map);
> >> +  rcu_read_unlock();
> >> +  }
> >> +}
> >>  #else
> >>  #define for_each_memcg_cache_index(_idx)  \
> >>for (; NULL; )
> >> @@ -1297,6 +1309,8 @@ static inline void memcg_put_cache_ids(void)
> >>  {
> >>  }
> >>  
> >> +static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
> >> +int nid, int shrinker_id) { }
> >>  #endif /* CONFIG_MEMCG_KMEM */
> >>  
> >>  #endif /* _LINUX_MEMCONTROL_H */
> >> diff --git a/mm/list_lru.c b/mm/list_lru.c
> >> index cab8fad7f7e2..7df71ab0de1c 100644
> >> --- a/mm/list_lru.c
> >> +++ b/mm/list_lru.c
> >> @@ -31,6 +31,11 @@ static void list_lru_unregister(struct list_lru *lru)
> >>mutex_unlock(_lrus_mutex);
> >>  }
> >>  
> >> +static int lru_shrinker_id(struct list_lru *lru)
> >> +{
> >> +  return lru->shrinker_id;
> >> +}
> >> +
> >>  static inline bool list_lru_memcg_aware(struct list_lru *lru)
> >>  {
> >>/*
> >> @@ -94,6 +99,11 @@ static void list_lru_unregister(struct list_lru *lru)
> >>  {
> >>  }
> >>  
> >> +static int lru_shrinker_id(struct list_lru *lru)
> >> +{
> >> +  return -1;
> >> +}
> >> +
> >>  static inline bool list_lru_memcg_aware(struct list_lru *lru)
> >>  {
> >>return false;
> >> @@ -119,13 +129,17 @@ bool list_lru_add(struct list_lru *lru, struct 
> >> list_head *item)
> >>  {
> >>int nid = page_to_nid(virt_to_page(item));
> >>struct list_lru_node *nlru = >node[nid];
> >> +  struct mem_cgroup *memcg;
> >>struct list_lru_one *l;
> >>  
> >>spin_lock(>lock);
> >>if (list_empty(item)) {
> >> -  l = list_lru_from_kmem(nlru, item, NULL);
> >> +  l = list_lru_from_kmem(nlru, item, );
> >>list_add_tail(item, >list);
> >> -  l->nr_items++;
> >> +  /* Set shrinker bit if the first element was added */
> >> +  if (!l->nr_items++)
> >> +  memcg_set_shrinker_bit(memcg, nid,
> >> + lru_shrinker_id(lru));
> > 
> > This would turn into
> > 
> > if (!l->nr_items++ && memcg)
> > memcg_set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
> > 
> > Note, you don't need to check that lru_shrinker_id(lru) is >= 0 here as
>

Re: [PATCH v6 14/17] mm: Iterate only over charged shrinkers during memcg shrink_slab()

2018-05-21 Thread Vladimir Davydov
On Mon, May 21, 2018 at 12:17:07PM +0300, Kirill Tkhai wrote:
> >> +static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> >> +  struct mem_cgroup *memcg, int priority)
> >> +{
> >> +  struct memcg_shrinker_map *map;
> >> +  unsigned long freed = 0;
> >> +  int ret, i;
> >> +
> >> +  if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
> >> +  return 0;
> >> +
> >> +  if (!down_read_trylock(_rwsem))
> >> +  return 0;
> >> +
> >> +  /*
> >> +   * 1) Caller passes only alive memcg, so map can't be NULL.
> >> +   * 2) shrinker_rwsem protects from maps expanding.
> >> +   */
> >> +  map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map,
> >> +  true);
> >> +  BUG_ON(!map);
> >> +
> >> +  for_each_set_bit(i, map->map, memcg_shrinker_nr_max) {
> >> +  struct shrink_control sc = {
> >> +  .gfp_mask = gfp_mask,
> >> +  .nid = nid,
> >> +  .memcg = memcg,
> >> +  };
> >> +  struct shrinker *shrinker;
> >> +
> >> +  shrinker = idr_find(_idr, i);
> >> +  if (unlikely(!shrinker)) {
> > 
> > Nit: I don't think 'unlikely' is required here as this is definitely not
> > a hot path.
> 
> In case of big machines with many containers and overcommit, shrink_slab()
> in general is very hot path. See the patchset description. There are 
> configurations,
> when only shrink_slab() is executing and occupies cpu for 100%, it's the 
> reason
> of this patchset is made for.
> 
> Here is the place we are absolutely sure shrinker is NULL in case if race 
> with parallel
> registering, so I don't see anything wrong to give compiler some information 
> about branch
> prediction.

OK. If you're confident this 'unlikely' is useful, let's leave it as is.


Re: [PATCH v6 14/17] mm: Iterate only over charged shrinkers during memcg shrink_slab()

2018-05-21 Thread Vladimir Davydov
On Mon, May 21, 2018 at 12:17:07PM +0300, Kirill Tkhai wrote:
> >> +static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> >> +  struct mem_cgroup *memcg, int priority)
> >> +{
> >> +  struct memcg_shrinker_map *map;
> >> +  unsigned long freed = 0;
> >> +  int ret, i;
> >> +
> >> +  if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
> >> +  return 0;
> >> +
> >> +  if (!down_read_trylock(_rwsem))
> >> +  return 0;
> >> +
> >> +  /*
> >> +   * 1) Caller passes only alive memcg, so map can't be NULL.
> >> +   * 2) shrinker_rwsem protects from maps expanding.
> >> +   */
> >> +  map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map,
> >> +  true);
> >> +  BUG_ON(!map);
> >> +
> >> +  for_each_set_bit(i, map->map, memcg_shrinker_nr_max) {
> >> +  struct shrink_control sc = {
> >> +  .gfp_mask = gfp_mask,
> >> +  .nid = nid,
> >> +  .memcg = memcg,
> >> +  };
> >> +  struct shrinker *shrinker;
> >> +
> >> +  shrinker = idr_find(_idr, i);
> >> +  if (unlikely(!shrinker)) {
> > 
> > Nit: I don't think 'unlikely' is required here as this is definitely not
> > a hot path.
> 
> In case of big machines with many containers and overcommit, shrink_slab()
> in general is very hot path. See the patchset description. There are 
> configurations,
> when only shrink_slab() is executing and occupies cpu for 100%, it's the 
> reason
> of this patchset is made for.
> 
> Here is the place we are absolutely sure shrinker is NULL in case if race 
> with parallel
> registering, so I don't see anything wrong to give compiler some information 
> about branch
> prediction.

OK. If you're confident this 'unlikely' is useful, let's leave it as is.


Re: [PATCH v6 15/17] mm: Generalize shrink_slab() calls in shrink_node()

2018-05-20 Thread Vladimir Davydov
On Fri, May 18, 2018 at 11:44:11AM +0300, Kirill Tkhai wrote:
> From: Vladimir Davydov <vdavydov@gmail.com>
> 
> The patch makes shrink_slab() be called for root_mem_cgroup
> in the same way as it's called for the rest of cgroups.
> This simplifies the logic and improves the readability.
> 
> Signed-off-by: Vladimir Davydov <vdavydov@gmail.com>
> ktkhai: Description written.
> Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
> ---
>  mm/vmscan.c |   13 +++--
>  1 file changed, 3 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2fbf3b476601..f1d23e2df988 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c

You forgot to patch the comment to shrink_slab(). Please take a closer
look at the diff I sent you:

@@ -486,10 +486,8 @@ static unsigned long do_shrink_slab(struct shrink_control 
*shrinkctl,
  * @nid is passed along to shrinkers with SHRINKER_NUMA_AWARE set,
  * unaware shrinkers will receive a node id of 0 instead.
  *
- * @memcg specifies the memory cgroup to target. If it is not NULL,
- * only shrinkers with SHRINKER_MEMCG_AWARE set will be called to scan
- * objects from the memory cgroup specified. Otherwise, only unaware
- * shrinkers are called.
+ * @memcg specifies the memory cgroup to target. Unaware shrinkers
+ * are called only if it is the root cgroup.
  *
  * @priority is sc->priority, we take the number of objects and >> by priority
  * in order to get the scan target.

> @@ -661,9 +661,6 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>   .memcg = memcg,
>   };

If you made !MEMCG version of mem_cgroup_is_root return true, as I
suggested in reply to patch 13, you could also simplify the memcg
related check in the beginning of shrink_slab() as in case of
CONFIG_MEMCG 'memcg' is now guaranteed to be != NULL in this function
while in case if !CONFIG_MEMCG mem_cgroup_is_root() would always
return true:

@@ -501,7 +501,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
struct shrinker *shrinker;
unsigned long freed = 0;
 
-   if (memcg && !mem_cgroup_is_root(memcg))
+   if (!mem_cgroup_is_root(memcg))
return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
 
if (!down_read_trylock(_rwsem))

>  
> - if (!!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> - continue;
> -
>   if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
>   sc.nid = 0;
>  
> @@ -693,6 +690,7 @@ void drop_slab_node(int nid)
>   struct mem_cgroup *memcg = NULL;
>  
>   freed = 0;
> + memcg = mem_cgroup_iter(NULL, NULL, NULL);
>   do {
>   freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
>   } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
> @@ -2712,9 +2710,8 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> scan_control *sc)
>   shrink_node_memcg(pgdat, memcg, sc, _pages);
>   node_lru_pages += lru_pages;
>  
> - if (memcg)
> - shrink_slab(sc->gfp_mask, pgdat->node_id,
> - memcg, sc->priority);
> + shrink_slab(sc->gfp_mask, pgdat->node_id,
> + memcg, sc->priority);
>  
>   /* Record the group's reclaim efficiency */
>   vmpressure(sc->gfp_mask, memcg, false,
> @@ -2738,10 +2735,6 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> scan_control *sc)
>   }
>   } while ((memcg = mem_cgroup_iter(root, memcg, )));
>  
> - if (global_reclaim(sc))
> - shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
> - sc->priority);
> -
>   if (reclaim_state) {
>   sc->nr_reclaimed += reclaim_state->reclaimed_slab;
>   reclaim_state->reclaimed_slab = 0;
> 


Re: [PATCH v6 15/17] mm: Generalize shrink_slab() calls in shrink_node()

2018-05-20 Thread Vladimir Davydov
On Fri, May 18, 2018 at 11:44:11AM +0300, Kirill Tkhai wrote:
> From: Vladimir Davydov 
> 
> The patch makes shrink_slab() be called for root_mem_cgroup
> in the same way as it's called for the rest of cgroups.
> This simplifies the logic and improves the readability.
> 
> Signed-off-by: Vladimir Davydov 
> ktkhai: Description written.
> Signed-off-by: Kirill Tkhai 
> ---
>  mm/vmscan.c |   13 +++--
>  1 file changed, 3 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2fbf3b476601..f1d23e2df988 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c

You forgot to patch the comment to shrink_slab(). Please take a closer
look at the diff I sent you:

@@ -486,10 +486,8 @@ static unsigned long do_shrink_slab(struct shrink_control 
*shrinkctl,
  * @nid is passed along to shrinkers with SHRINKER_NUMA_AWARE set,
  * unaware shrinkers will receive a node id of 0 instead.
  *
- * @memcg specifies the memory cgroup to target. If it is not NULL,
- * only shrinkers with SHRINKER_MEMCG_AWARE set will be called to scan
- * objects from the memory cgroup specified. Otherwise, only unaware
- * shrinkers are called.
+ * @memcg specifies the memory cgroup to target. Unaware shrinkers
+ * are called only if it is the root cgroup.
  *
  * @priority is sc->priority, we take the number of objects and >> by priority
  * in order to get the scan target.

> @@ -661,9 +661,6 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>   .memcg = memcg,
>   };

If you made !MEMCG version of mem_cgroup_is_root return true, as I
suggested in reply to patch 13, you could also simplify the memcg
related check in the beginning of shrink_slab() as in case of
CONFIG_MEMCG 'memcg' is now guaranteed to be != NULL in this function
while in case if !CONFIG_MEMCG mem_cgroup_is_root() would always
return true:

@@ -501,7 +501,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
struct shrinker *shrinker;
unsigned long freed = 0;
 
-   if (memcg && !mem_cgroup_is_root(memcg))
+   if (!mem_cgroup_is_root(memcg))
return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
 
if (!down_read_trylock(_rwsem))

>  
> - if (!!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> - continue;
> -
>   if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
>   sc.nid = 0;
>  
> @@ -693,6 +690,7 @@ void drop_slab_node(int nid)
>   struct mem_cgroup *memcg = NULL;
>  
>   freed = 0;
> + memcg = mem_cgroup_iter(NULL, NULL, NULL);
>   do {
>   freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
>   } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
> @@ -2712,9 +2710,8 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> scan_control *sc)
>   shrink_node_memcg(pgdat, memcg, sc, _pages);
>   node_lru_pages += lru_pages;
>  
> - if (memcg)
> - shrink_slab(sc->gfp_mask, pgdat->node_id,
> - memcg, sc->priority);
> + shrink_slab(sc->gfp_mask, pgdat->node_id,
> + memcg, sc->priority);
>  
>   /* Record the group's reclaim efficiency */
>   vmpressure(sc->gfp_mask, memcg, false,
> @@ -2738,10 +2735,6 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> scan_control *sc)
>   }
>   } while ((memcg = mem_cgroup_iter(root, memcg, )));
>  
> - if (global_reclaim(sc))
> - shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
> - sc->priority);
> -
>   if (reclaim_state) {
>   sc->nr_reclaimed += reclaim_state->reclaimed_slab;
>   reclaim_state->reclaimed_slab = 0;
> 


Re: [PATCH v6 14/17] mm: Iterate only over charged shrinkers during memcg shrink_slab()

2018-05-20 Thread Vladimir Davydov
On Fri, May 18, 2018 at 11:44:01AM +0300, Kirill Tkhai wrote:
> Using the preparations made in previous patches, in case of memcg
> shrink, we may avoid shrinkers, which are not set in memcg's shrinkers
> bitmap. To do that, we separate iterations over memcg-aware and
> !memcg-aware shrinkers, and memcg-aware shrinkers are chosen
> via for_each_set_bit() from the bitmap. In case of big nodes,
> having many isolated environments, this gives significant
> performance growth. See next patches for the details.
> 
> Note, that the patch does not respect to empty memcg shrinkers,
> since we never clear the bitmap bits after we set it once.
> Their shrinkers will be called again, with no shrinked objects
> as result. This functionality is provided by next patches.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  mm/vmscan.c |   87 
> +--
>  1 file changed, 78 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f09ea20d7270..2fbf3b476601 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -373,6 +373,20 @@ int prealloc_shrinker(struct shrinker *shrinker)
>   goto free_deferred;
>   }
>  
> + /*
> +  * There is a window between prealloc_shrinker()
> +  * and register_shrinker_prepared(). We don't want
> +  * to clear bit of a shrinker in such the state
> +  * in shrink_slab_memcg(), since this will impose
> +  * restrictions on a code registering a shrinker
> +  * (they would have to guarantee, their LRU lists
> +  * are empty till shrinker is completely registered).
> +  * So, we differ the situation, when 1)a shrinker
> +  * is semi-registered (id is assigned, but it has
> +  * not yet linked to shrinker_list) and 2)shrinker
> +  * is not registered (id is not assigned).
> +  */
> + INIT_LIST_HEAD(>list);
>   return 0;
>  
>  free_deferred:
> @@ -544,6 +558,67 @@ static unsigned long do_shrink_slab(struct 
> shrink_control *shrinkctl,
>   return freed;
>  }
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> + struct mem_cgroup *memcg, int priority)
> +{
> + struct memcg_shrinker_map *map;
> + unsigned long freed = 0;
> + int ret, i;
> +
> + if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
> + return 0;
> +
> + if (!down_read_trylock(_rwsem))
> + return 0;
> +
> + /*
> +  * 1) Caller passes only alive memcg, so map can't be NULL.
> +  * 2) shrinker_rwsem protects from maps expanding.
> +  */
> + map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map,
> + true);
> + BUG_ON(!map);
> +
> + for_each_set_bit(i, map->map, memcg_shrinker_nr_max) {
> + struct shrink_control sc = {
> + .gfp_mask = gfp_mask,
> + .nid = nid,
> + .memcg = memcg,
> + };
> + struct shrinker *shrinker;
> +
> + shrinker = idr_find(_idr, i);
> + if (unlikely(!shrinker)) {

Nit: I don't think 'unlikely' is required here as this is definitely not
a hot path.

> + clear_bit(i, map->map);
> + continue;
> + }
> + BUG_ON(!(shrinker->flags & SHRINKER_MEMCG_AWARE));
> +
> + /* See comment in prealloc_shrinker() */
> + if (unlikely(list_empty(>list)))

Ditto.

> + continue;
> +
> + ret = do_shrink_slab(, shrinker, priority);
> + freed += ret;
> +
> + if (rwsem_is_contended(_rwsem)) {
> + freed = freed ? : 1;
> + break;
> + }
> + }
> +
> + up_read(_rwsem);
> + return freed;
> +}
> +#else /* CONFIG_MEMCG_KMEM */
> +static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> + struct mem_cgroup *memcg, int priority)
> +{
> + return 0;
> +}
> +#endif /* CONFIG_MEMCG_KMEM */
> +
>  /**
>   * shrink_slab - shrink slab caches
>   * @gfp_mask: allocation context
> @@ -573,8 +648,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>   struct shrinker *shrinker;
>   unsigned long freed = 0;
>  
> - if (memcg && (!memcg_kmem_enabled() || !mem_cgroup_online(memcg)))
> - return 0;
> + if (memcg && !mem_cgroup_is_root(memcg))
> + return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>  
>   if (!down_read_trylock(_rwsem))
>   goto out;
> @@ -586,13 +661,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>   .memcg = memcg,
>   };
>  
> - /*
> -  * If kernel memory accounting is disabled, we ignore
> -  * SHRINKER_MEMCG_AWARE flag and call all shrinkers
> -  * passing NULL for 

Re: [PATCH v6 14/17] mm: Iterate only over charged shrinkers during memcg shrink_slab()

2018-05-20 Thread Vladimir Davydov
On Fri, May 18, 2018 at 11:44:01AM +0300, Kirill Tkhai wrote:
> Using the preparations made in previous patches, in case of memcg
> shrink, we may avoid shrinkers, which are not set in memcg's shrinkers
> bitmap. To do that, we separate iterations over memcg-aware and
> !memcg-aware shrinkers, and memcg-aware shrinkers are chosen
> via for_each_set_bit() from the bitmap. In case of big nodes,
> having many isolated environments, this gives significant
> performance growth. See next patches for the details.
> 
> Note, that the patch does not respect to empty memcg shrinkers,
> since we never clear the bitmap bits after we set it once.
> Their shrinkers will be called again, with no shrinked objects
> as result. This functionality is provided by next patches.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  mm/vmscan.c |   87 
> +--
>  1 file changed, 78 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f09ea20d7270..2fbf3b476601 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -373,6 +373,20 @@ int prealloc_shrinker(struct shrinker *shrinker)
>   goto free_deferred;
>   }
>  
> + /*
> +  * There is a window between prealloc_shrinker()
> +  * and register_shrinker_prepared(). We don't want
> +  * to clear bit of a shrinker in such the state
> +  * in shrink_slab_memcg(), since this will impose
> +  * restrictions on a code registering a shrinker
> +  * (they would have to guarantee, their LRU lists
> +  * are empty till shrinker is completely registered).
> +  * So, we differ the situation, when 1)a shrinker
> +  * is semi-registered (id is assigned, but it has
> +  * not yet linked to shrinker_list) and 2)shrinker
> +  * is not registered (id is not assigned).
> +  */
> + INIT_LIST_HEAD(>list);
>   return 0;
>  
>  free_deferred:
> @@ -544,6 +558,67 @@ static unsigned long do_shrink_slab(struct 
> shrink_control *shrinkctl,
>   return freed;
>  }
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> + struct mem_cgroup *memcg, int priority)
> +{
> + struct memcg_shrinker_map *map;
> + unsigned long freed = 0;
> + int ret, i;
> +
> + if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
> + return 0;
> +
> + if (!down_read_trylock(_rwsem))
> + return 0;
> +
> + /*
> +  * 1) Caller passes only alive memcg, so map can't be NULL.
> +  * 2) shrinker_rwsem protects from maps expanding.
> +  */
> + map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map,
> + true);
> + BUG_ON(!map);
> +
> + for_each_set_bit(i, map->map, memcg_shrinker_nr_max) {
> + struct shrink_control sc = {
> + .gfp_mask = gfp_mask,
> + .nid = nid,
> + .memcg = memcg,
> + };
> + struct shrinker *shrinker;
> +
> + shrinker = idr_find(_idr, i);
> + if (unlikely(!shrinker)) {

Nit: I don't think 'unlikely' is required here as this is definitely not
a hot path.

> + clear_bit(i, map->map);
> + continue;
> + }
> + BUG_ON(!(shrinker->flags & SHRINKER_MEMCG_AWARE));
> +
> + /* See comment in prealloc_shrinker() */
> + if (unlikely(list_empty(>list)))

Ditto.

> + continue;
> +
> + ret = do_shrink_slab(, shrinker, priority);
> + freed += ret;
> +
> + if (rwsem_is_contended(_rwsem)) {
> + freed = freed ? : 1;
> + break;
> + }
> + }
> +
> + up_read(_rwsem);
> + return freed;
> +}
> +#else /* CONFIG_MEMCG_KMEM */
> +static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> + struct mem_cgroup *memcg, int priority)
> +{
> + return 0;
> +}
> +#endif /* CONFIG_MEMCG_KMEM */
> +
>  /**
>   * shrink_slab - shrink slab caches
>   * @gfp_mask: allocation context
> @@ -573,8 +648,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>   struct shrinker *shrinker;
>   unsigned long freed = 0;
>  
> - if (memcg && (!memcg_kmem_enabled() || !mem_cgroup_online(memcg)))
> - return 0;
> + if (memcg && !mem_cgroup_is_root(memcg))
> + return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>  
>   if (!down_read_trylock(_rwsem))
>   goto out;
> @@ -586,13 +661,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>   .memcg = memcg,
>   };
>  
> - /*
> -  * If kernel memory accounting is disabled, we ignore
> -  * SHRINKER_MEMCG_AWARE flag and call all shrinkers
> -  * passing NULL for memcg.
> -  

Re: [PATCH v6 13/17] mm: Export mem_cgroup_is_root()

2018-05-20 Thread Vladimir Davydov
On Fri, May 18, 2018 at 11:43:53AM +0300, Kirill Tkhai wrote:
> This will be used in next patch.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  include/linux/memcontrol.h |   10 ++
>  mm/memcontrol.c|5 -
>  2 files changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 7ae1b94becf3..cd44c1fac22b 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -311,6 +311,11 @@ struct mem_cgroup {
>  
>  extern struct mem_cgroup *root_mem_cgroup;
>  
> +static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
> +{
> + return (memcg == root_mem_cgroup);
> +}
> +
>  static inline bool mem_cgroup_disabled(void)
>  {
>   return !cgroup_subsys_enabled(memory_cgrp_subsys);
> @@ -780,6 +785,11 @@ void mem_cgroup_split_huge_fixup(struct page *head);
>  
>  struct mem_cgroup;
>  
> +static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
> +{
> + return false;
> +}
> +

This stub must return true as one can think of !MEMCG as of the case
when there's the only cgroup - the root.


Re: [PATCH v6 13/17] mm: Export mem_cgroup_is_root()

2018-05-20 Thread Vladimir Davydov
On Fri, May 18, 2018 at 11:43:53AM +0300, Kirill Tkhai wrote:
> This will be used in next patch.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  include/linux/memcontrol.h |   10 ++
>  mm/memcontrol.c|5 -
>  2 files changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 7ae1b94becf3..cd44c1fac22b 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -311,6 +311,11 @@ struct mem_cgroup {
>  
>  extern struct mem_cgroup *root_mem_cgroup;
>  
> +static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
> +{
> + return (memcg == root_mem_cgroup);
> +}
> +
>  static inline bool mem_cgroup_disabled(void)
>  {
>   return !cgroup_subsys_enabled(memory_cgrp_subsys);
> @@ -780,6 +785,11 @@ void mem_cgroup_split_huge_fixup(struct page *head);
>  
>  struct mem_cgroup;
>  
> +static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
> +{
> + return false;
> +}
> +

This stub must return true as one can think of !MEMCG as of the case
when there's the only cgroup - the root.


Re: [PATCH v6 12/17] mm: Set bit in memcg shrinker bitmap on first list_lru item apearance

2018-05-20 Thread Vladimir Davydov
On Fri, May 18, 2018 at 11:43:42AM +0300, Kirill Tkhai wrote:
> Introduce set_shrinker_bit() function to set shrinker-related
> bit in memcg shrinker bitmap, and set the bit after the first
> item is added and in case of reparenting destroyed memcg's items.
> 
> This will allow next patch to make shrinkers be called only,
> in case of they have charged objects at the moment, and
> to improve shrink_slab() performance.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  include/linux/memcontrol.h |   14 ++
>  mm/list_lru.c  |   22 --
>  2 files changed, 34 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e51c6e953d7a..7ae1b94becf3 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1275,6 +1275,18 @@ static inline int memcg_cache_id(struct mem_cgroup 
> *memcg)
>  
>  extern int memcg_expand_shrinker_maps(int new_id);
>  
> +static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
> +   int nid, int shrinker_id)
> +{

> + if (shrinker_id >= 0 && memcg && memcg != root_mem_cgroup) {

Nit: I'd remove these checks from this function and require the caller
to check that shrinker_id >= 0 and memcg != NULL or root_mem_cgroup.
See below how the call sites would look then.

> + struct memcg_shrinker_map *map;
> +
> + rcu_read_lock();
> + map = rcu_dereference(memcg->nodeinfo[nid]->shrinker_map);
> + set_bit(shrinker_id, map->map);
> + rcu_read_unlock();
> + }
> +}
>  #else
>  #define for_each_memcg_cache_index(_idx) \
>   for (; NULL; )
> @@ -1297,6 +1309,8 @@ static inline void memcg_put_cache_ids(void)
>  {
>  }
>  
> +static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
> +   int nid, int shrinker_id) { }
>  #endif /* CONFIG_MEMCG_KMEM */
>  
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index cab8fad7f7e2..7df71ab0de1c 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -31,6 +31,11 @@ static void list_lru_unregister(struct list_lru *lru)
>   mutex_unlock(_lrus_mutex);
>  }
>  
> +static int lru_shrinker_id(struct list_lru *lru)
> +{
> + return lru->shrinker_id;
> +}
> +
>  static inline bool list_lru_memcg_aware(struct list_lru *lru)
>  {
>   /*
> @@ -94,6 +99,11 @@ static void list_lru_unregister(struct list_lru *lru)
>  {
>  }
>  
> +static int lru_shrinker_id(struct list_lru *lru)
> +{
> + return -1;
> +}
> +
>  static inline bool list_lru_memcg_aware(struct list_lru *lru)
>  {
>   return false;
> @@ -119,13 +129,17 @@ bool list_lru_add(struct list_lru *lru, struct 
> list_head *item)
>  {
>   int nid = page_to_nid(virt_to_page(item));
>   struct list_lru_node *nlru = >node[nid];
> + struct mem_cgroup *memcg;
>   struct list_lru_one *l;
>  
>   spin_lock(>lock);
>   if (list_empty(item)) {
> - l = list_lru_from_kmem(nlru, item, NULL);
> + l = list_lru_from_kmem(nlru, item, );
>   list_add_tail(item, >list);
> - l->nr_items++;
> + /* Set shrinker bit if the first element was added */
> + if (!l->nr_items++)
> + memcg_set_shrinker_bit(memcg, nid,
> +lru_shrinker_id(lru));

This would turn into

if (!l->nr_items++ && memcg)
memcg_set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));

Note, you don't need to check that lru_shrinker_id(lru) is >= 0 here as
the fact that memcg != NULL guarantees that. Also, memcg can't be
root_mem_cgroup here as kmem objects allocated for the root cgroup go
unaccounted.

>   nlru->nr_items++;
>   spin_unlock(>lock);
>   return true;
> @@ -520,6 +534,7 @@ static void memcg_drain_list_lru_node(struct list_lru 
> *lru, int nid,
>   struct list_lru_node *nlru = >node[nid];
>   int dst_idx = dst_memcg->kmemcg_id;
>   struct list_lru_one *src, *dst;
> + bool set;
>  
>   /*
>* Since list_lru_{add,del} may be called under an IRQ-safe lock,
> @@ -531,7 +546,10 @@ static void memcg_drain_list_lru_node(struct list_lru 
> *lru, int nid,
>   dst = list_lru_from_memcg_idx(nlru, dst_idx);
>  
>   list_splice_init(>list, >list);
> + set = (!dst->nr_items && src->nr_items);
>   dst->nr_items += src->nr_items;
> + if (set)
> + memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));

This would turn into

if (set && dst_idx >= 0)
memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));

Again, the shrinker is guaranteed to be memcg aware in this function and
dst_memcg != NULL.

IMHO such a change would make the code a bit more straightforward.

>   src->nr_items = 0;
>  
>   spin_unlock_irq(>lock);
> 


Re: [PATCH v6 12/17] mm: Set bit in memcg shrinker bitmap on first list_lru item apearance

2018-05-20 Thread Vladimir Davydov
On Fri, May 18, 2018 at 11:43:42AM +0300, Kirill Tkhai wrote:
> Introduce set_shrinker_bit() function to set shrinker-related
> bit in memcg shrinker bitmap, and set the bit after the first
> item is added and in case of reparenting destroyed memcg's items.
> 
> This will allow next patch to make shrinkers be called only,
> in case of they have charged objects at the moment, and
> to improve shrink_slab() performance.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  include/linux/memcontrol.h |   14 ++
>  mm/list_lru.c  |   22 --
>  2 files changed, 34 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e51c6e953d7a..7ae1b94becf3 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1275,6 +1275,18 @@ static inline int memcg_cache_id(struct mem_cgroup 
> *memcg)
>  
>  extern int memcg_expand_shrinker_maps(int new_id);
>  
> +static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
> +   int nid, int shrinker_id)
> +{

> + if (shrinker_id >= 0 && memcg && memcg != root_mem_cgroup) {

Nit: I'd remove these checks from this function and require the caller
to check that shrinker_id >= 0 and memcg != NULL or root_mem_cgroup.
See below how the call sites would look then.

> + struct memcg_shrinker_map *map;
> +
> + rcu_read_lock();
> + map = rcu_dereference(memcg->nodeinfo[nid]->shrinker_map);
> + set_bit(shrinker_id, map->map);
> + rcu_read_unlock();
> + }
> +}
>  #else
>  #define for_each_memcg_cache_index(_idx) \
>   for (; NULL; )
> @@ -1297,6 +1309,8 @@ static inline void memcg_put_cache_ids(void)
>  {
>  }
>  
> +static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
> +   int nid, int shrinker_id) { }
>  #endif /* CONFIG_MEMCG_KMEM */
>  
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index cab8fad7f7e2..7df71ab0de1c 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -31,6 +31,11 @@ static void list_lru_unregister(struct list_lru *lru)
>   mutex_unlock(_lrus_mutex);
>  }
>  
> +static int lru_shrinker_id(struct list_lru *lru)
> +{
> + return lru->shrinker_id;
> +}
> +
>  static inline bool list_lru_memcg_aware(struct list_lru *lru)
>  {
>   /*
> @@ -94,6 +99,11 @@ static void list_lru_unregister(struct list_lru *lru)
>  {
>  }
>  
> +static int lru_shrinker_id(struct list_lru *lru)
> +{
> + return -1;
> +}
> +
>  static inline bool list_lru_memcg_aware(struct list_lru *lru)
>  {
>   return false;
> @@ -119,13 +129,17 @@ bool list_lru_add(struct list_lru *lru, struct 
> list_head *item)
>  {
>   int nid = page_to_nid(virt_to_page(item));
>   struct list_lru_node *nlru = >node[nid];
> + struct mem_cgroup *memcg;
>   struct list_lru_one *l;
>  
>   spin_lock(>lock);
>   if (list_empty(item)) {
> - l = list_lru_from_kmem(nlru, item, NULL);
> + l = list_lru_from_kmem(nlru, item, );
>   list_add_tail(item, >list);
> - l->nr_items++;
> + /* Set shrinker bit if the first element was added */
> + if (!l->nr_items++)
> + memcg_set_shrinker_bit(memcg, nid,
> +lru_shrinker_id(lru));

This would turn into

if (!l->nr_items++ && memcg)
memcg_set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));

Note, you don't need to check that lru_shrinker_id(lru) is >= 0 here as
the fact that memcg != NULL guarantees that. Also, memcg can't be
root_mem_cgroup here as kmem objects allocated for the root cgroup go
unaccounted.

>   nlru->nr_items++;
>   spin_unlock(>lock);
>   return true;
> @@ -520,6 +534,7 @@ static void memcg_drain_list_lru_node(struct list_lru 
> *lru, int nid,
>   struct list_lru_node *nlru = >node[nid];
>   int dst_idx = dst_memcg->kmemcg_id;
>   struct list_lru_one *src, *dst;
> + bool set;
>  
>   /*
>* Since list_lru_{add,del} may be called under an IRQ-safe lock,
> @@ -531,7 +546,10 @@ static void memcg_drain_list_lru_node(struct list_lru 
> *lru, int nid,
>   dst = list_lru_from_memcg_idx(nlru, dst_idx);
>  
>   list_splice_init(>list, >list);
> + set = (!dst->nr_items && src->nr_items);
>   dst->nr_items += src->nr_items;
> + if (set)
> + memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));

This would turn into

if (set && dst_idx >= 0)
memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));

Again, the shrinker is guaranteed to be memcg aware in this function and
dst_memcg != NULL.

IMHO such a change would make the code a bit more straightforward.

>   src->nr_items = 0;
>  
>   spin_unlock_irq(>lock);
> 


Re: [PATCH v6 05/17] mm: Assign memcg-aware shrinkers bitmap to memcg

2018-05-20 Thread Vladimir Davydov
On Fri, May 18, 2018 at 11:42:37AM +0300, Kirill Tkhai wrote:
> Imagine a big node with many cpus, memory cgroups and containers.
> Let we have 200 containers, every container has 10 mounts,
> and 10 cgroups. All container tasks don't touch foreign
> containers mounts. If there is intensive pages write,
> and global reclaim happens, a writing task has to iterate
> over all memcgs to shrink slab, before it's able to go
> to shrink_page_list().
> 
> Iteration over all the memcg slabs is very expensive:
> the task has to visit 200 * 10 = 2000 shrinkers
> for every memcg, and since there are 2000 memcgs,
> the total calls are 2000 * 2000 = 400.
> 
> So, the shrinker makes 4 million do_shrink_slab() calls
> just to try to isolate SWAP_CLUSTER_MAX pages in one
> of the actively writing memcg via shrink_page_list().
> I've observed a node spending almost 100% in kernel,
> making useless iteration over already shrinked slab.
> 
> This patch adds bitmap of memcg-aware shrinkers to memcg.
> The size of the bitmap depends on bitmap_nr_ids, and during
> memcg life it's maintained to be enough to fit bitmap_nr_ids
> shrinkers. Every bit in the map is related to corresponding
> shrinker id.
> 
> Next patches will maintain set bit only for really charged
> memcg. This will allow shrink_slab() to increase its
> performance in significant way. See the last patch for
> the numbers.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  include/linux/memcontrol.h |   14 +
>  mm/memcontrol.c|  120 
> 
>  mm/vmscan.c|   10 
>  3 files changed, 144 insertions(+)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 996469bc2b82..e51c6e953d7a 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -112,6 +112,15 @@ struct lruvec_stat {
>   long count[NR_VM_NODE_STAT_ITEMS];
>  };
>  
> +/*
> + * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
> + * which have elements charged to this memcg.
> + */
> +struct memcg_shrinker_map {
> + struct rcu_head rcu;
> + unsigned long map[0];
> +};
> +
>  /*
>   * per-zone information in memory controller.
>   */
> @@ -125,6 +134,9 @@ struct mem_cgroup_per_node {
>  
>   struct mem_cgroup_reclaim_iter  iter[DEF_PRIORITY + 1];
>  
> +#ifdef CONFIG_MEMCG_KMEM
> + struct memcg_shrinker_map __rcu *shrinker_map;
> +#endif
>   struct rb_node  tree_node;  /* RB tree node */
>   unsigned long   usage_in_excess;/* Set to the value by which */
>   /* the soft limit is exceeded*/
> @@ -1261,6 +1273,8 @@ static inline int memcg_cache_id(struct mem_cgroup 
> *memcg)
>   return memcg ? memcg->kmemcg_id : -1;
>  }
>  
> +extern int memcg_expand_shrinker_maps(int new_id);
> +
>  #else
>  #define for_each_memcg_cache_index(_idx) \
>   for (; NULL; )
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 023a1e9c900e..317a72137b95 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -320,6 +320,120 @@ EXPORT_SYMBOL(memcg_kmem_enabled_key);
>  
>  struct workqueue_struct *memcg_kmem_cache_wq;
>  
> +static int memcg_shrinker_map_size;
> +static DEFINE_MUTEX(memcg_shrinker_map_mutex);
> +
> +static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
> +{
> + kvfree(container_of(head, struct memcg_shrinker_map, rcu));
> +}
> +
> +static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
> +  int size, int old_size)

Nit: No point in passing old_size here. You can instead use
memcg_shrinker_map_size directly.

> +{
> + struct memcg_shrinker_map *new, *old;
> + int nid;
> +
> + lockdep_assert_held(_shrinker_map_mutex);
> +
> + for_each_node(nid) {
> + old = rcu_dereference_protected(
> + memcg->nodeinfo[nid]->shrinker_map, true);

Nit: Sometimes you use mem_cgroup_nodeinfo() helper, sometimes you
access mem_cgorup->nodeinfo directly. Please, be consistent.

> + /* Not yet online memcg */
> + if (!old)
> + return 0;
> +
> + new = kvmalloc(sizeof(*new) + size, GFP_KERNEL);
> + if (!new)
> + return -ENOMEM;
> +
> + /* Set all old bits, clear all new bits */
> + memset(new->map, (int)0xff, old_size);
> + memset((void *)new->map + old_size, 0, size - old_size);
> +
> + rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
> + if (old)
> + call_rcu(>rcu, memcg_free_shrinker_map_rcu);
> + }
> +
> + return 0;
> +}
> +
> +static void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
> +{
> + struct mem_cgroup_per_node *pn;
> + struct memcg_shrinker_map *map;
> + int nid;
> +
> + if (mem_cgroup_is_root(memcg))
> + return;
> +
> + 

Re: [PATCH v6 05/17] mm: Assign memcg-aware shrinkers bitmap to memcg

2018-05-20 Thread Vladimir Davydov
On Fri, May 18, 2018 at 11:42:37AM +0300, Kirill Tkhai wrote:
> Imagine a big node with many cpus, memory cgroups and containers.
> Let we have 200 containers, every container has 10 mounts,
> and 10 cgroups. All container tasks don't touch foreign
> containers mounts. If there is intensive pages write,
> and global reclaim happens, a writing task has to iterate
> over all memcgs to shrink slab, before it's able to go
> to shrink_page_list().
> 
> Iteration over all the memcg slabs is very expensive:
> the task has to visit 200 * 10 = 2000 shrinkers
> for every memcg, and since there are 2000 memcgs,
> the total calls are 2000 * 2000 = 400.
> 
> So, the shrinker makes 4 million do_shrink_slab() calls
> just to try to isolate SWAP_CLUSTER_MAX pages in one
> of the actively writing memcg via shrink_page_list().
> I've observed a node spending almost 100% in kernel,
> making useless iteration over already shrinked slab.
> 
> This patch adds bitmap of memcg-aware shrinkers to memcg.
> The size of the bitmap depends on bitmap_nr_ids, and during
> memcg life it's maintained to be enough to fit bitmap_nr_ids
> shrinkers. Every bit in the map is related to corresponding
> shrinker id.
> 
> Next patches will maintain set bit only for really charged
> memcg. This will allow shrink_slab() to increase its
> performance in significant way. See the last patch for
> the numbers.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  include/linux/memcontrol.h |   14 +
>  mm/memcontrol.c|  120 
> 
>  mm/vmscan.c|   10 
>  3 files changed, 144 insertions(+)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 996469bc2b82..e51c6e953d7a 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -112,6 +112,15 @@ struct lruvec_stat {
>   long count[NR_VM_NODE_STAT_ITEMS];
>  };
>  
> +/*
> + * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
> + * which have elements charged to this memcg.
> + */
> +struct memcg_shrinker_map {
> + struct rcu_head rcu;
> + unsigned long map[0];
> +};
> +
>  /*
>   * per-zone information in memory controller.
>   */
> @@ -125,6 +134,9 @@ struct mem_cgroup_per_node {
>  
>   struct mem_cgroup_reclaim_iter  iter[DEF_PRIORITY + 1];
>  
> +#ifdef CONFIG_MEMCG_KMEM
> + struct memcg_shrinker_map __rcu *shrinker_map;
> +#endif
>   struct rb_node  tree_node;  /* RB tree node */
>   unsigned long   usage_in_excess;/* Set to the value by which */
>   /* the soft limit is exceeded*/
> @@ -1261,6 +1273,8 @@ static inline int memcg_cache_id(struct mem_cgroup 
> *memcg)
>   return memcg ? memcg->kmemcg_id : -1;
>  }
>  
> +extern int memcg_expand_shrinker_maps(int new_id);
> +
>  #else
>  #define for_each_memcg_cache_index(_idx) \
>   for (; NULL; )
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 023a1e9c900e..317a72137b95 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -320,6 +320,120 @@ EXPORT_SYMBOL(memcg_kmem_enabled_key);
>  
>  struct workqueue_struct *memcg_kmem_cache_wq;
>  
> +static int memcg_shrinker_map_size;
> +static DEFINE_MUTEX(memcg_shrinker_map_mutex);
> +
> +static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
> +{
> + kvfree(container_of(head, struct memcg_shrinker_map, rcu));
> +}
> +
> +static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
> +  int size, int old_size)

Nit: No point in passing old_size here. You can instead use
memcg_shrinker_map_size directly.

> +{
> + struct memcg_shrinker_map *new, *old;
> + int nid;
> +
> + lockdep_assert_held(_shrinker_map_mutex);
> +
> + for_each_node(nid) {
> + old = rcu_dereference_protected(
> + memcg->nodeinfo[nid]->shrinker_map, true);

Nit: Sometimes you use mem_cgroup_nodeinfo() helper, sometimes you
access mem_cgorup->nodeinfo directly. Please, be consistent.

> + /* Not yet online memcg */
> + if (!old)
> + return 0;
> +
> + new = kvmalloc(sizeof(*new) + size, GFP_KERNEL);
> + if (!new)
> + return -ENOMEM;
> +
> + /* Set all old bits, clear all new bits */
> + memset(new->map, (int)0xff, old_size);
> + memset((void *)new->map + old_size, 0, size - old_size);
> +
> + rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
> + if (old)
> + call_rcu(>rcu, memcg_free_shrinker_map_rcu);
> + }
> +
> + return 0;
> +}
> +
> +static void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
> +{
> + struct mem_cgroup_per_node *pn;
> + struct memcg_shrinker_map *map;
> + int nid;
> +
> + if (mem_cgroup_is_root(memcg))
> + return;
> +
> + for_each_node(nid) {
> +

Re: [PATCH v6 03/17] mm: Assign id to every memcg-aware shrinker

2018-05-20 Thread Vladimir Davydov
Hello Kirill,

Generally, all patches in the series look OK to me, but I'm going to do
some nitpicking before I ack them. See below.

On Fri, May 18, 2018 at 11:42:08AM +0300, Kirill Tkhai wrote:
> The patch introduces shrinker::id number, which is used to enumerate
> memcg-aware shrinkers. The number start from 0, and the code tries
> to maintain it as small as possible.
> 
> This will be used as to represent a memcg-aware shrinkers in memcg
> shrinkers map.
> 
> Since all memcg-aware shrinkers are based on list_lru, which is per-memcg
> in case of !CONFIG_MEMCG_KMEM only, the new functionality will be under
> this config option.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  include/linux/shrinker.h |4 +++
>  mm/vmscan.c  |   60 
> ++
>  2 files changed, 64 insertions(+)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 6794490f25b2..7ca9c18cf130 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -66,6 +66,10 @@ struct shrinker {
>  
>   /* These are for internal use */
>   struct list_head list;
> +#ifdef CONFIG_MEMCG_KMEM
> + /* ID in shrinker_idr */
> + int id;
> +#endif
>   /* objs pending delete, per node */
>   atomic_long_t *nr_deferred;
>  };
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 50055d72f294..3de12a9bdf85 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -169,6 +169,48 @@ unsigned long vm_total_pages;
>  static LIST_HEAD(shrinker_list);
>  static DECLARE_RWSEM(shrinker_rwsem);
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +static DEFINE_IDR(shrinker_idr);
> +
> +static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> +{
> + int id, ret;
> +
> + shrinker->id = -1;
> + down_write(_rwsem);
> + ret = id = idr_alloc(_idr, shrinker, 0, 0, GFP_KERNEL);
> + if (ret < 0)
> + goto unlock;
> + shrinker->id = id;
> + ret = 0;
> +unlock:
> + up_write(_rwsem);
> + return ret;
> +}
> +
> +static void unregister_memcg_shrinker(struct shrinker *shrinker)
> +{
> + int id = shrinker->id;
> +
> + if (id < 0)
> + return;

Nit: Please replace this with BUG_ON(id >= 0) - this function can only
be called for a memcg-aware shrinker that has been fully initialized
(prealloc_shrinker() sets nr_deferred after id; unregister_shrinker()
returns immediately if nr_deferred is NULL).

> +
> + down_write(_rwsem);
> + idr_remove(_idr, id);
> + up_write(_rwsem);

> + shrinker->id = -1;

Nit: I'd move this under shrinker_rwsem as you set it with the rwsem
taken.

> +}
> +#else /* CONFIG_MEMCG_KMEM */
> +static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> +{
> + return 0;
> +}
> +
> +static void unregister_memcg_shrinker(struct shrinker *shrinker)
> +{
> +}
> +#endif /* CONFIG_MEMCG_KMEM */
> +
>  #ifdef CONFIG_MEMCG
>  static bool global_reclaim(struct scan_control *sc)
>  {
> @@ -306,6 +348,7 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum 
> lru_list lru, int zone
>  int prealloc_shrinker(struct shrinker *shrinker)
>  {
>   size_t size = sizeof(*shrinker->nr_deferred);
> + int ret;
>  
>   if (shrinker->flags & SHRINKER_NUMA_AWARE)
>   size *= nr_node_ids;
> @@ -313,11 +356,26 @@ int prealloc_shrinker(struct shrinker *shrinker)
>   shrinker->nr_deferred = kzalloc(size, GFP_KERNEL);
>   if (!shrinker->nr_deferred)
>   return -ENOMEM;
> +
> + if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
> + ret = prealloc_memcg_shrinker(shrinker);
> + if (ret)
> + goto free_deferred;

Nit: 'ret' is not really needed here.

> + }
> +
>   return 0;
> +
> +free_deferred:
> + kfree(shrinker->nr_deferred);
> + shrinker->nr_deferred = NULL;
> + return -ENOMEM;
>  }
>  
>  void free_prealloced_shrinker(struct shrinker *shrinker)
>  {
> + if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> + unregister_memcg_shrinker(shrinker);
> +
>   kfree(shrinker->nr_deferred);
>   shrinker->nr_deferred = NULL;
>  }
> @@ -347,6 +405,8 @@ void unregister_shrinker(struct shrinker *shrinker)
>  {
>   if (!shrinker->nr_deferred)
>   return;
> + if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> + unregister_memcg_shrinker(shrinker);
>   down_write(_rwsem);
>   list_del(>list);
>   up_write(_rwsem);
> 


Re: [PATCH v6 03/17] mm: Assign id to every memcg-aware shrinker

2018-05-20 Thread Vladimir Davydov
Hello Kirill,

Generally, all patches in the series look OK to me, but I'm going to do
some nitpicking before I ack them. See below.

On Fri, May 18, 2018 at 11:42:08AM +0300, Kirill Tkhai wrote:
> The patch introduces shrinker::id number, which is used to enumerate
> memcg-aware shrinkers. The number start from 0, and the code tries
> to maintain it as small as possible.
> 
> This will be used as to represent a memcg-aware shrinkers in memcg
> shrinkers map.
> 
> Since all memcg-aware shrinkers are based on list_lru, which is per-memcg
> in case of !CONFIG_MEMCG_KMEM only, the new functionality will be under
> this config option.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  include/linux/shrinker.h |4 +++
>  mm/vmscan.c  |   60 
> ++
>  2 files changed, 64 insertions(+)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 6794490f25b2..7ca9c18cf130 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -66,6 +66,10 @@ struct shrinker {
>  
>   /* These are for internal use */
>   struct list_head list;
> +#ifdef CONFIG_MEMCG_KMEM
> + /* ID in shrinker_idr */
> + int id;
> +#endif
>   /* objs pending delete, per node */
>   atomic_long_t *nr_deferred;
>  };
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 50055d72f294..3de12a9bdf85 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -169,6 +169,48 @@ unsigned long vm_total_pages;
>  static LIST_HEAD(shrinker_list);
>  static DECLARE_RWSEM(shrinker_rwsem);
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +static DEFINE_IDR(shrinker_idr);
> +
> +static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> +{
> + int id, ret;
> +
> + shrinker->id = -1;
> + down_write(_rwsem);
> + ret = id = idr_alloc(_idr, shrinker, 0, 0, GFP_KERNEL);
> + if (ret < 0)
> + goto unlock;
> + shrinker->id = id;
> + ret = 0;
> +unlock:
> + up_write(_rwsem);
> + return ret;
> +}
> +
> +static void unregister_memcg_shrinker(struct shrinker *shrinker)
> +{
> + int id = shrinker->id;
> +
> + if (id < 0)
> + return;

Nit: Please replace this with BUG_ON(id >= 0) - this function can only
be called for a memcg-aware shrinker that has been fully initialized
(prealloc_shrinker() sets nr_deferred after id; unregister_shrinker()
returns immediately if nr_deferred is NULL).

> +
> + down_write(_rwsem);
> + idr_remove(_idr, id);
> + up_write(_rwsem);

> + shrinker->id = -1;

Nit: I'd move this under shrinker_rwsem as you set it with the rwsem
taken.

> +}
> +#else /* CONFIG_MEMCG_KMEM */
> +static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> +{
> + return 0;
> +}
> +
> +static void unregister_memcg_shrinker(struct shrinker *shrinker)
> +{
> +}
> +#endif /* CONFIG_MEMCG_KMEM */
> +
>  #ifdef CONFIG_MEMCG
>  static bool global_reclaim(struct scan_control *sc)
>  {
> @@ -306,6 +348,7 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum 
> lru_list lru, int zone
>  int prealloc_shrinker(struct shrinker *shrinker)
>  {
>   size_t size = sizeof(*shrinker->nr_deferred);
> + int ret;
>  
>   if (shrinker->flags & SHRINKER_NUMA_AWARE)
>   size *= nr_node_ids;
> @@ -313,11 +356,26 @@ int prealloc_shrinker(struct shrinker *shrinker)
>   shrinker->nr_deferred = kzalloc(size, GFP_KERNEL);
>   if (!shrinker->nr_deferred)
>   return -ENOMEM;
> +
> + if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
> + ret = prealloc_memcg_shrinker(shrinker);
> + if (ret)
> + goto free_deferred;

Nit: 'ret' is not really needed here.

> + }
> +
>   return 0;
> +
> +free_deferred:
> + kfree(shrinker->nr_deferred);
> + shrinker->nr_deferred = NULL;
> + return -ENOMEM;
>  }
>  
>  void free_prealloced_shrinker(struct shrinker *shrinker)
>  {
> + if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> + unregister_memcg_shrinker(shrinker);
> +
>   kfree(shrinker->nr_deferred);
>   shrinker->nr_deferred = NULL;
>  }
> @@ -347,6 +405,8 @@ void unregister_shrinker(struct shrinker *shrinker)
>  {
>   if (!shrinker->nr_deferred)
>   return;
> + if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> + unregister_memcg_shrinker(shrinker);
>   down_write(_rwsem);
>   list_del(>list);
>   up_write(_rwsem);
> 


Re: [PATCH v5 11/13] mm: Iterate only over charged shrinkers during memcg shrink_slab()

2018-05-17 Thread Vladimir Davydov
On Thu, May 17, 2018 at 02:49:26PM +0300, Kirill Tkhai wrote:
> On 17.05.2018 07:16, Vladimir Davydov wrote:
> > On Tue, May 15, 2018 at 05:49:59PM +0300, Kirill Tkhai wrote:
> >>>> @@ -589,13 +647,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, 
> >>>> int nid,
> >>>>  .memcg = memcg,
> >>>>  };
> >>>>  
> >>>> -/*
> >>>> - * If kernel memory accounting is disabled, we ignore
> >>>> - * SHRINKER_MEMCG_AWARE flag and call all shrinkers
> >>>> - * passing NULL for memcg.
> >>>> - */
> >>>> -if (memcg_kmem_enabled() &&
> >>>> -!!memcg != !!(shrinker->flags & 
> >>>> SHRINKER_MEMCG_AWARE))
> >>>> +if (!!memcg != !!(shrinker->flags & 
> >>>> SHRINKER_MEMCG_AWARE))
> >>>>  continue;
> >>>
> >>> I want this check gone. It's easy to achieve, actually - just remove the
> >>> following lines from shrink_node()
> >>>
> >>>   if (global_reclaim(sc))
> >>>   shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
> >>>   sc->priority);
> >>
> >> This check is not related to the patchset.
> > 
> > Yes, it is. This patch modifies shrink_slab which is used only by
> > shrink_node. Simplifying shrink_node along the way looks right to me.
> 
> shrink_slab() is used not only in this place.

drop_slab_node() doesn't really count as it is an extract from shrink_node()

> I does not seem a trivial change for me.
> 
> >> Let's don't mix everything in the single series of patches, because
> >> after your last remarks it will grow at least up to 15 patches.
> > 
> > Most of which are trivial so I don't see any problem here.
> > 
> >> This patchset can't be responsible for everything.
> > 
> > I don't understand why you balk at simplifying the code a bit while you
> > are patching related functions anyway.
> 
> Because this function is used in several places, and we have some particulars
> on root_mem_cgroup initialization, and this function called from these places
> with different states of root_mem_cgroup. It does not seem trivial fix for me.

Let me do it for you then:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9b697323a88c..e778569538de 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -486,10 +486,8 @@ static unsigned long do_shrink_slab(struct shrink_control 
*shrinkctl,
  * @nid is passed along to shrinkers with SHRINKER_NUMA_AWARE set,
  * unaware shrinkers will receive a node id of 0 instead.
  *
- * @memcg specifies the memory cgroup to target. If it is not NULL,
- * only shrinkers with SHRINKER_MEMCG_AWARE set will be called to scan
- * objects from the memory cgroup specified. Otherwise, only unaware
- * shrinkers are called.
+ * @memcg specifies the memory cgroup to target. Unaware shrinkers
+ * are called only if it is the root cgroup.
  *
  * @priority is sc->priority, we take the number of objects and >> by priority
  * in order to get the scan target.
@@ -554,6 +552,7 @@ void drop_slab_node(int nid)
struct mem_cgroup *memcg = NULL;
 
freed = 0;
+   memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {
freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
@@ -2557,9 +2556,8 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
shrink_node_memcg(pgdat, memcg, sc, _pages);
node_lru_pages += lru_pages;
 
-   if (memcg)
-   shrink_slab(sc->gfp_mask, pgdat->node_id,
-   memcg, sc->priority);
+   shrink_slab(sc->gfp_mask, pgdat->node_id,
+   memcg, sc->priority);
 
/* Record the group's reclaim efficiency */
vmpressure(sc->gfp_mask, memcg, false,
@@ -2583,10 +2581,6 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
}
} while ((memcg = mem_cgroup_iter(root, memcg, )));
 
-   if (global_reclaim(sc))
-   shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
-   sc->priority);
-
if (reclaim_state) {
sc->nr_reclaimed += reclaim_state->reclaimed_slab;
reclaim_state->reclaimed_slab = 0;


Seems simple enough to fold it into this patch, doesn't it?


Re: [PATCH v5 11/13] mm: Iterate only over charged shrinkers during memcg shrink_slab()

2018-05-17 Thread Vladimir Davydov
On Thu, May 17, 2018 at 02:49:26PM +0300, Kirill Tkhai wrote:
> On 17.05.2018 07:16, Vladimir Davydov wrote:
> > On Tue, May 15, 2018 at 05:49:59PM +0300, Kirill Tkhai wrote:
> >>>> @@ -589,13 +647,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, 
> >>>> int nid,
> >>>>  .memcg = memcg,
> >>>>  };
> >>>>  
> >>>> -/*
> >>>> - * If kernel memory accounting is disabled, we ignore
> >>>> - * SHRINKER_MEMCG_AWARE flag and call all shrinkers
> >>>> - * passing NULL for memcg.
> >>>> - */
> >>>> -if (memcg_kmem_enabled() &&
> >>>> -!!memcg != !!(shrinker->flags & 
> >>>> SHRINKER_MEMCG_AWARE))
> >>>> +if (!!memcg != !!(shrinker->flags & 
> >>>> SHRINKER_MEMCG_AWARE))
> >>>>  continue;
> >>>
> >>> I want this check gone. It's easy to achieve, actually - just remove the
> >>> following lines from shrink_node()
> >>>
> >>>   if (global_reclaim(sc))
> >>>   shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
> >>>   sc->priority);
> >>
> >> This check is not related to the patchset.
> > 
> > Yes, it is. This patch modifies shrink_slab which is used only by
> > shrink_node. Simplifying shrink_node along the way looks right to me.
> 
> shrink_slab() is used not only in this place.

drop_slab_node() doesn't really count as it is an extract from shrink_node()

> I does not seem a trivial change for me.
> 
> >> Let's don't mix everything in the single series of patches, because
> >> after your last remarks it will grow at least up to 15 patches.
> > 
> > Most of which are trivial so I don't see any problem here.
> > 
> >> This patchset can't be responsible for everything.
> > 
> > I don't understand why you balk at simplifying the code a bit while you
> > are patching related functions anyway.
> 
> Because this function is used in several places, and we have some particulars
> on root_mem_cgroup initialization, and this function called from these places
> with different states of root_mem_cgroup. It does not seem trivial fix for me.

Let me do it for you then:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9b697323a88c..e778569538de 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -486,10 +486,8 @@ static unsigned long do_shrink_slab(struct shrink_control 
*shrinkctl,
  * @nid is passed along to shrinkers with SHRINKER_NUMA_AWARE set,
  * unaware shrinkers will receive a node id of 0 instead.
  *
- * @memcg specifies the memory cgroup to target. If it is not NULL,
- * only shrinkers with SHRINKER_MEMCG_AWARE set will be called to scan
- * objects from the memory cgroup specified. Otherwise, only unaware
- * shrinkers are called.
+ * @memcg specifies the memory cgroup to target. Unaware shrinkers
+ * are called only if it is the root cgroup.
  *
  * @priority is sc->priority, we take the number of objects and >> by priority
  * in order to get the scan target.
@@ -554,6 +552,7 @@ void drop_slab_node(int nid)
struct mem_cgroup *memcg = NULL;
 
freed = 0;
+   memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {
freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
@@ -2557,9 +2556,8 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
shrink_node_memcg(pgdat, memcg, sc, _pages);
node_lru_pages += lru_pages;
 
-   if (memcg)
-   shrink_slab(sc->gfp_mask, pgdat->node_id,
-   memcg, sc->priority);
+   shrink_slab(sc->gfp_mask, pgdat->node_id,
+   memcg, sc->priority);
 
/* Record the group's reclaim efficiency */
vmpressure(sc->gfp_mask, memcg, false,
@@ -2583,10 +2581,6 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
}
} while ((memcg = mem_cgroup_iter(root, memcg, )));
 
-   if (global_reclaim(sc))
-   shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
-   sc->priority);
-
if (reclaim_state) {
sc->nr_reclaimed += reclaim_state->reclaimed_slab;
reclaim_state->reclaimed_slab = 0;


Seems simple enough to fold it into this patch, doesn't it?


Re: [PATCH v5 13/13] mm: Clear shrinker bit if there are no objects related to memcg

2018-05-16 Thread Vladimir Davydov
On Tue, May 15, 2018 at 11:55:04AM +0300, Kirill Tkhai wrote:
> >> @@ -586,8 +586,23 @@ static unsigned long shrink_slab_memcg(gfp_t 
> >> gfp_mask, int nid,
> >>continue;
> >>  
> >>ret = do_shrink_slab(, shrinker, priority);
> >> -  if (ret == SHRINK_EMPTY)
> >> -  ret = 0;
> >> +  if (ret == SHRINK_EMPTY) {
> >> +  clear_bit(i, map->map);
> >> +  /*
> >> +   * Pairs with mb in memcg_set_shrinker_bit():
> >> +   *
> >> +   * list_lru_add() shrink_slab_memcg()
> >> +   *   list_add_tail()clear_bit()
> >> +   *  
> >> +   *   set_bit()  do_shrink_slab()
> >> +   */
> > 
> > Please improve the comment so that it isn't just a diagram.
> 
> Please, say, which comment you want to see here.

I want the reader to understand why we need to invoke the shrinker twice
if it returns SHRINK_EMPTY. The diagram doesn't really help here IMO. So
I'd write Something like this:

ret = do_shrink_slab(, shrinker, priority);
if (ret == SHRINK_EMPTY) {
clear_bit(i, map->map);
/*
 * After the shrinker reported that it had no objects to free,
 * but before we cleared the corresponding bit in the memcg
 * shrinker map, a new object might have been added. To make
 * sure, we have the bit set in this case, we invoke the
 * shrinker one more time and re-set the bit if it reports that
 * it is not empty anymore. The memory barrier here pairs with
 * the barrier in memcg_set_shrinker_bit():
 *
 * list_lru_add() shrink_slab_memcg()
 *   list_add_tail()clear_bit()
 *  
 *   set_bit()  do_shrink_slab()
 */
smp_mb__after_atomic();
ret = do_shrink_slab(, shrinker, priority);
if (ret == SHRINK_EMPTY)
ret = 0;
else
memcg_set_shrinker_bit(memcg, nid, i);


Re: [PATCH v5 13/13] mm: Clear shrinker bit if there are no objects related to memcg

2018-05-16 Thread Vladimir Davydov
On Tue, May 15, 2018 at 11:55:04AM +0300, Kirill Tkhai wrote:
> >> @@ -586,8 +586,23 @@ static unsigned long shrink_slab_memcg(gfp_t 
> >> gfp_mask, int nid,
> >>continue;
> >>  
> >>ret = do_shrink_slab(, shrinker, priority);
> >> -  if (ret == SHRINK_EMPTY)
> >> -  ret = 0;
> >> +  if (ret == SHRINK_EMPTY) {
> >> +  clear_bit(i, map->map);
> >> +  /*
> >> +   * Pairs with mb in memcg_set_shrinker_bit():
> >> +   *
> >> +   * list_lru_add() shrink_slab_memcg()
> >> +   *   list_add_tail()clear_bit()
> >> +   *  
> >> +   *   set_bit()  do_shrink_slab()
> >> +   */
> > 
> > Please improve the comment so that it isn't just a diagram.
> 
> Please, say, which comment you want to see here.

I want the reader to understand why we need to invoke the shrinker twice
if it returns SHRINK_EMPTY. The diagram doesn't really help here IMO. So
I'd write Something like this:

ret = do_shrink_slab(, shrinker, priority);
if (ret == SHRINK_EMPTY) {
clear_bit(i, map->map);
/*
 * After the shrinker reported that it had no objects to free,
 * but before we cleared the corresponding bit in the memcg
 * shrinker map, a new object might have been added. To make
 * sure, we have the bit set in this case, we invoke the
 * shrinker one more time and re-set the bit if it reports that
 * it is not empty anymore. The memory barrier here pairs with
 * the barrier in memcg_set_shrinker_bit():
 *
 * list_lru_add() shrink_slab_memcg()
 *   list_add_tail()clear_bit()
 *  
 *   set_bit()  do_shrink_slab()
 */
smp_mb__after_atomic();
ret = do_shrink_slab(, shrinker, priority);
if (ret == SHRINK_EMPTY)
ret = 0;
else
memcg_set_shrinker_bit(memcg, nid, i);


Re: [PATCH v5 11/13] mm: Iterate only over charged shrinkers during memcg shrink_slab()

2018-05-16 Thread Vladimir Davydov
On Tue, May 15, 2018 at 01:12:20PM +0300, Kirill Tkhai wrote:
> >> +#define root_mem_cgroup NULL
> > 
> > Let's instead export mem_cgroup_is_root(). In case if MEMCG is disabled
> > it will always return false.
> 
> export == move to header file

That and adding a stub function in case !MEMCG.

> >> +static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> >> +  struct mem_cgroup *memcg, int priority)
> >> +{
> >> +  struct memcg_shrinker_map *map;
> >> +  unsigned long freed = 0;
> >> +  int ret, i;
> >> +
> >> +  if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
> >> +  return 0;
> >> +
> >> +  if (!down_read_trylock(_rwsem))
> >> +  return 0;
> >> +
> >> +  /*
> >> +   * 1)Caller passes only alive memcg, so map can't be NULL.
> >> +   * 2)shrinker_rwsem protects from maps expanding.
> > 
> > ^^
> > Nit: space missing here :-)
> 
> I don't understand what you mean here. Please, clarify...

This is just a trivial remark regarding comment formatting. They usually
put a space between the number and the first word in the sentence, i.e.
between '1)' and 'Caller' in your case.

> 
> >> +   */
> >> +  map = rcu_dereference_protected(MEMCG_SHRINKER_MAP(memcg, nid), true);
> >> +  BUG_ON(!map);
> >> +
> >> +  for_each_set_bit(i, map->map, memcg_shrinker_nr_max) {
> >> +  struct shrink_control sc = {
> >> +  .gfp_mask = gfp_mask,
> >> +  .nid = nid,
> >> +  .memcg = memcg,
> >> +  };
> >> +  struct shrinker *shrinker;
> >> +
> >> +  shrinker = idr_find(_idr, i);
> >> +  if (!shrinker) {
> >> +  clear_bit(i, map->map);
> >> +  continue;
> >> +  }
> >> +  if (list_empty(>list))
> >> +  continue;
> > 
> > I don't like using shrinker->list as an indicator that the shrinker has
> > been initialized. IMO if you do need such a check, you should split
> > shrinker_idr registration in two steps - allocate a slot in 'prealloc'
> > and set the pointer in 'register'. However, can we really encounter an
> > unregistered shrinker here? AFAIU a bit can be set in the shrinker map
> > only after the corresponding shrinker has been initialized, no?
> 
> 1)No, it's not so. Here is a race:
> cpu#0cpu#1   cpu#2
> prealloc_shrinker()
>  prealloc_shrinker()
>memcg_expand_shrinker_maps()
>  memcg_expand_one_shrinker_map()
>memset(>map, 0xff);  
>  
> do_shrink_slab() (on uninitialized LRUs)
> init LRUs
> register_shrinker_prepared()
> 
> So, the check is needed.

OK, I see.

> 
> 2)Assigning NULL pointer can't be used here, since NULL pointer is already 
> used
> to clear unregistered shrinkers from the map. See the check right after 
> idr_find().

But it won't break anything if we clear bit for prealloc-ed, but not yet
registered shrinkers, will it?

> 
> list_empty() is used since it's the already existing indicator, which does not
> require additional member in struct shrinker.

It just looks rather counter-intuitive to me to use shrinker->list to
differentiate between registered and unregistered shrinkers. May be, I'm
wrong. If you are sure that this is OK, I'm fine with it, but then
please add a comment here explaining what this check is needed for.

Thanks.


Re: [PATCH v5 11/13] mm: Iterate only over charged shrinkers during memcg shrink_slab()

2018-05-16 Thread Vladimir Davydov
On Tue, May 15, 2018 at 01:12:20PM +0300, Kirill Tkhai wrote:
> >> +#define root_mem_cgroup NULL
> > 
> > Let's instead export mem_cgroup_is_root(). In case if MEMCG is disabled
> > it will always return false.
> 
> export == move to header file

That and adding a stub function in case !MEMCG.

> >> +static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> >> +  struct mem_cgroup *memcg, int priority)
> >> +{
> >> +  struct memcg_shrinker_map *map;
> >> +  unsigned long freed = 0;
> >> +  int ret, i;
> >> +
> >> +  if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
> >> +  return 0;
> >> +
> >> +  if (!down_read_trylock(_rwsem))
> >> +  return 0;
> >> +
> >> +  /*
> >> +   * 1)Caller passes only alive memcg, so map can't be NULL.
> >> +   * 2)shrinker_rwsem protects from maps expanding.
> > 
> > ^^
> > Nit: space missing here :-)
> 
> I don't understand what you mean here. Please, clarify...

This is just a trivial remark regarding comment formatting. They usually
put a space between the number and the first word in the sentence, i.e.
between '1)' and 'Caller' in your case.

> 
> >> +   */
> >> +  map = rcu_dereference_protected(MEMCG_SHRINKER_MAP(memcg, nid), true);
> >> +  BUG_ON(!map);
> >> +
> >> +  for_each_set_bit(i, map->map, memcg_shrinker_nr_max) {
> >> +  struct shrink_control sc = {
> >> +  .gfp_mask = gfp_mask,
> >> +  .nid = nid,
> >> +  .memcg = memcg,
> >> +  };
> >> +  struct shrinker *shrinker;
> >> +
> >> +  shrinker = idr_find(_idr, i);
> >> +  if (!shrinker) {
> >> +  clear_bit(i, map->map);
> >> +  continue;
> >> +  }
> >> +  if (list_empty(>list))
> >> +  continue;
> > 
> > I don't like using shrinker->list as an indicator that the shrinker has
> > been initialized. IMO if you do need such a check, you should split
> > shrinker_idr registration in two steps - allocate a slot in 'prealloc'
> > and set the pointer in 'register'. However, can we really encounter an
> > unregistered shrinker here? AFAIU a bit can be set in the shrinker map
> > only after the corresponding shrinker has been initialized, no?
> 
> 1)No, it's not so. Here is a race:
> cpu#0cpu#1   cpu#2
> prealloc_shrinker()
>  prealloc_shrinker()
>memcg_expand_shrinker_maps()
>  memcg_expand_one_shrinker_map()
>memset(>map, 0xff);  
>  
> do_shrink_slab() (on uninitialized LRUs)
> init LRUs
> register_shrinker_prepared()
> 
> So, the check is needed.

OK, I see.

> 
> 2)Assigning NULL pointer can't be used here, since NULL pointer is already 
> used
> to clear unregistered shrinkers from the map. See the check right after 
> idr_find().

But it won't break anything if we clear bit for prealloc-ed, but not yet
registered shrinkers, will it?

> 
> list_empty() is used since it's the already existing indicator, which does not
> require additional member in struct shrinker.

It just looks rather counter-intuitive to me to use shrinker->list to
differentiate between registered and unregistered shrinkers. May be, I'm
wrong. If you are sure that this is OK, I'm fine with it, but then
please add a comment here explaining what this check is needed for.

Thanks.


Re: [PATCH v5 11/13] mm: Iterate only over charged shrinkers during memcg shrink_slab()

2018-05-16 Thread Vladimir Davydov
On Tue, May 15, 2018 at 05:49:59PM +0300, Kirill Tkhai wrote:
> >> @@ -589,13 +647,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int 
> >> nid,
> >>.memcg = memcg,
> >>};
> >>  
> >> -  /*
> >> -   * If kernel memory accounting is disabled, we ignore
> >> -   * SHRINKER_MEMCG_AWARE flag and call all shrinkers
> >> -   * passing NULL for memcg.
> >> -   */
> >> -  if (memcg_kmem_enabled() &&
> >> -  !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> >> +  if (!!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> >>continue;
> > 
> > I want this check gone. It's easy to achieve, actually - just remove the
> > following lines from shrink_node()
> > 
> > if (global_reclaim(sc))
> > shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
> > sc->priority);
> 
> This check is not related to the patchset.

Yes, it is. This patch modifies shrink_slab which is used only by
shrink_node. Simplifying shrink_node along the way looks right to me.

> Let's don't mix everything in the single series of patches, because
> after your last remarks it will grow at least up to 15 patches.

Most of which are trivial so I don't see any problem here.

> This patchset can't be responsible for everything.

I don't understand why you balk at simplifying the code a bit while you
are patching related functions anyway.

> 
> >>  
> >>if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> >>


Re: [PATCH v5 11/13] mm: Iterate only over charged shrinkers during memcg shrink_slab()

2018-05-16 Thread Vladimir Davydov
On Tue, May 15, 2018 at 05:49:59PM +0300, Kirill Tkhai wrote:
> >> @@ -589,13 +647,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int 
> >> nid,
> >>.memcg = memcg,
> >>};
> >>  
> >> -  /*
> >> -   * If kernel memory accounting is disabled, we ignore
> >> -   * SHRINKER_MEMCG_AWARE flag and call all shrinkers
> >> -   * passing NULL for memcg.
> >> -   */
> >> -  if (memcg_kmem_enabled() &&
> >> -  !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> >> +  if (!!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> >>continue;
> > 
> > I want this check gone. It's easy to achieve, actually - just remove the
> > following lines from shrink_node()
> > 
> > if (global_reclaim(sc))
> > shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
> > sc->priority);
> 
> This check is not related to the patchset.

Yes, it is. This patch modifies shrink_slab which is used only by
shrink_node. Simplifying shrink_node along the way looks right to me.

> Let's don't mix everything in the single series of patches, because
> after your last remarks it will grow at least up to 15 patches.

Most of which are trivial so I don't see any problem here.

> This patchset can't be responsible for everything.

I don't understand why you balk at simplifying the code a bit while you
are patching related functions anyway.

> 
> >>  
> >>if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> >>


Re: [PATCH v5 13/13] mm: Clear shrinker bit if there are no objects related to memcg

2018-05-14 Thread Vladimir Davydov
On Thu, May 10, 2018 at 12:54:15PM +0300, Kirill Tkhai wrote:
> To avoid further unneed calls of do_shrink_slab()
> for shrinkers, which already do not have any charged
> objects in a memcg, their bits have to be cleared.
> 
> This patch introduces a lockless mechanism to do that
> without races without parallel list lru add. After
> do_shrink_slab() returns SHRINK_EMPTY the first time,
> we clear the bit and call it once again. Then we restore
> the bit, if the new return value is different.
> 
> Note, that single smp_mb__after_atomic() in shrink_slab_memcg()
> covers two situations:
> 
> 1)list_lru_add() shrink_slab_memcg
> list_add_tail()for_each_set_bit() <--- read bit
>  do_shrink_slab() <--- missed list update (no barrier)
>  
> set_bit()do_shrink_slab() <--- seen list update
> 
> This situation, when the first do_shrink_slab() sees set bit,
> but it doesn't see list update (i.e., race with the first element
> queueing), is rare. So we don't add  before the first call
> of do_shrink_slab() instead of this to do not slow down generic
> case. Also, it's need the second call as seen in below in (2).
> 
> 2)list_lru_add()  shrink_slab_memcg()
> list_add_tail() ...
> set_bit()   ...
>   ...   for_each_set_bit()
>   do_shrink_slab()do_shrink_slab()
> clear_bit()   ...
>   ... ...
>   list_lru_add()  ...
> list_add_tail()   clear_bit()
>   
> set_bit() do_shrink_slab()
> 
> The barriers guarantees, the second do_shrink_slab()
> in the right side task sees list update if really
> cleared the bit. This case is drawn in the code comment.
> 
> [Results/performance of the patchset]
> 
> After the whole patchset applied the below test shows signify
> increase of performance:
> 
> $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
> $mkdir /sys/fs/cgroup/memory/ct
> $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
> $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i; echo $$ > 
> /sys/fs/cgroup/memory/ct/$i/cgroup.procs; mkdir -p s/$i; mount -t tmpfs $i 
> s/$i; touch s/$i/file; done
> 
> Then, 5 sequential calls of drop caches:
> $time echo 3 > /proc/sys/vm/drop_caches
> 
> 1)Before:
> 0.00user 13.78system 0:13.78elapsed 99%CPU
> 0.00user 5.59system 0:05.60elapsed 99%CPU
> 0.00user 5.48system 0:05.48elapsed 99%CPU
> 0.00user 8.35system 0:08.35elapsed 99%CPU
> 0.00user 8.34system 0:08.35elapsed 99%CPU
> 
> 2)After
> 0.00user 1.10system 0:01.10elapsed 99%CPU
> 0.00user 0.00system 0:00.01elapsed 64%CPU
> 0.00user 0.01system 0:00.01elapsed 82%CPU
> 0.00user 0.00system 0:00.01elapsed 64%CPU
> 0.00user 0.01system 0:00.01elapsed 82%CPU
> 
> The results show the performance increases at least in 548 times.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  include/linux/memcontrol.h |2 ++
>  mm/vmscan.c|   19 +--
>  2 files changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 436691a66500..82c0bf2d0579 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1283,6 +1283,8 @@ static inline void memcg_set_shrinker_bit(struct 
> mem_cgroup *memcg, int nid, int
>  
>   rcu_read_lock();
>   map = MEMCG_SHRINKER_MAP(memcg, nid);
> + /* Pairs with smp mb in shrink_slab() */
> + smp_mb__before_atomic();
>   set_bit(nr, map->map);
>   rcu_read_unlock();
>   }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7b0075612d73..189b163bef4a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -586,8 +586,23 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, 
> int nid,
>   continue;
>  
>   ret = do_shrink_slab(, shrinker, priority);
> - if (ret == SHRINK_EMPTY)
> - ret = 0;
> + if (ret == SHRINK_EMPTY) {
> + clear_bit(i, map->map);
> + /*
> +  * Pairs with mb in memcg_set_shrinker_bit():
> +  *
> +  * list_lru_add() shrink_slab_memcg()
> +  *   list_add_tail()clear_bit()
> +  *  
> +  *   set_bit()  do_shrink_slab()
> +  */

Please improve the comment so that it isn't just a diagram.

> + smp_mb__after_atomic();
> + ret = do_shrink_slab(, shrinker, priority);
> + if (ret == SHRINK_EMPTY)
> + ret = 0;
> + else
> + memcg_set_shrinker_bit(memcg, nid, i);
> + }
>   freed += ret;
>  
>   if (rwsem_is_contended(_rwsem)) {
> 


Re: [PATCH v5 13/13] mm: Clear shrinker bit if there are no objects related to memcg

2018-05-14 Thread Vladimir Davydov
On Thu, May 10, 2018 at 12:54:15PM +0300, Kirill Tkhai wrote:
> To avoid further unneed calls of do_shrink_slab()
> for shrinkers, which already do not have any charged
> objects in a memcg, their bits have to be cleared.
> 
> This patch introduces a lockless mechanism to do that
> without races without parallel list lru add. After
> do_shrink_slab() returns SHRINK_EMPTY the first time,
> we clear the bit and call it once again. Then we restore
> the bit, if the new return value is different.
> 
> Note, that single smp_mb__after_atomic() in shrink_slab_memcg()
> covers two situations:
> 
> 1)list_lru_add() shrink_slab_memcg
> list_add_tail()for_each_set_bit() <--- read bit
>  do_shrink_slab() <--- missed list update (no barrier)
>  
> set_bit()do_shrink_slab() <--- seen list update
> 
> This situation, when the first do_shrink_slab() sees set bit,
> but it doesn't see list update (i.e., race with the first element
> queueing), is rare. So we don't add  before the first call
> of do_shrink_slab() instead of this to do not slow down generic
> case. Also, it's need the second call as seen in below in (2).
> 
> 2)list_lru_add()  shrink_slab_memcg()
> list_add_tail() ...
> set_bit()   ...
>   ...   for_each_set_bit()
>   do_shrink_slab()do_shrink_slab()
> clear_bit()   ...
>   ... ...
>   list_lru_add()  ...
> list_add_tail()   clear_bit()
>   
> set_bit() do_shrink_slab()
> 
> The barriers guarantees, the second do_shrink_slab()
> in the right side task sees list update if really
> cleared the bit. This case is drawn in the code comment.
> 
> [Results/performance of the patchset]
> 
> After the whole patchset applied the below test shows signify
> increase of performance:
> 
> $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
> $mkdir /sys/fs/cgroup/memory/ct
> $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
> $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i; echo $$ > 
> /sys/fs/cgroup/memory/ct/$i/cgroup.procs; mkdir -p s/$i; mount -t tmpfs $i 
> s/$i; touch s/$i/file; done
> 
> Then, 5 sequential calls of drop caches:
> $time echo 3 > /proc/sys/vm/drop_caches
> 
> 1)Before:
> 0.00user 13.78system 0:13.78elapsed 99%CPU
> 0.00user 5.59system 0:05.60elapsed 99%CPU
> 0.00user 5.48system 0:05.48elapsed 99%CPU
> 0.00user 8.35system 0:08.35elapsed 99%CPU
> 0.00user 8.34system 0:08.35elapsed 99%CPU
> 
> 2)After
> 0.00user 1.10system 0:01.10elapsed 99%CPU
> 0.00user 0.00system 0:00.01elapsed 64%CPU
> 0.00user 0.01system 0:00.01elapsed 82%CPU
> 0.00user 0.00system 0:00.01elapsed 64%CPU
> 0.00user 0.01system 0:00.01elapsed 82%CPU
> 
> The results show the performance increases at least in 548 times.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  include/linux/memcontrol.h |2 ++
>  mm/vmscan.c|   19 +--
>  2 files changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 436691a66500..82c0bf2d0579 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1283,6 +1283,8 @@ static inline void memcg_set_shrinker_bit(struct 
> mem_cgroup *memcg, int nid, int
>  
>   rcu_read_lock();
>   map = MEMCG_SHRINKER_MAP(memcg, nid);
> + /* Pairs with smp mb in shrink_slab() */
> + smp_mb__before_atomic();
>   set_bit(nr, map->map);
>   rcu_read_unlock();
>   }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7b0075612d73..189b163bef4a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -586,8 +586,23 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, 
> int nid,
>   continue;
>  
>   ret = do_shrink_slab(, shrinker, priority);
> - if (ret == SHRINK_EMPTY)
> - ret = 0;
> + if (ret == SHRINK_EMPTY) {
> + clear_bit(i, map->map);
> + /*
> +  * Pairs with mb in memcg_set_shrinker_bit():
> +  *
> +  * list_lru_add() shrink_slab_memcg()
> +  *   list_add_tail()clear_bit()
> +  *  
> +  *   set_bit()  do_shrink_slab()
> +  */

Please improve the comment so that it isn't just a diagram.

> + smp_mb__after_atomic();
> + ret = do_shrink_slab(, shrinker, priority);
> + if (ret == SHRINK_EMPTY)
> + ret = 0;
> + else
> + memcg_set_shrinker_bit(memcg, nid, i);
> + }
>   freed += ret;
>  
>   if (rwsem_is_contended(_rwsem)) {
> 


Re: [PATCH v5 11/13] mm: Iterate only over charged shrinkers during memcg shrink_slab()

2018-05-14 Thread Vladimir Davydov
On Thu, May 10, 2018 at 12:53:55PM +0300, Kirill Tkhai wrote:
> Using the preparations made in previous patches, in case of memcg
> shrink, we may avoid shrinkers, which are not set in memcg's shrinkers
> bitmap. To do that, we separate iterations over memcg-aware and
> !memcg-aware shrinkers, and memcg-aware shrinkers are chosen
> via for_each_set_bit() from the bitmap. In case of big nodes,
> having many isolated environments, this gives significant
> performance growth. See next patches for the details.
> 
> Note, that the patch does not respect to empty memcg shrinkers,
> since we never clear the bitmap bits after we set it once.
> Their shrinkers will be called again, with no shrinked objects
> as result. This functionality is provided by next patches.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  include/linux/memcontrol.h |1 +
>  mm/vmscan.c|   70 
> ++--
>  2 files changed, 62 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 82f892e77637..436691a66500 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -760,6 +760,7 @@ void mem_cgroup_split_huge_fixup(struct page *head);
>  #define MEM_CGROUP_ID_MAX0
>  
>  struct mem_cgroup;
> +#define root_mem_cgroup NULL

Let's instead export mem_cgroup_is_root(). In case if MEMCG is disabled
it will always return false.

>  
>  static inline bool mem_cgroup_disabled(void)
>  {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d8a2870710e0..a2e38e05adb5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -376,6 +376,7 @@ int prealloc_shrinker(struct shrinker *shrinker)
>   goto free_deferred;
>   }
>  
> + INIT_LIST_HEAD(>list);

IMO this shouldn't be here, see my comment below.

>   return 0;
>  
>  free_deferred:
> @@ -547,6 +548,63 @@ static unsigned long do_shrink_slab(struct 
> shrink_control *shrinkctl,
>   return freed;
>  }
>  
> +#ifdef CONFIG_MEMCG_SHRINKER
> +static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> + struct mem_cgroup *memcg, int priority)
> +{
> + struct memcg_shrinker_map *map;
> + unsigned long freed = 0;
> + int ret, i;
> +
> + if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
> + return 0;
> +
> + if (!down_read_trylock(_rwsem))
> + return 0;
> +
> + /*
> +  * 1)Caller passes only alive memcg, so map can't be NULL.
> +  * 2)shrinker_rwsem protects from maps expanding.

^^
Nit: space missing here :-)

> +  */
> + map = rcu_dereference_protected(MEMCG_SHRINKER_MAP(memcg, nid), true);
> + BUG_ON(!map);
> +
> + for_each_set_bit(i, map->map, memcg_shrinker_nr_max) {
> + struct shrink_control sc = {
> + .gfp_mask = gfp_mask,
> + .nid = nid,
> + .memcg = memcg,
> + };
> + struct shrinker *shrinker;
> +
> + shrinker = idr_find(_idr, i);
> + if (!shrinker) {
> + clear_bit(i, map->map);
> + continue;
> + }

The shrinker must be memcg aware so please add

  BUG_ON((shrinker->flags & SHRINKER_MEMCG_AWARE) == 0);

> + if (list_empty(>list))
> + continue;

I don't like using shrinker->list as an indicator that the shrinker has
been initialized. IMO if you do need such a check, you should split
shrinker_idr registration in two steps - allocate a slot in 'prealloc'
and set the pointer in 'register'. However, can we really encounter an
unregistered shrinker here? AFAIU a bit can be set in the shrinker map
only after the corresponding shrinker has been initialized, no?

> +
> + ret = do_shrink_slab(, shrinker, priority);
> + freed += ret;
> +
> + if (rwsem_is_contended(_rwsem)) {
> + freed = freed ? : 1;
> + break;
> + }
> + }
> +
> + up_read(_rwsem);
> + return freed;
> +}
> +#else /* CONFIG_MEMCG_SHRINKER */
> +static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> + struct mem_cgroup *memcg, int priority)
> +{
> + return 0;
> +}
> +#endif /* CONFIG_MEMCG_SHRINKER */
> +
>  /**
>   * shrink_slab - shrink slab caches
>   * @gfp_mask: allocation context
> @@ -576,8 +634,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>   struct shrinker *shrinker;
>   unsigned long freed = 0;
>  
> - if (memcg && (!memcg_kmem_enabled() || !mem_cgroup_online(memcg)))
> - return 0;
> + if (memcg && memcg != root_mem_cgroup)

if (!mem_cgroup_is_root(memcg))

> + return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>  
>   if (!down_read_trylock(_rwsem))
>   goto out;
> @@ -589,13 +647,7 @@ static unsigned long shrink_slab(gfp_t 

Re: [PATCH v5 11/13] mm: Iterate only over charged shrinkers during memcg shrink_slab()

2018-05-14 Thread Vladimir Davydov
On Thu, May 10, 2018 at 12:53:55PM +0300, Kirill Tkhai wrote:
> Using the preparations made in previous patches, in case of memcg
> shrink, we may avoid shrinkers, which are not set in memcg's shrinkers
> bitmap. To do that, we separate iterations over memcg-aware and
> !memcg-aware shrinkers, and memcg-aware shrinkers are chosen
> via for_each_set_bit() from the bitmap. In case of big nodes,
> having many isolated environments, this gives significant
> performance growth. See next patches for the details.
> 
> Note, that the patch does not respect to empty memcg shrinkers,
> since we never clear the bitmap bits after we set it once.
> Their shrinkers will be called again, with no shrinked objects
> as result. This functionality is provided by next patches.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  include/linux/memcontrol.h |1 +
>  mm/vmscan.c|   70 
> ++--
>  2 files changed, 62 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 82f892e77637..436691a66500 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -760,6 +760,7 @@ void mem_cgroup_split_huge_fixup(struct page *head);
>  #define MEM_CGROUP_ID_MAX0
>  
>  struct mem_cgroup;
> +#define root_mem_cgroup NULL

Let's instead export mem_cgroup_is_root(). In case if MEMCG is disabled
it will always return false.

>  
>  static inline bool mem_cgroup_disabled(void)
>  {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d8a2870710e0..a2e38e05adb5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -376,6 +376,7 @@ int prealloc_shrinker(struct shrinker *shrinker)
>   goto free_deferred;
>   }
>  
> + INIT_LIST_HEAD(>list);

IMO this shouldn't be here, see my comment below.

>   return 0;
>  
>  free_deferred:
> @@ -547,6 +548,63 @@ static unsigned long do_shrink_slab(struct 
> shrink_control *shrinkctl,
>   return freed;
>  }
>  
> +#ifdef CONFIG_MEMCG_SHRINKER
> +static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> + struct mem_cgroup *memcg, int priority)
> +{
> + struct memcg_shrinker_map *map;
> + unsigned long freed = 0;
> + int ret, i;
> +
> + if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))
> + return 0;
> +
> + if (!down_read_trylock(_rwsem))
> + return 0;
> +
> + /*
> +  * 1)Caller passes only alive memcg, so map can't be NULL.
> +  * 2)shrinker_rwsem protects from maps expanding.

^^
Nit: space missing here :-)

> +  */
> + map = rcu_dereference_protected(MEMCG_SHRINKER_MAP(memcg, nid), true);
> + BUG_ON(!map);
> +
> + for_each_set_bit(i, map->map, memcg_shrinker_nr_max) {
> + struct shrink_control sc = {
> + .gfp_mask = gfp_mask,
> + .nid = nid,
> + .memcg = memcg,
> + };
> + struct shrinker *shrinker;
> +
> + shrinker = idr_find(_idr, i);
> + if (!shrinker) {
> + clear_bit(i, map->map);
> + continue;
> + }

The shrinker must be memcg aware so please add

  BUG_ON((shrinker->flags & SHRINKER_MEMCG_AWARE) == 0);

> + if (list_empty(>list))
> + continue;

I don't like using shrinker->list as an indicator that the shrinker has
been initialized. IMO if you do need such a check, you should split
shrinker_idr registration in two steps - allocate a slot in 'prealloc'
and set the pointer in 'register'. However, can we really encounter an
unregistered shrinker here? AFAIU a bit can be set in the shrinker map
only after the corresponding shrinker has been initialized, no?

> +
> + ret = do_shrink_slab(, shrinker, priority);
> + freed += ret;
> +
> + if (rwsem_is_contended(_rwsem)) {
> + freed = freed ? : 1;
> + break;
> + }
> + }
> +
> + up_read(_rwsem);
> + return freed;
> +}
> +#else /* CONFIG_MEMCG_SHRINKER */
> +static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> + struct mem_cgroup *memcg, int priority)
> +{
> + return 0;
> +}
> +#endif /* CONFIG_MEMCG_SHRINKER */
> +
>  /**
>   * shrink_slab - shrink slab caches
>   * @gfp_mask: allocation context
> @@ -576,8 +634,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>   struct shrinker *shrinker;
>   unsigned long freed = 0;
>  
> - if (memcg && (!memcg_kmem_enabled() || !mem_cgroup_online(memcg)))
> - return 0;
> + if (memcg && memcg != root_mem_cgroup)

if (!mem_cgroup_is_root(memcg))

> + return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>  
>   if (!down_read_trylock(_rwsem))
>   goto out;
> @@ -589,13 +647,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  

Re: [PATCH v5 10/13] mm: Set bit in memcg shrinker bitmap on first list_lru item apearance

2018-05-14 Thread Vladimir Davydov
On Thu, May 10, 2018 at 12:53:45PM +0300, Kirill Tkhai wrote:
> Introduce set_shrinker_bit() function to set shrinker-related
> bit in memcg shrinker bitmap, and set the bit after the first
> item is added and in case of reparenting destroyed memcg's items.
> 
> This will allow next patch to make shrinkers be called only,
> in case of they have charged objects at the moment, and
> to improve shrink_slab() performance.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  include/linux/memcontrol.h |   15 +++
>  mm/list_lru.c  |   22 --
>  2 files changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e5e7e0fc7158..82f892e77637 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1274,6 +1274,21 @@ static inline void memcg_put_cache_ids(void)
>  
>  extern int memcg_shrinker_nr_max;
>  extern int memcg_expand_shrinker_maps(int old_id, int id);
> +
> +static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, 
> int nr)

Nit: too long line (> 80 characters)
Nit: let's rename 'nr' to 'shrinker_id'

> +{
> + if (nr >= 0 && memcg && memcg != root_mem_cgroup) {
> + struct memcg_shrinker_map *map;
> +
> + rcu_read_lock();
> + map = MEMCG_SHRINKER_MAP(memcg, nid);

Missing rcu_dereference.

> + set_bit(nr, map->map);
> + rcu_read_unlock();
> + }
> +}
> +#else
> +static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
> +   int node, int id) { }

Nit: please keep the signature (including argument names) the same as in
MEMCG-enabled definition, namely 'node' => 'nid', 'id' => 'shrinker_id'.

Thanks.


Re: [PATCH v5 10/13] mm: Set bit in memcg shrinker bitmap on first list_lru item apearance

2018-05-14 Thread Vladimir Davydov
On Thu, May 10, 2018 at 12:53:45PM +0300, Kirill Tkhai wrote:
> Introduce set_shrinker_bit() function to set shrinker-related
> bit in memcg shrinker bitmap, and set the bit after the first
> item is added and in case of reparenting destroyed memcg's items.
> 
> This will allow next patch to make shrinkers be called only,
> in case of they have charged objects at the moment, and
> to improve shrink_slab() performance.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  include/linux/memcontrol.h |   15 +++
>  mm/list_lru.c  |   22 --
>  2 files changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e5e7e0fc7158..82f892e77637 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1274,6 +1274,21 @@ static inline void memcg_put_cache_ids(void)
>  
>  extern int memcg_shrinker_nr_max;
>  extern int memcg_expand_shrinker_maps(int old_id, int id);
> +
> +static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, 
> int nr)

Nit: too long line (> 80 characters)
Nit: let's rename 'nr' to 'shrinker_id'

> +{
> + if (nr >= 0 && memcg && memcg != root_mem_cgroup) {
> + struct memcg_shrinker_map *map;
> +
> + rcu_read_lock();
> + map = MEMCG_SHRINKER_MAP(memcg, nid);

Missing rcu_dereference.

> + set_bit(nr, map->map);
> + rcu_read_unlock();
> + }
> +}
> +#else
> +static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
> +   int node, int id) { }

Nit: please keep the signature (including argument names) the same as in
MEMCG-enabled definition, namely 'node' => 'nid', 'id' => 'shrinker_id'.

Thanks.


Re: [PATCH v5 03/13] mm: Assign memcg-aware shrinkers bitmap to memcg

2018-05-14 Thread Vladimir Davydov
On Mon, May 14, 2018 at 12:34:45PM +0300, Kirill Tkhai wrote:
> >> +static void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
> >> +{
> >> +  struct mem_cgroup_per_node *pn;
> >> +  struct memcg_shrinker_map *map;
> >> +  int nid;
> >> +
> >> +  if (memcg == root_mem_cgroup)
> >> +  return;
> >> +
> >> +  mutex_lock(_nr_max_mutex);
> > 
> > Why do you need to take the mutex here? You don't access shrinker map
> > capacity here AFAICS.
> 
> Allocation of shrinkers map is in css_online() now, and this wants its pay.
> memcg_expand_one_shrinker_map() must be able to differ mem cgroups with
> allocated maps, mem cgroups with not allocated maps, and mem cgroups with
> failed/failing css_online. So, the mutex is used for synchronization with
> expanding. See "old_size && !old" check in memcg_expand_one_shrinker_map().

Another reason to have 'expand' and 'alloc' paths separated - you
wouldn't need to take the mutex here as 'free' wouldn't be used for
undoing initial allocation, instead 'alloc' would cleanup by itself
while still holding the mutex.

> 
> >> +  for_each_node(nid) {
> >> +  pn = mem_cgroup_nodeinfo(memcg, nid);
> >> +  map = rcu_dereference_protected(pn->shrinker_map, true);
> >> +  if (map)
> >> +  call_rcu(>rcu, memcg_free_shrinker_map_rcu);
> >> +  rcu_assign_pointer(pn->shrinker_map, NULL);
> >> +  }
> >> +  mutex_unlock(_nr_max_mutex);
> >> +}
> >> +
> >> +static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
> >> +{
> >> +  int ret, size = memcg_shrinker_nr_max/BITS_PER_BYTE;
> >> +
> >> +  if (memcg == root_mem_cgroup)
> >> +  return 0;
> >> +
> >> +  mutex_lock(_nr_max_mutex);
> >> +  ret = memcg_expand_one_shrinker_map(memcg, size, 0);
> > 
> > I don't think it's worth reusing the function designed for reallocating
> > shrinker maps for initial allocation. Please just fold the code here -
> > it will make both 'alloc' and 'expand' easier to follow IMHO.
> 
> These function will have 80% code the same. What are the reasons to duplicate
> the same functionality? Two functions are more difficult for support, and
> everywhere in kernel we try to avoid this IMHO.

IMHO two functions with clear semantics are easier to maintain than
a function that does one of two things depending on some condition.
Separating 'alloc' from 'expand' would only add 10-15 SLOC.

> >> +  mutex_unlock(_nr_max_mutex);
> >> +
> >> +  if (ret)
> >> +  memcg_free_shrinker_maps(memcg);
> >> +
> >> +  return ret;
> >> +}
> >> +
> >> +static struct idr mem_cgroup_idr;
> >> +
> >> +int memcg_expand_shrinker_maps(int old_nr, int nr)
> >> +{
> >> +  int size, old_size, ret = 0;
> >> +  struct mem_cgroup *memcg;
> >> +
> >> +  old_size = old_nr / BITS_PER_BYTE;
> >> +  size = nr / BITS_PER_BYTE;
> >> +
> >> +  mutex_lock(_nr_max_mutex);
> >> +
> >> +  if (!root_mem_cgroup)
> >> +  goto unlock;
> > 
> > This wants a comment.
> 
> Which comment does this want? "root_mem_cgroup is not initialized, so
> it does not have child mem cgroups"?

Looking at this code again, I find it pretty self-explaining, sorry.

Thanks.


Re: [PATCH v5 03/13] mm: Assign memcg-aware shrinkers bitmap to memcg

2018-05-14 Thread Vladimir Davydov
On Mon, May 14, 2018 at 12:34:45PM +0300, Kirill Tkhai wrote:
> >> +static void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
> >> +{
> >> +  struct mem_cgroup_per_node *pn;
> >> +  struct memcg_shrinker_map *map;
> >> +  int nid;
> >> +
> >> +  if (memcg == root_mem_cgroup)
> >> +  return;
> >> +
> >> +  mutex_lock(_nr_max_mutex);
> > 
> > Why do you need to take the mutex here? You don't access shrinker map
> > capacity here AFAICS.
> 
> Allocation of shrinkers map is in css_online() now, and this wants its pay.
> memcg_expand_one_shrinker_map() must be able to differ mem cgroups with
> allocated maps, mem cgroups with not allocated maps, and mem cgroups with
> failed/failing css_online. So, the mutex is used for synchronization with
> expanding. See "old_size && !old" check in memcg_expand_one_shrinker_map().

Another reason to have 'expand' and 'alloc' paths separated - you
wouldn't need to take the mutex here as 'free' wouldn't be used for
undoing initial allocation, instead 'alloc' would cleanup by itself
while still holding the mutex.

> 
> >> +  for_each_node(nid) {
> >> +  pn = mem_cgroup_nodeinfo(memcg, nid);
> >> +  map = rcu_dereference_protected(pn->shrinker_map, true);
> >> +  if (map)
> >> +  call_rcu(>rcu, memcg_free_shrinker_map_rcu);
> >> +  rcu_assign_pointer(pn->shrinker_map, NULL);
> >> +  }
> >> +  mutex_unlock(_nr_max_mutex);
> >> +}
> >> +
> >> +static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
> >> +{
> >> +  int ret, size = memcg_shrinker_nr_max/BITS_PER_BYTE;
> >> +
> >> +  if (memcg == root_mem_cgroup)
> >> +  return 0;
> >> +
> >> +  mutex_lock(_nr_max_mutex);
> >> +  ret = memcg_expand_one_shrinker_map(memcg, size, 0);
> > 
> > I don't think it's worth reusing the function designed for reallocating
> > shrinker maps for initial allocation. Please just fold the code here -
> > it will make both 'alloc' and 'expand' easier to follow IMHO.
> 
> These function will have 80% code the same. What are the reasons to duplicate
> the same functionality? Two functions are more difficult for support, and
> everywhere in kernel we try to avoid this IMHO.

IMHO two functions with clear semantics are easier to maintain than
a function that does one of two things depending on some condition.
Separating 'alloc' from 'expand' would only add 10-15 SLOC.

> >> +  mutex_unlock(_nr_max_mutex);
> >> +
> >> +  if (ret)
> >> +  memcg_free_shrinker_maps(memcg);
> >> +
> >> +  return ret;
> >> +}
> >> +
> >> +static struct idr mem_cgroup_idr;
> >> +
> >> +int memcg_expand_shrinker_maps(int old_nr, int nr)
> >> +{
> >> +  int size, old_size, ret = 0;
> >> +  struct mem_cgroup *memcg;
> >> +
> >> +  old_size = old_nr / BITS_PER_BYTE;
> >> +  size = nr / BITS_PER_BYTE;
> >> +
> >> +  mutex_lock(_nr_max_mutex);
> >> +
> >> +  if (!root_mem_cgroup)
> >> +  goto unlock;
> > 
> > This wants a comment.
> 
> Which comment does this want? "root_mem_cgroup is not initialized, so
> it does not have child mem cgroups"?

Looking at this code again, I find it pretty self-explaining, sorry.

Thanks.


Re: [PATCH v5 01/13] mm: Assign id to every memcg-aware shrinker

2018-05-14 Thread Vladimir Davydov
On Mon, May 14, 2018 at 12:03:38PM +0300, Kirill Tkhai wrote:
> On 13.05.2018 08:15, Vladimir Davydov wrote:
> > On Thu, May 10, 2018 at 12:52:18PM +0300, Kirill Tkhai wrote:
> >> The patch introduces shrinker::id number, which is used to enumerate
> >> memcg-aware shrinkers. The number start from 0, and the code tries
> >> to maintain it as small as possible.
> >>
> >> This will be used as to represent a memcg-aware shrinkers in memcg
> >> shrinkers map.
> >>
> >> Since all memcg-aware shrinkers are based on list_lru, which is per-memcg
> >> in case of !SLOB only, the new functionality will be under MEMCG && !SLOB
> >> ifdef (symlinked to CONFIG_MEMCG_SHRINKER).
> > 
> > Using MEMCG && !SLOB instead of introducing a new config option was done
> > deliberately, see:
> > 
> >   http://lkml.kernel.org/r/20151210202244.ga4...@cmpxchg.org
> > 
> > I guess, this doesn't work well any more, as there are more and more
> > parts depending on kmem accounting, like shrinkers. If you really want
> > to introduce a new option, I think you should call it CONFIG_MEMCG_KMEM
> > and use it consistently throughout the code instead of MEMCG && !SLOB.
> > And this should be done in a separate patch.
> 
> What do you mean under "consistently throughout the code"? Should I replace
> all MEMCG && !SLOB with CONFIG_MEMCG_KMEM over existing code?

Yes, otherwise it looks messy - in some places we check !SLOB, in others
we use CONFIG_MEMCG_SHRINKER (or whatever it will be called).

> 
> >> diff --git a/fs/super.c b/fs/super.c
> >> index 122c402049a2..16c153d2f4f1 100644
> >> --- a/fs/super.c
> >> +++ b/fs/super.c
> >> @@ -248,6 +248,9 @@ static struct super_block *alloc_super(struct 
> >> file_system_type *type, int flags,
> >>s->s_time_gran = 10;
> >>s->cleancache_poolid = CLEANCACHE_NO_POOL;
> >>  
> >> +#ifdef CONFIG_MEMCG_SHRINKER
> >> +  s->s_shrink.id = -1;
> >> +#endif
> > 
> > No point doing that - you are going to overwrite the id anyway in
> > prealloc_shrinker().
> 
> Not so, this is done deliberately. alloc_super() has the only "fail" label,
> and it handles all the allocation errors there. The patch just behaves in
> the same style. It sets "-1" to make destroy_unused_super() able to differ
> the cases, when shrinker is really initialized, and when it's not.
> If you don't like this, I can move "s->s_shrink.id = -1;" into
> prealloc_memcg_shrinker() instead of this.

Yes, please do so that we don't have MEMCG ifdefs in fs code.

Thanks.


Re: [PATCH v5 01/13] mm: Assign id to every memcg-aware shrinker

2018-05-14 Thread Vladimir Davydov
On Mon, May 14, 2018 at 12:03:38PM +0300, Kirill Tkhai wrote:
> On 13.05.2018 08:15, Vladimir Davydov wrote:
> > On Thu, May 10, 2018 at 12:52:18PM +0300, Kirill Tkhai wrote:
> >> The patch introduces shrinker::id number, which is used to enumerate
> >> memcg-aware shrinkers. The number start from 0, and the code tries
> >> to maintain it as small as possible.
> >>
> >> This will be used as to represent a memcg-aware shrinkers in memcg
> >> shrinkers map.
> >>
> >> Since all memcg-aware shrinkers are based on list_lru, which is per-memcg
> >> in case of !SLOB only, the new functionality will be under MEMCG && !SLOB
> >> ifdef (symlinked to CONFIG_MEMCG_SHRINKER).
> > 
> > Using MEMCG && !SLOB instead of introducing a new config option was done
> > deliberately, see:
> > 
> >   http://lkml.kernel.org/r/20151210202244.ga4...@cmpxchg.org
> > 
> > I guess, this doesn't work well any more, as there are more and more
> > parts depending on kmem accounting, like shrinkers. If you really want
> > to introduce a new option, I think you should call it CONFIG_MEMCG_KMEM
> > and use it consistently throughout the code instead of MEMCG && !SLOB.
> > And this should be done in a separate patch.
> 
> What do you mean under "consistently throughout the code"? Should I replace
> all MEMCG && !SLOB with CONFIG_MEMCG_KMEM over existing code?

Yes, otherwise it looks messy - in some places we check !SLOB, in others
we use CONFIG_MEMCG_SHRINKER (or whatever it will be called).

> 
> >> diff --git a/fs/super.c b/fs/super.c
> >> index 122c402049a2..16c153d2f4f1 100644
> >> --- a/fs/super.c
> >> +++ b/fs/super.c
> >> @@ -248,6 +248,9 @@ static struct super_block *alloc_super(struct 
> >> file_system_type *type, int flags,
> >>s->s_time_gran = 10;
> >>s->cleancache_poolid = CLEANCACHE_NO_POOL;
> >>  
> >> +#ifdef CONFIG_MEMCG_SHRINKER
> >> +  s->s_shrink.id = -1;
> >> +#endif
> > 
> > No point doing that - you are going to overwrite the id anyway in
> > prealloc_shrinker().
> 
> Not so, this is done deliberately. alloc_super() has the only "fail" label,
> and it handles all the allocation errors there. The patch just behaves in
> the same style. It sets "-1" to make destroy_unused_super() able to differ
> the cases, when shrinker is really initialized, and when it's not.
> If you don't like this, I can move "s->s_shrink.id = -1;" into
> prealloc_memcg_shrinker() instead of this.

Yes, please do so that we don't have MEMCG ifdefs in fs code.

Thanks.


Re: [PATCH v5 06/13] fs: Propagate shrinker::id to list_lru

2018-05-13 Thread Vladimir Davydov
On Thu, May 10, 2018 at 12:53:06PM +0300, Kirill Tkhai wrote:
> The patch adds list_lru::shrinker_id field, and populates
> it by registered shrinker id.
> 
> This will be used to set correct bit in memcg shrinkers
> map by lru code in next patches, after there appeared
> the first related to memcg element in list_lru.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  fs/super.c   |4 
>  include/linux/list_lru.h |3 +++
>  mm/list_lru.c|6 ++
>  mm/workingset.c  |3 +++
>  4 files changed, 16 insertions(+)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 2ccacb78f91c..dfa85e725e45 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -258,6 +258,10 @@ static struct super_block *alloc_super(struct 
> file_system_type *type, int flags,
>   goto fail;
>   if (list_lru_init_memcg(>s_inode_lru))
>   goto fail;
> +#ifdef CONFIG_MEMCG_SHRINKER
> + s->s_dentry_lru.shrinker_id = s->s_shrink.id;
> + s->s_inode_lru.shrinker_id = s->s_shrink.id;
> +#endif

I don't like this. Can't you simply pass struct shrinker to
list_lru_init_memcg() and let it extract the id?


Re: [PATCH v5 06/13] fs: Propagate shrinker::id to list_lru

2018-05-13 Thread Vladimir Davydov
On Thu, May 10, 2018 at 12:53:06PM +0300, Kirill Tkhai wrote:
> The patch adds list_lru::shrinker_id field, and populates
> it by registered shrinker id.
> 
> This will be used to set correct bit in memcg shrinkers
> map by lru code in next patches, after there appeared
> the first related to memcg element in list_lru.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  fs/super.c   |4 
>  include/linux/list_lru.h |3 +++
>  mm/list_lru.c|6 ++
>  mm/workingset.c  |3 +++
>  4 files changed, 16 insertions(+)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 2ccacb78f91c..dfa85e725e45 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -258,6 +258,10 @@ static struct super_block *alloc_super(struct 
> file_system_type *type, int flags,
>   goto fail;
>   if (list_lru_init_memcg(>s_inode_lru))
>   goto fail;
> +#ifdef CONFIG_MEMCG_SHRINKER
> + s->s_dentry_lru.shrinker_id = s->s_shrink.id;
> + s->s_inode_lru.shrinker_id = s->s_shrink.id;
> +#endif

I don't like this. Can't you simply pass struct shrinker to
list_lru_init_memcg() and let it extract the id?


Re: [PATCH v5 03/13] mm: Assign memcg-aware shrinkers bitmap to memcg

2018-05-13 Thread Vladimir Davydov
On Thu, May 10, 2018 at 12:52:36PM +0300, Kirill Tkhai wrote:
> Imagine a big node with many cpus, memory cgroups and containers.
> Let we have 200 containers, every container has 10 mounts,
> and 10 cgroups. All container tasks don't touch foreign
> containers mounts. If there is intensive pages write,
> and global reclaim happens, a writing task has to iterate
> over all memcgs to shrink slab, before it's able to go
> to shrink_page_list().
> 
> Iteration over all the memcg slabs is very expensive:
> the task has to visit 200 * 10 = 2000 shrinkers
> for every memcg, and since there are 2000 memcgs,
> the total calls are 2000 * 2000 = 400.
> 
> So, the shrinker makes 4 million do_shrink_slab() calls
> just to try to isolate SWAP_CLUSTER_MAX pages in one
> of the actively writing memcg via shrink_page_list().
> I've observed a node spending almost 100% in kernel,
> making useless iteration over already shrinked slab.
> 
> This patch adds bitmap of memcg-aware shrinkers to memcg.
> The size of the bitmap depends on bitmap_nr_ids, and during
> memcg life it's maintained to be enough to fit bitmap_nr_ids
> shrinkers. Every bit in the map is related to corresponding
> shrinker id.
> 
> Next patches will maintain set bit only for really charged
> memcg. This will allow shrink_slab() to increase its
> performance in significant way. See the last patch for
> the numbers.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  include/linux/memcontrol.h |   21 
>  mm/memcontrol.c|  116 
> 
>  mm/vmscan.c|   16 ++
>  3 files changed, 152 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6cbea2f25a87..e5e7e0fc7158 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -105,6 +105,17 @@ struct lruvec_stat {
>   long count[NR_VM_NODE_STAT_ITEMS];
>  };
>  
> +#ifdef CONFIG_MEMCG_SHRINKER
> +/*
> + * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
> + * which have elements charged to this memcg.
> + */
> +struct memcg_shrinker_map {
> + struct rcu_head rcu;
> + unsigned long map[0];
> +};
> +#endif /* CONFIG_MEMCG_SHRINKER */
> +

AFAIR we don't normally ifdef structure definitions.

>  /*
>   * per-zone information in memory controller.
>   */
> @@ -118,6 +129,9 @@ struct mem_cgroup_per_node {
>  
>   struct mem_cgroup_reclaim_iter  iter[DEF_PRIORITY + 1];
>  
> +#ifdef CONFIG_MEMCG_SHRINKER
> + struct memcg_shrinker_map __rcu *shrinker_map;
> +#endif
>   struct rb_node  tree_node;  /* RB tree node */
>   unsigned long   usage_in_excess;/* Set to the value by which */
>   /* the soft limit is exceeded*/
> @@ -1255,4 +1269,11 @@ static inline void memcg_put_cache_ids(void)
>  
>  #endif /* CONFIG_MEMCG && !CONFIG_SLOB */
>  
> +#ifdef CONFIG_MEMCG_SHRINKER

> +#define MEMCG_SHRINKER_MAP(memcg, nid) (memcg->nodeinfo[nid]->shrinker_map)

I don't really like this helper macro. Accessing shrinker_map directly
looks cleaner IMO.

> +
> +extern int memcg_shrinker_nr_max;

As I've mentioned before, the capacity of shrinker map should be a
private business of memcontrol.c IMHO. We shouldn't use it in vmscan.c
as max shrinker id, instead we should introduce another variable for
this, private to vmscan.c.

> +extern int memcg_expand_shrinker_maps(int old_id, int id);

... Then this function would take just one argument, max id, and would
update shrinker_map capacity if necessary in memcontrol.c under the
corresponding mutex, which would look much more readable IMHO as all
shrinker_map related manipulations would be isolated in memcontrol.c.

> +#endif /* CONFIG_MEMCG_SHRINKER */
> +
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3df3efa7ff40..18e0fdf302a9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -322,6 +322,116 @@ struct workqueue_struct *memcg_kmem_cache_wq;
>  
>  #endif /* !CONFIG_SLOB */
>  
> +#ifdef CONFIG_MEMCG_SHRINKER
> +int memcg_shrinker_nr_max;

memcg_shrinker_map_capacity, may be?

> +static DEFINE_MUTEX(shrinkers_nr_max_mutex);

memcg_shrinker_map_mutex?

> +
> +static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
> +{
> + kvfree(container_of(head, struct memcg_shrinker_map, rcu));
> +}
> +
> +static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
> +  int size, int old_size)

If you followed my advice and made the shrinker_map_capacity private to
memcontrol.c, you wouldn't need to pass old_size here either, just max
shrinker id.

> +{
> + struct memcg_shrinker_map *new, *old;
> + int nid;
> +
> + lockdep_assert_held(_nr_max_mutex);
> +
> + for_each_node(nid) {
> + old = rcu_dereference_protected(MEMCG_SHRINKER_MAP(memcg, nid), 
> true);
> + /* Not yet online 

Re: [PATCH v5 03/13] mm: Assign memcg-aware shrinkers bitmap to memcg

2018-05-13 Thread Vladimir Davydov
On Thu, May 10, 2018 at 12:52:36PM +0300, Kirill Tkhai wrote:
> Imagine a big node with many cpus, memory cgroups and containers.
> Let we have 200 containers, every container has 10 mounts,
> and 10 cgroups. All container tasks don't touch foreign
> containers mounts. If there is intensive pages write,
> and global reclaim happens, a writing task has to iterate
> over all memcgs to shrink slab, before it's able to go
> to shrink_page_list().
> 
> Iteration over all the memcg slabs is very expensive:
> the task has to visit 200 * 10 = 2000 shrinkers
> for every memcg, and since there are 2000 memcgs,
> the total calls are 2000 * 2000 = 400.
> 
> So, the shrinker makes 4 million do_shrink_slab() calls
> just to try to isolate SWAP_CLUSTER_MAX pages in one
> of the actively writing memcg via shrink_page_list().
> I've observed a node spending almost 100% in kernel,
> making useless iteration over already shrinked slab.
> 
> This patch adds bitmap of memcg-aware shrinkers to memcg.
> The size of the bitmap depends on bitmap_nr_ids, and during
> memcg life it's maintained to be enough to fit bitmap_nr_ids
> shrinkers. Every bit in the map is related to corresponding
> shrinker id.
> 
> Next patches will maintain set bit only for really charged
> memcg. This will allow shrink_slab() to increase its
> performance in significant way. See the last patch for
> the numbers.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  include/linux/memcontrol.h |   21 
>  mm/memcontrol.c|  116 
> 
>  mm/vmscan.c|   16 ++
>  3 files changed, 152 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6cbea2f25a87..e5e7e0fc7158 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -105,6 +105,17 @@ struct lruvec_stat {
>   long count[NR_VM_NODE_STAT_ITEMS];
>  };
>  
> +#ifdef CONFIG_MEMCG_SHRINKER
> +/*
> + * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
> + * which have elements charged to this memcg.
> + */
> +struct memcg_shrinker_map {
> + struct rcu_head rcu;
> + unsigned long map[0];
> +};
> +#endif /* CONFIG_MEMCG_SHRINKER */
> +

AFAIR we don't normally ifdef structure definitions.

>  /*
>   * per-zone information in memory controller.
>   */
> @@ -118,6 +129,9 @@ struct mem_cgroup_per_node {
>  
>   struct mem_cgroup_reclaim_iter  iter[DEF_PRIORITY + 1];
>  
> +#ifdef CONFIG_MEMCG_SHRINKER
> + struct memcg_shrinker_map __rcu *shrinker_map;
> +#endif
>   struct rb_node  tree_node;  /* RB tree node */
>   unsigned long   usage_in_excess;/* Set to the value by which */
>   /* the soft limit is exceeded*/
> @@ -1255,4 +1269,11 @@ static inline void memcg_put_cache_ids(void)
>  
>  #endif /* CONFIG_MEMCG && !CONFIG_SLOB */
>  
> +#ifdef CONFIG_MEMCG_SHRINKER

> +#define MEMCG_SHRINKER_MAP(memcg, nid) (memcg->nodeinfo[nid]->shrinker_map)

I don't really like this helper macro. Accessing shrinker_map directly
looks cleaner IMO.

> +
> +extern int memcg_shrinker_nr_max;

As I've mentioned before, the capacity of shrinker map should be a
private business of memcontrol.c IMHO. We shouldn't use it in vmscan.c
as max shrinker id, instead we should introduce another variable for
this, private to vmscan.c.

> +extern int memcg_expand_shrinker_maps(int old_id, int id);

... Then this function would take just one argument, max id, and would
update shrinker_map capacity if necessary in memcontrol.c under the
corresponding mutex, which would look much more readable IMHO as all
shrinker_map related manipulations would be isolated in memcontrol.c.

> +#endif /* CONFIG_MEMCG_SHRINKER */
> +
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3df3efa7ff40..18e0fdf302a9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -322,6 +322,116 @@ struct workqueue_struct *memcg_kmem_cache_wq;
>  
>  #endif /* !CONFIG_SLOB */
>  
> +#ifdef CONFIG_MEMCG_SHRINKER
> +int memcg_shrinker_nr_max;

memcg_shrinker_map_capacity, may be?

> +static DEFINE_MUTEX(shrinkers_nr_max_mutex);

memcg_shrinker_map_mutex?

> +
> +static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
> +{
> + kvfree(container_of(head, struct memcg_shrinker_map, rcu));
> +}
> +
> +static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
> +  int size, int old_size)

If you followed my advice and made the shrinker_map_capacity private to
memcontrol.c, you wouldn't need to pass old_size here either, just max
shrinker id.

> +{
> + struct memcg_shrinker_map *new, *old;
> + int nid;
> +
> + lockdep_assert_held(_nr_max_mutex);
> +
> + for_each_node(nid) {
> + old = rcu_dereference_protected(MEMCG_SHRINKER_MAP(memcg, nid), 
> true);
> + /* Not yet online memcg */
> + 

Re: [PATCH v5 01/13] mm: Assign id to every memcg-aware shrinker

2018-05-12 Thread Vladimir Davydov
On Thu, May 10, 2018 at 12:52:18PM +0300, Kirill Tkhai wrote:
> The patch introduces shrinker::id number, which is used to enumerate
> memcg-aware shrinkers. The number start from 0, and the code tries
> to maintain it as small as possible.
> 
> This will be used as to represent a memcg-aware shrinkers in memcg
> shrinkers map.
> 
> Since all memcg-aware shrinkers are based on list_lru, which is per-memcg
> in case of !SLOB only, the new functionality will be under MEMCG && !SLOB
> ifdef (symlinked to CONFIG_MEMCG_SHRINKER).

Using MEMCG && !SLOB instead of introducing a new config option was done
deliberately, see:

  http://lkml.kernel.org/r/20151210202244.ga4...@cmpxchg.org

I guess, this doesn't work well any more, as there are more and more
parts depending on kmem accounting, like shrinkers. If you really want
to introduce a new option, I think you should call it CONFIG_MEMCG_KMEM
and use it consistently throughout the code instead of MEMCG && !SLOB.
And this should be done in a separate patch.

> diff --git a/fs/super.c b/fs/super.c
> index 122c402049a2..16c153d2f4f1 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -248,6 +248,9 @@ static struct super_block *alloc_super(struct 
> file_system_type *type, int flags,
>   s->s_time_gran = 10;
>   s->cleancache_poolid = CLEANCACHE_NO_POOL;
>  
> +#ifdef CONFIG_MEMCG_SHRINKER
> + s->s_shrink.id = -1;
> +#endif

No point doing that - you are going to overwrite the id anyway in
prealloc_shrinker().

>   s->s_shrink.seeks = DEFAULT_SEEKS;
>   s->s_shrink.scan_objects = super_cache_scan;
>   s->s_shrink.count_objects = super_cache_count;

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 10c8a38c5eef..d691beac1048 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -169,6 +169,47 @@ unsigned long vm_total_pages;
>  static LIST_HEAD(shrinker_list);
>  static DECLARE_RWSEM(shrinker_rwsem);
>  
> +#ifdef CONFIG_MEMCG_SHRINKER
> +static DEFINE_IDR(shrinker_idr);
> +
> +static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> +{
> + int id, ret;
> +
> + down_write(_rwsem);
> + ret = id = idr_alloc(_idr, shrinker, 0, 0, GFP_KERNEL);
> + if (ret < 0)
> + goto unlock;
> + shrinker->id = id;
> + ret = 0;
> +unlock:
> + up_write(_rwsem);
> + return ret;
> +}
> +
> +static void del_memcg_shrinker(struct shrinker *shrinker)

Nit: IMO unregister_memcg_shrinker() would be a better name as it
matches unregister_shrinker(), just like prealloc_memcg_shrinker()
matches prealloc_shrinker().

> +{
> + int id = shrinker->id;
> +

> + if (id < 0)
> + return;

Nit: I think this should be BUG_ON(id >= 0) as this function is only
called for memcg-aware shrinkers AFAICS.

> +
> + down_write(_rwsem);
> + idr_remove(_idr, id);
> + up_write(_rwsem);
> + shrinker->id = -1;
> +}


Re: [PATCH v5 01/13] mm: Assign id to every memcg-aware shrinker

2018-05-12 Thread Vladimir Davydov
On Thu, May 10, 2018 at 12:52:18PM +0300, Kirill Tkhai wrote:
> The patch introduces shrinker::id number, which is used to enumerate
> memcg-aware shrinkers. The number start from 0, and the code tries
> to maintain it as small as possible.
> 
> This will be used as to represent a memcg-aware shrinkers in memcg
> shrinkers map.
> 
> Since all memcg-aware shrinkers are based on list_lru, which is per-memcg
> in case of !SLOB only, the new functionality will be under MEMCG && !SLOB
> ifdef (symlinked to CONFIG_MEMCG_SHRINKER).

Using MEMCG && !SLOB instead of introducing a new config option was done
deliberately, see:

  http://lkml.kernel.org/r/20151210202244.ga4...@cmpxchg.org

I guess, this doesn't work well any more, as there are more and more
parts depending on kmem accounting, like shrinkers. If you really want
to introduce a new option, I think you should call it CONFIG_MEMCG_KMEM
and use it consistently throughout the code instead of MEMCG && !SLOB.
And this should be done in a separate patch.

> diff --git a/fs/super.c b/fs/super.c
> index 122c402049a2..16c153d2f4f1 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -248,6 +248,9 @@ static struct super_block *alloc_super(struct 
> file_system_type *type, int flags,
>   s->s_time_gran = 10;
>   s->cleancache_poolid = CLEANCACHE_NO_POOL;
>  
> +#ifdef CONFIG_MEMCG_SHRINKER
> + s->s_shrink.id = -1;
> +#endif

No point doing that - you are going to overwrite the id anyway in
prealloc_shrinker().

>   s->s_shrink.seeks = DEFAULT_SEEKS;
>   s->s_shrink.scan_objects = super_cache_scan;
>   s->s_shrink.count_objects = super_cache_count;

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 10c8a38c5eef..d691beac1048 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -169,6 +169,47 @@ unsigned long vm_total_pages;
>  static LIST_HEAD(shrinker_list);
>  static DECLARE_RWSEM(shrinker_rwsem);
>  
> +#ifdef CONFIG_MEMCG_SHRINKER
> +static DEFINE_IDR(shrinker_idr);
> +
> +static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> +{
> + int id, ret;
> +
> + down_write(_rwsem);
> + ret = id = idr_alloc(_idr, shrinker, 0, 0, GFP_KERNEL);
> + if (ret < 0)
> + goto unlock;
> + shrinker->id = id;
> + ret = 0;
> +unlock:
> + up_write(_rwsem);
> + return ret;
> +}
> +
> +static void del_memcg_shrinker(struct shrinker *shrinker)

Nit: IMO unregister_memcg_shrinker() would be a better name as it
matches unregister_shrinker(), just like prealloc_memcg_shrinker()
matches prealloc_shrinker().

> +{
> + int id = shrinker->id;
> +

> + if (id < 0)
> + return;

Nit: I think this should be BUG_ON(id >= 0) as this function is only
called for memcg-aware shrinkers AFAICS.

> +
> + down_write(_rwsem);
> + idr_remove(_idr, id);
> + up_write(_rwsem);
> + shrinker->id = -1;
> +}


Re: [PATCH v2 04/12] mm: Assign memcg-aware shrinkers bitmap to memcg

2018-04-28 Thread Vladimir Davydov
On Tue, Apr 24, 2018 at 03:24:53PM +0300, Kirill Tkhai wrote:
> >> +int expand_shrinker_maps(int old_nr, int nr)
> >> +{
> >> +  int id, size, old_size, node, ret;
> >> +  struct mem_cgroup *memcg;
> >> +
> >> +  old_size = old_nr / BITS_PER_BYTE;
> >> +  size = nr / BITS_PER_BYTE;
> >> +
> >> +  down_write(_max_nr_rwsem);
> >> +  for_each_node(node) {
> >
> > Iterating over cgroups first, numa nodes second seems like a better idea
> > to me. I think you should fold for_each_node in memcg_expand_maps.
> >
> >> +  idr_for_each_entry(_cgroup_idr, memcg, id) {
> >
> > Iterating over mem_cgroup_idr looks strange. Why don't you use
> > for_each_mem_cgroup?
> 
>  We want to allocate shrinkers maps in mem_cgroup_css_alloc(), since
>  mem_cgroup_css_online() mustn't fail (it's a requirement of currently
>  existing design of memcg_cgroup::id).
> 
>  A new memcg is added to parent's list between two of these calls:
> 
>  css_create()
>    ss->css_alloc()
>    list_add_tail_rcu(>sibling, _css->children)
>    ss->css_online()
> 
>  for_each_mem_cgroup() does not see allocated, but not linked children.
> >>>
> >>> Why don't we move shrinker map allocation to css_online then?
> >>
> >> Because the design of memcg_cgroup::id prohibits mem_cgroup_css_online() 
> >> to fail.
> >> This function can't fail.
> > 
> > I fail to understand why it is so. Could you please elaborate?
> 
> mem_cgroup::id is freed not in mem_cgroup_css_free(), but earlier. It's freed
> between mem_cgroup_css_offline() and mem_cgroup_free(), after the last 
> reference
> is put.
> 
> In case of sometimes we want to free it in mem_cgroup_css_free(), this will
> introduce assymmetric in the logic, which makes it more difficult. There is
> already a bug, which I fixed in
> 
> "memcg: remove memcg_cgroup::id from IDR on mem_cgroup_css_alloc() failure"
> 
> new change will make this code completely not-modular and unreadable.

How is that? AFAIU all we need to do to handle css_online failure
properly is call mem_cgroup_id_remove() from mem_cgroup_css_free().
That's it, as mem_cgroup_id_remove() is already safe to call more
than once for the same cgroup - the first call will free the id
while all subsequent calls will do nothing.

>  
> >>
> >> I don't think it will be good to dive into reworking of this stuff for 
> >> this patchset,
> >> which is really already big. Also, it will be assymmetric to allocate one 
> >> part of
> >> data in css_alloc(), while another data in css_free(). This breaks cgroup 
> >> design,
> >> which specially introduces this two function to differ allocation and 
> >> onlining.
> >> Also, I've just move the allocation to alloc_mem_cgroup_per_node_info() 
> >> like it was
> >> suggested in comments to v1...
> > 
> > Yeah, but (ab)using mem_cgroup_idr for iterating over all allocated
> > memory cgroups looks rather dubious to me...
> 
> But we have to iterate over all allocated memory cgroups in any way,
> as all of them must have expanded maps. What is the problem?
> It's rather simple method, and it faster then for_each_mem_cgroup()
> cycle, since it does not have to play with get and put of refcounters.

I don't like this, because mem_cgroup_idr was initially introduced to
avoid depletion of css ids by offline cgroups. We could fix that problem
by extending swap_cgroup to UINT_MAX, in which case mem_cgroup_idr
wouldn't be needed at all. Reusing mem_cgroup_idr for iterating over
allocated cgroups deprives us of the ability to reconsider that design
decision in future, neither does it look natural IMO. Besides, in order
to use mem_cgroup_idr for your purpose, you have to reshuffle the code
of mem_cgroup_css_alloc in a rather contrived way IMO.

I agree that allocating parts of struct mem_cgroup in css_online may
look dubious, but IMHO it's better than inventing a new way to iterate
over cgroups instead of using the iterator provided by cgroup core.
May be, you should ask Tejun which way he thinks is better.

Thanks,
Vladimir


Re: [PATCH v2 04/12] mm: Assign memcg-aware shrinkers bitmap to memcg

2018-04-28 Thread Vladimir Davydov
On Tue, Apr 24, 2018 at 03:24:53PM +0300, Kirill Tkhai wrote:
> >> +int expand_shrinker_maps(int old_nr, int nr)
> >> +{
> >> +  int id, size, old_size, node, ret;
> >> +  struct mem_cgroup *memcg;
> >> +
> >> +  old_size = old_nr / BITS_PER_BYTE;
> >> +  size = nr / BITS_PER_BYTE;
> >> +
> >> +  down_write(_max_nr_rwsem);
> >> +  for_each_node(node) {
> >
> > Iterating over cgroups first, numa nodes second seems like a better idea
> > to me. I think you should fold for_each_node in memcg_expand_maps.
> >
> >> +  idr_for_each_entry(_cgroup_idr, memcg, id) {
> >
> > Iterating over mem_cgroup_idr looks strange. Why don't you use
> > for_each_mem_cgroup?
> 
>  We want to allocate shrinkers maps in mem_cgroup_css_alloc(), since
>  mem_cgroup_css_online() mustn't fail (it's a requirement of currently
>  existing design of memcg_cgroup::id).
> 
>  A new memcg is added to parent's list between two of these calls:
> 
>  css_create()
>    ss->css_alloc()
>    list_add_tail_rcu(>sibling, _css->children)
>    ss->css_online()
> 
>  for_each_mem_cgroup() does not see allocated, but not linked children.
> >>>
> >>> Why don't we move shrinker map allocation to css_online then?
> >>
> >> Because the design of memcg_cgroup::id prohibits mem_cgroup_css_online() 
> >> to fail.
> >> This function can't fail.
> > 
> > I fail to understand why it is so. Could you please elaborate?
> 
> mem_cgroup::id is freed not in mem_cgroup_css_free(), but earlier. It's freed
> between mem_cgroup_css_offline() and mem_cgroup_free(), after the last 
> reference
> is put.
> 
> In case of sometimes we want to free it in mem_cgroup_css_free(), this will
> introduce assymmetric in the logic, which makes it more difficult. There is
> already a bug, which I fixed in
> 
> "memcg: remove memcg_cgroup::id from IDR on mem_cgroup_css_alloc() failure"
> 
> new change will make this code completely not-modular and unreadable.

How is that? AFAIU all we need to do to handle css_online failure
properly is call mem_cgroup_id_remove() from mem_cgroup_css_free().
That's it, as mem_cgroup_id_remove() is already safe to call more
than once for the same cgroup - the first call will free the id
while all subsequent calls will do nothing.

>  
> >>
> >> I don't think it will be good to dive into reworking of this stuff for 
> >> this patchset,
> >> which is really already big. Also, it will be assymmetric to allocate one 
> >> part of
> >> data in css_alloc(), while another data in css_free(). This breaks cgroup 
> >> design,
> >> which specially introduces this two function to differ allocation and 
> >> onlining.
> >> Also, I've just move the allocation to alloc_mem_cgroup_per_node_info() 
> >> like it was
> >> suggested in comments to v1...
> > 
> > Yeah, but (ab)using mem_cgroup_idr for iterating over all allocated
> > memory cgroups looks rather dubious to me...
> 
> But we have to iterate over all allocated memory cgroups in any way,
> as all of them must have expanded maps. What is the problem?
> It's rather simple method, and it faster then for_each_mem_cgroup()
> cycle, since it does not have to play with get and put of refcounters.

I don't like this, because mem_cgroup_idr was initially introduced to
avoid depletion of css ids by offline cgroups. We could fix that problem
by extending swap_cgroup to UINT_MAX, in which case mem_cgroup_idr
wouldn't be needed at all. Reusing mem_cgroup_idr for iterating over
allocated cgroups deprives us of the ability to reconsider that design
decision in future, neither does it look natural IMO. Besides, in order
to use mem_cgroup_idr for your purpose, you have to reshuffle the code
of mem_cgroup_css_alloc in a rather contrived way IMO.

I agree that allocating parts of struct mem_cgroup in css_online may
look dubious, but IMHO it's better than inventing a new way to iterate
over cgroups instead of using the iterator provided by cgroup core.
May be, you should ask Tejun which way he thinks is better.

Thanks,
Vladimir


Re: [PATCH v2] mm: introduce memory.min

2018-04-25 Thread Vladimir Davydov
On Tue, Apr 24, 2018 at 02:54:15PM +0100, Roman Gushchin wrote:
> > On Mon, Apr 23, 2018 at 01:36:10PM +0100, Roman Gushchin wrote:
> > > +  memory.min
> > > + A read-write single value file which exists on non-root
> > > + cgroups.  The default is "0".
> > > +
> > > + Hard memory protection.  If the memory usage of a cgroup
> > > + is within its effective min boundary, the cgroup's memory
> > > + won't be reclaimed under any conditions. If there is no
> > > + unprotected reclaimable memory available, OOM killer
> > > + is invoked.
> > 
> > What will happen if all tasks attached to a cgroup are killed by OOM,
> > but its memory usage is still within memory.min? Will memory.min be
> > ignored then?
> 
> Not really.
> 
> I don't think it's a big problem as long as a user isn't doing
> something weird (e.g. moving processes with significant
> amount of charged memory to other cgroups).

The user doesn't have to do anything weird for this to happen - just
read a file. This will allocate and charge page cache pages that are
not mapped to any process and hence cannot be freed by OOM killer.

> 
> But what we can do here, is to ignore memory.min of empty cgroups
> (patch below), it will resolve some edge cases like this.

Makes sense to me.

Thanks,
Vladimir


Re: [PATCH v2] mm: introduce memory.min

2018-04-25 Thread Vladimir Davydov
On Tue, Apr 24, 2018 at 02:54:15PM +0100, Roman Gushchin wrote:
> > On Mon, Apr 23, 2018 at 01:36:10PM +0100, Roman Gushchin wrote:
> > > +  memory.min
> > > + A read-write single value file which exists on non-root
> > > + cgroups.  The default is "0".
> > > +
> > > + Hard memory protection.  If the memory usage of a cgroup
> > > + is within its effective min boundary, the cgroup's memory
> > > + won't be reclaimed under any conditions. If there is no
> > > + unprotected reclaimable memory available, OOM killer
> > > + is invoked.
> > 
> > What will happen if all tasks attached to a cgroup are killed by OOM,
> > but its memory usage is still within memory.min? Will memory.min be
> > ignored then?
> 
> Not really.
> 
> I don't think it's a big problem as long as a user isn't doing
> something weird (e.g. moving processes with significant
> amount of charged memory to other cgroups).

The user doesn't have to do anything weird for this to happen - just
read a file. This will allocate and charge page cache pages that are
not mapped to any process and hence cannot be freed by OOM killer.

> 
> But what we can do here, is to ignore memory.min of empty cgroups
> (patch below), it will resolve some edge cases like this.

Makes sense to me.

Thanks,
Vladimir


Re: [PATCH v2] mm: introduce memory.min

2018-04-24 Thread Vladimir Davydov
Hi Roman,

On Mon, Apr 23, 2018 at 01:36:10PM +0100, Roman Gushchin wrote:
> +  memory.min
> + A read-write single value file which exists on non-root
> + cgroups.  The default is "0".
> +
> + Hard memory protection.  If the memory usage of a cgroup
> + is within its effective min boundary, the cgroup's memory
> + won't be reclaimed under any conditions. If there is no
> + unprotected reclaimable memory available, OOM killer
> + is invoked.

What will happen if all tasks attached to a cgroup are killed by OOM,
but its memory usage is still within memory.min? Will memory.min be
ignored then?

> +
> + Effective low boundary is limited by memory.min values of
> + all ancestor cgroups. If there is memory.min overcommitment
> + (child cgroup or cgroups are requiring more protected memory
> + than parent will allow), then each child cgroup will get
> + the part of parent's protection proportional to its
> + actual memory usage below memory.min.
> +
> + Putting more memory than generally available under this
> + protection is discouraged and may lead to constant OOMs.


Re: [PATCH v2] mm: introduce memory.min

2018-04-24 Thread Vladimir Davydov
Hi Roman,

On Mon, Apr 23, 2018 at 01:36:10PM +0100, Roman Gushchin wrote:
> +  memory.min
> + A read-write single value file which exists on non-root
> + cgroups.  The default is "0".
> +
> + Hard memory protection.  If the memory usage of a cgroup
> + is within its effective min boundary, the cgroup's memory
> + won't be reclaimed under any conditions. If there is no
> + unprotected reclaimable memory available, OOM killer
> + is invoked.

What will happen if all tasks attached to a cgroup are killed by OOM,
but its memory usage is still within memory.min? Will memory.min be
ignored then?

> +
> + Effective low boundary is limited by memory.min values of
> + all ancestor cgroups. If there is memory.min overcommitment
> + (child cgroup or cgroups are requiring more protected memory
> + than parent will allow), then each child cgroup will get
> + the part of parent's protection proportional to its
> + actual memory usage below memory.min.
> +
> + Putting more memory than generally available under this
> + protection is discouraged and may lead to constant OOMs.


Re: [PATCH v2 04/12] mm: Assign memcg-aware shrinkers bitmap to memcg

2018-04-24 Thread Vladimir Davydov
On Tue, Apr 24, 2018 at 02:38:51PM +0300, Kirill Tkhai wrote:
> On 24.04.2018 14:28, Vladimir Davydov wrote:
> > On Mon, Apr 23, 2018 at 01:54:50PM +0300, Kirill Tkhai wrote:
> >>>> @@ -1200,6 +1206,8 @@ extern int memcg_nr_cache_ids;
> >>>>  void memcg_get_cache_ids(void);
> >>>>  void memcg_put_cache_ids(void);
> >>>>  
> >>>> +extern int shrinkers_max_nr;
> >>>> +
> >>>
> >>> memcg_shrinker_id_max?
> >>
> >> memcg_shrinker_id_max sounds like an includive value, doesn't it?
> >> While shrinker->id < shrinker_max_nr.
> >>
> >> Let's better use memcg_shrinker_nr_max.
> > 
> > or memcg_nr_shrinker_ids (to match memcg_nr_cache_ids), not sure...
> > 
> > Come to think of it, this variable is kinda awkward: it is defined in
> > vmscan.c but declared in memcontrol.h; it is used by vmscan.c for max
> > shrinker id and by memcontrol.c for shrinker map capacity. Just a raw
> > idea: what about splitting it in two: one is private to vmscan.c, used
> > as max id, say we call it shrinker_id_max; the other is defined in
> > memcontrol.c and is used for shrinker map capacity, say we call it
> > memcg_shrinker_map_capacity. What do you think?
> 
> I don't much like a duplication of the single variable...

Well, it's not really a duplication. For example, shrinker_id_max could
decrease when a shrinker is unregistered while shrinker_map_capacity can
only grow exponentially.

> Are there real problems, if it defined in memcontrol.{c,h} and use in
> both of the places?

The code is more difficult to follow when variables are shared like that
IMHO. I suggest you try it and see how it looks. May be, it will only
get worse and we'll have to revert to what we have now. Difficult to say
without seeing the code.

>  
> >>>> +int expand_shrinker_maps(int old_nr, int nr)
> >>>> +{
> >>>> +int id, size, old_size, node, ret;
> >>>> +struct mem_cgroup *memcg;
> >>>> +
> >>>> +old_size = old_nr / BITS_PER_BYTE;
> >>>> +size = nr / BITS_PER_BYTE;
> >>>> +
> >>>> +down_write(_max_nr_rwsem);
> >>>> +for_each_node(node) {
> >>>
> >>> Iterating over cgroups first, numa nodes second seems like a better idea
> >>> to me. I think you should fold for_each_node in memcg_expand_maps.
> >>>
> >>>> +idr_for_each_entry(_cgroup_idr, memcg, id) {
> >>>
> >>> Iterating over mem_cgroup_idr looks strange. Why don't you use
> >>> for_each_mem_cgroup?
> >>
> >> We want to allocate shrinkers maps in mem_cgroup_css_alloc(), since
> >> mem_cgroup_css_online() mustn't fail (it's a requirement of currently
> >> existing design of memcg_cgroup::id).
> >>
> >> A new memcg is added to parent's list between two of these calls:
> >>
> >> css_create()
> >>   ss->css_alloc()
> >>   list_add_tail_rcu(>sibling, _css->children)
> >>   ss->css_online()
> >>
> >> for_each_mem_cgroup() does not see allocated, but not linked children.
> > 
> > Why don't we move shrinker map allocation to css_online then?
> 
> Because the design of memcg_cgroup::id prohibits mem_cgroup_css_online() to 
> fail.
> This function can't fail.

I fail to understand why it is so. Could you please elaborate?

> 
> I don't think it will be good to dive into reworking of this stuff for this 
> patchset,
> which is really already big. Also, it will be assymmetric to allocate one 
> part of
> data in css_alloc(), while another data in css_free(). This breaks cgroup 
> design,
> which specially introduces this two function to differ allocation and 
> onlining.
> Also, I've just move the allocation to alloc_mem_cgroup_per_node_info() like 
> it was
> suggested in comments to v1...

Yeah, but (ab)using mem_cgroup_idr for iterating over all allocated
memory cgroups looks rather dubious to me...


Re: [PATCH v2 04/12] mm: Assign memcg-aware shrinkers bitmap to memcg

2018-04-24 Thread Vladimir Davydov
On Tue, Apr 24, 2018 at 02:38:51PM +0300, Kirill Tkhai wrote:
> On 24.04.2018 14:28, Vladimir Davydov wrote:
> > On Mon, Apr 23, 2018 at 01:54:50PM +0300, Kirill Tkhai wrote:
> >>>> @@ -1200,6 +1206,8 @@ extern int memcg_nr_cache_ids;
> >>>>  void memcg_get_cache_ids(void);
> >>>>  void memcg_put_cache_ids(void);
> >>>>  
> >>>> +extern int shrinkers_max_nr;
> >>>> +
> >>>
> >>> memcg_shrinker_id_max?
> >>
> >> memcg_shrinker_id_max sounds like an includive value, doesn't it?
> >> While shrinker->id < shrinker_max_nr.
> >>
> >> Let's better use memcg_shrinker_nr_max.
> > 
> > or memcg_nr_shrinker_ids (to match memcg_nr_cache_ids), not sure...
> > 
> > Come to think of it, this variable is kinda awkward: it is defined in
> > vmscan.c but declared in memcontrol.h; it is used by vmscan.c for max
> > shrinker id and by memcontrol.c for shrinker map capacity. Just a raw
> > idea: what about splitting it in two: one is private to vmscan.c, used
> > as max id, say we call it shrinker_id_max; the other is defined in
> > memcontrol.c and is used for shrinker map capacity, say we call it
> > memcg_shrinker_map_capacity. What do you think?
> 
> I don't much like a duplication of the single variable...

Well, it's not really a duplication. For example, shrinker_id_max could
decrease when a shrinker is unregistered while shrinker_map_capacity can
only grow exponentially.

> Are there real problems, if it defined in memcontrol.{c,h} and use in
> both of the places?

The code is more difficult to follow when variables are shared like that
IMHO. I suggest you try it and see how it looks. May be, it will only
get worse and we'll have to revert to what we have now. Difficult to say
without seeing the code.

>  
> >>>> +int expand_shrinker_maps(int old_nr, int nr)
> >>>> +{
> >>>> +int id, size, old_size, node, ret;
> >>>> +struct mem_cgroup *memcg;
> >>>> +
> >>>> +old_size = old_nr / BITS_PER_BYTE;
> >>>> +size = nr / BITS_PER_BYTE;
> >>>> +
> >>>> +down_write(_max_nr_rwsem);
> >>>> +for_each_node(node) {
> >>>
> >>> Iterating over cgroups first, numa nodes second seems like a better idea
> >>> to me. I think you should fold for_each_node in memcg_expand_maps.
> >>>
> >>>> +idr_for_each_entry(_cgroup_idr, memcg, id) {
> >>>
> >>> Iterating over mem_cgroup_idr looks strange. Why don't you use
> >>> for_each_mem_cgroup?
> >>
> >> We want to allocate shrinkers maps in mem_cgroup_css_alloc(), since
> >> mem_cgroup_css_online() mustn't fail (it's a requirement of currently
> >> existing design of memcg_cgroup::id).
> >>
> >> A new memcg is added to parent's list between two of these calls:
> >>
> >> css_create()
> >>   ss->css_alloc()
> >>   list_add_tail_rcu(>sibling, _css->children)
> >>   ss->css_online()
> >>
> >> for_each_mem_cgroup() does not see allocated, but not linked children.
> > 
> > Why don't we move shrinker map allocation to css_online then?
> 
> Because the design of memcg_cgroup::id prohibits mem_cgroup_css_online() to 
> fail.
> This function can't fail.

I fail to understand why it is so. Could you please elaborate?

> 
> I don't think it will be good to dive into reworking of this stuff for this 
> patchset,
> which is really already big. Also, it will be assymmetric to allocate one 
> part of
> data in css_alloc(), while another data in css_free(). This breaks cgroup 
> design,
> which specially introduces this two function to differ allocation and 
> onlining.
> Also, I've just move the allocation to alloc_mem_cgroup_per_node_info() like 
> it was
> suggested in comments to v1...

Yeah, but (ab)using mem_cgroup_idr for iterating over all allocated
memory cgroups looks rather dubious to me...


Re: [PATCH v2 04/12] mm: Assign memcg-aware shrinkers bitmap to memcg

2018-04-24 Thread Vladimir Davydov
On Mon, Apr 23, 2018 at 01:54:50PM +0300, Kirill Tkhai wrote:
> >> @@ -1200,6 +1206,8 @@ extern int memcg_nr_cache_ids;
> >>  void memcg_get_cache_ids(void);
> >>  void memcg_put_cache_ids(void);
> >>  
> >> +extern int shrinkers_max_nr;
> >> +
> > 
> > memcg_shrinker_id_max?
> 
> memcg_shrinker_id_max sounds like an includive value, doesn't it?
> While shrinker->id < shrinker_max_nr.
> 
> Let's better use memcg_shrinker_nr_max.

or memcg_nr_shrinker_ids (to match memcg_nr_cache_ids), not sure...

Come to think of it, this variable is kinda awkward: it is defined in
vmscan.c but declared in memcontrol.h; it is used by vmscan.c for max
shrinker id and by memcontrol.c for shrinker map capacity. Just a raw
idea: what about splitting it in two: one is private to vmscan.c, used
as max id, say we call it shrinker_id_max; the other is defined in
memcontrol.c and is used for shrinker map capacity, say we call it
memcg_shrinker_map_capacity. What do you think?

> >> +int expand_shrinker_maps(int old_nr, int nr)
> >> +{
> >> +  int id, size, old_size, node, ret;
> >> +  struct mem_cgroup *memcg;
> >> +
> >> +  old_size = old_nr / BITS_PER_BYTE;
> >> +  size = nr / BITS_PER_BYTE;
> >> +
> >> +  down_write(_max_nr_rwsem);
> >> +  for_each_node(node) {
> > 
> > Iterating over cgroups first, numa nodes second seems like a better idea
> > to me. I think you should fold for_each_node in memcg_expand_maps.
> >
> >> +  idr_for_each_entry(_cgroup_idr, memcg, id) {
> > 
> > Iterating over mem_cgroup_idr looks strange. Why don't you use
> > for_each_mem_cgroup?
> 
> We want to allocate shrinkers maps in mem_cgroup_css_alloc(), since
> mem_cgroup_css_online() mustn't fail (it's a requirement of currently
> existing design of memcg_cgroup::id).
> 
> A new memcg is added to parent's list between two of these calls:
> 
> css_create()
>   ss->css_alloc()
>   list_add_tail_rcu(>sibling, _css->children)
>   ss->css_online()
> 
> for_each_mem_cgroup() does not see allocated, but not linked children.

Why don't we move shrinker map allocation to css_online then?

>  
> >> +  if (id == 1)
> >> +  memcg = NULL;
> >> +  ret = memcg_expand_maps(memcg, node, size, old_size);
> >> +  if (ret)
> >> +  goto unlock;
> >> +  }
> >> +
> >> +  /* root_mem_cgroup is not initialized yet */
> >> +  if (id == 0)
> >> +  ret = memcg_expand_maps(NULL, node, size, old_size);
> >> +  }
> >> +unlock:
> >> +  up_write(_max_nr_rwsem);
> >> +  return ret;
> >> +}


Re: [PATCH v2 04/12] mm: Assign memcg-aware shrinkers bitmap to memcg

2018-04-24 Thread Vladimir Davydov
On Mon, Apr 23, 2018 at 01:54:50PM +0300, Kirill Tkhai wrote:
> >> @@ -1200,6 +1206,8 @@ extern int memcg_nr_cache_ids;
> >>  void memcg_get_cache_ids(void);
> >>  void memcg_put_cache_ids(void);
> >>  
> >> +extern int shrinkers_max_nr;
> >> +
> > 
> > memcg_shrinker_id_max?
> 
> memcg_shrinker_id_max sounds like an includive value, doesn't it?
> While shrinker->id < shrinker_max_nr.
> 
> Let's better use memcg_shrinker_nr_max.

or memcg_nr_shrinker_ids (to match memcg_nr_cache_ids), not sure...

Come to think of it, this variable is kinda awkward: it is defined in
vmscan.c but declared in memcontrol.h; it is used by vmscan.c for max
shrinker id and by memcontrol.c for shrinker map capacity. Just a raw
idea: what about splitting it in two: one is private to vmscan.c, used
as max id, say we call it shrinker_id_max; the other is defined in
memcontrol.c and is used for shrinker map capacity, say we call it
memcg_shrinker_map_capacity. What do you think?

> >> +int expand_shrinker_maps(int old_nr, int nr)
> >> +{
> >> +  int id, size, old_size, node, ret;
> >> +  struct mem_cgroup *memcg;
> >> +
> >> +  old_size = old_nr / BITS_PER_BYTE;
> >> +  size = nr / BITS_PER_BYTE;
> >> +
> >> +  down_write(_max_nr_rwsem);
> >> +  for_each_node(node) {
> > 
> > Iterating over cgroups first, numa nodes second seems like a better idea
> > to me. I think you should fold for_each_node in memcg_expand_maps.
> >
> >> +  idr_for_each_entry(_cgroup_idr, memcg, id) {
> > 
> > Iterating over mem_cgroup_idr looks strange. Why don't you use
> > for_each_mem_cgroup?
> 
> We want to allocate shrinkers maps in mem_cgroup_css_alloc(), since
> mem_cgroup_css_online() mustn't fail (it's a requirement of currently
> existing design of memcg_cgroup::id).
> 
> A new memcg is added to parent's list between two of these calls:
> 
> css_create()
>   ss->css_alloc()
>   list_add_tail_rcu(>sibling, _css->children)
>   ss->css_online()
> 
> for_each_mem_cgroup() does not see allocated, but not linked children.

Why don't we move shrinker map allocation to css_online then?

>  
> >> +  if (id == 1)
> >> +  memcg = NULL;
> >> +  ret = memcg_expand_maps(memcg, node, size, old_size);
> >> +  if (ret)
> >> +  goto unlock;
> >> +  }
> >> +
> >> +  /* root_mem_cgroup is not initialized yet */
> >> +  if (id == 0)
> >> +  ret = memcg_expand_maps(NULL, node, size, old_size);
> >> +  }
> >> +unlock:
> >> +  up_write(_max_nr_rwsem);
> >> +  return ret;
> >> +}


Re: [PATCH v2 04/12] mm: Assign memcg-aware shrinkers bitmap to memcg

2018-04-24 Thread Vladimir Davydov
On Mon, Apr 23, 2018 at 02:06:31PM +0300, Kirill Tkhai wrote:
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 4f02fe83537e..f63eb5596c35 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -172,6 +172,22 @@ static DECLARE_RWSEM(shrinker_rwsem);
> >>  #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
> >>  static DEFINE_IDR(shrinkers_id_idr);
> >>  
> >> +static int expand_shrinker_id(int id)
> >> +{
> >> +  if (likely(id < shrinkers_max_nr))
> >> +  return 0;
> >> +
> >> +  id = shrinkers_max_nr * 2;
> >> +  if (id == 0)
> >> +  id = BITS_PER_BYTE;
> >> +
> >> +  if (expand_shrinker_maps(shrinkers_max_nr, id))
> >> +  return -ENOMEM;
> >> +
> >> +  shrinkers_max_nr = id;
> >> +  return 0;
> >> +}
> >> +
> > 
> > I think this function should live in memcontrol.c and shrinkers_max_nr
> > should be private to memcontrol.c.
> 
> It can't be private as shrink_slab_memcg() uses this value to get the last 
> bit of bitmap.

Yeah, you're right, sorry I haven't noticed that.

What about moving id allocation to this function as well? IMHO it would
make the code flow a little bit more straightforward. I mean,

alloc_shrinker_id()
{
int id = idr_alloc(...)
if (id >= memcg_nr_shrinker_ids)
memcg_grow_shrinker_map(...)
return id;
}

> 
> >>  static int add_memcg_shrinker(struct shrinker *shrinker)
> >>  {
> >>int id, ret;
> >> @@ -180,6 +196,11 @@ static int add_memcg_shrinker(struct shrinker 
> >> *shrinker)
> >>ret = id = idr_alloc(_id_idr, shrinker, 0, 0, GFP_KERNEL);
> >>if (ret < 0)
> >>goto unlock;
> >> +  ret = expand_shrinker_id(id);
> >> +  if (ret < 0) {
> >> +  idr_remove(_id_idr, id);
> >> +  goto unlock;
> >> +  }
> >>shrinker->id = id;
> >>ret = 0;
> >>  unlock:
> >>


Re: [PATCH v2 04/12] mm: Assign memcg-aware shrinkers bitmap to memcg

2018-04-24 Thread Vladimir Davydov
On Mon, Apr 23, 2018 at 02:06:31PM +0300, Kirill Tkhai wrote:
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 4f02fe83537e..f63eb5596c35 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -172,6 +172,22 @@ static DECLARE_RWSEM(shrinker_rwsem);
> >>  #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
> >>  static DEFINE_IDR(shrinkers_id_idr);
> >>  
> >> +static int expand_shrinker_id(int id)
> >> +{
> >> +  if (likely(id < shrinkers_max_nr))
> >> +  return 0;
> >> +
> >> +  id = shrinkers_max_nr * 2;
> >> +  if (id == 0)
> >> +  id = BITS_PER_BYTE;
> >> +
> >> +  if (expand_shrinker_maps(shrinkers_max_nr, id))
> >> +  return -ENOMEM;
> >> +
> >> +  shrinkers_max_nr = id;
> >> +  return 0;
> >> +}
> >> +
> > 
> > I think this function should live in memcontrol.c and shrinkers_max_nr
> > should be private to memcontrol.c.
> 
> It can't be private as shrink_slab_memcg() uses this value to get the last 
> bit of bitmap.

Yeah, you're right, sorry I haven't noticed that.

What about moving id allocation to this function as well? IMHO it would
make the code flow a little bit more straightforward. I mean,

alloc_shrinker_id()
{
int id = idr_alloc(...)
if (id >= memcg_nr_shrinker_ids)
memcg_grow_shrinker_map(...)
return id;
}

> 
> >>  static int add_memcg_shrinker(struct shrinker *shrinker)
> >>  {
> >>int id, ret;
> >> @@ -180,6 +196,11 @@ static int add_memcg_shrinker(struct shrinker 
> >> *shrinker)
> >>ret = id = idr_alloc(_id_idr, shrinker, 0, 0, GFP_KERNEL);
> >>if (ret < 0)
> >>goto unlock;
> >> +  ret = expand_shrinker_id(id);
> >> +  if (ret < 0) {
> >> +  idr_remove(_id_idr, id);
> >> +  goto unlock;
> >> +  }
> >>shrinker->id = id;
> >>ret = 0;
> >>  unlock:
> >>


  1   2   3   4   5   6   7   8   9   10   >