Re: [PATCH 8/9] slab: remove synchronous synchronize_sched() from memcg cache deactivation path

2017-01-14 Thread Vladimir Davydov
On Sat, Jan 14, 2017 at 12:54:48AM -0500, Tejun Heo wrote:
> With kmem cgroup support enabled, kmem_caches can be created and
> destroyed frequently and a great number of near empty kmem_caches can
> accumulate if there are a lot of transient cgroups and the system is
> not under memory pressure.  When memory reclaim starts under such
> conditions, it can lead to consecutive deactivation and destruction of
> many kmem_caches, easily hundreds of thousands on moderately large
> systems, exposing scalability issues in the current slab management
> code.  This is one of the patches to address the issue.
> 
> slub uses synchronize_sched() to deactivate a memcg cache.
> synchronize_sched() is an expensive and slow operation and doesn't
> scale when a huge number of caches are destroyed back-to-back.  While
> there used to be a simple batching mechanism, the batching was too
> restricted to be helpful.
> 
> This patch implements slab_deactivate_memcg_cache_rcu_sched() which
> slub can use to schedule sched RCU callback instead of performing
> synchronize_sched() synchronously while holding cgroup_mutex.  While
> this adds online cpus, mems and slab_mutex operations, operating on
> these locks back-to-back from the same kworker, which is what's gonna
> happen when there are many to deactivate, isn't expensive at all and
> this gets rid of the scalability problem completely.
> 
> Signed-off-by: Tejun Heo 
> Reported-by: Jay Vana 
> Cc: Vladimir Davydov 
> Cc: Christoph Lameter 
> Cc: Pekka Enberg 
> Cc: David Rientjes 
> Cc: Joonsoo Kim 
> Cc: Andrew Morton 

I don't think there's much point in having the infrastructure for this
in slab_common.c, as only SLUB needs it, but it isn't a show stopper.

Acked-by: Vladimir Davydov 


Re: [PATCH 8/9] slab: remove synchronous synchronize_sched() from memcg cache deactivation path

2017-01-14 Thread Vladimir Davydov
On Sat, Jan 14, 2017 at 12:54:48AM -0500, Tejun Heo wrote:
> With kmem cgroup support enabled, kmem_caches can be created and
> destroyed frequently and a great number of near empty kmem_caches can
> accumulate if there are a lot of transient cgroups and the system is
> not under memory pressure.  When memory reclaim starts under such
> conditions, it can lead to consecutive deactivation and destruction of
> many kmem_caches, easily hundreds of thousands on moderately large
> systems, exposing scalability issues in the current slab management
> code.  This is one of the patches to address the issue.
> 
> slub uses synchronize_sched() to deactivate a memcg cache.
> synchronize_sched() is an expensive and slow operation and doesn't
> scale when a huge number of caches are destroyed back-to-back.  While
> there used to be a simple batching mechanism, the batching was too
> restricted to be helpful.
> 
> This patch implements slab_deactivate_memcg_cache_rcu_sched() which
> slub can use to schedule sched RCU callback instead of performing
> synchronize_sched() synchronously while holding cgroup_mutex.  While
> this adds online cpus, mems and slab_mutex operations, operating on
> these locks back-to-back from the same kworker, which is what's gonna
> happen when there are many to deactivate, isn't expensive at all and
> this gets rid of the scalability problem completely.
> 
> Signed-off-by: Tejun Heo 
> Reported-by: Jay Vana 
> Cc: Vladimir Davydov 
> Cc: Christoph Lameter 
> Cc: Pekka Enberg 
> Cc: David Rientjes 
> Cc: Joonsoo Kim 
> Cc: Andrew Morton 

I don't think there's much point in having the infrastructure for this
in slab_common.c, as only SLUB needs it, but it isn't a show stopper.

Acked-by: Vladimir Davydov 


[PATCH 8/9] slab: remove synchronous synchronize_sched() from memcg cache deactivation path

2017-01-13 Thread Tejun Heo
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

slub uses synchronize_sched() to deactivate a memcg cache.
synchronize_sched() is an expensive and slow operation and doesn't
scale when a huge number of caches are destroyed back-to-back.  While
there used to be a simple batching mechanism, the batching was too
restricted to be helpful.

This patch implements slab_deactivate_memcg_cache_rcu_sched() which
slub can use to schedule sched RCU callback instead of performing
synchronize_sched() synchronously while holding cgroup_mutex.  While
this adds online cpus, mems and slab_mutex operations, operating on
these locks back-to-back from the same kworker, which is what's gonna
happen when there are many to deactivate, isn't expensive at all and
this gets rid of the scalability problem completely.

Signed-off-by: Tejun Heo 
Reported-by: Jay Vana 
Cc: Vladimir Davydov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
 include/linux/slab.h |  6 ++
 mm/slab.h|  2 ++
 mm/slab_common.c | 60 
 mm/slub.c| 12 +++
 4 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 63d543d..84701bb 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -576,6 +576,12 @@ struct memcg_cache_params {
struct {
struct mem_cgroup *memcg;
struct list_head kmem_caches_node;
+
+   void (*deact_fn)(struct kmem_cache *);
+   union {
+   struct rcu_head deact_rcu_head;
+   struct work_struct deact_work;
+   };
};
};
 };
diff --git a/mm/slab.h b/mm/slab.h
index 73ed6b5..b67ac8f 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -299,6 +299,8 @@ static __always_inline void memcg_uncharge_slab(struct page 
*page, int order,
 }
 
 extern void slab_init_memcg_params(struct kmem_cache *);
+extern void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
+   void (*deact_fn)(struct kmem_cache *));
 
 #else /* CONFIG_MEMCG && !CONFIG_SLOB */
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 87e5535..4730ef6 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -583,6 +583,66 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
put_online_cpus();
 }
 
+static void kmemcg_deactivate_workfn(struct work_struct *work)
+{
+   struct kmem_cache *s = container_of(work, struct kmem_cache,
+   memcg_params.deact_work);
+
+   get_online_cpus();
+   get_online_mems();
+
+   mutex_lock(_mutex);
+
+   s->memcg_params.deact_fn(s);
+
+   mutex_unlock(_mutex);
+
+   put_online_mems();
+   put_online_cpus();
+
+   /* done, put the ref from slab_deactivate_memcg_cache_rcu_sched() */
+   css_put(>memcg_params.memcg->css);
+}
+
+static void kmemcg_deactivate_rcufn(struct rcu_head *head)
+{
+   struct kmem_cache *s = container_of(head, struct kmem_cache,
+   memcg_params.deact_rcu_head);
+
+   /*
+* We need to grab blocking locks.  Bounce to ->deact_work.  The
+* work item shares the space with the RCU head and can't be
+* initialized eariler.
+*/
+   INIT_WORK(>memcg_params.deact_work, kmemcg_deactivate_workfn);
+   schedule_work(>memcg_params.deact_work);
+}
+
+/**
+ * slab_deactivate_memcg_cache_rcu_sched - schedule deactivation after a
+ *sched RCU grace period
+ * @s: target kmem_cache
+ * @deact_fn: deactivation function to call
+ *
+ * Schedule @deact_fn to be invoked with online cpus, mems and slab_mutex
+ * held after a sched RCU grace period.  The slab is guaranteed to stay
+ * alive until @deact_fn is finished.  This is to be used from
+ * __kmemcg_cache_deactivate().
+ */
+void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
+  void (*deact_fn)(struct kmem_cache 
*))
+{
+   if (WARN_ON_ONCE(is_root_cache(s)) ||
+   WARN_ON_ONCE(s->memcg_params.deact_fn))
+   return;
+
+   /* pin memcg so 

[PATCH 8/9] slab: remove synchronous synchronize_sched() from memcg cache deactivation path

2017-01-13 Thread Tejun Heo
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

slub uses synchronize_sched() to deactivate a memcg cache.
synchronize_sched() is an expensive and slow operation and doesn't
scale when a huge number of caches are destroyed back-to-back.  While
there used to be a simple batching mechanism, the batching was too
restricted to be helpful.

This patch implements slab_deactivate_memcg_cache_rcu_sched() which
slub can use to schedule sched RCU callback instead of performing
synchronize_sched() synchronously while holding cgroup_mutex.  While
this adds online cpus, mems and slab_mutex operations, operating on
these locks back-to-back from the same kworker, which is what's gonna
happen when there are many to deactivate, isn't expensive at all and
this gets rid of the scalability problem completely.

Signed-off-by: Tejun Heo 
Reported-by: Jay Vana 
Cc: Vladimir Davydov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
 include/linux/slab.h |  6 ++
 mm/slab.h|  2 ++
 mm/slab_common.c | 60 
 mm/slub.c| 12 +++
 4 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 63d543d..84701bb 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -576,6 +576,12 @@ struct memcg_cache_params {
struct {
struct mem_cgroup *memcg;
struct list_head kmem_caches_node;
+
+   void (*deact_fn)(struct kmem_cache *);
+   union {
+   struct rcu_head deact_rcu_head;
+   struct work_struct deact_work;
+   };
};
};
 };
diff --git a/mm/slab.h b/mm/slab.h
index 73ed6b5..b67ac8f 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -299,6 +299,8 @@ static __always_inline void memcg_uncharge_slab(struct page 
*page, int order,
 }
 
 extern void slab_init_memcg_params(struct kmem_cache *);
+extern void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
+   void (*deact_fn)(struct kmem_cache *));
 
 #else /* CONFIG_MEMCG && !CONFIG_SLOB */
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 87e5535..4730ef6 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -583,6 +583,66 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
put_online_cpus();
 }
 
+static void kmemcg_deactivate_workfn(struct work_struct *work)
+{
+   struct kmem_cache *s = container_of(work, struct kmem_cache,
+   memcg_params.deact_work);
+
+   get_online_cpus();
+   get_online_mems();
+
+   mutex_lock(_mutex);
+
+   s->memcg_params.deact_fn(s);
+
+   mutex_unlock(_mutex);
+
+   put_online_mems();
+   put_online_cpus();
+
+   /* done, put the ref from slab_deactivate_memcg_cache_rcu_sched() */
+   css_put(>memcg_params.memcg->css);
+}
+
+static void kmemcg_deactivate_rcufn(struct rcu_head *head)
+{
+   struct kmem_cache *s = container_of(head, struct kmem_cache,
+   memcg_params.deact_rcu_head);
+
+   /*
+* We need to grab blocking locks.  Bounce to ->deact_work.  The
+* work item shares the space with the RCU head and can't be
+* initialized eariler.
+*/
+   INIT_WORK(>memcg_params.deact_work, kmemcg_deactivate_workfn);
+   schedule_work(>memcg_params.deact_work);
+}
+
+/**
+ * slab_deactivate_memcg_cache_rcu_sched - schedule deactivation after a
+ *sched RCU grace period
+ * @s: target kmem_cache
+ * @deact_fn: deactivation function to call
+ *
+ * Schedule @deact_fn to be invoked with online cpus, mems and slab_mutex
+ * held after a sched RCU grace period.  The slab is guaranteed to stay
+ * alive until @deact_fn is finished.  This is to be used from
+ * __kmemcg_cache_deactivate().
+ */
+void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
+  void (*deact_fn)(struct kmem_cache 
*))
+{
+   if (WARN_ON_ONCE(is_root_cache(s)) ||
+   WARN_ON_ONCE(s->memcg_params.deact_fn))
+   return;
+
+   /* pin memcg so that @s doesn't get destroyed in the middle */
+   css_get(>memcg_params.memcg->css);
+
+   s->memcg_params.deact_fn = deact_fn;
+