Re: [PATCH 8/9] slab: remove synchronous synchronize_sched() from memcg cache deactivation path
On Sat, Jan 14, 2017 at 12:54:48AM -0500, Tejun Heo wrote: > With kmem cgroup support enabled, kmem_caches can be created and > destroyed frequently and a great number of near empty kmem_caches can > accumulate if there are a lot of transient cgroups and the system is > not under memory pressure. When memory reclaim starts under such > conditions, it can lead to consecutive deactivation and destruction of > many kmem_caches, easily hundreds of thousands on moderately large > systems, exposing scalability issues in the current slab management > code. This is one of the patches to address the issue. > > slub uses synchronize_sched() to deactivate a memcg cache. > synchronize_sched() is an expensive and slow operation and doesn't > scale when a huge number of caches are destroyed back-to-back. While > there used to be a simple batching mechanism, the batching was too > restricted to be helpful. > > This patch implements slab_deactivate_memcg_cache_rcu_sched() which > slub can use to schedule sched RCU callback instead of performing > synchronize_sched() synchronously while holding cgroup_mutex. While > this adds online cpus, mems and slab_mutex operations, operating on > these locks back-to-back from the same kworker, which is what's gonna > happen when there are many to deactivate, isn't expensive at all and > this gets rid of the scalability problem completely. > > Signed-off-by: Tejun Heo> Reported-by: Jay Vana > Cc: Vladimir Davydov > Cc: Christoph Lameter > Cc: Pekka Enberg > Cc: David Rientjes > Cc: Joonsoo Kim > Cc: Andrew Morton I don't think there's much point in having the infrastructure for this in slab_common.c, as only SLUB needs it, but it isn't a show stopper. Acked-by: Vladimir Davydov
Re: [PATCH 8/9] slab: remove synchronous synchronize_sched() from memcg cache deactivation path
On Sat, Jan 14, 2017 at 12:54:48AM -0500, Tejun Heo wrote: > With kmem cgroup support enabled, kmem_caches can be created and > destroyed frequently and a great number of near empty kmem_caches can > accumulate if there are a lot of transient cgroups and the system is > not under memory pressure. When memory reclaim starts under such > conditions, it can lead to consecutive deactivation and destruction of > many kmem_caches, easily hundreds of thousands on moderately large > systems, exposing scalability issues in the current slab management > code. This is one of the patches to address the issue. > > slub uses synchronize_sched() to deactivate a memcg cache. > synchronize_sched() is an expensive and slow operation and doesn't > scale when a huge number of caches are destroyed back-to-back. While > there used to be a simple batching mechanism, the batching was too > restricted to be helpful. > > This patch implements slab_deactivate_memcg_cache_rcu_sched() which > slub can use to schedule sched RCU callback instead of performing > synchronize_sched() synchronously while holding cgroup_mutex. While > this adds online cpus, mems and slab_mutex operations, operating on > these locks back-to-back from the same kworker, which is what's gonna > happen when there are many to deactivate, isn't expensive at all and > this gets rid of the scalability problem completely. > > Signed-off-by: Tejun Heo > Reported-by: Jay Vana > Cc: Vladimir Davydov > Cc: Christoph Lameter > Cc: Pekka Enberg > Cc: David Rientjes > Cc: Joonsoo Kim > Cc: Andrew Morton I don't think there's much point in having the infrastructure for this in slab_common.c, as only SLUB needs it, but it isn't a show stopper. Acked-by: Vladimir Davydov
[PATCH 8/9] slab: remove synchronous synchronize_sched() from memcg cache deactivation path
With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. This is one of the patches to address the issue. slub uses synchronize_sched() to deactivate a memcg cache. synchronize_sched() is an expensive and slow operation and doesn't scale when a huge number of caches are destroyed back-to-back. While there used to be a simple batching mechanism, the batching was too restricted to be helpful. This patch implements slab_deactivate_memcg_cache_rcu_sched() which slub can use to schedule sched RCU callback instead of performing synchronize_sched() synchronously while holding cgroup_mutex. While this adds online cpus, mems and slab_mutex operations, operating on these locks back-to-back from the same kworker, which is what's gonna happen when there are many to deactivate, isn't expensive at all and this gets rid of the scalability problem completely. Signed-off-by: Tejun HeoReported-by: Jay Vana Cc: Vladimir Davydov Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Andrew Morton --- include/linux/slab.h | 6 ++ mm/slab.h| 2 ++ mm/slab_common.c | 60 mm/slub.c| 12 +++ 4 files changed, 76 insertions(+), 4 deletions(-) diff --git a/include/linux/slab.h b/include/linux/slab.h index 63d543d..84701bb 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -576,6 +576,12 @@ struct memcg_cache_params { struct { struct mem_cgroup *memcg; struct list_head kmem_caches_node; + + void (*deact_fn)(struct kmem_cache *); + union { + struct rcu_head deact_rcu_head; + struct work_struct deact_work; + }; }; }; }; diff --git a/mm/slab.h b/mm/slab.h index 73ed6b5..b67ac8f 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -299,6 +299,8 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order, } extern void slab_init_memcg_params(struct kmem_cache *); +extern void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s, + void (*deact_fn)(struct kmem_cache *)); #else /* CONFIG_MEMCG && !CONFIG_SLOB */ diff --git a/mm/slab_common.c b/mm/slab_common.c index 87e5535..4730ef6 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -583,6 +583,66 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg, put_online_cpus(); } +static void kmemcg_deactivate_workfn(struct work_struct *work) +{ + struct kmem_cache *s = container_of(work, struct kmem_cache, + memcg_params.deact_work); + + get_online_cpus(); + get_online_mems(); + + mutex_lock(_mutex); + + s->memcg_params.deact_fn(s); + + mutex_unlock(_mutex); + + put_online_mems(); + put_online_cpus(); + + /* done, put the ref from slab_deactivate_memcg_cache_rcu_sched() */ + css_put(>memcg_params.memcg->css); +} + +static void kmemcg_deactivate_rcufn(struct rcu_head *head) +{ + struct kmem_cache *s = container_of(head, struct kmem_cache, + memcg_params.deact_rcu_head); + + /* +* We need to grab blocking locks. Bounce to ->deact_work. The +* work item shares the space with the RCU head and can't be +* initialized eariler. +*/ + INIT_WORK(>memcg_params.deact_work, kmemcg_deactivate_workfn); + schedule_work(>memcg_params.deact_work); +} + +/** + * slab_deactivate_memcg_cache_rcu_sched - schedule deactivation after a + *sched RCU grace period + * @s: target kmem_cache + * @deact_fn: deactivation function to call + * + * Schedule @deact_fn to be invoked with online cpus, mems and slab_mutex + * held after a sched RCU grace period. The slab is guaranteed to stay + * alive until @deact_fn is finished. This is to be used from + * __kmemcg_cache_deactivate(). + */ +void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s, + void (*deact_fn)(struct kmem_cache *)) +{ + if (WARN_ON_ONCE(is_root_cache(s)) || + WARN_ON_ONCE(s->memcg_params.deact_fn)) + return; + + /* pin memcg so
[PATCH 8/9] slab: remove synchronous synchronize_sched() from memcg cache deactivation path
With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. This is one of the patches to address the issue. slub uses synchronize_sched() to deactivate a memcg cache. synchronize_sched() is an expensive and slow operation and doesn't scale when a huge number of caches are destroyed back-to-back. While there used to be a simple batching mechanism, the batching was too restricted to be helpful. This patch implements slab_deactivate_memcg_cache_rcu_sched() which slub can use to schedule sched RCU callback instead of performing synchronize_sched() synchronously while holding cgroup_mutex. While this adds online cpus, mems and slab_mutex operations, operating on these locks back-to-back from the same kworker, which is what's gonna happen when there are many to deactivate, isn't expensive at all and this gets rid of the scalability problem completely. Signed-off-by: Tejun Heo Reported-by: Jay Vana Cc: Vladimir Davydov Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Andrew Morton --- include/linux/slab.h | 6 ++ mm/slab.h| 2 ++ mm/slab_common.c | 60 mm/slub.c| 12 +++ 4 files changed, 76 insertions(+), 4 deletions(-) diff --git a/include/linux/slab.h b/include/linux/slab.h index 63d543d..84701bb 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -576,6 +576,12 @@ struct memcg_cache_params { struct { struct mem_cgroup *memcg; struct list_head kmem_caches_node; + + void (*deact_fn)(struct kmem_cache *); + union { + struct rcu_head deact_rcu_head; + struct work_struct deact_work; + }; }; }; }; diff --git a/mm/slab.h b/mm/slab.h index 73ed6b5..b67ac8f 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -299,6 +299,8 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order, } extern void slab_init_memcg_params(struct kmem_cache *); +extern void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s, + void (*deact_fn)(struct kmem_cache *)); #else /* CONFIG_MEMCG && !CONFIG_SLOB */ diff --git a/mm/slab_common.c b/mm/slab_common.c index 87e5535..4730ef6 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -583,6 +583,66 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg, put_online_cpus(); } +static void kmemcg_deactivate_workfn(struct work_struct *work) +{ + struct kmem_cache *s = container_of(work, struct kmem_cache, + memcg_params.deact_work); + + get_online_cpus(); + get_online_mems(); + + mutex_lock(_mutex); + + s->memcg_params.deact_fn(s); + + mutex_unlock(_mutex); + + put_online_mems(); + put_online_cpus(); + + /* done, put the ref from slab_deactivate_memcg_cache_rcu_sched() */ + css_put(>memcg_params.memcg->css); +} + +static void kmemcg_deactivate_rcufn(struct rcu_head *head) +{ + struct kmem_cache *s = container_of(head, struct kmem_cache, + memcg_params.deact_rcu_head); + + /* +* We need to grab blocking locks. Bounce to ->deact_work. The +* work item shares the space with the RCU head and can't be +* initialized eariler. +*/ + INIT_WORK(>memcg_params.deact_work, kmemcg_deactivate_workfn); + schedule_work(>memcg_params.deact_work); +} + +/** + * slab_deactivate_memcg_cache_rcu_sched - schedule deactivation after a + *sched RCU grace period + * @s: target kmem_cache + * @deact_fn: deactivation function to call + * + * Schedule @deact_fn to be invoked with online cpus, mems and slab_mutex + * held after a sched RCU grace period. The slab is guaranteed to stay + * alive until @deact_fn is finished. This is to be used from + * __kmemcg_cache_deactivate(). + */ +void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s, + void (*deact_fn)(struct kmem_cache *)) +{ + if (WARN_ON_ONCE(is_root_cache(s)) || + WARN_ON_ONCE(s->memcg_params.deact_fn)) + return; + + /* pin memcg so that @s doesn't get destroyed in the middle */ + css_get(>memcg_params.memcg->css); + + s->memcg_params.deact_fn = deact_fn; +