Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path
On Tue, Jan 17, 2017 at 08:37:45AM -0800, Tejun Heo wrote: > The call sequence doesn't matter. Whether you're using call_rcu() or > rcu_barrier(), you're just waiting for a grace period to pass before > continuing. It doens't give any other ordering guarantees, so the new > code should be equivalent to the old one except for being asynchronous. Oh I was confusing synchronize_rcu() with rcu_barrier(), so you're right, kmem_cache struct needs to stay around for the slab pages to be freed after RCU grace period. Will revise the patch accordingly, thanks. Thanks. -- tejun
Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path
On Tue, Jan 17, 2017 at 08:37:45AM -0800, Tejun Heo wrote: > The call sequence doesn't matter. Whether you're using call_rcu() or > rcu_barrier(), you're just waiting for a grace period to pass before > continuing. It doens't give any other ordering guarantees, so the new > code should be equivalent to the old one except for being asynchronous. Oh I was confusing synchronize_rcu() with rcu_barrier(), so you're right, kmem_cache struct needs to stay around for the slab pages to be freed after RCU grace period. Will revise the patch accordingly, thanks. Thanks. -- tejun
Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path
Hello, Joonsoo. On Tue, Jan 17, 2017 at 09:07:54AM +0900, Joonsoo Kim wrote: > Long time no see! :) Yeah, happy new year! > IIUC, rcu_barrier() here prevents to destruct the kmem_cache until all > slab pages in it are freed. These slab pages are freed through call_rcu(). Hmm... why do we need that tho? SLAB_DESTROY_BY_RCU only needs to protect the slab pages, not kmem cache struct. I thought that this was because kmem cache destruction is allowed to release pages w/o RCU delaying it. > Your patch changes it to another call_rcu() and, I think, if sequence of > executing rcu callbacks is the same with sequence of adding rcu > callbacks, it would work. However, I'm not sure that it is > guaranteed by RCU API. Am I missing something? The call sequence doesn't matter. Whether you're using call_rcu() or rcu_barrier(), you're just waiting for a grace period to pass before continuing. It doens't give any other ordering guarantees, so the new code should be equivalent to the old one except for being asynchronous. Thanks. -- tejun
Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path
Hello, Joonsoo. On Tue, Jan 17, 2017 at 09:07:54AM +0900, Joonsoo Kim wrote: > Long time no see! :) Yeah, happy new year! > IIUC, rcu_barrier() here prevents to destruct the kmem_cache until all > slab pages in it are freed. These slab pages are freed through call_rcu(). Hmm... why do we need that tho? SLAB_DESTROY_BY_RCU only needs to protect the slab pages, not kmem cache struct. I thought that this was because kmem cache destruction is allowed to release pages w/o RCU delaying it. > Your patch changes it to another call_rcu() and, I think, if sequence of > executing rcu callbacks is the same with sequence of adding rcu > callbacks, it would work. However, I'm not sure that it is > guaranteed by RCU API. Am I missing something? The call sequence doesn't matter. Whether you're using call_rcu() or rcu_barrier(), you're just waiting for a grace period to pass before continuing. It doens't give any other ordering guarantees, so the new code should be equivalent to the old one except for being asynchronous. Thanks. -- tejun
Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path
On Sat, Jan 14, 2017 at 10:19:21AM -0500, Tejun Heo wrote: > Hello, Vladimir. > > On Sat, Jan 14, 2017 at 04:19:39PM +0300, Vladimir Davydov wrote: > > On Sat, Jan 14, 2017 at 12:54:42AM -0500, Tejun Heo wrote: > > > This patch updates the cache release path so that it simply uses > > > call_rcu() instead of the synchronous rcu_barrier() + custom batching. > > > This doesn't cost more while being logically simpler and way more > > > scalable. > > > > The point of rcu_barrier() is to wait until all rcu calls freeing slabs > > from the cache being destroyed are over (rcu_free_slab, kmem_rcu_free). > > I'm not sure if call_rcu() guarantees that for all rcu implementations > > too. If it did, why would we need rcu_barrier() at all? > > Yeah, I had a similar question and scanned its users briefly. Looks > like it's used in combination with ctors so that its users can > opportunistically dereference objects and e.g. check ids / state / > whatever without worrying about the objects' lifetimes. Hello, Tejun. Long time no see! :) IIUC, rcu_barrier() here prevents to destruct the kmem_cache until all slab pages in it are freed. These slab pages are freed through call_rcu(). Your patch changes it to another call_rcu() and, I think, if sequence of executing rcu callbacks is the same with sequence of adding rcu callbacks, it would work. However, I'm not sure that it is guaranteed by RCU API. Am I missing something? Thanks.
Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path
On Sat, Jan 14, 2017 at 10:19:21AM -0500, Tejun Heo wrote: > Hello, Vladimir. > > On Sat, Jan 14, 2017 at 04:19:39PM +0300, Vladimir Davydov wrote: > > On Sat, Jan 14, 2017 at 12:54:42AM -0500, Tejun Heo wrote: > > > This patch updates the cache release path so that it simply uses > > > call_rcu() instead of the synchronous rcu_barrier() + custom batching. > > > This doesn't cost more while being logically simpler and way more > > > scalable. > > > > The point of rcu_barrier() is to wait until all rcu calls freeing slabs > > from the cache being destroyed are over (rcu_free_slab, kmem_rcu_free). > > I'm not sure if call_rcu() guarantees that for all rcu implementations > > too. If it did, why would we need rcu_barrier() at all? > > Yeah, I had a similar question and scanned its users briefly. Looks > like it's used in combination with ctors so that its users can > opportunistically dereference objects and e.g. check ids / state / > whatever without worrying about the objects' lifetimes. Hello, Tejun. Long time no see! :) IIUC, rcu_barrier() here prevents to destruct the kmem_cache until all slab pages in it are freed. These slab pages are freed through call_rcu(). Your patch changes it to another call_rcu() and, I think, if sequence of executing rcu callbacks is the same with sequence of adding rcu callbacks, it would work. However, I'm not sure that it is guaranteed by RCU API. Am I missing something? Thanks.
Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path
Hello, Vladimir. On Sat, Jan 14, 2017 at 04:19:39PM +0300, Vladimir Davydov wrote: > On Sat, Jan 14, 2017 at 12:54:42AM -0500, Tejun Heo wrote: > > This patch updates the cache release path so that it simply uses > > call_rcu() instead of the synchronous rcu_barrier() + custom batching. > > This doesn't cost more while being logically simpler and way more > > scalable. > > The point of rcu_barrier() is to wait until all rcu calls freeing slabs > from the cache being destroyed are over (rcu_free_slab, kmem_rcu_free). > I'm not sure if call_rcu() guarantees that for all rcu implementations > too. If it did, why would we need rcu_barrier() at all? Yeah, I had a similar question and scanned its users briefly. Looks like it's used in combination with ctors so that its users can opportunistically dereference objects and e.g. check ids / state / whatever without worrying about the objects' lifetimes. Thanks. -- tejun
Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path
Hello, Vladimir. On Sat, Jan 14, 2017 at 04:19:39PM +0300, Vladimir Davydov wrote: > On Sat, Jan 14, 2017 at 12:54:42AM -0500, Tejun Heo wrote: > > This patch updates the cache release path so that it simply uses > > call_rcu() instead of the synchronous rcu_barrier() + custom batching. > > This doesn't cost more while being logically simpler and way more > > scalable. > > The point of rcu_barrier() is to wait until all rcu calls freeing slabs > from the cache being destroyed are over (rcu_free_slab, kmem_rcu_free). > I'm not sure if call_rcu() guarantees that for all rcu implementations > too. If it did, why would we need rcu_barrier() at all? Yeah, I had a similar question and scanned its users briefly. Looks like it's used in combination with ctors so that its users can opportunistically dereference objects and e.g. check ids / state / whatever without worrying about the objects' lifetimes. Thanks. -- tejun
Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path
Hello Tejun, Thanks a lot for looking into this issue as it seems to affect a lot of users! On Sat, Jan 14, 2017 at 12:54:42AM -0500, Tejun Heo wrote: > This patch updates the cache release path so that it simply uses > call_rcu() instead of the synchronous rcu_barrier() + custom batching. > This doesn't cost more while being logically simpler and way more > scalable. The point of rcu_barrier() is to wait until all rcu calls freeing slabs from the cache being destroyed are over (rcu_free_slab, kmem_rcu_free). I'm not sure if call_rcu() guarantees that for all rcu implementations too. If it did, why would we need rcu_barrier() at all?
Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path
Hello Tejun, Thanks a lot for looking into this issue as it seems to affect a lot of users! On Sat, Jan 14, 2017 at 12:54:42AM -0500, Tejun Heo wrote: > This patch updates the cache release path so that it simply uses > call_rcu() instead of the synchronous rcu_barrier() + custom batching. > This doesn't cost more while being logically simpler and way more > scalable. The point of rcu_barrier() is to wait until all rcu calls freeing slabs from the cache being destroyed are over (rcu_free_slab, kmem_rcu_free). I'm not sure if call_rcu() guarantees that for all rcu implementations too. If it did, why would we need rcu_barrier() at all?
[PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path
With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. This is one of the patches to address the issue. SLAB_DESTORY_BY_RCU caches need a rcu grace period before destruction. Currently, it's done synchronously with rcu_barrier(). As rcu_barrier() is expensive time-wise, slab implements a batching mechanism so that rcu_barrier() can be done for multiple caches at the same time. Unfortunately, the rcu_barrier() is in synchronous path which is called while holding cgroup_mutex and the batching is too limited to be actually helpful. Besides, the batching is just a very degenerate form of the actual RCU callback mechanism. This patch updates the cache release path so that it simply uses call_rcu() instead of the synchronous rcu_barrier() + custom batching. This doesn't cost more while being logically simpler and way more scalable. * ->rcu_head is added to kmem_cache structs. It shares storage space with ->list. * slub sysfs removal and release are separated and the former is now called from __kmem_cache_shutdown() while the latter is called from the release path. There's no reason to defer sysfs removal through RCU and this makes it unnecessary to bounce to workqueue from the RCU callback. * release_caches() is removed and shutdown_cache() now either directly release the cache or schedules a RCU callback to do that. This makes the cache inaccessible once shutdown_cache() is called and makes it impossible for shutdown_memcg_caches() to do memcg-specific cleanups afterwards. Move memcg-specific part into a helper, unlink_memcg_cache(), and make shutdown_cache() call it directly. Signed-off-by: Tejun HeoReported-by: Jay Vana Cc: Vladimir Davydov Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Andrew Morton --- include/linux/slab_def.h | 5 ++- include/linux/slub_def.h | 9 -- mm/slab.h| 5 ++- mm/slab_common.c | 84 mm/slub.c| 9 +- 5 files changed, 57 insertions(+), 55 deletions(-) diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h index 4ad2c5a..b649629 100644 --- a/include/linux/slab_def.h +++ b/include/linux/slab_def.h @@ -39,7 +39,10 @@ struct kmem_cache { /* 4) cache creation/removal */ const char *name; - struct list_head list; + union { + struct list_head list; + struct rcu_head rcu_head; + }; int refcount; int object_size; int align; diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index 75f56c2..7637b41 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -80,7 +80,10 @@ struct kmem_cache { int align; /* Alignment */ int reserved; /* Reserved bytes at the end of slabs */ const char *name; /* Name (only for display!) */ - struct list_head list; /* List of slab caches */ + union { + struct list_head list; /* List of slab caches */ + struct rcu_head rcu_head; + }; int red_left_pad; /* Left redzone padding size */ #ifdef CONFIG_SYSFS struct kobject kobj;/* For sysfs */ @@ -113,9 +116,9 @@ struct kmem_cache { #ifdef CONFIG_SYSFS #define SLAB_SUPPORTS_SYSFS -void sysfs_slab_remove(struct kmem_cache *); +void sysfs_slab_release(struct kmem_cache *); #else -static inline void sysfs_slab_remove(struct kmem_cache *s) +static inline void sysfs_slab_release(struct kmem_cache *s) { } #endif diff --git a/mm/slab.h b/mm/slab.h index 4acc644..3fa2d77 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -24,7 +24,10 @@ struct kmem_cache { const char *name; /* Slab name for sysfs */ int refcount; /* Use counter */ void (*ctor)(void *); /* Called on object slot creation */ - struct list_head list; /* List of all slab caches on the system */ + union { + struct list_head list; /* List of all slab caches on the system */ + struct rcu_head rcu_head; + }; }; #endif /* CONFIG_SLOB */ diff --git a/mm/slab_common.c b/mm/slab_common.c index 46ff746..851c75e 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -215,6 +215,11 @@ int memcg_update_all_caches(int num_memcgs) mutex_unlock(_mutex); return ret; } + +static
[PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path
With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. This is one of the patches to address the issue. SLAB_DESTORY_BY_RCU caches need a rcu grace period before destruction. Currently, it's done synchronously with rcu_barrier(). As rcu_barrier() is expensive time-wise, slab implements a batching mechanism so that rcu_barrier() can be done for multiple caches at the same time. Unfortunately, the rcu_barrier() is in synchronous path which is called while holding cgroup_mutex and the batching is too limited to be actually helpful. Besides, the batching is just a very degenerate form of the actual RCU callback mechanism. This patch updates the cache release path so that it simply uses call_rcu() instead of the synchronous rcu_barrier() + custom batching. This doesn't cost more while being logically simpler and way more scalable. * ->rcu_head is added to kmem_cache structs. It shares storage space with ->list. * slub sysfs removal and release are separated and the former is now called from __kmem_cache_shutdown() while the latter is called from the release path. There's no reason to defer sysfs removal through RCU and this makes it unnecessary to bounce to workqueue from the RCU callback. * release_caches() is removed and shutdown_cache() now either directly release the cache or schedules a RCU callback to do that. This makes the cache inaccessible once shutdown_cache() is called and makes it impossible for shutdown_memcg_caches() to do memcg-specific cleanups afterwards. Move memcg-specific part into a helper, unlink_memcg_cache(), and make shutdown_cache() call it directly. Signed-off-by: Tejun Heo Reported-by: Jay Vana Cc: Vladimir Davydov Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Andrew Morton --- include/linux/slab_def.h | 5 ++- include/linux/slub_def.h | 9 -- mm/slab.h| 5 ++- mm/slab_common.c | 84 mm/slub.c| 9 +- 5 files changed, 57 insertions(+), 55 deletions(-) diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h index 4ad2c5a..b649629 100644 --- a/include/linux/slab_def.h +++ b/include/linux/slab_def.h @@ -39,7 +39,10 @@ struct kmem_cache { /* 4) cache creation/removal */ const char *name; - struct list_head list; + union { + struct list_head list; + struct rcu_head rcu_head; + }; int refcount; int object_size; int align; diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index 75f56c2..7637b41 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -80,7 +80,10 @@ struct kmem_cache { int align; /* Alignment */ int reserved; /* Reserved bytes at the end of slabs */ const char *name; /* Name (only for display!) */ - struct list_head list; /* List of slab caches */ + union { + struct list_head list; /* List of slab caches */ + struct rcu_head rcu_head; + }; int red_left_pad; /* Left redzone padding size */ #ifdef CONFIG_SYSFS struct kobject kobj;/* For sysfs */ @@ -113,9 +116,9 @@ struct kmem_cache { #ifdef CONFIG_SYSFS #define SLAB_SUPPORTS_SYSFS -void sysfs_slab_remove(struct kmem_cache *); +void sysfs_slab_release(struct kmem_cache *); #else -static inline void sysfs_slab_remove(struct kmem_cache *s) +static inline void sysfs_slab_release(struct kmem_cache *s) { } #endif diff --git a/mm/slab.h b/mm/slab.h index 4acc644..3fa2d77 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -24,7 +24,10 @@ struct kmem_cache { const char *name; /* Slab name for sysfs */ int refcount; /* Use counter */ void (*ctor)(void *); /* Called on object slot creation */ - struct list_head list; /* List of all slab caches on the system */ + union { + struct list_head list; /* List of all slab caches on the system */ + struct rcu_head rcu_head; + }; }; #endif /* CONFIG_SLOB */ diff --git a/mm/slab_common.c b/mm/slab_common.c index 46ff746..851c75e 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -215,6 +215,11 @@ int memcg_update_all_caches(int num_memcgs) mutex_unlock(_mutex); return ret; } + +static void unlink_memcg_cache(struct kmem_cache *s) +{ + list_del(>memcg_params.list); +} #else static inline int init_memcg_params(struct kmem_cache *s,