Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path

2017-01-17 Thread Tejun Heo
On Tue, Jan 17, 2017 at 08:37:45AM -0800, Tejun Heo wrote:
> The call sequence doesn't matter.  Whether you're using call_rcu() or
> rcu_barrier(), you're just waiting for a grace period to pass before
> continuing.  It doens't give any other ordering guarantees, so the new
> code should be equivalent to the old one except for being asynchronous.

Oh I was confusing synchronize_rcu() with rcu_barrier(), so you're
right, kmem_cache struct needs to stay around for the slab pages to be
freed after RCU grace period.  Will revise the patch accordingly,
thanks.

Thanks.

-- 
tejun


Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path

2017-01-17 Thread Tejun Heo
On Tue, Jan 17, 2017 at 08:37:45AM -0800, Tejun Heo wrote:
> The call sequence doesn't matter.  Whether you're using call_rcu() or
> rcu_barrier(), you're just waiting for a grace period to pass before
> continuing.  It doens't give any other ordering guarantees, so the new
> code should be equivalent to the old one except for being asynchronous.

Oh I was confusing synchronize_rcu() with rcu_barrier(), so you're
right, kmem_cache struct needs to stay around for the slab pages to be
freed after RCU grace period.  Will revise the patch accordingly,
thanks.

Thanks.

-- 
tejun


Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path

2017-01-17 Thread Tejun Heo
Hello, Joonsoo.

On Tue, Jan 17, 2017 at 09:07:54AM +0900, Joonsoo Kim wrote:
> Long time no see! :)

Yeah, happy new year!

> IIUC, rcu_barrier() here prevents to destruct the kmem_cache until all
> slab pages in it are freed. These slab pages are freed through call_rcu().

Hmm... why do we need that tho?  SLAB_DESTROY_BY_RCU only needs to
protect the slab pages, not kmem cache struct.  I thought that this
was because kmem cache destruction is allowed to release pages w/o RCU
delaying it.

> Your patch changes it to another call_rcu() and, I think, if sequence of
> executing rcu callbacks is the same with sequence of adding rcu
> callbacks, it would work. However, I'm not sure that it is
> guaranteed by RCU API. Am I missing something?

The call sequence doesn't matter.  Whether you're using call_rcu() or
rcu_barrier(), you're just waiting for a grace period to pass before
continuing.  It doens't give any other ordering guarantees, so the new
code should be equivalent to the old one except for being asynchronous.

Thanks.

-- 
tejun


Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path

2017-01-17 Thread Tejun Heo
Hello, Joonsoo.

On Tue, Jan 17, 2017 at 09:07:54AM +0900, Joonsoo Kim wrote:
> Long time no see! :)

Yeah, happy new year!

> IIUC, rcu_barrier() here prevents to destruct the kmem_cache until all
> slab pages in it are freed. These slab pages are freed through call_rcu().

Hmm... why do we need that tho?  SLAB_DESTROY_BY_RCU only needs to
protect the slab pages, not kmem cache struct.  I thought that this
was because kmem cache destruction is allowed to release pages w/o RCU
delaying it.

> Your patch changes it to another call_rcu() and, I think, if sequence of
> executing rcu callbacks is the same with sequence of adding rcu
> callbacks, it would work. However, I'm not sure that it is
> guaranteed by RCU API. Am I missing something?

The call sequence doesn't matter.  Whether you're using call_rcu() or
rcu_barrier(), you're just waiting for a grace period to pass before
continuing.  It doens't give any other ordering guarantees, so the new
code should be equivalent to the old one except for being asynchronous.

Thanks.

-- 
tejun


Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path

2017-01-16 Thread Joonsoo Kim
On Sat, Jan 14, 2017 at 10:19:21AM -0500, Tejun Heo wrote:
> Hello, Vladimir.
> 
> On Sat, Jan 14, 2017 at 04:19:39PM +0300, Vladimir Davydov wrote:
> > On Sat, Jan 14, 2017 at 12:54:42AM -0500, Tejun Heo wrote:
> > > This patch updates the cache release path so that it simply uses
> > > call_rcu() instead of the synchronous rcu_barrier() + custom batching.
> > > This doesn't cost more while being logically simpler and way more
> > > scalable.
> > 
> > The point of rcu_barrier() is to wait until all rcu calls freeing slabs
> > from the cache being destroyed are over (rcu_free_slab, kmem_rcu_free).
> > I'm not sure if call_rcu() guarantees that for all rcu implementations
> > too. If it did, why would we need rcu_barrier() at all?
> 
> Yeah, I had a similar question and scanned its users briefly.  Looks
> like it's used in combination with ctors so that its users can
> opportunistically dereference objects and e.g. check ids / state /
> whatever without worrying about the objects' lifetimes.

Hello, Tejun.

Long time no see! :)

IIUC, rcu_barrier() here prevents to destruct the kmem_cache until all
slab pages in it are freed. These slab pages are freed through call_rcu().

Your patch changes it to another call_rcu() and, I think, if sequence of
executing rcu callbacks is the same with sequence of adding rcu
callbacks, it would work. However, I'm not sure that it is
guaranteed by RCU API. Am I missing something?

Thanks.


Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path

2017-01-16 Thread Joonsoo Kim
On Sat, Jan 14, 2017 at 10:19:21AM -0500, Tejun Heo wrote:
> Hello, Vladimir.
> 
> On Sat, Jan 14, 2017 at 04:19:39PM +0300, Vladimir Davydov wrote:
> > On Sat, Jan 14, 2017 at 12:54:42AM -0500, Tejun Heo wrote:
> > > This patch updates the cache release path so that it simply uses
> > > call_rcu() instead of the synchronous rcu_barrier() + custom batching.
> > > This doesn't cost more while being logically simpler and way more
> > > scalable.
> > 
> > The point of rcu_barrier() is to wait until all rcu calls freeing slabs
> > from the cache being destroyed are over (rcu_free_slab, kmem_rcu_free).
> > I'm not sure if call_rcu() guarantees that for all rcu implementations
> > too. If it did, why would we need rcu_barrier() at all?
> 
> Yeah, I had a similar question and scanned its users briefly.  Looks
> like it's used in combination with ctors so that its users can
> opportunistically dereference objects and e.g. check ids / state /
> whatever without worrying about the objects' lifetimes.

Hello, Tejun.

Long time no see! :)

IIUC, rcu_barrier() here prevents to destruct the kmem_cache until all
slab pages in it are freed. These slab pages are freed through call_rcu().

Your patch changes it to another call_rcu() and, I think, if sequence of
executing rcu callbacks is the same with sequence of adding rcu
callbacks, it would work. However, I'm not sure that it is
guaranteed by RCU API. Am I missing something?

Thanks.


Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path

2017-01-14 Thread Tejun Heo
Hello, Vladimir.

On Sat, Jan 14, 2017 at 04:19:39PM +0300, Vladimir Davydov wrote:
> On Sat, Jan 14, 2017 at 12:54:42AM -0500, Tejun Heo wrote:
> > This patch updates the cache release path so that it simply uses
> > call_rcu() instead of the synchronous rcu_barrier() + custom batching.
> > This doesn't cost more while being logically simpler and way more
> > scalable.
> 
> The point of rcu_barrier() is to wait until all rcu calls freeing slabs
> from the cache being destroyed are over (rcu_free_slab, kmem_rcu_free).
> I'm not sure if call_rcu() guarantees that for all rcu implementations
> too. If it did, why would we need rcu_barrier() at all?

Yeah, I had a similar question and scanned its users briefly.  Looks
like it's used in combination with ctors so that its users can
opportunistically dereference objects and e.g. check ids / state /
whatever without worrying about the objects' lifetimes.

Thanks.

-- 
tejun


Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path

2017-01-14 Thread Tejun Heo
Hello, Vladimir.

On Sat, Jan 14, 2017 at 04:19:39PM +0300, Vladimir Davydov wrote:
> On Sat, Jan 14, 2017 at 12:54:42AM -0500, Tejun Heo wrote:
> > This patch updates the cache release path so that it simply uses
> > call_rcu() instead of the synchronous rcu_barrier() + custom batching.
> > This doesn't cost more while being logically simpler and way more
> > scalable.
> 
> The point of rcu_barrier() is to wait until all rcu calls freeing slabs
> from the cache being destroyed are over (rcu_free_slab, kmem_rcu_free).
> I'm not sure if call_rcu() guarantees that for all rcu implementations
> too. If it did, why would we need rcu_barrier() at all?

Yeah, I had a similar question and scanned its users briefly.  Looks
like it's used in combination with ctors so that its users can
opportunistically dereference objects and e.g. check ids / state /
whatever without worrying about the objects' lifetimes.

Thanks.

-- 
tejun


Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path

2017-01-14 Thread Vladimir Davydov
Hello Tejun,

Thanks a lot for looking into this issue as it seems to affect a lot of
users!

On Sat, Jan 14, 2017 at 12:54:42AM -0500, Tejun Heo wrote:
> This patch updates the cache release path so that it simply uses
> call_rcu() instead of the synchronous rcu_barrier() + custom batching.
> This doesn't cost more while being logically simpler and way more
> scalable.

The point of rcu_barrier() is to wait until all rcu calls freeing slabs
from the cache being destroyed are over (rcu_free_slab, kmem_rcu_free).
I'm not sure if call_rcu() guarantees that for all rcu implementations
too. If it did, why would we need rcu_barrier() at all?


Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path

2017-01-14 Thread Vladimir Davydov
Hello Tejun,

Thanks a lot for looking into this issue as it seems to affect a lot of
users!

On Sat, Jan 14, 2017 at 12:54:42AM -0500, Tejun Heo wrote:
> This patch updates the cache release path so that it simply uses
> call_rcu() instead of the synchronous rcu_barrier() + custom batching.
> This doesn't cost more while being logically simpler and way more
> scalable.

The point of rcu_barrier() is to wait until all rcu calls freeing slabs
from the cache being destroyed are over (rcu_free_slab, kmem_rcu_free).
I'm not sure if call_rcu() guarantees that for all rcu implementations
too. If it did, why would we need rcu_barrier() at all?


[PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path

2017-01-13 Thread Tejun Heo
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

SLAB_DESTORY_BY_RCU caches need a rcu grace period before destruction.
Currently, it's done synchronously with rcu_barrier().  As
rcu_barrier() is expensive time-wise, slab implements a batching
mechanism so that rcu_barrier() can be done for multiple caches at the
same time.

Unfortunately, the rcu_barrier() is in synchronous path which is
called while holding cgroup_mutex and the batching is too limited to
be actually helpful.  Besides, the batching is just a very degenerate
form of the actual RCU callback mechanism.

This patch updates the cache release path so that it simply uses
call_rcu() instead of the synchronous rcu_barrier() + custom batching.
This doesn't cost more while being logically simpler and way more
scalable.

* ->rcu_head is added to kmem_cache structs.  It shares storage space
  with ->list.

* slub sysfs removal and release are separated and the former is now
  called from __kmem_cache_shutdown() while the latter is called from
  the release path.  There's no reason to defer sysfs removal through
  RCU and this makes it unnecessary to bounce to workqueue from the
  RCU callback.

* release_caches() is removed and shutdown_cache() now either directly
  release the cache or schedules a RCU callback to do that.  This
  makes the cache inaccessible once shutdown_cache() is called and
  makes it impossible for shutdown_memcg_caches() to do memcg-specific
  cleanups afterwards.  Move memcg-specific part into a helper,
  unlink_memcg_cache(), and make shutdown_cache() call it directly.

Signed-off-by: Tejun Heo 
Reported-by: Jay Vana 
Cc: Vladimir Davydov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
 include/linux/slab_def.h |  5 ++-
 include/linux/slub_def.h |  9 --
 mm/slab.h|  5 ++-
 mm/slab_common.c | 84 
 mm/slub.c|  9 +-
 5 files changed, 57 insertions(+), 55 deletions(-)

diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index 4ad2c5a..b649629 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -39,7 +39,10 @@ struct kmem_cache {
 
 /* 4) cache creation/removal */
const char *name;
-   struct list_head list;
+   union {
+   struct list_head list;
+   struct rcu_head rcu_head;
+   };
int refcount;
int object_size;
int align;
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 75f56c2..7637b41 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -80,7 +80,10 @@ struct kmem_cache {
int align;  /* Alignment */
int reserved;   /* Reserved bytes at the end of slabs */
const char *name;   /* Name (only for display!) */
-   struct list_head list;  /* List of slab caches */
+   union {
+   struct list_head list;  /* List of slab caches */
+   struct rcu_head rcu_head;
+   };
int red_left_pad;   /* Left redzone padding size */
 #ifdef CONFIG_SYSFS
struct kobject kobj;/* For sysfs */
@@ -113,9 +116,9 @@ struct kmem_cache {
 
 #ifdef CONFIG_SYSFS
 #define SLAB_SUPPORTS_SYSFS
-void sysfs_slab_remove(struct kmem_cache *);
+void sysfs_slab_release(struct kmem_cache *);
 #else
-static inline void sysfs_slab_remove(struct kmem_cache *s)
+static inline void sysfs_slab_release(struct kmem_cache *s)
 {
 }
 #endif
diff --git a/mm/slab.h b/mm/slab.h
index 4acc644..3fa2d77 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -24,7 +24,10 @@ struct kmem_cache {
const char *name;   /* Slab name for sysfs */
int refcount;   /* Use counter */
void (*ctor)(void *);   /* Called on object slot creation */
-   struct list_head list;  /* List of all slab caches on the system */
+   union {
+   struct list_head list;  /* List of all slab caches on the 
system */
+   struct rcu_head rcu_head;
+   };
 };
 
 #endif /* CONFIG_SLOB */
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 46ff746..851c75e 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -215,6 +215,11 @@ int memcg_update_all_caches(int num_memcgs)
mutex_unlock(_mutex);
return ret;
 }
+
+static 

[PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path

2017-01-13 Thread Tejun Heo
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

SLAB_DESTORY_BY_RCU caches need a rcu grace period before destruction.
Currently, it's done synchronously with rcu_barrier().  As
rcu_barrier() is expensive time-wise, slab implements a batching
mechanism so that rcu_barrier() can be done for multiple caches at the
same time.

Unfortunately, the rcu_barrier() is in synchronous path which is
called while holding cgroup_mutex and the batching is too limited to
be actually helpful.  Besides, the batching is just a very degenerate
form of the actual RCU callback mechanism.

This patch updates the cache release path so that it simply uses
call_rcu() instead of the synchronous rcu_barrier() + custom batching.
This doesn't cost more while being logically simpler and way more
scalable.

* ->rcu_head is added to kmem_cache structs.  It shares storage space
  with ->list.

* slub sysfs removal and release are separated and the former is now
  called from __kmem_cache_shutdown() while the latter is called from
  the release path.  There's no reason to defer sysfs removal through
  RCU and this makes it unnecessary to bounce to workqueue from the
  RCU callback.

* release_caches() is removed and shutdown_cache() now either directly
  release the cache or schedules a RCU callback to do that.  This
  makes the cache inaccessible once shutdown_cache() is called and
  makes it impossible for shutdown_memcg_caches() to do memcg-specific
  cleanups afterwards.  Move memcg-specific part into a helper,
  unlink_memcg_cache(), and make shutdown_cache() call it directly.

Signed-off-by: Tejun Heo 
Reported-by: Jay Vana 
Cc: Vladimir Davydov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
 include/linux/slab_def.h |  5 ++-
 include/linux/slub_def.h |  9 --
 mm/slab.h|  5 ++-
 mm/slab_common.c | 84 
 mm/slub.c|  9 +-
 5 files changed, 57 insertions(+), 55 deletions(-)

diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index 4ad2c5a..b649629 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -39,7 +39,10 @@ struct kmem_cache {
 
 /* 4) cache creation/removal */
const char *name;
-   struct list_head list;
+   union {
+   struct list_head list;
+   struct rcu_head rcu_head;
+   };
int refcount;
int object_size;
int align;
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 75f56c2..7637b41 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -80,7 +80,10 @@ struct kmem_cache {
int align;  /* Alignment */
int reserved;   /* Reserved bytes at the end of slabs */
const char *name;   /* Name (only for display!) */
-   struct list_head list;  /* List of slab caches */
+   union {
+   struct list_head list;  /* List of slab caches */
+   struct rcu_head rcu_head;
+   };
int red_left_pad;   /* Left redzone padding size */
 #ifdef CONFIG_SYSFS
struct kobject kobj;/* For sysfs */
@@ -113,9 +116,9 @@ struct kmem_cache {
 
 #ifdef CONFIG_SYSFS
 #define SLAB_SUPPORTS_SYSFS
-void sysfs_slab_remove(struct kmem_cache *);
+void sysfs_slab_release(struct kmem_cache *);
 #else
-static inline void sysfs_slab_remove(struct kmem_cache *s)
+static inline void sysfs_slab_release(struct kmem_cache *s)
 {
 }
 #endif
diff --git a/mm/slab.h b/mm/slab.h
index 4acc644..3fa2d77 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -24,7 +24,10 @@ struct kmem_cache {
const char *name;   /* Slab name for sysfs */
int refcount;   /* Use counter */
void (*ctor)(void *);   /* Called on object slot creation */
-   struct list_head list;  /* List of all slab caches on the system */
+   union {
+   struct list_head list;  /* List of all slab caches on the 
system */
+   struct rcu_head rcu_head;
+   };
 };
 
 #endif /* CONFIG_SLOB */
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 46ff746..851c75e 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -215,6 +215,11 @@ int memcg_update_all_caches(int num_memcgs)
mutex_unlock(_mutex);
return ret;
 }
+
+static void unlink_memcg_cache(struct kmem_cache *s)
+{
+   list_del(>memcg_params.list);
+}
 #else
 static inline int init_memcg_params(struct kmem_cache *s,