Re: [PATCH v2 1/2] mm, slab: Extend slab/shrink to shrink all memcg caches
On 7/19/19 2:20 AM, Michal Hocko wrote: > On Wed 17-07-19 16:24:12, Waiman Long wrote: >> Currently, a value of '1" is written to /sys/kernel/slab//shrink >> file to shrink the slab by flushing out all the per-cpu slabs and free >> slabs in partial lists. This can be useful to squeeze out a bit more memory >> under extreme condition as well as making the active object counts in >> /proc/slabinfo more accurate. >> >> This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON >> option is usually not enabled and "slub_memcg_sysfs=1" not set. Even >> if memcg sysfs is turned on, it is too cumbersome and impractical to >> manage all those per-memcg sysfs files in a real production system. >> >> So there is no practical way to shrink memcg caches. Fix this by >> enabling a proper write to the shrink sysfs file of the root cache >> to scan all the available memcg caches and shrink them as well. For a >> non-root memcg cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is >> on), only that cache will be shrunk when written. > I would mention that memcg unawareness was an overlook more than > anything else. The interface is intended to shrink all pcp data of the > cache. The fact that we are using per-memcg internal caches is an > implementation detail. > >> On a 2-socket 64-core 256-thread arm64 system with 64k page after >> a parallel kernel build, the the amount of memory occupied by slabs >> before shrinking slabs were: >> >> # grep task_struct /proc/slabinfo >> task_struct53137 53192 4288 614 : tunables00 >> 0 : slabdata872872 0 >> # grep "^S[lRU]" /proc/meminfo >> Slab:3936832 kB >> SReclaimable: 399104 kB >> SUnreclaim: 3537728 kB >> >> After shrinking slabs: >> >> # grep "^S[lRU]" /proc/meminfo >> Slab:1356288 kB >> SReclaimable: 263296 kB >> SUnreclaim: 1092992 kB >> # grep task_struct /proc/slabinfo >> task_struct 2764 6832 4288 614 : tunables00 >> 0 : slabdata112112 0 > Now that you are touching the documentation I would just add a note that > shrinking might be expensive and block other slab operations so it > should be used with some care. > Good point. I will update the patch to include such a note in the documentation. Thanks, Longman
Re: [PATCH v2 1/2] mm, slab: Extend slab/shrink to shrink all memcg caches
On Wed 17-07-19 16:24:12, Waiman Long wrote: > Currently, a value of '1" is written to /sys/kernel/slab//shrink > file to shrink the slab by flushing out all the per-cpu slabs and free > slabs in partial lists. This can be useful to squeeze out a bit more memory > under extreme condition as well as making the active object counts in > /proc/slabinfo more accurate. > > This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON > option is usually not enabled and "slub_memcg_sysfs=1" not set. Even > if memcg sysfs is turned on, it is too cumbersome and impractical to > manage all those per-memcg sysfs files in a real production system. > > So there is no practical way to shrink memcg caches. Fix this by > enabling a proper write to the shrink sysfs file of the root cache > to scan all the available memcg caches and shrink them as well. For a > non-root memcg cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is > on), only that cache will be shrunk when written. I would mention that memcg unawareness was an overlook more than anything else. The interface is intended to shrink all pcp data of the cache. The fact that we are using per-memcg internal caches is an implementation detail. > On a 2-socket 64-core 256-thread arm64 system with 64k page after > a parallel kernel build, the the amount of memory occupied by slabs > before shrinking slabs were: > > # grep task_struct /proc/slabinfo > task_struct53137 53192 4288 614 : tunables00 > 0 : slabdata872872 0 > # grep "^S[lRU]" /proc/meminfo > Slab:3936832 kB > SReclaimable: 399104 kB > SUnreclaim: 3537728 kB > > After shrinking slabs: > > # grep "^S[lRU]" /proc/meminfo > Slab:1356288 kB > SReclaimable: 263296 kB > SUnreclaim: 1092992 kB > # grep task_struct /proc/slabinfo > task_struct 2764 6832 4288 614 : tunables00 > 0 : slabdata112112 0 Now that you are touching the documentation I would just add a note that shrinking might be expensive and block other slab operations so it should be used with some care. > Signed-off-by: Waiman Long > Acked-by: Roman Gushchin The patch looks good to me. I do not feel qualified to give my ack but it is definitely a change in the good direction. Let's just be careful recommending people to use this as a workaround to over caching and resulting tilted stats. That needs to be addressed separately. Thanks! > --- > Documentation/ABI/testing/sysfs-kernel-slab | 12 --- > mm/slab.h | 1 + > mm/slab_common.c| 37 + > mm/slub.c | 2 +- > 4 files changed, 47 insertions(+), 5 deletions(-) > > diff --git a/Documentation/ABI/testing/sysfs-kernel-slab > b/Documentation/ABI/testing/sysfs-kernel-slab > index 29601d93a1c2..94ffd47fc8d7 100644 > --- a/Documentation/ABI/testing/sysfs-kernel-slab > +++ b/Documentation/ABI/testing/sysfs-kernel-slab > @@ -429,10 +429,14 @@ KernelVersion: 2.6.22 > Contact: Pekka Enberg , > Christoph Lameter > Description: > - The shrink file is written when memory should be reclaimed from > - a cache. Empty partial slabs are freed and the partial list is > - sorted so the slabs with the fewest available objects are used > - first. > + The shrink file is used to enable some unused slab cache > + memory to be reclaimed from a cache. Empty per-cpu > + or partial slabs are freed and the partial list is > + sorted so the slabs with the fewest available objects > + are used first. It only accepts a value of "1" on > + write for shrinking the cache. Other input values are > + considered invalid. If it is a root cache, all the > + child memcg caches will also be shrunk, if available. > > What:/sys/kernel/slab/cache/slab_size > Date:May 2007 > diff --git a/mm/slab.h b/mm/slab.h > index 9057b8056b07..5bf615cb3f99 100644 > --- a/mm/slab.h > +++ b/mm/slab.h > @@ -174,6 +174,7 @@ int __kmem_cache_shrink(struct kmem_cache *); > void __kmemcg_cache_deactivate(struct kmem_cache *s); > void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s); > void slab_kmem_cache_release(struct kmem_cache *); > +void kmem_cache_shrink_all(struct kmem_cache *s); > > struct seq_file; > struct file; > diff --git a/mm/slab_common.c b/mm/slab_common.c > index 807490fe217a..6491c3a41805 100644 > --- a/mm/slab_common.c > +++ b/mm/slab_common.c > @@ -981,6 +981,43 @@ int kmem_cache_shrink(struct kmem_cache *cachep) > } > EXPORT_SYMBOL(kmem_cache_shrink); > > +/** > + * kmem_cache_shrink_all - shrink a cache and all memcg caches for root cache > + * @s: The cache pointer > + */ > +void kmem_cache_shrink_all(struct kmem_cache *s) > +{ > + struct kmem_
Re: [PATCH v2 1/2] mm, slab: Extend slab/shrink to shrink all memcg caches
On Thu, Jul 18, 2019 at 11:38:11AM +, Christopher Lameter wrote: > On Wed, 17 Jul 2019, Waiman Long wrote: > > > Currently, a value of '1" is written to /sys/kernel/slab//shrink > > file to shrink the slab by flushing out all the per-cpu slabs and free > > slabs in partial lists. This can be useful to squeeze out a bit more memory > > under extreme condition as well as making the active object counts in > > /proc/slabinfo more accurate. > > Acked-by: Christoph Lameter > > > # grep task_struct /proc/slabinfo > > task_struct53137 53192 4288 614 : tunables00 > > 0 : slabdata872872 0 > > # grep "^S[lRU]" /proc/meminfo > > Slab:3936832 kB > > SReclaimable: 399104 kB > > SUnreclaim: 3537728 kB > > > > After shrinking slabs: > > > > # grep "^S[lRU]" /proc/meminfo > > Slab:1356288 kB > > SReclaimable: 263296 kB > > SUnreclaim: 1092992 kB > > Well another indicator that it may not be a good decision to replicate the > whole set of slabs for each memcg. Migrate the memcg ownership into the > objects may allow the use of the same slab cache. In particular together > with the slab migration patches this may be a viable way to reduce memory > consumption. > Btw I'm working on an alternative solution. It's way too early to present anything, but preliminary results are looking promising: slab memory usage is decreased by 10-40% depending on the workload. Thanks!
Re: [PATCH v2 1/2] mm, slab: Extend slab/shrink to shrink all memcg caches
On Wed, 17 Jul 2019, Waiman Long wrote: > Currently, a value of '1" is written to /sys/kernel/slab//shrink > file to shrink the slab by flushing out all the per-cpu slabs and free > slabs in partial lists. This can be useful to squeeze out a bit more memory > under extreme condition as well as making the active object counts in > /proc/slabinfo more accurate. Acked-by: Christoph Lameter > # grep task_struct /proc/slabinfo > task_struct53137 53192 4288 614 : tunables00 > 0 : slabdata872872 0 > # grep "^S[lRU]" /proc/meminfo > Slab:3936832 kB > SReclaimable: 399104 kB > SUnreclaim: 3537728 kB > > After shrinking slabs: > > # grep "^S[lRU]" /proc/meminfo > Slab:1356288 kB > SReclaimable: 263296 kB > SUnreclaim: 1092992 kB Well another indicator that it may not be a good decision to replicate the whole set of slabs for each memcg. Migrate the memcg ownership into the objects may allow the use of the same slab cache. In particular together with the slab migration patches this may be a viable way to reduce memory consumption.
[PATCH v2 1/2] mm, slab: Extend slab/shrink to shrink all memcg caches
Currently, a value of '1" is written to /sys/kernel/slab//shrink file to shrink the slab by flushing out all the per-cpu slabs and free slabs in partial lists. This can be useful to squeeze out a bit more memory under extreme condition as well as making the active object counts in /proc/slabinfo more accurate. This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON option is usually not enabled and "slub_memcg_sysfs=1" not set. Even if memcg sysfs is turned on, it is too cumbersome and impractical to manage all those per-memcg sysfs files in a real production system. So there is no practical way to shrink memcg caches. Fix this by enabling a proper write to the shrink sysfs file of the root cache to scan all the available memcg caches and shrink them as well. For a non-root memcg cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is on), only that cache will be shrunk when written. On a 2-socket 64-core 256-thread arm64 system with 64k page after a parallel kernel build, the the amount of memory occupied by slabs before shrinking slabs were: # grep task_struct /proc/slabinfo task_struct53137 53192 4288 614 : tunables00 0 : slabdata872872 0 # grep "^S[lRU]" /proc/meminfo Slab:3936832 kB SReclaimable: 399104 kB SUnreclaim: 3537728 kB After shrinking slabs: # grep "^S[lRU]" /proc/meminfo Slab:1356288 kB SReclaimable: 263296 kB SUnreclaim: 1092992 kB # grep task_struct /proc/slabinfo task_struct 2764 6832 4288 614 : tunables00 0 : slabdata112112 0 Signed-off-by: Waiman Long Acked-by: Roman Gushchin --- Documentation/ABI/testing/sysfs-kernel-slab | 12 --- mm/slab.h | 1 + mm/slab_common.c| 37 + mm/slub.c | 2 +- 4 files changed, 47 insertions(+), 5 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-kernel-slab b/Documentation/ABI/testing/sysfs-kernel-slab index 29601d93a1c2..94ffd47fc8d7 100644 --- a/Documentation/ABI/testing/sysfs-kernel-slab +++ b/Documentation/ABI/testing/sysfs-kernel-slab @@ -429,10 +429,14 @@ KernelVersion:2.6.22 Contact: Pekka Enberg , Christoph Lameter Description: - The shrink file is written when memory should be reclaimed from - a cache. Empty partial slabs are freed and the partial list is - sorted so the slabs with the fewest available objects are used - first. + The shrink file is used to enable some unused slab cache + memory to be reclaimed from a cache. Empty per-cpu + or partial slabs are freed and the partial list is + sorted so the slabs with the fewest available objects + are used first. It only accepts a value of "1" on + write for shrinking the cache. Other input values are + considered invalid. If it is a root cache, all the + child memcg caches will also be shrunk, if available. What: /sys/kernel/slab/cache/slab_size Date: May 2007 diff --git a/mm/slab.h b/mm/slab.h index 9057b8056b07..5bf615cb3f99 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -174,6 +174,7 @@ int __kmem_cache_shrink(struct kmem_cache *); void __kmemcg_cache_deactivate(struct kmem_cache *s); void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s); void slab_kmem_cache_release(struct kmem_cache *); +void kmem_cache_shrink_all(struct kmem_cache *s); struct seq_file; struct file; diff --git a/mm/slab_common.c b/mm/slab_common.c index 807490fe217a..6491c3a41805 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -981,6 +981,43 @@ int kmem_cache_shrink(struct kmem_cache *cachep) } EXPORT_SYMBOL(kmem_cache_shrink); +/** + * kmem_cache_shrink_all - shrink a cache and all memcg caches for root cache + * @s: The cache pointer + */ +void kmem_cache_shrink_all(struct kmem_cache *s) +{ + struct kmem_cache *c; + + if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || !is_root_cache(s)) { + kmem_cache_shrink(s); + return; + } + + get_online_cpus(); + get_online_mems(); + kasan_cache_shrink(s); + __kmem_cache_shrink(s); + + /* +* We have to take the slab_mutex to protect from the memcg list +* modification. +*/ + mutex_lock(&slab_mutex); + for_each_memcg_cache(c, s) { + /* +* Don't need to shrink deactivated memcg caches. +*/ + if (s->flags & SLAB_DEACTIVATED) + continue; + kasan_cache_shrink(c); + __kmem_cache_shrink(c); + } + mutex_unlock(&slab_mutex); + put_online_mems(); + put_online_cpus(); +} + bool slab_is_available(void) { return slab_state >= UP; di