Re: [PATCH v2] IB/mlx5: Reduce max order of memory allocated for xlt update
> On Mar 23, 2021, at 4:13 PM, Jason Gunthorpe wrote: > > On Tue, Mar 23, 2021 at 12:41:51PM -0700, Aruna Ramakrishna wrote: >> There is a far greater possibility of an order-8 allocation failing, >> esp. with the addition of __GFP_NORETRY , and the code would have to >> fall back to a lower order allocation more often than not (esp. on a >> long running system). Unless the performance gains from using order-8 >> pages is significant (and it does not seem that way to me), we can just >> skip this step and directly go to the lower order allocation. > > Do not send HTML mails. I apologize; I’ve fixed the setting now. > > Do you have benchmarks that show the performance of the high order > pages is not relavent? I'm a bit surprised to hear that > I guess my point was more to the effect that an order-8 alloc will fail more often than not, in this flow. For instance, when we were debugging the latency spikes here, this was the typical buddyinfo output on that system: Node 0, zone DMA 0 1 1 2 3 0 1 0 1 1 3 Node 0, zoneDMA32 7 7 7 6 10 2 6 7 6 2306 Node 0, zone Normal 3390 51354 17574 6556 1586 26 2 1 0 0 0 Node 1, zone Normal 11519 23315 23306 9738 73 2 0 1 0 0 0 I think this level of fragmentation is pretty normal on long running systems. Here, in the reg_mr flow, the first try (order-8) alloc will probably fail 9 times out of 10 (esp. after the addition of GFP_NORETRY flag), and then as fallback, the code tries to allocate a lower order, and if that too fails, it allocates a page. I think it makes sense to just avoid trying an order-8 alloc here. Thanks, Aruna > This code really needs some attention to use a proper > scatter/gather. I understand the chip can do it, just some of the > software layers need to be stripped away so it can form the right SGL > in the HW. > > Jason
Re: [PATCH 2/5] tracing: Verify if trace array exists before destroying it.
On 08/14/2019 10:55 AM, Divya Indi wrote: A trace array can be destroyed from userspace or kernel. Verify if the trace exists before proceeding to destroy/remove it. ^^ s/trace/trace array/ As you pointed out yourself. :) All the patches look good to me. For this patchset: Reviewed-by: Aruna Ramakrishna Thanks, Aruna
Re: [PATCH v4 resend] mm/slab: Improve performance of gathering slabinfo stats
On 08/29/2016 05:44 PM, Aruna Ramakrishna wrote: On large systems, when some slab caches grow to millions of objects (and many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds. During this time, interrupts are disabled while walking the slab lists (slabs_full, slabs_partial, and slabs_free) for each node, and this sometimes causes timeouts in other drivers (for instance, Infiniband). This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for total number of allocated slabs per node, per cache. This counter is updated when a slab is created or destroyed. This enables us to skip traversing the slabs_full list while gathering slabinfo statistics, and since slabs_full tends to be the biggest list when the cache is large, it results in a dramatic performance improvement. Getting slabinfo statistics now only requires walking the slabs_free and slabs_partial lists, and those lists are usually much smaller than slabs_full. We tested this after growing the dentry cache to 70GB, and the performance improved from 2s to 5ms. Signed-off-by: Aruna Ramakrishna <aruna.ramakris...@oracle.com> Cc: Mike Kravetz <mike.krav...@oracle.com> Cc: Christoph Lameter <c...@linux.com> Cc: Pekka Enberg <penb...@kernel.org> Cc: David Rientjes <rient...@google.com> Cc: Joonsoo Kim <iamjoonsoo@lge.com> Cc: Andrew Morton <a...@linux-foundation.org> --- Note: this has been tested only on x86_64. This patch has spawned off a very interesting discussion in a older thread, and I guess the latest incarnation of this patch got buried. I'm resending it for review/approval. Thanks, Aruna
Re: [PATCH v4 resend] mm/slab: Improve performance of gathering slabinfo stats
On 08/29/2016 05:44 PM, Aruna Ramakrishna wrote: On large systems, when some slab caches grow to millions of objects (and many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds. During this time, interrupts are disabled while walking the slab lists (slabs_full, slabs_partial, and slabs_free) for each node, and this sometimes causes timeouts in other drivers (for instance, Infiniband). This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for total number of allocated slabs per node, per cache. This counter is updated when a slab is created or destroyed. This enables us to skip traversing the slabs_full list while gathering slabinfo statistics, and since slabs_full tends to be the biggest list when the cache is large, it results in a dramatic performance improvement. Getting slabinfo statistics now only requires walking the slabs_free and slabs_partial lists, and those lists are usually much smaller than slabs_full. We tested this after growing the dentry cache to 70GB, and the performance improved from 2s to 5ms. Signed-off-by: Aruna Ramakrishna Cc: Mike Kravetz Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Andrew Morton --- Note: this has been tested only on x86_64. This patch has spawned off a very interesting discussion in a older thread, and I guess the latest incarnation of this patch got buried. I'm resending it for review/approval. Thanks, Aruna
[PATCH v4 resend] mm/slab: Improve performance of gathering slabinfo stats
On large systems, when some slab caches grow to millions of objects (and many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds. During this time, interrupts are disabled while walking the slab lists (slabs_full, slabs_partial, and slabs_free) for each node, and this sometimes causes timeouts in other drivers (for instance, Infiniband). This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for total number of allocated slabs per node, per cache. This counter is updated when a slab is created or destroyed. This enables us to skip traversing the slabs_full list while gathering slabinfo statistics, and since slabs_full tends to be the biggest list when the cache is large, it results in a dramatic performance improvement. Getting slabinfo statistics now only requires walking the slabs_free and slabs_partial lists, and those lists are usually much smaller than slabs_full. We tested this after growing the dentry cache to 70GB, and the performance improved from 2s to 5ms. Signed-off-by: Aruna Ramakrishna <aruna.ramakris...@oracle.com> Cc: Mike Kravetz <mike.krav...@oracle.com> Cc: Christoph Lameter <c...@linux.com> Cc: Pekka Enberg <penb...@kernel.org> Cc: David Rientjes <rient...@google.com> Cc: Joonsoo Kim <iamjoonsoo@lge.com> Cc: Andrew Morton <a...@linux-foundation.org> --- Note: this has been tested only on x86_64. mm/slab.c | 43 +++ mm/slab.h | 1 + 2 files changed, 28 insertions(+), 16 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index b672710..042017e 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -233,6 +233,7 @@ static void kmem_cache_node_init(struct kmem_cache_node *parent) spin_lock_init(>list_lock); parent->free_objects = 0; parent->free_touched = 0; + parent->num_slabs = 0; } #define MAKE_LIST(cachep, listp, slab, nodeid) \ @@ -1394,24 +1395,27 @@ slab_out_of_memory(struct kmem_cache *cachep, gfp_t gfpflags, int nodeid) for_each_kmem_cache_node(cachep, node, n) { unsigned long active_objs = 0, num_objs = 0, free_objects = 0; unsigned long active_slabs = 0, num_slabs = 0; + unsigned long num_slabs_partial = 0, num_slabs_free = 0; + unsigned long num_slabs_full; spin_lock_irqsave(>list_lock, flags); - list_for_each_entry(page, >slabs_full, lru) { - active_objs += cachep->num; - active_slabs++; - } + num_slabs = n->num_slabs; list_for_each_entry(page, >slabs_partial, lru) { active_objs += page->active; - active_slabs++; + num_slabs_partial++; } list_for_each_entry(page, >slabs_free, lru) - num_slabs++; + num_slabs_free++; free_objects += n->free_objects; spin_unlock_irqrestore(>list_lock, flags); - num_slabs += active_slabs; num_objs = num_slabs * cachep->num; + active_slabs = num_slabs - num_slabs_free; + num_slabs_full = num_slabs - + (num_slabs_partial + num_slabs_free); + active_objs += (num_slabs_full * cachep->num); + pr_warn(" node %d: slabs: %ld/%ld, objs: %ld/%ld, free: %ld\n", node, active_slabs, num_slabs, active_objs, num_objs, free_objects); @@ -2326,6 +2330,7 @@ static int drain_freelist(struct kmem_cache *cache, page = list_entry(p, struct page, lru); list_del(>lru); + n->num_slabs--; /* * Safe to drop the lock. The slab is no longer linked * to the cache. @@ -2764,6 +2769,8 @@ static void cache_grow_end(struct kmem_cache *cachep, struct page *page) list_add_tail(>lru, &(n->slabs_free)); else fixup_slab_list(cachep, n, page, ); + + n->num_slabs++; STATS_INC_GROWN(cachep); n->free_objects += cachep->num - page->active; spin_unlock(>list_lock); @@ -3455,6 +3462,7 @@ static void free_block(struct kmem_cache *cachep, void **objpp, page = list_last_entry(>slabs_free, struct page, lru); list_move(>lru, list); + n->num_slabs--; } } @@ -4111,6 +4119,8 @@ void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo) unsigned long num_objs; unsigned long active_slabs = 0; unsigned long num_slabs, free_objects = 0, shared_avail = 0; + unsigned long num_slabs_partial = 0, num_slabs_free = 0; + unsigned long num_slabs_full = 0; co
[PATCH v4 resend] mm/slab: Improve performance of gathering slabinfo stats
On large systems, when some slab caches grow to millions of objects (and many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds. During this time, interrupts are disabled while walking the slab lists (slabs_full, slabs_partial, and slabs_free) for each node, and this sometimes causes timeouts in other drivers (for instance, Infiniband). This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for total number of allocated slabs per node, per cache. This counter is updated when a slab is created or destroyed. This enables us to skip traversing the slabs_full list while gathering slabinfo statistics, and since slabs_full tends to be the biggest list when the cache is large, it results in a dramatic performance improvement. Getting slabinfo statistics now only requires walking the slabs_free and slabs_partial lists, and those lists are usually much smaller than slabs_full. We tested this after growing the dentry cache to 70GB, and the performance improved from 2s to 5ms. Signed-off-by: Aruna Ramakrishna Cc: Mike Kravetz Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Andrew Morton --- Note: this has been tested only on x86_64. mm/slab.c | 43 +++ mm/slab.h | 1 + 2 files changed, 28 insertions(+), 16 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index b672710..042017e 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -233,6 +233,7 @@ static void kmem_cache_node_init(struct kmem_cache_node *parent) spin_lock_init(>list_lock); parent->free_objects = 0; parent->free_touched = 0; + parent->num_slabs = 0; } #define MAKE_LIST(cachep, listp, slab, nodeid) \ @@ -1394,24 +1395,27 @@ slab_out_of_memory(struct kmem_cache *cachep, gfp_t gfpflags, int nodeid) for_each_kmem_cache_node(cachep, node, n) { unsigned long active_objs = 0, num_objs = 0, free_objects = 0; unsigned long active_slabs = 0, num_slabs = 0; + unsigned long num_slabs_partial = 0, num_slabs_free = 0; + unsigned long num_slabs_full; spin_lock_irqsave(>list_lock, flags); - list_for_each_entry(page, >slabs_full, lru) { - active_objs += cachep->num; - active_slabs++; - } + num_slabs = n->num_slabs; list_for_each_entry(page, >slabs_partial, lru) { active_objs += page->active; - active_slabs++; + num_slabs_partial++; } list_for_each_entry(page, >slabs_free, lru) - num_slabs++; + num_slabs_free++; free_objects += n->free_objects; spin_unlock_irqrestore(>list_lock, flags); - num_slabs += active_slabs; num_objs = num_slabs * cachep->num; + active_slabs = num_slabs - num_slabs_free; + num_slabs_full = num_slabs - + (num_slabs_partial + num_slabs_free); + active_objs += (num_slabs_full * cachep->num); + pr_warn(" node %d: slabs: %ld/%ld, objs: %ld/%ld, free: %ld\n", node, active_slabs, num_slabs, active_objs, num_objs, free_objects); @@ -2326,6 +2330,7 @@ static int drain_freelist(struct kmem_cache *cache, page = list_entry(p, struct page, lru); list_del(>lru); + n->num_slabs--; /* * Safe to drop the lock. The slab is no longer linked * to the cache. @@ -2764,6 +2769,8 @@ static void cache_grow_end(struct kmem_cache *cachep, struct page *page) list_add_tail(>lru, &(n->slabs_free)); else fixup_slab_list(cachep, n, page, ); + + n->num_slabs++; STATS_INC_GROWN(cachep); n->free_objects += cachep->num - page->active; spin_unlock(>list_lock); @@ -3455,6 +3462,7 @@ static void free_block(struct kmem_cache *cachep, void **objpp, page = list_last_entry(>slabs_free, struct page, lru); list_move(>lru, list); + n->num_slabs--; } } @@ -4111,6 +4119,8 @@ void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo) unsigned long num_objs; unsigned long active_slabs = 0; unsigned long num_slabs, free_objects = 0, shared_avail = 0; + unsigned long num_slabs_partial = 0, num_slabs_free = 0; + unsigned long num_slabs_full = 0; const char *name; char *error = NULL; int node; @@ -4123,33 +4133,34 @@ void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo) check_irq_on();
Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo stats
On 08/18/2016 04:52 AM, Michal Hocko wrote: I am not opposing the patch (to be honest it is quite neat) but this is buggering me for quite some time. Sorry for hijacking this email thread but I couldn't resist. Why are we trying to optimize SLAB and slowly converge it to SLUB feature-wise. I always thought that SLAB should remain stable and time challenged solution which works reasonably well for many/most workloads, while SLUB is an optimized implementation which experiment with slightly different concepts that might boost the performance considerably but might also surprise from time to time. If this is not the case then why do we have both of them in the kernel. It is a lot of code and some features need tweaking both while only one gets testing coverage. So this is mainly a question for maintainers. Why do we maintain both and what is the purpose of them. Michal, Speaking about this patch specifically - I'm not trying to optimize SLAB or make it more similar to SLUB. This patch is a bug fix for an issue where the slowness of 'cat /proc/slabinfo' caused timeouts in other drivers. While optimizing that flow, it became apparent (as Christoph pointed out) that one could converge this patch to SLUB's current implementation. Though I have not done that in this patch (because that warrants a separate patch), I think it makes sense to converge where appropriate, since they both do share some common data structures and code already. Thanks, Aruna
Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo stats
On 08/18/2016 04:52 AM, Michal Hocko wrote: I am not opposing the patch (to be honest it is quite neat) but this is buggering me for quite some time. Sorry for hijacking this email thread but I couldn't resist. Why are we trying to optimize SLAB and slowly converge it to SLUB feature-wise. I always thought that SLAB should remain stable and time challenged solution which works reasonably well for many/most workloads, while SLUB is an optimized implementation which experiment with slightly different concepts that might boost the performance considerably but might also surprise from time to time. If this is not the case then why do we have both of them in the kernel. It is a lot of code and some features need tweaking both while only one gets testing coverage. So this is mainly a question for maintainers. Why do we maintain both and what is the purpose of them. Michal, Speaking about this patch specifically - I'm not trying to optimize SLAB or make it more similar to SLUB. This patch is a bug fix for an issue where the slowness of 'cat /proc/slabinfo' caused timeouts in other drivers. While optimizing that flow, it became apparent (as Christoph pointed out) that one could converge this patch to SLUB's current implementation. Though I have not done that in this patch (because that warrants a separate patch), I think it makes sense to converge where appropriate, since they both do share some common data structures and code already. Thanks, Aruna
[PATCH v4] mm/slab: Improve performance of gathering slabinfo stats
On large systems, when some slab caches grow to millions of objects (and many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds. During this time, interrupts are disabled while walking the slab lists (slabs_full, slabs_partial, and slabs_free) for each node, and this sometimes causes timeouts in other drivers (for instance, Infiniband). This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for total number of allocated slabs per node, per cache. This counter is updated when a slab is created or destroyed. This enables us to skip traversing the slabs_full list while gathering slabinfo statistics, and since slabs_full tends to be the biggest list when the cache is large, it results in a dramatic performance improvement. Getting slabinfo statistics now only requires walking the slabs_free and slabs_partial lists, and those lists are usually much smaller than slabs_full. We tested this after growing the dentry cache to 70GB, and the performance improved from 2s to 5ms. Signed-off-by: Aruna Ramakrishna <aruna.ramakris...@oracle.com> Cc: Mike Kravetz <mike.krav...@oracle.com> Cc: Christoph Lameter <c...@linux.com> Cc: Pekka Enberg <penb...@kernel.org> Cc: David Rientjes <rient...@google.com> Cc: Joonsoo Kim <iamjoonsoo@lge.com> Cc: Andrew Morton <a...@linux-foundation.org> --- Note: this has been tested only on x86_64. mm/slab.c | 43 +++ mm/slab.h | 1 + 2 files changed, 28 insertions(+), 16 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index b672710..042017e 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -233,6 +233,7 @@ static void kmem_cache_node_init(struct kmem_cache_node *parent) spin_lock_init(>list_lock); parent->free_objects = 0; parent->free_touched = 0; + parent->num_slabs = 0; } #define MAKE_LIST(cachep, listp, slab, nodeid) \ @@ -1394,24 +1395,27 @@ slab_out_of_memory(struct kmem_cache *cachep, gfp_t gfpflags, int nodeid) for_each_kmem_cache_node(cachep, node, n) { unsigned long active_objs = 0, num_objs = 0, free_objects = 0; unsigned long active_slabs = 0, num_slabs = 0; + unsigned long num_slabs_partial = 0, num_slabs_free = 0; + unsigned long num_slabs_full; spin_lock_irqsave(>list_lock, flags); - list_for_each_entry(page, >slabs_full, lru) { - active_objs += cachep->num; - active_slabs++; - } + num_slabs = n->num_slabs; list_for_each_entry(page, >slabs_partial, lru) { active_objs += page->active; - active_slabs++; + num_slabs_partial++; } list_for_each_entry(page, >slabs_free, lru) - num_slabs++; + num_slabs_free++; free_objects += n->free_objects; spin_unlock_irqrestore(>list_lock, flags); - num_slabs += active_slabs; num_objs = num_slabs * cachep->num; + active_slabs = num_slabs - num_slabs_free; + num_slabs_full = num_slabs - + (num_slabs_partial + num_slabs_free); + active_objs += (num_slabs_full * cachep->num); + pr_warn(" node %d: slabs: %ld/%ld, objs: %ld/%ld, free: %ld\n", node, active_slabs, num_slabs, active_objs, num_objs, free_objects); @@ -2326,6 +2330,7 @@ static int drain_freelist(struct kmem_cache *cache, page = list_entry(p, struct page, lru); list_del(>lru); + n->num_slabs--; /* * Safe to drop the lock. The slab is no longer linked * to the cache. @@ -2764,6 +2769,8 @@ static void cache_grow_end(struct kmem_cache *cachep, struct page *page) list_add_tail(>lru, &(n->slabs_free)); else fixup_slab_list(cachep, n, page, ); + + n->num_slabs++; STATS_INC_GROWN(cachep); n->free_objects += cachep->num - page->active; spin_unlock(>list_lock); @@ -3455,6 +3462,7 @@ static void free_block(struct kmem_cache *cachep, void **objpp, page = list_last_entry(>slabs_free, struct page, lru); list_move(>lru, list); + n->num_slabs--; } } @@ -4111,6 +4119,8 @@ void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo) unsigned long num_objs; unsigned long active_slabs = 0; unsigned long num_slabs, free_objects = 0, shared_avail = 0; + unsigned long num_slabs_partial = 0, num_slabs_free = 0; + unsigned long num_slabs_full = 0; co
[PATCH v4] mm/slab: Improve performance of gathering slabinfo stats
On large systems, when some slab caches grow to millions of objects (and many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds. During this time, interrupts are disabled while walking the slab lists (slabs_full, slabs_partial, and slabs_free) for each node, and this sometimes causes timeouts in other drivers (for instance, Infiniband). This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for total number of allocated slabs per node, per cache. This counter is updated when a slab is created or destroyed. This enables us to skip traversing the slabs_full list while gathering slabinfo statistics, and since slabs_full tends to be the biggest list when the cache is large, it results in a dramatic performance improvement. Getting slabinfo statistics now only requires walking the slabs_free and slabs_partial lists, and those lists are usually much smaller than slabs_full. We tested this after growing the dentry cache to 70GB, and the performance improved from 2s to 5ms. Signed-off-by: Aruna Ramakrishna Cc: Mike Kravetz Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Andrew Morton --- Note: this has been tested only on x86_64. mm/slab.c | 43 +++ mm/slab.h | 1 + 2 files changed, 28 insertions(+), 16 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index b672710..042017e 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -233,6 +233,7 @@ static void kmem_cache_node_init(struct kmem_cache_node *parent) spin_lock_init(>list_lock); parent->free_objects = 0; parent->free_touched = 0; + parent->num_slabs = 0; } #define MAKE_LIST(cachep, listp, slab, nodeid) \ @@ -1394,24 +1395,27 @@ slab_out_of_memory(struct kmem_cache *cachep, gfp_t gfpflags, int nodeid) for_each_kmem_cache_node(cachep, node, n) { unsigned long active_objs = 0, num_objs = 0, free_objects = 0; unsigned long active_slabs = 0, num_slabs = 0; + unsigned long num_slabs_partial = 0, num_slabs_free = 0; + unsigned long num_slabs_full; spin_lock_irqsave(>list_lock, flags); - list_for_each_entry(page, >slabs_full, lru) { - active_objs += cachep->num; - active_slabs++; - } + num_slabs = n->num_slabs; list_for_each_entry(page, >slabs_partial, lru) { active_objs += page->active; - active_slabs++; + num_slabs_partial++; } list_for_each_entry(page, >slabs_free, lru) - num_slabs++; + num_slabs_free++; free_objects += n->free_objects; spin_unlock_irqrestore(>list_lock, flags); - num_slabs += active_slabs; num_objs = num_slabs * cachep->num; + active_slabs = num_slabs - num_slabs_free; + num_slabs_full = num_slabs - + (num_slabs_partial + num_slabs_free); + active_objs += (num_slabs_full * cachep->num); + pr_warn(" node %d: slabs: %ld/%ld, objs: %ld/%ld, free: %ld\n", node, active_slabs, num_slabs, active_objs, num_objs, free_objects); @@ -2326,6 +2330,7 @@ static int drain_freelist(struct kmem_cache *cache, page = list_entry(p, struct page, lru); list_del(>lru); + n->num_slabs--; /* * Safe to drop the lock. The slab is no longer linked * to the cache. @@ -2764,6 +2769,8 @@ static void cache_grow_end(struct kmem_cache *cachep, struct page *page) list_add_tail(>lru, &(n->slabs_free)); else fixup_slab_list(cachep, n, page, ); + + n->num_slabs++; STATS_INC_GROWN(cachep); n->free_objects += cachep->num - page->active; spin_unlock(>list_lock); @@ -3455,6 +3462,7 @@ static void free_block(struct kmem_cache *cachep, void **objpp, page = list_last_entry(>slabs_free, struct page, lru); list_move(>lru, list); + n->num_slabs--; } } @@ -4111,6 +4119,8 @@ void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo) unsigned long num_objs; unsigned long active_slabs = 0; unsigned long num_slabs, free_objects = 0, shared_avail = 0; + unsigned long num_slabs_partial = 0, num_slabs_free = 0; + unsigned long num_slabs_full = 0; const char *name; char *error = NULL; int node; @@ -4123,33 +4133,34 @@ void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo) check_irq_on();
Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo stats
On 08/17/2016 12:03 PM, Eric Dumazet wrote: On Wed, 2016-08-17 at 11:20 -0700, Aruna Ramakrishna wrote: ] - list_for_each_entry(page, >slabs_full, lru) { - if (page->active != cachep->num && !error) - error = "slabs_full accounting error"; - active_objs += cachep->num; - active_slabs++; - } Since you only removed this loop, you could track only number of full_slabs. This would avoid messing with n->num_slabs all over the places in fast path. Please also update slab_out_of_memory() Eric, Right now, n->num_slabs is modified only when a slab is detached from slabs_free (i.e. in drain_freelist and free_block) or when a new one is attached in cache_grow_end. None of those 3 calls are in the fast path, right? Tracking just full_slabs would also involve similar changes: decrement when a slab moves from full to partial during free_block, and increment when it moves from partial/free to full after allocation in fixup_slab_list. So I don't see what the real difference/advantage is. I will update slab_out_of_memory and remove the slabs_full list traversal there too. Thanks, Aruna
Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo stats
On 08/17/2016 12:03 PM, Eric Dumazet wrote: On Wed, 2016-08-17 at 11:20 -0700, Aruna Ramakrishna wrote: ] - list_for_each_entry(page, >slabs_full, lru) { - if (page->active != cachep->num && !error) - error = "slabs_full accounting error"; - active_objs += cachep->num; - active_slabs++; - } Since you only removed this loop, you could track only number of full_slabs. This would avoid messing with n->num_slabs all over the places in fast path. Please also update slab_out_of_memory() Eric, Right now, n->num_slabs is modified only when a slab is detached from slabs_free (i.e. in drain_freelist and free_block) or when a new one is attached in cache_grow_end. None of those 3 calls are in the fast path, right? Tracking just full_slabs would also involve similar changes: decrement when a slab moves from full to partial during free_block, and increment when it moves from partial/free to full after allocation in fixup_slab_list. So I don't see what the real difference/advantage is. I will update slab_out_of_memory and remove the slabs_full list traversal there too. Thanks, Aruna
[PATCH v3] mm/slab: Improve performance of gathering slabinfo stats
On large systems, when some slab caches grow to millions of objects (and many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds. During this time, interrupts are disabled while walking the slab lists (slabs_full, slabs_partial, and slabs_free) for each node, and this sometimes causes timeouts in other drivers (for instance, Infiniband). This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for total number of allocated slabs per node, per cache. This counter is updated when a slab is created or destroyed. This enables us to skip traversing the slabs_full list while gathering slabinfo statistics, and since slabs_full tends to be the biggest list when the cache is large, it results in a dramatic performance improvement. Getting slabinfo statistics now only requires walking the slabs_free and slabs_partial lists, and those lists are usually much smaller than slabs_full. We tested this after growing the dentry cache to 70GB, and the performance improved from 2s to 5ms. Signed-off-by: Aruna Ramakrishna <aruna.ramakris...@oracle.com> Cc: Mike Kravetz <mike.krav...@oracle.com> Cc: Christoph Lameter <c...@linux.com> Cc: Pekka Enberg <penb...@kernel.org> Cc: David Rientjes <rient...@google.com> Cc: Joonsoo Kim <iamjoonsoo@lge.com> Cc: Andrew Morton <a...@linux-foundation.org> --- Note: this has been tested only on x86_64. mm/slab.c | 26 +- mm/slab.h | 1 + 2 files changed, 18 insertions(+), 9 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index b672710..3da34fe 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -233,6 +233,7 @@ static void kmem_cache_node_init(struct kmem_cache_node *parent) spin_lock_init(>list_lock); parent->free_objects = 0; parent->free_touched = 0; + parent->num_slabs = 0; } #define MAKE_LIST(cachep, listp, slab, nodeid) \ @@ -2326,6 +2327,7 @@ static int drain_freelist(struct kmem_cache *cache, page = list_entry(p, struct page, lru); list_del(>lru); + n->num_slabs--; /* * Safe to drop the lock. The slab is no longer linked * to the cache. @@ -2764,6 +2766,8 @@ static void cache_grow_end(struct kmem_cache *cachep, struct page *page) list_add_tail(>lru, &(n->slabs_free)); else fixup_slab_list(cachep, n, page, ); + + n->num_slabs++; STATS_INC_GROWN(cachep); n->free_objects += cachep->num - page->active; spin_unlock(>list_lock); @@ -3455,6 +3459,7 @@ static void free_block(struct kmem_cache *cachep, void **objpp, page = list_last_entry(>slabs_free, struct page, lru); list_move(>lru, list); + n->num_slabs--; } } @@ -4111,6 +4116,8 @@ void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo) unsigned long num_objs; unsigned long active_slabs = 0; unsigned long num_slabs, free_objects = 0, shared_avail = 0; + unsigned long num_slabs_partial = 0, num_slabs_free = 0; + unsigned long num_slabs_full = 0; const char *name; char *error = NULL; int node; @@ -4123,33 +4130,34 @@ void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo) check_irq_on(); spin_lock_irq(>list_lock); - list_for_each_entry(page, >slabs_full, lru) { - if (page->active != cachep->num && !error) - error = "slabs_full accounting error"; - active_objs += cachep->num; - active_slabs++; - } + num_slabs += n->num_slabs; + list_for_each_entry(page, >slabs_partial, lru) { if (page->active == cachep->num && !error) error = "slabs_partial accounting error"; if (!page->active && !error) error = "slabs_partial accounting error"; active_objs += page->active; - active_slabs++; + num_slabs_partial++; } + list_for_each_entry(page, >slabs_free, lru) { if (page->active && !error) error = "slabs_free accounting error"; - num_slabs++; + num_slabs_free++; } + free_objects += n->free_objects; if (n->shared) shared_avail += n->shared->avail; spin_unlock_irq(>list_lock); } - num_slabs += active_slabs;
[PATCH v3] mm/slab: Improve performance of gathering slabinfo stats
On large systems, when some slab caches grow to millions of objects (and many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds. During this time, interrupts are disabled while walking the slab lists (slabs_full, slabs_partial, and slabs_free) for each node, and this sometimes causes timeouts in other drivers (for instance, Infiniband). This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for total number of allocated slabs per node, per cache. This counter is updated when a slab is created or destroyed. This enables us to skip traversing the slabs_full list while gathering slabinfo statistics, and since slabs_full tends to be the biggest list when the cache is large, it results in a dramatic performance improvement. Getting slabinfo statistics now only requires walking the slabs_free and slabs_partial lists, and those lists are usually much smaller than slabs_full. We tested this after growing the dentry cache to 70GB, and the performance improved from 2s to 5ms. Signed-off-by: Aruna Ramakrishna Cc: Mike Kravetz Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Andrew Morton --- Note: this has been tested only on x86_64. mm/slab.c | 26 +- mm/slab.h | 1 + 2 files changed, 18 insertions(+), 9 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index b672710..3da34fe 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -233,6 +233,7 @@ static void kmem_cache_node_init(struct kmem_cache_node *parent) spin_lock_init(>list_lock); parent->free_objects = 0; parent->free_touched = 0; + parent->num_slabs = 0; } #define MAKE_LIST(cachep, listp, slab, nodeid) \ @@ -2326,6 +2327,7 @@ static int drain_freelist(struct kmem_cache *cache, page = list_entry(p, struct page, lru); list_del(>lru); + n->num_slabs--; /* * Safe to drop the lock. The slab is no longer linked * to the cache. @@ -2764,6 +2766,8 @@ static void cache_grow_end(struct kmem_cache *cachep, struct page *page) list_add_tail(>lru, &(n->slabs_free)); else fixup_slab_list(cachep, n, page, ); + + n->num_slabs++; STATS_INC_GROWN(cachep); n->free_objects += cachep->num - page->active; spin_unlock(>list_lock); @@ -3455,6 +3459,7 @@ static void free_block(struct kmem_cache *cachep, void **objpp, page = list_last_entry(>slabs_free, struct page, lru); list_move(>lru, list); + n->num_slabs--; } } @@ -4111,6 +4116,8 @@ void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo) unsigned long num_objs; unsigned long active_slabs = 0; unsigned long num_slabs, free_objects = 0, shared_avail = 0; + unsigned long num_slabs_partial = 0, num_slabs_free = 0; + unsigned long num_slabs_full = 0; const char *name; char *error = NULL; int node; @@ -4123,33 +4130,34 @@ void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo) check_irq_on(); spin_lock_irq(>list_lock); - list_for_each_entry(page, >slabs_full, lru) { - if (page->active != cachep->num && !error) - error = "slabs_full accounting error"; - active_objs += cachep->num; - active_slabs++; - } + num_slabs += n->num_slabs; + list_for_each_entry(page, >slabs_partial, lru) { if (page->active == cachep->num && !error) error = "slabs_partial accounting error"; if (!page->active && !error) error = "slabs_partial accounting error"; active_objs += page->active; - active_slabs++; + num_slabs_partial++; } + list_for_each_entry(page, >slabs_free, lru) { if (page->active && !error) error = "slabs_free accounting error"; - num_slabs++; + num_slabs_free++; } + free_objects += n->free_objects; if (n->shared) shared_avail += n->shared->avail; spin_unlock_irq(>list_lock); } - num_slabs += active_slabs; num_objs = num_slabs * cachep->num; + active_slabs = num_slabs - num_slabs_free; + num_slabs_full = num_slabs - (num_slabs_partial + num_slabs_free); + active_objs += (num_slabs_full * cachep->nu
Re: [PATCH v2] mm/slab: Improve performance of gathering slabinfo stats
On 08/16/2016 08:52 AM, Christoph Lameter wrote: On Tue, 16 Aug 2016, Joonsoo Kim wrote: In SLUB, nr_slabs is manipulated without holding a lock so atomic operation should be used. It could be moved under the node lock. Christoph, Joonsoo, I agree that nr_slabs could be common between SLAB and SLUB, but I think that should be a separate patch, since converting nr_slabs to unsigned long for SLUB will cause quite a bit of change in mm/slub.c that is not related to adding counters to SLAB. I'll send out an updated slab counters patch with Joonsoo's suggested fix tomorrow (nr_slabs will be unsigned long for SLAB only, and there will be a separate definition for SLUB), and once that's in, I'll create a new patch that makes nr_slabs common for SLAB and SLUB, and also converts total_objects to unsigned long. Maybe it can include some more cleanup too. Does that sound acceptable? Thanks, Aruna
Re: [PATCH v2] mm/slab: Improve performance of gathering slabinfo stats
On 08/16/2016 08:52 AM, Christoph Lameter wrote: On Tue, 16 Aug 2016, Joonsoo Kim wrote: In SLUB, nr_slabs is manipulated without holding a lock so atomic operation should be used. It could be moved under the node lock. Christoph, Joonsoo, I agree that nr_slabs could be common between SLAB and SLUB, but I think that should be a separate patch, since converting nr_slabs to unsigned long for SLUB will cause quite a bit of change in mm/slub.c that is not related to adding counters to SLAB. I'll send out an updated slab counters patch with Joonsoo's suggested fix tomorrow (nr_slabs will be unsigned long for SLAB only, and there will be a separate definition for SLUB), and once that's in, I'll create a new patch that makes nr_slabs common for SLAB and SLUB, and also converts total_objects to unsigned long. Maybe it can include some more cleanup too. Does that sound acceptable? Thanks, Aruna
Re: [PATCH v2] mm/slab: Improve performance of gathering slabinfo stats
On 08/04/2016 02:06 PM, Andrew Morton wrote: On Thu, 4 Aug 2016 12:01:13 -0700 Aruna Ramakrishna <aruna.ramakris...@oracle.com> wrote: On large systems, when some slab caches grow to millions of objects (and many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds. During this time, interrupts are disabled while walking the slab lists (slabs_full, slabs_partial, and slabs_free) for each node, and this sometimes causes timeouts in other drivers (for instance, Infiniband). This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for total number of allocated slabs per node, per cache. This counter is updated when a slab is created or destroyed. This enables us to skip traversing the slabs_full list while gathering slabinfo statistics, and since slabs_full tends to be the biggest list when the cache is large, it results in a dramatic performance improvement. Getting slabinfo statistics now only requires walking the slabs_free and slabs_partial lists, and those lists are usually much smaller than slabs_full. We tested this after growing the dentry cache to 70GB, and the performance improved from 2s to 5ms. I assume this is tested on both slab and slub? It isn't the smallest of patches but given the seriousness of the problem I think I'll tag it for -stable backporting. This was only sanity-checked on slub. The performance tests were only run on slab. Thanks, Aruna
Re: [PATCH v2] mm/slab: Improve performance of gathering slabinfo stats
On 08/04/2016 02:06 PM, Andrew Morton wrote: On Thu, 4 Aug 2016 12:01:13 -0700 Aruna Ramakrishna wrote: On large systems, when some slab caches grow to millions of objects (and many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds. During this time, interrupts are disabled while walking the slab lists (slabs_full, slabs_partial, and slabs_free) for each node, and this sometimes causes timeouts in other drivers (for instance, Infiniband). This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for total number of allocated slabs per node, per cache. This counter is updated when a slab is created or destroyed. This enables us to skip traversing the slabs_full list while gathering slabinfo statistics, and since slabs_full tends to be the biggest list when the cache is large, it results in a dramatic performance improvement. Getting slabinfo statistics now only requires walking the slabs_free and slabs_partial lists, and those lists are usually much smaller than slabs_full. We tested this after growing the dentry cache to 70GB, and the performance improved from 2s to 5ms. I assume this is tested on both slab and slub? It isn't the smallest of patches but given the seriousness of the problem I think I'll tag it for -stable backporting. This was only sanity-checked on slub. The performance tests were only run on slab. Thanks, Aruna
[PATCH v2] mm/slab: Improve performance of gathering slabinfo stats
On large systems, when some slab caches grow to millions of objects (and many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds. During this time, interrupts are disabled while walking the slab lists (slabs_full, slabs_partial, and slabs_free) for each node, and this sometimes causes timeouts in other drivers (for instance, Infiniband). This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for total number of allocated slabs per node, per cache. This counter is updated when a slab is created or destroyed. This enables us to skip traversing the slabs_full list while gathering slabinfo statistics, and since slabs_full tends to be the biggest list when the cache is large, it results in a dramatic performance improvement. Getting slabinfo statistics now only requires walking the slabs_free and slabs_partial lists, and those lists are usually much smaller than slabs_full. We tested this after growing the dentry cache to 70GB, and the performance improved from 2s to 5ms. Signed-off-by: Aruna Ramakrishna <aruna.ramakris...@oracle.com> Cc: Mike Kravetz <mike.krav...@oracle.com> Cc: Christoph Lameter <c...@linux.com> Cc: Pekka Enberg <penb...@kernel.org> Cc: David Rientjes <rient...@google.com> Cc: Joonsoo Kim <iamjoonsoo@lge.com> Cc: Andrew Morton <a...@linux-foundation.org> --- Note: this has been tested only on x86_64. mm/slab.c | 25 - mm/slab.h | 15 ++- mm/slub.c | 19 +-- 3 files changed, 31 insertions(+), 28 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 261147b..d683840 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -233,6 +233,7 @@ static void kmem_cache_node_init(struct kmem_cache_node *parent) spin_lock_init(>list_lock); parent->free_objects = 0; parent->free_touched = 0; + atomic_long_set(>nr_slabs, 0); } #define MAKE_LIST(cachep, listp, slab, nodeid) \ @@ -2333,6 +2334,7 @@ static int drain_freelist(struct kmem_cache *cache, n->free_objects -= cache->num; spin_unlock_irq(>list_lock); slab_destroy(cache, page); + atomic_long_dec(>nr_slabs); nr_freed++; } out: @@ -2736,6 +2738,8 @@ static struct page *cache_grow_begin(struct kmem_cache *cachep, if (gfpflags_allow_blocking(local_flags)) local_irq_disable(); + atomic_long_inc(>nr_slabs); + return page; opps1: @@ -3455,6 +3459,7 @@ static void free_block(struct kmem_cache *cachep, void **objpp, page = list_last_entry(>slabs_free, struct page, lru); list_move(>lru, list); + atomic_long_dec(>nr_slabs); } } @@ -4111,6 +4116,8 @@ void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo) unsigned long num_objs; unsigned long active_slabs = 0; unsigned long num_slabs, free_objects = 0, shared_avail = 0; + unsigned long num_slabs_partial = 0, num_slabs_free = 0; + unsigned long num_slabs_full = 0; const char *name; char *error = NULL; int node; @@ -4120,36 +4127,36 @@ void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo) num_slabs = 0; for_each_kmem_cache_node(cachep, node, n) { + num_slabs += node_nr_slabs(n); check_irq_on(); spin_lock_irq(>list_lock); - list_for_each_entry(page, >slabs_full, lru) { - if (page->active != cachep->num && !error) - error = "slabs_full accounting error"; - active_objs += cachep->num; - active_slabs++; - } list_for_each_entry(page, >slabs_partial, lru) { if (page->active == cachep->num && !error) error = "slabs_partial accounting error"; if (!page->active && !error) error = "slabs_partial accounting error"; active_objs += page->active; - active_slabs++; + num_slabs_partial++; } + list_for_each_entry(page, >slabs_free, lru) { if (page->active && !error) error = "slabs_free accounting error"; - num_slabs++; + num_slabs_free++; } + free_objects += n->free_objects; if (n->shared) shared_avail += n->shared->avail; spin_unlock_irq(>list_lock); } - num_slabs += active_slabs; num_objs = nu
[PATCH v2] mm/slab: Improve performance of gathering slabinfo stats
On large systems, when some slab caches grow to millions of objects (and many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds. During this time, interrupts are disabled while walking the slab lists (slabs_full, slabs_partial, and slabs_free) for each node, and this sometimes causes timeouts in other drivers (for instance, Infiniband). This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for total number of allocated slabs per node, per cache. This counter is updated when a slab is created or destroyed. This enables us to skip traversing the slabs_full list while gathering slabinfo statistics, and since slabs_full tends to be the biggest list when the cache is large, it results in a dramatic performance improvement. Getting slabinfo statistics now only requires walking the slabs_free and slabs_partial lists, and those lists are usually much smaller than slabs_full. We tested this after growing the dentry cache to 70GB, and the performance improved from 2s to 5ms. Signed-off-by: Aruna Ramakrishna Cc: Mike Kravetz Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Andrew Morton --- Note: this has been tested only on x86_64. mm/slab.c | 25 - mm/slab.h | 15 ++- mm/slub.c | 19 +-- 3 files changed, 31 insertions(+), 28 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 261147b..d683840 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -233,6 +233,7 @@ static void kmem_cache_node_init(struct kmem_cache_node *parent) spin_lock_init(>list_lock); parent->free_objects = 0; parent->free_touched = 0; + atomic_long_set(>nr_slabs, 0); } #define MAKE_LIST(cachep, listp, slab, nodeid) \ @@ -2333,6 +2334,7 @@ static int drain_freelist(struct kmem_cache *cache, n->free_objects -= cache->num; spin_unlock_irq(>list_lock); slab_destroy(cache, page); + atomic_long_dec(>nr_slabs); nr_freed++; } out: @@ -2736,6 +2738,8 @@ static struct page *cache_grow_begin(struct kmem_cache *cachep, if (gfpflags_allow_blocking(local_flags)) local_irq_disable(); + atomic_long_inc(>nr_slabs); + return page; opps1: @@ -3455,6 +3459,7 @@ static void free_block(struct kmem_cache *cachep, void **objpp, page = list_last_entry(>slabs_free, struct page, lru); list_move(>lru, list); + atomic_long_dec(>nr_slabs); } } @@ -4111,6 +4116,8 @@ void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo) unsigned long num_objs; unsigned long active_slabs = 0; unsigned long num_slabs, free_objects = 0, shared_avail = 0; + unsigned long num_slabs_partial = 0, num_slabs_free = 0; + unsigned long num_slabs_full = 0; const char *name; char *error = NULL; int node; @@ -4120,36 +4127,36 @@ void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo) num_slabs = 0; for_each_kmem_cache_node(cachep, node, n) { + num_slabs += node_nr_slabs(n); check_irq_on(); spin_lock_irq(>list_lock); - list_for_each_entry(page, >slabs_full, lru) { - if (page->active != cachep->num && !error) - error = "slabs_full accounting error"; - active_objs += cachep->num; - active_slabs++; - } list_for_each_entry(page, >slabs_partial, lru) { if (page->active == cachep->num && !error) error = "slabs_partial accounting error"; if (!page->active && !error) error = "slabs_partial accounting error"; active_objs += page->active; - active_slabs++; + num_slabs_partial++; } + list_for_each_entry(page, >slabs_free, lru) { if (page->active && !error) error = "slabs_free accounting error"; - num_slabs++; + num_slabs_free++; } + free_objects += n->free_objects; if (n->shared) shared_avail += n->shared->avail; spin_unlock_irq(>list_lock); } - num_slabs += active_slabs; num_objs = num_slabs * cachep->num; + active_slabs = num_slabs - num_slabs_free; + num_slabs_full = num_slabs - (num_slabs_partial + num_slabs_free); + active_objs += (num_slabs_full * cachep->num); +
Re: [PATCH] mm/slab: Improve performance of gathering slabinfo stats
On 08/02/2016 07:59 AM, Christoph Lameter wrote: Hmm What SLUB does is: 1. Keep a count of the total number of allocated slab pages per node. This counter only needs to be updated when a slab page is allocated from the page allocator or when it is freed to the page allocator. At that point we already hold the per node lock, page allocator operations are extremely costly anyways and so that is ok. 2. Keep a count of the number of partially allocated slab pages per node. At that point we have to access the partial list and take a per node lock. Placing the counter into the same cacheline and the increment/decrement into the period when the lock has been taken avoids the overhead. As Joonsoo mentioned in his previous comment, the partial list is pretty small anyway. And we cannot avoid traversal of the partial list - we have to count the number of active objects in each partial slab: active_objs += page->active; So keeping a count of partially allocated slabs seems unnecessary to me. The number of full pages is then total - partial If both allocators would use the same scheme here then the code to maintain the counter can be moved into mm/slab_common.c. Plus the per node structures could be mostly harmonized between both allocators. Maybe even the page allocator operations could become common code. Aruna: Could you work on a solution like that? Yup, I'll replace the 3 counters with one counter for number of slabs per node and send out a new patch. I'll try to make the counter management as similar as possible, between SLAB and SLUB. Thanks, Aruna
Re: [PATCH] mm/slab: Improve performance of gathering slabinfo stats
On 08/02/2016 07:59 AM, Christoph Lameter wrote: Hmm What SLUB does is: 1. Keep a count of the total number of allocated slab pages per node. This counter only needs to be updated when a slab page is allocated from the page allocator or when it is freed to the page allocator. At that point we already hold the per node lock, page allocator operations are extremely costly anyways and so that is ok. 2. Keep a count of the number of partially allocated slab pages per node. At that point we have to access the partial list and take a per node lock. Placing the counter into the same cacheline and the increment/decrement into the period when the lock has been taken avoids the overhead. As Joonsoo mentioned in his previous comment, the partial list is pretty small anyway. And we cannot avoid traversal of the partial list - we have to count the number of active objects in each partial slab: active_objs += page->active; So keeping a count of partially allocated slabs seems unnecessary to me. The number of full pages is then total - partial If both allocators would use the same scheme here then the code to maintain the counter can be moved into mm/slab_common.c. Plus the per node structures could be mostly harmonized between both allocators. Maybe even the page allocator operations could become common code. Aruna: Could you work on a solution like that? Yup, I'll replace the 3 counters with one counter for number of slabs per node and send out a new patch. I'll try to make the counter management as similar as possible, between SLAB and SLUB. Thanks, Aruna
Re: [PATCH] mm/slab: Improve performance of gathering slabinfo stats
Hi Joonsoo, On 08/01/2016 05:55 PM, Joonsoo Kim wrote: Your patch updates these counters not only when a slabs are created and destroyed but also when object is allocated/freed from the slab. This would hurt runtime performance. The counters are not updated for each object allocation/free - only if that allocation/free results in that slab moving from one list (free/partial/full) to another. > slab lists for gathering slabinfo stats, resulting in a dramatic > performance improvement. We tested this after growing the dentry cache to > 70GB, and the performance improved from 2s to 2ms. Nice improvement. I can think of an altenative. I guess that improvement of your change comes from skipping to iterate n->slabs_full list. We can achieve it just with introducing only num_slabs. num_slabs can be updated when a slabs are created and destroyed. Yes, slabs_full is typically the largest list. We can calculate num_slabs_full by following equation. num_slabs_full = num_slabs - num_slabs_partial - num_slabs_free Calculating both num_slabs_partial and num_slabs_free by iterating n->slabs_XXX list would not take too much time. Yes, this would work too. We cannot avoid traversal of slabs_partial, and slabs_free is usually a small list, so this should give us similar performance benefits. But having separate counters could also be useful for debugging, like the ones defined under CONFIG_DEBUG_SLAB/STATS. Won't that help? Thanks, Aruna
Re: [PATCH] mm/slab: Improve performance of gathering slabinfo stats
Hi Joonsoo, On 08/01/2016 05:55 PM, Joonsoo Kim wrote: Your patch updates these counters not only when a slabs are created and destroyed but also when object is allocated/freed from the slab. This would hurt runtime performance. The counters are not updated for each object allocation/free - only if that allocation/free results in that slab moving from one list (free/partial/full) to another. > slab lists for gathering slabinfo stats, resulting in a dramatic > performance improvement. We tested this after growing the dentry cache to > 70GB, and the performance improved from 2s to 2ms. Nice improvement. I can think of an altenative. I guess that improvement of your change comes from skipping to iterate n->slabs_full list. We can achieve it just with introducing only num_slabs. num_slabs can be updated when a slabs are created and destroyed. Yes, slabs_full is typically the largest list. We can calculate num_slabs_full by following equation. num_slabs_full = num_slabs - num_slabs_partial - num_slabs_free Calculating both num_slabs_partial and num_slabs_free by iterating n->slabs_XXX list would not take too much time. Yes, this would work too. We cannot avoid traversal of slabs_partial, and slabs_free is usually a small list, so this should give us similar performance benefits. But having separate counters could also be useful for debugging, like the ones defined under CONFIG_DEBUG_SLAB/STATS. Won't that help? Thanks, Aruna
[PATCH] mm/slab: Improve performance of gathering slabinfo stats
On large systems, when some slab caches grow to millions of objects (and many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds. During this time, interrupts are disabled while walking the slab lists (slabs_full, slabs_partial, and slabs_free) for each node, and this sometimes causes timeouts in other drivers (for instance, Infiniband). This patch optimizes 'cat /proc/slabinfo' by maintaining slab counters to keep track of number of slabs per node, per cache. These counters are updated as slabs are created and destroyed. This avoids having to scan the slab lists for gathering slabinfo stats, resulting in a dramatic performance improvement. We tested this after growing the dentry cache to 70GB, and the performance improved from 2s to 2ms. Signed-off-by: Aruna Ramakrishna <aruna.ramakris...@oracle.com> Cc: Mike Kravetz <mike.krav...@oracle.com> Cc: Christoph Lameter <c...@linux.com> Cc: Pekka Enberg <penb...@kernel.org> Cc: David Rientjes <rient...@google.com> Cc: Joonsoo Kim <iamjoonsoo@lge.com> Cc: Andrew Morton <a...@linux-foundation.org> --- Note: this has been tested only on x86_64. mm/slab.c | 81 --- mm/slab.h | 5 2 files changed, 67 insertions(+), 19 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 09771ed..c205cd8 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -233,6 +233,11 @@ static void kmem_cache_node_init(struct kmem_cache_node *parent) spin_lock_init(>list_lock); parent->free_objects = 0; parent->free_touched = 0; + parent->num_slabs_partial = 0; + parent->num_slabs_full = 0; + parent->num_slabs_free = 0; + parent->cache_grown = 0; + parent->using_free_slab = 0; } #define MAKE_LIST(cachep, listp, slab, nodeid) \ @@ -2331,6 +2336,7 @@ static int drain_freelist(struct kmem_cache *cache, * to the cache. */ n->free_objects -= cache->num; + n->num_slabs_free--; spin_unlock_irq(>list_lock); slab_destroy(cache, page); nr_freed++; @@ -2731,6 +2737,14 @@ static struct page *cache_grow_begin(struct kmem_cache *cachep, kasan_poison_slab(page); cache_init_objs(cachep, page); + spin_lock(>list_lock); + /* +* Track the cache growth until the slabs lists and counters +* are adjusted. +*/ + n->cache_grown++; + spin_unlock(>list_lock); + if (gfpflags_allow_blocking(local_flags)) local_irq_disable(); @@ -2758,12 +2772,14 @@ static void cache_grow_end(struct kmem_cache *cachep, struct page *page) n = get_node(cachep, page_to_nid(page)); spin_lock(>list_lock); - if (!page->active) + if (!page->active) { list_add_tail(>lru, &(n->slabs_free)); - else + n->num_slabs_free++; + } else fixup_slab_list(cachep, n, page, ); STATS_INC_GROWN(cachep); n->free_objects += cachep->num - page->active; + n->cache_grown--; spin_unlock(>list_lock); fixup_objfreelist_debug(cachep, ); @@ -2867,8 +2883,26 @@ static inline void fixup_slab_list(struct kmem_cache *cachep, { /* move slabp to correct slabp list: */ list_del(>lru); + /* +* If the cache was not grown, then this slabp was deleted from a +* slabs_partial or a slabs_free list, and we decrement the counter +* for the appropriate list here. The flag using_free_slab is set +* whenever a slab from the slabs_free list is used for allocation. +* If set, we know that a slab was deleted from slabs_free and will be +* moved to slabs_partial (or slabs_full) later below. Otherwise, +* it was deleted from slabs_partial. +* If the cache was grown, slabp points to a new slab not present in +* any list - so we do not decrement any counters. +*/ + if (!n->cache_grown) { + if (n->using_free_slab) + n->num_slabs_free--; + else + n->num_slabs_partial--; + } if (page->active == cachep->num) { list_add(>lru, >slabs_full); + n->num_slabs_full++; if (OBJFREELIST_SLAB(cachep)) { #if DEBUG /* Poisoning will be done without holding the lock */ @@ -2881,8 +2915,10 @@ static inline void fixup_slab_list(struct kmem_cache *cachep, #endif page->freelist = NULL; } - } else + } else { list_add(>lru, >slabs_partial); + n->num_slabs_partial++; + } } /* Try to find non-pfmemalloc sla
[PATCH] mm/slab: Improve performance of gathering slabinfo stats
On large systems, when some slab caches grow to millions of objects (and many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds. During this time, interrupts are disabled while walking the slab lists (slabs_full, slabs_partial, and slabs_free) for each node, and this sometimes causes timeouts in other drivers (for instance, Infiniband). This patch optimizes 'cat /proc/slabinfo' by maintaining slab counters to keep track of number of slabs per node, per cache. These counters are updated as slabs are created and destroyed. This avoids having to scan the slab lists for gathering slabinfo stats, resulting in a dramatic performance improvement. We tested this after growing the dentry cache to 70GB, and the performance improved from 2s to 2ms. Signed-off-by: Aruna Ramakrishna Cc: Mike Kravetz Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Andrew Morton --- Note: this has been tested only on x86_64. mm/slab.c | 81 --- mm/slab.h | 5 2 files changed, 67 insertions(+), 19 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 09771ed..c205cd8 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -233,6 +233,11 @@ static void kmem_cache_node_init(struct kmem_cache_node *parent) spin_lock_init(>list_lock); parent->free_objects = 0; parent->free_touched = 0; + parent->num_slabs_partial = 0; + parent->num_slabs_full = 0; + parent->num_slabs_free = 0; + parent->cache_grown = 0; + parent->using_free_slab = 0; } #define MAKE_LIST(cachep, listp, slab, nodeid) \ @@ -2331,6 +2336,7 @@ static int drain_freelist(struct kmem_cache *cache, * to the cache. */ n->free_objects -= cache->num; + n->num_slabs_free--; spin_unlock_irq(>list_lock); slab_destroy(cache, page); nr_freed++; @@ -2731,6 +2737,14 @@ static struct page *cache_grow_begin(struct kmem_cache *cachep, kasan_poison_slab(page); cache_init_objs(cachep, page); + spin_lock(>list_lock); + /* +* Track the cache growth until the slabs lists and counters +* are adjusted. +*/ + n->cache_grown++; + spin_unlock(>list_lock); + if (gfpflags_allow_blocking(local_flags)) local_irq_disable(); @@ -2758,12 +2772,14 @@ static void cache_grow_end(struct kmem_cache *cachep, struct page *page) n = get_node(cachep, page_to_nid(page)); spin_lock(>list_lock); - if (!page->active) + if (!page->active) { list_add_tail(>lru, &(n->slabs_free)); - else + n->num_slabs_free++; + } else fixup_slab_list(cachep, n, page, ); STATS_INC_GROWN(cachep); n->free_objects += cachep->num - page->active; + n->cache_grown--; spin_unlock(>list_lock); fixup_objfreelist_debug(cachep, ); @@ -2867,8 +2883,26 @@ static inline void fixup_slab_list(struct kmem_cache *cachep, { /* move slabp to correct slabp list: */ list_del(>lru); + /* +* If the cache was not grown, then this slabp was deleted from a +* slabs_partial or a slabs_free list, and we decrement the counter +* for the appropriate list here. The flag using_free_slab is set +* whenever a slab from the slabs_free list is used for allocation. +* If set, we know that a slab was deleted from slabs_free and will be +* moved to slabs_partial (or slabs_full) later below. Otherwise, +* it was deleted from slabs_partial. +* If the cache was grown, slabp points to a new slab not present in +* any list - so we do not decrement any counters. +*/ + if (!n->cache_grown) { + if (n->using_free_slab) + n->num_slabs_free--; + else + n->num_slabs_partial--; + } if (page->active == cachep->num) { list_add(>lru, >slabs_full); + n->num_slabs_full++; if (OBJFREELIST_SLAB(cachep)) { #if DEBUG /* Poisoning will be done without holding the lock */ @@ -2881,8 +2915,10 @@ static inline void fixup_slab_list(struct kmem_cache *cachep, #endif page->freelist = NULL; } - } else + } else { list_add(>lru, >slabs_partial); + n->num_slabs_partial++; + } } /* Try to find non-pfmemalloc slab if needed */ @@ -2912,8 +2948,10 @@ static noinline struct page *get_valid_first_slab(struct kmem_cache_node *n, list_add_tail(>lru, >slabs_partial); list_for_e