Re: [PATCH v2] IB/mlx5: Reduce max order of memory allocated for xlt update

2021-03-23 Thread Aruna Ramakrishna


> On Mar 23, 2021, at 4:13 PM, Jason Gunthorpe  wrote:
> 
> On Tue, Mar 23, 2021 at 12:41:51PM -0700, Aruna Ramakrishna wrote:
>>   There is a far greater possibility of an order-8 allocation failing,
>>   esp. with the addition of __GFP_NORETRY , and the code would have to
>>   fall back to a lower order allocation more often than not (esp. on a
>>   long running system). Unless the performance gains from using order-8
>>   pages is significant (and it does not seem that way to me), we can just
>>   skip this step and directly go to the lower order allocation.
> 
> Do not send HTML mails.

I apologize; I’ve fixed the setting now.

> 
> Do you have benchmarks that show the performance of the high order
> pages is not relavent? I'm a bit surprised to hear that
> 

I guess my point was more to the effect that an order-8 alloc will fail more 
often than not, in this flow. For instance, when we were debugging the latency 
spikes here, this was the typical buddyinfo output on that system:

Node 0, zone  DMA  0  1  1  2  3  0  1  0   
   1  1  3 
Node 0, zoneDMA32  7  7  7  6 10  2  6  7   
   6  2306 
Node 0, zone   Normal   3390  51354  17574   6556   1586 26  2  1   
   0  0  0 
Node 1, zone   Normal  11519  23315  23306   9738 73  2  0  1   
   0  0  0 

I think this level of fragmentation is pretty normal on long running systems. 
Here, in the reg_mr flow, the first try (order-8) alloc will probably fail 9 
times out of 10 (esp. after the addition of GFP_NORETRY flag), and then as 
fallback, the code tries to allocate a lower order, and if that too fails, it 
allocates a page. I think it makes sense to just avoid trying an order-8 alloc 
here.

Thanks,
Aruna


> This code really needs some attention to use a proper
> scatter/gather. I understand the chip can do it, just some of the
> software layers need to be stripped away so it can form the right SGL
> in the HW.
> 
> Jason



Re: [PATCH 2/5] tracing: Verify if trace array exists before destroying it.

2019-08-14 Thread Aruna Ramakrishna




On 08/14/2019 10:55 AM, Divya Indi wrote:

A trace array can be destroyed from userspace or kernel. Verify if the
trace exists before proceeding to destroy/remove it.



^^ s/trace/trace array/

As you pointed out yourself. :)

All the patches look good to me. For this patchset:

Reviewed-by: Aruna Ramakrishna 

Thanks,
Aruna


Re: [PATCH v4 resend] mm/slab: Improve performance of gathering slabinfo stats

2016-08-29 Thread Aruna Ramakrishna

On 08/29/2016 05:44 PM, Aruna Ramakrishna wrote:

On large systems, when some slab caches grow to millions of objects (and
many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds.
During this time, interrupts are disabled while walking the slab lists
(slabs_full, slabs_partial, and slabs_free) for each node, and this
sometimes causes timeouts in other drivers (for instance, Infiniband).

This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
total number of allocated slabs per node, per cache. This counter is
updated when a slab is created or destroyed. This enables us to skip
traversing the slabs_full list while gathering slabinfo statistics, and
since slabs_full tends to be the biggest list when the cache is large, it
results in a dramatic performance improvement. Getting slabinfo statistics
now only requires walking the slabs_free and slabs_partial lists, and
those lists are usually much smaller than slabs_full. We tested this after
growing the dentry cache to 70GB, and the performance improved from 2s to
5ms.

Signed-off-by: Aruna Ramakrishna <aruna.ramakris...@oracle.com>
Cc: Mike Kravetz <mike.krav...@oracle.com>
Cc: Christoph Lameter <c...@linux.com>
Cc: Pekka Enberg <penb...@kernel.org>
Cc: David Rientjes <rient...@google.com>
Cc: Joonsoo Kim <iamjoonsoo@lge.com>
Cc: Andrew Morton <a...@linux-foundation.org>
---
Note: this has been tested only on x86_64.



This patch has spawned off a very interesting discussion in a older 
thread, and I guess the latest incarnation of this patch got buried. I'm 
resending it for review/approval.


Thanks,
Aruna


Re: [PATCH v4 resend] mm/slab: Improve performance of gathering slabinfo stats

2016-08-29 Thread Aruna Ramakrishna

On 08/29/2016 05:44 PM, Aruna Ramakrishna wrote:

On large systems, when some slab caches grow to millions of objects (and
many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds.
During this time, interrupts are disabled while walking the slab lists
(slabs_full, slabs_partial, and slabs_free) for each node, and this
sometimes causes timeouts in other drivers (for instance, Infiniband).

This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
total number of allocated slabs per node, per cache. This counter is
updated when a slab is created or destroyed. This enables us to skip
traversing the slabs_full list while gathering slabinfo statistics, and
since slabs_full tends to be the biggest list when the cache is large, it
results in a dramatic performance improvement. Getting slabinfo statistics
now only requires walking the slabs_free and slabs_partial lists, and
those lists are usually much smaller than slabs_full. We tested this after
growing the dentry cache to 70GB, and the performance improved from 2s to
5ms.

Signed-off-by: Aruna Ramakrishna 
Cc: Mike Kravetz 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
Note: this has been tested only on x86_64.



This patch has spawned off a very interesting discussion in a older 
thread, and I guess the latest incarnation of this patch got buried. I'm 
resending it for review/approval.


Thanks,
Aruna


[PATCH v4 resend] mm/slab: Improve performance of gathering slabinfo stats

2016-08-29 Thread Aruna Ramakrishna
On large systems, when some slab caches grow to millions of objects (and
many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds.
During this time, interrupts are disabled while walking the slab lists
(slabs_full, slabs_partial, and slabs_free) for each node, and this
sometimes causes timeouts in other drivers (for instance, Infiniband).

This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
total number of allocated slabs per node, per cache. This counter is
updated when a slab is created or destroyed. This enables us to skip
traversing the slabs_full list while gathering slabinfo statistics, and
since slabs_full tends to be the biggest list when the cache is large, it
results in a dramatic performance improvement. Getting slabinfo statistics
now only requires walking the slabs_free and slabs_partial lists, and
those lists are usually much smaller than slabs_full. We tested this after
growing the dentry cache to 70GB, and the performance improved from 2s to
5ms.

Signed-off-by: Aruna Ramakrishna <aruna.ramakris...@oracle.com>
Cc: Mike Kravetz <mike.krav...@oracle.com>
Cc: Christoph Lameter <c...@linux.com>
Cc: Pekka Enberg <penb...@kernel.org>
Cc: David Rientjes <rient...@google.com>
Cc: Joonsoo Kim <iamjoonsoo@lge.com>
Cc: Andrew Morton <a...@linux-foundation.org>
---
Note: this has been tested only on x86_64.

 mm/slab.c | 43 +++
 mm/slab.h |  1 +
 2 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index b672710..042017e 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -233,6 +233,7 @@ static void kmem_cache_node_init(struct kmem_cache_node 
*parent)
spin_lock_init(>list_lock);
parent->free_objects = 0;
parent->free_touched = 0;
+   parent->num_slabs = 0;
 }
 
 #define MAKE_LIST(cachep, listp, slab, nodeid) \
@@ -1394,24 +1395,27 @@ slab_out_of_memory(struct kmem_cache *cachep, gfp_t 
gfpflags, int nodeid)
for_each_kmem_cache_node(cachep, node, n) {
unsigned long active_objs = 0, num_objs = 0, free_objects = 0;
unsigned long active_slabs = 0, num_slabs = 0;
+   unsigned long num_slabs_partial = 0, num_slabs_free = 0;
+   unsigned long num_slabs_full;
 
spin_lock_irqsave(>list_lock, flags);
-   list_for_each_entry(page, >slabs_full, lru) {
-   active_objs += cachep->num;
-   active_slabs++;
-   }
+   num_slabs = n->num_slabs;
list_for_each_entry(page, >slabs_partial, lru) {
active_objs += page->active;
-   active_slabs++;
+   num_slabs_partial++;
}
list_for_each_entry(page, >slabs_free, lru)
-   num_slabs++;
+   num_slabs_free++;
 
free_objects += n->free_objects;
spin_unlock_irqrestore(>list_lock, flags);
 
-   num_slabs += active_slabs;
num_objs = num_slabs * cachep->num;
+   active_slabs = num_slabs - num_slabs_free;
+   num_slabs_full = num_slabs -
+   (num_slabs_partial + num_slabs_free);
+   active_objs += (num_slabs_full * cachep->num);
+
pr_warn("  node %d: slabs: %ld/%ld, objs: %ld/%ld, free: %ld\n",
node, active_slabs, num_slabs, active_objs, num_objs,
free_objects);
@@ -2326,6 +2330,7 @@ static int drain_freelist(struct kmem_cache *cache,
 
page = list_entry(p, struct page, lru);
list_del(>lru);
+   n->num_slabs--;
/*
 * Safe to drop the lock. The slab is no longer linked
 * to the cache.
@@ -2764,6 +2769,8 @@ static void cache_grow_end(struct kmem_cache *cachep, 
struct page *page)
list_add_tail(>lru, &(n->slabs_free));
else
fixup_slab_list(cachep, n, page, );
+
+   n->num_slabs++;
STATS_INC_GROWN(cachep);
n->free_objects += cachep->num - page->active;
spin_unlock(>list_lock);
@@ -3455,6 +3462,7 @@ static void free_block(struct kmem_cache *cachep, void 
**objpp,
 
page = list_last_entry(>slabs_free, struct page, lru);
list_move(>lru, list);
+   n->num_slabs--;
}
 }
 
@@ -4111,6 +4119,8 @@ void get_slabinfo(struct kmem_cache *cachep, struct 
slabinfo *sinfo)
unsigned long num_objs;
unsigned long active_slabs = 0;
unsigned long num_slabs, free_objects = 0, shared_avail = 0;
+   unsigned long num_slabs_partial = 0, num_slabs_free = 0;
+   unsigned long num_slabs_full = 0;
co

[PATCH v4 resend] mm/slab: Improve performance of gathering slabinfo stats

2016-08-29 Thread Aruna Ramakrishna
On large systems, when some slab caches grow to millions of objects (and
many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds.
During this time, interrupts are disabled while walking the slab lists
(slabs_full, slabs_partial, and slabs_free) for each node, and this
sometimes causes timeouts in other drivers (for instance, Infiniband).

This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
total number of allocated slabs per node, per cache. This counter is
updated when a slab is created or destroyed. This enables us to skip
traversing the slabs_full list while gathering slabinfo statistics, and
since slabs_full tends to be the biggest list when the cache is large, it
results in a dramatic performance improvement. Getting slabinfo statistics
now only requires walking the slabs_free and slabs_partial lists, and
those lists are usually much smaller than slabs_full. We tested this after
growing the dentry cache to 70GB, and the performance improved from 2s to
5ms.

Signed-off-by: Aruna Ramakrishna 
Cc: Mike Kravetz 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
Note: this has been tested only on x86_64.

 mm/slab.c | 43 +++
 mm/slab.h |  1 +
 2 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index b672710..042017e 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -233,6 +233,7 @@ static void kmem_cache_node_init(struct kmem_cache_node 
*parent)
spin_lock_init(>list_lock);
parent->free_objects = 0;
parent->free_touched = 0;
+   parent->num_slabs = 0;
 }
 
 #define MAKE_LIST(cachep, listp, slab, nodeid) \
@@ -1394,24 +1395,27 @@ slab_out_of_memory(struct kmem_cache *cachep, gfp_t 
gfpflags, int nodeid)
for_each_kmem_cache_node(cachep, node, n) {
unsigned long active_objs = 0, num_objs = 0, free_objects = 0;
unsigned long active_slabs = 0, num_slabs = 0;
+   unsigned long num_slabs_partial = 0, num_slabs_free = 0;
+   unsigned long num_slabs_full;
 
spin_lock_irqsave(>list_lock, flags);
-   list_for_each_entry(page, >slabs_full, lru) {
-   active_objs += cachep->num;
-   active_slabs++;
-   }
+   num_slabs = n->num_slabs;
list_for_each_entry(page, >slabs_partial, lru) {
active_objs += page->active;
-   active_slabs++;
+   num_slabs_partial++;
}
list_for_each_entry(page, >slabs_free, lru)
-   num_slabs++;
+   num_slabs_free++;
 
free_objects += n->free_objects;
spin_unlock_irqrestore(>list_lock, flags);
 
-   num_slabs += active_slabs;
num_objs = num_slabs * cachep->num;
+   active_slabs = num_slabs - num_slabs_free;
+   num_slabs_full = num_slabs -
+   (num_slabs_partial + num_slabs_free);
+   active_objs += (num_slabs_full * cachep->num);
+
pr_warn("  node %d: slabs: %ld/%ld, objs: %ld/%ld, free: %ld\n",
node, active_slabs, num_slabs, active_objs, num_objs,
free_objects);
@@ -2326,6 +2330,7 @@ static int drain_freelist(struct kmem_cache *cache,
 
page = list_entry(p, struct page, lru);
list_del(>lru);
+   n->num_slabs--;
/*
 * Safe to drop the lock. The slab is no longer linked
 * to the cache.
@@ -2764,6 +2769,8 @@ static void cache_grow_end(struct kmem_cache *cachep, 
struct page *page)
list_add_tail(>lru, &(n->slabs_free));
else
fixup_slab_list(cachep, n, page, );
+
+   n->num_slabs++;
STATS_INC_GROWN(cachep);
n->free_objects += cachep->num - page->active;
spin_unlock(>list_lock);
@@ -3455,6 +3462,7 @@ static void free_block(struct kmem_cache *cachep, void 
**objpp,
 
page = list_last_entry(>slabs_free, struct page, lru);
list_move(>lru, list);
+   n->num_slabs--;
}
 }
 
@@ -4111,6 +4119,8 @@ void get_slabinfo(struct kmem_cache *cachep, struct 
slabinfo *sinfo)
unsigned long num_objs;
unsigned long active_slabs = 0;
unsigned long num_slabs, free_objects = 0, shared_avail = 0;
+   unsigned long num_slabs_partial = 0, num_slabs_free = 0;
+   unsigned long num_slabs_full = 0;
const char *name;
char *error = NULL;
int node;
@@ -4123,33 +4133,34 @@ void get_slabinfo(struct kmem_cache *cachep, struct 
slabinfo *sinfo)
check_irq_on();

Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo stats

2016-08-18 Thread aruna . ramakrishna

On 08/18/2016 04:52 AM, Michal Hocko wrote:

I am not opposing the patch (to be honest it is quite neat) but this
is buggering me for quite some time. Sorry for hijacking this email
thread but I couldn't resist. Why are we trying to optimize SLAB and
slowly converge it to SLUB feature-wise. I always thought that SLAB
should remain stable and time challenged solution which works reasonably
well for many/most workloads, while SLUB is an optimized implementation
which experiment with slightly different concepts that might boost the
performance considerably but might also surprise from time to time. If
this is not the case then why do we have both of them in the kernel. It
is a lot of code and some features need tweaking both while only one
gets testing coverage. So this is mainly a question for maintainers. Why
do we maintain both and what is the purpose of them.


Michal,

Speaking about this patch specifically - I'm not trying to optimize SLAB 
or make it more similar to SLUB. This patch is a bug fix for an issue 
where the slowness of 'cat /proc/slabinfo' caused timeouts in other 
drivers. While optimizing that flow, it became apparent (as Christoph 
pointed out) that one could converge this patch to SLUB's current 
implementation. Though I have not done that in this patch (because that 
warrants a separate patch), I think it makes sense to converge where 
appropriate, since they both do share some common data structures and 
code already.


Thanks,
Aruna


Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo stats

2016-08-18 Thread aruna . ramakrishna

On 08/18/2016 04:52 AM, Michal Hocko wrote:

I am not opposing the patch (to be honest it is quite neat) but this
is buggering me for quite some time. Sorry for hijacking this email
thread but I couldn't resist. Why are we trying to optimize SLAB and
slowly converge it to SLUB feature-wise. I always thought that SLAB
should remain stable and time challenged solution which works reasonably
well for many/most workloads, while SLUB is an optimized implementation
which experiment with slightly different concepts that might boost the
performance considerably but might also surprise from time to time. If
this is not the case then why do we have both of them in the kernel. It
is a lot of code and some features need tweaking both while only one
gets testing coverage. So this is mainly a question for maintainers. Why
do we maintain both and what is the purpose of them.


Michal,

Speaking about this patch specifically - I'm not trying to optimize SLAB 
or make it more similar to SLUB. This patch is a bug fix for an issue 
where the slowness of 'cat /proc/slabinfo' caused timeouts in other 
drivers. While optimizing that flow, it became apparent (as Christoph 
pointed out) that one could converge this patch to SLUB's current 
implementation. Though I have not done that in this patch (because that 
warrants a separate patch), I think it makes sense to converge where 
appropriate, since they both do share some common data structures and 
code already.


Thanks,
Aruna


[PATCH v4] mm/slab: Improve performance of gathering slabinfo stats

2016-08-17 Thread Aruna Ramakrishna
On large systems, when some slab caches grow to millions of objects (and
many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds.
During this time, interrupts are disabled while walking the slab lists
(slabs_full, slabs_partial, and slabs_free) for each node, and this
sometimes causes timeouts in other drivers (for instance, Infiniband).

This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
total number of allocated slabs per node, per cache. This counter is
updated when a slab is created or destroyed. This enables us to skip
traversing the slabs_full list while gathering slabinfo statistics, and
since slabs_full tends to be the biggest list when the cache is large, it
results in a dramatic performance improvement. Getting slabinfo statistics
now only requires walking the slabs_free and slabs_partial lists, and
those lists are usually much smaller than slabs_full. We tested this after
growing the dentry cache to 70GB, and the performance improved from 2s to
5ms.

Signed-off-by: Aruna Ramakrishna <aruna.ramakris...@oracle.com>
Cc: Mike Kravetz <mike.krav...@oracle.com>
Cc: Christoph Lameter <c...@linux.com>
Cc: Pekka Enberg <penb...@kernel.org>
Cc: David Rientjes <rient...@google.com>
Cc: Joonsoo Kim <iamjoonsoo@lge.com>
Cc: Andrew Morton <a...@linux-foundation.org>
---
Note: this has been tested only on x86_64.

 mm/slab.c | 43 +++
 mm/slab.h |  1 +
 2 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index b672710..042017e 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -233,6 +233,7 @@ static void kmem_cache_node_init(struct kmem_cache_node 
*parent)
spin_lock_init(>list_lock);
parent->free_objects = 0;
parent->free_touched = 0;
+   parent->num_slabs = 0;
 }
 
 #define MAKE_LIST(cachep, listp, slab, nodeid) \
@@ -1394,24 +1395,27 @@ slab_out_of_memory(struct kmem_cache *cachep, gfp_t 
gfpflags, int nodeid)
for_each_kmem_cache_node(cachep, node, n) {
unsigned long active_objs = 0, num_objs = 0, free_objects = 0;
unsigned long active_slabs = 0, num_slabs = 0;
+   unsigned long num_slabs_partial = 0, num_slabs_free = 0;
+   unsigned long num_slabs_full;
 
spin_lock_irqsave(>list_lock, flags);
-   list_for_each_entry(page, >slabs_full, lru) {
-   active_objs += cachep->num;
-   active_slabs++;
-   }
+   num_slabs = n->num_slabs;
list_for_each_entry(page, >slabs_partial, lru) {
active_objs += page->active;
-   active_slabs++;
+   num_slabs_partial++;
}
list_for_each_entry(page, >slabs_free, lru)
-   num_slabs++;
+   num_slabs_free++;
 
free_objects += n->free_objects;
spin_unlock_irqrestore(>list_lock, flags);
 
-   num_slabs += active_slabs;
num_objs = num_slabs * cachep->num;
+   active_slabs = num_slabs - num_slabs_free;
+   num_slabs_full = num_slabs -
+   (num_slabs_partial + num_slabs_free);
+   active_objs += (num_slabs_full * cachep->num);
+
pr_warn("  node %d: slabs: %ld/%ld, objs: %ld/%ld, free: %ld\n",
node, active_slabs, num_slabs, active_objs, num_objs,
free_objects);
@@ -2326,6 +2330,7 @@ static int drain_freelist(struct kmem_cache *cache,
 
page = list_entry(p, struct page, lru);
list_del(>lru);
+   n->num_slabs--;
/*
 * Safe to drop the lock. The slab is no longer linked
 * to the cache.
@@ -2764,6 +2769,8 @@ static void cache_grow_end(struct kmem_cache *cachep, 
struct page *page)
list_add_tail(>lru, &(n->slabs_free));
else
fixup_slab_list(cachep, n, page, );
+
+   n->num_slabs++;
STATS_INC_GROWN(cachep);
n->free_objects += cachep->num - page->active;
spin_unlock(>list_lock);
@@ -3455,6 +3462,7 @@ static void free_block(struct kmem_cache *cachep, void 
**objpp,
 
page = list_last_entry(>slabs_free, struct page, lru);
list_move(>lru, list);
+   n->num_slabs--;
}
 }
 
@@ -4111,6 +4119,8 @@ void get_slabinfo(struct kmem_cache *cachep, struct 
slabinfo *sinfo)
unsigned long num_objs;
unsigned long active_slabs = 0;
unsigned long num_slabs, free_objects = 0, shared_avail = 0;
+   unsigned long num_slabs_partial = 0, num_slabs_free = 0;
+   unsigned long num_slabs_full = 0;
co

[PATCH v4] mm/slab: Improve performance of gathering slabinfo stats

2016-08-17 Thread Aruna Ramakrishna
On large systems, when some slab caches grow to millions of objects (and
many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds.
During this time, interrupts are disabled while walking the slab lists
(slabs_full, slabs_partial, and slabs_free) for each node, and this
sometimes causes timeouts in other drivers (for instance, Infiniband).

This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
total number of allocated slabs per node, per cache. This counter is
updated when a slab is created or destroyed. This enables us to skip
traversing the slabs_full list while gathering slabinfo statistics, and
since slabs_full tends to be the biggest list when the cache is large, it
results in a dramatic performance improvement. Getting slabinfo statistics
now only requires walking the slabs_free and slabs_partial lists, and
those lists are usually much smaller than slabs_full. We tested this after
growing the dentry cache to 70GB, and the performance improved from 2s to
5ms.

Signed-off-by: Aruna Ramakrishna 
Cc: Mike Kravetz 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
Note: this has been tested only on x86_64.

 mm/slab.c | 43 +++
 mm/slab.h |  1 +
 2 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index b672710..042017e 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -233,6 +233,7 @@ static void kmem_cache_node_init(struct kmem_cache_node 
*parent)
spin_lock_init(>list_lock);
parent->free_objects = 0;
parent->free_touched = 0;
+   parent->num_slabs = 0;
 }
 
 #define MAKE_LIST(cachep, listp, slab, nodeid) \
@@ -1394,24 +1395,27 @@ slab_out_of_memory(struct kmem_cache *cachep, gfp_t 
gfpflags, int nodeid)
for_each_kmem_cache_node(cachep, node, n) {
unsigned long active_objs = 0, num_objs = 0, free_objects = 0;
unsigned long active_slabs = 0, num_slabs = 0;
+   unsigned long num_slabs_partial = 0, num_slabs_free = 0;
+   unsigned long num_slabs_full;
 
spin_lock_irqsave(>list_lock, flags);
-   list_for_each_entry(page, >slabs_full, lru) {
-   active_objs += cachep->num;
-   active_slabs++;
-   }
+   num_slabs = n->num_slabs;
list_for_each_entry(page, >slabs_partial, lru) {
active_objs += page->active;
-   active_slabs++;
+   num_slabs_partial++;
}
list_for_each_entry(page, >slabs_free, lru)
-   num_slabs++;
+   num_slabs_free++;
 
free_objects += n->free_objects;
spin_unlock_irqrestore(>list_lock, flags);
 
-   num_slabs += active_slabs;
num_objs = num_slabs * cachep->num;
+   active_slabs = num_slabs - num_slabs_free;
+   num_slabs_full = num_slabs -
+   (num_slabs_partial + num_slabs_free);
+   active_objs += (num_slabs_full * cachep->num);
+
pr_warn("  node %d: slabs: %ld/%ld, objs: %ld/%ld, free: %ld\n",
node, active_slabs, num_slabs, active_objs, num_objs,
free_objects);
@@ -2326,6 +2330,7 @@ static int drain_freelist(struct kmem_cache *cache,
 
page = list_entry(p, struct page, lru);
list_del(>lru);
+   n->num_slabs--;
/*
 * Safe to drop the lock. The slab is no longer linked
 * to the cache.
@@ -2764,6 +2769,8 @@ static void cache_grow_end(struct kmem_cache *cachep, 
struct page *page)
list_add_tail(>lru, &(n->slabs_free));
else
fixup_slab_list(cachep, n, page, );
+
+   n->num_slabs++;
STATS_INC_GROWN(cachep);
n->free_objects += cachep->num - page->active;
spin_unlock(>list_lock);
@@ -3455,6 +3462,7 @@ static void free_block(struct kmem_cache *cachep, void 
**objpp,
 
page = list_last_entry(>slabs_free, struct page, lru);
list_move(>lru, list);
+   n->num_slabs--;
}
 }
 
@@ -4111,6 +4119,8 @@ void get_slabinfo(struct kmem_cache *cachep, struct 
slabinfo *sinfo)
unsigned long num_objs;
unsigned long active_slabs = 0;
unsigned long num_slabs, free_objects = 0, shared_avail = 0;
+   unsigned long num_slabs_partial = 0, num_slabs_free = 0;
+   unsigned long num_slabs_full = 0;
const char *name;
char *error = NULL;
int node;
@@ -4123,33 +4133,34 @@ void get_slabinfo(struct kmem_cache *cachep, struct 
slabinfo *sinfo)
check_irq_on();

Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo stats

2016-08-17 Thread Aruna Ramakrishna


On 08/17/2016 12:03 PM, Eric Dumazet wrote:

On Wed, 2016-08-17 at 11:20 -0700, Aruna Ramakrishna wrote:
]

-   list_for_each_entry(page, >slabs_full, lru) {
-   if (page->active != cachep->num && !error)
-   error = "slabs_full accounting error";
-   active_objs += cachep->num;
-   active_slabs++;
-   }


Since you only removed this loop, you could track only number of
full_slabs.

This would avoid messing with n->num_slabs all over the places in fast
path.

Please also update slab_out_of_memory()



Eric,

Right now, n->num_slabs is modified only when a slab is detached from 
slabs_free (i.e. in drain_freelist and free_block) or when a new one is 
attached in cache_grow_end. None of those 3 calls are in the fast path, 
right? Tracking just full_slabs would also involve similar changes: 
decrement when a slab moves from full to partial during free_block, and 
increment when it moves from partial/free to full after allocation in 
fixup_slab_list. So I don't see what the real difference/advantage is.


I will update slab_out_of_memory and remove the slabs_full list 
traversal there too.


Thanks,
Aruna




Re: [PATCH v3] mm/slab: Improve performance of gathering slabinfo stats

2016-08-17 Thread Aruna Ramakrishna


On 08/17/2016 12:03 PM, Eric Dumazet wrote:

On Wed, 2016-08-17 at 11:20 -0700, Aruna Ramakrishna wrote:
]

-   list_for_each_entry(page, >slabs_full, lru) {
-   if (page->active != cachep->num && !error)
-   error = "slabs_full accounting error";
-   active_objs += cachep->num;
-   active_slabs++;
-   }


Since you only removed this loop, you could track only number of
full_slabs.

This would avoid messing with n->num_slabs all over the places in fast
path.

Please also update slab_out_of_memory()



Eric,

Right now, n->num_slabs is modified only when a slab is detached from 
slabs_free (i.e. in drain_freelist and free_block) or when a new one is 
attached in cache_grow_end. None of those 3 calls are in the fast path, 
right? Tracking just full_slabs would also involve similar changes: 
decrement when a slab moves from full to partial during free_block, and 
increment when it moves from partial/free to full after allocation in 
fixup_slab_list. So I don't see what the real difference/advantage is.


I will update slab_out_of_memory and remove the slabs_full list 
traversal there too.


Thanks,
Aruna




[PATCH v3] mm/slab: Improve performance of gathering slabinfo stats

2016-08-17 Thread Aruna Ramakrishna
On large systems, when some slab caches grow to millions of objects (and
many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds.
During this time, interrupts are disabled while walking the slab lists
(slabs_full, slabs_partial, and slabs_free) for each node, and this
sometimes causes timeouts in other drivers (for instance, Infiniband).

This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
total number of allocated slabs per node, per cache. This counter is
updated when a slab is created or destroyed. This enables us to skip
traversing the slabs_full list while gathering slabinfo statistics, and
since slabs_full tends to be the biggest list when the cache is large, it
results in a dramatic performance improvement. Getting slabinfo statistics
now only requires walking the slabs_free and slabs_partial lists, and
those lists are usually much smaller than slabs_full. We tested this after
growing the dentry cache to 70GB, and the performance improved from 2s to
5ms.

Signed-off-by: Aruna Ramakrishna <aruna.ramakris...@oracle.com>
Cc: Mike Kravetz <mike.krav...@oracle.com>
Cc: Christoph Lameter <c...@linux.com>
Cc: Pekka Enberg <penb...@kernel.org>
Cc: David Rientjes <rient...@google.com>
Cc: Joonsoo Kim <iamjoonsoo@lge.com>
Cc: Andrew Morton <a...@linux-foundation.org>
---
Note: this has been tested only on x86_64.

 mm/slab.c | 26 +-
 mm/slab.h |  1 +
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index b672710..3da34fe 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -233,6 +233,7 @@ static void kmem_cache_node_init(struct kmem_cache_node 
*parent)
spin_lock_init(>list_lock);
parent->free_objects = 0;
parent->free_touched = 0;
+   parent->num_slabs = 0;
 }
 
 #define MAKE_LIST(cachep, listp, slab, nodeid) \
@@ -2326,6 +2327,7 @@ static int drain_freelist(struct kmem_cache *cache,
 
page = list_entry(p, struct page, lru);
list_del(>lru);
+   n->num_slabs--;
/*
 * Safe to drop the lock. The slab is no longer linked
 * to the cache.
@@ -2764,6 +2766,8 @@ static void cache_grow_end(struct kmem_cache *cachep, 
struct page *page)
list_add_tail(>lru, &(n->slabs_free));
else
fixup_slab_list(cachep, n, page, );
+
+   n->num_slabs++;
STATS_INC_GROWN(cachep);
n->free_objects += cachep->num - page->active;
spin_unlock(>list_lock);
@@ -3455,6 +3459,7 @@ static void free_block(struct kmem_cache *cachep, void 
**objpp,
 
page = list_last_entry(>slabs_free, struct page, lru);
list_move(>lru, list);
+   n->num_slabs--;
}
 }
 
@@ -4111,6 +4116,8 @@ void get_slabinfo(struct kmem_cache *cachep, struct 
slabinfo *sinfo)
unsigned long num_objs;
unsigned long active_slabs = 0;
unsigned long num_slabs, free_objects = 0, shared_avail = 0;
+   unsigned long num_slabs_partial = 0, num_slabs_free = 0;
+   unsigned long num_slabs_full = 0;
const char *name;
char *error = NULL;
int node;
@@ -4123,33 +4130,34 @@ void get_slabinfo(struct kmem_cache *cachep, struct 
slabinfo *sinfo)
check_irq_on();
spin_lock_irq(>list_lock);
 
-   list_for_each_entry(page, >slabs_full, lru) {
-   if (page->active != cachep->num && !error)
-   error = "slabs_full accounting error";
-   active_objs += cachep->num;
-   active_slabs++;
-   }
+   num_slabs += n->num_slabs;
+
list_for_each_entry(page, >slabs_partial, lru) {
if (page->active == cachep->num && !error)
error = "slabs_partial accounting error";
if (!page->active && !error)
error = "slabs_partial accounting error";
active_objs += page->active;
-   active_slabs++;
+   num_slabs_partial++;
}
+
list_for_each_entry(page, >slabs_free, lru) {
if (page->active && !error)
error = "slabs_free accounting error";
-   num_slabs++;
+   num_slabs_free++;
}
+
free_objects += n->free_objects;
if (n->shared)
shared_avail += n->shared->avail;
 
spin_unlock_irq(>list_lock);
}
-   num_slabs += active_slabs;
  

[PATCH v3] mm/slab: Improve performance of gathering slabinfo stats

2016-08-17 Thread Aruna Ramakrishna
On large systems, when some slab caches grow to millions of objects (and
many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds.
During this time, interrupts are disabled while walking the slab lists
(slabs_full, slabs_partial, and slabs_free) for each node, and this
sometimes causes timeouts in other drivers (for instance, Infiniband).

This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
total number of allocated slabs per node, per cache. This counter is
updated when a slab is created or destroyed. This enables us to skip
traversing the slabs_full list while gathering slabinfo statistics, and
since slabs_full tends to be the biggest list when the cache is large, it
results in a dramatic performance improvement. Getting slabinfo statistics
now only requires walking the slabs_free and slabs_partial lists, and
those lists are usually much smaller than slabs_full. We tested this after
growing the dentry cache to 70GB, and the performance improved from 2s to
5ms.

Signed-off-by: Aruna Ramakrishna 
Cc: Mike Kravetz 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
Note: this has been tested only on x86_64.

 mm/slab.c | 26 +-
 mm/slab.h |  1 +
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index b672710..3da34fe 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -233,6 +233,7 @@ static void kmem_cache_node_init(struct kmem_cache_node 
*parent)
spin_lock_init(>list_lock);
parent->free_objects = 0;
parent->free_touched = 0;
+   parent->num_slabs = 0;
 }
 
 #define MAKE_LIST(cachep, listp, slab, nodeid) \
@@ -2326,6 +2327,7 @@ static int drain_freelist(struct kmem_cache *cache,
 
page = list_entry(p, struct page, lru);
list_del(>lru);
+   n->num_slabs--;
/*
 * Safe to drop the lock. The slab is no longer linked
 * to the cache.
@@ -2764,6 +2766,8 @@ static void cache_grow_end(struct kmem_cache *cachep, 
struct page *page)
list_add_tail(>lru, &(n->slabs_free));
else
fixup_slab_list(cachep, n, page, );
+
+   n->num_slabs++;
STATS_INC_GROWN(cachep);
n->free_objects += cachep->num - page->active;
spin_unlock(>list_lock);
@@ -3455,6 +3459,7 @@ static void free_block(struct kmem_cache *cachep, void 
**objpp,
 
page = list_last_entry(>slabs_free, struct page, lru);
list_move(>lru, list);
+   n->num_slabs--;
}
 }
 
@@ -4111,6 +4116,8 @@ void get_slabinfo(struct kmem_cache *cachep, struct 
slabinfo *sinfo)
unsigned long num_objs;
unsigned long active_slabs = 0;
unsigned long num_slabs, free_objects = 0, shared_avail = 0;
+   unsigned long num_slabs_partial = 0, num_slabs_free = 0;
+   unsigned long num_slabs_full = 0;
const char *name;
char *error = NULL;
int node;
@@ -4123,33 +4130,34 @@ void get_slabinfo(struct kmem_cache *cachep, struct 
slabinfo *sinfo)
check_irq_on();
spin_lock_irq(>list_lock);
 
-   list_for_each_entry(page, >slabs_full, lru) {
-   if (page->active != cachep->num && !error)
-   error = "slabs_full accounting error";
-   active_objs += cachep->num;
-   active_slabs++;
-   }
+   num_slabs += n->num_slabs;
+
list_for_each_entry(page, >slabs_partial, lru) {
if (page->active == cachep->num && !error)
error = "slabs_partial accounting error";
if (!page->active && !error)
error = "slabs_partial accounting error";
active_objs += page->active;
-   active_slabs++;
+   num_slabs_partial++;
}
+
list_for_each_entry(page, >slabs_free, lru) {
if (page->active && !error)
error = "slabs_free accounting error";
-   num_slabs++;
+   num_slabs_free++;
}
+
free_objects += n->free_objects;
if (n->shared)
shared_avail += n->shared->avail;
 
spin_unlock_irq(>list_lock);
}
-   num_slabs += active_slabs;
num_objs = num_slabs * cachep->num;
+   active_slabs = num_slabs - num_slabs_free;
+   num_slabs_full = num_slabs - (num_slabs_partial + num_slabs_free);
+   active_objs += (num_slabs_full * cachep->nu

Re: [PATCH v2] mm/slab: Improve performance of gathering slabinfo stats

2016-08-17 Thread aruna . ramakrishna


On 08/16/2016 08:52 AM, Christoph Lameter wrote:


On Tue, 16 Aug 2016, Joonsoo Kim wrote:


In SLUB, nr_slabs is manipulated without holding a lock so atomic
operation should be used.


It could be moved under the node lock.



Christoph, Joonsoo,

I agree that nr_slabs could be common between SLAB and SLUB, but I think 
that should be a separate patch, since converting nr_slabs to unsigned 
long for SLUB will cause quite a bit of change in mm/slub.c that is not 
related to adding counters to SLAB.


I'll send out an updated slab counters patch with Joonsoo's suggested 
fix tomorrow (nr_slabs will be unsigned long for SLAB only, and there 
will be a separate definition for SLUB), and once that's in, I'll create 
a new patch that makes nr_slabs common for SLAB and SLUB, and also 
converts total_objects to unsigned long. Maybe it can include some more 
cleanup too. Does that sound acceptable?


Thanks,
Aruna


Re: [PATCH v2] mm/slab: Improve performance of gathering slabinfo stats

2016-08-17 Thread aruna . ramakrishna


On 08/16/2016 08:52 AM, Christoph Lameter wrote:


On Tue, 16 Aug 2016, Joonsoo Kim wrote:


In SLUB, nr_slabs is manipulated without holding a lock so atomic
operation should be used.


It could be moved under the node lock.



Christoph, Joonsoo,

I agree that nr_slabs could be common between SLAB and SLUB, but I think 
that should be a separate patch, since converting nr_slabs to unsigned 
long for SLUB will cause quite a bit of change in mm/slub.c that is not 
related to adding counters to SLAB.


I'll send out an updated slab counters patch with Joonsoo's suggested 
fix tomorrow (nr_slabs will be unsigned long for SLAB only, and there 
will be a separate definition for SLUB), and once that's in, I'll create 
a new patch that makes nr_slabs common for SLAB and SLUB, and also 
converts total_objects to unsigned long. Maybe it can include some more 
cleanup too. Does that sound acceptable?


Thanks,
Aruna


Re: [PATCH v2] mm/slab: Improve performance of gathering slabinfo stats

2016-08-04 Thread Aruna Ramakrishna



On 08/04/2016 02:06 PM, Andrew Morton wrote:

On Thu,  4 Aug 2016 12:01:13 -0700 Aruna Ramakrishna 
<aruna.ramakris...@oracle.com> wrote:


On large systems, when some slab caches grow to millions of objects (and
many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds.
During this time, interrupts are disabled while walking the slab lists
(slabs_full, slabs_partial, and slabs_free) for each node, and this
sometimes causes timeouts in other drivers (for instance, Infiniband).

This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
total number of allocated slabs per node, per cache. This counter is
updated when a slab is created or destroyed. This enables us to skip
traversing the slabs_full list while gathering slabinfo statistics, and
since slabs_full tends to be the biggest list when the cache is large, it
results in a dramatic performance improvement. Getting slabinfo statistics
now only requires walking the slabs_free and slabs_partial lists, and
those lists are usually much smaller than slabs_full. We tested this after
growing the dentry cache to 70GB, and the performance improved from 2s to
5ms.


I assume this is tested on both slab and slub?

It isn't the smallest of patches but given the seriousness of the
problem I think I'll tag it for -stable backporting.



This was only sanity-checked on slub. The performance tests were only 
run on slab.


Thanks,
Aruna


Re: [PATCH v2] mm/slab: Improve performance of gathering slabinfo stats

2016-08-04 Thread Aruna Ramakrishna



On 08/04/2016 02:06 PM, Andrew Morton wrote:

On Thu,  4 Aug 2016 12:01:13 -0700 Aruna Ramakrishna 
 wrote:


On large systems, when some slab caches grow to millions of objects (and
many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds.
During this time, interrupts are disabled while walking the slab lists
(slabs_full, slabs_partial, and slabs_free) for each node, and this
sometimes causes timeouts in other drivers (for instance, Infiniband).

This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
total number of allocated slabs per node, per cache. This counter is
updated when a slab is created or destroyed. This enables us to skip
traversing the slabs_full list while gathering slabinfo statistics, and
since slabs_full tends to be the biggest list when the cache is large, it
results in a dramatic performance improvement. Getting slabinfo statistics
now only requires walking the slabs_free and slabs_partial lists, and
those lists are usually much smaller than slabs_full. We tested this after
growing the dentry cache to 70GB, and the performance improved from 2s to
5ms.


I assume this is tested on both slab and slub?

It isn't the smallest of patches but given the seriousness of the
problem I think I'll tag it for -stable backporting.



This was only sanity-checked on slub. The performance tests were only 
run on slab.


Thanks,
Aruna


[PATCH v2] mm/slab: Improve performance of gathering slabinfo stats

2016-08-04 Thread Aruna Ramakrishna
On large systems, when some slab caches grow to millions of objects (and
many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds.
During this time, interrupts are disabled while walking the slab lists
(slabs_full, slabs_partial, and slabs_free) for each node, and this
sometimes causes timeouts in other drivers (for instance, Infiniband).

This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
total number of allocated slabs per node, per cache. This counter is
updated when a slab is created or destroyed. This enables us to skip
traversing the slabs_full list while gathering slabinfo statistics, and
since slabs_full tends to be the biggest list when the cache is large, it
results in a dramatic performance improvement. Getting slabinfo statistics
now only requires walking the slabs_free and slabs_partial lists, and
those lists are usually much smaller than slabs_full. We tested this after
growing the dentry cache to 70GB, and the performance improved from 2s to
5ms.

Signed-off-by: Aruna Ramakrishna <aruna.ramakris...@oracle.com>
Cc: Mike Kravetz <mike.krav...@oracle.com>
Cc: Christoph Lameter <c...@linux.com>
Cc: Pekka Enberg <penb...@kernel.org>
Cc: David Rientjes <rient...@google.com>
Cc: Joonsoo Kim <iamjoonsoo@lge.com>
Cc: Andrew Morton <a...@linux-foundation.org>
---
Note: this has been tested only on x86_64.

 mm/slab.c | 25 -
 mm/slab.h | 15 ++-
 mm/slub.c | 19 +--
 3 files changed, 31 insertions(+), 28 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 261147b..d683840 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -233,6 +233,7 @@ static void kmem_cache_node_init(struct kmem_cache_node 
*parent)
spin_lock_init(>list_lock);
parent->free_objects = 0;
parent->free_touched = 0;
+   atomic_long_set(>nr_slabs, 0);
 }
 
 #define MAKE_LIST(cachep, listp, slab, nodeid) \
@@ -2333,6 +2334,7 @@ static int drain_freelist(struct kmem_cache *cache,
n->free_objects -= cache->num;
spin_unlock_irq(>list_lock);
slab_destroy(cache, page);
+   atomic_long_dec(>nr_slabs);
nr_freed++;
}
 out:
@@ -2736,6 +2738,8 @@ static struct page *cache_grow_begin(struct kmem_cache 
*cachep,
if (gfpflags_allow_blocking(local_flags))
local_irq_disable();
 
+   atomic_long_inc(>nr_slabs);
+
return page;
 
 opps1:
@@ -3455,6 +3459,7 @@ static void free_block(struct kmem_cache *cachep, void 
**objpp,
 
page = list_last_entry(>slabs_free, struct page, lru);
list_move(>lru, list);
+   atomic_long_dec(>nr_slabs);
}
 }
 
@@ -4111,6 +4116,8 @@ void get_slabinfo(struct kmem_cache *cachep, struct 
slabinfo *sinfo)
unsigned long num_objs;
unsigned long active_slabs = 0;
unsigned long num_slabs, free_objects = 0, shared_avail = 0;
+   unsigned long num_slabs_partial = 0, num_slabs_free = 0;
+   unsigned long num_slabs_full = 0;
const char *name;
char *error = NULL;
int node;
@@ -4120,36 +4127,36 @@ void get_slabinfo(struct kmem_cache *cachep, struct 
slabinfo *sinfo)
num_slabs = 0;
for_each_kmem_cache_node(cachep, node, n) {
 
+   num_slabs += node_nr_slabs(n);
check_irq_on();
spin_lock_irq(>list_lock);
 
-   list_for_each_entry(page, >slabs_full, lru) {
-   if (page->active != cachep->num && !error)
-   error = "slabs_full accounting error";
-   active_objs += cachep->num;
-   active_slabs++;
-   }
list_for_each_entry(page, >slabs_partial, lru) {
if (page->active == cachep->num && !error)
error = "slabs_partial accounting error";
if (!page->active && !error)
error = "slabs_partial accounting error";
active_objs += page->active;
-   active_slabs++;
+   num_slabs_partial++;
}
+
list_for_each_entry(page, >slabs_free, lru) {
if (page->active && !error)
error = "slabs_free accounting error";
-   num_slabs++;
+   num_slabs_free++;
}
+
free_objects += n->free_objects;
if (n->shared)
shared_avail += n->shared->avail;
 
spin_unlock_irq(>list_lock);
}
-   num_slabs += active_slabs;
num_objs = nu

[PATCH v2] mm/slab: Improve performance of gathering slabinfo stats

2016-08-04 Thread Aruna Ramakrishna
On large systems, when some slab caches grow to millions of objects (and
many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds.
During this time, interrupts are disabled while walking the slab lists
(slabs_full, slabs_partial, and slabs_free) for each node, and this
sometimes causes timeouts in other drivers (for instance, Infiniband).

This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
total number of allocated slabs per node, per cache. This counter is
updated when a slab is created or destroyed. This enables us to skip
traversing the slabs_full list while gathering slabinfo statistics, and
since slabs_full tends to be the biggest list when the cache is large, it
results in a dramatic performance improvement. Getting slabinfo statistics
now only requires walking the slabs_free and slabs_partial lists, and
those lists are usually much smaller than slabs_full. We tested this after
growing the dentry cache to 70GB, and the performance improved from 2s to
5ms.

Signed-off-by: Aruna Ramakrishna 
Cc: Mike Kravetz 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
Note: this has been tested only on x86_64.

 mm/slab.c | 25 -
 mm/slab.h | 15 ++-
 mm/slub.c | 19 +--
 3 files changed, 31 insertions(+), 28 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 261147b..d683840 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -233,6 +233,7 @@ static void kmem_cache_node_init(struct kmem_cache_node 
*parent)
spin_lock_init(>list_lock);
parent->free_objects = 0;
parent->free_touched = 0;
+   atomic_long_set(>nr_slabs, 0);
 }
 
 #define MAKE_LIST(cachep, listp, slab, nodeid) \
@@ -2333,6 +2334,7 @@ static int drain_freelist(struct kmem_cache *cache,
n->free_objects -= cache->num;
spin_unlock_irq(>list_lock);
slab_destroy(cache, page);
+   atomic_long_dec(>nr_slabs);
nr_freed++;
}
 out:
@@ -2736,6 +2738,8 @@ static struct page *cache_grow_begin(struct kmem_cache 
*cachep,
if (gfpflags_allow_blocking(local_flags))
local_irq_disable();
 
+   atomic_long_inc(>nr_slabs);
+
return page;
 
 opps1:
@@ -3455,6 +3459,7 @@ static void free_block(struct kmem_cache *cachep, void 
**objpp,
 
page = list_last_entry(>slabs_free, struct page, lru);
list_move(>lru, list);
+   atomic_long_dec(>nr_slabs);
}
 }
 
@@ -4111,6 +4116,8 @@ void get_slabinfo(struct kmem_cache *cachep, struct 
slabinfo *sinfo)
unsigned long num_objs;
unsigned long active_slabs = 0;
unsigned long num_slabs, free_objects = 0, shared_avail = 0;
+   unsigned long num_slabs_partial = 0, num_slabs_free = 0;
+   unsigned long num_slabs_full = 0;
const char *name;
char *error = NULL;
int node;
@@ -4120,36 +4127,36 @@ void get_slabinfo(struct kmem_cache *cachep, struct 
slabinfo *sinfo)
num_slabs = 0;
for_each_kmem_cache_node(cachep, node, n) {
 
+   num_slabs += node_nr_slabs(n);
check_irq_on();
spin_lock_irq(>list_lock);
 
-   list_for_each_entry(page, >slabs_full, lru) {
-   if (page->active != cachep->num && !error)
-   error = "slabs_full accounting error";
-   active_objs += cachep->num;
-   active_slabs++;
-   }
list_for_each_entry(page, >slabs_partial, lru) {
if (page->active == cachep->num && !error)
error = "slabs_partial accounting error";
if (!page->active && !error)
error = "slabs_partial accounting error";
active_objs += page->active;
-   active_slabs++;
+   num_slabs_partial++;
}
+
list_for_each_entry(page, >slabs_free, lru) {
if (page->active && !error)
error = "slabs_free accounting error";
-   num_slabs++;
+   num_slabs_free++;
}
+
free_objects += n->free_objects;
if (n->shared)
shared_avail += n->shared->avail;
 
spin_unlock_irq(>list_lock);
}
-   num_slabs += active_slabs;
num_objs = num_slabs * cachep->num;
+   active_slabs = num_slabs - num_slabs_free;
+   num_slabs_full = num_slabs - (num_slabs_partial + num_slabs_free);
+   active_objs += (num_slabs_full * cachep->num);
+

Re: [PATCH] mm/slab: Improve performance of gathering slabinfo stats

2016-08-02 Thread Aruna Ramakrishna


On 08/02/2016 07:59 AM, Christoph Lameter wrote:

Hmm What SLUB does is:

1. Keep a count of the total number of allocated slab pages per node.
This counter only needs to be updated when a slab page is
allocated from the page allocator or when it is freed to the
page allocator. At that point we already hold the per node lock,
page allocator operations are extremely costly anyways and so that
is ok.

2. Keep a count of the number of partially allocated slab pages per node.
At that point we have to access the partial list and take a per
node lock. Placing the counter into the same cacheline and
the increment/decrement into the period when the lock has been taken
avoids the overhead.



As Joonsoo mentioned in his previous comment, the partial list is pretty 
small anyway. And we cannot avoid traversal of the partial list - we 
have to count the number of active objects in each partial slab:


active_objs += page->active;

So keeping a count of partially allocated slabs seems unnecessary to me.


The number of full pages is then

total - partial


If both allocators would use the same scheme here then the code to
maintain the counter can be moved into mm/slab_common.c. Plus the per node
structures could be mostly harmonized between both allocators. Maybe even
the page allocator operations could become common code.

Aruna: Could you work on a solution like that?



Yup, I'll replace the 3 counters with one counter for number of slabs 
per node and send out a new patch. I'll try to make the counter 
management as similar as possible, between SLAB and SLUB.


Thanks,
Aruna


Re: [PATCH] mm/slab: Improve performance of gathering slabinfo stats

2016-08-02 Thread Aruna Ramakrishna


On 08/02/2016 07:59 AM, Christoph Lameter wrote:

Hmm What SLUB does is:

1. Keep a count of the total number of allocated slab pages per node.
This counter only needs to be updated when a slab page is
allocated from the page allocator or when it is freed to the
page allocator. At that point we already hold the per node lock,
page allocator operations are extremely costly anyways and so that
is ok.

2. Keep a count of the number of partially allocated slab pages per node.
At that point we have to access the partial list and take a per
node lock. Placing the counter into the same cacheline and
the increment/decrement into the period when the lock has been taken
avoids the overhead.



As Joonsoo mentioned in his previous comment, the partial list is pretty 
small anyway. And we cannot avoid traversal of the partial list - we 
have to count the number of active objects in each partial slab:


active_objs += page->active;

So keeping a count of partially allocated slabs seems unnecessary to me.


The number of full pages is then

total - partial


If both allocators would use the same scheme here then the code to
maintain the counter can be moved into mm/slab_common.c. Plus the per node
structures could be mostly harmonized between both allocators. Maybe even
the page allocator operations could become common code.

Aruna: Could you work on a solution like that?



Yup, I'll replace the 3 counters with one counter for number of slabs 
per node and send out a new patch. I'll try to make the counter 
management as similar as possible, between SLAB and SLUB.


Thanks,
Aruna


Re: [PATCH] mm/slab: Improve performance of gathering slabinfo stats

2016-08-01 Thread Aruna Ramakrishna

Hi Joonsoo,

On 08/01/2016 05:55 PM, Joonsoo Kim wrote:

Your patch updates these counters not only when a slabs are created and
destroyed but also when object is allocated/freed from the slab. This
would hurt runtime performance.



The counters are not updated for each object allocation/free - only if 
that allocation/free results in that slab moving from one list 
(free/partial/full) to another.



> slab lists for gathering slabinfo stats, resulting in a dramatic
> performance improvement. We tested this after growing the dentry cache to
> 70GB, and the performance improved from 2s to 2ms.

Nice improvement. I can think of an altenative.

I guess that improvement of your change comes from skipping to iterate
n->slabs_full list. We can achieve it just with introducing only num_slabs.
num_slabs can be updated when a slabs are created and destroyed.



Yes, slabs_full is typically the largest list.


We can calculate num_slabs_full by following equation.

num_slabs_full = num_slabs - num_slabs_partial - num_slabs_free

Calculating both num_slabs_partial and num_slabs_free by iterating
n->slabs_XXX list would not take too much time.


Yes, this would work too. We cannot avoid traversal of slabs_partial, 
and slabs_free is usually a small list, so this should give us similar 
performance benefits. But having separate counters could also be useful 
for debugging, like the ones defined under CONFIG_DEBUG_SLAB/STATS. 
Won't that help?


Thanks,
Aruna


Re: [PATCH] mm/slab: Improve performance of gathering slabinfo stats

2016-08-01 Thread Aruna Ramakrishna

Hi Joonsoo,

On 08/01/2016 05:55 PM, Joonsoo Kim wrote:

Your patch updates these counters not only when a slabs are created and
destroyed but also when object is allocated/freed from the slab. This
would hurt runtime performance.



The counters are not updated for each object allocation/free - only if 
that allocation/free results in that slab moving from one list 
(free/partial/full) to another.



> slab lists for gathering slabinfo stats, resulting in a dramatic
> performance improvement. We tested this after growing the dentry cache to
> 70GB, and the performance improved from 2s to 2ms.

Nice improvement. I can think of an altenative.

I guess that improvement of your change comes from skipping to iterate
n->slabs_full list. We can achieve it just with introducing only num_slabs.
num_slabs can be updated when a slabs are created and destroyed.



Yes, slabs_full is typically the largest list.


We can calculate num_slabs_full by following equation.

num_slabs_full = num_slabs - num_slabs_partial - num_slabs_free

Calculating both num_slabs_partial and num_slabs_free by iterating
n->slabs_XXX list would not take too much time.


Yes, this would work too. We cannot avoid traversal of slabs_partial, 
and slabs_free is usually a small list, so this should give us similar 
performance benefits. But having separate counters could also be useful 
for debugging, like the ones defined under CONFIG_DEBUG_SLAB/STATS. 
Won't that help?


Thanks,
Aruna


[PATCH] mm/slab: Improve performance of gathering slabinfo stats

2016-08-01 Thread Aruna Ramakrishna
On large systems, when some slab caches grow to millions of objects (and
many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds.
During this time, interrupts are disabled while walking the slab lists
(slabs_full, slabs_partial, and slabs_free) for each node, and this
sometimes causes timeouts in other drivers (for instance, Infiniband).

This patch optimizes 'cat /proc/slabinfo' by maintaining slab counters to
keep track of number of slabs per node, per cache. These counters are
updated as slabs are created and destroyed. This avoids having to scan the
slab lists for gathering slabinfo stats, resulting in a dramatic
performance improvement. We tested this after growing the dentry cache to
70GB, and the performance improved from 2s to 2ms.

Signed-off-by: Aruna Ramakrishna <aruna.ramakris...@oracle.com>
Cc: Mike Kravetz <mike.krav...@oracle.com>
Cc: Christoph Lameter <c...@linux.com>
Cc: Pekka Enberg <penb...@kernel.org>
Cc: David Rientjes <rient...@google.com>
Cc: Joonsoo Kim <iamjoonsoo@lge.com>
Cc: Andrew Morton <a...@linux-foundation.org>
---
Note: this has been tested only on x86_64.

 mm/slab.c | 81 ---
 mm/slab.h |  5 
 2 files changed, 67 insertions(+), 19 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 09771ed..c205cd8 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -233,6 +233,11 @@ static void kmem_cache_node_init(struct kmem_cache_node 
*parent)
spin_lock_init(>list_lock);
parent->free_objects = 0;
parent->free_touched = 0;
+   parent->num_slabs_partial = 0;
+   parent->num_slabs_full = 0;
+   parent->num_slabs_free = 0;
+   parent->cache_grown = 0;
+   parent->using_free_slab = 0;
 }
 
 #define MAKE_LIST(cachep, listp, slab, nodeid) \
@@ -2331,6 +2336,7 @@ static int drain_freelist(struct kmem_cache *cache,
 * to the cache.
 */
n->free_objects -= cache->num;
+   n->num_slabs_free--;
spin_unlock_irq(>list_lock);
slab_destroy(cache, page);
nr_freed++;
@@ -2731,6 +2737,14 @@ static struct page *cache_grow_begin(struct kmem_cache 
*cachep,
kasan_poison_slab(page);
cache_init_objs(cachep, page);
 
+   spin_lock(>list_lock);
+   /*
+* Track the cache growth until the slabs lists and counters
+* are adjusted.
+*/
+   n->cache_grown++;
+   spin_unlock(>list_lock);
+
if (gfpflags_allow_blocking(local_flags))
local_irq_disable();
 
@@ -2758,12 +2772,14 @@ static void cache_grow_end(struct kmem_cache *cachep, 
struct page *page)
n = get_node(cachep, page_to_nid(page));
 
spin_lock(>list_lock);
-   if (!page->active)
+   if (!page->active) {
list_add_tail(>lru, &(n->slabs_free));
-   else
+   n->num_slabs_free++;
+   } else
fixup_slab_list(cachep, n, page, );
STATS_INC_GROWN(cachep);
n->free_objects += cachep->num - page->active;
+   n->cache_grown--;
spin_unlock(>list_lock);
 
fixup_objfreelist_debug(cachep, );
@@ -2867,8 +2883,26 @@ static inline void fixup_slab_list(struct kmem_cache 
*cachep,
 {
/* move slabp to correct slabp list: */
list_del(>lru);
+   /*
+* If the cache was not grown, then this slabp was deleted from a
+* slabs_partial or a slabs_free list, and we decrement the counter
+* for the appropriate list here. The flag using_free_slab is set
+* whenever a slab from the slabs_free list is used for allocation.
+* If set, we know that a slab was deleted from slabs_free and will be
+* moved to slabs_partial (or slabs_full) later below. Otherwise,
+* it was deleted from slabs_partial.
+* If the cache was grown, slabp points to a new slab not present in
+* any list - so we do not decrement any counters.
+*/
+   if (!n->cache_grown) {
+   if (n->using_free_slab)
+   n->num_slabs_free--;
+   else
+   n->num_slabs_partial--;
+   }
if (page->active == cachep->num) {
list_add(>lru, >slabs_full);
+   n->num_slabs_full++;
if (OBJFREELIST_SLAB(cachep)) {
 #if DEBUG
/* Poisoning will be done without holding the lock */
@@ -2881,8 +2915,10 @@ static inline void fixup_slab_list(struct kmem_cache 
*cachep,
 #endif
page->freelist = NULL;
}
-   } else
+   } else {
list_add(>lru, >slabs_partial);
+   n->num_slabs_partial++;
+   }
 }
 
 /* Try to find non-pfmemalloc sla

[PATCH] mm/slab: Improve performance of gathering slabinfo stats

2016-08-01 Thread Aruna Ramakrishna
On large systems, when some slab caches grow to millions of objects (and
many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2 seconds.
During this time, interrupts are disabled while walking the slab lists
(slabs_full, slabs_partial, and slabs_free) for each node, and this
sometimes causes timeouts in other drivers (for instance, Infiniband).

This patch optimizes 'cat /proc/slabinfo' by maintaining slab counters to
keep track of number of slabs per node, per cache. These counters are
updated as slabs are created and destroyed. This avoids having to scan the
slab lists for gathering slabinfo stats, resulting in a dramatic
performance improvement. We tested this after growing the dentry cache to
70GB, and the performance improved from 2s to 2ms.

Signed-off-by: Aruna Ramakrishna 
Cc: Mike Kravetz 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
Note: this has been tested only on x86_64.

 mm/slab.c | 81 ---
 mm/slab.h |  5 
 2 files changed, 67 insertions(+), 19 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 09771ed..c205cd8 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -233,6 +233,11 @@ static void kmem_cache_node_init(struct kmem_cache_node 
*parent)
spin_lock_init(>list_lock);
parent->free_objects = 0;
parent->free_touched = 0;
+   parent->num_slabs_partial = 0;
+   parent->num_slabs_full = 0;
+   parent->num_slabs_free = 0;
+   parent->cache_grown = 0;
+   parent->using_free_slab = 0;
 }
 
 #define MAKE_LIST(cachep, listp, slab, nodeid) \
@@ -2331,6 +2336,7 @@ static int drain_freelist(struct kmem_cache *cache,
 * to the cache.
 */
n->free_objects -= cache->num;
+   n->num_slabs_free--;
spin_unlock_irq(>list_lock);
slab_destroy(cache, page);
nr_freed++;
@@ -2731,6 +2737,14 @@ static struct page *cache_grow_begin(struct kmem_cache 
*cachep,
kasan_poison_slab(page);
cache_init_objs(cachep, page);
 
+   spin_lock(>list_lock);
+   /*
+* Track the cache growth until the slabs lists and counters
+* are adjusted.
+*/
+   n->cache_grown++;
+   spin_unlock(>list_lock);
+
if (gfpflags_allow_blocking(local_flags))
local_irq_disable();
 
@@ -2758,12 +2772,14 @@ static void cache_grow_end(struct kmem_cache *cachep, 
struct page *page)
n = get_node(cachep, page_to_nid(page));
 
spin_lock(>list_lock);
-   if (!page->active)
+   if (!page->active) {
list_add_tail(>lru, &(n->slabs_free));
-   else
+   n->num_slabs_free++;
+   } else
fixup_slab_list(cachep, n, page, );
STATS_INC_GROWN(cachep);
n->free_objects += cachep->num - page->active;
+   n->cache_grown--;
spin_unlock(>list_lock);
 
fixup_objfreelist_debug(cachep, );
@@ -2867,8 +2883,26 @@ static inline void fixup_slab_list(struct kmem_cache 
*cachep,
 {
/* move slabp to correct slabp list: */
list_del(>lru);
+   /*
+* If the cache was not grown, then this slabp was deleted from a
+* slabs_partial or a slabs_free list, and we decrement the counter
+* for the appropriate list here. The flag using_free_slab is set
+* whenever a slab from the slabs_free list is used for allocation.
+* If set, we know that a slab was deleted from slabs_free and will be
+* moved to slabs_partial (or slabs_full) later below. Otherwise,
+* it was deleted from slabs_partial.
+* If the cache was grown, slabp points to a new slab not present in
+* any list - so we do not decrement any counters.
+*/
+   if (!n->cache_grown) {
+   if (n->using_free_slab)
+   n->num_slabs_free--;
+   else
+   n->num_slabs_partial--;
+   }
if (page->active == cachep->num) {
list_add(>lru, >slabs_full);
+   n->num_slabs_full++;
if (OBJFREELIST_SLAB(cachep)) {
 #if DEBUG
/* Poisoning will be done without holding the lock */
@@ -2881,8 +2915,10 @@ static inline void fixup_slab_list(struct kmem_cache 
*cachep,
 #endif
page->freelist = NULL;
}
-   } else
+   } else {
list_add(>lru, >slabs_partial);
+   n->num_slabs_partial++;
+   }
 }
 
 /* Try to find non-pfmemalloc slab if needed */
@@ -2912,8 +2948,10 @@ static noinline struct page *get_valid_first_slab(struct 
kmem_cache_node *n,
list_add_tail(>lru, >slabs_partial);
 
list_for_e