from:"Raghavendra K T"

Re: [tip:sched/urgent] sched: Fix crash in sched_init_numa()

2016-02-08 Thread Raghavendra K T


On 01/19/2016 07:08 PM, tip-bot for Raghavendra K T wrote:

Commit-ID:  9c03ee147193645be4c186d3688232fa438c57c7
Gitweb: http://git.kernel.org/tip/9c03ee147193645be4c186d3688232fa438c57c7
Author: Raghavendra K T 
AuthorDate: Sat, 16 Jan 2016 00:31:23 +0530
Committer:  Ingo Molnar 
CommitDate: Tue, 19 Jan 2016 08:42:20 +0100

sched: Fix crash in sched_init_numa()

The following PowerPC commit:

   c118baf80256 ("arch/powerpc/mm/numa.c: do not allocate bootmem memory for non 
existing nodes")

avoids allocating bootmem memory for non existent nodes.

But when DEBUG_PER_CPU_MAPS=y is enabled, my powerNV system failed to boot
because in sched_init_numa(), cpumask_or() operation was done on
unallocated nodes.

Fix that by making cpumask_or() operation only on existing nodes.

[ Tested with and w/o DEBUG_PER_CPU_MAPS=y on x86 and PowerPC. ]

Reported-by: Jan Stancek 
Tested-by: Jan Stancek 
Signed-off-by: Raghavendra K T 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Link: 
http://lkml.kernel.org/r/1452884483-11676-1-git-send-email-raghavendra...@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar 
---
  kernel/sched/core.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 44253ad..474658b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6840,7 +6840,7 @@ static void sched_init_numa(void)

sched_domains_numa_masks[i][j] = mask;

-   for (k = 0; k < nr_node_ids; k++) {
+   for_each_node(k) {
if (node_distance(j, k) > 
sched_domains_numa_distance[i])
continue;






Hello Greg,
Above commit fixes the debug kernel crash in 4.4 kernel [ when
DEBUG_PER_CPU_MAPS=y to be precise]. This is a regression in 4.4 from
4.3 and should be ideally present in 4.4-stable.

Could you please pull in this change.?

Re: [tip:sched/urgent] sched: Fix crash in sched_init_numa()

2016-02-08 Thread Raghavendra K T


On 01/19/2016 07:08 PM, tip-bot for Raghavendra K T wrote:

Commit-ID:  9c03ee147193645be4c186d3688232fa438c57c7
Gitweb: http://git.kernel.org/tip/9c03ee147193645be4c186d3688232fa438c57c7
Author: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
AuthorDate: Sat, 16 Jan 2016 00:31:23 +0530
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Tue, 19 Jan 2016 08:42:20 +0100

sched: Fix crash in sched_init_numa()

The following PowerPC commit:

   c118baf80256 ("arch/powerpc/mm/numa.c: do not allocate bootmem memory for non 
existing nodes")

avoids allocating bootmem memory for non existent nodes.

But when DEBUG_PER_CPU_MAPS=y is enabled, my powerNV system failed to boot
because in sched_init_numa(), cpumask_or() operation was done on
unallocated nodes.

Fix that by making cpumask_or() operation only on existing nodes.

[ Tested with and w/o DEBUG_PER_CPU_MAPS=y on x86 and PowerPC. ]

Reported-by: Jan Stancek <jstan...@redhat.com>
Tested-by: Jan Stancek <jstan...@redhat.com>
Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
Cc: <gk...@linux.vnet.ibm.com>
Cc: <grant.lik...@linaro.org>
Cc: <nik...@linux.vnet.ibm.com>
Cc: <vdavy...@parallels.com>
Cc: <linuxppc-...@lists.ozlabs.org>
Cc: <linux...@kvack.org>
Cc: <pet...@infradead.org>
Cc: <b...@kernel.crashing.org>
Cc: <pau...@samba.org>
Cc: <m...@ellerman.id.au>
Cc: <an...@samba.org>
Link: 
http://lkml.kernel.org/r/1452884483-11676-1-git-send-email-raghavendra...@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
  kernel/sched/core.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 44253ad..474658b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6840,7 +6840,7 @@ static void sched_init_numa(void)

sched_domains_numa_masks[i][j] = mask;

-   for (k = 0; k < nr_node_ids; k++) {
+   for_each_node(k) {
if (node_distance(j, k) > 
sched_domains_numa_distance[i])
continue;






Hello Greg,
Above commit fixes the debug kernel crash in 4.4 kernel [ when
DEBUG_PER_CPU_MAPS=y to be precise]. This is a regression in 4.4 from
4.3 and should be ideally present in 4.4-stable.

Could you please pull in this change.?

[PATCH] blk-mq: Avoid memoryless numa node encoded in hctx numa_node

2015-12-02 Thread Raghavendra K T

In architecture like powerpc, we can have cpus without any local memory
attached to it (a.k.a memoryless nodes). In such cases cpu to node mapping
can result in memory allocation hints for block hctx->numa_node populated
with node values which does not have real memory.

Instead use local_memory_node(), which is guaranteed to have memory.
local_memory_node is a noop in other architectures that does not support
memoryless nodes.

Signed-off-by: Raghavendra K T 
---
 block/blk-mq-cpumap.c | 2 +-
 block/blk-mq.c| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

 Validated the patch on a memoryless node powerpc system.

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 8764c24..d0634bc 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -113,7 +113,7 @@ int blk_mq_hw_queue_to_node(unsigned int *mq_map, unsigned 
int index)
 
for_each_possible_cpu(i) {
if (index == mq_map[i])
-   return cpu_to_node(i);
+   return local_memory_node(cpu_to_node(i));
}
 
return NUMA_NO_NODE;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6ada3b4..74be735 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1794,7 +1794,7 @@ static void blk_mq_init_cpu_queues(struct request_queue 
*q,
 * not, we remain on the home node of the device
 */
if (nr_hw_queues > 1 && hctx->numa_node == NUMA_NO_NODE)
-   hctx->numa_node = cpu_to_node(i);
+   hctx->numa_node = local_memory_node(cpu_to_node(i));
}
 }
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] blk-mq: Reuse hardware context cpumask for tags

2015-12-02 Thread Raghavendra K T

hctx->cpumask is already populated and let the tag cpumask follow that
instead of going through a new for loop.

Signed-off-by: Raghavendra K T 
---
 block/blk-mq.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

 Nish had suggested to put cpumask_copy after WARN_ON (instead of before).

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6d6f8fe..6ada3b4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1854,6 +1854,7 @@ static void blk_mq_map_swqueue(struct request_queue *q,
hctx->tags = set->tags[i];
WARN_ON(!hctx->tags);
 
+   cpumask_copy(hctx->tags->cpumask, hctx->cpumask);
/*
 * Set the map size to the number of mapped software queues.
 * This is more accurate and more efficient than looping
@@ -1867,14 +1868,6 @@ static void blk_mq_map_swqueue(struct request_queue *q,
hctx->next_cpu = cpumask_first(hctx->cpumask);
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
}
-
-   queue_for_each_ctx(q, ctx, i) {
-   if (!cpumask_test_cpu(i, online_mask))
-   continue;
-
-   hctx = q->mq_ops->map_queue(q, i);
-   cpumask_set_cpu(i, hctx->tags->cpumask);
-   }
 }
 
 static void queue_set_hctx_shared(struct request_queue *q, bool shared)
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] blk-mq: Avoid memoryless numa node encoded in hctx numa_node

2015-12-02 Thread Raghavendra K T

In architecture like powerpc, we can have cpus without any local memory
attached to it (a.k.a memoryless nodes). In such cases cpu to node mapping
can result in memory allocation hints for block hctx->numa_node populated
with node values which does not have real memory.

Instead use local_memory_node(), which is guaranteed to have memory.
local_memory_node is a noop in other architectures that does not support
memoryless nodes.

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
 block/blk-mq-cpumap.c | 2 +-
 block/blk-mq.c| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

 Validated the patch on a memoryless node powerpc system.

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 8764c24..d0634bc 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -113,7 +113,7 @@ int blk_mq_hw_queue_to_node(unsigned int *mq_map, unsigned 
int index)
 
for_each_possible_cpu(i) {
if (index == mq_map[i])
-   return cpu_to_node(i);
+   return local_memory_node(cpu_to_node(i));
}
 
return NUMA_NO_NODE;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6ada3b4..74be735 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1794,7 +1794,7 @@ static void blk_mq_init_cpu_queues(struct request_queue 
*q,
 * not, we remain on the home node of the device
 */
if (nr_hw_queues > 1 && hctx->numa_node == NUMA_NO_NODE)
-   hctx->numa_node = cpu_to_node(i);
+   hctx->numa_node = local_memory_node(cpu_to_node(i));
}
 }
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] blk-mq: Reuse hardware context cpumask for tags

2015-12-02 Thread Raghavendra K T

hctx->cpumask is already populated and let the tag cpumask follow that
instead of going through a new for loop.

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
 block/blk-mq.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

 Nish had suggested to put cpumask_copy after WARN_ON (instead of before).

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6d6f8fe..6ada3b4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1854,6 +1854,7 @@ static void blk_mq_map_swqueue(struct request_queue *q,
hctx->tags = set->tags[i];
WARN_ON(!hctx->tags);
 
+   cpumask_copy(hctx->tags->cpumask, hctx->cpumask);
/*
 * Set the map size to the number of mapped software queues.
 * This is more accurate and more efficient than looping
@@ -1867,14 +1868,6 @@ static void blk_mq_map_swqueue(struct request_queue *q,
hctx->next_cpu = cpumask_first(hctx->cpumask);
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
}
-
-   queue_for_each_ctx(q, ctx, i) {
-   if (!cpumask_test_cpu(i, online_mask))
-   continue;
-
-   hctx = q->mq_ops->map_queue(q, i);
-   cpumask_set_cpu(i, hctx->tags->cpumask);
-   }
 }
 
 static void queue_set_hctx_shared(struct request_queue *q, bool shared)
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Unable to handle kernel paging request for data at address __percpu_counter_add

2015-11-29 Thread Raghavendra K T


On 11/24/2015 02:43 AM, Tejun Heo wrote:

Hello,

On Thu, Nov 19, 2015 at 03:54:35PM +0530, Raghavendra K T wrote:

While I was creating thousands of docker container on a power8 baremetal
(config: 4.3.0 kernel 1TB RAM, 20core (=160 cpu) system. After creating
around 5600 container
I have hit below problem.
[This is looking similar to
https://bugzilla.kernel.org/show_bug.cgi?id=101011, but
kernel had Revert "ext4: remove block_device_ejected" (bdfe0cbd746aa9) since
it is 4.3.0 tagged kernel]

Any hints on how to go about the fix. Please let me know if you think any
more information needed.

docker daemon is device mapper based. (and it took a day to recreate the
problem)

[ by disabling  CONFIG_BLK_CGROUP and CONFIG_CGROUP_WRITEBACK I am able to
create 10k containers without any problem]


Could be the same problem that Ilya is trying to fix.  ie. blkdev i_wb
pointing to a stale wb.  Can you please see whether the following
patch resolves the issue?

  http://lkml.kernel.org/g/1448054554-24138-1-git-send-email-idryo...@gmail.com



Hi Tejun,

Thanks again for the pointer. I was now able to create more than 10k
containers without any problem  with CGROUP_WRITEBACK on whereas
earlier I had hit this problem few times around 5k+ containers itself.
(Also Replying to Ilya's thread).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Unable to handle kernel paging request for data at address __percpu_counter_add

2015-11-29 Thread Raghavendra K T


On 11/24/2015 02:43 AM, Tejun Heo wrote:

Hello,

On Thu, Nov 19, 2015 at 03:54:35PM +0530, Raghavendra K T wrote:

While I was creating thousands of docker container on a power8 baremetal
(config: 4.3.0 kernel 1TB RAM, 20core (=160 cpu) system. After creating
around 5600 container
I have hit below problem.
[This is looking similar to
https://bugzilla.kernel.org/show_bug.cgi?id=101011, but
kernel had Revert "ext4: remove block_device_ejected" (bdfe0cbd746aa9) since
it is 4.3.0 tagged kernel]

Any hints on how to go about the fix. Please let me know if you think any
more information needed.

docker daemon is device mapper based. (and it took a day to recreate the
problem)

[ by disabling  CONFIG_BLK_CGROUP and CONFIG_CGROUP_WRITEBACK I am able to
create 10k containers without any problem]


Could be the same problem that Ilya is trying to fix.  ie. blkdev i_wb
pointing to a stale wb.  Can you please see whether the following
patch resolves the issue?

  http://lkml.kernel.org/g/1448054554-24138-1-git-send-email-idryo...@gmail.com



Hi Tejun,

Thanks again for the pointer. I was now able to create more than 10k
containers without any problem  with CGROUP_WRITEBACK on whereas
earlier I had hit this problem few times around 5k+ containers itself.
(Also Replying to Ilya's thread).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Unable to handle kernel paging request for data at address __percpu_counter_add

2015-11-23 Thread Raghavendra K T


On 11/24/2015 02:43 AM, Tejun Heo wrote:

Hello,

On Thu, Nov 19, 2015 at 03:54:35PM +0530, Raghavendra K T wrote:

While I was creating thousands of docker container on a power8 baremetal
(config: 4.3.0 kernel 1TB RAM, 20core (=160 cpu) system. After creating
around 5600 container
I have hit below problem.
[This is looking similar to
https://bugzilla.kernel.org/show_bug.cgi?id=101011, but
kernel had Revert "ext4: remove block_device_ejected" (bdfe0cbd746aa9) since
it is 4.3.0 tagged kernel]

Any hints on how to go about the fix. Please let me know if you think any
more information needed.

docker daemon is device mapper based. (and it took a day to recreate the
problem)

[ by disabling  CONFIG_BLK_CGROUP and CONFIG_CGROUP_WRITEBACK I am able to
create 10k containers without any problem]


Could be the same problem that Ilya is trying to fix.  ie. blkdev i_wb
pointing to a stale wb.  Can you please see whether the following
patch resolves the issue?

  http://lkml.kernel.org/g/1448054554-24138-1-git-send-email-idryo...@gmail.com



Thanks Tejun for the pointer.
Will check if the patch resolves the issue. (reproduction takes loong
time.. so it may take some time to report back).


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: Unable to handle kernel paging request for data at address __percpu_counter_add

2015-11-23 Thread Raghavendra K T


On 11/24/2015 02:43 AM, Tejun Heo wrote:

Hello,

On Thu, Nov 19, 2015 at 03:54:35PM +0530, Raghavendra K T wrote:

While I was creating thousands of docker container on a power8 baremetal
(config: 4.3.0 kernel 1TB RAM, 20core (=160 cpu) system. After creating
around 5600 container
I have hit below problem.
[This is looking similar to
https://bugzilla.kernel.org/show_bug.cgi?id=101011, but
kernel had Revert "ext4: remove block_device_ejected" (bdfe0cbd746aa9) since
it is 4.3.0 tagged kernel]

Any hints on how to go about the fix. Please let me know if you think any
more information needed.

docker daemon is device mapper based. (and it took a day to recreate the
problem)

[ by disabling  CONFIG_BLK_CGROUP and CONFIG_CGROUP_WRITEBACK I am able to
create 10k containers without any problem]


Could be the same problem that Ilya is trying to fix.  ie. blkdev i_wb
pointing to a stale wb.  Can you please see whether the following
patch resolves the issue?

  http://lkml.kernel.org/g/1448054554-24138-1-git-send-email-idryo...@gmail.com



Thanks Tejun for the pointer.
Will check if the patch resolves the issue. (reproduction takes loong
time.. so it may take some time to report back).


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

BUG: Unable to handle kernel paging request for data at address __percpu_counter_add

2015-11-19 Thread Raghavendra K T


Hi,

While I was creating thousands of docker container on a power8 baremetal
(config: 4.3.0 kernel 1TB RAM, 20core (=160 cpu) system. After creating 
around 5600 container

I have hit below problem.
[This is looking similar to 
https://bugzilla.kernel.org/show_bug.cgi?id=101011, but
kernel had Revert "ext4: remove block_device_ejected" (bdfe0cbd746aa9) 
since it is 4.3.0 tagged kernel]


Any hints on how to go about the fix. Please let me know if you think 
any more information needed.


docker daemon is device mapper based. (and it took a day to recreate the 
problem)


[ by disabling  CONFIG_BLK_CGROUP and CONFIG_CGROUP_WRITEBACK I am able 
to create 10k containers without any problem]


Nov 14 17:27:00 docker5 kernel: [40161.570029] Unable to handle kernel 
paging request for data at address 0x3fedfa
Nov 14 17:27:00 docker5 kernel: [40161.570125] Faulting instruction 
address: 0xc056de90
Nov 14 17:27:00 docker5 kernel: [40161.570136] Oops: Kernel access of 
bad area, sig: 11 [#1]

Nov 14 17:27:00 docker5 kernel: [40161.570143] SMP NR_CPUS=256 NUMA PowerNV
Nov 14 17:27:00 docker5 kernel: [40161.570177] Modules linked in: 
veth(E) xt_nat(E) xt_tcpudp(E) xt_addrtype(E) xt_conntrack(E) 
ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) iptable_nat(E) 
nf_conntrack_ipv4(E) nf_defrag_ipv4(E) nf_nat_ipv4(E) iptable_filter(E) 
ip_tables(E) x_tables(E) nf_nat(E) nf_conntrack(E) bridge(E) stp(E) 
llc(E) dm_thin_pool(E) dm_persistent_data(E) dm_bio_prison(E) 
dm_bufio(E) libcrc32c(E) uio_pdrv_genirq(E) powernv_rng(E) uio(E) 
autofs4(E) ses(E) enclosure(E) mlx4_en(E) vxlan(E) ip6_udp_tunnel(E) 
udp_tunnel(E) lpfc(E) mlx4_core(E) scsi_transport_fc(E) ipr(E)
Nov 14 17:27:00 docker5 kernel: [40161.570755] CPU: 154 PID: 77177 Comm: 
docker Tainted: GE   4.3.0+ #34
Nov 14 17:27:00 docker5 kernel: [40161.570830] task: c0eaec7f2780 
ti: c0eaa4ac task.ti: c0eaa4ac
Nov 14 17:27:00 docker5 kernel: [40161.570904] NIP: c056de90 LR: 
c02273e0 CTR: 
Nov 14 17:27:00 docker5 kernel: [40161.570978] REGS: c0eaa4ac3530 
TRAP: 0300   Tainted: GE(4.3.0+)
Nov 14 17:27:00 docker5 kernel: [40161.571051] MSR: 90019033 
  CR: 28028428  XER: 2000
Nov 14 17:27:00 docker5 kernel: [40161.571244] CFAR: c0008468 
DAR: 003fedfa DSISR: 4000 SOFTE: 0
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR00: c02273e0 
c0eaa4ac37b0 c14d6c00 c0f1f7603fb8
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR04: 0001 
0040 0001 0001
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR08: 007d 
 003fedfa 003fedfa
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR12: c03a4700 
c7fbb700 c0cff0f8 
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR16: c0e790430400 
  c0e7a7e1a000
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR20: c0e7c9d16800 
 c176cfc4 0001
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR24: 0009eca9 
0001  c0ffcf8cb800
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR28: c0f21a739af0 
0001 c1505414 c0f1f7603fb8
Nov 14 17:27:00 docker5 kernel: [40161.572243] NIP [c056de90] 
__percpu_counter_add+0x30/0x100
Nov 14 17:27:00 docker5 kernel: [40161.572310] LR [c02273e0] 
account_page_dirtied+0x100/0x250

Nov 14 17:27:00 docker5 kernel: [40161.572373] Call Trace:
Nov 14 17:27:00 docker5 kernel: [40161.572401] [c0eaa4ac37b0] 
[c0eaa4ac37f0] 0xc0eaa4ac37f0 (unreliable)
Nov 14 17:27:00 docker5 kernel: [40161.572491] [c0eaa4ac37f0] 
[c02273e0] account_page_dirtied+0x100/0x250
Nov 14 17:27:00 docker5 kernel: [40161.572580] [c0eaa4ac3840] 
[c031031c] __set_page_dirty+0x7c/0x130
Nov 14 17:27:00 docker5 kernel: [40161.572656] [c0eaa4ac3890] 
[c03106f8] mark_buffer_dirty+0x178/0x1c0
Nov 14 17:27:00 docker5 kernel: [40161.572746] [c0eaa4ac38d0] 
[c03a5c54] ext4_commit_super+0x1d4/0x340
Nov 14 17:27:00 docker5 kernel: [40161.572835] [c0eaa4ac3970] 
[c03a8d58] ext4_setup_super+0x118/0x250
Nov 14 17:27:00 docker5 kernel: [40161.572924] [c0eaa4ac3a00] 
[c03abce4] ext4_fill_super+0x1c04/0x3250
Nov 14 17:27:00 docker5 kernel: [40161.573013] [c0eaa4ac3b50] 
[c02c9964] mount_bdev+0x234/0x270
Nov 14 17:27:00 docker5 kernel: [40161.573089] [c0eaa4ac3bd0] 
[c03a3178] ext4_mount+0x48/0x60
Nov 14 17:27:00 docker5 kernel: [40161.573165] [c0eaa4ac3c10] 
[c02cad9c] mount_fs+0x8c/0x230
Nov 14 17:27:00 docker5 kernel: [40161.573242] [c0eaa4ac3cb0] 
[c02f0518] vfs_kern_mount+0x78/0x180
Nov 14 17:27:00 docker5 kernel: [40161.573319] [c0eaa4ac3d00] 
[c02f5150] do_mount+0x2e0/0xf60
Nov 14 17:27:00

BUG: Unable to handle kernel paging request for data at address __percpu_counter_add

2015-11-19 Thread Raghavendra K T


Hi,

While I was creating thousands of docker container on a power8 baremetal
(config: 4.3.0 kernel 1TB RAM, 20core (=160 cpu) system. After creating 
around 5600 container

I have hit below problem.
[This is looking similar to 
https://bugzilla.kernel.org/show_bug.cgi?id=101011, but
kernel had Revert "ext4: remove block_device_ejected" (bdfe0cbd746aa9) 
since it is 4.3.0 tagged kernel]


Any hints on how to go about the fix. Please let me know if you think 
any more information needed.


docker daemon is device mapper based. (and it took a day to recreate the 
problem)


[ by disabling  CONFIG_BLK_CGROUP and CONFIG_CGROUP_WRITEBACK I am able 
to create 10k containers without any problem]


Nov 14 17:27:00 docker5 kernel: [40161.570029] Unable to handle kernel 
paging request for data at address 0x3fedfa
Nov 14 17:27:00 docker5 kernel: [40161.570125] Faulting instruction 
address: 0xc056de90
Nov 14 17:27:00 docker5 kernel: [40161.570136] Oops: Kernel access of 
bad area, sig: 11 [#1]

Nov 14 17:27:00 docker5 kernel: [40161.570143] SMP NR_CPUS=256 NUMA PowerNV
Nov 14 17:27:00 docker5 kernel: [40161.570177] Modules linked in: 
veth(E) xt_nat(E) xt_tcpudp(E) xt_addrtype(E) xt_conntrack(E) 
ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) iptable_nat(E) 
nf_conntrack_ipv4(E) nf_defrag_ipv4(E) nf_nat_ipv4(E) iptable_filter(E) 
ip_tables(E) x_tables(E) nf_nat(E) nf_conntrack(E) bridge(E) stp(E) 
llc(E) dm_thin_pool(E) dm_persistent_data(E) dm_bio_prison(E) 
dm_bufio(E) libcrc32c(E) uio_pdrv_genirq(E) powernv_rng(E) uio(E) 
autofs4(E) ses(E) enclosure(E) mlx4_en(E) vxlan(E) ip6_udp_tunnel(E) 
udp_tunnel(E) lpfc(E) mlx4_core(E) scsi_transport_fc(E) ipr(E)
Nov 14 17:27:00 docker5 kernel: [40161.570755] CPU: 154 PID: 77177 Comm: 
docker Tainted: GE   4.3.0+ #34
Nov 14 17:27:00 docker5 kernel: [40161.570830] task: c0eaec7f2780 
ti: c0eaa4ac task.ti: c0eaa4ac
Nov 14 17:27:00 docker5 kernel: [40161.570904] NIP: c056de90 LR: 
c02273e0 CTR: 
Nov 14 17:27:00 docker5 kernel: [40161.570978] REGS: c0eaa4ac3530 
TRAP: 0300   Tainted: GE(4.3.0+)
Nov 14 17:27:00 docker5 kernel: [40161.571051] MSR: 90019033 
  CR: 28028428  XER: 2000
Nov 14 17:27:00 docker5 kernel: [40161.571244] CFAR: c0008468 
DAR: 003fedfa DSISR: 4000 SOFTE: 0
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR00: c02273e0 
c0eaa4ac37b0 c14d6c00 c0f1f7603fb8
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR04: 0001 
0040 0001 0001
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR08: 007d 
 003fedfa 003fedfa
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR12: c03a4700 
c7fbb700 c0cff0f8 
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR16: c0e790430400 
  c0e7a7e1a000
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR20: c0e7c9d16800 
 c176cfc4 0001
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR24: 0009eca9 
0001  c0ffcf8cb800
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR28: c0f21a739af0 
0001 c1505414 c0f1f7603fb8
Nov 14 17:27:00 docker5 kernel: [40161.572243] NIP [c056de90] 
__percpu_counter_add+0x30/0x100
Nov 14 17:27:00 docker5 kernel: [40161.572310] LR [c02273e0] 
account_page_dirtied+0x100/0x250

Nov 14 17:27:00 docker5 kernel: [40161.572373] Call Trace:
Nov 14 17:27:00 docker5 kernel: [40161.572401] [c0eaa4ac37b0] 
[c0eaa4ac37f0] 0xc0eaa4ac37f0 (unreliable)
Nov 14 17:27:00 docker5 kernel: [40161.572491] [c0eaa4ac37f0] 
[c02273e0] account_page_dirtied+0x100/0x250
Nov 14 17:27:00 docker5 kernel: [40161.572580] [c0eaa4ac3840] 
[c031031c] __set_page_dirty+0x7c/0x130
Nov 14 17:27:00 docker5 kernel: [40161.572656] [c0eaa4ac3890] 
[c03106f8] mark_buffer_dirty+0x178/0x1c0
Nov 14 17:27:00 docker5 kernel: [40161.572746] [c0eaa4ac38d0] 
[c03a5c54] ext4_commit_super+0x1d4/0x340
Nov 14 17:27:00 docker5 kernel: [40161.572835] [c0eaa4ac3970] 
[c03a8d58] ext4_setup_super+0x118/0x250
Nov 14 17:27:00 docker5 kernel: [40161.572924] [c0eaa4ac3a00] 
[c03abce4] ext4_fill_super+0x1c04/0x3250
Nov 14 17:27:00 docker5 kernel: [40161.573013] [c0eaa4ac3b50] 
[c02c9964] mount_bdev+0x234/0x270
Nov 14 17:27:00 docker5 kernel: [40161.573089] [c0eaa4ac3bd0] 
[c03a3178] ext4_mount+0x48/0x60
Nov 14 17:27:00 docker5 kernel: [40161.573165] [c0eaa4ac3c10] 
[c02cad9c] mount_fs+0x8c/0x230
Nov 14 17:27:00 docker5 kernel: [40161.573242] [c0eaa4ac3cb0] 
[c02f0518] vfs_kern_mount+0x78/0x180
Nov 14 17:27:00 docker5 kernel: [40161.573319] [c0eaa4ac3d00] 
[c02f5150]

Re: [PATCH] block-mq:Fix the null memory access while setting tags cpumask

2015-10-13 Thread Raghavendra K T


On 10/13/2015 10:17 PM, Jeff Moyer wrote:

Raghavendra K T  writes:


In nr_hw_queues >1 cases when certain number of cpus are onlined/or
offlined, that results change in request_queue map in block-mq layer,
we see the kernel dumping like:


What version is that patch against?  This problem should be fixed by the
patch set from Akinobu Mita.  I tested cpu hot-plugging on my system
with an nvme device, and it is fixed there on mainline.



Hi Jeff,

My patch was against (4.3.0-rc2) bcee19f424a.

I do see now that  commit 1356aae08338 (blk-mq: avoid setting
hctx->tags->cpumask before allocation) has equivalent fix.

Thanks for the information.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] block-mq:Fix the null memory access while setting tags cpumask

2015-10-13 Thread Raghavendra K T


On 10/13/2015 10:17 PM, Jeff Moyer wrote:

Raghavendra K T <raghavendra...@linux.vnet.ibm.com> writes:


In nr_hw_queues >1 cases when certain number of cpus are onlined/or
offlined, that results change in request_queue map in block-mq layer,
we see the kernel dumping like:


What version is that patch against?  This problem should be fixed by the
patch set from Akinobu Mita.  I tested cpu hot-plugging on my system
with an nvme device, and it is fixed there on mainline.



Hi Jeff,

My patch was against (4.3.0-rc2) bcee19f424a.

I do see now that  commit 1356aae08338 (blk-mq: avoid setting
hctx->tags->cpumask before allocation) has equivalent fix.

Thanks for the information.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] block-mq:Fix the null memory access while setting tags cpumask

2015-10-12 Thread Raghavendra K T

In nr_hw_queues >1 cases when certain number of cpus are onlined/or
offlined, that results change in request_queue map in block-mq layer,
we see the kernel dumping like:

 BUG: unable to handle kernel NULL pointer dereference at 0080
 IP: [] cpumask_set_cpu+0x6/0xd
 PGD 6d957067 PUD 7604c067 PMD 0
 Oops: 0002 [#1] SMP
 Modules linked in: null_blk
 CPU: 2 PID: 1926 Comm: bash Not tainted 4.3.0-rc2+ #24
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 task: 8800724cd1c0 ti: 880070a2c000 task.ti: 880070a2c000
 RIP: 0010:[]  [] cpumask_set_cpu+0x6/0xd
 RSP: 0018:880070a2fbc8  EFLAGS: 00010203
 RAX: 880073eedc00 RBX: 88006cc88000 RCX: 88006c06b000
 RDX: 0007 RSI: 0080 RDI: 0008
 RBP: 880070a2fbc8 R08: 88006c06ac00 R09: 88006c06ad48
 R10: 88004ea8 R11: 88006c069650 R12: 88007378fe28
 R13: 0008 R14: e8500200 R15: 81d2a630
 FS:  7fa34803b700() GS:88007cc4() knlGS:
 CS:  0010 DS:  ES:  CR0: 8005003b
 CR2: 0080 CR3: 761d2000 CR4: 06e0
 Stack:
  880070a2fc18 8128edec  880073eedc00
  0039 88006cc88000 0007 ffe3
  81cef2c0  880070a2fc38 8129049a
 Call Trace:
  [] blk_mq_map_swqueue+0x9d/0x206
  [] blk_mq_queue_reinit_notify+0xe3/0x144
  [] notifier_call_chain+0x37/0x63
  [] __raw_notifier_call_chain+0xe/0x10
  [] __cpu_notify+0x20/0x32
  [] cpu_notify_nofail+0x13/0x1b
  [] _cpu_down+0x18a/0x264
  [] ? path_put+0x1f/0x23
  [] cpu_down+0x2d/0x3a
  [] cpu_subsys_offline+0x14/0x16
  [] device_offline+0x65/0x94
  [] online_store+0x48/0x68
  [] ? kernfs_fop_write+0x6f/0x143
  [] dev_attr_store+0x20/0x22
  [] sysfs_kf_write+0x3c/0x3e
  [] kernfs_fop_write+0xed/0x143
  [] __vfs_write+0x28/0xa6
  [] ? security_file_permission+0x3c/0x44
  [] ? percpu_down_read+0x21/0x42
  [] ? __sb_start_write+0x24/0x41
  [] vfs_write+0x8d/0xd1
  [] SyS_write+0x59/0x83
  [] entry_SYSCALL_64_fastpath+0x12/0x71
 Code: 03 75 06 65 48 ff 0a eb 1a f0 48 83 af 68 07 00 00 01 74 02 eb 0d 48 8d 
bf 68 07 00 00 ff 90 78 07 00 00 5d c3 55 89 ff 48 89 e5  48 0f ab 3e 5d c3 
0f 1f 44 00 00 55 8b 4e 44 31 d2 8b b7 94
 RIP  [] cpumask_set_cpu+0x6/0xd
  RSP 
 CR2: 0080

How to reproduce:
1. create 80 vcpu guest with 10 core 8 threads
2. modprobe null_blk submit_queues=64
3. for i in 72 73 74 75 76 77 78 79 ; do
 echo 0 > /sys/devices/system/cpu/cpu$i/online;
done

Reason:
We try to set freed hwctx->tag->cpumask in blk_mq_map_swqueue().
Introduced during commit f26cdc8536ad ("blk-mq: Shared tag enhancements").

What is happening:
When certain number of cpus are onlined/offlined, that results in
blk_mq_update_queue_map, we could potentially end up in new mapping to
hwctx.

Subsequent blk_mq_map_swqueue of request_queue, tries to set the
hwctx->tags->cpumask which is already freed by blk_mq_free_rq_map in
earlier itearation when it was not mapped.

Fix:
Set the hwctx->tags->cpumask only after blk_mq_init_rq_map() is done

hwctx->tags->cpumask does not follow the hwctx->cpumask after new
mapping even in the cases where new mapping does not cause problem.
That is also fixed with this change.

This problem is originally found in powervm which had 160 cpus (SMT8),
128 nr_hw_queues. The dump was easily reproduced with offlining last core
and it has been a blocker issue because cpu hotplug is a common case for
DLPAR.

Signed-off-by: Raghavendra K T 
---
 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f2d67b4..39a7834 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1811,7 +1811,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 
hctx = q->mq_ops->map_queue(q, i);
cpumask_set_cpu(i, hctx->cpumask);
-   cpumask_set_cpu(i, hctx->tags->cpumask);
ctx->index_hw = hctx->nr_ctx;
hctx->ctxs[hctx->nr_ctx++] = ctx;
}
@@ -1836,6 +1835,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
if (!set->tags[i])
set->tags[i] = blk_mq_init_rq_map(set, i);
hctx->tags = set->tags[i];
+   cpumask_copy(hctx->tags->cpumask, hctx->cpumask);
WARN_ON(!hctx->tags);
 
/*
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] block-mq:Fix the null memory access while setting tags cpumask

2015-10-12 Thread Raghavendra K T

In nr_hw_queues >1 cases when certain number of cpus are onlined/or
offlined, that results change in request_queue map in block-mq layer,
we see the kernel dumping like:

 BUG: unable to handle kernel NULL pointer dereference at 0080
 IP: [] cpumask_set_cpu+0x6/0xd
 PGD 6d957067 PUD 7604c067 PMD 0
 Oops: 0002 [#1] SMP
 Modules linked in: null_blk
 CPU: 2 PID: 1926 Comm: bash Not tainted 4.3.0-rc2+ #24
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 task: 8800724cd1c0 ti: 880070a2c000 task.ti: 880070a2c000
 RIP: 0010:[]  [] cpumask_set_cpu+0x6/0xd
 RSP: 0018:880070a2fbc8  EFLAGS: 00010203
 RAX: 880073eedc00 RBX: 88006cc88000 RCX: 88006c06b000
 RDX: 0007 RSI: 0080 RDI: 0008
 RBP: 880070a2fbc8 R08: 88006c06ac00 R09: 88006c06ad48
 R10: 88004ea8 R11: 88006c069650 R12: 88007378fe28
 R13: 0008 R14: e8500200 R15: 81d2a630
 FS:  7fa34803b700() GS:88007cc4() knlGS:
 CS:  0010 DS:  ES:  CR0: 8005003b
 CR2: 0080 CR3: 761d2000 CR4: 06e0
 Stack:
  880070a2fc18 8128edec  880073eedc00
  0039 88006cc88000 0007 ffe3
  81cef2c0  880070a2fc38 8129049a
 Call Trace:
  [] blk_mq_map_swqueue+0x9d/0x206
  [] blk_mq_queue_reinit_notify+0xe3/0x144
  [] notifier_call_chain+0x37/0x63
  [] __raw_notifier_call_chain+0xe/0x10
  [] __cpu_notify+0x20/0x32
  [] cpu_notify_nofail+0x13/0x1b
  [] _cpu_down+0x18a/0x264
  [] ? path_put+0x1f/0x23
  [] cpu_down+0x2d/0x3a
  [] cpu_subsys_offline+0x14/0x16
  [] device_offline+0x65/0x94
  [] online_store+0x48/0x68
  [] ? kernfs_fop_write+0x6f/0x143
  [] dev_attr_store+0x20/0x22
  [] sysfs_kf_write+0x3c/0x3e
  [] kernfs_fop_write+0xed/0x143
  [] __vfs_write+0x28/0xa6
  [] ? security_file_permission+0x3c/0x44
  [] ? percpu_down_read+0x21/0x42
  [] ? __sb_start_write+0x24/0x41
  [] vfs_write+0x8d/0xd1
  [] SyS_write+0x59/0x83
  [] entry_SYSCALL_64_fastpath+0x12/0x71
 Code: 03 75 06 65 48 ff 0a eb 1a f0 48 83 af 68 07 00 00 01 74 02 eb 0d 48 8d 
bf 68 07 00 00 ff 90 78 07 00 00 5d c3 55 89 ff 48 89 e5  48 0f ab 3e 5d c3 
0f 1f 44 00 00 55 8b 4e 44 31 d2 8b b7 94
 RIP  [] cpumask_set_cpu+0x6/0xd
  RSP 
 CR2: 0080

How to reproduce:
1. create 80 vcpu guest with 10 core 8 threads
2. modprobe null_blk submit_queues=64
3. for i in 72 73 74 75 76 77 78 79 ; do
 echo 0 > /sys/devices/system/cpu/cpu$i/online;
done

Reason:
We try to set freed hwctx->tag->cpumask in blk_mq_map_swqueue().
Introduced during commit f26cdc8536ad ("blk-mq: Shared tag enhancements").

What is happening:
When certain number of cpus are onlined/offlined, that results in
blk_mq_update_queue_map, we could potentially end up in new mapping to
hwctx.

Subsequent blk_mq_map_swqueue of request_queue, tries to set the
hwctx->tags->cpumask which is already freed by blk_mq_free_rq_map in
earlier itearation when it was not mapped.

Fix:
Set the hwctx->tags->cpumask only after blk_mq_init_rq_map() is done

hwctx->tags->cpumask does not follow the hwctx->cpumask after new
mapping even in the cases where new mapping does not cause problem.
That is also fixed with this change.

This problem is originally found in powervm which had 160 cpus (SMT8),
128 nr_hw_queues. The dump was easily reproduced with offlining last core
and it has been a blocker issue because cpu hotplug is a common case for
DLPAR.

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f2d67b4..39a7834 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1811,7 +1811,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 
hctx = q->mq_ops->map_queue(q, i);
cpumask_set_cpu(i, hctx->cpumask);
-   cpumask_set_cpu(i, hctx->tags->cpumask);
ctx->index_hw = hctx->nr_ctx;
hctx->ctxs[hctx->nr_ctx++] = ctx;
}
@@ -1836,6 +1835,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
if (!set->tags[i])
set->tags[i] = blk_mq_init_rq_map(set, i);
hctx->tags = set->tags[i];
+   cpumask_copy(hctx->tags->cpumask, hctx->cpumask);
WARN_ON(!hctx->tags);
 
/*
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support

2015-10-06 Thread Raghavendra K T


On 10/06/2015 03:55 PM, Michael Ellerman wrote:

On Sun, 2015-09-27 at 23:59 +0530, Raghavendra K T wrote:

Problem description:
Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
got from device tree is naturally mapped (directly) to nid.

Potential side effect of that is:

1) There are several places in kernel that assumes serial node numbering.
and memory allocations assume that all the nodes from 0-(highest nid)
exist inturn ending up allocating memory for the nodes that does not exist.


Is it several? Or lots?

If it's several, ie. more than two but not lots, then we should probably just
fix those places. Or is that /really/ hard for some reason?



It is several and I did attempt to fix them. But the rest of the places
(like memcg, work queue, scheduler and so on) are tricky to fix because
the memory allocations are glued with other things.
and similar fix may be expected in future too..



Do we ever get whole nodes hotplugged in under PowerVM? I don't think so, but I
don't remember for sure.



Even on powervm we do have discontiguous  numa nodes. [Adding more to 
it, we could even end up creating a dummy node 0 just to make kernel

happy]
for e.g.,
available: 2 nodes (0,7)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 7 cpus: 0 1 2 3 4 5 6 7
node 7 size: 10240 MB
node 7 free: 8174 MB
node distances:
node   0   7
  0:  10  40
  7:  40  10

note that node zero neither has any cpu nor memory.


2) For virtualization use cases (such as qemu, libvirt, openstack), mapping
sparse nid of the host system to contiguous nids of guest (numa affinity,
placement) could be a challenge.


Can you elaborate? That's a bit vague.


one e.g., i can think of: (though libvirt/openstack people will know 
more about it) suppose one wishes to have half of the vcpus bind to one

physical node and rest of the vcpus to second numa node, we cant say
whether second node is 1,8, or 16. and same libvirtxml on a two node
system may not be valid for another two numa node system.
[ i believe it may cause some migration problem too ].




Possible Solutions:
1) Handling the memory allocations is kernel case by case: Though in some
cases it is easy to achieve, some cases may be intrusive/not trivial.
at the end it does not handle side effect (2) above.

2) Map the sparse chipid got from device tree to a serial nid at kernel
level (The idea proposed in this series).
Pro: It is more natural to handle at kernel level than at lower (OPAL) layer.
con: The chipid is in device tree no longer the same as nid in kernel

3) Let the lower layer (OPAL) give the serial node ids after parsing the
chipid and the associativity etc [ either as a separate item in device tree
or by compacting the chipid numbers ]
Pros: kernel, device tree are on same page and less change in kernel
Con: is it the functionality expected in lower layer


...


3) Numactl tests from
ftp://oss.sgi.com/www/projects/libnuma/download/numactl-2.0.10.tar.gz

(infact there were more breakage before the patch because of sparse nid
and memoryless node cases of powerpc)


This is probably the best argument for your series. ie. userspace is dumb and
fixing every broken app that assumes linear node numbering is not feasible.


So on the whole I think the concept is good. This series though is a bit
confusing because of all the renaming etc. etc. Nish made lots of good comments
so I'll wait for a v2 based on those.



Yes, will be sending V2 soon extending my patch to fix powervm case too.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC, 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table

2015-10-06 Thread Raghavendra K T


On 10/06/2015 03:47 PM, Michael Ellerman wrote:

On Sun, 2015-27-09 at 18:29:09 UTC, Raghavendra K T wrote:

We access numa_cpu_lookup_table array directly in all the places
to read/update numa cpu lookup information. Instead use a helper
function to update.

This is helpful in changing the way numa<-->cpu mapping in single
place when needed.

This is a cosmetic change, no change in functionality.

Signed-off-by: Raghavendra K T 
Signed-off-by: Raghavendra K T 
---
  arch/powerpc/include/asm/mmzone.h |  2 +-
  arch/powerpc/kernel/smp.c | 10 +-
  arch/powerpc/mm/numa.c| 28 +---
  3 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/mmzone.h 
b/arch/powerpc/include/asm/mmzone.h
index 7b58917..c24a5f4 100644
--- a/arch/powerpc/include/asm/mmzone.h
+++ b/arch/powerpc/include/asm/mmzone.h
@@ -29,7 +29,7 @@ extern struct pglist_data *node_data[];
   * Following are specific to this numa platform.
   */

-extern int numa_cpu_lookup_table[];
+extern int numa_cpu_lookup(int cpu);


Can you rename it better :)

Something like cpu_to_nid().


Good name. sure.



Although maybe nid is wrong given the rest of the series.


May be not. The current plan is to rename (after discussing with Nish)

chipid to pnid (physical nid)
and nid to vnid (virtual nid)

within powerpc numa.c
[reasoning chipid is applicable only to OPAL, since we want to handle
powerkvm, powervm and baremetal we need a generic name ]

But 'nid' naming will be retained which is applicable for generic
kernel interactions.




diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8b9502a..d5e6eee 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -52,7 +52,6 @@ int numa_cpu_lookup_table[NR_CPUS];
  cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
  struct pglist_data *node_data[MAX_NUMNODES];

-EXPORT_SYMBOL(numa_cpu_lookup_table);
  EXPORT_SYMBOL(node_to_cpumask_map);
  EXPORT_SYMBOL(node_data);

@@ -134,19 +133,25 @@ static int __init fake_numa_create_new_node(unsigned long 
end_pfn,
return 0;
  }

-static void reset_numa_cpu_lookup_table(void)
+int numa_cpu_lookup(int cpu)
  {
-   unsigned int cpu;
-
-   for_each_possible_cpu(cpu)
-   numa_cpu_lookup_table[cpu] = -1;
+   return numa_cpu_lookup_table[cpu];
  }
+EXPORT_SYMBOL(numa_cpu_lookup);


I don't see you changing any modular code that uses this, or any macros that
might be used by modules, so I don't see why this needs to be exported?

I think you just added it because num_cpu_lookup_table was exported?



arch/powerpc/kernel/smp.c uses it.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC, 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table

2015-10-06 Thread Raghavendra K T


On 10/06/2015 03:47 PM, Michael Ellerman wrote:

On Sun, 2015-27-09 at 18:29:09 UTC, Raghavendra K T wrote:

We access numa_cpu_lookup_table array directly in all the places
to read/update numa cpu lookup information. Instead use a helper
function to update.

This is helpful in changing the way numa<-->cpu mapping in single
place when needed.

This is a cosmetic change, no change in functionality.

Signed-off-by: Raghavendra K T <raghavendra...@linux.inet.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
  arch/powerpc/include/asm/mmzone.h |  2 +-
  arch/powerpc/kernel/smp.c | 10 +-
  arch/powerpc/mm/numa.c| 28 +---
  3 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/mmzone.h 
b/arch/powerpc/include/asm/mmzone.h
index 7b58917..c24a5f4 100644
--- a/arch/powerpc/include/asm/mmzone.h
+++ b/arch/powerpc/include/asm/mmzone.h
@@ -29,7 +29,7 @@ extern struct pglist_data *node_data[];
   * Following are specific to this numa platform.
   */

-extern int numa_cpu_lookup_table[];
+extern int numa_cpu_lookup(int cpu);


Can you rename it better :)

Something like cpu_to_nid().


Good name. sure.



Although maybe nid is wrong given the rest of the series.


May be not. The current plan is to rename (after discussing with Nish)

chipid to pnid (physical nid)
and nid to vnid (virtual nid)

within powerpc numa.c
[reasoning chipid is applicable only to OPAL, since we want to handle
powerkvm, powervm and baremetal we need a generic name ]

But 'nid' naming will be retained which is applicable for generic
kernel interactions.




diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8b9502a..d5e6eee 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -52,7 +52,6 @@ int numa_cpu_lookup_table[NR_CPUS];
  cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
  struct pglist_data *node_data[MAX_NUMNODES];

-EXPORT_SYMBOL(numa_cpu_lookup_table);
  EXPORT_SYMBOL(node_to_cpumask_map);
  EXPORT_SYMBOL(node_data);

@@ -134,19 +133,25 @@ static int __init fake_numa_create_new_node(unsigned long 
end_pfn,
return 0;
  }

-static void reset_numa_cpu_lookup_table(void)
+int numa_cpu_lookup(int cpu)
  {
-   unsigned int cpu;
-
-   for_each_possible_cpu(cpu)
-   numa_cpu_lookup_table[cpu] = -1;
+   return numa_cpu_lookup_table[cpu];
  }
+EXPORT_SYMBOL(numa_cpu_lookup);


I don't see you changing any modular code that uses this, or any macros that
might be used by modules, so I don't see why this needs to be exported?

I think you just added it because num_cpu_lookup_table was exported?



arch/powerpc/kernel/smp.c uses it.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support

2015-10-06 Thread Raghavendra K T


On 10/06/2015 03:55 PM, Michael Ellerman wrote:

On Sun, 2015-09-27 at 23:59 +0530, Raghavendra K T wrote:

Problem description:
Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
got from device tree is naturally mapped (directly) to nid.

Potential side effect of that is:

1) There are several places in kernel that assumes serial node numbering.
and memory allocations assume that all the nodes from 0-(highest nid)
exist inturn ending up allocating memory for the nodes that does not exist.


Is it several? Or lots?

If it's several, ie. more than two but not lots, then we should probably just
fix those places. Or is that /really/ hard for some reason?



It is several and I did attempt to fix them. But the rest of the places
(like memcg, work queue, scheduler and so on) are tricky to fix because
the memory allocations are glued with other things.
and similar fix may be expected in future too..



Do we ever get whole nodes hotplugged in under PowerVM? I don't think so, but I
don't remember for sure.



Even on powervm we do have discontiguous  numa nodes. [Adding more to 
it, we could even end up creating a dummy node 0 just to make kernel

happy]
for e.g.,
available: 2 nodes (0,7)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 7 cpus: 0 1 2 3 4 5 6 7
node 7 size: 10240 MB
node 7 free: 8174 MB
node distances:
node   0   7
  0:  10  40
  7:  40  10

note that node zero neither has any cpu nor memory.


2) For virtualization use cases (such as qemu, libvirt, openstack), mapping
sparse nid of the host system to contiguous nids of guest (numa affinity,
placement) could be a challenge.


Can you elaborate? That's a bit vague.


one e.g., i can think of: (though libvirt/openstack people will know 
more about it) suppose one wishes to have half of the vcpus bind to one

physical node and rest of the vcpus to second numa node, we cant say
whether second node is 1,8, or 16. and same libvirtxml on a two node
system may not be valid for another two numa node system.
[ i believe it may cause some migration problem too ].




Possible Solutions:
1) Handling the memory allocations is kernel case by case: Though in some
cases it is easy to achieve, some cases may be intrusive/not trivial.
at the end it does not handle side effect (2) above.

2) Map the sparse chipid got from device tree to a serial nid at kernel
level (The idea proposed in this series).
Pro: It is more natural to handle at kernel level than at lower (OPAL) layer.
con: The chipid is in device tree no longer the same as nid in kernel

3) Let the lower layer (OPAL) give the serial node ids after parsing the
chipid and the associativity etc [ either as a separate item in device tree
or by compacting the chipid numbers ]
Pros: kernel, device tree are on same page and less change in kernel
Con: is it the functionality expected in lower layer


...


3) Numactl tests from
ftp://oss.sgi.com/www/projects/libnuma/download/numactl-2.0.10.tar.gz

(infact there were more breakage before the patch because of sparse nid
and memoryless node cases of powerpc)


This is probably the best argument for your series. ie. userspace is dumb and
fixing every broken app that assumes linear node numbering is not feasible.


So on the whole I think the concept is good. This series though is a bit
confusing because of all the renaming etc. etc. Nish made lots of good comments
so I'll wait for a v2 based on those.



Yes, will be sending V2 soon extending my patch to fix powervm case too.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support

2015-09-30 Thread Raghavendra K T


On 09/30/2015 01:16 AM, Denis Kirjanov wrote:

On 9/29/15, Raghavendra K T  wrote:

On 09/28/2015 10:34 PM, Nishanth Aravamudan wrote:

On 28.09.2015 [13:44:42 +0300], Denis Kirjanov wrote:

On 9/27/15, Raghavendra K T  wrote:

Problem description:
Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
got from device tree is naturally mapped (directly) to nid.


Interesting thing to play with, I'll try to test it on my POWER7 box,
but it doesn't have the OPAL layer :(


Hi Denis,
Thanks for your interest. I have pushed the patches to

https://github.com/ktraghavendra/linux/tree/serialnuma_v1 if it makes
   patches easy to grab.


Thanks!
One sad thing is that I can't test the actual node id mapping now
since currently I have an access to machine with only one memory node
:/ Can we fake it through qemu?



faking the sparse numa ids is possible with Nish's patch for qemu:

https://lists.gnu.org/archive/html/qemu-devel/2014-06/msg05826.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support

2015-09-30 Thread Raghavendra K T


On 09/30/2015 01:16 AM, Denis Kirjanov wrote:

On 9/29/15, Raghavendra K T <raghavendra...@linux.vnet.ibm.com> wrote:

On 09/28/2015 10:34 PM, Nishanth Aravamudan wrote:

On 28.09.2015 [13:44:42 +0300], Denis Kirjanov wrote:

On 9/27/15, Raghavendra K T <raghavendra...@linux.vnet.ibm.com> wrote:

Problem description:
Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
got from device tree is naturally mapped (directly) to nid.


Interesting thing to play with, I'll try to test it on my POWER7 box,
but it doesn't have the OPAL layer :(


Hi Denis,
Thanks for your interest. I have pushed the patches to

https://github.com/ktraghavendra/linux/tree/serialnuma_v1 if it makes
   patches easy to grab.


Thanks!
One sad thing is that I can't test the actual node id mapping now
since currently I have an access to machine with only one memory node
:/ Can we fake it through qemu?



faking the sparse numa ids is possible with Nish's patch for qemu:

https://lists.gnu.org/archive/html/qemu-devel/2014-06/msg05826.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid

2015-09-29 Thread Raghavendra K T


On 09/28/2015 11:05 PM, Nishanth Aravamudan wrote:

On 27.09.2015 [23:59:11 +0530], Raghavendra K T wrote:

Once we have made the distinction between nid and chipid
create a 1:1 mapping between them. This makes compacting the
nids easy later.

No functionality change.

Signed-off-by: Raghavendra K T 
---
  arch/powerpc/mm/numa.c | 36 +---
  1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f84ed2f..dd2073b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -264,6 +264,17 @@ out:
return chipid;
  }

+
+ /* Return the nid from associativity */
+static int associativity_to_nid(const __be32 *associativity)
+{
+   int nid;
+
+   nid = associativity_to_chipid(associativity);
+   return nid;
+}


This is ultimately confusing. You are assigning the semantic return
value of a chipid to a nid -- is it a nid or a chipid? Shouldn't the
variable naming be consistent?



:( yes. will come up with some consistent naming.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support

2015-09-29 Thread Raghavendra K T


On 09/28/2015 11:04 PM, Nishanth Aravamudan wrote:

On 27.09.2015 [23:59:08 +0530], Raghavendra K T wrote:

[...]


2) Map the sparse chipid got from device tree to a serial nid at kernel
level (The idea proposed in this series).
Pro: It is more natural to handle at kernel level than at lower (OPAL) layer.
con: The chipid is in device tree no longer the same as nid in kernel


Is there any debugging/logging? Looks like not -- so how does a sysadmin
map from firmware-provided values to the Linux values? That's going to
make debugging of large systems (PowerVM or otherwise) less than
pleasant, it seems? Possibly you could put something in sysfs?


I see 2 things could be done here:

1) while doing dump_numa_cpu_topology() we can dump nid_to_chipid()
as additional information.

2) sysfs->
Does /sys/devices/system/node/nodeX/*chipid* looks good. May be we
should add only for powerpc or otherwise we need to have chipid = nid
populated for other archs. [ I think this change may be done slowly ]





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 4/5] powerpc:numa Add helper functions to maintain chipid to nid mapping

2015-09-29 Thread Raghavendra K T


On 09/28/2015 11:02 PM, Nishanth Aravamudan wrote:

On 27.09.2015 [23:59:12 +0530], Raghavendra K T wrote:

Create arrays that maps serial nids and sparse chipids.

Note: My original idea had only two arrays of chipid to nid map. Final
code is inspired by driver/acpi/numa.c that maps a proximity node with
a logical node by Takayoshi Kochi , and thus
uses an additional chipid_map nodemask. The mask helps in first unused
nid easily by knowing first unset bit in the mask.

No change in functionality.

Signed-off-by: Raghavendra K T 
---
  arch/powerpc/mm/numa.c | 48 +++-
  1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index dd2073b..f015cad 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -63,6 +63,11 @@ static int form1_affinity;
  static int distance_ref_points_depth;
  static const __be32 *distance_ref_points;
  static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
+static nodemask_t chipid_map = NODE_MASK_NONE;
+static int chipid_to_nid_map[MAX_NUMNODES]
+   = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };


Hrm, conceptually there are *more* chips than nodes, right? So what
guarantees we won't see > MAX_NUMNODES chips?


You are correct that nid <= chipids.
and #nids = #chipids when all possible slots are populated. Considering
we assume that maximum chip slots are no more than MAX_NUMNODES,


how about having

#define MAX_CHIPNODES MAX_NUMNODES
and
chipid_to_nid_map[MAX_CHIPNODES] = { [0 ... MAX_CHIPNODES - 1] = ..




+static int nid_to_chipid_map[MAX_NUMNODES]
+   = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };

  /*
   * Allocate node_to_cpumask_map based on number of available nodes
@@ -133,6 +138,48 @@ static int __init fake_numa_create_new_node(unsigned long 
end_pfn,
return 0;
  }

+int chipid_to_nid(int chipid)
+{
+   if (chipid < 0)
+   return NUMA_NO_NODE;


Do you really want to support these cases? Or should they be
bugs/warnings indicating that you got an unexpected input? Or at least
WARN_ON_ONCE?



Right. Querying for nid of an invalid chipid should be atleast
WARN_ON_ONCE(). But 'll check once if there is any valid scenario
before the change.


+   return chipid_to_nid_map[chipid];
+}
+
+int nid_to_chipid(int nid)
+{
+   if (nid < 0)
+   return NUMA_NO_NODE;
+   return nid_to_chipid_map[nid];
+}
+
+static void __map_chipid_to_nid(int chipid, int nid)
+{
+   if (chipid_to_nid_map[chipid] == NUMA_NO_NODE
+|| nid < chipid_to_nid_map[chipid])
+   chipid_to_nid_map[chipid] = nid;
+   if (nid_to_chipid_map[nid] == NUMA_NO_NODE
+   || chipid < nid_to_chipid_map[nid])
+   nid_to_chipid_map[nid] = chipid;
+}


chip <-> node mapping is a static (physical) concept, right? Should we
emit some debugging if for some reason we get a runtime call to remap
an already mapped chip to a new node?



Good point. Already mapped chipid to a different nid is unexpected
whereas mapping chipid to same nid is expected.(because mapping comes
from cpus belonging to same node).

WARN_ON() should suffice here?



+
+int map_chipid_to_nid(int chipid)
+{
+   int nid;
+
+   if (chipid < 0 || chipid >= MAX_NUMNODES)
+   return NUMA_NO_NODE;
+
+   nid = chipid_to_nid_map[chipid];
+   if (nid == NUMA_NO_NODE) {
+   if (nodes_weight(chipid_map) >= MAX_NUMNODES)
+   return NUMA_NO_NODE;


If you create a KVM guest with a bogus topology, doesn't this just start
losing NUMA information for very high-noded guests?



'll try to see if it is possible to hit this case, ideally we should
not allow more than MAX_NUMNODES for chipids and we should abort early.


+   nid = first_unset_node(chipid_map);
+   __map_chipid_to_nid(chipid, nid);
+   node_set(nid, chipid_map);
+   }
+   return nid;
+}
+
  int numa_cpu_lookup(int cpu)
  {
return numa_cpu_lookup_table[cpu];
@@ -264,7 +311,6 @@ out:
return chipid;
  }

-


stray change?



yep, will correct that.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid

2015-09-29 Thread Raghavendra K T


On 09/28/2015 10:58 PM, Nishanth Aravamudan wrote:

On 27.09.2015 [23:59:11 +0530], Raghavendra K T wrote:

Once we have made the distinction between nid and chipid
create a 1:1 mapping between them. This makes compacting the
nids easy later.



Didn't the previous patch just do the opposite of...



As per my thoughts it was:
1. rename functions to say loud that it is chipid (and not nid)
2. and then assign nid = chipid so that we are clear that
we made nid:chipid 1:1 mapping and compact nids later..

But again may be I should combine patch 2 and 3.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 2/5] powerpc:numa Rename functions referring to nid as chipid

2015-09-29 Thread Raghavendra K T


On 09/28/2015 10:57 PM, Nishanth Aravamudan wrote:

On 27.09.2015 [23:59:10 +0530], Raghavendra K T wrote:

There is no change in the fuctionality

Signed-off-by: Raghavendra K T 
---
  arch/powerpc/mm/numa.c | 42 +-
  1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index d5e6eee..f84ed2f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -235,47 +235,47 @@ static void initialize_distance_lookup_table(int nid,
}
  }

-/* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
+/* Returns chipid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
   * info is found.
   */
-static int associativity_to_nid(const __be32 *associativity)
+static int associativity_to_chipid(const __be32 *associativity)


This is confusing to me. This function is also used by the DLPAR code
under PowerVM to indicate what node the CPU is on -- not a chip (which I
don't believe is exposed at all under PowerVM).



Good point.

should I retain the name nid?
or any suggestions? instead of chipid -> nid which fits both the cases.
or should I rename like nid->vnid  something?
[...]

@@ -1415,7 +1415,7 @@ int arch_update_cpu_topology(void)

/* Use associativity from first thread for all siblings */
vphn_get_associativity(cpu, associativity);
-   new_nid = associativity_to_nid(associativity);
+   new_nid = associativity_to_chipid(associativity);


If you are getting a chipid, shouldn't you be assigning it to a variable
called 'new_chipid'?


yes perhaps.
my splitting idea was
1. change nid name in functions to chipid (without changing nid
variable calling that function)
2. rename variables to chipid and assign nid=chipid (1:1 mapping)
3. now let nid = mapped chipid

But I see that it isn't consistent in some places. do you think
merging step 1 and step 2 is okay?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support

2015-09-29 Thread Raghavendra K T


On 09/28/2015 10:34 PM, Nishanth Aravamudan wrote:

On 28.09.2015 [13:44:42 +0300], Denis Kirjanov wrote:

On 9/27/15, Raghavendra K T  wrote:

Problem description:
Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
got from device tree is naturally mapped (directly) to nid.


Interesting thing to play with, I'll try to test it on my POWER7 box,
but it doesn't have the OPAL layer :(


Hi Denis,
Thanks for your interest. I have pushed the patches to

https://github.com/ktraghavendra/linux/tree/serialnuma_v1 if it makes
 patches easy to grab.



Note that it's also interesting to try it under PowerVM, with odd NUMA
topologies and report any issues found :)



Thanks Nish, I 'll also grab a powerVM and test.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 2/5] powerpc:numa Rename functions referring to nid as chipid

2015-09-29 Thread Raghavendra K T


On 09/28/2015 10:57 PM, Nishanth Aravamudan wrote:

On 27.09.2015 [23:59:10 +0530], Raghavendra K T wrote:

There is no change in the fuctionality

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
  arch/powerpc/mm/numa.c | 42 +-
  1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index d5e6eee..f84ed2f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -235,47 +235,47 @@ static void initialize_distance_lookup_table(int nid,
}
  }

-/* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
+/* Returns chipid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
   * info is found.
   */
-static int associativity_to_nid(const __be32 *associativity)
+static int associativity_to_chipid(const __be32 *associativity)


This is confusing to me. This function is also used by the DLPAR code
under PowerVM to indicate what node the CPU is on -- not a chip (which I
don't believe is exposed at all under PowerVM).



Good point.

should I retain the name nid?
or any suggestions? instead of chipid -> nid which fits both the cases.
or should I rename like nid->vnid  something?
[...]

@@ -1415,7 +1415,7 @@ int arch_update_cpu_topology(void)

/* Use associativity from first thread for all siblings */
vphn_get_associativity(cpu, associativity);
-   new_nid = associativity_to_nid(associativity);
+   new_nid = associativity_to_chipid(associativity);


If you are getting a chipid, shouldn't you be assigning it to a variable
called 'new_chipid'?


yes perhaps.
my splitting idea was
1. change nid name in functions to chipid (without changing nid
variable calling that function)
2. rename variables to chipid and assign nid=chipid (1:1 mapping)
3. now let nid = mapped chipid

But I see that it isn't consistent in some places. do you think
merging step 1 and step 2 is okay?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support

2015-09-29 Thread Raghavendra K T


On 09/28/2015 10:34 PM, Nishanth Aravamudan wrote:

On 28.09.2015 [13:44:42 +0300], Denis Kirjanov wrote:

On 9/27/15, Raghavendra K T <raghavendra...@linux.vnet.ibm.com> wrote:

Problem description:
Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
got from device tree is naturally mapped (directly) to nid.


Interesting thing to play with, I'll try to test it on my POWER7 box,
but it doesn't have the OPAL layer :(


Hi Denis,
Thanks for your interest. I have pushed the patches to

https://github.com/ktraghavendra/linux/tree/serialnuma_v1 if it makes
 patches easy to grab.



Note that it's also interesting to try it under PowerVM, with odd NUMA
topologies and report any issues found :)



Thanks Nish, I 'll also grab a powerVM and test.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid

2015-09-29 Thread Raghavendra K T


On 09/28/2015 10:58 PM, Nishanth Aravamudan wrote:

On 27.09.2015 [23:59:11 +0530], Raghavendra K T wrote:

Once we have made the distinction between nid and chipid
create a 1:1 mapping between them. This makes compacting the
nids easy later.



Didn't the previous patch just do the opposite of...



As per my thoughts it was:
1. rename functions to say loud that it is chipid (and not nid)
2. and then assign nid = chipid so that we are clear that
we made nid:chipid 1:1 mapping and compact nids later..

But again may be I should combine patch 2 and 3.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid

2015-09-29 Thread Raghavendra K T


On 09/28/2015 11:05 PM, Nishanth Aravamudan wrote:

On 27.09.2015 [23:59:11 +0530], Raghavendra K T wrote:

Once we have made the distinction between nid and chipid
create a 1:1 mapping between them. This makes compacting the
nids easy later.

No functionality change.

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
  arch/powerpc/mm/numa.c | 36 +---
  1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f84ed2f..dd2073b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -264,6 +264,17 @@ out:
return chipid;
  }

+
+ /* Return the nid from associativity */
+static int associativity_to_nid(const __be32 *associativity)
+{
+   int nid;
+
+   nid = associativity_to_chipid(associativity);
+   return nid;
+}


This is ultimately confusing. You are assigning the semantic return
value of a chipid to a nid -- is it a nid or a chipid? Shouldn't the
variable naming be consistent?



:( yes. will come up with some consistent naming.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 4/5] powerpc:numa Add helper functions to maintain chipid to nid mapping

2015-09-29 Thread Raghavendra K T


On 09/28/2015 11:02 PM, Nishanth Aravamudan wrote:

On 27.09.2015 [23:59:12 +0530], Raghavendra K T wrote:

Create arrays that maps serial nids and sparse chipids.

Note: My original idea had only two arrays of chipid to nid map. Final
code is inspired by driver/acpi/numa.c that maps a proximity node with
a logical node by Takayoshi Kochi <t-ko...@bq.jp.nec.com>, and thus
uses an additional chipid_map nodemask. The mask helps in first unused
nid easily by knowing first unset bit in the mask.

No change in functionality.

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
  arch/powerpc/mm/numa.c | 48 +++-
  1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index dd2073b..f015cad 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -63,6 +63,11 @@ static int form1_affinity;
  static int distance_ref_points_depth;
  static const __be32 *distance_ref_points;
  static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
+static nodemask_t chipid_map = NODE_MASK_NONE;
+static int chipid_to_nid_map[MAX_NUMNODES]
+   = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };


Hrm, conceptually there are *more* chips than nodes, right? So what
guarantees we won't see > MAX_NUMNODES chips?


You are correct that nid <= chipids.
and #nids = #chipids when all possible slots are populated. Considering
we assume that maximum chip slots are no more than MAX_NUMNODES,


how about having

#define MAX_CHIPNODES MAX_NUMNODES
and
chipid_to_nid_map[MAX_CHIPNODES] = { [0 ... MAX_CHIPNODES - 1] = ..




+static int nid_to_chipid_map[MAX_NUMNODES]
+   = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };

  /*
   * Allocate node_to_cpumask_map based on number of available nodes
@@ -133,6 +138,48 @@ static int __init fake_numa_create_new_node(unsigned long 
end_pfn,
return 0;
  }

+int chipid_to_nid(int chipid)
+{
+   if (chipid < 0)
+   return NUMA_NO_NODE;


Do you really want to support these cases? Or should they be
bugs/warnings indicating that you got an unexpected input? Or at least
WARN_ON_ONCE?



Right. Querying for nid of an invalid chipid should be atleast
WARN_ON_ONCE(). But 'll check once if there is any valid scenario
before the change.


+   return chipid_to_nid_map[chipid];
+}
+
+int nid_to_chipid(int nid)
+{
+   if (nid < 0)
+   return NUMA_NO_NODE;
+   return nid_to_chipid_map[nid];
+}
+
+static void __map_chipid_to_nid(int chipid, int nid)
+{
+   if (chipid_to_nid_map[chipid] == NUMA_NO_NODE
+|| nid < chipid_to_nid_map[chipid])
+   chipid_to_nid_map[chipid] = nid;
+   if (nid_to_chipid_map[nid] == NUMA_NO_NODE
+   || chipid < nid_to_chipid_map[nid])
+   nid_to_chipid_map[nid] = chipid;
+}


chip <-> node mapping is a static (physical) concept, right? Should we
emit some debugging if for some reason we get a runtime call to remap
an already mapped chip to a new node?



Good point. Already mapped chipid to a different nid is unexpected
whereas mapping chipid to same nid is expected.(because mapping comes
from cpus belonging to same node).

WARN_ON() should suffice here?



+
+int map_chipid_to_nid(int chipid)
+{
+   int nid;
+
+   if (chipid < 0 || chipid >= MAX_NUMNODES)
+   return NUMA_NO_NODE;
+
+   nid = chipid_to_nid_map[chipid];
+   if (nid == NUMA_NO_NODE) {
+   if (nodes_weight(chipid_map) >= MAX_NUMNODES)
+   return NUMA_NO_NODE;


If you create a KVM guest with a bogus topology, doesn't this just start
losing NUMA information for very high-noded guests?



'll try to see if it is possible to hit this case, ideally we should
not allow more than MAX_NUMNODES for chipids and we should abort early.


+   nid = first_unset_node(chipid_map);
+   __map_chipid_to_nid(chipid, nid);
+   node_set(nid, chipid_map);
+   }
+   return nid;
+}
+
  int numa_cpu_lookup(int cpu)
  {
return numa_cpu_lookup_table[cpu];
@@ -264,7 +311,6 @@ out:
return chipid;
  }

-


stray change?



yep, will correct that.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support

2015-09-29 Thread Raghavendra K T


On 09/28/2015 11:04 PM, Nishanth Aravamudan wrote:

On 27.09.2015 [23:59:08 +0530], Raghavendra K T wrote:

[...]


2) Map the sparse chipid got from device tree to a serial nid at kernel
level (The idea proposed in this series).
Pro: It is more natural to handle at kernel level than at lower (OPAL) layer.
con: The chipid is in device tree no longer the same as nid in kernel


Is there any debugging/logging? Looks like not -- so how does a sysadmin
map from firmware-provided values to the Linux values? That's going to
make debugging of large systems (PowerVM or otherwise) less than
pleasant, it seems? Possibly you could put something in sysfs?


I see 2 things could be done here:

1) while doing dump_numa_cpu_topology() we can dump nid_to_chipid()
as additional information.

2) sysfs->
Does /sys/devices/system/node/nodeX/*chipid* looks good. May be we
should add only for powerpc or otherwise we need to have chipid = nid
populated for other archs. [ I think this change may be done slowly ]





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table

2015-09-27 Thread Raghavendra K T


On 09/27/2015 11:59 PM, Raghavendra K T wrote:

We access numa_cpu_lookup_table array directly in all the places
to read/update numa cpu lookup information. Instead use a helper
function to update.

This is helpful in changing the way numa<-->cpu mapping in single
place when needed.

This is a cosmetic change, no change in functionality.

Signed-off-by: Raghavendra K T 

Signed-off-by: Raghavendra K T 

---

Sorry:  changelog edit messed it.. :(


  arch/powerpc/include/asm/mmzone.h |  2 +-
  arch/powerpc/kernel/smp.c | 10 +-
  arch/powerpc/mm/numa.c| 28 +---
  3 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/mmzone.h 
b/arch/powerpc/include/asm/mmzone.h
index 7b58917..c24a5f4 100644
--- a/arch/powerpc/include/asm/mmzone.h
+++ b/arch/powerpc/include/asm/mmzone.h
@@ -29,7 +29,7 @@ extern struct pglist_data *node_data[];
   * Following are specific to this numa platform.
   */

-extern int numa_cpu_lookup_table[];
+extern int numa_cpu_lookup(int cpu);
  extern cpumask_var_t node_to_cpumask_map[];
  #ifdef CONFIG_MEMORY_HOTPLUG
  extern unsigned long max_pfn;
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ec9ec20..56fbe9e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -381,9 +381,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 * numa_node_id() works after this.
 */
if (cpu_present(cpu)) {
-   set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]);
+   set_cpu_numa_node(cpu, numa_cpu_lookup(cpu));
set_cpu_numa_mem(cpu,
-   local_memory_node(numa_cpu_lookup_table[cpu]));
+   local_memory_node(numa_cpu_lookup(cpu)));
}
}

@@ -400,7 +400,7 @@ void smp_prepare_boot_cpu(void)
  #ifdef CONFIG_PPC64
paca[boot_cpuid].__current = current;
  #endif
-   set_numa_node(numa_cpu_lookup_table[boot_cpuid]);
+   set_numa_node(numa_cpu_lookup(boot_cpuid));
current_set[boot_cpuid] = task_thread_info(current);
  }

@@ -718,8 +718,8 @@ void start_secondary(void *unused)
}
traverse_core_siblings(cpu, true);

-   set_numa_node(numa_cpu_lookup_table[cpu]);
-   set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
+   set_numa_node(numa_cpu_lookup(cpu));
+   set_numa_mem(local_memory_node(numa_cpu_lookup(cpu)));

smp_wmb();
notify_cpu_starting(cpu);
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8b9502a..d5e6eee 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -52,7 +52,6 @@ int numa_cpu_lookup_table[NR_CPUS];
  cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
  struct pglist_data *node_data[MAX_NUMNODES];

-EXPORT_SYMBOL(numa_cpu_lookup_table);
  EXPORT_SYMBOL(node_to_cpumask_map);
  EXPORT_SYMBOL(node_data);

@@ -134,19 +133,25 @@ static int __init fake_numa_create_new_node(unsigned long 
end_pfn,
return 0;
  }

-static void reset_numa_cpu_lookup_table(void)
+int numa_cpu_lookup(int cpu)
  {
-   unsigned int cpu;
-
-   for_each_possible_cpu(cpu)
-   numa_cpu_lookup_table[cpu] = -1;
+   return numa_cpu_lookup_table[cpu];
  }
+EXPORT_SYMBOL(numa_cpu_lookup);

-static void update_numa_cpu_lookup_table(unsigned int cpu, int node)
+static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node)
  {
numa_cpu_lookup_table[cpu] = node;
  }

+static void reset_numa_cpu_lookup_table(void)
+{
+   unsigned int cpu;
+
+   for_each_possible_cpu(cpu)
+   update_numa_cpu_lookup_table(cpu, -1);
+}
+
  static void map_cpu_to_node(int cpu, int node)
  {
update_numa_cpu_lookup_table(cpu, node);
@@ -160,7 +165,7 @@ static void map_cpu_to_node(int cpu, int node)
  #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
  static void unmap_cpu_from_node(unsigned long cpu)
  {
-   int node = numa_cpu_lookup_table[cpu];
+   int node = numa_cpu_lookup(cpu);

dbg("removing cpu %lu from node %d\n", cpu, node);

@@ -536,7 +541,8 @@ static int numa_setup_cpu(unsigned long lcpu)
 * directly instead of querying the firmware, since it represents
 * the most recent mapping notified to us by the platform (eg: VPHN).
 */
-   if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) {
+   nid = numa_cpu_lookup(lcpu);
+   if (nid >= 0) {
map_cpu_to_node(lcpu, nid);
return nid;
}
@@ -1413,7 +1419,7 @@ int arch_update_cpu_topology(void)
if (new_nid < 0 || !node_online(new_nid))
new_nid = first_online_node;

-   if (new_nid == numa_cpu_lookup_table[cpu]) {
+   if (new_nid == numa_cpu_lookup(cpu)) {
cpumask_andnot(_a

[PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid

2015-09-27 Thread Raghavendra K T

Once we have made the distinction between nid and chipid
create a 1:1 mapping between them. This makes compacting the
nids easy later.

No functionality change.

Signed-off-by: Raghavendra K T 
---
 arch/powerpc/mm/numa.c | 36 +---
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f84ed2f..dd2073b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -264,6 +264,17 @@ out:
return chipid;
 }
 
+
+ /* Return the nid from associativity */
+static int associativity_to_nid(const __be32 *associativity)
+{
+   int nid;
+
+   nid = associativity_to_chipid(associativity);
+   return nid;
+}
+
+
 /* Returns the chipid associated with the given device tree node,
  * or -1 if not found.
  */
@@ -278,6 +289,17 @@ static int of_node_to_chipid_single(struct device_node 
*device)
return chipid;
 }
 
+/*
+ * Returns nid from the chipid associated with given tree node
+ */
+static int of_node_to_nid_single(struct device_node *device)
+{
+   int nid;
+
+   nid = of_node_to_chipid_single(device);
+   return nid;
+}
+
 /* Walk the device tree upwards, looking for an associativity id */
 int of_node_to_nid(struct device_node *device)
 {
@@ -286,7 +308,7 @@ int of_node_to_nid(struct device_node *device)
 
of_node_get(device);
while (device) {
-   nid = of_node_to_chipid_single(device);
+   nid = of_node_to_nid_single(device);
if (nid != -1)
break;
 
@@ -498,7 +520,7 @@ static int of_get_assoc_arrays(struct device_node *memory,
 }
 
 /*
- * This is like of_node_to_chipid_single() for memory represented in the
+ * This is like of_node_to_nid_single() for memory represented in the
  * ibm,dynamic-reconfiguration-memory node.
  */
 static int of_drconf_to_nid_single(struct of_drconf_cell *drmem,
@@ -557,7 +579,7 @@ static int numa_setup_cpu(unsigned long lcpu)
goto out;
}
 
-   nid = of_node_to_chipid_single(cpu);
+   nid = of_node_to_nid_single(cpu);
 
 out_present:
if (nid < 0 || !node_online(nid))
@@ -754,7 +776,7 @@ static int __init parse_numa_properties(void)
 
cpu = of_get_cpu_node(i, NULL);
BUG_ON(!cpu);
-   nid = of_node_to_chipid_single(cpu);
+   nid = of_node_to_nid_single(cpu);
of_node_put(cpu);
 
/*
@@ -796,7 +818,7 @@ new_range:
 * have associativity properties.  If none, then
 * everything goes to default_nid.
 */
-   nid = of_node_to_chipid_single(memory);
+   nid = of_node_to_nid_single(memory);
if (nid < 0)
nid = default_nid;
 
@@ -1119,7 +1141,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr)
if ((scn_addr < start) || (scn_addr >= (start + size)))
continue;
 
-   nid = of_node_to_chipid_single(memory);
+   nid = of_node_to_nid_single(memory);
break;
}
 
@@ -1415,7 +1437,7 @@ int arch_update_cpu_topology(void)
 
/* Use associativity from first thread for all siblings */
vphn_get_associativity(cpu, associativity);
-   new_nid = associativity_to_chipid(associativity);
+   new_nid = associativity_to_nid(associativity);
if (new_nid < 0 || !node_online(new_nid))
new_nid = first_online_node;
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC 0/5] powerpc:numa Add serial nid support

2015-09-27 Thread Raghavendra K T

Problem description:
Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
got from device tree is naturally mapped (directly) to nid.

Potential side effect of that is:

1) There are several places in kernel that assumes serial node numbering.
and memory allocations assume that all the nodes from 0-(highest nid)
exist inturn ending up allocating memory for the nodes that does not exist.

2) For virtualization use cases (such as qemu, libvirt, openstack), mapping
sparse nid of the host system to contiguous nids of guest (numa affinity,
placement) could be a challenge.

Possible Solutions:
1) Handling the memory allocations is kernel case by case: Though in some
cases it is easy to achieve, some cases may be intrusive/not trivial. 
at the end it does not handle side effect (2) above.

2) Map the sparse chipid got from device tree to a serial nid at kernel
level (The idea proposed in this series).
Pro: It is more natural to handle at kernel level than at lower (OPAL) layer.
con: The chipid is in device tree no longer the same as nid in kernel

3) Let the lower layer (OPAL) give the serial node ids after parsing the
chipid and the associativity etc [ either as a separate item in device tree
or by compacting the chipid numbers ]
Pros: kernel, device tree are on same page and less change in kernel
Con: is it the functionality expected in lower layer

As mentioned above, current patch series tries to map chipid from lower layer
to a contiguos nid at kernel level keeping the node distance calculation and
so on intact.

Result:
Before the patch: numactl -H

available: 4 nodes (0-1,16-17)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31
node 0 size: 31665 MB
node 0 free: 29836 MB
node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 
54 55 56 57 58 59 60 61 62 63
node 1 size: 32722 MB
node 1 free: 32019 MB
node 16 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 
86 87 88 89 90 91 92 93 94 95
node 16 size: 32571 MB
node 16 free: 31222 MB
node 17 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 17 size: 0 MB
node 17 free: 0 MB
node distances:
node   0   1  16  17 
  0:  10  20  40  40 
  1:  20  10  40  40 
 16:  40  40  10  20 
 17:  40  40  20  10 

After the patch: numactl -H

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31
node 0 size: 31665 MB
node 0 free: 30657 MB
node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 
54 55 56 57 58 59 60 61 62 63
node 1 size: 32722 MB
node 1 free: 32566 MB
node 2 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 
86 87 88 89 90 91 92 93 94 95
node 2 size: 32571 MB
node 2 free: 32401 MB
node 3 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node   0   1   2   3 
  0:  10  20  40  40 
  1:  20  10  40  40 
  2:  40  40  10  20 
  3:  40  40  20  10 

(note that numa distances are intact). Apart from this, The following tests
are done with the patched kernel (both baremetal and KVM guest with multiple
nodes) to ensure there is no breakage.

1) offlining and onlining of memory in /sys/devices/system/node/nodeX path

2) offlining and onlining of cpus in /sys/devices/system/cpu/ path

3) Numactl tests from
ftp://oss.sgi.com/www/projects/libnuma/download/numactl-2.0.10.tar.gz

(infact there were more breakage before the patch because of sparse nid
and memoryless node cases of powerpc)

4) Thousands of docker containers were spawned.

Please let me know your comments.

 patch 1-3: cleanup patches
 patch 4: Adds helper function to map nid and chipid
 patch 5: Uses the mapping to get serial nid 

Raghavendra K T (5):
  powerpc:numa Add numa_cpu_lookup function to update lookup table
  powerpc:numa Rename functions referring to nid as chipid
  powerpc:numa create 1:1 mappaing between chipid and nid
  powerpc:numa Add helper functions to maintain chipid to nid mapping
  powerpc:numa Use chipid to nid mapping to get serial numa node ids

 arch/powerpc/include/asm/mmzone.h |   2 +-
 arch/powerpc/kernel/smp.c |  10 ++--
 arch/powerpc/mm/numa.c| 121 +++---
 3 files changed, 105 insertions(+), 28 deletions(-)

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC 5/5] powerpc:numa Use chipid to nid mapping to get serial numa node ids

2015-09-27 Thread Raghavendra K T

Also properly initialize numa distance table for serial nids.

Problem: Powerpc supports sparse nid numbering which could affect
1) memory footprint 2) virtualization use cases

Current solution:
The patch maps sprase chipid got fromn device tree to serail
nids.

Result before:
node   0   1  16  17
  0:  10  20  40  40
  1:  20  10  40  40
 16:  40  40  10  20
 17:  40  40  20  10

After:
node   0   1   2   3
  0:  10  20  40  40
  1:  20  10  40  40
  2:  40  40  10  20
  3:  40  40  20  10

Testing: Scenarios tested on  baremetal and KVM guest with 4 nodes
1) offlining and onlining memory and cpus
2) Running the tests from numactl source.
3) Creating 1000s of docker containers stressing the system

Signed-off-by: Raghavendra K T 
---
 arch/powerpc/mm/numa.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f015cad..873ac8c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -304,7 +304,8 @@ static int associativity_to_chipid(const __be32 
*associativity)
/*
 * Skip the length field and send start of associativity array
 */
-   initialize_distance_lookup_table(chipid, associativity + 1);
+   initialize_distance_lookup_table(chipid_to_nid(chipid),
+associativity + 1);
}
 
 out:
@@ -314,9 +315,10 @@ out:
  /* Return the nid from associativity */
 static int associativity_to_nid(const __be32 *associativity)
 {
-   int nid;
+   int chipid, nid;
 
-   nid = associativity_to_chipid(associativity);
+   chipid = associativity_to_chipid(associativity);
+   nid = map_chipid_to_nid(chipid);
return nid;
 }
 
@@ -340,9 +342,10 @@ static int of_node_to_chipid_single(struct device_node 
*device)
  */
 static int of_node_to_nid_single(struct device_node *device)
 {
-   int nid;
+   int chipid, nid;
 
-   nid = of_node_to_chipid_single(device);
+   chipid = of_node_to_chipid_single(device);
+   nid = map_chipid_to_nid(chipid);
return nid;
 }
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table

2015-09-27 Thread Raghavendra K T

We access numa_cpu_lookup_table array directly in all the places
to read/update numa cpu lookup information. Instead use a helper
function to update.

This is helpful in changing the way numa<-->cpu mapping in single
place when needed.

This is a cosmetic change, no change in functionality.

Signed-off-by: Raghavendra K T 
---
 arch/powerpc/include/asm/mmzone.h |  2 +-
 arch/powerpc/kernel/smp.c | 10 +-
 arch/powerpc/mm/numa.c| 28 +---
 3 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/mmzone.h 
b/arch/powerpc/include/asm/mmzone.h
index 7b58917..c24a5f4 100644
--- a/arch/powerpc/include/asm/mmzone.h
+++ b/arch/powerpc/include/asm/mmzone.h
@@ -29,7 +29,7 @@ extern struct pglist_data *node_data[];
  * Following are specific to this numa platform.
  */
 
-extern int numa_cpu_lookup_table[];
+extern int numa_cpu_lookup(int cpu);
 extern cpumask_var_t node_to_cpumask_map[];
 #ifdef CONFIG_MEMORY_HOTPLUG
 extern unsigned long max_pfn;
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ec9ec20..56fbe9e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -381,9 +381,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 * numa_node_id() works after this.
 */
if (cpu_present(cpu)) {
-   set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]);
+   set_cpu_numa_node(cpu, numa_cpu_lookup(cpu));
set_cpu_numa_mem(cpu,
-   local_memory_node(numa_cpu_lookup_table[cpu]));
+   local_memory_node(numa_cpu_lookup(cpu)));
}
}
 
@@ -400,7 +400,7 @@ void smp_prepare_boot_cpu(void)
 #ifdef CONFIG_PPC64
paca[boot_cpuid].__current = current;
 #endif
-   set_numa_node(numa_cpu_lookup_table[boot_cpuid]);
+   set_numa_node(numa_cpu_lookup(boot_cpuid));
current_set[boot_cpuid] = task_thread_info(current);
 }
 
@@ -718,8 +718,8 @@ void start_secondary(void *unused)
}
traverse_core_siblings(cpu, true);
 
-   set_numa_node(numa_cpu_lookup_table[cpu]);
-   set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
+   set_numa_node(numa_cpu_lookup(cpu));
+   set_numa_mem(local_memory_node(numa_cpu_lookup(cpu)));
 
smp_wmb();
notify_cpu_starting(cpu);
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8b9502a..d5e6eee 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -52,7 +52,6 @@ int numa_cpu_lookup_table[NR_CPUS];
 cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
 struct pglist_data *node_data[MAX_NUMNODES];
 
-EXPORT_SYMBOL(numa_cpu_lookup_table);
 EXPORT_SYMBOL(node_to_cpumask_map);
 EXPORT_SYMBOL(node_data);
 
@@ -134,19 +133,25 @@ static int __init fake_numa_create_new_node(unsigned long 
end_pfn,
return 0;
 }
 
-static void reset_numa_cpu_lookup_table(void)
+int numa_cpu_lookup(int cpu)
 {
-   unsigned int cpu;
-
-   for_each_possible_cpu(cpu)
-   numa_cpu_lookup_table[cpu] = -1;
+   return numa_cpu_lookup_table[cpu];
 }
+EXPORT_SYMBOL(numa_cpu_lookup);
 
-static void update_numa_cpu_lookup_table(unsigned int cpu, int node)
+static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node)
 {
numa_cpu_lookup_table[cpu] = node;
 }
 
+static void reset_numa_cpu_lookup_table(void)
+{
+   unsigned int cpu;
+
+   for_each_possible_cpu(cpu)
+   update_numa_cpu_lookup_table(cpu, -1);
+}
+
 static void map_cpu_to_node(int cpu, int node)
 {
update_numa_cpu_lookup_table(cpu, node);
@@ -160,7 +165,7 @@ static void map_cpu_to_node(int cpu, int node)
 #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
 static void unmap_cpu_from_node(unsigned long cpu)
 {
-   int node = numa_cpu_lookup_table[cpu];
+   int node = numa_cpu_lookup(cpu);
 
dbg("removing cpu %lu from node %d\n", cpu, node);
 
@@ -536,7 +541,8 @@ static int numa_setup_cpu(unsigned long lcpu)
 * directly instead of querying the firmware, since it represents
 * the most recent mapping notified to us by the platform (eg: VPHN).
 */
-   if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) {
+   nid = numa_cpu_lookup(lcpu);
+   if (nid >= 0) {
map_cpu_to_node(lcpu, nid);
return nid;
}
@@ -1413,7 +1419,7 @@ int arch_update_cpu_topology(void)
if (new_nid < 0 || !node_online(new_nid))
new_nid = first_online_node;
 
-   if (new_nid == numa_cpu_lookup_table[cpu]) {
+   if (new_nid == numa_cpu_lookup(cpu)) {
cpumask_andnot(_associativity_changes_mask,
_associativity_changes_mask,
cpu_sibling_mask(cpu));

[PATCH RFC 4/5] powerpc:numa Add helper functions to maintain chipid to nid mapping

2015-09-27 Thread Raghavendra K T

Create arrays that maps serial nids and sparse chipids.

Note: My original idea had only two arrays of chipid to nid map. Final
code is inspired by driver/acpi/numa.c that maps a proximity node with
a logical node by Takayoshi Kochi , and thus
uses an additional chipid_map nodemask. The mask helps in first unused
nid easily by knowing first unset bit in the mask.

No change in functionality.

Signed-off-by: Raghavendra K T 
---
 arch/powerpc/mm/numa.c | 48 +++-
 1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index dd2073b..f015cad 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -63,6 +63,11 @@ static int form1_affinity;
 static int distance_ref_points_depth;
 static const __be32 *distance_ref_points;
 static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
+static nodemask_t chipid_map = NODE_MASK_NONE;
+static int chipid_to_nid_map[MAX_NUMNODES]
+   = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };
+static int nid_to_chipid_map[MAX_NUMNODES]
+   = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };
 
 /*
  * Allocate node_to_cpumask_map based on number of available nodes
@@ -133,6 +138,48 @@ static int __init fake_numa_create_new_node(unsigned long 
end_pfn,
return 0;
 }
 
+int chipid_to_nid(int chipid)
+{
+   if (chipid < 0)
+   return NUMA_NO_NODE;
+   return chipid_to_nid_map[chipid];
+}
+
+int nid_to_chipid(int nid)
+{
+   if (nid < 0)
+   return NUMA_NO_NODE;
+   return nid_to_chipid_map[nid];
+}
+
+static void __map_chipid_to_nid(int chipid, int nid)
+{
+   if (chipid_to_nid_map[chipid] == NUMA_NO_NODE
+|| nid < chipid_to_nid_map[chipid])
+   chipid_to_nid_map[chipid] = nid;
+   if (nid_to_chipid_map[nid] == NUMA_NO_NODE
+   || chipid < nid_to_chipid_map[nid])
+   nid_to_chipid_map[nid] = chipid;
+}
+
+int map_chipid_to_nid(int chipid)
+{
+   int nid;
+
+   if (chipid < 0 || chipid >= MAX_NUMNODES)
+   return NUMA_NO_NODE;
+
+   nid = chipid_to_nid_map[chipid];
+   if (nid == NUMA_NO_NODE) {
+   if (nodes_weight(chipid_map) >= MAX_NUMNODES)
+   return NUMA_NO_NODE;
+   nid = first_unset_node(chipid_map);
+   __map_chipid_to_nid(chipid, nid);
+   node_set(nid, chipid_map);
+   }
+   return nid;
+}
+
 int numa_cpu_lookup(int cpu)
 {
return numa_cpu_lookup_table[cpu];
@@ -264,7 +311,6 @@ out:
return chipid;
 }
 
-
  /* Return the nid from associativity */
 static int associativity_to_nid(const __be32 *associativity)
 {
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC 2/5] powerpc:numa Rename functions referring to nid as chipid

2015-09-27 Thread Raghavendra K T

There is no change in the fuctionality

Signed-off-by: Raghavendra K T 
---
 arch/powerpc/mm/numa.c | 42 +-
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index d5e6eee..f84ed2f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -235,47 +235,47 @@ static void initialize_distance_lookup_table(int nid,
}
 }
 
-/* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
+/* Returns chipid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
  * info is found.
  */
-static int associativity_to_nid(const __be32 *associativity)
+static int associativity_to_chipid(const __be32 *associativity)
 {
-   int nid = -1;
+   int chipid = -1;
 
if (min_common_depth == -1)
goto out;
 
if (of_read_number(associativity, 1) >= min_common_depth)
-   nid = of_read_number([min_common_depth], 1);
+   chipid = of_read_number([min_common_depth], 1);
 
/* POWER4 LPAR uses 0x as invalid node */
-   if (nid == 0x || nid >= MAX_NUMNODES)
-   nid = -1;
+   if (chipid == 0x || chipid >= MAX_NUMNODES)
+   chipid = -1;
 
-   if (nid > 0 &&
+   if (chipid > 0 &&
of_read_number(associativity, 1) >= distance_ref_points_depth) {
/*
 * Skip the length field and send start of associativity array
 */
-   initialize_distance_lookup_table(nid, associativity + 1);
+   initialize_distance_lookup_table(chipid, associativity + 1);
}
 
 out:
-   return nid;
+   return chipid;
 }
 
-/* Returns the nid associated with the given device tree node,
+/* Returns the chipid associated with the given device tree node,
  * or -1 if not found.
  */
-static int of_node_to_nid_single(struct device_node *device)
+static int of_node_to_chipid_single(struct device_node *device)
 {
-   int nid = -1;
+   int chipid = -1;
const __be32 *tmp;
 
tmp = of_get_associativity(device);
if (tmp)
-   nid = associativity_to_nid(tmp);
-   return nid;
+   chipid = associativity_to_chipid(tmp);
+   return chipid;
 }
 
 /* Walk the device tree upwards, looking for an associativity id */
@@ -286,7 +286,7 @@ int of_node_to_nid(struct device_node *device)
 
of_node_get(device);
while (device) {
-   nid = of_node_to_nid_single(device);
+   nid = of_node_to_chipid_single(device);
if (nid != -1)
break;
 
@@ -498,7 +498,7 @@ static int of_get_assoc_arrays(struct device_node *memory,
 }
 
 /*
- * This is like of_node_to_nid_single() for memory represented in the
+ * This is like of_node_to_chipid_single() for memory represented in the
  * ibm,dynamic-reconfiguration-memory node.
  */
 static int of_drconf_to_nid_single(struct of_drconf_cell *drmem,
@@ -557,7 +557,7 @@ static int numa_setup_cpu(unsigned long lcpu)
goto out;
}
 
-   nid = of_node_to_nid_single(cpu);
+   nid = of_node_to_chipid_single(cpu);
 
 out_present:
if (nid < 0 || !node_online(nid))
@@ -754,7 +754,7 @@ static int __init parse_numa_properties(void)
 
cpu = of_get_cpu_node(i, NULL);
BUG_ON(!cpu);
-   nid = of_node_to_nid_single(cpu);
+   nid = of_node_to_chipid_single(cpu);
of_node_put(cpu);
 
/*
@@ -796,7 +796,7 @@ new_range:
 * have associativity properties.  If none, then
 * everything goes to default_nid.
 */
-   nid = of_node_to_nid_single(memory);
+   nid = of_node_to_chipid_single(memory);
if (nid < 0)
nid = default_nid;
 
@@ -1119,7 +1119,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr)
if ((scn_addr < start) || (scn_addr >= (start + size)))
continue;
 
-   nid = of_node_to_nid_single(memory);
+   nid = of_node_to_chipid_single(memory);
break;
}
 
@@ -1415,7 +1415,7 @@ int arch_update_cpu_topology(void)
 
/* Use associativity from first thread for all siblings */
vphn_get_associativity(cpu, associativity);
-   new_nid = associativity_to_nid(associativity);
+   new_nid = associativity_to_chipid(associativity);
if (new_nid < 0 || !node_online(new_nid))
new_nid = first_online_node;
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.

[PATCH RFC 5/5] powerpc:numa Use chipid to nid mapping to get serial numa node ids

2015-09-27 Thread Raghavendra K T

Also properly initialize numa distance table for serial nids.

Problem: Powerpc supports sparse nid numbering which could affect
1) memory footprint 2) virtualization use cases

Current solution:
The patch maps sprase chipid got fromn device tree to serail
nids.

Result before:
node   0   1  16  17
  0:  10  20  40  40
  1:  20  10  40  40
 16:  40  40  10  20
 17:  40  40  20  10

After:
node   0   1   2   3
  0:  10  20  40  40
  1:  20  10  40  40
  2:  40  40  10  20
  3:  40  40  20  10

Testing: Scenarios tested on  baremetal and KVM guest with 4 nodes
1) offlining and onlining memory and cpus
2) Running the tests from numactl source.
3) Creating 1000s of docker containers stressing the system

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f015cad..873ac8c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -304,7 +304,8 @@ static int associativity_to_chipid(const __be32 
*associativity)
/*
 * Skip the length field and send start of associativity array
 */
-   initialize_distance_lookup_table(chipid, associativity + 1);
+   initialize_distance_lookup_table(chipid_to_nid(chipid),
+associativity + 1);
}
 
 out:
@@ -314,9 +315,10 @@ out:
  /* Return the nid from associativity */
 static int associativity_to_nid(const __be32 *associativity)
 {
-   int nid;
+   int chipid, nid;
 
-   nid = associativity_to_chipid(associativity);
+   chipid = associativity_to_chipid(associativity);
+   nid = map_chipid_to_nid(chipid);
return nid;
 }
 
@@ -340,9 +342,10 @@ static int of_node_to_chipid_single(struct device_node 
*device)
  */
 static int of_node_to_nid_single(struct device_node *device)
 {
-   int nid;
+   int chipid, nid;
 
-   nid = of_node_to_chipid_single(device);
+   chipid = of_node_to_chipid_single(device);
+   nid = map_chipid_to_nid(chipid);
return nid;
 }
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC 2/5] powerpc:numa Rename functions referring to nid as chipid

2015-09-27 Thread Raghavendra K T

There is no change in the fuctionality

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 42 +-
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index d5e6eee..f84ed2f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -235,47 +235,47 @@ static void initialize_distance_lookup_table(int nid,
}
 }
 
-/* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
+/* Returns chipid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
  * info is found.
  */
-static int associativity_to_nid(const __be32 *associativity)
+static int associativity_to_chipid(const __be32 *associativity)
 {
-   int nid = -1;
+   int chipid = -1;
 
if (min_common_depth == -1)
goto out;
 
if (of_read_number(associativity, 1) >= min_common_depth)
-   nid = of_read_number([min_common_depth], 1);
+   chipid = of_read_number([min_common_depth], 1);
 
/* POWER4 LPAR uses 0x as invalid node */
-   if (nid == 0x || nid >= MAX_NUMNODES)
-   nid = -1;
+   if (chipid == 0x || chipid >= MAX_NUMNODES)
+   chipid = -1;
 
-   if (nid > 0 &&
+   if (chipid > 0 &&
of_read_number(associativity, 1) >= distance_ref_points_depth) {
/*
 * Skip the length field and send start of associativity array
 */
-   initialize_distance_lookup_table(nid, associativity + 1);
+   initialize_distance_lookup_table(chipid, associativity + 1);
}
 
 out:
-   return nid;
+   return chipid;
 }
 
-/* Returns the nid associated with the given device tree node,
+/* Returns the chipid associated with the given device tree node,
  * or -1 if not found.
  */
-static int of_node_to_nid_single(struct device_node *device)
+static int of_node_to_chipid_single(struct device_node *device)
 {
-   int nid = -1;
+   int chipid = -1;
const __be32 *tmp;
 
tmp = of_get_associativity(device);
if (tmp)
-   nid = associativity_to_nid(tmp);
-   return nid;
+   chipid = associativity_to_chipid(tmp);
+   return chipid;
 }
 
 /* Walk the device tree upwards, looking for an associativity id */
@@ -286,7 +286,7 @@ int of_node_to_nid(struct device_node *device)
 
of_node_get(device);
while (device) {
-   nid = of_node_to_nid_single(device);
+   nid = of_node_to_chipid_single(device);
if (nid != -1)
break;
 
@@ -498,7 +498,7 @@ static int of_get_assoc_arrays(struct device_node *memory,
 }
 
 /*
- * This is like of_node_to_nid_single() for memory represented in the
+ * This is like of_node_to_chipid_single() for memory represented in the
  * ibm,dynamic-reconfiguration-memory node.
  */
 static int of_drconf_to_nid_single(struct of_drconf_cell *drmem,
@@ -557,7 +557,7 @@ static int numa_setup_cpu(unsigned long lcpu)
goto out;
}
 
-   nid = of_node_to_nid_single(cpu);
+   nid = of_node_to_chipid_single(cpu);
 
 out_present:
if (nid < 0 || !node_online(nid))
@@ -754,7 +754,7 @@ static int __init parse_numa_properties(void)
 
cpu = of_get_cpu_node(i, NULL);
BUG_ON(!cpu);
-   nid = of_node_to_nid_single(cpu);
+   nid = of_node_to_chipid_single(cpu);
of_node_put(cpu);
 
/*
@@ -796,7 +796,7 @@ new_range:
 * have associativity properties.  If none, then
 * everything goes to default_nid.
 */
-   nid = of_node_to_nid_single(memory);
+   nid = of_node_to_chipid_single(memory);
if (nid < 0)
nid = default_nid;
 
@@ -1119,7 +1119,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr)
if ((scn_addr < start) || (scn_addr >= (start + size)))
continue;
 
-   nid = of_node_to_nid_single(memory);
+   nid = of_node_to_chipid_single(memory);
break;
}
 
@@ -1415,7 +1415,7 @@ int arch_update_cpu_topology(void)
 
/* Use associativity from first thread for all siblings */
vphn_get_associativity(cpu, associativity);
-   new_nid = associativity_to_nid(associativity);
+   new_nid = associativity_to_chipid(associativity);
if (new_nid < 0 || !node_online(new_nid))
new_nid = first_online_node;
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
M

[PATCH RFC 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table

2015-09-27 Thread Raghavendra K T

We access numa_cpu_lookup_table array directly in all the places
to read/update numa cpu lookup information. Instead use a helper
function to update.

This is helpful in changing the way numa<-->cpu mapping in single
place when needed.

This is a cosmetic change, no change in functionality.

Signed-off-by: Raghavendra K T <raghavendra...@linux.inet.ibm.com>
---
 arch/powerpc/include/asm/mmzone.h |  2 +-
 arch/powerpc/kernel/smp.c | 10 +-
 arch/powerpc/mm/numa.c| 28 +---
 3 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/mmzone.h 
b/arch/powerpc/include/asm/mmzone.h
index 7b58917..c24a5f4 100644
--- a/arch/powerpc/include/asm/mmzone.h
+++ b/arch/powerpc/include/asm/mmzone.h
@@ -29,7 +29,7 @@ extern struct pglist_data *node_data[];
  * Following are specific to this numa platform.
  */
 
-extern int numa_cpu_lookup_table[];
+extern int numa_cpu_lookup(int cpu);
 extern cpumask_var_t node_to_cpumask_map[];
 #ifdef CONFIG_MEMORY_HOTPLUG
 extern unsigned long max_pfn;
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ec9ec20..56fbe9e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -381,9 +381,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 * numa_node_id() works after this.
 */
if (cpu_present(cpu)) {
-   set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]);
+   set_cpu_numa_node(cpu, numa_cpu_lookup(cpu));
set_cpu_numa_mem(cpu,
-   local_memory_node(numa_cpu_lookup_table[cpu]));
+   local_memory_node(numa_cpu_lookup(cpu)));
}
}
 
@@ -400,7 +400,7 @@ void smp_prepare_boot_cpu(void)
 #ifdef CONFIG_PPC64
paca[boot_cpuid].__current = current;
 #endif
-   set_numa_node(numa_cpu_lookup_table[boot_cpuid]);
+   set_numa_node(numa_cpu_lookup(boot_cpuid));
current_set[boot_cpuid] = task_thread_info(current);
 }
 
@@ -718,8 +718,8 @@ void start_secondary(void *unused)
}
traverse_core_siblings(cpu, true);
 
-   set_numa_node(numa_cpu_lookup_table[cpu]);
-   set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
+   set_numa_node(numa_cpu_lookup(cpu));
+   set_numa_mem(local_memory_node(numa_cpu_lookup(cpu)));
 
smp_wmb();
notify_cpu_starting(cpu);
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8b9502a..d5e6eee 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -52,7 +52,6 @@ int numa_cpu_lookup_table[NR_CPUS];
 cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
 struct pglist_data *node_data[MAX_NUMNODES];
 
-EXPORT_SYMBOL(numa_cpu_lookup_table);
 EXPORT_SYMBOL(node_to_cpumask_map);
 EXPORT_SYMBOL(node_data);
 
@@ -134,19 +133,25 @@ static int __init fake_numa_create_new_node(unsigned long 
end_pfn,
return 0;
 }
 
-static void reset_numa_cpu_lookup_table(void)
+int numa_cpu_lookup(int cpu)
 {
-   unsigned int cpu;
-
-   for_each_possible_cpu(cpu)
-   numa_cpu_lookup_table[cpu] = -1;
+   return numa_cpu_lookup_table[cpu];
 }
+EXPORT_SYMBOL(numa_cpu_lookup);
 
-static void update_numa_cpu_lookup_table(unsigned int cpu, int node)
+static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node)
 {
numa_cpu_lookup_table[cpu] = node;
 }
 
+static void reset_numa_cpu_lookup_table(void)
+{
+   unsigned int cpu;
+
+   for_each_possible_cpu(cpu)
+   update_numa_cpu_lookup_table(cpu, -1);
+}
+
 static void map_cpu_to_node(int cpu, int node)
 {
update_numa_cpu_lookup_table(cpu, node);
@@ -160,7 +165,7 @@ static void map_cpu_to_node(int cpu, int node)
 #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
 static void unmap_cpu_from_node(unsigned long cpu)
 {
-   int node = numa_cpu_lookup_table[cpu];
+   int node = numa_cpu_lookup(cpu);
 
dbg("removing cpu %lu from node %d\n", cpu, node);
 
@@ -536,7 +541,8 @@ static int numa_setup_cpu(unsigned long lcpu)
 * directly instead of querying the firmware, since it represents
 * the most recent mapping notified to us by the platform (eg: VPHN).
 */
-   if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) {
+   nid = numa_cpu_lookup(lcpu);
+   if (nid >= 0) {
map_cpu_to_node(lcpu, nid);
return nid;
}
@@ -1413,7 +1419,7 @@ int arch_update_cpu_topology(void)
if (new_nid < 0 || !node_online(new_nid))
new_nid = first_online_node;
 
-   if (new_nid == numa_cpu_lookup_table[cpu]) {
+   if (new_nid == numa_cpu_lookup(cpu)) {
cpumask_andnot(_associativity_changes_mask,
_associativity_changes_mask,

[PATCH RFC 4/5] powerpc:numa Add helper functions to maintain chipid to nid mapping

2015-09-27 Thread Raghavendra K T

Create arrays that maps serial nids and sparse chipids.

Note: My original idea had only two arrays of chipid to nid map. Final
code is inspired by driver/acpi/numa.c that maps a proximity node with
a logical node by Takayoshi Kochi <t-ko...@bq.jp.nec.com>, and thus
uses an additional chipid_map nodemask. The mask helps in first unused
nid easily by knowing first unset bit in the mask.

No change in functionality.

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 48 +++-
 1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index dd2073b..f015cad 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -63,6 +63,11 @@ static int form1_affinity;
 static int distance_ref_points_depth;
 static const __be32 *distance_ref_points;
 static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
+static nodemask_t chipid_map = NODE_MASK_NONE;
+static int chipid_to_nid_map[MAX_NUMNODES]
+   = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };
+static int nid_to_chipid_map[MAX_NUMNODES]
+   = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };
 
 /*
  * Allocate node_to_cpumask_map based on number of available nodes
@@ -133,6 +138,48 @@ static int __init fake_numa_create_new_node(unsigned long 
end_pfn,
return 0;
 }
 
+int chipid_to_nid(int chipid)
+{
+   if (chipid < 0)
+   return NUMA_NO_NODE;
+   return chipid_to_nid_map[chipid];
+}
+
+int nid_to_chipid(int nid)
+{
+   if (nid < 0)
+   return NUMA_NO_NODE;
+   return nid_to_chipid_map[nid];
+}
+
+static void __map_chipid_to_nid(int chipid, int nid)
+{
+   if (chipid_to_nid_map[chipid] == NUMA_NO_NODE
+|| nid < chipid_to_nid_map[chipid])
+   chipid_to_nid_map[chipid] = nid;
+   if (nid_to_chipid_map[nid] == NUMA_NO_NODE
+   || chipid < nid_to_chipid_map[nid])
+   nid_to_chipid_map[nid] = chipid;
+}
+
+int map_chipid_to_nid(int chipid)
+{
+   int nid;
+
+   if (chipid < 0 || chipid >= MAX_NUMNODES)
+   return NUMA_NO_NODE;
+
+   nid = chipid_to_nid_map[chipid];
+   if (nid == NUMA_NO_NODE) {
+   if (nodes_weight(chipid_map) >= MAX_NUMNODES)
+   return NUMA_NO_NODE;
+   nid = first_unset_node(chipid_map);
+   __map_chipid_to_nid(chipid, nid);
+   node_set(nid, chipid_map);
+   }
+   return nid;
+}
+
 int numa_cpu_lookup(int cpu)
 {
return numa_cpu_lookup_table[cpu];
@@ -264,7 +311,6 @@ out:
return chipid;
 }
 
-
  /* Return the nid from associativity */
 static int associativity_to_nid(const __be32 *associativity)
 {
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC 0/5] powerpc:numa Add serial nid support

2015-09-27 Thread Raghavendra K T

Problem description:
Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
got from device tree is naturally mapped (directly) to nid.

Potential side effect of that is:

1) There are several places in kernel that assumes serial node numbering.
and memory allocations assume that all the nodes from 0-(highest nid)
exist inturn ending up allocating memory for the nodes that does not exist.

2) For virtualization use cases (such as qemu, libvirt, openstack), mapping
sparse nid of the host system to contiguous nids of guest (numa affinity,
placement) could be a challenge.

Possible Solutions:
1) Handling the memory allocations is kernel case by case: Though in some
cases it is easy to achieve, some cases may be intrusive/not trivial. 
at the end it does not handle side effect (2) above.

2) Map the sparse chipid got from device tree to a serial nid at kernel
level (The idea proposed in this series).
Pro: It is more natural to handle at kernel level than at lower (OPAL) layer.
con: The chipid is in device tree no longer the same as nid in kernel

3) Let the lower layer (OPAL) give the serial node ids after parsing the
chipid and the associativity etc [ either as a separate item in device tree
or by compacting the chipid numbers ]
Pros: kernel, device tree are on same page and less change in kernel
Con: is it the functionality expected in lower layer

As mentioned above, current patch series tries to map chipid from lower layer
to a contiguos nid at kernel level keeping the node distance calculation and
so on intact.

Result:
Before the patch: numactl -H

available: 4 nodes (0-1,16-17)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31
node 0 size: 31665 MB
node 0 free: 29836 MB
node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 
54 55 56 57 58 59 60 61 62 63
node 1 size: 32722 MB
node 1 free: 32019 MB
node 16 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 
86 87 88 89 90 91 92 93 94 95
node 16 size: 32571 MB
node 16 free: 31222 MB
node 17 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 17 size: 0 MB
node 17 free: 0 MB
node distances:
node   0   1  16  17 
  0:  10  20  40  40 
  1:  20  10  40  40 
 16:  40  40  10  20 
 17:  40  40  20  10 

After the patch: numactl -H

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31
node 0 size: 31665 MB
node 0 free: 30657 MB
node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 
54 55 56 57 58 59 60 61 62 63
node 1 size: 32722 MB
node 1 free: 32566 MB
node 2 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 
86 87 88 89 90 91 92 93 94 95
node 2 size: 32571 MB
node 2 free: 32401 MB
node 3 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node   0   1   2   3 
  0:  10  20  40  40 
  1:  20  10  40  40 
  2:  40  40  10  20 
  3:  40  40  20  10 

(note that numa distances are intact). Apart from this, The following tests
are done with the patched kernel (both baremetal and KVM guest with multiple
nodes) to ensure there is no breakage.

1) offlining and onlining of memory in /sys/devices/system/node/nodeX path

2) offlining and onlining of cpus in /sys/devices/system/cpu/ path

3) Numactl tests from
ftp://oss.sgi.com/www/projects/libnuma/download/numactl-2.0.10.tar.gz

(infact there were more breakage before the patch because of sparse nid
and memoryless node cases of powerpc)

4) Thousands of docker containers were spawned.

Please let me know your comments.

 patch 1-3: cleanup patches
 patch 4: Adds helper function to map nid and chipid
 patch 5: Uses the mapping to get serial nid 

Raghavendra K T (5):
  powerpc:numa Add numa_cpu_lookup function to update lookup table
  powerpc:numa Rename functions referring to nid as chipid
  powerpc:numa create 1:1 mappaing between chipid and nid
  powerpc:numa Add helper functions to maintain chipid to nid mapping
  powerpc:numa Use chipid to nid mapping to get serial numa node ids

 arch/powerpc/include/asm/mmzone.h |   2 +-
 arch/powerpc/kernel/smp.c |  10 ++--
 arch/powerpc/mm/numa.c| 121 +++---
 3 files changed, 105 insertions(+), 28 deletions(-)

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid

2015-09-27 Thread Raghavendra K T

Once we have made the distinction between nid and chipid
create a 1:1 mapping between them. This makes compacting the
nids easy later.

No functionality change.

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 36 +---
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f84ed2f..dd2073b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -264,6 +264,17 @@ out:
return chipid;
 }
 
+
+ /* Return the nid from associativity */
+static int associativity_to_nid(const __be32 *associativity)
+{
+   int nid;
+
+   nid = associativity_to_chipid(associativity);
+   return nid;
+}
+
+
 /* Returns the chipid associated with the given device tree node,
  * or -1 if not found.
  */
@@ -278,6 +289,17 @@ static int of_node_to_chipid_single(struct device_node 
*device)
return chipid;
 }
 
+/*
+ * Returns nid from the chipid associated with given tree node
+ */
+static int of_node_to_nid_single(struct device_node *device)
+{
+   int nid;
+
+   nid = of_node_to_chipid_single(device);
+   return nid;
+}
+
 /* Walk the device tree upwards, looking for an associativity id */
 int of_node_to_nid(struct device_node *device)
 {
@@ -286,7 +308,7 @@ int of_node_to_nid(struct device_node *device)
 
of_node_get(device);
while (device) {
-   nid = of_node_to_chipid_single(device);
+   nid = of_node_to_nid_single(device);
if (nid != -1)
break;
 
@@ -498,7 +520,7 @@ static int of_get_assoc_arrays(struct device_node *memory,
 }
 
 /*
- * This is like of_node_to_chipid_single() for memory represented in the
+ * This is like of_node_to_nid_single() for memory represented in the
  * ibm,dynamic-reconfiguration-memory node.
  */
 static int of_drconf_to_nid_single(struct of_drconf_cell *drmem,
@@ -557,7 +579,7 @@ static int numa_setup_cpu(unsigned long lcpu)
goto out;
}
 
-   nid = of_node_to_chipid_single(cpu);
+   nid = of_node_to_nid_single(cpu);
 
 out_present:
if (nid < 0 || !node_online(nid))
@@ -754,7 +776,7 @@ static int __init parse_numa_properties(void)
 
cpu = of_get_cpu_node(i, NULL);
BUG_ON(!cpu);
-   nid = of_node_to_chipid_single(cpu);
+   nid = of_node_to_nid_single(cpu);
of_node_put(cpu);
 
/*
@@ -796,7 +818,7 @@ new_range:
 * have associativity properties.  If none, then
 * everything goes to default_nid.
 */
-   nid = of_node_to_chipid_single(memory);
+   nid = of_node_to_nid_single(memory);
if (nid < 0)
nid = default_nid;
 
@@ -1119,7 +1141,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr)
if ((scn_addr < start) || (scn_addr >= (start + size)))
continue;
 
-   nid = of_node_to_chipid_single(memory);
+   nid = of_node_to_nid_single(memory);
break;
}
 
@@ -1415,7 +1437,7 @@ int arch_update_cpu_topology(void)
 
/* Use associativity from first thread for all siblings */
vphn_get_associativity(cpu, associativity);
-   new_nid = associativity_to_chipid(associativity);
+   new_nid = associativity_to_nid(associativity);
if (new_nid < 0 || !node_online(new_nid))
new_nid = first_online_node;
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table

2015-09-27 Thread Raghavendra K T


On 09/27/2015 11:59 PM, Raghavendra K T wrote:

We access numa_cpu_lookup_table array directly in all the places
to read/update numa cpu lookup information. Instead use a helper
function to update.

This is helpful in changing the way numa<-->cpu mapping in single
place when needed.

This is a cosmetic change, no change in functionality.

Signed-off-by: Raghavendra K T <raghavendra...@linux.inet.ibm.com>

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>

---

Sorry:  changelog edit messed it.. :(


  arch/powerpc/include/asm/mmzone.h |  2 +-
  arch/powerpc/kernel/smp.c | 10 +-
  arch/powerpc/mm/numa.c| 28 +---
  3 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/mmzone.h 
b/arch/powerpc/include/asm/mmzone.h
index 7b58917..c24a5f4 100644
--- a/arch/powerpc/include/asm/mmzone.h
+++ b/arch/powerpc/include/asm/mmzone.h
@@ -29,7 +29,7 @@ extern struct pglist_data *node_data[];
   * Following are specific to this numa platform.
   */

-extern int numa_cpu_lookup_table[];
+extern int numa_cpu_lookup(int cpu);
  extern cpumask_var_t node_to_cpumask_map[];
  #ifdef CONFIG_MEMORY_HOTPLUG
  extern unsigned long max_pfn;
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ec9ec20..56fbe9e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -381,9 +381,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 * numa_node_id() works after this.
 */
if (cpu_present(cpu)) {
-   set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]);
+   set_cpu_numa_node(cpu, numa_cpu_lookup(cpu));
set_cpu_numa_mem(cpu,
-   local_memory_node(numa_cpu_lookup_table[cpu]));
+   local_memory_node(numa_cpu_lookup(cpu)));
}
}

@@ -400,7 +400,7 @@ void smp_prepare_boot_cpu(void)
  #ifdef CONFIG_PPC64
paca[boot_cpuid].__current = current;
  #endif
-   set_numa_node(numa_cpu_lookup_table[boot_cpuid]);
+   set_numa_node(numa_cpu_lookup(boot_cpuid));
current_set[boot_cpuid] = task_thread_info(current);
  }

@@ -718,8 +718,8 @@ void start_secondary(void *unused)
}
traverse_core_siblings(cpu, true);

-   set_numa_node(numa_cpu_lookup_table[cpu]);
-   set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
+   set_numa_node(numa_cpu_lookup(cpu));
+   set_numa_mem(local_memory_node(numa_cpu_lookup(cpu)));

smp_wmb();
notify_cpu_starting(cpu);
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8b9502a..d5e6eee 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -52,7 +52,6 @@ int numa_cpu_lookup_table[NR_CPUS];
  cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
  struct pglist_data *node_data[MAX_NUMNODES];

-EXPORT_SYMBOL(numa_cpu_lookup_table);
  EXPORT_SYMBOL(node_to_cpumask_map);
  EXPORT_SYMBOL(node_data);

@@ -134,19 +133,25 @@ static int __init fake_numa_create_new_node(unsigned long 
end_pfn,
return 0;
  }

-static void reset_numa_cpu_lookup_table(void)
+int numa_cpu_lookup(int cpu)
  {
-   unsigned int cpu;
-
-   for_each_possible_cpu(cpu)
-   numa_cpu_lookup_table[cpu] = -1;
+   return numa_cpu_lookup_table[cpu];
  }
+EXPORT_SYMBOL(numa_cpu_lookup);

-static void update_numa_cpu_lookup_table(unsigned int cpu, int node)
+static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node)
  {
numa_cpu_lookup_table[cpu] = node;
  }

+static void reset_numa_cpu_lookup_table(void)
+{
+   unsigned int cpu;
+
+   for_each_possible_cpu(cpu)
+   update_numa_cpu_lookup_table(cpu, -1);
+}
+
  static void map_cpu_to_node(int cpu, int node)
  {
update_numa_cpu_lookup_table(cpu, node);
@@ -160,7 +165,7 @@ static void map_cpu_to_node(int cpu, int node)
  #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
  static void unmap_cpu_from_node(unsigned long cpu)
  {
-   int node = numa_cpu_lookup_table[cpu];
+   int node = numa_cpu_lookup(cpu);

dbg("removing cpu %lu from node %d\n", cpu, node);

@@ -536,7 +541,8 @@ static int numa_setup_cpu(unsigned long lcpu)
 * directly instead of querying the firmware, since it represents
 * the most recent mapping notified to us by the platform (eg: VPHN).
 */
-   if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) {
+   nid = numa_cpu_lookup(lcpu);
+   if (nid >= 0) {
map_cpu_to_node(lcpu, nid);
return nid;
}
@@ -1413,7 +1419,7 @@ int arch_update_cpu_topology(void)
if (new_nid < 0 || !node_online(new_nid))
new_nid = first_online_node;

-   if (new_nid == numa_cpu_lookup_table[cpu]) {
+

Re: [PATCH V2 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes

2015-09-22 Thread Raghavendra K T

* Michael Ellerman  [2015-09-22 15:29:03]:

> On Tue, 2015-09-15 at 07:38 +0530, Raghavendra K T wrote:
> >
> > ... nothing
> 
> Sure this patch looks obvious, but please give me a changelog that proves
> you've thought about it thoroughly.
> 
> For example is it OK to use for_each_node() at this point in boot? Is there 
> any
> historical reason why we did it with a hard coded loop? If so what has 
> changed.
> What systems have you tested on? etc. etc.
> 
> cheers

Hi Michael,
resending the patches with the changelog.

Please note that the patch is in -mm tree already.

---8<---
>From 86ead2520662c362d7b9ebd452ce1c33e156016f Mon Sep 17 00:00:00 2001
From: Raghavendra K T 
Date: Sun, 6 Sep 2015 12:54:40 +0530
Subject: [PATCH V2  2/2] powerpc:numa Do not allocate bootmem memory for non
 existing nodes

With the setup_nr_nodes(), we have already initialized
node_possible_map. So it is safe to use for_each_node here.

There are many places in the kernel that use hardcoded 'for' loop with
nr_node_ids, because all other architectures have numa nodes populated
serially. That should be reason we had maintained the same for powerpc.

But, since sparse numa node ids possible on powerpc, we
unnecessarily allocate memory for non existent numa nodes.

For e.g., on a system with 0,1,16,17 as numa nodes nr_node_ids=18
and we allocate memory for nodes 2-14. This patch we allocate
memory for only existing numa nodes.

The patch is boot tested on a 4 node tuleta [ confirming with printks ].
that it works as expected.

Signed-off-by: Raghavendra K T 
---
 arch/powerpc/mm/numa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8b9502a..8d8a541 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -80,7 +80,7 @@ static void __init setup_node_to_cpumask_map(void)
setup_nr_node_ids();

/* allocate the map */
-   for (node = 0; node < nr_node_ids; node++)
+   for_each_node(node)
alloc_bootmem_cpumask_var(_to_cpumask_map[node]);

/* cpumask_of_node() will now work */
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes

2015-09-22 Thread Raghavendra K T


On 09/22/2015 10:59 AM, Michael Ellerman wrote:

On Tue, 2015-09-15 at 07:38 +0530, Raghavendra K T wrote:


... nothing


Sure this patch looks obvious, but please give me a changelog that proves
you've thought about it thoroughly.

For example is it OK to use for_each_node() at this point in boot? Is there any
historical reason why we did it with a hard coded loop? If so what has changed.
What systems have you tested on? etc. etc.

cheers


Changelog:

With the setup_nr_nodes(), we have already initialized
node_possible_map. So it is safe to use for_each_node here.

There are many places in the kernel that use hardcoded 'for' loop with
nr_node_ids, because all other architectures have numa nodes populated
serially. That should be reason we had maintained same for powerpc.

But since on power we have sparse numa node ids possible, we
unnecessarily allocate memory for non existent numa nodes.

For e.g., on a system with 0,1,16,17 as numa nodes nr_node_ids=18
and we allocate memory for nodes 2-14.

The patch is boot tested on a 4 node tuleta [ confirming with printks ].
that it works as expected.




Signed-off-by: Raghavendra K T 
---
  arch/powerpc/mm/numa.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8b9502a..8d8a541 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -80,7 +80,7 @@ static void __init setup_node_to_cpumask_map(void)
setup_nr_node_ids();

/* allocate the map */
-   for (node = 0; node < nr_node_ids; node++)
+   for_each_node(node)
alloc_bootmem_cpumask_var(_to_cpumask_map[node]);

/* cpumask_of_node() will now work */








--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes

2015-09-22 Thread Raghavendra K T

* Michael Ellerman <m...@ellerman.id.au> [2015-09-22 15:29:03]:

> On Tue, 2015-09-15 at 07:38 +0530, Raghavendra K T wrote:
> >
> > ... nothing
> 
> Sure this patch looks obvious, but please give me a changelog that proves
> you've thought about it thoroughly.
> 
> For example is it OK to use for_each_node() at this point in boot? Is there 
> any
> historical reason why we did it with a hard coded loop? If so what has 
> changed.
> What systems have you tested on? etc. etc.
> 
> cheers

Hi Michael,
resending the patches with the changelog.

Please note that the patch is in -mm tree already.

---8<---
>From 86ead2520662c362d7b9ebd452ce1c33e156016f Mon Sep 17 00:00:00 2001
From: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
Date: Sun, 6 Sep 2015 12:54:40 +0530
Subject: [PATCH V2  2/2] powerpc:numa Do not allocate bootmem memory for non
 existing nodes

With the setup_nr_nodes(), we have already initialized
node_possible_map. So it is safe to use for_each_node here.

There are many places in the kernel that use hardcoded 'for' loop with
nr_node_ids, because all other architectures have numa nodes populated
serially. That should be reason we had maintained the same for powerpc.

But, since sparse numa node ids possible on powerpc, we
unnecessarily allocate memory for non existent numa nodes.

For e.g., on a system with 0,1,16,17 as numa nodes nr_node_ids=18
and we allocate memory for nodes 2-14. This patch we allocate
memory for only existing numa nodes.

The patch is boot tested on a 4 node tuleta [ confirming with printks ].
that it works as expected.

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8b9502a..8d8a541 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -80,7 +80,7 @@ static void __init setup_node_to_cpumask_map(void)
setup_nr_node_ids();

/* allocate the map */
-   for (node = 0; node < nr_node_ids; node++)
+   for_each_node(node)
alloc_bootmem_cpumask_var(_to_cpumask_map[node]);

/* cpumask_of_node() will now work */
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes

2015-09-22 Thread Raghavendra K T


On 09/22/2015 10:59 AM, Michael Ellerman wrote:

On Tue, 2015-09-15 at 07:38 +0530, Raghavendra K T wrote:


... nothing


Sure this patch looks obvious, but please give me a changelog that proves
you've thought about it thoroughly.

For example is it OK to use for_each_node() at this point in boot? Is there any
historical reason why we did it with a hard coded loop? If so what has changed.
What systems have you tested on? etc. etc.

cheers


Changelog:

With the setup_nr_nodes(), we have already initialized
node_possible_map. So it is safe to use for_each_node here.

There are many places in the kernel that use hardcoded 'for' loop with
nr_node_ids, because all other architectures have numa nodes populated
serially. That should be reason we had maintained same for powerpc.

But since on power we have sparse numa node ids possible, we
unnecessarily allocate memory for non existent numa nodes.

For e.g., on a system with 0,1,16,17 as numa nodes nr_node_ids=18
and we allocate memory for nodes 2-14.

The patch is boot tested on a 4 node tuleta [ confirming with printks ].
that it works as expected.




Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
  arch/powerpc/mm/numa.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8b9502a..8d8a541 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -80,7 +80,7 @@ static void __init setup_node_to_cpumask_map(void)
setup_nr_node_ids();

/* allocate the map */
-   for (node = 0; node < nr_node_ids; node++)
+   for_each_node(node)
alloc_bootmem_cpumask_var(_to_cpumask_map[node]);

/* cpumask_of_node() will now work */








--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru

2015-09-14 Thread Raghavendra K T


On 09/15/2015 07:38 AM, Raghavendra K T wrote:

The functions used in the patch are in slowpath, which gets called
whenever alloc_super is called during mounts.

Though this should not make difference for the architectures with
sequential numa node ids, for the powerpc which can potentially have
sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17
is common), this patch saves some unnecessary allocations for
non existing numa nodes.

Even without that saving, perhaps patch makes code more readable.

[ Take memcg_aware check outside for_each loop: Vladimir]
Signed-off-by: Raghavendra K T 
---


Sorry that I had misspelled Vladimir above.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 0/2] Replace nr_node_ids for loop with for_each_node

2015-09-14 Thread Raghavendra K T

Many places in the kernel use 'for' loop with nr_node_ids. For the architectures
which supports sparse numa ids, this will result in some unnecessary allocations
for non existing nodes.
(for e.g., numa node numbers such as 0,1,16,17 is common in powerpc.)

So replace the for loop with for_each_node so that allocations happen only for
existing numa nodes.

Please note that, though there are many places where nr_node_ids is used,
current patchset uses for_each_node only for slowpath to avoid find_next_bit
traversal.

Changes in V2:
  - Take memcg_aware check outside for_each loop (Vldimir)
  - Add comment that node 0 should always be present (Vladimir)

Raghavendra K T (2):
  mm: Replace nr_node_ids for loop with for_each_node in list lru
  powerpc:numa Do not allocate bootmem memory for non existing nodes

 arch/powerpc/mm/numa.c |  2 +-
 mm/list_lru.c  | 34 +++---
 2 files changed, 24 insertions(+), 12 deletions(-)

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes

2015-09-14 Thread Raghavendra K T

Signed-off-by: Raghavendra K T 
---
 arch/powerpc/mm/numa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8b9502a..8d8a541 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -80,7 +80,7 @@ static void __init setup_node_to_cpumask_map(void)
setup_nr_node_ids();
 
/* allocate the map */
-   for (node = 0; node < nr_node_ids; node++)
+   for_each_node(node)
alloc_bootmem_cpumask_var(_to_cpumask_map[node]);
 
/* cpumask_of_node() will now work */
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru

2015-09-14 Thread Raghavendra K T

The functions used in the patch are in slowpath, which gets called
whenever alloc_super is called during mounts.

Though this should not make difference for the architectures with
sequential numa node ids, for the powerpc which can potentially have
sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17
is common), this patch saves some unnecessary allocations for
non existing numa nodes.

Even without that saving, perhaps patch makes code more readable.

[ Take memcg_aware check outside for_each loop: Vldimir]
Signed-off-by: Raghavendra K T 
---
 mm/list_lru.c | 34 +++---
 1 file changed, 23 insertions(+), 11 deletions(-)

 Changes in V2:
  - Take memcg_aware check outside for_each loop (Vldimir)
  - Add comment that node 0 should always be present (Vladimir)  
  
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 909eca2..60cd6dd 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -42,6 +42,10 @@ static void list_lru_unregister(struct list_lru *lru)
 #ifdef CONFIG_MEMCG_KMEM
 static inline bool list_lru_memcg_aware(struct list_lru *lru)
 {
+   /*
+* This needs node 0 to be always present, even
+* in the systems supporting sparse numa ids.
+*/
return !!lru->node[0].memcg_lrus;
 }
 
@@ -377,16 +381,20 @@ static int memcg_init_list_lru(struct list_lru *lru, bool 
memcg_aware)
 {
int i;
 
-   for (i = 0; i < nr_node_ids; i++) {
-   if (!memcg_aware)
-   lru->node[i].memcg_lrus = NULL;
-   else if (memcg_init_list_lru_node(>node[i]))
+   if (!memcg_aware)
+   return 0;
+
+   for_each_node(i) {
+   if (memcg_init_list_lru_node(>node[i]))
goto fail;
}
return 0;
 fail:
-   for (i = i - 1; i >= 0; i--)
+   for (i = i - 1; i >= 0; i--) {
+   if (!lru->node[i].memcg_lrus)
+   continue;
memcg_destroy_list_lru_node(>node[i]);
+   }
return -ENOMEM;
 }
 
@@ -397,7 +405,7 @@ static void memcg_destroy_list_lru(struct list_lru *lru)
if (!list_lru_memcg_aware(lru))
return;
 
-   for (i = 0; i < nr_node_ids; i++)
+   for_each_node(i)
memcg_destroy_list_lru_node(>node[i]);
 }
 
@@ -409,16 +417,20 @@ static int memcg_update_list_lru(struct list_lru *lru,
if (!list_lru_memcg_aware(lru))
return 0;
 
-   for (i = 0; i < nr_node_ids; i++) {
+   for_each_node(i) {
if (memcg_update_list_lru_node(>node[i],
   old_size, new_size))
goto fail;
}
return 0;
 fail:
-   for (i = i - 1; i >= 0; i--)
+   for (i = i - 1; i >= 0; i--) {
+   if (!lru->node[i].memcg_lrus)
+   continue;
+
memcg_cancel_update_list_lru_node(>node[i],
  old_size, new_size);
+   }
return -ENOMEM;
 }
 
@@ -430,7 +442,7 @@ static void memcg_cancel_update_list_lru(struct list_lru 
*lru,
if (!list_lru_memcg_aware(lru))
return;
 
-   for (i = 0; i < nr_node_ids; i++)
+   for_each_node(i)
memcg_cancel_update_list_lru_node(>node[i],
  old_size, new_size);
 }
@@ -485,7 +497,7 @@ static void memcg_drain_list_lru(struct list_lru *lru,
if (!list_lru_memcg_aware(lru))
return;
 
-   for (i = 0; i < nr_node_ids; i++)
+   for_each_node(i)
memcg_drain_list_lru_node(>node[i], src_idx, dst_idx);
 }
 
@@ -522,7 +534,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware,
if (!lru->node)
goto out;
 
-   for (i = 0; i < nr_node_ids; i++) {
+   for_each_node(i) {
spin_lock_init(>node[i].lock);
if (key)
lockdep_set_class(>node[i].lock, key);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru

2015-09-14 Thread Raghavendra K T


On 09/14/2015 05:34 PM, Vladimir Davydov wrote:

On Mon, Sep 14, 2015 at 05:09:31PM +0530, Raghavendra K T wrote:

On 09/14/2015 02:30 PM, Vladimir Davydov wrote:

On Wed, Sep 09, 2015 at 12:01:46AM +0530, Raghavendra K T wrote:

The functions used in the patch are in slowpath, which gets called
whenever alloc_super is called during mounts.

Though this should not make difference for the architectures with
sequential numa node ids, for the powerpc which can potentially have
sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17
is common), this patch saves some unnecessary allocations for
non existing numa nodes.

Even without that saving, perhaps patch makes code more readable.


Do I understand correctly that node 0 must always be in
node_possible_map? I ask, because we currently test
lru->node[0].memcg_lrus to determine if the list is memcg aware.



Yes, node 0 is always there. So it should not be a problem.


I think it should be mentioned in the comment to list_lru_memcg_aware
then.



Something like this: ?
static inline bool list_lru_memcg_aware(struct list_lru *lru)
{
/*
 * This needs node 0 to be always present, even
 * in the systems supporting sparse numa ids.
 */
return !!lru->node[0].memcg_lrus;
}


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru

2015-09-14 Thread Raghavendra K T


On 09/14/2015 02:30 PM, Vladimir Davydov wrote:

Hi,

On Wed, Sep 09, 2015 at 12:01:46AM +0530, Raghavendra K T wrote:

The functions used in the patch are in slowpath, which gets called
whenever alloc_super is called during mounts.

Though this should not make difference for the architectures with
sequential numa node ids, for the powerpc which can potentially have
sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17
is common), this patch saves some unnecessary allocations for
non existing numa nodes.

Even without that saving, perhaps patch makes code more readable.


Do I understand correctly that node 0 must always be in
node_possible_map? I ask, because we currently test
lru->node[0].memcg_lrus to determine if the list is memcg aware.



Yes, node 0 is always there. So it should not be a problem.



Signed-off-by: Raghavendra K T 
---
  mm/list_lru.c | 23 +++
  1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 909eca2..5a97f83 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -377,7 +377,7 @@ static int memcg_init_list_lru(struct list_lru *lru, bool 
memcg_aware)
  {
int i;

-   for (i = 0; i < nr_node_ids; i++) {
+   for_each_node(i) {
if (!memcg_aware)
lru->node[i].memcg_lrus = NULL;


So, we don't explicitly initialize memcg_lrus for nodes that are not in
node_possible_map. That's OK, because we allocate lru->node using
kzalloc. However, this partial nullifying in case !memcg_aware looks
confusing IMO. Let's drop it, I mean something like this:


Yes, you are right. and we do not have to have memcg_aware check inside
for loop too.
Will change as per your suggestion and send V2.
Thanks for the review.



static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware)
{
int i;

if (!memcg_aware)
return 0;

for_each_node(i) {
if (memcg_init_list_lru_node(>node[i]))
goto fail;
}



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru

2015-09-14 Thread Raghavendra K T


On 09/14/2015 02:30 PM, Vladimir Davydov wrote:

Hi,

On Wed, Sep 09, 2015 at 12:01:46AM +0530, Raghavendra K T wrote:

The functions used in the patch are in slowpath, which gets called
whenever alloc_super is called during mounts.

Though this should not make difference for the architectures with
sequential numa node ids, for the powerpc which can potentially have
sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17
is common), this patch saves some unnecessary allocations for
non existing numa nodes.

Even without that saving, perhaps patch makes code more readable.


Do I understand correctly that node 0 must always be in
node_possible_map? I ask, because we currently test
lru->node[0].memcg_lrus to determine if the list is memcg aware.



Yes, node 0 is always there. So it should not be a problem.



Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
  mm/list_lru.c | 23 +++
  1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 909eca2..5a97f83 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -377,7 +377,7 @@ static int memcg_init_list_lru(struct list_lru *lru, bool 
memcg_aware)
  {
int i;

-   for (i = 0; i < nr_node_ids; i++) {
+   for_each_node(i) {
if (!memcg_aware)
lru->node[i].memcg_lrus = NULL;


So, we don't explicitly initialize memcg_lrus for nodes that are not in
node_possible_map. That's OK, because we allocate lru->node using
kzalloc. However, this partial nullifying in case !memcg_aware looks
confusing IMO. Let's drop it, I mean something like this:


Yes, you are right. and we do not have to have memcg_aware check inside
for loop too.
Will change as per your suggestion and send V2.
Thanks for the review.



static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware)
{
int i;

if (!memcg_aware)
return 0;

for_each_node(i) {
if (memcg_init_list_lru_node(>node[i]))
goto fail;
}



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru

2015-09-14 Thread Raghavendra K T


On 09/14/2015 05:34 PM, Vladimir Davydov wrote:

On Mon, Sep 14, 2015 at 05:09:31PM +0530, Raghavendra K T wrote:

On 09/14/2015 02:30 PM, Vladimir Davydov wrote:

On Wed, Sep 09, 2015 at 12:01:46AM +0530, Raghavendra K T wrote:

The functions used in the patch are in slowpath, which gets called
whenever alloc_super is called during mounts.

Though this should not make difference for the architectures with
sequential numa node ids, for the powerpc which can potentially have
sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17
is common), this patch saves some unnecessary allocations for
non existing numa nodes.

Even without that saving, perhaps patch makes code more readable.


Do I understand correctly that node 0 must always be in
node_possible_map? I ask, because we currently test
lru->node[0].memcg_lrus to determine if the list is memcg aware.



Yes, node 0 is always there. So it should not be a problem.


I think it should be mentioned in the comment to list_lru_memcg_aware
then.



Something like this: ?
static inline bool list_lru_memcg_aware(struct list_lru *lru)
{
/*
 * This needs node 0 to be always present, even
 * in the systems supporting sparse numa ids.
 */
return !!lru->node[0].memcg_lrus;
}


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 0/2] Replace nr_node_ids for loop with for_each_node

2015-09-14 Thread Raghavendra K T

Many places in the kernel use 'for' loop with nr_node_ids. For the architectures
which supports sparse numa ids, this will result in some unnecessary allocations
for non existing nodes.
(for e.g., numa node numbers such as 0,1,16,17 is common in powerpc.)

So replace the for loop with for_each_node so that allocations happen only for
existing numa nodes.

Please note that, though there are many places where nr_node_ids is used,
current patchset uses for_each_node only for slowpath to avoid find_next_bit
traversal.

Changes in V2:
  - Take memcg_aware check outside for_each loop (Vldimir)
  - Add comment that node 0 should always be present (Vladimir)

Raghavendra K T (2):
  mm: Replace nr_node_ids for loop with for_each_node in list lru
  powerpc:numa Do not allocate bootmem memory for non existing nodes

 arch/powerpc/mm/numa.c |  2 +-
 mm/list_lru.c  | 34 +++---
 2 files changed, 24 insertions(+), 12 deletions(-)

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes

2015-09-14 Thread Raghavendra K T

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8b9502a..8d8a541 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -80,7 +80,7 @@ static void __init setup_node_to_cpumask_map(void)
setup_nr_node_ids();
 
/* allocate the map */
-   for (node = 0; node < nr_node_ids; node++)
+   for_each_node(node)
alloc_bootmem_cpumask_var(_to_cpumask_map[node]);
 
/* cpumask_of_node() will now work */
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru

2015-09-14 Thread Raghavendra K T


On 09/15/2015 07:38 AM, Raghavendra K T wrote:

The functions used in the patch are in slowpath, which gets called
whenever alloc_super is called during mounts.

Though this should not make difference for the architectures with
sequential numa node ids, for the powerpc which can potentially have
sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17
is common), this patch saves some unnecessary allocations for
non existing numa nodes.

Even without that saving, perhaps patch makes code more readable.

[ Take memcg_aware check outside for_each loop: Vladimir]
Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---


Sorry that I had misspelled Vladimir above.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru

2015-09-14 Thread Raghavendra K T

The functions used in the patch are in slowpath, which gets called
whenever alloc_super is called during mounts.

Though this should not make difference for the architectures with
sequential numa node ids, for the powerpc which can potentially have
sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17
is common), this patch saves some unnecessary allocations for
non existing numa nodes.

Even without that saving, perhaps patch makes code more readable.

[ Take memcg_aware check outside for_each loop: Vldimir]
Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
 mm/list_lru.c | 34 +++---
 1 file changed, 23 insertions(+), 11 deletions(-)

 Changes in V2:
  - Take memcg_aware check outside for_each loop (Vldimir)
  - Add comment that node 0 should always be present (Vladimir)  
  
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 909eca2..60cd6dd 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -42,6 +42,10 @@ static void list_lru_unregister(struct list_lru *lru)
 #ifdef CONFIG_MEMCG_KMEM
 static inline bool list_lru_memcg_aware(struct list_lru *lru)
 {
+   /*
+* This needs node 0 to be always present, even
+* in the systems supporting sparse numa ids.
+*/
return !!lru->node[0].memcg_lrus;
 }
 
@@ -377,16 +381,20 @@ static int memcg_init_list_lru(struct list_lru *lru, bool 
memcg_aware)
 {
int i;
 
-   for (i = 0; i < nr_node_ids; i++) {
-   if (!memcg_aware)
-   lru->node[i].memcg_lrus = NULL;
-   else if (memcg_init_list_lru_node(>node[i]))
+   if (!memcg_aware)
+   return 0;
+
+   for_each_node(i) {
+   if (memcg_init_list_lru_node(>node[i]))
goto fail;
}
return 0;
 fail:
-   for (i = i - 1; i >= 0; i--)
+   for (i = i - 1; i >= 0; i--) {
+   if (!lru->node[i].memcg_lrus)
+   continue;
memcg_destroy_list_lru_node(>node[i]);
+   }
return -ENOMEM;
 }
 
@@ -397,7 +405,7 @@ static void memcg_destroy_list_lru(struct list_lru *lru)
if (!list_lru_memcg_aware(lru))
return;
 
-   for (i = 0; i < nr_node_ids; i++)
+   for_each_node(i)
memcg_destroy_list_lru_node(>node[i]);
 }
 
@@ -409,16 +417,20 @@ static int memcg_update_list_lru(struct list_lru *lru,
if (!list_lru_memcg_aware(lru))
return 0;
 
-   for (i = 0; i < nr_node_ids; i++) {
+   for_each_node(i) {
if (memcg_update_list_lru_node(>node[i],
   old_size, new_size))
goto fail;
}
return 0;
 fail:
-   for (i = i - 1; i >= 0; i--)
+   for (i = i - 1; i >= 0; i--) {
+   if (!lru->node[i].memcg_lrus)
+   continue;
+
memcg_cancel_update_list_lru_node(>node[i],
  old_size, new_size);
+   }
return -ENOMEM;
 }
 
@@ -430,7 +442,7 @@ static void memcg_cancel_update_list_lru(struct list_lru 
*lru,
if (!list_lru_memcg_aware(lru))
return;
 
-   for (i = 0; i < nr_node_ids; i++)
+   for_each_node(i)
memcg_cancel_update_list_lru_node(>node[i],
  old_size, new_size);
 }
@@ -485,7 +497,7 @@ static void memcg_drain_list_lru(struct list_lru *lru,
if (!list_lru_memcg_aware(lru))
return;
 
-   for (i = 0; i < nr_node_ids; i++)
+   for_each_node(i)
memcg_drain_list_lru_node(>node[i], src_idx, dst_idx);
 }
 
@@ -522,7 +534,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware,
if (!lru->node)
goto out;
 
-   for (i = 0; i < nr_node_ids; i++) {
+   for_each_node(i) {
spin_lock_init(>node[i].lock);
if (key)
lockdep_set_class(>node[i].lock, key);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru

2015-09-08 Thread Raghavendra K T

The functions used in the patch are in slowpath, which gets called
whenever alloc_super is called during mounts.

Though this should not make difference for the architectures with
sequential numa node ids, for the powerpc which can potentially have
sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17
is common), this patch saves some unnecessary allocations for
non existing numa nodes.

Even without that saving, perhaps patch makes code more readable.

Signed-off-by: Raghavendra K T 
---
 mm/list_lru.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 909eca2..5a97f83 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -377,7 +377,7 @@ static int memcg_init_list_lru(struct list_lru *lru, bool 
memcg_aware)
 {
int i;
 
-   for (i = 0; i < nr_node_ids; i++) {
+   for_each_node(i) {
if (!memcg_aware)
lru->node[i].memcg_lrus = NULL;
else if (memcg_init_list_lru_node(>node[i]))
@@ -385,8 +385,11 @@ static int memcg_init_list_lru(struct list_lru *lru, bool 
memcg_aware)
}
return 0;
 fail:
-   for (i = i - 1; i >= 0; i--)
+   for (i = i - 1; i >= 0; i--) {
+   if (!lru->node[i].memcg_lrus)
+   continue;
memcg_destroy_list_lru_node(>node[i]);
+   }
return -ENOMEM;
 }
 
@@ -397,7 +400,7 @@ static void memcg_destroy_list_lru(struct list_lru *lru)
if (!list_lru_memcg_aware(lru))
return;
 
-   for (i = 0; i < nr_node_ids; i++)
+   for_each_node(i)
memcg_destroy_list_lru_node(>node[i]);
 }
 
@@ -409,16 +412,20 @@ static int memcg_update_list_lru(struct list_lru *lru,
if (!list_lru_memcg_aware(lru))
return 0;
 
-   for (i = 0; i < nr_node_ids; i++) {
+   for_each_node(i) {
if (memcg_update_list_lru_node(>node[i],
   old_size, new_size))
goto fail;
}
return 0;
 fail:
-   for (i = i - 1; i >= 0; i--)
+   for (i = i - 1; i >= 0; i--) {
+   if (!lru->node[i].memcg_lrus)
+   continue;
+
memcg_cancel_update_list_lru_node(>node[i],
  old_size, new_size);
+   }
return -ENOMEM;
 }
 
@@ -430,7 +437,7 @@ static void memcg_cancel_update_list_lru(struct list_lru 
*lru,
if (!list_lru_memcg_aware(lru))
return;
 
-   for (i = 0; i < nr_node_ids; i++)
+   for_each_node(i)
memcg_cancel_update_list_lru_node(>node[i],
  old_size, new_size);
 }
@@ -485,7 +492,7 @@ static void memcg_drain_list_lru(struct list_lru *lru,
if (!list_lru_memcg_aware(lru))
return;
 
-   for (i = 0; i < nr_node_ids; i++)
+   for_each_node(i)
memcg_drain_list_lru_node(>node[i], src_idx, dst_idx);
 }
 
@@ -522,7 +529,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware,
if (!lru->node)
goto out;
 
-   for (i = 0; i < nr_node_ids; i++) {
+   for_each_node(i) {
spin_lock_init(>node[i].lock);
if (key)
lockdep_set_class(>node[i].lock, key);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes

2015-09-08 Thread Raghavendra K T

Signed-off-by: Raghavendra K T 
---
 arch/powerpc/mm/numa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8b9502a..8d8a541 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -80,7 +80,7 @@ static void __init setup_node_to_cpumask_map(void)
setup_nr_node_ids();
 
/* allocate the map */
-   for (node = 0; node < nr_node_ids; node++)
+   for_each_node(node)
alloc_bootmem_cpumask_var(_to_cpumask_map[node]);
 
/* cpumask_of_node() will now work */
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/2] Replace nr_node_ids for loop with for_each_node

2015-09-08 Thread Raghavendra K T

Many places in the kernel use 'for' loop with nr_node_ids. For the architectures
which supports sparse numa ids, this will result in some unnecessary allocations
for non existing nodes.
(for e.g., numa node numbers such as 0,1,16,17 is common in powerpc.)

So replace the for loop with for_each_node so that allocations happen only for
existing numa nodes.

Please note that, though there are many places where nr_node_ids is used,
current patchset uses for_each_node only for slowpath to avoid find_next_bit
traversal.

Raghavendra K T (2):
  mm: Replace nr_node_ids for loop with for_each_node in list lru
  powerpc:numa Do not allocate bootmem memory for non existing nodes

 arch/powerpc/mm/numa.c |  2 +-
 mm/list_lru.c  | 23 +++
 2 files changed, 16 insertions(+), 9 deletions(-)

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/2] Replace nr_node_ids for loop with for_each_node

2015-09-08 Thread Raghavendra K T

Many places in the kernel use 'for' loop with nr_node_ids. For the architectures
which supports sparse numa ids, this will result in some unnecessary allocations
for non existing nodes.
(for e.g., numa node numbers such as 0,1,16,17 is common in powerpc.)

So replace the for loop with for_each_node so that allocations happen only for
existing numa nodes.

Please note that, though there are many places where nr_node_ids is used,
current patchset uses for_each_node only for slowpath to avoid find_next_bit
traversal.

Raghavendra K T (2):
  mm: Replace nr_node_ids for loop with for_each_node in list lru
  powerpc:numa Do not allocate bootmem memory for non existing nodes

 arch/powerpc/mm/numa.c |  2 +-
 mm/list_lru.c  | 23 +++
 2 files changed, 16 insertions(+), 9 deletions(-)

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes

2015-09-08 Thread Raghavendra K T

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8b9502a..8d8a541 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -80,7 +80,7 @@ static void __init setup_node_to_cpumask_map(void)
setup_nr_node_ids();
 
/* allocate the map */
-   for (node = 0; node < nr_node_ids; node++)
+   for_each_node(node)
alloc_bootmem_cpumask_var(_to_cpumask_map[node]);
 
/* cpumask_of_node() will now work */
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru

2015-09-08 Thread Raghavendra K T

The functions used in the patch are in slowpath, which gets called
whenever alloc_super is called during mounts.

Though this should not make difference for the architectures with
sequential numa node ids, for the powerpc which can potentially have
sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17
is common), this patch saves some unnecessary allocations for
non existing numa nodes.

Even without that saving, perhaps patch makes code more readable.

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
 mm/list_lru.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 909eca2..5a97f83 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -377,7 +377,7 @@ static int memcg_init_list_lru(struct list_lru *lru, bool 
memcg_aware)
 {
int i;
 
-   for (i = 0; i < nr_node_ids; i++) {
+   for_each_node(i) {
if (!memcg_aware)
lru->node[i].memcg_lrus = NULL;
else if (memcg_init_list_lru_node(>node[i]))
@@ -385,8 +385,11 @@ static int memcg_init_list_lru(struct list_lru *lru, bool 
memcg_aware)
}
return 0;
 fail:
-   for (i = i - 1; i >= 0; i--)
+   for (i = i - 1; i >= 0; i--) {
+   if (!lru->node[i].memcg_lrus)
+   continue;
memcg_destroy_list_lru_node(>node[i]);
+   }
return -ENOMEM;
 }
 
@@ -397,7 +400,7 @@ static void memcg_destroy_list_lru(struct list_lru *lru)
if (!list_lru_memcg_aware(lru))
return;
 
-   for (i = 0; i < nr_node_ids; i++)
+   for_each_node(i)
memcg_destroy_list_lru_node(>node[i]);
 }
 
@@ -409,16 +412,20 @@ static int memcg_update_list_lru(struct list_lru *lru,
if (!list_lru_memcg_aware(lru))
return 0;
 
-   for (i = 0; i < nr_node_ids; i++) {
+   for_each_node(i) {
if (memcg_update_list_lru_node(>node[i],
   old_size, new_size))
goto fail;
}
return 0;
 fail:
-   for (i = i - 1; i >= 0; i--)
+   for (i = i - 1; i >= 0; i--) {
+   if (!lru->node[i].memcg_lrus)
+   continue;
+
memcg_cancel_update_list_lru_node(>node[i],
  old_size, new_size);
+   }
return -ENOMEM;
 }
 
@@ -430,7 +437,7 @@ static void memcg_cancel_update_list_lru(struct list_lru 
*lru,
if (!list_lru_memcg_aware(lru))
return;
 
-   for (i = 0; i < nr_node_ids; i++)
+   for_each_node(i)
memcg_cancel_update_list_lru_node(>node[i],
  old_size, new_size);
 }
@@ -485,7 +492,7 @@ static void memcg_drain_list_lru(struct list_lru *lru,
if (!list_lru_memcg_aware(lru))
return;
 
-   for (i = 0; i < nr_node_ids; i++)
+   for_each_node(i)
memcg_drain_list_lru_node(>node[i], src_idx, dst_idx);
 }
 
@@ -522,7 +529,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware,
if (!lru->node)
goto out;
 
-   for (i = 0; i < nr_node_ids; i++) {
+   for_each_node(i) {
spin_lock_init(>node[i].lock);
if (key)
lockdep_set_class(>node[i].lock, key);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC V4 1/2] net: Introduce helper functions to get the per cpu data

2015-08-30 Thread Raghavendra K T

Signed-off-by: Raghavendra K T 
---
 include/net/ip.h   | 10 ++
 net/ipv4/af_inet.c | 41 +++--
 2 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index d5fe9f2..93bf12e 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -202,10 +202,20 @@ void ip_send_unicast_reply(struct sock *sk, struct 
sk_buff *skb,
 #define NET_ADD_STATS_BH(net, field, adnd) 
SNMP_ADD_STATS_BH((net)->mib.net_statistics, field, adnd)
 #define NET_ADD_STATS_USER(net, field, adnd) 
SNMP_ADD_STATS_USER((net)->mib.net_statistics, field, adnd)
 
+u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offct);
 unsigned long snmp_fold_field(void __percpu *mib, int offt);
 #if BITS_PER_LONG==32
+u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+size_t syncp_offset);
 u64 snmp_fold_field64(void __percpu *mib, int offt, size_t sync_off);
 #else
+static inline u64  snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+   size_t syncp_offset)
+{
+   return snmp_get_cpu_field(mib, cpu, offct);
+
+}
+
 static inline u64 snmp_fold_field64(void __percpu *mib, int offt, size_t 
syncp_off)
 {
return snmp_fold_field(mib, offt);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9532ee8..302e36b 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1448,38 +1448,51 @@ int inet_ctl_sock_create(struct sock **sk, unsigned 
short family,
 }
 EXPORT_SYMBOL_GPL(inet_ctl_sock_create);
 
+u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offt)
+{
+   return  *(((unsigned long *)per_cpu_ptr(mib, cpu)) + offt);
+}
+EXPORT_SYMBOL_GPL(snmp_get_cpu_field);
+
 unsigned long snmp_fold_field(void __percpu *mib, int offt)
 {
unsigned long res = 0;
int i;
 
for_each_possible_cpu(i)
-   res += *(((unsigned long *) per_cpu_ptr(mib, i)) + offt);
+   res += snmp_get_cpu_field(mib, i, offt);
return res;
 }
 EXPORT_SYMBOL_GPL(snmp_fold_field);
 
 #if BITS_PER_LONG==32
 
+u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+size_t syncp_offset)
+{
+   void *bhptr;
+   struct u64_stats_sync *syncp;
+   u64 v;
+   unsigned int start;
+
+   bhptr = per_cpu_ptr(mib, cpu);
+   syncp = (struct u64_stats_sync *)(bhptr + syncp_offset);
+   do {
+   start = u64_stats_fetch_begin_irq(syncp);
+   v = *(((u64 *)bhptr) + offt);
+   } while (u64_stats_fetch_retry_irq(syncp, start));
+
+   return v;
+}
+EXPORT_SYMBOL_GPL(snmp_get_cpu_field64);
+
 u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_offset)
 {
u64 res = 0;
int cpu;
 
for_each_possible_cpu(cpu) {
-   void *bhptr;
-   struct u64_stats_sync *syncp;
-   u64 v;
-   unsigned int start;
-
-   bhptr = per_cpu_ptr(mib, cpu);
-   syncp = (struct u64_stats_sync *)(bhptr + syncp_offset);
-   do {
-   start = u64_stats_fetch_begin_irq(syncp);
-   v = *(((u64 *) bhptr) + offt);
-   } while (u64_stats_fetch_retry_irq(syncp, start));
-
-   res += v;
+   res += snmp_get_cpu_field(mib, cpu, offct, syncp_offset);
}
return res;
 }
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC V4 1/2] net: Introduce helper functions to get the per cpu data

2015-08-30 Thread Raghavendra K T

Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---
 include/net/ip.h   | 10 ++
 net/ipv4/af_inet.c | 41 +++--
 2 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index d5fe9f2..93bf12e 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -202,10 +202,20 @@ void ip_send_unicast_reply(struct sock *sk, struct 
sk_buff *skb,
 #define NET_ADD_STATS_BH(net, field, adnd) 
SNMP_ADD_STATS_BH((net)-mib.net_statistics, field, adnd)
 #define NET_ADD_STATS_USER(net, field, adnd) 
SNMP_ADD_STATS_USER((net)-mib.net_statistics, field, adnd)
 
+u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offct);
 unsigned long snmp_fold_field(void __percpu *mib, int offt);
 #if BITS_PER_LONG==32
+u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+size_t syncp_offset);
 u64 snmp_fold_field64(void __percpu *mib, int offt, size_t sync_off);
 #else
+static inline u64  snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+   size_t syncp_offset)
+{
+   return snmp_get_cpu_field(mib, cpu, offct);
+
+}
+
 static inline u64 snmp_fold_field64(void __percpu *mib, int offt, size_t 
syncp_off)
 {
return snmp_fold_field(mib, offt);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9532ee8..302e36b 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1448,38 +1448,51 @@ int inet_ctl_sock_create(struct sock **sk, unsigned 
short family,
 }
 EXPORT_SYMBOL_GPL(inet_ctl_sock_create);
 
+u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offt)
+{
+   return  *(((unsigned long *)per_cpu_ptr(mib, cpu)) + offt);
+}
+EXPORT_SYMBOL_GPL(snmp_get_cpu_field);
+
 unsigned long snmp_fold_field(void __percpu *mib, int offt)
 {
unsigned long res = 0;
int i;
 
for_each_possible_cpu(i)
-   res += *(((unsigned long *) per_cpu_ptr(mib, i)) + offt);
+   res += snmp_get_cpu_field(mib, i, offt);
return res;
 }
 EXPORT_SYMBOL_GPL(snmp_fold_field);
 
 #if BITS_PER_LONG==32
 
+u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+size_t syncp_offset)
+{
+   void *bhptr;
+   struct u64_stats_sync *syncp;
+   u64 v;
+   unsigned int start;
+
+   bhptr = per_cpu_ptr(mib, cpu);
+   syncp = (struct u64_stats_sync *)(bhptr + syncp_offset);
+   do {
+   start = u64_stats_fetch_begin_irq(syncp);
+   v = *(((u64 *)bhptr) + offt);
+   } while (u64_stats_fetch_retry_irq(syncp, start));
+
+   return v;
+}
+EXPORT_SYMBOL_GPL(snmp_get_cpu_field64);
+
 u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_offset)
 {
u64 res = 0;
int cpu;
 
for_each_possible_cpu(cpu) {
-   void *bhptr;
-   struct u64_stats_sync *syncp;
-   u64 v;
-   unsigned int start;
-
-   bhptr = per_cpu_ptr(mib, cpu);
-   syncp = (struct u64_stats_sync *)(bhptr + syncp_offset);
-   do {
-   start = u64_stats_fetch_begin_irq(syncp);
-   v = *(((u64 *) bhptr) + offt);
-   } while (u64_stats_fetch_retry_irq(syncp, start));
-
-   res += v;
+   res += snmp_get_cpu_field(mib, cpu, offct, syncp_offset);
}
return res;
 }
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC V4 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-29 Thread Raghavendra K T

Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field.

reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data of an item (iteratively for around 36 items).

idea: This patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.

Docker creation got faster by more than 2x after the patch.

Result:
   Before   After
Docker creation time   6.836s   3.25s
cache miss 2.7% 1.41%

perf before:
50.73%  docker   [kernel.kallsyms]   [k] snmp_fold_field
 9.07%  swapper  [kernel.kallsyms]   [k] snooze_loop
 3.49%  docker   [kernel.kallsyms]   [k] veth_stats_one
 2.85%  swapper  [kernel.kallsyms]   [k] _raw_spin_lock

perf after:
10.57%  docker   docker[.] scanblock
 8.37%  swapper  [kernel.kallsyms] [k] snooze_loop
 6.91%  docker   [kernel.kallsyms] [k] snmp_get_cpu_field
 6.67%  docker   [kernel.kallsyms] [k] veth_stats_one

changes/ideas suggested:
Using buffer in stack (Eric), Usage of memset (David), Using memcpy in
place of unaligned_put (Joe).

Signed-off-by: Raghavendra K T 
---
 net/ipv6/addrconf.c | 26 --
 1 file changed, 16 insertions(+), 10 deletions(-)

Changes in V4:
 - remove 'item' variable and use IPSTATS_MIB_MAX to avoid sparse
   warning (Eric) also remove 'item' parameter (Joe)
 - add missing memset of padding.

Changes in V3:
 - use memset to initialize temp buffer in leaf function. (David)
 - use memcpy to copy the buffer data to stat instead of unalign_pu (Joe)
 - Move buffer definition to leaf function __snmp6_fill_stats64() (Eric)
 -
Changes in V2:
 - Allocate the stat calculation buffer in stack. (Eric)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 21c2c81..4abc10c 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4624,18 +4624,24 @@ static inline void __snmp6_fill_statsdev(u64 *stats, 
atomic_long_t *mib,
 }
 
 static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib,
- int items, int bytes, size_t syncpoff)
+   int bytes, size_t syncpoff)
 {
-   int i;
-   int pad = bytes - sizeof(u64) * items;
+   int i, c;
+   u64 buff[IPSTATS_MIB_MAX];
+   int pad = bytes - sizeof(u64) * IPSTATS_MIB_MAX;
+
BUG_ON(pad < 0);
 
-   /* Use put_unaligned() because stats may not be aligned for u64. */
-   put_unaligned(items, [0]);
-   for (i = 1; i < items; i++)
-   put_unaligned(snmp_fold_field64(mib, i, syncpoff), [i]);
+   memset(buff, 0, sizeof(buff));
+   buff[0] = IPSTATS_MIB_MAX;
 
-   memset([items], 0, pad);
+   for_each_possible_cpu(c) {
+   for (i = 1; i < IPSTATS_MIB_MAX; i++)
+   buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff);
+   }
+
+   memcpy(stats, buff, IPSTATS_MIB_MAX * sizeof(u64));
+   memset([IPSTATS_MIB_MAX], 0, pad);
 }
 
 static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype,
@@ -4643,8 +4649,8 @@ static void snmp6_fill_stats(u64 *stats, struct inet6_dev 
*idev, int attrtype,
 {
switch (attrtype) {
case IFLA_INET6_STATS:
-   __snmp6_fill_stats64(stats, idev->stats.ipv6,
-IPSTATS_MIB_MAX, bytes, offsetof(struct 
ipstats_mib, syncp));
+   __snmp6_fill_stats64(stats, idev->stats.ipv6, bytes,
+offsetof(struct ipstats_mib, syncp));
break;
case IFLA_INET6_ICMP6STATS:
__snmp6_fill_statsdev(stats, idev->stats.icmpv6dev->mibs, 
ICMP6_MIB_MAX, bytes);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC V4 0/2] Optimize the snmp stat aggregation for large cpus

2015-08-29 Thread Raghavendra K T

While creating 1000 containers, perf is showing lot of time spent in
snmp_fold_field on a large cpu system.

The current patch tries to improve by reordering the statistics gathering.

Please note that similar overhead was also reported while creating
veth pairs  https://lkml.org/lkml/2013/3/19/556

Changes in V4:
 - remove 'item' variable and use IPSTATS_MIB_MAX to avoid sparse
   warning (Eric) also remove 'item' parameter (Joe)
 - add missing memset of padding. 

Changes in V3:
 - use memset to initialize temp buffer in leaf function. (David)
 - use memcpy to copy the buffer data to stat instead of unalign_pu (Joe)
 - Move buffer definition to leaf function __snmp6_fill_stats64() (Eric)
 -
Changes in V2:
 - Allocate the stat calculation buffer in stack. (Eric)
 
Setup:
160 cpu (20 core) baremetal powerpc system with 1TB memory

1000 docker containers was created with command
docker run -itd  ubuntu:15.04  /bin/bash in loop

observation:
Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) perf data showed, creating veth interfaces resulting in
the below code path was taking more time.

rtnl_fill_ifinfo
  -> inet6_fill_link_af
-> inet6_fill_ifla6_attrs
  -> snmp_fold_field

proposed idea:
 currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data to of an item (iteratively for around 36 items).
 The patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.

Performance of docker creation improved by around more than 2x
after the patch.

before the patch: 

#time docker run -itd  ubuntu:15.04  /bin/bash
3f45ba571a42e925c4ec4aaee0e48d7610a9ed82a4c931f83324d41822cf6617
real0m6.836s
user0m0.095s
sys 0m0.011s

perf record -a docker run -itd  ubuntu:15.04  /bin/bash
===
# Samples: 32K of event 'cycles'
# Event count (approx.): 24688700190
# Overhead  Command  Shared Object   Symbol 

#   ...  ..  
50.73%  docker   [kernel.kallsyms]   [k] snmp_fold_field


 9.07%  swapper  [kernel.kallsyms]   [k] snooze_loop


 3.49%  docker   [kernel.kallsyms]   [k] veth_stats_one 


 2.85%  swapper  [kernel.kallsyms]   [k] _raw_spin_lock 


 1.37%  docker   docker  [.] backtrace_qsort


 1.31%  docker   docker  [.] strings.FieldsFunc 
 

  cache-misses:  2.7%
  
after the patch:
=
#time docker run -itd  ubuntu:15.04  /bin/bash
9178273e9df399c8290b6c196e4aef9273be2876225f63b14a60cf97eacfafb5
real0m3.249s
user0m0.088s
sys 0m0.020s

perf record -a docker run -itd  ubuntu:15.04  /bin/bash
===
# Samples: 18K of event 'cycles'
# Event count (approx.): 13958958797
# Overhead  Command  Shared Object Symbol   
 
#   ...    ..
10.57%  docker   docker[.] scanblock
 
 8.37%  swapper  [kernel.kallsyms] [k] snooze_loop  
 
 6.91%  docker   [kernel.kallsyms] [k] snmp_get_cpu_field   
 
 6.67%  docker   [kernel.kallsyms] [k] veth_stats_one   
 
 3.96%  docker   docker[.] runtime_MSpan_Sweep  
 
 2.47%  docker   docker[.] strings.FieldsFunc   
 
 
cache-misses: 1.41 %

Please let me know if you have suggestions/comments.
Thanks Eric, Joe and David for the comments.

Raghavendra K T (2):
  net: Introduce helper functions to get the per cpu data
  net: Optimize

Re: [PATCH RFC V3 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-29 Thread Raghavendra K T


On 08/29/2015 08:51 PM, Joe Perches wrote:

On Sat, 2015-08-29 at 07:32 -0700, Eric Dumazet wrote:

On Sat, 2015-08-29 at 14:37 +0530, Raghavendra K T wrote:



  static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib,
- int items, int bytes, size_t syncpoff)
+   int items, int bytes, size_t syncpoff)
  {
-   int i;
+   int i, c;
int pad = bytes - sizeof(u64) * items;
+   u64 buff[items];
+


One last comment : using variable length arrays is confusing for the
reader, and sparse as well.

$ make C=2 net/ipv6/addrconf.o
...
   CHECK   net/ipv6/addrconf.c
net/ipv6/addrconf.c:4733:18: warning: Variable length array is used.
net/ipv6/addrconf.c:4737:25: error: cannot size expression


I suggest you remove 'items' parameter and replace it by
IPSTATS_MIB_MAX, as __snmp6_fill_stats64() is called exactly once.


If you respin, I suggest:

o remove "items" from the __snmp6_fill_stats64 arguments
   and use IPSTATS_MIB_MAX in the function instead


Yes, as also suggested by Eric.


o add braces around the for_each_possible_cpu loop

for_each_possible_cpu(c) {
for (i = 1; i < items; i++)
buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff);
}



Sure. It makes it more readable.
will respin V4 with these changes (+ memset 0 for pad which I realized).



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC V3 1/2] net: Introduce helper functions to get the per cpu data

2015-08-29 Thread Raghavendra K T

Signed-off-by: Raghavendra K T 
---
 include/net/ip.h   | 10 ++
 net/ipv4/af_inet.c | 41 +++--
 2 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index d5fe9f2..93bf12e 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -202,10 +202,20 @@ void ip_send_unicast_reply(struct sock *sk, struct 
sk_buff *skb,
 #define NET_ADD_STATS_BH(net, field, adnd) 
SNMP_ADD_STATS_BH((net)->mib.net_statistics, field, adnd)
 #define NET_ADD_STATS_USER(net, field, adnd) 
SNMP_ADD_STATS_USER((net)->mib.net_statistics, field, adnd)
 
+u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offct);
 unsigned long snmp_fold_field(void __percpu *mib, int offt);
 #if BITS_PER_LONG==32
+u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+size_t syncp_offset);
 u64 snmp_fold_field64(void __percpu *mib, int offt, size_t sync_off);
 #else
+static inline u64  snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+   size_t syncp_offset)
+{
+   return snmp_get_cpu_field(mib, cpu, offct);
+
+}
+
 static inline u64 snmp_fold_field64(void __percpu *mib, int offt, size_t 
syncp_off)
 {
return snmp_fold_field(mib, offt);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9532ee8..302e36b 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1448,38 +1448,51 @@ int inet_ctl_sock_create(struct sock **sk, unsigned 
short family,
 }
 EXPORT_SYMBOL_GPL(inet_ctl_sock_create);
 
+u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offt)
+{
+   return  *(((unsigned long *)per_cpu_ptr(mib, cpu)) + offt);
+}
+EXPORT_SYMBOL_GPL(snmp_get_cpu_field);
+
 unsigned long snmp_fold_field(void __percpu *mib, int offt)
 {
unsigned long res = 0;
int i;
 
for_each_possible_cpu(i)
-   res += *(((unsigned long *) per_cpu_ptr(mib, i)) + offt);
+   res += snmp_get_cpu_field(mib, i, offt);
return res;
 }
 EXPORT_SYMBOL_GPL(snmp_fold_field);
 
 #if BITS_PER_LONG==32
 
+u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+size_t syncp_offset)
+{
+   void *bhptr;
+   struct u64_stats_sync *syncp;
+   u64 v;
+   unsigned int start;
+
+   bhptr = per_cpu_ptr(mib, cpu);
+   syncp = (struct u64_stats_sync *)(bhptr + syncp_offset);
+   do {
+   start = u64_stats_fetch_begin_irq(syncp);
+   v = *(((u64 *)bhptr) + offt);
+   } while (u64_stats_fetch_retry_irq(syncp, start));
+
+   return v;
+}
+EXPORT_SYMBOL_GPL(snmp_get_cpu_field64);
+
 u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_offset)
 {
u64 res = 0;
int cpu;
 
for_each_possible_cpu(cpu) {
-   void *bhptr;
-   struct u64_stats_sync *syncp;
-   u64 v;
-   unsigned int start;
-
-   bhptr = per_cpu_ptr(mib, cpu);
-   syncp = (struct u64_stats_sync *)(bhptr + syncp_offset);
-   do {
-   start = u64_stats_fetch_begin_irq(syncp);
-   v = *(((u64 *) bhptr) + offt);
-   } while (u64_stats_fetch_retry_irq(syncp, start));
-
-   res += v;
+   res += snmp_get_cpu_field(mib, cpu, offct, syncp_offset);
}
return res;
 }
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC V3 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-29 Thread Raghavendra K T

Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field.

reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data of an item (iteratively for around 36 items).

idea: This patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.

Docker creation got faster by more than 2x after the patch.

Result:
   Before   After
Docker creation time   6.836s   3.25s
cache miss 2.7% 1.41%

perf before:
50.73%  docker   [kernel.kallsyms]   [k] snmp_fold_field
 9.07%  swapper  [kernel.kallsyms]   [k] snooze_loop
 3.49%  docker   [kernel.kallsyms]   [k] veth_stats_one
 2.85%  swapper  [kernel.kallsyms]   [k] _raw_spin_lock

perf after:
10.57%  docker   docker[.] scanblock
 8.37%  swapper  [kernel.kallsyms] [k] snooze_loop
 6.91%  docker   [kernel.kallsyms] [k] snmp_get_cpu_field
 6.67%  docker   [kernel.kallsyms] [k] veth_stats_one

changes/ideas suggested:
Using buffer in stack (Eric), Usage of memset (David), Using memcpy in
place of unaligned_put (Joe).

Signed-off-by: Raghavendra K T 
---
 net/ipv6/addrconf.c | 22 +-
 1 file changed, 13 insertions(+), 9 deletions(-)

Changes in V3:
 - use memset to initialize temp buffer in leaf function. (David)
 - use memcpy to copy the buffer data to stat instead of unalign_pu (Joe)
 - Move buffer definition to leaf function __snmp6_fill_stats64() (Eric)
 -
Changes in V2:
 - Allocate the stat calculation buffer in stack. (Eric)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 21c2c81..379619a 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4624,18 +4624,22 @@ static inline void __snmp6_fill_statsdev(u64 *stats, 
atomic_long_t *mib,
 }
 
 static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib,
- int items, int bytes, size_t syncpoff)
+   int items, int bytes, size_t syncpoff)
 {
-   int i;
+   int i, c;
int pad = bytes - sizeof(u64) * items;
+   u64 buff[items];
+
BUG_ON(pad < 0);
 
-   /* Use put_unaligned() because stats may not be aligned for u64. */
-   put_unaligned(items, [0]);
-   for (i = 1; i < items; i++)
-   put_unaligned(snmp_fold_field64(mib, i, syncpoff), [i]);
+   memset(buff, 0, sizeof(buff));
+   buff[0] = items;
 
-   memset([items], 0, pad);
+   for_each_possible_cpu(c)
+   for (i = 1; i < items; i++)
+   buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff);
+
+   memcpy(stats, buff, items * sizeof(u64));
 }
 
 static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype,
@@ -4643,8 +4647,8 @@ static void snmp6_fill_stats(u64 *stats, struct inet6_dev 
*idev, int attrtype,
 {
switch (attrtype) {
case IFLA_INET6_STATS:
-   __snmp6_fill_stats64(stats, idev->stats.ipv6,
-IPSTATS_MIB_MAX, bytes, offsetof(struct 
ipstats_mib, syncp));
+   __snmp6_fill_stats64(stats, idev->stats.ipv6, IPSTATS_MIB_MAX,
+bytes, offsetof(struct ipstats_mib, 
syncp));
break;
case IFLA_INET6_ICMP6STATS:
__snmp6_fill_statsdev(stats, idev->stats.icmpv6dev->mibs, 
ICMP6_MIB_MAX, bytes);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC V3 0/2] Optimize the snmp stat aggregation for large cpus

2015-08-29 Thread Raghavendra K T

While creating 1000 containers, perf is showing lot of time spent in
snmp_fold_field on a large cpu system.

The current patch tries to improve by reordering the statistics gathering.

Please note that similar overhead was also reported while creating
veth pairs  https://lkml.org/lkml/2013/3/19/556

Changes in V3:
 - use memset to initialize temp buffer in leaf function. (David)
 - use memcpy to copy the buffer data to stat instead of unalign_pu (Joe)
 - Move buffer definition to leaf function __snmp6_fill_stats64() (Eric)
 -
Changes in V2:
 - Allocate the stat calculation buffer in stack. (Eric)
 
Setup:
160 cpu (20 core) baremetal powerpc system with 1TB memory

1000 docker containers was created with command
docker run -itd  ubuntu:15.04  /bin/bash in loop

observation:
Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) perf data showed, creating veth interfaces resulting in
the below code path was taking more time.

rtnl_fill_ifinfo
  -> inet6_fill_link_af
-> inet6_fill_ifla6_attrs
  -> snmp_fold_field

proposed idea:
 currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data to of an item (iteratively for around 36 items).
 The patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.

Performance of docker creation improved by around more than 2x
after the patch.

before the patch: 

#time docker run -itd  ubuntu:15.04  /bin/bash
3f45ba571a42e925c4ec4aaee0e48d7610a9ed82a4c931f83324d41822cf6617
real0m6.836s
user0m0.095s
sys 0m0.011s

perf record -a docker run -itd  ubuntu:15.04  /bin/bash
===
# Samples: 32K of event 'cycles'
# Event count (approx.): 24688700190
# Overhead  Command  Shared Object   Symbol 

#   ...  ..  
50.73%  docker   [kernel.kallsyms]   [k] snmp_fold_field


 9.07%  swapper  [kernel.kallsyms]   [k] snooze_loop


 3.49%  docker   [kernel.kallsyms]   [k] veth_stats_one 


 2.85%  swapper  [kernel.kallsyms]   [k] _raw_spin_lock 


 1.37%  docker   docker  [.] backtrace_qsort


 1.31%  docker   docker  [.] strings.FieldsFunc 
 

  cache-misses:  2.7%
  
after the patch:
=
#time docker run -itd  ubuntu:15.04  /bin/bash
9178273e9df399c8290b6c196e4aef9273be2876225f63b14a60cf97eacfafb5
real0m3.249s
user0m0.088s
sys 0m0.020s

perf record -a docker run -itd  ubuntu:15.04  /bin/bash
===
# Samples: 18K of event 'cycles'
# Event count (approx.): 13958958797
# Overhead  Command  Shared Object Symbol   
 
#   ...    ..
10.57%  docker   docker[.] scanblock
 
 8.37%  swapper  [kernel.kallsyms] [k] snooze_loop  
 
 6.91%  docker   [kernel.kallsyms] [k] snmp_get_cpu_field   
 
 6.67%  docker   [kernel.kallsyms] [k] veth_stats_one   
 
 3.96%  docker   docker[.] runtime_MSpan_Sweep  
 
 2.47%  docker   docker[.] strings.FieldsFunc   
 
 
cache-misses: 1.41 %

Please let me know if you have suggestions/comments.
Thanks Eric, Joe and David for comments on V1 and V2.

Raghavendra K T (2):
  net: Introduce helper functions to get the per cpu data
  net: Optimize snmp stat aggregation by walking all the percpu data at
once

 include/net/ip.h| 10 ++
 net/ipv4/af_ine

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-29 Thread Raghavendra K T

On 08/29/2015 10:41 AM, David Miller wrote:

From: Raghavendra K T 
Date: Sat, 29 Aug 2015 08:27:15 +0530

resending the patch with memset. Please let me know if you want to
resend all the patches.

Do not post patches as replies to existing discussion threads.

Instead, make a new, fresh, patch posting, updating the Subject line
as needed.

Sure.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-29 Thread Raghavendra K T


On 08/29/2015 08:56 AM, Eric Dumazet wrote:

On Sat, 2015-08-29 at 08:27 +0530, Raghavendra K T wrote:


/* Use put_unaligned() because stats may not be aligned for u64. */
put_unaligned(items, [0]);




for (i = 1; i < items; i++)
-   put_unaligned(snmp_fold_field64(mib, i, syncpoff), [i]);
+   put_unaligned(buff[i], [i]);



I believe Joe suggested following code instead :

buff[0] = items;
memcpy(stats, buff, items * sizeof(u64));


Thanks. Sure, will use this.

(I missed that. I thought that it was applicable only when we have
aligned data,and for power, put_aunaligned was not a nop unlike intel).



Also please move buff[] array into __snmp6_fill_stats64() to make it
clear it is used in a 'leaf' function.


Correct.



(even if calling memcpy()/memset() makes it not a leaf function)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC V3 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-29 Thread Raghavendra K T

Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field.

reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data of an item (iteratively for around 36 items).

idea: This patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.

Docker creation got faster by more than 2x after the patch.

Result:
   Before   After
Docker creation time   6.836s   3.25s
cache miss 2.7% 1.41%

perf before:
50.73%  docker   [kernel.kallsyms]   [k] snmp_fold_field
 9.07%  swapper  [kernel.kallsyms]   [k] snooze_loop
 3.49%  docker   [kernel.kallsyms]   [k] veth_stats_one
 2.85%  swapper  [kernel.kallsyms]   [k] _raw_spin_lock

perf after:
10.57%  docker   docker[.] scanblock
 8.37%  swapper  [kernel.kallsyms] [k] snooze_loop
 6.91%  docker   [kernel.kallsyms] [k] snmp_get_cpu_field
 6.67%  docker   [kernel.kallsyms] [k] veth_stats_one

changes/ideas suggested:
Using buffer in stack (Eric), Usage of memset (David), Using memcpy in
place of unaligned_put (Joe).

Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---
 net/ipv6/addrconf.c | 22 +-
 1 file changed, 13 insertions(+), 9 deletions(-)

Changes in V3:
 - use memset to initialize temp buffer in leaf function. (David)
 - use memcpy to copy the buffer data to stat instead of unalign_pu (Joe)
 - Move buffer definition to leaf function __snmp6_fill_stats64() (Eric)
 -
Changes in V2:
 - Allocate the stat calculation buffer in stack. (Eric)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 21c2c81..379619a 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4624,18 +4624,22 @@ static inline void __snmp6_fill_statsdev(u64 *stats, 
atomic_long_t *mib,
 }
 
 static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib,
- int items, int bytes, size_t syncpoff)
+   int items, int bytes, size_t syncpoff)
 {
-   int i;
+   int i, c;
int pad = bytes - sizeof(u64) * items;
+   u64 buff[items];
+
BUG_ON(pad  0);
 
-   /* Use put_unaligned() because stats may not be aligned for u64. */
-   put_unaligned(items, stats[0]);
-   for (i = 1; i  items; i++)
-   put_unaligned(snmp_fold_field64(mib, i, syncpoff), stats[i]);
+   memset(buff, 0, sizeof(buff));
+   buff[0] = items;
 
-   memset(stats[items], 0, pad);
+   for_each_possible_cpu(c)
+   for (i = 1; i  items; i++)
+   buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff);
+
+   memcpy(stats, buff, items * sizeof(u64));
 }
 
 static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype,
@@ -4643,8 +4647,8 @@ static void snmp6_fill_stats(u64 *stats, struct inet6_dev 
*idev, int attrtype,
 {
switch (attrtype) {
case IFLA_INET6_STATS:
-   __snmp6_fill_stats64(stats, idev-stats.ipv6,
-IPSTATS_MIB_MAX, bytes, offsetof(struct 
ipstats_mib, syncp));
+   __snmp6_fill_stats64(stats, idev-stats.ipv6, IPSTATS_MIB_MAX,
+bytes, offsetof(struct ipstats_mib, 
syncp));
break;
case IFLA_INET6_ICMP6STATS:
__snmp6_fill_statsdev(stats, idev-stats.icmpv6dev-mibs, 
ICMP6_MIB_MAX, bytes);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC V3 0/2] Optimize the snmp stat aggregation for large cpus

2015-08-29 Thread Raghavendra K T

While creating 1000 containers, perf is showing lot of time spent in
snmp_fold_field on a large cpu system.

The current patch tries to improve by reordering the statistics gathering.

Please note that similar overhead was also reported while creating
veth pairs  https://lkml.org/lkml/2013/3/19/556

Changes in V3:
 - use memset to initialize temp buffer in leaf function. (David)
 - use memcpy to copy the buffer data to stat instead of unalign_pu (Joe)
 - Move buffer definition to leaf function __snmp6_fill_stats64() (Eric)
 -
Changes in V2:
 - Allocate the stat calculation buffer in stack. (Eric)
 
Setup:
160 cpu (20 core) baremetal powerpc system with 1TB memory

1000 docker containers was created with command
docker run -itd  ubuntu:15.04  /bin/bash in loop

observation:
Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) perf data showed, creating veth interfaces resulting in
the below code path was taking more time.

rtnl_fill_ifinfo
  - inet6_fill_link_af
- inet6_fill_ifla6_attrs
  - snmp_fold_field

proposed idea:
 currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data to of an item (iteratively for around 36 items).
 The patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.

Performance of docker creation improved by around more than 2x
after the patch.

before the patch: 

#time docker run -itd  ubuntu:15.04  /bin/bash
3f45ba571a42e925c4ec4aaee0e48d7610a9ed82a4c931f83324d41822cf6617
real0m6.836s
user0m0.095s
sys 0m0.011s

perf record -a docker run -itd  ubuntu:15.04  /bin/bash
===
# Samples: 32K of event 'cycles'
# Event count (approx.): 24688700190
# Overhead  Command  Shared Object   Symbol 

#   ...  ..  
50.73%  docker   [kernel.kallsyms]   [k] snmp_fold_field


 9.07%  swapper  [kernel.kallsyms]   [k] snooze_loop


 3.49%  docker   [kernel.kallsyms]   [k] veth_stats_one 


 2.85%  swapper  [kernel.kallsyms]   [k] _raw_spin_lock 


 1.37%  docker   docker  [.] backtrace_qsort


 1.31%  docker   docker  [.] strings.FieldsFunc 
 

  cache-misses:  2.7%
  
after the patch:
=
#time docker run -itd  ubuntu:15.04  /bin/bash
9178273e9df399c8290b6c196e4aef9273be2876225f63b14a60cf97eacfafb5
real0m3.249s
user0m0.088s
sys 0m0.020s

perf record -a docker run -itd  ubuntu:15.04  /bin/bash
===
# Samples: 18K of event 'cycles'
# Event count (approx.): 13958958797
# Overhead  Command  Shared Object Symbol   
 
#   ...    ..
10.57%  docker   docker[.] scanblock
 
 8.37%  swapper  [kernel.kallsyms] [k] snooze_loop  
 
 6.91%  docker   [kernel.kallsyms] [k] snmp_get_cpu_field   
 
 6.67%  docker   [kernel.kallsyms] [k] veth_stats_one   
 
 3.96%  docker   docker[.] runtime_MSpan_Sweep  
 
 2.47%  docker   docker[.] strings.FieldsFunc   
 
 
cache-misses: 1.41 %

Please let me know if you have suggestions/comments.
Thanks Eric, Joe and David for comments on V1 and V2.

Raghavendra K T (2):
  net: Introduce helper functions to get the per cpu data
  net: Optimize snmp stat aggregation by walking all the percpu data at
once

 include/net/ip.h| 10 ++
 net/ipv4/af_inet.c  | 41

[PATCH RFC V3 1/2] net: Introduce helper functions to get the per cpu data

2015-08-29 Thread Raghavendra K T

Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---
 include/net/ip.h   | 10 ++
 net/ipv4/af_inet.c | 41 +++--
 2 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index d5fe9f2..93bf12e 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -202,10 +202,20 @@ void ip_send_unicast_reply(struct sock *sk, struct 
sk_buff *skb,
 #define NET_ADD_STATS_BH(net, field, adnd) 
SNMP_ADD_STATS_BH((net)-mib.net_statistics, field, adnd)
 #define NET_ADD_STATS_USER(net, field, adnd) 
SNMP_ADD_STATS_USER((net)-mib.net_statistics, field, adnd)
 
+u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offct);
 unsigned long snmp_fold_field(void __percpu *mib, int offt);
 #if BITS_PER_LONG==32
+u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+size_t syncp_offset);
 u64 snmp_fold_field64(void __percpu *mib, int offt, size_t sync_off);
 #else
+static inline u64  snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+   size_t syncp_offset)
+{
+   return snmp_get_cpu_field(mib, cpu, offct);
+
+}
+
 static inline u64 snmp_fold_field64(void __percpu *mib, int offt, size_t 
syncp_off)
 {
return snmp_fold_field(mib, offt);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9532ee8..302e36b 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1448,38 +1448,51 @@ int inet_ctl_sock_create(struct sock **sk, unsigned 
short family,
 }
 EXPORT_SYMBOL_GPL(inet_ctl_sock_create);
 
+u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offt)
+{
+   return  *(((unsigned long *)per_cpu_ptr(mib, cpu)) + offt);
+}
+EXPORT_SYMBOL_GPL(snmp_get_cpu_field);
+
 unsigned long snmp_fold_field(void __percpu *mib, int offt)
 {
unsigned long res = 0;
int i;
 
for_each_possible_cpu(i)
-   res += *(((unsigned long *) per_cpu_ptr(mib, i)) + offt);
+   res += snmp_get_cpu_field(mib, i, offt);
return res;
 }
 EXPORT_SYMBOL_GPL(snmp_fold_field);
 
 #if BITS_PER_LONG==32
 
+u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+size_t syncp_offset)
+{
+   void *bhptr;
+   struct u64_stats_sync *syncp;
+   u64 v;
+   unsigned int start;
+
+   bhptr = per_cpu_ptr(mib, cpu);
+   syncp = (struct u64_stats_sync *)(bhptr + syncp_offset);
+   do {
+   start = u64_stats_fetch_begin_irq(syncp);
+   v = *(((u64 *)bhptr) + offt);
+   } while (u64_stats_fetch_retry_irq(syncp, start));
+
+   return v;
+}
+EXPORT_SYMBOL_GPL(snmp_get_cpu_field64);
+
 u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_offset)
 {
u64 res = 0;
int cpu;
 
for_each_possible_cpu(cpu) {
-   void *bhptr;
-   struct u64_stats_sync *syncp;
-   u64 v;
-   unsigned int start;
-
-   bhptr = per_cpu_ptr(mib, cpu);
-   syncp = (struct u64_stats_sync *)(bhptr + syncp_offset);
-   do {
-   start = u64_stats_fetch_begin_irq(syncp);
-   v = *(((u64 *) bhptr) + offt);
-   } while (u64_stats_fetch_retry_irq(syncp, start));
-
-   res += v;
+   res += snmp_get_cpu_field(mib, cpu, offct, syncp_offset);
}
return res;
 }
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-29 Thread Raghavendra K T

On 08/29/2015 10:41 AM, David Miller wrote:

From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Date: Sat, 29 Aug 2015 08:27:15 +0530

resending the patch with memset. Please let me know if you want to
resend all the patches.

Do not post patches as replies to existing discussion threads.

Instead, make a new, fresh, patch posting, updating the Subject line
as needed.

Sure.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-29 Thread Raghavendra K T


On 08/29/2015 08:56 AM, Eric Dumazet wrote:

On Sat, 2015-08-29 at 08:27 +0530, Raghavendra K T wrote:


/* Use put_unaligned() because stats may not be aligned for u64. */
put_unaligned(items, stats[0]);




for (i = 1; i  items; i++)
-   put_unaligned(snmp_fold_field64(mib, i, syncpoff), stats[i]);
+   put_unaligned(buff[i], stats[i]);



I believe Joe suggested following code instead :

buff[0] = items;
memcpy(stats, buff, items * sizeof(u64));


Thanks. Sure, will use this.

(I missed that. I thought that it was applicable only when we have
aligned data,and for power, put_aunaligned was not a nop unlike intel).



Also please move buff[] array into __snmp6_fill_stats64() to make it
clear it is used in a 'leaf' function.


Correct.



(even if calling memcpy()/memset() makes it not a leaf function)



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V3 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-29 Thread Raghavendra K T


On 08/29/2015 08:51 PM, Joe Perches wrote:

On Sat, 2015-08-29 at 07:32 -0700, Eric Dumazet wrote:

On Sat, 2015-08-29 at 14:37 +0530, Raghavendra K T wrote:



  static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib,
- int items, int bytes, size_t syncpoff)
+   int items, int bytes, size_t syncpoff)
  {
-   int i;
+   int i, c;
int pad = bytes - sizeof(u64) * items;
+   u64 buff[items];
+


One last comment : using variable length arrays is confusing for the
reader, and sparse as well.

$ make C=2 net/ipv6/addrconf.o
...
   CHECK   net/ipv6/addrconf.c
net/ipv6/addrconf.c:4733:18: warning: Variable length array is used.
net/ipv6/addrconf.c:4737:25: error: cannot size expression


I suggest you remove 'items' parameter and replace it by
IPSTATS_MIB_MAX, as __snmp6_fill_stats64() is called exactly once.


If you respin, I suggest:

o remove items from the __snmp6_fill_stats64 arguments
   and use IPSTATS_MIB_MAX in the function instead


Yes, as also suggested by Eric.


o add braces around the for_each_possible_cpu loop

for_each_possible_cpu(c) {
for (i = 1; i  items; i++)
buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff);
}



Sure. It makes it more readable.
will respin V4 with these changes (+ memset 0 for pad which I realized).



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC V4 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-29 Thread Raghavendra K T

Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field.

reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data of an item (iteratively for around 36 items).

idea: This patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.

Docker creation got faster by more than 2x after the patch.

Result:
   Before   After
Docker creation time   6.836s   3.25s
cache miss 2.7% 1.41%

perf before:
50.73%  docker   [kernel.kallsyms]   [k] snmp_fold_field
 9.07%  swapper  [kernel.kallsyms]   [k] snooze_loop
 3.49%  docker   [kernel.kallsyms]   [k] veth_stats_one
 2.85%  swapper  [kernel.kallsyms]   [k] _raw_spin_lock

perf after:
10.57%  docker   docker[.] scanblock
 8.37%  swapper  [kernel.kallsyms] [k] snooze_loop
 6.91%  docker   [kernel.kallsyms] [k] snmp_get_cpu_field
 6.67%  docker   [kernel.kallsyms] [k] veth_stats_one

changes/ideas suggested:
Using buffer in stack (Eric), Usage of memset (David), Using memcpy in
place of unaligned_put (Joe).

Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---
 net/ipv6/addrconf.c | 26 --
 1 file changed, 16 insertions(+), 10 deletions(-)

Changes in V4:
 - remove 'item' variable and use IPSTATS_MIB_MAX to avoid sparse
   warning (Eric) also remove 'item' parameter (Joe)
 - add missing memset of padding.

Changes in V3:
 - use memset to initialize temp buffer in leaf function. (David)
 - use memcpy to copy the buffer data to stat instead of unalign_pu (Joe)
 - Move buffer definition to leaf function __snmp6_fill_stats64() (Eric)
 -
Changes in V2:
 - Allocate the stat calculation buffer in stack. (Eric)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 21c2c81..4abc10c 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4624,18 +4624,24 @@ static inline void __snmp6_fill_statsdev(u64 *stats, 
atomic_long_t *mib,
 }
 
 static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib,
- int items, int bytes, size_t syncpoff)
+   int bytes, size_t syncpoff)
 {
-   int i;
-   int pad = bytes - sizeof(u64) * items;
+   int i, c;
+   u64 buff[IPSTATS_MIB_MAX];
+   int pad = bytes - sizeof(u64) * IPSTATS_MIB_MAX;
+
BUG_ON(pad  0);
 
-   /* Use put_unaligned() because stats may not be aligned for u64. */
-   put_unaligned(items, stats[0]);
-   for (i = 1; i  items; i++)
-   put_unaligned(snmp_fold_field64(mib, i, syncpoff), stats[i]);
+   memset(buff, 0, sizeof(buff));
+   buff[0] = IPSTATS_MIB_MAX;
 
-   memset(stats[items], 0, pad);
+   for_each_possible_cpu(c) {
+   for (i = 1; i  IPSTATS_MIB_MAX; i++)
+   buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff);
+   }
+
+   memcpy(stats, buff, IPSTATS_MIB_MAX * sizeof(u64));
+   memset(stats[IPSTATS_MIB_MAX], 0, pad);
 }
 
 static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype,
@@ -4643,8 +4649,8 @@ static void snmp6_fill_stats(u64 *stats, struct inet6_dev 
*idev, int attrtype,
 {
switch (attrtype) {
case IFLA_INET6_STATS:
-   __snmp6_fill_stats64(stats, idev-stats.ipv6,
-IPSTATS_MIB_MAX, bytes, offsetof(struct 
ipstats_mib, syncp));
+   __snmp6_fill_stats64(stats, idev-stats.ipv6, bytes,
+offsetof(struct ipstats_mib, syncp));
break;
case IFLA_INET6_ICMP6STATS:
__snmp6_fill_statsdev(stats, idev-stats.icmpv6dev-mibs, 
ICMP6_MIB_MAX, bytes);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC V4 0/2] Optimize the snmp stat aggregation for large cpus

2015-08-29 Thread Raghavendra K T

While creating 1000 containers, perf is showing lot of time spent in
snmp_fold_field on a large cpu system.

The current patch tries to improve by reordering the statistics gathering.

Please note that similar overhead was also reported while creating
veth pairs  https://lkml.org/lkml/2013/3/19/556

Changes in V4:
 - remove 'item' variable and use IPSTATS_MIB_MAX to avoid sparse
   warning (Eric) also remove 'item' parameter (Joe)
 - add missing memset of padding. 

Changes in V3:
 - use memset to initialize temp buffer in leaf function. (David)
 - use memcpy to copy the buffer data to stat instead of unalign_pu (Joe)
 - Move buffer definition to leaf function __snmp6_fill_stats64() (Eric)
 -
Changes in V2:
 - Allocate the stat calculation buffer in stack. (Eric)
 
Setup:
160 cpu (20 core) baremetal powerpc system with 1TB memory

1000 docker containers was created with command
docker run -itd  ubuntu:15.04  /bin/bash in loop

observation:
Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) perf data showed, creating veth interfaces resulting in
the below code path was taking more time.

rtnl_fill_ifinfo
  - inet6_fill_link_af
- inet6_fill_ifla6_attrs
  - snmp_fold_field

proposed idea:
 currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data to of an item (iteratively for around 36 items).
 The patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.

Performance of docker creation improved by around more than 2x
after the patch.

before the patch: 

#time docker run -itd  ubuntu:15.04  /bin/bash
3f45ba571a42e925c4ec4aaee0e48d7610a9ed82a4c931f83324d41822cf6617
real0m6.836s
user0m0.095s
sys 0m0.011s

perf record -a docker run -itd  ubuntu:15.04  /bin/bash
===
# Samples: 32K of event 'cycles'
# Event count (approx.): 24688700190
# Overhead  Command  Shared Object   Symbol 

#   ...  ..  
50.73%  docker   [kernel.kallsyms]   [k] snmp_fold_field


 9.07%  swapper  [kernel.kallsyms]   [k] snooze_loop


 3.49%  docker   [kernel.kallsyms]   [k] veth_stats_one 


 2.85%  swapper  [kernel.kallsyms]   [k] _raw_spin_lock 


 1.37%  docker   docker  [.] backtrace_qsort


 1.31%  docker   docker  [.] strings.FieldsFunc 
 

  cache-misses:  2.7%
  
after the patch:
=
#time docker run -itd  ubuntu:15.04  /bin/bash
9178273e9df399c8290b6c196e4aef9273be2876225f63b14a60cf97eacfafb5
real0m3.249s
user0m0.088s
sys 0m0.020s

perf record -a docker run -itd  ubuntu:15.04  /bin/bash
===
# Samples: 18K of event 'cycles'
# Event count (approx.): 13958958797
# Overhead  Command  Shared Object Symbol   
 
#   ...    ..
10.57%  docker   docker[.] scanblock
 
 8.37%  swapper  [kernel.kallsyms] [k] snooze_loop  
 
 6.91%  docker   [kernel.kallsyms] [k] snmp_get_cpu_field   
 
 6.67%  docker   [kernel.kallsyms] [k] veth_stats_one   
 
 3.96%  docker   docker[.] runtime_MSpan_Sweep  
 
 2.47%  docker   docker[.] strings.FieldsFunc   
 
 
cache-misses: 1.41 %

Please let me know if you have suggestions/comments.
Thanks Eric, Joe and David for the comments.

Raghavendra K T (2):
  net: Introduce helper functions to get the per cpu data
  net: Optimize snmp stat

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Raghavendra K T

* David Miller  [2015-08-28 11:24:13]:

> From: Raghavendra K T 
> Date: Fri, 28 Aug 2015 12:09:52 +0530
> 
> > On 08/28/2015 12:08 AM, David Miller wrote:
> >> From: Raghavendra K T 
> >> Date: Wed, 26 Aug 2015 23:07:33 +0530
> >>
> >>> @@ -4641,10 +4647,12 @@ static inline void __snmp6_fill_stats64(u64
> >>> *stats, void __percpu *mib,
> >>>   static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int
> >>>   attrtype,
> >>>int bytes)
> >>>   {
> >>> + u64 buff[IPSTATS_MIB_MAX] = {0,};
> >>> +
>  ...
> > hope you wanted to know the overhead than to change the current
> > patch. please let me know..
> 
> I want you to change that variable initializer to an explicit memset().
> 
> The compiler is emitting a memset() or similar _anyways_.
> 
> Not because it will have any impact at all upon performance, but because
> of how it looks to people trying to read and understand the code.
> 
> 

Hi David,
resending the patch with memset. Please let me know if you want to
resend all the patches.

8<
From: Raghavendra K T 
Subject: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking
 all the percpu data at once

Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field.

reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data of an item (iteratively for around 90 items).

idea: This patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.

Docker creation got faster by more than 2x after the patch.

Result:
   Before   After
Docker creation time   6.836s   3.357s
cache miss 2.7% 1.38%

perf before:
50.73%  docker   [kernel.kallsyms]   [k] snmp_fold_field
 9.07%  swapper  [kernel.kallsyms]   [k] snooze_loop
 3.49%  docker   [kernel.kallsyms]   [k] veth_stats_one
 2.85%  swapper  [kernel.kallsyms]   [k] _raw_spin_lock

perf after:
10.56%  swapper  [kernel.kallsyms] [k] snooze_loop
 8.72%  docker   [kernel.kallsyms] [k] snmp_get_cpu_field
 7.59%  docker   [kernel.kallsyms] [k] veth_stats_one
 3.65%  swapper  [kernel.kallsyms] [k] _raw_spin_lock

Signed-off-by: Raghavendra K T 
---
 net/ipv6/addrconf.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

 Change in V2:
 - Allocate stat calculation buffer in stack (Eric)
 - Use memset to zero temp buffer (David)

Thanks David and Eric for coments on V1 and as both of them pointed,
unfortunately we cannot get rid of buffer for calculation without
avoiding unaligned op.


diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 21c2c81..9bdfba3 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4624,16 +4624,22 @@ static inline void __snmp6_fill_statsdev(u64 *stats, 
atomic_long_t *mib,
 }
 
 static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib,
- int items, int bytes, size_t syncpoff)
+   int items, int bytes, size_t syncpoff,
+   u64 *buff)
 {
-   int i;
+   int i, c;
int pad = bytes - sizeof(u64) * items;
BUG_ON(pad < 0);
 
/* Use put_unaligned() because stats may not be aligned for u64. */
put_unaligned(items, [0]);
+
+   for_each_possible_cpu(c)
+   for (i = 1; i < items; i++)
+   buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff);
+
for (i = 1; i < items; i++)
-   put_unaligned(snmp_fold_field64(mib, i, syncpoff), [i]);
+   put_unaligned(buff[i], [i]);
 
memset([items], 0, pad);
 }
@@ -4641,10 +4647,13 @@ static inline void __snmp6_fill_stats64(u64 *stats, 
void __percpu *mib,
 static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype,
 int bytes)
 {
+   u64 buff[IPSTATS_MIB_MAX];
+
switch (attrtype) {
case IFLA_INET6_STATS:
-   __snmp6_fill_stats64(stats, idev->stats.ipv6,
-IPSTATS_MIB_MAX, bytes, offsetof(struct 
ipstats_mib, syncp));
+   memset(buff, 0, sizeof(buff));
+   __snmp6_fill_stats64(stats, idev->stats.ipv6, IPSTATS_MIB_MAX, 
bytes,
+offsetof(struct ipstats_mib, syncp), buff);
break;
case IFLA_INET6_ICMP6STATS:
__snmp6_fill_statsdev(stats, idev->stats.icmpv6dev->mibs, 
ICMP6_MIB_MAX, bytes);
-- 
1.7.11.7

--
To unsub

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Raghavendra K T


On 08/28/2015 12:08 AM, David Miller wrote:

From: Raghavendra K T 
Date: Wed, 26 Aug 2015 23:07:33 +0530


@@ -4641,10 +4647,12 @@ static inline void __snmp6_fill_stats64(u64 *stats, 
void __percpu *mib,
  static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype,
 int bytes)
  {
+   u64 buff[IPSTATS_MIB_MAX] = {0,};
+
switch (attrtype) {
case IFLA_INET6_STATS:
-   __snmp6_fill_stats64(stats, idev->stats.ipv6,


I would suggest using an explicit memset() here, it makes the overhead incurred
by this scheme clearer.



I changed the code to look like below to measure fill_stat overhead:

container creation now took: 3.012s
it was:
without patch : 6.86sec
with current patch: 3.34sec

and perf did not show the snmp6_fill_stats() parent traces.

changed code:
snmp6_fill_stats(...)
{
switch (attrtype) {
case IFLA_INET6_STATS:
put_unaligned(IPSTATS_MIB_MAX, [0]);
memset([1], 0, IPSTATS_MIB_MAX-1);

//__snmp6_fill_stats64(stats, idev->stats.ipv6, 
IPSTATS_MIB_MAX, bytes,
//   offsetof(struct ipstats_mib, 
syncp), buff);

.
}

So in summary:
The current patch amounts to reduction in major overhead in fill_stat,
though there is still percpu walk overhead (0.33sec difference).

[ percpu walk overead grows when create for e.g. 3k containers].

cache miss: there was no major difference (around 1.4%) w.r.t patch

Hi David,
hope you wanted to know the overhead than to change the current patch. 
please let me know..


Eric, does V2 patch look good now.. please add your ack/review

Details:
time
=
time docker run -itd  ubuntu:15.04  /bin/bash
b6670c321b5957f004e281cbb14512deafd0c0be6a39707c2f3dc95649bbc394

real0m3.012s
user0m0.093s
sys 0m0.009s

perf:
==
# Samples: 18K of event 'cycles'
# Event count (approx.): 12838752009
# Overhead  Command  Shared Object  Symbol 


#   ...  .  
#
15.29%  swapper  [kernel.kallsyms]  [k] snooze_loop 

 9.37%  docker   docker [.] scanblock 

 6.47%  docker   [kernel.kallsyms]  [k] veth_stats_one 

 3.87%  swapper  [kernel.kallsyms]  [k] _raw_spin_lock 


 2.71%  docker   docker [.]


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Raghavendra K T


On 08/28/2015 12:08 AM, David Miller wrote:

From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Date: Wed, 26 Aug 2015 23:07:33 +0530


@@ -4641,10 +4647,12 @@ static inline void __snmp6_fill_stats64(u64 *stats, 
void __percpu *mib,
  static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype,
 int bytes)
  {
+   u64 buff[IPSTATS_MIB_MAX] = {0,};
+
switch (attrtype) {
case IFLA_INET6_STATS:
-   __snmp6_fill_stats64(stats, idev-stats.ipv6,


I would suggest using an explicit memset() here, it makes the overhead incurred
by this scheme clearer.



I changed the code to look like below to measure fill_stat overhead:

container creation now took: 3.012s
it was:
without patch : 6.86sec
with current patch: 3.34sec

and perf did not show the snmp6_fill_stats() parent traces.

changed code:
snmp6_fill_stats(...)
{
switch (attrtype) {
case IFLA_INET6_STATS:
put_unaligned(IPSTATS_MIB_MAX, stats[0]);
memset(stats[1], 0, IPSTATS_MIB_MAX-1);

//__snmp6_fill_stats64(stats, idev-stats.ipv6, 
IPSTATS_MIB_MAX, bytes,
//   offsetof(struct ipstats_mib, 
syncp), buff);

.
}

So in summary:
The current patch amounts to reduction in major overhead in fill_stat,
though there is still percpu walk overhead (0.33sec difference).

[ percpu walk overead grows when create for e.g. 3k containers].

cache miss: there was no major difference (around 1.4%) w.r.t patch

Hi David,
hope you wanted to know the overhead than to change the current patch. 
please let me know..


Eric, does V2 patch look good now.. please add your ack/review

Details:
time
=
time docker run -itd  ubuntu:15.04  /bin/bash
b6670c321b5957f004e281cbb14512deafd0c0be6a39707c2f3dc95649bbc394

real0m3.012s
user0m0.093s
sys 0m0.009s

perf:
==
# Samples: 18K of event 'cycles'
# Event count (approx.): 12838752009
# Overhead  Command  Shared Object  Symbol 


#   ...  .  
#
15.29%  swapper  [kernel.kallsyms]  [k] snooze_loop 

 9.37%  docker   docker [.] scanblock 

 6.47%  docker   [kernel.kallsyms]  [k] veth_stats_one 

 3.87%  swapper  [kernel.kallsyms]  [k] _raw_spin_lock 


 2.71%  docker   docker [.]


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-28 Thread Raghavendra K T

* David Miller da...@davemloft.net [2015-08-28 11:24:13]:

 From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 Date: Fri, 28 Aug 2015 12:09:52 +0530
 
  On 08/28/2015 12:08 AM, David Miller wrote:
  From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
  Date: Wed, 26 Aug 2015 23:07:33 +0530
 
  @@ -4641,10 +4647,12 @@ static inline void __snmp6_fill_stats64(u64
  *stats, void __percpu *mib,
static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int
attrtype,
 int bytes)
{
  + u64 buff[IPSTATS_MIB_MAX] = {0,};
  +
  ...
  hope you wanted to know the overhead than to change the current
  patch. please let me know..
 
 I want you to change that variable initializer to an explicit memset().
 
 The compiler is emitting a memset() or similar _anyways_.
 
 Not because it will have any impact at all upon performance, but because
 of how it looks to people trying to read and understand the code.
 
 

Hi David,
resending the patch with memset. Please let me know if you want to
resend all the patches.

8
From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Subject: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking
 all the percpu data at once

Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field.

reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data of an item (iteratively for around 90 items).

idea: This patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.

Docker creation got faster by more than 2x after the patch.

Result:
   Before   After
Docker creation time   6.836s   3.357s
cache miss 2.7% 1.38%

perf before:
50.73%  docker   [kernel.kallsyms]   [k] snmp_fold_field
 9.07%  swapper  [kernel.kallsyms]   [k] snooze_loop
 3.49%  docker   [kernel.kallsyms]   [k] veth_stats_one
 2.85%  swapper  [kernel.kallsyms]   [k] _raw_spin_lock

perf after:
10.56%  swapper  [kernel.kallsyms] [k] snooze_loop
 8.72%  docker   [kernel.kallsyms] [k] snmp_get_cpu_field
 7.59%  docker   [kernel.kallsyms] [k] veth_stats_one
 3.65%  swapper  [kernel.kallsyms] [k] _raw_spin_lock

Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---
 net/ipv6/addrconf.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

 Change in V2:
 - Allocate stat calculation buffer in stack (Eric)
 - Use memset to zero temp buffer (David)

Thanks David and Eric for coments on V1 and as both of them pointed,
unfortunately we cannot get rid of buffer for calculation without
avoiding unaligned op.


diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 21c2c81..9bdfba3 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4624,16 +4624,22 @@ static inline void __snmp6_fill_statsdev(u64 *stats, 
atomic_long_t *mib,
 }
 
 static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib,
- int items, int bytes, size_t syncpoff)
+   int items, int bytes, size_t syncpoff,
+   u64 *buff)
 {
-   int i;
+   int i, c;
int pad = bytes - sizeof(u64) * items;
BUG_ON(pad  0);
 
/* Use put_unaligned() because stats may not be aligned for u64. */
put_unaligned(items, stats[0]);
+
+   for_each_possible_cpu(c)
+   for (i = 1; i  items; i++)
+   buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff);
+
for (i = 1; i  items; i++)
-   put_unaligned(snmp_fold_field64(mib, i, syncpoff), stats[i]);
+   put_unaligned(buff[i], stats[i]);
 
memset(stats[items], 0, pad);
 }
@@ -4641,10 +4647,13 @@ static inline void __snmp6_fill_stats64(u64 *stats, 
void __percpu *mib,
 static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype,
 int bytes)
 {
+   u64 buff[IPSTATS_MIB_MAX];
+
switch (attrtype) {
case IFLA_INET6_STATS:
-   __snmp6_fill_stats64(stats, idev-stats.ipv6,
-IPSTATS_MIB_MAX, bytes, offsetof(struct 
ipstats_mib, syncp));
+   memset(buff, 0, sizeof(buff));
+   __snmp6_fill_stats64(stats, idev-stats.ipv6, IPSTATS_MIB_MAX, 
bytes,
+offsetof(struct ipstats_mib, syncp), buff);
break;
case IFLA_INET6_ICMP6STATS:
__snmp6_fill_statsdev(stats, idev-stats.icmpv6dev-mibs, 
ICMP6_MIB_MAX, bytes);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord

[PATCH RFC V2 1/2] net: Introduce helper functions to get the per cpu data

2015-08-26 Thread Raghavendra K T

Signed-off-by: Raghavendra K T 
---
 include/net/ip.h   | 10 ++
 net/ipv4/af_inet.c | 41 +++--
 2 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index d5fe9f2..93bf12e 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -202,10 +202,20 @@ void ip_send_unicast_reply(struct sock *sk, struct 
sk_buff *skb,
 #define NET_ADD_STATS_BH(net, field, adnd) 
SNMP_ADD_STATS_BH((net)->mib.net_statistics, field, adnd)
 #define NET_ADD_STATS_USER(net, field, adnd) 
SNMP_ADD_STATS_USER((net)->mib.net_statistics, field, adnd)
 
+u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offct);
 unsigned long snmp_fold_field(void __percpu *mib, int offt);
 #if BITS_PER_LONG==32
+u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+size_t syncp_offset);
 u64 snmp_fold_field64(void __percpu *mib, int offt, size_t sync_off);
 #else
+static inline u64  snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+   size_t syncp_offset)
+{
+   return snmp_get_cpu_field(mib, cpu, offct);
+
+}
+
 static inline u64 snmp_fold_field64(void __percpu *mib, int offt, size_t 
syncp_off)
 {
return snmp_fold_field(mib, offt);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9532ee8..302e36b 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1448,38 +1448,51 @@ int inet_ctl_sock_create(struct sock **sk, unsigned 
short family,
 }
 EXPORT_SYMBOL_GPL(inet_ctl_sock_create);
 
+u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offt)
+{
+   return  *(((unsigned long *)per_cpu_ptr(mib, cpu)) + offt);
+}
+EXPORT_SYMBOL_GPL(snmp_get_cpu_field);
+
 unsigned long snmp_fold_field(void __percpu *mib, int offt)
 {
unsigned long res = 0;
int i;
 
for_each_possible_cpu(i)
-   res += *(((unsigned long *) per_cpu_ptr(mib, i)) + offt);
+   res += snmp_get_cpu_field(mib, i, offt);
return res;
 }
 EXPORT_SYMBOL_GPL(snmp_fold_field);
 
 #if BITS_PER_LONG==32
 
+u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+size_t syncp_offset)
+{
+   void *bhptr;
+   struct u64_stats_sync *syncp;
+   u64 v;
+   unsigned int start;
+
+   bhptr = per_cpu_ptr(mib, cpu);
+   syncp = (struct u64_stats_sync *)(bhptr + syncp_offset);
+   do {
+   start = u64_stats_fetch_begin_irq(syncp);
+   v = *(((u64 *)bhptr) + offt);
+   } while (u64_stats_fetch_retry_irq(syncp, start));
+
+   return v;
+}
+EXPORT_SYMBOL_GPL(snmp_get_cpu_field64);
+
 u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_offset)
 {
u64 res = 0;
int cpu;
 
for_each_possible_cpu(cpu) {
-   void *bhptr;
-   struct u64_stats_sync *syncp;
-   u64 v;
-   unsigned int start;
-
-   bhptr = per_cpu_ptr(mib, cpu);
-   syncp = (struct u64_stats_sync *)(bhptr + syncp_offset);
-   do {
-   start = u64_stats_fetch_begin_irq(syncp);
-   v = *(((u64 *) bhptr) + offt);
-   } while (u64_stats_fetch_retry_irq(syncp, start));
-
-   res += v;
+   res += snmp_get_cpu_field(mib, cpu, offct, syncp_offset);
}
return res;
 }
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-26 Thread Raghavendra K T

Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field.

reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data of an item (iteratively for around 90 items).

idea: This patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.

Docker creation got faster by more than 2x after the patch.

Result:
   Before   After
Docker creation time   6.836s   3.357s
cache miss 2.7% 1.38%

perf before:
50.73%  docker   [kernel.kallsyms]   [k] snmp_fold_field
 9.07%  swapper  [kernel.kallsyms]   [k] snooze_loop
 3.49%  docker   [kernel.kallsyms]   [k] veth_stats_one
 2.85%  swapper  [kernel.kallsyms]   [k] _raw_spin_lock

perf after:
10.56%  swapper  [kernel.kallsyms] [k] snooze_loop
 8.72%  docker   [kernel.kallsyms] [k] snmp_get_cpu_field
 7.59%  docker   [kernel.kallsyms] [k] veth_stats_one
 3.65%  swapper  [kernel.kallsyms] [k] _raw_spin_lock

Signed-off-by: Raghavendra K T 
---
 net/ipv6/addrconf.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

 Change in V2:
 - Allocate stat calculation buffer in stack (Eric)

Thanks David and Eric for coments on V1 and as both of them pointed,
unfortunately we cannot get rid of buffer for calculation without avoiding
unaligned op.

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 21c2c81..0f6c7a5 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4624,16 +4624,22 @@ static inline void __snmp6_fill_statsdev(u64 *stats, 
atomic_long_t *mib,
 }
 
 static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib,
- int items, int bytes, size_t syncpoff)
+   int items, int bytes, size_t syncpoff,
+   u64 *buff)
 {
-   int i;
+   int i, c;
int pad = bytes - sizeof(u64) * items;
BUG_ON(pad < 0);
 
/* Use put_unaligned() because stats may not be aligned for u64. */
put_unaligned(items, [0]);
+
+   for_each_possible_cpu(c)
+   for (i = 1; i < items; i++)
+   buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff);
+
for (i = 1; i < items; i++)
-   put_unaligned(snmp_fold_field64(mib, i, syncpoff), [i]);
+   put_unaligned(buff[i], [i]);
 
memset([items], 0, pad);
 }
@@ -4641,10 +4647,12 @@ static inline void __snmp6_fill_stats64(u64 *stats, 
void __percpu *mib,
 static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype,
 int bytes)
 {
+   u64 buff[IPSTATS_MIB_MAX] = {0,};
+
switch (attrtype) {
case IFLA_INET6_STATS:
-   __snmp6_fill_stats64(stats, idev->stats.ipv6,
-IPSTATS_MIB_MAX, bytes, offsetof(struct 
ipstats_mib, syncp));
+   __snmp6_fill_stats64(stats, idev->stats.ipv6, IPSTATS_MIB_MAX, 
bytes,
+offsetof(struct ipstats_mib, syncp), buff);
break;
case IFLA_INET6_ICMP6STATS:
__snmp6_fill_statsdev(stats, idev->stats.icmpv6dev->mibs, 
ICMP6_MIB_MAX, bytes);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC V2 0/2] Optimize the snmp stat aggregation for large cpus

2015-08-26 Thread Raghavendra K T

While creating 1000 containers, perf is showing lot of time spent in
snmp_fold_field on a large cpu system.

The current patch tries to improve by reordering the statistics gathering.

Please note that similar overhead was also reported while creating
veth pairs  https://lkml.org/lkml/2013/3/19/556

Changes in V2:
 - Allocate the stat calculation buffer in stack. (Eric)
 
Setup:
160 cpu (20 core) baremetal powerpc system with 1TB memory

1000 docker containers was created with command
docker run -itd  ubuntu:15.04  /bin/bash in loop

observation:
Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) perf data showed, creating veth interfaces resulting in
the below code path was taking more time.

rtnl_fill_ifinfo
  -> inet6_fill_link_af
-> inet6_fill_ifla6_attrs
  -> snmp_fold_field

proposed idea:
 currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data to of an item (iteratively for around 90 items).
 The patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.

Performance of docker creation improved by around more than 2x
after the patch.

before the patch: 

time docker run -itd  ubuntu:15.04  /bin/bash
3f45ba571a42e925c4ec4aaee0e48d7610a9ed82a4c931f83324d41822cf6617
real0m6.836s
user0m0.095s
sys 0m0.011s

perf record -a docker run -itd  ubuntu:15.04  /bin/bash
===
# Samples: 32K of event 'cycles'
# Event count (approx.): 24688700190
# Overhead  Command  Shared Object   Symbol 

#   ...  ..  
50.73%  docker   [kernel.kallsyms]   [k] snmp_fold_field


 9.07%  swapper  [kernel.kallsyms]   [k] snooze_loop


 3.49%  docker   [kernel.kallsyms]   [k] veth_stats_one 


 2.85%  swapper  [kernel.kallsyms]   [k] _raw_spin_lock 


 1.37%  docker   docker  [.] backtrace_qsort


 1.31%  docker   docker  [.] strings.FieldsFunc 
 

  cache-misses:  2.7%
  
after the patch:
=
 time docker run -itd  ubuntu:15.04  /bin/bash
4e0619421332990bdea413fe455ab187607ed63d33d5c37aa5291bc2f5b35857
real0m3.357s
user0m0.092s
sys 0m0.010s

perf record -a docker run -itd  ubuntu:15.04  /bin/bash
===
# Samples: 15K of event 'cycles'
# Event count (approx.): 11471830714
# Overhead  Command  Shared Object Symbol   
  
#   ...    .
10.56%  swapper  [kernel.kallsyms] [k] snooze_loop  
  
 8.72%  docker   [kernel.kallsyms] [k] snmp_get_cpu_field   
  
 7.59%  docker   [kernel.kallsyms] [k] veth_stats_one   
  
 3.65%  swapper  [kernel.kallsyms] [k] _raw_spin_lock   
  
 3.06%  docker   docker[.] strings.FieldsFunc   
  
 2.96%  docker   docker[.] backtrace_qsort  

cache-misses: 1.38 %

Please let me know if you have suggestions/comments.
Thanks Eric and David for comments on V1.

Raghavendra K T (2):
  net: Introduce helper functions to get the per cpu data
  net: Optimize snmp stat aggregation by walking all the percpu data at
once

 include/net/ip.h| 10 ++
 net/ipv4/af_inet.c  | 41 +++--
 net/ipv6/addrconf.c | 18 +-
 3 files changed, 50 insertions(+), 19 deletions(-)

-- 
1.7.11.7

--
To unsubscribe from this list: send the line "uns

Re: [PATCH RFC 0/2] Optimize the snmp stat aggregation for large cpus

2015-08-26 Thread Raghavendra K T

On 08/26/2015 07:39 PM, Eric Dumazet wrote:

On Wed, 2015-08-26 at 15:55 +0530, Raghavendra K T wrote:

On 08/26/2015 04:37 AM, David Miller wrote:

From: Raghavendra K T 
Date: Tue, 25 Aug 2015 13:24:24 +0530

Please let me know if you have suggestions/comments.

Like Eric Dumazet said the idea is good but needs some adjustments.

You might want to see whether a per-cpu work buffer works for this.

sure, Let me know if I understood correctly,

we allocate the temp buffer,
we will have a  "add_this_cpu_data" function and do

for_each_online_cpu(cpu)
  smp_call_function_single(cpu, add_this_cpu_data, buffer, 1)

if not could you please point to an example you had in mind.

Sorry I do not think it is a good idea.

Sending an IPI is way more expensive and intrusive than reading 4 or 5
cache lines from memory (per cpu)

Definitely not something we want.

Okay. Another problem I thought here was that we could only loop over
online cpus.

It's extremely unfortunately that we can't depend upon the destination
buffer being properly aligned, because we wouldn't need a temporary
scratch area if it were aligned properly.

True, But I think for 64 bit cpus when (pad == 0) we can go ahead and
use stats array directly and get rid of put_unaligned(). is it correct?

Nope. We have no alignment guarantee. It could be 0x04
pointer value. (ie not a multiple of 8)

(my internal initial patch had this version but thought it is ugly to
have ifdef BITS_PER_LONG==64)

This has nothing to do with arch having 64bit per long. It is about
alignment of a u64.

Okay. I 'll send V2 with declaring tmp buffer in stack.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once

2015-08-26 Thread Raghavendra K T


On 08/25/2015 09:30 PM, Eric Dumazet wrote:

On Tue, 2015-08-25 at 21:17 +0530, Raghavendra K T wrote:

On 08/25/2015 07:58 PM, Eric Dumazet wrote:





This is a great idea, but kcalloc()/kmalloc() can fail and you'll crash
the whole kernel at this point.



Good catch, and my bad. Though system is in bad memory condition,
since fill_stat is not critical for the system do you think silently
returning from here is a good idea?
or do you think we should handle with -ENOMEM way up.?


Hmm... presumably these 288 bytes could be allocated in
inet6_fill_ifla6_attrs() stack frame.


Correct, since we need to allocate  for IPSTATS_MIB_MAX, we could do 
this in even snmp6_fill_stats() stack frame.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/2] Optimize the snmp stat aggregation for large cpus

2015-08-26 Thread Raghavendra K T

On 08/26/2015 04:37 AM, David Miller wrote:

From: Raghavendra K T 
Date: Tue, 25 Aug 2015 13:24:24 +0530

Please let me know if you have suggestions/comments.

Like Eric Dumazet said the idea is good but needs some adjustments.

You might want to see whether a per-cpu work buffer works for this.

sure, Let me know if I understood correctly,

we allocate the temp buffer,
we will have a  "add_this_cpu_data" function and do

for_each_online_cpu(cpu)
smp_call_function_single(cpu, add_this_cpu_data, buffer, 1)

if not could you please point to an example you had in mind.

It's extremely unfortunately that we can't depend upon the destination
buffer being properly aligned, because we wouldn't need a temporary
scratch area if it were aligned properly.

True, But I think for 64 bit cpus when (pad == 0) we can go ahead and
use stats array directly and get rid of put_unaligned(). is it correct?

(my internal initial patch had this version but thought it is ugly to
have ifdef BITS_PER_LONG==64)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC V2 0/2] Optimize the snmp stat aggregation for large cpus

2015-08-26 Thread Raghavendra K T

While creating 1000 containers, perf is showing lot of time spent in
snmp_fold_field on a large cpu system.

The current patch tries to improve by reordering the statistics gathering.

Please note that similar overhead was also reported while creating
veth pairs  https://lkml.org/lkml/2013/3/19/556

Changes in V2:
 - Allocate the stat calculation buffer in stack. (Eric)
 
Setup:
160 cpu (20 core) baremetal powerpc system with 1TB memory

1000 docker containers was created with command
docker run -itd  ubuntu:15.04  /bin/bash in loop

observation:
Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) perf data showed, creating veth interfaces resulting in
the below code path was taking more time.

rtnl_fill_ifinfo
  - inet6_fill_link_af
- inet6_fill_ifla6_attrs
  - snmp_fold_field

proposed idea:
 currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data to of an item (iteratively for around 90 items).
 The patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.

Performance of docker creation improved by around more than 2x
after the patch.

before the patch: 

time docker run -itd  ubuntu:15.04  /bin/bash
3f45ba571a42e925c4ec4aaee0e48d7610a9ed82a4c931f83324d41822cf6617
real0m6.836s
user0m0.095s
sys 0m0.011s

perf record -a docker run -itd  ubuntu:15.04  /bin/bash
===
# Samples: 32K of event 'cycles'
# Event count (approx.): 24688700190
# Overhead  Command  Shared Object   Symbol 

#   ...  ..  
50.73%  docker   [kernel.kallsyms]   [k] snmp_fold_field


 9.07%  swapper  [kernel.kallsyms]   [k] snooze_loop


 3.49%  docker   [kernel.kallsyms]   [k] veth_stats_one 


 2.85%  swapper  [kernel.kallsyms]   [k] _raw_spin_lock 


 1.37%  docker   docker  [.] backtrace_qsort


 1.31%  docker   docker  [.] strings.FieldsFunc 
 

  cache-misses:  2.7%
  
after the patch:
=
 time docker run -itd  ubuntu:15.04  /bin/bash
4e0619421332990bdea413fe455ab187607ed63d33d5c37aa5291bc2f5b35857
real0m3.357s
user0m0.092s
sys 0m0.010s

perf record -a docker run -itd  ubuntu:15.04  /bin/bash
===
# Samples: 15K of event 'cycles'
# Event count (approx.): 11471830714
# Overhead  Command  Shared Object Symbol   
  
#   ...    .
10.56%  swapper  [kernel.kallsyms] [k] snooze_loop  
  
 8.72%  docker   [kernel.kallsyms] [k] snmp_get_cpu_field   
  
 7.59%  docker   [kernel.kallsyms] [k] veth_stats_one   
  
 3.65%  swapper  [kernel.kallsyms] [k] _raw_spin_lock   
  
 3.06%  docker   docker[.] strings.FieldsFunc   
  
 2.96%  docker   docker[.] backtrace_qsort  

cache-misses: 1.38 %

Please let me know if you have suggestions/comments.
Thanks Eric and David for comments on V1.

Raghavendra K T (2):
  net: Introduce helper functions to get the per cpu data
  net: Optimize snmp stat aggregation by walking all the percpu data at
once

 include/net/ip.h| 10 ++
 net/ipv4/af_inet.c  | 41 +++--
 net/ipv6/addrconf.c | 18 +-
 3 files changed, 50 insertions(+), 19 deletions(-)

-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux

[PATCH RFC V2 1/2] net: Introduce helper functions to get the per cpu data

2015-08-26 Thread Raghavendra K T

Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---
 include/net/ip.h   | 10 ++
 net/ipv4/af_inet.c | 41 +++--
 2 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index d5fe9f2..93bf12e 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -202,10 +202,20 @@ void ip_send_unicast_reply(struct sock *sk, struct 
sk_buff *skb,
 #define NET_ADD_STATS_BH(net, field, adnd) 
SNMP_ADD_STATS_BH((net)-mib.net_statistics, field, adnd)
 #define NET_ADD_STATS_USER(net, field, adnd) 
SNMP_ADD_STATS_USER((net)-mib.net_statistics, field, adnd)
 
+u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offct);
 unsigned long snmp_fold_field(void __percpu *mib, int offt);
 #if BITS_PER_LONG==32
+u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+size_t syncp_offset);
 u64 snmp_fold_field64(void __percpu *mib, int offt, size_t sync_off);
 #else
+static inline u64  snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+   size_t syncp_offset)
+{
+   return snmp_get_cpu_field(mib, cpu, offct);
+
+}
+
 static inline u64 snmp_fold_field64(void __percpu *mib, int offt, size_t 
syncp_off)
 {
return snmp_fold_field(mib, offt);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9532ee8..302e36b 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1448,38 +1448,51 @@ int inet_ctl_sock_create(struct sock **sk, unsigned 
short family,
 }
 EXPORT_SYMBOL_GPL(inet_ctl_sock_create);
 
+u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offt)
+{
+   return  *(((unsigned long *)per_cpu_ptr(mib, cpu)) + offt);
+}
+EXPORT_SYMBOL_GPL(snmp_get_cpu_field);
+
 unsigned long snmp_fold_field(void __percpu *mib, int offt)
 {
unsigned long res = 0;
int i;
 
for_each_possible_cpu(i)
-   res += *(((unsigned long *) per_cpu_ptr(mib, i)) + offt);
+   res += snmp_get_cpu_field(mib, i, offt);
return res;
 }
 EXPORT_SYMBOL_GPL(snmp_fold_field);
 
 #if BITS_PER_LONG==32
 
+u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+size_t syncp_offset)
+{
+   void *bhptr;
+   struct u64_stats_sync *syncp;
+   u64 v;
+   unsigned int start;
+
+   bhptr = per_cpu_ptr(mib, cpu);
+   syncp = (struct u64_stats_sync *)(bhptr + syncp_offset);
+   do {
+   start = u64_stats_fetch_begin_irq(syncp);
+   v = *(((u64 *)bhptr) + offt);
+   } while (u64_stats_fetch_retry_irq(syncp, start));
+
+   return v;
+}
+EXPORT_SYMBOL_GPL(snmp_get_cpu_field64);
+
 u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_offset)
 {
u64 res = 0;
int cpu;
 
for_each_possible_cpu(cpu) {
-   void *bhptr;
-   struct u64_stats_sync *syncp;
-   u64 v;
-   unsigned int start;
-
-   bhptr = per_cpu_ptr(mib, cpu);
-   syncp = (struct u64_stats_sync *)(bhptr + syncp_offset);
-   do {
-   start = u64_stats_fetch_begin_irq(syncp);
-   v = *(((u64 *) bhptr) + offt);
-   } while (u64_stats_fetch_retry_irq(syncp, start));
-
-   res += v;
+   res += snmp_get_cpu_field(mib, cpu, offct, syncp_offset);
}
return res;
 }
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1108 matches

Mail list logo