Re: [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs

2016-07-12 Thread Waiman Long

On 07/12/2016 02:57 PM, Tejun Heo wrote:

Hello,

On Tue, Jul 12, 2016 at 02:51:31PM -0400, Waiman Long wrote:

The last 2 RFC patches were created in response to Andi's comment to have
coarser granularity than per-cpu. In this particular use case, I don't think
global list traversals are frequent enough to really have any noticeable
performance impact. So I don't have any benchmark number to support this
change. However, it may not be true for other future use cases.

These 2 patches were created to gauge if using a per-subnode API for this
use case is a good idea or not. I am perfectly happy to keep it as per-cpu
and scrap the last 2 RFC patches. My main goal is to make this patchset more
acceptable to be moved forward instead of staying in limbo.

I see.  I don't think it makes sense to add a whole new API for a use
case which doesn't really need it without any backing data.  It
probably would be best to revisit this when we're dealing with an
actually problematic case.

Thanks.



I am fine with that. BTW, do you think patches 1-5 are good enough to be 
merged in a future release or is there further improvement that needs to 
be made?


Thanks,
Longman


Re: [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs

2016-07-12 Thread Waiman Long

On 07/12/2016 02:57 PM, Tejun Heo wrote:

Hello,

On Tue, Jul 12, 2016 at 02:51:31PM -0400, Waiman Long wrote:

The last 2 RFC patches were created in response to Andi's comment to have
coarser granularity than per-cpu. In this particular use case, I don't think
global list traversals are frequent enough to really have any noticeable
performance impact. So I don't have any benchmark number to support this
change. However, it may not be true for other future use cases.

These 2 patches were created to gauge if using a per-subnode API for this
use case is a good idea or not. I am perfectly happy to keep it as per-cpu
and scrap the last 2 RFC patches. My main goal is to make this patchset more
acceptable to be moved forward instead of staying in limbo.

I see.  I don't think it makes sense to add a whole new API for a use
case which doesn't really need it without any backing data.  It
probably would be best to revisit this when we're dealing with an
actually problematic case.

Thanks.



I am fine with that. BTW, do you think patches 1-5 are good enough to be 
merged in a future release or is there further improvement that needs to 
be made?


Thanks,
Longman


Re: [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs

2016-07-12 Thread Tejun Heo
Hello,

On Tue, Jul 12, 2016 at 02:51:31PM -0400, Waiman Long wrote:
> The last 2 RFC patches were created in response to Andi's comment to have
> coarser granularity than per-cpu. In this particular use case, I don't think
> global list traversals are frequent enough to really have any noticeable
> performance impact. So I don't have any benchmark number to support this
> change. However, it may not be true for other future use cases.
> 
> These 2 patches were created to gauge if using a per-subnode API for this
> use case is a good idea or not. I am perfectly happy to keep it as per-cpu
> and scrap the last 2 RFC patches. My main goal is to make this patchset more
> acceptable to be moved forward instead of staying in limbo.

I see.  I don't think it makes sense to add a whole new API for a use
case which doesn't really need it without any backing data.  It
probably would be best to revisit this when we're dealing with an
actually problematic case.

Thanks.

-- 
tejun


Re: [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs

2016-07-12 Thread Tejun Heo
Hello,

On Tue, Jul 12, 2016 at 02:51:31PM -0400, Waiman Long wrote:
> The last 2 RFC patches were created in response to Andi's comment to have
> coarser granularity than per-cpu. In this particular use case, I don't think
> global list traversals are frequent enough to really have any noticeable
> performance impact. So I don't have any benchmark number to support this
> change. However, it may not be true for other future use cases.
> 
> These 2 patches were created to gauge if using a per-subnode API for this
> use case is a good idea or not. I am perfectly happy to keep it as per-cpu
> and scrap the last 2 RFC patches. My main goal is to make this patchset more
> acceptable to be moved forward instead of staying in limbo.

I see.  I don't think it makes sense to add a whole new API for a use
case which doesn't really need it without any backing data.  It
probably would be best to revisit this when we're dealing with an
actually problematic case.

Thanks.

-- 
tejun


Re: [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs

2016-07-12 Thread Waiman Long

On 07/12/2016 10:27 AM, Tejun Heo wrote:

Hello,

On Mon, Jul 11, 2016 at 01:32:11PM -0400, Waiman Long wrote:

The percpu APIs are extensively used in the Linux kernel to reduce
cacheline contention and improve performance. For some use cases, the
percpu APIs may be too fine-grain for distributed resources whereas
a per-node based allocation may be too coarse as we can have dozens
of CPUs in a NUMA node in some high-end systems.

This patch introduces a simple per-subnode APIs where each of the
distributed resources will be shared by only a handful of CPUs within
a NUMA node. The per-subnode APIs are built on top of the percpu APIs
and hence requires the same amount of memory as if the percpu APIs
are used. However, it helps to reduce the total number of separate
resources that needed to be managed. As a result, it can speed up code
that need to iterate all the resources compared with using the percpu
APIs. Cacheline contention, however, will increases slightly as each
resource is shared by more than one CPU. As long as the number of CPUs
in each subnode is small, the performance impact won't be significant.

In this patch, at most 2 sibling groups can be put into a subnode. For
an x86-64 CPU, at most 4 CPUs will be in a subnode when HT is enabled
and 2 when it is not.

I understand that there's a trade-off between local access and global
traversing and you're trying to find a sweet spot between the two, but
this seems pretty arbitrary.  What's the use case?  What are the
numbers?  Why are global traversals often enough to matter so much?


The last 2 RFC patches were created in response to Andi's comment to 
have coarser granularity than per-cpu. In this particular use case, I 
don't think global list traversals are frequent enough to really have 
any noticeable performance impact. So I don't have any benchmark number 
to support this change. However, it may not be true for other future use 
cases.


These 2 patches were created to gauge if using a per-subnode API for 
this use case is a good idea or not. I am perfectly happy to keep it as 
per-cpu and scrap the last 2 RFC patches. My main goal is to make this 
patchset more acceptable to be moved forward instead of staying in limbo.


Cheers,
Longman


Re: [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs

2016-07-12 Thread Waiman Long

On 07/12/2016 10:27 AM, Tejun Heo wrote:

Hello,

On Mon, Jul 11, 2016 at 01:32:11PM -0400, Waiman Long wrote:

The percpu APIs are extensively used in the Linux kernel to reduce
cacheline contention and improve performance. For some use cases, the
percpu APIs may be too fine-grain for distributed resources whereas
a per-node based allocation may be too coarse as we can have dozens
of CPUs in a NUMA node in some high-end systems.

This patch introduces a simple per-subnode APIs where each of the
distributed resources will be shared by only a handful of CPUs within
a NUMA node. The per-subnode APIs are built on top of the percpu APIs
and hence requires the same amount of memory as if the percpu APIs
are used. However, it helps to reduce the total number of separate
resources that needed to be managed. As a result, it can speed up code
that need to iterate all the resources compared with using the percpu
APIs. Cacheline contention, however, will increases slightly as each
resource is shared by more than one CPU. As long as the number of CPUs
in each subnode is small, the performance impact won't be significant.

In this patch, at most 2 sibling groups can be put into a subnode. For
an x86-64 CPU, at most 4 CPUs will be in a subnode when HT is enabled
and 2 when it is not.

I understand that there's a trade-off between local access and global
traversing and you're trying to find a sweet spot between the two, but
this seems pretty arbitrary.  What's the use case?  What are the
numbers?  Why are global traversals often enough to matter so much?


The last 2 RFC patches were created in response to Andi's comment to 
have coarser granularity than per-cpu. In this particular use case, I 
don't think global list traversals are frequent enough to really have 
any noticeable performance impact. So I don't have any benchmark number 
to support this change. However, it may not be true for other future use 
cases.


These 2 patches were created to gauge if using a per-subnode API for 
this use case is a good idea or not. I am perfectly happy to keep it as 
per-cpu and scrap the last 2 RFC patches. My main goal is to make this 
patchset more acceptable to be moved forward instead of staying in limbo.


Cheers,
Longman


Re: [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs

2016-07-12 Thread Waiman Long

On 07/11/2016 11:14 PM, Boqun Feng wrote:

On Mon, Jul 11, 2016 at 01:32:11PM -0400, Waiman Long wrote:

+/*
+ * Initialize the subnodes
+ *
+ * All the sibling CPUs will be in the same subnode. On top of that, we will
+ * put at most 2 sibling groups into the same subnode. The percpu
+ * topology_sibling_cpumask() and topology_core_cpumask() are used for
+ * grouping CPUs into subnodes. The subnode ID is the CPU number of the
+ * first CPU in the subnode.
+ */
+static int __init subnode_init(void)
+{
+   int cpu;
+   int nr_subnodes = 0;
+   const int subnode_nr_cpus = 2;
+
+   /*
+* Some of the bits in the subnode_mask will be cleared as we proceed.
+*/
+   for_each_cpu(cpu, subnode_mask) {
+   int ccpu, scpu;
+   int cpucnt = 0;
+
+   cpumask_var_t core_mask = topology_core_cpumask(cpu);
+   cpumask_var_t sibling_mask;
+
+   /*
+* Put subnode_nr_cpus of CPUs and their siblings into each
+* subnode.
+*/
+   for_each_cpu_from(cpu, ccpu, core_mask) {
+   sibling_mask = topology_sibling_cpumask(ccpu);
+   for_each_cpu_from(ccpu, scpu, sibling_mask) {
+   /*
+* Clear the bits of the higher CPUs.
+*/
+   if (scpu>  cpu)
+   cpumask_clear_cpu(scpu, subnode_mask);

Do we also need to clear the 'core_mask' here? Consider a core consist
of two sibling groups and each sibling group consist of two cpus. At the
beginning of the outer loop(for_each_cpu_from(cpu, ccpu, core_mask)):

'core_mask' is 0b

so at the beginning of the inner loop first time:

'ccpu' is 0, therefore 'sibling_mask' is 0b1100, in this loop we set the
'cpu_subnode_id' of cpu 0 and 1 to 0.

at the beginning of the inner loop second time:

'ccpu' is 1 because we don't clear cpu 1 from 'core_mask'. Therefore
sibling_mask is still 0b1100, so in this loop we still do the setting on
'cpu_subnode_id' of cpu 0 and 1.

Am I missing something here?



You are right. The current code work in my test as the 2 sibling CPUs 
occupy the a lower and higher numbers like (0, 72) for a 72-core system. 
It may not work for other sibling CPU assignment.


The core_mask, however, is a global data variable and we cannot modify 
it. I will make the following change instead:


diff --git a/lib/persubnode.c b/lib/persubnode.c
index 9febe7c..d1c8c29 100644
--- a/lib/persubnode.c
+++ b/lib/persubnode.c
@@ -94,6 +94,8 @@ static int __init subnode_init(void)
 * subnode.
 */
for_each_cpu_from(cpu, ccpu, core_mask) {
+   if (!cpumask_test_cpu(ccpu, subnode_mask))
+   continue;   /* Skip allocated CPU */
sibling_mask = topology_sibling_cpumask(ccpu);
for_each_cpu_from(ccpu, scpu, sibling_mask) {
/*

Thanks for catching this bug.

Cheers,
Longman


Re: [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs

2016-07-12 Thread Waiman Long

On 07/11/2016 11:14 PM, Boqun Feng wrote:

On Mon, Jul 11, 2016 at 01:32:11PM -0400, Waiman Long wrote:

+/*
+ * Initialize the subnodes
+ *
+ * All the sibling CPUs will be in the same subnode. On top of that, we will
+ * put at most 2 sibling groups into the same subnode. The percpu
+ * topology_sibling_cpumask() and topology_core_cpumask() are used for
+ * grouping CPUs into subnodes. The subnode ID is the CPU number of the
+ * first CPU in the subnode.
+ */
+static int __init subnode_init(void)
+{
+   int cpu;
+   int nr_subnodes = 0;
+   const int subnode_nr_cpus = 2;
+
+   /*
+* Some of the bits in the subnode_mask will be cleared as we proceed.
+*/
+   for_each_cpu(cpu, subnode_mask) {
+   int ccpu, scpu;
+   int cpucnt = 0;
+
+   cpumask_var_t core_mask = topology_core_cpumask(cpu);
+   cpumask_var_t sibling_mask;
+
+   /*
+* Put subnode_nr_cpus of CPUs and their siblings into each
+* subnode.
+*/
+   for_each_cpu_from(cpu, ccpu, core_mask) {
+   sibling_mask = topology_sibling_cpumask(ccpu);
+   for_each_cpu_from(ccpu, scpu, sibling_mask) {
+   /*
+* Clear the bits of the higher CPUs.
+*/
+   if (scpu>  cpu)
+   cpumask_clear_cpu(scpu, subnode_mask);

Do we also need to clear the 'core_mask' here? Consider a core consist
of two sibling groups and each sibling group consist of two cpus. At the
beginning of the outer loop(for_each_cpu_from(cpu, ccpu, core_mask)):

'core_mask' is 0b

so at the beginning of the inner loop first time:

'ccpu' is 0, therefore 'sibling_mask' is 0b1100, in this loop we set the
'cpu_subnode_id' of cpu 0 and 1 to 0.

at the beginning of the inner loop second time:

'ccpu' is 1 because we don't clear cpu 1 from 'core_mask'. Therefore
sibling_mask is still 0b1100, so in this loop we still do the setting on
'cpu_subnode_id' of cpu 0 and 1.

Am I missing something here?



You are right. The current code work in my test as the 2 sibling CPUs 
occupy the a lower and higher numbers like (0, 72) for a 72-core system. 
It may not work for other sibling CPU assignment.


The core_mask, however, is a global data variable and we cannot modify 
it. I will make the following change instead:


diff --git a/lib/persubnode.c b/lib/persubnode.c
index 9febe7c..d1c8c29 100644
--- a/lib/persubnode.c
+++ b/lib/persubnode.c
@@ -94,6 +94,8 @@ static int __init subnode_init(void)
 * subnode.
 */
for_each_cpu_from(cpu, ccpu, core_mask) {
+   if (!cpumask_test_cpu(ccpu, subnode_mask))
+   continue;   /* Skip allocated CPU */
sibling_mask = topology_sibling_cpumask(ccpu);
for_each_cpu_from(ccpu, scpu, sibling_mask) {
/*

Thanks for catching this bug.

Cheers,
Longman


Re: [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs

2016-07-12 Thread Tejun Heo
Hello,

On Mon, Jul 11, 2016 at 01:32:11PM -0400, Waiman Long wrote:
> The percpu APIs are extensively used in the Linux kernel to reduce
> cacheline contention and improve performance. For some use cases, the
> percpu APIs may be too fine-grain for distributed resources whereas
> a per-node based allocation may be too coarse as we can have dozens
> of CPUs in a NUMA node in some high-end systems.
> 
> This patch introduces a simple per-subnode APIs where each of the
> distributed resources will be shared by only a handful of CPUs within
> a NUMA node. The per-subnode APIs are built on top of the percpu APIs
> and hence requires the same amount of memory as if the percpu APIs
> are used. However, it helps to reduce the total number of separate
> resources that needed to be managed. As a result, it can speed up code
> that need to iterate all the resources compared with using the percpu
> APIs. Cacheline contention, however, will increases slightly as each
> resource is shared by more than one CPU. As long as the number of CPUs
> in each subnode is small, the performance impact won't be significant.
> 
> In this patch, at most 2 sibling groups can be put into a subnode. For
> an x86-64 CPU, at most 4 CPUs will be in a subnode when HT is enabled
> and 2 when it is not.

I understand that there's a trade-off between local access and global
traversing and you're trying to find a sweet spot between the two, but
this seems pretty arbitrary.  What's the use case?  What are the
numbers?  Why are global traversals often enough to matter so much?

Thanks.

-- 
tejun


Re: [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs

2016-07-12 Thread Tejun Heo
Hello,

On Mon, Jul 11, 2016 at 01:32:11PM -0400, Waiman Long wrote:
> The percpu APIs are extensively used in the Linux kernel to reduce
> cacheline contention and improve performance. For some use cases, the
> percpu APIs may be too fine-grain for distributed resources whereas
> a per-node based allocation may be too coarse as we can have dozens
> of CPUs in a NUMA node in some high-end systems.
> 
> This patch introduces a simple per-subnode APIs where each of the
> distributed resources will be shared by only a handful of CPUs within
> a NUMA node. The per-subnode APIs are built on top of the percpu APIs
> and hence requires the same amount of memory as if the percpu APIs
> are used. However, it helps to reduce the total number of separate
> resources that needed to be managed. As a result, it can speed up code
> that need to iterate all the resources compared with using the percpu
> APIs. Cacheline contention, however, will increases slightly as each
> resource is shared by more than one CPU. As long as the number of CPUs
> in each subnode is small, the performance impact won't be significant.
> 
> In this patch, at most 2 sibling groups can be put into a subnode. For
> an x86-64 CPU, at most 4 CPUs will be in a subnode when HT is enabled
> and 2 when it is not.

I understand that there's a trade-off between local access and global
traversing and you're trying to find a sweet spot between the two, but
this seems pretty arbitrary.  What's the use case?  What are the
numbers?  Why are global traversals often enough to matter so much?

Thanks.

-- 
tejun


Re: [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs

2016-07-11 Thread Boqun Feng
On Mon, Jul 11, 2016 at 01:32:11PM -0400, Waiman Long wrote:
> The percpu APIs are extensively used in the Linux kernel to reduce
> cacheline contention and improve performance. For some use cases, the
> percpu APIs may be too fine-grain for distributed resources whereas
> a per-node based allocation may be too coarse as we can have dozens
> of CPUs in a NUMA node in some high-end systems.
> 
> This patch introduces a simple per-subnode APIs where each of the
> distributed resources will be shared by only a handful of CPUs within
> a NUMA node. The per-subnode APIs are built on top of the percpu APIs
> and hence requires the same amount of memory as if the percpu APIs
> are used. However, it helps to reduce the total number of separate
> resources that needed to be managed. As a result, it can speed up code
> that need to iterate all the resources compared with using the percpu
> APIs. Cacheline contention, however, will increases slightly as each
> resource is shared by more than one CPU. As long as the number of CPUs
> in each subnode is small, the performance impact won't be significant.
> 
> In this patch, at most 2 sibling groups can be put into a subnode. For
> an x86-64 CPU, at most 4 CPUs will be in a subnode when HT is enabled
> and 2 when it is not.
> 
> Signed-off-by: Waiman Long 
> ---
>  include/linux/persubnode.h |   80 +
>  init/main.c|2 +
>  lib/Makefile   |2 +
>  lib/persubnode.c   |  119 
> 
>  4 files changed, 203 insertions(+), 0 deletions(-)
>  create mode 100644 include/linux/persubnode.h
>  create mode 100644 lib/persubnode.c
> 
> diff --git a/include/linux/persubnode.h b/include/linux/persubnode.h
> new file mode 100644
> index 000..b777daa
> --- /dev/null
> +++ b/include/linux/persubnode.h
> @@ -0,0 +1,80 @@
> +/*
> + * Per-subnode definitions
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
> + *
> + * Authors: Waiman Long 
> + */
> +#ifndef __LINUX_PERSUBNODE_H
> +#define __LINUX_PERSUBNODE_H
> +
> +#include 
> +#include 
> +
> +/*
> + * Per-subnode APIs
> + */
> +#define __persubnode __percpu
> +#define nr_subnode_ids   nr_cpu_ids
> +#define alloc_persubnode(type)   alloc_percpu(type)
> +#define free_persubnode(var) free_percpu(var)
> +#define for_each_subnode(snode)  for_each_cpu(snode, 
> subnode_mask)
> +#define per_subnode_ptr(ptr, subnode)per_cpu_ptr(ptr, subnode)
> +#define per_subnode(var, subnode)per_cpu(var, subnode)
> +
> +#ifdef CONFIG_SMP
> +
> +extern struct cpumask __subnode_mask __read_mostly;
> +DECLARE_PER_CPU_READ_MOSTLY(int, cpu_subnode_id);
> +
> +#define subnode_mask (&__subnode_mask)
> +
> +static inline int this_cpu_to_subnode(void)
> +{
> + return *this_cpu_ptr(_subnode_id);
> +}
> +
> +/*
> + * For safety, preemption should be disabled before using this_subnode_ptr().
> + */
> +#define this_subnode_ptr(ptr)\
> +({   \
> + int _snid = this_cpu_to_subnode();  \
> + per_cpu_ptr(ptr, _snid);\
> +})
> +
> +#define get_subnode_ptr(ptr) \
> +({   \
> + preempt_disable();  \
> + this_subnode_ptr(ptr);  \
> +})
> +
> +#define put_subnode_ptr(ptr) \
> +do { \
> + (void)(ptr);\
> + preempt_enable();   \
> +} while (0)
> +
> +extern void __init subnode_early_init(void);
> +
> +#else /* CONFIG_SMP */
> +
> +#define subnode_mask cpu_possible_mask
> +#define this_subnode_ptr(ptr)this_cpu_ptr(ptr)
> +#define get_subnode_ptr(ptr) get_cpu_ptr(ptr)
> +#define put_subnode_ptr(ptr) put_cpu_ptr(ptr)
> +
> +static inline void subnode_early_init(void) { }
> +
> +#endif /* CONFIG_SMP */
> +#endif /* __LINUX_PERSUBNODE_H */
> diff --git a/init/main.c b/init/main.c
> index 4c17fda..28e4425 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -81,6 +81,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -524,6 +525,7 @@ asmlinkage __visible void __init start_kernel(void)
>  NULL, set_init_arg);
>  
>   

Re: [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs

2016-07-11 Thread Boqun Feng
On Mon, Jul 11, 2016 at 01:32:11PM -0400, Waiman Long wrote:
> The percpu APIs are extensively used in the Linux kernel to reduce
> cacheline contention and improve performance. For some use cases, the
> percpu APIs may be too fine-grain for distributed resources whereas
> a per-node based allocation may be too coarse as we can have dozens
> of CPUs in a NUMA node in some high-end systems.
> 
> This patch introduces a simple per-subnode APIs where each of the
> distributed resources will be shared by only a handful of CPUs within
> a NUMA node. The per-subnode APIs are built on top of the percpu APIs
> and hence requires the same amount of memory as if the percpu APIs
> are used. However, it helps to reduce the total number of separate
> resources that needed to be managed. As a result, it can speed up code
> that need to iterate all the resources compared with using the percpu
> APIs. Cacheline contention, however, will increases slightly as each
> resource is shared by more than one CPU. As long as the number of CPUs
> in each subnode is small, the performance impact won't be significant.
> 
> In this patch, at most 2 sibling groups can be put into a subnode. For
> an x86-64 CPU, at most 4 CPUs will be in a subnode when HT is enabled
> and 2 when it is not.
> 
> Signed-off-by: Waiman Long 
> ---
>  include/linux/persubnode.h |   80 +
>  init/main.c|2 +
>  lib/Makefile   |2 +
>  lib/persubnode.c   |  119 
> 
>  4 files changed, 203 insertions(+), 0 deletions(-)
>  create mode 100644 include/linux/persubnode.h
>  create mode 100644 lib/persubnode.c
> 
> diff --git a/include/linux/persubnode.h b/include/linux/persubnode.h
> new file mode 100644
> index 000..b777daa
> --- /dev/null
> +++ b/include/linux/persubnode.h
> @@ -0,0 +1,80 @@
> +/*
> + * Per-subnode definitions
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
> + *
> + * Authors: Waiman Long 
> + */
> +#ifndef __LINUX_PERSUBNODE_H
> +#define __LINUX_PERSUBNODE_H
> +
> +#include 
> +#include 
> +
> +/*
> + * Per-subnode APIs
> + */
> +#define __persubnode __percpu
> +#define nr_subnode_ids   nr_cpu_ids
> +#define alloc_persubnode(type)   alloc_percpu(type)
> +#define free_persubnode(var) free_percpu(var)
> +#define for_each_subnode(snode)  for_each_cpu(snode, 
> subnode_mask)
> +#define per_subnode_ptr(ptr, subnode)per_cpu_ptr(ptr, subnode)
> +#define per_subnode(var, subnode)per_cpu(var, subnode)
> +
> +#ifdef CONFIG_SMP
> +
> +extern struct cpumask __subnode_mask __read_mostly;
> +DECLARE_PER_CPU_READ_MOSTLY(int, cpu_subnode_id);
> +
> +#define subnode_mask (&__subnode_mask)
> +
> +static inline int this_cpu_to_subnode(void)
> +{
> + return *this_cpu_ptr(_subnode_id);
> +}
> +
> +/*
> + * For safety, preemption should be disabled before using this_subnode_ptr().
> + */
> +#define this_subnode_ptr(ptr)\
> +({   \
> + int _snid = this_cpu_to_subnode();  \
> + per_cpu_ptr(ptr, _snid);\
> +})
> +
> +#define get_subnode_ptr(ptr) \
> +({   \
> + preempt_disable();  \
> + this_subnode_ptr(ptr);  \
> +})
> +
> +#define put_subnode_ptr(ptr) \
> +do { \
> + (void)(ptr);\
> + preempt_enable();   \
> +} while (0)
> +
> +extern void __init subnode_early_init(void);
> +
> +#else /* CONFIG_SMP */
> +
> +#define subnode_mask cpu_possible_mask
> +#define this_subnode_ptr(ptr)this_cpu_ptr(ptr)
> +#define get_subnode_ptr(ptr) get_cpu_ptr(ptr)
> +#define put_subnode_ptr(ptr) put_cpu_ptr(ptr)
> +
> +static inline void subnode_early_init(void) { }
> +
> +#endif /* CONFIG_SMP */
> +#endif /* __LINUX_PERSUBNODE_H */
> diff --git a/init/main.c b/init/main.c
> index 4c17fda..28e4425 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -81,6 +81,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -524,6 +525,7 @@ asmlinkage __visible void __init start_kernel(void)
>  NULL, set_init_arg);
>  
>   jump_label_init();
> + 

[RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs

2016-07-11 Thread Waiman Long
The percpu APIs are extensively used in the Linux kernel to reduce
cacheline contention and improve performance. For some use cases, the
percpu APIs may be too fine-grain for distributed resources whereas
a per-node based allocation may be too coarse as we can have dozens
of CPUs in a NUMA node in some high-end systems.

This patch introduces a simple per-subnode APIs where each of the
distributed resources will be shared by only a handful of CPUs within
a NUMA node. The per-subnode APIs are built on top of the percpu APIs
and hence requires the same amount of memory as if the percpu APIs
are used. However, it helps to reduce the total number of separate
resources that needed to be managed. As a result, it can speed up code
that need to iterate all the resources compared with using the percpu
APIs. Cacheline contention, however, will increases slightly as each
resource is shared by more than one CPU. As long as the number of CPUs
in each subnode is small, the performance impact won't be significant.

In this patch, at most 2 sibling groups can be put into a subnode. For
an x86-64 CPU, at most 4 CPUs will be in a subnode when HT is enabled
and 2 when it is not.

Signed-off-by: Waiman Long 
---
 include/linux/persubnode.h |   80 +
 init/main.c|2 +
 lib/Makefile   |2 +
 lib/persubnode.c   |  119 
 4 files changed, 203 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/persubnode.h
 create mode 100644 lib/persubnode.c

diff --git a/include/linux/persubnode.h b/include/linux/persubnode.h
new file mode 100644
index 000..b777daa
--- /dev/null
+++ b/include/linux/persubnode.h
@@ -0,0 +1,80 @@
+/*
+ * Per-subnode definitions
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long 
+ */
+#ifndef __LINUX_PERSUBNODE_H
+#define __LINUX_PERSUBNODE_H
+
+#include 
+#include 
+
+/*
+ * Per-subnode APIs
+ */
+#define __persubnode   __percpu
+#define nr_subnode_ids nr_cpu_ids
+#define alloc_persubnode(type) alloc_percpu(type)
+#define free_persubnode(var)   free_percpu(var)
+#define for_each_subnode(snode)for_each_cpu(snode, 
subnode_mask)
+#define per_subnode_ptr(ptr, subnode)  per_cpu_ptr(ptr, subnode)
+#define per_subnode(var, subnode)  per_cpu(var, subnode)
+
+#ifdef CONFIG_SMP
+
+extern struct cpumask __subnode_mask __read_mostly;
+DECLARE_PER_CPU_READ_MOSTLY(int, cpu_subnode_id);
+
+#define subnode_mask   (&__subnode_mask)
+
+static inline int this_cpu_to_subnode(void)
+{
+   return *this_cpu_ptr(_subnode_id);
+}
+
+/*
+ * For safety, preemption should be disabled before using this_subnode_ptr().
+ */
+#define this_subnode_ptr(ptr)  \
+({ \
+   int _snid = this_cpu_to_subnode();  \
+   per_cpu_ptr(ptr, _snid);\
+})
+
+#define get_subnode_ptr(ptr)   \
+({ \
+   preempt_disable();  \
+   this_subnode_ptr(ptr);  \
+})
+
+#define put_subnode_ptr(ptr)   \
+do {   \
+   (void)(ptr);\
+   preempt_enable();   \
+} while (0)
+
+extern void __init subnode_early_init(void);
+
+#else /* CONFIG_SMP */
+
+#define subnode_mask   cpu_possible_mask
+#define this_subnode_ptr(ptr)  this_cpu_ptr(ptr)
+#define get_subnode_ptr(ptr)   get_cpu_ptr(ptr)
+#define put_subnode_ptr(ptr)   put_cpu_ptr(ptr)
+
+static inline void subnode_early_init(void) { }
+
+#endif /* CONFIG_SMP */
+#endif /* __LINUX_PERSUBNODE_H */
diff --git a/init/main.c b/init/main.c
index 4c17fda..28e4425 100644
--- a/init/main.c
+++ b/init/main.c
@@ -81,6 +81,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -524,6 +525,7 @@ asmlinkage __visible void __init start_kernel(void)
   NULL, set_init_arg);
 
jump_label_init();
+   subnode_early_init();
 
/*
 * These use large bootmem allocations and must precede
diff --git a/lib/Makefile b/lib/Makefile
index 92e8c38..440152c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -232,3 +232,5 @@ obj-$(CONFIG_UCS2_STRING) += ucs2_string.o
 obj-$(CONFIG_UBSAN) 

[RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs

2016-07-11 Thread Waiman Long
The percpu APIs are extensively used in the Linux kernel to reduce
cacheline contention and improve performance. For some use cases, the
percpu APIs may be too fine-grain for distributed resources whereas
a per-node based allocation may be too coarse as we can have dozens
of CPUs in a NUMA node in some high-end systems.

This patch introduces a simple per-subnode APIs where each of the
distributed resources will be shared by only a handful of CPUs within
a NUMA node. The per-subnode APIs are built on top of the percpu APIs
and hence requires the same amount of memory as if the percpu APIs
are used. However, it helps to reduce the total number of separate
resources that needed to be managed. As a result, it can speed up code
that need to iterate all the resources compared with using the percpu
APIs. Cacheline contention, however, will increases slightly as each
resource is shared by more than one CPU. As long as the number of CPUs
in each subnode is small, the performance impact won't be significant.

In this patch, at most 2 sibling groups can be put into a subnode. For
an x86-64 CPU, at most 4 CPUs will be in a subnode when HT is enabled
and 2 when it is not.

Signed-off-by: Waiman Long 
---
 include/linux/persubnode.h |   80 +
 init/main.c|2 +
 lib/Makefile   |2 +
 lib/persubnode.c   |  119 
 4 files changed, 203 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/persubnode.h
 create mode 100644 lib/persubnode.c

diff --git a/include/linux/persubnode.h b/include/linux/persubnode.h
new file mode 100644
index 000..b777daa
--- /dev/null
+++ b/include/linux/persubnode.h
@@ -0,0 +1,80 @@
+/*
+ * Per-subnode definitions
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long 
+ */
+#ifndef __LINUX_PERSUBNODE_H
+#define __LINUX_PERSUBNODE_H
+
+#include 
+#include 
+
+/*
+ * Per-subnode APIs
+ */
+#define __persubnode   __percpu
+#define nr_subnode_ids nr_cpu_ids
+#define alloc_persubnode(type) alloc_percpu(type)
+#define free_persubnode(var)   free_percpu(var)
+#define for_each_subnode(snode)for_each_cpu(snode, 
subnode_mask)
+#define per_subnode_ptr(ptr, subnode)  per_cpu_ptr(ptr, subnode)
+#define per_subnode(var, subnode)  per_cpu(var, subnode)
+
+#ifdef CONFIG_SMP
+
+extern struct cpumask __subnode_mask __read_mostly;
+DECLARE_PER_CPU_READ_MOSTLY(int, cpu_subnode_id);
+
+#define subnode_mask   (&__subnode_mask)
+
+static inline int this_cpu_to_subnode(void)
+{
+   return *this_cpu_ptr(_subnode_id);
+}
+
+/*
+ * For safety, preemption should be disabled before using this_subnode_ptr().
+ */
+#define this_subnode_ptr(ptr)  \
+({ \
+   int _snid = this_cpu_to_subnode();  \
+   per_cpu_ptr(ptr, _snid);\
+})
+
+#define get_subnode_ptr(ptr)   \
+({ \
+   preempt_disable();  \
+   this_subnode_ptr(ptr);  \
+})
+
+#define put_subnode_ptr(ptr)   \
+do {   \
+   (void)(ptr);\
+   preempt_enable();   \
+} while (0)
+
+extern void __init subnode_early_init(void);
+
+#else /* CONFIG_SMP */
+
+#define subnode_mask   cpu_possible_mask
+#define this_subnode_ptr(ptr)  this_cpu_ptr(ptr)
+#define get_subnode_ptr(ptr)   get_cpu_ptr(ptr)
+#define put_subnode_ptr(ptr)   put_cpu_ptr(ptr)
+
+static inline void subnode_early_init(void) { }
+
+#endif /* CONFIG_SMP */
+#endif /* __LINUX_PERSUBNODE_H */
diff --git a/init/main.c b/init/main.c
index 4c17fda..28e4425 100644
--- a/init/main.c
+++ b/init/main.c
@@ -81,6 +81,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -524,6 +525,7 @@ asmlinkage __visible void __init start_kernel(void)
   NULL, set_init_arg);
 
jump_label_init();
+   subnode_early_init();
 
/*
 * These use large bootmem allocations and must precede
diff --git a/lib/Makefile b/lib/Makefile
index 92e8c38..440152c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -232,3 +232,5 @@ obj-$(CONFIG_UCS2_STRING) += ucs2_string.o
 obj-$(CONFIG_UBSAN) += ubsan.o
 
 UBSAN_SANITIZE_ubsan.o := n
+