Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-28 Thread Peter Zijlstra
On Mon, Aug 28, 2017 at 01:19:21PM +0800, Huang, Ying wrote:
> > What do you think about this version?
> >
> 
> Ping.

Thanks, yes that got lost in the inbox :-(

I'll queue it, thanks !


Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-28 Thread Peter Zijlstra
On Mon, Aug 28, 2017 at 01:19:21PM +0800, Huang, Ying wrote:
> > What do you think about this version?
> >
> 
> Ping.

Thanks, yes that got lost in the inbox :-(

I'll queue it, thanks !


Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-27 Thread Huang, Ying
"Huang, Ying"  writes:

> Hi, Peter,
>
> "Huang, Ying"  writes:
>
>> Peter Zijlstra  writes:
>>
>>> On Sat, Aug 05, 2017 at 08:47:02AM +0800, Huang, Ying wrote:
 Yes.  That looks good.  So you will prepare the final patch?  Or you
 hope me to do that?
>>>
>>> I was hoping you'd do it ;-)
>>
>> Thanks!  Here is the updated patch
>>
>> Best Regards,
>> Huang, Ying
>>
>> -->8--
>> From 957735e9ff3922368286540dab852986fc7b23b5 Mon Sep 17 00:00:00 2001
>> From: Huang Ying 
>> Date: Mon, 7 Aug 2017 16:55:33 +0800
>> Subject: [PATCH -v3] IPI: Avoid to use 2 cache lines for one
>>  call_single_data
>>
>> struct call_single_data is used in IPI to transfer information between
>> CPUs.  Its size is bigger than sizeof(unsigned long) and less than
>> cache line size.  Now, it is allocated with no explicit alignment
>> requirement.  This makes it possible for allocated call_single_data to
>> cross 2 cache lines.  So that double the number of the cache lines
>> that need to be transferred among CPUs.
>>
>> This is resolved by requiring call_single_data to be aligned with the
>> size of call_single_data.  Now the size of call_single_data is the
>> power of 2.  If we add new fields to call_single_data, we may need to
>> add pads to make sure the size of new definition is the power of 2.
>> Fortunately, this is enforced by gcc, which will report error for not
>> power of 2 alignment requirement.
>>
>> To set alignment requirement of call_single_data to the size of
>> call_single_data, a struct definition and a typedef is used.
>>
>> To test the effect of the patch, we use the vm-scalability multiple
>> thread swap test case (swap-w-seq-mt).  The test will create multiple
>> threads and each thread will eat memory until all RAM and part of swap
>> is used, so that huge number of IPI will be triggered when unmapping
>> memory.  In the test, the throughput of memory writing improves ~5%
>> compared with misaligned call_single_data because of faster IPI.
>
> What do you think about this version?
>

Ping.

Best Regards,
Huang, Ying


Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-27 Thread Huang, Ying
"Huang, Ying"  writes:

> Hi, Peter,
>
> "Huang, Ying"  writes:
>
>> Peter Zijlstra  writes:
>>
>>> On Sat, Aug 05, 2017 at 08:47:02AM +0800, Huang, Ying wrote:
 Yes.  That looks good.  So you will prepare the final patch?  Or you
 hope me to do that?
>>>
>>> I was hoping you'd do it ;-)
>>
>> Thanks!  Here is the updated patch
>>
>> Best Regards,
>> Huang, Ying
>>
>> -->8--
>> From 957735e9ff3922368286540dab852986fc7b23b5 Mon Sep 17 00:00:00 2001
>> From: Huang Ying 
>> Date: Mon, 7 Aug 2017 16:55:33 +0800
>> Subject: [PATCH -v3] IPI: Avoid to use 2 cache lines for one
>>  call_single_data
>>
>> struct call_single_data is used in IPI to transfer information between
>> CPUs.  Its size is bigger than sizeof(unsigned long) and less than
>> cache line size.  Now, it is allocated with no explicit alignment
>> requirement.  This makes it possible for allocated call_single_data to
>> cross 2 cache lines.  So that double the number of the cache lines
>> that need to be transferred among CPUs.
>>
>> This is resolved by requiring call_single_data to be aligned with the
>> size of call_single_data.  Now the size of call_single_data is the
>> power of 2.  If we add new fields to call_single_data, we may need to
>> add pads to make sure the size of new definition is the power of 2.
>> Fortunately, this is enforced by gcc, which will report error for not
>> power of 2 alignment requirement.
>>
>> To set alignment requirement of call_single_data to the size of
>> call_single_data, a struct definition and a typedef is used.
>>
>> To test the effect of the patch, we use the vm-scalability multiple
>> thread swap test case (swap-w-seq-mt).  The test will create multiple
>> threads and each thread will eat memory until all RAM and part of swap
>> is used, so that huge number of IPI will be triggered when unmapping
>> memory.  In the test, the throughput of memory writing improves ~5%
>> compared with misaligned call_single_data because of faster IPI.
>
> What do you think about this version?
>

Ping.

Best Regards,
Huang, Ying


Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-13 Thread Huang, Ying
Hi, Peter,

"Huang, Ying"  writes:

> Peter Zijlstra  writes:
>
>> On Sat, Aug 05, 2017 at 08:47:02AM +0800, Huang, Ying wrote:
>>> Yes.  That looks good.  So you will prepare the final patch?  Or you
>>> hope me to do that?
>>
>> I was hoping you'd do it ;-)
>
> Thanks!  Here is the updated patch
>
> Best Regards,
> Huang, Ying
>
> -->8--
> From 957735e9ff3922368286540dab852986fc7b23b5 Mon Sep 17 00:00:00 2001
> From: Huang Ying 
> Date: Mon, 7 Aug 2017 16:55:33 +0800
> Subject: [PATCH -v3] IPI: Avoid to use 2 cache lines for one
>  call_single_data
>
> struct call_single_data is used in IPI to transfer information between
> CPUs.  Its size is bigger than sizeof(unsigned long) and less than
> cache line size.  Now, it is allocated with no explicit alignment
> requirement.  This makes it possible for allocated call_single_data to
> cross 2 cache lines.  So that double the number of the cache lines
> that need to be transferred among CPUs.
>
> This is resolved by requiring call_single_data to be aligned with the
> size of call_single_data.  Now the size of call_single_data is the
> power of 2.  If we add new fields to call_single_data, we may need to
> add pads to make sure the size of new definition is the power of 2.
> Fortunately, this is enforced by gcc, which will report error for not
> power of 2 alignment requirement.
>
> To set alignment requirement of call_single_data to the size of
> call_single_data, a struct definition and a typedef is used.
>
> To test the effect of the patch, we use the vm-scalability multiple
> thread swap test case (swap-w-seq-mt).  The test will create multiple
> threads and each thread will eat memory until all RAM and part of swap
> is used, so that huge number of IPI will be triggered when unmapping
> memory.  In the test, the throughput of memory writing improves ~5%
> compared with misaligned call_single_data because of faster IPI.

What do you think about this version?

Best Regards,
Huang, Ying

> [Add call_single_data_t and align with size of call_single_data]
> Suggested-by: Peter Zijlstra 
> Signed-off-by: "Huang, Ying" 
> Cc: Peter Zijlstra 
> Cc: Ingo Molnar 
> Cc: Michael Ellerman 
> Cc: Borislav Petkov 
> Cc: Thomas Gleixner 
> Cc: Juergen Gross 
> Cc: Aaron Lu 
> ---
>  arch/mips/kernel/smp.c |  6 ++--
>  block/blk-softirq.c|  2 +-
>  drivers/block/null_blk.c   |  2 +-
>  drivers/cpuidle/coupled.c  | 10 +++
>  drivers/net/ethernet/cavium/liquidio/lio_main.c|  2 +-
>  drivers/net/ethernet/cavium/liquidio/octeon_droq.h |  2 +-
>  include/linux/blkdev.h |  2 +-
>  include/linux/netdevice.h  |  2 +-
>  include/linux/smp.h|  8 --
>  kernel/sched/sched.h   |  2 +-
>  kernel/smp.c   | 32 
> --
>  kernel/up.c|  2 +-
>  12 files changed, 39 insertions(+), 33 deletions(-)
>
> diff --git a/arch/mips/kernel/smp.c b/arch/mips/kernel/smp.c
> index 770d4d1516cb..bd8ba5472bca 100644
> --- a/arch/mips/kernel/smp.c
> +++ b/arch/mips/kernel/smp.c
> @@ -648,12 +648,12 @@ EXPORT_SYMBOL(flush_tlb_one);
>  #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
>  
>  static DEFINE_PER_CPU(atomic_t, tick_broadcast_count);
> -static DEFINE_PER_CPU(struct call_single_data, tick_broadcast_csd);
> +static DEFINE_PER_CPU(call_single_data_t, tick_broadcast_csd);
>  
>  void tick_broadcast(const struct cpumask *mask)
>  {
>   atomic_t *count;
> - struct call_single_data *csd;
> + call_single_data_t *csd;
>   int cpu;
>  
>   for_each_cpu(cpu, mask) {
> @@ -674,7 +674,7 @@ static void tick_broadcast_callee(void *info)
>  
>  static int __init tick_broadcast_init(void)
>  {
> - struct call_single_data *csd;
> + call_single_data_t *csd;
>   int cpu;
>  
>   for (cpu = 0; cpu < NR_CPUS; cpu++) {
> diff --git a/block/blk-softirq.c b/block/blk-softirq.c
> index 87b7df4851bf..07125e7941f4 100644
> --- a/block/blk-softirq.c
> +++ b/block/blk-softirq.c
> @@ -60,7 +60,7 @@ static void trigger_softirq(void *data)
>  static int raise_blk_irq(int cpu, struct request *rq)
>  {
>   if (cpu_online(cpu)) {
> - struct call_single_data *data = >csd;
> + call_single_data_t *data = >csd;
>  
>   data->func = trigger_softirq;
>   data->info = rq;
> diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
> index 85c24cace973..81142ce781da 100644
> --- a/drivers/block/null_blk.c
> +++ b/drivers/block/null_blk.c
> @@ -13,7 +13,7 @@
>  struct nullb_cmd {

Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-13 Thread Huang, Ying
Hi, Peter,

"Huang, Ying"  writes:

> Peter Zijlstra  writes:
>
>> On Sat, Aug 05, 2017 at 08:47:02AM +0800, Huang, Ying wrote:
>>> Yes.  That looks good.  So you will prepare the final patch?  Or you
>>> hope me to do that?
>>
>> I was hoping you'd do it ;-)
>
> Thanks!  Here is the updated patch
>
> Best Regards,
> Huang, Ying
>
> -->8--
> From 957735e9ff3922368286540dab852986fc7b23b5 Mon Sep 17 00:00:00 2001
> From: Huang Ying 
> Date: Mon, 7 Aug 2017 16:55:33 +0800
> Subject: [PATCH -v3] IPI: Avoid to use 2 cache lines for one
>  call_single_data
>
> struct call_single_data is used in IPI to transfer information between
> CPUs.  Its size is bigger than sizeof(unsigned long) and less than
> cache line size.  Now, it is allocated with no explicit alignment
> requirement.  This makes it possible for allocated call_single_data to
> cross 2 cache lines.  So that double the number of the cache lines
> that need to be transferred among CPUs.
>
> This is resolved by requiring call_single_data to be aligned with the
> size of call_single_data.  Now the size of call_single_data is the
> power of 2.  If we add new fields to call_single_data, we may need to
> add pads to make sure the size of new definition is the power of 2.
> Fortunately, this is enforced by gcc, which will report error for not
> power of 2 alignment requirement.
>
> To set alignment requirement of call_single_data to the size of
> call_single_data, a struct definition and a typedef is used.
>
> To test the effect of the patch, we use the vm-scalability multiple
> thread swap test case (swap-w-seq-mt).  The test will create multiple
> threads and each thread will eat memory until all RAM and part of swap
> is used, so that huge number of IPI will be triggered when unmapping
> memory.  In the test, the throughput of memory writing improves ~5%
> compared with misaligned call_single_data because of faster IPI.

What do you think about this version?

Best Regards,
Huang, Ying

> [Add call_single_data_t and align with size of call_single_data]
> Suggested-by: Peter Zijlstra 
> Signed-off-by: "Huang, Ying" 
> Cc: Peter Zijlstra 
> Cc: Ingo Molnar 
> Cc: Michael Ellerman 
> Cc: Borislav Petkov 
> Cc: Thomas Gleixner 
> Cc: Juergen Gross 
> Cc: Aaron Lu 
> ---
>  arch/mips/kernel/smp.c |  6 ++--
>  block/blk-softirq.c|  2 +-
>  drivers/block/null_blk.c   |  2 +-
>  drivers/cpuidle/coupled.c  | 10 +++
>  drivers/net/ethernet/cavium/liquidio/lio_main.c|  2 +-
>  drivers/net/ethernet/cavium/liquidio/octeon_droq.h |  2 +-
>  include/linux/blkdev.h |  2 +-
>  include/linux/netdevice.h  |  2 +-
>  include/linux/smp.h|  8 --
>  kernel/sched/sched.h   |  2 +-
>  kernel/smp.c   | 32 
> --
>  kernel/up.c|  2 +-
>  12 files changed, 39 insertions(+), 33 deletions(-)
>
> diff --git a/arch/mips/kernel/smp.c b/arch/mips/kernel/smp.c
> index 770d4d1516cb..bd8ba5472bca 100644
> --- a/arch/mips/kernel/smp.c
> +++ b/arch/mips/kernel/smp.c
> @@ -648,12 +648,12 @@ EXPORT_SYMBOL(flush_tlb_one);
>  #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
>  
>  static DEFINE_PER_CPU(atomic_t, tick_broadcast_count);
> -static DEFINE_PER_CPU(struct call_single_data, tick_broadcast_csd);
> +static DEFINE_PER_CPU(call_single_data_t, tick_broadcast_csd);
>  
>  void tick_broadcast(const struct cpumask *mask)
>  {
>   atomic_t *count;
> - struct call_single_data *csd;
> + call_single_data_t *csd;
>   int cpu;
>  
>   for_each_cpu(cpu, mask) {
> @@ -674,7 +674,7 @@ static void tick_broadcast_callee(void *info)
>  
>  static int __init tick_broadcast_init(void)
>  {
> - struct call_single_data *csd;
> + call_single_data_t *csd;
>   int cpu;
>  
>   for (cpu = 0; cpu < NR_CPUS; cpu++) {
> diff --git a/block/blk-softirq.c b/block/blk-softirq.c
> index 87b7df4851bf..07125e7941f4 100644
> --- a/block/blk-softirq.c
> +++ b/block/blk-softirq.c
> @@ -60,7 +60,7 @@ static void trigger_softirq(void *data)
>  static int raise_blk_irq(int cpu, struct request *rq)
>  {
>   if (cpu_online(cpu)) {
> - struct call_single_data *data = >csd;
> + call_single_data_t *data = >csd;
>  
>   data->func = trigger_softirq;
>   data->info = rq;
> diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
> index 85c24cace973..81142ce781da 100644
> --- a/drivers/block/null_blk.c
> +++ b/drivers/block/null_blk.c
> @@ -13,7 +13,7 @@
>  struct nullb_cmd {
>   struct list_head list;
>   struct llist_node ll_list;
> - struct call_single_data csd;
> + call_single_data_t csd;
>   struct request *rq;
>   struct bio *bio;
>   unsigned int tag;
> diff --git 

Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-07 Thread Huang, Ying
Peter Zijlstra  writes:

> On Sat, Aug 05, 2017 at 08:47:02AM +0800, Huang, Ying wrote:
>> Yes.  That looks good.  So you will prepare the final patch?  Or you
>> hope me to do that?
>
> I was hoping you'd do it ;-)

Thanks!  Here is the updated patch

Best Regards,
Huang, Ying

-->8--
>From 957735e9ff3922368286540dab852986fc7b23b5 Mon Sep 17 00:00:00 2001
From: Huang Ying 
Date: Mon, 7 Aug 2017 16:55:33 +0800
Subject: [PATCH -v3] IPI: Avoid to use 2 cache lines for one
 call_single_data

struct call_single_data is used in IPI to transfer information between
CPUs.  Its size is bigger than sizeof(unsigned long) and less than
cache line size.  Now, it is allocated with no explicit alignment
requirement.  This makes it possible for allocated call_single_data to
cross 2 cache lines.  So that double the number of the cache lines
that need to be transferred among CPUs.

This is resolved by requiring call_single_data to be aligned with the
size of call_single_data.  Now the size of call_single_data is the
power of 2.  If we add new fields to call_single_data, we may need to
add pads to make sure the size of new definition is the power of 2.
Fortunately, this is enforced by gcc, which will report error for not
power of 2 alignment requirement.

To set alignment requirement of call_single_data to the size of
call_single_data, a struct definition and a typedef is used.

To test the effect of the patch, we use the vm-scalability multiple
thread swap test case (swap-w-seq-mt).  The test will create multiple
threads and each thread will eat memory until all RAM and part of swap
is used, so that huge number of IPI will be triggered when unmapping
memory.  In the test, the throughput of memory writing improves ~5%
compared with misaligned call_single_data because of faster IPI.

[Add call_single_data_t and align with size of call_single_data]
Suggested-by: Peter Zijlstra 
Signed-off-by: "Huang, Ying" 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Michael Ellerman 
Cc: Borislav Petkov 
Cc: Thomas Gleixner 
Cc: Juergen Gross 
Cc: Aaron Lu 
---
 arch/mips/kernel/smp.c |  6 ++--
 block/blk-softirq.c|  2 +-
 drivers/block/null_blk.c   |  2 +-
 drivers/cpuidle/coupled.c  | 10 +++
 drivers/net/ethernet/cavium/liquidio/lio_main.c|  2 +-
 drivers/net/ethernet/cavium/liquidio/octeon_droq.h |  2 +-
 include/linux/blkdev.h |  2 +-
 include/linux/netdevice.h  |  2 +-
 include/linux/smp.h|  8 --
 kernel/sched/sched.h   |  2 +-
 kernel/smp.c   | 32 --
 kernel/up.c|  2 +-
 12 files changed, 39 insertions(+), 33 deletions(-)

diff --git a/arch/mips/kernel/smp.c b/arch/mips/kernel/smp.c
index 770d4d1516cb..bd8ba5472bca 100644
--- a/arch/mips/kernel/smp.c
+++ b/arch/mips/kernel/smp.c
@@ -648,12 +648,12 @@ EXPORT_SYMBOL(flush_tlb_one);
 #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 
 static DEFINE_PER_CPU(atomic_t, tick_broadcast_count);
-static DEFINE_PER_CPU(struct call_single_data, tick_broadcast_csd);
+static DEFINE_PER_CPU(call_single_data_t, tick_broadcast_csd);
 
 void tick_broadcast(const struct cpumask *mask)
 {
atomic_t *count;
-   struct call_single_data *csd;
+   call_single_data_t *csd;
int cpu;
 
for_each_cpu(cpu, mask) {
@@ -674,7 +674,7 @@ static void tick_broadcast_callee(void *info)
 
 static int __init tick_broadcast_init(void)
 {
-   struct call_single_data *csd;
+   call_single_data_t *csd;
int cpu;
 
for (cpu = 0; cpu < NR_CPUS; cpu++) {
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 87b7df4851bf..07125e7941f4 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -60,7 +60,7 @@ static void trigger_softirq(void *data)
 static int raise_blk_irq(int cpu, struct request *rq)
 {
if (cpu_online(cpu)) {
-   struct call_single_data *data = >csd;
+   call_single_data_t *data = >csd;
 
data->func = trigger_softirq;
data->info = rq;
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 85c24cace973..81142ce781da 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -13,7 +13,7 @@
 struct nullb_cmd {
struct list_head list;
struct llist_node ll_list;
-   struct call_single_data csd;
+   call_single_data_t csd;
struct request *rq;
struct bio *bio;
unsigned int tag;
diff --git a/drivers/cpuidle/coupled.c b/drivers/cpuidle/coupled.c
index 71e586d7df71..147f38ea0fcd 

Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-07 Thread Huang, Ying
Peter Zijlstra  writes:

> On Sat, Aug 05, 2017 at 08:47:02AM +0800, Huang, Ying wrote:
>> Yes.  That looks good.  So you will prepare the final patch?  Or you
>> hope me to do that?
>
> I was hoping you'd do it ;-)

Thanks!  Here is the updated patch

Best Regards,
Huang, Ying

-->8--
>From 957735e9ff3922368286540dab852986fc7b23b5 Mon Sep 17 00:00:00 2001
From: Huang Ying 
Date: Mon, 7 Aug 2017 16:55:33 +0800
Subject: [PATCH -v3] IPI: Avoid to use 2 cache lines for one
 call_single_data

struct call_single_data is used in IPI to transfer information between
CPUs.  Its size is bigger than sizeof(unsigned long) and less than
cache line size.  Now, it is allocated with no explicit alignment
requirement.  This makes it possible for allocated call_single_data to
cross 2 cache lines.  So that double the number of the cache lines
that need to be transferred among CPUs.

This is resolved by requiring call_single_data to be aligned with the
size of call_single_data.  Now the size of call_single_data is the
power of 2.  If we add new fields to call_single_data, we may need to
add pads to make sure the size of new definition is the power of 2.
Fortunately, this is enforced by gcc, which will report error for not
power of 2 alignment requirement.

To set alignment requirement of call_single_data to the size of
call_single_data, a struct definition and a typedef is used.

To test the effect of the patch, we use the vm-scalability multiple
thread swap test case (swap-w-seq-mt).  The test will create multiple
threads and each thread will eat memory until all RAM and part of swap
is used, so that huge number of IPI will be triggered when unmapping
memory.  In the test, the throughput of memory writing improves ~5%
compared with misaligned call_single_data because of faster IPI.

[Add call_single_data_t and align with size of call_single_data]
Suggested-by: Peter Zijlstra 
Signed-off-by: "Huang, Ying" 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Michael Ellerman 
Cc: Borislav Petkov 
Cc: Thomas Gleixner 
Cc: Juergen Gross 
Cc: Aaron Lu 
---
 arch/mips/kernel/smp.c |  6 ++--
 block/blk-softirq.c|  2 +-
 drivers/block/null_blk.c   |  2 +-
 drivers/cpuidle/coupled.c  | 10 +++
 drivers/net/ethernet/cavium/liquidio/lio_main.c|  2 +-
 drivers/net/ethernet/cavium/liquidio/octeon_droq.h |  2 +-
 include/linux/blkdev.h |  2 +-
 include/linux/netdevice.h  |  2 +-
 include/linux/smp.h|  8 --
 kernel/sched/sched.h   |  2 +-
 kernel/smp.c   | 32 --
 kernel/up.c|  2 +-
 12 files changed, 39 insertions(+), 33 deletions(-)

diff --git a/arch/mips/kernel/smp.c b/arch/mips/kernel/smp.c
index 770d4d1516cb..bd8ba5472bca 100644
--- a/arch/mips/kernel/smp.c
+++ b/arch/mips/kernel/smp.c
@@ -648,12 +648,12 @@ EXPORT_SYMBOL(flush_tlb_one);
 #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 
 static DEFINE_PER_CPU(atomic_t, tick_broadcast_count);
-static DEFINE_PER_CPU(struct call_single_data, tick_broadcast_csd);
+static DEFINE_PER_CPU(call_single_data_t, tick_broadcast_csd);
 
 void tick_broadcast(const struct cpumask *mask)
 {
atomic_t *count;
-   struct call_single_data *csd;
+   call_single_data_t *csd;
int cpu;
 
for_each_cpu(cpu, mask) {
@@ -674,7 +674,7 @@ static void tick_broadcast_callee(void *info)
 
 static int __init tick_broadcast_init(void)
 {
-   struct call_single_data *csd;
+   call_single_data_t *csd;
int cpu;
 
for (cpu = 0; cpu < NR_CPUS; cpu++) {
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 87b7df4851bf..07125e7941f4 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -60,7 +60,7 @@ static void trigger_softirq(void *data)
 static int raise_blk_irq(int cpu, struct request *rq)
 {
if (cpu_online(cpu)) {
-   struct call_single_data *data = >csd;
+   call_single_data_t *data = >csd;
 
data->func = trigger_softirq;
data->info = rq;
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 85c24cace973..81142ce781da 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -13,7 +13,7 @@
 struct nullb_cmd {
struct list_head list;
struct llist_node ll_list;
-   struct call_single_data csd;
+   call_single_data_t csd;
struct request *rq;
struct bio *bio;
unsigned int tag;
diff --git a/drivers/cpuidle/coupled.c b/drivers/cpuidle/coupled.c
index 71e586d7df71..147f38ea0fcd 100644
--- a/drivers/cpuidle/coupled.c
+++ b/drivers/cpuidle/coupled.c
@@ -119,13 +119,13 @@ struct cpuidle_coupled {
 
 #define CPUIDLE_COUPLED_NOT_IDLE   (-1)
 
-static DEFINE_PER_CPU(struct call_single_data, 

Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-07 Thread Peter Zijlstra
On Sat, Aug 05, 2017 at 08:47:02AM +0800, Huang, Ying wrote:
> Yes.  That looks good.  So you will prepare the final patch?  Or you
> hope me to do that?

I was hoping you'd do it ;-)


Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-07 Thread Peter Zijlstra
On Sat, Aug 05, 2017 at 08:47:02AM +0800, Huang, Ying wrote:
> Yes.  That looks good.  So you will prepare the final patch?  Or you
> hope me to do that?

I was hoping you'd do it ;-)


Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-04 Thread Huang, Ying
Peter Zijlstra  writes:

> On Fri, Aug 04, 2017 at 10:05:55AM +0800, Huang, Ying wrote:
>> "Huang, Ying"  writes:
>> > Peter Zijlstra  writes:
>
>> >> +struct __call_single_data {
>> >>   struct llist_node llist;
>> >>   smp_call_func_t func;
>> >>   void *info;
>> >>   unsigned int flags;
>> >>  };
>> >>  
>> >> +typedef struct __call_single_data call_single_data_t
>> >> + __aligned(sizeof(struct __call_single_data));
>> >> +
>> >
>> > Another requirement of the alignment is that it should be the power of
>> > 2.  Otherwise, for example, if someone adds a field to struct, so that
>> > the size becomes 40 on x86_64.  The alignment should be 64 instead of
>> > 40.
>> 
>> Thanks Aaron, he reminded me that there is a roundup_pow_of_two().  So
>> the typedef could be,
>> 
>> typedef struct __call_single_data call_single_data_t
>>  __aligned(roundup_pow_of_two(sizeof(struct __call_single_data));
>> 
>
> Yes, that would take away the requirement to play padding games with the
> struct. Then again, maybe its a good thing to have to be explicit about
> it.
>
> If you see:
>
> struct __call_single_data {
>   struct llist_node llist;
>   smp_call_func_t func;
>   void *info
>   int flags;
>   void *extra_field;
>
>   unsigned long __padding[3]; /* make align work */
> };
>
> that makes it very clear what is going on. In any case, we can delay
> this part because the current structure is a power-of-2 for both ILP32
> and LP64. So only the person growing this will have to deal with it ;-)

Yes.  That looks good.  So you will prepare the final patch?  Or you
hope me to do that?

Best Regards,
Huang, Ying


Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-04 Thread Huang, Ying
Peter Zijlstra  writes:

> On Fri, Aug 04, 2017 at 10:05:55AM +0800, Huang, Ying wrote:
>> "Huang, Ying"  writes:
>> > Peter Zijlstra  writes:
>
>> >> +struct __call_single_data {
>> >>   struct llist_node llist;
>> >>   smp_call_func_t func;
>> >>   void *info;
>> >>   unsigned int flags;
>> >>  };
>> >>  
>> >> +typedef struct __call_single_data call_single_data_t
>> >> + __aligned(sizeof(struct __call_single_data));
>> >> +
>> >
>> > Another requirement of the alignment is that it should be the power of
>> > 2.  Otherwise, for example, if someone adds a field to struct, so that
>> > the size becomes 40 on x86_64.  The alignment should be 64 instead of
>> > 40.
>> 
>> Thanks Aaron, he reminded me that there is a roundup_pow_of_two().  So
>> the typedef could be,
>> 
>> typedef struct __call_single_data call_single_data_t
>>  __aligned(roundup_pow_of_two(sizeof(struct __call_single_data));
>> 
>
> Yes, that would take away the requirement to play padding games with the
> struct. Then again, maybe its a good thing to have to be explicit about
> it.
>
> If you see:
>
> struct __call_single_data {
>   struct llist_node llist;
>   smp_call_func_t func;
>   void *info
>   int flags;
>   void *extra_field;
>
>   unsigned long __padding[3]; /* make align work */
> };
>
> that makes it very clear what is going on. In any case, we can delay
> this part because the current structure is a power-of-2 for both ILP32
> and LP64. So only the person growing this will have to deal with it ;-)

Yes.  That looks good.  So you will prepare the final patch?  Or you
hope me to do that?

Best Regards,
Huang, Ying


Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-04 Thread Peter Zijlstra
On Fri, Aug 04, 2017 at 10:05:55AM +0800, Huang, Ying wrote:
> "Huang, Ying"  writes:
> > Peter Zijlstra  writes:

> >> +struct __call_single_data {
> >>struct llist_node llist;
> >>smp_call_func_t func;
> >>void *info;
> >>unsigned int flags;
> >>  };
> >>  
> >> +typedef struct __call_single_data call_single_data_t
> >> +  __aligned(sizeof(struct __call_single_data));
> >> +
> >
> > Another requirement of the alignment is that it should be the power of
> > 2.  Otherwise, for example, if someone adds a field to struct, so that
> > the size becomes 40 on x86_64.  The alignment should be 64 instead of
> > 40.
> 
> Thanks Aaron, he reminded me that there is a roundup_pow_of_two().  So
> the typedef could be,
> 
> typedef struct __call_single_data call_single_data_t
>   __aligned(roundup_pow_of_two(sizeof(struct __call_single_data));
> 

Yes, that would take away the requirement to play padding games with the
struct. Then again, maybe its a good thing to have to be explicit about
it.

If you see:

struct __call_single_data {
struct llist_node llist;
smp_call_func_t func;
void *info
int flags;
void *extra_field;

unsigned long __padding[3]; /* make align work */
};

that makes it very clear what is going on. In any case, we can delay
this part because the current structure is a power-of-2 for both ILP32
and LP64. So only the person growing this will have to deal with it ;-)


Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-04 Thread Peter Zijlstra
On Fri, Aug 04, 2017 at 10:05:55AM +0800, Huang, Ying wrote:
> "Huang, Ying"  writes:
> > Peter Zijlstra  writes:

> >> +struct __call_single_data {
> >>struct llist_node llist;
> >>smp_call_func_t func;
> >>void *info;
> >>unsigned int flags;
> >>  };
> >>  
> >> +typedef struct __call_single_data call_single_data_t
> >> +  __aligned(sizeof(struct __call_single_data));
> >> +
> >
> > Another requirement of the alignment is that it should be the power of
> > 2.  Otherwise, for example, if someone adds a field to struct, so that
> > the size becomes 40 on x86_64.  The alignment should be 64 instead of
> > 40.
> 
> Thanks Aaron, he reminded me that there is a roundup_pow_of_two().  So
> the typedef could be,
> 
> typedef struct __call_single_data call_single_data_t
>   __aligned(roundup_pow_of_two(sizeof(struct __call_single_data));
> 

Yes, that would take away the requirement to play padding games with the
struct. Then again, maybe its a good thing to have to be explicit about
it.

If you see:

struct __call_single_data {
struct llist_node llist;
smp_call_func_t func;
void *info
int flags;
void *extra_field;

unsigned long __padding[3]; /* make align work */
};

that makes it very clear what is going on. In any case, we can delay
this part because the current structure is a power-of-2 for both ILP32
and LP64. So only the person growing this will have to deal with it ;-)


Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-04 Thread Peter Zijlstra
On Fri, Aug 04, 2017 at 09:28:17AM +0800, Huang, Ying wrote:
> Peter Zijlstra  writes:
> [snip]
> > diff --git a/include/linux/smp.h b/include/linux/smp.h
> > index 68123c1fe549..8d817cb80a38 100644
> > --- a/include/linux/smp.h
> > +++ b/include/linux/smp.h
> > @@ -14,13 +14,16 @@
> >  #include 
> >  
> >  typedef void (*smp_call_func_t)(void *info);
> > -struct call_single_data {
> > +struct __call_single_data {
> > struct llist_node llist;
> > smp_call_func_t func;
> > void *info;
> > unsigned int flags;
> >  };
> >  
> > +typedef struct __call_single_data call_single_data_t
> > +   __aligned(sizeof(struct __call_single_data));
> > +
> 
> Another requirement of the alignment is that it should be the power of
> 2.  Otherwise, for example, if someone adds a field to struct, so that
> the size becomes 40 on x86_64.  The alignment should be 64 instead of
> 40.

Yes I know. This generates a compiler error if sizeof() isn't a
power of 2. That's similar to the BUILD_BUG_ON() you added.


Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-04 Thread Peter Zijlstra
On Fri, Aug 04, 2017 at 09:28:17AM +0800, Huang, Ying wrote:
> Peter Zijlstra  writes:
> [snip]
> > diff --git a/include/linux/smp.h b/include/linux/smp.h
> > index 68123c1fe549..8d817cb80a38 100644
> > --- a/include/linux/smp.h
> > +++ b/include/linux/smp.h
> > @@ -14,13 +14,16 @@
> >  #include 
> >  
> >  typedef void (*smp_call_func_t)(void *info);
> > -struct call_single_data {
> > +struct __call_single_data {
> > struct llist_node llist;
> > smp_call_func_t func;
> > void *info;
> > unsigned int flags;
> >  };
> >  
> > +typedef struct __call_single_data call_single_data_t
> > +   __aligned(sizeof(struct __call_single_data));
> > +
> 
> Another requirement of the alignment is that it should be the power of
> 2.  Otherwise, for example, if someone adds a field to struct, so that
> the size becomes 40 on x86_64.  The alignment should be 64 instead of
> 40.

Yes I know. This generates a compiler error if sizeof() isn't a
power of 2. That's similar to the BUILD_BUG_ON() you added.


Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-03 Thread Huang, Ying
"Huang, Ying"  writes:

> Peter Zijlstra  writes:
> [snip]
>> diff --git a/include/linux/smp.h b/include/linux/smp.h
>> index 68123c1fe549..8d817cb80a38 100644
>> --- a/include/linux/smp.h
>> +++ b/include/linux/smp.h
>> @@ -14,13 +14,16 @@
>>  #include 
>>  
>>  typedef void (*smp_call_func_t)(void *info);
>> -struct call_single_data {
>> +struct __call_single_data {
>>  struct llist_node llist;
>>  smp_call_func_t func;
>>  void *info;
>>  unsigned int flags;
>>  };
>>  
>> +typedef struct __call_single_data call_single_data_t
>> +__aligned(sizeof(struct __call_single_data));
>> +
>
> Another requirement of the alignment is that it should be the power of
> 2.  Otherwise, for example, if someone adds a field to struct, so that
> the size becomes 40 on x86_64.  The alignment should be 64 instead of
> 40.

Thanks Aaron, he reminded me that there is a roundup_pow_of_two().  So
the typedef could be,

typedef struct __call_single_data call_single_data_t
__aligned(roundup_pow_of_two(sizeof(struct __call_single_data));

Best Regards,
Huang, Ying

> Best Regards,
> Huang, Ying
>
>>  /* total number of cpus in this system (may exceed NR_CPUS) */
>>  extern unsigned int total_cpus;
>>  
> [snip]


Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-03 Thread Huang, Ying
"Huang, Ying"  writes:

> Peter Zijlstra  writes:
> [snip]
>> diff --git a/include/linux/smp.h b/include/linux/smp.h
>> index 68123c1fe549..8d817cb80a38 100644
>> --- a/include/linux/smp.h
>> +++ b/include/linux/smp.h
>> @@ -14,13 +14,16 @@
>>  #include 
>>  
>>  typedef void (*smp_call_func_t)(void *info);
>> -struct call_single_data {
>> +struct __call_single_data {
>>  struct llist_node llist;
>>  smp_call_func_t func;
>>  void *info;
>>  unsigned int flags;
>>  };
>>  
>> +typedef struct __call_single_data call_single_data_t
>> +__aligned(sizeof(struct __call_single_data));
>> +
>
> Another requirement of the alignment is that it should be the power of
> 2.  Otherwise, for example, if someone adds a field to struct, so that
> the size becomes 40 on x86_64.  The alignment should be 64 instead of
> 40.

Thanks Aaron, he reminded me that there is a roundup_pow_of_two().  So
the typedef could be,

typedef struct __call_single_data call_single_data_t
__aligned(roundup_pow_of_two(sizeof(struct __call_single_data));

Best Regards,
Huang, Ying

> Best Regards,
> Huang, Ying
>
>>  /* total number of cpus in this system (may exceed NR_CPUS) */
>>  extern unsigned int total_cpus;
>>  
> [snip]


Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-03 Thread Huang, Ying
Peter Zijlstra  writes:
[snip]
> diff --git a/include/linux/smp.h b/include/linux/smp.h
> index 68123c1fe549..8d817cb80a38 100644
> --- a/include/linux/smp.h
> +++ b/include/linux/smp.h
> @@ -14,13 +14,16 @@
>  #include 
>  
>  typedef void (*smp_call_func_t)(void *info);
> -struct call_single_data {
> +struct __call_single_data {
>   struct llist_node llist;
>   smp_call_func_t func;
>   void *info;
>   unsigned int flags;
>  };
>  
> +typedef struct __call_single_data call_single_data_t
> + __aligned(sizeof(struct __call_single_data));
> +

Another requirement of the alignment is that it should be the power of
2.  Otherwise, for example, if someone adds a field to struct, so that
the size becomes 40 on x86_64.  The alignment should be 64 instead of
40.

Best Regards,
Huang, Ying

>  /* total number of cpus in this system (may exceed NR_CPUS) */
>  extern unsigned int total_cpus;
>  
[snip]


Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-03 Thread Huang, Ying
Peter Zijlstra  writes:
[snip]
> diff --git a/include/linux/smp.h b/include/linux/smp.h
> index 68123c1fe549..8d817cb80a38 100644
> --- a/include/linux/smp.h
> +++ b/include/linux/smp.h
> @@ -14,13 +14,16 @@
>  #include 
>  
>  typedef void (*smp_call_func_t)(void *info);
> -struct call_single_data {
> +struct __call_single_data {
>   struct llist_node llist;
>   smp_call_func_t func;
>   void *info;
>   unsigned int flags;
>  };
>  
> +typedef struct __call_single_data call_single_data_t
> + __aligned(sizeof(struct __call_single_data));
> +

Another requirement of the alignment is that it should be the power of
2.  Otherwise, for example, if someone adds a field to struct, so that
the size becomes 40 on x86_64.  The alignment should be 64 instead of
40.

Best Regards,
Huang, Ying

>  /* total number of cpus in this system (may exceed NR_CPUS) */
>  extern unsigned int total_cpus;
>  
[snip]


Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-03 Thread Peter Zijlstra
On Thu, Aug 03, 2017 at 04:35:21PM +0800, Huang, Ying wrote:
> diff --git a/include/linux/smp.h b/include/linux/smp.h
> index 68123c1fe549..4d3b372d50b0 100644
> --- a/include/linux/smp.h
> +++ b/include/linux/smp.h
> @@ -13,13 +13,22 @@
>  #include 
>  #include 
>  
> +#define CSD_ALIGNMENT(4 * sizeof(void *))
> +
>  typedef void (*smp_call_func_t)(void *info);
>  struct call_single_data {
>   struct llist_node llist;
>   smp_call_func_t func;
>   void *info;
>   unsigned int flags;
> -};
> +} __aligned(CSD_ALIGNMENT);
> +
> +/* To avoid allocate csd across 2 cache lines */
> +static inline void check_alignment_of_csd(void)
> +{
> + BUILD_BUG_ON((CSD_ALIGNMENT & (CSD_ALIGNMENT - 1)) != 0);
> + BUILD_BUG_ON(sizeof(struct call_single_data) > CSD_ALIGNMENT);
> +}
>  
>  /* total number of cpus in this system (may exceed NR_CPUS) */
>  extern unsigned int total_cpus;

Bah, C sucks.. a much larger but possibly nicer patch 

---
diff --git a/arch/mips/kernel/smp.c b/arch/mips/kernel/smp.c
index 770d4d1516cb..bd8ba5472bca 100644
--- a/arch/mips/kernel/smp.c
+++ b/arch/mips/kernel/smp.c
@@ -648,12 +648,12 @@ EXPORT_SYMBOL(flush_tlb_one);
 #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 
 static DEFINE_PER_CPU(atomic_t, tick_broadcast_count);
-static DEFINE_PER_CPU(struct call_single_data, tick_broadcast_csd);
+static DEFINE_PER_CPU(call_single_data_t, tick_broadcast_csd);
 
 void tick_broadcast(const struct cpumask *mask)
 {
atomic_t *count;
-   struct call_single_data *csd;
+   call_single_data_t *csd;
int cpu;
 
for_each_cpu(cpu, mask) {
@@ -674,7 +674,7 @@ static void tick_broadcast_callee(void *info)
 
 static int __init tick_broadcast_init(void)
 {
-   struct call_single_data *csd;
+   call_single_data_t *csd;
int cpu;
 
for (cpu = 0; cpu < NR_CPUS; cpu++) {
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 87b7df4851bf..07125e7941f4 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -60,7 +60,7 @@ static void trigger_softirq(void *data)
 static int raise_blk_irq(int cpu, struct request *rq)
 {
if (cpu_online(cpu)) {
-   struct call_single_data *data = >csd;
+   call_single_data_t *data = >csd;
 
data->func = trigger_softirq;
data->info = rq;
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 85c24cace973..81142ce781da 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -13,7 +13,7 @@
 struct nullb_cmd {
struct list_head list;
struct llist_node ll_list;
-   struct call_single_data csd;
+   call_single_data_t csd;
struct request *rq;
struct bio *bio;
unsigned int tag;
diff --git a/drivers/cpuidle/coupled.c b/drivers/cpuidle/coupled.c
index 71e586d7df71..e54be79b2084 100644
--- a/drivers/cpuidle/coupled.c
+++ b/drivers/cpuidle/coupled.c
@@ -119,7 +119,7 @@ struct cpuidle_coupled {
 
 #define CPUIDLE_COUPLED_NOT_IDLE   (-1)
 
-static DEFINE_PER_CPU(struct call_single_data, cpuidle_coupled_poke_cb);
+static DEFINE_PER_CPU(call_single_data_t, cpuidle_coupled_poke_cb);
 
 /*
  * The cpuidle_coupled_poke_pending mask is used to avoid calling
@@ -339,7 +339,7 @@ static void cpuidle_coupled_handle_poke(void *info)
  */
 static void cpuidle_coupled_poke(int cpu)
 {
-   struct call_single_data *csd = _cpu(cpuidle_coupled_poke_cb, cpu);
+   call_single_data_t *csd = _cpu(cpuidle_coupled_poke_cb, cpu);
 
if (!cpumask_test_and_set_cpu(cpu, _coupled_poke_pending))
smp_call_function_single_async(cpu, csd);
@@ -651,7 +651,7 @@ int cpuidle_coupled_register_device(struct cpuidle_device 
*dev)
 {
int cpu;
struct cpuidle_device *other_dev;
-   struct call_single_data *csd;
+   call_single_data_t *csd;
struct cpuidle_coupled *coupled;
 
if (cpumask_empty(>coupled_cpus))
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 51583ae4b1eb..120b6e537b28 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -2468,7 +2468,7 @@ static void liquidio_napi_drv_callback(void *arg)
if (OCTEON_CN23XX_PF(oct) || droq->cpu_id == this_cpu) {
napi_schedule_irqoff(>napi);
} else {
-   struct call_single_data *csd = >csd;
+   call_single_data_t *csd = >csd;
 
csd->func = napi_schedule_wrapper;
csd->info = >napi;
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_droq.h 
b/drivers/net/ethernet/cavium/liquidio/octeon_droq.h
index 6efd139b894d..f91bc84d1719 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_droq.h
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_droq.h
@@ -328,7 +328,7 @@ struct octeon_droq {
 
u32 cpu_id;
 
-   struct call_single_data csd;
+   call_single_data_t 

Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-03 Thread Peter Zijlstra
On Thu, Aug 03, 2017 at 04:35:21PM +0800, Huang, Ying wrote:
> diff --git a/include/linux/smp.h b/include/linux/smp.h
> index 68123c1fe549..4d3b372d50b0 100644
> --- a/include/linux/smp.h
> +++ b/include/linux/smp.h
> @@ -13,13 +13,22 @@
>  #include 
>  #include 
>  
> +#define CSD_ALIGNMENT(4 * sizeof(void *))
> +
>  typedef void (*smp_call_func_t)(void *info);
>  struct call_single_data {
>   struct llist_node llist;
>   smp_call_func_t func;
>   void *info;
>   unsigned int flags;
> -};
> +} __aligned(CSD_ALIGNMENT);
> +
> +/* To avoid allocate csd across 2 cache lines */
> +static inline void check_alignment_of_csd(void)
> +{
> + BUILD_BUG_ON((CSD_ALIGNMENT & (CSD_ALIGNMENT - 1)) != 0);
> + BUILD_BUG_ON(sizeof(struct call_single_data) > CSD_ALIGNMENT);
> +}
>  
>  /* total number of cpus in this system (may exceed NR_CPUS) */
>  extern unsigned int total_cpus;

Bah, C sucks.. a much larger but possibly nicer patch 

---
diff --git a/arch/mips/kernel/smp.c b/arch/mips/kernel/smp.c
index 770d4d1516cb..bd8ba5472bca 100644
--- a/arch/mips/kernel/smp.c
+++ b/arch/mips/kernel/smp.c
@@ -648,12 +648,12 @@ EXPORT_SYMBOL(flush_tlb_one);
 #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 
 static DEFINE_PER_CPU(atomic_t, tick_broadcast_count);
-static DEFINE_PER_CPU(struct call_single_data, tick_broadcast_csd);
+static DEFINE_PER_CPU(call_single_data_t, tick_broadcast_csd);
 
 void tick_broadcast(const struct cpumask *mask)
 {
atomic_t *count;
-   struct call_single_data *csd;
+   call_single_data_t *csd;
int cpu;
 
for_each_cpu(cpu, mask) {
@@ -674,7 +674,7 @@ static void tick_broadcast_callee(void *info)
 
 static int __init tick_broadcast_init(void)
 {
-   struct call_single_data *csd;
+   call_single_data_t *csd;
int cpu;
 
for (cpu = 0; cpu < NR_CPUS; cpu++) {
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 87b7df4851bf..07125e7941f4 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -60,7 +60,7 @@ static void trigger_softirq(void *data)
 static int raise_blk_irq(int cpu, struct request *rq)
 {
if (cpu_online(cpu)) {
-   struct call_single_data *data = >csd;
+   call_single_data_t *data = >csd;
 
data->func = trigger_softirq;
data->info = rq;
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 85c24cace973..81142ce781da 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -13,7 +13,7 @@
 struct nullb_cmd {
struct list_head list;
struct llist_node ll_list;
-   struct call_single_data csd;
+   call_single_data_t csd;
struct request *rq;
struct bio *bio;
unsigned int tag;
diff --git a/drivers/cpuidle/coupled.c b/drivers/cpuidle/coupled.c
index 71e586d7df71..e54be79b2084 100644
--- a/drivers/cpuidle/coupled.c
+++ b/drivers/cpuidle/coupled.c
@@ -119,7 +119,7 @@ struct cpuidle_coupled {
 
 #define CPUIDLE_COUPLED_NOT_IDLE   (-1)
 
-static DEFINE_PER_CPU(struct call_single_data, cpuidle_coupled_poke_cb);
+static DEFINE_PER_CPU(call_single_data_t, cpuidle_coupled_poke_cb);
 
 /*
  * The cpuidle_coupled_poke_pending mask is used to avoid calling
@@ -339,7 +339,7 @@ static void cpuidle_coupled_handle_poke(void *info)
  */
 static void cpuidle_coupled_poke(int cpu)
 {
-   struct call_single_data *csd = _cpu(cpuidle_coupled_poke_cb, cpu);
+   call_single_data_t *csd = _cpu(cpuidle_coupled_poke_cb, cpu);
 
if (!cpumask_test_and_set_cpu(cpu, _coupled_poke_pending))
smp_call_function_single_async(cpu, csd);
@@ -651,7 +651,7 @@ int cpuidle_coupled_register_device(struct cpuidle_device 
*dev)
 {
int cpu;
struct cpuidle_device *other_dev;
-   struct call_single_data *csd;
+   call_single_data_t *csd;
struct cpuidle_coupled *coupled;
 
if (cpumask_empty(>coupled_cpus))
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 51583ae4b1eb..120b6e537b28 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -2468,7 +2468,7 @@ static void liquidio_napi_drv_callback(void *arg)
if (OCTEON_CN23XX_PF(oct) || droq->cpu_id == this_cpu) {
napi_schedule_irqoff(>napi);
} else {
-   struct call_single_data *csd = >csd;
+   call_single_data_t *csd = >csd;
 
csd->func = napi_schedule_wrapper;
csd->info = >napi;
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_droq.h 
b/drivers/net/ethernet/cavium/liquidio/octeon_droq.h
index 6efd139b894d..f91bc84d1719 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_droq.h
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_droq.h
@@ -328,7 +328,7 @@ struct octeon_droq {
 
u32 cpu_id;
 
-   struct call_single_data csd;
+   call_single_data_t 

Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-03 Thread Huang, Ying
Eric Dumazet  writes:

> On Wed, 2017-08-02 at 16:52 +0800, Huang, Ying wrote:
>> From: Huang Ying 
>> 
>> struct call_single_data is used in IPI to transfer information between
>> CPUs.  Its size is bigger than sizeof(unsigned long) and less than
>> cache line size.  Now, it is allocated with no any alignment
>> requirement.  This makes it possible for allocated call_single_data to
>> cross 2 cache lines.  So that double the number of the cache lines
>> that need to be transferred among CPUs.  This is resolved by aligning
>> the allocated call_single_data with cache line size.
>> 
>> To test the effect of the patch, we use the vm-scalability multiple
>> thread swap test case (swap-w-seq-mt).  The test will create multiple
>> threads and each thread will eat memory until all RAM and part of swap
>> is used, so that huge number of IPI will be triggered when unmapping
>> memory.  In the test, the throughput of memory writing improves ~5%
>> compared with misaligned call_single_data because of faster IPI.
>> 
>> Signed-off-by: "Huang, Ying" 
>> Cc: Peter Zijlstra 
>> Cc: Ingo Molnar 
>> Cc: Michael Ellerman 
>> Cc: Borislav Petkov 
>> Cc: Thomas Gleixner 
>> Cc: Juergen Gross 
>> Cc: Aaron Lu 
>> ---
>>  kernel/smp.c | 6 --
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>> 
>> diff --git a/kernel/smp.c b/kernel/smp.c
>> index 3061483cb3ad..81d9ae08eb6e 100644
>> --- a/kernel/smp.c
>> +++ b/kernel/smp.c
>> @@ -51,7 +51,7 @@ int smpcfd_prepare_cpu(unsigned int cpu)
>>  free_cpumask_var(cfd->cpumask);
>>  return -ENOMEM;
>>  }
>> -cfd->csd = alloc_percpu(struct call_single_data);
>> +cfd->csd = alloc_percpu_aligned(struct call_single_data);
>
> I do not believe allocating 64 bytes (per cpu) for this structure is
> needed. That would be an increase of cache lines.
>
> What we can do instead is to force an alignment on 4*sizeof(void *).
> (32 bytes on 64bit, 16 bytes on 32bit arches)
>
> Maybe something like this :
>
> diff --git a/include/linux/smp.h b/include/linux/smp.h
> index 
> 68123c1fe54918c051292eb5ba3427df09f31c2f..f7072bf173c5456e38e958d6af85a4793bced96e
>  100644
> --- a/include/linux/smp.h
> +++ b/include/linux/smp.h
> @@ -19,7 +19,7 @@ struct call_single_data {
>   smp_call_func_t func;
>   void *info;
>   unsigned int flags;
> -};
> +} __attribute__((aligned(4 * sizeof(void *;
>  
>  /* total number of cpus in this system (may exceed NR_CPUS) */
>  extern unsigned int total_cpus;

OK.  And if the sizeof(struct call_single_data) changes, we need to
change the alignment accordingly too.  So I add some BUILD_BUG_ON() for
that.

Best Regards,
Huang, Ying

-->8--
>From 2c400e9b1793f1c1d33bc278f5bc066e32ca4fee Mon Sep 17 00:00:00 2001
From: Huang Ying 
Date: Thu, 27 Jul 2017 16:43:20 +0800
Subject: [PATCH -v2] IPI: Avoid to use 2 cache lines for one call_single_data

struct call_single_data is used in IPI to transfer information between
CPUs.  Its size is bigger than sizeof(unsigned long) and less than
cache line size.  Now, it is allocated with no any alignment
requirement.  This makes it possible for allocated call_single_data to
cross 2 cache lines.  So that double the number of the cache lines
that need to be transferred among CPUs.

This is resolved by aligning the allocated call_single_data with 4 *
sizeof(void *).  If the size of struct call_single_data is changed in
the future, the alignment should be changed accordingly.  It should be
more than sizeof(struct call_single_data) and the power of 2.

To test the effect of the patch, we use the vm-scalability multiple
thread swap test case (swap-w-seq-mt).  The test will create multiple
threads and each thread will eat memory until all RAM and part of swap
is used, so that huge number of IPI will be triggered when unmapping
memory.  In the test, the throughput of memory writing improves ~5%
compared with misaligned call_single_data because of faster IPI.

[Align with 4 * sizeof(void*) instead of cache line size]
Suggested-by: Eric Dumazet 
Signed-off-by: "Huang, Ying" 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Michael Ellerman 
Cc: Borislav Petkov 
Cc: Thomas Gleixner 
Cc: Juergen Gross 
Cc: Aaron Lu 
---
 include/linux/smp.h | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index 68123c1fe549..4d3b372d50b0 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -13,13 +13,22 @@
 #include 
 #include 
 
+#define CSD_ALIGNMENT  (4 * sizeof(void *))
+
 typedef void (*smp_call_func_t)(void 

Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-03 Thread Huang, Ying
Eric Dumazet  writes:

> On Wed, 2017-08-02 at 16:52 +0800, Huang, Ying wrote:
>> From: Huang Ying 
>> 
>> struct call_single_data is used in IPI to transfer information between
>> CPUs.  Its size is bigger than sizeof(unsigned long) and less than
>> cache line size.  Now, it is allocated with no any alignment
>> requirement.  This makes it possible for allocated call_single_data to
>> cross 2 cache lines.  So that double the number of the cache lines
>> that need to be transferred among CPUs.  This is resolved by aligning
>> the allocated call_single_data with cache line size.
>> 
>> To test the effect of the patch, we use the vm-scalability multiple
>> thread swap test case (swap-w-seq-mt).  The test will create multiple
>> threads and each thread will eat memory until all RAM and part of swap
>> is used, so that huge number of IPI will be triggered when unmapping
>> memory.  In the test, the throughput of memory writing improves ~5%
>> compared with misaligned call_single_data because of faster IPI.
>> 
>> Signed-off-by: "Huang, Ying" 
>> Cc: Peter Zijlstra 
>> Cc: Ingo Molnar 
>> Cc: Michael Ellerman 
>> Cc: Borislav Petkov 
>> Cc: Thomas Gleixner 
>> Cc: Juergen Gross 
>> Cc: Aaron Lu 
>> ---
>>  kernel/smp.c | 6 --
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>> 
>> diff --git a/kernel/smp.c b/kernel/smp.c
>> index 3061483cb3ad..81d9ae08eb6e 100644
>> --- a/kernel/smp.c
>> +++ b/kernel/smp.c
>> @@ -51,7 +51,7 @@ int smpcfd_prepare_cpu(unsigned int cpu)
>>  free_cpumask_var(cfd->cpumask);
>>  return -ENOMEM;
>>  }
>> -cfd->csd = alloc_percpu(struct call_single_data);
>> +cfd->csd = alloc_percpu_aligned(struct call_single_data);
>
> I do not believe allocating 64 bytes (per cpu) for this structure is
> needed. That would be an increase of cache lines.
>
> What we can do instead is to force an alignment on 4*sizeof(void *).
> (32 bytes on 64bit, 16 bytes on 32bit arches)
>
> Maybe something like this :
>
> diff --git a/include/linux/smp.h b/include/linux/smp.h
> index 
> 68123c1fe54918c051292eb5ba3427df09f31c2f..f7072bf173c5456e38e958d6af85a4793bced96e
>  100644
> --- a/include/linux/smp.h
> +++ b/include/linux/smp.h
> @@ -19,7 +19,7 @@ struct call_single_data {
>   smp_call_func_t func;
>   void *info;
>   unsigned int flags;
> -};
> +} __attribute__((aligned(4 * sizeof(void *;
>  
>  /* total number of cpus in this system (may exceed NR_CPUS) */
>  extern unsigned int total_cpus;

OK.  And if the sizeof(struct call_single_data) changes, we need to
change the alignment accordingly too.  So I add some BUILD_BUG_ON() for
that.

Best Regards,
Huang, Ying

-->8--
>From 2c400e9b1793f1c1d33bc278f5bc066e32ca4fee Mon Sep 17 00:00:00 2001
From: Huang Ying 
Date: Thu, 27 Jul 2017 16:43:20 +0800
Subject: [PATCH -v2] IPI: Avoid to use 2 cache lines for one call_single_data

struct call_single_data is used in IPI to transfer information between
CPUs.  Its size is bigger than sizeof(unsigned long) and less than
cache line size.  Now, it is allocated with no any alignment
requirement.  This makes it possible for allocated call_single_data to
cross 2 cache lines.  So that double the number of the cache lines
that need to be transferred among CPUs.

This is resolved by aligning the allocated call_single_data with 4 *
sizeof(void *).  If the size of struct call_single_data is changed in
the future, the alignment should be changed accordingly.  It should be
more than sizeof(struct call_single_data) and the power of 2.

To test the effect of the patch, we use the vm-scalability multiple
thread swap test case (swap-w-seq-mt).  The test will create multiple
threads and each thread will eat memory until all RAM and part of swap
is used, so that huge number of IPI will be triggered when unmapping
memory.  In the test, the throughput of memory writing improves ~5%
compared with misaligned call_single_data because of faster IPI.

[Align with 4 * sizeof(void*) instead of cache line size]
Suggested-by: Eric Dumazet 
Signed-off-by: "Huang, Ying" 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Michael Ellerman 
Cc: Borislav Petkov 
Cc: Thomas Gleixner 
Cc: Juergen Gross 
Cc: Aaron Lu 
---
 include/linux/smp.h | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index 68123c1fe549..4d3b372d50b0 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -13,13 +13,22 @@
 #include 
 #include 
 
+#define CSD_ALIGNMENT  (4 * sizeof(void *))
+
 typedef void (*smp_call_func_t)(void *info);
 struct call_single_data {
struct llist_node llist;
smp_call_func_t func;
void *info;
unsigned int flags;
-};
+} __aligned(CSD_ALIGNMENT);
+
+/* To avoid allocate csd across 2 cache lines */
+static inline void check_alignment_of_csd(void)
+{
+   BUILD_BUG_ON((CSD_ALIGNMENT & (CSD_ALIGNMENT - 1)) != 0);
+   BUILD_BUG_ON(sizeof(struct call_single_data) > 

Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-02 Thread Peter Zijlstra
On Wed, Aug 02, 2017 at 03:18:58AM -0700, Eric Dumazet wrote:
> What we can do instead is to force an alignment on 4*sizeof(void *).
> (32 bytes on 64bit, 16 bytes on 32bit arches)
> 
> Maybe something like this :
> 
> diff --git a/include/linux/smp.h b/include/linux/smp.h
> index 
> 68123c1fe54918c051292eb5ba3427df09f31c2f..f7072bf173c5456e38e958d6af85a4793bced96e
>  100644
> --- a/include/linux/smp.h
> +++ b/include/linux/smp.h
> @@ -19,7 +19,7 @@ struct call_single_data {
>   smp_call_func_t func;
>   void *info;
>   unsigned int flags;
> -};
> +} __attribute__((aligned(4 * sizeof(void *;

Agreed.


Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-02 Thread Peter Zijlstra
On Wed, Aug 02, 2017 at 03:18:58AM -0700, Eric Dumazet wrote:
> What we can do instead is to force an alignment on 4*sizeof(void *).
> (32 bytes on 64bit, 16 bytes on 32bit arches)
> 
> Maybe something like this :
> 
> diff --git a/include/linux/smp.h b/include/linux/smp.h
> index 
> 68123c1fe54918c051292eb5ba3427df09f31c2f..f7072bf173c5456e38e958d6af85a4793bced96e
>  100644
> --- a/include/linux/smp.h
> +++ b/include/linux/smp.h
> @@ -19,7 +19,7 @@ struct call_single_data {
>   smp_call_func_t func;
>   void *info;
>   unsigned int flags;
> -};
> +} __attribute__((aligned(4 * sizeof(void *;

Agreed.


Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-02 Thread Eric Dumazet
On Wed, 2017-08-02 at 16:52 +0800, Huang, Ying wrote:
> From: Huang Ying 
> 
> struct call_single_data is used in IPI to transfer information between
> CPUs.  Its size is bigger than sizeof(unsigned long) and less than
> cache line size.  Now, it is allocated with no any alignment
> requirement.  This makes it possible for allocated call_single_data to
> cross 2 cache lines.  So that double the number of the cache lines
> that need to be transferred among CPUs.  This is resolved by aligning
> the allocated call_single_data with cache line size.
> 
> To test the effect of the patch, we use the vm-scalability multiple
> thread swap test case (swap-w-seq-mt).  The test will create multiple
> threads and each thread will eat memory until all RAM and part of swap
> is used, so that huge number of IPI will be triggered when unmapping
> memory.  In the test, the throughput of memory writing improves ~5%
> compared with misaligned call_single_data because of faster IPI.
> 
> Signed-off-by: "Huang, Ying" 
> Cc: Peter Zijlstra 
> Cc: Ingo Molnar 
> Cc: Michael Ellerman 
> Cc: Borislav Petkov 
> Cc: Thomas Gleixner 
> Cc: Juergen Gross 
> Cc: Aaron Lu 
> ---
>  kernel/smp.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 3061483cb3ad..81d9ae08eb6e 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -51,7 +51,7 @@ int smpcfd_prepare_cpu(unsigned int cpu)
>   free_cpumask_var(cfd->cpumask);
>   return -ENOMEM;
>   }
> - cfd->csd = alloc_percpu(struct call_single_data);
> + cfd->csd = alloc_percpu_aligned(struct call_single_data);

I do not believe allocating 64 bytes (per cpu) for this structure is
needed. That would be an increase of cache lines.

What we can do instead is to force an alignment on 4*sizeof(void *).
(32 bytes on 64bit, 16 bytes on 32bit arches)

Maybe something like this :

diff --git a/include/linux/smp.h b/include/linux/smp.h
index 
68123c1fe54918c051292eb5ba3427df09f31c2f..f7072bf173c5456e38e958d6af85a4793bced96e
 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -19,7 +19,7 @@ struct call_single_data {
smp_call_func_t func;
void *info;
unsigned int flags;
-};
+} __attribute__((aligned(4 * sizeof(void *;
 
 /* total number of cpus in this system (may exceed NR_CPUS) */
 extern unsigned int total_cpus;




Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-02 Thread Eric Dumazet
On Wed, 2017-08-02 at 16:52 +0800, Huang, Ying wrote:
> From: Huang Ying 
> 
> struct call_single_data is used in IPI to transfer information between
> CPUs.  Its size is bigger than sizeof(unsigned long) and less than
> cache line size.  Now, it is allocated with no any alignment
> requirement.  This makes it possible for allocated call_single_data to
> cross 2 cache lines.  So that double the number of the cache lines
> that need to be transferred among CPUs.  This is resolved by aligning
> the allocated call_single_data with cache line size.
> 
> To test the effect of the patch, we use the vm-scalability multiple
> thread swap test case (swap-w-seq-mt).  The test will create multiple
> threads and each thread will eat memory until all RAM and part of swap
> is used, so that huge number of IPI will be triggered when unmapping
> memory.  In the test, the throughput of memory writing improves ~5%
> compared with misaligned call_single_data because of faster IPI.
> 
> Signed-off-by: "Huang, Ying" 
> Cc: Peter Zijlstra 
> Cc: Ingo Molnar 
> Cc: Michael Ellerman 
> Cc: Borislav Petkov 
> Cc: Thomas Gleixner 
> Cc: Juergen Gross 
> Cc: Aaron Lu 
> ---
>  kernel/smp.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 3061483cb3ad..81d9ae08eb6e 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -51,7 +51,7 @@ int smpcfd_prepare_cpu(unsigned int cpu)
>   free_cpumask_var(cfd->cpumask);
>   return -ENOMEM;
>   }
> - cfd->csd = alloc_percpu(struct call_single_data);
> + cfd->csd = alloc_percpu_aligned(struct call_single_data);

I do not believe allocating 64 bytes (per cpu) for this structure is
needed. That would be an increase of cache lines.

What we can do instead is to force an alignment on 4*sizeof(void *).
(32 bytes on 64bit, 16 bytes on 32bit arches)

Maybe something like this :

diff --git a/include/linux/smp.h b/include/linux/smp.h
index 
68123c1fe54918c051292eb5ba3427df09f31c2f..f7072bf173c5456e38e958d6af85a4793bced96e
 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -19,7 +19,7 @@ struct call_single_data {
smp_call_func_t func;
void *info;
unsigned int flags;
-};
+} __attribute__((aligned(4 * sizeof(void *;
 
 /* total number of cpus in this system (may exceed NR_CPUS) */
 extern unsigned int total_cpus;




[PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-02 Thread Huang, Ying
From: Huang Ying 

struct call_single_data is used in IPI to transfer information between
CPUs.  Its size is bigger than sizeof(unsigned long) and less than
cache line size.  Now, it is allocated with no any alignment
requirement.  This makes it possible for allocated call_single_data to
cross 2 cache lines.  So that double the number of the cache lines
that need to be transferred among CPUs.  This is resolved by aligning
the allocated call_single_data with cache line size.

To test the effect of the patch, we use the vm-scalability multiple
thread swap test case (swap-w-seq-mt).  The test will create multiple
threads and each thread will eat memory until all RAM and part of swap
is used, so that huge number of IPI will be triggered when unmapping
memory.  In the test, the throughput of memory writing improves ~5%
compared with misaligned call_single_data because of faster IPI.

Signed-off-by: "Huang, Ying" 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Michael Ellerman 
Cc: Borislav Petkov 
Cc: Thomas Gleixner 
Cc: Juergen Gross 
Cc: Aaron Lu 
---
 kernel/smp.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 3061483cb3ad..81d9ae08eb6e 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -51,7 +51,7 @@ int smpcfd_prepare_cpu(unsigned int cpu)
free_cpumask_var(cfd->cpumask);
return -ENOMEM;
}
-   cfd->csd = alloc_percpu(struct call_single_data);
+   cfd->csd = alloc_percpu_aligned(struct call_single_data);
if (!cfd->csd) {
free_cpumask_var(cfd->cpumask);
free_cpumask_var(cfd->cpumask_ipi);
@@ -269,7 +269,9 @@ int smp_call_function_single(int cpu, smp_call_func_t func, 
void *info,
 int wait)
 {
struct call_single_data *csd;
-   struct call_single_data csd_stack = { .flags = CSD_FLAG_LOCK | 
CSD_FLAG_SYNCHRONOUS };
+   struct call_single_data csd_stack cacheline_aligned = {
+   .flags = CSD_FLAG_LOCK | CSD_FLAG_SYNCHRONOUS
+   };
int this_cpu;
int err;
 
-- 
2.13.2



[PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

2017-08-02 Thread Huang, Ying
From: Huang Ying 

struct call_single_data is used in IPI to transfer information between
CPUs.  Its size is bigger than sizeof(unsigned long) and less than
cache line size.  Now, it is allocated with no any alignment
requirement.  This makes it possible for allocated call_single_data to
cross 2 cache lines.  So that double the number of the cache lines
that need to be transferred among CPUs.  This is resolved by aligning
the allocated call_single_data with cache line size.

To test the effect of the patch, we use the vm-scalability multiple
thread swap test case (swap-w-seq-mt).  The test will create multiple
threads and each thread will eat memory until all RAM and part of swap
is used, so that huge number of IPI will be triggered when unmapping
memory.  In the test, the throughput of memory writing improves ~5%
compared with misaligned call_single_data because of faster IPI.

Signed-off-by: "Huang, Ying" 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Michael Ellerman 
Cc: Borislav Petkov 
Cc: Thomas Gleixner 
Cc: Juergen Gross 
Cc: Aaron Lu 
---
 kernel/smp.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 3061483cb3ad..81d9ae08eb6e 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -51,7 +51,7 @@ int smpcfd_prepare_cpu(unsigned int cpu)
free_cpumask_var(cfd->cpumask);
return -ENOMEM;
}
-   cfd->csd = alloc_percpu(struct call_single_data);
+   cfd->csd = alloc_percpu_aligned(struct call_single_data);
if (!cfd->csd) {
free_cpumask_var(cfd->cpumask);
free_cpumask_var(cfd->cpumask_ipi);
@@ -269,7 +269,9 @@ int smp_call_function_single(int cpu, smp_call_func_t func, 
void *info,
 int wait)
 {
struct call_single_data *csd;
-   struct call_single_data csd_stack = { .flags = CSD_FLAG_LOCK | 
CSD_FLAG_SYNCHRONOUS };
+   struct call_single_data csd_stack cacheline_aligned = {
+   .flags = CSD_FLAG_LOCK | CSD_FLAG_SYNCHRONOUS
+   };
int this_cpu;
int err;
 
-- 
2.13.2