Re: [PATCH v3 1/2] sched: smart wake-affine foundation

2013-07-09 Thread Michael Wang
On 07/10/2013 09:52 AM, Sam Ben wrote:
> On 07/08/2013 10:36 AM, Michael Wang wrote:
>> Hi, Sam
>>
>> On 07/07/2013 09:31 AM, Sam Ben wrote:
>>> On 07/04/2013 12:55 PM, Michael Wang wrote:
 wake-affine stuff is always trying to pull wakee close to waker, by
 theory,
 this will bring benefit if waker's cpu cached hot data for wakee, or
 the
 extreme ping-pong case.
>>> What's the meaning of ping-pong case?
>> PeterZ explained it well in here:
>>
>> https://lkml.org/lkml/2013/3/7/332
>>
>> And you could try to compare:
>> taskset 1 perf bench sched pipe
>> with
>> perf bench sched pipe
> 
> Why sched pipe is special?

I think the link already explained the reason well, or you can read the
code of that pipe implementation, and you will find out there is a high
chances to match the ping-pong cases :)

Regards,
Michael Wang

> 
>>
>> to confirm it ;-)
>>
>> Regards,
>> Michael Wang
>>
 And testing show it could benefit hackbench 15% at most.

 However, the whole stuff is somewhat blindly and time-consuming, some
 workload therefore suffer.

 And testing show it could damage pgbench 50% at most.

 Thus, wake-affine stuff should be more smart, and realise when to stop
 it's thankless effort.

 This patch introduced 'nr_wakee_switch', which will be increased each
 time the task switch it's wakee.

 So a high 'nr_wakee_switch' means the task has more than one wakee, and
 bigger the number, higher the wakeup frequency.

 Now when making the decision on whether to pull or not, pay
 attention on
 the wakee with a high 'nr_wakee_switch', pull such task may benefit
 wakee,
 but also imply that waker will face cruel competition later, it
 could be
 very cruel or very fast depends on the story behind 'nr_wakee_switch',
 whatever, waker therefore suffer.

 Furthermore, if waker also has a high 'nr_wakee_switch', imply that
 multiple
 tasks rely on it, then waker's higher latency will damage all of them,
 pull
 wakee seems to be a bad deal.

 Thus, when 'waker->nr_wakee_switch / wakee->nr_wakee_switch' become
 higher
 and higher, the deal seems to be worse and worse.

 The patch therefore help wake-affine stuff to stop it's work when:

  wakee->nr_wakee_switch > factor &&
  waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)

 The factor here is the node-size of current-cpu, so bigger node will
 lead
 to more pull since the trial become more severe.

 After applied the patch, pgbench show 40% improvement at most.

 Test:
  Tested with 12 cpu X86 server and tip 3.10.0-rc7.

  pgbenchbasesmart

  | db_size | clients |  tps  ||  tps  |
  +-+-+---+   +---+
  | 22 MB   |   1 | 10598 |   | 10796 |
  | 22 MB   |   2 | 21257 |   | 21336 |
  | 22 MB   |   4 | 41386 |   | 41622 |
  | 22 MB   |   8 | 51253 |   | 57932 |
  | 22 MB   |  12 | 48570 |   | 54000 |
  | 22 MB   |  16 | 46748 |   | 55982 | +19.75%
  | 22 MB   |  24 | 44346 |   | 55847 | +25.93%
  | 22 MB   |  32 | 43460 |   | 54614 | +25.66%
  | 7484 MB |   1 |  8951 |   |  9193 |
  | 7484 MB |   2 | 19233 |   | 19240 |
  | 7484 MB |   4 | 37239 |   | 37302 |
  | 7484 MB |   8 | 46087 |   | 50018 |
  | 7484 MB |  12 | 42054 |   | 48763 |
  | 7484 MB |  16 | 40765 |   | 51633 | +26.66%
  | 7484 MB |  24 | 37651 |   | 52377 | +39.11%
  | 7484 MB |  32 | 37056 |   | 51108 | +37.92%
  | 15 GB   |   1 |  8845 |   |  9104 |
  | 15 GB   |   2 | 19094 |   | 19162 |
  | 15 GB   |   4 | 36979 |   | 36983 |
  | 15 GB   |   8 | 46087 |   | 49977 |
  | 15 GB   |  12 | 41901 |   | 48591 |
  | 15 GB   |  16 | 40147 |   | 50651 | +26.16%
  | 15 GB   |  24 | 37250 |   | 52365 | +40.58%
  | 15 GB   |  32 | 36470 |   | 50015 | +37.14%

 CC: Ingo Molnar 
 CC: Peter Zijlstra 
 CC: Mike Galbraith 
 Signed-off-by: Michael Wang 
 ---
include/linux/sched.h |3 +++
kernel/sched/fair.c   |   47
 +++
2 files changed, 50 insertions(+), 0 deletions(-)

 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index 178a8d9..1c996c7 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -1041,6 +1041,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
 +struct task_struct *last_wakee;
 +unsigned long nr_wakee_switch;
 +unsigned long last_switch_decay;
#endif
int 

Re: [PATCH v3 1/2] sched: smart wake-affine foundation

2013-07-09 Thread Sam Ben

On 07/08/2013 10:36 AM, Michael Wang wrote:

Hi, Sam

On 07/07/2013 09:31 AM, Sam Ben wrote:

On 07/04/2013 12:55 PM, Michael Wang wrote:

wake-affine stuff is always trying to pull wakee close to waker, by
theory,
this will bring benefit if waker's cpu cached hot data for wakee, or the
extreme ping-pong case.

What's the meaning of ping-pong case?

PeterZ explained it well in here:

https://lkml.org/lkml/2013/3/7/332

And you could try to compare:
taskset 1 perf bench sched pipe
with
perf bench sched pipe


Why sched pipe is special?



to confirm it ;-)

Regards,
Michael Wang


And testing show it could benefit hackbench 15% at most.

However, the whole stuff is somewhat blindly and time-consuming, some
workload therefore suffer.

And testing show it could damage pgbench 50% at most.

Thus, wake-affine stuff should be more smart, and realise when to stop
it's thankless effort.

This patch introduced 'nr_wakee_switch', which will be increased each
time the task switch it's wakee.

So a high 'nr_wakee_switch' means the task has more than one wakee, and
bigger the number, higher the wakeup frequency.

Now when making the decision on whether to pull or not, pay attention on
the wakee with a high 'nr_wakee_switch', pull such task may benefit
wakee,
but also imply that waker will face cruel competition later, it could be
very cruel or very fast depends on the story behind 'nr_wakee_switch',
whatever, waker therefore suffer.

Furthermore, if waker also has a high 'nr_wakee_switch', imply that
multiple
tasks rely on it, then waker's higher latency will damage all of them,
pull
wakee seems to be a bad deal.

Thus, when 'waker->nr_wakee_switch / wakee->nr_wakee_switch' become
higher
and higher, the deal seems to be worse and worse.

The patch therefore help wake-affine stuff to stop it's work when:

 wakee->nr_wakee_switch > factor &&
 waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)

The factor here is the node-size of current-cpu, so bigger node will lead
to more pull since the trial become more severe.

After applied the patch, pgbench show 40% improvement at most.

Test:
 Tested with 12 cpu X86 server and tip 3.10.0-rc7.

 pgbenchbasesmart

 | db_size | clients |  tps  ||  tps  |
 +-+-+---+   +---+
 | 22 MB   |   1 | 10598 |   | 10796 |
 | 22 MB   |   2 | 21257 |   | 21336 |
 | 22 MB   |   4 | 41386 |   | 41622 |
 | 22 MB   |   8 | 51253 |   | 57932 |
 | 22 MB   |  12 | 48570 |   | 54000 |
 | 22 MB   |  16 | 46748 |   | 55982 | +19.75%
 | 22 MB   |  24 | 44346 |   | 55847 | +25.93%
 | 22 MB   |  32 | 43460 |   | 54614 | +25.66%
 | 7484 MB |   1 |  8951 |   |  9193 |
 | 7484 MB |   2 | 19233 |   | 19240 |
 | 7484 MB |   4 | 37239 |   | 37302 |
 | 7484 MB |   8 | 46087 |   | 50018 |
 | 7484 MB |  12 | 42054 |   | 48763 |
 | 7484 MB |  16 | 40765 |   | 51633 | +26.66%
 | 7484 MB |  24 | 37651 |   | 52377 | +39.11%
 | 7484 MB |  32 | 37056 |   | 51108 | +37.92%
 | 15 GB   |   1 |  8845 |   |  9104 |
 | 15 GB   |   2 | 19094 |   | 19162 |
 | 15 GB   |   4 | 36979 |   | 36983 |
 | 15 GB   |   8 | 46087 |   | 49977 |
 | 15 GB   |  12 | 41901 |   | 48591 |
 | 15 GB   |  16 | 40147 |   | 50651 | +26.16%
 | 15 GB   |  24 | 37250 |   | 52365 | +40.58%
 | 15 GB   |  32 | 36470 |   | 50015 | +37.14%

CC: Ingo Molnar 
CC: Peter Zijlstra 
CC: Mike Galbraith 
Signed-off-by: Michael Wang 
---
   include/linux/sched.h |3 +++
   kernel/sched/fair.c   |   47
+++
   2 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..1c996c7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1041,6 +1041,9 @@ struct task_struct {
   #ifdef CONFIG_SMP
   struct llist_node wake_entry;
   int on_cpu;
+struct task_struct *last_wakee;
+unsigned long nr_wakee_switch;
+unsigned long last_switch_decay;
   #endif
   int on_rq;
   diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c61a614..a4ddbf5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2971,6 +2971,23 @@ static unsigned long cpu_avg_load_per_task(int
cpu)
   return 0;
   }
   +static void record_wakee(struct task_struct *p)
+{
+/*
+ * Rough decay(wiping) for cost saving, don't worry
+ * about the boundary, really active task won't care
+ * the loose.
+ */
+if (jiffies > current->last_switch_decay + HZ) {
+current->nr_wakee_switch = 0;
+current->last_switch_decay = jiffies;
+}
+
+if (current->last_wakee != p) {
+current->last_wakee = p;
+current->nr_wakee_switch++;
+}
+}
 static void task_waking_fair(struct task_struct *p)
   {
@@ -2991,6 +3008,7 @@ 

Re: [PATCH v3 1/2] sched: smart wake-affine foundation

2013-07-09 Thread Sam Ben

On 07/08/2013 10:36 AM, Michael Wang wrote:

Hi, Sam

On 07/07/2013 09:31 AM, Sam Ben wrote:

On 07/04/2013 12:55 PM, Michael Wang wrote:

wake-affine stuff is always trying to pull wakee close to waker, by
theory,
this will bring benefit if waker's cpu cached hot data for wakee, or the
extreme ping-pong case.

What's the meaning of ping-pong case?

PeterZ explained it well in here:

https://lkml.org/lkml/2013/3/7/332

And you could try to compare:
taskset 1 perf bench sched pipe
with
perf bench sched pipe


Why sched pipe is special?



to confirm it ;-)

Regards,
Michael Wang


And testing show it could benefit hackbench 15% at most.

However, the whole stuff is somewhat blindly and time-consuming, some
workload therefore suffer.

And testing show it could damage pgbench 50% at most.

Thus, wake-affine stuff should be more smart, and realise when to stop
it's thankless effort.

This patch introduced 'nr_wakee_switch', which will be increased each
time the task switch it's wakee.

So a high 'nr_wakee_switch' means the task has more than one wakee, and
bigger the number, higher the wakeup frequency.

Now when making the decision on whether to pull or not, pay attention on
the wakee with a high 'nr_wakee_switch', pull such task may benefit
wakee,
but also imply that waker will face cruel competition later, it could be
very cruel or very fast depends on the story behind 'nr_wakee_switch',
whatever, waker therefore suffer.

Furthermore, if waker also has a high 'nr_wakee_switch', imply that
multiple
tasks rely on it, then waker's higher latency will damage all of them,
pull
wakee seems to be a bad deal.

Thus, when 'waker-nr_wakee_switch / wakee-nr_wakee_switch' become
higher
and higher, the deal seems to be worse and worse.

The patch therefore help wake-affine stuff to stop it's work when:

 wakee-nr_wakee_switch  factor 
 waker-nr_wakee_switch  (factor * wakee-nr_wakee_switch)

The factor here is the node-size of current-cpu, so bigger node will lead
to more pull since the trial become more severe.

After applied the patch, pgbench show 40% improvement at most.

Test:
 Tested with 12 cpu X86 server and tip 3.10.0-rc7.

 pgbenchbasesmart

 | db_size | clients |  tps  ||  tps  |
 +-+-+---+   +---+
 | 22 MB   |   1 | 10598 |   | 10796 |
 | 22 MB   |   2 | 21257 |   | 21336 |
 | 22 MB   |   4 | 41386 |   | 41622 |
 | 22 MB   |   8 | 51253 |   | 57932 |
 | 22 MB   |  12 | 48570 |   | 54000 |
 | 22 MB   |  16 | 46748 |   | 55982 | +19.75%
 | 22 MB   |  24 | 44346 |   | 55847 | +25.93%
 | 22 MB   |  32 | 43460 |   | 54614 | +25.66%
 | 7484 MB |   1 |  8951 |   |  9193 |
 | 7484 MB |   2 | 19233 |   | 19240 |
 | 7484 MB |   4 | 37239 |   | 37302 |
 | 7484 MB |   8 | 46087 |   | 50018 |
 | 7484 MB |  12 | 42054 |   | 48763 |
 | 7484 MB |  16 | 40765 |   | 51633 | +26.66%
 | 7484 MB |  24 | 37651 |   | 52377 | +39.11%
 | 7484 MB |  32 | 37056 |   | 51108 | +37.92%
 | 15 GB   |   1 |  8845 |   |  9104 |
 | 15 GB   |   2 | 19094 |   | 19162 |
 | 15 GB   |   4 | 36979 |   | 36983 |
 | 15 GB   |   8 | 46087 |   | 49977 |
 | 15 GB   |  12 | 41901 |   | 48591 |
 | 15 GB   |  16 | 40147 |   | 50651 | +26.16%
 | 15 GB   |  24 | 37250 |   | 52365 | +40.58%
 | 15 GB   |  32 | 36470 |   | 50015 | +37.14%

CC: Ingo Molnar mi...@kernel.org
CC: Peter Zijlstra pet...@infradead.org
CC: Mike Galbraith efa...@gmx.de
Signed-off-by: Michael Wang wang...@linux.vnet.ibm.com
---
   include/linux/sched.h |3 +++
   kernel/sched/fair.c   |   47
+++
   2 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..1c996c7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1041,6 +1041,9 @@ struct task_struct {
   #ifdef CONFIG_SMP
   struct llist_node wake_entry;
   int on_cpu;
+struct task_struct *last_wakee;
+unsigned long nr_wakee_switch;
+unsigned long last_switch_decay;
   #endif
   int on_rq;
   diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c61a614..a4ddbf5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2971,6 +2971,23 @@ static unsigned long cpu_avg_load_per_task(int
cpu)
   return 0;
   }
   +static void record_wakee(struct task_struct *p)
+{
+/*
+ * Rough decay(wiping) for cost saving, don't worry
+ * about the boundary, really active task won't care
+ * the loose.
+ */
+if (jiffies  current-last_switch_decay + HZ) {
+current-nr_wakee_switch = 0;
+current-last_switch_decay = jiffies;
+}
+
+if (current-last_wakee != p) {
+current-last_wakee = p;
+current-nr_wakee_switch++;
+}
+}
 static void 

Re: [PATCH v3 1/2] sched: smart wake-affine foundation

2013-07-09 Thread Michael Wang
On 07/10/2013 09:52 AM, Sam Ben wrote:
 On 07/08/2013 10:36 AM, Michael Wang wrote:
 Hi, Sam

 On 07/07/2013 09:31 AM, Sam Ben wrote:
 On 07/04/2013 12:55 PM, Michael Wang wrote:
 wake-affine stuff is always trying to pull wakee close to waker, by
 theory,
 this will bring benefit if waker's cpu cached hot data for wakee, or
 the
 extreme ping-pong case.
 What's the meaning of ping-pong case?
 PeterZ explained it well in here:

 https://lkml.org/lkml/2013/3/7/332

 And you could try to compare:
 taskset 1 perf bench sched pipe
 with
 perf bench sched pipe
 
 Why sched pipe is special?

I think the link already explained the reason well, or you can read the
code of that pipe implementation, and you will find out there is a high
chances to match the ping-pong cases :)

Regards,
Michael Wang

 

 to confirm it ;-)

 Regards,
 Michael Wang

 And testing show it could benefit hackbench 15% at most.

 However, the whole stuff is somewhat blindly and time-consuming, some
 workload therefore suffer.

 And testing show it could damage pgbench 50% at most.

 Thus, wake-affine stuff should be more smart, and realise when to stop
 it's thankless effort.

 This patch introduced 'nr_wakee_switch', which will be increased each
 time the task switch it's wakee.

 So a high 'nr_wakee_switch' means the task has more than one wakee, and
 bigger the number, higher the wakeup frequency.

 Now when making the decision on whether to pull or not, pay
 attention on
 the wakee with a high 'nr_wakee_switch', pull such task may benefit
 wakee,
 but also imply that waker will face cruel competition later, it
 could be
 very cruel or very fast depends on the story behind 'nr_wakee_switch',
 whatever, waker therefore suffer.

 Furthermore, if waker also has a high 'nr_wakee_switch', imply that
 multiple
 tasks rely on it, then waker's higher latency will damage all of them,
 pull
 wakee seems to be a bad deal.

 Thus, when 'waker-nr_wakee_switch / wakee-nr_wakee_switch' become
 higher
 and higher, the deal seems to be worse and worse.

 The patch therefore help wake-affine stuff to stop it's work when:

  wakee-nr_wakee_switch  factor 
  waker-nr_wakee_switch  (factor * wakee-nr_wakee_switch)

 The factor here is the node-size of current-cpu, so bigger node will
 lead
 to more pull since the trial become more severe.

 After applied the patch, pgbench show 40% improvement at most.

 Test:
  Tested with 12 cpu X86 server and tip 3.10.0-rc7.

  pgbenchbasesmart

  | db_size | clients |  tps  ||  tps  |
  +-+-+---+   +---+
  | 22 MB   |   1 | 10598 |   | 10796 |
  | 22 MB   |   2 | 21257 |   | 21336 |
  | 22 MB   |   4 | 41386 |   | 41622 |
  | 22 MB   |   8 | 51253 |   | 57932 |
  | 22 MB   |  12 | 48570 |   | 54000 |
  | 22 MB   |  16 | 46748 |   | 55982 | +19.75%
  | 22 MB   |  24 | 44346 |   | 55847 | +25.93%
  | 22 MB   |  32 | 43460 |   | 54614 | +25.66%
  | 7484 MB |   1 |  8951 |   |  9193 |
  | 7484 MB |   2 | 19233 |   | 19240 |
  | 7484 MB |   4 | 37239 |   | 37302 |
  | 7484 MB |   8 | 46087 |   | 50018 |
  | 7484 MB |  12 | 42054 |   | 48763 |
  | 7484 MB |  16 | 40765 |   | 51633 | +26.66%
  | 7484 MB |  24 | 37651 |   | 52377 | +39.11%
  | 7484 MB |  32 | 37056 |   | 51108 | +37.92%
  | 15 GB   |   1 |  8845 |   |  9104 |
  | 15 GB   |   2 | 19094 |   | 19162 |
  | 15 GB   |   4 | 36979 |   | 36983 |
  | 15 GB   |   8 | 46087 |   | 49977 |
  | 15 GB   |  12 | 41901 |   | 48591 |
  | 15 GB   |  16 | 40147 |   | 50651 | +26.16%
  | 15 GB   |  24 | 37250 |   | 52365 | +40.58%
  | 15 GB   |  32 | 36470 |   | 50015 | +37.14%

 CC: Ingo Molnar mi...@kernel.org
 CC: Peter Zijlstra pet...@infradead.org
 CC: Mike Galbraith efa...@gmx.de
 Signed-off-by: Michael Wang wang...@linux.vnet.ibm.com
 ---
include/linux/sched.h |3 +++
kernel/sched/fair.c   |   47
 +++
2 files changed, 50 insertions(+), 0 deletions(-)

 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index 178a8d9..1c996c7 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -1041,6 +1041,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
 +struct task_struct *last_wakee;
 +unsigned long nr_wakee_switch;
 +unsigned long last_switch_decay;
#endif
int on_rq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index c61a614..a4ddbf5 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -2971,6 +2971,23 @@ static unsigned long cpu_avg_load_per_task(int
 cpu)
return 0;
}
+static void record_wakee(struct task_struct *p)
 +{
 +/*
 + * Rough decay(wiping) for cost saving, don't worry
 + * about the boundary, 

Re: [PATCH v3 1/2] sched: smart wake-affine foundation

2013-07-07 Thread Michael Wang
Hi, Sam

On 07/07/2013 09:31 AM, Sam Ben wrote:
> On 07/04/2013 12:55 PM, Michael Wang wrote:
>> wake-affine stuff is always trying to pull wakee close to waker, by
>> theory,
>> this will bring benefit if waker's cpu cached hot data for wakee, or the
>> extreme ping-pong case.
> 
> What's the meaning of ping-pong case?

PeterZ explained it well in here:

https://lkml.org/lkml/2013/3/7/332

And you could try to compare:
taskset 1 perf bench sched pipe
with
perf bench sched pipe

to confirm it ;-)

Regards,
Michael Wang

> 
>>
>> And testing show it could benefit hackbench 15% at most.
>>
>> However, the whole stuff is somewhat blindly and time-consuming, some
>> workload therefore suffer.
>>
>> And testing show it could damage pgbench 50% at most.
>>
>> Thus, wake-affine stuff should be more smart, and realise when to stop
>> it's thankless effort.
>>
>> This patch introduced 'nr_wakee_switch', which will be increased each
>> time the task switch it's wakee.
>>
>> So a high 'nr_wakee_switch' means the task has more than one wakee, and
>> bigger the number, higher the wakeup frequency.
>>
>> Now when making the decision on whether to pull or not, pay attention on
>> the wakee with a high 'nr_wakee_switch', pull such task may benefit
>> wakee,
>> but also imply that waker will face cruel competition later, it could be
>> very cruel or very fast depends on the story behind 'nr_wakee_switch',
>> whatever, waker therefore suffer.
>>
>> Furthermore, if waker also has a high 'nr_wakee_switch', imply that
>> multiple
>> tasks rely on it, then waker's higher latency will damage all of them,
>> pull
>> wakee seems to be a bad deal.
>>
>> Thus, when 'waker->nr_wakee_switch / wakee->nr_wakee_switch' become
>> higher
>> and higher, the deal seems to be worse and worse.
>>
>> The patch therefore help wake-affine stuff to stop it's work when:
>>
>> wakee->nr_wakee_switch > factor &&
>> waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)
>>
>> The factor here is the node-size of current-cpu, so bigger node will lead
>> to more pull since the trial become more severe.
>>
>> After applied the patch, pgbench show 40% improvement at most.
>>
>> Test:
>> Tested with 12 cpu X86 server and tip 3.10.0-rc7.
>>
>> pgbenchbasesmart
>>
>> | db_size | clients |  tps  ||  tps  |
>> +-+-+---+   +---+
>> | 22 MB   |   1 | 10598 |   | 10796 |
>> | 22 MB   |   2 | 21257 |   | 21336 |
>> | 22 MB   |   4 | 41386 |   | 41622 |
>> | 22 MB   |   8 | 51253 |   | 57932 |
>> | 22 MB   |  12 | 48570 |   | 54000 |
>> | 22 MB   |  16 | 46748 |   | 55982 | +19.75%
>> | 22 MB   |  24 | 44346 |   | 55847 | +25.93%
>> | 22 MB   |  32 | 43460 |   | 54614 | +25.66%
>> | 7484 MB |   1 |  8951 |   |  9193 |
>> | 7484 MB |   2 | 19233 |   | 19240 |
>> | 7484 MB |   4 | 37239 |   | 37302 |
>> | 7484 MB |   8 | 46087 |   | 50018 |
>> | 7484 MB |  12 | 42054 |   | 48763 |
>> | 7484 MB |  16 | 40765 |   | 51633 | +26.66%
>> | 7484 MB |  24 | 37651 |   | 52377 | +39.11%
>> | 7484 MB |  32 | 37056 |   | 51108 | +37.92%
>> | 15 GB   |   1 |  8845 |   |  9104 |
>> | 15 GB   |   2 | 19094 |   | 19162 |
>> | 15 GB   |   4 | 36979 |   | 36983 |
>> | 15 GB   |   8 | 46087 |   | 49977 |
>> | 15 GB   |  12 | 41901 |   | 48591 |
>> | 15 GB   |  16 | 40147 |   | 50651 | +26.16%
>> | 15 GB   |  24 | 37250 |   | 52365 | +40.58%
>> | 15 GB   |  32 | 36470 |   | 50015 | +37.14%
>>
>> CC: Ingo Molnar 
>> CC: Peter Zijlstra 
>> CC: Mike Galbraith 
>> Signed-off-by: Michael Wang 
>> ---
>>   include/linux/sched.h |3 +++
>>   kernel/sched/fair.c   |   47
>> +++
>>   2 files changed, 50 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 178a8d9..1c996c7 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1041,6 +1041,9 @@ struct task_struct {
>>   #ifdef CONFIG_SMP
>>   struct llist_node wake_entry;
>>   int on_cpu;
>> +struct task_struct *last_wakee;
>> +unsigned long nr_wakee_switch;
>> +unsigned long last_switch_decay;
>>   #endif
>>   int on_rq;
>>   diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c61a614..a4ddbf5 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2971,6 +2971,23 @@ static unsigned long cpu_avg_load_per_task(int
>> cpu)
>>   return 0;
>>   }
>>   +static void record_wakee(struct task_struct *p)
>> +{
>> +/*
>> + * Rough decay(wiping) for cost saving, don't worry
>> + * about the boundary, really active task won't care
>> + * the loose.
>> + */
>> +if (jiffies > current->last_switch_decay + HZ) {
>> +current->nr_wakee_switch = 0;
>> +

Re: [PATCH v3 1/2] sched: smart wake-affine foundation

2013-07-07 Thread Michael Wang
Hi, Sam

On 07/07/2013 09:31 AM, Sam Ben wrote:
 On 07/04/2013 12:55 PM, Michael Wang wrote:
 wake-affine stuff is always trying to pull wakee close to waker, by
 theory,
 this will bring benefit if waker's cpu cached hot data for wakee, or the
 extreme ping-pong case.
 
 What's the meaning of ping-pong case?

PeterZ explained it well in here:

https://lkml.org/lkml/2013/3/7/332

And you could try to compare:
taskset 1 perf bench sched pipe
with
perf bench sched pipe

to confirm it ;-)

Regards,
Michael Wang

 

 And testing show it could benefit hackbench 15% at most.

 However, the whole stuff is somewhat blindly and time-consuming, some
 workload therefore suffer.

 And testing show it could damage pgbench 50% at most.

 Thus, wake-affine stuff should be more smart, and realise when to stop
 it's thankless effort.

 This patch introduced 'nr_wakee_switch', which will be increased each
 time the task switch it's wakee.

 So a high 'nr_wakee_switch' means the task has more than one wakee, and
 bigger the number, higher the wakeup frequency.

 Now when making the decision on whether to pull or not, pay attention on
 the wakee with a high 'nr_wakee_switch', pull such task may benefit
 wakee,
 but also imply that waker will face cruel competition later, it could be
 very cruel or very fast depends on the story behind 'nr_wakee_switch',
 whatever, waker therefore suffer.

 Furthermore, if waker also has a high 'nr_wakee_switch', imply that
 multiple
 tasks rely on it, then waker's higher latency will damage all of them,
 pull
 wakee seems to be a bad deal.

 Thus, when 'waker-nr_wakee_switch / wakee-nr_wakee_switch' become
 higher
 and higher, the deal seems to be worse and worse.

 The patch therefore help wake-affine stuff to stop it's work when:

 wakee-nr_wakee_switch  factor 
 waker-nr_wakee_switch  (factor * wakee-nr_wakee_switch)

 The factor here is the node-size of current-cpu, so bigger node will lead
 to more pull since the trial become more severe.

 After applied the patch, pgbench show 40% improvement at most.

 Test:
 Tested with 12 cpu X86 server and tip 3.10.0-rc7.

 pgbenchbasesmart

 | db_size | clients |  tps  ||  tps  |
 +-+-+---+   +---+
 | 22 MB   |   1 | 10598 |   | 10796 |
 | 22 MB   |   2 | 21257 |   | 21336 |
 | 22 MB   |   4 | 41386 |   | 41622 |
 | 22 MB   |   8 | 51253 |   | 57932 |
 | 22 MB   |  12 | 48570 |   | 54000 |
 | 22 MB   |  16 | 46748 |   | 55982 | +19.75%
 | 22 MB   |  24 | 44346 |   | 55847 | +25.93%
 | 22 MB   |  32 | 43460 |   | 54614 | +25.66%
 | 7484 MB |   1 |  8951 |   |  9193 |
 | 7484 MB |   2 | 19233 |   | 19240 |
 | 7484 MB |   4 | 37239 |   | 37302 |
 | 7484 MB |   8 | 46087 |   | 50018 |
 | 7484 MB |  12 | 42054 |   | 48763 |
 | 7484 MB |  16 | 40765 |   | 51633 | +26.66%
 | 7484 MB |  24 | 37651 |   | 52377 | +39.11%
 | 7484 MB |  32 | 37056 |   | 51108 | +37.92%
 | 15 GB   |   1 |  8845 |   |  9104 |
 | 15 GB   |   2 | 19094 |   | 19162 |
 | 15 GB   |   4 | 36979 |   | 36983 |
 | 15 GB   |   8 | 46087 |   | 49977 |
 | 15 GB   |  12 | 41901 |   | 48591 |
 | 15 GB   |  16 | 40147 |   | 50651 | +26.16%
 | 15 GB   |  24 | 37250 |   | 52365 | +40.58%
 | 15 GB   |  32 | 36470 |   | 50015 | +37.14%

 CC: Ingo Molnar mi...@kernel.org
 CC: Peter Zijlstra pet...@infradead.org
 CC: Mike Galbraith efa...@gmx.de
 Signed-off-by: Michael Wang wang...@linux.vnet.ibm.com
 ---
   include/linux/sched.h |3 +++
   kernel/sched/fair.c   |   47
 +++
   2 files changed, 50 insertions(+), 0 deletions(-)

 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index 178a8d9..1c996c7 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -1041,6 +1041,9 @@ struct task_struct {
   #ifdef CONFIG_SMP
   struct llist_node wake_entry;
   int on_cpu;
 +struct task_struct *last_wakee;
 +unsigned long nr_wakee_switch;
 +unsigned long last_switch_decay;
   #endif
   int on_rq;
   diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index c61a614..a4ddbf5 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -2971,6 +2971,23 @@ static unsigned long cpu_avg_load_per_task(int
 cpu)
   return 0;
   }
   +static void record_wakee(struct task_struct *p)
 +{
 +/*
 + * Rough decay(wiping) for cost saving, don't worry
 + * about the boundary, really active task won't care
 + * the loose.
 + */
 +if (jiffies  current-last_switch_decay + HZ) {
 +current-nr_wakee_switch = 0;
 +current-last_switch_decay = jiffies;
 +}
 +
 +if (current-last_wakee != p) {
 +current-last_wakee = p;
 +current-nr_wakee_switch++;
 +}
 +}
 static void 

Re: [PATCH v3 1/2] sched: smart wake-affine foundation

2013-07-06 Thread Sam Ben

On 07/04/2013 12:55 PM, Michael Wang wrote:

wake-affine stuff is always trying to pull wakee close to waker, by theory,
this will bring benefit if waker's cpu cached hot data for wakee, or the
extreme ping-pong case.


What's the meaning of ping-pong case?



And testing show it could benefit hackbench 15% at most.

However, the whole stuff is somewhat blindly and time-consuming, some
workload therefore suffer.

And testing show it could damage pgbench 50% at most.

Thus, wake-affine stuff should be more smart, and realise when to stop
it's thankless effort.

This patch introduced 'nr_wakee_switch', which will be increased each
time the task switch it's wakee.

So a high 'nr_wakee_switch' means the task has more than one wakee, and
bigger the number, higher the wakeup frequency.

Now when making the decision on whether to pull or not, pay attention on
the wakee with a high 'nr_wakee_switch', pull such task may benefit wakee,
but also imply that waker will face cruel competition later, it could be
very cruel or very fast depends on the story behind 'nr_wakee_switch',
whatever, waker therefore suffer.

Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
tasks rely on it, then waker's higher latency will damage all of them, pull
wakee seems to be a bad deal.

Thus, when 'waker->nr_wakee_switch / wakee->nr_wakee_switch' become higher
and higher, the deal seems to be worse and worse.

The patch therefore help wake-affine stuff to stop it's work when:

wakee->nr_wakee_switch > factor &&
waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)

The factor here is the node-size of current-cpu, so bigger node will lead
to more pull since the trial become more severe.

After applied the patch, pgbench show 40% improvement at most.

Test:
Tested with 12 cpu X86 server and tip 3.10.0-rc7.

pgbench basesmart

| db_size | clients |  tps  |   |  tps  |
+-+-+---+   +---+
| 22 MB   |   1 | 10598 |   | 10796 |
| 22 MB   |   2 | 21257 |   | 21336 |
| 22 MB   |   4 | 41386 |   | 41622 |
| 22 MB   |   8 | 51253 |   | 57932 |
| 22 MB   |  12 | 48570 |   | 54000 |
| 22 MB   |  16 | 46748 |   | 55982 | +19.75%
| 22 MB   |  24 | 44346 |   | 55847 | +25.93%
| 22 MB   |  32 | 43460 |   | 54614 | +25.66%
| 7484 MB |   1 |  8951 |   |  9193 |
| 7484 MB |   2 | 19233 |   | 19240 |
| 7484 MB |   4 | 37239 |   | 37302 |
| 7484 MB |   8 | 46087 |   | 50018 |
| 7484 MB |  12 | 42054 |   | 48763 |
| 7484 MB |  16 | 40765 |   | 51633 | +26.66%
| 7484 MB |  24 | 37651 |   | 52377 | +39.11%
| 7484 MB |  32 | 37056 |   | 51108 | +37.92%
| 15 GB   |   1 |  8845 |   |  9104 |
| 15 GB   |   2 | 19094 |   | 19162 |
| 15 GB   |   4 | 36979 |   | 36983 |
| 15 GB   |   8 | 46087 |   | 49977 |
| 15 GB   |  12 | 41901 |   | 48591 |
| 15 GB   |  16 | 40147 |   | 50651 | +26.16%
| 15 GB   |  24 | 37250 |   | 52365 | +40.58%
| 15 GB   |  32 | 36470 |   | 50015 | +37.14%

CC: Ingo Molnar 
CC: Peter Zijlstra 
CC: Mike Galbraith 
Signed-off-by: Michael Wang 
---
  include/linux/sched.h |3 +++
  kernel/sched/fair.c   |   47 +++
  2 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..1c996c7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1041,6 +1041,9 @@ struct task_struct {
  #ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
+   struct task_struct *last_wakee;
+   unsigned long nr_wakee_switch;
+   unsigned long last_switch_decay;
  #endif
int on_rq;
  
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

index c61a614..a4ddbf5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2971,6 +2971,23 @@ static unsigned long cpu_avg_load_per_task(int cpu)
return 0;
  }
  
+static void record_wakee(struct task_struct *p)

+{
+   /*
+* Rough decay(wiping) for cost saving, don't worry
+* about the boundary, really active task won't care
+* the loose.
+*/
+   if (jiffies > current->last_switch_decay + HZ) {
+   current->nr_wakee_switch = 0;
+   current->last_switch_decay = jiffies;
+   }
+
+   if (current->last_wakee != p) {
+   current->last_wakee = p;
+   current->nr_wakee_switch++;
+   }
+}
  
  static void task_waking_fair(struct task_struct *p)

  {
@@ -2991,6 +3008,7 @@ static void task_waking_fair(struct task_struct *p)
  #endif
  
  	se->vruntime -= min_vruntime;

+   record_wakee(p);
  }
  
  #ifdef CONFIG_FAIR_GROUP_SCHED

@@ -3109,6 +3127,28 @@ static 

Re: [PATCH v3 1/2] sched: smart wake-affine foundation

2013-07-06 Thread Sam Ben

On 07/04/2013 12:55 PM, Michael Wang wrote:

wake-affine stuff is always trying to pull wakee close to waker, by theory,
this will bring benefit if waker's cpu cached hot data for wakee, or the
extreme ping-pong case.


What's the meaning of ping-pong case?



And testing show it could benefit hackbench 15% at most.

However, the whole stuff is somewhat blindly and time-consuming, some
workload therefore suffer.

And testing show it could damage pgbench 50% at most.

Thus, wake-affine stuff should be more smart, and realise when to stop
it's thankless effort.

This patch introduced 'nr_wakee_switch', which will be increased each
time the task switch it's wakee.

So a high 'nr_wakee_switch' means the task has more than one wakee, and
bigger the number, higher the wakeup frequency.

Now when making the decision on whether to pull or not, pay attention on
the wakee with a high 'nr_wakee_switch', pull such task may benefit wakee,
but also imply that waker will face cruel competition later, it could be
very cruel or very fast depends on the story behind 'nr_wakee_switch',
whatever, waker therefore suffer.

Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
tasks rely on it, then waker's higher latency will damage all of them, pull
wakee seems to be a bad deal.

Thus, when 'waker-nr_wakee_switch / wakee-nr_wakee_switch' become higher
and higher, the deal seems to be worse and worse.

The patch therefore help wake-affine stuff to stop it's work when:

wakee-nr_wakee_switch  factor 
waker-nr_wakee_switch  (factor * wakee-nr_wakee_switch)

The factor here is the node-size of current-cpu, so bigger node will lead
to more pull since the trial become more severe.

After applied the patch, pgbench show 40% improvement at most.

Test:
Tested with 12 cpu X86 server and tip 3.10.0-rc7.

pgbench basesmart

| db_size | clients |  tps  |   |  tps  |
+-+-+---+   +---+
| 22 MB   |   1 | 10598 |   | 10796 |
| 22 MB   |   2 | 21257 |   | 21336 |
| 22 MB   |   4 | 41386 |   | 41622 |
| 22 MB   |   8 | 51253 |   | 57932 |
| 22 MB   |  12 | 48570 |   | 54000 |
| 22 MB   |  16 | 46748 |   | 55982 | +19.75%
| 22 MB   |  24 | 44346 |   | 55847 | +25.93%
| 22 MB   |  32 | 43460 |   | 54614 | +25.66%
| 7484 MB |   1 |  8951 |   |  9193 |
| 7484 MB |   2 | 19233 |   | 19240 |
| 7484 MB |   4 | 37239 |   | 37302 |
| 7484 MB |   8 | 46087 |   | 50018 |
| 7484 MB |  12 | 42054 |   | 48763 |
| 7484 MB |  16 | 40765 |   | 51633 | +26.66%
| 7484 MB |  24 | 37651 |   | 52377 | +39.11%
| 7484 MB |  32 | 37056 |   | 51108 | +37.92%
| 15 GB   |   1 |  8845 |   |  9104 |
| 15 GB   |   2 | 19094 |   | 19162 |
| 15 GB   |   4 | 36979 |   | 36983 |
| 15 GB   |   8 | 46087 |   | 49977 |
| 15 GB   |  12 | 41901 |   | 48591 |
| 15 GB   |  16 | 40147 |   | 50651 | +26.16%
| 15 GB   |  24 | 37250 |   | 52365 | +40.58%
| 15 GB   |  32 | 36470 |   | 50015 | +37.14%

CC: Ingo Molnar mi...@kernel.org
CC: Peter Zijlstra pet...@infradead.org
CC: Mike Galbraith efa...@gmx.de
Signed-off-by: Michael Wang wang...@linux.vnet.ibm.com
---
  include/linux/sched.h |3 +++
  kernel/sched/fair.c   |   47 +++
  2 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..1c996c7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1041,6 +1041,9 @@ struct task_struct {
  #ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
+   struct task_struct *last_wakee;
+   unsigned long nr_wakee_switch;
+   unsigned long last_switch_decay;
  #endif
int on_rq;
  
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

index c61a614..a4ddbf5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2971,6 +2971,23 @@ static unsigned long cpu_avg_load_per_task(int cpu)
return 0;
  }
  
+static void record_wakee(struct task_struct *p)

+{
+   /*
+* Rough decay(wiping) for cost saving, don't worry
+* about the boundary, really active task won't care
+* the loose.
+*/
+   if (jiffies  current-last_switch_decay + HZ) {
+   current-nr_wakee_switch = 0;
+   current-last_switch_decay = jiffies;
+   }
+
+   if (current-last_wakee != p) {
+   current-last_wakee = p;
+   current-nr_wakee_switch++;
+   }
+}
  
  static void task_waking_fair(struct task_struct *p)

  {
@@ -2991,6 +3008,7 @@ static void task_waking_fair(struct task_struct *p)
  #endif
  
  	se-vruntime -= min_vruntime;

+   record_wakee(p);
  }
  
  

[PATCH v3 1/2] sched: smart wake-affine foundation

2013-07-03 Thread Michael Wang
wake-affine stuff is always trying to pull wakee close to waker, by theory,
this will bring benefit if waker's cpu cached hot data for wakee, or the
extreme ping-pong case.

And testing show it could benefit hackbench 15% at most.

However, the whole stuff is somewhat blindly and time-consuming, some
workload therefore suffer.

And testing show it could damage pgbench 50% at most.

Thus, wake-affine stuff should be more smart, and realise when to stop
it's thankless effort.

This patch introduced 'nr_wakee_switch', which will be increased each
time the task switch it's wakee.

So a high 'nr_wakee_switch' means the task has more than one wakee, and
bigger the number, higher the wakeup frequency.

Now when making the decision on whether to pull or not, pay attention on
the wakee with a high 'nr_wakee_switch', pull such task may benefit wakee,
but also imply that waker will face cruel competition later, it could be
very cruel or very fast depends on the story behind 'nr_wakee_switch',
whatever, waker therefore suffer.

Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
tasks rely on it, then waker's higher latency will damage all of them, pull
wakee seems to be a bad deal.

Thus, when 'waker->nr_wakee_switch / wakee->nr_wakee_switch' become higher
and higher, the deal seems to be worse and worse.

The patch therefore help wake-affine stuff to stop it's work when:

wakee->nr_wakee_switch > factor &&
waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)

The factor here is the node-size of current-cpu, so bigger node will lead
to more pull since the trial become more severe.

After applied the patch, pgbench show 40% improvement at most.

Test:
Tested with 12 cpu X86 server and tip 3.10.0-rc7.

pgbench basesmart

| db_size | clients |  tps  |   |  tps  |
+-+-+---+   +---+
| 22 MB   |   1 | 10598 |   | 10796 |
| 22 MB   |   2 | 21257 |   | 21336 |
| 22 MB   |   4 | 41386 |   | 41622 |
| 22 MB   |   8 | 51253 |   | 57932 |
| 22 MB   |  12 | 48570 |   | 54000 |
| 22 MB   |  16 | 46748 |   | 55982 | +19.75%
| 22 MB   |  24 | 44346 |   | 55847 | +25.93%
| 22 MB   |  32 | 43460 |   | 54614 | +25.66%
| 7484 MB |   1 |  8951 |   |  9193 |
| 7484 MB |   2 | 19233 |   | 19240 |
| 7484 MB |   4 | 37239 |   | 37302 |
| 7484 MB |   8 | 46087 |   | 50018 |
| 7484 MB |  12 | 42054 |   | 48763 |
| 7484 MB |  16 | 40765 |   | 51633 | +26.66%
| 7484 MB |  24 | 37651 |   | 52377 | +39.11%
| 7484 MB |  32 | 37056 |   | 51108 | +37.92%
| 15 GB   |   1 |  8845 |   |  9104 |
| 15 GB   |   2 | 19094 |   | 19162 |
| 15 GB   |   4 | 36979 |   | 36983 |
| 15 GB   |   8 | 46087 |   | 49977 |
| 15 GB   |  12 | 41901 |   | 48591 |
| 15 GB   |  16 | 40147 |   | 50651 | +26.16%
| 15 GB   |  24 | 37250 |   | 52365 | +40.58%
| 15 GB   |  32 | 36470 |   | 50015 | +37.14%

CC: Ingo Molnar 
CC: Peter Zijlstra 
CC: Mike Galbraith 
Signed-off-by: Michael Wang 
---
 include/linux/sched.h |3 +++
 kernel/sched/fair.c   |   47 +++
 2 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..1c996c7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1041,6 +1041,9 @@ struct task_struct {
 #ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
+   struct task_struct *last_wakee;
+   unsigned long nr_wakee_switch;
+   unsigned long last_switch_decay;
 #endif
int on_rq;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c61a614..a4ddbf5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2971,6 +2971,23 @@ static unsigned long cpu_avg_load_per_task(int cpu)
return 0;
 }
 
+static void record_wakee(struct task_struct *p)
+{
+   /*
+* Rough decay(wiping) for cost saving, don't worry
+* about the boundary, really active task won't care
+* the loose.
+*/
+   if (jiffies > current->last_switch_decay + HZ) {
+   current->nr_wakee_switch = 0;
+   current->last_switch_decay = jiffies;
+   }
+
+   if (current->last_wakee != p) {
+   current->last_wakee = p;
+   current->nr_wakee_switch++;
+   }
+}
 
 static void task_waking_fair(struct task_struct *p)
 {
@@ -2991,6 +3008,7 @@ static void task_waking_fair(struct task_struct *p)
 #endif
 
se->vruntime -= min_vruntime;
+   record_wakee(p);
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -3109,6 +3127,28 @@ static inline unsigned long effective_load(struct 
task_group *tg, int cpu,
 
 #endif
 
+static int 

[PATCH v3 1/2] sched: smart wake-affine foundation

2013-07-03 Thread Michael Wang
wake-affine stuff is always trying to pull wakee close to waker, by theory,
this will bring benefit if waker's cpu cached hot data for wakee, or the
extreme ping-pong case.

And testing show it could benefit hackbench 15% at most.

However, the whole stuff is somewhat blindly and time-consuming, some
workload therefore suffer.

And testing show it could damage pgbench 50% at most.

Thus, wake-affine stuff should be more smart, and realise when to stop
it's thankless effort.

This patch introduced 'nr_wakee_switch', which will be increased each
time the task switch it's wakee.

So a high 'nr_wakee_switch' means the task has more than one wakee, and
bigger the number, higher the wakeup frequency.

Now when making the decision on whether to pull or not, pay attention on
the wakee with a high 'nr_wakee_switch', pull such task may benefit wakee,
but also imply that waker will face cruel competition later, it could be
very cruel or very fast depends on the story behind 'nr_wakee_switch',
whatever, waker therefore suffer.

Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
tasks rely on it, then waker's higher latency will damage all of them, pull
wakee seems to be a bad deal.

Thus, when 'waker-nr_wakee_switch / wakee-nr_wakee_switch' become higher
and higher, the deal seems to be worse and worse.

The patch therefore help wake-affine stuff to stop it's work when:

wakee-nr_wakee_switch  factor 
waker-nr_wakee_switch  (factor * wakee-nr_wakee_switch)

The factor here is the node-size of current-cpu, so bigger node will lead
to more pull since the trial become more severe.

After applied the patch, pgbench show 40% improvement at most.

Test:
Tested with 12 cpu X86 server and tip 3.10.0-rc7.

pgbench basesmart

| db_size | clients |  tps  |   |  tps  |
+-+-+---+   +---+
| 22 MB   |   1 | 10598 |   | 10796 |
| 22 MB   |   2 | 21257 |   | 21336 |
| 22 MB   |   4 | 41386 |   | 41622 |
| 22 MB   |   8 | 51253 |   | 57932 |
| 22 MB   |  12 | 48570 |   | 54000 |
| 22 MB   |  16 | 46748 |   | 55982 | +19.75%
| 22 MB   |  24 | 44346 |   | 55847 | +25.93%
| 22 MB   |  32 | 43460 |   | 54614 | +25.66%
| 7484 MB |   1 |  8951 |   |  9193 |
| 7484 MB |   2 | 19233 |   | 19240 |
| 7484 MB |   4 | 37239 |   | 37302 |
| 7484 MB |   8 | 46087 |   | 50018 |
| 7484 MB |  12 | 42054 |   | 48763 |
| 7484 MB |  16 | 40765 |   | 51633 | +26.66%
| 7484 MB |  24 | 37651 |   | 52377 | +39.11%
| 7484 MB |  32 | 37056 |   | 51108 | +37.92%
| 15 GB   |   1 |  8845 |   |  9104 |
| 15 GB   |   2 | 19094 |   | 19162 |
| 15 GB   |   4 | 36979 |   | 36983 |
| 15 GB   |   8 | 46087 |   | 49977 |
| 15 GB   |  12 | 41901 |   | 48591 |
| 15 GB   |  16 | 40147 |   | 50651 | +26.16%
| 15 GB   |  24 | 37250 |   | 52365 | +40.58%
| 15 GB   |  32 | 36470 |   | 50015 | +37.14%

CC: Ingo Molnar mi...@kernel.org
CC: Peter Zijlstra pet...@infradead.org
CC: Mike Galbraith efa...@gmx.de
Signed-off-by: Michael Wang wang...@linux.vnet.ibm.com
---
 include/linux/sched.h |3 +++
 kernel/sched/fair.c   |   47 +++
 2 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..1c996c7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1041,6 +1041,9 @@ struct task_struct {
 #ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
+   struct task_struct *last_wakee;
+   unsigned long nr_wakee_switch;
+   unsigned long last_switch_decay;
 #endif
int on_rq;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c61a614..a4ddbf5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2971,6 +2971,23 @@ static unsigned long cpu_avg_load_per_task(int cpu)
return 0;
 }
 
+static void record_wakee(struct task_struct *p)
+{
+   /*
+* Rough decay(wiping) for cost saving, don't worry
+* about the boundary, really active task won't care
+* the loose.
+*/
+   if (jiffies  current-last_switch_decay + HZ) {
+   current-nr_wakee_switch = 0;
+   current-last_switch_decay = jiffies;
+   }
+
+   if (current-last_wakee != p) {
+   current-last_wakee = p;
+   current-nr_wakee_switch++;
+   }
+}
 
 static void task_waking_fair(struct task_struct *p)
 {
@@ -2991,6 +3008,7 @@ static void task_waking_fair(struct task_struct *p)
 #endif
 
se-vruntime -= min_vruntime;
+   record_wakee(p);
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -3109,6 +3127,28 @@ static inline unsigned long effective_load(struct