Re: [patch v4 05/18] sched: quicker balancing on fork/exec/wake

2013-02-15 Thread Peter Zijlstra
On Thu, 2013-02-14 at 13:42 +0530, Preeti U Murthy wrote:
> Hi Peter,Alex,
> If the eligible cpus happen to be all the cpus,then iterating over all
> the 
> cpus for idlest would be much worse than iterating over sched domains
> right?

Depends, doing a domain walk generally gets you 2n cpus visited --
geometric series and such. A simple scan of the top-most domain mask
that's eligible will limit that to n. 

> I am also wondering how important it is to bias the balancing of
> forked/woken up
> task onto an idlest cpu at every iteration.

Yeah, I don't know, it seems overkill to me, that code is from before my
time, so far it has survived.

> If biasing towards the idlest_cpu at every iteration is not really the
> criteria,
> then we could cut down on the iterations in fork/exec/wake balancing.
> Then the problem boils down to,is the option between biasing our
> search towards
> the idlest_cpu or the idlest_group.If we are not really concerned
> about balancing
> load across  groups,but ensuring we find the idlest cpu to run the
> task on,then
> Alex's patch seems to have covered the criteria.
> 
> However if the concern is to distribute the load uniformly across
> groups,then
> I have the following patch which might reduce the overhead of the
> search of an
> eligible cpu for a forked/exec/woken up task.

Nah, so I think the whole bias thing was mostly done to avoid
over-balancing and possibly to compensate for some approximations on the
whole weight/load measurement stuff.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v4 05/18] sched: quicker balancing on fork/exec/wake

2013-02-15 Thread Peter Zijlstra
On Thu, 2013-02-14 at 13:42 +0530, Preeti U Murthy wrote:
 Hi Peter,Alex,
 If the eligible cpus happen to be all the cpus,then iterating over all
 the 
 cpus for idlest would be much worse than iterating over sched domains
 right?

Depends, doing a domain walk generally gets you 2n cpus visited --
geometric series and such. A simple scan of the top-most domain mask
that's eligible will limit that to n. 

 I am also wondering how important it is to bias the balancing of
 forked/woken up
 task onto an idlest cpu at every iteration.

Yeah, I don't know, it seems overkill to me, that code is from before my
time, so far it has survived.

 If biasing towards the idlest_cpu at every iteration is not really the
 criteria,
 then we could cut down on the iterations in fork/exec/wake balancing.
 Then the problem boils down to,is the option between biasing our
 search towards
 the idlest_cpu or the idlest_group.If we are not really concerned
 about balancing
 load across  groups,but ensuring we find the idlest cpu to run the
 task on,then
 Alex's patch seems to have covered the criteria.
 
 However if the concern is to distribute the load uniformly across
 groups,then
 I have the following patch which might reduce the overhead of the
 search of an
 eligible cpu for a forked/exec/woken up task.

Nah, so I think the whole bias thing was mostly done to avoid
over-balancing and possibly to compensate for some approximations on the
whole weight/load measurement stuff.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v4 05/18] sched: quicker balancing on fork/exec/wake

2013-02-14 Thread Alex Shi
On 02/14/2013 04:12 PM, Preeti U Murthy wrote:
> **START PATCH*
> 
> sched:Improve balancing for fork/exec/wake tasks

Can you test the patch with hackbench and aim7 with 2000 threads? plus
the patch in the tip:core/locking tree:

  3a15e0e0cdda rwsem: Implement writer lock-stealing for better scalability

If it works you may find some improvement.

-- 
Thanks
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v4 05/18] sched: quicker balancing on fork/exec/wake

2013-02-14 Thread Preeti U Murthy
Hi everyone,

On 02/12/2013 03:52 PM, Peter Zijlstra wrote:
> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>> Guess the search cpu from bottom to up in domain tree come from
>> commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
>> balancing over tasks on all level domains.
>>
>> This balancing cost too much if there has many domain/groups in a
>> large system.
>>
>> If we remove this code, we will get quick fork/exec/wake with a
>> similar
>> balancing result amony whole system.
>>
>> This patch increases 10+% performance of hackbench on my 4 sockets
>> SNB machines and about 3% increasing on 2 sockets servers.
>>
>>
> Numbers be groovy.. still I'd like a little more on the behavioural
> change. Expand on what exactly is lost by this change so that if we
> later find a regression we have a better idea of what and how.
> 
> For instance, note how find_idlest_group() isn't symmetric wrt
> local_group. So by not doing the domain iteration we change things.
> 
> Now, it might well be that all this is somewhat overkill as it is, but
> should we then not replace all of it with a simple min search over all
> eligible cpus; that would be a real clean up.

Hi Peter,Alex,
If the eligible cpus happen to be all the cpus,then iterating over all the 
cpus for idlest would be much worse than iterating over sched domains right?
I am also wondering how important it is to bias the balancing of forked/woken up
task onto an idlest cpu at every iteration.

If biasing towards the idlest_cpu at every iteration is not really the criteria,
then we could cut down on the iterations in fork/exec/wake balancing.
Then the problem boils down to,is the option between biasing our search towards
the idlest_cpu or the idlest_group.If we are not really concerned about 
balancing
load across  groups,but ensuring we find the idlest cpu to run the task on,then
Alex's patch seems to have covered the criteria.

However if the concern is to distribute the load uniformly across groups,then
I have the following patch which might reduce the overhead of the search of an
eligible cpu for a forked/exec/woken up task.

Alex,if your patch does not show favourable behavioural changes,you could try
the below and check the same.

**START PATCH*

sched:Improve balancing for fork/exec/wake tasks

As I see it,currently,we first find the idlest group,then the idlest cpu
within it.However the current code does not seem to get convinced that the
selected cpu in the current iteration,is in the correct child group and does
an iteration over all the groups yet again,
*taking the selected cpu as reference to point to the next level sched
domain.*

Why then find the idlest cpu at every iteration,if the concern is primarily
around the idlest group at each iteration? Why not find the idlest group in
every iteration and the idlest cpu in the final iteration? As a result:

1.We save time spent in going over all the cpus in a sched group in
find_idlest_cpu() at every iteration.
2.Functionality remains the same.We find the idlest group at every level,and
consider a cpu of the idlest group as a reference to get the next lower level
sched domain so as to find the idlest group there.
However instead of taking the idlest cpu as the reference,take
the first cpu as the reference.The resulting next level sched domain remains
the same.

*However this completely removes the bias towards the idlest cpu at every 
level.*

This patchset therefore tries to bias its find towards the right group to put
the task on,and not the idlest cpu.
---
 kernel/sched/fair.c |   53 ++-
 1 file changed, 15 insertions(+), 38 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8691b0d..90855dd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3190,16 +3190,14 @@ static int wake_affine(struct sched_domain *sd, struct 
task_struct *p, int sync)
  */
 static struct sched_group *
 find_idlest_group(struct sched_domain *sd, struct task_struct *p,
- int this_cpu, int load_idx)
+ int this_cpu)
 {
struct sched_group *idlest = NULL, *group = sd->groups;
-   struct sched_group *this_group = NULL;
-   u64 min_load = ~0ULL, this_load = 0;
-   int imbalance = 100 + (sd->imbalance_pct-100)/2;
+   struct sched_group;
+   u64 min_load = ~0ULL;
 
do {
u64 load, avg_load;
-   int local_group;
int i;
 
/* Skip over this group if it has no CPUs allowed */
@@ -3207,18 +3205,11 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p,
tsk_cpus_allowed(p)))
continue;
 
-   local_group = cpumask_test_cpu(this_cpu,
-  sched_group_cpus(group));
-
/* Tally up the load of all CPUs in the group */
 

Re: [patch v4 05/18] sched: quicker balancing on fork/exec/wake

2013-02-14 Thread Preeti U Murthy
Hi everyone,

On 02/12/2013 03:52 PM, Peter Zijlstra wrote:
 On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
 Guess the search cpu from bottom to up in domain tree come from
 commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
 balancing over tasks on all level domains.

 This balancing cost too much if there has many domain/groups in a
 large system.

 If we remove this code, we will get quick fork/exec/wake with a
 similar
 balancing result amony whole system.

 This patch increases 10+% performance of hackbench on my 4 sockets
 SNB machines and about 3% increasing on 2 sockets servers.


 Numbers be groovy.. still I'd like a little more on the behavioural
 change. Expand on what exactly is lost by this change so that if we
 later find a regression we have a better idea of what and how.
 
 For instance, note how find_idlest_group() isn't symmetric wrt
 local_group. So by not doing the domain iteration we change things.
 
 Now, it might well be that all this is somewhat overkill as it is, but
 should we then not replace all of it with a simple min search over all
 eligible cpus; that would be a real clean up.

Hi Peter,Alex,
If the eligible cpus happen to be all the cpus,then iterating over all the 
cpus for idlest would be much worse than iterating over sched domains right?
I am also wondering how important it is to bias the balancing of forked/woken up
task onto an idlest cpu at every iteration.

If biasing towards the idlest_cpu at every iteration is not really the criteria,
then we could cut down on the iterations in fork/exec/wake balancing.
Then the problem boils down to,is the option between biasing our search towards
the idlest_cpu or the idlest_group.If we are not really concerned about 
balancing
load across  groups,but ensuring we find the idlest cpu to run the task on,then
Alex's patch seems to have covered the criteria.

However if the concern is to distribute the load uniformly across groups,then
I have the following patch which might reduce the overhead of the search of an
eligible cpu for a forked/exec/woken up task.

Alex,if your patch does not show favourable behavioural changes,you could try
the below and check the same.

**START PATCH*

sched:Improve balancing for fork/exec/wake tasks

As I see it,currently,we first find the idlest group,then the idlest cpu
within it.However the current code does not seem to get convinced that the
selected cpu in the current iteration,is in the correct child group and does
an iteration over all the groups yet again,
*taking the selected cpu as reference to point to the next level sched
domain.*

Why then find the idlest cpu at every iteration,if the concern is primarily
around the idlest group at each iteration? Why not find the idlest group in
every iteration and the idlest cpu in the final iteration? As a result:

1.We save time spent in going over all the cpus in a sched group in
find_idlest_cpu() at every iteration.
2.Functionality remains the same.We find the idlest group at every level,and
consider a cpu of the idlest group as a reference to get the next lower level
sched domain so as to find the idlest group there.
However instead of taking the idlest cpu as the reference,take
the first cpu as the reference.The resulting next level sched domain remains
the same.

*However this completely removes the bias towards the idlest cpu at every 
level.*

This patchset therefore tries to bias its find towards the right group to put
the task on,and not the idlest cpu.
---
 kernel/sched/fair.c |   53 ++-
 1 file changed, 15 insertions(+), 38 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8691b0d..90855dd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3190,16 +3190,14 @@ static int wake_affine(struct sched_domain *sd, struct 
task_struct *p, int sync)
  */
 static struct sched_group *
 find_idlest_group(struct sched_domain *sd, struct task_struct *p,
- int this_cpu, int load_idx)
+ int this_cpu)
 {
struct sched_group *idlest = NULL, *group = sd-groups;
-   struct sched_group *this_group = NULL;
-   u64 min_load = ~0ULL, this_load = 0;
-   int imbalance = 100 + (sd-imbalance_pct-100)/2;
+   struct sched_group;
+   u64 min_load = ~0ULL;
 
do {
u64 load, avg_load;
-   int local_group;
int i;
 
/* Skip over this group if it has no CPUs allowed */
@@ -3207,18 +3205,11 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p,
tsk_cpus_allowed(p)))
continue;
 
-   local_group = cpumask_test_cpu(this_cpu,
-  sched_group_cpus(group));
-
/* Tally up the load of all CPUs in the group */
avg_load = 0;
 

Re: [patch v4 05/18] sched: quicker balancing on fork/exec/wake

2013-02-14 Thread Alex Shi
On 02/14/2013 04:12 PM, Preeti U Murthy wrote:
 **START PATCH*
 
 sched:Improve balancing for fork/exec/wake tasks

Can you test the patch with hackbench and aim7 with 2000 threads? plus
the patch in the tip:core/locking tree:

  3a15e0e0cdda rwsem: Implement writer lock-stealing for better scalability

If it works you may find some improvement.

-- 
Thanks
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v4 05/18] sched: quicker balancing on fork/exec/wake

2013-02-13 Thread Alex Shi
On 02/12/2013 06:22 PM, Peter Zijlstra wrote:
> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>> Guess the search cpu from bottom to up in domain tree come from
>> commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
>> balancing over tasks on all level domains.
>>
>> This balancing cost too much if there has many domain/groups in a
>> large system.
>>
>> If we remove this code, we will get quick fork/exec/wake with a
>> similar
>> balancing result amony whole system.
>>
>> This patch increases 10+% performance of hackbench on my 4 sockets
>> SNB machines and about 3% increasing on 2 sockets servers.
>>
>>
> Numbers be groovy.. still I'd like a little more on the behavioural
> change. Expand on what exactly is lost by this change so that if we
> later find a regression we have a better idea of what and how.
> 
> For instance, note how find_idlest_group() isn't symmetric wrt
> local_group. So by not doing the domain iteration we change things.
> 
> Now, it might well be that all this is somewhat overkill as it is, but
> should we then not replace all of it with a simple min search over all
> eligible cpus; that would be a real clean up.
>  

Um, will think this again..
> 


-- 
Thanks
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v4 05/18] sched: quicker balancing on fork/exec/wake

2013-02-13 Thread Alex Shi
On 02/12/2013 06:22 PM, Peter Zijlstra wrote:
 On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
 Guess the search cpu from bottom to up in domain tree come from
 commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
 balancing over tasks on all level domains.

 This balancing cost too much if there has many domain/groups in a
 large system.

 If we remove this code, we will get quick fork/exec/wake with a
 similar
 balancing result amony whole system.

 This patch increases 10+% performance of hackbench on my 4 sockets
 SNB machines and about 3% increasing on 2 sockets servers.


 Numbers be groovy.. still I'd like a little more on the behavioural
 change. Expand on what exactly is lost by this change so that if we
 later find a regression we have a better idea of what and how.
 
 For instance, note how find_idlest_group() isn't symmetric wrt
 local_group. So by not doing the domain iteration we change things.
 
 Now, it might well be that all this is somewhat overkill as it is, but
 should we then not replace all of it with a simple min search over all
 eligible cpus; that would be a real clean up.
  

Um, will think this again..
 


-- 
Thanks
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v4 05/18] sched: quicker balancing on fork/exec/wake

2013-02-12 Thread Peter Zijlstra
On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
> Guess the search cpu from bottom to up in domain tree come from
> commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
> balancing over tasks on all level domains.
> 
> This balancing cost too much if there has many domain/groups in a
> large system.
> 
> If we remove this code, we will get quick fork/exec/wake with a
> similar
> balancing result amony whole system.
> 
> This patch increases 10+% performance of hackbench on my 4 sockets
> SNB machines and about 3% increasing on 2 sockets servers.
> 
> 
Numbers be groovy.. still I'd like a little more on the behavioural
change. Expand on what exactly is lost by this change so that if we
later find a regression we have a better idea of what and how.

For instance, note how find_idlest_group() isn't symmetric wrt
local_group. So by not doing the domain iteration we change things.

Now, it might well be that all this is somewhat overkill as it is, but
should we then not replace all of it with a simple min search over all
eligible cpus; that would be a real clean up.
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v4 05/18] sched: quicker balancing on fork/exec/wake

2013-02-12 Thread Peter Zijlstra
On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
 Guess the search cpu from bottom to up in domain tree come from
 commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
 balancing over tasks on all level domains.
 
 This balancing cost too much if there has many domain/groups in a
 large system.
 
 If we remove this code, we will get quick fork/exec/wake with a
 similar
 balancing result amony whole system.
 
 This patch increases 10+% performance of hackbench on my 4 sockets
 SNB machines and about 3% increasing on 2 sockets servers.
 
 
Numbers be groovy.. still I'd like a little more on the behavioural
change. Expand on what exactly is lost by this change so that if we
later find a regression we have a better idea of what and how.

For instance, note how find_idlest_group() isn't symmetric wrt
local_group. So by not doing the domain iteration we change things.

Now, it might well be that all this is somewhat overkill as it is, but
should we then not replace all of it with a simple min search over all
eligible cpus; that would be a real clean up.
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch v4 05/18] sched: quicker balancing on fork/exec/wake

2013-01-23 Thread Alex Shi
Guess the search cpu from bottom to up in domain tree come from
commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
balancing over tasks on all level domains.

This balancing cost too much if there has many domain/groups in a
large system.

If we remove this code, we will get quick fork/exec/wake with a similar
balancing result amony whole system.

This patch increases 10+% performance of hackbench on my 4 sockets
SNB machines and about 3% increasing on 2 sockets servers.

Signed-off-by: Alex Shi 
---
 kernel/sched/fair.c | 20 +---
 1 file changed, 1 insertion(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ecfbf8e..895a3f4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, 
int wake_flags)
goto unlock;
}
 
-   while (sd) {
+   if (sd) {
int load_idx = sd->forkexec_idx;
struct sched_group *group;
-   int weight;
-
-   if (!(sd->flags & sd_flag)) {
-   sd = sd->child;
-   continue;
-   }
 
if (sd_flag & SD_BALANCE_WAKE)
load_idx = sd->wake_idx;
@@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, 
int wake_flags)
goto unlock;
 
new_cpu = find_idlest_cpu(group, p, cpu);
-
-   /* Now try balancing at a lower domain level of new_cpu */
-   cpu = new_cpu;
-   weight = sd->span_weight;
-   sd = NULL;
-   for_each_domain(cpu, tmp) {
-   if (weight <= tmp->span_weight)
-   break;
-   if (tmp->flags & sd_flag)
-   sd = tmp;
-   }
-   /* while loop will break here if sd == NULL */
}
 unlock:
rcu_read_unlock();
-- 
1.7.12

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch v4 05/18] sched: quicker balancing on fork/exec/wake

2013-01-23 Thread Alex Shi
Guess the search cpu from bottom to up in domain tree come from
commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
balancing over tasks on all level domains.

This balancing cost too much if there has many domain/groups in a
large system.

If we remove this code, we will get quick fork/exec/wake with a similar
balancing result amony whole system.

This patch increases 10+% performance of hackbench on my 4 sockets
SNB machines and about 3% increasing on 2 sockets servers.

Signed-off-by: Alex Shi alex@intel.com
---
 kernel/sched/fair.c | 20 +---
 1 file changed, 1 insertion(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ecfbf8e..895a3f4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, 
int wake_flags)
goto unlock;
}
 
-   while (sd) {
+   if (sd) {
int load_idx = sd-forkexec_idx;
struct sched_group *group;
-   int weight;
-
-   if (!(sd-flags  sd_flag)) {
-   sd = sd-child;
-   continue;
-   }
 
if (sd_flag  SD_BALANCE_WAKE)
load_idx = sd-wake_idx;
@@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, 
int wake_flags)
goto unlock;
 
new_cpu = find_idlest_cpu(group, p, cpu);
-
-   /* Now try balancing at a lower domain level of new_cpu */
-   cpu = new_cpu;
-   weight = sd-span_weight;
-   sd = NULL;
-   for_each_domain(cpu, tmp) {
-   if (weight = tmp-span_weight)
-   break;
-   if (tmp-flags  sd_flag)
-   sd = tmp;
-   }
-   /* while loop will break here if sd == NULL */
}
 unlock:
rcu_read_unlock();
-- 
1.7.12

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/