Re: [patch 3/5] sched: multilevel sbe and sbf

2005-04-06 Thread Nick Piggin
Ingo Molnar wrote:
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
note that no matter how much scheduler logic, in the end 
cross-scheduling of tasks between nodes on NUMA will always have a 
permanent penalty (i.e. the 'migration cost' is 'infinity' in the long 
run), so the primary focus _hast to be_ on 'get it right initially' When 
tasks must spill over to other nodes will always remain a special case.  
So balance-on-fork/exec/[clone] definitely needs to be aware of the full 
domain tree picture.

Yes, well put. I imagine this will only become more important as
there becomes more push towards multiprocessing machines, and the
need for higher memory bandwidth and lower latency to CPUs.
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/5] sched: multilevel sbe and sbf

2005-04-06 Thread Nick Piggin
Ingo Molnar wrote:
Acked-by: Ingo Molnar [EMAIL PROTECTED]
note that no matter how much scheduler logic, in the end 
cross-scheduling of tasks between nodes on NUMA will always have a 
permanent penalty (i.e. the 'migration cost' is 'infinity' in the long 
run), so the primary focus _hast to be_ on 'get it right initially' When 
tasks must spill over to other nodes will always remain a special case.  
So balance-on-fork/exec/[clone] definitely needs to be aware of the full 
domain tree picture.

Yes, well put. I imagine this will only become more important as
there becomes more push towards multiprocessing machines, and the
need for higher memory bandwidth and lower latency to CPUs.
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/5] sched: multilevel sbe and sbf

2005-04-05 Thread Ingo Molnar

* Nick Piggin <[EMAIL PROTECTED]> wrote:

> 3/5

> The fundamental problem that Suresh has with balance on exec and fork
> is that it only tries to balance the top level domain with the flag
> set.
> 
> This was worked around by removing degenerate domains, but is still a
> problem if people want to start using more complex sched-domains, especially
> multilevel NUMA that ia64 is already using.
> 
> This patch makes balance on fork and exec try balancing over not just the
> top most domain with the flag set, but all the way down the domain tree.
> 
> Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

Acked-by: Ingo Molnar <[EMAIL PROTECTED]>

note that no matter how much scheduler logic, in the end 
cross-scheduling of tasks between nodes on NUMA will always have a 
permanent penalty (i.e. the 'migration cost' is 'infinity' in the long 
run), so the primary focus _hast to be_ on 'get it right initially' When 
tasks must spill over to other nodes will always remain a special case.  
So balance-on-fork/exec/[clone] definitely needs to be aware of the full 
domain tree picture.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 3/5] sched: multilevel sbe and sbf

2005-04-05 Thread Nick Piggin
3/5
The fundamental problem that Suresh has with balance on exec and fork
is that it only tries to balance the top level domain with the flag
set.

This was worked around by removing degenerate domains, but is still a
problem if people want to start using more complex sched-domains, especially
multilevel NUMA that ia64 is already using.

This patch makes balance on fork and exec try balancing over not just the
top most domain with the flag set, but all the way down the domain tree.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c   2005-04-05 16:38:53.0 +1000
+++ linux-2.6/kernel/sched.c2005-04-05 18:39:07.0 +1000
@@ -1320,21 +1320,24 @@ void fastcall wake_up_new_task(task_t * 
sd = tmp;
 
if (sd) {
+   cpumask_t span;
int new_cpu;
struct sched_group *group;
 
+again:
schedstat_inc(sd, sbf_cnt);
+   span = sd->span;
cpu = task_cpu(p);
group = find_idlest_group(sd, p, cpu);
if (!group) {
schedstat_inc(sd, sbf_balanced);
-   goto no_forkbalance;
+   goto nextlevel;
}
 
new_cpu = find_idlest_cpu(group, cpu);
if (new_cpu == -1 || new_cpu == cpu) {
schedstat_inc(sd, sbf_balanced);
-   goto no_forkbalance;
+   goto nextlevel;
}
 
if (cpu_isset(new_cpu, p->cpus_allowed)) {
@@ -1344,9 +1347,21 @@ void fastcall wake_up_new_task(task_t * 
rq = task_rq_lock(p, );
cpu = task_cpu(p);
}
+
+   /* Now try balancing at a lower domain level */
+nextlevel:
+   sd = NULL;
+   for_each_domain(cpu, tmp) {
+   if (cpus_subset(span, tmp->span))
+   break;
+   if (tmp->flags & SD_BALANCE_FORK)
+   sd = tmp;
+   }
+
+   if (sd)
+   goto again;
}
 
-no_forkbalance:
 #endif
/*
 * We decrease the sleep average of forking parents
@@ -1712,25 +1727,41 @@ void sched_exec(void)
sd = tmp;
 
if (sd) {
+   cpumask_t span;
struct sched_group *group;
+again:
schedstat_inc(sd, sbe_cnt);
+   span = sd->span;
group = find_idlest_group(sd, current, this_cpu);
if (!group) {
schedstat_inc(sd, sbe_balanced);
-   goto out;
+   goto nextlevel;
}
new_cpu = find_idlest_cpu(group, this_cpu);
if (new_cpu == -1 || new_cpu == this_cpu) {
schedstat_inc(sd, sbe_balanced);
-   goto out;
+   goto nextlevel;
}
 
schedstat_inc(sd, sbe_pushed);
put_cpu();
sched_migrate_task(current, new_cpu);
-   return;
+   
+   /* Now try balancing at a lower domain level */
+   this_cpu = get_cpu();
+nextlevel:
+   sd = NULL;
+   for_each_domain(this_cpu, tmp) {
+   if (cpus_subset(span, tmp->span))
+   break;
+   if (tmp->flags & SD_BALANCE_EXEC)
+   sd = tmp;
+   }
+
+   if (sd)
+   goto again;
}
-out:
+
put_cpu();
 }
 


[patch 3/5] sched: multilevel sbe and sbf

2005-04-05 Thread Nick Piggin
3/5
The fundamental problem that Suresh has with balance on exec and fork
is that it only tries to balance the top level domain with the flag
set.

This was worked around by removing degenerate domains, but is still a
problem if people want to start using more complex sched-domains, especially
multilevel NUMA that ia64 is already using.

This patch makes balance on fork and exec try balancing over not just the
top most domain with the flag set, but all the way down the domain tree.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]

Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c   2005-04-05 16:38:53.0 +1000
+++ linux-2.6/kernel/sched.c2005-04-05 18:39:07.0 +1000
@@ -1320,21 +1320,24 @@ void fastcall wake_up_new_task(task_t * 
sd = tmp;
 
if (sd) {
+   cpumask_t span;
int new_cpu;
struct sched_group *group;
 
+again:
schedstat_inc(sd, sbf_cnt);
+   span = sd-span;
cpu = task_cpu(p);
group = find_idlest_group(sd, p, cpu);
if (!group) {
schedstat_inc(sd, sbf_balanced);
-   goto no_forkbalance;
+   goto nextlevel;
}
 
new_cpu = find_idlest_cpu(group, cpu);
if (new_cpu == -1 || new_cpu == cpu) {
schedstat_inc(sd, sbf_balanced);
-   goto no_forkbalance;
+   goto nextlevel;
}
 
if (cpu_isset(new_cpu, p-cpus_allowed)) {
@@ -1344,9 +1347,21 @@ void fastcall wake_up_new_task(task_t * 
rq = task_rq_lock(p, flags);
cpu = task_cpu(p);
}
+
+   /* Now try balancing at a lower domain level */
+nextlevel:
+   sd = NULL;
+   for_each_domain(cpu, tmp) {
+   if (cpus_subset(span, tmp-span))
+   break;
+   if (tmp-flags  SD_BALANCE_FORK)
+   sd = tmp;
+   }
+
+   if (sd)
+   goto again;
}
 
-no_forkbalance:
 #endif
/*
 * We decrease the sleep average of forking parents
@@ -1712,25 +1727,41 @@ void sched_exec(void)
sd = tmp;
 
if (sd) {
+   cpumask_t span;
struct sched_group *group;
+again:
schedstat_inc(sd, sbe_cnt);
+   span = sd-span;
group = find_idlest_group(sd, current, this_cpu);
if (!group) {
schedstat_inc(sd, sbe_balanced);
-   goto out;
+   goto nextlevel;
}
new_cpu = find_idlest_cpu(group, this_cpu);
if (new_cpu == -1 || new_cpu == this_cpu) {
schedstat_inc(sd, sbe_balanced);
-   goto out;
+   goto nextlevel;
}
 
schedstat_inc(sd, sbe_pushed);
put_cpu();
sched_migrate_task(current, new_cpu);
-   return;
+   
+   /* Now try balancing at a lower domain level */
+   this_cpu = get_cpu();
+nextlevel:
+   sd = NULL;
+   for_each_domain(this_cpu, tmp) {
+   if (cpus_subset(span, tmp-span))
+   break;
+   if (tmp-flags  SD_BALANCE_EXEC)
+   sd = tmp;
+   }
+
+   if (sd)
+   goto again;
}
-out:
+
put_cpu();
 }
 


Re: [patch 3/5] sched: multilevel sbe and sbf

2005-04-05 Thread Ingo Molnar

* Nick Piggin [EMAIL PROTECTED] wrote:

 3/5

 The fundamental problem that Suresh has with balance on exec and fork
 is that it only tries to balance the top level domain with the flag
 set.
 
 This was worked around by removing degenerate domains, but is still a
 problem if people want to start using more complex sched-domains, especially
 multilevel NUMA that ia64 is already using.
 
 This patch makes balance on fork and exec try balancing over not just the
 top most domain with the flag set, but all the way down the domain tree.
 
 Signed-off-by: Nick Piggin [EMAIL PROTECTED]

Acked-by: Ingo Molnar [EMAIL PROTECTED]

note that no matter how much scheduler logic, in the end 
cross-scheduling of tasks between nodes on NUMA will always have a 
permanent penalty (i.e. the 'migration cost' is 'infinity' in the long 
run), so the primary focus _hast to be_ on 'get it right initially' When 
tasks must spill over to other nodes will always remain a special case.  
So balance-on-fork/exec/[clone] definitely needs to be aware of the full 
domain tree picture.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/