Re: [patch 1/5] sched: remove degenerate domains

2005-04-07 Thread Ingo Molnar

* Nick Piggin <[EMAIL PROTECTED]> wrote:

> [...] Although I'd imagine it may be something distros may want. For 
> example, a generic x86-64 kernel for both AMD and Intel systems could 
> easily have SMT and NUMA turned on.

yes, that's true - in fact reducing the number of separate kernel 
packages is of utmost importance to all distributions. (I'm not sure we 
are there yet with CONFIG_NUMA, but small steps wont hurt.)

> I agree with the downside of exercising less code paths though.

if we make CONFIG_NUMA good enough on small boxes so that distributors 
can turn it on then in the long run the loss could be offset by the win 
the extra QA gives.

> >is there any case where we'd want to simplify the domain tree? One more 
> >domain level is just one (and very minor) aspect of CONFIG_NUMA - i'd 
> >not want to run a CONFIG_NUMA kernel on a non-NUMA box, even if the 
> >domain tree got optimized. Hm?
> 
> I guess there is the SMT issue too, and even booting an SMP kernel on 
> a UP system. Also small ia64 NUMA systems will probably have one 
> redundant NUMA level.

i think most factors of not running an SMP kernel on a UP box are not 
due scheduler overhead: the biggest cost is spinlock overhead. Someone 
should try a little prototype: use the 'alternate instructions' 
framework to patch out calls to spinlock functions to NOPs, and 
benchmark the resulting kernel against UP. If it's "good enough", 
distros will use it. Having just a single binary kernel RPM that 
supports everything from NUMA through SMP to UP is the holy grail of 
distros. (especially the ones that offer commercial support and 
services.)

this is probably not possible on x86 - e.g. it would probably be 
expensive (in terms of runtime cost) to make the PAE/non-PAE decision 
runtime (the distro boot kernel needs to be non-PAE). But for newer 
arches like x64 it should be easier.

> If/when topologies get more complex (for example, the recent Altix 
> discussions we had with Paul), it will be generally easier to set up 
> all levels in a generic way, then weed them out using something like 
> this, rather than put the logic in the domain setup code.

ok. That should also make it easier to put more of the arch domain setup 
code into sched.c. E.g. i'm still uneasy about it having so much 
scheduler code in arch/ia64/kernel/domain.c, and all the ripple effects. 
(the #ifdefs, include file impact, etc.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/5] sched: remove degenerate domains

2005-04-07 Thread Ingo Molnar

* Nick Piggin [EMAIL PROTECTED] wrote:

 [...] Although I'd imagine it may be something distros may want. For 
 example, a generic x86-64 kernel for both AMD and Intel systems could 
 easily have SMT and NUMA turned on.

yes, that's true - in fact reducing the number of separate kernel 
packages is of utmost importance to all distributions. (I'm not sure we 
are there yet with CONFIG_NUMA, but small steps wont hurt.)

 I agree with the downside of exercising less code paths though.

if we make CONFIG_NUMA good enough on small boxes so that distributors 
can turn it on then in the long run the loss could be offset by the win 
the extra QA gives.

 is there any case where we'd want to simplify the domain tree? One more 
 domain level is just one (and very minor) aspect of CONFIG_NUMA - i'd 
 not want to run a CONFIG_NUMA kernel on a non-NUMA box, even if the 
 domain tree got optimized. Hm?
 
 I guess there is the SMT issue too, and even booting an SMP kernel on 
 a UP system. Also small ia64 NUMA systems will probably have one 
 redundant NUMA level.

i think most factors of not running an SMP kernel on a UP box are not 
due scheduler overhead: the biggest cost is spinlock overhead. Someone 
should try a little prototype: use the 'alternate instructions' 
framework to patch out calls to spinlock functions to NOPs, and 
benchmark the resulting kernel against UP. If it's good enough, 
distros will use it. Having just a single binary kernel RPM that 
supports everything from NUMA through SMP to UP is the holy grail of 
distros. (especially the ones that offer commercial support and 
services.)

this is probably not possible on x86 - e.g. it would probably be 
expensive (in terms of runtime cost) to make the PAE/non-PAE decision 
runtime (the distro boot kernel needs to be non-PAE). But for newer 
arches like x64 it should be easier.

 If/when topologies get more complex (for example, the recent Altix 
 discussions we had with Paul), it will be generally easier to set up 
 all levels in a generic way, then weed them out using something like 
 this, rather than put the logic in the domain setup code.

ok. That should also make it easier to put more of the arch domain setup 
code into sched.c. E.g. i'm still uneasy about it having so much 
scheduler code in arch/ia64/kernel/domain.c, and all the ripple effects. 
(the #ifdefs, include file impact, etc.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/5] sched: remove degenerate domains

2005-04-06 Thread Nick Piggin
Ingo Molnar wrote:
* Siddha, Suresh B <[EMAIL PROTECTED]> wrote:

Similarly I am working on adding a new core domain for dual-core 
systems! All these domains are unnecessary and cause performance 
isssues on non Multi-threading/Multi-core capable cpus! Agreed that 
performance impact will be minor but still...

ok, lets keep it then. It may in fact simplify the domain setup code: we 
could generate the 'most generic' layout for a given arch all the time, 
and then optimize it automatically. I.e. in theory we could have just a 
single domain-setup routine, which would e.g. generate the NUMA domains 
on SMP too, which would then be optimized away.

Yep, exactly. Even so, Andrew: please ignore this patch series
and I'll redo it for you when we all agree on everything.
Thanks.
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/5] sched: remove degenerate domains

2005-04-06 Thread Nick Piggin
Ingo Molnar wrote:
* Nick Piggin <[EMAIL PROTECTED]> wrote:

This is Suresh's patch with some modifications.

Remove degenerate scheduler domains during the sched-domain init.

actually, i'd suggest to not do this patch. The point of booting with a 
CONFIG_NUMA kernel on a non-NUMA box is mostly for testing, and the 
'degenerate' toplevel domain exposed conceptual bugs in the 
sched-domains code. In that sense removing such 'unnecessary' domains 
inhibits debuggability to a certain degree. If we had this patch earlier 
we'd not have experienced the wrong decisions taken by the scheduler, 
only on the much rarer 'really NUMA' boxes.

True. Although I'd imagine it may be something distros may want.
For example, a generic x86-64 kernel for both AMD and Intel systems
could easily have SMT and NUMA turned on.
I agree with the downside of exercising less code paths though.
What about putting as a (default to off for 2.6) config option in
the config embedded menu?
is there any case where we'd want to simplify the domain tree? One more 
domain level is just one (and very minor) aspect of CONFIG_NUMA - i'd 
not want to run a CONFIG_NUMA kernel on a non-NUMA box, even if the 
domain tree got optimized. Hm?

I guess there is the SMT issue too, and even booting an SMP kernel
on a UP system. Also small ia64 NUMA systems will probably have one
redundant NUMA level.
If/when topologies get more complex (for example, the recent Altix
discussions we had with Paul), it will be generally easier to set
up all levels in a generic way, then weed them out using something
like this, rather than put the logic in the domain setup code.
Nick
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/5] sched: remove degenerate domains

2005-04-06 Thread Ingo Molnar

* Siddha, Suresh B <[EMAIL PROTECTED]> wrote:

> Similarly I am working on adding a new core domain for dual-core 
> systems! All these domains are unnecessary and cause performance 
> isssues on non Multi-threading/Multi-core capable cpus! Agreed that 
> performance impact will be minor but still...

ok, lets keep it then. It may in fact simplify the domain setup code: we 
could generate the 'most generic' layout for a given arch all the time, 
and then optimize it automatically. I.e. in theory we could have just a 
single domain-setup routine, which would e.g. generate the NUMA domains 
on SMP too, which would then be optimized away.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/5] sched: remove degenerate domains

2005-04-06 Thread Siddha, Suresh B
On Wed, Apr 06, 2005 at 07:44:12AM +0200, Ingo Molnar wrote:
> 
> * Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > This is Suresh's patch with some modifications.
> 
> > Remove degenerate scheduler domains during the sched-domain init.
> 
> actually, i'd suggest to not do this patch. The point of booting with a 
> CONFIG_NUMA kernel on a non-NUMA box is mostly for testing, and the 

Not really. All of the x86_64 kernels are NUMA enabled and most Intel x86_64
systems today are non NUMA.

> 'degenerate' toplevel domain exposed conceptual bugs in the 
> sched-domains code. In that sense removing such 'unnecessary' domains 
> inhibits debuggability to a certain degree. If we had this patch earlier 
> we'd not have experienced the wrong decisions taken by the scheduler, 
> only on the much rarer 'really NUMA' boxes.
> 
> is there any case where we'd want to simplify the domain tree? One more 
> domain level is just one (and very minor) aspect of CONFIG_NUMA - i'd 
> not want to run a CONFIG_NUMA kernel on a non-NUMA box, even if the 
> domain tree got optimized. Hm?
> 

Ingo, pardon me! Actually I used NUMA domain as an excuse to push domain
degenerate patch As I mentioned earlier, we should remove SMT domain
on a non-HT capable system.

Similarly I am working on adding a new core domain for dual-core systems!
All these domains are unnecessary and cause performance isssues on
non Multi-threading/Multi-core capable cpus! Agreed that performance 
impact will be minor but still...

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/5] sched: remove degenerate domains

2005-04-06 Thread Siddha, Suresh B
On Wed, Apr 06, 2005 at 07:44:12AM +0200, Ingo Molnar wrote:
 
 * Nick Piggin [EMAIL PROTECTED] wrote:
 
  This is Suresh's patch with some modifications.
 
  Remove degenerate scheduler domains during the sched-domain init.
 
 actually, i'd suggest to not do this patch. The point of booting with a 
 CONFIG_NUMA kernel on a non-NUMA box is mostly for testing, and the 

Not really. All of the x86_64 kernels are NUMA enabled and most Intel x86_64
systems today are non NUMA.

 'degenerate' toplevel domain exposed conceptual bugs in the 
 sched-domains code. In that sense removing such 'unnecessary' domains 
 inhibits debuggability to a certain degree. If we had this patch earlier 
 we'd not have experienced the wrong decisions taken by the scheduler, 
 only on the much rarer 'really NUMA' boxes.
 
 is there any case where we'd want to simplify the domain tree? One more 
 domain level is just one (and very minor) aspect of CONFIG_NUMA - i'd 
 not want to run a CONFIG_NUMA kernel on a non-NUMA box, even if the 
 domain tree got optimized. Hm?
 

Ingo, pardon me! Actually I used NUMA domain as an excuse to push domain
degenerate patch As I mentioned earlier, we should remove SMT domain
on a non-HT capable system.

Similarly I am working on adding a new core domain for dual-core systems!
All these domains are unnecessary and cause performance isssues on
non Multi-threading/Multi-core capable cpus! Agreed that performance 
impact will be minor but still...

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/5] sched: remove degenerate domains

2005-04-06 Thread Ingo Molnar

* Siddha, Suresh B [EMAIL PROTECTED] wrote:

 Similarly I am working on adding a new core domain for dual-core 
 systems! All these domains are unnecessary and cause performance 
 isssues on non Multi-threading/Multi-core capable cpus! Agreed that 
 performance impact will be minor but still...

ok, lets keep it then. It may in fact simplify the domain setup code: we 
could generate the 'most generic' layout for a given arch all the time, 
and then optimize it automatically. I.e. in theory we could have just a 
single domain-setup routine, which would e.g. generate the NUMA domains 
on SMP too, which would then be optimized away.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/5] sched: remove degenerate domains

2005-04-06 Thread Nick Piggin
Ingo Molnar wrote:
* Nick Piggin [EMAIL PROTECTED] wrote:

This is Suresh's patch with some modifications.

Remove degenerate scheduler domains during the sched-domain init.

actually, i'd suggest to not do this patch. The point of booting with a 
CONFIG_NUMA kernel on a non-NUMA box is mostly for testing, and the 
'degenerate' toplevel domain exposed conceptual bugs in the 
sched-domains code. In that sense removing such 'unnecessary' domains 
inhibits debuggability to a certain degree. If we had this patch earlier 
we'd not have experienced the wrong decisions taken by the scheduler, 
only on the much rarer 'really NUMA' boxes.

True. Although I'd imagine it may be something distros may want.
For example, a generic x86-64 kernel for both AMD and Intel systems
could easily have SMT and NUMA turned on.
I agree with the downside of exercising less code paths though.
What about putting as a (default to off for 2.6) config option in
the config embedded menu?
is there any case where we'd want to simplify the domain tree? One more 
domain level is just one (and very minor) aspect of CONFIG_NUMA - i'd 
not want to run a CONFIG_NUMA kernel on a non-NUMA box, even if the 
domain tree got optimized. Hm?

I guess there is the SMT issue too, and even booting an SMP kernel
on a UP system. Also small ia64 NUMA systems will probably have one
redundant NUMA level.
If/when topologies get more complex (for example, the recent Altix
discussions we had with Paul), it will be generally easier to set
up all levels in a generic way, then weed them out using something
like this, rather than put the logic in the domain setup code.
Nick
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/5] sched: remove degenerate domains

2005-04-06 Thread Nick Piggin
Ingo Molnar wrote:
* Siddha, Suresh B [EMAIL PROTECTED] wrote:

Similarly I am working on adding a new core domain for dual-core 
systems! All these domains are unnecessary and cause performance 
isssues on non Multi-threading/Multi-core capable cpus! Agreed that 
performance impact will be minor but still...

ok, lets keep it then. It may in fact simplify the domain setup code: we 
could generate the 'most generic' layout for a given arch all the time, 
and then optimize it automatically. I.e. in theory we could have just a 
single domain-setup routine, which would e.g. generate the NUMA domains 
on SMP too, which would then be optimized away.

Yep, exactly. Even so, Andrew: please ignore this patch series
and I'll redo it for you when we all agree on everything.
Thanks.
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/5] sched: remove degenerate domains

2005-04-05 Thread Ingo Molnar

* Nick Piggin <[EMAIL PROTECTED]> wrote:

> This is Suresh's patch with some modifications.

> Remove degenerate scheduler domains during the sched-domain init.

actually, i'd suggest to not do this patch. The point of booting with a 
CONFIG_NUMA kernel on a non-NUMA box is mostly for testing, and the 
'degenerate' toplevel domain exposed conceptual bugs in the 
sched-domains code. In that sense removing such 'unnecessary' domains 
inhibits debuggability to a certain degree. If we had this patch earlier 
we'd not have experienced the wrong decisions taken by the scheduler, 
only on the much rarer 'really NUMA' boxes.

is there any case where we'd want to simplify the domain tree? One more 
domain level is just one (and very minor) aspect of CONFIG_NUMA - i'd 
not want to run a CONFIG_NUMA kernel on a non-NUMA box, even if the 
domain tree got optimized. Hm?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 1/5] sched: remove degenerate domains

2005-04-05 Thread Nick Piggin
This is Suresh's patch with some modifications.
--
SUSE Labs, Novell Inc.
Remove degenerate scheduler domains during the sched-domain init.

For example on x86_64, we always have NUMA configured in. On Intel EM64T
systems, top most sched domain will be of NUMA and with only one sched_group in
it. 

With fork/exec balances(recent Nick's fixes in -mm tree), we always endup 
taking wrong decisions because of this topmost domain (as it contains only 
one group and find_idlest_group always returns NULL). We will endup loading 
HT package completely first, letting active load balance kickin and correct it.

In general, this patch also makes sense with out recent Nick's fixes
in -mm.

Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>

Modified to account for more than just sched_groups when scanning for
degenerate domains by Nick Piggin. Allow a runqueue's sd to go NULL, which
required small changes to the smtnice code.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>


Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c   2005-04-05 16:38:21.0 +1000
+++ linux-2.6/kernel/sched.c2005-04-05 18:39:09.0 +1000
@@ -2583,11 +2583,15 @@ out:
 #ifdef CONFIG_SCHED_SMT
 static inline void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq)
 {
-   struct sched_domain *sd = this_rq->sd;
+   struct sched_domain *tmp, *sd = NULL;
cpumask_t sibling_map;
int i;
+   
+   for_each_domain(this_cpu, tmp)
+   if (tmp->flags & SD_SHARE_CPUPOWER)
+   sd = tmp;
 
-   if (!(sd->flags & SD_SHARE_CPUPOWER))
+   if (!sd)
return;
 
/*
@@ -2628,13 +2632,17 @@ static inline void wake_sleeping_depende
 
 static inline int dependent_sleeper(int this_cpu, runqueue_t *this_rq)
 {
-   struct sched_domain *sd = this_rq->sd;
+   struct sched_domain *tmp, *sd = NULL;
cpumask_t sibling_map;
prio_array_t *array;
int ret = 0, i;
task_t *p;
 
-   if (!(sd->flags & SD_SHARE_CPUPOWER))
+   for_each_domain(this_cpu, tmp)
+   if (tmp->flags & SD_SHARE_CPUPOWER)
+   sd = tmp;
+
+   if (!sd)
return 0;
 
/*
@@ -4604,6 +4612,11 @@ static void sched_domain_debug(struct sc
 {
int level = 0;
 
+   if (!sd) {
+   printk(KERN_DEBUG "CPU%d attaching NULL sched-domain.\n", cpu);
+   return;
+   }
+   
printk(KERN_DEBUG "CPU%d attaching sched-domain:\n", cpu);
 
do {
@@ -4809,6 +4822,50 @@ static void init_sched_domain_sysctl(voi
 }
 #endif
 
+static int __devinit sd_degenerate(struct sched_domain *sd)
+{
+   if (cpus_weight(sd->span) == 1)
+   return 1;
+
+   /* Following flags need at least 2 groups */
+   if (sd->flags & (SD_LOAD_BALANCE |
+SD_BALANCE_NEWIDLE |
+SD_BALANCE_FORK |
+SD_BALANCE_EXEC)) {
+   if (sd->groups != sd->groups->next)
+   return 0;
+   }
+
+   /* Following flags don't use groups */
+   if (sd->flags & (SD_WAKE_IDLE |
+SD_WAKE_AFFINE |
+SD_WAKE_BALANCE))
+   return 0;
+
+   return 1;
+}
+
+static int __devinit sd_parent_degenerate(struct sched_domain *sd,
+   struct sched_domain *parent)
+{
+   unsigned long cflags = sd->flags, pflags = parent->flags;
+
+   if (sd_degenerate(parent))
+   return 1;
+
+   if (!cpus_equal(sd->span, parent->span))
+   return 0;
+
+   /* Does parent contain flags not in child? */
+   /* WAKE_BALANCE is a subset of WAKE_AFFINE */
+   if (cflags & SD_WAKE_AFFINE)
+   pflags &= ~SD_WAKE_BALANCE;
+   if ((~sd->flags) & parent->flags)
+   return 0;
+
+   return 1;
+}
+
 /*
  * Attach the domain 'sd' to 'cpu' as its base domain.  Callers must
  * hold the hotplug lock.
@@ -4819,6 +4876,19 @@ void __devinit cpu_attach_domain(struct 
unsigned long flags;
runqueue_t *rq = cpu_rq(cpu);
int local = 1;
+   struct sched_domain *tmp;
+
+   /* Remove the sched domains which do not contribute to scheduling. */
+   for (tmp = sd; tmp; tmp = tmp->parent) {
+   struct sched_domain *parent = tmp->parent;
+   if (!parent)
+   break;
+   if (sd_parent_degenerate(tmp, parent))
+   tmp->parent = parent->parent;
+   }
+
+   if (sd_degenerate(sd))
+   sd = sd->parent;
 
sched_domain_debug(sd, cpu);
 


[patch 1/5] sched: remove degenerate domains

2005-04-05 Thread Nick Piggin
This is Suresh's patch with some modifications.
--
SUSE Labs, Novell Inc.
Remove degenerate scheduler domains during the sched-domain init.

For example on x86_64, we always have NUMA configured in. On Intel EM64T
systems, top most sched domain will be of NUMA and with only one sched_group in
it. 

With fork/exec balances(recent Nick's fixes in -mm tree), we always endup 
taking wrong decisions because of this topmost domain (as it contains only 
one group and find_idlest_group always returns NULL). We will endup loading 
HT package completely first, letting active load balance kickin and correct it.

In general, this patch also makes sense with out recent Nick's fixes
in -mm.

Signed-off-by: Suresh Siddha [EMAIL PROTECTED]

Modified to account for more than just sched_groups when scanning for
degenerate domains by Nick Piggin. Allow a runqueue's sd to go NULL, which
required small changes to the smtnice code.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]


Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c   2005-04-05 16:38:21.0 +1000
+++ linux-2.6/kernel/sched.c2005-04-05 18:39:09.0 +1000
@@ -2583,11 +2583,15 @@ out:
 #ifdef CONFIG_SCHED_SMT
 static inline void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq)
 {
-   struct sched_domain *sd = this_rq-sd;
+   struct sched_domain *tmp, *sd = NULL;
cpumask_t sibling_map;
int i;
+   
+   for_each_domain(this_cpu, tmp)
+   if (tmp-flags  SD_SHARE_CPUPOWER)
+   sd = tmp;
 
-   if (!(sd-flags  SD_SHARE_CPUPOWER))
+   if (!sd)
return;
 
/*
@@ -2628,13 +2632,17 @@ static inline void wake_sleeping_depende
 
 static inline int dependent_sleeper(int this_cpu, runqueue_t *this_rq)
 {
-   struct sched_domain *sd = this_rq-sd;
+   struct sched_domain *tmp, *sd = NULL;
cpumask_t sibling_map;
prio_array_t *array;
int ret = 0, i;
task_t *p;
 
-   if (!(sd-flags  SD_SHARE_CPUPOWER))
+   for_each_domain(this_cpu, tmp)
+   if (tmp-flags  SD_SHARE_CPUPOWER)
+   sd = tmp;
+
+   if (!sd)
return 0;
 
/*
@@ -4604,6 +4612,11 @@ static void sched_domain_debug(struct sc
 {
int level = 0;
 
+   if (!sd) {
+   printk(KERN_DEBUG CPU%d attaching NULL sched-domain.\n, cpu);
+   return;
+   }
+   
printk(KERN_DEBUG CPU%d attaching sched-domain:\n, cpu);
 
do {
@@ -4809,6 +4822,50 @@ static void init_sched_domain_sysctl(voi
 }
 #endif
 
+static int __devinit sd_degenerate(struct sched_domain *sd)
+{
+   if (cpus_weight(sd-span) == 1)
+   return 1;
+
+   /* Following flags need at least 2 groups */
+   if (sd-flags  (SD_LOAD_BALANCE |
+SD_BALANCE_NEWIDLE |
+SD_BALANCE_FORK |
+SD_BALANCE_EXEC)) {
+   if (sd-groups != sd-groups-next)
+   return 0;
+   }
+
+   /* Following flags don't use groups */
+   if (sd-flags  (SD_WAKE_IDLE |
+SD_WAKE_AFFINE |
+SD_WAKE_BALANCE))
+   return 0;
+
+   return 1;
+}
+
+static int __devinit sd_parent_degenerate(struct sched_domain *sd,
+   struct sched_domain *parent)
+{
+   unsigned long cflags = sd-flags, pflags = parent-flags;
+
+   if (sd_degenerate(parent))
+   return 1;
+
+   if (!cpus_equal(sd-span, parent-span))
+   return 0;
+
+   /* Does parent contain flags not in child? */
+   /* WAKE_BALANCE is a subset of WAKE_AFFINE */
+   if (cflags  SD_WAKE_AFFINE)
+   pflags = ~SD_WAKE_BALANCE;
+   if ((~sd-flags)  parent-flags)
+   return 0;
+
+   return 1;
+}
+
 /*
  * Attach the domain 'sd' to 'cpu' as its base domain.  Callers must
  * hold the hotplug lock.
@@ -4819,6 +4876,19 @@ void __devinit cpu_attach_domain(struct 
unsigned long flags;
runqueue_t *rq = cpu_rq(cpu);
int local = 1;
+   struct sched_domain *tmp;
+
+   /* Remove the sched domains which do not contribute to scheduling. */
+   for (tmp = sd; tmp; tmp = tmp-parent) {
+   struct sched_domain *parent = tmp-parent;
+   if (!parent)
+   break;
+   if (sd_parent_degenerate(tmp, parent))
+   tmp-parent = parent-parent;
+   }
+
+   if (sd_degenerate(sd))
+   sd = sd-parent;
 
sched_domain_debug(sd, cpu);
 


Re: [patch 1/5] sched: remove degenerate domains

2005-04-05 Thread Ingo Molnar

* Nick Piggin [EMAIL PROTECTED] wrote:

 This is Suresh's patch with some modifications.

 Remove degenerate scheduler domains during the sched-domain init.

actually, i'd suggest to not do this patch. The point of booting with a 
CONFIG_NUMA kernel on a non-NUMA box is mostly for testing, and the 
'degenerate' toplevel domain exposed conceptual bugs in the 
sched-domains code. In that sense removing such 'unnecessary' domains 
inhibits debuggability to a certain degree. If we had this patch earlier 
we'd not have experienced the wrong decisions taken by the scheduler, 
only on the much rarer 'really NUMA' boxes.

is there any case where we'd want to simplify the domain tree? One more 
domain level is just one (and very minor) aspect of CONFIG_NUMA - i'd 
not want to run a CONFIG_NUMA kernel on a non-NUMA box, even if the 
domain tree got optimized. Hm?

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/