Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
Hi Peter~ I will make a v2 patch-set following your comments and resent it soon. :) Thanks. :) On 09/25/2012 07:50 PM, Peter Zijlstra wrote: On Tue, 2012-09-25 at 19:45 +0800, Tang Chen wrote: Let's have an example here. sched_init_numa() { ... // A loop set sched_domains_numa_levels to level.-1 // I set sched_domains_numa_levels to 0. sched_domains_numa_levels = 0;2 Ah, ok, so this 'idiot in a hurry' (aka. me) placed the hunk wrong and thought it was at the start of the function. // A loop allocating memory for sched_domains_numa_masks[][] for (i = 0; i< level; i++) { .. // Allocate memory for sched_domains_numa_masks[i]3 .. } .. // I reset sched_domains_numa_levels to level. sched_domains_numa_levels = level;4 } Yes this makes sense, it might be best to have this as a separate patch. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
On Tue, 2012-09-25 at 19:45 +0800, Tang Chen wrote: > Let's have an example here. > > sched_init_numa() > { > ... > // A loop set sched_domains_numa_levels to level.-1 > > // I set sched_domains_numa_levels to 0. > sched_domains_numa_levels = 0;2 Ah, ok, so this 'idiot in a hurry' (aka. me) placed the hunk wrong and thought it was at the start of the function. > // A loop allocating memory for sched_domains_numa_masks[][] > for (i = 0; i < level; i++) { > .. > // Allocate memory for sched_domains_numa_masks[i]3 > .. > } > .. > > // I reset sched_domains_numa_levels to level. > sched_domains_numa_levels = level;4 > } Yes this makes sense, it might be best to have this as a separate patch. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
Hi Peter~ Sorry about the confusing log, and thanks for the patient. :) Here, I want to say something more about the sched_domains_numa_levels to make myself more clear. :) Let's have an example here. sched_init_numa() { ... // A loop set sched_domains_numa_levels to level.-1 // I set sched_domains_numa_levels to 0. sched_domains_numa_levels = 0;2 // A loop allocating memory for sched_domains_numa_masks[][] for (i = 0; i < level; i++) { .. // Allocate memory for sched_domains_numa_masks[i]3 .. } .. // I reset sched_domains_numa_levels to level. sched_domains_numa_levels = level;4 } // A new function I added. static void sched_domains_numa_masks_clear(int cpu) { int i, j; for (i = 0; i < sched_domains_numa_levels; i++)---5 for (j = 0; j < nr_node_ids; j++) cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]); } Suppose level is 10, and at step 1, sched_domains_numa_levels is 10. If I didn't set sched_domains_numa_levels to 0 at step 2, it will be 10 all the time. If memory allocation at step 3 fails when i = 5, the array sched_domains_numa_masks[][] will only have 5 members, and sched_domains_numa_levels is 10. As you see, I added 2 functions using sched_domains_numa_levels to iterate sched_domains_numa_masks[][], such as at step 5. In this case, the iteration will break out when i = 5. This could be dangerous. So, I set sched_domains_numa_levels to 0 at step 2. This way, even if any memory allocation at step 3 fails, and sched_init_numa() returns, anyone uses sched_domains_numa_levels (which is 0) won't be wrong. I'm not sure if this is the best way to settle this problem. If you have a better idea, please tell me. Thanks. :) On 09/25/2012 06:33 PM, Peter Zijlstra wrote: On Tue, 2012-09-25 at 10:39 +0800, Tang Chen wrote: @@ -6765,11 +6773,64 @@ static void sched_init_numa(void) } sched_domain_topology = tl; + +sched_domains_numa_levels = level; And I set it to level here again. But its already set there.. its set every time we find a new level. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
On Tue, 2012-09-25 at 10:39 +0800, Tang Chen wrote: > > We do this because nr_node_ids changed, right? This means the entire > > distance table grew/shrunk, which means we should do the level scan > > again. > > It seems that nr_node_ids will not change once the system is up. > I'm not quite sure. If I am wrong, please tell me. :) Ah, right you are.. So the problem is that cpumask_of_node() doesn't contain offline cpus, and we might not boot the machine with all cpus present. In that case I guess the suggested hotplug hooks are fine. Its just that your changelog was confusing. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
On Tue, 2012-09-25 at 10:39 +0800, Tang Chen wrote: > >> @@ -6765,11 +6773,64 @@ static void sched_init_numa(void) > >> } > >> > >> sched_domain_topology = tl; > >> + > >> +sched_domains_numa_levels = level; > > And I set it to level here again. > But its already set there.. its set every time we find a new level. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
On Tue, 2012-09-25 at 10:39 +0800, Tang Chen wrote: @@ -6765,11 +6773,64 @@ static void sched_init_numa(void) } sched_domain_topology = tl; + +sched_domains_numa_levels = level; And I set it to level here again. But its already set there.. its set every time we find a new level. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
On Tue, 2012-09-25 at 10:39 +0800, Tang Chen wrote: We do this because nr_node_ids changed, right? This means the entire distance table grew/shrunk, which means we should do the level scan again. It seems that nr_node_ids will not change once the system is up. I'm not quite sure. If I am wrong, please tell me. :) Ah, right you are.. So the problem is that cpumask_of_node() doesn't contain offline cpus, and we might not boot the machine with all cpus present. In that case I guess the suggested hotplug hooks are fine. Its just that your changelog was confusing. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
Hi Peter~ Sorry about the confusing log, and thanks for the patient. :) Here, I want to say something more about the sched_domains_numa_levels to make myself more clear. :) Let's have an example here. sched_init_numa() { ... // A loop set sched_domains_numa_levels to level.-1 // I set sched_domains_numa_levels to 0. sched_domains_numa_levels = 0;2 // A loop allocating memory for sched_domains_numa_masks[][] for (i = 0; i level; i++) { .. // Allocate memory for sched_domains_numa_masks[i]3 .. } .. // I reset sched_domains_numa_levels to level. sched_domains_numa_levels = level;4 } // A new function I added. static void sched_domains_numa_masks_clear(int cpu) { int i, j; for (i = 0; i sched_domains_numa_levels; i++)---5 for (j = 0; j nr_node_ids; j++) cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]); } Suppose level is 10, and at step 1, sched_domains_numa_levels is 10. If I didn't set sched_domains_numa_levels to 0 at step 2, it will be 10 all the time. If memory allocation at step 3 fails when i = 5, the array sched_domains_numa_masks[][] will only have 5 members, and sched_domains_numa_levels is 10. As you see, I added 2 functions using sched_domains_numa_levels to iterate sched_domains_numa_masks[][], such as at step 5. In this case, the iteration will break out when i = 5. This could be dangerous. So, I set sched_domains_numa_levels to 0 at step 2. This way, even if any memory allocation at step 3 fails, and sched_init_numa() returns, anyone uses sched_domains_numa_levels (which is 0) won't be wrong. I'm not sure if this is the best way to settle this problem. If you have a better idea, please tell me. Thanks. :) On 09/25/2012 06:33 PM, Peter Zijlstra wrote: On Tue, 2012-09-25 at 10:39 +0800, Tang Chen wrote: @@ -6765,11 +6773,64 @@ static void sched_init_numa(void) } sched_domain_topology = tl; + +sched_domains_numa_levels = level; And I set it to level here again. But its already set there.. its set every time we find a new level. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
On Tue, 2012-09-25 at 19:45 +0800, Tang Chen wrote: Let's have an example here. sched_init_numa() { ... // A loop set sched_domains_numa_levels to level.-1 // I set sched_domains_numa_levels to 0. sched_domains_numa_levels = 0;2 Ah, ok, so this 'idiot in a hurry' (aka. me) placed the hunk wrong and thought it was at the start of the function. // A loop allocating memory for sched_domains_numa_masks[][] for (i = 0; i level; i++) { .. // Allocate memory for sched_domains_numa_masks[i]3 .. } .. // I reset sched_domains_numa_levels to level. sched_domains_numa_levels = level;4 } Yes this makes sense, it might be best to have this as a separate patch. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
Hi Peter~ I will make a v2 patch-set following your comments and resent it soon. :) Thanks. :) On 09/25/2012 07:50 PM, Peter Zijlstra wrote: On Tue, 2012-09-25 at 19:45 +0800, Tang Chen wrote: Let's have an example here. sched_init_numa() { ... // A loop set sched_domains_numa_levels to level.-1 // I set sched_domains_numa_levels to 0. sched_domains_numa_levels = 0;2 Ah, ok, so this 'idiot in a hurry' (aka. me) placed the hunk wrong and thought it was at the start of the function. // A loop allocating memory for sched_domains_numa_masks[][] for (i = 0; i level; i++) { .. // Allocate memory for sched_domains_numa_masks[i]3 .. } .. // I reset sched_domains_numa_levels to level. sched_domains_numa_levels = level;4 } Yes this makes sense, it might be best to have this as a separate patch. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
On 09/24/2012 05:57 PM, Srivatsa S. Bhat wrote: On 09/24/2012 03:08 PM, Peter Zijlstra wrote: + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE); hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); OK, so you really want your notifier to run before cpuset_cpu_active because otherwise you get that crash, yet you fail with the whole order thing.. You should not _ever_ rely on registration order. IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE priority to ensure that the ordering of callbacks is right, isn't it? Yes, that's right. Thanks. :) Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
Hi Peter: Thanks for commenting this patch. :) On 09/24/2012 05:38 PM, Peter Zijlstra wrote: Why are you cc'ing x86 and numa folks but not a single scheduler person when you're patching scheduler stuff? First of all, I'm sorry for this. I thought it was a NUMA or memory related problem. And I get your address from the get_maintainer.pl script. On Tue, 2012-09-18 at 18:12 +0800, Tang Chen wrote: Once array sched_domains_numa_masks is defined, it is never updated. When a new cpu on a new node is onlined, Hmm, so there's hardware where you can boot with smaller nr_node_ids than possible.. I guess that makes sense. Yes. nr_node_ids represents the max number of nodes the sustem supports. And usually, we don't have that many. the coincident member in sched_domains_numa_masks is not initialized, and all the masks are 0. As a result, the build_overlap_sched_groups() will initialize a NULL sched_group for the new cpu on the new node, which will lead to kernel panic. This patch registers a new notifier for cpu hotplug notify chain, and updates sched_domains_numa_masks every time a new cpu is onlined or offlined. Urgh, more hotplug notifiers.. a well. Signed-off-by: Tang Chen --- kernel/sched/core.c | 62 +++ 1 files changed, 62 insertions(+), 0 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fbf1fd0..66b36ab 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6711,6 +6711,14 @@ static void sched_init_numa(void) * numbers. */ + /* +* Since sched_domains_numa_levels is also used in other functions as +* an index for sched_domains_numa_masks[][], we should reset it here in +* case sched_domains_numa_masks[][] fails to be initialized. And set it +* to 'level' when sched_domains_numa_masks[][] is fully initialized. +*/ + sched_domains_numa_levels = 0; This isn't strictly needed for this patch right? I don't see anybody calling sched_init_numa() a second time (although they should).. Indeed, sched_init_numa() won't be called by anyone else. But I clear it here in case the following memory allocation fail. Actually, I want to use sched_domains_numa_levels to iterate all the numa levels, such as in sched_domains_numa_masks_set(). Suppose sched_domains_numa_levels is 10. If allocating memory for sched_domains_numa_masks[5] fails in sched_init_numa(), it will just return, but sched_domains_numa_masks[] only has 4 members. It could be dangerous. So I set sched_domains_numa_levels to 0, and reset it to level in the end of sched_init_numa(), see below. sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL); if (!sched_domains_numa_masks) return; @@ -6765,11 +6773,64 @@ static void sched_init_numa(void) } sched_domain_topology = tl; + + sched_domains_numa_levels = level; And I set it to level here again. +} + +static void sched_domains_numa_masks_set(int cpu) +{ + int i, j; + int node = cpu_to_node(cpu); + + for (i = 0; i< sched_domains_numa_levels; i++) + for (j = 0; j< nr_node_ids; j++) + if (node_distance(j, node)<= sched_domains_numa_distance[i]) + cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]); +} + +static void sched_domains_numa_masks_clear(int cpu) +{ + int i, j; + for (i = 0; i< sched_domains_numa_levels; i++) + for (j = 0; j< nr_node_ids; j++) + cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]); +} Aside from the coding style nit of wanting braces over multi-line statements even though not strictly required, I really don't see how this could possibly be right.. Thanks for telling me that. I'll fix the coding style. :) We do this because nr_node_ids changed, right? This means the entire distance table grew/shrunk, which means we should do the level scan again. It seems that nr_node_ids will not change once the system is up. I'm not quite sure. If I am wrong, please tell me. :) I found that nr_node_ids is defined as MAX_NUMNODES in mm/page_alloc.c int nr_node_ids __read_mostly = MAX_NUMNODES; And all the functions change this value are __init functions. So I think nr_node_ids is just the possible nodes in the system, and it won't be changed after the system initialization finished. @@ -7218,6 +7279,7 @@ void __init sched_init_smp(void) mutex_unlock(_domains_mutex); put_online_cpus(); + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE); hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); OK, so you really want your notifier to run before cpuset_cpu_active because otherwise you get that crash, yet you fail with the whole order thing.. You should not
Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
On 09/24/2012 03:42 PM, Peter Zijlstra wrote: > On Mon, 2012-09-24 at 15:27 +0530, Srivatsa S. Bhat wrote: >> On 09/24/2012 03:08 PM, Peter Zijlstra wrote: + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE); hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); >>> >>> OK, so you really want your notifier to run before cpuset_cpu_active >>> because otherwise you get that crash, yet you fail with the whole order >>> thing.. You should not _ever_ rely on registration order. >>> >> >> IMHO he isn't relying on registration order.. He uses the >> CPU_PRI_SCHED_ACTIVE >> priority to ensure that the ordering of callbacks is right, isn't it? > > Oh argh indeed. I can't read :/ > ;-) Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
On Mon, 2012-09-24 at 15:27 +0530, Srivatsa S. Bhat wrote: > On 09/24/2012 03:08 PM, Peter Zijlstra wrote: > >> + hotcpu_notifier(sched_domains_numa_masks_update, > >> CPU_PRI_SCHED_ACTIVE); > >> hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); > >> hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); > > > > OK, so you really want your notifier to run before cpuset_cpu_active > > because otherwise you get that crash, yet you fail with the whole order > > thing.. You should not _ever_ rely on registration order. > > > > IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE > priority to ensure that the ordering of callbacks is right, isn't it? Oh argh indeed. I can't read :/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
On 09/24/2012 03:08 PM, Peter Zijlstra wrote: >> + hotcpu_notifier(sched_domains_numa_masks_update, >> CPU_PRI_SCHED_ACTIVE); >> hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); >> hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); > > OK, so you really want your notifier to run before cpuset_cpu_active > because otherwise you get that crash, yet you fail with the whole order > thing.. You should not _ever_ rely on registration order. > IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE priority to ensure that the ordering of callbacks is right, isn't it? Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
Why are you cc'ing x86 and numa folks but not a single scheduler person when you're patching scheduler stuff? On Tue, 2012-09-18 at 18:12 +0800, Tang Chen wrote: > Once array sched_domains_numa_masks is defined, it is never updated. > When a new cpu on a new node is onlined, Hmm, so there's hardware where you can boot with smaller nr_node_ids than possible.. I guess that makes sense. > the coincident member in > sched_domains_numa_masks is not initialized, and all the masks are 0. > As a result, the build_overlap_sched_groups() will initialize a NULL > sched_group for the new cpu on the new node, which will lead to kernel panic. > This patch registers a new notifier for cpu hotplug notify chain, and > updates sched_domains_numa_masks every time a new cpu is onlined or offlined. Urgh, more hotplug notifiers.. a well. > Signed-off-by: Tang Chen > --- > kernel/sched/core.c | 62 > +++ > 1 files changed, 62 insertions(+), 0 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index fbf1fd0..66b36ab 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -6711,6 +6711,14 @@ static void sched_init_numa(void) >* numbers. >*/ > > + /* > + * Since sched_domains_numa_levels is also used in other functions as > + * an index for sched_domains_numa_masks[][], we should reset it here in > + * case sched_domains_numa_masks[][] fails to be initialized. And set it > + * to 'level' when sched_domains_numa_masks[][] is fully initialized. > + */ > + sched_domains_numa_levels = 0; This isn't strictly needed for this patch right? I don't see anybody calling sched_init_numa() a second time (although they should).. > sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL); > if (!sched_domains_numa_masks) > return; > @@ -6765,11 +6773,64 @@ static void sched_init_numa(void) > } > > sched_domain_topology = tl; > + > + sched_domains_numa_levels = level; > +} > + > +static void sched_domains_numa_masks_set(int cpu) > +{ > + int i, j; > + int node = cpu_to_node(cpu); > + > + for (i = 0; i < sched_domains_numa_levels; i++) > + for (j = 0; j < nr_node_ids; j++) > + if (node_distance(j, node) <= > sched_domains_numa_distance[i]) > + cpumask_set_cpu(cpu, > sched_domains_numa_masks[i][j]); > +} > + > +static void sched_domains_numa_masks_clear(int cpu) > +{ > + int i, j; > + for (i = 0; i < sched_domains_numa_levels; i++) > + for (j = 0; j < nr_node_ids; j++) > + cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]); > +} Aside from the coding style nit of wanting braces over multi-line statements even though not strictly required, I really don't see how this could possibly be right.. We do this because nr_node_ids changed, right? This means the entire distance table grew/shrunk, which means we should do the level scan again. > @@ -7218,6 +7279,7 @@ void __init sched_init_smp(void) > mutex_unlock(_domains_mutex); > put_online_cpus(); > > + hotcpu_notifier(sched_domains_numa_masks_update, > CPU_PRI_SCHED_ACTIVE); > hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); > hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); OK, so you really want your notifier to run before cpuset_cpu_active because otherwise you get that crash, yet you fail with the whole order thing.. You should not _ever_ rely on registration order. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
Why are you cc'ing x86 and numa folks but not a single scheduler person when you're patching scheduler stuff? On Tue, 2012-09-18 at 18:12 +0800, Tang Chen wrote: Once array sched_domains_numa_masks is defined, it is never updated. When a new cpu on a new node is onlined, Hmm, so there's hardware where you can boot with smaller nr_node_ids than possible.. I guess that makes sense. the coincident member in sched_domains_numa_masks is not initialized, and all the masks are 0. As a result, the build_overlap_sched_groups() will initialize a NULL sched_group for the new cpu on the new node, which will lead to kernel panic. snip This patch registers a new notifier for cpu hotplug notify chain, and updates sched_domains_numa_masks every time a new cpu is onlined or offlined. Urgh, more hotplug notifiers.. a well. Signed-off-by: Tang Chen tangc...@cn.fujitsu.com --- kernel/sched/core.c | 62 +++ 1 files changed, 62 insertions(+), 0 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fbf1fd0..66b36ab 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6711,6 +6711,14 @@ static void sched_init_numa(void) * numbers. */ + /* + * Since sched_domains_numa_levels is also used in other functions as + * an index for sched_domains_numa_masks[][], we should reset it here in + * case sched_domains_numa_masks[][] fails to be initialized. And set it + * to 'level' when sched_domains_numa_masks[][] is fully initialized. + */ + sched_domains_numa_levels = 0; This isn't strictly needed for this patch right? I don't see anybody calling sched_init_numa() a second time (although they should).. sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL); if (!sched_domains_numa_masks) return; @@ -6765,11 +6773,64 @@ static void sched_init_numa(void) } sched_domain_topology = tl; + + sched_domains_numa_levels = level; +} + +static void sched_domains_numa_masks_set(int cpu) +{ + int i, j; + int node = cpu_to_node(cpu); + + for (i = 0; i sched_domains_numa_levels; i++) + for (j = 0; j nr_node_ids; j++) + if (node_distance(j, node) = sched_domains_numa_distance[i]) + cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]); +} + +static void sched_domains_numa_masks_clear(int cpu) +{ + int i, j; + for (i = 0; i sched_domains_numa_levels; i++) + for (j = 0; j nr_node_ids; j++) + cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]); +} Aside from the coding style nit of wanting braces over multi-line statements even though not strictly required, I really don't see how this could possibly be right.. We do this because nr_node_ids changed, right? This means the entire distance table grew/shrunk, which means we should do the level scan again. @@ -7218,6 +7279,7 @@ void __init sched_init_smp(void) mutex_unlock(sched_domains_mutex); put_online_cpus(); + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE); hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); OK, so you really want your notifier to run before cpuset_cpu_active because otherwise you get that crash, yet you fail with the whole order thing.. You should not _ever_ rely on registration order. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
On 09/24/2012 03:08 PM, Peter Zijlstra wrote: + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE); hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); OK, so you really want your notifier to run before cpuset_cpu_active because otherwise you get that crash, yet you fail with the whole order thing.. You should not _ever_ rely on registration order. IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE priority to ensure that the ordering of callbacks is right, isn't it? Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
On Mon, 2012-09-24 at 15:27 +0530, Srivatsa S. Bhat wrote: On 09/24/2012 03:08 PM, Peter Zijlstra wrote: + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE); hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); OK, so you really want your notifier to run before cpuset_cpu_active because otherwise you get that crash, yet you fail with the whole order thing.. You should not _ever_ rely on registration order. IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE priority to ensure that the ordering of callbacks is right, isn't it? Oh argh indeed. I can't read :/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
On 09/24/2012 03:42 PM, Peter Zijlstra wrote: On Mon, 2012-09-24 at 15:27 +0530, Srivatsa S. Bhat wrote: On 09/24/2012 03:08 PM, Peter Zijlstra wrote: + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE); hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); OK, so you really want your notifier to run before cpuset_cpu_active because otherwise you get that crash, yet you fail with the whole order thing.. You should not _ever_ rely on registration order. IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE priority to ensure that the ordering of callbacks is right, isn't it? Oh argh indeed. I can't read :/ ;-) Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
Hi Peter: Thanks for commenting this patch. :) On 09/24/2012 05:38 PM, Peter Zijlstra wrote: Why are you cc'ing x86 and numa folks but not a single scheduler person when you're patching scheduler stuff? First of all, I'm sorry for this. I thought it was a NUMA or memory related problem. And I get your address from the get_maintainer.pl script. On Tue, 2012-09-18 at 18:12 +0800, Tang Chen wrote: Once array sched_domains_numa_masks is defined, it is never updated. When a new cpu on a new node is onlined, Hmm, so there's hardware where you can boot with smaller nr_node_ids than possible.. I guess that makes sense. Yes. nr_node_ids represents the max number of nodes the sustem supports. And usually, we don't have that many. the coincident member in sched_domains_numa_masks is not initialized, and all the masks are 0. As a result, the build_overlap_sched_groups() will initialize a NULL sched_group for the new cpu on the new node, which will lead to kernel panic. snip This patch registers a new notifier for cpu hotplug notify chain, and updates sched_domains_numa_masks every time a new cpu is onlined or offlined. Urgh, more hotplug notifiers.. a well. Signed-off-by: Tang Chentangc...@cn.fujitsu.com --- kernel/sched/core.c | 62 +++ 1 files changed, 62 insertions(+), 0 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fbf1fd0..66b36ab 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6711,6 +6711,14 @@ static void sched_init_numa(void) * numbers. */ + /* +* Since sched_domains_numa_levels is also used in other functions as +* an index for sched_domains_numa_masks[][], we should reset it here in +* case sched_domains_numa_masks[][] fails to be initialized. And set it +* to 'level' when sched_domains_numa_masks[][] is fully initialized. +*/ + sched_domains_numa_levels = 0; This isn't strictly needed for this patch right? I don't see anybody calling sched_init_numa() a second time (although they should).. Indeed, sched_init_numa() won't be called by anyone else. But I clear it here in case the following memory allocation fail. Actually, I want to use sched_domains_numa_levels to iterate all the numa levels, such as in sched_domains_numa_masks_set(). Suppose sched_domains_numa_levels is 10. If allocating memory for sched_domains_numa_masks[5] fails in sched_init_numa(), it will just return, but sched_domains_numa_masks[] only has 4 members. It could be dangerous. So I set sched_domains_numa_levels to 0, and reset it to level in the end of sched_init_numa(), see below. sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL); if (!sched_domains_numa_masks) return; @@ -6765,11 +6773,64 @@ static void sched_init_numa(void) } sched_domain_topology = tl; + + sched_domains_numa_levels = level; And I set it to level here again. +} + +static void sched_domains_numa_masks_set(int cpu) +{ + int i, j; + int node = cpu_to_node(cpu); + + for (i = 0; i sched_domains_numa_levels; i++) + for (j = 0; j nr_node_ids; j++) + if (node_distance(j, node)= sched_domains_numa_distance[i]) + cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]); +} + +static void sched_domains_numa_masks_clear(int cpu) +{ + int i, j; + for (i = 0; i sched_domains_numa_levels; i++) + for (j = 0; j nr_node_ids; j++) + cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]); +} Aside from the coding style nit of wanting braces over multi-line statements even though not strictly required, I really don't see how this could possibly be right.. Thanks for telling me that. I'll fix the coding style. :) We do this because nr_node_ids changed, right? This means the entire distance table grew/shrunk, which means we should do the level scan again. It seems that nr_node_ids will not change once the system is up. I'm not quite sure. If I am wrong, please tell me. :) I found that nr_node_ids is defined as MAX_NUMNODES in mm/page_alloc.c int nr_node_ids __read_mostly = MAX_NUMNODES; And all the functions change this value are __init functions. So I think nr_node_ids is just the possible nodes in the system, and it won't be changed after the system initialization finished. @@ -7218,6 +7279,7 @@ void __init sched_init_smp(void) mutex_unlock(sched_domains_mutex); put_online_cpus(); + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE); hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); OK, so you really want your notifier to run before cpuset_cpu_active because otherwise you get that crash, yet you fail with the whole order
Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
On 09/24/2012 05:57 PM, Srivatsa S. Bhat wrote: On 09/24/2012 03:08 PM, Peter Zijlstra wrote: + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE); hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); OK, so you really want your notifier to run before cpuset_cpu_active because otherwise you get that crash, yet you fail with the whole order thing.. You should not _ever_ rely on registration order. IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE priority to ensure that the ordering of callbacks is right, isn't it? Yes, that's right. Thanks. :) Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
Hi, Would you please help to review this patch ? Thanks. :) On 09/18/2012 06:12 PM, Tang Chen wrote: Once array sched_domains_numa_masks is defined, it is never updated. When a new cpu on a new node is onlined, the coincident member in sched_domains_numa_masks is not initialized, and all the masks are 0. As a result, the build_overlap_sched_groups() will initialize a NULL sched_group for the new cpu on the new node, which will lead to kernel panic. [ 3189.403280] Call Trace: [ 3189.403286] [] warn_slowpath_common+0x7f/0xc0 [ 3189.403289] [] warn_slowpath_null+0x1a/0x20 [ 3189.403292] [] build_sched_domains+0x467/0x470 [ 3189.403296] [] partition_sched_domains+0x307/0x510 [ 3189.403299] [] ? partition_sched_domains+0x142/0x510 [ 3189.403305] [] cpuset_update_active_cpus+0x83/0x90 [ 3189.403308] [] cpuset_cpu_active+0x38/0x70 [ 3189.403316] [] notifier_call_chain+0x67/0x150 [ 3189.403320] [] ? native_cpu_up+0x18a/0x1b5 [ 3189.403328] [] __raw_notifier_call_chain+0xe/0x10 [ 3189.40] [] __cpu_notify+0x20/0x40 [ 3189.403337] [] _cpu_up+0xe9/0x131 [ 3189.403340] [] cpu_up+0xdb/0xee [ 3189.403348] [] store_online+0x9c/0xd0 [ 3189.403355] [] dev_attr_store+0x20/0x30 [ 3189.403361] [] sysfs_write_file+0xa3/0x100 [ 3189.403368] [] vfs_write+0xd0/0x1a0 [ 3189.403371] [] sys_write+0x54/0xa0 [ 3189.403375] [] system_call_fastpath+0x16/0x1b [ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]--- [ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 0018 This patch registers a new notifier for cpu hotplug notify chain, and updates sched_domains_numa_masks every time a new cpu is onlined or offlined. Signed-off-by: Tang Chen --- kernel/sched/core.c | 62 +++ 1 files changed, 62 insertions(+), 0 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fbf1fd0..66b36ab 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6711,6 +6711,14 @@ static void sched_init_numa(void) * numbers. */ + /* +* Since sched_domains_numa_levels is also used in other functions as +* an index for sched_domains_numa_masks[][], we should reset it here in +* case sched_domains_numa_masks[][] fails to be initialized. And set it +* to 'level' when sched_domains_numa_masks[][] is fully initialized. +*/ + sched_domains_numa_levels = 0; + sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL); if (!sched_domains_numa_masks) return; @@ -6765,11 +6773,64 @@ static void sched_init_numa(void) } sched_domain_topology = tl; + + sched_domains_numa_levels = level; +} + +static void sched_domains_numa_masks_set(int cpu) +{ + int i, j; + int node = cpu_to_node(cpu); + + for (i = 0; i< sched_domains_numa_levels; i++) + for (j = 0; j< nr_node_ids; j++) + if (node_distance(j, node)<= sched_domains_numa_distance[i]) + cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]); +} + +static void sched_domains_numa_masks_clear(int cpu) +{ + int i, j; + for (i = 0; i< sched_domains_numa_levels; i++) + for (j = 0; j< nr_node_ids; j++) + cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]); +} + +/* + * Update sched_domains_numa_masks[level][node] array when new cpus + * are onlined. + */ +static int sched_domains_numa_masks_update(struct notifier_block *nfb, + unsigned long action, + void *hcpu) +{ + int cpu = (int)hcpu; + + switch (action& ~CPU_TASKS_FROZEN) { + case CPU_ONLINE: + sched_domains_numa_masks_set(cpu); + break; + + case CPU_DEAD: + sched_domains_numa_masks_clear(cpu); + break; + + default: + return NOTIFY_DONE; + } + return NOTIFY_OK; } #else static inline void sched_init_numa(void) { } + +static int sched_domains_numa_masks_update(struct notifier_block *nfb, + unsigned long action, + void *hcpu) +{ + return 0; +} #endif /* CONFIG_NUMA */ static int __sdt_alloc(const struct cpumask *cpu_map) @@ -7218,6 +7279,7 @@ void __init sched_init_smp(void) mutex_unlock(_domains_mutex); put_online_cpus(); + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE); hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at
Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
Hi, Would you please help to review this patch ? Thanks. :) On 09/18/2012 06:12 PM, Tang Chen wrote: Once array sched_domains_numa_masks is defined, it is never updated. When a new cpu on a new node is onlined, the coincident member in sched_domains_numa_masks is not initialized, and all the masks are 0. As a result, the build_overlap_sched_groups() will initialize a NULL sched_group for the new cpu on the new node, which will lead to kernel panic. [ 3189.403280] Call Trace: [ 3189.403286] [8106c36f] warn_slowpath_common+0x7f/0xc0 [ 3189.403289] [8106c3ca] warn_slowpath_null+0x1a/0x20 [ 3189.403292] [810b1d57] build_sched_domains+0x467/0x470 [ 3189.403296] [810b2067] partition_sched_domains+0x307/0x510 [ 3189.403299] [810b1ea2] ? partition_sched_domains+0x142/0x510 [ 3189.403305] [810fcc93] cpuset_update_active_cpus+0x83/0x90 [ 3189.403308] [810b22a8] cpuset_cpu_active+0x38/0x70 [ 3189.403316] [81674b87] notifier_call_chain+0x67/0x150 [ 3189.403320] [81664647] ? native_cpu_up+0x18a/0x1b5 [ 3189.403328] [810a044e] __raw_notifier_call_chain+0xe/0x10 [ 3189.40] [81070470] __cpu_notify+0x20/0x40 [ 3189.403337] [813e] _cpu_up+0xe9/0x131 [ 3189.403340] [81666761] cpu_up+0xdb/0xee [ 3189.403348] [8165667c] store_online+0x9c/0xd0 [ 3189.403355] [81437640] dev_attr_store+0x20/0x30 [ 3189.403361] [8124aa63] sysfs_write_file+0xa3/0x100 [ 3189.403368] [811ccbe0] vfs_write+0xd0/0x1a0 [ 3189.403371] [811ccdb4] sys_write+0x54/0xa0 [ 3189.403375] [81679c69] system_call_fastpath+0x16/0x1b [ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]--- [ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 0018 This patch registers a new notifier for cpu hotplug notify chain, and updates sched_domains_numa_masks every time a new cpu is onlined or offlined. Signed-off-by: Tang Chentangc...@cn.fujitsu.com --- kernel/sched/core.c | 62 +++ 1 files changed, 62 insertions(+), 0 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fbf1fd0..66b36ab 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6711,6 +6711,14 @@ static void sched_init_numa(void) * numbers. */ + /* +* Since sched_domains_numa_levels is also used in other functions as +* an index for sched_domains_numa_masks[][], we should reset it here in +* case sched_domains_numa_masks[][] fails to be initialized. And set it +* to 'level' when sched_domains_numa_masks[][] is fully initialized. +*/ + sched_domains_numa_levels = 0; + sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL); if (!sched_domains_numa_masks) return; @@ -6765,11 +6773,64 @@ static void sched_init_numa(void) } sched_domain_topology = tl; + + sched_domains_numa_levels = level; +} + +static void sched_domains_numa_masks_set(int cpu) +{ + int i, j; + int node = cpu_to_node(cpu); + + for (i = 0; i sched_domains_numa_levels; i++) + for (j = 0; j nr_node_ids; j++) + if (node_distance(j, node)= sched_domains_numa_distance[i]) + cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]); +} + +static void sched_domains_numa_masks_clear(int cpu) +{ + int i, j; + for (i = 0; i sched_domains_numa_levels; i++) + for (j = 0; j nr_node_ids; j++) + cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]); +} + +/* + * Update sched_domains_numa_masks[level][node] array when new cpus + * are onlined. + */ +static int sched_domains_numa_masks_update(struct notifier_block *nfb, + unsigned long action, + void *hcpu) +{ + int cpu = (int)hcpu; + + switch (action ~CPU_TASKS_FROZEN) { + case CPU_ONLINE: + sched_domains_numa_masks_set(cpu); + break; + + case CPU_DEAD: + sched_domains_numa_masks_clear(cpu); + break; + + default: + return NOTIFY_DONE; + } + return NOTIFY_OK; } #else static inline void sched_init_numa(void) { } + +static int sched_domains_numa_masks_update(struct notifier_block *nfb, + unsigned long action, + void *hcpu) +{ + return 0; +} #endif /* CONFIG_NUMA */ static int __sdt_alloc(const struct cpumask *cpu_map) @@ -7218,6 +7279,7 @@ void __init sched_init_smp(void) mutex_unlock(sched_domains_mutex); put_online_cpus(); + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE); hotcpu_notifier(cpuset_cpu_active,
[PATCH] Update sched_domains_numa_masks when new cpus are onlined.
Once array sched_domains_numa_masks is defined, it is never updated. When a new cpu on a new node is onlined, the coincident member in sched_domains_numa_masks is not initialized, and all the masks are 0. As a result, the build_overlap_sched_groups() will initialize a NULL sched_group for the new cpu on the new node, which will lead to kernel panic. [ 3189.403280] Call Trace: [ 3189.403286] [] warn_slowpath_common+0x7f/0xc0 [ 3189.403289] [] warn_slowpath_null+0x1a/0x20 [ 3189.403292] [] build_sched_domains+0x467/0x470 [ 3189.403296] [] partition_sched_domains+0x307/0x510 [ 3189.403299] [] ? partition_sched_domains+0x142/0x510 [ 3189.403305] [] cpuset_update_active_cpus+0x83/0x90 [ 3189.403308] [] cpuset_cpu_active+0x38/0x70 [ 3189.403316] [] notifier_call_chain+0x67/0x150 [ 3189.403320] [] ? native_cpu_up+0x18a/0x1b5 [ 3189.403328] [] __raw_notifier_call_chain+0xe/0x10 [ 3189.40] [] __cpu_notify+0x20/0x40 [ 3189.403337] [] _cpu_up+0xe9/0x131 [ 3189.403340] [] cpu_up+0xdb/0xee [ 3189.403348] [] store_online+0x9c/0xd0 [ 3189.403355] [] dev_attr_store+0x20/0x30 [ 3189.403361] [] sysfs_write_file+0xa3/0x100 [ 3189.403368] [] vfs_write+0xd0/0x1a0 [ 3189.403371] [] sys_write+0x54/0xa0 [ 3189.403375] [] system_call_fastpath+0x16/0x1b [ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]--- [ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 0018 This patch registers a new notifier for cpu hotplug notify chain, and updates sched_domains_numa_masks every time a new cpu is onlined or offlined. Signed-off-by: Tang Chen --- kernel/sched/core.c | 62 +++ 1 files changed, 62 insertions(+), 0 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fbf1fd0..66b36ab 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6711,6 +6711,14 @@ static void sched_init_numa(void) * numbers. */ + /* +* Since sched_domains_numa_levels is also used in other functions as +* an index for sched_domains_numa_masks[][], we should reset it here in +* case sched_domains_numa_masks[][] fails to be initialized. And set it +* to 'level' when sched_domains_numa_masks[][] is fully initialized. +*/ + sched_domains_numa_levels = 0; + sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL); if (!sched_domains_numa_masks) return; @@ -6765,11 +6773,64 @@ static void sched_init_numa(void) } sched_domain_topology = tl; + + sched_domains_numa_levels = level; +} + +static void sched_domains_numa_masks_set(int cpu) +{ + int i, j; + int node = cpu_to_node(cpu); + + for (i = 0; i < sched_domains_numa_levels; i++) + for (j = 0; j < nr_node_ids; j++) + if (node_distance(j, node) <= sched_domains_numa_distance[i]) + cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]); +} + +static void sched_domains_numa_masks_clear(int cpu) +{ + int i, j; + for (i = 0; i < sched_domains_numa_levels; i++) + for (j = 0; j < nr_node_ids; j++) + cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]); +} + +/* + * Update sched_domains_numa_masks[level][node] array when new cpus + * are onlined. + */ +static int sched_domains_numa_masks_update(struct notifier_block *nfb, + unsigned long action, + void *hcpu) +{ + int cpu = (int)hcpu; + + switch (action & ~CPU_TASKS_FROZEN) { + case CPU_ONLINE: + sched_domains_numa_masks_set(cpu); + break; + + case CPU_DEAD: + sched_domains_numa_masks_clear(cpu); + break; + + default: + return NOTIFY_DONE; + } + return NOTIFY_OK; } #else static inline void sched_init_numa(void) { } + +static int sched_domains_numa_masks_update(struct notifier_block *nfb, + unsigned long action, + void *hcpu) +{ + return 0; +} #endif /* CONFIG_NUMA */ static int __sdt_alloc(const struct cpumask *cpu_map) @@ -7218,6 +7279,7 @@ void __init sched_init_smp(void) mutex_unlock(_domains_mutex); put_online_cpus(); + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE); hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Update sched_domains_numa_masks when new cpus are onlined.
Once array sched_domains_numa_masks is defined, it is never updated. When a new cpu on a new node is onlined, the coincident member in sched_domains_numa_masks is not initialized, and all the masks are 0. As a result, the build_overlap_sched_groups() will initialize a NULL sched_group for the new cpu on the new node, which will lead to kernel panic. [ 3189.403280] Call Trace: [ 3189.403286] [8106c36f] warn_slowpath_common+0x7f/0xc0 [ 3189.403289] [8106c3ca] warn_slowpath_null+0x1a/0x20 [ 3189.403292] [810b1d57] build_sched_domains+0x467/0x470 [ 3189.403296] [810b2067] partition_sched_domains+0x307/0x510 [ 3189.403299] [810b1ea2] ? partition_sched_domains+0x142/0x510 [ 3189.403305] [810fcc93] cpuset_update_active_cpus+0x83/0x90 [ 3189.403308] [810b22a8] cpuset_cpu_active+0x38/0x70 [ 3189.403316] [81674b87] notifier_call_chain+0x67/0x150 [ 3189.403320] [81664647] ? native_cpu_up+0x18a/0x1b5 [ 3189.403328] [810a044e] __raw_notifier_call_chain+0xe/0x10 [ 3189.40] [81070470] __cpu_notify+0x20/0x40 [ 3189.403337] [813e] _cpu_up+0xe9/0x131 [ 3189.403340] [81666761] cpu_up+0xdb/0xee [ 3189.403348] [8165667c] store_online+0x9c/0xd0 [ 3189.403355] [81437640] dev_attr_store+0x20/0x30 [ 3189.403361] [8124aa63] sysfs_write_file+0xa3/0x100 [ 3189.403368] [811ccbe0] vfs_write+0xd0/0x1a0 [ 3189.403371] [811ccdb4] sys_write+0x54/0xa0 [ 3189.403375] [81679c69] system_call_fastpath+0x16/0x1b [ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]--- [ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 0018 This patch registers a new notifier for cpu hotplug notify chain, and updates sched_domains_numa_masks every time a new cpu is onlined or offlined. Signed-off-by: Tang Chen tangc...@cn.fujitsu.com --- kernel/sched/core.c | 62 +++ 1 files changed, 62 insertions(+), 0 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fbf1fd0..66b36ab 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6711,6 +6711,14 @@ static void sched_init_numa(void) * numbers. */ + /* +* Since sched_domains_numa_levels is also used in other functions as +* an index for sched_domains_numa_masks[][], we should reset it here in +* case sched_domains_numa_masks[][] fails to be initialized. And set it +* to 'level' when sched_domains_numa_masks[][] is fully initialized. +*/ + sched_domains_numa_levels = 0; + sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL); if (!sched_domains_numa_masks) return; @@ -6765,11 +6773,64 @@ static void sched_init_numa(void) } sched_domain_topology = tl; + + sched_domains_numa_levels = level; +} + +static void sched_domains_numa_masks_set(int cpu) +{ + int i, j; + int node = cpu_to_node(cpu); + + for (i = 0; i sched_domains_numa_levels; i++) + for (j = 0; j nr_node_ids; j++) + if (node_distance(j, node) = sched_domains_numa_distance[i]) + cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]); +} + +static void sched_domains_numa_masks_clear(int cpu) +{ + int i, j; + for (i = 0; i sched_domains_numa_levels; i++) + for (j = 0; j nr_node_ids; j++) + cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]); +} + +/* + * Update sched_domains_numa_masks[level][node] array when new cpus + * are onlined. + */ +static int sched_domains_numa_masks_update(struct notifier_block *nfb, + unsigned long action, + void *hcpu) +{ + int cpu = (int)hcpu; + + switch (action ~CPU_TASKS_FROZEN) { + case CPU_ONLINE: + sched_domains_numa_masks_set(cpu); + break; + + case CPU_DEAD: + sched_domains_numa_masks_clear(cpu); + break; + + default: + return NOTIFY_DONE; + } + return NOTIFY_OK; } #else static inline void sched_init_numa(void) { } + +static int sched_domains_numa_masks_update(struct notifier_block *nfb, + unsigned long action, + void *hcpu) +{ + return 0; +} #endif /* CONFIG_NUMA */ static int __sdt_alloc(const struct cpumask *cpu_map) @@ -7218,6 +7279,7 @@ void __init sched_init_smp(void) mutex_unlock(sched_domains_mutex); put_online_cpus(); + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE); hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); -- 1.7.1 --