Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-25 Thread Tang Chen

Hi Peter~

I will make a v2 patch-set following your comments
and resent it soon. :)

Thanks. :)

On 09/25/2012 07:50 PM, Peter Zijlstra wrote:

On Tue, 2012-09-25 at 19:45 +0800, Tang Chen wrote:



Let's have an example here.

sched_init_numa()
{
...
// A loop set sched_domains_numa_levels to level.-1

// I set sched_domains_numa_levels to 0.
sched_domains_numa_levels = 0;2


Ah, ok, so this 'idiot in a hurry' (aka. me) placed the hunk wrong and
thought it was at the start of the function.


// A loop allocating memory for sched_domains_numa_masks[][]
for (i = 0; i<  level; i++) {
..
// Allocate memory for sched_domains_numa_masks[i]3
..
}
..

// I reset sched_domains_numa_levels to level.
sched_domains_numa_levels = level;4
}


Yes this makes sense, it might be best to have this as a separate patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-25 Thread Peter Zijlstra
On Tue, 2012-09-25 at 19:45 +0800, Tang Chen wrote:


> Let's have an example here.
> 
> sched_init_numa()
> {
>   ...
>   // A loop set sched_domains_numa_levels to level.-1
> 
>   // I set sched_domains_numa_levels to 0.
>   sched_domains_numa_levels = 0;2

Ah, ok, so this 'idiot in a hurry' (aka. me) placed the hunk wrong and
thought it was at the start of the function.

>   // A loop allocating memory for sched_domains_numa_masks[][]
>   for (i = 0; i < level; i++) {
>   ..
>   // Allocate memory for sched_domains_numa_masks[i]3
>   ..
>   }
>   ..
> 
>   // I reset sched_domains_numa_levels to level.
>   sched_domains_numa_levels = level;4
> }

Yes this makes sense, it might be best to have this as a separate patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-25 Thread Tang Chen

Hi Peter~

Sorry about the confusing log, and thanks for the patient. :)

Here, I want to say something more about the sched_domains_numa_levels
to make myself more clear. :)


Let's have an example here.

sched_init_numa()
{
...
// A loop set sched_domains_numa_levels to level.-1

// I set sched_domains_numa_levels to 0.
sched_domains_numa_levels = 0;2

// A loop allocating memory for sched_domains_numa_masks[][]
for (i = 0; i < level; i++) {
..
// Allocate memory for sched_domains_numa_masks[i]3
..
}
..

// I reset sched_domains_numa_levels to level.
sched_domains_numa_levels = level;4
}

// A new function I added.
static void sched_domains_numa_masks_clear(int cpu)
{
int i, j;
for (i = 0; i < sched_domains_numa_levels; i++)---5
for (j = 0; j < nr_node_ids; j++)
cpumask_clear_cpu(cpu, 
sched_domains_numa_masks[i][j]);

}


Suppose level is 10, and at step 1, sched_domains_numa_levels is 10.

If I didn't set sched_domains_numa_levels to 0 at step 2, it will be 10
all the time.

If memory allocation at step 3 fails when i = 5, the array
sched_domains_numa_masks[][] will only have 5 members, and
sched_domains_numa_levels is 10.

As you see, I added 2 functions using sched_domains_numa_levels to
iterate sched_domains_numa_masks[][], such as at step 5.
In this case, the iteration will break out when i = 5.

This could be dangerous.

So, I set sched_domains_numa_levels to 0 at step 2. This way, even if
any memory allocation at step 3 fails, and sched_init_numa() returns,
anyone uses sched_domains_numa_levels (which is 0) won't be wrong.

I'm not sure if this is the best way to settle this problem.
If you have a better idea, please tell me. Thanks. :)


On 09/25/2012 06:33 PM, Peter Zijlstra wrote:

On Tue, 2012-09-25 at 10:39 +0800, Tang Chen wrote:

@@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
  }

  sched_domain_topology = tl;
+
+sched_domains_numa_levels = level;


And I set it to level here again.


But its already set there.. its set every time we find a new level.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-25 Thread Peter Zijlstra
On Tue, 2012-09-25 at 10:39 +0800, Tang Chen wrote:
> > We do this because nr_node_ids changed, right? This means the entire
> > distance table grew/shrunk, which means we should do the level scan
> > again.
> 
> It seems that nr_node_ids will not change once the system is up.
> I'm not quite sure. If I am wrong, please tell me. :) 

Ah, right you are..

So the problem is that cpumask_of_node() doesn't contain offline cpus,
and we might not boot the machine with all cpus present.

In that case I guess the suggested hotplug hooks are fine. Its just that
your changelog was confusing.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-25 Thread Peter Zijlstra
On Tue, 2012-09-25 at 10:39 +0800, Tang Chen wrote:
> >> @@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
> >>  }
> >>
> >>  sched_domain_topology = tl;
> >> +
> >> +sched_domains_numa_levels = level;
> 
> And I set it to level here again.
> 
But its already set there.. its set every time we find a new level.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-25 Thread Peter Zijlstra
On Tue, 2012-09-25 at 10:39 +0800, Tang Chen wrote:
  @@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
   }
 
   sched_domain_topology = tl;
  +
  +sched_domains_numa_levels = level;
 
 And I set it to level here again.
 
But its already set there.. its set every time we find a new level.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-25 Thread Peter Zijlstra
On Tue, 2012-09-25 at 10:39 +0800, Tang Chen wrote:
  We do this because nr_node_ids changed, right? This means the entire
  distance table grew/shrunk, which means we should do the level scan
  again.
 
 It seems that nr_node_ids will not change once the system is up.
 I'm not quite sure. If I am wrong, please tell me. :) 

Ah, right you are..

So the problem is that cpumask_of_node() doesn't contain offline cpus,
and we might not boot the machine with all cpus present.

In that case I guess the suggested hotplug hooks are fine. Its just that
your changelog was confusing.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-25 Thread Tang Chen

Hi Peter~

Sorry about the confusing log, and thanks for the patient. :)

Here, I want to say something more about the sched_domains_numa_levels
to make myself more clear. :)


Let's have an example here.

sched_init_numa()
{
...
// A loop set sched_domains_numa_levels to level.-1

// I set sched_domains_numa_levels to 0.
sched_domains_numa_levels = 0;2

// A loop allocating memory for sched_domains_numa_masks[][]
for (i = 0; i  level; i++) {
..
// Allocate memory for sched_domains_numa_masks[i]3
..
}
..

// I reset sched_domains_numa_levels to level.
sched_domains_numa_levels = level;4
}

// A new function I added.
static void sched_domains_numa_masks_clear(int cpu)
{
int i, j;
for (i = 0; i  sched_domains_numa_levels; i++)---5
for (j = 0; j  nr_node_ids; j++)
cpumask_clear_cpu(cpu, 
sched_domains_numa_masks[i][j]);

}


Suppose level is 10, and at step 1, sched_domains_numa_levels is 10.

If I didn't set sched_domains_numa_levels to 0 at step 2, it will be 10
all the time.

If memory allocation at step 3 fails when i = 5, the array
sched_domains_numa_masks[][] will only have 5 members, and
sched_domains_numa_levels is 10.

As you see, I added 2 functions using sched_domains_numa_levels to
iterate sched_domains_numa_masks[][], such as at step 5.
In this case, the iteration will break out when i = 5.

This could be dangerous.

So, I set sched_domains_numa_levels to 0 at step 2. This way, even if
any memory allocation at step 3 fails, and sched_init_numa() returns,
anyone uses sched_domains_numa_levels (which is 0) won't be wrong.

I'm not sure if this is the best way to settle this problem.
If you have a better idea, please tell me. Thanks. :)


On 09/25/2012 06:33 PM, Peter Zijlstra wrote:

On Tue, 2012-09-25 at 10:39 +0800, Tang Chen wrote:

@@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
  }

  sched_domain_topology = tl;
+
+sched_domains_numa_levels = level;


And I set it to level here again.


But its already set there.. its set every time we find a new level.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-25 Thread Peter Zijlstra
On Tue, 2012-09-25 at 19:45 +0800, Tang Chen wrote:


 Let's have an example here.
 
 sched_init_numa()
 {
   ...
   // A loop set sched_domains_numa_levels to level.-1
 
   // I set sched_domains_numa_levels to 0.
   sched_domains_numa_levels = 0;2

Ah, ok, so this 'idiot in a hurry' (aka. me) placed the hunk wrong and
thought it was at the start of the function.

   // A loop allocating memory for sched_domains_numa_masks[][]
   for (i = 0; i  level; i++) {
   ..
   // Allocate memory for sched_domains_numa_masks[i]3
   ..
   }
   ..
 
   // I reset sched_domains_numa_levels to level.
   sched_domains_numa_levels = level;4
 }

Yes this makes sense, it might be best to have this as a separate patch.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-25 Thread Tang Chen

Hi Peter~

I will make a v2 patch-set following your comments
and resent it soon. :)

Thanks. :)

On 09/25/2012 07:50 PM, Peter Zijlstra wrote:

On Tue, 2012-09-25 at 19:45 +0800, Tang Chen wrote:



Let's have an example here.

sched_init_numa()
{
...
// A loop set sched_domains_numa_levels to level.-1

// I set sched_domains_numa_levels to 0.
sched_domains_numa_levels = 0;2


Ah, ok, so this 'idiot in a hurry' (aka. me) placed the hunk wrong and
thought it was at the start of the function.


// A loop allocating memory for sched_domains_numa_masks[][]
for (i = 0; i  level; i++) {
..
// Allocate memory for sched_domains_numa_masks[i]3
..
}
..

// I reset sched_domains_numa_levels to level.
sched_domains_numa_levels = level;4
}


Yes this makes sense, it might be best to have this as a separate patch.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-24 Thread Tang Chen

On 09/24/2012 05:57 PM, Srivatsa S. Bhat wrote:

On 09/24/2012 03:08 PM, Peter Zijlstra wrote:

+   hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
 hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
 hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);


OK, so you really want your notifier to run before cpuset_cpu_active
because otherwise you get that crash, yet you fail with the whole order
thing.. You should not _ever_ rely on registration order.



IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE
priority to ensure that the ordering of callbacks is right, isn't it?


Yes, that's right. Thanks. :)



Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-24 Thread Tang Chen

Hi Peter:

Thanks for commenting this patch. :)

On 09/24/2012 05:38 PM, Peter Zijlstra wrote:

Why are you cc'ing x86 and numa folks but not a single scheduler person
when you're patching scheduler stuff?


First of all, I'm sorry for this. I thought it was a NUMA or memory
related problem. And I get your address from the get_maintainer.pl
script.



On Tue, 2012-09-18 at 18:12 +0800, Tang Chen wrote:

Once array sched_domains_numa_masks is defined, it is never updated.
When a new cpu on a new node is onlined,


Hmm, so there's hardware where you can boot with smaller nr_node_ids
than possible.. I guess that makes sense.


Yes. nr_node_ids represents the max number of nodes the sustem supports.
And usually, we don't have that many.




  the coincident member in
sched_domains_numa_masks is not initialized, and all the masks are 0.
As a result, the build_overlap_sched_groups() will initialize a NULL
sched_group for the new cpu on the new node, which will lead to kernel panic.





This patch registers a new notifier for cpu hotplug notify chain, and
updates sched_domains_numa_masks every time a new cpu is onlined or offlined.


Urgh, more hotplug notifiers.. a well.


Signed-off-by: Tang Chen
---
  kernel/sched/core.c |   62 +++
  1 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..66b36ab 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6711,6 +6711,14 @@ static void sched_init_numa(void)
 * numbers.
 */

+   /*
+* Since sched_domains_numa_levels is also used in other functions as
+* an index for sched_domains_numa_masks[][], we should reset it here in
+* case sched_domains_numa_masks[][] fails to be initialized. And set it
+* to 'level' when sched_domains_numa_masks[][] is fully initialized.
+*/
+   sched_domains_numa_levels = 0;


This isn't strictly needed for this patch right? I don't see anybody
calling sched_init_numa() a second time (although they should)..


Indeed, sched_init_numa() won't be called by anyone else. But I clear it
here in case the following memory allocation fail.

Actually, I want to use sched_domains_numa_levels to iterate all the
numa levels, such as in sched_domains_numa_masks_set().

Suppose sched_domains_numa_levels is 10. If allocating memory for
sched_domains_numa_masks[5] fails in sched_init_numa(), it will just 
return, but sched_domains_numa_masks[] only has 4 members. It

could be dangerous.

So I set sched_domains_numa_levels to 0, and reset it to level in the
end of sched_init_numa(), see below.




sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
if (!sched_domains_numa_masks)
return;
@@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
}

sched_domain_topology = tl;
+
+   sched_domains_numa_levels = level;


And I set it to level here again.


+}
+
+static void sched_domains_numa_masks_set(int cpu)
+{
+   int i, j;
+   int node = cpu_to_node(cpu);
+
+   for (i = 0; i<  sched_domains_numa_levels; i++)
+   for (j = 0; j<  nr_node_ids; j++)
+   if (node_distance(j, node)<= 
sched_domains_numa_distance[i])
+   cpumask_set_cpu(cpu, 
sched_domains_numa_masks[i][j]);
+}
+
+static void sched_domains_numa_masks_clear(int cpu)
+{
+   int i, j;
+   for (i = 0; i<  sched_domains_numa_levels; i++)
+   for (j = 0; j<  nr_node_ids; j++)
+   cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]);
+}


Aside from the coding style nit of wanting braces over multi-line
statements even though not strictly required, I really don't see how
this could possibly be right..


Thanks for telling me that. I'll fix the coding style. :)



We do this because nr_node_ids changed, right? This means the entire
distance table grew/shrunk, which means we should do the level scan
again.


It seems that nr_node_ids will not change once the system is up.
I'm not quite sure. If I am wrong, please tell me. :)

I found that nr_node_ids is defined as MAX_NUMNODES in mm/page_alloc.c

int nr_node_ids __read_mostly = MAX_NUMNODES;

And all the functions change this value are __init functions. So I think
nr_node_ids is just the possible nodes in the system, and it won't be
changed after the system initialization finished.




@@ -7218,6 +7279,7 @@ void __init sched_init_smp(void)
 mutex_unlock(_domains_mutex);
 put_online_cpus();

+   hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
 hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
 hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);


OK, so you really want your notifier to run before cpuset_cpu_active
because otherwise you get that crash, yet you fail with the whole order
thing.. You should not 

Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-24 Thread Srivatsa S. Bhat
On 09/24/2012 03:42 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 15:27 +0530, Srivatsa S. Bhat wrote:
>> On 09/24/2012 03:08 PM, Peter Zijlstra wrote:  
 +   hotcpu_notifier(sched_domains_numa_masks_update, 
 CPU_PRI_SCHED_ACTIVE);
 hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
 hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
>>>
>>> OK, so you really want your notifier to run before cpuset_cpu_active
>>> because otherwise you get that crash, yet you fail with the whole order
>>> thing.. You should not _ever_ rely on registration order.
>>>
>>
>> IMHO he isn't relying on registration order.. He uses the 
>> CPU_PRI_SCHED_ACTIVE
>> priority to ensure that the ordering of callbacks is right, isn't it?
> 
> Oh argh indeed. I can't read :/
> 

;-)
 
Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-24 Thread Peter Zijlstra
On Mon, 2012-09-24 at 15:27 +0530, Srivatsa S. Bhat wrote:
> On 09/24/2012 03:08 PM, Peter Zijlstra wrote:  
> >> +   hotcpu_notifier(sched_domains_numa_masks_update, 
> >> CPU_PRI_SCHED_ACTIVE);
> >> hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
> >> hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
> > 
> > OK, so you really want your notifier to run before cpuset_cpu_active
> > because otherwise you get that crash, yet you fail with the whole order
> > thing.. You should not _ever_ rely on registration order.
> > 
> 
> IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE
> priority to ensure that the ordering of callbacks is right, isn't it?

Oh argh indeed. I can't read :/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-24 Thread Srivatsa S. Bhat
On 09/24/2012 03:08 PM, Peter Zijlstra wrote:  
>> +   hotcpu_notifier(sched_domains_numa_masks_update, 
>> CPU_PRI_SCHED_ACTIVE);
>> hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
>> hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
> 
> OK, so you really want your notifier to run before cpuset_cpu_active
> because otherwise you get that crash, yet you fail with the whole order
> thing.. You should not _ever_ rely on registration order.
> 

IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE
priority to ensure that the ordering of callbacks is right, isn't it?

Regards,
Srivatsa S. Bhat 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-24 Thread Peter Zijlstra
Why are you cc'ing x86 and numa folks but not a single scheduler person
when you're patching scheduler stuff?

On Tue, 2012-09-18 at 18:12 +0800, Tang Chen wrote:
> Once array sched_domains_numa_masks is defined, it is never updated.
> When a new cpu on a new node is onlined,

Hmm, so there's hardware where you can boot with smaller nr_node_ids
than possible.. I guess that makes sense.

>  the coincident member in
> sched_domains_numa_masks is not initialized, and all the masks are 0.
> As a result, the build_overlap_sched_groups() will initialize a NULL
> sched_group for the new cpu on the new node, which will lead to kernel panic.



> This patch registers a new notifier for cpu hotplug notify chain, and
> updates sched_domains_numa_masks every time a new cpu is onlined or offlined.

Urgh, more hotplug notifiers.. a well.

> Signed-off-by: Tang Chen 
> ---
>  kernel/sched/core.c |   62 
> +++
>  1 files changed, 62 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fbf1fd0..66b36ab 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6711,6 +6711,14 @@ static void sched_init_numa(void)
>* numbers.
>*/
>  
> + /*
> +  * Since sched_domains_numa_levels is also used in other functions as
> +  * an index for sched_domains_numa_masks[][], we should reset it here in
> +  * case sched_domains_numa_masks[][] fails to be initialized. And set it
> +  * to 'level' when sched_domains_numa_masks[][] is fully initialized.
> +  */
> + sched_domains_numa_levels = 0;

This isn't strictly needed for this patch right? I don't see anybody
calling sched_init_numa() a second time (although they should)..

>   sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
>   if (!sched_domains_numa_masks)
>   return;
> @@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
>   }
>  
>   sched_domain_topology = tl;
> +
> + sched_domains_numa_levels = level;
> +}
> +
> +static void sched_domains_numa_masks_set(int cpu)
> +{
> + int i, j;
> + int node = cpu_to_node(cpu);
> +
> + for (i = 0; i < sched_domains_numa_levels; i++)
> + for (j = 0; j < nr_node_ids; j++)
> + if (node_distance(j, node) <= 
> sched_domains_numa_distance[i])
> + cpumask_set_cpu(cpu, 
> sched_domains_numa_masks[i][j]);
> +}
> +
> +static void sched_domains_numa_masks_clear(int cpu)
> +{
> + int i, j;
> + for (i = 0; i < sched_domains_numa_levels; i++)
> + for (j = 0; j < nr_node_ids; j++)
> + cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]);
> +}

Aside from the coding style nit of wanting braces over multi-line
statements even though not strictly required, I really don't see how
this could possibly be right.. 

We do this because nr_node_ids changed, right? This means the entire
distance table grew/shrunk, which means we should do the level scan
again.

> @@ -7218,6 +7279,7 @@ void __init sched_init_smp(void)
> mutex_unlock(_domains_mutex);
> put_online_cpus();
>  
> +   hotcpu_notifier(sched_domains_numa_masks_update, 
> CPU_PRI_SCHED_ACTIVE);
> hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
> hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);

OK, so you really want your notifier to run before cpuset_cpu_active
because otherwise you get that crash, yet you fail with the whole order
thing.. You should not _ever_ rely on registration order.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-24 Thread Peter Zijlstra
Why are you cc'ing x86 and numa folks but not a single scheduler person
when you're patching scheduler stuff?

On Tue, 2012-09-18 at 18:12 +0800, Tang Chen wrote:
 Once array sched_domains_numa_masks is defined, it is never updated.
 When a new cpu on a new node is onlined,

Hmm, so there's hardware where you can boot with smaller nr_node_ids
than possible.. I guess that makes sense.

  the coincident member in
 sched_domains_numa_masks is not initialized, and all the masks are 0.
 As a result, the build_overlap_sched_groups() will initialize a NULL
 sched_group for the new cpu on the new node, which will lead to kernel panic.

snip

 This patch registers a new notifier for cpu hotplug notify chain, and
 updates sched_domains_numa_masks every time a new cpu is onlined or offlined.

Urgh, more hotplug notifiers.. a well.

 Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
 ---
  kernel/sched/core.c |   62 
 +++
  1 files changed, 62 insertions(+), 0 deletions(-)
 
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index fbf1fd0..66b36ab 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -6711,6 +6711,14 @@ static void sched_init_numa(void)
* numbers.
*/
  
 + /*
 +  * Since sched_domains_numa_levels is also used in other functions as
 +  * an index for sched_domains_numa_masks[][], we should reset it here in
 +  * case sched_domains_numa_masks[][] fails to be initialized. And set it
 +  * to 'level' when sched_domains_numa_masks[][] is fully initialized.
 +  */
 + sched_domains_numa_levels = 0;

This isn't strictly needed for this patch right? I don't see anybody
calling sched_init_numa() a second time (although they should)..

   sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
   if (!sched_domains_numa_masks)
   return;
 @@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
   }
  
   sched_domain_topology = tl;
 +
 + sched_domains_numa_levels = level;
 +}
 +
 +static void sched_domains_numa_masks_set(int cpu)
 +{
 + int i, j;
 + int node = cpu_to_node(cpu);
 +
 + for (i = 0; i  sched_domains_numa_levels; i++)
 + for (j = 0; j  nr_node_ids; j++)
 + if (node_distance(j, node) = 
 sched_domains_numa_distance[i])
 + cpumask_set_cpu(cpu, 
 sched_domains_numa_masks[i][j]);
 +}
 +
 +static void sched_domains_numa_masks_clear(int cpu)
 +{
 + int i, j;
 + for (i = 0; i  sched_domains_numa_levels; i++)
 + for (j = 0; j  nr_node_ids; j++)
 + cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]);
 +}

Aside from the coding style nit of wanting braces over multi-line
statements even though not strictly required, I really don't see how
this could possibly be right.. 

We do this because nr_node_ids changed, right? This means the entire
distance table grew/shrunk, which means we should do the level scan
again.

 @@ -7218,6 +7279,7 @@ void __init sched_init_smp(void)
 mutex_unlock(sched_domains_mutex);
 put_online_cpus();
  
 +   hotcpu_notifier(sched_domains_numa_masks_update, 
 CPU_PRI_SCHED_ACTIVE);
 hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
 hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);

OK, so you really want your notifier to run before cpuset_cpu_active
because otherwise you get that crash, yet you fail with the whole order
thing.. You should not _ever_ rely on registration order.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-24 Thread Srivatsa S. Bhat
On 09/24/2012 03:08 PM, Peter Zijlstra wrote:  
 +   hotcpu_notifier(sched_domains_numa_masks_update, 
 CPU_PRI_SCHED_ACTIVE);
 hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
 hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
 
 OK, so you really want your notifier to run before cpuset_cpu_active
 because otherwise you get that crash, yet you fail with the whole order
 thing.. You should not _ever_ rely on registration order.
 

IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE
priority to ensure that the ordering of callbacks is right, isn't it?

Regards,
Srivatsa S. Bhat 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-24 Thread Peter Zijlstra
On Mon, 2012-09-24 at 15:27 +0530, Srivatsa S. Bhat wrote:
 On 09/24/2012 03:08 PM, Peter Zijlstra wrote:  
  +   hotcpu_notifier(sched_domains_numa_masks_update, 
  CPU_PRI_SCHED_ACTIVE);
  hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
  hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
  
  OK, so you really want your notifier to run before cpuset_cpu_active
  because otherwise you get that crash, yet you fail with the whole order
  thing.. You should not _ever_ rely on registration order.
  
 
 IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE
 priority to ensure that the ordering of callbacks is right, isn't it?

Oh argh indeed. I can't read :/


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-24 Thread Srivatsa S. Bhat
On 09/24/2012 03:42 PM, Peter Zijlstra wrote:
 On Mon, 2012-09-24 at 15:27 +0530, Srivatsa S. Bhat wrote:
 On 09/24/2012 03:08 PM, Peter Zijlstra wrote:  
 +   hotcpu_notifier(sched_domains_numa_masks_update, 
 CPU_PRI_SCHED_ACTIVE);
 hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
 hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);

 OK, so you really want your notifier to run before cpuset_cpu_active
 because otherwise you get that crash, yet you fail with the whole order
 thing.. You should not _ever_ rely on registration order.


 IMHO he isn't relying on registration order.. He uses the 
 CPU_PRI_SCHED_ACTIVE
 priority to ensure that the ordering of callbacks is right, isn't it?
 
 Oh argh indeed. I can't read :/
 

;-)
 
Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-24 Thread Tang Chen

Hi Peter:

Thanks for commenting this patch. :)

On 09/24/2012 05:38 PM, Peter Zijlstra wrote:

Why are you cc'ing x86 and numa folks but not a single scheduler person
when you're patching scheduler stuff?


First of all, I'm sorry for this. I thought it was a NUMA or memory
related problem. And I get your address from the get_maintainer.pl
script.



On Tue, 2012-09-18 at 18:12 +0800, Tang Chen wrote:

Once array sched_domains_numa_masks is defined, it is never updated.
When a new cpu on a new node is onlined,


Hmm, so there's hardware where you can boot with smaller nr_node_ids
than possible.. I guess that makes sense.


Yes. nr_node_ids represents the max number of nodes the sustem supports.
And usually, we don't have that many.




  the coincident member in
sched_domains_numa_masks is not initialized, and all the masks are 0.
As a result, the build_overlap_sched_groups() will initialize a NULL
sched_group for the new cpu on the new node, which will lead to kernel panic.


snip


This patch registers a new notifier for cpu hotplug notify chain, and
updates sched_domains_numa_masks every time a new cpu is onlined or offlined.


Urgh, more hotplug notifiers.. a well.


Signed-off-by: Tang Chentangc...@cn.fujitsu.com
---
  kernel/sched/core.c |   62 +++
  1 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..66b36ab 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6711,6 +6711,14 @@ static void sched_init_numa(void)
 * numbers.
 */

+   /*
+* Since sched_domains_numa_levels is also used in other functions as
+* an index for sched_domains_numa_masks[][], we should reset it here in
+* case sched_domains_numa_masks[][] fails to be initialized. And set it
+* to 'level' when sched_domains_numa_masks[][] is fully initialized.
+*/
+   sched_domains_numa_levels = 0;


This isn't strictly needed for this patch right? I don't see anybody
calling sched_init_numa() a second time (although they should)..


Indeed, sched_init_numa() won't be called by anyone else. But I clear it
here in case the following memory allocation fail.

Actually, I want to use sched_domains_numa_levels to iterate all the
numa levels, such as in sched_domains_numa_masks_set().

Suppose sched_domains_numa_levels is 10. If allocating memory for
sched_domains_numa_masks[5] fails in sched_init_numa(), it will just 
return, but sched_domains_numa_masks[] only has 4 members. It

could be dangerous.

So I set sched_domains_numa_levels to 0, and reset it to level in the
end of sched_init_numa(), see below.




sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
if (!sched_domains_numa_masks)
return;
@@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
}

sched_domain_topology = tl;
+
+   sched_domains_numa_levels = level;


And I set it to level here again.


+}
+
+static void sched_domains_numa_masks_set(int cpu)
+{
+   int i, j;
+   int node = cpu_to_node(cpu);
+
+   for (i = 0; i  sched_domains_numa_levels; i++)
+   for (j = 0; j  nr_node_ids; j++)
+   if (node_distance(j, node)= 
sched_domains_numa_distance[i])
+   cpumask_set_cpu(cpu, 
sched_domains_numa_masks[i][j]);
+}
+
+static void sched_domains_numa_masks_clear(int cpu)
+{
+   int i, j;
+   for (i = 0; i  sched_domains_numa_levels; i++)
+   for (j = 0; j  nr_node_ids; j++)
+   cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]);
+}


Aside from the coding style nit of wanting braces over multi-line
statements even though not strictly required, I really don't see how
this could possibly be right..


Thanks for telling me that. I'll fix the coding style. :)



We do this because nr_node_ids changed, right? This means the entire
distance table grew/shrunk, which means we should do the level scan
again.


It seems that nr_node_ids will not change once the system is up.
I'm not quite sure. If I am wrong, please tell me. :)

I found that nr_node_ids is defined as MAX_NUMNODES in mm/page_alloc.c

int nr_node_ids __read_mostly = MAX_NUMNODES;

And all the functions change this value are __init functions. So I think
nr_node_ids is just the possible nodes in the system, and it won't be
changed after the system initialization finished.




@@ -7218,6 +7279,7 @@ void __init sched_init_smp(void)
 mutex_unlock(sched_domains_mutex);
 put_online_cpus();

+   hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
 hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
 hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);


OK, so you really want your notifier to run before cpuset_cpu_active
because otherwise you get that crash, yet you fail with the whole order

Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-24 Thread Tang Chen

On 09/24/2012 05:57 PM, Srivatsa S. Bhat wrote:

On 09/24/2012 03:08 PM, Peter Zijlstra wrote:

+   hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
 hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
 hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);


OK, so you really want your notifier to run before cpuset_cpu_active
because otherwise you get that crash, yet you fail with the whole order
thing.. You should not _ever_ rely on registration order.



IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE
priority to ensure that the ordering of callbacks is right, isn't it?


Yes, that's right. Thanks. :)



Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-23 Thread Tang Chen

Hi,

Would you please help to review this patch ?

Thanks. :)

On 09/18/2012 06:12 PM, Tang Chen wrote:

Once array sched_domains_numa_masks is defined, it is never updated.
When a new cpu on a new node is onlined, the coincident member in
sched_domains_numa_masks is not initialized, and all the masks are 0.
As a result, the build_overlap_sched_groups() will initialize a NULL
sched_group for the new cpu on the new node, which will lead to kernel panic.

[ 3189.403280] Call Trace:
[ 3189.403286]  [] warn_slowpath_common+0x7f/0xc0
[ 3189.403289]  [] warn_slowpath_null+0x1a/0x20
[ 3189.403292]  [] build_sched_domains+0x467/0x470
[ 3189.403296]  [] partition_sched_domains+0x307/0x510
[ 3189.403299]  [] ? partition_sched_domains+0x142/0x510
[ 3189.403305]  [] cpuset_update_active_cpus+0x83/0x90
[ 3189.403308]  [] cpuset_cpu_active+0x38/0x70
[ 3189.403316]  [] notifier_call_chain+0x67/0x150
[ 3189.403320]  [] ? native_cpu_up+0x18a/0x1b5
[ 3189.403328]  [] __raw_notifier_call_chain+0xe/0x10
[ 3189.40]  [] __cpu_notify+0x20/0x40
[ 3189.403337]  [] _cpu_up+0xe9/0x131
[ 3189.403340]  [] cpu_up+0xdb/0xee
[ 3189.403348]  [] store_online+0x9c/0xd0
[ 3189.403355]  [] dev_attr_store+0x20/0x30
[ 3189.403361]  [] sysfs_write_file+0xa3/0x100
[ 3189.403368]  [] vfs_write+0xd0/0x1a0
[ 3189.403371]  [] sys_write+0x54/0xa0
[ 3189.403375]  [] system_call_fastpath+0x16/0x1b
[ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]---
[ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 
0018

This patch registers a new notifier for cpu hotplug notify chain, and
updates sched_domains_numa_masks every time a new cpu is onlined or offlined.

Signed-off-by: Tang Chen
---
  kernel/sched/core.c |   62 +++
  1 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..66b36ab 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6711,6 +6711,14 @@ static void sched_init_numa(void)
 * numbers.
 */

+   /*
+* Since sched_domains_numa_levels is also used in other functions as
+* an index for sched_domains_numa_masks[][], we should reset it here in
+* case sched_domains_numa_masks[][] fails to be initialized. And set it
+* to 'level' when sched_domains_numa_masks[][] is fully initialized.
+*/
+   sched_domains_numa_levels = 0;
+
sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
if (!sched_domains_numa_masks)
return;
@@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
}

sched_domain_topology = tl;
+
+   sched_domains_numa_levels = level;
+}
+
+static void sched_domains_numa_masks_set(int cpu)
+{
+   int i, j;
+   int node = cpu_to_node(cpu);
+
+   for (i = 0; i<  sched_domains_numa_levels; i++)
+   for (j = 0; j<  nr_node_ids; j++)
+   if (node_distance(j, node)<= 
sched_domains_numa_distance[i])
+   cpumask_set_cpu(cpu, 
sched_domains_numa_masks[i][j]);
+}
+
+static void sched_domains_numa_masks_clear(int cpu)
+{
+   int i, j;
+   for (i = 0; i<  sched_domains_numa_levels; i++)
+   for (j = 0; j<  nr_node_ids; j++)
+   cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]);
+}
+
+/*
+ * Update sched_domains_numa_masks[level][node] array when new cpus
+ * are onlined.
+ */
+static int sched_domains_numa_masks_update(struct notifier_block *nfb,
+  unsigned long action,
+  void *hcpu)
+{
+   int cpu = (int)hcpu;
+
+   switch (action&  ~CPU_TASKS_FROZEN) {
+   case CPU_ONLINE:
+   sched_domains_numa_masks_set(cpu);
+   break;
+
+   case CPU_DEAD:
+   sched_domains_numa_masks_clear(cpu);
+   break;
+
+   default:
+   return NOTIFY_DONE;
+   }
+   return NOTIFY_OK;
  }
  #else
  static inline void sched_init_numa(void)
  {
  }
+
+static int sched_domains_numa_masks_update(struct notifier_block *nfb,
+   unsigned long action,
+   void *hcpu)
+{
+   return 0;
+}
  #endif /* CONFIG_NUMA */

  static int __sdt_alloc(const struct cpumask *cpu_map)
@@ -7218,6 +7279,7 @@ void __init sched_init_smp(void)
mutex_unlock(_domains_mutex);
put_online_cpus();

+   hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-23 Thread Tang Chen

Hi,

Would you please help to review this patch ?

Thanks. :)

On 09/18/2012 06:12 PM, Tang Chen wrote:

Once array sched_domains_numa_masks is defined, it is never updated.
When a new cpu on a new node is onlined, the coincident member in
sched_domains_numa_masks is not initialized, and all the masks are 0.
As a result, the build_overlap_sched_groups() will initialize a NULL
sched_group for the new cpu on the new node, which will lead to kernel panic.

[ 3189.403280] Call Trace:
[ 3189.403286]  [8106c36f] warn_slowpath_common+0x7f/0xc0
[ 3189.403289]  [8106c3ca] warn_slowpath_null+0x1a/0x20
[ 3189.403292]  [810b1d57] build_sched_domains+0x467/0x470
[ 3189.403296]  [810b2067] partition_sched_domains+0x307/0x510
[ 3189.403299]  [810b1ea2] ? partition_sched_domains+0x142/0x510
[ 3189.403305]  [810fcc93] cpuset_update_active_cpus+0x83/0x90
[ 3189.403308]  [810b22a8] cpuset_cpu_active+0x38/0x70
[ 3189.403316]  [81674b87] notifier_call_chain+0x67/0x150
[ 3189.403320]  [81664647] ? native_cpu_up+0x18a/0x1b5
[ 3189.403328]  [810a044e] __raw_notifier_call_chain+0xe/0x10
[ 3189.40]  [81070470] __cpu_notify+0x20/0x40
[ 3189.403337]  [813e] _cpu_up+0xe9/0x131
[ 3189.403340]  [81666761] cpu_up+0xdb/0xee
[ 3189.403348]  [8165667c] store_online+0x9c/0xd0
[ 3189.403355]  [81437640] dev_attr_store+0x20/0x30
[ 3189.403361]  [8124aa63] sysfs_write_file+0xa3/0x100
[ 3189.403368]  [811ccbe0] vfs_write+0xd0/0x1a0
[ 3189.403371]  [811ccdb4] sys_write+0x54/0xa0
[ 3189.403375]  [81679c69] system_call_fastpath+0x16/0x1b
[ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]---
[ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 
0018

This patch registers a new notifier for cpu hotplug notify chain, and
updates sched_domains_numa_masks every time a new cpu is onlined or offlined.

Signed-off-by: Tang Chentangc...@cn.fujitsu.com
---
  kernel/sched/core.c |   62 +++
  1 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..66b36ab 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6711,6 +6711,14 @@ static void sched_init_numa(void)
 * numbers.
 */

+   /*
+* Since sched_domains_numa_levels is also used in other functions as
+* an index for sched_domains_numa_masks[][], we should reset it here in
+* case sched_domains_numa_masks[][] fails to be initialized. And set it
+* to 'level' when sched_domains_numa_masks[][] is fully initialized.
+*/
+   sched_domains_numa_levels = 0;
+
sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
if (!sched_domains_numa_masks)
return;
@@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
}

sched_domain_topology = tl;
+
+   sched_domains_numa_levels = level;
+}
+
+static void sched_domains_numa_masks_set(int cpu)
+{
+   int i, j;
+   int node = cpu_to_node(cpu);
+
+   for (i = 0; i  sched_domains_numa_levels; i++)
+   for (j = 0; j  nr_node_ids; j++)
+   if (node_distance(j, node)= 
sched_domains_numa_distance[i])
+   cpumask_set_cpu(cpu, 
sched_domains_numa_masks[i][j]);
+}
+
+static void sched_domains_numa_masks_clear(int cpu)
+{
+   int i, j;
+   for (i = 0; i  sched_domains_numa_levels; i++)
+   for (j = 0; j  nr_node_ids; j++)
+   cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]);
+}
+
+/*
+ * Update sched_domains_numa_masks[level][node] array when new cpus
+ * are onlined.
+ */
+static int sched_domains_numa_masks_update(struct notifier_block *nfb,
+  unsigned long action,
+  void *hcpu)
+{
+   int cpu = (int)hcpu;
+
+   switch (action  ~CPU_TASKS_FROZEN) {
+   case CPU_ONLINE:
+   sched_domains_numa_masks_set(cpu);
+   break;
+
+   case CPU_DEAD:
+   sched_domains_numa_masks_clear(cpu);
+   break;
+
+   default:
+   return NOTIFY_DONE;
+   }
+   return NOTIFY_OK;
  }
  #else
  static inline void sched_init_numa(void)
  {
  }
+
+static int sched_domains_numa_masks_update(struct notifier_block *nfb,
+   unsigned long action,
+   void *hcpu)
+{
+   return 0;
+}
  #endif /* CONFIG_NUMA */

  static int __sdt_alloc(const struct cpumask *cpu_map)
@@ -7218,6 +7279,7 @@ void __init sched_init_smp(void)
mutex_unlock(sched_domains_mutex);
put_online_cpus();

+   hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
hotcpu_notifier(cpuset_cpu_active, 

[PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-18 Thread Tang Chen
Once array sched_domains_numa_masks is defined, it is never updated.
When a new cpu on a new node is onlined, the coincident member in
sched_domains_numa_masks is not initialized, and all the masks are 0.
As a result, the build_overlap_sched_groups() will initialize a NULL
sched_group for the new cpu on the new node, which will lead to kernel panic.

[ 3189.403280] Call Trace:
[ 3189.403286]  [] warn_slowpath_common+0x7f/0xc0
[ 3189.403289]  [] warn_slowpath_null+0x1a/0x20
[ 3189.403292]  [] build_sched_domains+0x467/0x470
[ 3189.403296]  [] partition_sched_domains+0x307/0x510
[ 3189.403299]  [] ? partition_sched_domains+0x142/0x510
[ 3189.403305]  [] cpuset_update_active_cpus+0x83/0x90
[ 3189.403308]  [] cpuset_cpu_active+0x38/0x70
[ 3189.403316]  [] notifier_call_chain+0x67/0x150
[ 3189.403320]  [] ? native_cpu_up+0x18a/0x1b5
[ 3189.403328]  [] __raw_notifier_call_chain+0xe/0x10
[ 3189.40]  [] __cpu_notify+0x20/0x40
[ 3189.403337]  [] _cpu_up+0xe9/0x131
[ 3189.403340]  [] cpu_up+0xdb/0xee
[ 3189.403348]  [] store_online+0x9c/0xd0
[ 3189.403355]  [] dev_attr_store+0x20/0x30
[ 3189.403361]  [] sysfs_write_file+0xa3/0x100
[ 3189.403368]  [] vfs_write+0xd0/0x1a0
[ 3189.403371]  [] sys_write+0x54/0xa0
[ 3189.403375]  [] system_call_fastpath+0x16/0x1b
[ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]---
[ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 
0018

This patch registers a new notifier for cpu hotplug notify chain, and
updates sched_domains_numa_masks every time a new cpu is onlined or offlined.

Signed-off-by: Tang Chen 
---
 kernel/sched/core.c |   62 +++
 1 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..66b36ab 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6711,6 +6711,14 @@ static void sched_init_numa(void)
 * numbers.
 */
 
+   /*
+* Since sched_domains_numa_levels is also used in other functions as
+* an index for sched_domains_numa_masks[][], we should reset it here in
+* case sched_domains_numa_masks[][] fails to be initialized. And set it
+* to 'level' when sched_domains_numa_masks[][] is fully initialized.
+*/
+   sched_domains_numa_levels = 0;
+
sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
if (!sched_domains_numa_masks)
return;
@@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
}
 
sched_domain_topology = tl;
+
+   sched_domains_numa_levels = level;
+}
+
+static void sched_domains_numa_masks_set(int cpu)
+{
+   int i, j;
+   int node = cpu_to_node(cpu);
+
+   for (i = 0; i < sched_domains_numa_levels; i++)
+   for (j = 0; j < nr_node_ids; j++)
+   if (node_distance(j, node) <= 
sched_domains_numa_distance[i])
+   cpumask_set_cpu(cpu, 
sched_domains_numa_masks[i][j]);
+}
+
+static void sched_domains_numa_masks_clear(int cpu)
+{
+   int i, j;
+   for (i = 0; i < sched_domains_numa_levels; i++)
+   for (j = 0; j < nr_node_ids; j++)
+   cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]);
+}
+
+/*
+ * Update sched_domains_numa_masks[level][node] array when new cpus
+ * are onlined.
+ */
+static int sched_domains_numa_masks_update(struct notifier_block *nfb,
+  unsigned long action,
+  void *hcpu)
+{
+   int cpu = (int)hcpu;
+
+   switch (action & ~CPU_TASKS_FROZEN) {
+   case CPU_ONLINE:
+   sched_domains_numa_masks_set(cpu);
+   break;
+
+   case CPU_DEAD:
+   sched_domains_numa_masks_clear(cpu);
+   break;
+
+   default:
+   return NOTIFY_DONE;
+   }
+   return NOTIFY_OK;
 }
 #else
 static inline void sched_init_numa(void)
 {
 }
+
+static int sched_domains_numa_masks_update(struct notifier_block *nfb,
+   unsigned long action,
+   void *hcpu)
+{
+   return 0;
+}
 #endif /* CONFIG_NUMA */
 
 static int __sdt_alloc(const struct cpumask *cpu_map)
@@ -7218,6 +7279,7 @@ void __init sched_init_smp(void)
mutex_unlock(_domains_mutex);
put_online_cpus();
 
+   hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Update sched_domains_numa_masks when new cpus are onlined.

2012-09-18 Thread Tang Chen
Once array sched_domains_numa_masks is defined, it is never updated.
When a new cpu on a new node is onlined, the coincident member in
sched_domains_numa_masks is not initialized, and all the masks are 0.
As a result, the build_overlap_sched_groups() will initialize a NULL
sched_group for the new cpu on the new node, which will lead to kernel panic.

[ 3189.403280] Call Trace:
[ 3189.403286]  [8106c36f] warn_slowpath_common+0x7f/0xc0
[ 3189.403289]  [8106c3ca] warn_slowpath_null+0x1a/0x20
[ 3189.403292]  [810b1d57] build_sched_domains+0x467/0x470
[ 3189.403296]  [810b2067] partition_sched_domains+0x307/0x510
[ 3189.403299]  [810b1ea2] ? partition_sched_domains+0x142/0x510
[ 3189.403305]  [810fcc93] cpuset_update_active_cpus+0x83/0x90
[ 3189.403308]  [810b22a8] cpuset_cpu_active+0x38/0x70
[ 3189.403316]  [81674b87] notifier_call_chain+0x67/0x150
[ 3189.403320]  [81664647] ? native_cpu_up+0x18a/0x1b5
[ 3189.403328]  [810a044e] __raw_notifier_call_chain+0xe/0x10
[ 3189.40]  [81070470] __cpu_notify+0x20/0x40
[ 3189.403337]  [813e] _cpu_up+0xe9/0x131
[ 3189.403340]  [81666761] cpu_up+0xdb/0xee
[ 3189.403348]  [8165667c] store_online+0x9c/0xd0
[ 3189.403355]  [81437640] dev_attr_store+0x20/0x30
[ 3189.403361]  [8124aa63] sysfs_write_file+0xa3/0x100
[ 3189.403368]  [811ccbe0] vfs_write+0xd0/0x1a0
[ 3189.403371]  [811ccdb4] sys_write+0x54/0xa0
[ 3189.403375]  [81679c69] system_call_fastpath+0x16/0x1b
[ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]---
[ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 
0018

This patch registers a new notifier for cpu hotplug notify chain, and
updates sched_domains_numa_masks every time a new cpu is onlined or offlined.

Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
---
 kernel/sched/core.c |   62 +++
 1 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..66b36ab 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6711,6 +6711,14 @@ static void sched_init_numa(void)
 * numbers.
 */
 
+   /*
+* Since sched_domains_numa_levels is also used in other functions as
+* an index for sched_domains_numa_masks[][], we should reset it here in
+* case sched_domains_numa_masks[][] fails to be initialized. And set it
+* to 'level' when sched_domains_numa_masks[][] is fully initialized.
+*/
+   sched_domains_numa_levels = 0;
+
sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
if (!sched_domains_numa_masks)
return;
@@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
}
 
sched_domain_topology = tl;
+
+   sched_domains_numa_levels = level;
+}
+
+static void sched_domains_numa_masks_set(int cpu)
+{
+   int i, j;
+   int node = cpu_to_node(cpu);
+
+   for (i = 0; i  sched_domains_numa_levels; i++)
+   for (j = 0; j  nr_node_ids; j++)
+   if (node_distance(j, node) = 
sched_domains_numa_distance[i])
+   cpumask_set_cpu(cpu, 
sched_domains_numa_masks[i][j]);
+}
+
+static void sched_domains_numa_masks_clear(int cpu)
+{
+   int i, j;
+   for (i = 0; i  sched_domains_numa_levels; i++)
+   for (j = 0; j  nr_node_ids; j++)
+   cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]);
+}
+
+/*
+ * Update sched_domains_numa_masks[level][node] array when new cpus
+ * are onlined.
+ */
+static int sched_domains_numa_masks_update(struct notifier_block *nfb,
+  unsigned long action,
+  void *hcpu)
+{
+   int cpu = (int)hcpu;
+
+   switch (action  ~CPU_TASKS_FROZEN) {
+   case CPU_ONLINE:
+   sched_domains_numa_masks_set(cpu);
+   break;
+
+   case CPU_DEAD:
+   sched_domains_numa_masks_clear(cpu);
+   break;
+
+   default:
+   return NOTIFY_DONE;
+   }
+   return NOTIFY_OK;
 }
 #else
 static inline void sched_init_numa(void)
 {
 }
+
+static int sched_domains_numa_masks_update(struct notifier_block *nfb,
+   unsigned long action,
+   void *hcpu)
+{
+   return 0;
+}
 #endif /* CONFIG_NUMA */
 
 static int __sdt_alloc(const struct cpumask *cpu_map)
@@ -7218,6 +7279,7 @@ void __init sched_init_smp(void)
mutex_unlock(sched_domains_mutex);
put_online_cpus();
 
+   hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
 
-- 
1.7.1

--