Re: 3.10.16 cgroup css_set_lock deadlock

2013-11-15 Thread Don Morris
On 11/15/2013 03:19 AM, Tejun Heo wrote:
> On Thu, Nov 14, 2013 at 05:25:29PM -0600, Shawn Bohrer wrote:
>> In trying to reproduce the cgroup_mutex deadlock I reported earlier
>> in https://lkml.org/lkml/2013/11/11/574 I believe I encountered a
>> different issue that I'm also unable to understand. This machine
>> started out reporting some soft lockups that look to me like they are
>> on a read_lock(css_set_lock):
>>
> ...
>> RIP: 0010:[]  [] 
>> cgroup_attach_task+0xdc/0x7a0
> ...
>>  [] attach_task_by_pid+0x167/0x1a0
>>  [] cgroup_tasks_write+0x13/0x20

I've been getting this hang intermittently with the numad daemon
running on CentOS/Fedora while running numa balancing tests. Started
around 3.9 or so.

> 
> Most likely the bug fixed by ea84753c98a7 ("cgroup: fix to break the
> while loop in cgroup_attach_task() correctly").  3.10.19 contains the
> backported fix.
> 
> Thanks.
> 

Yes, that definitely looks like the right change -- and I ran
post-3.12-rc6 for over a week without hitting the issue again.
I'm willing to call that verified by since previously I couldn't
go more than 2 days without encountering the bug.

Ok, stupid question time since I stared at that loop several
times while trying to figure out how things got stuck there.
Apologies in advance if I'm just thick today -- but I'd
really like to grok this bug.

Are we getting some other thread from while_each_task()
repeatedly keeping us in the loop? Or is there something
else going on? The gut instinct is that calling something
like while_each_task() on an exiting thread would either
reliably give other threads in the group or quit [if the
thread is the only one left in the group or if an exiting
thread is no longer part of the group], but since that would
make the continue work, obviously I'm missing something.

Mel, I don't know how much time you've given to this since the
last email, but this clears it up. Thanks for your time.

Don Morris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10.16 cgroup css_set_lock deadlock

2013-11-15 Thread Don Morris
On 11/15/2013 03:19 AM, Tejun Heo wrote:
 On Thu, Nov 14, 2013 at 05:25:29PM -0600, Shawn Bohrer wrote:
 In trying to reproduce the cgroup_mutex deadlock I reported earlier
 in https://lkml.org/lkml/2013/11/11/574 I believe I encountered a
 different issue that I'm also unable to understand. This machine
 started out reporting some soft lockups that look to me like they are
 on a read_lock(css_set_lock):

 ...
 RIP: 0010:[8109253c]  [8109253c] 
 cgroup_attach_task+0xdc/0x7a0
 ...
  [81092e87] attach_task_by_pid+0x167/0x1a0
  [81092ef3] cgroup_tasks_write+0x13/0x20

I've been getting this hang intermittently with the numad daemon
running on CentOS/Fedora while running numa balancing tests. Started
around 3.9 or so.

 
 Most likely the bug fixed by ea84753c98a7 (cgroup: fix to break the
 while loop in cgroup_attach_task() correctly).  3.10.19 contains the
 backported fix.
 
 Thanks.
 

Yes, that definitely looks like the right change -- and I ran
post-3.12-rc6 for over a week without hitting the issue again.
I'm willing to call that verified by since previously I couldn't
go more than 2 days without encountering the bug.

Ok, stupid question time since I stared at that loop several
times while trying to figure out how things got stuck there.
Apologies in advance if I'm just thick today -- but I'd
really like to grok this bug.

Are we getting some other thread from while_each_task()
repeatedly keeping us in the loop? Or is there something
else going on? The gut instinct is that calling something
like while_each_task() on an exiting thread would either
reliably give other threads in the group or quit [if the
thread is the only one left in the group or if an exiting
thread is no longer part of the group], but since that would
make the continue work, obviously I'm missing something.

Mel, I don't know how much time you've given to this since the
last email, but this clears it up. Thanks for your time.

Don Morris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: soft lockup - CPU#8 stuck for 22s!

2013-11-04 Thread Don Morris
On 11/04/2013 12:04 PM, Mel Gorman wrote:
> On Tue, Oct 22, 2013 at 01:29:22PM -0400, Don Morris wrote:
>> Greetings, all.
>>
>> Just wanted to drop this out there to see if it rang any bells.
>> I've been getting a soft lockup (numad thread stuck on a cpu
>> while attempting to attach a task to a cgroup) for a while now,
>> but I thought it was only happening when I applied Mel Gorman's
>> set of AutoNUMA patches.
> 
> This maybe?

Certainly would make sense. My appreciation for taking a look
at it.

I happen to be on the road today, however -- and away from the
reproduction environment. I'll give it a shot tomorrow morning
and either let you know if it fixes things or report the sysrq-t
output you requested.

Again, my thanks!
Don Morris

> 
> ---8<---
> mm: memcontrol: Release css_set_lock when aborting an OOM scan
> 
> css_task_iter_start acquires the css_set_lock and it must be released with
> a call to css_task_iter_end. Commmit 9cbb78bb (mm, memcg: introduce own
> oom handler to iterate only over its own threads) introduced a loop that
> was not guaranteed to call css_task_iter_end.
> 
> Cc: stable 
> Signed-off-by: Mel Gorman 
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5ef8929..941f67d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1795,6 +1795,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup 
> *memcg, gfp_t gfp_mask,
>   mem_cgroup_iter_break(memcg, iter);
>   if (chosen)
>   put_task_struct(chosen);
> + css_task_iter_end();
>   return;
>   case OOM_SCAN_OK:
>   break;
> .
> 


-- 
kernel, n:
A part of an operating system that preserves the medieval traditions
of sorcery and black art.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: soft lockup - CPU#8 stuck for 22s!

2013-11-04 Thread Don Morris
On 11/04/2013 12:04 PM, Mel Gorman wrote:
 On Tue, Oct 22, 2013 at 01:29:22PM -0400, Don Morris wrote:
 Greetings, all.

 Just wanted to drop this out there to see if it rang any bells.
 I've been getting a soft lockup (numad thread stuck on a cpu
 while attempting to attach a task to a cgroup) for a while now,
 but I thought it was only happening when I applied Mel Gorman's
 set of AutoNUMA patches.
 
 This maybe?

Certainly would make sense. My appreciation for taking a look
at it.

I happen to be on the road today, however -- and away from the
reproduction environment. I'll give it a shot tomorrow morning
and either let you know if it fixes things or report the sysrq-t
output you requested.

Again, my thanks!
Don Morris

 
 ---8---
 mm: memcontrol: Release css_set_lock when aborting an OOM scan
 
 css_task_iter_start acquires the css_set_lock and it must be released with
 a call to css_task_iter_end. Commmit 9cbb78bb (mm, memcg: introduce own
 oom handler to iterate only over its own threads) introduced a loop that
 was not guaranteed to call css_task_iter_end.
 
 Cc: stable sta...@vger.kernel.org
 Signed-off-by: Mel Gorman mgor...@suse.de
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 5ef8929..941f67d 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -1795,6 +1795,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup 
 *memcg, gfp_t gfp_mask,
   mem_cgroup_iter_break(memcg, iter);
   if (chosen)
   put_task_struct(chosen);
 + css_task_iter_end(it);
   return;
   case OOM_SCAN_OK:
   break;
 .
 


-- 
kernel, n:
A part of an operating system that preserves the medieval traditions
of sorcery and black art.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched, numa: Use {cpu, pid} to create task groups for shared faults

2013-07-31 Thread Don Morris
On 07/31/2013 11:07 AM, Peter Zijlstra wrote:
> 
> New version that includes a final put for the numa_group struct and a
> few other modifications.
> 
> The new task_numa_free() completely blows though, far too expensive.
> Good ideas needed.
> 
> ---
> Subject: sched, numa: Use {cpu, pid} to create task groups for shared faults
> From: Peter Zijlstra 
> Date: Tue Jul 30 10:40:20 CEST 2013
> 
> A very simple/straight forward shared fault task grouping
> implementation.
> 
> Concerns are that grouping on a single shared fault might be too
> aggressive -- this only works because Mel is excluding DSOs for faults,
> otherwise we'd have the world in a single group.
> 
> Future work could explore more complex means of picking groups. We
> could for example track one group for the entire scan (using something
> like PDM) and join it at the end of the scan if we deem it shared a
> sufficient amount of memory.
> 
> Another avenue to explore is that to do with tasks where private faults
> are predominant. Should we exclude them from the group or treat them as
> secondary, creating a graded group that tries hardest to collate shared
> tasks but also tries to move private tasks near when possible.
> 
> Also, the grouping information is completely unused, its up to future
> patches to do this.
> 
> Signed-off-by: Peter Zijlstra 
> ---
>  include/linux/sched.h |4 +
>  kernel/sched/core.c   |4 +
>  kernel/sched/fair.c   |  177 
> +++---
>  kernel/sched/sched.h  |5 -
>  4 files changed, 176 insertions(+), 14 deletions(-)

> +
> +static void task_numa_free(struct task_struct *p)
> +{
> + kfree(p->numa_faults);
> + if (p->numa_group) {
> + struct numa_group *grp = p->numa_group;

See below.

> + int i;
> +
> + for (i = 0; i < 2*nr_node_ids; i++)
> + atomic_long_sub(p->numa_faults[i], >faults[i]);
> +
> + spin_lock(>numa_lock);
> + spin_lock(>lock);
> + list_del(>numa_entry);
> + spin_unlock(>lock);
> + rcu_assign_pointer(p->numa_group, NULL);
> + put_numa_group(grp);

So is the local variable group or grp here? Got to be one or the
other to compile...

Don

> + }
> +}
> +
>  /*
>   * Got a PROT_NONE fault for a page on @node.
>   */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched, numa: Use {cpu, pid} to create task groups for shared faults

2013-07-31 Thread Don Morris
On 07/31/2013 11:07 AM, Peter Zijlstra wrote:
 
 New version that includes a final put for the numa_group struct and a
 few other modifications.
 
 The new task_numa_free() completely blows though, far too expensive.
 Good ideas needed.
 
 ---
 Subject: sched, numa: Use {cpu, pid} to create task groups for shared faults
 From: Peter Zijlstra pet...@infradead.org
 Date: Tue Jul 30 10:40:20 CEST 2013
 
 A very simple/straight forward shared fault task grouping
 implementation.
 
 Concerns are that grouping on a single shared fault might be too
 aggressive -- this only works because Mel is excluding DSOs for faults,
 otherwise we'd have the world in a single group.
 
 Future work could explore more complex means of picking groups. We
 could for example track one group for the entire scan (using something
 like PDM) and join it at the end of the scan if we deem it shared a
 sufficient amount of memory.
 
 Another avenue to explore is that to do with tasks where private faults
 are predominant. Should we exclude them from the group or treat them as
 secondary, creating a graded group that tries hardest to collate shared
 tasks but also tries to move private tasks near when possible.
 
 Also, the grouping information is completely unused, its up to future
 patches to do this.
 
 Signed-off-by: Peter Zijlstra pet...@infradead.org
 ---
  include/linux/sched.h |4 +
  kernel/sched/core.c   |4 +
  kernel/sched/fair.c   |  177 
 +++---
  kernel/sched/sched.h  |5 -
  4 files changed, 176 insertions(+), 14 deletions(-)

 +
 +static void task_numa_free(struct task_struct *p)
 +{
 + kfree(p-numa_faults);
 + if (p-numa_group) {
 + struct numa_group *grp = p-numa_group;

See below.

 + int i;
 +
 + for (i = 0; i  2*nr_node_ids; i++)
 + atomic_long_sub(p-numa_faults[i], grp-faults[i]);
 +
 + spin_lock(p-numa_lock);
 + spin_lock(group-lock);
 + list_del(p-numa_entry);
 + spin_unlock(group-lock);
 + rcu_assign_pointer(p-numa_group, NULL);
 + put_numa_group(grp);

So is the local variable group or grp here? Got to be one or the
other to compile...

Don

 + }
 +}
 +
  /*
   * Got a PROT_NONE fault for a page on @node.
   */

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-27 Thread Don Morris
On 02/27/2013 12:11 AM, Yinghai Lu wrote:
> On Tue, Feb 26, 2013 at 8:43 PM, Yasuaki Ishimatsu
>  wrote:
>> 2013/02/27 13:04, Yinghai Lu wrote:
>>>
>>> On Tue, Feb 26, 2013 at 7:38 PM, Yasuaki Ishimatsu
>>>  wrote:
>>>>
>>>> 2013/02/27 11:30, Yinghai Lu wrote:
>>>>>
>>>>> Do you mean you can not boot one socket system with 1G ram ?
>>>>> Assume socket 0 does not support hotplug, other 31 sockets support hot
>>>>> plug.
>>>>>
>>>>> So we could boot system only with socket0, and later one by one hot
>>>>> add other cpus.
>>>>
>>>>
>>>>
>>>> In this case, system can boot. But other cpus with bunch of ram hot
>>>> plug may fails, since system does not have enough memory for cover
>>>> hot added memory. When hot adding memory device, kernel object for the
>>>> memory is allocated from 1G ram since hot added memory has not been
>>>> enabled.
>>>>
>>>
>>> yes, it may fail, if the one node memory need page table and vmemmap
>>> is more than 1g ...
>>>
>>
>>> for hot add memory we need to
>>> 1. add another wrapper for init_memory_mapping, just like
>>> init_mem_mapping() for booting path.
>>> 2. we need make memblock more generic, so we can use it with hot add
>>> memory during runtime.
>>> 3. with that we can initialize page table for hot added node with ram.
>>> a. initial page table for 2M near node top is from node0 ( that does
>>> not support hot plug).
>>> b. then will use 2M for memory below node top...
>>> c. with that we will make sure page table stay on local node.
>>>   alloc_low_pages need to be updated to support that.
>>> 4. need to make sure vmemmap on local node too.
>>
>>
>> I think so too. By this, memory hot plug becomes more useful.
>>
>>>
>>> so hot-remove node will work too later.
>>>
>>> In the long run, we should make booting path and hot adding more
>>> similar and share at most code.
>>> That will make code get more test coverage.
> 
> Tang,  Yasuaki, Andrew,
> 
> Please check if you are ok with attached reverting patch.
> 
> Tim, Don,
> Can you try if attached reverting patch fix all the problems for you ?

I'm sure from the discussion on how to leave in memory hotplug it
likely won't be just a clean reversion, but as a data point -- yes,
this patch does remove the problem as expected (and I don't see
any new ones at first glance... though I'm not trying hotplug yet
obviously).

Thanks,
Don Morris


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-27 Thread Don Morris
On 02/27/2013 12:11 AM, Yinghai Lu wrote:
 On Tue, Feb 26, 2013 at 8:43 PM, Yasuaki Ishimatsu
 isimatu.yasu...@jp.fujitsu.com wrote:
 2013/02/27 13:04, Yinghai Lu wrote:

 On Tue, Feb 26, 2013 at 7:38 PM, Yasuaki Ishimatsu
 isimatu.yasu...@jp.fujitsu.com wrote:

 2013/02/27 11:30, Yinghai Lu wrote:

 Do you mean you can not boot one socket system with 1G ram ?
 Assume socket 0 does not support hotplug, other 31 sockets support hot
 plug.

 So we could boot system only with socket0, and later one by one hot
 add other cpus.



 In this case, system can boot. But other cpus with bunch of ram hot
 plug may fails, since system does not have enough memory for cover
 hot added memory. When hot adding memory device, kernel object for the
 memory is allocated from 1G ram since hot added memory has not been
 enabled.


 yes, it may fail, if the one node memory need page table and vmemmap
 is more than 1g ...


 for hot add memory we need to
 1. add another wrapper for init_memory_mapping, just like
 init_mem_mapping() for booting path.
 2. we need make memblock more generic, so we can use it with hot add
 memory during runtime.
 3. with that we can initialize page table for hot added node with ram.
 a. initial page table for 2M near node top is from node0 ( that does
 not support hot plug).
 b. then will use 2M for memory below node top...
 c. with that we will make sure page table stay on local node.
   alloc_low_pages need to be updated to support that.
 4. need to make sure vmemmap on local node too.


 I think so too. By this, memory hot plug becomes more useful.


 so hot-remove node will work too later.

 In the long run, we should make booting path and hot adding more
 similar and share at most code.
 That will make code get more test coverage.
 
 Tang,  Yasuaki, Andrew,
 
 Please check if you are ok with attached reverting patch.
 
 Tim, Don,
 Can you try if attached reverting patch fix all the problems for you ?

I'm sure from the discussion on how to leave in memory hotplug it
likely won't be just a clean reversion, but as a data point -- yes,
this patch does remove the problem as expected (and I don't see
any new ones at first glance... though I'm not trying hotplug yet
obviously).

Thanks,
Don Morris


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-25 Thread Don Morris
On 02/25/2013 10:32 AM, Tim Gardner wrote:
> On 02/25/2013 08:02 AM, Tim Gardner wrote:
>> Is this an expected warning ? I'll boot a vanilla kernel just to be sure.
>>
>> rebased against ab7826595e9ec51a51f622c5fc91e2f59440481a in Linus' repo:
>>
> 
> Same with a vanilla kernel, so it doesn't appear that any Ubuntu cruft
> is having an impact:

Reproduced on a HP z620 workstation (E5-2620 instead of E5-2680, but
still Sandy Bridge, though I don't think that matters).

Bisection leads to:
# bad: [e8d1955258091e4c92d5a975ebd7fd8a98f5d30f] acpi, memory-hotplug:
parse SRAT before memblock is ready

Nothing terribly obvious leaps out as to *why* that reshuffling messes
up the cpu<-->node bindings, but I wanted to put this out there while
I poke around further. [Note that the SRAT: PXM -> APIC -> Node print
outs during boot are the same either way -- if you look at the APIC
numbers of the processors (from /proc/cpuinfo), the processors should
be assigned to the correct node, but they aren't.] cc'ing Tang Chen
in case this is obvious to him or he's already fixed it somewhere not
on Linus's tree yet.

Don Morris

> 
> [0.170435] [ cut here ]
> [0.170450] WARNING: at arch/x86/kernel/smpboot.c:324
> topology_sane.isra.2+0x71/0x84()
> [0.170452] Hardware name: S2600CP
> [0.170454] sched: CPU #1's llc-sibling CPU #0 is not on the same
> node! [node: 1 != 0]. Ignoring dependency.
> [0.156000] smpboot: Booting Node   1, Processors  #1
> [0.170455] Modules linked in:
> [0.170460] Pid: 0, comm: swapper/1 Not tainted 3.8.0+ #1
> [0.170461] Call Trace:
> [0.170466]  [] warn_slowpath_common+0x7f/0xc0
> [0.170473]  [] warn_slowpath_fmt+0x46/0x50
> [0.170477]  [] topology_sane.isra.2+0x71/0x84
> [0.170482]  [] set_cpu_sibling_map+0x23f/0x436
> [0.170487]  [] start_secondary+0x137/0x201
> [0.170502] ---[ end trace 09222f596307ca1d ]---
> 
> rtg
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!

2013-02-25 Thread Don Morris
On 02/25/2013 10:32 AM, Tim Gardner wrote:
 On 02/25/2013 08:02 AM, Tim Gardner wrote:
 Is this an expected warning ? I'll boot a vanilla kernel just to be sure.

 rebased against ab7826595e9ec51a51f622c5fc91e2f59440481a in Linus' repo:

 
 Same with a vanilla kernel, so it doesn't appear that any Ubuntu cruft
 is having an impact:

Reproduced on a HP z620 workstation (E5-2620 instead of E5-2680, but
still Sandy Bridge, though I don't think that matters).

Bisection leads to:
# bad: [e8d1955258091e4c92d5a975ebd7fd8a98f5d30f] acpi, memory-hotplug:
parse SRAT before memblock is ready

Nothing terribly obvious leaps out as to *why* that reshuffling messes
up the cpu--node bindings, but I wanted to put this out there while
I poke around further. [Note that the SRAT: PXM - APIC - Node print
outs during boot are the same either way -- if you look at the APIC
numbers of the processors (from /proc/cpuinfo), the processors should
be assigned to the correct node, but they aren't.] cc'ing Tang Chen
in case this is obvious to him or he's already fixed it somewhere not
on Linus's tree yet.

Don Morris

 
 [0.170435] [ cut here ]
 [0.170450] WARNING: at arch/x86/kernel/smpboot.c:324
 topology_sane.isra.2+0x71/0x84()
 [0.170452] Hardware name: S2600CP
 [0.170454] sched: CPU #1's llc-sibling CPU #0 is not on the same
 node! [node: 1 != 0]. Ignoring dependency.
 [0.156000] smpboot: Booting Node   1, Processors  #1
 [0.170455] Modules linked in:
 [0.170460] Pid: 0, comm: swapper/1 Not tainted 3.8.0+ #1
 [0.170461] Call Trace:
 [0.170466]  [810597bf] warn_slowpath_common+0x7f/0xc0
 [0.170473]  [810598b6] warn_slowpath_fmt+0x46/0x50
 [0.170477]  [816cc752] topology_sane.isra.2+0x71/0x84
 [0.170482]  [816cc9de] set_cpu_sibling_map+0x23f/0x436
 [0.170487]  [816ccd0c] start_secondary+0x137/0x201
 [0.170502] ---[ end trace 09222f596307ca1d ]---
 
 rtg
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 20/31] sched, numa, mm/mpol: Make mempolicy home-node aware

2012-11-01 Thread Don Morris
On 11/01/2012 06:58 AM, Mel Gorman wrote:
> On Thu, Oct 25, 2012 at 02:16:37PM +0200, Peter Zijlstra wrote:
>> Add another layer of fallback policy to make the home node concept
>> useful from a memory allocation PoV.
>>
>> This changes the mpol order to:
>>
>>  - vma->vm_ops->get_policy   [if applicable]
>>  - vma->vm_policy[if applicable]
>>  - task->mempolicy
>>  - tsk_home_node() preferred [NEW]
>>  - default_policy
>>
>> Note that the tsk_home_node() policy has Migrate-on-Fault enabled to
>> facilitate efficient on-demand memory migration.
>>
> 
> Makes sense and it looks like a VMA policy, if set, will still override
> the home_node policy as you'd expect. At some point this may need to cope
> with node hot-remove. Also, at some point this must be dealing with the
> case where mbind() is called but the home_node is not in the nodemask.
> Does that happen somewhere else in the series? (maybe I'll see it later)
> 

I'd expect one of the first things to be done in the sequence of
hot-removing a node would be to take the cpus offline (at least
out of being schedulable). Hence the tasks would be migrated
to other nodes/processors, which should result in a home node
update the same as if the scheduler had simply chosen a better
home for them anyway. The memory would then migrate either
via the home node change by the tasks themselves or via
migration to evacuate the outgoing node (with the preferred
migration target using the new home node).

As long as no one wants to do something crazy like offline
a node before taking the resources away from the scheduler
and memory management, it should all work out.

Don Morris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 20/31] sched, numa, mm/mpol: Make mempolicy home-node aware

2012-11-01 Thread Don Morris
On 11/01/2012 06:58 AM, Mel Gorman wrote:
 On Thu, Oct 25, 2012 at 02:16:37PM +0200, Peter Zijlstra wrote:
 Add another layer of fallback policy to make the home node concept
 useful from a memory allocation PoV.

 This changes the mpol order to:

  - vma-vm_ops-get_policy   [if applicable]
  - vma-vm_policy[if applicable]
  - task-mempolicy
  - tsk_home_node() preferred [NEW]
  - default_policy

 Note that the tsk_home_node() policy has Migrate-on-Fault enabled to
 facilitate efficient on-demand memory migration.

 
 Makes sense and it looks like a VMA policy, if set, will still override
 the home_node policy as you'd expect. At some point this may need to cope
 with node hot-remove. Also, at some point this must be dealing with the
 case where mbind() is called but the home_node is not in the nodemask.
 Does that happen somewhere else in the series? (maybe I'll see it later)
 

I'd expect one of the first things to be done in the sequence of
hot-removing a node would be to take the cpus offline (at least
out of being schedulable). Hence the tasks would be migrated
to other nodes/processors, which should result in a home node
update the same as if the scheduler had simply chosen a better
home for them anyway. The memory would then migrate either
via the home node change by the tasks themselves or via
migration to evacuate the outgoing node (with the preferred
migration target using the new home node).

As long as no one wants to do something crazy like offline
a node before taking the resources away from the scheduler
and memory management, it should all work out.

Don Morris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [3.6.2] oops @ opteron server: mgag200 Fatal error during GPU init

2012-10-19 Thread Don Morris
On 10/19/2012 04:53 AM, Paweł Sikora wrote:
> Hi,
> 
> on the new opteron server i'm observing an oops during matrox video 
> initialization.
> here's the dmesg from pure 3.6.2 kernel:

I haven't owned a G200 based Matrox in years, but based on code
analysis and your output, it looks to me like the VRAM init failure
results in our taking the unload/cleanup backout path well before
the call to drm_mode_config_init() in mgag200_driver_load().
drm_mode_config_cleanup() doesn't handle that situation.

So I would think either drm_mode_config_cleanup() itself needs
revision to handle being called with an uninitialized data set
(better general solution, but that may violate expectations and
I'd think the maintainers would want to chime in on how to signify
that) or we have the driver use some common sense and clean up what
it really did.

I've generated a patch for the latter, does it solve your immediate
problem? It won't solve the VRAM init failure, I know.
I've built it, but without a G200, haven't tested myself.

Don Morris
HP Mission Critical Linux

> 
> [   20.598985] [drm] Initialized drm 1.1.0 20060810
> [   20.642302] [drm:mga_vram_init] *ERROR* can't reserve VRAM
> [   20.642307] mgag200 :01:04.0: Fatal error during GPU init: -6
> [   20.642319] BUG: unable to handle kernel NULL pointer dereference at   
> (null)
> [   20.664413] IP: [] drm_mode_config_cleanup+0x1f/0x1c0 
> [drm]
> [   20.675905] PGD 40869b067 PUD 4086a4067 PMD 0 
> [   20.687362] Oops:  [#1] SMP 
> [   20.698748] Modules linked in: igb(+) usb_storage(+) mgag200(+) ttm 
> crc32c_intel ghash_clmulni_intel drm_kms_helper drm aesni_intel usb_libusual 
> dca ablk_helper uas i2c_algo_bit sysimgblt cryptd sysfillrect syscopyarea ptp 
> aes_x86_64 pps_core evdev joydev pcspkr aes_generic hid_generic 
> fam15h_power(+) i2c_piix4(+) atiixp(+) k10temp i2c_core microcode ide_core 
> amd64_edac_mod edac_core hwmon edac_mce_amd processor button uhci_hcd ext3 
> jbd mbcache raid1 md_mod usbhid hid ohci_hcd ehci_hcd usbcore usb_common 
> uvesafb sd_mod crc_t10dif ahci libahci libata scsi_mod
> [   20.750381] CPU 12 
> [   20.750478] Pid: 463, comm: udevd Not tainted 3.6.2 #4 Supermicro 
> H8DGU/H8DGU
> [   20.776696] RIP: 0010:[]  [] 
> drm_mode_config_cleanup+0x1f/0x1c0 [drm]
> [   20.790249] RSP: 0018:8804086a3a88  EFLAGS: 00010296
> [   20.803729] RAX:  RBX: 881007f41000 RCX: 
> 0043
> [   20.817409] RDX:  RSI: 0046 RDI: 
> 881008d83000
> [   20.831003] RBP: 8804086a3aa8 R08: 000a R09: 
> 03ff
> [   20.844580] R10:  R11: 03fe R12: 
> 881008d83000
> [   20.858085] R13: 881008d83460 R14: 881007f41000 R15: 
> 881008d833a0
> [   20.871607] FS:  7fc87267c800() GS:88101ec0() 
> knlGS:
> [   20.885316] CS:  0010 DS:  ES:  CR0: 80050033
> [   20.899017] CR2:  CR3: 00040869a000 CR4: 
> 000407e0
> [   20.912916] DR0:  DR1:  DR2: 
> 
> [   20.926724] DR3:  DR6: 0ff0 DR7: 
> 0400
> [   20.940450] Process udevd (pid: 463, threadinfo 8804086a2000, task 
> 88040846ee00)
> [   20.942880] Probing IDE interface ide1...
> [   20.968028] Stack:
> [   20.981616]  881007f41000 881007f41000 881008d83000 
> a029a8e0
> [   20.995514]  8804086a3ac8 a02942c7 fffa 
> 881008ddd000
> [   21.009470]  8804086a3b58 a029462e 8804086a3af8 
> a03c11a1
> [   21.023443] Call Trace:
> [   21.037295]  [] mgag200_driver_unload+0x37/0x70 [mgag200]
> [   21.051493]  [] mgag200_driver_load+0x32e/0x4b0 [mgag200]
> [   21.065600]  [] ? drm_sysfs_device_add+0x81/0xb0 [drm]
> [   21.079699]  [] ? drm_get_minor+0x259/0x2f0 [drm]
> [   21.093733]  [] drm_get_pci_dev+0x17e/0x2c0 [drm]
> [   21.107675]  [] mga_pci_probe+0xb1/0xb9 [mgag200]
> [   21.121582]  [] local_pci_probe+0x74/0x100
> [   21.135386]  [] pci_device_probe+0x111/0x120
> [   21.149106]  [] driver_probe_device+0x76/0x240
> [   21.162801]  [] __driver_attach+0x9b/0xa0
> [   21.176411]  [] ? driver_probe_device+0x240/0x240
> [   21.190062]  [] bus_for_each_dev+0x4d/0x90
> [   21.203724]  [] driver_attach+0x19/0x20
> [   21.217443]  [] bus_add_driver+0x190/0x260
> [   21.231260]  [] ? 0xa02c4fff
> [   21.245155]  [] ? 0xa02c4fff
> [   21.259047]  [] driver_register+0x72/0x170
> [   21.272998]  [] ? 0xa02c4fff
> [   21.286900]  [] __pci_register_driver+0x59/0xd0
> [   21.300840]  [] ? 0xa02c4fff
> [   21.314682]  [] drm_pci_init+0x11a/0x130 [drm]
> [   21.328540]  [

Re: [3.6.2] oops @ opteron server: mgag200 Fatal error during GPU init

2012-10-19 Thread Don Morris
On 10/19/2012 04:53 AM, Paweł Sikora wrote:
 Hi,
 
 on the new opteron server i'm observing an oops during matrox video 
 initialization.
 here's the dmesg from pure 3.6.2 kernel:

I haven't owned a G200 based Matrox in years, but based on code
analysis and your output, it looks to me like the VRAM init failure
results in our taking the unload/cleanup backout path well before
the call to drm_mode_config_init() in mgag200_driver_load().
drm_mode_config_cleanup() doesn't handle that situation.

So I would think either drm_mode_config_cleanup() itself needs
revision to handle being called with an uninitialized data set
(better general solution, but that may violate expectations and
I'd think the maintainers would want to chime in on how to signify
that) or we have the driver use some common sense and clean up what
it really did.

I've generated a patch for the latter, does it solve your immediate
problem? It won't solve the VRAM init failure, I know.
I've built it, but without a G200, haven't tested myself.

Don Morris
HP Mission Critical Linux

 
 [   20.598985] [drm] Initialized drm 1.1.0 20060810
 [   20.642302] [drm:mga_vram_init] *ERROR* can't reserve VRAM
 [   20.642307] mgag200 :01:04.0: Fatal error during GPU init: -6
 [   20.642319] BUG: unable to handle kernel NULL pointer dereference at   
 (null)
 [   20.664413] IP: [a03c364f] drm_mode_config_cleanup+0x1f/0x1c0 
 [drm]
 [   20.675905] PGD 40869b067 PUD 4086a4067 PMD 0 
 [   20.687362] Oops:  [#1] SMP 
 [   20.698748] Modules linked in: igb(+) usb_storage(+) mgag200(+) ttm 
 crc32c_intel ghash_clmulni_intel drm_kms_helper drm aesni_intel usb_libusual 
 dca ablk_helper uas i2c_algo_bit sysimgblt cryptd sysfillrect syscopyarea ptp 
 aes_x86_64 pps_core evdev joydev pcspkr aes_generic hid_generic 
 fam15h_power(+) i2c_piix4(+) atiixp(+) k10temp i2c_core microcode ide_core 
 amd64_edac_mod edac_core hwmon edac_mce_amd processor button uhci_hcd ext3 
 jbd mbcache raid1 md_mod usbhid hid ohci_hcd ehci_hcd usbcore usb_common 
 uvesafb sd_mod crc_t10dif ahci libahci libata scsi_mod
 [   20.750381] CPU 12 
 [   20.750478] Pid: 463, comm: udevd Not tainted 3.6.2 #4 Supermicro 
 H8DGU/H8DGU
 [   20.776696] RIP: 0010:[a03c364f]  [a03c364f] 
 drm_mode_config_cleanup+0x1f/0x1c0 [drm]
 [   20.790249] RSP: 0018:8804086a3a88  EFLAGS: 00010296
 [   20.803729] RAX:  RBX: 881007f41000 RCX: 
 0043
 [   20.817409] RDX:  RSI: 0046 RDI: 
 881008d83000
 [   20.831003] RBP: 8804086a3aa8 R08: 000a R09: 
 03ff
 [   20.844580] R10:  R11: 03fe R12: 
 881008d83000
 [   20.858085] R13: 881008d83460 R14: 881007f41000 R15: 
 881008d833a0
 [   20.871607] FS:  7fc87267c800() GS:88101ec0() 
 knlGS:
 [   20.885316] CS:  0010 DS:  ES:  CR0: 80050033
 [   20.899017] CR2:  CR3: 00040869a000 CR4: 
 000407e0
 [   20.912916] DR0:  DR1:  DR2: 
 
 [   20.926724] DR3:  DR6: 0ff0 DR7: 
 0400
 [   20.940450] Process udevd (pid: 463, threadinfo 8804086a2000, task 
 88040846ee00)
 [   20.942880] Probing IDE interface ide1...
 [   20.968028] Stack:
 [   20.981616]  881007f41000 881007f41000 881008d83000 
 a029a8e0
 [   20.995514]  8804086a3ac8 a02942c7 fffa 
 881008ddd000
 [   21.009470]  8804086a3b58 a029462e 8804086a3af8 
 a03c11a1
 [   21.023443] Call Trace:
 [   21.037295]  [a02942c7] mgag200_driver_unload+0x37/0x70 [mgag200]
 [   21.051493]  [a029462e] mgag200_driver_load+0x32e/0x4b0 [mgag200]
 [   21.065600]  [a03c11a1] ? drm_sysfs_device_add+0x81/0xb0 [drm]
 [   21.079699]  [a03bd469] ? drm_get_minor+0x259/0x2f0 [drm]
 [   21.093733]  [a03bfaae] drm_get_pci_dev+0x17e/0x2c0 [drm]
 [   21.107675]  [a0299405] mga_pci_probe+0xb1/0xb9 [mgag200]
 [   21.121582]  [8127f854] local_pci_probe+0x74/0x100
 [   21.135386]  [8127f9f1] pci_device_probe+0x111/0x120
 [   21.149106]  [813319e6] driver_probe_device+0x76/0x240
 [   21.162801]  [81331c4b] __driver_attach+0x9b/0xa0
 [   21.176411]  [81331bb0] ? driver_probe_device+0x240/0x240
 [   21.190062]  [8132fd4d] bus_for_each_dev+0x4d/0x90
 [   21.203724]  [81331509] driver_attach+0x19/0x20
 [   21.217443]  [81331100] bus_add_driver+0x190/0x260
 [   21.231260]  [a02c5000] ? 0xa02c4fff
 [   21.245155]  [a02c5000] ? 0xa02c4fff
 [   21.259047]  [813322d2] driver_register+0x72/0x170
 [   21.272998]  [a02c5000] ? 0xa02c4fff
 [   21.286900]  [8127e6c9] __pci_register_driver+0x59/0xd0
 [   21.300840]  [a02c5000] ? 0xa02c4fff
 [   21.314682

Re: [PATCH 00/33] AutoNUMA27

2012-10-08 Thread Don Morris
On 10/05/2012 05:11 PM, Andi Kleen wrote:
> Tim Chen  writes:
>>>
>>
>> I remembered that 3 months ago when Alex tested the numa/sched patches
>> there were 20% regression on SpecJbb2005 due to the numa balancer.
> 
> 20% on anything sounds like a show stopper to me.
> 
> -Andi
> 

Much worse than that on an 8-way machine for a multi-node multi-threaded
process, from what I can tell. (Andrea's AutoNUMA microbenchmark is a
simple version of that). The contention on the page table lock
( &(>page_table_lock)->rlock ) goes through the roof, with threads
constantly fighting to invalidate translations and re-fault them.

This is on a DL980 with Xeon E7-2870s @ 2.4 GHz, btw.

Running linux-next with no tweaks other than
kernel.sched_migration_cost_ns = 50 gives:
numa01
8325.78
numa01_HARD_BIND
488.98

(The Hard Bind being a case where the threads are pre-bound to the
node set with their memory, so what should be a fairly "best case" for
comparison).

If the SchedNUMA scanning period is upped to 25000 ms (to keep repeated
invalidations from being triggered while the contention for the first
invalidation pass is still being fought over):
numa01
4272.93
numa01_HARD_BIND
498.98

Since this is a "big" process in the current SchedNUMA code and hence
much more likely to trip invalidations, forcing task_numa_big() to
always return false in order to avoid the frequent invalidations gives:
numa01
429.07
numa01_HARD_BIND
466.67

Finally, with SchedNUMA entirely disabled but the rest of linux-next
left intact:
numa01
1075.31
numa01_HARD_BIND
484.20

I didn't write down the lock contentions for comparison, but yes -
the contention does decrease similarly to the time decreases.

There are other microbenchmarks, but those suffice to show the
regression pattern. I mentioned this to the RedHat folks last
week, so I expect this is already being worked. It seemed pertinent
to bring up given the discussion about the current state of linux-next
though, just so folks know. From where I'm sitting, it looks to
me like the scan period is way too aggressive and there's too much
work potentially attempted during a "scan" (by which I mean the
hard tick driven choice to invalidate in order to set up potential
migration faults). The current code walks/invalidates the entire
virtual address space, skipping few vmas. For a very large 64-bit
process, that's going to be a *lot* of translations (or even vmas
if the address space is fragmented) to walk. That's a seriously
long path coming from the timer code. I would think capping the
number of translations to process per visit would help.

Hope this helps the discussion,
Don Morris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/33] AutoNUMA27

2012-10-08 Thread Don Morris
On 10/05/2012 05:11 PM, Andi Kleen wrote:
 Tim Chen tim.c.c...@linux.intel.com writes:


 I remembered that 3 months ago when Alex tested the numa/sched patches
 there were 20% regression on SpecJbb2005 due to the numa balancer.
 
 20% on anything sounds like a show stopper to me.
 
 -Andi
 

Much worse than that on an 8-way machine for a multi-node multi-threaded
process, from what I can tell. (Andrea's AutoNUMA microbenchmark is a
simple version of that). The contention on the page table lock
( (mm-page_table_lock)-rlock ) goes through the roof, with threads
constantly fighting to invalidate translations and re-fault them.

This is on a DL980 with Xeon E7-2870s @ 2.4 GHz, btw.

Running linux-next with no tweaks other than
kernel.sched_migration_cost_ns = 50 gives:
numa01
8325.78
numa01_HARD_BIND
488.98

(The Hard Bind being a case where the threads are pre-bound to the
node set with their memory, so what should be a fairly best case for
comparison).

If the SchedNUMA scanning period is upped to 25000 ms (to keep repeated
invalidations from being triggered while the contention for the first
invalidation pass is still being fought over):
numa01
4272.93
numa01_HARD_BIND
498.98

Since this is a big process in the current SchedNUMA code and hence
much more likely to trip invalidations, forcing task_numa_big() to
always return false in order to avoid the frequent invalidations gives:
numa01
429.07
numa01_HARD_BIND
466.67

Finally, with SchedNUMA entirely disabled but the rest of linux-next
left intact:
numa01
1075.31
numa01_HARD_BIND
484.20

I didn't write down the lock contentions for comparison, but yes -
the contention does decrease similarly to the time decreases.

There are other microbenchmarks, but those suffice to show the
regression pattern. I mentioned this to the RedHat folks last
week, so I expect this is already being worked. It seemed pertinent
to bring up given the discussion about the current state of linux-next
though, just so folks know. From where I'm sitting, it looks to
me like the scan period is way too aggressive and there's too much
work potentially attempted during a scan (by which I mean the
hard tick driven choice to invalidate in order to set up potential
migration faults). The current code walks/invalidates the entire
virtual address space, skipping few vmas. For a very large 64-bit
process, that's going to be a *lot* of translations (or even vmas
if the address space is fragmented) to walk. That's a seriously
long path coming from the timer code. I would think capping the
number of translations to process per visit would help.

Hope this helps the discussion,
Don Morris

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 16/19] sched, numa: NUMA home-node selection code

2012-09-26 Thread Don Morris
Re-sending to LKML due to mailer picking up an incorrect
address. (Sorry for the dupe).

On 09/26/2012 07:26 AM, Don Morris wrote:
> Peter --
> 
> You may have / probably have already seen this, and if so I
> apologize in advance (can't find any sign of a fix via any
> searches...).
> 
> I picked up your August sched/numa patch set and have been
> working on it with a 2-node and a 8-node configuration. Got
> a very intermittent crash on the 2-node which of course
> hasn't reproduced since I got the crash/kdump configured.
> (I suspect it is related, however).
> 
> On the 8-node, however, I very reliably got a hard lockup
> NMI after several minutes. This occurs when running Andrea's
> autonuma-benchmark
> (git://gitorious.org/autonuma-benchmark/autonuma-benchmark.git) reliably
> with the first test (two processes, one
> thread per core/vcore, each loops over a single malloc space).
> I'll attach the full stack set from that crash.
> 
> Since the NMI output seemed really consistent that the hard
> lockup stemmed from waiting for a spinlock that never seemed
> to be picked up, I turned on Lock debugging in the .config and
> got a very clear, very consistent circular dependency warning (just
> below).
> 
> As far as I can tell, the warning is correct and is consistent
> with the actual NMI crash output (variant in that the "pidof"
> process on cpu 52 is going through task_sched_runtime() to do
> the task_rq_lock() operation on the numa01 process which
> results in it getting the pi_lock and waiting for
> the rq->lock when numa01 (back on CPU 0) had the rq->lock
> from scheduler_tick() and is going for the pi_lock via
> task_work_add()... ).
> 
> I'm nowhere near confident enough in my knowledge of the
> nuances of run queue locking during the tick update to try
> to hack a workaround - so sorry no proposed patch fix here,
> just a bug report.
> 
> On another minor note, while looking over this and of course
> noticing that most other cpus were tied up waiting for the
> page lock on one of the huge pages (THP was of course on)
> while one of them busied itself invalidating across the other
> CPUs -- the question comes to mind if that's really needed.
> Yes, it certainly is needed in the true PROT_NONE case you're
> building off of as you certainly can't allow access to a
> translation which is now supposed to be locked out, but you
> could allow transitory minor faults when going from PROT_NONE
> back to access as the fault would clear the TLB anyway (at
> least on x86, any architecture which doesn't do that would have
> to have an explicit TLB invalidation for cases where the translation
> is detected as updated anyway, so that should be okay). In your
> case, I would think the transitory faults on what's really a
> hint to the system would probably be much better than tying up
> N-1 other CPUs to do the other flush on a process that spans
> the system -- especially if the other processors are in a scenario
> where they're running that process but working on a different page
> (and hence may never even touch the page changing access anyway).
> Even in the case where you're adding the hint (access to NONE)
> you could be willing to miss an access in favor of letting the
> next context switch invalidate the TLB for you (again, there
> may be architectures where you'll never invalidate unless it is
> explicitly, I think IPF was that way but it has been a while)
> given you really need a non-trivial run time to merit doing this
> work and have a good chance of settling out to a good access
> pattern.
> 
> Just a thought.
> 
> Thanks for your work,
> Don Morris
> 
> ==
> [ INFO: possible circular locking dependency detected ]
> 3.6.0-rc4 #28 Not tainted
> ---
> numa01/35386 is trying to acquire lock:
>  (>pi_lock){-.-.-.}, at: [] task_work_add+0x38/0xa0
> 
> but task is already holding lock:
>  (>lock){-.-.-.}, at: [] scheduler_tick+0x53/0x150
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (>lock){-.-.-.}:
>[] validate_chain+0x633/0x730
>[] __lock_acquire+0x3f2/0x490
>[] lock_acquire+0xe9/0x120
>[] _raw_spin_lock+0x36/0x70
>[] wake_up_new_task+0xd1/0x190
>[] do_fork+0x1f2/0x280
>[] kernel_thread+0x76/0x80
>[] rest_init+0x26/0xc0
>[] start_kernel+0x3c6/0x3d3
>[] x86_64_start_reservations+0x131/0x136
>[] x86_64_start_kernel+0x101/0x110
> 
> -> #0 (>pi_lock){-.-.-.}:
>[] check_prev_add+0x11f/0x4e0
>  

Re: [PATCH 16/19] sched, numa: NUMA home-node selection code

2012-09-26 Thread Don Morris
Re-sending to LKML due to mailer picking up an incorrect
address. (Sorry for the dupe).

On 09/26/2012 07:26 AM, Don Morris wrote:
 Peter --
 
 You may have / probably have already seen this, and if so I
 apologize in advance (can't find any sign of a fix via any
 searches...).
 
 I picked up your August sched/numa patch set and have been
 working on it with a 2-node and a 8-node configuration. Got
 a very intermittent crash on the 2-node which of course
 hasn't reproduced since I got the crash/kdump configured.
 (I suspect it is related, however).
 
 On the 8-node, however, I very reliably got a hard lockup
 NMI after several minutes. This occurs when running Andrea's
 autonuma-benchmark
 (git://gitorious.org/autonuma-benchmark/autonuma-benchmark.git) reliably
 with the first test (two processes, one
 thread per core/vcore, each loops over a single malloc space).
 I'll attach the full stack set from that crash.
 
 Since the NMI output seemed really consistent that the hard
 lockup stemmed from waiting for a spinlock that never seemed
 to be picked up, I turned on Lock debugging in the .config and
 got a very clear, very consistent circular dependency warning (just
 below).
 
 As far as I can tell, the warning is correct and is consistent
 with the actual NMI crash output (variant in that the pidof
 process on cpu 52 is going through task_sched_runtime() to do
 the task_rq_lock() operation on the numa01 process which
 results in it getting the pi_lock and waiting for
 the rq-lock when numa01 (back on CPU 0) had the rq-lock
 from scheduler_tick() and is going for the pi_lock via
 task_work_add()... ).
 
 I'm nowhere near confident enough in my knowledge of the
 nuances of run queue locking during the tick update to try
 to hack a workaround - so sorry no proposed patch fix here,
 just a bug report.
 
 On another minor note, while looking over this and of course
 noticing that most other cpus were tied up waiting for the
 page lock on one of the huge pages (THP was of course on)
 while one of them busied itself invalidating across the other
 CPUs -- the question comes to mind if that's really needed.
 Yes, it certainly is needed in the true PROT_NONE case you're
 building off of as you certainly can't allow access to a
 translation which is now supposed to be locked out, but you
 could allow transitory minor faults when going from PROT_NONE
 back to access as the fault would clear the TLB anyway (at
 least on x86, any architecture which doesn't do that would have
 to have an explicit TLB invalidation for cases where the translation
 is detected as updated anyway, so that should be okay). In your
 case, I would think the transitory faults on what's really a
 hint to the system would probably be much better than tying up
 N-1 other CPUs to do the other flush on a process that spans
 the system -- especially if the other processors are in a scenario
 where they're running that process but working on a different page
 (and hence may never even touch the page changing access anyway).
 Even in the case where you're adding the hint (access to NONE)
 you could be willing to miss an access in favor of letting the
 next context switch invalidate the TLB for you (again, there
 may be architectures where you'll never invalidate unless it is
 explicitly, I think IPF was that way but it has been a while)
 given you really need a non-trivial run time to merit doing this
 work and have a good chance of settling out to a good access
 pattern.
 
 Just a thought.
 
 Thanks for your work,
 Don Morris
 
 ==
 [ INFO: possible circular locking dependency detected ]
 3.6.0-rc4 #28 Not tainted
 ---
 numa01/35386 is trying to acquire lock:
  (p-pi_lock){-.-.-.}, at: [81073e68] task_work_add+0x38/0xa0
 
 but task is already holding lock:
  (rq-lock){-.-.-.}, at: [81085d83] scheduler_tick+0x53/0x150
 
 which lock already depends on the new lock.
 
 
 the existing dependency chain (in reverse order) is:
 
 - #1 (rq-lock){-.-.-.}:
[810b52e3] validate_chain+0x633/0x730
[810b57d2] __lock_acquire+0x3f2/0x490
[810b5959] lock_acquire+0xe9/0x120
[8152e306] _raw_spin_lock+0x36/0x70
[8108c1f1] wake_up_new_task+0xd1/0x190
[810513f2] do_fork+0x1f2/0x280
[8101bcd6] kernel_thread+0x76/0x80
[81513976] rest_init+0x26/0xc0
[81cdfeff] start_kernel+0x3c6/0x3d3
[81cdf356] x86_64_start_reservations+0x131/0x136
[81cdf45c] x86_64_start_kernel+0x101/0x110
 
 - #0 (p-pi_lock){-.-.-.}:
[810b48ef] check_prev_add+0x11f/0x4e0
[810b52e3] validate_chain+0x633/0x730
[810b57d2] __lock_acquire+0x3f2/0x490
[810b5959] lock_acquire+0xe9/0x120
[8152e4b5] _raw_spin_lock_irqsave+0x55/0xa0

Re: [RFC][PATCH 14/26] sched, numa: Numa balancer

2012-07-13 Thread Don Morris
On 07/12/2012 03:02 PM, Rik van Riel wrote:
> On 03/16/2012 10:40 AM, Peter Zijlstra wrote:
> 
> At LSF/MM, there was a presentation comparing Peter's
> NUMA code with Andrea's NUMA code. I believe this is
> the main reason why Andrea's code performed better in
> that particular test...
> 
>> +if (sched_feat(NUMA_BALANCE_FILTER)) {
>> +/*
>> + * Avoid moving ne's when we create a larger imbalance
>> + * on the other end.
>> + */
>> +if ((imb->type & NUMA_BALANCE_CPU) &&
>> +imb->cpu - cpu_moved < ne_cpu / 2)
>> +goto next;
>> +
>> +/*
>> + * Avoid migrating ne's when we'll know we'll push our
>> + * node over the memory limit.
>> + */
>> +if (max_mem_load &&
>> +imb->mem_load + mem_moved + ne_mem > max_mem_load)
>> +goto next;
>> +}
> 
> IIRC the test consisted of a 16GB NUMA system with two 8GB nodes.
> It was running 3 KVM guests, two guests of 3GB memory each, and
> one guest of 6GB each.

How many cpus per guest (host threads) and how many physical/logical
cpus per node on the host? Any comparisons with a situation where
the memory would fit within nodes but the scheduling load would
be too high?

Don

> 
> With autonuma, the 6GB guest ended up on one node, and the
> 3GB guests on the other.
> 
> With sched numa, each node had a 3GB guest, and part of the 6GB guest.
> 
> There is a fundamental difference in the balancing between autonuma
> and sched numa.
> 
> In sched numa, a process is moved over to the current node only if
> the current node has space for it.
> 
> Autonuma, on the other hand, operates more of a a "hostage exchange"
> policy, where a thread on one node is exchanged with a thread on
> another node, if it looks like that will reduce the overall number
> of cross-node NUMA faults in the system.
> 
> I am not sure how to do a "hostage exchange" algorithm with
> sched numa, but it would seem like it could be necessary in order
> for some workloads to converge on a sane configuration.
> 
> After all, with only about 2GB free on each node, you will never
> get to move either a 3GB guest, or parts of a 6GB guest...
> 
> Any ideas?
> 
> -- 
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 14/26] sched, numa: Numa balancer

2012-07-13 Thread Don Morris
On 07/12/2012 03:02 PM, Rik van Riel wrote:
 On 03/16/2012 10:40 AM, Peter Zijlstra wrote:
 
 At LSF/MM, there was a presentation comparing Peter's
 NUMA code with Andrea's NUMA code. I believe this is
 the main reason why Andrea's code performed better in
 that particular test...
 
 +if (sched_feat(NUMA_BALANCE_FILTER)) {
 +/*
 + * Avoid moving ne's when we create a larger imbalance
 + * on the other end.
 + */
 +if ((imb-type  NUMA_BALANCE_CPU) 
 +imb-cpu - cpu_moved  ne_cpu / 2)
 +goto next;
 +
 +/*
 + * Avoid migrating ne's when we'll know we'll push our
 + * node over the memory limit.
 + */
 +if (max_mem_load 
 +imb-mem_load + mem_moved + ne_mem  max_mem_load)
 +goto next;
 +}
 
 IIRC the test consisted of a 16GB NUMA system with two 8GB nodes.
 It was running 3 KVM guests, two guests of 3GB memory each, and
 one guest of 6GB each.

How many cpus per guest (host threads) and how many physical/logical
cpus per node on the host? Any comparisons with a situation where
the memory would fit within nodes but the scheduling load would
be too high?

Don

 
 With autonuma, the 6GB guest ended up on one node, and the
 3GB guests on the other.
 
 With sched numa, each node had a 3GB guest, and part of the 6GB guest.
 
 There is a fundamental difference in the balancing between autonuma
 and sched numa.
 
 In sched numa, a process is moved over to the current node only if
 the current node has space for it.
 
 Autonuma, on the other hand, operates more of a a hostage exchange
 policy, where a thread on one node is exchanged with a thread on
 another node, if it looks like that will reduce the overall number
 of cross-node NUMA faults in the system.
 
 I am not sure how to do a hostage exchange algorithm with
 sched numa, but it would seem like it could be necessary in order
 for some workloads to converge on a sane configuration.
 
 After all, with only about 2GB free on each node, you will never
 get to move either a 3GB guest, or parts of a 6GB guest...
 
 Any ideas?
 
 -- 
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/