Re: 3.10.16 cgroup css_set_lock deadlock
On 11/15/2013 03:19 AM, Tejun Heo wrote: > On Thu, Nov 14, 2013 at 05:25:29PM -0600, Shawn Bohrer wrote: >> In trying to reproduce the cgroup_mutex deadlock I reported earlier >> in https://lkml.org/lkml/2013/11/11/574 I believe I encountered a >> different issue that I'm also unable to understand. This machine >> started out reporting some soft lockups that look to me like they are >> on a read_lock(css_set_lock): >> > ... >> RIP: 0010:[] [] >> cgroup_attach_task+0xdc/0x7a0 > ... >> [] attach_task_by_pid+0x167/0x1a0 >> [] cgroup_tasks_write+0x13/0x20 I've been getting this hang intermittently with the numad daemon running on CentOS/Fedora while running numa balancing tests. Started around 3.9 or so. > > Most likely the bug fixed by ea84753c98a7 ("cgroup: fix to break the > while loop in cgroup_attach_task() correctly"). 3.10.19 contains the > backported fix. > > Thanks. > Yes, that definitely looks like the right change -- and I ran post-3.12-rc6 for over a week without hitting the issue again. I'm willing to call that verified by since previously I couldn't go more than 2 days without encountering the bug. Ok, stupid question time since I stared at that loop several times while trying to figure out how things got stuck there. Apologies in advance if I'm just thick today -- but I'd really like to grok this bug. Are we getting some other thread from while_each_task() repeatedly keeping us in the loop? Or is there something else going on? The gut instinct is that calling something like while_each_task() on an exiting thread would either reliably give other threads in the group or quit [if the thread is the only one left in the group or if an exiting thread is no longer part of the group], but since that would make the continue work, obviously I'm missing something. Mel, I don't know how much time you've given to this since the last email, but this clears it up. Thanks for your time. Don Morris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10.16 cgroup css_set_lock deadlock
On 11/15/2013 03:19 AM, Tejun Heo wrote: On Thu, Nov 14, 2013 at 05:25:29PM -0600, Shawn Bohrer wrote: In trying to reproduce the cgroup_mutex deadlock I reported earlier in https://lkml.org/lkml/2013/11/11/574 I believe I encountered a different issue that I'm also unable to understand. This machine started out reporting some soft lockups that look to me like they are on a read_lock(css_set_lock): ... RIP: 0010:[8109253c] [8109253c] cgroup_attach_task+0xdc/0x7a0 ... [81092e87] attach_task_by_pid+0x167/0x1a0 [81092ef3] cgroup_tasks_write+0x13/0x20 I've been getting this hang intermittently with the numad daemon running on CentOS/Fedora while running numa balancing tests. Started around 3.9 or so. Most likely the bug fixed by ea84753c98a7 (cgroup: fix to break the while loop in cgroup_attach_task() correctly). 3.10.19 contains the backported fix. Thanks. Yes, that definitely looks like the right change -- and I ran post-3.12-rc6 for over a week without hitting the issue again. I'm willing to call that verified by since previously I couldn't go more than 2 days without encountering the bug. Ok, stupid question time since I stared at that loop several times while trying to figure out how things got stuck there. Apologies in advance if I'm just thick today -- but I'd really like to grok this bug. Are we getting some other thread from while_each_task() repeatedly keeping us in the loop? Or is there something else going on? The gut instinct is that calling something like while_each_task() on an exiting thread would either reliably give other threads in the group or quit [if the thread is the only one left in the group or if an exiting thread is no longer part of the group], but since that would make the continue work, obviously I'm missing something. Mel, I don't know how much time you've given to this since the last email, but this clears it up. Thanks for your time. Don Morris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: soft lockup - CPU#8 stuck for 22s!
On 11/04/2013 12:04 PM, Mel Gorman wrote: > On Tue, Oct 22, 2013 at 01:29:22PM -0400, Don Morris wrote: >> Greetings, all. >> >> Just wanted to drop this out there to see if it rang any bells. >> I've been getting a soft lockup (numad thread stuck on a cpu >> while attempting to attach a task to a cgroup) for a while now, >> but I thought it was only happening when I applied Mel Gorman's >> set of AutoNUMA patches. > > This maybe? Certainly would make sense. My appreciation for taking a look at it. I happen to be on the road today, however -- and away from the reproduction environment. I'll give it a shot tomorrow morning and either let you know if it fixes things or report the sysrq-t output you requested. Again, my thanks! Don Morris > > ---8<--- > mm: memcontrol: Release css_set_lock when aborting an OOM scan > > css_task_iter_start acquires the css_set_lock and it must be released with > a call to css_task_iter_end. Commmit 9cbb78bb (mm, memcg: introduce own > oom handler to iterate only over its own threads) introduced a loop that > was not guaranteed to call css_task_iter_end. > > Cc: stable > Signed-off-by: Mel Gorman > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 5ef8929..941f67d 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1795,6 +1795,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup > *memcg, gfp_t gfp_mask, > mem_cgroup_iter_break(memcg, iter); > if (chosen) > put_task_struct(chosen); > + css_task_iter_end(); > return; > case OOM_SCAN_OK: > break; > . > -- kernel, n: A part of an operating system that preserves the medieval traditions of sorcery and black art. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: soft lockup - CPU#8 stuck for 22s!
On 11/04/2013 12:04 PM, Mel Gorman wrote: On Tue, Oct 22, 2013 at 01:29:22PM -0400, Don Morris wrote: Greetings, all. Just wanted to drop this out there to see if it rang any bells. I've been getting a soft lockup (numad thread stuck on a cpu while attempting to attach a task to a cgroup) for a while now, but I thought it was only happening when I applied Mel Gorman's set of AutoNUMA patches. This maybe? Certainly would make sense. My appreciation for taking a look at it. I happen to be on the road today, however -- and away from the reproduction environment. I'll give it a shot tomorrow morning and either let you know if it fixes things or report the sysrq-t output you requested. Again, my thanks! Don Morris ---8--- mm: memcontrol: Release css_set_lock when aborting an OOM scan css_task_iter_start acquires the css_set_lock and it must be released with a call to css_task_iter_end. Commmit 9cbb78bb (mm, memcg: introduce own oom handler to iterate only over its own threads) introduced a loop that was not guaranteed to call css_task_iter_end. Cc: stable sta...@vger.kernel.org Signed-off-by: Mel Gorman mgor...@suse.de diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5ef8929..941f67d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1795,6 +1795,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, mem_cgroup_iter_break(memcg, iter); if (chosen) put_task_struct(chosen); + css_task_iter_end(it); return; case OOM_SCAN_OK: break; . -- kernel, n: A part of an operating system that preserves the medieval traditions of sorcery and black art. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched, numa: Use {cpu, pid} to create task groups for shared faults
On 07/31/2013 11:07 AM, Peter Zijlstra wrote: > > New version that includes a final put for the numa_group struct and a > few other modifications. > > The new task_numa_free() completely blows though, far too expensive. > Good ideas needed. > > --- > Subject: sched, numa: Use {cpu, pid} to create task groups for shared faults > From: Peter Zijlstra > Date: Tue Jul 30 10:40:20 CEST 2013 > > A very simple/straight forward shared fault task grouping > implementation. > > Concerns are that grouping on a single shared fault might be too > aggressive -- this only works because Mel is excluding DSOs for faults, > otherwise we'd have the world in a single group. > > Future work could explore more complex means of picking groups. We > could for example track one group for the entire scan (using something > like PDM) and join it at the end of the scan if we deem it shared a > sufficient amount of memory. > > Another avenue to explore is that to do with tasks where private faults > are predominant. Should we exclude them from the group or treat them as > secondary, creating a graded group that tries hardest to collate shared > tasks but also tries to move private tasks near when possible. > > Also, the grouping information is completely unused, its up to future > patches to do this. > > Signed-off-by: Peter Zijlstra > --- > include/linux/sched.h |4 + > kernel/sched/core.c |4 + > kernel/sched/fair.c | 177 > +++--- > kernel/sched/sched.h |5 - > 4 files changed, 176 insertions(+), 14 deletions(-) > + > +static void task_numa_free(struct task_struct *p) > +{ > + kfree(p->numa_faults); > + if (p->numa_group) { > + struct numa_group *grp = p->numa_group; See below. > + int i; > + > + for (i = 0; i < 2*nr_node_ids; i++) > + atomic_long_sub(p->numa_faults[i], >faults[i]); > + > + spin_lock(>numa_lock); > + spin_lock(>lock); > + list_del(>numa_entry); > + spin_unlock(>lock); > + rcu_assign_pointer(p->numa_group, NULL); > + put_numa_group(grp); So is the local variable group or grp here? Got to be one or the other to compile... Don > + } > +} > + > /* > * Got a PROT_NONE fault for a page on @node. > */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched, numa: Use {cpu, pid} to create task groups for shared faults
On 07/31/2013 11:07 AM, Peter Zijlstra wrote: New version that includes a final put for the numa_group struct and a few other modifications. The new task_numa_free() completely blows though, far too expensive. Good ideas needed. --- Subject: sched, numa: Use {cpu, pid} to create task groups for shared faults From: Peter Zijlstra pet...@infradead.org Date: Tue Jul 30 10:40:20 CEST 2013 A very simple/straight forward shared fault task grouping implementation. Concerns are that grouping on a single shared fault might be too aggressive -- this only works because Mel is excluding DSOs for faults, otherwise we'd have the world in a single group. Future work could explore more complex means of picking groups. We could for example track one group for the entire scan (using something like PDM) and join it at the end of the scan if we deem it shared a sufficient amount of memory. Another avenue to explore is that to do with tasks where private faults are predominant. Should we exclude them from the group or treat them as secondary, creating a graded group that tries hardest to collate shared tasks but also tries to move private tasks near when possible. Also, the grouping information is completely unused, its up to future patches to do this. Signed-off-by: Peter Zijlstra pet...@infradead.org --- include/linux/sched.h |4 + kernel/sched/core.c |4 + kernel/sched/fair.c | 177 +++--- kernel/sched/sched.h |5 - 4 files changed, 176 insertions(+), 14 deletions(-) + +static void task_numa_free(struct task_struct *p) +{ + kfree(p-numa_faults); + if (p-numa_group) { + struct numa_group *grp = p-numa_group; See below. + int i; + + for (i = 0; i 2*nr_node_ids; i++) + atomic_long_sub(p-numa_faults[i], grp-faults[i]); + + spin_lock(p-numa_lock); + spin_lock(group-lock); + list_del(p-numa_entry); + spin_unlock(group-lock); + rcu_assign_pointer(p-numa_group, NULL); + put_numa_group(grp); So is the local variable group or grp here? Got to be one or the other to compile... Don + } +} + /* * Got a PROT_NONE fault for a page on @node. */ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!
On 02/27/2013 12:11 AM, Yinghai Lu wrote: > On Tue, Feb 26, 2013 at 8:43 PM, Yasuaki Ishimatsu > wrote: >> 2013/02/27 13:04, Yinghai Lu wrote: >>> >>> On Tue, Feb 26, 2013 at 7:38 PM, Yasuaki Ishimatsu >>> wrote: >>>> >>>> 2013/02/27 11:30, Yinghai Lu wrote: >>>>> >>>>> Do you mean you can not boot one socket system with 1G ram ? >>>>> Assume socket 0 does not support hotplug, other 31 sockets support hot >>>>> plug. >>>>> >>>>> So we could boot system only with socket0, and later one by one hot >>>>> add other cpus. >>>> >>>> >>>> >>>> In this case, system can boot. But other cpus with bunch of ram hot >>>> plug may fails, since system does not have enough memory for cover >>>> hot added memory. When hot adding memory device, kernel object for the >>>> memory is allocated from 1G ram since hot added memory has not been >>>> enabled. >>>> >>> >>> yes, it may fail, if the one node memory need page table and vmemmap >>> is more than 1g ... >>> >> >>> for hot add memory we need to >>> 1. add another wrapper for init_memory_mapping, just like >>> init_mem_mapping() for booting path. >>> 2. we need make memblock more generic, so we can use it with hot add >>> memory during runtime. >>> 3. with that we can initialize page table for hot added node with ram. >>> a. initial page table for 2M near node top is from node0 ( that does >>> not support hot plug). >>> b. then will use 2M for memory below node top... >>> c. with that we will make sure page table stay on local node. >>> alloc_low_pages need to be updated to support that. >>> 4. need to make sure vmemmap on local node too. >> >> >> I think so too. By this, memory hot plug becomes more useful. >> >>> >>> so hot-remove node will work too later. >>> >>> In the long run, we should make booting path and hot adding more >>> similar and share at most code. >>> That will make code get more test coverage. > > Tang, Yasuaki, Andrew, > > Please check if you are ok with attached reverting patch. > > Tim, Don, > Can you try if attached reverting patch fix all the problems for you ? I'm sure from the discussion on how to leave in memory hotplug it likely won't be just a clean reversion, but as a data point -- yes, this patch does remove the problem as expected (and I don't see any new ones at first glance... though I'm not trying hotplug yet obviously). Thanks, Don Morris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!
On 02/27/2013 12:11 AM, Yinghai Lu wrote: On Tue, Feb 26, 2013 at 8:43 PM, Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com wrote: 2013/02/27 13:04, Yinghai Lu wrote: On Tue, Feb 26, 2013 at 7:38 PM, Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com wrote: 2013/02/27 11:30, Yinghai Lu wrote: Do you mean you can not boot one socket system with 1G ram ? Assume socket 0 does not support hotplug, other 31 sockets support hot plug. So we could boot system only with socket0, and later one by one hot add other cpus. In this case, system can boot. But other cpus with bunch of ram hot plug may fails, since system does not have enough memory for cover hot added memory. When hot adding memory device, kernel object for the memory is allocated from 1G ram since hot added memory has not been enabled. yes, it may fail, if the one node memory need page table and vmemmap is more than 1g ... for hot add memory we need to 1. add another wrapper for init_memory_mapping, just like init_mem_mapping() for booting path. 2. we need make memblock more generic, so we can use it with hot add memory during runtime. 3. with that we can initialize page table for hot added node with ram. a. initial page table for 2M near node top is from node0 ( that does not support hot plug). b. then will use 2M for memory below node top... c. with that we will make sure page table stay on local node. alloc_low_pages need to be updated to support that. 4. need to make sure vmemmap on local node too. I think so too. By this, memory hot plug becomes more useful. so hot-remove node will work too later. In the long run, we should make booting path and hot adding more similar and share at most code. That will make code get more test coverage. Tang, Yasuaki, Andrew, Please check if you are ok with attached reverting patch. Tim, Don, Can you try if attached reverting patch fix all the problems for you ? I'm sure from the discussion on how to leave in memory hotplug it likely won't be just a clean reversion, but as a data point -- yes, this patch does remove the problem as expected (and I don't see any new ones at first glance... though I'm not trying hotplug yet obviously). Thanks, Don Morris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!
On 02/25/2013 10:32 AM, Tim Gardner wrote: > On 02/25/2013 08:02 AM, Tim Gardner wrote: >> Is this an expected warning ? I'll boot a vanilla kernel just to be sure. >> >> rebased against ab7826595e9ec51a51f622c5fc91e2f59440481a in Linus' repo: >> > > Same with a vanilla kernel, so it doesn't appear that any Ubuntu cruft > is having an impact: Reproduced on a HP z620 workstation (E5-2620 instead of E5-2680, but still Sandy Bridge, though I don't think that matters). Bisection leads to: # bad: [e8d1955258091e4c92d5a975ebd7fd8a98f5d30f] acpi, memory-hotplug: parse SRAT before memblock is ready Nothing terribly obvious leaps out as to *why* that reshuffling messes up the cpu<-->node bindings, but I wanted to put this out there while I poke around further. [Note that the SRAT: PXM -> APIC -> Node print outs during boot are the same either way -- if you look at the APIC numbers of the processors (from /proc/cpuinfo), the processors should be assigned to the correct node, but they aren't.] cc'ing Tang Chen in case this is obvious to him or he's already fixed it somewhere not on Linus's tree yet. Don Morris > > [0.170435] [ cut here ] > [0.170450] WARNING: at arch/x86/kernel/smpboot.c:324 > topology_sane.isra.2+0x71/0x84() > [0.170452] Hardware name: S2600CP > [0.170454] sched: CPU #1's llc-sibling CPU #0 is not on the same > node! [node: 1 != 0]. Ignoring dependency. > [0.156000] smpboot: Booting Node 1, Processors #1 > [0.170455] Modules linked in: > [0.170460] Pid: 0, comm: swapper/1 Not tainted 3.8.0+ #1 > [0.170461] Call Trace: > [0.170466] [] warn_slowpath_common+0x7f/0xc0 > [0.170473] [] warn_slowpath_fmt+0x46/0x50 > [0.170477] [] topology_sane.isra.2+0x71/0x84 > [0.170482] [] set_cpu_sibling_map+0x23f/0x436 > [0.170487] [] start_secondary+0x137/0x201 > [0.170502] ---[ end trace 09222f596307ca1d ]--- > > rtg > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched: CPU #1's llc-sibling CPU #0 is not on the same node!
On 02/25/2013 10:32 AM, Tim Gardner wrote: On 02/25/2013 08:02 AM, Tim Gardner wrote: Is this an expected warning ? I'll boot a vanilla kernel just to be sure. rebased against ab7826595e9ec51a51f622c5fc91e2f59440481a in Linus' repo: Same with a vanilla kernel, so it doesn't appear that any Ubuntu cruft is having an impact: Reproduced on a HP z620 workstation (E5-2620 instead of E5-2680, but still Sandy Bridge, though I don't think that matters). Bisection leads to: # bad: [e8d1955258091e4c92d5a975ebd7fd8a98f5d30f] acpi, memory-hotplug: parse SRAT before memblock is ready Nothing terribly obvious leaps out as to *why* that reshuffling messes up the cpu--node bindings, but I wanted to put this out there while I poke around further. [Note that the SRAT: PXM - APIC - Node print outs during boot are the same either way -- if you look at the APIC numbers of the processors (from /proc/cpuinfo), the processors should be assigned to the correct node, but they aren't.] cc'ing Tang Chen in case this is obvious to him or he's already fixed it somewhere not on Linus's tree yet. Don Morris [0.170435] [ cut here ] [0.170450] WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x71/0x84() [0.170452] Hardware name: S2600CP [0.170454] sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency. [0.156000] smpboot: Booting Node 1, Processors #1 [0.170455] Modules linked in: [0.170460] Pid: 0, comm: swapper/1 Not tainted 3.8.0+ #1 [0.170461] Call Trace: [0.170466] [810597bf] warn_slowpath_common+0x7f/0xc0 [0.170473] [810598b6] warn_slowpath_fmt+0x46/0x50 [0.170477] [816cc752] topology_sane.isra.2+0x71/0x84 [0.170482] [816cc9de] set_cpu_sibling_map+0x23f/0x436 [0.170487] [816ccd0c] start_secondary+0x137/0x201 [0.170502] ---[ end trace 09222f596307ca1d ]--- rtg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 20/31] sched, numa, mm/mpol: Make mempolicy home-node aware
On 11/01/2012 06:58 AM, Mel Gorman wrote: > On Thu, Oct 25, 2012 at 02:16:37PM +0200, Peter Zijlstra wrote: >> Add another layer of fallback policy to make the home node concept >> useful from a memory allocation PoV. >> >> This changes the mpol order to: >> >> - vma->vm_ops->get_policy [if applicable] >> - vma->vm_policy[if applicable] >> - task->mempolicy >> - tsk_home_node() preferred [NEW] >> - default_policy >> >> Note that the tsk_home_node() policy has Migrate-on-Fault enabled to >> facilitate efficient on-demand memory migration. >> > > Makes sense and it looks like a VMA policy, if set, will still override > the home_node policy as you'd expect. At some point this may need to cope > with node hot-remove. Also, at some point this must be dealing with the > case where mbind() is called but the home_node is not in the nodemask. > Does that happen somewhere else in the series? (maybe I'll see it later) > I'd expect one of the first things to be done in the sequence of hot-removing a node would be to take the cpus offline (at least out of being schedulable). Hence the tasks would be migrated to other nodes/processors, which should result in a home node update the same as if the scheduler had simply chosen a better home for them anyway. The memory would then migrate either via the home node change by the tasks themselves or via migration to evacuate the outgoing node (with the preferred migration target using the new home node). As long as no one wants to do something crazy like offline a node before taking the resources away from the scheduler and memory management, it should all work out. Don Morris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 20/31] sched, numa, mm/mpol: Make mempolicy home-node aware
On 11/01/2012 06:58 AM, Mel Gorman wrote: On Thu, Oct 25, 2012 at 02:16:37PM +0200, Peter Zijlstra wrote: Add another layer of fallback policy to make the home node concept useful from a memory allocation PoV. This changes the mpol order to: - vma-vm_ops-get_policy [if applicable] - vma-vm_policy[if applicable] - task-mempolicy - tsk_home_node() preferred [NEW] - default_policy Note that the tsk_home_node() policy has Migrate-on-Fault enabled to facilitate efficient on-demand memory migration. Makes sense and it looks like a VMA policy, if set, will still override the home_node policy as you'd expect. At some point this may need to cope with node hot-remove. Also, at some point this must be dealing with the case where mbind() is called but the home_node is not in the nodemask. Does that happen somewhere else in the series? (maybe I'll see it later) I'd expect one of the first things to be done in the sequence of hot-removing a node would be to take the cpus offline (at least out of being schedulable). Hence the tasks would be migrated to other nodes/processors, which should result in a home node update the same as if the scheduler had simply chosen a better home for them anyway. The memory would then migrate either via the home node change by the tasks themselves or via migration to evacuate the outgoing node (with the preferred migration target using the new home node). As long as no one wants to do something crazy like offline a node before taking the resources away from the scheduler and memory management, it should all work out. Don Morris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3.6.2] oops @ opteron server: mgag200 Fatal error during GPU init
On 10/19/2012 04:53 AM, Paweł Sikora wrote: > Hi, > > on the new opteron server i'm observing an oops during matrox video > initialization. > here's the dmesg from pure 3.6.2 kernel: I haven't owned a G200 based Matrox in years, but based on code analysis and your output, it looks to me like the VRAM init failure results in our taking the unload/cleanup backout path well before the call to drm_mode_config_init() in mgag200_driver_load(). drm_mode_config_cleanup() doesn't handle that situation. So I would think either drm_mode_config_cleanup() itself needs revision to handle being called with an uninitialized data set (better general solution, but that may violate expectations and I'd think the maintainers would want to chime in on how to signify that) or we have the driver use some common sense and clean up what it really did. I've generated a patch for the latter, does it solve your immediate problem? It won't solve the VRAM init failure, I know. I've built it, but without a G200, haven't tested myself. Don Morris HP Mission Critical Linux > > [ 20.598985] [drm] Initialized drm 1.1.0 20060810 > [ 20.642302] [drm:mga_vram_init] *ERROR* can't reserve VRAM > [ 20.642307] mgag200 :01:04.0: Fatal error during GPU init: -6 > [ 20.642319] BUG: unable to handle kernel NULL pointer dereference at > (null) > [ 20.664413] IP: [] drm_mode_config_cleanup+0x1f/0x1c0 > [drm] > [ 20.675905] PGD 40869b067 PUD 4086a4067 PMD 0 > [ 20.687362] Oops: [#1] SMP > [ 20.698748] Modules linked in: igb(+) usb_storage(+) mgag200(+) ttm > crc32c_intel ghash_clmulni_intel drm_kms_helper drm aesni_intel usb_libusual > dca ablk_helper uas i2c_algo_bit sysimgblt cryptd sysfillrect syscopyarea ptp > aes_x86_64 pps_core evdev joydev pcspkr aes_generic hid_generic > fam15h_power(+) i2c_piix4(+) atiixp(+) k10temp i2c_core microcode ide_core > amd64_edac_mod edac_core hwmon edac_mce_amd processor button uhci_hcd ext3 > jbd mbcache raid1 md_mod usbhid hid ohci_hcd ehci_hcd usbcore usb_common > uvesafb sd_mod crc_t10dif ahci libahci libata scsi_mod > [ 20.750381] CPU 12 > [ 20.750478] Pid: 463, comm: udevd Not tainted 3.6.2 #4 Supermicro > H8DGU/H8DGU > [ 20.776696] RIP: 0010:[] [] > drm_mode_config_cleanup+0x1f/0x1c0 [drm] > [ 20.790249] RSP: 0018:8804086a3a88 EFLAGS: 00010296 > [ 20.803729] RAX: RBX: 881007f41000 RCX: > 0043 > [ 20.817409] RDX: RSI: 0046 RDI: > 881008d83000 > [ 20.831003] RBP: 8804086a3aa8 R08: 000a R09: > 03ff > [ 20.844580] R10: R11: 03fe R12: > 881008d83000 > [ 20.858085] R13: 881008d83460 R14: 881007f41000 R15: > 881008d833a0 > [ 20.871607] FS: 7fc87267c800() GS:88101ec0() > knlGS: > [ 20.885316] CS: 0010 DS: ES: CR0: 80050033 > [ 20.899017] CR2: CR3: 00040869a000 CR4: > 000407e0 > [ 20.912916] DR0: DR1: DR2: > > [ 20.926724] DR3: DR6: 0ff0 DR7: > 0400 > [ 20.940450] Process udevd (pid: 463, threadinfo 8804086a2000, task > 88040846ee00) > [ 20.942880] Probing IDE interface ide1... > [ 20.968028] Stack: > [ 20.981616] 881007f41000 881007f41000 881008d83000 > a029a8e0 > [ 20.995514] 8804086a3ac8 a02942c7 fffa > 881008ddd000 > [ 21.009470] 8804086a3b58 a029462e 8804086a3af8 > a03c11a1 > [ 21.023443] Call Trace: > [ 21.037295] [] mgag200_driver_unload+0x37/0x70 [mgag200] > [ 21.051493] [] mgag200_driver_load+0x32e/0x4b0 [mgag200] > [ 21.065600] [] ? drm_sysfs_device_add+0x81/0xb0 [drm] > [ 21.079699] [] ? drm_get_minor+0x259/0x2f0 [drm] > [ 21.093733] [] drm_get_pci_dev+0x17e/0x2c0 [drm] > [ 21.107675] [] mga_pci_probe+0xb1/0xb9 [mgag200] > [ 21.121582] [] local_pci_probe+0x74/0x100 > [ 21.135386] [] pci_device_probe+0x111/0x120 > [ 21.149106] [] driver_probe_device+0x76/0x240 > [ 21.162801] [] __driver_attach+0x9b/0xa0 > [ 21.176411] [] ? driver_probe_device+0x240/0x240 > [ 21.190062] [] bus_for_each_dev+0x4d/0x90 > [ 21.203724] [] driver_attach+0x19/0x20 > [ 21.217443] [] bus_add_driver+0x190/0x260 > [ 21.231260] [] ? 0xa02c4fff > [ 21.245155] [] ? 0xa02c4fff > [ 21.259047] [] driver_register+0x72/0x170 > [ 21.272998] [] ? 0xa02c4fff > [ 21.286900] [] __pci_register_driver+0x59/0xd0 > [ 21.300840] [] ? 0xa02c4fff > [ 21.314682] [] drm_pci_init+0x11a/0x130 [drm] > [ 21.328540] [
Re: [3.6.2] oops @ opteron server: mgag200 Fatal error during GPU init
On 10/19/2012 04:53 AM, Paweł Sikora wrote: Hi, on the new opteron server i'm observing an oops during matrox video initialization. here's the dmesg from pure 3.6.2 kernel: I haven't owned a G200 based Matrox in years, but based on code analysis and your output, it looks to me like the VRAM init failure results in our taking the unload/cleanup backout path well before the call to drm_mode_config_init() in mgag200_driver_load(). drm_mode_config_cleanup() doesn't handle that situation. So I would think either drm_mode_config_cleanup() itself needs revision to handle being called with an uninitialized data set (better general solution, but that may violate expectations and I'd think the maintainers would want to chime in on how to signify that) or we have the driver use some common sense and clean up what it really did. I've generated a patch for the latter, does it solve your immediate problem? It won't solve the VRAM init failure, I know. I've built it, but without a G200, haven't tested myself. Don Morris HP Mission Critical Linux [ 20.598985] [drm] Initialized drm 1.1.0 20060810 [ 20.642302] [drm:mga_vram_init] *ERROR* can't reserve VRAM [ 20.642307] mgag200 :01:04.0: Fatal error during GPU init: -6 [ 20.642319] BUG: unable to handle kernel NULL pointer dereference at (null) [ 20.664413] IP: [a03c364f] drm_mode_config_cleanup+0x1f/0x1c0 [drm] [ 20.675905] PGD 40869b067 PUD 4086a4067 PMD 0 [ 20.687362] Oops: [#1] SMP [ 20.698748] Modules linked in: igb(+) usb_storage(+) mgag200(+) ttm crc32c_intel ghash_clmulni_intel drm_kms_helper drm aesni_intel usb_libusual dca ablk_helper uas i2c_algo_bit sysimgblt cryptd sysfillrect syscopyarea ptp aes_x86_64 pps_core evdev joydev pcspkr aes_generic hid_generic fam15h_power(+) i2c_piix4(+) atiixp(+) k10temp i2c_core microcode ide_core amd64_edac_mod edac_core hwmon edac_mce_amd processor button uhci_hcd ext3 jbd mbcache raid1 md_mod usbhid hid ohci_hcd ehci_hcd usbcore usb_common uvesafb sd_mod crc_t10dif ahci libahci libata scsi_mod [ 20.750381] CPU 12 [ 20.750478] Pid: 463, comm: udevd Not tainted 3.6.2 #4 Supermicro H8DGU/H8DGU [ 20.776696] RIP: 0010:[a03c364f] [a03c364f] drm_mode_config_cleanup+0x1f/0x1c0 [drm] [ 20.790249] RSP: 0018:8804086a3a88 EFLAGS: 00010296 [ 20.803729] RAX: RBX: 881007f41000 RCX: 0043 [ 20.817409] RDX: RSI: 0046 RDI: 881008d83000 [ 20.831003] RBP: 8804086a3aa8 R08: 000a R09: 03ff [ 20.844580] R10: R11: 03fe R12: 881008d83000 [ 20.858085] R13: 881008d83460 R14: 881007f41000 R15: 881008d833a0 [ 20.871607] FS: 7fc87267c800() GS:88101ec0() knlGS: [ 20.885316] CS: 0010 DS: ES: CR0: 80050033 [ 20.899017] CR2: CR3: 00040869a000 CR4: 000407e0 [ 20.912916] DR0: DR1: DR2: [ 20.926724] DR3: DR6: 0ff0 DR7: 0400 [ 20.940450] Process udevd (pid: 463, threadinfo 8804086a2000, task 88040846ee00) [ 20.942880] Probing IDE interface ide1... [ 20.968028] Stack: [ 20.981616] 881007f41000 881007f41000 881008d83000 a029a8e0 [ 20.995514] 8804086a3ac8 a02942c7 fffa 881008ddd000 [ 21.009470] 8804086a3b58 a029462e 8804086a3af8 a03c11a1 [ 21.023443] Call Trace: [ 21.037295] [a02942c7] mgag200_driver_unload+0x37/0x70 [mgag200] [ 21.051493] [a029462e] mgag200_driver_load+0x32e/0x4b0 [mgag200] [ 21.065600] [a03c11a1] ? drm_sysfs_device_add+0x81/0xb0 [drm] [ 21.079699] [a03bd469] ? drm_get_minor+0x259/0x2f0 [drm] [ 21.093733] [a03bfaae] drm_get_pci_dev+0x17e/0x2c0 [drm] [ 21.107675] [a0299405] mga_pci_probe+0xb1/0xb9 [mgag200] [ 21.121582] [8127f854] local_pci_probe+0x74/0x100 [ 21.135386] [8127f9f1] pci_device_probe+0x111/0x120 [ 21.149106] [813319e6] driver_probe_device+0x76/0x240 [ 21.162801] [81331c4b] __driver_attach+0x9b/0xa0 [ 21.176411] [81331bb0] ? driver_probe_device+0x240/0x240 [ 21.190062] [8132fd4d] bus_for_each_dev+0x4d/0x90 [ 21.203724] [81331509] driver_attach+0x19/0x20 [ 21.217443] [81331100] bus_add_driver+0x190/0x260 [ 21.231260] [a02c5000] ? 0xa02c4fff [ 21.245155] [a02c5000] ? 0xa02c4fff [ 21.259047] [813322d2] driver_register+0x72/0x170 [ 21.272998] [a02c5000] ? 0xa02c4fff [ 21.286900] [8127e6c9] __pci_register_driver+0x59/0xd0 [ 21.300840] [a02c5000] ? 0xa02c4fff [ 21.314682
Re: [PATCH 00/33] AutoNUMA27
On 10/05/2012 05:11 PM, Andi Kleen wrote: > Tim Chen writes: >>> >> >> I remembered that 3 months ago when Alex tested the numa/sched patches >> there were 20% regression on SpecJbb2005 due to the numa balancer. > > 20% on anything sounds like a show stopper to me. > > -Andi > Much worse than that on an 8-way machine for a multi-node multi-threaded process, from what I can tell. (Andrea's AutoNUMA microbenchmark is a simple version of that). The contention on the page table lock ( &(>page_table_lock)->rlock ) goes through the roof, with threads constantly fighting to invalidate translations and re-fault them. This is on a DL980 with Xeon E7-2870s @ 2.4 GHz, btw. Running linux-next with no tweaks other than kernel.sched_migration_cost_ns = 50 gives: numa01 8325.78 numa01_HARD_BIND 488.98 (The Hard Bind being a case where the threads are pre-bound to the node set with their memory, so what should be a fairly "best case" for comparison). If the SchedNUMA scanning period is upped to 25000 ms (to keep repeated invalidations from being triggered while the contention for the first invalidation pass is still being fought over): numa01 4272.93 numa01_HARD_BIND 498.98 Since this is a "big" process in the current SchedNUMA code and hence much more likely to trip invalidations, forcing task_numa_big() to always return false in order to avoid the frequent invalidations gives: numa01 429.07 numa01_HARD_BIND 466.67 Finally, with SchedNUMA entirely disabled but the rest of linux-next left intact: numa01 1075.31 numa01_HARD_BIND 484.20 I didn't write down the lock contentions for comparison, but yes - the contention does decrease similarly to the time decreases. There are other microbenchmarks, but those suffice to show the regression pattern. I mentioned this to the RedHat folks last week, so I expect this is already being worked. It seemed pertinent to bring up given the discussion about the current state of linux-next though, just so folks know. From where I'm sitting, it looks to me like the scan period is way too aggressive and there's too much work potentially attempted during a "scan" (by which I mean the hard tick driven choice to invalidate in order to set up potential migration faults). The current code walks/invalidates the entire virtual address space, skipping few vmas. For a very large 64-bit process, that's going to be a *lot* of translations (or even vmas if the address space is fragmented) to walk. That's a seriously long path coming from the timer code. I would think capping the number of translations to process per visit would help. Hope this helps the discussion, Don Morris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] AutoNUMA27
On 10/05/2012 05:11 PM, Andi Kleen wrote: Tim Chen tim.c.c...@linux.intel.com writes: I remembered that 3 months ago when Alex tested the numa/sched patches there were 20% regression on SpecJbb2005 due to the numa balancer. 20% on anything sounds like a show stopper to me. -Andi Much worse than that on an 8-way machine for a multi-node multi-threaded process, from what I can tell. (Andrea's AutoNUMA microbenchmark is a simple version of that). The contention on the page table lock ( (mm-page_table_lock)-rlock ) goes through the roof, with threads constantly fighting to invalidate translations and re-fault them. This is on a DL980 with Xeon E7-2870s @ 2.4 GHz, btw. Running linux-next with no tweaks other than kernel.sched_migration_cost_ns = 50 gives: numa01 8325.78 numa01_HARD_BIND 488.98 (The Hard Bind being a case where the threads are pre-bound to the node set with their memory, so what should be a fairly best case for comparison). If the SchedNUMA scanning period is upped to 25000 ms (to keep repeated invalidations from being triggered while the contention for the first invalidation pass is still being fought over): numa01 4272.93 numa01_HARD_BIND 498.98 Since this is a big process in the current SchedNUMA code and hence much more likely to trip invalidations, forcing task_numa_big() to always return false in order to avoid the frequent invalidations gives: numa01 429.07 numa01_HARD_BIND 466.67 Finally, with SchedNUMA entirely disabled but the rest of linux-next left intact: numa01 1075.31 numa01_HARD_BIND 484.20 I didn't write down the lock contentions for comparison, but yes - the contention does decrease similarly to the time decreases. There are other microbenchmarks, but those suffice to show the regression pattern. I mentioned this to the RedHat folks last week, so I expect this is already being worked. It seemed pertinent to bring up given the discussion about the current state of linux-next though, just so folks know. From where I'm sitting, it looks to me like the scan period is way too aggressive and there's too much work potentially attempted during a scan (by which I mean the hard tick driven choice to invalidate in order to set up potential migration faults). The current code walks/invalidates the entire virtual address space, skipping few vmas. For a very large 64-bit process, that's going to be a *lot* of translations (or even vmas if the address space is fragmented) to walk. That's a seriously long path coming from the timer code. I would think capping the number of translations to process per visit would help. Hope this helps the discussion, Don Morris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 16/19] sched, numa: NUMA home-node selection code
Re-sending to LKML due to mailer picking up an incorrect address. (Sorry for the dupe). On 09/26/2012 07:26 AM, Don Morris wrote: > Peter -- > > You may have / probably have already seen this, and if so I > apologize in advance (can't find any sign of a fix via any > searches...). > > I picked up your August sched/numa patch set and have been > working on it with a 2-node and a 8-node configuration. Got > a very intermittent crash on the 2-node which of course > hasn't reproduced since I got the crash/kdump configured. > (I suspect it is related, however). > > On the 8-node, however, I very reliably got a hard lockup > NMI after several minutes. This occurs when running Andrea's > autonuma-benchmark > (git://gitorious.org/autonuma-benchmark/autonuma-benchmark.git) reliably > with the first test (two processes, one > thread per core/vcore, each loops over a single malloc space). > I'll attach the full stack set from that crash. > > Since the NMI output seemed really consistent that the hard > lockup stemmed from waiting for a spinlock that never seemed > to be picked up, I turned on Lock debugging in the .config and > got a very clear, very consistent circular dependency warning (just > below). > > As far as I can tell, the warning is correct and is consistent > with the actual NMI crash output (variant in that the "pidof" > process on cpu 52 is going through task_sched_runtime() to do > the task_rq_lock() operation on the numa01 process which > results in it getting the pi_lock and waiting for > the rq->lock when numa01 (back on CPU 0) had the rq->lock > from scheduler_tick() and is going for the pi_lock via > task_work_add()... ). > > I'm nowhere near confident enough in my knowledge of the > nuances of run queue locking during the tick update to try > to hack a workaround - so sorry no proposed patch fix here, > just a bug report. > > On another minor note, while looking over this and of course > noticing that most other cpus were tied up waiting for the > page lock on one of the huge pages (THP was of course on) > while one of them busied itself invalidating across the other > CPUs -- the question comes to mind if that's really needed. > Yes, it certainly is needed in the true PROT_NONE case you're > building off of as you certainly can't allow access to a > translation which is now supposed to be locked out, but you > could allow transitory minor faults when going from PROT_NONE > back to access as the fault would clear the TLB anyway (at > least on x86, any architecture which doesn't do that would have > to have an explicit TLB invalidation for cases where the translation > is detected as updated anyway, so that should be okay). In your > case, I would think the transitory faults on what's really a > hint to the system would probably be much better than tying up > N-1 other CPUs to do the other flush on a process that spans > the system -- especially if the other processors are in a scenario > where they're running that process but working on a different page > (and hence may never even touch the page changing access anyway). > Even in the case where you're adding the hint (access to NONE) > you could be willing to miss an access in favor of letting the > next context switch invalidate the TLB for you (again, there > may be architectures where you'll never invalidate unless it is > explicitly, I think IPF was that way but it has been a while) > given you really need a non-trivial run time to merit doing this > work and have a good chance of settling out to a good access > pattern. > > Just a thought. > > Thanks for your work, > Don Morris > > == > [ INFO: possible circular locking dependency detected ] > 3.6.0-rc4 #28 Not tainted > --- > numa01/35386 is trying to acquire lock: > (>pi_lock){-.-.-.}, at: [] task_work_add+0x38/0xa0 > > but task is already holding lock: > (>lock){-.-.-.}, at: [] scheduler_tick+0x53/0x150 > > which lock already depends on the new lock. > > > the existing dependency chain (in reverse order) is: > > -> #1 (>lock){-.-.-.}: >[] validate_chain+0x633/0x730 >[] __lock_acquire+0x3f2/0x490 >[] lock_acquire+0xe9/0x120 >[] _raw_spin_lock+0x36/0x70 >[] wake_up_new_task+0xd1/0x190 >[] do_fork+0x1f2/0x280 >[] kernel_thread+0x76/0x80 >[] rest_init+0x26/0xc0 >[] start_kernel+0x3c6/0x3d3 >[] x86_64_start_reservations+0x131/0x136 >[] x86_64_start_kernel+0x101/0x110 > > -> #0 (>pi_lock){-.-.-.}: >[] check_prev_add+0x11f/0x4e0 >
Re: [PATCH 16/19] sched, numa: NUMA home-node selection code
Re-sending to LKML due to mailer picking up an incorrect address. (Sorry for the dupe). On 09/26/2012 07:26 AM, Don Morris wrote: Peter -- You may have / probably have already seen this, and if so I apologize in advance (can't find any sign of a fix via any searches...). I picked up your August sched/numa patch set and have been working on it with a 2-node and a 8-node configuration. Got a very intermittent crash on the 2-node which of course hasn't reproduced since I got the crash/kdump configured. (I suspect it is related, however). On the 8-node, however, I very reliably got a hard lockup NMI after several minutes. This occurs when running Andrea's autonuma-benchmark (git://gitorious.org/autonuma-benchmark/autonuma-benchmark.git) reliably with the first test (two processes, one thread per core/vcore, each loops over a single malloc space). I'll attach the full stack set from that crash. Since the NMI output seemed really consistent that the hard lockup stemmed from waiting for a spinlock that never seemed to be picked up, I turned on Lock debugging in the .config and got a very clear, very consistent circular dependency warning (just below). As far as I can tell, the warning is correct and is consistent with the actual NMI crash output (variant in that the pidof process on cpu 52 is going through task_sched_runtime() to do the task_rq_lock() operation on the numa01 process which results in it getting the pi_lock and waiting for the rq-lock when numa01 (back on CPU 0) had the rq-lock from scheduler_tick() and is going for the pi_lock via task_work_add()... ). I'm nowhere near confident enough in my knowledge of the nuances of run queue locking during the tick update to try to hack a workaround - so sorry no proposed patch fix here, just a bug report. On another minor note, while looking over this and of course noticing that most other cpus were tied up waiting for the page lock on one of the huge pages (THP was of course on) while one of them busied itself invalidating across the other CPUs -- the question comes to mind if that's really needed. Yes, it certainly is needed in the true PROT_NONE case you're building off of as you certainly can't allow access to a translation which is now supposed to be locked out, but you could allow transitory minor faults when going from PROT_NONE back to access as the fault would clear the TLB anyway (at least on x86, any architecture which doesn't do that would have to have an explicit TLB invalidation for cases where the translation is detected as updated anyway, so that should be okay). In your case, I would think the transitory faults on what's really a hint to the system would probably be much better than tying up N-1 other CPUs to do the other flush on a process that spans the system -- especially if the other processors are in a scenario where they're running that process but working on a different page (and hence may never even touch the page changing access anyway). Even in the case where you're adding the hint (access to NONE) you could be willing to miss an access in favor of letting the next context switch invalidate the TLB for you (again, there may be architectures where you'll never invalidate unless it is explicitly, I think IPF was that way but it has been a while) given you really need a non-trivial run time to merit doing this work and have a good chance of settling out to a good access pattern. Just a thought. Thanks for your work, Don Morris == [ INFO: possible circular locking dependency detected ] 3.6.0-rc4 #28 Not tainted --- numa01/35386 is trying to acquire lock: (p-pi_lock){-.-.-.}, at: [81073e68] task_work_add+0x38/0xa0 but task is already holding lock: (rq-lock){-.-.-.}, at: [81085d83] scheduler_tick+0x53/0x150 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: - #1 (rq-lock){-.-.-.}: [810b52e3] validate_chain+0x633/0x730 [810b57d2] __lock_acquire+0x3f2/0x490 [810b5959] lock_acquire+0xe9/0x120 [8152e306] _raw_spin_lock+0x36/0x70 [8108c1f1] wake_up_new_task+0xd1/0x190 [810513f2] do_fork+0x1f2/0x280 [8101bcd6] kernel_thread+0x76/0x80 [81513976] rest_init+0x26/0xc0 [81cdfeff] start_kernel+0x3c6/0x3d3 [81cdf356] x86_64_start_reservations+0x131/0x136 [81cdf45c] x86_64_start_kernel+0x101/0x110 - #0 (p-pi_lock){-.-.-.}: [810b48ef] check_prev_add+0x11f/0x4e0 [810b52e3] validate_chain+0x633/0x730 [810b57d2] __lock_acquire+0x3f2/0x490 [810b5959] lock_acquire+0xe9/0x120 [8152e4b5] _raw_spin_lock_irqsave+0x55/0xa0
Re: [RFC][PATCH 14/26] sched, numa: Numa balancer
On 07/12/2012 03:02 PM, Rik van Riel wrote: > On 03/16/2012 10:40 AM, Peter Zijlstra wrote: > > At LSF/MM, there was a presentation comparing Peter's > NUMA code with Andrea's NUMA code. I believe this is > the main reason why Andrea's code performed better in > that particular test... > >> +if (sched_feat(NUMA_BALANCE_FILTER)) { >> +/* >> + * Avoid moving ne's when we create a larger imbalance >> + * on the other end. >> + */ >> +if ((imb->type & NUMA_BALANCE_CPU) && >> +imb->cpu - cpu_moved < ne_cpu / 2) >> +goto next; >> + >> +/* >> + * Avoid migrating ne's when we'll know we'll push our >> + * node over the memory limit. >> + */ >> +if (max_mem_load && >> +imb->mem_load + mem_moved + ne_mem > max_mem_load) >> +goto next; >> +} > > IIRC the test consisted of a 16GB NUMA system with two 8GB nodes. > It was running 3 KVM guests, two guests of 3GB memory each, and > one guest of 6GB each. How many cpus per guest (host threads) and how many physical/logical cpus per node on the host? Any comparisons with a situation where the memory would fit within nodes but the scheduling load would be too high? Don > > With autonuma, the 6GB guest ended up on one node, and the > 3GB guests on the other. > > With sched numa, each node had a 3GB guest, and part of the 6GB guest. > > There is a fundamental difference in the balancing between autonuma > and sched numa. > > In sched numa, a process is moved over to the current node only if > the current node has space for it. > > Autonuma, on the other hand, operates more of a a "hostage exchange" > policy, where a thread on one node is exchanged with a thread on > another node, if it looks like that will reduce the overall number > of cross-node NUMA faults in the system. > > I am not sure how to do a "hostage exchange" algorithm with > sched numa, but it would seem like it could be necessary in order > for some workloads to converge on a sane configuration. > > After all, with only about 2GB free on each node, you will never > get to move either a 3GB guest, or parts of a 6GB guest... > > Any ideas? > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org > . > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 14/26] sched, numa: Numa balancer
On 07/12/2012 03:02 PM, Rik van Riel wrote: On 03/16/2012 10:40 AM, Peter Zijlstra wrote: At LSF/MM, there was a presentation comparing Peter's NUMA code with Andrea's NUMA code. I believe this is the main reason why Andrea's code performed better in that particular test... +if (sched_feat(NUMA_BALANCE_FILTER)) { +/* + * Avoid moving ne's when we create a larger imbalance + * on the other end. + */ +if ((imb-type NUMA_BALANCE_CPU) +imb-cpu - cpu_moved ne_cpu / 2) +goto next; + +/* + * Avoid migrating ne's when we'll know we'll push our + * node over the memory limit. + */ +if (max_mem_load +imb-mem_load + mem_moved + ne_mem max_mem_load) +goto next; +} IIRC the test consisted of a 16GB NUMA system with two 8GB nodes. It was running 3 KVM guests, two guests of 3GB memory each, and one guest of 6GB each. How many cpus per guest (host threads) and how many physical/logical cpus per node on the host? Any comparisons with a situation where the memory would fit within nodes but the scheduling load would be too high? Don With autonuma, the 6GB guest ended up on one node, and the 3GB guests on the other. With sched numa, each node had a 3GB guest, and part of the 6GB guest. There is a fundamental difference in the balancing between autonuma and sched numa. In sched numa, a process is moved over to the current node only if the current node has space for it. Autonuma, on the other hand, operates more of a a hostage exchange policy, where a thread on one node is exchanged with a thread on another node, if it looks like that will reduce the overall number of cross-node NUMA faults in the system. I am not sure how to do a hostage exchange algorithm with sched numa, but it would seem like it could be necessary in order for some workloads to converge on a sane configuration. After all, with only about 2GB free on each node, you will never get to move either a 3GB guest, or parts of a 6GB guest... Any ideas? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a . -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/