Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
* Peter Zijlstra [2013-07-31 17:09:23]: > On Tue, Jul 30, 2013 at 03:16:50PM +0530, Srikar Dronamraju wrote: > > I am not against fault and fault based handling is very much needed. > > I have listed that this approach is complementary to numa faults that > > Mel is proposing. > > > > Right now I think if we can first get the tasks to consolidate on nodes > > and then use the numa faults to place the tasks, then we would be able > > to have a very good solution. > > > > Plain fault information is actually causing confusion in enough number > > of cases esp if the initial set of pages is all consolidated into fewer > > set of nodes. With plain fault information, memory follows cpu, cpu > > follows memory are conflicting with each other. memory wants to move to > > nodes where the tasks are currently running and the tasks are planning > > to move nodes where the current memory is around. > > Since task weights are a completely random measure the above story > completely fails to make any sense. If you can collate on an arbitrary > number, why can't you collate on faults? Since task weights contribute to cpu load and we would want to keep the loads balanced, and make sure that we dont do excessive consolidation where we end up being imbalanced across cpus/nodes. So for example, numa02 case, (single process running across all nodes), we dont want tasks to consolidate or make the system imbalanced. So I thought task weights would give me hints to say we should consolidating or we should back off from consolidation. How do I derive hints to stop consolidation based on numa faults. > > The fact that the placement policies so far have not had inter-task > relations doesn't mean its not possible. > Do you have ideas that I can try out that could help build these inter-task relations? > > Also most of the consolidation that I have proposed is pretty > > conservative or either done at idle balance time. This would not affect > > the numa faulting in any way. When I run with my patches (along with > > some debug code), the consolidation happens pretty pretty quickly. > > Once consolidation has happened, numa faults would be of immense value. > > And also completely broken in various 'fun' ways. You're far too fond of > nr_running for one. Yeah I too feel, I am too attached to nr_running. > > Also, afaict it never does anything if the machine is overloaded and we > never hit the !nr_running case in rebalance_domains. Actually not, in most of my testing, cpu utilization is close to 100%. And I have find_numa_queue, preferred_node logic that should kick in. My idea is we could achieve consolidation much easier in a overloaded case since we dont actually have to do active migration. Futher there are hints at task wake up time. If we can further make the load balancer super intelligent that it schedules the right task on the right cpu/node, will we need to do migrate cpus on faults? Arent we making the code complicated by introducing too many more points where we do pseudo load balancing? > > > Here is how I am looking at the solution. > > > > 1. Till the initial scan delay, allow tasks to consolidate > > I would really want to not change regular balance behaviour for now; > you're also adding far too many atomic operations to the scheduler fast > path, that's going to make people terribly unhappy. > > > 2. After the first scan delay to the next scan delay, account numa > >faults, allow memory to move. But dont use numa faults as yet to > >drive scheduling decisions. Here also task continue to consolidate. > > > > This will lead to tasks and memory moving to specific nodes and > > leading to consolidation. > > This is just plain silly, once you have fault information you'd better > use it to move tasks towards where the memory is, doing anything else > is, like said, silly. > > > 3. After the second scan delay, continue to account numa faults and > > allow numa faults to drive scheduling decisions. > > > > Should we use also use task weights at stage 3 or just numa faults or > > which one should get more preference is something that I am not clear at > > this time. At this time, I would think we would need to factor in both > > of them. > > > > I think this approach would mean tasks get consolidated but the inter > > process, inter task relations that you are looking for also remain > > strong. > > > > Is this a acceptable solution? > > No, again, task weight is a completely random number unrelated to > anything we want to do. Furthermore we simply cannot add mm wide atomics > to the scheduler hot paths. > How do I maintain a per-mm per node data? -- Thanks and Regards Srikar Dronamraju -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
* Peter Zijlstra [2013-07-30 11:33:21]: > On Tue, Jul 30, 2013 at 02:45:43PM +0530, Srikar Dronamraju wrote: > > > Can you please suggest workloads that I could try which might showcase > > why you hate pure process based approach? > > 2 processes, 1 sysvshm segment. I know there's multi-process MPI > libraries out there. > > Something like: perf bench numa mem -p 2 -G 4096 -0 -z --no-data_rand_walk -Z > The above dumped core; Looks like -T is a must with -G. I tried "perf bench numa mem -p 2 -T 32 -G 4096 -0 -z --no-data_rand_walk -Z" It still didn't seem to do anything on my 4 node box (almost 2 hours and nothing happened). Finally I ran "perf bench numa mem -a" (both with ht disabled and enabled) Convergence wise my patchset did really well. bw looks like a mixed bag. Though there are improvements, we see degradations. I am not sure how to quantify which was the best among the three. nx1 tests were the ones where this patchset had a -ve; but +ve for all others. Is this what you were looking for? Or was it something else? (Lower is better) testcase3.9.0 Mels v5 this_patchset Units -- 1x3-convergence 0.320 100.060 100.204 secs 1x4-convergence 100.139 100.162 100.155 secs 1x6-convergence 100.455 100.179 1.078 secs 2x3-convergence 100.261 100.339 9.743 secs 3x3-convergence 100.213 100.168 10.073 secs 4x4-convergence 100.307 100.201 19.686 secs 4x4-convergence-NOTHP 100.229 100.221 3.189 secs 4x6-convergence 101.441 100.632 6.204 secs 4x8-convergence 100.680 100.588 5.275 secs 8x4-convergence 100.335 100.365 34.069 secs 8x4-convergence-NOTHP 100.331 100.412 100.478 secs 3x1-convergence 1.227 1.536 0.576 secs 4x1-convergence 1.224 1.063 1.390 secs 8x1-convergence 1.713 2.437 1.704 secs 16x1-convergence2.750 2.677 1.856 secs 32x1-convergence1.985 1.795 1.391 secs (Higher is better) testcase3.9.0 Mels v5 this_patchset Units -- RAM-bw-local3.341 3.340 3.325 GB/sec RAM-bw-local-NOTHP 3.308 3.307 3.290 GB/sec RAM-bw-remote 1.815 1.815 1.815 GB/sec RAM-bw-local-2x 6.410 6.413 6.412 GB/sec RAM-bw-remote-2x3.020 3.041 3.027 GB/sec RAM-bw-cross4.397 3.425 4.374 GB/sec 2x1-bw-process 3.481 3.442 3.492 GB/sec 3x1-bw-process 5.423 7.547 5.445 GB/sec 4x1-bw-process 5.108 11.009 5.118 GB/sec 8x1-bw-process 8.929 10.935 8.825 GB/sec 8x1-bw-process-NOTHP12.754 11.442 22.889 GB/sec 16x1-bw-process 12.886 12.685 13.546 GB/sec 4x1-bw-thread 19.147 17.964 9.622 GB/sec 8x1-bw-thread 26.342 30.171 14.679 GB/sec 16x1-bw-thread 41.527 36.363 40.070 GB/sec 32x1-bw-thread 45.005 40.950 49.846 GB/sec 2x3-bw-thread 9.493 14.444 8.145 GB/sec 4x4-bw-thread 18.309 16.382 45.384 GB/sec 4x6-bw-thread 14.524 18.502 17.058 GB/sec 4x8-bw-thread 13.315 16.852 33.693 GB/sec 4x8-bw-thread-NOTHP 12.273 12.226 24.887 GB/sec 3x3-bw-thread 17.614 11.960 16.119 GB/sec 5x5-bw-thread 13.415 17.585 24.251 GB/sec 2x16-bw-thread 11.718 11.174 17.971 GB/sec 1x32-bw-thread 11.360 10.902 14.330 GB/sec numa02-bw 48.999 44.173 54.795 GB/sec numa02-bw-NOTHP 47.655 42.600 53.445 GB/sec numa01-bw-thread36.983 39.692 45.254 GB/sec numa01-bw-thread-NOTHP 38.486 35.208 44.118 GB/sec With HT ON (Lower is better) testcase3.9.0
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
* Andrew Theurer [2013-07-31 08:33:44]: > ------------ > VM-node00| 49153(006%) 673792(083%)51712(006%) 36352(004%) > > I think the consolidation is a nice concept, but it needs a much tighter > integration with numa balancing. The action to clump tasks on same node's > runqueues should be triggered by detecting that they also access > the same memory. > Thanks Andrew for testing and reporting your results and analysis. Will try to focus on getting consolidation + tighter integration with numa balancing. -- Thanks and Regards Srikar Dronamraju -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
On Tue, Jul 30, 2013 at 03:16:50PM +0530, Srikar Dronamraju wrote: > I am not against fault and fault based handling is very much needed. > I have listed that this approach is complementary to numa faults that > Mel is proposing. > > Right now I think if we can first get the tasks to consolidate on nodes > and then use the numa faults to place the tasks, then we would be able > to have a very good solution. > > Plain fault information is actually causing confusion in enough number > of cases esp if the initial set of pages is all consolidated into fewer > set of nodes. With plain fault information, memory follows cpu, cpu > follows memory are conflicting with each other. memory wants to move to > nodes where the tasks are currently running and the tasks are planning > to move nodes where the current memory is around. Since task weights are a completely random measure the above story completely fails to make any sense. If you can collate on an arbitrary number, why can't you collate on faults? The fact that the placement policies so far have not had inter-task relations doesn't mean its not possible. > Also most of the consolidation that I have proposed is pretty > conservative or either done at idle balance time. This would not affect > the numa faulting in any way. When I run with my patches (along with > some debug code), the consolidation happens pretty pretty quickly. > Once consolidation has happened, numa faults would be of immense value. And also completely broken in various 'fun' ways. You're far too fond of nr_running for one. Also, afaict it never does anything if the machine is overloaded and we never hit the !nr_running case in rebalance_domains. > Here is how I am looking at the solution. > > 1. Till the initial scan delay, allow tasks to consolidate I would really want to not change regular balance behaviour for now; you're also adding far too many atomic operations to the scheduler fast path, that's going to make people terribly unhappy. > 2. After the first scan delay to the next scan delay, account numa >faults, allow memory to move. But dont use numa faults as yet to >drive scheduling decisions. Here also task continue to consolidate. > > This will lead to tasks and memory moving to specific nodes and > leading to consolidation. This is just plain silly, once you have fault information you'd better use it to move tasks towards where the memory is, doing anything else is, like said, silly. > 3. After the second scan delay, continue to account numa faults and > allow numa faults to drive scheduling decisions. > > Should we use also use task weights at stage 3 or just numa faults or > which one should get more preference is something that I am not clear at > this time. At this time, I would think we would need to factor in both > of them. > > I think this approach would mean tasks get consolidated but the inter > process, inter task relations that you are looking for also remain > strong. > > Is this a acceptable solution? No, again, task weight is a completely random number unrelated to anything we want to do. Furthermore we simply cannot add mm wide atomics to the scheduler hot paths. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
On Tue, 2013-07-30 at 13:18 +0530, Srikar Dronamraju wrote: > Here is an approach that looks to consolidate workloads across nodes. > This results in much improved performance. Again I would assume this work > is complementary to Mel's work with numa faulting. > > Here are the advantages of this approach. > 1. Provides excellent consolidation of tasks. > From my experiments, I have found that the better the task > consolidation, we achieve better the memory layout, which results in > better the performance. > > 2. Provides good improvement in most cases, but there are some regressions. > > 3. Looks to extend the load balancer esp when the cpus are idling. > > Here is the outline of the approach. > > - Every process has a per node array where we store the weight of all > its tasks running on that node. This arrays gets updated on task > enqueue/dequeue. > > - Added a 2 pass mechanism (somewhat taken from numacore but not > exactly) while choosing tasks to move across nodes. > > In the first pass, choose only tasks that are ideal to be moved. > While choosing a task, look at the per node process arrays to see if > moving task helps. > If the first pass fails to move a task, any task can be chosen on the > second pass. > > - If the regular load balancer (rebalance_domain()) fails to balance the > load (or finds no imbalance) and there is a cpu, use the cpu to > consolidate tasks to the nodes by using the information in the per > node process arrays. > > Every idle cpu if its doesnt have tasks queued after load balance, > - will walk thro the cpus in its node and checks if there are buddy > tasks that are not part of the node but should have been ideally > part of this node. > - To make sure that we dont pull all buddy tasks and create an > imbalance, we look at load on the load, pinned tasks and the > processes contribution to the load for this node. > - Each cpu looks at the node which has the least number of buddy tasks > running and tries to pull the tasks from such nodes. > > - Once it finds the cpu from which to pull the tasks, it triggers > active_balancing. This type of active balancing triggers just one > pass. i.e it only fetches tasks that increase numa locality. > > Here are results of specjbb run on a 2 node machine. Here's a comparison with 4 KVM VMs running dbench on a 4 socket, 40 core, 80 thread host. kernel total dbench throughpout 3-9.numabal-on 21242 3.9-numabal-off 20455 3.9-numabal-on-consolidate 22541 3.9-numabal-off-consolidate 21632 3.9-numabal-off-node-pinning26450 3.9-numabal-on-node-pinning 25265 Based on the node pinning results, we have a long way to go, with either numa-balancing and/or consolidation. One thing the consolidation helps is actually getting the sibling tasks running in the same node: % CPU usage by node for 1st VM node00 node01 node02 node03 094%002%001%001% However, the node which was chosen to consolidate tasks is not the same node where most of the memory for the tasks is located: % memory per node for 1st VM host-node00 host-node01host-node02host-node03 --------- -- VM-node00 295937(034%) 550400(064%) 6144(000%) 0(000%) By comparison, same stats for numa-balancing on and no consolidation: % CPU usage by node for 1st VM node00 node01 node02 node03 028% 027% 020%023% <-CPU usage spread across whole system % memory per node for 1st VM host-node00host-node01host-node02host-node03 ------------ VM-node00| 49153(006%) 673792(083%)51712(006%) 36352(004%) I think the consolidation is a nice concept, but it needs a much tighter integration with numa balancing. The action to clump tasks on same node's runqueues should be triggered by detecting that they also access the same memory. > Specjbb was run on 3 vms. > In the fit case, one vm was big to fit one node size. > In the no-fit case, one vm was bigger than the node size. > > - > |kernel| nofit| > fit| vm| > |kernel| noksm|ksm| noksm| > ksm| vm| > |kernel| nothp|thp| nothp|thp| nothp|thp| nothp| > thp| vm| > -- > |v3.9 | 136056| 189423| 135359| 186722| 136983| 191669| 136728| > 184253| vm_1| > |v3.9 | 66041| 84779| 64564| 86645| 67426| 84427| 63657| > 85043| vm_2| > |v3.9 | 67322| 83301| 63731| 85394| 65015| 85156| 63838| > 84199| vm_3| >
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
On Tue, 2013-07-30 at 13:18 +0530, Srikar Dronamraju wrote: Here is an approach that looks to consolidate workloads across nodes. This results in much improved performance. Again I would assume this work is complementary to Mel's work with numa faulting. Here are the advantages of this approach. 1. Provides excellent consolidation of tasks. From my experiments, I have found that the better the task consolidation, we achieve better the memory layout, which results in better the performance. 2. Provides good improvement in most cases, but there are some regressions. 3. Looks to extend the load balancer esp when the cpus are idling. Here is the outline of the approach. - Every process has a per node array where we store the weight of all its tasks running on that node. This arrays gets updated on task enqueue/dequeue. - Added a 2 pass mechanism (somewhat taken from numacore but not exactly) while choosing tasks to move across nodes. In the first pass, choose only tasks that are ideal to be moved. While choosing a task, look at the per node process arrays to see if moving task helps. If the first pass fails to move a task, any task can be chosen on the second pass. - If the regular load balancer (rebalance_domain()) fails to balance the load (or finds no imbalance) and there is a cpu, use the cpu to consolidate tasks to the nodes by using the information in the per node process arrays. Every idle cpu if its doesnt have tasks queued after load balance, - will walk thro the cpus in its node and checks if there are buddy tasks that are not part of the node but should have been ideally part of this node. - To make sure that we dont pull all buddy tasks and create an imbalance, we look at load on the load, pinned tasks and the processes contribution to the load for this node. - Each cpu looks at the node which has the least number of buddy tasks running and tries to pull the tasks from such nodes. - Once it finds the cpu from which to pull the tasks, it triggers active_balancing. This type of active balancing triggers just one pass. i.e it only fetches tasks that increase numa locality. Here are results of specjbb run on a 2 node machine. Here's a comparison with 4 KVM VMs running dbench on a 4 socket, 40 core, 80 thread host. kernel total dbench throughpout 3-9.numabal-on 21242 3.9-numabal-off 20455 3.9-numabal-on-consolidate 22541 3.9-numabal-off-consolidate 21632 3.9-numabal-off-node-pinning26450 3.9-numabal-on-node-pinning 25265 Based on the node pinning results, we have a long way to go, with either numa-balancing and/or consolidation. One thing the consolidation helps is actually getting the sibling tasks running in the same node: % CPU usage by node for 1st VM node00 node01 node02 node03 094%002%001%001% However, the node which was chosen to consolidate tasks is not the same node where most of the memory for the tasks is located: % memory per node for 1st VM host-node00 host-node01host-node02host-node03 --------- -- VM-node00 295937(034%) 550400(064%) 6144(000%) 0(000%) By comparison, same stats for numa-balancing on and no consolidation: % CPU usage by node for 1st VM node00 node01 node02 node03 028% 027% 020%023% -CPU usage spread across whole system % memory per node for 1st VM host-node00host-node01host-node02host-node03 ------------ VM-node00| 49153(006%) 673792(083%)51712(006%) 36352(004%) I think the consolidation is a nice concept, but it needs a much tighter integration with numa balancing. The action to clump tasks on same node's runqueues should be triggered by detecting that they also access the same memory. Specjbb was run on 3 vms. In the fit case, one vm was big to fit one node size. In the no-fit case, one vm was bigger than the node size. - |kernel| nofit| fit| vm| |kernel| noksm|ksm| noksm| ksm| vm| |kernel| nothp|thp| nothp|thp| nothp|thp| nothp| thp| vm| -- |v3.9 | 136056| 189423| 135359| 186722| 136983| 191669| 136728| 184253| vm_1| |v3.9 | 66041| 84779| 64564| 86645| 67426| 84427| 63657| 85043| vm_2| |v3.9 | 67322| 83301| 63731| 85394| 65015| 85156| 63838| 84199| vm_3| -- |v3.9 + Mel(v5)| 133170| 177883|
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
On Tue, Jul 30, 2013 at 03:16:50PM +0530, Srikar Dronamraju wrote: I am not against fault and fault based handling is very much needed. I have listed that this approach is complementary to numa faults that Mel is proposing. Right now I think if we can first get the tasks to consolidate on nodes and then use the numa faults to place the tasks, then we would be able to have a very good solution. Plain fault information is actually causing confusion in enough number of cases esp if the initial set of pages is all consolidated into fewer set of nodes. With plain fault information, memory follows cpu, cpu follows memory are conflicting with each other. memory wants to move to nodes where the tasks are currently running and the tasks are planning to move nodes where the current memory is around. Since task weights are a completely random measure the above story completely fails to make any sense. If you can collate on an arbitrary number, why can't you collate on faults? The fact that the placement policies so far have not had inter-task relations doesn't mean its not possible. Also most of the consolidation that I have proposed is pretty conservative or either done at idle balance time. This would not affect the numa faulting in any way. When I run with my patches (along with some debug code), the consolidation happens pretty pretty quickly. Once consolidation has happened, numa faults would be of immense value. And also completely broken in various 'fun' ways. You're far too fond of nr_running for one. Also, afaict it never does anything if the machine is overloaded and we never hit the !nr_running case in rebalance_domains. Here is how I am looking at the solution. 1. Till the initial scan delay, allow tasks to consolidate I would really want to not change regular balance behaviour for now; you're also adding far too many atomic operations to the scheduler fast path, that's going to make people terribly unhappy. 2. After the first scan delay to the next scan delay, account numa faults, allow memory to move. But dont use numa faults as yet to drive scheduling decisions. Here also task continue to consolidate. This will lead to tasks and memory moving to specific nodes and leading to consolidation. This is just plain silly, once you have fault information you'd better use it to move tasks towards where the memory is, doing anything else is, like said, silly. 3. After the second scan delay, continue to account numa faults and allow numa faults to drive scheduling decisions. Should we use also use task weights at stage 3 or just numa faults or which one should get more preference is something that I am not clear at this time. At this time, I would think we would need to factor in both of them. I think this approach would mean tasks get consolidated but the inter process, inter task relations that you are looking for also remain strong. Is this a acceptable solution? No, again, task weight is a completely random number unrelated to anything we want to do. Furthermore we simply cannot add mm wide atomics to the scheduler hot paths. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
* Andrew Theurer haban...@linux.vnet.ibm.com [2013-07-31 08:33:44]: ------------ VM-node00| 49153(006%) 673792(083%)51712(006%) 36352(004%) I think the consolidation is a nice concept, but it needs a much tighter integration with numa balancing. The action to clump tasks on same node's runqueues should be triggered by detecting that they also access the same memory. Thanks Andrew for testing and reporting your results and analysis. Will try to focus on getting consolidation + tighter integration with numa balancing. -- Thanks and Regards Srikar Dronamraju -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
* Peter Zijlstra pet...@infradead.org [2013-07-30 11:33:21]: On Tue, Jul 30, 2013 at 02:45:43PM +0530, Srikar Dronamraju wrote: Can you please suggest workloads that I could try which might showcase why you hate pure process based approach? 2 processes, 1 sysvshm segment. I know there's multi-process MPI libraries out there. Something like: perf bench numa mem -p 2 -G 4096 -0 -z --no-data_rand_walk -Z The above dumped core; Looks like -T is a must with -G. I tried perf bench numa mem -p 2 -T 32 -G 4096 -0 -z --no-data_rand_walk -Z It still didn't seem to do anything on my 4 node box (almost 2 hours and nothing happened). Finally I ran perf bench numa mem -a (both with ht disabled and enabled) Convergence wise my patchset did really well. bw looks like a mixed bag. Though there are improvements, we see degradations. I am not sure how to quantify which was the best among the three. nx1 tests were the ones where this patchset had a -ve; but +ve for all others. Is this what you were looking for? Or was it something else? (Lower is better) testcase3.9.0 Mels v5 this_patchset Units -- 1x3-convergence 0.320 100.060 100.204 secs 1x4-convergence 100.139 100.162 100.155 secs 1x6-convergence 100.455 100.179 1.078 secs 2x3-convergence 100.261 100.339 9.743 secs 3x3-convergence 100.213 100.168 10.073 secs 4x4-convergence 100.307 100.201 19.686 secs 4x4-convergence-NOTHP 100.229 100.221 3.189 secs 4x6-convergence 101.441 100.632 6.204 secs 4x8-convergence 100.680 100.588 5.275 secs 8x4-convergence 100.335 100.365 34.069 secs 8x4-convergence-NOTHP 100.331 100.412 100.478 secs 3x1-convergence 1.227 1.536 0.576 secs 4x1-convergence 1.224 1.063 1.390 secs 8x1-convergence 1.713 2.437 1.704 secs 16x1-convergence2.750 2.677 1.856 secs 32x1-convergence1.985 1.795 1.391 secs (Higher is better) testcase3.9.0 Mels v5 this_patchset Units -- RAM-bw-local3.341 3.340 3.325 GB/sec RAM-bw-local-NOTHP 3.308 3.307 3.290 GB/sec RAM-bw-remote 1.815 1.815 1.815 GB/sec RAM-bw-local-2x 6.410 6.413 6.412 GB/sec RAM-bw-remote-2x3.020 3.041 3.027 GB/sec RAM-bw-cross4.397 3.425 4.374 GB/sec 2x1-bw-process 3.481 3.442 3.492 GB/sec 3x1-bw-process 5.423 7.547 5.445 GB/sec 4x1-bw-process 5.108 11.009 5.118 GB/sec 8x1-bw-process 8.929 10.935 8.825 GB/sec 8x1-bw-process-NOTHP12.754 11.442 22.889 GB/sec 16x1-bw-process 12.886 12.685 13.546 GB/sec 4x1-bw-thread 19.147 17.964 9.622 GB/sec 8x1-bw-thread 26.342 30.171 14.679 GB/sec 16x1-bw-thread 41.527 36.363 40.070 GB/sec 32x1-bw-thread 45.005 40.950 49.846 GB/sec 2x3-bw-thread 9.493 14.444 8.145 GB/sec 4x4-bw-thread 18.309 16.382 45.384 GB/sec 4x6-bw-thread 14.524 18.502 17.058 GB/sec 4x8-bw-thread 13.315 16.852 33.693 GB/sec 4x8-bw-thread-NOTHP 12.273 12.226 24.887 GB/sec 3x3-bw-thread 17.614 11.960 16.119 GB/sec 5x5-bw-thread 13.415 17.585 24.251 GB/sec 2x16-bw-thread 11.718 11.174 17.971 GB/sec 1x32-bw-thread 11.360 10.902 14.330 GB/sec numa02-bw 48.999 44.173 54.795 GB/sec numa02-bw-NOTHP 47.655 42.600 53.445 GB/sec numa01-bw-thread36.983 39.692 45.254 GB/sec numa01-bw-thread-NOTHP 38.486 35.208 44.118 GB/sec With HT ON (Lower is better) testcase3.9.0
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
* Peter Zijlstra pet...@infradead.org [2013-07-31 17:09:23]: On Tue, Jul 30, 2013 at 03:16:50PM +0530, Srikar Dronamraju wrote: I am not against fault and fault based handling is very much needed. I have listed that this approach is complementary to numa faults that Mel is proposing. Right now I think if we can first get the tasks to consolidate on nodes and then use the numa faults to place the tasks, then we would be able to have a very good solution. Plain fault information is actually causing confusion in enough number of cases esp if the initial set of pages is all consolidated into fewer set of nodes. With plain fault information, memory follows cpu, cpu follows memory are conflicting with each other. memory wants to move to nodes where the tasks are currently running and the tasks are planning to move nodes where the current memory is around. Since task weights are a completely random measure the above story completely fails to make any sense. If you can collate on an arbitrary number, why can't you collate on faults? Since task weights contribute to cpu load and we would want to keep the loads balanced, and make sure that we dont do excessive consolidation where we end up being imbalanced across cpus/nodes. So for example, numa02 case, (single process running across all nodes), we dont want tasks to consolidate or make the system imbalanced. So I thought task weights would give me hints to say we should consolidating or we should back off from consolidation. How do I derive hints to stop consolidation based on numa faults. The fact that the placement policies so far have not had inter-task relations doesn't mean its not possible. Do you have ideas that I can try out that could help build these inter-task relations? Also most of the consolidation that I have proposed is pretty conservative or either done at idle balance time. This would not affect the numa faulting in any way. When I run with my patches (along with some debug code), the consolidation happens pretty pretty quickly. Once consolidation has happened, numa faults would be of immense value. And also completely broken in various 'fun' ways. You're far too fond of nr_running for one. Yeah I too feel, I am too attached to nr_running. Also, afaict it never does anything if the machine is overloaded and we never hit the !nr_running case in rebalance_domains. Actually not, in most of my testing, cpu utilization is close to 100%. And I have find_numa_queue, preferred_node logic that should kick in. My idea is we could achieve consolidation much easier in a overloaded case since we dont actually have to do active migration. Futher there are hints at task wake up time. If we can further make the load balancer super intelligent that it schedules the right task on the right cpu/node, will we need to do migrate cpus on faults? Arent we making the code complicated by introducing too many more points where we do pseudo load balancing? Here is how I am looking at the solution. 1. Till the initial scan delay, allow tasks to consolidate I would really want to not change regular balance behaviour for now; you're also adding far too many atomic operations to the scheduler fast path, that's going to make people terribly unhappy. 2. After the first scan delay to the next scan delay, account numa faults, allow memory to move. But dont use numa faults as yet to drive scheduling decisions. Here also task continue to consolidate. This will lead to tasks and memory moving to specific nodes and leading to consolidation. This is just plain silly, once you have fault information you'd better use it to move tasks towards where the memory is, doing anything else is, like said, silly. 3. After the second scan delay, continue to account numa faults and allow numa faults to drive scheduling decisions. Should we use also use task weights at stage 3 or just numa faults or which one should get more preference is something that I am not clear at this time. At this time, I would think we would need to factor in both of them. I think this approach would mean tasks get consolidated but the inter process, inter task relations that you are looking for also remain strong. Is this a acceptable solution? No, again, task weight is a completely random number unrelated to anything we want to do. Furthermore we simply cannot add mm wide atomics to the scheduler hot paths. How do I maintain a per-mm per node data? -- Thanks and Regards Srikar Dronamraju -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
* Peter Zijlstra [2013-07-30 11:10:21]: > On Tue, Jul 30, 2013 at 02:33:45PM +0530, Srikar Dronamraju wrote: > > * Peter Zijlstra [2013-07-30 10:20:01]: > > > > > On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote: > > > > On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote: > > > > > Here is an approach that looks to consolidate workloads across nodes. > > > > > This results in much improved performance. Again I would assume this > > > > > work > > > > > is complementary to Mel's work with numa faulting. > > > > > > > > I highly dislike the use of task weights here. It seems completely > > > > unrelated to the problem at hand. > > > > > > I also don't particularly like the fact that it's purely process based. > > > The faults information we have gives much richer task relations. > > > > > > > With just pure fault information based approach, I am not seeing any > > major improvement in tasks/memory consolidation. I still see memory > > spread across different nodes and tasks getting ping-ponged to different > > nodes. And if there are multiple unrelated processes, then we see a mix > > of tasks of different processes in each of the node. > > The fault thing isn't finished. Mel explicitly said it doesn't yet have > inter-task relations. And you run everything in a VM which is like a big > nasty mangler for anything sane. > I am not against fault and fault based handling is very much needed. I have listed that this approach is complementary to numa faults that Mel is proposing. Right now I think if we can first get the tasks to consolidate on nodes and then use the numa faults to place the tasks, then we would be able to have a very good solution. Plain fault information is actually causing confusion in enough number of cases esp if the initial set of pages is all consolidated into fewer set of nodes. With plain fault information, memory follows cpu, cpu follows memory are conflicting with each other. memory wants to move to nodes where the tasks are currently running and the tasks are planning to move nodes where the current memory is around. Also most of the consolidation that I have proposed is pretty conservative or either done at idle balance time. This would not affect the numa faulting in any way. When I run with my patches (along with some debug code), the consolidation happens pretty pretty quickly. Once consolidation has happened, numa faults would be of immense value. Here is how I am looking at the solution. 1. Till the initial scan delay, allow tasks to consolidate 2. After the first scan delay to the next scan delay, account numa faults, allow memory to move. But dont use numa faults as yet to drive scheduling decisions. Here also task continue to consolidate. This will lead to tasks and memory moving to specific nodes and leading to consolidation. 3. After the second scan delay, continue to account numa faults and allow numa faults to drive scheduling decisions. Should we use also use task weights at stage 3 or just numa faults or which one should get more preference is something that I am not clear at this time. At this time, I would think we would need to factor in both of them. I think this approach would mean tasks get consolidated but the inter process, inter task relations that you are looking for also remain strong. Is this a acceptable solution? -- Thanks and Regards Srikar -- Thanks and Regards Srikar Dronamraju -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
On Tue, Jul 30, 2013 at 02:45:43PM +0530, Srikar Dronamraju wrote: > Can you please suggest workloads that I could try which might showcase > why you hate pure process based approach? 2 processes, 1 sysvshm segment. I know there's multi-process MPI libraries out there. Something like: perf bench numa mem -p 2 -G 4096 -0 -z --no-data_rand_walk -Z -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
On Tue, Jul 30, 2013 at 11:10:21AM +0200, Peter Zijlstra wrote: > On Tue, Jul 30, 2013 at 02:33:45PM +0530, Srikar Dronamraju wrote: > > * Peter Zijlstra [2013-07-30 10:20:01]: > > > > > On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote: > > > > On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote: > > > > > Here is an approach that looks to consolidate workloads across nodes. > > > > > This results in much improved performance. Again I would assume this > > > > > work > > > > > is complementary to Mel's work with numa faulting. > > > > > > > > I highly dislike the use of task weights here. It seems completely > > > > unrelated to the problem at hand. > > > > > > I also don't particularly like the fact that it's purely process based. > > > The faults information we have gives much richer task relations. > > > > > > > With just pure fault information based approach, I am not seeing any > > major improvement in tasks/memory consolidation. I still see memory > > spread across different nodes and tasks getting ping-ponged to different > > nodes. And if there are multiple unrelated processes, then we see a mix > > of tasks of different processes in each of the node. > > The fault thing isn't finished. Mel explicitly said it doesn't yet have > inter-task relations. And you run everything in a VM which is like a big > nasty mangler for anything sane. Also, the last time you posted this, I already said that if you'd use the faults data to do grouping you'd get similar reseults. Task weight is a completely unrelated and random measure. I think you even conceded this. So I really don't get why you're still using task weight for this. Also, Ingo already showed that you can get task grouping from the fault information itself, no need to use mm information to do this. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
* Peter Zijlstra [2013-07-30 10:20:01]: > On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote: > > On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote: > > > Here is an approach that looks to consolidate workloads across nodes. > > > This results in much improved performance. Again I would assume this work > > > is complementary to Mel's work with numa faulting. > > > > I highly dislike the use of task weights here. It seems completely > > unrelated to the problem at hand. > > I also don't particularly like the fact that it's purely process based. > The faults information we have gives much richer task relations. > Peter, Can you please suggest workloads that I could try which might showcase why you hate pure process based approach? I know numa02_SMT does regress with my patches but I think its most my implementation fault and not a approach issue. -- Thanks and Regards Srikar Dronamraju -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
On Tue, Jul 30, 2013 at 02:33:45PM +0530, Srikar Dronamraju wrote: > * Peter Zijlstra [2013-07-30 10:20:01]: > > > On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote: > > > On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote: > > > > Here is an approach that looks to consolidate workloads across nodes. > > > > This results in much improved performance. Again I would assume this > > > > work > > > > is complementary to Mel's work with numa faulting. > > > > > > I highly dislike the use of task weights here. It seems completely > > > unrelated to the problem at hand. > > > > I also don't particularly like the fact that it's purely process based. > > The faults information we have gives much richer task relations. > > > > With just pure fault information based approach, I am not seeing any > major improvement in tasks/memory consolidation. I still see memory > spread across different nodes and tasks getting ping-ponged to different > nodes. And if there are multiple unrelated processes, then we see a mix > of tasks of different processes in each of the node. The fault thing isn't finished. Mel explicitly said it doesn't yet have inter-task relations. And you run everything in a VM which is like a big nasty mangler for anything sane. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
* Peter Zijlstra [2013-07-30 10:20:01]: > On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote: > > On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote: > > > Here is an approach that looks to consolidate workloads across nodes. > > > This results in much improved performance. Again I would assume this work > > > is complementary to Mel's work with numa faulting. > > > > I highly dislike the use of task weights here. It seems completely > > unrelated to the problem at hand. > > I also don't particularly like the fact that it's purely process based. > The faults information we have gives much richer task relations. > With just pure fault information based approach, I am not seeing any major improvement in tasks/memory consolidation. I still see memory spread across different nodes and tasks getting ping-ponged to different nodes. And if there are multiple unrelated processes, then we see a mix of tasks of different processes in each of the node. This spreading of load as per my observation, isn't helping the performance. This is esp true with bigger boxes and would take this as a hint that we need to consolidate tasks for better performance. Now I can just use the number of tasks rather than task weights as I do with the current patchset. But I don't think that would be ideal either. Esp this wouldn't work with Fair share scheduling. For example: lets say there are 2 vm's running similar loads on a 2 node machine. We would get the best performance if we could easily segregate the load. I know all problems cannot be generalized into just this set. My thinking is to get atleast these set of problems solved. Do you see any alternatives other than numa faults/task weights that we could use to better consolidate tasks? -- Thanks and Regards Srikar Dronamraju -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote: > On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote: > > Here is an approach that looks to consolidate workloads across nodes. > > This results in much improved performance. Again I would assume this work > > is complementary to Mel's work with numa faulting. > > I highly dislike the use of task weights here. It seems completely > unrelated to the problem at hand. I also don't particularly like the fact that it's purely process based. The faults information we have gives much richer task relations. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote: > Here is an approach that looks to consolidate workloads across nodes. > This results in much improved performance. Again I would assume this work > is complementary to Mel's work with numa faulting. I highly dislike the use of task weights here. It seems completely unrelated to the problem at hand. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote: Here is an approach that looks to consolidate workloads across nodes. This results in much improved performance. Again I would assume this work is complementary to Mel's work with numa faulting. I highly dislike the use of task weights here. It seems completely unrelated to the problem at hand. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote: On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote: Here is an approach that looks to consolidate workloads across nodes. This results in much improved performance. Again I would assume this work is complementary to Mel's work with numa faulting. I highly dislike the use of task weights here. It seems completely unrelated to the problem at hand. I also don't particularly like the fact that it's purely process based. The faults information we have gives much richer task relations. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
* Peter Zijlstra pet...@infradead.org [2013-07-30 10:20:01]: On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote: On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote: Here is an approach that looks to consolidate workloads across nodes. This results in much improved performance. Again I would assume this work is complementary to Mel's work with numa faulting. I highly dislike the use of task weights here. It seems completely unrelated to the problem at hand. I also don't particularly like the fact that it's purely process based. The faults information we have gives much richer task relations. With just pure fault information based approach, I am not seeing any major improvement in tasks/memory consolidation. I still see memory spread across different nodes and tasks getting ping-ponged to different nodes. And if there are multiple unrelated processes, then we see a mix of tasks of different processes in each of the node. This spreading of load as per my observation, isn't helping the performance. This is esp true with bigger boxes and would take this as a hint that we need to consolidate tasks for better performance. Now I can just use the number of tasks rather than task weights as I do with the current patchset. But I don't think that would be ideal either. Esp this wouldn't work with Fair share scheduling. For example: lets say there are 2 vm's running similar loads on a 2 node machine. We would get the best performance if we could easily segregate the load. I know all problems cannot be generalized into just this set. My thinking is to get atleast these set of problems solved. Do you see any alternatives other than numa faults/task weights that we could use to better consolidate tasks? -- Thanks and Regards Srikar Dronamraju -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
On Tue, Jul 30, 2013 at 02:33:45PM +0530, Srikar Dronamraju wrote: * Peter Zijlstra pet...@infradead.org [2013-07-30 10:20:01]: On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote: On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote: Here is an approach that looks to consolidate workloads across nodes. This results in much improved performance. Again I would assume this work is complementary to Mel's work with numa faulting. I highly dislike the use of task weights here. It seems completely unrelated to the problem at hand. I also don't particularly like the fact that it's purely process based. The faults information we have gives much richer task relations. With just pure fault information based approach, I am not seeing any major improvement in tasks/memory consolidation. I still see memory spread across different nodes and tasks getting ping-ponged to different nodes. And if there are multiple unrelated processes, then we see a mix of tasks of different processes in each of the node. The fault thing isn't finished. Mel explicitly said it doesn't yet have inter-task relations. And you run everything in a VM which is like a big nasty mangler for anything sane. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
* Peter Zijlstra pet...@infradead.org [2013-07-30 10:20:01]: On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote: On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote: Here is an approach that looks to consolidate workloads across nodes. This results in much improved performance. Again I would assume this work is complementary to Mel's work with numa faulting. I highly dislike the use of task weights here. It seems completely unrelated to the problem at hand. I also don't particularly like the fact that it's purely process based. The faults information we have gives much richer task relations. Peter, Can you please suggest workloads that I could try which might showcase why you hate pure process based approach? I know numa02_SMT does regress with my patches but I think its most my implementation fault and not a approach issue. -- Thanks and Regards Srikar Dronamraju -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
On Tue, Jul 30, 2013 at 11:10:21AM +0200, Peter Zijlstra wrote: On Tue, Jul 30, 2013 at 02:33:45PM +0530, Srikar Dronamraju wrote: * Peter Zijlstra pet...@infradead.org [2013-07-30 10:20:01]: On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote: On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote: Here is an approach that looks to consolidate workloads across nodes. This results in much improved performance. Again I would assume this work is complementary to Mel's work with numa faulting. I highly dislike the use of task weights here. It seems completely unrelated to the problem at hand. I also don't particularly like the fact that it's purely process based. The faults information we have gives much richer task relations. With just pure fault information based approach, I am not seeing any major improvement in tasks/memory consolidation. I still see memory spread across different nodes and tasks getting ping-ponged to different nodes. And if there are multiple unrelated processes, then we see a mix of tasks of different processes in each of the node. The fault thing isn't finished. Mel explicitly said it doesn't yet have inter-task relations. And you run everything in a VM which is like a big nasty mangler for anything sane. Also, the last time you posted this, I already said that if you'd use the faults data to do grouping you'd get similar reseults. Task weight is a completely unrelated and random measure. I think you even conceded this. So I really don't get why you're still using task weight for this. Also, Ingo already showed that you can get task grouping from the fault information itself, no need to use mm information to do this. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
On Tue, Jul 30, 2013 at 02:45:43PM +0530, Srikar Dronamraju wrote: Can you please suggest workloads that I could try which might showcase why you hate pure process based approach? 2 processes, 1 sysvshm segment. I know there's multi-process MPI libraries out there. Something like: perf bench numa mem -p 2 -G 4096 -0 -z --no-data_rand_walk -Z -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
* Peter Zijlstra pet...@infradead.org [2013-07-30 11:10:21]: On Tue, Jul 30, 2013 at 02:33:45PM +0530, Srikar Dronamraju wrote: * Peter Zijlstra pet...@infradead.org [2013-07-30 10:20:01]: On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote: On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote: Here is an approach that looks to consolidate workloads across nodes. This results in much improved performance. Again I would assume this work is complementary to Mel's work with numa faulting. I highly dislike the use of task weights here. It seems completely unrelated to the problem at hand. I also don't particularly like the fact that it's purely process based. The faults information we have gives much richer task relations. With just pure fault information based approach, I am not seeing any major improvement in tasks/memory consolidation. I still see memory spread across different nodes and tasks getting ping-ponged to different nodes. And if there are multiple unrelated processes, then we see a mix of tasks of different processes in each of the node. The fault thing isn't finished. Mel explicitly said it doesn't yet have inter-task relations. And you run everything in a VM which is like a big nasty mangler for anything sane. I am not against fault and fault based handling is very much needed. I have listed that this approach is complementary to numa faults that Mel is proposing. Right now I think if we can first get the tasks to consolidate on nodes and then use the numa faults to place the tasks, then we would be able to have a very good solution. Plain fault information is actually causing confusion in enough number of cases esp if the initial set of pages is all consolidated into fewer set of nodes. With plain fault information, memory follows cpu, cpu follows memory are conflicting with each other. memory wants to move to nodes where the tasks are currently running and the tasks are planning to move nodes where the current memory is around. Also most of the consolidation that I have proposed is pretty conservative or either done at idle balance time. This would not affect the numa faulting in any way. When I run with my patches (along with some debug code), the consolidation happens pretty pretty quickly. Once consolidation has happened, numa faults would be of immense value. Here is how I am looking at the solution. 1. Till the initial scan delay, allow tasks to consolidate 2. After the first scan delay to the next scan delay, account numa faults, allow memory to move. But dont use numa faults as yet to drive scheduling decisions. Here also task continue to consolidate. This will lead to tasks and memory moving to specific nodes and leading to consolidation. 3. After the second scan delay, continue to account numa faults and allow numa faults to drive scheduling decisions. Should we use also use task weights at stage 3 or just numa faults or which one should get more preference is something that I am not clear at this time. At this time, I would think we would need to factor in both of them. I think this approach would mean tasks get consolidated but the inter process, inter task relations that you are looking for also remain strong. Is this a acceptable solution? -- Thanks and Regards Srikar -- Thanks and Regards Srikar Dronamraju -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/