Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-31 Thread Srikar Dronamraju
* Peter Zijlstra  [2013-07-31 17:09:23]:

> On Tue, Jul 30, 2013 at 03:16:50PM +0530, Srikar Dronamraju wrote:
> > I am not against fault and fault based handling is very much needed. 
> > I have listed that this approach is complementary to numa faults that
> > Mel is proposing. 
> > 
> > Right now I think if we can first get the tasks to consolidate on nodes
> > and then use the numa faults to place the tasks, then we would be able
> > to have a very good solution. 
> > 
> > Plain fault information is actually causing confusion in enough number
> > of cases esp if the initial set of pages is all consolidated into fewer
> > set of nodes. With plain fault information, memory follows cpu, cpu
> > follows memory are conflicting with each other. memory wants to move to
> > nodes where the tasks are currently running and the tasks are planning
> > to move nodes where the current memory is around.
> 
> Since task weights are a completely random measure the above story
> completely fails to make any sense. If you can collate on an arbitrary
> number, why can't you collate on faults?

Since task weights contribute to cpu load and we would want to keep the
loads balanced, and make sure that we dont do excessive consolidation
where we end up being imbalanced across cpus/nodes. So for example,
numa02 case, (single process running across all nodes), we dont want
tasks to consolidate or make the system imbalanced. So I thought task
weights would give me hints to say we should consolidating or we should
back off from consolidation. How do I derive hints to stop consolidation
based on numa faults.

> 
> The fact that the placement policies so far have not had inter-task
> relations doesn't mean its not possible.
> 

Do you have ideas that I can try out that could help build these
inter-task relations?

> > Also most of the consolidation that I have proposed is pretty
> > conservative or either done at idle balance time. This would not affect
> > the numa faulting in any way. When I run with my patches (along with
> > some debug code), the consolidation happens pretty pretty quickly.
> > Once consolidation has happened, numa faults would be of immense value.
> 
> And also completely broken in various 'fun' ways. You're far too fond of
> nr_running for one.

Yeah I too feel, I am too attached to nr_running.
> 
> Also, afaict it never does anything if the machine is overloaded and we
> never hit the !nr_running case in rebalance_domains.

Actually not, in most of my testing, cpu utilization is close to 100%.
And I have find_numa_queue, preferred_node logic that should kick in. 
My idea is we could achieve consolidation much easier in a overloaded
case since we dont actually have to do active migration. Futher there
are hints at task wake up time.

If we can further make the load balancer super intelligent that it
schedules the right task on the right cpu/node, will we need to
do migrate cpus on faults? Arent we making the code complicated by
introducing too many more points where we do pseudo load balancing?


> 
> > Here is how I am looking at the solution.
> > 
> > 1. Till the initial scan delay, allow tasks to consolidate
> 
> I would really want to not change regular balance behaviour for now;
> you're also adding far too many atomic operations to the scheduler fast
> path, that's going to make people terribly unhappy.
> 
> > 2. After the first scan delay to the next scan delay, account numa
> >faults, allow memory to move. But dont use numa faults as yet to
> >drive scheduling decisions. Here also task continue to consolidate.
> > 
> > This will lead to tasks and memory moving to specific nodes and
> > leading to consolidation.
> 
> This is just plain silly, once you have fault information you'd better
> use it to move tasks towards where the memory is, doing anything else
> is, like said, silly.
> 
> > 3. After the second scan delay, continue to account numa faults and
> > allow numa faults to drive scheduling decisions.
> > 
> > Should we use also use task weights at stage 3 or just numa faults or
> > which one should get more preference is something that I am not clear at
> > this time. At this time, I would think we would need to factor in both
> > of them.
> > 
> > I think this approach would mean tasks get consolidated but the inter
> > process, inter task relations that you are looking for also remain
> > strong.
> > 
> > Is this a acceptable solution?
> 
> No, again, task weight is a completely random number unrelated to
> anything we want to do. Furthermore we simply cannot add mm wide atomics
> to the scheduler hot paths.
> 

How do I maintain a per-mm per node data?

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-31 Thread Srikar Dronamraju
* Peter Zijlstra  [2013-07-30 11:33:21]:

> On Tue, Jul 30, 2013 at 02:45:43PM +0530, Srikar Dronamraju wrote:
> 
> > Can you please suggest workloads that I could try which might showcase
> > why you hate pure process based approach?
> 
> 2 processes, 1 sysvshm segment. I know there's multi-process MPI
> libraries out there.
> 
> Something like: perf bench numa mem -p 2 -G 4096 -0 -z --no-data_rand_walk -Z
> 

The above dumped core; Looks like -T is a must with -G.

I tried "perf bench numa mem -p 2 -T 32 -G 4096 -0 -z --no-data_rand_walk -Z"
It still didn't seem to do anything on my 4 node box (almost 2 hours
and nothing happened).

Finally I ran "perf bench numa mem -a"
(both with ht disabled and enabled)

Convergence wise my patchset did really well.

bw looks like a mixed bag. Though there are improvements, we see
degradations. I am not sure how to quantify which was the best among the
three. nx1 tests were the ones where this patchset had a -ve; but +ve
for all others.

Is this what you were looking for? Or was it something else?

(Lower is better)
testcase3.9.0   Mels v5 this_patchset   Units
--
1x3-convergence 0.320   100.060 100.204 secs
1x4-convergence 100.139 100.162 100.155 secs
1x6-convergence 100.455 100.179 1.078   secs
2x3-convergence 100.261 100.339 9.743   secs
3x3-convergence 100.213 100.168 10.073  secs
4x4-convergence 100.307 100.201 19.686  secs
4x4-convergence-NOTHP   100.229 100.221 3.189   secs
4x6-convergence 101.441 100.632 6.204   secs
4x8-convergence 100.680 100.588 5.275   secs
8x4-convergence 100.335 100.365 34.069  secs
8x4-convergence-NOTHP   100.331 100.412 100.478 secs
3x1-convergence 1.227   1.536   0.576   secs
4x1-convergence 1.224   1.063   1.390   secs
8x1-convergence 1.713   2.437   1.704   secs
16x1-convergence2.750   2.677   1.856   secs
32x1-convergence1.985   1.795   1.391   secs


(Higher is better)
testcase3.9.0   Mels v5 this_patchset   Units
--
RAM-bw-local3.341   3.340   3.325   GB/sec
RAM-bw-local-NOTHP  3.308   3.307   3.290   GB/sec
RAM-bw-remote   1.815   1.815   1.815   GB/sec
RAM-bw-local-2x 6.410   6.413   6.412   GB/sec
RAM-bw-remote-2x3.020   3.041   3.027   GB/sec
RAM-bw-cross4.397   3.425   4.374   GB/sec
2x1-bw-process  3.481   3.442   3.492   GB/sec
3x1-bw-process  5.423   7.547   5.445   GB/sec
4x1-bw-process  5.108   11.009  5.118   GB/sec
8x1-bw-process  8.929   10.935  8.825   GB/sec
8x1-bw-process-NOTHP12.754  11.442  22.889  GB/sec
16x1-bw-process 12.886  12.685  13.546  GB/sec
4x1-bw-thread   19.147  17.964  9.622   GB/sec
8x1-bw-thread   26.342  30.171  14.679  GB/sec
16x1-bw-thread  41.527  36.363  40.070  GB/sec
32x1-bw-thread  45.005  40.950  49.846  GB/sec
2x3-bw-thread   9.493   14.444  8.145   GB/sec
4x4-bw-thread   18.309  16.382  45.384  GB/sec
4x6-bw-thread   14.524  18.502  17.058  GB/sec
4x8-bw-thread   13.315  16.852  33.693  GB/sec
4x8-bw-thread-NOTHP 12.273  12.226  24.887  GB/sec
3x3-bw-thread   17.614  11.960  16.119  GB/sec
5x5-bw-thread   13.415  17.585  24.251  GB/sec
2x16-bw-thread  11.718  11.174  17.971  GB/sec
1x32-bw-thread  11.360  10.902  14.330  GB/sec
numa02-bw   48.999  44.173  54.795  GB/sec
numa02-bw-NOTHP 47.655  42.600  53.445  GB/sec
numa01-bw-thread36.983  39.692  45.254  GB/sec
numa01-bw-thread-NOTHP  38.486  35.208  44.118  GB/sec



With HT ON

(Lower is better)
testcase3.9.0

Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-31 Thread Srikar Dronamraju
* Andrew Theurer  [2013-07-31 08:33:44]:
>  ------------  
>  VM-node00|   49153(006%)   673792(083%)51712(006%)   36352(004%) 
> 
> I think the consolidation is a nice concept, but it needs a much tighter
> integration with numa balancing.  The action to clump tasks on same node's
> runqueues should be triggered by detecting that they also access
> the same memory.
> 

Thanks Andrew for testing and reporting your results and analysis.
Will try to focus on getting consolidation + tighter integration with
numa balancing.

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-31 Thread Peter Zijlstra
On Tue, Jul 30, 2013 at 03:16:50PM +0530, Srikar Dronamraju wrote:
> I am not against fault and fault based handling is very much needed. 
> I have listed that this approach is complementary to numa faults that
> Mel is proposing. 
> 
> Right now I think if we can first get the tasks to consolidate on nodes
> and then use the numa faults to place the tasks, then we would be able
> to have a very good solution. 
> 
> Plain fault information is actually causing confusion in enough number
> of cases esp if the initial set of pages is all consolidated into fewer
> set of nodes. With plain fault information, memory follows cpu, cpu
> follows memory are conflicting with each other. memory wants to move to
> nodes where the tasks are currently running and the tasks are planning
> to move nodes where the current memory is around.

Since task weights are a completely random measure the above story
completely fails to make any sense. If you can collate on an arbitrary
number, why can't you collate on faults?

The fact that the placement policies so far have not had inter-task
relations doesn't mean its not possible.

> Also most of the consolidation that I have proposed is pretty
> conservative or either done at idle balance time. This would not affect
> the numa faulting in any way. When I run with my patches (along with
> some debug code), the consolidation happens pretty pretty quickly.
> Once consolidation has happened, numa faults would be of immense value.

And also completely broken in various 'fun' ways. You're far too fond of
nr_running for one.

Also, afaict it never does anything if the machine is overloaded and we
never hit the !nr_running case in rebalance_domains.

> Here is how I am looking at the solution.
> 
> 1. Till the initial scan delay, allow tasks to consolidate

I would really want to not change regular balance behaviour for now;
you're also adding far too many atomic operations to the scheduler fast
path, that's going to make people terribly unhappy.

> 2. After the first scan delay to the next scan delay, account numa
>faults, allow memory to move. But dont use numa faults as yet to
>drive scheduling decisions. Here also task continue to consolidate.
> 
>   This will lead to tasks and memory moving to specific nodes and
>   leading to consolidation.

This is just plain silly, once you have fault information you'd better
use it to move tasks towards where the memory is, doing anything else
is, like said, silly.

> 3. After the second scan delay, continue to account numa faults and
> allow numa faults to drive scheduling decisions.
> 
> Should we use also use task weights at stage 3 or just numa faults or
> which one should get more preference is something that I am not clear at
> this time. At this time, I would think we would need to factor in both
> of them.
> 
> I think this approach would mean tasks get consolidated but the inter
> process, inter task relations that you are looking for also remain
> strong.
> 
> Is this a acceptable solution?

No, again, task weight is a completely random number unrelated to
anything we want to do. Furthermore we simply cannot add mm wide atomics
to the scheduler hot paths.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-31 Thread Andrew Theurer
On Tue, 2013-07-30 at 13:18 +0530, Srikar Dronamraju wrote:
> Here is an approach that looks to consolidate workloads across nodes.
> This results in much improved performance. Again I would assume this work
> is complementary to Mel's work with numa faulting.
> 
> Here are the advantages of this approach.
> 1. Provides excellent consolidation of tasks.
>  From my experiments, I have found that the better the task
>  consolidation, we achieve better the memory layout, which results in
>  better the performance.
> 
> 2. Provides good improvement in most cases, but there are some regressions.
> 
> 3. Looks to extend the load balancer esp when the cpus are idling.
> 
> Here is the outline of the approach.
> 
> - Every process has a per node array where we store the weight of all
>   its tasks running on that node. This arrays gets updated on task
>   enqueue/dequeue.
> 
> - Added a 2 pass mechanism (somewhat taken from numacore but not
>   exactly) while choosing tasks to move across nodes.
> 
>   In the first pass, choose only tasks that are ideal to be moved.
>   While choosing a task, look at the per node process arrays to see if
>   moving task helps.
>   If the first pass fails to move a task, any task can be chosen on the
>   second pass.
> 
> - If the regular load balancer (rebalance_domain()) fails to balance the
>   load (or finds no imbalance) and there is a cpu, use the cpu to
>   consolidate tasks to the nodes by using the information in the per
>   node process arrays.
> 
>   Every idle cpu if its doesnt have tasks queued after load balance,
>   - will walk thro the cpus in its node and checks if there are buddy
> tasks that are not part of the node but should have been ideally
> part of this node.
>   - To make sure that we dont pull all buddy tasks and create an
> imbalance, we look at load on the load, pinned tasks and the
> processes contribution to the load for this node.
>   - Each cpu looks at the node which has the least number of buddy tasks
> running and tries to pull the tasks from such nodes.
> 
>   - Once it finds the cpu from which to pull the tasks, it triggers
> active_balancing. This type of active balancing triggers just one
> pass. i.e it only fetches tasks that increase numa locality.
> 
> Here are results of specjbb run on a 2 node machine.

Here's a comparison with 4 KVM VMs running dbench on a 4 socket, 40
core, 80 thread host.

kernel  total dbench throughpout

3-9.numabal-on  21242
3.9-numabal-off 20455
3.9-numabal-on-consolidate  22541
3.9-numabal-off-consolidate 21632
3.9-numabal-off-node-pinning26450
3.9-numabal-on-node-pinning 25265

Based on the node pinning results, we have a long way to go, with either
numa-balancing and/or consolidation.  One thing the consolidation helps
is actually getting the sibling tasks running in the same node:

% CPU usage by node for 1st VM
node00 node01 node02  node03
094%002%001%001%

However, the node which was chosen to consolidate tasks is
not the same node where most of the memory for the tasks is located:

% memory per node for 1st VM
  host-node00  host-node01host-node02host-node03
 ---------   --
 VM-node00   295937(034%)   550400(064%)   6144(000%)   0(000%) 


By comparison, same stats for numa-balancing on and no consolidation:

% CPU usage by node for 1st VM
node00 node01 node02 node03
  028%   027%  020%023%   <-CPU usage spread across whole system

% memory per node for 1st VM
 host-node00host-node01host-node02host-node03
 ------------  
 VM-node00|   49153(006%)   673792(083%)51712(006%)   36352(004%) 

I think the consolidation is a nice concept, but it needs a much tighter
integration with numa balancing.  The action to clump tasks on same node's
runqueues should be triggered by detecting that they also access
the same memory.

> Specjbb was run on 3 vms.
> In the fit case, one vm was big to fit one node size.
> In the no-fit case, one vm was bigger than the node size.
> 
> -
> |kernel|  nofit|
> fit|   vm|
> |kernel|  noksm|ksm|  noksm|
> ksm|   vm|
> |kernel|  nothp|thp|  nothp|thp|  nothp|thp|  nothp|
> thp|   vm|
> --
> |v3.9  | 136056| 189423| 135359| 186722| 136983| 191669| 136728| 
> 184253| vm_1|
> |v3.9  |  66041|  84779|  64564|  86645|  67426|  84427|  63657|  
> 85043| vm_2|
> |v3.9  |  67322|  83301|  63731|  85394|  65015|  85156|  63838|  
> 84199| vm_3|
> 

Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-31 Thread Andrew Theurer
On Tue, 2013-07-30 at 13:18 +0530, Srikar Dronamraju wrote:
 Here is an approach that looks to consolidate workloads across nodes.
 This results in much improved performance. Again I would assume this work
 is complementary to Mel's work with numa faulting.
 
 Here are the advantages of this approach.
 1. Provides excellent consolidation of tasks.
  From my experiments, I have found that the better the task
  consolidation, we achieve better the memory layout, which results in
  better the performance.
 
 2. Provides good improvement in most cases, but there are some regressions.
 
 3. Looks to extend the load balancer esp when the cpus are idling.
 
 Here is the outline of the approach.
 
 - Every process has a per node array where we store the weight of all
   its tasks running on that node. This arrays gets updated on task
   enqueue/dequeue.
 
 - Added a 2 pass mechanism (somewhat taken from numacore but not
   exactly) while choosing tasks to move across nodes.
 
   In the first pass, choose only tasks that are ideal to be moved.
   While choosing a task, look at the per node process arrays to see if
   moving task helps.
   If the first pass fails to move a task, any task can be chosen on the
   second pass.
 
 - If the regular load balancer (rebalance_domain()) fails to balance the
   load (or finds no imbalance) and there is a cpu, use the cpu to
   consolidate tasks to the nodes by using the information in the per
   node process arrays.
 
   Every idle cpu if its doesnt have tasks queued after load balance,
   - will walk thro the cpus in its node and checks if there are buddy
 tasks that are not part of the node but should have been ideally
 part of this node.
   - To make sure that we dont pull all buddy tasks and create an
 imbalance, we look at load on the load, pinned tasks and the
 processes contribution to the load for this node.
   - Each cpu looks at the node which has the least number of buddy tasks
 running and tries to pull the tasks from such nodes.
 
   - Once it finds the cpu from which to pull the tasks, it triggers
 active_balancing. This type of active balancing triggers just one
 pass. i.e it only fetches tasks that increase numa locality.
 
 Here are results of specjbb run on a 2 node machine.

Here's a comparison with 4 KVM VMs running dbench on a 4 socket, 40
core, 80 thread host.

kernel  total dbench throughpout

3-9.numabal-on  21242
3.9-numabal-off 20455
3.9-numabal-on-consolidate  22541
3.9-numabal-off-consolidate 21632
3.9-numabal-off-node-pinning26450
3.9-numabal-on-node-pinning 25265

Based on the node pinning results, we have a long way to go, with either
numa-balancing and/or consolidation.  One thing the consolidation helps
is actually getting the sibling tasks running in the same node:

% CPU usage by node for 1st VM
node00 node01 node02  node03
094%002%001%001%

However, the node which was chosen to consolidate tasks is
not the same node where most of the memory for the tasks is located:

% memory per node for 1st VM
  host-node00  host-node01host-node02host-node03
 ---------   --
 VM-node00   295937(034%)   550400(064%)   6144(000%)   0(000%) 


By comparison, same stats for numa-balancing on and no consolidation:

% CPU usage by node for 1st VM
node00 node01 node02 node03
  028%   027%  020%023%   -CPU usage spread across whole system

% memory per node for 1st VM
 host-node00host-node01host-node02host-node03
 ------------  
 VM-node00|   49153(006%)   673792(083%)51712(006%)   36352(004%) 

I think the consolidation is a nice concept, but it needs a much tighter
integration with numa balancing.  The action to clump tasks on same node's
runqueues should be triggered by detecting that they also access
the same memory.

 Specjbb was run on 3 vms.
 In the fit case, one vm was big to fit one node size.
 In the no-fit case, one vm was bigger than the node size.
 
 -
 |kernel|  nofit|
 fit|   vm|
 |kernel|  noksm|ksm|  noksm|
 ksm|   vm|
 |kernel|  nothp|thp|  nothp|thp|  nothp|thp|  nothp|
 thp|   vm|
 --
 |v3.9  | 136056| 189423| 135359| 186722| 136983| 191669| 136728| 
 184253| vm_1|
 |v3.9  |  66041|  84779|  64564|  86645|  67426|  84427|  63657|  
 85043| vm_2|
 |v3.9  |  67322|  83301|  63731|  85394|  65015|  85156|  63838|  
 84199| vm_3|
 --
 |v3.9 + Mel(v5)| 133170| 177883| 

Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-31 Thread Peter Zijlstra
On Tue, Jul 30, 2013 at 03:16:50PM +0530, Srikar Dronamraju wrote:
 I am not against fault and fault based handling is very much needed. 
 I have listed that this approach is complementary to numa faults that
 Mel is proposing. 
 
 Right now I think if we can first get the tasks to consolidate on nodes
 and then use the numa faults to place the tasks, then we would be able
 to have a very good solution. 
 
 Plain fault information is actually causing confusion in enough number
 of cases esp if the initial set of pages is all consolidated into fewer
 set of nodes. With plain fault information, memory follows cpu, cpu
 follows memory are conflicting with each other. memory wants to move to
 nodes where the tasks are currently running and the tasks are planning
 to move nodes where the current memory is around.

Since task weights are a completely random measure the above story
completely fails to make any sense. If you can collate on an arbitrary
number, why can't you collate on faults?

The fact that the placement policies so far have not had inter-task
relations doesn't mean its not possible.

 Also most of the consolidation that I have proposed is pretty
 conservative or either done at idle balance time. This would not affect
 the numa faulting in any way. When I run with my patches (along with
 some debug code), the consolidation happens pretty pretty quickly.
 Once consolidation has happened, numa faults would be of immense value.

And also completely broken in various 'fun' ways. You're far too fond of
nr_running for one.

Also, afaict it never does anything if the machine is overloaded and we
never hit the !nr_running case in rebalance_domains.

 Here is how I am looking at the solution.
 
 1. Till the initial scan delay, allow tasks to consolidate

I would really want to not change regular balance behaviour for now;
you're also adding far too many atomic operations to the scheduler fast
path, that's going to make people terribly unhappy.

 2. After the first scan delay to the next scan delay, account numa
faults, allow memory to move. But dont use numa faults as yet to
drive scheduling decisions. Here also task continue to consolidate.
 
   This will lead to tasks and memory moving to specific nodes and
   leading to consolidation.

This is just plain silly, once you have fault information you'd better
use it to move tasks towards where the memory is, doing anything else
is, like said, silly.

 3. After the second scan delay, continue to account numa faults and
 allow numa faults to drive scheduling decisions.
 
 Should we use also use task weights at stage 3 or just numa faults or
 which one should get more preference is something that I am not clear at
 this time. At this time, I would think we would need to factor in both
 of them.
 
 I think this approach would mean tasks get consolidated but the inter
 process, inter task relations that you are looking for also remain
 strong.
 
 Is this a acceptable solution?

No, again, task weight is a completely random number unrelated to
anything we want to do. Furthermore we simply cannot add mm wide atomics
to the scheduler hot paths.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-31 Thread Srikar Dronamraju
* Andrew Theurer haban...@linux.vnet.ibm.com [2013-07-31 08:33:44]:
  ------------  
  VM-node00|   49153(006%)   673792(083%)51712(006%)   36352(004%) 
 
 I think the consolidation is a nice concept, but it needs a much tighter
 integration with numa balancing.  The action to clump tasks on same node's
 runqueues should be triggered by detecting that they also access
 the same memory.
 

Thanks Andrew for testing and reporting your results and analysis.
Will try to focus on getting consolidation + tighter integration with
numa balancing.

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-31 Thread Srikar Dronamraju
* Peter Zijlstra pet...@infradead.org [2013-07-30 11:33:21]:

 On Tue, Jul 30, 2013 at 02:45:43PM +0530, Srikar Dronamraju wrote:
 
  Can you please suggest workloads that I could try which might showcase
  why you hate pure process based approach?
 
 2 processes, 1 sysvshm segment. I know there's multi-process MPI
 libraries out there.
 
 Something like: perf bench numa mem -p 2 -G 4096 -0 -z --no-data_rand_walk -Z
 

The above dumped core; Looks like -T is a must with -G.

I tried perf bench numa mem -p 2 -T 32 -G 4096 -0 -z --no-data_rand_walk -Z
It still didn't seem to do anything on my 4 node box (almost 2 hours
and nothing happened).

Finally I ran perf bench numa mem -a
(both with ht disabled and enabled)

Convergence wise my patchset did really well.

bw looks like a mixed bag. Though there are improvements, we see
degradations. I am not sure how to quantify which was the best among the
three. nx1 tests were the ones where this patchset had a -ve; but +ve
for all others.

Is this what you were looking for? Or was it something else?

(Lower is better)
testcase3.9.0   Mels v5 this_patchset   Units
--
1x3-convergence 0.320   100.060 100.204 secs
1x4-convergence 100.139 100.162 100.155 secs
1x6-convergence 100.455 100.179 1.078   secs
2x3-convergence 100.261 100.339 9.743   secs
3x3-convergence 100.213 100.168 10.073  secs
4x4-convergence 100.307 100.201 19.686  secs
4x4-convergence-NOTHP   100.229 100.221 3.189   secs
4x6-convergence 101.441 100.632 6.204   secs
4x8-convergence 100.680 100.588 5.275   secs
8x4-convergence 100.335 100.365 34.069  secs
8x4-convergence-NOTHP   100.331 100.412 100.478 secs
3x1-convergence 1.227   1.536   0.576   secs
4x1-convergence 1.224   1.063   1.390   secs
8x1-convergence 1.713   2.437   1.704   secs
16x1-convergence2.750   2.677   1.856   secs
32x1-convergence1.985   1.795   1.391   secs


(Higher is better)
testcase3.9.0   Mels v5 this_patchset   Units
--
RAM-bw-local3.341   3.340   3.325   GB/sec
RAM-bw-local-NOTHP  3.308   3.307   3.290   GB/sec
RAM-bw-remote   1.815   1.815   1.815   GB/sec
RAM-bw-local-2x 6.410   6.413   6.412   GB/sec
RAM-bw-remote-2x3.020   3.041   3.027   GB/sec
RAM-bw-cross4.397   3.425   4.374   GB/sec
2x1-bw-process  3.481   3.442   3.492   GB/sec
3x1-bw-process  5.423   7.547   5.445   GB/sec
4x1-bw-process  5.108   11.009  5.118   GB/sec
8x1-bw-process  8.929   10.935  8.825   GB/sec
8x1-bw-process-NOTHP12.754  11.442  22.889  GB/sec
16x1-bw-process 12.886  12.685  13.546  GB/sec
4x1-bw-thread   19.147  17.964  9.622   GB/sec
8x1-bw-thread   26.342  30.171  14.679  GB/sec
16x1-bw-thread  41.527  36.363  40.070  GB/sec
32x1-bw-thread  45.005  40.950  49.846  GB/sec
2x3-bw-thread   9.493   14.444  8.145   GB/sec
4x4-bw-thread   18.309  16.382  45.384  GB/sec
4x6-bw-thread   14.524  18.502  17.058  GB/sec
4x8-bw-thread   13.315  16.852  33.693  GB/sec
4x8-bw-thread-NOTHP 12.273  12.226  24.887  GB/sec
3x3-bw-thread   17.614  11.960  16.119  GB/sec
5x5-bw-thread   13.415  17.585  24.251  GB/sec
2x16-bw-thread  11.718  11.174  17.971  GB/sec
1x32-bw-thread  11.360  10.902  14.330  GB/sec
numa02-bw   48.999  44.173  54.795  GB/sec
numa02-bw-NOTHP 47.655  42.600  53.445  GB/sec
numa01-bw-thread36.983  39.692  45.254  GB/sec
numa01-bw-thread-NOTHP  38.486  35.208  44.118  GB/sec



With HT ON

(Lower is better)
testcase3.9.0

Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-31 Thread Srikar Dronamraju
* Peter Zijlstra pet...@infradead.org [2013-07-31 17:09:23]:

 On Tue, Jul 30, 2013 at 03:16:50PM +0530, Srikar Dronamraju wrote:
  I am not against fault and fault based handling is very much needed. 
  I have listed that this approach is complementary to numa faults that
  Mel is proposing. 
  
  Right now I think if we can first get the tasks to consolidate on nodes
  and then use the numa faults to place the tasks, then we would be able
  to have a very good solution. 
  
  Plain fault information is actually causing confusion in enough number
  of cases esp if the initial set of pages is all consolidated into fewer
  set of nodes. With plain fault information, memory follows cpu, cpu
  follows memory are conflicting with each other. memory wants to move to
  nodes where the tasks are currently running and the tasks are planning
  to move nodes where the current memory is around.
 
 Since task weights are a completely random measure the above story
 completely fails to make any sense. If you can collate on an arbitrary
 number, why can't you collate on faults?

Since task weights contribute to cpu load and we would want to keep the
loads balanced, and make sure that we dont do excessive consolidation
where we end up being imbalanced across cpus/nodes. So for example,
numa02 case, (single process running across all nodes), we dont want
tasks to consolidate or make the system imbalanced. So I thought task
weights would give me hints to say we should consolidating or we should
back off from consolidation. How do I derive hints to stop consolidation
based on numa faults.

 
 The fact that the placement policies so far have not had inter-task
 relations doesn't mean its not possible.
 

Do you have ideas that I can try out that could help build these
inter-task relations?

  Also most of the consolidation that I have proposed is pretty
  conservative or either done at idle balance time. This would not affect
  the numa faulting in any way. When I run with my patches (along with
  some debug code), the consolidation happens pretty pretty quickly.
  Once consolidation has happened, numa faults would be of immense value.
 
 And also completely broken in various 'fun' ways. You're far too fond of
 nr_running for one.

Yeah I too feel, I am too attached to nr_running.
 
 Also, afaict it never does anything if the machine is overloaded and we
 never hit the !nr_running case in rebalance_domains.

Actually not, in most of my testing, cpu utilization is close to 100%.
And I have find_numa_queue, preferred_node logic that should kick in. 
My idea is we could achieve consolidation much easier in a overloaded
case since we dont actually have to do active migration. Futher there
are hints at task wake up time.

If we can further make the load balancer super intelligent that it
schedules the right task on the right cpu/node, will we need to
do migrate cpus on faults? Arent we making the code complicated by
introducing too many more points where we do pseudo load balancing?


 
  Here is how I am looking at the solution.
  
  1. Till the initial scan delay, allow tasks to consolidate
 
 I would really want to not change regular balance behaviour for now;
 you're also adding far too many atomic operations to the scheduler fast
 path, that's going to make people terribly unhappy.
 
  2. After the first scan delay to the next scan delay, account numa
 faults, allow memory to move. But dont use numa faults as yet to
 drive scheduling decisions. Here also task continue to consolidate.
  
  This will lead to tasks and memory moving to specific nodes and
  leading to consolidation.
 
 This is just plain silly, once you have fault information you'd better
 use it to move tasks towards where the memory is, doing anything else
 is, like said, silly.
 
  3. After the second scan delay, continue to account numa faults and
  allow numa faults to drive scheduling decisions.
  
  Should we use also use task weights at stage 3 or just numa faults or
  which one should get more preference is something that I am not clear at
  this time. At this time, I would think we would need to factor in both
  of them.
  
  I think this approach would mean tasks get consolidated but the inter
  process, inter task relations that you are looking for also remain
  strong.
  
  Is this a acceptable solution?
 
 No, again, task weight is a completely random number unrelated to
 anything we want to do. Furthermore we simply cannot add mm wide atomics
 to the scheduler hot paths.
 

How do I maintain a per-mm per node data?

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-30 Thread Srikar Dronamraju
* Peter Zijlstra  [2013-07-30 11:10:21]:

> On Tue, Jul 30, 2013 at 02:33:45PM +0530, Srikar Dronamraju wrote:
> > * Peter Zijlstra  [2013-07-30 10:20:01]:
> > 
> > > On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote:
> > > > On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote:
> > > > > Here is an approach that looks to consolidate workloads across nodes.
> > > > > This results in much improved performance. Again I would assume this 
> > > > > work
> > > > > is complementary to Mel's work with numa faulting.
> > > > 
> > > > I highly dislike the use of task weights here. It seems completely
> > > > unrelated to the problem at hand.
> > > 
> > > I also don't particularly like the fact that it's purely process based.
> > > The faults information we have gives much richer task relations.
> > > 
> > 
> > With just pure fault information based approach, I am not seeing any
> > major improvement in tasks/memory consolidation. I still see memory
> > spread across different nodes and tasks getting ping-ponged to different
> > nodes. And if there are multiple unrelated processes, then we see a mix
> > of tasks of different processes in each of the node.
> 
> The fault thing isn't finished. Mel explicitly said it doesn't yet have
> inter-task relations. And you run everything in a VM which is like a big
> nasty mangler for anything sane.
> 

I am not against fault and fault based handling is very much needed. 
I have listed that this approach is complementary to numa faults that
Mel is proposing. 

Right now I think if we can first get the tasks to consolidate on nodes
and then use the numa faults to place the tasks, then we would be able
to have a very good solution. 

Plain fault information is actually causing confusion in enough number
of cases esp if the initial set of pages is all consolidated into fewer
set of nodes. With plain fault information, memory follows cpu, cpu
follows memory are conflicting with each other. memory wants to move to
nodes where the tasks are currently running and the tasks are planning
to move nodes where the current memory is around.

Also most of the consolidation that I have proposed is pretty
conservative or either done at idle balance time. This would not affect
the numa faulting in any way. When I run with my patches (along with
some debug code), the consolidation happens pretty pretty quickly.
Once consolidation has happened, numa faults would be of immense value.

Here is how I am looking at the solution.

1. Till the initial scan delay, allow tasks to consolidate

2. After the first scan delay to the next scan delay, account numa
   faults, allow memory to move. But dont use numa faults as yet to
   drive scheduling decisions. Here also task continue to consolidate.

This will lead to tasks and memory moving to specific nodes and
leading to consolidation.

3. After the second scan delay, continue to account numa faults and
allow numa faults to drive scheduling decisions.

Should we use also use task weights at stage 3 or just numa faults or
which one should get more preference is something that I am not clear at
this time. At this time, I would think we would need to factor in both
of them.

I think this approach would mean tasks get consolidated but the inter
process, inter task relations that you are looking for also remain
strong.

Is this a acceptable solution?

-- 
Thanks and Regards
Srikar




-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-30 Thread Peter Zijlstra
On Tue, Jul 30, 2013 at 02:45:43PM +0530, Srikar Dronamraju wrote:

> Can you please suggest workloads that I could try which might showcase
> why you hate pure process based approach?

2 processes, 1 sysvshm segment. I know there's multi-process MPI
libraries out there.

Something like: perf bench numa mem -p 2 -G 4096 -0 -z --no-data_rand_walk -Z
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-30 Thread Peter Zijlstra
On Tue, Jul 30, 2013 at 11:10:21AM +0200, Peter Zijlstra wrote:
> On Tue, Jul 30, 2013 at 02:33:45PM +0530, Srikar Dronamraju wrote:
> > * Peter Zijlstra  [2013-07-30 10:20:01]:
> > 
> > > On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote:
> > > > On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote:
> > > > > Here is an approach that looks to consolidate workloads across nodes.
> > > > > This results in much improved performance. Again I would assume this 
> > > > > work
> > > > > is complementary to Mel's work with numa faulting.
> > > > 
> > > > I highly dislike the use of task weights here. It seems completely
> > > > unrelated to the problem at hand.
> > > 
> > > I also don't particularly like the fact that it's purely process based.
> > > The faults information we have gives much richer task relations.
> > > 
> > 
> > With just pure fault information based approach, I am not seeing any
> > major improvement in tasks/memory consolidation. I still see memory
> > spread across different nodes and tasks getting ping-ponged to different
> > nodes. And if there are multiple unrelated processes, then we see a mix
> > of tasks of different processes in each of the node.
> 
> The fault thing isn't finished. Mel explicitly said it doesn't yet have
> inter-task relations. And you run everything in a VM which is like a big
> nasty mangler for anything sane.

Also, the last time you posted this, I already said that if you'd use
the faults data to do grouping you'd get similar reseults. Task weight
is a completely unrelated and random measure. I think you even conceded
this.

So I really don't get why you're still using task weight for this.

Also, Ingo already showed that you can get task grouping from the fault
information itself, no need to use mm information to do this.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-30 Thread Srikar Dronamraju
* Peter Zijlstra  [2013-07-30 10:20:01]:

> On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote:
> > On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote:
> > > Here is an approach that looks to consolidate workloads across nodes.
> > > This results in much improved performance. Again I would assume this work
> > > is complementary to Mel's work with numa faulting.
> > 
> > I highly dislike the use of task weights here. It seems completely
> > unrelated to the problem at hand.
> 
> I also don't particularly like the fact that it's purely process based.
> The faults information we have gives much richer task relations.
> 

Peter, 

Can you please suggest workloads that I could try which might showcase
why you hate pure process based approach?

I know numa02_SMT does regress with my patches but I think its most
my implementation fault and not a approach issue.

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-30 Thread Peter Zijlstra
On Tue, Jul 30, 2013 at 02:33:45PM +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra  [2013-07-30 10:20:01]:
> 
> > On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote:
> > > On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote:
> > > > Here is an approach that looks to consolidate workloads across nodes.
> > > > This results in much improved performance. Again I would assume this 
> > > > work
> > > > is complementary to Mel's work with numa faulting.
> > > 
> > > I highly dislike the use of task weights here. It seems completely
> > > unrelated to the problem at hand.
> > 
> > I also don't particularly like the fact that it's purely process based.
> > The faults information we have gives much richer task relations.
> > 
> 
> With just pure fault information based approach, I am not seeing any
> major improvement in tasks/memory consolidation. I still see memory
> spread across different nodes and tasks getting ping-ponged to different
> nodes. And if there are multiple unrelated processes, then we see a mix
> of tasks of different processes in each of the node.

The fault thing isn't finished. Mel explicitly said it doesn't yet have
inter-task relations. And you run everything in a VM which is like a big
nasty mangler for anything sane.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-30 Thread Srikar Dronamraju
* Peter Zijlstra  [2013-07-30 10:20:01]:

> On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote:
> > On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote:
> > > Here is an approach that looks to consolidate workloads across nodes.
> > > This results in much improved performance. Again I would assume this work
> > > is complementary to Mel's work with numa faulting.
> > 
> > I highly dislike the use of task weights here. It seems completely
> > unrelated to the problem at hand.
> 
> I also don't particularly like the fact that it's purely process based.
> The faults information we have gives much richer task relations.
> 

With just pure fault information based approach, I am not seeing any
major improvement in tasks/memory consolidation. I still see memory
spread across different nodes and tasks getting ping-ponged to different
nodes. And if there are multiple unrelated processes, then we see a mix
of tasks of different processes in each of the node.

This spreading of load as per my observation, isn't helping the
performance. This is esp true with bigger boxes and would take this as a
hint that we need to consolidate tasks for better performance.

Now I can just use the number of tasks rather than task weights as I do
with the current patchset. But I don't think that would be ideal either.
Esp this wouldn't work with Fair share scheduling.

For example: lets say there are 2 vm's running similar loads on a 2 node
machine. We would get the best performance if we could easily segregate
the load. I know all problems cannot be generalized into just this set.
My thinking is to get atleast these set of problems solved.

Do you see any alternatives other than numa faults/task weights that we
could use to better consolidate tasks?

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-30 Thread Peter Zijlstra
On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote:
> On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote:
> > Here is an approach that looks to consolidate workloads across nodes.
> > This results in much improved performance. Again I would assume this work
> > is complementary to Mel's work with numa faulting.
> 
> I highly dislike the use of task weights here. It seems completely
> unrelated to the problem at hand.

I also don't particularly like the fact that it's purely process based.
The faults information we have gives much richer task relations.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-30 Thread Peter Zijlstra
On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote:
> Here is an approach that looks to consolidate workloads across nodes.
> This results in much improved performance. Again I would assume this work
> is complementary to Mel's work with numa faulting.

I highly dislike the use of task weights here. It seems completely
unrelated to the problem at hand.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-30 Thread Peter Zijlstra
On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote:
 Here is an approach that looks to consolidate workloads across nodes.
 This results in much improved performance. Again I would assume this work
 is complementary to Mel's work with numa faulting.

I highly dislike the use of task weights here. It seems completely
unrelated to the problem at hand.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-30 Thread Peter Zijlstra
On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote:
 On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote:
  Here is an approach that looks to consolidate workloads across nodes.
  This results in much improved performance. Again I would assume this work
  is complementary to Mel's work with numa faulting.
 
 I highly dislike the use of task weights here. It seems completely
 unrelated to the problem at hand.

I also don't particularly like the fact that it's purely process based.
The faults information we have gives much richer task relations.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-30 Thread Srikar Dronamraju
* Peter Zijlstra pet...@infradead.org [2013-07-30 10:20:01]:

 On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote:
  On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote:
   Here is an approach that looks to consolidate workloads across nodes.
   This results in much improved performance. Again I would assume this work
   is complementary to Mel's work with numa faulting.
  
  I highly dislike the use of task weights here. It seems completely
  unrelated to the problem at hand.
 
 I also don't particularly like the fact that it's purely process based.
 The faults information we have gives much richer task relations.
 

With just pure fault information based approach, I am not seeing any
major improvement in tasks/memory consolidation. I still see memory
spread across different nodes and tasks getting ping-ponged to different
nodes. And if there are multiple unrelated processes, then we see a mix
of tasks of different processes in each of the node.

This spreading of load as per my observation, isn't helping the
performance. This is esp true with bigger boxes and would take this as a
hint that we need to consolidate tasks for better performance.

Now I can just use the number of tasks rather than task weights as I do
with the current patchset. But I don't think that would be ideal either.
Esp this wouldn't work with Fair share scheduling.

For example: lets say there are 2 vm's running similar loads on a 2 node
machine. We would get the best performance if we could easily segregate
the load. I know all problems cannot be generalized into just this set.
My thinking is to get atleast these set of problems solved.

Do you see any alternatives other than numa faults/task weights that we
could use to better consolidate tasks?

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-30 Thread Peter Zijlstra
On Tue, Jul 30, 2013 at 02:33:45PM +0530, Srikar Dronamraju wrote:
 * Peter Zijlstra pet...@infradead.org [2013-07-30 10:20:01]:
 
  On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote:
   On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote:
Here is an approach that looks to consolidate workloads across nodes.
This results in much improved performance. Again I would assume this 
work
is complementary to Mel's work with numa faulting.
   
   I highly dislike the use of task weights here. It seems completely
   unrelated to the problem at hand.
  
  I also don't particularly like the fact that it's purely process based.
  The faults information we have gives much richer task relations.
  
 
 With just pure fault information based approach, I am not seeing any
 major improvement in tasks/memory consolidation. I still see memory
 spread across different nodes and tasks getting ping-ponged to different
 nodes. And if there are multiple unrelated processes, then we see a mix
 of tasks of different processes in each of the node.

The fault thing isn't finished. Mel explicitly said it doesn't yet have
inter-task relations. And you run everything in a VM which is like a big
nasty mangler for anything sane.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-30 Thread Srikar Dronamraju
* Peter Zijlstra pet...@infradead.org [2013-07-30 10:20:01]:

 On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote:
  On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote:
   Here is an approach that looks to consolidate workloads across nodes.
   This results in much improved performance. Again I would assume this work
   is complementary to Mel's work with numa faulting.
  
  I highly dislike the use of task weights here. It seems completely
  unrelated to the problem at hand.
 
 I also don't particularly like the fact that it's purely process based.
 The faults information we have gives much richer task relations.
 

Peter, 

Can you please suggest workloads that I could try which might showcase
why you hate pure process based approach?

I know numa02_SMT does regress with my patches but I think its most
my implementation fault and not a approach issue.

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-30 Thread Peter Zijlstra
On Tue, Jul 30, 2013 at 11:10:21AM +0200, Peter Zijlstra wrote:
 On Tue, Jul 30, 2013 at 02:33:45PM +0530, Srikar Dronamraju wrote:
  * Peter Zijlstra pet...@infradead.org [2013-07-30 10:20:01]:
  
   On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote:
On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote:
 Here is an approach that looks to consolidate workloads across nodes.
 This results in much improved performance. Again I would assume this 
 work
 is complementary to Mel's work with numa faulting.

I highly dislike the use of task weights here. It seems completely
unrelated to the problem at hand.
   
   I also don't particularly like the fact that it's purely process based.
   The faults information we have gives much richer task relations.
   
  
  With just pure fault information based approach, I am not seeing any
  major improvement in tasks/memory consolidation. I still see memory
  spread across different nodes and tasks getting ping-ponged to different
  nodes. And if there are multiple unrelated processes, then we see a mix
  of tasks of different processes in each of the node.
 
 The fault thing isn't finished. Mel explicitly said it doesn't yet have
 inter-task relations. And you run everything in a VM which is like a big
 nasty mangler for anything sane.

Also, the last time you posted this, I already said that if you'd use
the faults data to do grouping you'd get similar reseults. Task weight
is a completely unrelated and random measure. I think you even conceded
this.

So I really don't get why you're still using task weight for this.

Also, Ingo already showed that you can get task grouping from the fault
information itself, no need to use mm information to do this.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-30 Thread Peter Zijlstra
On Tue, Jul 30, 2013 at 02:45:43PM +0530, Srikar Dronamraju wrote:

 Can you please suggest workloads that I could try which might showcase
 why you hate pure process based approach?

2 processes, 1 sysvshm segment. I know there's multi-process MPI
libraries out there.

Something like: perf bench numa mem -p 2 -G 4096 -0 -z --no-data_rand_walk -Z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

2013-07-30 Thread Srikar Dronamraju
* Peter Zijlstra pet...@infradead.org [2013-07-30 11:10:21]:

 On Tue, Jul 30, 2013 at 02:33:45PM +0530, Srikar Dronamraju wrote:
  * Peter Zijlstra pet...@infradead.org [2013-07-30 10:20:01]:
  
   On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote:
On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote:
 Here is an approach that looks to consolidate workloads across nodes.
 This results in much improved performance. Again I would assume this 
 work
 is complementary to Mel's work with numa faulting.

I highly dislike the use of task weights here. It seems completely
unrelated to the problem at hand.
   
   I also don't particularly like the fact that it's purely process based.
   The faults information we have gives much richer task relations.
   
  
  With just pure fault information based approach, I am not seeing any
  major improvement in tasks/memory consolidation. I still see memory
  spread across different nodes and tasks getting ping-ponged to different
  nodes. And if there are multiple unrelated processes, then we see a mix
  of tasks of different processes in each of the node.
 
 The fault thing isn't finished. Mel explicitly said it doesn't yet have
 inter-task relations. And you run everything in a VM which is like a big
 nasty mangler for anything sane.
 

I am not against fault and fault based handling is very much needed. 
I have listed that this approach is complementary to numa faults that
Mel is proposing. 

Right now I think if we can first get the tasks to consolidate on nodes
and then use the numa faults to place the tasks, then we would be able
to have a very good solution. 

Plain fault information is actually causing confusion in enough number
of cases esp if the initial set of pages is all consolidated into fewer
set of nodes. With plain fault information, memory follows cpu, cpu
follows memory are conflicting with each other. memory wants to move to
nodes where the tasks are currently running and the tasks are planning
to move nodes where the current memory is around.

Also most of the consolidation that I have proposed is pretty
conservative or either done at idle balance time. This would not affect
the numa faulting in any way. When I run with my patches (along with
some debug code), the consolidation happens pretty pretty quickly.
Once consolidation has happened, numa faults would be of immense value.

Here is how I am looking at the solution.

1. Till the initial scan delay, allow tasks to consolidate

2. After the first scan delay to the next scan delay, account numa
   faults, allow memory to move. But dont use numa faults as yet to
   drive scheduling decisions. Here also task continue to consolidate.

This will lead to tasks and memory moving to specific nodes and
leading to consolidation.

3. After the second scan delay, continue to account numa faults and
allow numa faults to drive scheduling decisions.

Should we use also use task weights at stage 3 or just numa faults or
which one should get more preference is something that I am not clear at
this time. At this time, I would think we would need to factor in both
of them.

I think this approach would mean tasks get consolidated but the inter
process, inter task relations that you are looking for also remain
strong.

Is this a acceptable solution?

-- 
Thanks and Regards
Srikar




-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/