[RFC PATCH 0/4] pseudo-interleaving NUMA placement

2013-11-26 Thread riel
This patch set attempts to implement a pseudo-interleaving policy for workloads that do not fit in one NUMA node. For each NUMA group, we track the NUMA nodes on which the workload is actively running, and try to concentrate the memory on those NUMA nodes. Unfortunately, the scheduler appears to

[RFC PATCH 4/4] use active_nodes nodemask to decide on numa migrations

2013-11-26 Thread riel
From: Rik van Riel r...@redhat.com Use the active_nodes nodemask to make smarter decisions on NUMA migrations. In order to maximize performance of workloads that do not fit in one NUMA node, we want to satisfy the following criteria: 1) keep private memory local to each thread 2) avoid excessive

[RFC PATCH 1/4] remove p-numa_migrate_deferred

2013-11-26 Thread riel
From: Rik van Riel r...@redhat.com Excessive migration of pages can hurt the performance of workloads that span multiple NUMA nodes. However, it turns out that the p-numa_migrate_deferred knob is a really big hammer, which does reduce migration rates, but does not actually help performance

[RFC PATCH 2/4] track from which nodes NUMA faults are triggered

2013-11-26 Thread riel
From: Rik van Riel r...@redhat.com Track which nodes NUMA faults are triggered from. This uses a similar mechanism to what is used to track the memory involved in numa faults. This is used, in the next patch, to build up a bitmap of which nodes a workload is actively running on. Signed-off

[RFC PATCH 3/4] build per numa_group active node mask from faults_from statistics

2013-11-26 Thread riel
From: Rik van Riel r...@redhat.com The faults_from statistics are used to maintain an active_nodes nodemask per numa_group. This allows us to be smarter about when to do numa migrations. Signed-off-by: Rik van Riel r...@redhat.com --- kernel/sched/fair.c | 33

[PATCH -mm 2/3] mm,numa: reorganize change_pmd_range

2014-02-18 Thread riel
From: Rik van Riel r...@redhat.com Reorganize the order of ifs in change_pmd_range a little, in preparation for the next patch. Signed-off-by: Rik van Riel r...@redhat.com Cc: Peter Zijlstra pet...@infradead.org Cc: Andrea Arcangeli aarca...@redhat.com Reported-by: Xing Gang gang.x...@hp.com

[PATCH -mm 0/3] fix numa vs kvm scalability issue

2014-02-18 Thread riel
The NUMA scanning code can end up iterating over many gigabytes of unpopulated memory, especially in the case of a freshly started KVM guest with lots of memory. This results in the mmu notifier code being called even when there are no mapped pages in a virtual address range. The amount of time

[PATCH -mm 1/3] sched,numa: add cond_resched to task_numa_work

2014-02-18 Thread riel
From: Rik van Riel r...@redhat.com Normally task_numa_work scans over a fairly small amount of memory, but it is possible to run into a large unpopulated part of virtual memory, with no pages mapped. In that case, task_numa_work can run for a while, and it may make sense to reschedule as required

[PATCH -mm 3/3] move mmu notifier call from change_protection to change_pmd_range

2014-02-18 Thread riel
From: Rik van Riel r...@redhat.com The NUMA scanning code can end up iterating over many gigabytes of unpopulated memory, especially in the case of a freshly started KVM guest with lots of memory. This results in the mmu notifier code being called even when there are no mapped pages in a virtual

[PATCH 2/4] sched,numa: weigh nearby nodes for task placement on complex NUMA topologies

2014-05-08 Thread riel
From: Rik van Riel r...@redhat.com Workloads that span multiple NUMA nodes benefit greatly from being placed on nearby nodes. There are two common configurations on 8 node NUMA systems. One has four islands of 2 tighly coupled nodes, another has two islands of 4 tightly coupled nodes. When

[PATCH 4/4] sched,numa: pull workloads towards their preferred nodes

2014-05-08 Thread riel
From: Rik van Riel r...@redhat.com Give a bonus to nodes near a workload's preferred node. This will pull workloads towards their preferred node. For workloads that span multiple NUMA nodes, pseudo-interleaving will even out the memory use between nodes over time, causing the preferred node

[PATCH 1/4] numa,x86: store maximum numa node distance

2014-05-08 Thread riel
From: Rik van Riel r...@redhat.com Store the maximum node distance, so the numa placement code can do better placement on systems with complex numa topology. The function max_node_distance will return LOCAL_DISTANCE if the system has simple NUMA topology, with only a single level of remote

[PATCH 3/4] sched,numa: store numa_group's preferred nid

2014-05-08 Thread riel
From: Rik van Riel r...@redhat.com Store a numa_group's preferred nid. Used by the next patch to pull workloads towards their preferred nodes. Signed-off-by: Rik van Riel r...@redhat.com Tested-by: Chegu Vinod chegu_vi...@hp.com --- kernel/sched/fair.c | 3 +++ 1 file changed, 3 insertions

[PATCH 0/4] sched,numa: task placement for complex NUMA topologies

2014-05-08 Thread riel
This patch series adds code for placement of tasks on a NUMA system with complex NUMA topology. The code is fairly well isolated, and does not impact things on systems with directly connected NUMA topology. The strategy is to adjust the score of each node, by the score of nearby NUMA nodes,

[PATCH 0/3] sched,numa: further numa balancing fixes

2014-06-14 Thread riel
A few more bug fixes that seem to improve convergence of perf bench numa mem -m -0 -P 1000 -p X -t Y for various values of X and Y, on both 4 and 8 node systems. This does not address the issue I highlighted Friday: https://lkml.org/lkml/2014/6/13/529 I have an idea on how to fix that issue,

[PATCH 1/3] sched,numa: use group's max nid as task's preferred nid

2014-06-14 Thread riel
From: Rik van Riel r...@redhat.com From task_numa_placement, always try to consolidate the tasks in a group on the group's top nid. In case this task is part of a group that is interleaved over multiple nodes, task_numa_migrate will set the task's preferred nid to the best node it could find

[PATCH 3/3] sched,numa: use effective_load to balance NUMA loads

2014-06-14 Thread riel
From: Rik van Riel r...@redhat.com When CONFIG_FAIR_GROUP_SCHED is enabled, the load that a task places on a CPU is determined by the group the task is in. This is conveniently calculated for us by effective_load(), which task_numa_compare should use. The active groups on the source

[PATCH 2/3] sched,numa: move power adjustment into load_too_imbalanced

2014-06-14 Thread riel
From: Rik van Riel r...@redhat.com Currently the NUMA code scales the load on each node with the amount of CPU power available on that node, but it does not apply any adjustment to the load of the task that is being moved over. On systems with SMT/HT, this results in a task being weighed much

[PATCH v3 0/6] pseudo-interleaving for automatic NUMA balancing

2014-01-20 Thread riel
The current automatic NUMA balancing code base has issues with workloads that do not fit on one NUMA load. Page migration is slowed down, but memory distribution between the nodes where the workload runs is essentially random, often resulting in a suboptimal amount of memory bandwidth being

[PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics

2014-01-20 Thread riel
From: Rik van Riel r...@redhat.com The faults_from statistics are used to maintain an active_nodes nodemask per numa_group. This allows us to be smarter about when to do numa migrations. Cc: Peter Zijlstra pet...@infradead.org Cc: Mel Gorman mgor...@suse.de Cc: Ingo Molnar mi...@redhat.com Cc

[PATCH 6/6] numa,sched: do statistics calculation using local variables only

2014-01-20 Thread riel
From: Rik van Riel r...@redhat.com The current code in task_numa_placement calculates the difference between the old and the new value, but also temporarily stores half of the old value in the per-process variables. The NUMA balancing code looks at those per-process variables, and having other

[PATCH 4/6] numa,sched,mm: use active_nodes nodemask to limit numa migrations

2014-01-20 Thread riel
From: Rik van Riel r...@redhat.com Use the active_nodes nodemask to make smarter decisions on NUMA migrations. In order to maximize performance of workloads that do not fit in one NUMA node, we want to satisfy the following criteria: 1) keep private memory local to each thread 2) avoid excessive

[PATCH 1/6] numa,sched,mm: remove p-numa_migrate_deferred

2014-01-20 Thread riel
From: Rik van Riel r...@redhat.com Excessive migration of pages can hurt the performance of workloads that span multiple NUMA nodes. However, it turns out that the p-numa_migrate_deferred knob is a really big hammer, which does reduce migration rates, but does not actually help performance. Now

[PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered

2014-01-20 Thread riel
From: Rik van Riel r...@redhat.com Track which nodes NUMA faults are triggered from, in other words the CPUs on which the NUMA faults happened. This uses a similar mechanism to what is used to track the memory involved in numa faults. The next patches use this to build up a bitmap of which nodes

[PATCH 5/6] numa,sched: normalize faults_from stats and weigh by CPU use

2014-01-20 Thread riel
From: Rik van Riel r...@redhat.com The tracepoint has made it abundantly clear that the naive implementation of the faults_from code has issues. Specifically, the garbage collector in some workloads will access orders of magnitudes more memory than the threads that do all the active work

[PATCH v4 0/9] pseudo-interleaving for automatic NUMA balancing

2014-01-21 Thread riel
The current automatic NUMA balancing code base has issues with workloads that do not fit on one NUMA load. Page migration is slowed down, but memory distribution between the nodes where the workload runs is essentially random, often resulting in a suboptimal amount of memory bandwidth being

[PATCH 1/9] numa,sched,mm: remove p-numa_migrate_deferred

2014-01-21 Thread riel
From: Rik van Riel r...@redhat.com Excessive migration of pages can hurt the performance of workloads that span multiple NUMA nodes. However, it turns out that the p-numa_migrate_deferred knob is a really big hammer, which does reduce migration rates, but does not actually help performance. Now

[PATCH 8/9] numa,sched: rename variables in task_numa_fault

2014-01-21 Thread riel
From: Rik van Riel r...@redhat.com We track both the node of the memory after a NUMA fault, and the node of the CPU on which the fault happened. Rename the local variables in task_numa_fault to make things more explicit. Suggested-by: Mel Gorman mgor...@suse.de Signed-off-by: Rik van Riel r

[PATCH 4/9] numa,sched: build per numa_group active node mask from numa_faults_cpu statistics

2014-01-21 Thread riel
From: Rik van Riel r...@redhat.com The numa_faults_cpu statistics are used to maintain an active_nodes nodemask per numa_group. This allows us to be smarter about when to do numa migrations. Cc: Peter Zijlstra pet...@infradead.org Cc: Mel Gorman mgor...@suse.de Cc: Ingo Molnar mi...@redhat.com

[PATCH 9/9] numa,sched: define some magic numbers

2014-01-21 Thread riel
From: Rik van Riel r...@redhat.com Cleanup suggested by Mel Gorman. Now the code contains some more hints on what statistics go where. Suggested-by: Mel Gorman mgor...@suse.de Signed-off-by: Rik van Riel r...@redhat.com --- kernel/sched/fair.c | 34 +- 1 file

[PATCH 3/9] numa,sched: track from which nodes NUMA faults are triggered

2014-01-21 Thread riel
From: Rik van Riel r...@redhat.com Track which nodes NUMA faults are triggered from, in other words the CPUs on which the NUMA faults happened. This uses a similar mechanism to what is used to track the memory involved in numa faults. The next patches use this to build up a bitmap of which nodes

[PATCH 5/9] numa,sched,mm: use active_nodes nodemask to limit numa migrations

2014-01-21 Thread riel
From: Rik van Riel r...@redhat.com Use the active_nodes nodemask to make smarter decisions on NUMA migrations. In order to maximize performance of workloads that do not fit in one NUMA node, we want to satisfy the following criteria: 1) keep private memory local to each thread 2) avoid excessive

[PATCH 2/9] rename p-numa_faults to numa_faults_memory

2014-01-21 Thread riel
From: Rik van Riel r...@redhat.com In order to get a more consistent naming scheme, making it clear which fault statistics track memory locality, and which track CPU locality, rename the memory fault statistics. Suggested-by: Mel Gorman mgor...@suse.de Signed-off-by: Rik van Riel r...@redhat.com

[PATCH 6/9] numa,sched: normalize faults_cpu stats and weigh by CPU use

2014-01-21 Thread riel
From: Rik van Riel r...@redhat.com Tracing the code that decides the active nodes has made it abundantly clear that the naive implementation of the faults_from code has issues. Specifically, the garbage collector in some workloads will access orders of magnitudes more memory than the threads

[PATCH 7/9] numa,sched: do statistics calculation using local variables only

2014-01-21 Thread riel
From: Rik van Riel r...@redhat.com The current code in task_numa_placement calculates the difference between the old and the new value, but also temporarily stores half of the old value in the per-process variables. The NUMA balancing code looks at those per-process variables, and having other

[PATCH 5/6] numa,sched,mm: use active_nodes nodemask to limit numa migrations

2014-01-16 Thread riel
From: Rik van Riel r...@surriel.com Use the active_nodes nodemask to make smarter decisions on NUMA migrations. In order to maximize performance of workloads that do not fit in one NUMA node, we want to satisfy the following criteria: 1) keep private memory local to each thread 2) avoid

[PATCH 6/6] numa,sched: normalize faults_from stats and weigh by CPU use

2014-01-16 Thread riel
From: Rik van Riel r...@surriel.com The tracepoint has made it abundantly clear that the naive implementation of the faults_from code has issues. Specifically, the garbage collector in some workloads will access orders of magnitudes more memory than the threads that do all the active work

[PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics

2014-01-16 Thread riel
From: Rik van Riel r...@surriel.com The faults_from statistics are used to maintain an active_nodes nodemask per numa_group. This allows us to be smarter about when to do numa migrations. Cc: Peter Zijlstra pet...@infradead.org Cc: Mel Gorman mgor...@suse.de Cc: Ingo Molnar mi...@redhat.com Cc

[PATCH 4/6] numa,sched: tracepoints for NUMA balancing active nodemask changes

2014-01-16 Thread riel
From: Rik van Riel r...@surriel.com Being able to see how the active nodemask changes over time, and why, can be quite useful. Cc: Peter Zijlstra pet...@infradead.org Cc: Mel Gorman mgor...@suse.de Cc: Ingo Molnar mi...@redhat.com Cc: Chegu Vinod chegu_vi...@hp.com Signed-off-by: Rik van Riel r

[PATCH 0/6] pseudo-interleaving for automatic NUMA balancing

2014-01-16 Thread riel
The current automatic NUMA balancing code base has issues with workloads that do not fit on one NUMA load. Page migration is slowed down, but memory distribution between the nodes where the workload runs is essentially random, often resulting in a suboptimal amount of memory bandwidth being

[PATCH 1/6] numa,sched,mm: remove p-numa_migrate_deferred

2014-01-16 Thread riel
From: Rik van Riel r...@surriel.com Excessive migration of pages can hurt the performance of workloads that span multiple NUMA nodes. However, it turns out that the p-numa_migrate_deferred knob is a really big hammer, which does reduce migration rates, but does not actually help performance

[PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered

2014-01-16 Thread riel
From: Rik van Riel r...@surriel.com Track which nodes NUMA faults are triggered from, in other words the CPUs on which the NUMA faults happened. This uses a similar mechanism to what is used to track the memory involved in numa faults. The next patches use this to build up a bitmap of which

[PATCH v2 0/7] pseudo-interleaving for automatic NUMA balancing

2014-01-17 Thread riel
The current automatic NUMA balancing code base has issues with workloads that do not fit on one NUMA load. Page migration is slowed down, but memory distribution between the nodes where the workload runs is essentially random, often resulting in a suboptimal amount of memory bandwidth being

[PATCH 1/7] numa,sched,mm: remove p-numa_migrate_deferred

2014-01-17 Thread riel
From: Rik van Riel r...@redhat.com Excessive migration of pages can hurt the performance of workloads that span multiple NUMA nodes. However, it turns out that the p-numa_migrate_deferred knob is a really big hammer, which does reduce migration rates, but does not actually help performance. Now

[PATCH 2/7] numa,sched: track from which nodes NUMA faults are triggered

2014-01-17 Thread riel
From: Rik van Riel r...@redhat.com Track which nodes NUMA faults are triggered from, in other words the CPUs on which the NUMA faults happened. This uses a similar mechanism to what is used to track the memory involved in numa faults. The next patches use this to build up a bitmap of which nodes

[PATCH 3/7] numa,sched: build per numa_group active node mask from faults_from statistics

2014-01-17 Thread riel
From: Rik van Riel r...@redhat.com The faults_from statistics are used to maintain an active_nodes nodemask per numa_group. This allows us to be smarter about when to do numa migrations. Cc: Peter Zijlstra pet...@infradead.org Cc: Mel Gorman mgor...@suse.de Cc: Ingo Molnar mi...@redhat.com Cc

[PATCH 4/7] numa,sched: tracepoints for NUMA balancing active nodemask changes

2014-01-17 Thread riel
From: Rik van Riel r...@redhat.com Being able to see how the active nodemask changes over time, and why, can be quite useful. Cc: Peter Zijlstra pet...@infradead.org Cc: Mel Gorman mgor...@suse.de Cc: Ingo Molnar mi...@redhat.com Cc: Chegu Vinod chegu_vi...@hp.com Signed-off-by: Rik van Riel r

[PATCH 5/7] numa,sched,mm: use active_nodes nodemask to limit numa migrations

2014-01-17 Thread riel
From: Rik van Riel r...@redhat.com Use the active_nodes nodemask to make smarter decisions on NUMA migrations. In order to maximize performance of workloads that do not fit in one NUMA node, we want to satisfy the following criteria: 1) keep private memory local to each thread 2) avoid excessive

[PATCH 6/7] numa,sched: normalize faults_from stats and weigh by CPU use

2014-01-17 Thread riel
From: Rik van Riel r...@redhat.com The tracepoint has made it abundantly clear that the naive implementation of the faults_from code has issues. Specifically, the garbage collector in some workloads will access orders of magnitudes more memory than the threads that do all the active work

[PATCH 7/7] numa,sched: do statistics calculation using local variables only

2014-01-17 Thread riel
From: Rik van Riel r...@redhat.com The current code in task_numa_placement calculates the difference between the old and the new value, but also temporarily stores half of the old value in the per-process variables. The NUMA balancing code looks at those per-process variables, and having other

[PATCH 6/9] numa,sched: normalize faults_cpu stats and weigh by CPU use

2014-01-27 Thread riel
From: Rik van Riel r...@redhat.com Tracing the code that decides the active nodes has made it abundantly clear that the naive implementation of the faults_from code has issues. Specifically, the garbage collector in some workloads will access orders of magnitudes more memory than the threads

[PATCH 5/9] numa,sched,mm: use active_nodes nodemask to limit numa migrations

2014-01-27 Thread riel
From: Rik van Riel r...@redhat.com Use the active_nodes nodemask to make smarter decisions on NUMA migrations. In order to maximize performance of workloads that do not fit in one NUMA node, we want to satisfy the following criteria: 1) keep private memory local to each thread 2) avoid excessive

[PATCH 1/9] numa,sched,mm: remove p-numa_migrate_deferred

2014-01-27 Thread riel
From: Rik van Riel r...@redhat.com Excessive migration of pages can hurt the performance of workloads that span multiple NUMA nodes. However, it turns out that the p-numa_migrate_deferred knob is a really big hammer, which does reduce migration rates, but does not actually help performance. Now

[PATCH 3/9] numa,sched: track from which nodes NUMA faults are triggered

2014-01-27 Thread riel
From: Rik van Riel r...@redhat.com Track which nodes NUMA faults are triggered from, in other words the CPUs on which the NUMA faults happened. This uses a similar mechanism to what is used to track the memory involved in numa faults. The next patches use this to build up a bitmap of which nodes

[PATCH 7/9] numa,sched: do statistics calculation using local variables only

2014-01-27 Thread riel
From: Rik van Riel r...@redhat.com The current code in task_numa_placement calculates the difference between the old and the new value, but also temporarily stores half of the old value in the per-process variables. The NUMA balancing code looks at those per-process variables, and having other

[PATCH 9/9] numa,sched: turn some magic numbers into defines

2014-01-27 Thread riel
From: Rik van Riel r...@redhat.com Cleanup suggested by Mel Gorman. Now the code contains some more hints on what statistics go where. Cc: Peter Zijlstra pet...@infradead.org Cc: Mel Gorman mgor...@suse.de Cc: Ingo Molnar mi...@redhat.com Cc: Chegu Vinod chegu_vi...@hp.com Suggested-by: Mel

[PATCH 8/9] numa,sched: rename variables in task_numa_fault

2014-01-27 Thread riel
From: Rik van Riel r...@redhat.com We track both the node of the memory after a NUMA fault, and the node of the CPU on which the fault happened. Rename the local variables in task_numa_fault to make things more explicit. Cc: Peter Zijlstra pet...@infradead.org Cc: Mel Gorman mgor...@suse.de Cc

[PATCH 2/9] rename p-numa_faults to numa_faults_memory

2014-01-27 Thread riel
From: Rik van Riel r...@redhat.com In order to get a more consistent naming scheme, making it clear which fault statistics track memory locality, and which track CPU locality, rename the memory fault statistics. Cc: Peter Zijlstra pet...@infradead.org Cc: Mel Gorman mgor...@suse.de Cc: Ingo

[PATCH v5 0/9] numa,sched,mm: pseudo-interleaving for automatic NUMA balancing

2014-01-27 Thread riel
The current automatic NUMA balancing code base has issues with workloads that do not fit on one NUMA load. Page migration is slowed down, but memory distribution between the nodes where the workload runs is essentially random, often resulting in a suboptimal amount of memory bandwidth being

[PATCH 4/9] numa,sched: build per numa_group active node mask from numa_faults_cpu statistics

2014-01-27 Thread riel
From: Rik van Riel r...@redhat.com The numa_faults_cpu statistics are used to maintain an active_nodes nodemask per numa_group. This allows us to be smarter about when to do numa migrations. Cc: Peter Zijlstra pet...@infradead.org Cc: Mel Gorman mgor...@suse.de Cc: Ingo Molnar mi...@redhat.com

[PATCH 0/2] fixed sysrq rcu patches

2014-04-29 Thread riel
Andrew, these patches contain all the fixes from the threads. They seem to compile on normal x86 and UML now. Thanks to Paul, Randy, and everybody else. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo

[PATCH 2/2] sysrq,rcu: suppress RCU stall warnings while sysrq runs

2014-04-29 Thread riel
From: Rik van Riel r...@redhat.com Some sysrq handlers can run for a long time, because they dump a lot of data onto a serial console. Having RCU stall warnings pop up in the middle of them only makes the problem worse. This patch temporarily disables RCU stall warnings while a sysrq request

[PATCH 1/2] sysrq: rcu-ify __handle_sysrq

2014-04-29 Thread riel
From: Rik van Riel r...@redhat.com Echoing values into /proc/sysrq-trigger seems to be a popular way to get information out of the kernel. However, dumping information about thousands of processes, or hundreds of CPUs to serial console can result in IRQs being blocked for minutes, resulting

[PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node

2014-04-11 Thread riel
From: Rik van Riel r...@redhat.com Setting the numa_preferred_node for a task in task_numa_migrate does nothing on a 2-node system. Either we migrate to the node that already was our preferred node, or we stay where we were. On a 4-node system, it can slightly decrease overhead, by not calling

[PATCH 0/3] sched,numa: reduce page migrations with pseudo-interleaving

2014-04-11 Thread riel
The pseudo-interleaving code deals fairly well with the placement of tasks that are part of workloads that span multiple NUMA nodes, but the code has a number of corner cases left that can result in higher than desired overhead. This patch series reduces the overhead slightly, mostly visible

[PATCH 2/3] sched,numa: retry placement more frequently when misplaced

2014-04-11 Thread riel
From: Rik van Riel r...@redhat.com When tasks have not converged on their preferred nodes yet, we want to retry fairly often, to make sure we do not migrate a task's memory to an undesirable location, only to have to move it again later. This patch reduces the interval at which migration

[PATCH 1/3] sched,numa: count pages on active node as local

2014-04-11 Thread riel
From: Rik van Riel r...@redhat.com The NUMA code is smart enough to distribute the memory of workloads that span multiple NUMA nodes across those NUMA nodes. However, it still has a pretty high scan rate for such workloads, because any memory that is left on a node other than the node of the CPU

[PATCH 1/7] sched,numa: use group's max nid as task's preferred nid

2014-06-23 Thread riel
From: Rik van Riel r...@redhat.com From task_numa_placement, always try to consolidate the tasks in a group on the group's top nid. In case this task is part of a group that is interleaved over multiple nodes, task_numa_migrate will set the task's preferred nid to the best node it could find

[PATCH 5/7] sched,numa: examine a task move when examining a task swap

2014-06-23 Thread riel
From: Rik van Riel r...@redhat.com Running perf bench numa mem -0 -m -P 1000 -p 8 -t 20 on a 4 node system results in 160 runnable threads on a system with 80 CPU threads. Once a process has nearly converged, with 39 threads on one node and 1 thread on another node, the remaining thread

[PATCH 4/7] sched,numa: simplify task_numa_compare

2014-06-23 Thread riel
From: Rik van Riel r...@redhat.com When a task is part of a numa_group, the comparison should always use the group weight, in order to make workloads converge. Signed-off-by: Rik van Riel r...@redhat.com --- kernel/sched/fair.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff

[PATCH 2/7] sched,numa: move power adjustment into load_too_imbalanced

2014-06-23 Thread riel
From: Rik van Riel r...@redhat.com Currently the NUMA code scales the load on each node with the amount of CPU power available on that node, but it does not apply any adjustment to the load of the task that is being moved over. On systems with SMT/HT, this results in a task being weighed much

[PATCH 3/7] sched,numa: use effective_load to balance NUMA loads

2014-06-23 Thread riel
From: Rik van Riel r...@redhat.com When CONFIG_FAIR_GROUP_SCHED is enabled, the load that a task places on a CPU is determined by the group the task is in. This is conveniently calculated for us by effective_load(), which task_numa_compare should use. The active groups on the source

[PATCH 0/7] sched,numa: improve NUMA convergence times

2014-06-23 Thread riel
Running things like the below pointed out a number of situations in which the current NUMA code has extremely slow task convergence, and even some situations in which tasks do not converge at all. ### # 160 tasks will execute (on 4 nodes, 80 CPUs): # -1x 0MB global shared mem

[PATCH 6/7] sched,numa: rework best node setting in task_numa_migrate

2014-06-23 Thread riel
From: Rik van Riel r...@redhat.com Fix up the best node setting in task_numa_migrate to deal with a task in a pseudo-interleaved NUMA group, which is already running in the best location. Set the task's preferred nid to the current nid, so task migration is not retried at a high rate. Signed

[PATCH 3/7] sched,numa: use effective_load to balance NUMA loads

2014-06-23 Thread riel
From: Rik van Riel r...@redhat.com When CONFIG_FAIR_GROUP_SCHED is enabled, the load that a task places on a CPU is determined by the group the task is in. This is conveniently calculated for us by effective_load(), which task_numa_compare should use. The active groups on the source

[PATCH 7/7] sched,numa: change scan period code to match intent

2014-06-23 Thread riel
From: Rik van Riel r...@redhat.com Reading through the scan period code and comment, it appears the intent was to slow down NUMA scanning when a majority of accesses are on the local node, specifically a local:remote ratio of 3:1. However, the code actually tests local / (local + remote

[PATCH 4/7] sched,numa: simplify task_numa_compare

2014-06-23 Thread riel
From: Rik van Riel r...@redhat.com When a task is part of a numa_group, the comparison should always use the group weight, in order to make workloads converge. Signed-off-by: Rik van Riel r...@redhat.com --- kernel/sched/fair.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff

[PATCH 5/7] sched,numa: examine a task move when examining a task swap

2014-06-23 Thread riel
From: Rik van Riel r...@redhat.com Running perf bench numa mem -0 -m -P 1000 -p 8 -t 20 on a 4 node system results in 160 runnable threads on a system with 80 CPU threads. Once a process has nearly converged, with 39 threads on one node and 1 thread on another node, the remaining thread

[PATCH 1/2] sched: fix and clean up calculate_imbalance

2014-07-28 Thread riel
From: Rik van Riel r...@redhat.com There are several ways in which update_sd_pick_busiest can end up picking an sd as busiest that has a below-average per-cpu load. All of those could use the same correction that was previously only applied when the selected group has a group imbalance

[PATCH 2/2] sched: make update_sd_pick_busiest return true on a busier sd

2014-07-28 Thread riel
From: Rik van Riel r...@redhat.com Currently update_sd_pick_busiest only identifies the busiest sd that is either overloaded, or has a group imbalance. When no sd is imbalanced or overloaded, the load balancer fails to find the busiest domain. This breaks load balancing between domains

[PATCH 0/2] load balancing fixes

2014-07-28 Thread riel
Currently update_sd_pick_busiest only identifies the busiest sd that is either overloaded, or has a group imbalance. When no sd is imbalanced or overloaded, the load balancer fails to find the busiest domain. This breaks load balancing between domains that are not overloaded, in the

[PATCH 1/2] sched,numa: fix off-by-one in capacity check

2014-08-04 Thread riel
From: Rik van Riel r...@redhat.com Commit a43455a1d572daf7b730fe12eb747d1e17411365 ensures that task_numa_migrate will call task_numa_compare on the preferred node all the time, even when the preferred node has no free capacity. This could lead to a performance regression if nr_running

[PATCH 2/2] sched,numa: fix numa capacity computation

2014-08-04 Thread riel
From: Rik van Riel r...@redhat.com Commit c61037e9 fixes the phenomenon of 'fantom' cores due to N*frac(smt_power) = 1 by limiting the capacity to the actual number of cores in the load balancing code. This patch applies the same correction to the NUMA balancing code. Signed-off-by: Rik van

[PATCH 0/2] node capacity fixes for NUMA balancing

2014-08-04 Thread riel
The NUMA balancing code has a few issues with determining the capacity of nodes, and using it when doing a task move. First the NUMA balancing code does not have the equivalent of c61037e9 to fix the phantom cores phenomenon in the presence of SMT. Secondly, the NUMA balancing code will happily

[PATCH 0/3] lockless sys_times and posix_cpu_clock_get

2014-08-15 Thread riel
and Andrew have one sitting around. /* Based on the test case from the following bug report, but changed to measure utime on a per thread basis. (Rik van Riel) https://lkml.org/lkml/2009/11/3/522 From: Spencer Candland Subject: utime/stime decreasing on thread exit I am seeing a problem

[PATCH 2/3] time,signal: protect resource use statistics with seqlock

2014-08-15 Thread riel
From: Rik van Riel r...@redhat.com Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability issues on large systems, due to both functions being serialized with a lock. The lock protects against reporting a wrong value, due to a thread in the task group exiting, its statistics

[PATCH 3/3] sched,time: atomically increment stime utime

2014-08-15 Thread riel
From: Rik van Riel r...@redhat.com The functions task_cputime_adjusted and thread_group_cputime_adjusted can be called locklessly, as well as concurrently on many different CPUs. This can occasionally lead to the utime and stime reported by times(), and other syscalls like it, going backward

[PATCH 1/3] exit: always reap resource stats in __exit_signal

2014-08-15 Thread riel
From: Rik van Riel r...@redhat.com Oleg pointed out that wait_task_zombie adds a task's usage statistics to the parent's signal struct, but the task's own signal struct should also propagate the statistics at exit time. This allows thread_group_cputime(reaped_zombie) to get the statistics after

[PATCH 5/6] sched,numa: find the preferred nid with complex NUMA topology

2014-10-17 Thread riel
From: Rik van Riel r...@redhat.com On systems with complex NUMA topologies, the node scoring is adjusted to allow workloads to converge on nodes that are near each other. The way a task group's preferred nid is determined needs to be adjusted, in order for the preferred_nid to be consistent

[PATCH 1/6] sched,numa: export info needed for NUMA balancing on complex topologies

2014-10-17 Thread riel
From: Rik van Riel r...@redhat.com Export some information that is necessary to do placement of tasks on systems with multi-level NUMA topologies. Signed-off-by: Rik van Riel r...@redhat.com --- kernel/sched/core.c | 4 +++- kernel/sched/sched.h | 2 ++ 2 files changed, 5 insertions(+), 1

[PATCH 6/6] sched,numa: check all nodes when placing a pseudo-interleaved group

2014-10-17 Thread riel
From: Rik van Riel r...@redhat.com In pseudo-interleaved numa_groups, all tasks try to relocate to the group's preferred_nid. When a group is spread across multiple NUMA nodes, this can lead to tasks swapping their location with other tasks inside the same group, instead of having the group

[PATCH 3/6] sched,numa: preparations for complex topology placement

2014-10-17 Thread riel
From: Rik van Riel r...@redhat.com Preparatory patch for adding NUMA placement on systems with complex NUMA topology. Also fix a potential divide by zero in group_weight() Signed-off-by: Rik van Riel r...@redhat.com Tested-by: Chegu Vinod chegu_vi...@hp.com --- kernel/sched/fair.c | 57

[PATCH 0/6] sched,numa: weigh nearby nodes for task placement on complex NUMA topologies (v2)

2014-10-17 Thread riel
This patch set integrates two algorithms I have previously tested, one for glueless mesh NUMA topologies, where NUMA nodes communicate with far-away nodes through intermediary nodes, and backplane topologies, where communication with far-away NUMA nodes happens through backplane controllers (which

[PATCH 2/6] sched,numa: classify the NUMA topology of a system

2014-10-17 Thread riel
From: Rik van Riel r...@redhat.com Smaller NUMA systems tend to have all NUMA nodes directly connected to each other. This includes the degenerate case of a system with just one node, ie. a non-NUMA system. Larger systems can have two kinds of NUMA topology, which affects how tasks and memory

[PATCH 4/6] sched,numa: calculate node scores in complex NUMA topologies

2014-10-17 Thread riel
From: Rik van Riel r...@redhat.com In order to do task placement on systems with complex NUMA topologies, it is necessary to count the faults on nodes nearby the node that is being examined for a potential move. In case of a system with a backplane interconnect, we are dealing with groups

[PATCH -tip 1/2] seqlock: add irqsave variant of read_seqbegin_or_lock

2014-09-12 Thread riel
From: Rik van Riel r...@redhat.com There are cases where read_seqbegin_or_lock needs to block irqs, because the seqlock in question nests inside a lock that is also be taken from irq context. Add read_seqbegin_or_lock_irqsave and done_seqretry_irqrestore, which are almost identical

[PATCH -tip 2/2] sched,time: fix lock inversion in thread_group_cputime

2014-09-12 Thread riel
From: Rik van Riel r...@redhat.com The sig-stats_lock nests inside the tasklist_lock and the sighand-siglock in __exit_signal and wait_task_zombie. However, both of those locks can be taken from irq context, which means we need to use the interrupt safe variant of read_seqbegin_or_lock

[PATCH -tip 0/2] fix lock inversion in lockless sys_times()

2014-09-12 Thread riel
The sig-stats_lock nests inside the tasklist_lock and the sighand-siglock in __exit_signal and wait_task_zombie. However, both of those locks can be taken from irq context, which means we need to use the interrupt safe variant of read_seqbegin_or_lock. This blocks interrupts when the lock branch

[PATCH RFC 1/5] sched,numa: build table of node hop distance

2014-10-08 Thread riel
From: Rik van Riel r...@redhat.com In order to more efficiently figure out where to place workloads that span multiple NUMA nodes, it makes sense to estimate how many hops away nodes are from each other. Also add some comments to sched_init_numa. Signed-off-by: Rik van Riel r...@redhat.com

[PATCH RFC 0/5] sched,numa: task placement with complex NUMA topologies

2014-10-08 Thread riel
This patch set integrates two algorithms I have previously tested, one for glueless mesh NUMA topologies, where NUMA nodes communicate with far-away nodes through intermediary nodes, and backplane topologies, where communication with far-away NUMA nodes happens through backplane controllers (which

  1   2   3   4   5   6   7   8   9   10   >