[PATCH v2 0/2] x86,kvm: move qemu/guest FPU switching out to kvm_arch_vcpu_ioctl_run

2017-11-14 Thread riel
This code moves FPU handling from the non-preemptible part of running a VCPU, way further out into the KVM_RUN ioctl handling. That way there is no need to continuously save and load the qemu FPU user context every time a VCPU context switches, or goes to sleep in the host kernel. v2: - move FPU s

[PATCH 2/2] x86,kvm: remove KVM emulator get_fpu / put_fpu

2017-11-14 Thread riel
From: Rik van Riel Now that get_fpu and put_fpu do nothing, because the scheduler will automatically load and restore the guest FPU context for us while we are in this code (deep inside the vcpu_run main loop), we can get rid of the get_fpu and put_fpu hooks. Signed-off-by: Rik van Riel

[PATCH 1/2] x86,kvm: move qemu/guest FPU switching out to vcpu_run

2017-11-14 Thread riel
From: Rik van Riel Currently, every time a VCPU is scheduled out, the host kernel will first save the guest FPU/xstate context, then load the qemu userspace FPU context, only to then immediately save the qemu userspace FPU context back to memory. When scheduling in a VCPU, the same extraneous

[RFC PATCH 11/11] (BROKEN) x86,fpu: broken signal handler stack setup

2015-01-11 Thread riel
From: Rik van Riel The previous patches result in situations where the FPU state for a task is not present in the FPU registers, when using eager fpu mode. The signal frame setup and restore code needs to be adjusted to deal with that situation. Without this patch, the signal handler stack

[RFC PATCH 04/11] x86,fpu: defer FPU restore until return to userspace

2015-01-11 Thread riel
From: Rik van Riel Defer restoring the FPU state, if so desired, until the task returns to userspace. In case of kernel threads, KVM VCPU threads, and tasks performing longer running operations in kernel space, this could mean skipping the FPU state restore entirely for several context switches

[RFC PATCH 07/11] x86,fpu: store current fpu pointer, instead of fpu_owner_task

2015-01-11 Thread riel
From: Rik van Riel This change has no impact on normal tasks, but it allows tasks with multiple FPU states(like a KVM vcpu thread) to check whether its other FPU state is still loaded. Exported so KVM can use it. Signed-off-by: Rik van Riel --- arch/x86/include/asm/fpu-internal.h | 15

[RFC PATCH 05/11] x86,fpu: ensure FPU state is reloaded from memory if task is traced

2015-01-11 Thread riel
From: Rik van Riel If the old task is in a state where its FPU state could be changed by a debugger, ensure the FPU state is always restored from memory on the next context switch. Currently the system only skips FPU reloads when !eager_fpu_mode() and the task's FPU state is still loaded o

[RFC PATCH 02/11] x86,fpu: replace fpu_switch_t with a thread flag

2015-01-11 Thread riel
From: Rik van Riel Replace fpu_switch_t with a thread flag, in preparation for only restoring the FPU state on return to user space. I have left the code around fpu_lazy_restore intact, even though there appears to be no protection against races with eg. ptrace, and the optimization appears

[RFC PATCH 01/11] x86,fpu: document the data structures a little

2015-01-11 Thread riel
From: Rik van Riel Add some documentation to data structures used for FPU context switching. Signed-off-by: Rik van Riel --- arch/x86/include/asm/processor.h | 9 +++-- arch/x86/kernel/cpu/common.c | 1 + 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/x86/include

[RFC PATCH 06/11] x86,fpu: lazily skip fpu restore with eager fpu mode, too

2015-01-11 Thread riel
From: Rik van Riel If the next task still has its FPU state present in the FPU registers, there is no need to restore it from memory. This is no big deal on bare metal, where XSAVEOPT / XRSTOR are heavily optimized, but those optimizations do not carry across VMENTER / VMEXIT. Skipping the

[RFC PATCH 09/11] x86,fpu,kvm: keep vcpu FPU active as long as it is resident

2015-01-11 Thread riel
From: Rik van Riel Currently KVM always deactivates the FPU on VCPU unload, only to reactivate it next time the guest uses it. This can make using the FPU inside a KVM guest fairly expensive. On the other hand, restoring the FPU state for a KVM guest is also significantly more involved (and

[RFC PATCH 03/11] x86,fpu: move __thread_fpu_begin to when the task has the fpu

2015-01-11 Thread riel
From: Rik van Riel Move the call to __thread_fpu_begin, which in turn calls __thread_set_has_fpu, to a spot where the task actually has the FPU. This is in preparation for the next patch. This changeset introduces an extraneous clts() call when switching from one FPU-using task to another FPU

[RFC PATCH 08/11] x86,fpu: restore user FPU state lazily after __kernel_fpu_end

2015-01-11 Thread riel
From: Rik van Riel Tasks may have multiple invocations of kernel_fpu_start and kernel_fpu_end in sequence without ever hitting userspace in-between. Delaying the restore of the user FPU state until the task returns to userspace means the kernel only has to save the user FPU state on the first

[RFC PATCH 0/11 BROKEN] move FPU context loading to userspace switch

2015-01-11 Thread riel
Currently the kernel will always load the FPU context, even when switching to a kernel thread, or to an idle thread. In the case of a task on a KVM VCPU going idle for a bit, and waking up again later, this creates a vastly inefficient chain of FPU context saves & loads: 1) save task FPU context,

[RFC PATCH 10/11] x86,fpu: fix fpu_copy to deal with not-loaded fpu

2015-01-11 Thread riel
From: Rik van Riel It is possible to hit fpu_copy in eager fpu mode, but without the current task's FPU context actually loaded into the CPU. In that case, we should copy the FPU context from memory, not save it from registers. Signed-off-by: Rik van Riel --- arch/x86/include/as

[PATCH 1/2] show isolated cpus in sysfs

2015-04-24 Thread riel
From: Rik van Riel After system bootup, there is no totally reliable way to see which CPUs are isolated, because the kernel may modify the CPUs specified on the isolcpus= kernel command line option. Export the CPU list that actually got isolated in sysfs, specifically in the file /sys/devices

[PATCH 0/2 resend] show isolated & nohz_full cpus in sysfs

2015-04-24 Thread riel
Currently there is no good way to get the isolated and nohz_full CPUs at runtime, because the kernel may have changed the CPUs specified on the commandline (when specifying all CPUs as isolated, or CPUs that do not exist, ...) This series adds two files to /sys/devices/system/cpu, which can be use

[PATCH 2/2] show nohz_full cpus in sysfs

2015-04-24 Thread riel
From: Rik van Riel Currently there is no way to query which CPUs are in nohz_full mode from userspace. Export the CPU list running in nohz_full mode in sysfs, specifically in the file /sys/devices/system/cpu/nohz_full This can be used by system management tools like libvirt, openstack, and

[PATCH 0/3] reduce nohz_full syscall overhead by 10%

2015-04-30 Thread riel
Profiling reveals that a lot of the overhead from the nohz_full accounting seems to come not from the accounting itself, but from disabling and re-enabling interrupts. This patch series removes the interrupt disabling & re-enabling from __acct_update_integrals, which is called on both syscall entr

[PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-04-30 Thread riel
From: Rik van Riel On syscall entry with nohz_full on, we enable interrupts, call user_exit, disable interrupts, do something, re-enable interrupts, and go on our merry way. Profiling shows that a large amount of the nohz_full overhead comes from the extraneous disabling and re-enabling of

[PATCH 2/3] remove local_irq_save from __acct_update_integrals

2015-04-30 Thread riel
From: Rik van Riel The function __acct_update_integrals() is called both from irq context and task context. This creates a race where irq context can advance tsk->acct_timexpd to a value larger than time, leading to a negative value, which causes a divide error. See commit 6d5b5acca9e5 (&

[PATCH 1/3] reduce indentation in __acct_update_integrals

2015-04-30 Thread riel
From: Peter Zijlstra Reduce indentation in __acct_update_integrals. Cc: Andy Lutomirsky Cc: Frederic Weisbecker Cc: Peter Zijlstra Cc: Heiko Carstens Cc: Thomas Gleixner Signed-off-by: Peter Zijlstra Signed-off-by: Rik van Riel --- kernel/tsacct.c | 34

[PATCH 0/2] numa,sched: resolve conflict between load balancing and NUMA balancing

2015-05-27 Thread riel
A previous attempt to resolve a major conflict between load balancing and NUMA balancing, changeset 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced"), introduced its own problems. Revert that changeset, and introduce a new fix, which actually seems to resolve the issues

[PATCH 1/2] revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced")

2015-05-27 Thread riel
From: Rik van Riel Commit 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced") broke convergence of workloads with just one runnable thread, by making it impossible for the one runnable thread on the system to move from one NUMA node to another. Instead,

[PATCH 2/2] numa,sched: only consider less busy nodes as numa balancing destination

2015-05-27 Thread riel
From: Rik van Riel Changeset a43455a1 ("sched/numa: Ensure task_numa_migrate() checks the preferred node") fixes an issue where workloads would never converge on a fully loaded (or overloaded) system. However, it introduces a regression on less than fully loaded systems, where

[RFC PATCH 11/11] nohz,kvm,time: teach account_process_tick about guest time

2015-06-24 Thread riel
From: Rik van Riel When tick based accounting is run from a remote CPU, it is actually possible to encounter a task with PF_VCPU set. Make sure to account those as guest time. Signed-off-by: Rik van Riel --- kernel/sched/cputime.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff

[RFC PATCH 08/11] nohz,timer: have housekeeper call account_process_tick for nohz cpus

2015-06-24 Thread riel
From: Rik van Riel Have the housekeeper CPU call account_process_tick to do tick based accounting for remote nohz_full CPUs. Signed-off-by: Rik van Riel --- kernel/time/timer.c | 28 1 file changed, 28 insertions(+) diff --git a/kernel/time/timer.c b/kernel/time

[RFC PATCH 03/11] time,nohz: add cpu parameter to irqtime_account_process_tick

2015-06-24 Thread riel
From: Rik van Riel Add a cpu parameter to irqtime_account_process_tick, to specify what cpu to run the statistics for. In order for this to actually work on a different cpu, all the functions called by irqtime_account_process_tick need to be able to handle workng for another CPU. Signed-off-by

[RFC PATCH 01/11] nohz,time: make account_process_tick work on the task's CPU

2015-06-24 Thread riel
From: Rik van Riel Teach account_process_tick to work on the CPU of the task specified in the function argument. This allows us to do remote tick based sampling of a nohz_full cpu from a housekeeping CPU. Signed-off-by: Rik van Riel --- kernel/sched/cputime.c | 8 +++- 1 file changed, 7

[RFC INCOMPLETE] tick based timekeeping from a housekeeping CPU

2015-06-24 Thread riel
This series seems to make basic tick based time sampling from a housekeeping CPU work, allowing us to have tick based accounting on a nohz_full CPU, and no longer doing vtime accounting on those CPUs. It still needs a major cleanup, and steal time accounting and irq accounting are still missing.

[RFC PATCH 04/11] time,nohz: add cpu parameter to steal_account_process_tick

2015-06-24 Thread riel
From: Rik van Riel Add a cpu parameter to steal_account_process_tick, so it can be used to do CPU time accounting for another CPU. Signed-off-by: Rik van Riel --- kernel/sched/cputime.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/kernel/sched/cputime.c b

[RFC PATCH 09/11] nohz,time: add tick_accounting_remote macro

2015-06-24 Thread riel
From: Rik van Riel With the introduction of remote tick based sampling, we now have three ways of gathering time statistics: - local tick based sampling - vtime accounting (used natively on some architectures) - remote tick based sampling On a system with remote tick based sampling, the

[RFC PATCH 02/11] time,nohz: rename vtime_accounting_enabled to tick_accounting_disabled

2015-06-24 Thread riel
From: Rik van Riel Rename vtime_accounting_enabled to tick_accounting_disabled, because it can mean either that vtime accounting is enabled, or that the system is doing tick based sampling from a housekeeping CPU for nohz_full CPUs. Signed-off-by: Rik van Riel --- include/linux

[RFC PATCH 05/11] time,nohz: add cpu parameter to account_steal_time

2015-06-24 Thread riel
From: Rik van Riel Simple transformation to allow tick based sampling from a remote cpu. Additional changes may be needed to actually acquire the steal time info for remote cpus from the host/hypervisor. Signed-off-by: Rik van Riel --- include/linux/kernel_stat.h | 2 +- kernel/sched

[RFC PATCH 06/11] time,nohz: add cpu parameter to account_idle_time

2015-06-24 Thread riel
From: Rik van Riel Simple transformation to allow account_idle_time to account the idle time for another CPU. Signed-off-by: Rik van Riel --- arch/ia64/kernel/time.c | 2 +- arch/powerpc/kernel/time.c | 2 +- arch/s390/kernel/idle.c | 2 +- include/linux/kernel_stat.h | 2

[RFC PATCH 07/11] nohz,timer: designate timer housekeeping cpu

2015-06-24 Thread riel
From: Rik van Riel The timer housekeeping CPU can do tick based sampling for remote CPUs. For now this is the first CPU in the housekeeping_mask. Eventually we could move to having one timer housekeeping cpu per socket, if needed. Signed-off-by: Rik van Riel --- include/linux/tick.h | 9

[RFC PATCH 10/11] nohz,kvm,time: skip vtime accounting at kernel entry & exit

2015-06-24 Thread riel
From: Rik van Riel When timer statistics are sampled from a remote CPU, vtime calculations at the kernel/user and kernel/guest boundary are no longer necessary. Skip them. Signed-off-by: Rik van Riel --- include/linux/context_tracking.h | 4 ++-- kernel/context_tracking.c| 6 -- 2

[PATCH -mm 2/3] mm,numa: reorganize change_pmd_range

2014-02-18 Thread riel
From: Rik van Riel Reorganize the order of ifs in change_pmd_range a little, in preparation for the next patch. Signed-off-by: Rik van Riel Cc: Peter Zijlstra Cc: Andrea Arcangeli Reported-by: Xing Gang Tested-by: Chegu Vinod --- mm/mprotect.c | 7 --- 1 file changed, 4 insertions

[PATCH -mm 0/3] fix numa vs kvm scalability issue

2014-02-18 Thread riel
The NUMA scanning code can end up iterating over many gigabytes of unpopulated memory, especially in the case of a freshly started KVM guest with lots of memory. This results in the mmu notifier code being called even when there are no mapped pages in a virtual address range. The amount of time wa

[PATCH -mm 1/3] sched,numa: add cond_resched to task_numa_work

2014-02-18 Thread riel
From: Rik van Riel Normally task_numa_work scans over a fairly small amount of memory, but it is possible to run into a large unpopulated part of virtual memory, with no pages mapped. In that case, task_numa_work can run for a while, and it may make sense to reschedule as required. Signed-off

[PATCH -mm 3/3] move mmu notifier call from change_protection to change_pmd_range

2014-02-18 Thread riel
From: Rik van Riel The NUMA scanning code can end up iterating over many gigabytes of unpopulated memory, especially in the case of a freshly started KVM guest with lots of memory. This results in the mmu notifier code being called even when there are no mapped pages in a virtual address range

[PATCH v4 0/9] pseudo-interleaving for automatic NUMA balancing

2014-01-21 Thread riel
The current automatic NUMA balancing code base has issues with workloads that do not fit on one NUMA load. Page migration is slowed down, but memory distribution between the nodes where the workload runs is essentially random, often resulting in a suboptimal amount of memory bandwidth being availab

[PATCH 1/9] numa,sched,mm: remove p->numa_migrate_deferred

2014-01-21 Thread riel
From: Rik van Riel Excessive migration of pages can hurt the performance of workloads that span multiple NUMA nodes. However, it turns out that the p->numa_migrate_deferred knob is a really big hammer, which does reduce migration rates, but does not actually help performance. Now that

[PATCH 8/9] numa,sched: rename variables in task_numa_fault

2014-01-21 Thread riel
From: Rik van Riel We track both the node of the memory after a NUMA fault, and the node of the CPU on which the fault happened. Rename the local variables in task_numa_fault to make things more explicit. Suggested-by: Mel Gorman Signed-off-by: Rik van Riel --- kernel/sched/fair.c | 8

[PATCH 4/9] numa,sched: build per numa_group active node mask from numa_faults_cpu statistics

2014-01-21 Thread riel
From: Rik van Riel The numa_faults_cpu statistics are used to maintain an active_nodes nodemask per numa_group. This allows us to be smarter about when to do numa migrations. Cc: Peter Zijlstra Cc: Mel Gorman Cc: Ingo Molnar Cc: Chegu Vinod Signed-off-by: Rik van Riel --- kernel/sched

[PATCH 9/9] numa,sched: define some magic numbers

2014-01-21 Thread riel
From: Rik van Riel Cleanup suggested by Mel Gorman. Now the code contains some more hints on what statistics go where. Suggested-by: Mel Gorman Signed-off-by: Rik van Riel --- kernel/sched/fair.c | 34 +- 1 file changed, 25 insertions(+), 9 deletions(-) diff

[PATCH 3/9] numa,sched: track from which nodes NUMA faults are triggered

2014-01-21 Thread riel
From: Rik van Riel Track which nodes NUMA faults are triggered from, in other words the CPUs on which the NUMA faults happened. This uses a similar mechanism to what is used to track the memory involved in numa faults. The next patches use this to build up a bitmap of which nodes a workload is

[PATCH 5/9] numa,sched,mm: use active_nodes nodemask to limit numa migrations

2014-01-21 Thread riel
From: Rik van Riel Use the active_nodes nodemask to make smarter decisions on NUMA migrations. In order to maximize performance of workloads that do not fit in one NUMA node, we want to satisfy the following criteria: 1) keep private memory local to each thread 2) avoid excessive NUMA migration

[PATCH 2/9] rename p->numa_faults to numa_faults_memory

2014-01-21 Thread riel
From: Rik van Riel In order to get a more consistent naming scheme, making it clear which fault statistics track memory locality, and which track CPU locality, rename the memory fault statistics. Suggested-by: Mel Gorman Signed-off-by: Rik van Riel --- include/linux/sched.h | 8

[PATCH 6/9] numa,sched: normalize faults_cpu stats and weigh by CPU use

2014-01-21 Thread riel
From: Rik van Riel Tracing the code that decides the active nodes has made it abundantly clear that the naive implementation of the faults_from code has issues. Specifically, the garbage collector in some workloads will access orders of magnitudes more memory than the threads that do all the

[PATCH 7/9] numa,sched: do statistics calculation using local variables only

2014-01-21 Thread riel
From: Rik van Riel The current code in task_numa_placement calculates the difference between the old and the new value, but also temporarily stores half of the old value in the per-process variables. The NUMA balancing code looks at those per-process variables, and having other tasks

[RFC PATCH 0/4] pseudo-interleaving NUMA placement

2013-11-26 Thread riel
This patch set attempts to implement a pseudo-interleaving policy for workloads that do not fit in one NUMA node. For each NUMA group, we track the NUMA nodes on which the workload is actively running, and try to concentrate the memory on those NUMA nodes. Unfortunately, the scheduler appears to

[RFC PATCH 4/4] use active_nodes nodemask to decide on numa migrations

2013-11-26 Thread riel
From: Rik van Riel Use the active_nodes nodemask to make smarter decisions on NUMA migrations. In order to maximize performance of workloads that do not fit in one NUMA node, we want to satisfy the following criteria: 1) keep private memory local to each thread 2) avoid excessive NUMA migration

[RFC PATCH 1/4] remove p->numa_migrate_deferred

2013-11-26 Thread riel
From: Rik van Riel Excessive migration of pages can hurt the performance of workloads that span multiple NUMA nodes. However, it turns out that the p->numa_migrate_deferred knob is a really big hammer, which does reduce migration rates, but does not actually help performance. It is time to

[RFC PATCH 2/4] track from which nodes NUMA faults are triggered

2013-11-26 Thread riel
From: Rik van Riel Track which nodes NUMA faults are triggered from. This uses a similar mechanism to what is used to track the memory involved in numa faults. This is used, in the next patch, to build up a bitmap of which nodes a workload is actively running on. Signed-off-by: Rik van Riel

[RFC PATCH 3/4] build per numa_group active node mask from faults_from statistics

2013-11-26 Thread riel
From: Rik van Riel The faults_from statistics are used to maintain an active_nodes nodemask per numa_group. This allows us to be smarter about when to do numa migrations. Signed-off-by: Rik van Riel --- kernel/sched/fair.c | 33 + 1 file changed, 33 insertions

[PATCH 5/6] numa,sched,mm: use active_nodes nodemask to limit numa migrations

2014-01-16 Thread riel
From: Rik van Riel Use the active_nodes nodemask to make smarter decisions on NUMA migrations. In order to maximize performance of workloads that do not fit in one NUMA node, we want to satisfy the following criteria: 1) keep private memory local to each thread 2) avoid excessive NUMA migration

[PATCH 6/6] numa,sched: normalize faults_from stats and weigh by CPU use

2014-01-16 Thread riel
From: Rik van Riel The tracepoint has made it abundantly clear that the naive implementation of the faults_from code has issues. Specifically, the garbage collector in some workloads will access orders of magnitudes more memory than the threads that do all the active work. This resulted in the

[PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics

2014-01-16 Thread riel
From: Rik van Riel The faults_from statistics are used to maintain an active_nodes nodemask per numa_group. This allows us to be smarter about when to do numa migrations. Cc: Peter Zijlstra Cc: Mel Gorman Cc: Ingo Molnar Cc: Chegu Vinod Signed-off-by: Rik van Riel Signed-off-by: Rik van

[PATCH 4/6] numa,sched: tracepoints for NUMA balancing active nodemask changes

2014-01-16 Thread riel
From: Rik van Riel Being able to see how the active nodemask changes over time, and why, can be quite useful. Cc: Peter Zijlstra Cc: Mel Gorman Cc: Ingo Molnar Cc: Chegu Vinod Signed-off-by: Rik van Riel Signed-off-by: Rik van Riel --- include/trace/events/sched.h | 34

[PATCH 0/6] pseudo-interleaving for automatic NUMA balancing

2014-01-16 Thread riel
The current automatic NUMA balancing code base has issues with workloads that do not fit on one NUMA load. Page migration is slowed down, but memory distribution between the nodes where the workload runs is essentially random, often resulting in a suboptimal amount of memory bandwidth being availab

[PATCH 1/6] numa,sched,mm: remove p->numa_migrate_deferred

2014-01-16 Thread riel
From: Rik van Riel Excessive migration of pages can hurt the performance of workloads that span multiple NUMA nodes. However, it turns out that the p->numa_migrate_deferred knob is a really big hammer, which does reduce migration rates, but does not actually help performance. Now that

[PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered

2014-01-16 Thread riel
From: Rik van Riel Track which nodes NUMA faults are triggered from, in other words the CPUs on which the NUMA faults happened. This uses a similar mechanism to what is used to track the memory involved in numa faults. The next patches use this to build up a bitmap of which nodes a workload is

[PATCH v2 0/7] pseudo-interleaving for automatic NUMA balancing

2014-01-17 Thread riel
The current automatic NUMA balancing code base has issues with workloads that do not fit on one NUMA load. Page migration is slowed down, but memory distribution between the nodes where the workload runs is essentially random, often resulting in a suboptimal amount of memory bandwidth being availab

[PATCH 1/7] numa,sched,mm: remove p->numa_migrate_deferred

2014-01-17 Thread riel
From: Rik van Riel Excessive migration of pages can hurt the performance of workloads that span multiple NUMA nodes. However, it turns out that the p->numa_migrate_deferred knob is a really big hammer, which does reduce migration rates, but does not actually help performance. Now that

[PATCH 2/7] numa,sched: track from which nodes NUMA faults are triggered

2014-01-17 Thread riel
From: Rik van Riel Track which nodes NUMA faults are triggered from, in other words the CPUs on which the NUMA faults happened. This uses a similar mechanism to what is used to track the memory involved in numa faults. The next patches use this to build up a bitmap of which nodes a workload is

[PATCH 3/7] numa,sched: build per numa_group active node mask from faults_from statistics

2014-01-17 Thread riel
From: Rik van Riel The faults_from statistics are used to maintain an active_nodes nodemask per numa_group. This allows us to be smarter about when to do numa migrations. Cc: Peter Zijlstra Cc: Mel Gorman Cc: Ingo Molnar Cc: Chegu Vinod Signed-off-by: Rik van Riel --- kernel/sched/fair.c

[PATCH 4/7] numa,sched: tracepoints for NUMA balancing active nodemask changes

2014-01-17 Thread riel
From: Rik van Riel Being able to see how the active nodemask changes over time, and why, can be quite useful. Cc: Peter Zijlstra Cc: Mel Gorman Cc: Ingo Molnar Cc: Chegu Vinod Signed-off-by: Rik van Riel --- include/trace/events/sched.h | 34 ++ kernel

[PATCH 5/7] numa,sched,mm: use active_nodes nodemask to limit numa migrations

2014-01-17 Thread riel
From: Rik van Riel Use the active_nodes nodemask to make smarter decisions on NUMA migrations. In order to maximize performance of workloads that do not fit in one NUMA node, we want to satisfy the following criteria: 1) keep private memory local to each thread 2) avoid excessive NUMA migration

[PATCH 6/7] numa,sched: normalize faults_from stats and weigh by CPU use

2014-01-17 Thread riel
From: Rik van Riel The tracepoint has made it abundantly clear that the naive implementation of the faults_from code has issues. Specifically, the garbage collector in some workloads will access orders of magnitudes more memory than the threads that do all the active work. This resulted in the

[PATCH 7/7] numa,sched: do statistics calculation using local variables only

2014-01-17 Thread riel
From: Rik van Riel The current code in task_numa_placement calculates the difference between the old and the new value, but also temporarily stores half of the old value in the per-process variables. The NUMA balancing code looks at those per-process variables, and having other tasks

[PATCH v3 0/6] pseudo-interleaving for automatic NUMA balancing

2014-01-20 Thread riel
The current automatic NUMA balancing code base has issues with workloads that do not fit on one NUMA load. Page migration is slowed down, but memory distribution between the nodes where the workload runs is essentially random, often resulting in a suboptimal amount of memory bandwidth being availab

[PATCH 3/6] numa,sched: build per numa_group active node mask from faults_from statistics

2014-01-20 Thread riel
From: Rik van Riel The faults_from statistics are used to maintain an active_nodes nodemask per numa_group. This allows us to be smarter about when to do numa migrations. Cc: Peter Zijlstra Cc: Mel Gorman Cc: Ingo Molnar Cc: Chegu Vinod Signed-off-by: Rik van Riel --- kernel/sched/fair.c

[PATCH 6/6] numa,sched: do statistics calculation using local variables only

2014-01-20 Thread riel
From: Rik van Riel The current code in task_numa_placement calculates the difference between the old and the new value, but also temporarily stores half of the old value in the per-process variables. The NUMA balancing code looks at those per-process variables, and having other tasks

[PATCH 4/6] numa,sched,mm: use active_nodes nodemask to limit numa migrations

2014-01-20 Thread riel
From: Rik van Riel Use the active_nodes nodemask to make smarter decisions on NUMA migrations. In order to maximize performance of workloads that do not fit in one NUMA node, we want to satisfy the following criteria: 1) keep private memory local to each thread 2) avoid excessive NUMA migration

[PATCH 1/6] numa,sched,mm: remove p->numa_migrate_deferred

2014-01-20 Thread riel
From: Rik van Riel Excessive migration of pages can hurt the performance of workloads that span multiple NUMA nodes. However, it turns out that the p->numa_migrate_deferred knob is a really big hammer, which does reduce migration rates, but does not actually help performance. Now that

[PATCH 2/6] numa,sched: track from which nodes NUMA faults are triggered

2014-01-20 Thread riel
From: Rik van Riel Track which nodes NUMA faults are triggered from, in other words the CPUs on which the NUMA faults happened. This uses a similar mechanism to what is used to track the memory involved in numa faults. The next patches use this to build up a bitmap of which nodes a workload is

[PATCH 5/6] numa,sched: normalize faults_from stats and weigh by CPU use

2014-01-20 Thread riel
From: Rik van Riel The tracepoint has made it abundantly clear that the naive implementation of the faults_from code has issues. Specifically, the garbage collector in some workloads will access orders of magnitudes more memory than the threads that do all the active work. This resulted in the

[PATCH -tip 1/2] seqlock: add irqsave variant of read_seqbegin_or_lock

2014-09-12 Thread riel
From: Rik van Riel There are cases where read_seqbegin_or_lock needs to block irqs, because the seqlock in question nests inside a lock that is also be taken from irq context. Add read_seqbegin_or_lock_irqsave and done_seqretry_irqrestore, which are almost identical to read_seqbegin_or_lock and

[PATCH -tip 2/2] sched,time: fix lock inversion in thread_group_cputime

2014-09-12 Thread riel
From: Rik van Riel The sig->stats_lock nests inside the tasklist_lock and the sighand->siglock in __exit_signal and wait_task_zombie. However, both of those locks can be taken from irq context, which means we need to use the interrupt safe variant of read_seqbegin_or_lock. This

[PATCH -tip 0/2] fix lock inversion in lockless sys_times()

2014-09-12 Thread riel
The sig->stats_lock nests inside the tasklist_lock and the sighand->siglock in __exit_signal and wait_task_zombie. However, both of those locks can be taken from irq context, which means we need to use the interrupt safe variant of read_seqbegin_or_lock. This blocks interrupts when the "lock" bran

[PATCH 0/2] fixed sysrq & rcu patches

2014-04-29 Thread riel
Andrew, these patches contain all the fixes from the threads. They seem to compile on normal x86 and UML now. Thanks to Paul, Randy, and everybody else. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo

[PATCH 2/2] sysrq,rcu: suppress RCU stall warnings while sysrq runs

2014-04-29 Thread riel
From: Rik van Riel Some sysrq handlers can run for a long time, because they dump a lot of data onto a serial console. Having RCU stall warnings pop up in the middle of them only makes the problem worse. This patch temporarily disables RCU stall warnings while a sysrq request is handled

[PATCH 1/2] sysrq: rcu-ify __handle_sysrq

2014-04-29 Thread riel
From: Rik van Riel Echoing values into /proc/sysrq-trigger seems to be a popular way to get information out of the kernel. However, dumping information about thousands of processes, or hundreds of CPUs to serial console can result in IRQs being blocked for minutes, resulting in various kinds of

[PATCH 2/4] sched,numa: weigh nearby nodes for task placement on complex NUMA topologies

2014-05-08 Thread riel
From: Rik van Riel Workloads that span multiple NUMA nodes benefit greatly from being placed on nearby nodes. There are two common configurations on 8 node NUMA systems. One has four "islands" of 2 tighly coupled nodes, another has two "islands" of 4 tightly coupled nodes. W

[PATCH 4/4] sched,numa: pull workloads towards their preferred nodes

2014-05-08 Thread riel
From: Rik van Riel Give a bonus to nodes near a workload's preferred node. This will pull workloads towards their preferred node. For workloads that span multiple NUMA nodes, pseudo-interleaving will even out the memory use between nodes over time, causing the preferred node to move around

[PATCH 1/4] numa,x86: store maximum numa node distance

2014-05-08 Thread riel
From: Rik van Riel Store the maximum node distance, so the numa placement code can do better placement on systems with complex numa topology. The function max_node_distance will return LOCAL_DISTANCE if the system has simple NUMA topology, with only a single level of remote distance. Signed

[PATCH 3/4] sched,numa: store numa_group's preferred nid

2014-05-08 Thread riel
From: Rik van Riel Store a numa_group's preferred nid. Used by the next patch to pull workloads towards their preferred nodes. Signed-off-by: Rik van Riel Tested-by: Chegu Vinod --- kernel/sched/fair.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/

[PATCH 0/4] sched,numa: task placement for complex NUMA topologies

2014-05-08 Thread riel
This patch series adds code for placement of tasks on a NUMA system with complex NUMA topology. The code is fairly well isolated, and does not impact things on systems with directly connected NUMA topology. The strategy is to adjust the score of each node, by the score of nearby NUMA nodes, weighe

[PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node

2014-04-11 Thread riel
From: Rik van Riel Setting the numa_preferred_node for a task in task_numa_migrate does nothing on a 2-node system. Either we migrate to the node that already was our preferred node, or we stay where we were. On a 4-node system, it can slightly decrease overhead, by not calling the NUMA code as

[PATCH 0/3] sched,numa: reduce page migrations with pseudo-interleaving

2014-04-11 Thread riel
The pseudo-interleaving code deals fairly well with the placement of tasks that are part of workloads that span multiple NUMA nodes, but the code has a number of corner cases left that can result in higher than desired overhead. This patch series reduces the overhead slightly, mostly visible throu

[PATCH 2/3] sched,numa: retry placement more frequently when misplaced

2014-04-11 Thread riel
From: Rik van Riel When tasks have not converged on their preferred nodes yet, we want to retry fairly often, to make sure we do not migrate a task's memory to an undesirable location, only to have to move it again later. This patch reduces the interval at which migration is retried, whe

[PATCH 1/3] sched,numa: count pages on active node as local

2014-04-11 Thread riel
From: Rik van Riel The NUMA code is smart enough to distribute the memory of workloads that span multiple NUMA nodes across those NUMA nodes. However, it still has a pretty high scan rate for such workloads, because any memory that is left on a node other than the node of the CPU that faulted

[PATCH 1/2] sched: fix and clean up calculate_imbalance

2014-07-28 Thread riel
From: Rik van Riel There are several ways in which update_sd_pick_busiest can end up picking an sd as "busiest" that has a below-average per-cpu load. All of those could use the same correction that was previously only applied when the selected group has a group imbalance. Additio

[PATCH 2/2] sched: make update_sd_pick_busiest return true on a busier sd

2014-07-28 Thread riel
From: Rik van Riel Currently update_sd_pick_busiest only identifies the busiest sd that is either overloaded, or has a group imbalance. When no sd is imbalanced or overloaded, the load balancer fails to find the busiest domain. This breaks load balancing between domains that are not overloaded

[PATCH 0/2] load balancing fixes

2014-07-28 Thread riel
Currently update_sd_pick_busiest only identifies the busiest sd that is either overloaded, or has a group imbalance. When no sd is imbalanced or overloaded, the load balancer fails to find the busiest domain. This breaks load balancing between domains that are not overloaded, in the !SD_ASYM_PACKI

[PATCH 0/3] sched,numa: further numa balancing fixes

2014-06-14 Thread riel
A few more bug fixes that seem to improve convergence of "perf bench numa mem -m -0 -P 1000 -p X -t Y" for various values of X and Y, on both 4 and 8 node systems. This does not address the issue I highlighted Friday: https://lkml.org/lkml/2014/6/13/529 I have an idea on how to fix that issue, b

[PATCH 1/3] sched,numa: use group's max nid as task's preferred nid

2014-06-14 Thread riel
From: Rik van Riel >From task_numa_placement, always try to consolidate the tasks in a group on the group's top nid. In case this task is part of a group that is interleaved over multiple nodes, task_numa_migrate will set the task's preferred nid to the best node it could find for

[PATCH 3/3] sched,numa: use effective_load to balance NUMA loads

2014-06-14 Thread riel
From: Rik van Riel When CONFIG_FAIR_GROUP_SCHED is enabled, the load that a task places on a CPU is determined by the group the task is in. This is conveniently calculated for us by effective_load(), which task_numa_compare should use. The active groups on the source and destination CPU can be

[PATCH 2/3] sched,numa: move power adjustment into load_too_imbalanced

2014-06-14 Thread riel
From: Rik van Riel Currently the NUMA code scales the load on each node with the amount of CPU power available on that node, but it does not apply any adjustment to the load of the task that is being moved over. On systems with SMT/HT, this results in a task being weighed much more heavily than

  1   2   3   4   5   6   7   8   9   10   >