This code moves FPU handling from the non-preemptible part of running
a VCPU, way further out into the KVM_RUN ioctl handling. That way there
is no need to continuously save and load the qemu FPU user context
every time a VCPU context switches, or goes to sleep in the host kernel.
v2:
- move FPU s
From: Rik van Riel
Now that get_fpu and put_fpu do nothing, because the scheduler will
automatically load and restore the guest FPU context for us while we
are in this code (deep inside the vcpu_run main loop), we can get rid
of the get_fpu and put_fpu hooks.
Signed-off-by: Rik van Riel
From: Rik van Riel
Currently, every time a VCPU is scheduled out, the host kernel will
first save the guest FPU/xstate context, then load the qemu userspace
FPU context, only to then immediately save the qemu userspace FPU
context back to memory. When scheduling in a VCPU, the same extraneous
From: Rik van Riel
The previous patches result in situations where the FPU state
for a task is not present in the FPU registers, when using eager
fpu mode. The signal frame setup and restore code needs to be
adjusted to deal with that situation.
Without this patch, the signal handler stack
From: Rik van Riel
Defer restoring the FPU state, if so desired, until the task returns to
userspace.
In case of kernel threads, KVM VCPU threads, and tasks performing longer
running operations in kernel space, this could mean skipping the FPU state
restore entirely for several context switches
From: Rik van Riel
This change has no impact on normal tasks, but it allows tasks
with multiple FPU states(like a KVM vcpu thread) to check
whether its other FPU state is still loaded.
Exported so KVM can use it.
Signed-off-by: Rik van Riel
---
arch/x86/include/asm/fpu-internal.h | 15
From: Rik van Riel
If the old task is in a state where its FPU state could be changed by
a debugger, ensure the FPU state is always restored from memory on the
next context switch.
Currently the system only skips FPU reloads when !eager_fpu_mode()
and the task's FPU state is still loaded o
From: Rik van Riel
Replace fpu_switch_t with a thread flag, in preparation for only
restoring the FPU state on return to user space.
I have left the code around fpu_lazy_restore intact, even though
there appears to be no protection against races with eg. ptrace,
and the optimization appears
From: Rik van Riel
Add some documentation to data structures used for FPU context
switching.
Signed-off-by: Rik van Riel
---
arch/x86/include/asm/processor.h | 9 +++--
arch/x86/kernel/cpu/common.c | 1 +
2 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include
From: Rik van Riel
If the next task still has its FPU state present in the FPU registers,
there is no need to restore it from memory.
This is no big deal on bare metal, where XSAVEOPT / XRSTOR are heavily
optimized, but those optimizations do not carry across VMENTER / VMEXIT.
Skipping the
From: Rik van Riel
Currently KVM always deactivates the FPU on VCPU unload, only to
reactivate it next time the guest uses it. This can make using the
FPU inside a KVM guest fairly expensive.
On the other hand, restoring the FPU state for a KVM guest is also
significantly more involved (and
From: Rik van Riel
Move the call to __thread_fpu_begin, which in turn calls
__thread_set_has_fpu, to a spot where the task actually has
the FPU.
This is in preparation for the next patch.
This changeset introduces an extraneous clts() call when
switching from one FPU-using task to another FPU
From: Rik van Riel
Tasks may have multiple invocations of kernel_fpu_start and kernel_fpu_end
in sequence without ever hitting userspace in-between.
Delaying the restore of the user FPU state until the task returns to
userspace means the kernel only has to save the user FPU state on the
first
Currently the kernel will always load the FPU context, even
when switching to a kernel thread, or to an idle thread. In
the case of a task on a KVM VCPU going idle for a bit, and
waking up again later, this creates a vastly inefficient
chain of FPU context saves & loads:
1) save task FPU context,
From: Rik van Riel
It is possible to hit fpu_copy in eager fpu mode, but without
the current task's FPU context actually loaded into the CPU.
In that case, we should copy the FPU context from memory, not
save it from registers.
Signed-off-by: Rik van Riel
---
arch/x86/include/as
From: Rik van Riel
After system bootup, there is no totally reliable way to see
which CPUs are isolated, because the kernel may modify the
CPUs specified on the isolcpus= kernel command line option.
Export the CPU list that actually got isolated in sysfs,
specifically in the file /sys/devices
Currently there is no good way to get the isolated and nohz_full
CPUs at runtime, because the kernel may have changed the CPUs
specified on the commandline (when specifying all CPUs as
isolated, or CPUs that do not exist, ...)
This series adds two files to /sys/devices/system/cpu, which can
be use
From: Rik van Riel
Currently there is no way to query which CPUs are in nohz_full
mode from userspace.
Export the CPU list running in nohz_full mode in sysfs,
specifically in the file /sys/devices/system/cpu/nohz_full
This can be used by system management tools like libvirt,
openstack, and
Profiling reveals that a lot of the overhead from the nohz_full
accounting seems to come not from the accounting itself, but from
disabling and re-enabling interrupts.
This patch series removes the interrupt disabling & re-enabling
from __acct_update_integrals, which is called on both syscall
entr
From: Rik van Riel
On syscall entry with nohz_full on, we enable interrupts, call user_exit,
disable interrupts, do something, re-enable interrupts, and go on our
merry way.
Profiling shows that a large amount of the nohz_full overhead comes
from the extraneous disabling and re-enabling of
From: Rik van Riel
The function __acct_update_integrals() is called both from irq context
and task context. This creates a race where irq context can advance
tsk->acct_timexpd to a value larger than time, leading to a negative
value, which causes a divide error. See commit 6d5b5acca9e5
(&
From: Peter Zijlstra
Reduce indentation in __acct_update_integrals.
Cc: Andy Lutomirsky
Cc: Frederic Weisbecker
Cc: Peter Zijlstra
Cc: Heiko Carstens
Cc: Thomas Gleixner
Signed-off-by: Peter Zijlstra
Signed-off-by: Rik van Riel
---
kernel/tsacct.c | 34
A previous attempt to resolve a major conflict between load balancing and
NUMA balancing, changeset 095bebf61a46 ("sched/numa: Do not move past the
balance point if unbalanced"), introduced its own problems.
Revert that changeset, and introduce a new fix, which actually seems to
resolve the issues
From: Rik van Riel
Commit 095bebf61a46 ("sched/numa: Do not move past the balance point
if unbalanced") broke convergence of workloads with just one runnable
thread, by making it impossible for the one runnable thread on the
system to move from one NUMA node to another.
Instead,
From: Rik van Riel
Changeset a43455a1 ("sched/numa: Ensure task_numa_migrate() checks the
preferred node") fixes an issue where workloads would never converge
on a fully loaded (or overloaded) system.
However, it introduces a regression on less than fully loaded systems,
where
From: Rik van Riel
When tick based accounting is run from a remote CPU, it is actually
possible to encounter a task with PF_VCPU set. Make sure to account
those as guest time.
Signed-off-by: Rik van Riel
---
kernel/sched/cputime.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff
From: Rik van Riel
Have the housekeeper CPU call account_process_tick to do tick based
accounting for remote nohz_full CPUs.
Signed-off-by: Rik van Riel
---
kernel/time/timer.c | 28
1 file changed, 28 insertions(+)
diff --git a/kernel/time/timer.c b/kernel/time
From: Rik van Riel
Add a cpu parameter to irqtime_account_process_tick, to specify what
cpu to run the statistics for.
In order for this to actually work on a different cpu, all the functions
called by irqtime_account_process_tick need to be able to handle workng
for another CPU.
Signed-off-by
From: Rik van Riel
Teach account_process_tick to work on the CPU of the task
specified in the function argument. This allows us to do
remote tick based sampling of a nohz_full cpu from a
housekeeping CPU.
Signed-off-by: Rik van Riel
---
kernel/sched/cputime.c | 8 +++-
1 file changed, 7
This series seems to make basic tick based time sampling from a
housekeeping CPU work, allowing us to have tick based accounting
on a nohz_full CPU, and no longer doing vtime accounting on those
CPUs.
It still needs a major cleanup, and steal time accounting and irq
accounting are still missing.
From: Rik van Riel
Add a cpu parameter to steal_account_process_tick, so it can
be used to do CPU time accounting for another CPU.
Signed-off-by: Rik van Riel
---
kernel/sched/cputime.c | 12 ++--
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/cputime.c b
From: Rik van Riel
With the introduction of remote tick based sampling, we now have
three ways of gathering time statistics:
- local tick based sampling
- vtime accounting (used natively on some architectures)
- remote tick based sampling
On a system with remote tick based sampling, the
From: Rik van Riel
Rename vtime_accounting_enabled to tick_accounting_disabled, because it
can mean either that vtime accounting is enabled, or that the system
is doing tick based sampling from a housekeeping CPU for nohz_full CPUs.
Signed-off-by: Rik van Riel
---
include/linux
From: Rik van Riel
Simple transformation to allow tick based sampling from a remote
cpu. Additional changes may be needed to actually acquire the
steal time info for remote cpus from the host/hypervisor.
Signed-off-by: Rik van Riel
---
include/linux/kernel_stat.h | 2 +-
kernel/sched
From: Rik van Riel
Simple transformation to allow account_idle_time to account the
idle time for another CPU.
Signed-off-by: Rik van Riel
---
arch/ia64/kernel/time.c | 2 +-
arch/powerpc/kernel/time.c | 2 +-
arch/s390/kernel/idle.c | 2 +-
include/linux/kernel_stat.h | 2
From: Rik van Riel
The timer housekeeping CPU can do tick based sampling for remote
CPUs. For now this is the first CPU in the housekeeping_mask.
Eventually we could move to having one timer housekeeping cpu per
socket, if needed.
Signed-off-by: Rik van Riel
---
include/linux/tick.h | 9
From: Rik van Riel
When timer statistics are sampled from a remote CPU, vtime calculations
at the kernel/user and kernel/guest boundary are no longer necessary.
Skip them.
Signed-off-by: Rik van Riel
---
include/linux/context_tracking.h | 4 ++--
kernel/context_tracking.c| 6 --
2
From: Rik van Riel
Reorganize the order of ifs in change_pmd_range a little, in
preparation for the next patch.
Signed-off-by: Rik van Riel
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Reported-by: Xing Gang
Tested-by: Chegu Vinod
---
mm/mprotect.c | 7 ---
1 file changed, 4 insertions
The NUMA scanning code can end up iterating over many gigabytes
of unpopulated memory, especially in the case of a freshly started
KVM guest with lots of memory.
This results in the mmu notifier code being called even when
there are no mapped pages in a virtual address range. The amount
of time wa
From: Rik van Riel
Normally task_numa_work scans over a fairly small amount of memory,
but it is possible to run into a large unpopulated part of virtual
memory, with no pages mapped. In that case, task_numa_work can run
for a while, and it may make sense to reschedule as required.
Signed-off
From: Rik van Riel
The NUMA scanning code can end up iterating over many gigabytes
of unpopulated memory, especially in the case of a freshly started
KVM guest with lots of memory.
This results in the mmu notifier code being called even when
there are no mapped pages in a virtual address range
The current automatic NUMA balancing code base has issues with
workloads that do not fit on one NUMA load. Page migration is
slowed down, but memory distribution between the nodes where
the workload runs is essentially random, often resulting in a
suboptimal amount of memory bandwidth being availab
From: Rik van Riel
Excessive migration of pages can hurt the performance of workloads
that span multiple NUMA nodes. However, it turns out that the
p->numa_migrate_deferred knob is a really big hammer, which does
reduce migration rates, but does not actually help performance.
Now that
From: Rik van Riel
We track both the node of the memory after a NUMA fault, and the node
of the CPU on which the fault happened. Rename the local variables in
task_numa_fault to make things more explicit.
Suggested-by: Mel Gorman
Signed-off-by: Rik van Riel
---
kernel/sched/fair.c | 8
From: Rik van Riel
The numa_faults_cpu statistics are used to maintain an active_nodes nodemask
per numa_group. This allows us to be smarter about when to do numa migrations.
Cc: Peter Zijlstra
Cc: Mel Gorman
Cc: Ingo Molnar
Cc: Chegu Vinod
Signed-off-by: Rik van Riel
---
kernel/sched
From: Rik van Riel
Cleanup suggested by Mel Gorman. Now the code contains some more
hints on what statistics go where.
Suggested-by: Mel Gorman
Signed-off-by: Rik van Riel
---
kernel/sched/fair.c | 34 +-
1 file changed, 25 insertions(+), 9 deletions(-)
diff
From: Rik van Riel
Track which nodes NUMA faults are triggered from, in other words
the CPUs on which the NUMA faults happened. This uses a similar
mechanism to what is used to track the memory involved in numa faults.
The next patches use this to build up a bitmap of which nodes a
workload is
From: Rik van Riel
Use the active_nodes nodemask to make smarter decisions on NUMA migrations.
In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration
From: Rik van Riel
In order to get a more consistent naming scheme, making it clear
which fault statistics track memory locality, and which track
CPU locality, rename the memory fault statistics.
Suggested-by: Mel Gorman
Signed-off-by: Rik van Riel
---
include/linux/sched.h | 8
From: Rik van Riel
Tracing the code that decides the active nodes has made it abundantly clear
that the naive implementation of the faults_from code has issues.
Specifically, the garbage collector in some workloads will access orders
of magnitudes more memory than the threads that do all the
From: Rik van Riel
The current code in task_numa_placement calculates the difference
between the old and the new value, but also temporarily stores half
of the old value in the per-process variables.
The NUMA balancing code looks at those per-process variables, and
having other tasks
This patch set attempts to implement a pseudo-interleaving
policy for workloads that do not fit in one NUMA node.
For each NUMA group, we track the NUMA nodes on which the
workload is actively running, and try to concentrate the
memory on those NUMA nodes.
Unfortunately, the scheduler appears to
From: Rik van Riel
Use the active_nodes nodemask to make smarter decisions on NUMA migrations.
In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration
From: Rik van Riel
Excessive migration of pages can hurt the performance of workloads
that span multiple NUMA nodes. However, it turns out that the
p->numa_migrate_deferred knob is a really big hammer, which does
reduce migration rates, but does not actually help performance.
It is time to
From: Rik van Riel
Track which nodes NUMA faults are triggered from. This uses a similar
mechanism to what is used to track the memory involved in numa faults.
This is used, in the next patch, to build up a bitmap of which nodes
a workload is actively running on.
Signed-off-by: Rik van Riel
From: Rik van Riel
The faults_from statistics are used to maintain an active_nodes nodemask
per numa_group. This allows us to be smarter about when to do numa migrations.
Signed-off-by: Rik van Riel
---
kernel/sched/fair.c | 33 +
1 file changed, 33 insertions
From: Rik van Riel
Use the active_nodes nodemask to make smarter decisions on NUMA migrations.
In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration
From: Rik van Riel
The tracepoint has made it abundantly clear that the naive
implementation of the faults_from code has issues.
Specifically, the garbage collector in some workloads will
access orders of magnitudes more memory than the threads
that do all the active work. This resulted in the
From: Rik van Riel
The faults_from statistics are used to maintain an active_nodes nodemask
per numa_group. This allows us to be smarter about when to do numa migrations.
Cc: Peter Zijlstra
Cc: Mel Gorman
Cc: Ingo Molnar
Cc: Chegu Vinod
Signed-off-by: Rik van Riel
Signed-off-by: Rik van
From: Rik van Riel
Being able to see how the active nodemask changes over time, and why,
can be quite useful.
Cc: Peter Zijlstra
Cc: Mel Gorman
Cc: Ingo Molnar
Cc: Chegu Vinod
Signed-off-by: Rik van Riel
Signed-off-by: Rik van Riel
---
include/trace/events/sched.h | 34
The current automatic NUMA balancing code base has issues with
workloads that do not fit on one NUMA load. Page migration is
slowed down, but memory distribution between the nodes where
the workload runs is essentially random, often resulting in a
suboptimal amount of memory bandwidth being availab
From: Rik van Riel
Excessive migration of pages can hurt the performance of workloads
that span multiple NUMA nodes. However, it turns out that the
p->numa_migrate_deferred knob is a really big hammer, which does
reduce migration rates, but does not actually help performance.
Now that
From: Rik van Riel
Track which nodes NUMA faults are triggered from, in other words
the CPUs on which the NUMA faults happened. This uses a similar
mechanism to what is used to track the memory involved in numa faults.
The next patches use this to build up a bitmap of which nodes a
workload is
The current automatic NUMA balancing code base has issues with
workloads that do not fit on one NUMA load. Page migration is
slowed down, but memory distribution between the nodes where
the workload runs is essentially random, often resulting in a
suboptimal amount of memory bandwidth being availab
From: Rik van Riel
Excessive migration of pages can hurt the performance of workloads
that span multiple NUMA nodes. However, it turns out that the
p->numa_migrate_deferred knob is a really big hammer, which does
reduce migration rates, but does not actually help performance.
Now that
From: Rik van Riel
Track which nodes NUMA faults are triggered from, in other words
the CPUs on which the NUMA faults happened. This uses a similar
mechanism to what is used to track the memory involved in numa faults.
The next patches use this to build up a bitmap of which nodes a
workload is
From: Rik van Riel
The faults_from statistics are used to maintain an active_nodes nodemask
per numa_group. This allows us to be smarter about when to do numa migrations.
Cc: Peter Zijlstra
Cc: Mel Gorman
Cc: Ingo Molnar
Cc: Chegu Vinod
Signed-off-by: Rik van Riel
---
kernel/sched/fair.c
From: Rik van Riel
Being able to see how the active nodemask changes over time, and why,
can be quite useful.
Cc: Peter Zijlstra
Cc: Mel Gorman
Cc: Ingo Molnar
Cc: Chegu Vinod
Signed-off-by: Rik van Riel
---
include/trace/events/sched.h | 34 ++
kernel
From: Rik van Riel
Use the active_nodes nodemask to make smarter decisions on NUMA migrations.
In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration
From: Rik van Riel
The tracepoint has made it abundantly clear that the naive
implementation of the faults_from code has issues.
Specifically, the garbage collector in some workloads will
access orders of magnitudes more memory than the threads
that do all the active work. This resulted in the
From: Rik van Riel
The current code in task_numa_placement calculates the difference
between the old and the new value, but also temporarily stores half
of the old value in the per-process variables.
The NUMA balancing code looks at those per-process variables, and
having other tasks
The current automatic NUMA balancing code base has issues with
workloads that do not fit on one NUMA load. Page migration is
slowed down, but memory distribution between the nodes where
the workload runs is essentially random, often resulting in a
suboptimal amount of memory bandwidth being availab
From: Rik van Riel
The faults_from statistics are used to maintain an active_nodes nodemask
per numa_group. This allows us to be smarter about when to do numa migrations.
Cc: Peter Zijlstra
Cc: Mel Gorman
Cc: Ingo Molnar
Cc: Chegu Vinod
Signed-off-by: Rik van Riel
---
kernel/sched/fair.c
From: Rik van Riel
The current code in task_numa_placement calculates the difference
between the old and the new value, but also temporarily stores half
of the old value in the per-process variables.
The NUMA balancing code looks at those per-process variables, and
having other tasks
From: Rik van Riel
Use the active_nodes nodemask to make smarter decisions on NUMA migrations.
In order to maximize performance of workloads that do not fit in one NUMA
node, we want to satisfy the following criteria:
1) keep private memory local to each thread
2) avoid excessive NUMA migration
From: Rik van Riel
Excessive migration of pages can hurt the performance of workloads
that span multiple NUMA nodes. However, it turns out that the
p->numa_migrate_deferred knob is a really big hammer, which does
reduce migration rates, but does not actually help performance.
Now that
From: Rik van Riel
Track which nodes NUMA faults are triggered from, in other words
the CPUs on which the NUMA faults happened. This uses a similar
mechanism to what is used to track the memory involved in numa faults.
The next patches use this to build up a bitmap of which nodes a
workload is
From: Rik van Riel
The tracepoint has made it abundantly clear that the naive
implementation of the faults_from code has issues.
Specifically, the garbage collector in some workloads will
access orders of magnitudes more memory than the threads
that do all the active work. This resulted in the
From: Rik van Riel
There are cases where read_seqbegin_or_lock needs to block irqs,
because the seqlock in question nests inside a lock that is also
be taken from irq context.
Add read_seqbegin_or_lock_irqsave and done_seqretry_irqrestore, which
are almost identical to read_seqbegin_or_lock and
From: Rik van Riel
The sig->stats_lock nests inside the tasklist_lock and the
sighand->siglock in __exit_signal and wait_task_zombie.
However, both of those locks can be taken from irq context,
which means we need to use the interrupt safe variant of
read_seqbegin_or_lock. This
The sig->stats_lock nests inside the tasklist_lock and the
sighand->siglock in __exit_signal and wait_task_zombie.
However, both of those locks can be taken from irq context,
which means we need to use the interrupt safe variant of
read_seqbegin_or_lock. This blocks interrupts when the "lock"
bran
Andrew, these patches contain all the fixes from the threads. They
seem to compile on normal x86 and UML now.
Thanks to Paul, Randy, and everybody else.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo
From: Rik van Riel
Some sysrq handlers can run for a long time, because they dump a lot
of data onto a serial console. Having RCU stall warnings pop up in
the middle of them only makes the problem worse.
This patch temporarily disables RCU stall warnings while a sysrq
request is handled
From: Rik van Riel
Echoing values into /proc/sysrq-trigger seems to be a popular way to
get information out of the kernel. However, dumping information about
thousands of processes, or hundreds of CPUs to serial console can
result in IRQs being blocked for minutes, resulting in various kinds
of
From: Rik van Riel
Workloads that span multiple NUMA nodes benefit greatly from being placed
on nearby nodes. There are two common configurations on 8 node NUMA systems.
One has four "islands" of 2 tighly coupled nodes, another has two "islands"
of 4 tightly coupled nodes.
W
From: Rik van Riel
Give a bonus to nodes near a workload's preferred node. This will pull
workloads towards their preferred node.
For workloads that span multiple NUMA nodes, pseudo-interleaving will
even out the memory use between nodes over time, causing the preferred
node to move around
From: Rik van Riel
Store the maximum node distance, so the numa placement code can do
better placement on systems with complex numa topology.
The function max_node_distance will return LOCAL_DISTANCE if the
system has simple NUMA topology, with only a single level of
remote distance.
Signed
From: Rik van Riel
Store a numa_group's preferred nid. Used by the next patch to pull
workloads towards their preferred nodes.
Signed-off-by: Rik van Riel
Tested-by: Chegu Vinod
---
kernel/sched/fair.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/
This patch series adds code for placement of tasks on a NUMA system
with complex NUMA topology. The code is fairly well isolated, and
does not impact things on systems with directly connected NUMA
topology.
The strategy is to adjust the score of each node, by the score of
nearby NUMA nodes, weighe
From: Rik van Riel
Setting the numa_preferred_node for a task in task_numa_migrate
does nothing on a 2-node system. Either we migrate to the node
that already was our preferred node, or we stay where we were.
On a 4-node system, it can slightly decrease overhead, by not
calling the NUMA code as
The pseudo-interleaving code deals fairly well with the placement
of tasks that are part of workloads that span multiple NUMA nodes,
but the code has a number of corner cases left that can result in
higher than desired overhead.
This patch series reduces the overhead slightly, mostly visible
throu
From: Rik van Riel
When tasks have not converged on their preferred nodes yet, we want
to retry fairly often, to make sure we do not migrate a task's memory
to an undesirable location, only to have to move it again later.
This patch reduces the interval at which migration is retried,
whe
From: Rik van Riel
The NUMA code is smart enough to distribute the memory of workloads
that span multiple NUMA nodes across those NUMA nodes.
However, it still has a pretty high scan rate for such workloads,
because any memory that is left on a node other than the node of
the CPU that faulted
From: Rik van Riel
There are several ways in which update_sd_pick_busiest can end up
picking an sd as "busiest" that has a below-average per-cpu load.
All of those could use the same correction that was previously only
applied when the selected group has a group imbalance.
Additio
From: Rik van Riel
Currently update_sd_pick_busiest only identifies the busiest sd
that is either overloaded, or has a group imbalance. When no
sd is imbalanced or overloaded, the load balancer fails to find
the busiest domain.
This breaks load balancing between domains that are not overloaded
Currently update_sd_pick_busiest only identifies the busiest sd
that is either overloaded, or has a group imbalance. When no
sd is imbalanced or overloaded, the load balancer fails to find
the busiest domain.
This breaks load balancing between domains that are not overloaded,
in the !SD_ASYM_PACKI
A few more bug fixes that seem to improve convergence of
"perf bench numa mem -m -0 -P 1000 -p X -t Y" for various
values of X and Y, on both 4 and 8 node systems.
This does not address the issue I highlighted Friday:
https://lkml.org/lkml/2014/6/13/529
I have an idea on how to fix that issue, b
From: Rik van Riel
>From task_numa_placement, always try to consolidate the tasks
in a group on the group's top nid.
In case this task is part of a group that is interleaved over
multiple nodes, task_numa_migrate will set the task's preferred
nid to the best node it could find for
From: Rik van Riel
When CONFIG_FAIR_GROUP_SCHED is enabled, the load that a task places
on a CPU is determined by the group the task is in. This is conveniently
calculated for us by effective_load(), which task_numa_compare should
use.
The active groups on the source and destination CPU can be
From: Rik van Riel
Currently the NUMA code scales the load on each node with the
amount of CPU power available on that node, but it does not
apply any adjustment to the load of the task that is being
moved over.
On systems with SMT/HT, this results in a task being weighed
much more heavily than
1 - 100 of 3478 matches
Mail list logo