Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-22 Thread Peter Zijlstra
On Fri, May 22, 2020 at 02:28:54PM +0100, Mel Gorman wrote:

> Is something like this on top of your patch what you had in mind?

All under the assumption that is makes it go faster of course ;-)

> ---8<---

static inline bool ttwu_queue_cond()
{
/*
 * If the CPU does not share cache, then queue the task on the
 * remote rqs wakelist to avoid accessing remote data.
 */
if (!cpus_share_cache(smp_processor_id(), cpu))
return true;

/*
 * If the task is descheduling and the only running task on the
 * CPU, 
 */
if ((wake_flags & WF_ON_RQ) && cpu_rq(cpu)->nr_running <= 1)
return true;

return false;
}

> -static bool ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
> +static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int 
> wake_flags)
>  {
> - if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
> cpu)) {
> - sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> - __ttwu_queue_remote(p, cpu, wake_flags);
> - return true;
> + if (sched_feat(TTWU_QUEUE)) {
> + /*
> +  * If CPU does not share cache then queue the task on the remote
> +  * rqs wakelist to avoid accessing remote data. Alternatively,
> +  * if the task is descheduling and the only running task on the
> +  * CPU then use the wakelist to offload the task activation to
> +  * the CPU that will soon be idle so the waker can continue.
> +  * nr_running is checked to avoid unnecessary task stacking.
> +  */
> + if (!cpus_share_cache(smp_processor_id(), cpu) ||
> + ((wake_flags & WF_ON_RQ) && cpu_rq(cpu)->nr_running <= 1)) {
> + sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> + __ttwu_queue_wakelist(p, cpu, wake_flags);
> + return true;
> + }

if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {
sched_clock_cpu(cpu); /* Sync clocks across CPUs */
__ttwu_queue_remote(p, cpu, wake_flags);
return true;

>   }
>  
>   return false;


might be easier to read...


Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-22 Thread Mel Gorman
On Thu, May 21, 2020 at 01:41:32PM +0200, Peter Zijlstra wrote:
> On Thu, May 21, 2020 at 11:38:16AM +0100, Mel Gorman wrote:
> > IIUC, this patch front-loads as much work as possible before checking if
> > the task is on_rq and then the waker/wakee shares a cache, queue task on
> > the wake_list and otherwise do a direct wakeup.
> > 
> > The advantage is that spinning is avoided on p->on_rq when p does not
> > share a cache. The disadvantage is that it may result in tasks being
> > stacked but this should only happen when the domain is overloaded and
> > select_task_eq() is unlikely to find an idle CPU. The load balancer would
> > soon correct the situation anyway.
> > 
> > In terms of netperf for my testing, the benefit is marginal because the
> > wakeups are primarily between tasks that share cache. It does trigger as
> > perf indicates that some time is spent in ttwu_queue_remote with this
> > patch, it's just that the overall time spent spinning on p->on_rq is
> > very similar. I'm still waiting on other workloads to complete to see
> > what the impact is.
> 
> So it might make sense to play with the exact conditions under which
> we'll attempt this remote queue, if we see a large 'local' p->on_cpu
> spin time, it might make sense to attempt the queue even in this case.
> 
> We could for example change it to:
> 
>   if (REAC_ONCE(p->on_cpu) && ttwu_queue_remote(p, cpu, wake_flags | 
> WF_ON_CPU))
>   goto unlock;
> 
> and then use that in ttwu_queue_remote() to differentiate between these
> two cases.
> 

>  #endif /* CONFIG_SMP */
>  
>   ttwu_queue(p, cpu, wake_flags);

Is something like this on top of your patch what you had in mind?

---8<---

---
 kernel/sched/core.c  | 35 ++-
 kernel/sched/sched.h |  3 ++-
 2 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 987b8ecf2ee9..435ecf5820ee 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2330,13 +2330,19 @@ void scheduler_ipi(void)
irq_exit();
 }
 
-static void __ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+/*
+ * Queue a task on the target CPUs wake_list and wake the CPU via IPI if
+ * necessary. The wakee CPU on receipt of the IPI will queue the task
+ * via sched_ttwu_wakeup() for activation instead of the waking task
+ * activating and queueing the wakee.
+ */
+static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int 
wake_flags)
 {
struct rq *rq = cpu_rq(cpu);
 
p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
 
-   if (llist_add(>wake_entry, _rq(cpu)->wake_list)) {
+   if (llist_add(>wake_entry, >wake_list)) {
if (!set_nr_if_polling(rq->idle))
smp_send_reschedule(cpu);
else
@@ -2373,12 +2379,23 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
 }
 
-static bool ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
 {
-   if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
cpu)) {
-   sched_clock_cpu(cpu); /* Sync clocks across CPUs */
-   __ttwu_queue_remote(p, cpu, wake_flags);
-   return true;
+   if (sched_feat(TTWU_QUEUE)) {
+   /*
+* If CPU does not share cache then queue the task on the remote
+* rqs wakelist to avoid accessing remote data. Alternatively,
+* if the task is descheduling and the only running task on the
+* CPU then use the wakelist to offload the task activation to
+* the CPU that will soon be idle so the waker can continue.
+* nr_running is checked to avoid unnecessary task stacking.
+*/
+   if (!cpus_share_cache(smp_processor_id(), cpu) ||
+   ((wake_flags & WF_ON_RQ) && cpu_rq(cpu)->nr_running <= 1)) {
+   sched_clock_cpu(cpu); /* Sync clocks across CPUs */
+   __ttwu_queue_wakelist(p, cpu, wake_flags);
+   return true;
+   }
}
 
return false;
@@ -2391,7 +2408,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, 
int wake_flags)
struct rq_flags rf;
 
 #if defined(CONFIG_SMP)
-   if (ttwu_queue_remote(p, cpu, wake_flags))
+   if (ttwu_queue_wakelist(p, cpu, wake_flags))
return;
 #endif
 
@@ -2611,7 +2628,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, 
int wake_flags)
 * let the waker make forward progress. This is safe because IRQs are
 * disabled and the IPI will deliver after on_cpu is cleared.
 */
-   if (READ_ONCE(p->on_cpu) && ttwu_queue_remote(p, cpu, wake_flags))
+   if (READ_ONCE(p->on_cpu) && 

Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-22 Thread Mel Gorman
On Fri, May 22, 2020 at 09:09:50AM +0800, Hillf Danton wrote:
> 
> On Thu, 21 May 2020 17:04:08 +0100 Mel Gorman wrote:
> > On Thu, May 21, 2020 at 10:09:31PM +0800, Hillf Danton wrote:
> > > > I'm ignoring the coding style of c++ comments but minimally that should
> > > > be fixed. More importantly, node_type can be one of node_overloaded,
> > > > node_has_spare or node_fully busy and this is checking if there is a
> > > > mismatch. However, it's not taking into account whether the dst_node
> > > > is more or less loaded than the src and does not appear to be actually
> > > > checking spare capacity like the comment suggests.
> > > > 
> > >
> > > Type other than node_has_spare is not considered because node is not
> > > possible to be so idle that two communicating tasks would better
> > > "stay local."
> > > 
> > 
> > You hardcode an imbalance of 2 at the start without computing any
> > imbalance.
> 
> Same result comes up if it's a bool.
> 

Then the magic number is simply confusing. The patch needs to be
a lot clearer about what the intent is if my patch that adds a "if
(env->src_stats.node_type == node_has_spare)" check is not what you were
aiming for.

> > Then if the source is fully_busy or overloaded while the
> > dst is idle, a task can move but that is based on an imaginary hard-coded
> > imbalance of 2.
> 
> This is the potpit I walked around by checking spare capacity first. As
> for overloaded, I see it as a signal that a glitch is not idle somewhere
> else, and I prefer to push it in to ICU before it's too late.
> 

The domain could be overloaded simply due to CPU bindings. I cannot
parse the ICU comment (Intensive Care Unit?!?) :(

> > Finally, it appears that that the load balancer and
> > NUMA balancer are again using separate logic again when one part of the
> > objective of the series is that the load balancer and NUMA balancer would
> > not override each other.
> 
> This explains 80% of why it is a choppy road ahead.
> 

And I don't think we should go back to the load balancer and NUMA balancer
taking different actions. It ends up doing useless CPU migrations and
can lead to higher NUMA scanning activity. It's partially why I changed
task_numa_compare to prefer swapping with tasks that move to their
preferred node when using an idle CPU would cause an imbalance.

> > As the imbalance is never computed, the patch
> > can create one which the load balancer then overrides. What am I missing?
> 
> LB would give a green light if things move in the direction in favor of
> cutting imb.
> 

Load balancing primarily cares about balancing the number of idle CPUs
between domains when there is spare capacity. While it tries to avoid
balancing by moving tasks from their preferred node, it will do so if
there are no other candidates.

> > > > 
> > > >
> > > > Then there is this part
> > > > 
> > > > +   imbalance = adjust_numa_imbalance(imbalance,
> > > > +   
> > > > env->src_stats.nr_running);
> > > > +
> > > > +   //Do nothing without imbalance
> > > > +   if (!imbalance) {
> > > > +   imbalance = 2;
> > > > +   goto check_imb;
> > > > +   }
> > > > 
> > > > So... if there is no imbalance, assume there is an imbalance of 2, jump 
> > > > to
> > > > a branch that will always be false and fall through to code that ignores
> > > > the value of imbalance .. it's hard to see exactly why that code 
> > > > flow
> > > > is ideal.
> > > > 
> > > With spare capacity be certain, then lets see how idle the node is.
> > 
> > But you no longer check how idle it is or what it's relative idleness of
> > the destination is relative to the source. adjust_numa_imbalance does
> > not calculate an imbalance, it only decides whether it should be
> > ignored.
> 
> Then the idle CPU is no longer so tempting.
> 

True. While it's prefectly possible to ignore imbalance and use an idle
CPU if it exists, the load balancer will simply override it later and we
go back to the NUMA balancer and load balancer fighting each other with
the NUMA balancer retrying migrations based on p->numa_migrate_retry.
Whatever logic is used to allow imbalances (be they based on communicating
tasks or preferred locality), it needs to be the same in both balancers.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-21 Thread Mel Gorman
On Thu, May 21, 2020 at 10:09:31PM +0800, Hillf Danton wrote:
> > I'm ignoring the coding style of c++ comments but minimally that should
> > be fixed. More importantly, node_type can be one of node_overloaded,
> > node_has_spare or node_fully busy and this is checking if there is a
> > mismatch. However, it's not taking into account whether the dst_node
> > is more or less loaded than the src and does not appear to be actually
> > checking spare capacity like the comment suggests.
> > 
>
> Type other than node_has_spare is not considered because node is not
> possible to be so idle that two communicating tasks would better
> "stay local."
> 

You hardcode an imbalance of 2 at the start without computing any
imbalance. Then if the source is fully_busy or overloaded while the
dst is idle, a task can move but that is based on an imaginary hard-coded
imbalance of 2. Finally, it appears that that the load balancer and
NUMA balancer are again using separate logic again when one part of the
objective of the series is that the load balancer and NUMA balancer would
not override each other. As the imbalance is never computed, the patch
can create one which the load balancer then overrides. What am I missing?

That said, it does make a certain amount of sense that if the dst has
spare capacity and the src is fully busy or overloaded then allow the
migration regardless of imbalance but that would be a different patch --
something like this but I didn't think too deeply or test it.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02f323b85b6d..97c0e090e161 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1916,19 +1916,27 @@ static void task_numa_find_cpu(struct task_numa_env 
*env,
 * imbalance that would be overruled by the load balancer.
 */
if (env->dst_stats.node_type == node_has_spare) {
-   unsigned int imbalance;
+   unsigned int imbalance = 0;
int src_running, dst_running;
 
/*
-* Would movement cause an imbalance? Note that if src has
-* more running tasks that the imbalance is ignored as the
-* move improves the imbalance from the perspective of the
-* CPU load balancer.
-* */
-   src_running = env->src_stats.nr_running - 1;
-   dst_running = env->dst_stats.nr_running + 1;
-   imbalance = max(0, dst_running - src_running);
-   imbalance = adjust_numa_imbalance(imbalance, src_running);
+* Check the imbalance if both src and dst have spare
+* capacity. If src is fully_busy or overloaded then
+* allow the task to move as it's both improving locality
+* and reducing an imbalance.
+*/
+   if (env->src_stats.node_type == node_has_spare) {
+   /*
+* Would movement cause an imbalance? Note that if src
+* has more running tasks that the imbalance is
+* ignored as the move improves the imbalance from the
+* perspective of the CPU load balancer.
+*/
+   src_running = env->src_stats.nr_running - 1;
+   dst_running = env->dst_stats.nr_running + 1;
+   imbalance = max(0, dst_running - src_running);
+   imbalance = adjust_numa_imbalance(imbalance, 
src_running);
+   }
 
/* Use idle CPU if there is no imbalance */
if (!imbalance) {

> > Then there is this part
> > 
> > +   imbalance = adjust_numa_imbalance(imbalance,
> > +   env->src_stats.nr_running);
> > +
> > +   //Do nothing without imbalance
> > +   if (!imbalance) {
> > +   imbalance = 2;
> > +   goto check_imb;
> > +   }
> > 
> > So... if there is no imbalance, assume there is an imbalance of 2, jump to
> > a branch that will always be false and fall through to code that ignores
> > the value of imbalance .. it's hard to see exactly why that code flow
> > is ideal.
> > 
> With spare capacity be certain, then lets see how idle the node is.

But you no longer check how idle it is or what it's relative idleness of
the destination is relative to the source. adjust_numa_imbalance does
not calculate an imbalance, it only decides whether it should be
ignored.

> And I
> can not do that without the tool you created, He bless you. Although not
> sure they're two tasks talking to each other. This is another topic in the
> coming days.
> 

There is insufficient context in this path to determine if two tasks are
communicating. In some cases it may be inferred from last_wakee but that
only works for two tasks, it doesn't work for 1:n waker:wakees such as
might 

Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-21 Thread Peter Zijlstra
On Thu, May 21, 2020 at 11:38:16AM +0100, Mel Gorman wrote:
> IIUC, this patch front-loads as much work as possible before checking if
> the task is on_rq and then the waker/wakee shares a cache, queue task on
> the wake_list and otherwise do a direct wakeup.
> 
> The advantage is that spinning is avoided on p->on_rq when p does not
> share a cache. The disadvantage is that it may result in tasks being
> stacked but this should only happen when the domain is overloaded and
> select_task_eq() is unlikely to find an idle CPU. The load balancer would
> soon correct the situation anyway.
> 
> In terms of netperf for my testing, the benefit is marginal because the
> wakeups are primarily between tasks that share cache. It does trigger as
> perf indicates that some time is spent in ttwu_queue_remote with this
> patch, it's just that the overall time spent spinning on p->on_rq is
> very similar. I'm still waiting on other workloads to complete to see
> what the impact is.

So it might make sense to play with the exact conditions under which
we'll attempt this remote queue, if we see a large 'local' p->on_cpu
spin time, it might make sense to attempt the queue even in this case.

We could for example change it to:

if (REAC_ONCE(p->on_cpu) && ttwu_queue_remote(p, cpu, wake_flags | 
WF_ON_CPU))
goto unlock;

and then use that in ttwu_queue_remote() to differentiate between these
two cases.

Anyway, if it's a wash (atomic op vs spinning) then it's probably not
worth it.

Another optimization might be to forgo the IPI entirely in this case and
instead stick a sched_ttwu_pending() at the end of __schedule() or
something like that.

> However, intuitively at least, it makes sense to avoid spinning on p->on_rq
> when it's unnecessary and the other changes appear to be safe.  Even if
> wake_list should be used in some cases for local wakeups, it would make
> sense to put that on top of this patch. Do you want to slap a changelog
> around this and update the comments or do you want me to do it? I should
> have more results in a few hours even if they are limited to one machine
> but ideally Rik would test his workload too.

I've written you a Changelog, but please carry it in your set to
evaluate if it's actually worth it.

---
Subject: sched: Optimize ttwu() spinning on p->on_cpu
From: Peter Zijlstra 
Date: Fri, 15 May 2020 16:24:44 +0200

Both Rik and Mel reported seeing ttwu() spend significant time on:

  smp_cond_load_acquire(>on_cpu, !VAL);

Attempt to avoid this by queueing the wakeup on the CPU that own's the
p->on_cpu value. This will then allow the ttwu() to complete without
further waiting.

Since we run schedule() with interrupts disabled, the IPI is
guaranteed to happen after p->on_cpu is cleared, this is what makes it
safe to queue early.

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c |   45 -
 1 file changed, 24 insertions(+), 21 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2312,7 +2312,7 @@ static void wake_csd_func(void *info)
sched_ttwu_pending();
 }
 
-static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+static void __ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
 {
struct rq *rq = cpu_rq(cpu);
 
@@ -2354,6 +2354,17 @@ bool cpus_share_cache(int this_cpu, int
 {
return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
 }
+
+static bool ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+{
+   if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
cpu)) {
+   sched_clock_cpu(cpu); /* Sync clocks across CPUs */
+   __ttwu_queue_remote(p, cpu, wake_flags);
+   return true;
+   }
+
+   return false;
+}
 #endif /* CONFIG_SMP */
 
 static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
@@ -2362,11 +2373,8 @@ static void ttwu_queue(struct task_struc
struct rq_flags rf;
 
 #if defined(CONFIG_SMP)
-   if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
cpu)) {
-   sched_clock_cpu(cpu); /* Sync clocks across CPUs */
-   ttwu_queue_remote(p, cpu, wake_flags);
+   if (ttwu_queue_remote(p, cpu, wake_flags))
return;
-   }
 #endif
 
rq_lock(rq, );
@@ -2550,7 +2558,15 @@ try_to_wake_up(struct task_struct *p, un
if (p->on_rq && ttwu_remote(p, wake_flags))
goto unlock;
 
+   if (p->in_iowait) {
+   delayacct_blkio_end(p);
+   atomic_dec(_rq(p)->nr_iowait);
+   }
+
 #ifdef CONFIG_SMP
+   p->sched_contributes_to_load = !!task_contributes_to_load(p);
+   p->state = TASK_WAKING;
+
/*
 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
 * possible to, falsely, observe p->on_cpu == 0.
@@ -2581,15 +2597,10 @@ try_to_wake_up(struct task_struct *p, un
 * 

Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-21 Thread Mel Gorman
On Wed, May 20, 2020 at 03:58:01PM +0200, Jirka Hladky wrote:
> Hi Hillf, Mel and all,
> 
> thanks for the patch! It has produced really GOOD results.
> 
> 1) It has fixed performance problems with 5.7 vanilla kernel for
> single-tenant workload and low system load scenarios, without
> performance degradation for the multi-tenant tasks. It's producing the
> same results as the previous proof-of-concept patch where
> adjust_numa_imbalance function was modified to be a no-op (returning
> the same value of imbalance as it gets on the input).
> 

I worry that what it's doing is sort of reverts the patch but in a
relatively obscure way. We already know that a side-effect of having a
pair of tasks sharing cache is that the wakeup paths can be more expensive
and the difference in the wakeup path exceeds the advantage of using
local memory. However, I don't think the right approach long term is to
keep tasks artificially apart so that a different wakeup path is used as
a side-effect.

The patch also needs a changelog and better comments to follow exactly
what the rationale is. Take this part


+   //No imbalance computed without spare capacity
+   if (env->dst_stats.node_type != env->src_stats.node_type)
+   goto check_imb;

I'm ignoring the coding style of c++ comments but minimally that should
be fixed. More importantly, node_type can be one of node_overloaded,
node_has_spare or node_fully busy and this is checking if there is a
mismatch. However, it's not taking into account whether the dst_node
is more or less loaded than the src and does not appear to be actually
checking spare capacity like the comment suggests.

Then there is this part

+   imbalance = adjust_numa_imbalance(imbalance,
+   env->src_stats.nr_running);
+
+   //Do nothing without imbalance
+   if (!imbalance) {
+   imbalance = 2;
+   goto check_imb;
+   }

So... if there is no imbalance, assume there is an imbalance of 2, jump to
a branch that will always be false and fall through to code that ignores
the value of imbalance .. it's hard to see exactly why that code flow
is ideal.

> 2) We have also added Mel's netperf-cstate-small-cross-socket test to
> our test battery:
> https://github.com/gormanm/mmtests/blob/master/configs/config-network-netperf-cstate-small-cross-socket
> 
> Mel told me that he had seen significant performance improvements with
> 5.7 over 5.6 for the netperf-cstate-small-cross-socket scenario.
> 
> Out of 6 different patches we have tested, your patch has performed
> the best for this scenario. Compared to vanilla, we see minimal
> performance degradation of 2.5% for the udp stream throughput and 0.6%
> for the tcp throughput.

Which implies that we are once again using remote memory for netperf and
possibly getting some interference from NUMA balancing hinting faults.

> The testing was done on a dual-socket system
> with Gold 6132 CPU.
> 
> @Mel - could you please test Hillf's patch with your full testing
> suite? So far, it looks very promising, but I would like to check the
> patch thoroughly to make sure it does not hurt performance in other
> areas.
> 

I can't at the moment due to a backlog on my test grid which isn't helped
that by the fact I lost two days development time due to a thrashed work
machine. That aside, I'm finding it hard to see exactly why the patch
is suitable. What I've seen so far is that there are two outstanding
problems after the rewritten load balancer and the reconcilation between
load balancer and NUMA balancer.

The first is that 57abff067a08 ("sched/fair: Rework find_idlest_group()")
has shown up some problems with LKP reporting it here
https://lore.kernel.org/lkml/20200514141526.GA30976@xsang-OptiPlex-9020/
but I've also seen numerous workloads internally bisect to the same
commit. This patch is meant to ensure that finding the busiest group uses
similar logic to finding the idlest group but something is hiding in there.

The second is that waking two tasks sharing tasks can be more expensive
than waking two remote tasks but it's preferable to fix the wakeup logic
than keep related tasks on separate caches just because it happens to
generate good numbers in some cases.

I feel that it makes sense to try and get both of those issues resolved
before making adjust_numa_imbalance a tunable or reverting it but I
really think it makes sense for communicating tasks to be sharing cache
if possible unless a node is overloaded or limited by memory bandwidth.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-21 Thread Mel Gorman
On Fri, May 15, 2020 at 04:24:44PM +0200, Peter Zijlstra wrote:
> On Fri, May 15, 2020 at 01:17:32PM +0200, Peter Zijlstra wrote:
> > On Fri, May 15, 2020 at 09:47:40AM +0100, Mel Gorman wrote:
> > 
> > > However, the wakeups are so rapid that the wakeup
> > > happens while the server is descheduling. That forces the waker to spin
> > > on smp_cond_load_acquire for longer. In this case, it can be cheaper to
> > > add the task to the rq->wake_list even if that potentially requires an 
> > > IPI.
> > 
> > Right, I think Rik ran into that as well at some point. He wanted to
> > make ->on_cpu do a hand-off, but simply queueing the wakeup on the prev
> > cpu (which is currently in the middle of schedule()) should be an easier
> > proposition.
> > 
> > Maybe something like this untested thing... could explode most mighty,
> > didn't thing too hard.
> > 
> 
> Mel pointed out that that patch got mutilated somewhere (my own .Sent
> copy was fine), let me try again.
> 

Sorry for the slow response. My primary work machine suffered a
catastrophic failure on Sunday night which is a fantastic way to start
a Monday morning so I'm playing catchup.

IIUC, this patch front-loads as much work as possible before checking if
the task is on_rq and then the waker/wakee shares a cache, queue task on
the wake_list and otherwise do a direct wakeup.

The advantage is that spinning is avoided on p->on_rq when p does not
share a cache. The disadvantage is that it may result in tasks being
stacked but this should only happen when the domain is overloaded and
select_task_eq() is unlikely to find an idle CPU. The load balancer would
soon correct the situation anyway.

In terms of netperf for my testing, the benefit is marginal because the
wakeups are primarily between tasks that share cache. It does trigger as
perf indicates that some time is spent in ttwu_queue_remote with this
patch, it's just that the overall time spent spinning on p->on_rq is
very similar. I'm still waiting on other workloads to complete to see
what the impact is.

However, intuitively at least, it makes sense to avoid spinning on p->on_rq
when it's unnecessary and the other changes appear to be safe.  Even if
wake_list should be used in some cases for local wakeups, it would make
sense to put that on top of this patch. Do you want to slap a changelog
around this and update the comments or do you want me to do it? I should
have more results in a few hours even if they are limited to one machine
but ideally Rik would test his workload too.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-20 Thread Jirka Hladky
I have an update on netperf-cstate-small-cross-socket results.

Reported performance degradation of 2.5% for the UDP stream throughput
and 0.6% for the TCP throughput is for message size of 16kB. For
smaller message sizes, the performance drop is higher - up to 5% for
UDP throughput for a message size of 64B. See the numbers below [1]

We still think that it's acceptable given the gains in other
situations (this is again compared to 5.7 vanilla) :

* solved the performance drop upto 20%  with single instance
SPECjbb2005 benchmark on 8 NUMA node servers (particularly on AMD EPYC
Rome systems) => this performance drop was INCREASING with higher
threads counts (10% for 16 threads and 20 % for 32 threads)
* solved the performance drop upto 50% for low load scenarios
(SPECjvm2008 and NAS)

[1]
Hillf's patch compared to 5.7 (rc4) vanilla:

TCP throughput
Message size (B)
64  -2.6%
128-2.3%
256-2.6%
1024  -2.7%
2048  -2.2%
3312  -2.4%
4096  -1.1%
8192  -0.4%
16384-0.6%

UDP throughput
64  -5.0%
128-3.0%
256-3.0%
1024  -3.1%
2048  -3.3%
3312  -3.5%
4096  -4.0%
8192  -3.3%
16384-2.6%


On Wed, May 20, 2020 at 3:58 PM Jirka Hladky  wrote:
>
> Hi Hillf, Mel and all,
>
> thanks for the patch! It has produced really GOOD results.
>
> 1) It has fixed performance problems with 5.7 vanilla kernel for
> single-tenant workload and low system load scenarios, without
> performance degradation for the multi-tenant tasks. It's producing the
> same results as the previous proof-of-concept patch where
> adjust_numa_imbalance function was modified to be a no-op (returning
> the same value of imbalance as it gets on the input).
>
> 2) We have also added Mel's netperf-cstate-small-cross-socket test to
> our test battery:
> https://github.com/gormanm/mmtests/blob/master/configs/config-network-netperf-cstate-small-cross-socket
>
> Mel told me that he had seen significant performance improvements with
> 5.7 over 5.6 for the netperf-cstate-small-cross-socket scenario.
>
> Out of 6 different patches we have tested, your patch has performed
> the best for this scenario. Compared to vanilla, we see minimal
> performance degradation of 2.5% for the udp stream throughput and 0.6%
> for the tcp throughput. The testing was done on a dual-socket system
> with Gold 6132 CPU.
>
> @Mel - could you please test Hillf's patch with your full testing
> suite? So far, it looks very promising, but I would like to check the
> patch thoroughly to make sure it does not hurt performance in other
> areas.
>
> Thanks a lot!
> Jirka
>
>
>
>
>
>
>
>
>
>
>
>
> On Tue, May 19, 2020 at 6:32 AM Hillf Danton  wrote:
> >
> >
> > Hi Jirka
> >
> > On Mon, 18 May 2020 16:52:52 +0200 Jirka Hladky wrote:
> > >
> > > We have compared it against kernel with adjust_numa_imbalance disabled
> > > [1], and both kernels perform at the same level for the single-tenant
> > > jobs, but the proposed patch is bad for the multitenancy mode. The
> > > kernel with adjust_numa_imbalance disabled is a clear winner here.
> >
> > Double thanks to you for the tests!
> >
> > > We would be very interested in what others think about disabling
> > > adjust_numa_imbalance function. The patch is bellow. It would be great
> >
> > A minute...
> >
> > > to collect performance results for different scenarios to make sure
> > > the results are objective.
> >
> > I don't have another test case but a diff trying to confine the tool
> > in question back to the hard-coded 2's field.
> >
> > It's used in the first hunk below to detect imbalance before migrating
> > a task, and a small churn of code is added at another call site when
> > balancing idle CPUs.
> >
> > Thanks
> > Hillf
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1916,20 +1916,26 @@ static void task_numa_find_cpu(struct ta
> >  * imbalance that would be overruled by the load balancer.
> >  */
> > if (env->dst_stats.node_type == node_has_spare) {
> > -   unsigned int imbalance;
> > -   int src_running, dst_running;
> > +   unsigned int imbalance = 2;
> >
> > -   /*
> > -* Would movement cause an imbalance? Note that if src has
> > -* more running tasks that the imbalance is ignored as the
> > -* move improves the imbalance from the perspective of the
> > -* CPU load balancer.
> > -* */
> > -   src_running = env->src_stats.nr_running - 1;
> > -   dst_running = env->dst_stats.nr_running + 1;
> > -   imbalance = max(0, dst_running - src_running);
> > -   imbalance = adjust_numa_imbalance(imbalance, src_running);
> > +   //No imbalance computed without spare capacity
> > +   if (env->dst_stats.node_type != env->src_stats.node_type)
> > +   goto check_imb;
> > +
> > +   

Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-20 Thread Jirka Hladky
Hi Hillf, Mel and all,

thanks for the patch! It has produced really GOOD results.

1) It has fixed performance problems with 5.7 vanilla kernel for
single-tenant workload and low system load scenarios, without
performance degradation for the multi-tenant tasks. It's producing the
same results as the previous proof-of-concept patch where
adjust_numa_imbalance function was modified to be a no-op (returning
the same value of imbalance as it gets on the input).

2) We have also added Mel's netperf-cstate-small-cross-socket test to
our test battery:
https://github.com/gormanm/mmtests/blob/master/configs/config-network-netperf-cstate-small-cross-socket

Mel told me that he had seen significant performance improvements with
5.7 over 5.6 for the netperf-cstate-small-cross-socket scenario.

Out of 6 different patches we have tested, your patch has performed
the best for this scenario. Compared to vanilla, we see minimal
performance degradation of 2.5% for the udp stream throughput and 0.6%
for the tcp throughput. The testing was done on a dual-socket system
with Gold 6132 CPU.

@Mel - could you please test Hillf's patch with your full testing
suite? So far, it looks very promising, but I would like to check the
patch thoroughly to make sure it does not hurt performance in other
areas.

Thanks a lot!
Jirka












On Tue, May 19, 2020 at 6:32 AM Hillf Danton  wrote:
>
>
> Hi Jirka
>
> On Mon, 18 May 2020 16:52:52 +0200 Jirka Hladky wrote:
> >
> > We have compared it against kernel with adjust_numa_imbalance disabled
> > [1], and both kernels perform at the same level for the single-tenant
> > jobs, but the proposed patch is bad for the multitenancy mode. The
> > kernel with adjust_numa_imbalance disabled is a clear winner here.
>
> Double thanks to you for the tests!
>
> > We would be very interested in what others think about disabling
> > adjust_numa_imbalance function. The patch is bellow. It would be great
>
> A minute...
>
> > to collect performance results for different scenarios to make sure
> > the results are objective.
>
> I don't have another test case but a diff trying to confine the tool
> in question back to the hard-coded 2's field.
>
> It's used in the first hunk below to detect imbalance before migrating
> a task, and a small churn of code is added at another call site when
> balancing idle CPUs.
>
> Thanks
> Hillf
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1916,20 +1916,26 @@ static void task_numa_find_cpu(struct ta
>  * imbalance that would be overruled by the load balancer.
>  */
> if (env->dst_stats.node_type == node_has_spare) {
> -   unsigned int imbalance;
> -   int src_running, dst_running;
> +   unsigned int imbalance = 2;
>
> -   /*
> -* Would movement cause an imbalance? Note that if src has
> -* more running tasks that the imbalance is ignored as the
> -* move improves the imbalance from the perspective of the
> -* CPU load balancer.
> -* */
> -   src_running = env->src_stats.nr_running - 1;
> -   dst_running = env->dst_stats.nr_running + 1;
> -   imbalance = max(0, dst_running - src_running);
> -   imbalance = adjust_numa_imbalance(imbalance, src_running);
> +   //No imbalance computed without spare capacity
> +   if (env->dst_stats.node_type != env->src_stats.node_type)
> +   goto check_imb;
> +
> +   imbalance = adjust_numa_imbalance(imbalance,
> +   env->src_stats.nr_running);
> +
> +   //Do nothing without imbalance
> +   if (!imbalance) {
> +   imbalance = 2;
> +   goto check_imb;
> +   }
> +
> +   //Migrate task if it's likely to grow balance
> +   if (env->dst_stats.nr_running + 1 < env->src_stats.nr_running)
> +   imbalance = 0;
>
> +check_imb:
> /* Use idle CPU if there is no imbalance */
> if (!imbalance) {
> maymove = true;
> @@ -9011,12 +9017,13 @@ static inline void calculate_imbalance(s
> env->migration_type = migrate_task;
> env->imbalance = max_t(long, 0, (local->idle_cpus -
>  busiest->idle_cpus) >> 1);
> -   }
>
> -   /* Consider allowing a small imbalance between NUMA groups */
> -   if (env->sd->flags & SD_NUMA)
> -   env->imbalance = adjust_numa_imbalance(env->imbalance,
> -   busiest->sum_nr_running);
> +   /* Consider allowing a small imbalance between NUMA 
> groups */
> +   if (env->sd->flags & SD_NUMA &&
> +   

Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-18 Thread Jirka Hladky
Hi Hillf,

thanks a lot for your patch!

Compared to 5.7 rc4 vanilla, we observe the following:
  * Single-tenant jobs show improvement up to 15% for SPECjbb2005 and
up to 100% for NAS in low thread mode. In other words, it fixes all
the problems we have reported in this thread.
  * Multitenancy jobs show performance degradation up to 30% for SPECjbb2005

While it fixes problems with single-tenant jobs and with a performance
at low system load, it breaks multi-tenant tasks.

We have compared it against kernel with adjust_numa_imbalance disabled
[1], and both kernels perform at the same level for the single-tenant
jobs, but the proposed patch is bad for the multitenancy mode. The
kernel with adjust_numa_imbalance disabled is a clear winner here.

We would be very interested in what others think about disabling
adjust_numa_imbalance function. The patch is bellow. It would be great
to collect performance results for different scenarios to make sure
the results are objective.

Thanks a lot!
Jirka

[1] Patch to test kernel with adjust_numa_imbalance disabled:
===
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02f323b..8c43d29 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8907,14 +8907,6 @@ static inline long adjust_numa_imbalance(int
imbalance, int src_nr_running)
{
   unsigned int imbalance_min;

-   /*
-* Allow a small imbalance based on a simple pair of communicating
-* tasks that remain local when the source domain is almost idle.
-*/
-   imbalance_min = 2;
-   if (src_nr_running <= imbalance_min)
-   return 0;
-
   return imbalance;
}
===





On Fri, May 8, 2020 at 5:47 AM Hillf Danton  wrote:
>
>
> On Thu, 7 May 2020 13:49:34 Phil Auld wrote:
> >
> > On Thu, May 07, 2020 at 06:29:44PM +0200 Jirka Hladky wrote:
> > > Hi Mel,
> > >
> > > we are not targeting just OMP applications. We see the performance
> > > degradation also for other workloads, like SPECjbb2005 and
> > > SPECjvm2008. Even worse, it also affects a higher number of threads.
> > > For example, comparing 5.7.0-0.rc2 against 5.6 kernel, on 4 NUMA
> > > server with 2x AMD 7351 CPU, we see performance degradation 22% for 32
> > > threads (the system has 64 CPUs in total). We observe this degradation
> > > only when we run a single SPECjbb binary. When running 4 SPECjbb
> > > binaries in parallel, there is no change in performance between 5.6
> > > and 5.7.
> > >
> > > That's why we are asking for the kernel tunable, which we would add to
> > > the tuned profile. We don't expect users to change this frequently but
> > > rather to set the performance profile once based on the purpose of the
> > > server.
> > >
> > > If you could prepare a patch for us, we would be more than happy to
> > > test it extensively. Based on the results, we can then evaluate if
> > > it's the way to go. Thoughts?
> > >
> >
> > I'm happy to spin up a patch once I'm sure what exactly the tuning would
> > effect. At an initial glance I'm thinking it would be the imbalance_min
> > which is currently hardcoded to 2. But there may be something else...
>
> hrm... try to restore the old behavior by skipping task migrate in favor
> of the local node if there is no imbalance.
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1928,18 +1928,16 @@ static void task_numa_find_cpu(struct ta
> src_running = env->src_stats.nr_running - 1;
> dst_running = env->dst_stats.nr_running + 1;
> imbalance = max(0, dst_running - src_running);
> -   imbalance = adjust_numa_imbalance(imbalance, src_running);
> +   imbalance = adjust_numa_imbalance(imbalance, src_running +1);
>
> -   /* Use idle CPU if there is no imbalance */
> +   /* No task migrate without imbalance */
> if (!imbalance) {
> -   maymove = true;
> -   if (env->dst_stats.idle_cpu >= 0) {
> -   env->dst_cpu = env->dst_stats.idle_cpu;
> -   task_numa_assign(env, NULL, 0);
> -   return;
> -   }
> +   env->best_cpu = -1;
> +   return;
> }
> -   } else {
> +   }
> +
> +   do {
> long src_load, dst_load, load;
> /*
>  * If the improvement from just moving env->p direction is 
> better
> @@ -1949,7 +1947,7 @@ static void task_numa_find_cpu(struct ta
> dst_load = env->dst_stats.load + load;
> src_load = env->src_stats.load - load;
> maymove = !load_too_imbalanced(src_load, dst_load, env);
> -   }
> +   } while (0);
>
> for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
> /* Skip this CPU if the source task 

Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-15 Thread Jirka Hladky
> Complete shot in the dark but restore adjust_numa_imbalance() and try
> this
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1a9983da4408..0b31f4468d5b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2393,7 +2393,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, 
> int wake_flags)
> struct rq_flags rf;
>  #if defined(CONFIG_SMP)
> -   if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
> cpu)) {
> +   if (sched_feat(TTWU_QUEUE)) {
> sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> ttwu_queue_remote(p, cpu, wake_flags);
> return;

Hi Mel,

we have performance results for the proposed patch above ^^^.
Unfortunately, it hasn't helped the performance.

Jirka


On Wed, May 13, 2020 at 5:30 PM Mel Gorman  wrote:
>
> On Wed, May 13, 2020 at 04:57:15PM +0200, Jirka Hladky wrote:
> > Hi Mel,
> >
> > we have tried the kernel with adjust_numa_imbalance() crippled to just
> > return the imbalance it's given.
> >
> > It has solved all the performance problems I have reported.
> > Performance is the same as with 5.6 kernel (before the patch was
> > applied).
> >
> > * solved the performance drop upto 20%  with single instance
> > SPECjbb2005 benchmark on 8 NUMA node servers (particularly on AMD EPYC
> > Rome systems) => this performance drop was INCREASING with higher
> > threads counts (10% for 16 threads and 20 % for 32 threads)
> > * solved the performance drop for low load scenarios (SPECjvm2008 and NAS)
> >
> > Any suggestions on how to proceed? One approach is to turn
> > "imbalance_min" into the kernel tunable. Any other ideas?
> >
> > https://github.com/torvalds/linux/blob/4f8a3cc1183c442daee6cc65360e3385021131e4/kernel/sched/fair.c#L8914
> >
>
> Complete shot in the dark but restore adjust_numa_imbalance() and try
> this
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1a9983da4408..0b31f4468d5b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2393,7 +2393,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, 
> int wake_flags)
> struct rq_flags rf;
>
>  #if defined(CONFIG_SMP)
> -   if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
> cpu)) {
> +   if (sched_feat(TTWU_QUEUE)) {
> sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> ttwu_queue_remote(p, cpu, wake_flags);
> return;
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka



Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-15 Thread Peter Zijlstra
On Fri, May 15, 2020 at 01:17:32PM +0200, Peter Zijlstra wrote:
> On Fri, May 15, 2020 at 09:47:40AM +0100, Mel Gorman wrote:
> 
> > However, the wakeups are so rapid that the wakeup
> > happens while the server is descheduling. That forces the waker to spin
> > on smp_cond_load_acquire for longer. In this case, it can be cheaper to
> > add the task to the rq->wake_list even if that potentially requires an IPI.
> 
> Right, I think Rik ran into that as well at some point. He wanted to
> make ->on_cpu do a hand-off, but simply queueing the wakeup on the prev
> cpu (which is currently in the middle of schedule()) should be an easier
> proposition.
> 
> Maybe something like this untested thing... could explode most mighty,
> didn't thing too hard.
> 

Mel pointed out that that patch got mutilated somewhere (my own .Sent
copy was fine), let me try again.


---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3b64ffd6c728..df588ac75bf0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2330,7 +2330,7 @@ void scheduler_ipi(void)
irq_exit();
 }
 
-static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+static void __ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
 {
struct rq *rq = cpu_rq(cpu);
 
@@ -2372,6 +2372,17 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
 {
return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
 }
+
+static bool ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+{
+   if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
cpu)) {
+   sched_clock_cpu(cpu); /* Sync clocks across CPUs */
+   __ttwu_queue_remote(p, cpu, wake_flags);
+   return true;
+   }
+
+   return false;
+}
 #endif /* CONFIG_SMP */
 
 static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
@@ -2380,11 +2391,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, 
int wake_flags)
struct rq_flags rf;
 
 #if defined(CONFIG_SMP)
-   if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
cpu)) {
-   sched_clock_cpu(cpu); /* Sync clocks across CPUs */
-   ttwu_queue_remote(p, cpu, wake_flags);
+   if (ttwu_queue_remote(p, cpu, wake_flags))
return;
-   }
 #endif
 
rq_lock(rq, );
@@ -2568,7 +2576,15 @@ try_to_wake_up(struct task_struct *p, unsigned int 
state, int wake_flags)
if (p->on_rq && ttwu_remote(p, wake_flags))
goto unlock;
 
+   if (p->in_iowait) {
+   delayacct_blkio_end(p);
+   atomic_dec(_rq(p)->nr_iowait);
+   }
+
 #ifdef CONFIG_SMP
+   p->sched_contributes_to_load = !!task_contributes_to_load(p);
+   p->state = TASK_WAKING;
+
/*
 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
 * possible to, falsely, observe p->on_cpu == 0.
@@ -2599,15 +2615,10 @@ try_to_wake_up(struct task_struct *p, unsigned int 
state, int wake_flags)
 * This ensures that tasks getting woken will be fully ordered against
 * their previous state and preserve Program Order.
 */
-   smp_cond_load_acquire(>on_cpu, !VAL);
-
-   p->sched_contributes_to_load = !!task_contributes_to_load(p);
-   p->state = TASK_WAKING;
+   if (READ_ONCE(p->on_cpu) && ttwu_queue_remote(p, cpu, wake_flags))
+   goto unlock;
 
-   if (p->in_iowait) {
-   delayacct_blkio_end(p);
-   atomic_dec(_rq(p)->nr_iowait);
-   }
+   smp_cond_load_acquire(>on_cpu, !VAL);
 
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
if (task_cpu(p) != cpu) {
@@ -2615,14 +2626,6 @@ try_to_wake_up(struct task_struct *p, unsigned int 
state, int wake_flags)
psi_ttwu_dequeue(p);
set_task_cpu(p, cpu);
}
-
-#else /* CONFIG_SMP */
-
-   if (p->in_iowait) {
-   delayacct_blkio_end(p);
-   atomic_dec(_rq(p)->nr_iowait);
-   }
-
 #endif /* CONFIG_SMP */
 
ttwu_queue(p, cpu, wake_flags);


Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-15 Thread Peter Zijlstra


> > In that case, we potentially avoid spinning on on_rq for wakeups between
> > tasks that do not share CPU but it's not clear why it would be specific to
> > remote tasks.
> 
> The thinking was that we can avoid spinning on ->on_cpu, and let the CPU
> get on with things. Rik had a workload where that spinning was
> significant, and I thought to have understood you saw the same.
> 
> By sticking the task on the wake_list of the CPU that's in charge of
> clearing ->on_cpu we ensure ->on_cpu is 0 by the time we get to doing
> the actual enqueue.
> 

Of course, aside from sending an obviously broken patch, I also forgot
to Cc Rik.

This one compiled, booted and built a kernel, so it must be perfect :-)

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3b64ffd6c728..df588ac75bf0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2330,7 +2330,7 @@ void scheduler_ipi(void)
irq_exit();
 }
 
-static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+static void __ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
 {
struct emote(p, cpu, wake_flags)) *rq = cpu_eturn;(cpu);
 
@@ -2372,6 +2372,17 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
 {
return per_cpu(sd_q_lock(rq, );_id, this_cpu) == 
per_cpu(sd_y_to_wake_up(struct task_struct *p, unsigned int state, int 
wake_flags)_id, that_cpu);
 }
+
+static bool ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+{
+   if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
cpu)) {
+   sched_clock_cpu(cpu); /* Sync clocks across CPUs */
+   __ttwu_queue_remote(p, cpu, wake_flags);
+   return true;
+   }
+
+   return false;
+}
 #endif /* q && ttwu_remote(p, wake_flags))_SMP */
 
 static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
@@ -2380,11 +2391,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, 
int wake_flags)
struct rq_flags rf;
 
 #if defined(CONFIG_SMP)
-   if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
cpu)) {
-   sched_clock_cpu(cpu); /* Sync clocks across CPUs */
-   ttwu_queue_remote(p, cpu, wake_flags);
+   if (ttwu_queue_remote(p, cpu, wake_flags))
return;
-   }
 #endif
 
rq_lock(rq, );
@@ -2568,7 +2576,15 @@ try_to_wake_up(struct task_struct *p, unsigned int 
state, int wake_flags)
if (p->on_rq && ttwu_remote(p, wake_flags))
goto unlock;
 
+   if (p->in_iowait) {
+   delayacct_blkio_end(p);
+   atomic_dec(_rq(p)->nr_iowait);
+   }
+
 #ifdef CONFIG_SMP
+   p->sched_contributes_to_load = !!task_contributes_to_load(p);
+   p->state = TASK_WAKING;
+
/*
 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
 * possible to, falsely, observe p->on_cpu == 0.
@@ -2599,15 +2615,10 @@ try_to_wake_up(struct task_struct *p, unsigned int 
state, int wake_flags)
 * This ensures that tasks getting woken will be fully ordered against
 * their previous state and preserve Program Order.
 */
-   smp_ibutes_to_load = 
!!task_contributes_to_load(p);_load_acquire(>on_cpu, !VAL);
-
-   p->sched_contributes_to_load = !!task_contributes_to_load(p);
-   p->state = TASK_WAKING;
+   if (READ_ONCE(p->on_cpu) && ttwu_queue_remote(p, cpu, wake_flags))
+   goto unlock;
 
-   if (p->in_iowait) {
-   delayacct_blkio_end(p);
-   atomic_dec(_rq(p)->nr_iowait);
-   }
+   smp_cond_load_acquire(>on_cpu, !VAL);
 
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
if (task_cpu(p) != cpu) {
@@ -2615,14 +2626,6 @@ try_to_wake_up(struct task_struct *p, unsigned int 
state, int wake_flags)
psi_ttwu_dequeue(p);
set_task_cpu(p, cpu);
}
-
-#else /* CONFIG_SMP */
-
-   if (p->in_iowait) {
-   delayacct_blkio_end(p);
-   atomic_dec(_rq(p)->nr_iowait);
-   }
-
 #endif /* CONFIG_SMP */
 
ttwu_queue(p, cpu, wake_flags);


Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-15 Thread Peter Zijlstra
On Fri, May 15, 2020 at 02:03:46PM +0100, Mel Gorman wrote:
> On Fri, May 15, 2020 at 01:17:32PM +0200, Peter Zijlstra wrote:
> > On Fri, May 15, 2020 at 09:47:40AM +0100, Mel Gorman wrote:
> > +static bool ttwu_queue_remote(struct task_struct *p, int cpu, int 
> > wake_flags)
> > +{
> > +   if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
> > cpu)) {
> > +   sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> > +   __ttwu_queue_remote(p, cpu, wake_flags);
> > +   return true;
> > +   }
> > +
> > +   return false;
> > +}

> > +   if (READ_ONCE(p->on_cpu) && __ttwu_queue_remote(p, cpu, wake_flags))
> > +   goto unlock;

> I don't see a problem with moving the updating of p->state to the other
> side of the barrier but I'm relying on the comment that the barrier is
> only related to on_rq and on_cpu.

Yeah, I went with that too, like I said, didn't think too hard.

> However, I'm less sure about what exactly you intended to do.
> __ttwu_queue_remote is void so maybe you meant to use ttwu_queue_remote.

That!

> In that case, we potentially avoid spinning on on_rq for wakeups between
> tasks that do not share CPU but it's not clear why it would be specific to
> remote tasks.

The thinking was that we can avoid spinning on ->on_cpu, and let the CPU
get on with things. Rik had a workload where that spinning was
significant, and I thought to have understood you saw the same.

By sticking the task on the wake_list of the CPU that's in charge of
clearing ->on_cpu we ensure ->on_cpu is 0 by the time we get to doing
the actual enqueue.



Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-15 Thread Mel Gorman
On Fri, May 15, 2020 at 01:17:32PM +0200, Peter Zijlstra wrote:
> On Fri, May 15, 2020 at 09:47:40AM +0100, Mel Gorman wrote:
> 
> > However, the wakeups are so rapid that the wakeup
> > happens while the server is descheduling. That forces the waker to spin
> > on smp_cond_load_acquire for longer. In this case, it can be cheaper to
> > add the task to the rq->wake_list even if that potentially requires an IPI.
> 
> Right, I think Rik ran into that as well at some point. He wanted to
> make ->on_cpu do a hand-off, but simply queueing the wakeup on the prev
> cpu (which is currently in the middle of schedule()) should be an easier
> proposition.
> 
> Maybe something like this untested thing... could explode most mighty,
> didn't thing too hard.
> 
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fa6c19d38e82..c07b92a0ee5d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2312,7 +2312,7 @@ static void wake_csd_func(void *info)
>   sched_ttwu_pending();
>  }
>  
> -static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
> +static void __ttwu_queue_remote(struct task_struct *p, int cpu, int 
> wake_flags)
>  {
>   struct rq *rq = cpu_rq(cpu);
>  
> @@ -2354,6 +2354,17 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
>  {
>   return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
>  }
> +
> +static bool ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
> +{
> + if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
> cpu)) {
> + sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> + __ttwu_queue_remote(p, cpu, wake_flags);
> + return true;
> + }
> +
> + return false;
> +}
>  #endif /* CONFIG_SMP */
>  
>  static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
> @@ -2362,11 +2373,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, 
> int wake_flags)
>   struct rq_flags rf;
>  
>  #if defined(CONFIG_SMP)
> - if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
> cpu)) {
> - sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> - ttwu_queue_remote(p, cpu, wake_flags);
> + if (ttwu_queue_remote(p, cpu, wake_flags))
>   return;
> - }
>  #endif
>  
>   rq_lock(rq, );
> @@ -2550,7 +2558,15 @@ try_to_wake_up(struct task_struct *p, unsigned int 
> state, int wake_flags)
>   if (p->on_rq && ttwu_remote(p, wake_flags))
>   goto unlock;
>  
> + if (p->in_iowait) {
> + delayacct_blkio_end(p);
> + atomic_dec(_rq(p)->nr_iowait);
> + }
> +
>  #ifdef CONFIG_SMP
> + p->sched_contributes_to_load = !!task_contributes_to_load(p);
> + p->state = TASK_WAKING;
> +
>   /*
>* Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
>* possible to, falsely, observe p->on_cpu == 0.
> @@ -2581,15 +2597,10 @@ try_to_wake_up(struct task_struct *p, unsigned int 
> state, int wake_flags)
>* This ensures that tasks getting woken will be fully ordered against
>* their previous state and preserve Program Order.
>*/
> - smp_cond_load_acquire(>on_cpu, !VAL);
> -
> - p->sched_contributes_to_load = !!task_contributes_to_load(p);
> - p->state = TASK_WAKING;
> + if (READ_ONCE(p->on_cpu) && __ttwu_queue_remote(p, cpu, wake_flags))
> + goto unlock;
>  
> - if (p->in_iowait) {
> - delayacct_blkio_end(p);
> - atomic_dec(_rq(p)->nr_iowait);
> - }
> + smp_cond_load_acquire(>on_cpu, !VAL);
>  
>   cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
>   if (task_cpu(p) != cpu) {

I don't see a problem with moving the updating of p->state to the other
side of the barrier but I'm relying on the comment that the barrier is
only related to on_rq and on_cpu.

However, I'm less sure about what exactly you intended to do.
__ttwu_queue_remote is void so maybe you meant to use ttwu_queue_remote.
In that case, we potentially avoid spinning on on_rq for wakeups between
tasks that do not share CPU but it's not clear why it would be specific to
remote tasks. If you meant to call __ttwu_queue_remote unconditionally,
it's not clear why that's now safe when smp_cond_load_acquire() 
cared about on_rq being 0 before queueing a task for wakup or directly
waking it up.

Also because __ttwu_queue_remote() now happens before select_task_rq(), is
there not a risk that in some cases we end up stacking tasks unnecessarily?

> @@ -2597,14 +2608,6 @@ try_to_wake_up(struct task_struct *p, unsigned int 
> state, int wake_flags)
>   psi_ttwu_dequeue(p);
>   set_task_cpu(p, cpu);
>   }
> -
> -#else /* CONFIG_SMP */
> -
> - if (p->in_iowait) {
> - delayacct_blkio_end(p);
> - atomic_dec(_rq(p)->nr_iowait);
> - }
> -
>  #endif /* CONFIG_SMP */
>  
>   ttwu_queue(p, cpu, 

Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-15 Thread Peter Zijlstra
On Fri, May 15, 2020 at 01:22:31PM +0100, Mel Gorman wrote:
> On Fri, May 15, 2020 at 01:28:15PM +0200, Peter Zijlstra wrote:

> > > + if (val & _TIF_POLLING_NRFLAG)
> > > + goto activate;
> > 
> > I'm completely confused... the result here is that if you're polling you
> > do _NOT_ queue on the wake_list, but instead immediately enqueue.
> > 
> > (which kinda makes sense, since if the remote CPU is idle, it doesn't
> > have these lines in its cache anyway)
> > 
> 
> Crap, I rushed this and severely confused myself about what is going

Hehe, and here I though I was confused :-)

> on. It is definitely the case that flipping this check does not give
> any benefit. The patch shows a benefit but I'm failing to understand
> exactly why. How I ended up here was perf indicating a lot of time spent
> on smp_cond_load_acquire() which made me look closely at ttwu_remote()
> and looking at function graphs to compare the different types of wakeups
> and their timings.

So the raisin we did this remote wakeup thing in the first place was
that Oracle was having very heavy rq->lock cache-line contention. By
farming off the enqueue to the CPU that was going to run the task
anyway, the rq->lock (and the other runqueue structure lines) could stay
in the CPU that was using them (hard). Less cacheline ping-pong, more
win.

The observation here is that if a CPU is idle, it's rq will not be
contended.


Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-15 Thread Mel Gorman
On Fri, May 15, 2020 at 01:28:15PM +0200, Peter Zijlstra wrote:
> On Fri, May 15, 2020 at 09:47:40AM +0100, Mel Gorman wrote:
> > sched: Wake cache-local tasks via wake_list if wakee CPU is polling
> > 
> > There are two separate method for waking a task from idle, one for
> > tasks running on CPUs that share a cache and one for when the caches are
> > separate. The methods can loosely be called local and remote even though
> > this is not directly related to NUMA and instead is due to the expected
> > cost of accessing data that is cache-hot on another CPU. The "local" costs
> > are expected to be relatively cheap as they share the LLC in comparison to
> > a remote IPI that is potentially required when using the "remote" wakeup.
> > The problem is that the local wakeup is not always cheaper and it appears
> > to have degraded even further around the 4.19 mark.
> > 
> > There appears to be a couple of reasons why it can be slower.
> > 
> > The first is specific to pairs of tasks where one or both rapidly enters
> > idle. For example, on netperf UDP_STREAM, the client sends a bunch of
> > packets, wakes the server, the server processes some packets and goes
> > back to sleep. There is a relationship between the tasks but it's not
> > strictly synchronous. The timing is different if the client/server are on
> > separate NUMA nodes and netserver is more likely to enter idle (measured
> > as server entering idle 10% more times when tasks are local vs remote
> > but machine-specific). However, the wakeups are so rapid that the wakeup
> > happens while the server is descheduling. That forces the waker to spin
> > on smp_cond_load_acquire for longer. In this case, it can be cheaper to
> > add the task to the rq->wake_list even if that potentially requires an IPI.
> > 
> > The second is that the local wakeup path is simply not always
> > that fast. Using ftrace, the cost of the locks, update_rq_clock and
> > ttwu_do_activate was measured as roughly 4.5 microseconds. While it's
> > a single instance, the cost of the "remote" wakeup for try_to_wake_up
> > was roughly 2.5 microseconds versus 6.2 microseconds for the "fast" local
> > wakeup. When there are tens of thousands of wakeups per second, these costs
> > accumulate and manifest as a throughput regression in netperf UDP_STREAM.
> > 
> > The essential difference in costs comes down to whether the CPU is fully
> > idle, a task is descheduling or polling in poll_idle(). This patch
> > special cases ttwu_queue() to use the "remote" method if the CPUs
> > task is polling as it's generally cheaper to use the wake_list in that
> > case than the local method because an IPI should not be required. As it is
> > race-prone, a reschedule IPI may still be sent but in that case the local
> > wakeup would also have to send a reschedule IPI so it should be harmless.
> 
> We don't in fact send a wakeup IPI when polling.

I know when it's polling that an IPI can be skipped as set_nr_if_polling
will succeed.

> So this might end up
> with an extra IPI.
> 

Yes, it might. If between the check for polling and set_nr_and_not_polling,
the polling may have stopped and the CPU entered a c-state and an IPI
is sent but in that case, either path is likely to result in some sort
of IPI.

> 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 9a2fbf98fd6f..59077c7c6660 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2380,13 +2380,32 @@ static void ttwu_queue(struct task_struct *p, int 
> > cpu, int wake_flags)
> > struct rq_flags rf;
> >  
> >  #if defined(CONFIG_SMP)
> > -   if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
> > cpu)) {
> > +   if (sched_feat(TTWU_QUEUE)) {
> > +   /*
> > +* A remote wakeup is often expensive as can require
> > +* an IPI and the wakeup path is slow. However, in
> > +* the specific case where the target CPU is idle
> > +* and polling, the CPU is active and rapidly checking
> > +* if a reschedule is needed.
> 
> Not strictly true; MWAIT can be very deep idle, it's just that with
> POLLING we indicate we do not have to send an IPI to wake up. Just
> setting the TIF_NEED_RESCHED flag is sufficient to wake up -- the
> monitor part of monitor-wait.
> 

The terminology is unhelpful. In this case when I said "idle and polling",
I was thinking of the idle task running poll_idle() and the CPU should
not be in any c-state below C0 but it doesn't really matter due to your
own comments.

> > In this case, the idle
> > +* task just needs to be marked for resched and p
> > +* will rapidly be requeued which is less expensive
> > +* than the direct wakeup path.
> > +*/
> > +   if (cpus_share_cache(smp_processor_id(), cpu)) {
> > +   struct thread_info *ti = task_thread_info(p);
> > +   typeof(ti->flags) val = 

Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-15 Thread Peter Zijlstra
On Fri, May 15, 2020 at 09:47:40AM +0100, Mel Gorman wrote:
> sched: Wake cache-local tasks via wake_list if wakee CPU is polling
> 
> There are two separate method for waking a task from idle, one for
> tasks running on CPUs that share a cache and one for when the caches are
> separate. The methods can loosely be called local and remote even though
> this is not directly related to NUMA and instead is due to the expected
> cost of accessing data that is cache-hot on another CPU. The "local" costs
> are expected to be relatively cheap as they share the LLC in comparison to
> a remote IPI that is potentially required when using the "remote" wakeup.
> The problem is that the local wakeup is not always cheaper and it appears
> to have degraded even further around the 4.19 mark.
> 
> There appears to be a couple of reasons why it can be slower.
> 
> The first is specific to pairs of tasks where one or both rapidly enters
> idle. For example, on netperf UDP_STREAM, the client sends a bunch of
> packets, wakes the server, the server processes some packets and goes
> back to sleep. There is a relationship between the tasks but it's not
> strictly synchronous. The timing is different if the client/server are on
> separate NUMA nodes and netserver is more likely to enter idle (measured
> as server entering idle 10% more times when tasks are local vs remote
> but machine-specific). However, the wakeups are so rapid that the wakeup
> happens while the server is descheduling. That forces the waker to spin
> on smp_cond_load_acquire for longer. In this case, it can be cheaper to
> add the task to the rq->wake_list even if that potentially requires an IPI.
> 
> The second is that the local wakeup path is simply not always
> that fast. Using ftrace, the cost of the locks, update_rq_clock and
> ttwu_do_activate was measured as roughly 4.5 microseconds. While it's
> a single instance, the cost of the "remote" wakeup for try_to_wake_up
> was roughly 2.5 microseconds versus 6.2 microseconds for the "fast" local
> wakeup. When there are tens of thousands of wakeups per second, these costs
> accumulate and manifest as a throughput regression in netperf UDP_STREAM.
> 
> The essential difference in costs comes down to whether the CPU is fully
> idle, a task is descheduling or polling in poll_idle(). This patch
> special cases ttwu_queue() to use the "remote" method if the CPUs
> task is polling as it's generally cheaper to use the wake_list in that
> case than the local method because an IPI should not be required. As it is
> race-prone, a reschedule IPI may still be sent but in that case the local
> wakeup would also have to send a reschedule IPI so it should be harmless.

We don't in fact send a wakeup IPI when polling. So this might end up
with an extra IPI.


> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9a2fbf98fd6f..59077c7c6660 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2380,13 +2380,32 @@ static void ttwu_queue(struct task_struct *p, int 
> cpu, int wake_flags)
>   struct rq_flags rf;
>  
>  #if defined(CONFIG_SMP)
> - if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
> cpu)) {
> + if (sched_feat(TTWU_QUEUE)) {
> + /*
> +  * A remote wakeup is often expensive as can require
> +  * an IPI and the wakeup path is slow. However, in
> +  * the specific case where the target CPU is idle
> +  * and polling, the CPU is active and rapidly checking
> +  * if a reschedule is needed.

Not strictly true; MWAIT can be very deep idle, it's just that with
POLLING we indicate we do not have to send an IPI to wake up. Just
setting the TIF_NEED_RESCHED flag is sufficient to wake up -- the
monitor part of monitor-wait.

> In this case, the idle
> +  * task just needs to be marked for resched and p
> +  * will rapidly be requeued which is less expensive
> +  * than the direct wakeup path.
> +  */
> + if (cpus_share_cache(smp_processor_id(), cpu)) {
> + struct thread_info *ti = task_thread_info(p);
> + typeof(ti->flags) val = READ_ONCE(ti->flags);
> +
> + if (val & _TIF_POLLING_NRFLAG)
> + goto activate;

I'm completely confused... the result here is that if you're polling you
do _NOT_ queue on the wake_list, but instead immediately enqueue.

(which kinda makes sense, since if the remote CPU is idle, it doesn't
have these lines in its cache anyway)

But the subject and comments all seem to suggest the opposite !?

Also, this will fail compilation when !defined(TIF_POLLING_NRFLAGG).

> + }
> +
>   sched_clock_cpu(cpu); /* Sync clocks across CPUs */
>   ttwu_queue_remote(p, cpu, wake_flags);
>   return;
>   }
>  #endif
>  
> +activate:

The labels wants to go 

Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-15 Thread Peter Zijlstra
On Fri, May 15, 2020 at 09:47:40AM +0100, Mel Gorman wrote:

> However, the wakeups are so rapid that the wakeup
> happens while the server is descheduling. That forces the waker to spin
> on smp_cond_load_acquire for longer. In this case, it can be cheaper to
> add the task to the rq->wake_list even if that potentially requires an IPI.

Right, I think Rik ran into that as well at some point. He wanted to
make ->on_cpu do a hand-off, but simply queueing the wakeup on the prev
cpu (which is currently in the middle of schedule()) should be an easier
proposition.

Maybe something like this untested thing... could explode most mighty,
didn't thing too hard.

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fa6c19d38e82..c07b92a0ee5d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2312,7 +2312,7 @@ static void wake_csd_func(void *info)
sched_ttwu_pending();
 }
 
-static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+static void __ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
 {
struct rq *rq = cpu_rq(cpu);
 
@@ -2354,6 +2354,17 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
 {
return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
 }
+
+static bool ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+{
+   if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
cpu)) {
+   sched_clock_cpu(cpu); /* Sync clocks across CPUs */
+   __ttwu_queue_remote(p, cpu, wake_flags);
+   return true;
+   }
+
+   return false;
+}
 #endif /* CONFIG_SMP */
 
 static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
@@ -2362,11 +2373,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, 
int wake_flags)
struct rq_flags rf;
 
 #if defined(CONFIG_SMP)
-   if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
cpu)) {
-   sched_clock_cpu(cpu); /* Sync clocks across CPUs */
-   ttwu_queue_remote(p, cpu, wake_flags);
+   if (ttwu_queue_remote(p, cpu, wake_flags))
return;
-   }
 #endif
 
rq_lock(rq, );
@@ -2550,7 +2558,15 @@ try_to_wake_up(struct task_struct *p, unsigned int 
state, int wake_flags)
if (p->on_rq && ttwu_remote(p, wake_flags))
goto unlock;
 
+   if (p->in_iowait) {
+   delayacct_blkio_end(p);
+   atomic_dec(_rq(p)->nr_iowait);
+   }
+
 #ifdef CONFIG_SMP
+   p->sched_contributes_to_load = !!task_contributes_to_load(p);
+   p->state = TASK_WAKING;
+
/*
 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
 * possible to, falsely, observe p->on_cpu == 0.
@@ -2581,15 +2597,10 @@ try_to_wake_up(struct task_struct *p, unsigned int 
state, int wake_flags)
 * This ensures that tasks getting woken will be fully ordered against
 * their previous state and preserve Program Order.
 */
-   smp_cond_load_acquire(>on_cpu, !VAL);
-
-   p->sched_contributes_to_load = !!task_contributes_to_load(p);
-   p->state = TASK_WAKING;
+   if (READ_ONCE(p->on_cpu) && __ttwu_queue_remote(p, cpu, wake_flags))
+   goto unlock;
 
-   if (p->in_iowait) {
-   delayacct_blkio_end(p);
-   atomic_dec(_rq(p)->nr_iowait);
-   }
+   smp_cond_load_acquire(>on_cpu, !VAL);
 
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
if (task_cpu(p) != cpu) {
@@ -2597,14 +2608,6 @@ try_to_wake_up(struct task_struct *p, unsigned int 
state, int wake_flags)
psi_ttwu_dequeue(p);
set_task_cpu(p, cpu);
}
-
-#else /* CONFIG_SMP */
-
-   if (p->in_iowait) {
-   delayacct_blkio_end(p);
-   atomic_dec(_rq(p)->nr_iowait);
-   }
-
 #endif /* CONFIG_SMP */
 
ttwu_queue(p, cpu, wake_flags);


Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-15 Thread Mel Gorman
On Thu, May 14, 2020 at 05:31:22PM +0200, Peter Zijlstra wrote:
> On Wed, May 13, 2020 at 04:30:23PM +0100, Mel Gorman wrote:
> > Complete shot in the dark but restore adjust_numa_imbalance() and try
> > this
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 1a9983da4408..0b31f4468d5b 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2393,7 +2393,7 @@ static void ttwu_queue(struct task_struct *p, int 
> > cpu, int wake_flags)
> > struct rq_flags rf;
> >  
> >  #if defined(CONFIG_SMP)
> > -   if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
> > cpu)) {
> > +   if (sched_feat(TTWU_QUEUE)) {
> 
> just saying that this has the risk of regressing other workloads, see:
> 
>   518cd6234178 ("sched: Only queue remote wakeups when crossing cache 
> boundaries")
> 

Sure, I didn't really think that this was appropriate but it was the best I
had at the time. This can be a lot more targetted but it took me a while
to put together a picture as it was difficult for me to analyse. I was
going to post this as a seperate RFC but given this thread, what do you
think of this?

---8<---
sched: Wake cache-local tasks via wake_list if wakee CPU is polling

There are two separate method for waking a task from idle, one for
tasks running on CPUs that share a cache and one for when the caches are
separate. The methods can loosely be called local and remote even though
this is not directly related to NUMA and instead is due to the expected
cost of accessing data that is cache-hot on another CPU. The "local" costs
are expected to be relatively cheap as they share the LLC in comparison to
a remote IPI that is potentially required when using the "remote" wakeup.
The problem is that the local wakeup is not always cheaper and it appears
to have degraded even further around the 4.19 mark.

There appears to be a couple of reasons why it can be slower.

The first is specific to pairs of tasks where one or both rapidly enters
idle. For example, on netperf UDP_STREAM, the client sends a bunch of
packets, wakes the server, the server processes some packets and goes
back to sleep. There is a relationship between the tasks but it's not
strictly synchronous. The timing is different if the client/server are on
separate NUMA nodes and netserver is more likely to enter idle (measured
as server entering idle 10% more times when tasks are local vs remote
but machine-specific). However, the wakeups are so rapid that the wakeup
happens while the server is descheduling. That forces the waker to spin
on smp_cond_load_acquire for longer. In this case, it can be cheaper to
add the task to the rq->wake_list even if that potentially requires an IPI.

The second is that the local wakeup path is simply not always
that fast. Using ftrace, the cost of the locks, update_rq_clock and
ttwu_do_activate was measured as roughly 4.5 microseconds. While it's
a single instance, the cost of the "remote" wakeup for try_to_wake_up
was roughly 2.5 microseconds versus 6.2 microseconds for the "fast" local
wakeup. When there are tens of thousands of wakeups per second, these costs
accumulate and manifest as a throughput regression in netperf UDP_STREAM.

The essential difference in costs comes down to whether the CPU is fully
idle, a task is descheduling or polling in poll_idle(). This patch
special cases ttwu_queue() to use the "remote" method if the CPUs
task is polling as it's generally cheaper to use the wake_list in that
case than the local method because an IPI should not be required. As it is
race-prone, a reschedule IPI may still be sent but in that case the local
wakeup would also have to send a reschedule IPI so it should be harmless.

The benefit is not universal as it requires the specific pattern of wakeups
that happen when the wakee is likely to be descheduling. netperf UDP_STREAM
is one of the more obvious examples and it is more obvious since the
load balancer was reconciled and the client and server are more likely to
be running on the same NUMA node which is why it was missed for so long.

netperf-udp
  5.7.0-rc5  5.7.0-rc5
vanilla  wakelist-v1r1
Hmean send-64 211.25 (   0.00%)  228.61 *   8.22%*
Hmean send-128413.49 (   0.00%)  458.00 *  10.77%*
Hmean send-256786.31 (   0.00%)  879.30 *  11.83%*
Hmean send-1024  3055.75 (   0.00%) 3429.45 *  12.23%*
Hmean send-2048  6052.79 (   0.00%) 6878.99 *  13.65%*
Hmean send-3312  9208.09 (   0.00%)10832.19 *  17.64%*
Hmean send-4096 11268.45 (   0.00%)12968.39 *  15.09%*
Hmean send-8192 17943.12 (   0.00%)19886.07 *  10.83%*
Hmean send-1638429732.94 (   0.00%)32297.44 *   8.63%*
Hmean recv-64 211.25 (   0.00%)  228.61 *   8.22%*
Hmean recv-128413.49 (   0.00%)  458.00 *  10.77%*
Hmean recv-256

Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-14 Thread Peter Zijlstra
On Wed, May 13, 2020 at 04:30:23PM +0100, Mel Gorman wrote:
> Complete shot in the dark but restore adjust_numa_imbalance() and try
> this
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1a9983da4408..0b31f4468d5b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2393,7 +2393,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, 
> int wake_flags)
>   struct rq_flags rf;
>  
>  #if defined(CONFIG_SMP)
> - if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
> cpu)) {
> + if (sched_feat(TTWU_QUEUE)) {

just saying that this has the risk of regressing other workloads, see:

  518cd6234178 ("sched: Only queue remote wakeups when crossing cache 
boundaries")

>   sched_clock_cpu(cpu); /* Sync clocks across CPUs */
>   ttwu_queue_remote(p, cpu, wake_flags);
>   return;
> 
> -- 
> Mel Gorman
> SUSE Labs


Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-14 Thread Jirka Hladky
THANK YOU!


On Thu, May 14, 2020 at 1:50 PM Mel Gorman  wrote:
>
> On Thu, May 14, 2020 at 12:22:05PM +0200, Jirka Hladky wrote:
> > Thanks!
> >
> > Do you have a link? I cannot find it on github
> > (https://github.com/gormanm/mmtests, searched for
> > config-network-netperf-cstate-small-cross-socket)
> >
>
> https://github.com/gormanm/mmtests/blob/master/configs/config-network-netperf-cstate-small-cross-socket
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka



Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-14 Thread Mel Gorman
On Thu, May 14, 2020 at 12:22:05PM +0200, Jirka Hladky wrote:
> Thanks!
> 
> Do you have a link? I cannot find it on github
> (https://github.com/gormanm/mmtests, searched for
> config-network-netperf-cstate-small-cross-socket)
> 

https://github.com/gormanm/mmtests/blob/master/configs/config-network-netperf-cstate-small-cross-socket

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-14 Thread Jirka Hladky
Thanks!

Do you have a link? I cannot find it on github
(https://github.com/gormanm/mmtests, searched for
config-network-netperf-cstate-small-cross-socket)


On Thu, May 14, 2020 at 12:08 PM Mel Gorman  wrote:
>
> On Thu, May 14, 2020 at 11:58:36AM +0200, Jirka Hladky wrote:
> > Thank you, Mel!
> >
> > We are using netperf as well, but AFAIK it's running on two different
> > hosts. I will check with colleagues, if they can
> > add network-netperf-unbound run on the localhost.
> >
> > Is this the right config?
> > https://github.com/gormanm/mmtests/blob/345f82bee77cbf09ba57f470a1cfc1ae413c97df/bin/generate-generic-configs
> > sed -e 's/NETPERF_BUFFER_SIZES=.*/NETPERF_BUFFER_SIZES=64/'
> > config-network-netperf-unbound > config-network-netperf-unbound-small
> >
>
> That's one I was using at the moment to have a quick test after
> the reconciliation series was completed. It has since changed to
> config-network-netperf-cstate-small-cross-socket to limit cstates, bind
> the client and server to two local CPUs and using one buffer size. It
> was necessary to get an ftrace function graph of the wakeup path that
> was readable and not too noisy due to migrations, cpuidle exit costs etc.
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka



Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-14 Thread Mel Gorman
On Thu, May 14, 2020 at 11:58:36AM +0200, Jirka Hladky wrote:
> Thank you, Mel!
> 
> We are using netperf as well, but AFAIK it's running on two different
> hosts. I will check with colleagues, if they can
> add network-netperf-unbound run on the localhost.
> 
> Is this the right config?
> https://github.com/gormanm/mmtests/blob/345f82bee77cbf09ba57f470a1cfc1ae413c97df/bin/generate-generic-configs
> sed -e 's/NETPERF_BUFFER_SIZES=.*/NETPERF_BUFFER_SIZES=64/'
> config-network-netperf-unbound > config-network-netperf-unbound-small
> 

That's one I was using at the moment to have a quick test after
the reconciliation series was completed. It has since changed to
config-network-netperf-cstate-small-cross-socket to limit cstates, bind
the client and server to two local CPUs and using one buffer size. It
was necessary to get an ftrace function graph of the wakeup path that
was readable and not too noisy due to migrations, cpuidle exit costs etc.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-14 Thread Mel Gorman
On Wed, May 13, 2020 at 06:20:53PM +0200, Jirka Hladky wrote:
> Thank you, Mel!
> 
> I think I have to make sure we cover the scenario you have targeted
> when developing adjust_numa_imbalance:
> 
> ===
> https://github.com/torvalds/linux/blob/4f8a3cc1183c442daee6cc65360e3385021131e4/kernel/sched/fair.c#L8910
> 
> /*
> * Allow a small imbalance based on a simple pair of communicating
> * tasks that remain local when the source domain is almost idle.
> */
> ===
> 
> Could you point me to a benchmark for this scenario? I have checked
> https://github.com/gormanm/mmtests
> and we use lots of the same benchmarks but I'm not sure if we cover
> this particular scenario.
> 

The NUMA imbalance part showed up as part of the general effort to
reconcile NUMA balancing with Load balancing. It's been known for years
that the two balancers disagreed to the extent that NUMA balancing retries
migrations multiple times just to keep things local leading to excessive
migrations. The full battery of tests that were used when I was trying
to reconcile the balancers and later working on Vincent's version is
as follows

scheduler-unbound
scheduler-forkintensive
scheduler-perfpipe
scheduler-perfpipe-cpufreq
scheduler-schbench
db-pgbench-timed-ro-small-xfs
hpc-nas-c-class-mpi-full-xfs
hpc-nas-c-class-mpi-half-xfs
hpc-nas-c-class-omp-full
hpc-nas-c-class-omp-half
hpc-nas-d-class-mpi-full-xfs
hpc-nas-d-class-mpi-half-xfs
hpc-nas-d-class-omp-full
hpc-nas-d-class-omp-half
io-dbench4-async-ext4
io-dbench4-async-xfs
jvm-specjbb2005-multi
jvm-specjbb2005-single
network-netperf-cstate
network-netperf-rr-cstate
network-netperf-rr-unbound
network-netperf-unbound
network-tbench
numa-autonumabench
workload-kerndevel-xfs
workload-shellscripts-xfs

Where there is -ext4 or -xfs, just remove the filesystem to get the base
configuration. i.e. io-dbench4-async-ext4 basic configuration is
io-dbench4-async. Both filesystems are sometimes tested because they
interact differently with the scheduler due to ext4 using a journal
thread and xfs using workqueues.

The imbalance one is most obvious with network-netperf-unbound running
on localhost. If the client/server are on separate nodes, it's obvious
from mpstat that two nodes are busy and it's migrating quite a bit. The
second effect is that NUMA balancing is active, trapping hinting faults
and migrating pages.

The biggest problem I have right now is that the wakeup path between tasks
that are local is slower than doing a remote wakeup via wake_list that
potentially sends an IPI which is ridiculous. The slower wakeup manifests
as a loss of throughput for netperf even though all the accesses are
local. At least that's what I'm looking at whenever I get the chance.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-13 Thread Jirka Hladky
Thank you, Mel!

I think I have to make sure we cover the scenario you have targeted
when developing adjust_numa_imbalance:

===
https://github.com/torvalds/linux/blob/4f8a3cc1183c442daee6cc65360e3385021131e4/kernel/sched/fair.c#L8910

/*
* Allow a small imbalance based on a simple pair of communicating
* tasks that remain local when the source domain is almost idle.
*/
===

Could you point me to a benchmark for this scenario? I have checked
https://github.com/gormanm/mmtests
and we use lots of the same benchmarks but I'm not sure if we cover
this particular scenario.

Jirka


On Wed, May 13, 2020 at 5:30 PM Mel Gorman  wrote:
>
> On Wed, May 13, 2020 at 04:57:15PM +0200, Jirka Hladky wrote:
> > Hi Mel,
> >
> > we have tried the kernel with adjust_numa_imbalance() crippled to just
> > return the imbalance it's given.
> >
> > It has solved all the performance problems I have reported.
> > Performance is the same as with 5.6 kernel (before the patch was
> > applied).
> >
> > * solved the performance drop upto 20%  with single instance
> > SPECjbb2005 benchmark on 8 NUMA node servers (particularly on AMD EPYC
> > Rome systems) => this performance drop was INCREASING with higher
> > threads counts (10% for 16 threads and 20 % for 32 threads)
> > * solved the performance drop for low load scenarios (SPECjvm2008 and NAS)
> >
> > Any suggestions on how to proceed? One approach is to turn
> > "imbalance_min" into the kernel tunable. Any other ideas?
> >
> > https://github.com/torvalds/linux/blob/4f8a3cc1183c442daee6cc65360e3385021131e4/kernel/sched/fair.c#L8914
> >
>
> Complete shot in the dark but restore adjust_numa_imbalance() and try
> this
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1a9983da4408..0b31f4468d5b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2393,7 +2393,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, 
> int wake_flags)
> struct rq_flags rf;
>
>  #if defined(CONFIG_SMP)
> -   if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
> cpu)) {
> +   if (sched_feat(TTWU_QUEUE)) {
> sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> ttwu_queue_remote(p, cpu, wake_flags);
> return;
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka



Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-13 Thread Mel Gorman
On Wed, May 13, 2020 at 04:57:15PM +0200, Jirka Hladky wrote:
> Hi Mel,
> 
> we have tried the kernel with adjust_numa_imbalance() crippled to just
> return the imbalance it's given.
> 
> It has solved all the performance problems I have reported.
> Performance is the same as with 5.6 kernel (before the patch was
> applied).
> 
> * solved the performance drop upto 20%  with single instance
> SPECjbb2005 benchmark on 8 NUMA node servers (particularly on AMD EPYC
> Rome systems) => this performance drop was INCREASING with higher
> threads counts (10% for 16 threads and 20 % for 32 threads)
> * solved the performance drop for low load scenarios (SPECjvm2008 and NAS)
> 
> Any suggestions on how to proceed? One approach is to turn
> "imbalance_min" into the kernel tunable. Any other ideas?
> 
> https://github.com/torvalds/linux/blob/4f8a3cc1183c442daee6cc65360e3385021131e4/kernel/sched/fair.c#L8914
> 

Complete shot in the dark but restore adjust_numa_imbalance() and try
this

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1a9983da4408..0b31f4468d5b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2393,7 +2393,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, 
int wake_flags)
struct rq_flags rf;
 
 #if defined(CONFIG_SMP)
-   if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
cpu)) {
+   if (sched_feat(TTWU_QUEUE)) {
sched_clock_cpu(cpu); /* Sync clocks across CPUs */
ttwu_queue_remote(p, cpu, wake_flags);
return;

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-13 Thread Jirka Hladky
Hi Mel,

we have tried the kernel with adjust_numa_imbalance() crippled to just
return the imbalance it's given.

It has solved all the performance problems I have reported.
Performance is the same as with 5.6 kernel (before the patch was
applied).

* solved the performance drop upto 20%  with single instance
SPECjbb2005 benchmark on 8 NUMA node servers (particularly on AMD EPYC
Rome systems) => this performance drop was INCREASING with higher
threads counts (10% for 16 threads and 20 % for 32 threads)
* solved the performance drop for low load scenarios (SPECjvm2008 and NAS)

Any suggestions on how to proceed? One approach is to turn
"imbalance_min" into the kernel tunable. Any other ideas?

https://github.com/torvalds/linux/blob/4f8a3cc1183c442daee6cc65360e3385021131e4/kernel/sched/fair.c#L8914

Thanks a lot!
Jirka






On Fri, May 8, 2020 at 12:40 PM Jirka Hladky  wrote:
>
> Hi Mel,
>
> thanks for hints! We will try it.
>
> @Phil - could you please prepare a kernel build for me to test?
>
> Thank you!
> Jirka
>
> On Fri, May 8, 2020 at 11:22 AM Mel Gorman  
> wrote:
>>
>> On Thu, May 07, 2020 at 06:29:44PM +0200, Jirka Hladky wrote:
>> > Hi Mel,
>> >
>> > we are not targeting just OMP applications. We see the performance
>> > degradation also for other workloads, like SPECjbb2005 and
>> > SPECjvm2008. Even worse, it also affects a higher number of threads.
>> > For example, comparing 5.7.0-0.rc2 against 5.6 kernel, on 4 NUMA
>> > server with 2x AMD 7351 CPU, we see performance degradation 22% for 32
>> > threads (the system has 64 CPUs in total). We observe this degradation
>> > only when we run a single SPECjbb binary. When running 4 SPECjbb
>> > binaries in parallel, there is no change in performance between 5.6
>> > and 5.7.
>> >
>>
>> Minimally I suggest confirming that it's really due to
>> adjust_numa_imbalance() by making the function a no-op and retesting.
>> I have found odd artifacts with it but I'm unsure how to proceed without
>> causing problems elsehwere.
>>
>> For example, netperf on localhost in some cases reported a regression
>> when the client and server were running on the same node. The problem
>> appears to be that netserver completes its work faster when running
>> local and goes idle more regularly. The cost of going idle and waking up
>> builds up and a lower throughput is reported but I'm not sure if gaming
>> an artifact like that is a good idea.
>>
>> > That's why we are asking for the kernel tunable, which we would add to
>> > the tuned profile. We don't expect users to change this frequently but
>> > rather to set the performance profile once based on the purpose of the
>> > server.
>> >
>> > If you could prepare a patch for us, we would be more than happy to
>> > test it extensively. Based on the results, we can then evaluate if
>> > it's the way to go. Thoughts?
>> >
>>
>> I would suggest simply disabling that function first to ensure that is
>> really what is causing problems for you.
>>
>> --
>> Mel Gorman
>> SUSE Labs
>>
>
>
> --
> -Jirka



-- 
-Jirka



Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-08 Thread Jirka Hladky
Hi Mel,

thanks for hints! We will try it.

@Phil - could you please prepare a kernel build for me to test?

Thank you!
Jirka

On Fri, May 8, 2020 at 11:22 AM Mel Gorman  wrote:
>
> On Thu, May 07, 2020 at 06:29:44PM +0200, Jirka Hladky wrote:
> > Hi Mel,
> >
> > we are not targeting just OMP applications. We see the performance
> > degradation also for other workloads, like SPECjbb2005 and
> > SPECjvm2008. Even worse, it also affects a higher number of threads.
> > For example, comparing 5.7.0-0.rc2 against 5.6 kernel, on 4 NUMA
> > server with 2x AMD 7351 CPU, we see performance degradation 22% for 32
> > threads (the system has 64 CPUs in total). We observe this degradation
> > only when we run a single SPECjbb binary. When running 4 SPECjbb
> > binaries in parallel, there is no change in performance between 5.6
> > and 5.7.
> >
>
> Minimally I suggest confirming that it's really due to
> adjust_numa_imbalance() by making the function a no-op and retesting.
> I have found odd artifacts with it but I'm unsure how to proceed without
> causing problems elsehwere.
>
> For example, netperf on localhost in some cases reported a regression
> when the client and server were running on the same node. The problem
> appears to be that netserver completes its work faster when running
> local and goes idle more regularly. The cost of going idle and waking up
> builds up and a lower throughput is reported but I'm not sure if gaming
> an artifact like that is a good idea.
>
> > That's why we are asking for the kernel tunable, which we would add to
> > the tuned profile. We don't expect users to change this frequently but
> > rather to set the performance profile once based on the purpose of the
> > server.
> >
> > If you could prepare a patch for us, we would be more than happy to
> > test it extensively. Based on the results, we can then evaluate if
> > it's the way to go. Thoughts?
> >
>
> I would suggest simply disabling that function first to ensure that is
> really what is causing problems for you.
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka



Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-08 Thread Mel Gorman
On Thu, May 07, 2020 at 06:29:44PM +0200, Jirka Hladky wrote:
> Hi Mel,
> 
> we are not targeting just OMP applications. We see the performance
> degradation also for other workloads, like SPECjbb2005 and
> SPECjvm2008. Even worse, it also affects a higher number of threads.
> For example, comparing 5.7.0-0.rc2 against 5.6 kernel, on 4 NUMA
> server with 2x AMD 7351 CPU, we see performance degradation 22% for 32
> threads (the system has 64 CPUs in total). We observe this degradation
> only when we run a single SPECjbb binary. When running 4 SPECjbb
> binaries in parallel, there is no change in performance between 5.6
> and 5.7.
> 

Minimally I suggest confirming that it's really due to
adjust_numa_imbalance() by making the function a no-op and retesting.
I have found odd artifacts with it but I'm unsure how to proceed without
causing problems elsehwere.

For example, netperf on localhost in some cases reported a regression
when the client and server were running on the same node. The problem
appears to be that netserver completes its work faster when running
local and goes idle more regularly. The cost of going idle and waking up
builds up and a lower throughput is reported but I'm not sure if gaming
an artifact like that is a good idea.

> That's why we are asking for the kernel tunable, which we would add to
> the tuned profile. We don't expect users to change this frequently but
> rather to set the performance profile once based on the purpose of the
> server.
> 
> If you could prepare a patch for us, we would be more than happy to
> test it extensively. Based on the results, we can then evaluate if
> it's the way to go. Thoughts?
> 

I would suggest simply disabling that function first to ensure that is
really what is causing problems for you.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-07 Thread Phil Auld
On Thu, May 07, 2020 at 06:29:44PM +0200 Jirka Hladky wrote:
> Hi Mel,
> 
> we are not targeting just OMP applications. We see the performance
> degradation also for other workloads, like SPECjbb2005 and
> SPECjvm2008. Even worse, it also affects a higher number of threads.
> For example, comparing 5.7.0-0.rc2 against 5.6 kernel, on 4 NUMA
> server with 2x AMD 7351 CPU, we see performance degradation 22% for 32
> threads (the system has 64 CPUs in total). We observe this degradation
> only when we run a single SPECjbb binary. When running 4 SPECjbb
> binaries in parallel, there is no change in performance between 5.6
> and 5.7.
> 
> That's why we are asking for the kernel tunable, which we would add to
> the tuned profile. We don't expect users to change this frequently but
> rather to set the performance profile once based on the purpose of the
> server.
> 
> If you could prepare a patch for us, we would be more than happy to
> test it extensively. Based on the results, we can then evaluate if
> it's the way to go. Thoughts?
>

I'm happy to spin up a patch once I'm sure what exactly the tuning would
effect. At an initial glance I'm thinking it would be the imbalance_min
which is currently hardcoded to 2. But there may be something else...


Cheers,
Phil


> Thanks a lot!
> Jirka
> 
> On Thu, May 7, 2020 at 5:54 PM Mel Gorman  wrote:
> >
> > On Thu, May 07, 2020 at 05:24:17PM +0200, Jirka Hladky wrote:
> > > Hi Mel,
> > >
> > > > > Yes, it's indeed OMP.  With low threads count, I mean up to 2x number 
> > > > > of
> > > > > NUMA nodes (8 threads on 4 NUMA node servers, 16 threads on 8 NUMA 
> > > > > node
> > > > > servers).
> > > >
> > > > Ok, so we know it's within the imbalance threshold where a NUMA node can
> > > > be left idle.
> > >
> > > we have discussed today with my colleagues the performance drop for
> > > some workloads for low threads counts (roughly up to 2x number of NUMA
> > > nodes). We are worried that it can be a severe issue for some use
> > > cases, which require a full memory bandwidth even when only part of
> > > CPUs is used.
> > >
> > > We understand that scheduler cannot distinguish this type of workload
> > > from others automatically. However, there was an idea for a * new
> > > kernel tunable to control the imbalance threshold *. Based on the
> > > purpose of the server, users could set this tunable. See the tuned
> > > project, which allows creating performance profiles [1].
> > >
> >
> > I'm not completely opposed to it but given that the setting is global,
> > I imagine it could have other consequences if two applications ran
> > at different times have different requirements. Given that it's OMP,
> > I would have imagined that an application that really cared about this
> > would specify what was needed using OMP_PLACES. Why would someone prefer
> > kernel tuning or a tuned profile over OMP_PLACES? After all, it requires
> > specific knowledge of the application even to know that a particular
> > tuned profile is needed.
> >
> > --
> > Mel Gorman
> > SUSE Labs
> >
> 
> 
> -- 
> -Jirka
> 

-- 



Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-07 Thread Jirka Hladky
Hi Mel,

we are not targeting just OMP applications. We see the performance
degradation also for other workloads, like SPECjbb2005 and
SPECjvm2008. Even worse, it also affects a higher number of threads.
For example, comparing 5.7.0-0.rc2 against 5.6 kernel, on 4 NUMA
server with 2x AMD 7351 CPU, we see performance degradation 22% for 32
threads (the system has 64 CPUs in total). We observe this degradation
only when we run a single SPECjbb binary. When running 4 SPECjbb
binaries in parallel, there is no change in performance between 5.6
and 5.7.

That's why we are asking for the kernel tunable, which we would add to
the tuned profile. We don't expect users to change this frequently but
rather to set the performance profile once based on the purpose of the
server.

If you could prepare a patch for us, we would be more than happy to
test it extensively. Based on the results, we can then evaluate if
it's the way to go. Thoughts?

Thanks a lot!
Jirka

On Thu, May 7, 2020 at 5:54 PM Mel Gorman  wrote:
>
> On Thu, May 07, 2020 at 05:24:17PM +0200, Jirka Hladky wrote:
> > Hi Mel,
> >
> > > > Yes, it's indeed OMP.  With low threads count, I mean up to 2x number of
> > > > NUMA nodes (8 threads on 4 NUMA node servers, 16 threads on 8 NUMA node
> > > > servers).
> > >
> > > Ok, so we know it's within the imbalance threshold where a NUMA node can
> > > be left idle.
> >
> > we have discussed today with my colleagues the performance drop for
> > some workloads for low threads counts (roughly up to 2x number of NUMA
> > nodes). We are worried that it can be a severe issue for some use
> > cases, which require a full memory bandwidth even when only part of
> > CPUs is used.
> >
> > We understand that scheduler cannot distinguish this type of workload
> > from others automatically. However, there was an idea for a * new
> > kernel tunable to control the imbalance threshold *. Based on the
> > purpose of the server, users could set this tunable. See the tuned
> > project, which allows creating performance profiles [1].
> >
>
> I'm not completely opposed to it but given that the setting is global,
> I imagine it could have other consequences if two applications ran
> at different times have different requirements. Given that it's OMP,
> I would have imagined that an application that really cared about this
> would specify what was needed using OMP_PLACES. Why would someone prefer
> kernel tuning or a tuned profile over OMP_PLACES? After all, it requires
> specific knowledge of the application even to know that a particular
> tuned profile is needed.
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka



Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-07 Thread Mel Gorman
On Thu, May 07, 2020 at 05:24:17PM +0200, Jirka Hladky wrote:
> Hi Mel,
> 
> > > Yes, it's indeed OMP.  With low threads count, I mean up to 2x number of
> > > NUMA nodes (8 threads on 4 NUMA node servers, 16 threads on 8 NUMA node
> > > servers).
> >
> > Ok, so we know it's within the imbalance threshold where a NUMA node can
> > be left idle.
> 
> we have discussed today with my colleagues the performance drop for
> some workloads for low threads counts (roughly up to 2x number of NUMA
> nodes). We are worried that it can be a severe issue for some use
> cases, which require a full memory bandwidth even when only part of
> CPUs is used.
> 
> We understand that scheduler cannot distinguish this type of workload
> from others automatically. However, there was an idea for a * new
> kernel tunable to control the imbalance threshold *. Based on the
> purpose of the server, users could set this tunable. See the tuned
> project, which allows creating performance profiles [1].
> 

I'm not completely opposed to it but given that the setting is global,
I imagine it could have other consequences if two applications ran
at different times have different requirements. Given that it's OMP,
I would have imagined that an application that really cared about this
would specify what was needed using OMP_PLACES. Why would someone prefer
kernel tuning or a tuned profile over OMP_PLACES? After all, it requires
specific knowledge of the application even to know that a particular
tuned profile is needed.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-07 Thread Jirka Hladky
Hi Mel,

> > Yes, it's indeed OMP.  With low threads count, I mean up to 2x number of
> > NUMA nodes (8 threads on 4 NUMA node servers, 16 threads on 8 NUMA node
> > servers).
>
> Ok, so we know it's within the imbalance threshold where a NUMA node can
> be left idle.

we have discussed today with my colleagues the performance drop for
some workloads for low threads counts (roughly up to 2x number of NUMA
nodes). We are worried that it can be a severe issue for some use
cases, which require a full memory bandwidth even when only part of
CPUs is used.

We understand that scheduler cannot distinguish this type of workload
from others automatically. However, there was an idea for a * new
kernel tunable to control the imbalance threshold *. Based on the
purpose of the server, users could set this tunable. See the tuned
project, which allows creating performance profiles [1].

What do you think about this approach?

Thanks a lot!
Jirka

[1] https://tuned-project.org


On Fri, Mar 20, 2020 at 5:38 PM Mel Gorman  wrote:
>
> On Fri, Mar 20, 2020 at 04:30:08PM +0100, Jirka Hladky wrote:
> > >
> > > MPI or OMP and what is a low thread count? For MPI at least, I saw a 0.4%
> > > gain on an 4-node machine for bt_C and a 3.88% regression on 8-nodes. I
> > > think it must be OMP you are using because I found I had to disable UA
> > > for MPI at some point in the past for reasons I no longer remember.
> >
> >
> > Yes, it's indeed OMP.  With low threads count, I mean up to 2x number of
> > NUMA nodes (8 threads on 4 NUMA node servers, 16 threads on 8 NUMA node
> > servers).
> >
>
> Ok, so we know it's within the imbalance threshold where a NUMA node can
> be left idle.
>
> > One possibility would be to spread wide always at clone time and assume
> > > wake_affine will pull related tasks but it's fragile because it breaks
> > > if the cloned task execs and then allocates memory from a remote node
> > > only to migrate to a local node immediately.
> >
> >
> > I think the only way to find out how it performs is to test it. If you
> > could prepare a patch like that, I'm more than happy to give it a try!
> >
>
> When the initial spreading was prevented, it was for pipelines mainly --
> even basic shell scripts. In that case it was observed that a shell would
> fork/exec two tasks connected via pipe that started on separate nodes and
> had allocated remote data before being pulled close. The processes were
> typically too short lived for NUMA balancing to fix it up by exec time
> the information on where the fork happened was lost.  See 2c83362734da
> ("sched/fair: Consider SD_NUMA when selecting the most idle group to
> schedule on"). Now the logic has probably been partially broken since
> because of how SD_NUMA is now treated but the concern about spreading
> wide prematurely remains.
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka