Re: [PATCH 2/2] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine

2018-02-12 Thread Mel Gorman
On Mon, Feb 12, 2018 at 06:37:43PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 12, 2018 at 05:11:31PM +, Mel Gorman wrote:
> > +static void
> > +update_wa_numa_placement(struct task_struct *p, int prev_cpu, int target)
> > +{
> > +   unsigned long interval;
> > +
> > +   if (!static_branch_likely(&sched_numa_balancing))
> > +   return;
> > +
> > +   /* If balancing has no preference then accept the target */
> > +   if (p->numa_preferred_nid == -1)
> > +   return;
> > +
> > +   /* If the wakeup is not affecting locality then accept the target */
> > +   if (cpus_share_cache(prev_cpu, target))
> > +   return;
> 
> Both the above comments speak of 'accepting' the target, but its a void
> function, there's nothing they can do about it. It cannot not accept the
> placement.
> 

It's stale phrasing from an initial prototype that tried altering the
placement which failed miserably. I'll fix it.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 2/2] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine

2018-02-12 Thread Mel Gorman
On Mon, Feb 12, 2018 at 06:34:49PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 12, 2018 at 05:11:31PM +, Mel Gorman wrote:
> 
> > However, the benefit in other cases is large. This is the result for NAS
> > with the D class sizing on a 4-socket machine
> > 
> >   4.15.0 4.15.0
> > sdnuma-v1r23   delayretry-v1r23
> > Time cg.D  557.00 (   0.00%)  431.82 (  22.47%)
> > Time ep.D   77.83 (   0.00%)   79.01 (  -1.52%)
> > Time is.D   26.46 (   0.00%)   26.64 (  -0.68%)
> > Time lu.D  727.14 (   0.00%)  597.94 (  17.77%)
> > Time mg.D  191.35 (   0.00%)  146.85 (  23.26%)
> 
> Last time I checked, we were some ~25% from OMP_PROC_BIND with NAS, this
> seems to close that hole significantly. Do you happen to have
> OMP_PROC_BIND numbers handy to see how far away we are from manual
> affinity?
> 

OMP_PROC_BIND implies openmp and this particular test was using MPI for
parallisation. I'll look into doing an openmp comparison and see what it
looks like with OMP_PROC_BIND set.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 2/2] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine

2018-02-12 Thread Peter Zijlstra
On Mon, Feb 12, 2018 at 05:11:31PM +, Mel Gorman wrote:
> +static void
> +update_wa_numa_placement(struct task_struct *p, int prev_cpu, int target)
> +{
> + unsigned long interval;
> +
> + if (!static_branch_likely(&sched_numa_balancing))
> + return;
> +
> + /* If balancing has no preference then accept the target */
> + if (p->numa_preferred_nid == -1)
> + return;
> +
> + /* If the wakeup is not affecting locality then accept the target */
> + if (cpus_share_cache(prev_cpu, target))
> + return;

Both the above comments speak of 'accepting' the target, but its a void
function, there's nothing they can do about it. It cannot not accept the
placement.

> +
> + /*
> +  * Temporarily prevent NUMA balancing trying to place waker/wakee after
> +  * wakee has been moved by wake_affine. This will potentially allow
> +  * related tasks to converge and update their data placement. The
> +  * 4 * numa_scan_period is to allow the two-pass filter to migrate
> +  * hot data to the wakers node.
> +  */
> + interval = max(sysctl_numa_balancing_scan_delay,
> +  p->numa_scan_period << 2);
> + p->numa_migrate_retry = jiffies + msecs_to_jiffies(interval);
> +
> + interval = max(sysctl_numa_balancing_scan_delay,
> +  current->numa_scan_period << 2);
> + current->numa_migrate_retry = jiffies + msecs_to_jiffies(interval);
> +}

Otherwise that makes sense.


Re: [PATCH 2/2] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine

2018-02-12 Thread Peter Zijlstra
On Mon, Feb 12, 2018 at 05:11:31PM +, Mel Gorman wrote:

> However, the benefit in other cases is large. This is the result for NAS
> with the D class sizing on a 4-socket machine
> 
>   4.15.0 4.15.0
> sdnuma-v1r23   delayretry-v1r23
> Time cg.D  557.00 (   0.00%)  431.82 (  22.47%)
> Time ep.D   77.83 (   0.00%)   79.01 (  -1.52%)
> Time is.D   26.46 (   0.00%)   26.64 (  -0.68%)
> Time lu.D  727.14 (   0.00%)  597.94 (  17.77%)
> Time mg.D  191.35 (   0.00%)  146.85 (  23.26%)

Last time I checked, we were some ~25% from OMP_PROC_BIND with NAS, this
seems to close that hole significantly. Do you happen to have
OMP_PROC_BIND numbers handy to see how far away we are from manual
affinity?



[PATCH 2/2] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine

2018-02-12 Thread Mel Gorman
If wake_affine pulls a task to another node for any reason and the node is
no longer preferred then temporarily stop automatic NUMA balancing pulling
the task back. Otherwise, tasks with a strong waker/wakee relationship
may constantly fight automatic NUMA balancing over where a task should
be placed.

Once again netperf is interesting here. The performance barely changes
but automatic NUMA balancing is interesting.

Hmean send-64 354.67 (   0.00%)  352.15 (  -0.71%)
Hmean send-128702.91 (   0.00%)  693.84 (  -1.29%)
Hmean send-256   1350.07 (   0.00%) 1344.19 (  -0.44%)
Hmean send-1024  5124.38 (   0.00%) 4941.24 (  -3.57%)
Hmean send-2048  9687.44 (   0.00%) 9624.45 (  -0.65%)
Hmean send-3312 14577.64 (   0.00%)14514.35 (  -0.43%)
Hmean send-4096 16393.62 (   0.00%)16488.30 (   0.58%)
Hmean send-8192 26877.26 (   0.00%)26431.63 (  -1.66%)
Hmean send-1638438683.43 (   0.00%)38264.91 (  -1.08%)
Hmean recv-64 354.67 (   0.00%)  352.15 (  -0.71%)
Hmean recv-128702.91 (   0.00%)  693.84 (  -1.29%)
Hmean recv-256   1350.07 (   0.00%) 1344.19 (  -0.44%)
Hmean recv-1024  5124.38 (   0.00%) 4941.24 (  -3.57%)
Hmean recv-2048  9687.43 (   0.00%) 9624.45 (  -0.65%)
Hmean recv-3312 14577.59 (   0.00%)14514.35 (  -0.43%)
Hmean recv-4096 16393.55 (   0.00%)16488.20 (   0.58%)
Hmean recv-8192 26876.96 (   0.00%)26431.29 (  -1.66%)
Hmean recv-1638438682.41 (   0.00%)38263.94 (  -1.08%)

NUMA alloc hit 1465986 1423090
NUMA alloc miss  0   0
NUMA interleave hit  0   0
NUMA alloc local   1465897 1423003
NUMA base PTE updates 14731420
NUMA huge PMD updates0   0
NUMA page range updates   14731420
NUMA hint faults  13831312
NUMA hint local faults 451 124
NUMA hint local percent 32   9

There is a slight degrading in performance but there are slightly fewer
NUMA faults. There is a large drop in the percentage of local faults but
the bulk of migrations for netperf are in small shared libraries so it's
reflecting the fact that automatic NUMA balancing has backed off. This is
a case where despite wake_affine and automatic NUMA balancing fighting
for placement that there is a marginal benefit to rescheduling to local
data quickly. However, it should be noted that wake_affine and automatic
NUMA balancing fighting each other constantly is undesirable.

However, the benefit in other cases is large. This is the result for NAS
with the D class sizing on a 4-socket machine

  4.15.0 4.15.0
sdnuma-v1r23   delayretry-v1r23
Time cg.D  557.00 (   0.00%)  431.82 (  22.47%)
Time ep.D   77.83 (   0.00%)   79.01 (  -1.52%)
Time is.D   26.46 (   0.00%)   26.64 (  -0.68%)
Time lu.D  727.14 (   0.00%)  597.94 (  17.77%)
Time mg.D  191.35 (   0.00%)  146.85 (  23.26%)

  4.15.0  4.15.0
sdnuma-v1r23delayretry-v1r23
User75665.2070413.30
System  20321.59 8861.67
Elapsed   766.13  634.92

Minor Faults  16528502 7127941
Major Faults  45535068
NUMA alloc local   6963197 6749135
NUMA base PTE updates366409093   107491434
NUMA huge PMD updates   687556  198880
NUMA page range updates  718437765   209317994
NUMA hint faults  13643410 4601187
NUMA hint local faults 9212593 3063996
NUMA hint local percent 67  66

Note the massive reduction in system CPU usage even though the percentage
of local faults is barely affected. There is a massive reduction in the
number of PTE updates showing that automatic NUMA balancing has backed off.
A critical observation is also that there is a massive reduction in minor
faults which is due to far fewer NUMA hinting faults being trapped.

Other workloads like hackbench, tbench, dbench and schbench are barely
affected. dbench shows a mix of gains and losses depending on the machine
although in general, the results are more stable.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 54 -
 1 file changed, 53 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0192448e43a2..396d95f06f35 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1869,6 +1869,7 @@ static int task_numa_migrate(struct task_struct *p)
 static void numa_migrate_preferred(struct task_struct *p)
 {
unsigned long interval = HZ;
+   unsigned long numa_migrate_retry;
 
/* This task has no NUMA fault statistics yet */