Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()

2017-06-09 Thread Chris Mason

On 06/06/2017 05:21 AM, Peter Zijlstra wrote:

On Mon, Jun 05, 2017 at 02:00:21PM +0100, Matt Fleming wrote:

On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote:

On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote:


Please test..


Results are still coming in but things do look better with your patch
applied.

It does look like there's a regression when running hackbench in
process mode and when the CPUs are not fully utilised, e.g. check this
out:


This turned out to be a false positive; your patch improves things as
far as I can see.


Hooray, I'll move it to a part of the queue intended for merging.


It's a little late, but Roman Gushchin helped get some runs of this with 
our production workload.  The patch is every so slightly better.


Thanks!

-chris



Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()

2017-06-09 Thread Chris Mason

On 06/06/2017 05:21 AM, Peter Zijlstra wrote:

On Mon, Jun 05, 2017 at 02:00:21PM +0100, Matt Fleming wrote:

On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote:

On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote:


Please test..


Results are still coming in but things do look better with your patch
applied.

It does look like there's a regression when running hackbench in
process mode and when the CPUs are not fully utilised, e.g. check this
out:


This turned out to be a false positive; your patch improves things as
far as I can see.


Hooray, I'll move it to a part of the queue intended for merging.


It's a little late, but Roman Gushchin helped get some runs of this with 
our production workload.  The patch is every so slightly better.


Thanks!

-chris



Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()

2017-06-06 Thread Peter Zijlstra
On Mon, Jun 05, 2017 at 02:00:21PM +0100, Matt Fleming wrote:
> On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote:
> > On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote:
> > > 
> > > Please test..
> > 
> > Results are still coming in but things do look better with your patch
> > applied.
> > 
> > It does look like there's a regression when running hackbench in
> > process mode and when the CPUs are not fully utilised, e.g. check this
> > out:
> 
> This turned out to be a false positive; your patch improves things as
> far as I can see.

Hooray, I'll move it to a part of the queue intended for merging.

Thanks!


Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()

2017-06-06 Thread Peter Zijlstra
On Mon, Jun 05, 2017 at 02:00:21PM +0100, Matt Fleming wrote:
> On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote:
> > On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote:
> > > 
> > > Please test..
> > 
> > Results are still coming in but things do look better with your patch
> > applied.
> > 
> > It does look like there's a regression when running hackbench in
> > process mode and when the CPUs are not fully utilised, e.g. check this
> > out:
> 
> This turned out to be a false positive; your patch improves things as
> far as I can see.

Hooray, I'll move it to a part of the queue intended for merging.

Thanks!


Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()

2017-06-05 Thread Matt Fleming
On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote:
> On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote:
> > 
> > Please test..
> 
> Results are still coming in but things do look better with your patch
> applied.
> 
> It does look like there's a regression when running hackbench in
> process mode and when the CPUs are not fully utilised, e.g. check this
> out:

This turned out to be a false positive; your patch improves things as
far as I can see.


Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()

2017-06-05 Thread Matt Fleming
On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote:
> On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote:
> > 
> > Please test..
> 
> Results are still coming in but things do look better with your patch
> applied.
> 
> It does look like there's a regression when running hackbench in
> process mode and when the CPUs are not fully utilised, e.g. check this
> out:

This turned out to be a false positive; your patch improves things as
far as I can see.


Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()

2017-05-19 Thread Matt Fleming
On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote:
> 
> Please test..

Results are still coming in but things do look better with your patch
applied.

It does look like there's a regression when running hackbench in
process mode and when the CPUs are not fully utilised, e.g. check this
out:

hackbench-process-pipes
4.4.68 4.4.68
4.4.684.4.68
sles12-sp3 select-idle-cpu-aggressive 
for-each-cpu-wrap-fix  latest-hackbench-fix
Amean10.8853 (  0.00%)   1.2160 (-37.35%)  1.0350 
(-16.91%)  1.1853 (-33.89%)

This machine has 80 CPUs and that's a 40 process workload.

Here's the key:

select-idle-cpu-aggressive: 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() 
more aggressive")
for-each-cpu-wrap-fix: c743f0a5c50f ("sched/fair, cpumask: Export 
for_each_cpu_wrap()")
latest-hackbench-fix: this patch

But those results definitely look to be an exception. Here's the same
machine running the same number of tasks but with pthreads,

hackbench-thread-pipes
4.4.68 4.4.68
4.4.684.4.68
sles12-sp3 select-idle-cpu-aggressive 
for-each-cpu-wrap-fix  latest-hackbench-fix
Amean10.7427 (  0.00%)   0.9760 (-31.42%)  1.1907 
(-60.32%)  0.7643 ( -2.92%)

Nice win.


Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()

2017-05-19 Thread Matt Fleming
On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote:
> 
> Please test..

Results are still coming in but things do look better with your patch
applied.

It does look like there's a regression when running hackbench in
process mode and when the CPUs are not fully utilised, e.g. check this
out:

hackbench-process-pipes
4.4.68 4.4.68
4.4.684.4.68
sles12-sp3 select-idle-cpu-aggressive 
for-each-cpu-wrap-fix  latest-hackbench-fix
Amean10.8853 (  0.00%)   1.2160 (-37.35%)  1.0350 
(-16.91%)  1.1853 (-33.89%)

This machine has 80 CPUs and that's a 40 process workload.

Here's the key:

select-idle-cpu-aggressive: 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() 
more aggressive")
for-each-cpu-wrap-fix: c743f0a5c50f ("sched/fair, cpumask: Export 
for_each_cpu_wrap()")
latest-hackbench-fix: this patch

But those results definitely look to be an exception. Here's the same
machine running the same number of tasks but with pthreads,

hackbench-thread-pipes
4.4.68 4.4.68
4.4.684.4.68
sles12-sp3 select-idle-cpu-aggressive 
for-each-cpu-wrap-fix  latest-hackbench-fix
Amean10.7427 (  0.00%)   0.9760 (-31.42%)  1.1907 
(-60.32%)  0.7643 ( -2.92%)

Nice win.


Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()

2017-05-17 Thread Chris Mason

On 05/17/2017 06:53 AM, Peter Zijlstra wrote:

On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote:

sched/fair, cpumask: Export for_each_cpu_wrap()



-static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int 
*wrapped)
-{



-   next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1);



-}


OK, so this patch fixed an actual bug in the for_each_cpu_wrap()
implementation. The above 'n+1' should be 'n', and the effect is that
it'll skip over CPUs, potentially resulting in an iteration that only
sees every other CPU (for a fully contiguous mask).

This in turn causes hackbench to further suffer from the regression
introduced by commit:

  4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive")

So its well past time to fix this.

Where the old scheme was a cliff-edge throttle on idle scanning, this
introduces a more gradual approach. Instead of stopping to scan
entirely, we limit how many CPUs we scan.

Initial benchmarks show that it mostly recovers hackbench while not
hurting anything else, except Mason's schbench, but not as bad as the
old thing.

It also appears to recover the tbench high-end, which also suffered like
hackbench.

I'm also hoping it will fix/preserve kitsunyan's interactivity issue.

Please test..


We'll get some tests going here too.

-chris


Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()

2017-05-17 Thread Chris Mason

On 05/17/2017 06:53 AM, Peter Zijlstra wrote:

On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote:

sched/fair, cpumask: Export for_each_cpu_wrap()



-static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int 
*wrapped)
-{



-   next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1);



-}


OK, so this patch fixed an actual bug in the for_each_cpu_wrap()
implementation. The above 'n+1' should be 'n', and the effect is that
it'll skip over CPUs, potentially resulting in an iteration that only
sees every other CPU (for a fully contiguous mask).

This in turn causes hackbench to further suffer from the regression
introduced by commit:

  4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive")

So its well past time to fix this.

Where the old scheme was a cliff-edge throttle on idle scanning, this
introduces a more gradual approach. Instead of stopping to scan
entirely, we limit how many CPUs we scan.

Initial benchmarks show that it mostly recovers hackbench while not
hurting anything else, except Mason's schbench, but not as bad as the
old thing.

It also appears to recover the tbench high-end, which also suffered like
hackbench.

I'm also hoping it will fix/preserve kitsunyan's interactivity issue.

Please test..


We'll get some tests going here too.

-chris


Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()

2017-05-17 Thread Matt Fleming
On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote:
> On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote:
> > sched/fair, cpumask: Export for_each_cpu_wrap()
> 
> > -static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, 
> > int *wrapped)
> > -{
> 
> > -   next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1);
> 
> > -}
> 
> OK, so this patch fixed an actual bug in the for_each_cpu_wrap()
> implementation. The above 'n+1' should be 'n', and the effect is that
> it'll skip over CPUs, potentially resulting in an iteration that only
> sees every other CPU (for a fully contiguous mask).
> 
> This in turn causes hackbench to further suffer from the regression
> introduced by commit:
> 
>   4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive")
> 
> So its well past time to fix this.
> 
> Where the old scheme was a cliff-edge throttle on idle scanning, this
> introduces a more gradual approach. Instead of stopping to scan
> entirely, we limit how many CPUs we scan.
> 
> Initial benchmarks show that it mostly recovers hackbench while not
> hurting anything else, except Mason's schbench, but not as bad as the
> old thing.
> 
> It also appears to recover the tbench high-end, which also suffered like
> hackbench.
> 
> I'm also hoping it will fix/preserve kitsunyan's interactivity issue.
> 
> Please test..

Tests queued up at SUSE. I'll let you know the results as soon as
they're ready -- it'll be at least a couple of days.


Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()

2017-05-17 Thread Matt Fleming
On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote:
> On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote:
> > sched/fair, cpumask: Export for_each_cpu_wrap()
> 
> > -static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, 
> > int *wrapped)
> > -{
> 
> > -   next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1);
> 
> > -}
> 
> OK, so this patch fixed an actual bug in the for_each_cpu_wrap()
> implementation. The above 'n+1' should be 'n', and the effect is that
> it'll skip over CPUs, potentially resulting in an iteration that only
> sees every other CPU (for a fully contiguous mask).
> 
> This in turn causes hackbench to further suffer from the regression
> introduced by commit:
> 
>   4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive")
> 
> So its well past time to fix this.
> 
> Where the old scheme was a cliff-edge throttle on idle scanning, this
> introduces a more gradual approach. Instead of stopping to scan
> entirely, we limit how many CPUs we scan.
> 
> Initial benchmarks show that it mostly recovers hackbench while not
> hurting anything else, except Mason's schbench, but not as bad as the
> old thing.
> 
> It also appears to recover the tbench high-end, which also suffered like
> hackbench.
> 
> I'm also hoping it will fix/preserve kitsunyan's interactivity issue.
> 
> Please test..

Tests queued up at SUSE. I'll let you know the results as soon as
they're ready -- it'll be at least a couple of days.


hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()

2017-05-17 Thread Peter Zijlstra
On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote:
> sched/fair, cpumask: Export for_each_cpu_wrap()

> -static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, 
> int *wrapped)
> -{

> - next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1);

> -}

OK, so this patch fixed an actual bug in the for_each_cpu_wrap()
implementation. The above 'n+1' should be 'n', and the effect is that
it'll skip over CPUs, potentially resulting in an iteration that only
sees every other CPU (for a fully contiguous mask).

This in turn causes hackbench to further suffer from the regression
introduced by commit:

  4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive")

So its well past time to fix this.

Where the old scheme was a cliff-edge throttle on idle scanning, this
introduces a more gradual approach. Instead of stopping to scan
entirely, we limit how many CPUs we scan.

Initial benchmarks show that it mostly recovers hackbench while not
hurting anything else, except Mason's schbench, but not as bad as the
old thing.

It also appears to recover the tbench high-end, which also suffered like
hackbench.

I'm also hoping it will fix/preserve kitsunyan's interactivity issue.

Please test..

Cc: xiaolong...@intel.com
Cc: kitsunyan 
Cc: Chris Mason 
Cc: Mike Galbraith 
Cc: Matt Fleming 
---
 kernel/sched/fair.c | 21 -
 kernel/sched/features.h |  1 +
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 219fe58..8c1afa0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5793,27 +5793,38 @@ static inline int select_idle_smt(struct task_struct 
*p, struct sched_domain *sd
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int 
target)
 {
struct sched_domain *this_sd;
-   u64 avg_cost, avg_idle = this_rq()->avg_idle;
+   u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
-   int cpu;
+   int cpu, nr = INT_MAX;
 
this_sd = rcu_dereference(*this_cpu_ptr(_llc));
if (!this_sd)
return -1;
 
-   avg_cost = this_sd->avg_scan_cost;
-
/*
 * Due to large variance we need a large fuzz factor; hackbench in
 * particularly is sensitive here.
 */
-   if (sched_feat(SIS_AVG_CPU) && (avg_idle / 512) < avg_cost)
+   avg_idle = this_rq()->avg_idle / 512;
+   avg_cost = this_sd->avg_scan_cost + 1;
+
+   if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
return -1;
 
+   if (sched_feat(SIS_PROP)) {
+   u64 span_avg = sd->span_weight * avg_idle;
+   if (span_avg > 4*avg_cost)
+   nr = div_u64(span_avg, avg_cost);
+   else
+   nr = 4;
+   }
+
time = local_clock();
 
for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+   if (!--nr)
+   return -1;
if (!cpumask_test_cpu(cpu, >cpus_allowed))
continue;
if (idle_cpu(cpu))
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index dc4d148..d3fb155 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -55,6 +55,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
  */
 SCHED_FEAT(SIS_AVG_CPU, false)
+SCHED_FEAT(SIS_PROP, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls



hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()

2017-05-17 Thread Peter Zijlstra
On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote:
> sched/fair, cpumask: Export for_each_cpu_wrap()

> -static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, 
> int *wrapped)
> -{

> - next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1);

> -}

OK, so this patch fixed an actual bug in the for_each_cpu_wrap()
implementation. The above 'n+1' should be 'n', and the effect is that
it'll skip over CPUs, potentially resulting in an iteration that only
sees every other CPU (for a fully contiguous mask).

This in turn causes hackbench to further suffer from the regression
introduced by commit:

  4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive")

So its well past time to fix this.

Where the old scheme was a cliff-edge throttle on idle scanning, this
introduces a more gradual approach. Instead of stopping to scan
entirely, we limit how many CPUs we scan.

Initial benchmarks show that it mostly recovers hackbench while not
hurting anything else, except Mason's schbench, but not as bad as the
old thing.

It also appears to recover the tbench high-end, which also suffered like
hackbench.

I'm also hoping it will fix/preserve kitsunyan's interactivity issue.

Please test..

Cc: xiaolong...@intel.com
Cc: kitsunyan 
Cc: Chris Mason 
Cc: Mike Galbraith 
Cc: Matt Fleming 
---
 kernel/sched/fair.c | 21 -
 kernel/sched/features.h |  1 +
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 219fe58..8c1afa0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5793,27 +5793,38 @@ static inline int select_idle_smt(struct task_struct 
*p, struct sched_domain *sd
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int 
target)
 {
struct sched_domain *this_sd;
-   u64 avg_cost, avg_idle = this_rq()->avg_idle;
+   u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
-   int cpu;
+   int cpu, nr = INT_MAX;
 
this_sd = rcu_dereference(*this_cpu_ptr(_llc));
if (!this_sd)
return -1;
 
-   avg_cost = this_sd->avg_scan_cost;
-
/*
 * Due to large variance we need a large fuzz factor; hackbench in
 * particularly is sensitive here.
 */
-   if (sched_feat(SIS_AVG_CPU) && (avg_idle / 512) < avg_cost)
+   avg_idle = this_rq()->avg_idle / 512;
+   avg_cost = this_sd->avg_scan_cost + 1;
+
+   if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
return -1;
 
+   if (sched_feat(SIS_PROP)) {
+   u64 span_avg = sd->span_weight * avg_idle;
+   if (span_avg > 4*avg_cost)
+   nr = div_u64(span_avg, avg_cost);
+   else
+   nr = 4;
+   }
+
time = local_clock();
 
for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+   if (!--nr)
+   return -1;
if (!cpumask_test_cpu(cpu, >cpus_allowed))
continue;
if (idle_cpu(cpu))
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index dc4d148..d3fb155 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -55,6 +55,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
  */
 SCHED_FEAT(SIS_AVG_CPU, false)
+SCHED_FEAT(SIS_PROP, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls