Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
On 06/06/2017 05:21 AM, Peter Zijlstra wrote: On Mon, Jun 05, 2017 at 02:00:21PM +0100, Matt Fleming wrote: On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote: On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote: Please test.. Results are still coming in but things do look better with your patch applied. It does look like there's a regression when running hackbench in process mode and when the CPUs are not fully utilised, e.g. check this out: This turned out to be a false positive; your patch improves things as far as I can see. Hooray, I'll move it to a part of the queue intended for merging. It's a little late, but Roman Gushchin helped get some runs of this with our production workload. The patch is every so slightly better. Thanks! -chris
Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
On 06/06/2017 05:21 AM, Peter Zijlstra wrote: On Mon, Jun 05, 2017 at 02:00:21PM +0100, Matt Fleming wrote: On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote: On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote: Please test.. Results are still coming in but things do look better with your patch applied. It does look like there's a regression when running hackbench in process mode and when the CPUs are not fully utilised, e.g. check this out: This turned out to be a false positive; your patch improves things as far as I can see. Hooray, I'll move it to a part of the queue intended for merging. It's a little late, but Roman Gushchin helped get some runs of this with our production workload. The patch is every so slightly better. Thanks! -chris
Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
On Mon, Jun 05, 2017 at 02:00:21PM +0100, Matt Fleming wrote: > On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote: > > On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote: > > > > > > Please test.. > > > > Results are still coming in but things do look better with your patch > > applied. > > > > It does look like there's a regression when running hackbench in > > process mode and when the CPUs are not fully utilised, e.g. check this > > out: > > This turned out to be a false positive; your patch improves things as > far as I can see. Hooray, I'll move it to a part of the queue intended for merging. Thanks!
Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
On Mon, Jun 05, 2017 at 02:00:21PM +0100, Matt Fleming wrote: > On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote: > > On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote: > > > > > > Please test.. > > > > Results are still coming in but things do look better with your patch > > applied. > > > > It does look like there's a regression when running hackbench in > > process mode and when the CPUs are not fully utilised, e.g. check this > > out: > > This turned out to be a false positive; your patch improves things as > far as I can see. Hooray, I'll move it to a part of the queue intended for merging. Thanks!
Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote: > On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote: > > > > Please test.. > > Results are still coming in but things do look better with your patch > applied. > > It does look like there's a regression when running hackbench in > process mode and when the CPUs are not fully utilised, e.g. check this > out: This turned out to be a false positive; your patch improves things as far as I can see.
Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote: > On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote: > > > > Please test.. > > Results are still coming in but things do look better with your patch > applied. > > It does look like there's a regression when running hackbench in > process mode and when the CPUs are not fully utilised, e.g. check this > out: This turned out to be a false positive; your patch improves things as far as I can see.
Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote: > > Please test.. Results are still coming in but things do look better with your patch applied. It does look like there's a regression when running hackbench in process mode and when the CPUs are not fully utilised, e.g. check this out: hackbench-process-pipes 4.4.68 4.4.68 4.4.684.4.68 sles12-sp3 select-idle-cpu-aggressive for-each-cpu-wrap-fix latest-hackbench-fix Amean10.8853 ( 0.00%) 1.2160 (-37.35%) 1.0350 (-16.91%) 1.1853 (-33.89%) This machine has 80 CPUs and that's a 40 process workload. Here's the key: select-idle-cpu-aggressive: 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") for-each-cpu-wrap-fix: c743f0a5c50f ("sched/fair, cpumask: Export for_each_cpu_wrap()") latest-hackbench-fix: this patch But those results definitely look to be an exception. Here's the same machine running the same number of tasks but with pthreads, hackbench-thread-pipes 4.4.68 4.4.68 4.4.684.4.68 sles12-sp3 select-idle-cpu-aggressive for-each-cpu-wrap-fix latest-hackbench-fix Amean10.7427 ( 0.00%) 0.9760 (-31.42%) 1.1907 (-60.32%) 0.7643 ( -2.92%) Nice win.
Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote: > > Please test.. Results are still coming in but things do look better with your patch applied. It does look like there's a regression when running hackbench in process mode and when the CPUs are not fully utilised, e.g. check this out: hackbench-process-pipes 4.4.68 4.4.68 4.4.684.4.68 sles12-sp3 select-idle-cpu-aggressive for-each-cpu-wrap-fix latest-hackbench-fix Amean10.8853 ( 0.00%) 1.2160 (-37.35%) 1.0350 (-16.91%) 1.1853 (-33.89%) This machine has 80 CPUs and that's a 40 process workload. Here's the key: select-idle-cpu-aggressive: 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") for-each-cpu-wrap-fix: c743f0a5c50f ("sched/fair, cpumask: Export for_each_cpu_wrap()") latest-hackbench-fix: this patch But those results definitely look to be an exception. Here's the same machine running the same number of tasks but with pthreads, hackbench-thread-pipes 4.4.68 4.4.68 4.4.684.4.68 sles12-sp3 select-idle-cpu-aggressive for-each-cpu-wrap-fix latest-hackbench-fix Amean10.7427 ( 0.00%) 0.9760 (-31.42%) 1.1907 (-60.32%) 0.7643 ( -2.92%) Nice win.
Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
On 05/17/2017 06:53 AM, Peter Zijlstra wrote: On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote: sched/fair, cpumask: Export for_each_cpu_wrap() -static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int *wrapped) -{ - next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1); -} OK, so this patch fixed an actual bug in the for_each_cpu_wrap() implementation. The above 'n+1' should be 'n', and the effect is that it'll skip over CPUs, potentially resulting in an iteration that only sees every other CPU (for a fully contiguous mask). This in turn causes hackbench to further suffer from the regression introduced by commit: 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") So its well past time to fix this. Where the old scheme was a cliff-edge throttle on idle scanning, this introduces a more gradual approach. Instead of stopping to scan entirely, we limit how many CPUs we scan. Initial benchmarks show that it mostly recovers hackbench while not hurting anything else, except Mason's schbench, but not as bad as the old thing. It also appears to recover the tbench high-end, which also suffered like hackbench. I'm also hoping it will fix/preserve kitsunyan's interactivity issue. Please test.. We'll get some tests going here too. -chris
Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
On 05/17/2017 06:53 AM, Peter Zijlstra wrote: On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote: sched/fair, cpumask: Export for_each_cpu_wrap() -static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int *wrapped) -{ - next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1); -} OK, so this patch fixed an actual bug in the for_each_cpu_wrap() implementation. The above 'n+1' should be 'n', and the effect is that it'll skip over CPUs, potentially resulting in an iteration that only sees every other CPU (for a fully contiguous mask). This in turn causes hackbench to further suffer from the regression introduced by commit: 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") So its well past time to fix this. Where the old scheme was a cliff-edge throttle on idle scanning, this introduces a more gradual approach. Instead of stopping to scan entirely, we limit how many CPUs we scan. Initial benchmarks show that it mostly recovers hackbench while not hurting anything else, except Mason's schbench, but not as bad as the old thing. It also appears to recover the tbench high-end, which also suffered like hackbench. I'm also hoping it will fix/preserve kitsunyan's interactivity issue. Please test.. We'll get some tests going here too. -chris
Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote: > On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote: > > sched/fair, cpumask: Export for_each_cpu_wrap() > > > -static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, > > int *wrapped) > > -{ > > > - next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1); > > > -} > > OK, so this patch fixed an actual bug in the for_each_cpu_wrap() > implementation. The above 'n+1' should be 'n', and the effect is that > it'll skip over CPUs, potentially resulting in an iteration that only > sees every other CPU (for a fully contiguous mask). > > This in turn causes hackbench to further suffer from the regression > introduced by commit: > > 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") > > So its well past time to fix this. > > Where the old scheme was a cliff-edge throttle on idle scanning, this > introduces a more gradual approach. Instead of stopping to scan > entirely, we limit how many CPUs we scan. > > Initial benchmarks show that it mostly recovers hackbench while not > hurting anything else, except Mason's schbench, but not as bad as the > old thing. > > It also appears to recover the tbench high-end, which also suffered like > hackbench. > > I'm also hoping it will fix/preserve kitsunyan's interactivity issue. > > Please test.. Tests queued up at SUSE. I'll let you know the results as soon as they're ready -- it'll be at least a couple of days.
Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote: > On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote: > > sched/fair, cpumask: Export for_each_cpu_wrap() > > > -static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, > > int *wrapped) > > -{ > > > - next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1); > > > -} > > OK, so this patch fixed an actual bug in the for_each_cpu_wrap() > implementation. The above 'n+1' should be 'n', and the effect is that > it'll skip over CPUs, potentially resulting in an iteration that only > sees every other CPU (for a fully contiguous mask). > > This in turn causes hackbench to further suffer from the regression > introduced by commit: > > 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") > > So its well past time to fix this. > > Where the old scheme was a cliff-edge throttle on idle scanning, this > introduces a more gradual approach. Instead of stopping to scan > entirely, we limit how many CPUs we scan. > > Initial benchmarks show that it mostly recovers hackbench while not > hurting anything else, except Mason's schbench, but not as bad as the > old thing. > > It also appears to recover the tbench high-end, which also suffered like > hackbench. > > I'm also hoping it will fix/preserve kitsunyan's interactivity issue. > > Please test.. Tests queued up at SUSE. I'll let you know the results as soon as they're ready -- it'll be at least a couple of days.
hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote: > sched/fair, cpumask: Export for_each_cpu_wrap() > -static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, > int *wrapped) > -{ > - next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1); > -} OK, so this patch fixed an actual bug in the for_each_cpu_wrap() implementation. The above 'n+1' should be 'n', and the effect is that it'll skip over CPUs, potentially resulting in an iteration that only sees every other CPU (for a fully contiguous mask). This in turn causes hackbench to further suffer from the regression introduced by commit: 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") So its well past time to fix this. Where the old scheme was a cliff-edge throttle on idle scanning, this introduces a more gradual approach. Instead of stopping to scan entirely, we limit how many CPUs we scan. Initial benchmarks show that it mostly recovers hackbench while not hurting anything else, except Mason's schbench, but not as bad as the old thing. It also appears to recover the tbench high-end, which also suffered like hackbench. I'm also hoping it will fix/preserve kitsunyan's interactivity issue. Please test.. Cc: xiaolong...@intel.com Cc: kitsunyanCc: Chris Mason Cc: Mike Galbraith Cc: Matt Fleming --- kernel/sched/fair.c | 21 - kernel/sched/features.h | 1 + 2 files changed, 17 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 219fe58..8c1afa0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5793,27 +5793,38 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target) { struct sched_domain *this_sd; - u64 avg_cost, avg_idle = this_rq()->avg_idle; + u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu; + int cpu, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(_llc)); if (!this_sd) return -1; - avg_cost = this_sd->avg_scan_cost; - /* * Due to large variance we need a large fuzz factor; hackbench in * particularly is sensitive here. */ - if (sched_feat(SIS_AVG_CPU) && (avg_idle / 512) < avg_cost) + avg_idle = this_rq()->avg_idle / 512; + avg_cost = this_sd->avg_scan_cost + 1; + + if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost) return -1; + if (sched_feat(SIS_PROP)) { + u64 span_avg = sd->span_weight * avg_idle; + if (span_avg > 4*avg_cost) + nr = div_u64(span_avg, avg_cost); + else + nr = 4; + } + time = local_clock(); for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { + if (!--nr) + return -1; if (!cpumask_test_cpu(cpu, >cpus_allowed)) continue; if (idle_cpu(cpu)) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index dc4d148..d3fb155 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -55,6 +55,7 @@ SCHED_FEAT(TTWU_QUEUE, true) * When doing wakeups, attempt to limit superfluous scans of the LLC domain. */ SCHED_FEAT(SIS_AVG_CPU, false) +SCHED_FEAT(SIS_PROP, true) /* * Issue a WARN when we do multiple update_rq_clock() calls
hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote: > sched/fair, cpumask: Export for_each_cpu_wrap() > -static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, > int *wrapped) > -{ > - next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1); > -} OK, so this patch fixed an actual bug in the for_each_cpu_wrap() implementation. The above 'n+1' should be 'n', and the effect is that it'll skip over CPUs, potentially resulting in an iteration that only sees every other CPU (for a fully contiguous mask). This in turn causes hackbench to further suffer from the regression introduced by commit: 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") So its well past time to fix this. Where the old scheme was a cliff-edge throttle on idle scanning, this introduces a more gradual approach. Instead of stopping to scan entirely, we limit how many CPUs we scan. Initial benchmarks show that it mostly recovers hackbench while not hurting anything else, except Mason's schbench, but not as bad as the old thing. It also appears to recover the tbench high-end, which also suffered like hackbench. I'm also hoping it will fix/preserve kitsunyan's interactivity issue. Please test.. Cc: xiaolong...@intel.com Cc: kitsunyan Cc: Chris Mason Cc: Mike Galbraith Cc: Matt Fleming --- kernel/sched/fair.c | 21 - kernel/sched/features.h | 1 + 2 files changed, 17 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 219fe58..8c1afa0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5793,27 +5793,38 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target) { struct sched_domain *this_sd; - u64 avg_cost, avg_idle = this_rq()->avg_idle; + u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu; + int cpu, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(_llc)); if (!this_sd) return -1; - avg_cost = this_sd->avg_scan_cost; - /* * Due to large variance we need a large fuzz factor; hackbench in * particularly is sensitive here. */ - if (sched_feat(SIS_AVG_CPU) && (avg_idle / 512) < avg_cost) + avg_idle = this_rq()->avg_idle / 512; + avg_cost = this_sd->avg_scan_cost + 1; + + if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost) return -1; + if (sched_feat(SIS_PROP)) { + u64 span_avg = sd->span_weight * avg_idle; + if (span_avg > 4*avg_cost) + nr = div_u64(span_avg, avg_cost); + else + nr = 4; + } + time = local_clock(); for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { + if (!--nr) + return -1; if (!cpumask_test_cpu(cpu, >cpus_allowed)) continue; if (idle_cpu(cpu)) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index dc4d148..d3fb155 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -55,6 +55,7 @@ SCHED_FEAT(TTWU_QUEUE, true) * When doing wakeups, attempt to limit superfluous scans of the LLC domain. */ SCHED_FEAT(SIS_AVG_CPU, false) +SCHED_FEAT(SIS_PROP, true) /* * Issue a WARN when we do multiple update_rq_clock() calls