Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()
On Wed, Jan 20, 2021 at 10:21:47AM +0100, Vincent Guittot wrote: > On Wed, 20 Jan 2021 at 10:12, Mel Gorman wrote: > > > > On Wed, Jan 20, 2021 at 02:00:18PM +0530, Gautham R Shenoy wrote: > > > > @@ -6157,18 +6169,31 @@ static int select_idle_cpu(struct task_struct > > > > *p, struct sched_domain *sd, int t > > > > } > > > > > > > > for_each_cpu_wrap(cpu, cpus, target) { > > > > - if (!--nr) > > > > - return -1; > > > > - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) > > > > - break; > > > > + if (smt) { > > > > + i = select_idle_core(p, cpu, cpus, _cpu); > > > > + if ((unsigned int)i < nr_cpumask_bits) > > > > + return i; > > > > + > > > > + } else { > > > > + if (!--nr) > > > > + return -1; > > > > + i = __select_idle_cpu(cpu); > > > > + if ((unsigned int)i < nr_cpumask_bits) { > > > > + idle_cpu = i; > > > > + break; > > > > + } > > > > + } > > > > } > > > > > > > > - if (sched_feat(SIS_PROP)) { > > > > + if (smt) > > > > + set_idle_cores(this, false); > > > > > > Shouldn't we set_idle_cores(false) only if this was the last idle > > > core in the LLC ? > > > > > > > That would involve rechecking the cpumask bits that have not been > > scanned to see if any of them are an idle core. As the existance of idle > > cores can change very rapidly, it's not worth the cost. > > But don't we reach this point only if we scanned all CPUs and didn't > find an idle core ? Yes, but my understanding of Gauthams suggestion was to check if an idle core found was the last idle core available and set has_idle_cores to false in that case. I think this would be relatively expensive and possibly futile as returning the last idle core for this wakeup does not mean there will be no idle core on the next wakeup as other cores may go idle between wakeups. -- Mel Gorman SUSE Labs
Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()
On Wed, Jan 20, 2021 at 02:00:18PM +0530, Gautham R Shenoy wrote: > > @@ -6157,18 +6169,31 @@ static int select_idle_cpu(struct task_struct *p, > > struct sched_domain *sd, int t > > } > > > > for_each_cpu_wrap(cpu, cpus, target) { > > - if (!--nr) > > - return -1; > > - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) > > - break; > > + if (smt) { > > + i = select_idle_core(p, cpu, cpus, _cpu); > > + if ((unsigned int)i < nr_cpumask_bits) > > + return i; > > + > > + } else { > > + if (!--nr) > > + return -1; > > + i = __select_idle_cpu(cpu); > > + if ((unsigned int)i < nr_cpumask_bits) { > > + idle_cpu = i; > > + break; > > + } > > + } > > } > > > > - if (sched_feat(SIS_PROP)) { > > + if (smt) > > + set_idle_cores(this, false); > > Shouldn't we set_idle_cores(false) only if this was the last idle > core in the LLC ? > That would involve rechecking the cpumask bits that have not been scanned to see if any of them are an idle core. As the existance of idle cores can change very rapidly, it's not worth the cost. -- Mel Gorman SUSE Labs
Re: [PATCH v3 0/5] Scan for an idle sibling in a single pass
On Tue, Jan 19, 2021 at 12:33:04PM +0100, Vincent Guittot wrote: > On Tue, 19 Jan 2021 at 12:22, Mel Gorman wrote: > > > > Changelog since v2 > > o Remove unnecessary parameters > > o Update nr during scan only when scanning for cpus > > Hi Mel, > > I haven't looked at your previous version mainly because I'm chasing a > performance regression on v5.11-rcx which prevents me from testing the > impact of your patchset on my !SMT2 system. > Will do this as soon as this problem is fixed > Thanks, that would be appreciated as I do not have access to a !SMT2 system to do my own evaluation. -- Mel Gorman SUSE Labs
[PATCH 5/5] sched/fair: Merge select_idle_core/cpu()
From: Peter Zijlstra (Intel) Both select_idle_core() and select_idle_cpu() do a loop over the same cpumask. Observe that by clearing the already visited CPUs, we can fold the iteration and iterate a core at a time. All we need to do is remember any non-idle CPU we encountered while scanning for an idle core. This way we'll only iterate every CPU once. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 101 ++-- 1 file changed, 61 insertions(+), 40 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 12e08da90024..822851b39b65 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6006,6 +6006,14 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p return new_cpu; } +static inline int __select_idle_cpu(int cpu) +{ + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) + return cpu; + + return -1; +} + #ifdef CONFIG_SCHED_SMT DEFINE_STATIC_KEY_FALSE(sched_smt_present); EXPORT_SYMBOL_GPL(sched_smt_present); @@ -6066,40 +6074,34 @@ void __update_idle_core(struct rq *rq) * there are no idle cores left in the system; tracked through * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above. */ -static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target) +static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu) { - struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); - int core, cpu; + bool idle = true; + int cpu; if (!static_branch_likely(_smt_present)) - return -1; - - if (!test_idle_cores(target, false)) - return -1; - - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + return __select_idle_cpu(core); - for_each_cpu_wrap(core, cpus, target) { - bool idle = true; - - for_each_cpu(cpu, cpu_smt_mask(core)) { - if (!available_idle_cpu(cpu)) { - idle = false; - break; + for_each_cpu(cpu, cpu_smt_mask(core)) { + if (!available_idle_cpu(cpu)) { + idle = false; + if (*idle_cpu == -1) { + if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) { + *idle_cpu = cpu; + break; + } + continue; } + break; } - - if (idle) - return core; - - cpumask_andnot(cpus, cpus, cpu_smt_mask(core)); + if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr)) + *idle_cpu = cpu; } - /* -* Failed to find an idle core; stop looking for one. -*/ - set_idle_cores(target, 0); + if (idle) + return core; + cpumask_andnot(cpus, cpus, cpu_smt_mask(core)); return -1; } @@ -6107,9 +6109,18 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int #define sched_smt_weight 1 -static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target) +static inline void set_idle_cores(int cpu, int val) { - return -1; +} + +static inline bool test_idle_cores(int cpu, bool def) +{ + return def; +} + +static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu) +{ + return __select_idle_cpu(core); } #endif /* CONFIG_SCHED_SMT */ @@ -6124,10 +6135,11 @@ static inline int select_idle_core(struct task_struct *p, struct sched_domain *s static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target) { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); + int i, cpu, idle_cpu = -1, nr = INT_MAX; + bool smt = test_idle_cores(target, false); + int this = smp_processor_id(); struct sched_domain *this_sd; u64 time; - int this = smp_processor_id(); - int cpu, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(_llc)); if (!this_sd) @@ -6135,7 +6147,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); - if (sched_feat(SIS_PROP)) { + if (sched_feat(SIS_PROP) && !smt) { u64 avg_cost, avg_idle, span_avg; /* @@ -6157,18 +6169,31 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t } for_each_cpu_wrap(cpu, cpus, target) { - if
[PATCH 4/5] sched/fair: Remove select_idle_smt()
From: Peter Zijlstra (Intel) In order to make the next patch more readable, and to quantify the actual effectiveness of this pass, start by removing it. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 30 -- 1 file changed, 30 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0811e2fe4f19..12e08da90024 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6103,27 +6103,6 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int return -1; } -/* - * Scan the local SMT mask for idle CPUs. - */ -static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target) -{ - int cpu; - - if (!static_branch_likely(_smt_present)) - return -1; - - for_each_cpu(cpu, cpu_smt_mask(target)) { - if (!cpumask_test_cpu(cpu, p->cpus_ptr) || - !cpumask_test_cpu(cpu, sched_domain_span(sd))) - continue; - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) - return cpu; - } - - return -1; -} - #else /* CONFIG_SCHED_SMT */ #define sched_smt_weight 1 @@ -6133,11 +6112,6 @@ static inline int select_idle_core(struct task_struct *p, struct sched_domain *s return -1; } -static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target) -{ - return -1; -} - #endif /* CONFIG_SCHED_SMT */ #define sis_min_cores 2 @@ -6331,10 +6305,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if ((unsigned)i < nr_cpumask_bits) return i; - i = select_idle_smt(p, sd, target); - if ((unsigned)i < nr_cpumask_bits) - return i; - return target; } -- 2.26.2
[PATCH 3/5] sched/fair: Make select_idle_cpu() proportional to cores
From: Peter Zijlstra (Intel) Instead of calculating how many (logical) CPUs to scan, compute how many cores to scan. This changes behaviour for anything !SMT2. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Mel Gorman --- kernel/sched/core.c | 18 +- kernel/sched/fair.c | 12 ++-- kernel/sched/sched.h | 2 ++ 3 files changed, 25 insertions(+), 7 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 15d2562118d1..ada8faac2e4d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7444,11 +7444,19 @@ int sched_cpu_activate(unsigned int cpu) balance_push_set(cpu, false); #ifdef CONFIG_SCHED_SMT - /* -* When going up, increment the number of cores with SMT present. -*/ - if (cpumask_weight(cpu_smt_mask(cpu)) == 2) - static_branch_inc_cpuslocked(_smt_present); + do { + int weight = cpumask_weight(cpu_smt_mask(cpu)); + + if (weight > sched_smt_weight) + sched_smt_weight = weight; + + /* +* When going up, increment the number of cores with SMT present. +*/ + if (weight == 2) + static_branch_inc_cpuslocked(_smt_present); + + } while (0); #endif set_cpu_active(cpu, true); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c8d8e185cf3b..0811e2fe4f19 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6010,6 +6010,8 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p DEFINE_STATIC_KEY_FALSE(sched_smt_present); EXPORT_SYMBOL_GPL(sched_smt_present); +int sched_smt_weight __read_mostly = 1; + static inline void set_idle_cores(int cpu, int val) { struct sched_domain_shared *sds; @@ -6124,6 +6126,8 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t #else /* CONFIG_SCHED_SMT */ +#define sched_smt_weight 1 + static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target) { return -1; @@ -6136,6 +6140,8 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd #endif /* CONFIG_SCHED_SMT */ +#define sis_min_cores 2 + /* * Scan the LLC domain for idle CPUs; this is dynamically regulated by * comparing the average scan cost (tracked in sd->avg_scan_cost) against the @@ -6166,10 +6172,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t avg_cost = this_sd->avg_scan_cost + 1; span_avg = sd->span_weight * avg_idle; - if (span_avg > 4*avg_cost) + if (span_avg > sis_min_cores*avg_cost) nr = div_u64(span_avg, avg_cost); else - nr = 4; + nr = sis_min_cores; + + nr *= sched_smt_weight; time = cpu_clock(this); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 12ada79d40f3..29aabe98dd1d 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1107,6 +1107,8 @@ static inline void update_idle_core(struct rq *rq) __update_idle_core(rq); } +extern int sched_smt_weight; + #else static inline void update_idle_core(struct rq *rq) { } #endif -- 2.26.2
[PATCH 2/5] sched/fair: Move avg_scan_cost calculations under SIS_PROP
As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP check and while we are at it, exclude the cost of initialising the CPU mask from the average scan cost. Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9f5682aeda2e..c8d8e185cf3b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (!this_sd) return -1; + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + if (sched_feat(SIS_PROP)) { u64 avg_cost, avg_idle, span_avg; @@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t nr = div_u64(span_avg, avg_cost); else nr = 4; - } - - time = cpu_clock(this); - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + time = cpu_clock(this); + } for_each_cpu_wrap(cpu, cpus, target) { if (!--nr) @@ -6181,8 +6181,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t break; } - time = cpu_clock(this) - time; - update_avg(_sd->avg_scan_cost, time); + if (sched_feat(SIS_PROP)) { + time = cpu_clock(this) - time; + update_avg(_sd->avg_scan_cost, time); + } return cpu; } -- 2.26.2
[PATCH 1/5] sched/fair: Remove SIS_AVG_CPU
SIS_AVG_CPU was introduced as a means of avoiding a search when the average search cost indicated that the search would likely fail. It was a blunt instrument and disabled by commit 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") and later replaced with a proportional search depth by commit 1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()"). While there are corner cases where SIS_AVG_CPU is better, it has now been disabled for almost three years. As the intent of SIS_PROP is to reduce the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus on SIS_PROP as a throttling mechanism. Signed-off-by: Mel Gorman Reviewed-by: Vincent Guittot --- kernel/sched/fair.c | 20 +--- kernel/sched/features.h | 1 - 2 files changed, 9 insertions(+), 12 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 04a3ce20da67..9f5682aeda2e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6145,7 +6145,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); struct sched_domain *this_sd; - u64 avg_cost, avg_idle; u64 time; int this = smp_processor_id(); int cpu, nr = INT_MAX; @@ -6154,18 +6153,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (!this_sd) return -1; - /* -* Due to large variance we need a large fuzz factor; hackbench in -* particularly is sensitive here. -*/ - avg_idle = this_rq()->avg_idle / 512; - avg_cost = this_sd->avg_scan_cost + 1; + if (sched_feat(SIS_PROP)) { + u64 avg_cost, avg_idle, span_avg; - if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost) - return -1; + /* +* Due to large variance we need a large fuzz factor; +* hackbench in particularly is sensitive here. +*/ + avg_idle = this_rq()->avg_idle / 512; + avg_cost = this_sd->avg_scan_cost + 1; - if (sched_feat(SIS_PROP)) { - u64 span_avg = sd->span_weight * avg_idle; + span_avg = sd->span_weight * avg_idle; if (span_avg > 4*avg_cost) nr = div_u64(span_avg, avg_cost); else diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 68d369cba9e4..e875eabb6600 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -54,7 +54,6 @@ SCHED_FEAT(TTWU_QUEUE, true) /* * When doing wakeups, attempt to limit superfluous scans of the LLC domain. */ -SCHED_FEAT(SIS_AVG_CPU, false) SCHED_FEAT(SIS_PROP, true) /* -- 2.26.2
[PATCH v3 0/5] Scan for an idle sibling in a single pass
Changelog since v2 o Remove unnecessary parameters o Update nr during scan only when scanning for cpus Changlog since v1 o Move extern declaration to header for coding style o Remove unnecessary parameter from __select_idle_cpu This series of 5 patches reposts three patches from Peter entitled "select_idle_sibling() wreckage". It only scans the runqueues in a single pass when searching for an idle sibling. Two patches from Peter were dropped. The first patch altered how scan depth was calculated. Scan depth deletion is a random number generator with two major limitations. The avg_idle time is based on the time between a CPU going idle and being woken up clamped approximately by 2*sysctl_sched_migration_cost. This is difficult to compare in a sensible fashion to avg_scan_cost. The second issue is that only the avg_scan_cost of scan failures is recorded and it does not decay. This requires deeper surgery that would justify a patch on its own although Peter notes that https://lkml.kernel.org/r/20180530143105.977759...@infradead.org is potentially useful for an alternative avg_idle metric. The second patch dropped converted the idle core scan throttling mechanism to SIS_PROP. While this would unify the throttling of core and CPU scanning, it was not free of regressions and has_idle_cores is a fairly effective throttling mechanism with the caveat that it can have a lot of false positives for workloads like hackbench. Peter's series tried to solve three problems at once, this subset addresses one problem. As with anything select_idle_sibling, it's a mix of wins and losses but won more than it lost across a range of workloads and machines. kernel/sched/core.c | 18 +++-- kernel/sched/fair.c | 161 kernel/sched/features.h | 1 - kernel/sched/sched.h| 2 + 4 files changed, 95 insertions(+), 87 deletions(-) -- 2.26.2
Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()
On Mon, Jan 18, 2021 at 08:55:03PM +0800, Li, Aubrey wrote: > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 12e08da90024..6c0f841e9e75 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6006,6 +6006,14 @@ static inline int find_idlest_cpu(struct > > sched_domain *sd, struct task_struct *p > > return new_cpu; > > } > > > > +static inline int __select_idle_cpu(struct task_struct *p, int core, > > struct cpumask *cpus) > > Sorry if I missed anything, why p and cpus are needed here? > They are not needed. The original code was matching the calling pattern for select_idle_core() which needs p and cpus to check if sibling CPUs are allowed. > > @@ -6135,7 +6147,7 @@ static int select_idle_cpu(struct task_struct *p, > > struct sched_domain *sd, int t > > > > cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); > > > > - if (sched_feat(SIS_PROP)) { > > + if (sched_feat(SIS_PROP) && !smt) { > > Is it possible the system does have a idle core, but I still don't want to > scan the entire llc domain? > This version is matching historical behaviour. To limit the scan for cores, select_idle_core() would need to obey SIS_PROP and that patch was dropped as it introduced regressions. It would only be considered once SIS_PROP had better metrics for limiting the depth of the search. > > u64 avg_cost, avg_idle, span_avg; > > > > /* > > @@ -6159,16 +6171,29 @@ static int select_idle_cpu(struct task_struct *p, > > struct sched_domain *sd, int t > > for_each_cpu_wrap(cpu, cpus, target) { > > if (!--nr) > > return -1; > > It looks like nr only makes sense when smt = false now, can it be moved into > else branch below? > It can. I expect the saving to be marginal and it will need to move back when/if select_idle_core() obeys SIS_PROP. > > - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) > > - break; > > + if (smt) { > > + i = select_idle_core(p, cpu, cpus, _cpu); > > + if ((unsigned int)i < nr_cpumask_bits) > > + return i; > > What if the last idle core is selected here, should we set_idle_cores false > before return? > We'd have to check what bits were still set in the cpus mask and determine if they represent an idle core. I severely doubt it would be worth the cost given that the availability of idle cores can change at any instant. -- Mel Gorman SUSE Labs
Re: [RFC PATCH] Remove redundant sched_numa_balancing check.
On Mon, Jan 18, 2021 at 09:32:18PM +1100, Imran Khan wrote: > task_numa_fault is invoked from do_numa_page/do_huge_pmd_numa_page, > for task_numa_work induced memory faults. task_numa_work is scheduled > from task_tick_numa which is invoked only if sched_numa_balancing > is true. > > So task_numa_fault will not get invoked if sched_numa_balancing is > false and hence we can avoid checking it again in task_numa_fault. > > Signed-off-by: Imran Khan If NUMA balancing is disabled at runtime, there may still be PTEs that are marked for NUMA balancing. While these still get handled at fault, there is no point tracking the fault information in task_numa_fault and this function can still get called after sched_numa_balancing is disabled. -- Mel Gorman SUSE Labs
Re: [PATCH] mm, compaction: move high_pfn to the for loop scope.
On Mon, Jan 18, 2021 at 03:41:26PM +0800, Rokudo Yan wrote: > In fast_isolate_freepages, high_pfn will be used if a prefered one(PFN >= > low_fn) not found. But the high_pfn > is not reset before searching an free area, so when it was used as freepage, > it may from another free area searched before. > And move_freelist_head(freelist, freepage) will have unexpected behavior(eg. > corrupt the MOVABLE freelist) > > Unable to handle kernel paging request at virtual address dead0200 > Mem abort info: > ESR = 0x9644 > Exception class = DABT (current EL), IL = 32 bits > SET = 0, FnV = 0 > EA = 0, S1PTW = 0 > Data abort info: > ISV = 0, ISS = 0x0044 > CM = 0, WnR = 1 > [dead0200] address between user and kernel address ranges > > -000|list_cut_before(inline) > -000|move_freelist_head(inline) > -000|fast_isolate_freepages(inline) > -000|isolate_freepages(inline) > -000|compaction_alloc(?, ?) > -001|unmap_and_move(inline) > -001|migrate_pages([NSD:0xFF80088CBBD0] from = 0xFF80088CBD88, > [NSD:0xFF80088CBBC8] get_new_p > -002|__read_once_size(inline) > -002|static_key_count(inline) > -002|static_key_false(inline) > -002|trace_mm_compaction_migratepages(inline) > -002|compact_zone(?, [NSD:0xFF80088CBCB0] capc = 0x0) > -003|kcompactd_do_work(inline) > -003|kcompactd([X19] p = 0xFF93227FBC40) > -004|kthread([X20] _create = 0xFFE1AFB26380) > -005|ret_from_fork(asm) > ---|end of frame > > The issue was reported on an smart phone product with 6GB ram and 3GB zram as > swap device. > > This patch fixes the issue by reset high_pfn before searching each free area, > which ensure > freepage and freelist match when call move_freelist_head in > fast_isolate_freepages(). > > Fixes: 5a811889de10f1eb ("mm, compaction: use free lists to quickly locate a > migration target") > > Signed-off-by: Rokudo Yan Other than a fixes line, I do not think this changed so my previous Ack still applies Acked-by: Mel Gorman -- Mel Gorman SUSE Labs
Re: [PATCH 3/5] sched/fair: Make select_idle_cpu() proportional to cores
On Mon, Jan 18, 2021 at 04:14:36PM +0800, Li, Aubrey wrote: > > > > @@ -6124,6 +6126,8 @@ static int select_idle_smt(struct task_struct *p, > > struct sched_domain *sd, int t > > > > #else /* CONFIG_SCHED_SMT */ > > > > +#define sched_smt_weight 1 > > + > > static inline int select_idle_core(struct task_struct *p, struct > > sched_domain *sd, int target) > > { > > return -1; > > > > > > > > @@ -6166,10 +6172,12 @@ static int select_idle_cpu(struct task_struct *p, > > struct sched_domain *sd, int t > > avg_cost = this_sd->avg_scan_cost + 1; > > > > span_avg = sd->span_weight * avg_idle; > > - if (span_avg > 4*avg_cost) > > + if (span_avg > sis_min_cores*avg_cost) > > nr = div_u64(span_avg, avg_cost); > > else > > - nr = 4; > > + nr = sis_min_cores; > > + > > + nr *= sched_smt_weight; > > Is it better to put this into an inline wrapper to hide sched_smt_weight if > !CONFIG_SCHED_SMT? > There already is a #define sched_smt_weight for !CONFIG_SCHED_SMT and I do not think an inline wrapper would make it more readable or maintainable. -- Mel Gorman SUSE Labs
[PATCH v2 0/5] Scan for an idle sibling in a single pass
Changlog since v1 o Move extern declaration to header for coding style o Remove unnecessary parameter from __select_idle_cpu This series of 5 patches reposts three patches from Peter entitled "select_idle_sibling() wreckage". It only scans the runqueues in a single pass when searching for an idle sibling. Two patches from Peter were dropped. The first patch altered how scan depth was calculated. Scan depth deletion is a random number generator with two major limitations. The avg_idle time is based on the time between a CPU going idle and being woken up clamped approximately by 2*sysctl_sched_migration_cost. This is difficult to compare in a sensible fashion to avg_scan_cost. The second issue is that only the avg_scan_cost of scan failures is recorded and it does not decay. This requires deeper surgery that would justify a patch on its own although Peter notes that https://lkml.kernel.org/r/20180530143105.977759...@infradead.org is potentially useful for an alternative avg_idle metric. The second patch dropped converted the idle core scan throttling mechanism to SIS_PROP. While this would unify the throttling of core and CPU scanning, it was not free of regressions and has_idle_cores is a fairly effective throttling mechanism with the caveat that it can have a lot of false positives for workloads like hackbench. Peter's series tried to solve three problems at once, this subset addresses one problem. As with anything select_idle_sibling, it's a mix of wins and losses but won more than it lost across a range of workloads and machines. kernel/sched/core.c | 18 +++-- kernel/sched/fair.c | 157 kernel/sched/features.h | 1 - kernel/sched/sched.h| 2 + 4 files changed, 93 insertions(+), 85 deletions(-) -- 2.26.2
[PATCH 1/5] sched/fair: Remove SIS_AVG_CPU
SIS_AVG_CPU was introduced as a means of avoiding a search when the average search cost indicated that the search would likely fail. It was a blunt instrument and disabled by commit 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") and later replaced with a proportional search depth by commit 1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()"). While there are corner cases where SIS_AVG_CPU is better, it has now been disabled for almost three years. As the intent of SIS_PROP is to reduce the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus on SIS_PROP as a throttling mechanism. Signed-off-by: Mel Gorman Reviewed-by: Vincent Guittot --- kernel/sched/fair.c | 20 +--- kernel/sched/features.h | 1 - 2 files changed, 9 insertions(+), 12 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 04a3ce20da67..9f5682aeda2e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6145,7 +6145,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); struct sched_domain *this_sd; - u64 avg_cost, avg_idle; u64 time; int this = smp_processor_id(); int cpu, nr = INT_MAX; @@ -6154,18 +6153,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (!this_sd) return -1; - /* -* Due to large variance we need a large fuzz factor; hackbench in -* particularly is sensitive here. -*/ - avg_idle = this_rq()->avg_idle / 512; - avg_cost = this_sd->avg_scan_cost + 1; + if (sched_feat(SIS_PROP)) { + u64 avg_cost, avg_idle, span_avg; - if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost) - return -1; + /* +* Due to large variance we need a large fuzz factor; +* hackbench in particularly is sensitive here. +*/ + avg_idle = this_rq()->avg_idle / 512; + avg_cost = this_sd->avg_scan_cost + 1; - if (sched_feat(SIS_PROP)) { - u64 span_avg = sd->span_weight * avg_idle; + span_avg = sd->span_weight * avg_idle; if (span_avg > 4*avg_cost) nr = div_u64(span_avg, avg_cost); else diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 68d369cba9e4..e875eabb6600 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -54,7 +54,6 @@ SCHED_FEAT(TTWU_QUEUE, true) /* * When doing wakeups, attempt to limit superfluous scans of the LLC domain. */ -SCHED_FEAT(SIS_AVG_CPU, false) SCHED_FEAT(SIS_PROP, true) /* -- 2.26.2
[PATCH 3/5] sched/fair: Make select_idle_cpu() proportional to cores
From: Peter Zijlstra (Intel) Instead of calculating how many (logical) CPUs to scan, compute how many cores to scan. This changes behaviour for anything !SMT2. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Mel Gorman --- kernel/sched/core.c | 18 +- kernel/sched/fair.c | 12 ++-- kernel/sched/sched.h | 2 ++ 3 files changed, 25 insertions(+), 7 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 15d2562118d1..ada8faac2e4d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7444,11 +7444,19 @@ int sched_cpu_activate(unsigned int cpu) balance_push_set(cpu, false); #ifdef CONFIG_SCHED_SMT - /* -* When going up, increment the number of cores with SMT present. -*/ - if (cpumask_weight(cpu_smt_mask(cpu)) == 2) - static_branch_inc_cpuslocked(_smt_present); + do { + int weight = cpumask_weight(cpu_smt_mask(cpu)); + + if (weight > sched_smt_weight) + sched_smt_weight = weight; + + /* +* When going up, increment the number of cores with SMT present. +*/ + if (weight == 2) + static_branch_inc_cpuslocked(_smt_present); + + } while (0); #endif set_cpu_active(cpu, true); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c8d8e185cf3b..0811e2fe4f19 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6010,6 +6010,8 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p DEFINE_STATIC_KEY_FALSE(sched_smt_present); EXPORT_SYMBOL_GPL(sched_smt_present); +int sched_smt_weight __read_mostly = 1; + static inline void set_idle_cores(int cpu, int val) { struct sched_domain_shared *sds; @@ -6124,6 +6126,8 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t #else /* CONFIG_SCHED_SMT */ +#define sched_smt_weight 1 + static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target) { return -1; @@ -6136,6 +6140,8 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd #endif /* CONFIG_SCHED_SMT */ +#define sis_min_cores 2 + /* * Scan the LLC domain for idle CPUs; this is dynamically regulated by * comparing the average scan cost (tracked in sd->avg_scan_cost) against the @@ -6166,10 +6172,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t avg_cost = this_sd->avg_scan_cost + 1; span_avg = sd->span_weight * avg_idle; - if (span_avg > 4*avg_cost) + if (span_avg > sis_min_cores*avg_cost) nr = div_u64(span_avg, avg_cost); else - nr = 4; + nr = sis_min_cores; + + nr *= sched_smt_weight; time = cpu_clock(this); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 12ada79d40f3..29aabe98dd1d 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1107,6 +1107,8 @@ static inline void update_idle_core(struct rq *rq) __update_idle_core(rq); } +extern int sched_smt_weight; + #else static inline void update_idle_core(struct rq *rq) { } #endif -- 2.26.2
[PATCH 5/5] sched/fair: Merge select_idle_core/cpu()
From: Peter Zijlstra (Intel) Both select_idle_core() and select_idle_cpu() do a loop over the same cpumask. Observe that by clearing the already visited CPUs, we can fold the iteration and iterate a core at a time. All we need to do is remember any non-idle CPU we encountered while scanning for an idle core. This way we'll only iterate every CPU once. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 97 +++-- 1 file changed, 59 insertions(+), 38 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 12e08da90024..6c0f841e9e75 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6006,6 +6006,14 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p return new_cpu; } +static inline int __select_idle_cpu(struct task_struct *p, int core, struct cpumask *cpus) +{ + if (available_idle_cpu(core) || sched_idle_cpu(core)) + return core; + + return -1; +} + #ifdef CONFIG_SCHED_SMT DEFINE_STATIC_KEY_FALSE(sched_smt_present); EXPORT_SYMBOL_GPL(sched_smt_present); @@ -6066,40 +6074,34 @@ void __update_idle_core(struct rq *rq) * there are no idle cores left in the system; tracked through * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above. */ -static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target) +static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu) { - struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); - int core, cpu; + bool idle = true; + int cpu; if (!static_branch_likely(_smt_present)) - return -1; - - if (!test_idle_cores(target, false)) - return -1; - - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + return __select_idle_cpu(p, core, cpus); - for_each_cpu_wrap(core, cpus, target) { - bool idle = true; - - for_each_cpu(cpu, cpu_smt_mask(core)) { - if (!available_idle_cpu(cpu)) { - idle = false; - break; + for_each_cpu(cpu, cpu_smt_mask(core)) { + if (!available_idle_cpu(cpu)) { + idle = false; + if (*idle_cpu == -1) { + if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) { + *idle_cpu = cpu; + break; + } + continue; } + break; } - - if (idle) - return core; - - cpumask_andnot(cpus, cpus, cpu_smt_mask(core)); + if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr)) + *idle_cpu = cpu; } - /* -* Failed to find an idle core; stop looking for one. -*/ - set_idle_cores(target, 0); + if (idle) + return core; + cpumask_andnot(cpus, cpus, cpu_smt_mask(core)); return -1; } @@ -6107,9 +6109,18 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int #define sched_smt_weight 1 -static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target) +static inline void set_idle_cores(int cpu, int val) { - return -1; +} + +static inline bool test_idle_cores(int cpu, bool def) +{ + return def; +} + +static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu) +{ + return __select_idle_cpu(p, core, cpus); } #endif /* CONFIG_SCHED_SMT */ @@ -6124,10 +6135,11 @@ static inline int select_idle_core(struct task_struct *p, struct sched_domain *s static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target) { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); + int i, cpu, idle_cpu = -1, nr = INT_MAX; + bool smt = test_idle_cores(target, false); + int this = smp_processor_id(); struct sched_domain *this_sd; u64 time; - int this = smp_processor_id(); - int cpu, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(_llc)); if (!this_sd) @@ -6135,7 +6147,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); - if (sched_feat(SIS_PROP)) { + if (sched_feat(SIS_PROP) && !smt) { u64 avg_cost, avg_idle, span_avg; /* @@ -6159,16 +6171,29 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
[PATCH 2/5] sched/fair: Move avg_scan_cost calculations under SIS_PROP
As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP check and while we are at it, exclude the cost of initialising the CPU mask from the average scan cost. Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9f5682aeda2e..c8d8e185cf3b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (!this_sd) return -1; + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + if (sched_feat(SIS_PROP)) { u64 avg_cost, avg_idle, span_avg; @@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t nr = div_u64(span_avg, avg_cost); else nr = 4; - } - - time = cpu_clock(this); - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + time = cpu_clock(this); + } for_each_cpu_wrap(cpu, cpus, target) { if (!--nr) @@ -6181,8 +6181,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t break; } - time = cpu_clock(this) - time; - update_avg(_sd->avg_scan_cost, time); + if (sched_feat(SIS_PROP)) { + time = cpu_clock(this) - time; + update_avg(_sd->avg_scan_cost, time); + } return cpu; } -- 2.26.2
[PATCH 4/5] sched/fair: Remove select_idle_smt()
From: Peter Zijlstra (Intel) In order to make the next patch more readable, and to quantify the actual effectiveness of this pass, start by removing it. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 30 -- 1 file changed, 30 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0811e2fe4f19..12e08da90024 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6103,27 +6103,6 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int return -1; } -/* - * Scan the local SMT mask for idle CPUs. - */ -static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target) -{ - int cpu; - - if (!static_branch_likely(_smt_present)) - return -1; - - for_each_cpu(cpu, cpu_smt_mask(target)) { - if (!cpumask_test_cpu(cpu, p->cpus_ptr) || - !cpumask_test_cpu(cpu, sched_domain_span(sd))) - continue; - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) - return cpu; - } - - return -1; -} - #else /* CONFIG_SCHED_SMT */ #define sched_smt_weight 1 @@ -6133,11 +6112,6 @@ static inline int select_idle_core(struct task_struct *p, struct sched_domain *s return -1; } -static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target) -{ - return -1; -} - #endif /* CONFIG_SCHED_SMT */ #define sis_min_cores 2 @@ -6331,10 +6305,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if ((unsigned)i < nr_cpumask_bits) return i; - i = select_idle_smt(p, sd, target); - if ((unsigned)i < nr_cpumask_bits) - return i; - return target; } -- 2.26.2
Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()
On Thu, Jan 14, 2021 at 04:44:46PM +0100, Vincent Guittot wrote: > > domain. There is no need to make it specific to the core account and we > > are already doing the full scan. Throttling that would be a separate patch. > > > > > This patch 5 should focus on merging select_idle_core and > > > select_idle_cpu so we keep (almost) the same behavior but each CPU is > > > checked only once. > > > > > > > Which I think it's already doing. Main glitch really is that > > __select_idle_cpu() shouldn't be taking *idle_cpu as it does not consume > > the information. > > don't really like the if (smt) else in the for_each_cpu_wrap(cpu, > cpus, target) loop because it just looks like we fail to merge idle > core and idle cpu search loop at the end. > While it's not the best, I did at one point have a series that fully unified this function and it wasn't pretty. > But there is probably not much we can do without changing what is > accounted idle core search in the avg_scan_cost > Indeed. Maybe in the future it'll make more sense to consolidate it further but between the depth search and possibly using SIS_PROP core core searches, we've bigger fish to fry. Current delta between this series and what is being tested is simply diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7bfa73de6a8d..ada8faac2e4d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7446,7 +7446,6 @@ int sched_cpu_activate(unsigned int cpu) #ifdef CONFIG_SCHED_SMT do { int weight = cpumask_weight(cpu_smt_mask(cpu)); - extern int sched_smt_weight; if (weight > sched_smt_weight) sched_smt_weight = weight; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 84f02abb29e3..6c0f841e9e75 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6006,7 +6006,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p return new_cpu; } -static inline int __select_idle_cpu(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu) +static inline int __select_idle_cpu(struct task_struct *p, int core, struct cpumask *cpus) { if (available_idle_cpu(core) || sched_idle_cpu(core)) return core; @@ -6080,7 +6080,7 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu int cpu; if (!static_branch_likely(_smt_present)) - return __select_idle_cpu(p, core, cpus, idle_cpu); + return __select_idle_cpu(p, core, cpus); for_each_cpu(cpu, cpu_smt_mask(core)) { if (!available_idle_cpu(cpu)) { @@ -6120,7 +6120,7 @@ static inline bool test_idle_cores(int cpu, bool def) static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu) { - return __select_idle_cpu(p, core, cpus, idle_cpu); + return __select_idle_cpu(p, core, cpus); } #endif /* CONFIG_SCHED_SMT */ @@ -6177,7 +6177,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t return i; } else { - i = __select_idle_cpu(p, cpu, cpus, _cpu); + i = __select_idle_cpu(p, cpu, cpus); if ((unsigned int)i < nr_cpumask_bits) { idle_cpu = i; break; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 12ada79d40f3..29aabe98dd1d 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1107,6 +1107,8 @@ static inline void update_idle_core(struct rq *rq) __update_idle_core(rq); } +extern int sched_smt_weight; + #else static inline void update_idle_core(struct rq *rq) { } #endif -- Mel Gorman SUSE Labs
Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()
On Thu, Jan 14, 2021 at 02:25:32PM +0100, Vincent Guittot wrote: > On Thu, 14 Jan 2021 at 10:35, Mel Gorman wrote: > > > > On Wed, Jan 13, 2021 at 06:03:00PM +0100, Vincent Guittot wrote: > > > > @@ -6159,16 +6171,29 @@ static int select_idle_cpu(struct task_struct > > > > *p, struct sched_domain *sd, int t > > > > for_each_cpu_wrap(cpu, cpus, target) { > > > > if (!--nr) > > > > return -1; > > > > - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) > > > > - break; > > > > + if (smt) { > > > > > > If we want to stay on something similar to the previous behavior, we > > > want to check on all cores if test_idle_cores is true so nr should be > > > set to number of cores > > > > > > > I don't think we necessarily want to do that. has_idle_cores is an > > effective throttling mechanism but it's not perfect. If the full domain > > is always scanned for a core then there can be excessive scanning in > > But that's what the code is currently doing. Can this change be done > in another patch so we can check the impact of each change more > easily? Ok, when looking at this again instead of just the mail, the flow is; int i, cpu, idle_cpu = -1, nr = INT_MAX; ... if (sched_feat(SIS_PROP) && !smt) { /* recalculate nr */ } The !smt check should mean that core scanning is still scanning the entire domain. There is no need to make it specific to the core account and we are already doing the full scan. Throttling that would be a separate patch. > This patch 5 should focus on merging select_idle_core and > select_idle_cpu so we keep (almost) the same behavior but each CPU is > checked only once. > Which I think it's already doing. Main glitch really is that __select_idle_cpu() shouldn't be taking *idle_cpu as it does not consume the information. > > workloads like hackbench which tends to have has_idle_cores return false > > positives. It becomes important once average busy CPUs is over half of > > the domain for SMT2. > > > > At least with the patch if that change was made, we still would not scan > > twice going over the same runqueues so it would still be an improvement > > yeah, it's for me the main goal of this patchset with the calculation > of avg_can_cost being done only when SIS_PROP is true and the remove > of SIS_AVG > > any changes in the number of cpu/core to loop on is sensitive to > regression and should be done in a separate patch IMHO > Understood. -- Mel Gorman SUSE Labs
Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()
On Wed, Jan 13, 2021 at 06:03:00PM +0100, Vincent Guittot wrote: > > @@ -6159,16 +6171,29 @@ static int select_idle_cpu(struct task_struct *p, > > struct sched_domain *sd, int t > > for_each_cpu_wrap(cpu, cpus, target) { > > if (!--nr) > > return -1; > > - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) > > - break; > > + if (smt) { > > If we want to stay on something similar to the previous behavior, we > want to check on all cores if test_idle_cores is true so nr should be > set to number of cores > I don't think we necessarily want to do that. has_idle_cores is an effective throttling mechanism but it's not perfect. If the full domain is always scanned for a core then there can be excessive scanning in workloads like hackbench which tends to have has_idle_cores return false positives. It becomes important once average busy CPUs is over half of the domain for SMT2. At least with the patch if that change was made, we still would not scan twice going over the same runqueues so it would still be an improvement but it would be nice to avoid excessive deep scanning when there are a lot of busy CPUs but individual tasks are rapidly idling. However, in the event regressions are found, changing to your suggested behaviour would be an obvious starting point. -- Mel Gorman SUSE Labs
Re: [PATCH 3/5] sched/fair: Make select_idle_cpu() proportional to cores
On Wed, Jan 13, 2021 at 05:49:58PM +0100, Vincent Guittot wrote: > > @@ -7444,11 +7444,20 @@ int sched_cpu_activate(unsigned int cpu) > > balance_push_set(cpu, false); > > > > #ifdef CONFIG_SCHED_SMT > > - /* > > -* When going up, increment the number of cores with SMT present. > > -*/ > > - if (cpumask_weight(cpu_smt_mask(cpu)) == 2) > > - static_branch_inc_cpuslocked(_smt_present); > > + do { > > + int weight = cpumask_weight(cpu_smt_mask(cpu)); > > + extern int sched_smt_weight; > > coding style problem > Presumably you are referring to an extern defined in a C file. That can move to kernel/sched/sched.h in this patch. > > > > /* > > * Scan the LLC domain for idle CPUs; this is dynamically regulated by > > * comparing the average scan cost (tracked in sd->avg_scan_cost) against > > the > > @@ -6166,10 +6172,12 @@ static int select_idle_cpu(struct task_struct *p, > > struct sched_domain *sd, int t > > avg_cost = this_sd->avg_scan_cost + 1; > > > > span_avg = sd->span_weight * avg_idle; > > - if (span_avg > 4*avg_cost) > > + if (span_avg > sis_min_cores*avg_cost) > > nr = div_u64(span_avg, avg_cost); > > else > > - nr = 4; > > + nr = sis_min_cores; > > + > > + nr *= sched_smt_weight; > > Also, patch 5 will look at all CPUs of a core in select_idle_core so > nr will decrement by 1 per core so i don't see the need to multiply by > sched_smt_weight one patch 5 is applied > It makes sense in the context of this patch but can be removed again in the last patch and then I think sched_smt_weight only exists in core.c -- Mel Gorman SUSE Labs
Re: [PATCH] mm, compaction: move high_pfn to the for loop scope.
On Tue, Jan 12, 2021 at 05:47:20PM +0800, Rokudo Yan wrote: > In fast_isolate_freepages, high_pfn will be used if a prefered one(PFN >= > low_fn) not found. But the high_pfn > is not reset before searching an free area, so when it was used as freepage, > it may from another free area searched before. > And move_freelist_head(freelist, freepage) will have unexpected behavior(eg. > corrupt the MOVABLE freelist) > > Unable to handle kernel paging request at virtual address dead0200 > Mem abort info: > ESR = 0x9644 > Exception class = DABT (current EL), IL = 32 bits > SET = 0, FnV = 0 > EA = 0, S1PTW = 0 > Data abort info: > ISV = 0, ISS = 0x0044 > CM = 0, WnR = 1 > [dead0200] address between user and kernel address ranges > > -000|list_cut_before(inline) > -000|move_freelist_head(inline) > -000|fast_isolate_freepages(inline) > -000|isolate_freepages(inline) > -000|compaction_alloc(?, ?) > -001|unmap_and_move(inline) > -001|migrate_pages([NSD:0xFF80088CBBD0] from = 0xFF80088CBD88, > [NSD:0xFF80088CBBC8] get_new_p > -002|__read_once_size(inline) > -002|static_key_count(inline) > -002|static_key_false(inline) > -002|trace_mm_compaction_migratepages(inline) > -002|compact_zone(?, [NSD:0xFF80088CBCB0] capc = 0x0) > -003|kcompactd_do_work(inline) > -003|kcompactd([X19] p = 0xFF93227FBC40) > -004|kthread([X20] _create = 0xFFE1AFB26380) > -005|ret_from_fork(asm) > ---|end of frame > > The issue was reported on an smart phone product with 6GB ram and 3GB zram as > swap device. > > This patch fixes the issue by reset high_pfn before searching each free area, > which ensure > freepage and freelist match when call move_freelist_head in > fast_isolate_freepages(). > > Link: > http://lkml.kernel.org/r/20190118175136.31341-12-mgor...@techsingularity.net > Fixes: 5a811889de10f1eb ("mm, compaction: use free lists to quickly locate a > migration target") Thanks. Acked-by: Mel Gorman -- Mel Gorman SUSE Labs
Re: [PATCH 3/3 v2] sched/fair: reduce cases for active balance
On Thu, Jan 07, 2021 at 11:33:25AM +0100, Vincent Guittot wrote: > Active balance is triggered for a number of voluntary cases like misfit > or pinned tasks cases but also after that a number of load balance > attempts failed to migrate a task. There is no need to use active load > balance when the group is overloaded because an overloaded state means > that there is at least one waiting task. Nevertheless, the waiting task > is not selected and detached until the threshold becomes higher than its > load. This threshold increases with the number of failed lb (see the > condition if ((load >> env->sd->nr_balance_failed) > env->imbalance) in > detach_tasks()) and the waiting task will end up to be selected after a > number of attempts. > > Signed-off-by: Vincent Guittot I didn't see major problems so. Acked-by: Mel Gorman Like you, I do not see significant performance differences, either positive or negative (minor gains and losses that are borderline). However, I didn't track the stats necessary to see what impact it had on alb_* stats which is an oversight but schedstat tracking interferes with results on its own. It would have been nice to see how often this case is even hit in the overloaded case. -- Mel Gorman SUSE Labs
Re: [PATCH] mm, compaction: make sure we isolate a valid freepage when high_pfn is used
On Tue, Jan 12, 2021 at 01:19:55PM +0800, Rokudo Yan wrote: > In fast_isolate_freepages, high_pfn will be used if a prefered one(PFN >= > low_fn) not found. But the high_pfn > is not reset before searching an free area, so when it was used as freepage, > it may from another free area searched before. > And move_freelist_head(freelist, freepage) will have unexpected behavior(eg. > corrupt the MOVABLE freelist) > > Unable to handle kernel paging request at virtual address dead0200 > Mem abort info: > ESR = 0x9644 > Exception class = DABT (current EL), IL = 32 bits > SET = 0, FnV = 0 > EA = 0, S1PTW = 0 > Data abort info: > ISV = 0, ISS = 0x0044 > CM = 0, WnR = 1 > [dead0200] address between user and kernel address ranges > > -000|list_cut_before(inline) > -000|move_freelist_head(inline) > -000|fast_isolate_freepages(inline) > -000|isolate_freepages(inline) > -000|compaction_alloc(?, ?) > -001|unmap_and_move(inline) > -001|migrate_pages([NSD:0xFF80088CBBD0] from = 0xFF80088CBD88, > [NSD:0xFF80088CBBC8] get_new_p > -002|__read_once_size(inline) > -002|static_key_count(inline) > -002|static_key_false(inline) > -002|trace_mm_compaction_migratepages(inline) > -002|compact_zone(?, [NSD:0xFF80088CBCB0] capc = 0x0) > -003|kcompactd_do_work(inline) > -003|kcompactd([X19] p = 0xFF93227FBC40) > -004|kthread([X20] _create = 0xFFE1AFB26380) > -005|ret_from_fork(asm) > ---|end of frame > > The issue was reported on an smart phone product with 6GB ram and 3GB zram as > swap device. > > This patch fixes the issue by reset high_pfn before searching each free area, > which ensure > freepage and freelist match when call move_freelist_head in > fast_isolate_freepages(). > > Link: > http://lkml.kernel.org/r/1558711908-15688-1-git-send-email-suzuki.poul...@arm.com > Fixes: 5a811889de10f1eb ("mm, compaction: use free lists to quickly locate a > migration target") In that case, please move the high_pfn declaration and initialiser within the for (order = cc->order - 1;...) loop to make it explicit what the scope of high_pfn is. -- Mel Gorman SUSE Labs
Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
On Mon, Jan 11, 2021 at 03:36:57PM +0100, Vincent Guittot wrote: > > > > > > > > > I think > > > that we should decay it periodically to reflect there is less and less > > > idle time (in fact no more) on this busy CPU that never goes to idle. > > > If a cpu was idle for a long period but then a long running task > > > starts, the avg_idle will stay stalled to the large value which is > > > becoming less and less relevant. > > > > While I get what you're saying, it does not help extrapolate what the > > idleness of a domain is. > > not but it gives a more up to date view of the idleness of the local > cpu which is better than a stalled value > Fair enough. > > > > > At the opposite, a cpu with a short running/idle period task will have > > > a lower avg_idle whereas it is more often idle. > > > > > > Another thing that worries me, is that we use the avg_idle of the > > > local cpu, which is obviously not idle otherwise it would have been > > > selected, to decide how much time we should spend on looking for > > > another idle CPU. I'm not sure that's the right metrics to use > > > especially with a possibly stalled value. > > > > > > > A better estimate requires heavy writes to sd_llc. The cost of that will > > likely offset any benefit gained by a superior selection of a scan > > depth. > > > > Treating a successful scan cost and a failed scan cost as being equal has > > too many corner cases. If we do not want to weight the successful scan > > cost, then the compromise is to keep the old behaviour that accounts for > > I think that keeping the current way to scane_cost id the best option for now > I sent a series that drops this patch for the moment as well as the SIS_PROP for selecting a core. -- Mel Gorman SUSE Labs
[PATCH 4/5] sched/fair: Remove select_idle_smt()
From: Peter Zijlstra (Intel) In order to make the next patch more readable, and to quantify the actual effectiveness of this pass, start by removing it. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 30 -- 1 file changed, 30 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0811e2fe4f19..12e08da90024 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6103,27 +6103,6 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int return -1; } -/* - * Scan the local SMT mask for idle CPUs. - */ -static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target) -{ - int cpu; - - if (!static_branch_likely(_smt_present)) - return -1; - - for_each_cpu(cpu, cpu_smt_mask(target)) { - if (!cpumask_test_cpu(cpu, p->cpus_ptr) || - !cpumask_test_cpu(cpu, sched_domain_span(sd))) - continue; - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) - return cpu; - } - - return -1; -} - #else /* CONFIG_SCHED_SMT */ #define sched_smt_weight 1 @@ -6133,11 +6112,6 @@ static inline int select_idle_core(struct task_struct *p, struct sched_domain *s return -1; } -static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target) -{ - return -1; -} - #endif /* CONFIG_SCHED_SMT */ #define sis_min_cores 2 @@ -6331,10 +6305,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if ((unsigned)i < nr_cpumask_bits) return i; - i = select_idle_smt(p, sd, target); - if ((unsigned)i < nr_cpumask_bits) - return i; - return target; } -- 2.26.2
[PATCH 3/5] sched/fair: Make select_idle_cpu() proportional to cores
From: Peter Zijlstra (Intel) Instead of calculating how many (logical) CPUs to scan, compute how many cores to scan. This changes behaviour for anything !SMT2. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Mel Gorman --- kernel/sched/core.c | 19 ++- kernel/sched/fair.c | 12 ++-- 2 files changed, 24 insertions(+), 7 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 15d2562118d1..7bfa73de6a8d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7444,11 +7444,20 @@ int sched_cpu_activate(unsigned int cpu) balance_push_set(cpu, false); #ifdef CONFIG_SCHED_SMT - /* -* When going up, increment the number of cores with SMT present. -*/ - if (cpumask_weight(cpu_smt_mask(cpu)) == 2) - static_branch_inc_cpuslocked(_smt_present); + do { + int weight = cpumask_weight(cpu_smt_mask(cpu)); + extern int sched_smt_weight; + + if (weight > sched_smt_weight) + sched_smt_weight = weight; + + /* +* When going up, increment the number of cores with SMT present. +*/ + if (weight == 2) + static_branch_inc_cpuslocked(_smt_present); + + } while (0); #endif set_cpu_active(cpu, true); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c8d8e185cf3b..0811e2fe4f19 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6010,6 +6010,8 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p DEFINE_STATIC_KEY_FALSE(sched_smt_present); EXPORT_SYMBOL_GPL(sched_smt_present); +int sched_smt_weight __read_mostly = 1; + static inline void set_idle_cores(int cpu, int val) { struct sched_domain_shared *sds; @@ -6124,6 +6126,8 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t #else /* CONFIG_SCHED_SMT */ +#define sched_smt_weight 1 + static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target) { return -1; @@ -6136,6 +6140,8 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd #endif /* CONFIG_SCHED_SMT */ +#define sis_min_cores 2 + /* * Scan the LLC domain for idle CPUs; this is dynamically regulated by * comparing the average scan cost (tracked in sd->avg_scan_cost) against the @@ -6166,10 +6172,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t avg_cost = this_sd->avg_scan_cost + 1; span_avg = sd->span_weight * avg_idle; - if (span_avg > 4*avg_cost) + if (span_avg > sis_min_cores*avg_cost) nr = div_u64(span_avg, avg_cost); else - nr = 4; + nr = sis_min_cores; + + nr *= sched_smt_weight; time = cpu_clock(this); } -- 2.26.2
[PATCH 5/5] sched/fair: Merge select_idle_core/cpu()
Both select_idle_core() and select_idle_cpu() do a loop over the same cpumask. Observe that by clearing the already visited CPUs, we can fold the iteration and iterate a core at a time. All we need to do is remember any non-idle CPU we encountered while scanning for an idle core. This way we'll only iterate every CPU once. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 97 +++-- 1 file changed, 59 insertions(+), 38 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 12e08da90024..84f02abb29e3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6006,6 +6006,14 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p return new_cpu; } +static inline int __select_idle_cpu(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu) +{ + if (available_idle_cpu(core) || sched_idle_cpu(core)) + return core; + + return -1; +} + #ifdef CONFIG_SCHED_SMT DEFINE_STATIC_KEY_FALSE(sched_smt_present); EXPORT_SYMBOL_GPL(sched_smt_present); @@ -6066,40 +6074,34 @@ void __update_idle_core(struct rq *rq) * there are no idle cores left in the system; tracked through * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above. */ -static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target) +static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu) { - struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); - int core, cpu; + bool idle = true; + int cpu; if (!static_branch_likely(_smt_present)) - return -1; - - if (!test_idle_cores(target, false)) - return -1; - - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + return __select_idle_cpu(p, core, cpus, idle_cpu); - for_each_cpu_wrap(core, cpus, target) { - bool idle = true; - - for_each_cpu(cpu, cpu_smt_mask(core)) { - if (!available_idle_cpu(cpu)) { - idle = false; - break; + for_each_cpu(cpu, cpu_smt_mask(core)) { + if (!available_idle_cpu(cpu)) { + idle = false; + if (*idle_cpu == -1) { + if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) { + *idle_cpu = cpu; + break; + } + continue; } + break; } - - if (idle) - return core; - - cpumask_andnot(cpus, cpus, cpu_smt_mask(core)); + if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr)) + *idle_cpu = cpu; } - /* -* Failed to find an idle core; stop looking for one. -*/ - set_idle_cores(target, 0); + if (idle) + return core; + cpumask_andnot(cpus, cpus, cpu_smt_mask(core)); return -1; } @@ -6107,9 +6109,18 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int #define sched_smt_weight 1 -static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target) +static inline void set_idle_cores(int cpu, int val) { - return -1; +} + +static inline bool test_idle_cores(int cpu, bool def) +{ + return def; +} + +static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu) +{ + return __select_idle_cpu(p, core, cpus, idle_cpu); } #endif /* CONFIG_SCHED_SMT */ @@ -6124,10 +6135,11 @@ static inline int select_idle_core(struct task_struct *p, struct sched_domain *s static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target) { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); + int i, cpu, idle_cpu = -1, nr = INT_MAX; + bool smt = test_idle_cores(target, false); + int this = smp_processor_id(); struct sched_domain *this_sd; u64 time; - int this = smp_processor_id(); - int cpu, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(_llc)); if (!this_sd) @@ -6135,7 +6147,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); - if (sched_feat(SIS_PROP)) { + if (sched_feat(SIS_PROP) && !smt) { u64 avg_cost, avg_idle, span_avg; /* @@ -6159,16 +6171,29 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
[PATCH 0/5] Scan for an idle sibling in a single pass
This series of 5 patches reposts three patches from Peter entitled "select_idle_sibling() wreckage". It only scans the runqueues in a single pass when searching for an idle sibling. Two patches from Peter were dropped. The first patch altered how scan depth was calculated. Scan depth deletion is a random number generator with two major limitations. The avg_idle time is based on the time between a CPU going idle and being woken up clamped approximately by 2*sysctl_sched_migration_cost. This is difficult to compare in a sensible fashion to avg_scan_cost. The second issue is that only the avg_scan_cost of scan failures is recorded and it does not decay. This requires deeper surgery that would justify a patch on its own although Peter notes that https://lkml.kernel.org/r/20180530143105.977759...@infradead.org is potentially useful for an alternative avg_idle metric. The second patch dropped converted the idle core scan throttling mechanism to SIS_PROP. While this would unify the throttling of core and CPU scanning, it was not free of regressions and has_idle_cores is a fairly effective throttling mechanism with the caveat that it can have a lot of false positives for workloads like hackbench. Peter's series tried to solve three problems at once, this subset addresses one problem. As with anything select_idle_sibling, it's a mix of wins and losses but won more than it lost across a range of workloads and machines. kernel/sched/core.c | 19 +++-- kernel/sched/fair.c | 157 kernel/sched/features.h | 1 - 3 files changed, 92 insertions(+), 85 deletions(-) -- 2.26.2
[PATCH 2/5] sched/fair: Move avg_scan_cost calculations under SIS_PROP
As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP check and while we are at it, exclude the cost of initialising the CPU mask from the average scan cost. Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9f5682aeda2e..c8d8e185cf3b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (!this_sd) return -1; + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + if (sched_feat(SIS_PROP)) { u64 avg_cost, avg_idle, span_avg; @@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t nr = div_u64(span_avg, avg_cost); else nr = 4; - } - - time = cpu_clock(this); - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + time = cpu_clock(this); + } for_each_cpu_wrap(cpu, cpus, target) { if (!--nr) @@ -6181,8 +6181,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t break; } - time = cpu_clock(this) - time; - update_avg(_sd->avg_scan_cost, time); + if (sched_feat(SIS_PROP)) { + time = cpu_clock(this) - time; + update_avg(_sd->avg_scan_cost, time); + } return cpu; } -- 2.26.2
[PATCH 1/5] sched/fair: Remove SIS_AVG_CPU
SIS_AVG_CPU was introduced as a means of avoiding a search when the average search cost indicated that the search would likely fail. It was a blunt instrument and disabled by commit 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") and later replaced with a proportional search depth by commit 1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()"). While there are corner cases where SIS_AVG_CPU is better, it has now been disabled for almost three years. As the intent of SIS_PROP is to reduce the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus on SIS_PROP as a throttling mechanism. Signed-off-by: Mel Gorman Reviewed-by: Vincent Guittot --- kernel/sched/fair.c | 20 +--- kernel/sched/features.h | 1 - 2 files changed, 9 insertions(+), 12 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 04a3ce20da67..9f5682aeda2e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6145,7 +6145,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); struct sched_domain *this_sd; - u64 avg_cost, avg_idle; u64 time; int this = smp_processor_id(); int cpu, nr = INT_MAX; @@ -6154,18 +6153,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (!this_sd) return -1; - /* -* Due to large variance we need a large fuzz factor; hackbench in -* particularly is sensitive here. -*/ - avg_idle = this_rq()->avg_idle / 512; - avg_cost = this_sd->avg_scan_cost + 1; + if (sched_feat(SIS_PROP)) { + u64 avg_cost, avg_idle, span_avg; - if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost) - return -1; + /* +* Due to large variance we need a large fuzz factor; +* hackbench in particularly is sensitive here. +*/ + avg_idle = this_rq()->avg_idle / 512; + avg_cost = this_sd->avg_scan_cost + 1; - if (sched_feat(SIS_PROP)) { - u64 span_avg = sd->span_weight * avg_idle; + span_avg = sd->span_weight * avg_idle; if (span_avg > 4*avg_cost) nr = div_u64(span_avg, avg_cost); else diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 68d369cba9e4..e875eabb6600 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -54,7 +54,6 @@ SCHED_FEAT(TTWU_QUEUE, true) /* * When doing wakeups, attempt to limit superfluous scans of the LLC domain. */ -SCHED_FEAT(SIS_AVG_CPU, false) SCHED_FEAT(SIS_PROP, true) /* -- 2.26.2
Re: [PATCH 2/3 v2] sched/fair: don't set LBF_ALL_PINNED unnecessarily
On Thu, Jan 07, 2021 at 11:33:24AM +0100, Vincent Guittot wrote: > Setting LBF_ALL_PINNED during active load balance is only valid when there > is only 1 running task on the rq otherwise this ends up increasing the > balance interval whereas other tasks could migrate after the next interval > once they become cache-cold as an example. > > LBF_ALL_PINNED flag is now always set it by default. It is then cleared > when we find one task that can be pulled when calling detach_tasks() or > during active migration. > > Signed-off-by: Vincent Guittot When reviewing this, I found it curious that sched_nr_migrate_break is a const instead of a #define. It's not clear why as it only appears to exist in case sysctl_sched_nr_migrate is set higher than 32 to prevent rq lock starvation. With your patch, s/atleast/at least/ in the comments if there is another version. Otherwise, I didn't spot a real problem so Acked-by: Mel Gorman -- Mel Gorman SUSE Labs
Re: [PATCH 1/3 v2] sched/fair: skip idle cfs_rq
On Thu, Jan 07, 2021 at 11:33:23AM +0100, Vincent Guittot wrote: > Don't waste time checking whether an idle cfs_rq could be the busiest > queue. Furthermore, this can end up selecting a cfs_rq with a high load > but being idle in case of migrate_load. > > Signed-off-by: Vincent Guittot > Reviewed-by: Valentin Schneider Acked-by: Mel Gorman -- Mel Gorman SUSE Labs
Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
On Fri, Jan 08, 2021 at 08:45:44PM +0100, Peter Zijlstra wrote: > On Fri, Jan 08, 2021 at 04:10:51PM +0100, Vincent Guittot wrote: > > Also, there is another problem (that I'm investigating) which is that > > this_rq()->avg_idle is stalled when your cpu is busy. Which means that > > this avg_idle can just be a very old and meaningless value. I think > > that we should decay it periodically to reflect there is less and less > > https://lkml.kernel.org/r/20180530143105.977759...@infradead.org > > :-) This needs to be revived. I'm of the opinion that your initial series needs to be split somewhat into "single scan for core" parts followed by the "Fix depth selection". I have tests in the queue and one of them is just patch 2 on its own. Preliminary results for patch 2 on its own do not look bad but that's based on one test (tbench). It'll be tomorrow before it works through variants of patch 1 which I suspect will be inconclusive and make me more convinced it should be split out separately. The single scan for core would be patches 2-4 of the series this thread is about which is an orthogonal problem to avoiding repeated scans of the same runqueues during a single wakeup. I would prefer to drop patch 5 initially because the has_idle_cores throttling mechanism for idle core searches is reasonably effective and the first round of tests indicated that changing it was inconclusive and should be treated separately. The depth scan stuff would go on top because right now the depth scan is based on magic numbers and no amount of shuffling that around will make it less magic without an avg_idle value that decays and potentially the scan cost also aging. It's much less straight-forward than the single scan aspect. Switching idle core searches to SIS_PROP should also be treated separately. Thoughts? I don't want to go too far in this direction on my own because the testing requirements are severe in terms of time. Even then no matter how much I test, stuff will be missed as it's sensitive to domain sizes, CPU generation, cache topology, cache characteristics, domain utilisation, architecture, kitchen sink etc. -- Mel Gorman SUSE Labs
Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
On Fri, Jan 08, 2021 at 09:21:48PM +0100, Peter Zijlstra wrote: > On Fri, Jan 08, 2021 at 10:27:38AM +0000, Mel Gorman wrote: > > > 1. avg_scan_cost is now based on the average scan cost of a rq but > >avg_idle is still scaled to the domain size. This is a bit problematic > >because it's comparing scan cost of a single rq with the estimated > >average idle time of a domain. As a result, the scan depth can be much > >larger than it was before the patch and led to some regressions. > > > @@ -6164,25 +6164,25 @@ static int select_idle_cpu(struct task_struct *p, > > struct sched_domain *sd, int t > > */ > > avg_idle = this_rq()->avg_idle / 512; > > avg_cost = this_sd->avg_scan_cost + 1; > > - > > - span_avg = sd->span_weight * avg_idle; > > - if (span_avg > 4*avg_cost) > > - nr = div_u64(span_avg, avg_cost); > > - else > > + nr = div_u64(avg_idle, avg_cost); > > + if (nr < 4) > > nr = 4; > > Oooh, could it be I simply didn't remember how that code was supposed to > work and should kick my (much) younger self for not writing a comment? > > Consider: > >span_weight * avg_idle avg_cost > nr = -- = avg_idle / -- >avg_costspan_weigt > > Where: avg_cost / span_weight ~= cost-per-rq > This would definitely make sense and I even evaluated it but the nature of avg_idle and the scale it works at (up to 2*sched_migration_cost) just ended up generating lunatic values far outside the size of the domain size. Fitting that to the domain size just ended up looking silly too and avg_cost does not decay. Still, in principle, it's the right direction, it's just not what the code does right now. -- Mel Gorman SUSE Labs
Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
On Fri, Jan 08, 2021 at 04:10:51PM +0100, Vincent Guittot wrote: > > > Trying to bias the avg_scan_cost with: loops <<= 2; > > > will just make avg_scan_cost lost any kind of meaning because it > > > doesn't reflect the avg cost of scanning a rq anymore > > > > > > > Before the series, the avg_scan_cost also did not represent the cost of > > scanning a RQ before either. Treating scan failures and successes equally > > I agree that the previous avg_scan_cost was not representing a RQ > because it was the avg cost of scanning the full domain. It was not even that. As the full domain was not necessarily scanned at all, it simply reflected how much time was spent scanning in general. It neither represented an rq scan cost (which is variable due to cache traffic) nor did it represent a full domain scan. > And we were > comparing it with the average idle time (weighted by few factors). > And this cost was impacted by the fact that the scan can return early > because it found a cpu. This has advantage and drawback but at least > stays coherent in what we are comparing > Not really because it represented the cost of a scan of some number of rqs from 1 to sd->span_weight. > Peter's patch wants to move on per rq avg scan cost. And what you're > proposing is to add a magic heuristic to bias the per rq which at the > end makes this value just an opaque metric. > The metric is a heuristic no matter what. The scan cost of a RQ is not fixed as it depends on whether cache data needs to be updated or not. Also bear in mind that the first round of results and the results that I posted showed that Peter's patch has significant corner cases that the patch mitigates. You also note that avg_idle is an unusual metric to compare against because it has numerous timing artifacts. At least one of them is that we are extrapolating the domain idle time from a single rq which is a poor proxy measure when a domain is only partially used. There just is not a better one available without heavy writes to sd_llc which causes its own set of problems. > If we really want to keep the impact of early return than IMO we > should stay on a full domain scan level instead of a per rq. > That also has the same class of side-effects. Once the scan cost of a successful scan is strictly accounted for, there are problems. Even tracking the success scans is costly as the CPU clock has to be read and sd_llc has to be updated. > Also, there is another problem (that I'm investigating) which is that > this_rq()->avg_idle is stalled when your cpu is busy. Which means that > this avg_idle can just be a very old and meaningless value. Yes, avg_idle in itself is just the average inter-arrival time between a CPU going idle and receiving a wakeup partially bound roughly by 2*sysctl_sched_migration_cost. If avg_idle is traced for each select_idle_cpu(), it's obvious that it takes time to adjust when a load starts. > I think > that we should decay it periodically to reflect there is less and less > idle time (in fact no more) on this busy CPU that never goes to idle. > If a cpu was idle for a long period but then a long running task > starts, the avg_idle will stay stalled to the large value which is > becoming less and less relevant. While I get what you're saying, it does not help extrapolate what the idleness of a domain is. > At the opposite, a cpu with a short running/idle period task will have > a lower avg_idle whereas it is more often idle. > > Another thing that worries me, is that we use the avg_idle of the > local cpu, which is obviously not idle otherwise it would have been > selected, to decide how much time we should spend on looking for > another idle CPU. I'm not sure that's the right metrics to use > especially with a possibly stalled value. > A better estimate requires heavy writes to sd_llc. The cost of that will likely offset any benefit gained by a superior selection of a scan depth. Treating a successful scan cost and a failed scan cost as being equal has too many corner cases. If we do not want to weight the successful scan cost, then the compromise is to keep the old behaviour that accounts for scan failures only (avoids an sd_llc write at least) but base it on the estimated scan cost for a single rq. The fact that we do not account for scan failures should be explicitly commented so we do not forget it again in the future. -- Mel Gorman SUSE Labs
Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
30 5.6573 ( 0.00%) 5.7390 * -1.44%* Amean 48 8.9037 ( 0.00%) 8.8053 * 1.10%* Amean 79 14.9190 ( 0.00%) 14.4360 * 3.24%* Amean 110 22.5703 ( 0.00%) 21.9210 ( 2.88%) Amean 141 29.2400 ( 0.00%) 28.0110 * 4.20%* Amean 172 36.3720 ( 0.00%) 34.7963 ( 4.33%) Amean 203 43.5783 ( 0.00%) 42.5537 * 2.35%* Amean 234 50.3653 ( 0.00%) 47.3463 * 5.99%* Amean 265 57.6153 ( 0.00%) 55.6247 * 3.46%* Amean 296 62.7370 ( 0.00%) 62.0720 ( 1.06%) Adjusting the scan cost for successes is neither a universal win or failure but it's closer to historical behaviour and the strict accounting does hit corner cases. If a deep scan is finding an idle CPU, it makes some sense to continue scanning deeply by adjusting the weight instead of prematurely failing. The testing of the full series previously showed that some loads never recovered from side-effects of the first patch and the last patch in the series introduced new problems of its own. Hence, I would like to limit the negative impact of the first patch and, if necessary, cut the last patch altogether. -- Mel Gorman SUSE Labs
Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
On Fri, Jan 08, 2021 at 01:01:10PM +, Qais Yousef wrote: > On 01/08/21 10:27, Mel Gorman wrote: > > for_each_cpu_wrap(cpu, cpus, target) { > > - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) > > + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) { > > + /* Adjust cost of a successful scan */ > > + loops <<= 2; > > + > > break; > > + } > > > > - if (loops >= nr) { > > + if (++loops >= nr) { > > cpu = -1; > > break; > > } > > - loops++; > > Random (out of the blue) comment. > > Now this will increment loops before the comparison/break. ie: we're > effectively doing one iteration less IIRC. Should loops be initialized to > 0 instead of 1? > Yep, although in practice it'll make little difference except after a rapid phase change when avg_idle still appears high on a per-rq basis yet the domain is fully busy with no idle CPUs. -- Mel Gorman SUSE Labs
Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
On Tue, Dec 15, 2020 at 08:59:11AM +0100, Peter Zijlstra wrote: > On Tue, Dec 15, 2020 at 11:36:35AM +0800, Li, Aubrey wrote: > > On 2020/12/15 0:48, Peter Zijlstra wrote: > > > We compute the average cost of the total scan, but then use it as a > > > per-cpu scan cost when computing the scan proportion. Fix this by > > > properly computing a per-cpu scan cost. > > > > > > This also fixes a bug where we would terminate early (!--nr, case) and > > > not account that cost at all. > > > > I'm a bit worried this may introduce a regression under heavy load. > > The overhead of adding another cpu_clock() and calculation becomes > > significant when sis_scan is throttled by nr. > > The thing is, the code as it exists today makes no sense what so ever. > It's plain broken batshit. > > We calculate the total scanning time (irrespective of how many CPUs we > touched), and then use that calculate the number of cpus to scan. That's > just daft. > > After this patch we calculate the avg cost of scanning 1 cpu and use > that to calculate how many cpus to scan. Which is coherent and sane. > > Maybe it can be improved, but that's a completely different thing. In general I agree with everything you said so lets talk about the last sentence. 1. avg_scan_cost is now based on the average scan cost of a rq but avg_idle is still scaled to the domain size. This is a bit problematic because it's comparing scan cost of a single rq with the estimated average idle time of a domain. As a result, the scan depth can be much larger than it was before the patch and led to some regressions. 2. Accounting for the scan cost of success makes sense but there is a big difference between a scan that finds an idle CPU and one that fails. For failures, the scan cost is wasted CPU time where as a success means that an uncontested CPU is used. This can cause a search to be truncated earlier than it should be when the domain is lightly loaded. The patch below attempts to deal with both of those points. The remaining difficulty is the "fuzz" factor which is necessary to bring large avg_idle values into a reasonable range for comparing against avg_scan_cost. The max avg_idle value can be anything but generally the ceiling is 2*sysctl_sched_migration_cost. I didn't quickly spot a good way of mapping avg_idle to a range between 0 and sd->span_weight. The obvious one was (weight*avg_idle/(2*sched_migration_cost)) but it did not work very well as avg_scan_cost accounts for the full cost of accessing a remote rq. However, this could be revisited later after this series is sorted out. Anyway, for the enumerated concerns, how about this on top for your first patch? It seemed to work well for a few workloads on 3 machines at least and I'd like to nail this down before considering the remaining patches. The first run indicated that the first patch offset some of the benefits of the other patches. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 65a2f208ada8..1e04a250e8ee 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6156,7 +6156,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); if (sched_feat(SIS_PROP)) { - u64 avg_cost, avg_idle, span_avg; + u64 avg_cost, avg_idle; /* * Due to large variance we need a large fuzz factor; @@ -6164,25 +6164,25 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t */ avg_idle = this_rq()->avg_idle / 512; avg_cost = this_sd->avg_scan_cost + 1; - - span_avg = sd->span_weight * avg_idle; - if (span_avg > 4*avg_cost) - nr = div_u64(span_avg, avg_cost); - else + nr = div_u64(avg_idle, avg_cost); + if (nr < 4) nr = 4; time = cpu_clock(this); } for_each_cpu_wrap(cpu, cpus, target) { - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) { + /* Adjust cost of a successful scan */ + loops <<= 2; + break; + } - if (loops >= nr) { + if (++loops >= nr) { cpu = -1; break; } - loops++; } if (sched_feat(SIS_PROP)) { -- Mel Gorman SUSE Labs
Re: [RFC][PATCH 0/5] select_idle_sibling() wreckage
On Wed, Dec 23, 2020 at 02:23:41PM +0100, Vincent Guittot wrote: > > Tests are still running on my side but early results shows perf > > regression for hackbench > > Few more results before being off: > On small embedded system, the problem seems to be mainly a matter of > setting the right number of loops. > > On large smt system, The system on which I usually run my tests if > off for now so i haven't been able to finalize tests yet but the > problem might be that we don't loop all core anymore with this > patchset compare to current algorithm > Tests ran over the holidays and are available at http://www.skynet.ie/~mel/postings/peterz-20210104/dashboard.html I am thrawling through the data but by and large the two main observations I've had so far are 1. The last patch seems the most problematic and the most likely to make a large change, particularly to hackbench. For example; http://www.skynet.ie/~mel/postings/peterz-20210104/scheduler-unbound/bing2/index.html#hackbench-thread-pipes The idle cpu cutoff is reasonably effective even though it triggers a lot of false positives meaning that it may be better to treat that in isolation 2. The cost accounting one had variable impact. Generally it was small gains and losses but tbench for low client counts is an exception as low thread counts say variable impact. Some big losses although EPYC1 is a counter-example (toto in the dashboard) The second issue might be responsible for the first issue, not sure. However, it does not suprise me that properly accounting would have an impact on the SMT depth search and likely needs tweaking. -- Mel Gorman SUSE Labs
Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting
On Tue, Dec 15, 2020 at 08:59:11AM +0100, Peter Zijlstra wrote: > On Tue, Dec 15, 2020 at 11:36:35AM +0800, Li, Aubrey wrote: > > On 2020/12/15 0:48, Peter Zijlstra wrote: > > > We compute the average cost of the total scan, but then use it as a > > > per-cpu scan cost when computing the scan proportion. Fix this by > > > properly computing a per-cpu scan cost. > > > > > > This also fixes a bug where we would terminate early (!--nr, case) and > > > not account that cost at all. > > > > I'm a bit worried this may introduce a regression under heavy load. > > The overhead of adding another cpu_clock() and calculation becomes > > significant when sis_scan is throttled by nr. > > The thing is, the code as it exists today makes no sense what so ever. Which makes it very hard to reason about or change in a "safe" manner as all sorts of counter-intuitive effects occur. The series is queued and running and takes 1-2 days. I haven't reviewed the patches properly (holiday) but it'll be interesting to get some provisional data at least. -- Mel Gorman SUSE Labs
Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
On Mon, Dec 14, 2020 at 10:18:16AM +0100, Vincent Guittot wrote: > On Fri, 11 Dec 2020 at 18:45, Peter Zijlstra wrote: > > > > On Thu, Dec 10, 2020 at 12:58:33PM +0000, Mel Gorman wrote: > > > The prequisite patch to make that approach work was rejected though > > > as on its own, it's not very helpful and Vincent didn't like that the > > > load_balance_mask was abused to make it effective. > > > > So last time I poked at all this, I found that using more masks was a > > performance issue as well due to added cache misses. > > > > Anyway, I finally found some time to look at all this, and while reading > > over the whole SiS crud to refresh the brain came up with the below... > > > > It's still naf, but at least it gets rid of a bunch of duplicate > > scanning and LoC decreases, so it should be awesome. Ofcourse, as > > always, benchmarks will totally ruin everything, we'll see, I started > > some. > > > > It goes on top of Mel's first two patches (which is about as far as I > > got)... > > We have several different things that Aubrey and Mel patches are > trying to achieve: > > Aubrey wants to avoid scanning busy cpus > - patch works well on busy system: I see a significant benefit with > hackbench on a lot of group on my 2 nodes * 28 cores * 4 smt > hackbench -l 2000 -g 128 > tip 3.334 > w/ patch 2.978 (+12%) > It's still the case that Aubrey's work does not conflict with what Peter is doing. All that changes is what mask is applied. Similarly, the p->recent_used_cpu changes I made initially (but got rejected) reduces scanning. I intend to revisit that to use recent_used_cpu as a search target because one side-effect of the patch was the siblings can be used prematurely. > - Aubey also tried to not scan the cpus which are idle for a short > duration (when a tick not stopped) but there are problems because not > stopping a tick doesn't mean a short idle. In fact , something similar > to find_idlest_group_cpu() should be done in this case but then it's > no more a fast path search > The crowd most likely to complain about this is Facebook as they have workloads that prefer deep searches to find idle CPUs. > Mel wants to minimize looping over the cpus > -patch 4 is an obvious win on light loaded system because it avoids > looping twice the cpus mask when some cpus are idle but no core Peter's patch replaces that entirely which I'm fine with. > -But patch 3 generates perf régression > hackbench -l 2000 -g 1 > tip 12.293 > /w all Mel's patches 15.163 -14% > /w Aubrey + Mel patches minus patch 3 : 12.788 +3.8% But I think that > Aubreay's patch doesn't help here. Test without aubrey's patch are > running > > -he also tried to used load_balance_mask to do something similar to the below > Peter's approach removes load_balance_mask abuse so I think it's a better approach overall to scanning LLC domains in a single pass. It needs refinement and a lot of testing but I think it's promising. > > -static int select_idle_core(struct task_struct *p, struct sched_domain > > *sd, int target) > > +static int __select_idle_core(int core, struct cpumask *cpus, int > > *idle_cpu) > > { > > - struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); > > - int core, cpu; > > - > > - if (!static_branch_likely(_smt_present)) > > - return -1; > > - > > - if (!test_idle_cores(target, false)) > > - return -1; > > - > > - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); > > - > > - for_each_cpu_wrap(core, cpus, target) { > > - bool idle = true; > > + bool idle = true; > > + int cpu; > > > > - for_each_cpu(cpu, cpu_smt_mask(core)) { > > - if (!available_idle_cpu(cpu)) { > > - idle = false; > > - break; > > - } > > + for_each_cpu(cpu, cpu_smt_mask(core)) { > > + if (!available_idle_cpu(cpu)) { > > + idle = false; > > + continue; > > By not breaking on the first not idle cpu of the core, you will > largely increase the number of loops. On my system, it increases 4 > times from 28 up tu 112 > But breaking early will mean that SMT siblings are used prematurely. However, if SIS_PROP was partially enforced for the idle core scan, it could limit the search for an idle core once an idle candidate was discovered. -- Mel Gorman SUSE Labs
Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
On Mon, Dec 14, 2020 at 10:31:22AM +0100, Peter Zijlstra wrote: > On Mon, Dec 14, 2020 at 09:11:29AM +0100, Vincent Guittot wrote: > > On Fri, 11 Dec 2020 at 23:50, Mel Gorman > > wrote: > > > > I originally did something like that on purpose too but Vincent called > > > it out so it is worth mentioning now to avoid surprises. That said, I'm > > > at the point where anything SIS-related simply needs excessive exposure > > > because no matter what it does, someone is unhappy with it. > > > > Yeah, I don't like extending the idle core search loop for something > > that is not related to the idle core search. This adds more work out > > of control of the sis_prop. So I'm still against hiding some idle cpu > > search in idle core search. > > The idea, of course, is to do less. The current code is pretty crap in > that it will do a whole bunch of things multiple times. > ^^^ this While there is some overhead when searching for an idle core to track an idle sibling, it's better than double scanning when test_idle_cores() returns a false positive (or it races with a parallel search that takes the last idle core). > Also, a possible follow up, would be something like the below (and > remove all the sds->has_idle_cores crud), which brings core scanning > under SIS_PROP. > I'm less confident about this this but admit I have no data. test_idle_core becomes critical for hackbench once it saturates the machine as it'll generally return a true positive. Where test_idle_cores causes problems is when domains are over 50% busy and returns false positives due to races and the work to find an idle core is wasted. The flip side is that updating the has_idle_core information is also expensive in this case as it causes lots of dirty cache line bouncing so maybe in practice it might be ok. It definitely should be a separate patch that is tested on top of your first prototype with the p->cpus_ptr check when picking an idle candidate. The other side-effect is that with this patch that the scan cost is *always* accounted for. While this makes intuitive sense as it was never clear to me why it was only accounted with scan failures. When I had tested something like this, it was a mix of wins and losses. At minimum, a patch that always accounts for scan cost and one the removes the test_idle_core should be separate patches for bisection purposes at the very least. This is the current set of results I have for your prototype plus the fixes I suggested on top http://www.skynet.ie/~mel/postings/peterz-20201214/dashboard.html It's not a universal win but appears to win more than it loses The biggest loss is dbench on EPYC 2 http://www.skynet.ie/~mel/postings/peterz-20201214/io-dbench4-async-xfs/romulus/index.html#dbench4 It's not at clear why it was so badly affected but in general, EPYC can be problematic as it has multiple small LLCs. The same machine for specjvm showed large gains. http://www.skynet.ie/~mel/postings/peterz-20201214/jvm-specjbb2005-multi/romulus/index.html#specjbb There are a lot of results to trawl through but mostly it shows that it's a mix of wins and losses and it's both workload and machine dependant which is generally true for anything select_idle_sibling related. As the merge window is open, it's inevitable that this will need to be evaluated against 5.11-rc1 when all the current batch of scheduler code has been merged. Do you mind splitting your prototype into three patches and slap some sort of changlog on them? I'll run them through the grid with p->recent_used_cpu changes on top to use recent_use_cpu as a search hint instead of an idle candidate so that it scans for a core. They'll take a while to run but it's automated and some people are going to be dropping off for holidays relatively soon anyway. I can test on arm too but as it does not have SMT enabled, it'll be less useful. -- Mel Gorman SUSE Labs
Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
On Fri, Dec 11, 2020 at 11:19:05PM +0100, Peter Zijlstra wrote: > On Fri, Dec 11, 2020 at 08:43:37PM +0000, Mel Gorman wrote: > > One bug is in __select_idle_core() though. It's scanning the SMT mask, > > not select_idle_mask so it can return an idle candidate that is not in > > p->cpus_ptr. > > D'0h.. luckily the benchmarks don't hit that :-) > Yep, neither do mine for the most part which is why I ran it as-is but eventually someone would complain that sched_setscheduler was being ignored. Trick is whether a failed check should continue searching for an idle core or terminate early and just pick an allowed idle CPU. I tend to favour the latter but it's hard to predict what sort of reasonable workloads would be affected. > > There are some other potential caveats. > > > > This is a single pass so when test_idle_cores() is true, __select_idle_core > > is used to to check all the siblings even if the core is not idle. That > > could have been cut short if __select_idle_core checked *idle_cpu == > > 1 and terminated the SMT scan if an idle candidate had already been found. > > So I did that on purpose, so as to track the last/most-recent idle cpu, > with the thinking that that cpu has the higher chance of still being > idle vs one we checked earlier/longer-ago. > > I suppose we benchmark both and see which is liked best. > I originally did something like that on purpose too but Vincent called it out so it is worth mentioning now to avoid surprises. That said, I'm at the point where anything SIS-related simply needs excessive exposure because no matter what it does, someone is unhappy with it. > > Second downside is related. If test_idle_cpus() was true but no idle > > CPU is found then __select_idle_core has been called enough to scan > > the entire domain. In this corner case, the new code does *more* work > > because the old code would have failed select_idle_core() quickly and > > then select_idle_cpu() would be throttled by SIS_PROP. I think this will > > only be noticable in the heavily overloaded case but if the corner case > > hits enough then the new code will be slower than the old code for the > > over-saturated case (i.e. hackbench with lots of groups). > > Right, due to scanning siblings, even if the first inspected thread is > not idle, we scan more. > Yep. > > The third potential downside is that the SMT sibling is not guaranteed to > > be checked due to SIS_PROP throttling but in the old code, that would have > > been checked by select_idle_smt(). That might result in premature stacking > > of runnable tasks on the same CPU. Similarly, as __select_idle_core may > > find multiple idle candidates, it will not pick the targets SMT sibling > > if it is idle like select_idle_smt would have. > > > > That said, I am skeptical that select_idle_smt() matters all that often. > > This, I didn't really believe in it either. > Good because I think any benefit from select_idle_smt is so marginal that it should be ignored if the full scan is simpler overall. > The benchmarks I started are mostly noise, with a few wins for > TCP_STREAM and UDP_RR around the 50% mark. Although I should run > longer variants to make sure. So far I have one benchmark from one machine. It's tbench again because it's a reasonable communicating workload that is trivial to vary the level of utilisation. 80-cpu CascadeLake, 2 sockets, HT enabled tbench4 5.10.0-rc6 5.10.0-rc6 5.10.0-rc6 baseline-v1r1singlescan-v1r1 singlescan-v1r3 Hmean 1503.90 ( 0.00%) 505.19 * 0.26%* 499.20 * -0.93%* Hmean 2980.80 ( 0.00%) 975.15 * -0.58%* 983.79 * 0.31%* Hmean 4 1912.37 ( 0.00%) 1883.25 * -1.52%* 1923.76 * 0.60%* Hmean 8 3741.47 ( 0.00%) 3568.66 * -4.62%* 3690.60 * -1.36%* Hmean 16 6552.90 ( 0.00%) 6549.97 * -0.04%* 6478.37 * -1.14%* Hmean 32 10217.34 ( 0.00%)10266.66 * 0.48%*10291.60 * 0.73%* Hmean 64 13604.93 ( 0.00%)11240.88 * -17.38%*12045.87 * -11.46%* Hmean 12821194.11 ( 0.00%)21316.08 * 0.58%*21868.39 * 3.18%* Hmean 25621163.19 ( 0.00%)20989.14 * -0.82%*20831.20 * -1.57%* Hmean 32020906.29 ( 0.00%)21024.11 * 0.56%*20756.88 * -0.71%* Stddev1 3.14 ( 0.00%)1.17 ( 62.93%)1.05 ( 66.61%) Stddev2 4.44 ( 0.00%)3.72 ( 16.35%)2.20 ( 50.56%) Stddev4 9.09 ( 0.00%) 18.67 (-105.32%)3.66 ( 59.71%) Stddev8 12.87 ( 0.00%) 18.96 ( -47.31%) 11.90 ( 7.58%) Stddev16 8.
Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
On Fri, Dec 11, 2020 at 06:44:42PM +0100, Peter Zijlstra wrote: > On Thu, Dec 10, 2020 at 12:58:33PM +0000, Mel Gorman wrote: > > The prequisite patch to make that approach work was rejected though > > as on its own, it's not very helpful and Vincent didn't like that the > > load_balance_mask was abused to make it effective. > > So last time I poked at all this, I found that using more masks was a > performance issue as well due to added cache misses. > > Anyway, I finally found some time to look at all this, and while reading > over the whole SiS crud to refresh the brain came up with the below... > > It's still naf, but at least it gets rid of a bunch of duplicate > scanning and LoC decreases, so it should be awesome. Ofcourse, as > always, benchmarks will totally ruin everything, we'll see, I started > some. > > It goes on top of Mel's first two patches (which is about as far as I > got)... > It's not free of bugs but it's interesting! The positive aspects are that it does a single scan, inits select_idle_mask once and caches an idle candidate when searching for an idle core but in a far cleaner fashion than what I did. One bug is in __select_idle_core() though. It's scanning the SMT mask, not select_idle_mask so it can return an idle candidate that is not in p->cpus_ptr. I queued tests anyway to see what it looks like. There are some other potential caveats. This is a single pass so when test_idle_cores() is true, __select_idle_core is used to to check all the siblings even if the core is not idle. That could have been cut short if __select_idle_core checked *idle_cpu == 1 and terminated the SMT scan if an idle candidate had already been found. Second downside is related. If test_idle_cpus() was true but no idle CPU is found then __select_idle_core has been called enough to scan the entire domain. In this corner case, the new code does *more* work because the old code would have failed select_idle_core() quickly and then select_idle_cpu() would be throttled by SIS_PROP. I think this will only be noticable in the heavily overloaded case but if the corner case hits enough then the new code will be slower than the old code for the over-saturated case (i.e. hackbench with lots of groups). The third potential downside is that the SMT sibling is not guaranteed to be checked due to SIS_PROP throttling but in the old code, that would have been checked by select_idle_smt(). That might result in premature stacking of runnable tasks on the same CPU. Similarly, as __select_idle_core may find multiple idle candidates, it will not pick the targets SMT sibling if it is idle like select_idle_smt would have. That said, I am skeptical that select_idle_smt() matters all that often. If select_idle_cpu() has been throttled heavily then the machine is either over-saturated or was over-saturated in the very recent pass. In that situation, the chances of the SMT siblings being free is slim. Each of the downsides could potentially addressed with this untested mess diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 637f610c7059..260592d143d8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6061,7 +6061,7 @@ void __update_idle_core(struct rq *rq) rcu_read_unlock(); } -static int __select_idle_core(int core, struct cpumask *cpus, int *idle_cpu) +static int __select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu) { bool idle = true; int cpu; @@ -6070,9 +6070,11 @@ static int __select_idle_core(int core, struct cpumask *cpus, int *idle_cpu) schedstat_inc(this_rq()->sis_scanned); if (!available_idle_cpu(cpu)) { idle = false; - continue; + if (*idle_cpu == -1) + continue; + break; } - if (idle_cpu) + if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr)) *idle_cpu = cpu; } @@ -6094,7 +6096,7 @@ static inline bool test_idle_cores(int cpu, bool def) return def; } -static inline int __select_idle_core(int core, struct cpumask *cpus, int *idle_cpu) +static inline int __select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu) { return -1; } @@ -6142,7 +6144,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t for_each_cpu_wrap(cpu, cpus, target) { if (smt) { - i = __select_idle_core(cpu, cpus, _cpu); + i = __select_idle_core(p, cpu, cpus, _cpu); if ((unsigned)i < nr_cpumask_bits) return i; -- Mel Gorman SUSE Labs
Re: [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling
On Fri, Dec 11, 2020 at 10:51:17AM +0100, Vincent Guittot wrote: > On Thu, 10 Dec 2020 at 12:04, Mel Gorman wrote: > > > > On Thu, Dec 10, 2020 at 10:38:37AM +0100, Vincent Guittot wrote: > > > > while testing your patchset and Aubrey one on top of tip, I'm facing > > > > some perf regression on my arm64 numa system on hackbench and reaim. > > > > The regression seems to comes from your patchset but i don't know > > > > which patch in particular yet > > > > > > > > hackbench -l 256000 -g 1 > > > > > > > > v5.10-rc7 + tip/sched/core 13,255(+/- 3.22%) > > > > with your patchset 15.368(+/- 2.74) -15.9% > > > > > > > > I'm also seeing perf regression on reaim but this one needs more > > > > investigation before confirming > > > > > > > > TBH, I was not expecting regressions. I'm running more test to find > > > > which patch is the culprit > > > > > > The regression comes from patch 3: sched/fair: Do not replace > > > recent_used_cpu with the new target > > > > > > > That's not entirely surprising. The intent of the patch is to increase the > > hit rate of p->recent_used_cpu but it's not a guaranteed win due to two > > corner cases. If multiple tasks have the same p->recent_used_cpu, they can > > race to use that CPU and stack as a result instead of searching the domain. > > If SMT is enabled then p->recent_used_cpu can point to an idle CPU that has > > a busy sibling which the search would have avoided in select_idle_core(). > > > > I think you are using processes and sockets for hackbench but as you'll > > see later, hackbench can be used both to show losses and gains. > > I run more hackbench tests with pipe and socket and both show > regression with patch 3 whereas this is significant improvement with > other patches and Aubrey's one > Is SMT enabled on your test machine? If not, then patch 4 should make no difference but if SMT is enabled, I wonder how this untested version of patch 3 behaves for you. The main difference is that the recent used cpu is used as a search target so that it would still check if it's an idle core and if not, fall through so it's used as an idle CPU after checking it's allowed by p->cpus_ptr. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5c41875aec23..63980bcf6e70 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6275,21 +6275,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) return prev; } - /* Check a recently used CPU as a potential idle candidate: */ + /* Check a recently used CPU as a search target: */ recent_used_cpu = p->recent_used_cpu; + p->recent_used_cpu = prev; if (recent_used_cpu != prev && recent_used_cpu != target && cpus_share_cache(recent_used_cpu, target) && - (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) && - cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) && - asym_fits_capacity(task_util, recent_used_cpu)) { - /* -* Replace recent_used_cpu with prev as it is a potential -* candidate for the next wake: -*/ - p->recent_used_cpu = prev; - return recent_used_cpu; - } + (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu))) + target = recent_used_cpu; /* * For asymmetric CPU capacity systems, our domain of interest is @@ -6768,9 +6761,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) } else if (wake_flags & WF_TTWU) { /* XXX always ? */ /* Fast path */ new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); - - if (want_affine) - current->recent_used_cpu = cpu; } rcu_read_unlock(); -- Mel Gorman SUSE Labs
Re: [PATCH 3/4] sched/fair: Do not replace recent_used_cpu with the new target
On Fri, Dec 11, 2020 at 05:34:43PM +0800, Hillf Danton wrote: > On Fri, 11 Dec 2020 09:02:28 +0000 Mel Gorman wrote: > >On Fri, Dec 11, 2020 at 02:25:42PM +0800, Hillf Danton wrote: > >> On Tue, 8 Dec 2020 15:35:00 + Mel Gorman wrote: > >> > @@ -6277,17 +6277,13 @@ static int select_idle_sibling(struct > >> > task_struct *p, int prev, int target) > >> > > >> > /* Check a recently used CPU as a potential idle candidate: */ > >> > recent_used_cpu = p->recent_used_cpu; > >> > +p->recent_used_cpu = prev; > >> > if (recent_used_cpu != prev && > >> > recent_used_cpu != target && > >> > cpus_share_cache(recent_used_cpu, target) && > >> > (available_idle_cpu(recent_used_cpu) || > >> > sched_idle_cpu(recent_used_cpu)) && > >> > cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) && > >> > >> Typo? Fix it in spin if so. > >> > > > >What typo? > > After your change it is prev that we check against p->cpus_ptr instead of > the recent CPU. Wonder the point to do such a check for returning the > recent one. Ah... yes, this is indeed wrong. It wouldn't affect Vincent's case that showed a problem with a hackbench configuration (which I'm still disappointed about as it's a trade-off depending on machine and workload) but it allows a task to run on the wrong cpu if sched_setscheduler() was called between wakeup events. -- Mel Gorman SUSE Labs
[tip: sched/core] sched/fair: Clear SMT siblings after determining the core is not idle
The following commit has been merged into the sched/core branch of tip: Commit-ID: 13d5a5e9f9b8515da3c04305ae1bb03ab91be7a7 Gitweb: https://git.kernel.org/tip/13d5a5e9f9b8515da3c04305ae1bb03ab91be7a7 Author:Mel Gorman AuthorDate:Mon, 30 Nov 2020 14:40:20 Committer: Ingo Molnar CommitterDate: Fri, 11 Dec 2020 10:30:38 +01:00 sched/fair: Clear SMT siblings after determining the core is not idle The clearing of SMT siblings from the SIS mask before checking for an idle core is a small but unnecessary cost. Defer the clearing of the siblings until the scan moves to the next potential target. The cost of this was not measured as it is borderline noise but it should be self-evident. Signed-off-by: Mel Gorman Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Reviewed-by: Vincent Guittot Link: https://lkml.kernel.org/r/20201130144020.gs3...@techsingularity.net --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f5dceda..efac224 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6086,10 +6086,11 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int break; } } - cpumask_andnot(cpus, cpus, cpu_smt_mask(core)); if (idle) return core; + + cpumask_andnot(cpus, cpus, cpu_smt_mask(core)); } /*
Re: [PATCH 3/4] sched/fair: Do not replace recent_used_cpu with the new target
On Fri, Dec 11, 2020 at 02:25:42PM +0800, Hillf Danton wrote: > On Tue, 8 Dec 2020 15:35:00 +0000 Mel Gorman wrote: > > @@ -6277,17 +6277,13 @@ static int select_idle_sibling(struct task_struct > > *p, int prev, int target) > > > > /* Check a recently used CPU as a potential idle candidate: */ > > recent_used_cpu = p->recent_used_cpu; > > + p->recent_used_cpu = prev; > > if (recent_used_cpu != prev && > > recent_used_cpu != target && > > cpus_share_cache(recent_used_cpu, target) && > > (available_idle_cpu(recent_used_cpu) || > > sched_idle_cpu(recent_used_cpu)) && > > cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) && > > Typo? Fix it in spin if so. > What typo? > > asym_fits_capacity(task_util, recent_used_cpu)) { > > - /* > > -* Replace recent_used_cpu with prev as it is a potential > > -* candidate for the next wake: > > -*/ > > - p->recent_used_cpu = prev; > > return recent_used_cpu; > > I prefer to update the recent CPU after llc check. > That would prevent recent_used_cpu leaving the LLC the task first started on. -- Mel Gorman SUSE Labs
Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
ed. > That's somewhat more marginal as a benefit but fair enough. I took a closer look at the scan rates for 320 clients which loads the machine fairly heavily. mpstat is reported 0% idle time but the domain scanned with or without the patch is almost identical which hints that scans were not avoided. As I write this, hackbench has not executed yet to see if it makes a difference but it's in the queue. > > I agree that it should never be worse/bigger but if the idle cpu mask is > > often the same bits as the sched_domain_span then it's not going to be > > any better either. > > > > At least on my side, v5 is easier to see the benefit, as rapidly entering > and exiting idle will make idle governor to retain the tick, v5 uses stop_tick > signal from idle governor to decide if this CPU is added to idle cpumask. > https://lore.kernel.org/lkml/20201118043113.53128-1-aubrey...@linux.intel.com/ > > But as Vincent and Peter concerned, v5 may break the latency sensitive > workload. > Also the short idle definition need more work. > Indeed, this is why I was advocating an approach that would use the idle CPU mask as a hint to quickly find idle CPUs but still fall through the scan the rest of the domain if necessary. That would cover the "latency sensitive" workload situation because a deeper scan still happens if necessary. While I'm not 100% certain, they were probably thinking about schbench which favours deep searches for idle CPUs and is latency sensitive. The prequisite patch to make that approach work was rejected though as on its own, it's not very helpful and Vincent didn't like that the load_balance_mask was abused to make it effective. -- Mel Gorman SUSE Labs
Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
-0 [001] d..1 137.409035: update_idle_cpumask: > set_idle-1, cpumask: 0-3 > What's the size of the LLC domain on this machine? If it's 4 then this is indicating that there is little difference between scanning the full domain and targetting idle CPUs via the idle cpumask. > stage 3: uperf running, select_idle_cpu scan all the CPUs in the scheduler > domain at the beginning > === >uperf-560 [000] d..3 138.418494: select_task_rq_fair: > scanning: 0-3 >uperf-560 [000] d..3 138.418506: select_task_rq_fair: > scanning: 0-3 >uperf-560 [000] d..3 138.418514: select_task_rq_fair: > scanning: 0-3 >uperf-560 [000] dN.3 138.418534: select_task_rq_fair: > scanning: 0-3 >uperf-560 [000] dN.3 138.418543: select_task_rq_fair: > scanning: 0-3 >uperf-560 [000] dN.3 138.418551: select_task_rq_fair: > scanning: 0-3 >uperf-561 [003] d..3 138.418577: select_task_rq_fair: > scanning: 0-3 >uperf-561 [003] d..3 138.418600: select_task_rq_fair: > scanning: 0-3 >uperf-561 [003] d..3 138.418617: select_task_rq_fair: > scanning: 0-3 >uperf-561 [003] d..3 138.418640: select_task_rq_fair: > scanning: 0-3 >uperf-561 [003] d..3 138.418652: select_task_rq_fair: > scanning: 0-3 >uperf-561 [003] d..3 138.418662: select_task_rq_fair: > scanning: 0-3 >uperf-561 [003] d..3 138.418672: select_task_rq_fair: > scanning: 0-3 >uperf-560 [000] d..5 138.418676: select_task_rq_fair: > scanning: 0-3 >uperf-561 [003] d..3 138.418693: select_task_rq_fair: > scanning: 0-3 > kworker/u8:3-110 [002] d..4 138.418746: select_task_rq_fair: > scanning: 0-3 > > stage 4: scheduler tick comes, update idle cpumask to EMPTY > >uperf-572 [002] d.h. 139.420568: update_idle_cpumask: > set_idle-0, cpumask: 1,3 >uperf-574 [000] d.H2 139.420568: update_idle_cpumask: > set_idle-0, cpumask: 1,3 >uperf-565 [003] d.H6 139.420569: update_idle_cpumask: > set_idle-0, cpumask: 1 > tmux: server-528 [001] d.h2 139.420572: update_idle_cpumask: > set_idle-0, cpumask: > It's the timing of the clear that may be problematic. For the configurations I use, the tick is every 4 milliseconds which is a very long time in the scheduler for communicating workloads. A simple round-trip of perf pipe where tasks will be entering/exiting rapidly is just 0.004 milliseconds on my desktop. > > > > I suspect the issue is that the mask is only marked busy from the tick > > context which is a very wide window. If select_idle_cpu() picks an idle > > CPU from the mask, it's still marked as idle in the mask. > > > That should be fine because we still check available_idle_cpu() and > sched_idle_cpu for the selected > CPU. And even if that CPU is marked as idle in the mask, that mask should not > be worse(bigger) than > the default sched_domain_span(sd). > I agree that it should never be worse/bigger but if the idle cpu mask is often the same bits as the sched_domain_span then it's not going to be any better either. -- Mel Gorman SUSE Labs
Re: [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling
ket CascadeLake Amean 10.3663 ( 0.00%) 0.3657 ( 0.18%) Amean 40.7510 ( 0.00%) 0.7793 ( -3.77%) Amean 71.2650 ( 0.00%) 1.2583 ( 0.53%) Amean 12 1.9510 ( 0.00%) 1.9450 ( 0.31%) Amean 21 2.9677 ( 0.00%) 3.0277 ( -2.02%) Amean 30 4.2993 ( 0.00%) 4.0237 * 6.41%* Amean 48 6.5373 ( 0.00%) 6.2987 * 3.65%* Amean 79 10.5513 ( 0.00%) 10.3280 ( 2.12%) Amean 110 15.8567 ( 0.00%) 13.9817 ( 11.82%) Amean 141 17.4243 ( 0.00%) 17.3177 ( 0.61%) Amean 172 21.0473 ( 0.00%) 20.9760 ( 0.34%) Amean 203 25.1070 ( 0.00%) 25.1150 ( -0.03%) Amean 234 28.6753 ( 0.00%) 28.9383 ( -0.92%) Amean 265 32.7970 ( 0.00%) 32.9663 ( -0.52%) Amean 296 36.6510 ( 0.00%) 36.6753 ( -0.07%) Neutral for 1 group, small regression for 4 groups, few gains around the middle, neutral when over-saturated. select_idle_sibling is a curse because it's very rare that a change to it is a universal win. On balance, I think it's better to avoid searching the domain at all where possible even if there are cases where searching can have a benefit such as finding an idle core instead of picking an idle CPU with a busy sibling via p->recent_used_cpu. -- Mel Gorman SUSE Labs
Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP
On Thu, Dec 10, 2020 at 01:18:05PM +0800, Li, Aubrey wrote: > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index ac7b34e7372b..5c41875aec23 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, > > struct sched_domain *sd, int t > > if (!this_sd) > > return -1; > > > > + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); > > + > > if (sched_feat(SIS_PROP)) { > > u64 avg_cost, avg_idle, span_avg; > > > > @@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, > > struct sched_domain *sd, int t > > nr = div_u64(span_avg, avg_cost); > > else > > nr = 4; > > - } > > - > > - time = cpu_clock(this); > > > > - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); > > + time = cpu_clock(this); > > + } > > > > for_each_cpu_wrap(cpu, cpus, target) { > > if (!--nr) > > return -1; > > I thought about this again and here seems not to be consistent: > - even if nr reduces to 0, shouldn't avg_scan_cost be updated as well before > return -1? You're right, but it's outside the scope of this patch. I noted that this was a problem in lore.kernel.org/r/lore.kernel.org/r/20201203141124.7391-8-mgor...@techsingularity.net It's neither a consistent win or loss to always account for it and so was dropped for this series to keep the number of controversial patches to a minimum. > - if avg_scan_cost is not updated because nr is throttled, the first > time = cpu_clock(this); > can be optimized. As nr is calculated and we already know which of the > weight of cpumask and nr is greater. > That is also outside the scope of this patch. To do that, cpumask_weight() would have to be calculated but it's likely to be a net loss. Even under light load, nr will be smaller than the domain weight incurring both the cost of cpumask_weight and the clock read in the common case. -- Mel Gorman SUSE Labs
Re: [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling
On Tue, Dec 08, 2020 at 03:34:57PM +, Mel Gorman wrote: > Changelog since v1 > o Drop single-pass patch > (vincent) > o Scope variables used for SIS_AVG_CPU > (dietmar) > o Remove redundant assignment (dietmar > > This reduces the amount of runqueue scanning in select_idle_sibling in > the worst case. > > Patch 1 removes SIS_AVG_CPU because it's unused. > > Patch 2 moves all SIS_PROP-related calculations under SIS_PROP > > Patch 3 improves the hit rate of p->recent_used_cpu to reduce the amount > of scanning. It should be relatively uncontroversial > > Patch 4 returns an idle candidate if one is found while scanning for a > free core. > Any other objections to the series? Vincent marked 1, 3 and 4 as reviewed. While patch 2 had some mild cosmetic concerns, I think the version and how it treats SIS_PROP is fine as it is to keep it functionally equivalent to !SIS_PROP and without adding too many SIS_PROP checks. -- Mel Gorman SUSE Labs
Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup
On Wed, Dec 09, 2020 at 02:24:04PM +0800, Aubrey Li wrote: > Add idle cpumask to track idle cpus in sched domain. Every time > a CPU enters idle, the CPU is set in idle cpumask to be a wakeup > target. And if the CPU is not in idle, the CPU is cleared in idle > cpumask during scheduler tick to ratelimit idle cpumask update. > > When a task wakes up to select an idle cpu, scanning idle cpumask > has lower cost than scanning all the cpus in last level cache domain, > especially when the system is heavily loaded. > > Benchmarks including hackbench, schbench, uperf, sysbench mysql > and kbuild were tested on a x86 4 socket system with 24 cores per > socket and 2 hyperthreads per core, total 192 CPUs, no regression > found. > I ran this patch with tbench on top of of the schedstat patches that track SIS efficiency. The tracking adds overhead so it's not a perfect performance comparison but the expectation would be that the patch reduces the number of runqueues that are scanned tbench4 5.10.0-rc6 5.10.0-rc6 schedstat-v1r1 idlemask-v7r1 Hmean 1504.76 ( 0.00%) 500.14 * -0.91%* Hmean 2 1001.22 ( 0.00%) 970.37 * -3.08%* Hmean 4 1930.56 ( 0.00%) 1880.96 * -2.57%* Hmean 8 3688.05 ( 0.00%) 3537.72 * -4.08%* Hmean 16 6352.71 ( 0.00%) 6439.53 * 1.37%* Hmean 32 10066.37 ( 0.00%)10124.65 * 0.58%* Hmean 64 12846.32 ( 0.00%)11627.27 * -9.49%* Hmean 12822278.41 ( 0.00%)22304.33 * 0.12%* Hmean 25621455.52 ( 0.00%)20900.13 * -2.59%* Hmean 32021802.38 ( 0.00%)21928.81 * 0.58%* Not very optimistic result. The schedstats indicate; 5.10.0-rc6 5.10.0-rc6 schedstat-v1r1 idlemask-v7r1 Ops TTWU Count 5599714302.00 5589495123.00 Ops TTWU Local 2687713250.00 2563662550.00 Ops SIS Search 5596677950.00 5586381168.00 Ops SIS Domain Search3268344934.00 3229088045.00 Ops SIS Scanned 15909069113.00 16568899405.00 Ops SIS Domain Scanned 13580736097.00 14211606282.00 Ops SIS Failures 2944874939.00 2843113421.00 Ops SIS Core Search 262853975.00 311781774.00 Ops SIS Core Hit 185189656.00 216097102.00 Ops SIS Core Miss 77664319.0095684672.00 Ops SIS Recent Used Hit 124265515.00 146021086.00 Ops SIS Recent Used Miss 338142547.00 403547579.00 Ops SIS Recent Attempts 462408062.00 549568665.00 Ops SIS Search Efficiency35.18 33.72 Ops SIS Domain Search Eff24.07 22.72 Ops SIS Fast Success Rate41.60 42.20 Ops SIS Success Rate 47.38 49.11 Ops SIS Recent Success Rate 26.87 26.57 The field I would expect to decrease is SIS Domain Scanned -- the number of runqueues that were examined but it's actually worse and graphing over time shows it's worse for the client thread counts. select_idle_cpu() is definitely being called because "Domain Search" is 10 times higher than "Core Search" and there "Core Miss" is non-zero. I suspect the issue is that the mask is only marked busy from the tick context which is a very wide window. If select_idle_cpu() picks an idle CPU from the mask, it's still marked as idle in the mask. -- Mel Gorman SUSE Labs
Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP
On Wed, Dec 09, 2020 at 07:07:11PM +0800, Li, Aubrey wrote: > On 2020/12/9 17:05, Mel Gorman wrote: > > On Wed, Dec 09, 2020 at 01:28:11PM +0800, Li, Aubrey wrote: > >>>> nr = div_u64(span_avg, avg_cost); > >>>> else > >>>> nr = 4; > >>>> - } > >>>> - > >>>> - time = cpu_clock(this); > >>>> > >>>> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); > >>>> + time = cpu_clock(this); > >>>> + } > >>>> > >>>> for_each_cpu_wrap(cpu, cpus, target) { > >>>> if (!--nr) > >> > >> nr is the key of this throttling mechanism, need to be placed under > >> sched_feat(SIS_PROP) as well. > >> > > > > It isn't necessary as nr in initialised to INT_MAX if !SIS_PROP. > >If !SIS_PROP, nr need to -1 then tested in the loop, instead of testing > >directly. > But with SIS_PROP, need adding a test in the loop. > Since SIS_PROP is default true, I think it's okay to keep the current way. > It's because it's default true and the cost is negligible that I'm leaving it alone. The branch cost and nr accounting cost is negligible and it avoids peppering select_idle_cpu() with too many SIS_PROP checks. -- Mel Gorman SUSE Labs
Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP
On Wed, Dec 09, 2020 at 01:28:11PM +0800, Li, Aubrey wrote: > >> nr = div_u64(span_avg, avg_cost); > >> else > >> nr = 4; > >> - } > >> - > >> - time = cpu_clock(this); > >> > >> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); > >> + time = cpu_clock(this); > >> + } > >> > >> for_each_cpu_wrap(cpu, cpus, target) { > >> if (!--nr) > > nr is the key of this throttling mechanism, need to be placed under > sched_feat(SIS_PROP) as well. > It isn't necessary as nr in initialised to INT_MAX if !SIS_PROP. -- Mel Gorman SUSE Labs
Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP
On Tue, Dec 08, 2020 at 05:03:21PM +0100, Vincent Guittot wrote: > On Tue, 8 Dec 2020 at 16:35, Mel Gorman wrote: > > > > As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP > > even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP > > check and while we are at it, exclude the cost of initialising the CPU > > mask from the average scan cost. > > > > Signed-off-by: Mel Gorman > > --- > > kernel/sched/fair.c | 14 -- > > 1 file changed, 8 insertions(+), 6 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index ac7b34e7372b..5c41875aec23 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, > > struct sched_domain *sd, int t > > if (!this_sd) > > return -1; > > Just noticed while reviewing the patch that the above related to > this_sd can also go under sched_feat(SIS_PROP) > Technically yes but I also decided against it. It's a functional difference depending on whether SIS_PROP is set in the highly unlikely case that this_sd == NULL. I was also thinking in terms of what happens if SIS_PROP was disabled and enabled while a search is in progress even if it's very unlikely. In that case, this_sd would be uninitialised. That might be impossible in practice depending on how static branching is implemented but I don't think we should rely on the internals of jump labels and play it safe. I can move it in if you feel strongly about it but I think the disable/enable race is enough of a concern to leave it alone. -- Mel Gorman SUSE Labs
[PATCH 4/4] sched/fair: Return an idle cpu if one is found after a failed search for an idle core
select_idle_core is called when SMT is active and there is likely a free core available. It may find idle CPUs but this information is simply discarded and the scan starts over again with select_idle_cpu. This patch caches information on idle CPUs found during the search for a core and uses one if no core is found. This is a tradeoff. There may be a slight impact when utilisation is low and an idle core can be found quickly. It provides improvements as the number of busy CPUs approaches 50% of the domain size when SMT is enabled. With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled 5.10.0-rc6 5.10.0-rc6 schedstat idlecandidate Hmean 1500.06 ( 0.00%) 505.67 * 1.12%* Hmean 2975.90 ( 0.00%) 974.06 * -0.19%* Hmean 4 1902.95 ( 0.00%) 1904.43 * 0.08%* Hmean 8 3761.73 ( 0.00%) 3721.02 * -1.08%* Hmean 16 6713.93 ( 0.00%) 6769.17 * 0.82%* Hmean 32 10435.31 ( 0.00%)10312.58 * -1.18%* Hmean 64 12325.51 ( 0.00%)13792.01 * 11.90%* Hmean 12821225.21 ( 0.00%)20963.44 * -1.23%* Hmean 25620532.83 ( 0.00%)20335.62 * -0.96%* Hmean 32020334.81 ( 0.00%)20147.25 * -0.92%* Note that there is a significant corner case. As the SMT scan may be terminated early, not all CPUs have been visited and select_idle_cpu() is still called for a full scan. This case is handled in the next patch. Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 413d895bbbf8..ed6f45832d97 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6066,6 +6066,7 @@ void __update_idle_core(struct rq *rq) */ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target) { + int idle_candidate = -1; struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); int core, cpu; @@ -6085,6 +6086,11 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int idle = false; break; } + + if (idle_candidate == -1 && + cpumask_test_cpu(cpu, p->cpus_ptr)) { + idle_candidate = cpu; + } } if (idle) @@ -6098,7 +6104,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int */ set_idle_cores(target, 0); - return -1; + return idle_candidate; } /* -- 2.26.2
[PATCH 3/4] sched/fair: Do not replace recent_used_cpu with the new target
After select_idle_sibling, p->recent_used_cpu is set to the new target. However on the next wakeup, prev will be the same as recent_used_cpu unless the load balancer has moved the task since the last wakeup. It still works, but is less efficient than it can be after all the changes that went in since that reduce unnecessary migrations, load balancer changes etc. This patch preserves recent_used_cpu for longer. With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled 5.10.0-rc6 5.10.0-rc6 baseline-v2 altrecent-v2 Hmean 1508.39 ( 0.00%) 502.05 * -1.25%* Hmean 2986.70 ( 0.00%) 983.65 * -0.31%* Hmean 4 1914.55 ( 0.00%) 1920.24 * 0.30%* Hmean 8 3702.37 ( 0.00%) 3663.96 * -1.04%* Hmean 16 6573.11 ( 0.00%) 6545.58 * -0.42%* Hmean 32 10142.57 ( 0.00%)10253.73 * 1.10%* Hmean 64 14348.40 ( 0.00%)12506.31 * -12.84%* Hmean 12821842.59 ( 0.00%)21967.13 * 0.57%* Hmean 25620813.75 ( 0.00%)21534.52 * 3.46%* Hmean 32020684.33 ( 0.00%)21070.14 * 1.87%* The different was marginal except for 64 threads which showed in the baseline that the result was very unstable where as the patch was much more stable. This is somewhat machine specific as on a separate 80-cpu Broadwell machine the same test reported. 5.10.0-rc6 5.10.0-rc6 baseline-v2 altrecent-v2 Hmean 1310.36 ( 0.00%) 291.81 * -5.98%* Hmean 2340.86 ( 0.00%) 547.22 * 60.54%* Hmean 4912.29 ( 0.00%) 1063.21 * 16.54%* Hmean 8 2116.40 ( 0.00%) 2103.60 * -0.60%* Hmean 16 4232.90 ( 0.00%) 4362.92 * 3.07%* Hmean 32 8442.03 ( 0.00%) 8642.10 * 2.37%* Hmean 64 11733.91 ( 0.00%)11473.66 * -2.22%* Hmean 12817727.24 ( 0.00%)16784.23 * -5.32%* Hmean 25616089.23 ( 0.00%)16110.79 * 0.13%* Hmean 32015992.60 ( 0.00%)16071.64 * 0.49%* schedstats were not used in this series but from an earlier debugging effort, the schedstats after the test run were as follows; Ops SIS Search 5653107942.00 5726545742.00 Ops SIS Domain Search3365067916.00 3319768543.00 Ops SIS Scanned112173512543.00 99194352541.00 Ops SIS Domain Scanned 109885472517.00 96787575342.00 Ops SIS Failures 2923185114.00 2950166441.00 Ops SIS Recent Used Hit 56547.00 118064916.00 Ops SIS Recent Used Miss 1590899250.00 354942791.00 Ops SIS Recent Attempts 1590955797.00 473007707.00 Ops SIS Search Efficiency 5.04 5.77 Ops SIS Domain Search Eff 3.06 3.43 Ops SIS Fast Success Rate40.47 42.03 Ops SIS Success Rate 48.29 48.48 Ops SIS Recent Success Rate 0.00 24.96 First interesting point is the ridiculous number of times runqueues are enabled -- almost 97 billion times over the course of 40 minutes With the patch, "Recent Used Hit" is over 2000 times more likely to succeed. The failure rate also increases by quite a lot but the cost is marginal even if the "Fast Success Rate" only increases by 2% overall. What cannot be observed from these stats is where the biggest impact as these stats cover low utilisation to over saturation. If graphed over time, the graphs show that the sched domain is only scanned at negligible rates until the machine is fully busy. With low utilisation, the "Fast Success Rate" is almost 100% until the machine is fully busy. For 320 clients, the success rate is close to 0% which is unsurprising. Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 9 + 1 file changed, 1 insertion(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5c41875aec23..413d895bbbf8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6277,17 +6277,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) /* Check a recently used CPU as a potential idle candidate: */ recent_used_cpu = p->recent_used_cpu; + p->recent_used_cpu = prev; if (recent_used_cpu != prev && recent_used_cpu != target && cpus_share_cache(recent_used_cpu, target) && (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) && cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) && asym_fits_capacity(task_util, recent_used_cpu)) { - /* -* Replace recent_used_cpu with prev as it is a potential -* candidate for the next wake: -*/ - p->recent
[PATCH 1/4] sched/fair: Remove SIS_AVG_CPU
SIS_AVG_CPU was introduced as a means of avoiding a search when the average search cost indicated that the search would likely fail. It was a blunt instrument and disabled by 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") and later replaced with a proportional search depth by 1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()"). While there are corner cases where SIS_AVG_CPU is better, it has now been disabled for almost three years. As the intent of SIS_PROP is to reduce the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus on SIS_PROP as a throttling mechanism. Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 20 +--- kernel/sched/features.h | 1 - 2 files changed, 9 insertions(+), 12 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 98075f9ea9a8..ac7b34e7372b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6145,7 +6145,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); struct sched_domain *this_sd; - u64 avg_cost, avg_idle; u64 time; int this = smp_processor_id(); int cpu, nr = INT_MAX; @@ -6154,18 +6153,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (!this_sd) return -1; - /* -* Due to large variance we need a large fuzz factor; hackbench in -* particularly is sensitive here. -*/ - avg_idle = this_rq()->avg_idle / 512; - avg_cost = this_sd->avg_scan_cost + 1; + if (sched_feat(SIS_PROP)) { + u64 avg_cost, avg_idle, span_avg; - if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost) - return -1; + /* +* Due to large variance we need a large fuzz factor; +* hackbench in particularly is sensitive here. +*/ + avg_idle = this_rq()->avg_idle / 512; + avg_cost = this_sd->avg_scan_cost + 1; - if (sched_feat(SIS_PROP)) { - u64 span_avg = sd->span_weight * avg_idle; + span_avg = sd->span_weight * avg_idle; if (span_avg > 4*avg_cost) nr = div_u64(span_avg, avg_cost); else diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 68d369cba9e4..e875eabb6600 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -54,7 +54,6 @@ SCHED_FEAT(TTWU_QUEUE, true) /* * When doing wakeups, attempt to limit superfluous scans of the LLC domain. */ -SCHED_FEAT(SIS_AVG_CPU, false) SCHED_FEAT(SIS_PROP, true) /* -- 2.26.2
[PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP
As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP check and while we are at it, exclude the cost of initialising the CPU mask from the average scan cost. Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ac7b34e7372b..5c41875aec23 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (!this_sd) return -1; + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + if (sched_feat(SIS_PROP)) { u64 avg_cost, avg_idle, span_avg; @@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t nr = div_u64(span_avg, avg_cost); else nr = 4; - } - - time = cpu_clock(this); - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + time = cpu_clock(this); + } for_each_cpu_wrap(cpu, cpus, target) { if (!--nr) @@ -6181,8 +6181,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t break; } - time = cpu_clock(this) - time; - update_avg(_sd->avg_scan_cost, time); + if (sched_feat(SIS_PROP)) { + time = cpu_clock(this) - time; + update_avg(_sd->avg_scan_cost, time); + } return cpu; } -- 2.26.2
[PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling
Changelog since v1 o Drop single-pass patch (vincent) o Scope variables used for SIS_AVG_CPU (dietmar) o Remove redundant assignment (dietmar This reduces the amount of runqueue scanning in select_idle_sibling in the worst case. Patch 1 removes SIS_AVG_CPU because it's unused. Patch 2 moves all SIS_PROP-related calculations under SIS_PROP Patch 3 improves the hit rate of p->recent_used_cpu to reduce the amount of scanning. It should be relatively uncontroversial Patch 4 returns an idle candidate if one is found while scanning for a free core. -- 2.26.2 Mel Gorman (4): sched/fair: Remove SIS_AVG_CPU sched/fair: Move avg_scan_cost calculations under SIS_PROP sched/fair: Do not replace recent_used_cpu with the new target sched/fair: Return an idle cpu if one is found after a failed search for an idle core kernel/sched/fair.c | 51 - kernel/sched/features.h | 1 - 2 files changed, 25 insertions(+), 27 deletions(-) -- 2.26.2
Re: [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU
On Tue, Dec 08, 2020 at 03:47:40PM +0100, Vincent Guittot wrote: > > I considered it but made the choice to exclude the cost of cpumask_and() > > from the avg_scan_cost instead. It's minor but when doing the original > > At the cost of a less readable code > Slightly less readable, yes. > > prototype, I didn't think it was appropriate to count the cpumask > > clearing as part of the scan cost as it's not directly related. > > hmm... I think it is because the number of loop is directly related to > the allowed cpus > While that is true, the cost of initialising the map is constant and what is most important is tracking the scan cost which is variable. Without SIS_AVG_CPU, the cpumask init can go before SIS_PROP without any penalty so is this version preferable? --8<-- sched/fair: Move avg_scan_cost calculations under SIS_PROP As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP check and while we are at it, exclude the cost of initialising the CPU mask from the average scan cost. Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ac7b34e7372b..5c41875aec23 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (!this_sd) return -1; + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + if (sched_feat(SIS_PROP)) { u64 avg_cost, avg_idle, span_avg; @@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t nr = div_u64(span_avg, avg_cost); else nr = 4; - } - - time = cpu_clock(this); - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + time = cpu_clock(this); + } for_each_cpu_wrap(cpu, cpus, target) { if (!--nr) @@ -6181,8 +6181,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t break; } - time = cpu_clock(this) - time; - update_avg(_sd->avg_scan_cost, time); + if (sched_feat(SIS_PROP)) { + time = cpu_clock(this) - time; + update_avg(_sd->avg_scan_cost, time); + } return cpu; }
Re: [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU
On Tue, Dec 08, 2020 at 02:43:10PM +0100, Vincent Guittot wrote: > On Tue, 8 Dec 2020 at 14:36, Mel Gorman wrote: > > > > On Tue, Dec 08, 2020 at 02:24:32PM +0100, Vincent Guittot wrote: > > > > > Nitpick: > > > > > > > > > > Since now avg_cost and avg_idle are only used w/ SIS_PROP, they could > > > > > go > > > > > completely into the SIS_PROP if condition. > > > > > > > > > > > > > Yeah, I can do that. In the initial prototype, that happened in a > > > > separate patch that split out SIS_PROP into a helper function and I > > > > never merged it back. It's a trivial change. > > > > > > while doing this, should you also put the update of > > > this_sd->avg_scan_cost under the SIS_PROP feature ? > > > > > > > It's outside the scope of the series but why not. This? > > > > --8<-- > > sched/fair: Move avg_scan_cost calculations under SIS_PROP > > > > As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP > > even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP > > check and while we are at it, exclude the cost of initialising the CPU > > mask from the average scan cost. > > > > Signed-off-by: Mel Gorman > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 19ca0265f8aa..0fee53b1aae4 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6176,10 +6176,10 @@ static int select_idle_cpu(struct task_struct *p, > > struct sched_domain *sd, int t > > nr = 4; > > } > > > > - time = cpu_clock(this); > > I would move it in the if (sched_feat(SIS_PROP)) above. > I considered it but made the choice to exclude the cost of cpumask_and() from the avg_scan_cost instead. It's minor but when doing the original prototype, I didn't think it was appropriate to count the cpumask clearing as part of the scan cost as it's not directly related. -- Mel Gorman SUSE Labs
Re: [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU
On Tue, Dec 08, 2020 at 02:24:32PM +0100, Vincent Guittot wrote: > > > Nitpick: > > > > > > Since now avg_cost and avg_idle are only used w/ SIS_PROP, they could go > > > completely into the SIS_PROP if condition. > > > > > > > Yeah, I can do that. In the initial prototype, that happened in a > > separate patch that split out SIS_PROP into a helper function and I > > never merged it back. It's a trivial change. > > while doing this, should you also put the update of > this_sd->avg_scan_cost under the SIS_PROP feature ? > It's outside the scope of the series but why not. This? --8<-- sched/fair: Move avg_scan_cost calculations under SIS_PROP As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP check and while we are at it, exclude the cost of initialising the CPU mask from the average scan cost. Signed-off-by: Mel Gorman diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 19ca0265f8aa..0fee53b1aae4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6176,10 +6176,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t nr = 4; } - time = cpu_clock(this); - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + if (sched_feat(SIS_PROP)) + time = cpu_clock(this); for_each_cpu_wrap(cpu, cpus, target) { if (!--nr) return -1; @@ -6187,8 +6187,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t break; } - time = cpu_clock(this) - time; - update_avg(_sd->avg_scan_cost, time); + if (sched_feat(SIS_PROP)) { + time = cpu_clock(this) - time; + update_avg(_sd->avg_scan_cost, time); + } return cpu; }
Re: [PATCH 2/4] sched/fair: Do not replace recent_used_cpu with the new target
On Tue, Dec 08, 2020 at 10:57:29AM +0100, Dietmar Eggemann wrote: > On 07/12/2020 10:15, Mel Gorman wrote: > > After select_idle_sibling, p->recent_used_cpu is set to the > > new target. However on the next wakeup, prev will be the same as > > I'm confused here. Isn't current->recent_used_cpu set to 'cpu = > smp_processor_id()' after sis()? Looking at v5.10-rc6. If you are referring to this; if (want_affine) current->recent_used_cpu = cpu; then it gets removed by the path. That replaces recent_used_cpu with the wakers CPU which still works but the hit rate is lower. > > [...] > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 23934dbac635..01b38fc17bca 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6274,6 +6274,7 @@ static int select_idle_sibling(struct task_struct *p, > > int prev, int target) > > > > /* Check a recently used CPU as a potential idle candidate: */ > > recent_used_cpu = p->recent_used_cpu; > > + p->recent_used_cpu = prev; > > if (recent_used_cpu != prev && > > recent_used_cpu != target && > > cpus_share_cache(recent_used_cpu, target) && > > p->recent_used_cpu is already set to prev in this if condition. > That can be removed as redundant, I'll fix it. -- Mel Gorman SUSE Labs
Re: [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU
On Tue, Dec 08, 2020 at 11:07:19AM +0100, Dietmar Eggemann wrote: > On 07/12/2020 10:15, Mel Gorman wrote: > > SIS_AVG_CPU was introduced as a means of avoiding a search when the > > average search cost indicated that the search would likely fail. It > > was a blunt instrument and disabled by 4c77b18cf8b7 ("sched/fair: Make > > select_idle_cpu() more aggressive") and later replaced with a proportional > > search depth by 1ad3aaf3fcd2 ("sched/core: Implement new approach to > > scale select_idle_cpu()"). > > > > While there are corner cases where SIS_AVG_CPU is better, it has now been > > disabled for almost three years. As the intent of SIS_PROP is to reduce > > the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus > > on SIS_PROP as a throttling mechanism. > > > > Signed-off-by: Mel Gorman > > --- > > kernel/sched/fair.c | 3 --- > > kernel/sched/features.h | 1 - > > 2 files changed, 4 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 98075f9ea9a8..23934dbac635 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6161,9 +6161,6 @@ static int select_idle_cpu(struct task_struct *p, > > struct sched_domain *sd, int t > > avg_idle = this_rq()->avg_idle / 512; > > avg_cost = this_sd->avg_scan_cost + 1; > > > > - if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost) > > - return -1; > > - > > if (sched_feat(SIS_PROP)) { > > u64 span_avg = sd->span_weight * avg_idle; > > if (span_avg > 4*avg_cost) > > Nitpick: > > Since now avg_cost and avg_idle are only used w/ SIS_PROP, they could go > completely into the SIS_PROP if condition. > Yeah, I can do that. In the initial prototype, that happened in a separate patch that split out SIS_PROP into a helper function and I never merged it back. It's a trivial change. Thanks. -- Mel Gorman SUSE Labs
Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup
On Mon, Dec 07, 2020 at 04:50:15PM +0100, Peter Zijlstra wrote: > On Wed, Nov 18, 2020 at 12:31:13PM +0800, Aubrey Li wrote: > > Add idle cpumask to track idle cpus in sched domain. When a CPU > > enters idle, if the idle driver indicates to stop tick, this CPU > > is set in the idle cpumask to be a wakeup target. And if the CPU > > is not in idle, the CPU is cleared in idle cpumask during scheduler > > tick to ratelimit idle cpumask update. > > So this will only have SIS consider CPUs that have the tick stopped? > That might affect the facebook tail-latency workloads quite a bit. > This is the concern I had as well. That's why patch "sched/fair: Avoid revisiting CPUs multiple times during select_idle_sibling" was created. The intent was that a mask could be used as a hint about what runqueues to look at first but continue looking at other runqueues without examining the same runqueue twice. While the patch would not be merged on its own, it may still be relevant in the context of an idle CPU mask depending on how up to date it is. -- Mel Gorman SUSE Labs
Re: [RFC PATCH 0/4] Reduce worst-case scanning of runqueues in select_idle_sibling
On Mon, Dec 07, 2020 at 04:04:41PM +0100, Vincent Guittot wrote: > On Mon, 7 Dec 2020 at 10:15, Mel Gorman wrote: > > > > This is a minimal series to reduce the amount of runqueue scanning in > > select_idle_sibling in the worst case. > > > > Patch 1 removes SIS_AVG_CPU because it's unused. > > > > Patch 2 improves the hit rate of p->recent_used_cpu to reduce the amount > > of scanning. It should be relatively uncontroversial > > > > Patch 3-4 scans the runqueues in a single pass for select_idle_core() > > and select_idle_cpu() so runqueues are not scanned twice. It's > > a tradeoff because it benefits deep scans but introduces overhead > > for shallow scans. > > > > Even if patch 3-4 is rejected to allow more time for Aubrey's idle cpu mask > > patch 3 looks fine and doesn't collide with Aubrey's work. But I don't > like patch 4 which manipulates different cpumask including > load_balance_mask out of LB and I prefer to wait for v6 of Aubrey's > patchset which should fix the problem of possibly scanning twice busy > cpus in select_idle_core and select_idle_cpu > Seems fair, we can see where we stand after V6 of Aubrey's work. A lot of the motivation for patch 4 would go away if we managed to avoid calling select_idle_core() unnecessarily. As it stands, we can call it a lot from hackbench even though the chance of getting an idle core are minimal. Assuming I revisit it, I'll update the schedstat debug patches to include the times select_idle_core() starts versus how many times it fails and see can I think of a useful heuristic. I'll wait for more review on patches 1-3 and if I hear nothing, I'll resend just those. Thanks Vincent. -- Mel Gorman SUSE Labs
[PATCH 4/4] sched/fair: Avoid revisiting CPUs multiple times during select_idle_sibling
select_idle_core() potentially searches a number of CPUs for idle candidates before select_idle_cpu() clears the mask and revisits the same CPUs. This patch moves the initialisation of select_idle_mask to the top-level and reuses the same mask across both select_idle_core and select_idle_cpu. select_idle_smt() is left alone as the cost of checking one SMT sibling is marginal relative to calling __clear_cpumask_cpu() for every CPU visited by select_idle_core(). With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled 5.10.0-rc6 5.10.0-rc6 altrecent-v2 singlepass-v2 Hmean 1502.05 ( 0.00%) 498.53 * -0.70%* Hmean 2983.65 ( 0.00%) 982.33 * -0.13%* Hmean 4 1920.24 ( 0.00%) 1929.34 * 0.47%* Hmean 8 3663.96 ( 0.00%) 3536.94 * -3.47%* Hmean 16 6545.58 ( 0.00%) 6560.21 * 0.22%* Hmean 32 10253.73 ( 0.00%)10374.30 * 1.18%* Hmean 64 12506.31 ( 0.00%)11692.26 * -6.51%* Hmean 12821967.13 ( 0.00%)21705.80 * -1.19%* Hmean 25621534.52 ( 0.00%)21223.50 * -1.44%* Hmean 32021070.14 ( 0.00%)21023.31 * -0.22%* As before, results are somewhat workload and machine-specific with a mix of good and bad. For example, netperf UDP_STREAM on an 80-cpu Broadwell machine shows; 5.10.0-rc6 5.10.0-rc6 altrecent-v2 singlepass-v2 Hmean send-64 232.78 ( 0.00%) 258.02 * 10.85%* Hmean send-128464.69 ( 0.00%) 511.53 * 10.08%* Hmean send-256904.72 ( 0.00%) 999.40 * 10.46%* Hmean send-1024 3512.08 ( 0.00%) 3868.86 * 10.16%* Hmean send-2048 6777.61 ( 0.00%) 7435.05 * 9.70%* Hmean send-3312 10352.45 ( 0.00%)11321.43 * 9.36%* Hmean send-4096 12660.58 ( 0.00%)13577.08 * 7.24%* Hmean send-8192 19240.36 ( 0.00%)21321.34 * 10.82%* Hmean send-1638429691.27 ( 0.00%)31296.47 * 5.41%* The real question is -- is it better to scan CPU runqueues just once where possible or scan multiple times? This patch scans once but it must do additional work to track that scanning so for shallow scans it's a loss and or deep scans, it's a win. The main downside to this patch is that cpumask manipulations are more complex because two cpumasks are involved. The alternative is that select_idle_core() would scan the full domain and always return a CPU which was previously considered. Alternatively, the depth of select_idle_core() could be limited. Similarly, select_idle_core() often scans when no idle core is available as test_idle_cores is very race-prone and maybe there is a better approach to determining if select_idle_core() is likely to succeed or not. Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 40 +--- 1 file changed, 29 insertions(+), 11 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 00c3b526a5bd..3b1736dc5bde 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6064,10 +6064,11 @@ void __update_idle_core(struct rq *rq) * there are no idle cores left in the system; tracked through * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above. */ -static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target) +static int select_idle_core(struct task_struct *p, struct sched_domain *sd, + int target, struct cpumask *cpus_scan) { int idle_candidate = -1; - struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); + struct cpumask *cpus; int core, cpu; if (!static_branch_likely(_smt_present)) @@ -6076,12 +6077,21 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int if (!test_idle_cores(target, false)) return -1; - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + /* +* Repurpose load_balance_mask to avoid rescanning cores while +* cpus_scan tracks the cpu if select_idle_cpu() is necessary. +* In this context, it should be impossible to enter LB and +* clobber load_balance_mask. +*/ + cpus = this_cpu_cpumask_var_ptr(load_balance_mask); + cpumask_copy(cpus, cpus_scan); for_each_cpu_wrap(core, cpus, target) { bool idle = true; for_each_cpu(cpu, cpu_smt_mask(core)) { + __cpumask_clear_cpu(cpu, cpus_scan); + if (!available_idle_cpu(cpu)) { idle = false; break; @@ -6130,7 +6140,8 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t #else /* CONFIG_
[PATCH 3/4] sched/fair: Return an idle cpu if one is found after a failed search for an idle core
select_idle_core is called when SMT is active and there is likely a free core available. It may find idle CPUs but this information is simply discarded and the scan starts over again with select_idle_cpu. This patch caches information on idle CPUs found during the search for a core and uses one if no core is found. This is a tradeoff. There may be a slight impact when utilisation is low and an idle core can be found quickly. It provides improvements as the number of busy CPUs approaches 50% of the domain size when SMT is enabled. With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled 5.10.0-rc6 5.10.0-rc6 schedstat idlecandidate Hmean 1500.06 ( 0.00%) 505.67 * 1.12%* Hmean 2975.90 ( 0.00%) 974.06 * -0.19%* Hmean 4 1902.95 ( 0.00%) 1904.43 * 0.08%* Hmean 8 3761.73 ( 0.00%) 3721.02 * -1.08%* Hmean 16 6713.93 ( 0.00%) 6769.17 * 0.82%* Hmean 32 10435.31 ( 0.00%)10312.58 * -1.18%* Hmean 64 12325.51 ( 0.00%)13792.01 * 11.90%* Hmean 12821225.21 ( 0.00%)20963.44 * -1.23%* Hmean 25620532.83 ( 0.00%)20335.62 * -0.96%* Hmean 32020334.81 ( 0.00%)20147.25 * -0.92%* Note that there is a significant corner case. As the SMT scan may be terminated early, not all CPUs have been visited and select_idle_cpu() is still called for a full scan. This case is handled in the next patch. Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 01b38fc17bca..00c3b526a5bd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6066,6 +6066,7 @@ void __update_idle_core(struct rq *rq) */ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target) { + int idle_candidate = -1; struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); int core, cpu; @@ -6085,6 +6086,11 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int idle = false; break; } + + if (idle_candidate == -1 && + cpumask_test_cpu(cpu, p->cpus_ptr)) { + idle_candidate = cpu; + } } if (idle) @@ -6098,7 +6104,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int */ set_idle_cores(target, 0); - return -1; + return idle_candidate; } /* -- 2.26.2
[RFC PATCH 0/4] Reduce worst-case scanning of runqueues in select_idle_sibling
This is a minimal series to reduce the amount of runqueue scanning in select_idle_sibling in the worst case. Patch 1 removes SIS_AVG_CPU because it's unused. Patch 2 improves the hit rate of p->recent_used_cpu to reduce the amount of scanning. It should be relatively uncontroversial Patch 3-4 scans the runqueues in a single pass for select_idle_core() and select_idle_cpu() so runqueues are not scanned twice. It's a tradeoff because it benefits deep scans but introduces overhead for shallow scans. Even if patch 3-4 is rejected to allow more time for Aubrey's idle cpu mask approach to stand on its own, patches 1-2 should be fine. The main decision with patch 4 is whether select_idle_core() should do a full scan when searching for an idle core, whether it should be throttled in some other fashion or whether it should be just left alone. -- 2.26.2
[PATCH 2/4] sched/fair: Do not replace recent_used_cpu with the new target
After select_idle_sibling, p->recent_used_cpu is set to the new target. However on the next wakeup, prev will be the same as recent_used_cpu unless the load balancer has moved the task since the last wakeup. It still works, but is less efficient than it can be after all the changes that went in since that reduce unnecessary migrations, load balancer changes etc. This patch preserves recent_used_cpu for longer. With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled 5.10.0-rc6 5.10.0-rc6 baseline-v2 altrecent-v2 Hmean 1508.39 ( 0.00%) 502.05 * -1.25%* Hmean 2986.70 ( 0.00%) 983.65 * -0.31%* Hmean 4 1914.55 ( 0.00%) 1920.24 * 0.30%* Hmean 8 3702.37 ( 0.00%) 3663.96 * -1.04%* Hmean 16 6573.11 ( 0.00%) 6545.58 * -0.42%* Hmean 32 10142.57 ( 0.00%)10253.73 * 1.10%* Hmean 64 14348.40 ( 0.00%)12506.31 * -12.84%* Hmean 12821842.59 ( 0.00%)21967.13 * 0.57%* Hmean 25620813.75 ( 0.00%)21534.52 * 3.46%* Hmean 32020684.33 ( 0.00%)21070.14 * 1.87%* The different was marginal except for 64 threads which showed in the baseline that the result was very unstable where as the patch was much more stable. This is somewhat machine specific as on a separate 80-cpu Broadwell machine the same test reported. 5.10.0-rc6 5.10.0-rc6 baseline-v2 altrecent-v2 Hmean 1310.36 ( 0.00%) 291.81 * -5.98%* Hmean 2340.86 ( 0.00%) 547.22 * 60.54%* Hmean 4912.29 ( 0.00%) 1063.21 * 16.54%* Hmean 8 2116.40 ( 0.00%) 2103.60 * -0.60%* Hmean 16 4232.90 ( 0.00%) 4362.92 * 3.07%* Hmean 32 8442.03 ( 0.00%) 8642.10 * 2.37%* Hmean 64 11733.91 ( 0.00%)11473.66 * -2.22%* Hmean 12817727.24 ( 0.00%)16784.23 * -5.32%* Hmean 25616089.23 ( 0.00%)16110.79 * 0.13%* Hmean 32015992.60 ( 0.00%)16071.64 * 0.49%* schedstats were not used in this series but from an earlier debugging effort, the schedstats after the test run were as follows; Ops SIS Search 5653107942.00 5726545742.00 Ops SIS Domain Search3365067916.00 3319768543.00 Ops SIS Scanned112173512543.00 99194352541.00 Ops SIS Domain Scanned 109885472517.00 96787575342.00 Ops SIS Failures 2923185114.00 2950166441.00 Ops SIS Recent Used Hit 56547.00 118064916.00 Ops SIS Recent Used Miss 1590899250.00 354942791.00 Ops SIS Recent Attempts 1590955797.00 473007707.00 Ops SIS Search Efficiency 5.04 5.77 Ops SIS Domain Search Eff 3.06 3.43 Ops SIS Fast Success Rate40.47 42.03 Ops SIS Success Rate 48.29 48.48 Ops SIS Recent Success Rate 0.00 24.96 First interesting point is the ridiculous number of times runqueues are enabled -- almost 97 billion times over the course of 40 minutes With the patch, "Recent Used Hit" is over 2000 times more likely to succeed. The failure rate also increases by quite a lot but the cost is marginal even if the "Fast Success Rate" only increases by 2% overall. What cannot be observed from these stats is where the biggest impact as these stats cover low utilisation to over saturation. If graphed over time, the graphs show that the sched domain is only scanned at negligible rates until the machine is fully busy. With low utilisation, the "Fast Success Rate" is almost 100% until the machine is fully busy. For 320 clients, the success rate is close to 0% which is unsurprising. Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 23934dbac635..01b38fc17bca 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6274,6 +6274,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) /* Check a recently used CPU as a potential idle candidate: */ recent_used_cpu = p->recent_used_cpu; + p->recent_used_cpu = prev; if (recent_used_cpu != prev && recent_used_cpu != target && cpus_share_cache(recent_used_cpu, target) && @@ -6765,9 +6766,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) } else if (wake_flags & WF_TTWU) { /* XXX always ? */ /* Fast path */ new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); - - if (want_affine) - current->recent_used_cpu = cpu; } rcu_read_unlock(); -- 2.26.2
[PATCH 1/4] sched/fair: Remove SIS_AVG_CPU
SIS_AVG_CPU was introduced as a means of avoiding a search when the average search cost indicated that the search would likely fail. It was a blunt instrument and disabled by 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") and later replaced with a proportional search depth by 1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()"). While there are corner cases where SIS_AVG_CPU is better, it has now been disabled for almost three years. As the intent of SIS_PROP is to reduce the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus on SIS_PROP as a throttling mechanism. Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 3 --- kernel/sched/features.h | 1 - 2 files changed, 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 98075f9ea9a8..23934dbac635 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6161,9 +6161,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t avg_idle = this_rq()->avg_idle / 512; avg_cost = this_sd->avg_scan_cost + 1; - if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost) - return -1; - if (sched_feat(SIS_PROP)) { u64 span_avg = sd->span_weight * avg_idle; if (span_avg > 4*avg_cost) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 68d369cba9e4..e875eabb6600 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -54,7 +54,6 @@ SCHED_FEAT(TTWU_QUEUE, true) /* * When doing wakeups, attempt to limit superfluous scans of the LLC domain. */ -SCHED_FEAT(SIS_AVG_CPU, false) SCHED_FEAT(SIS_PROP, true) /* -- 2.26.2
Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
On Sat, Dec 05, 2020 at 09:26:47PM -0500, Andrea Arcangeli wrote: > Hi Mel, > > On Thu, Nov 26, 2020 at 10:47:20AM +0000, Mel Gorman wrote: > > Agreed. This thread has a lot of different directions in it at this > > point so what I'd hope for is first, a patch that initialises holes with > > zone/node linkages within a 1<<(MAX_ORDER-1) alignment. If there is a > > hole, it would be expected the pages are PageReserved. Second, a fix to > > fast_isolate that forces PFNs returned to always be within the stated > > zone boundaries. > > The last two patches should resolve the struct page > initialization > https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/ and the > VM_BUG_ON_PAGE never happened again as expected. > > So I looked back to see how the "distance" logic is accurate. I added > those checks and I get negative hits: > > diff --git a/mm/compaction.c b/mm/compaction.c > index cc1a7f600a86..844a90b0acdf 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -1331,6 +1331,12 @@ fast_isolate_freepages(struct compact_control *cc) > low_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 2)); > min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1)); > > + WARN_ON_ONCE((long) low_pfn < 0); > + WARN_ON_ONCE((long) min_pfn < 0); > + if ((long) low_pfn < 0) > + return cc->free_pfn; > + if ((long) min_pfn < 0) > + return cc->free_pfn; > if (WARN_ON_ONCE(min_pfn > low_pfn)) > low_pfn = min_pfn; > > Both warn-on-once triggers, so it goes negative. While it appears not > kernel crashing since pfn_valid won't succeed on negative entries and > they'll always be higher than any pfn in the freelist, is this sign > that there's further room for improvement here? > Possibly, checking the wrong pfns is simply risky. This is not tested at all, just checking if it's in the right ballpark even. Intent is that when the free/migrate PFNs are too close or already overlapping that no attempt is made and it returns back to detect the scanners have met. diff --git a/mm/compaction.c b/mm/compaction.c index 13cb7a961b31..208cb5857446 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1313,6 +1313,10 @@ fast_isolate_freepages(struct compact_control *cc) if (cc->order <= 0) return cc->free_pfn; + /* Ensure that migration and free scanner are not about to cross */ + if (cc->migrate_pfn < cc->free_pfn) + return cc->free_pfn; + /* * If starting the scan, use a deeper search and use the highest * PFN found if a suitable one is not found. @@ -1324,9 +1328,12 @@ fast_isolate_freepages(struct compact_control *cc) /* * Preferred point is in the top quarter of the scan space but take -* a pfn from the top half if the search is problematic. +* a pfn from the top half if the search is problematic. Ensure +* there is enough distance to make the fast search worthwhile. */ distance = (cc->free_pfn - cc->migrate_pfn); + if (distance <= (pageblock_nr_pages << 2)) + return cc->free_pfn; low_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 2)); min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1)); -- Mel Gorman SUSE Labs
Re: [PATCH 06/10] sched/fair: Clear the target CPU from the cpumask of CPUs searched
On Fri, Dec 04, 2020 at 04:43:05PM +0100, Vincent Guittot wrote: > On Fri, 4 Dec 2020 at 16:40, Mel Gorman wrote: > > > > On Fri, Dec 04, 2020 at 04:23:48PM +0100, Vincent Guittot wrote: > > > On Fri, 4 Dec 2020 at 15:31, Mel Gorman > > > wrote: > > > > > > > > On Fri, Dec 04, 2020 at 02:47:48PM +0100, Vincent Guittot wrote: > > > > > > IIUC, select_idle_core and select_idle_cpu share the same > > > > > > cpumask(select_idle_mask)? > > > > > > If the target's sibling is removed from select_idle_mask from > > > > > > select_idle_core(), > > > > > > select_idle_cpu() will lose the chance to pick it up? > > > > > > > > > > This is only relevant for patch 10 which is not to be included IIUC > > > > > what mel said in cover letter : "Patches 9 and 10 are stupid in the > > > > > context of this series." > > > > > > > > > > > > > Patch 10 was stupid in the context of the prototype because > > > > select_idle_core always returned a CPU. A variation ended up being > > > > reintroduced at the end of the Series Yet To Be Posted so that SMT > > > > siblings > > > > are cleared during select_idle_core() but select_idle_cpu() still has a > > > > mask with unvisited CPUs to consider if no idle cores are found. > > > > > > > > As far as I know, this would still be compatible with Aubrey's idle > > > > cpu mask as long as it's visited and cleared between select_idle_core > > > > and select_idle_cpu. It relaxes the contraints on Aubrey to some extent > > > > because the idle cpu mask would be a hint so if the information is out > > > > of date, an idle cpu may still be found the normal way. > > > > > > But even without patch 10, just replacing sched_domain_span(sd) by > > > sds_idle_cpus(sd->shared) will ensure that sis loops only on cpus that > > > get a chance to be idle so select_idle_core is likely to return an > > > idle_candidate > > > > > > > Yes but if the idle mask is out of date for any reason then idle CPUs might > > In fact it's the opposite, a cpu in idle mask might not be idle but > all cpus that enter idle will be set > When I first checked, the information was based on the tick or a CPU stopping the tick. That was not guaranteed to be up to date so I considered the best option would be to treat idle cpu mask as advisory. It would not necessarily cover a CPU that was entering idle and polling before entering an idle state for example or a rq that would pass sched_idle_cpu() depending on the timing of the update_idle_cpumask call. I know you reviewed that patch and v6 may be very different but the more up to date that information is, the greater the cache conflicts will be on sched_domain_shared so maintaining the up-to-date information may cost enough to offset any benefit from reduced searching at wakeup. If this turns out to be wrong, then great, the idle cpu mask can be used as both the basis for an idle core search and a fast find of an individual CPU. If the cost of keeping up to date information is too high then the idle_cpu_mask can be treated as advisory to start the search and track CPUs visited. The series are not either/or, chunks of the series I posted are orthogonal (e.g. changes to p->recent_cpu_used), the latter parts could either work with idle cpu mask or be replaced by idle cpu mask depending on which performs better. -- Mel Gorman SUSE Labs
Re: [PATCH 06/10] sched/fair: Clear the target CPU from the cpumask of CPUs searched
On Fri, Dec 04, 2020 at 04:23:48PM +0100, Vincent Guittot wrote: > On Fri, 4 Dec 2020 at 15:31, Mel Gorman wrote: > > > > On Fri, Dec 04, 2020 at 02:47:48PM +0100, Vincent Guittot wrote: > > > > IIUC, select_idle_core and select_idle_cpu share the same > > > > cpumask(select_idle_mask)? > > > > If the target's sibling is removed from select_idle_mask from > > > > select_idle_core(), > > > > select_idle_cpu() will lose the chance to pick it up? > > > > > > This is only relevant for patch 10 which is not to be included IIUC > > > what mel said in cover letter : "Patches 9 and 10 are stupid in the > > > context of this series." > > > > > > > Patch 10 was stupid in the context of the prototype because > > select_idle_core always returned a CPU. A variation ended up being > > reintroduced at the end of the Series Yet To Be Posted so that SMT siblings > > are cleared during select_idle_core() but select_idle_cpu() still has a > > mask with unvisited CPUs to consider if no idle cores are found. > > > > As far as I know, this would still be compatible with Aubrey's idle > > cpu mask as long as it's visited and cleared between select_idle_core > > and select_idle_cpu. It relaxes the contraints on Aubrey to some extent > > because the idle cpu mask would be a hint so if the information is out > > of date, an idle cpu may still be found the normal way. > > But even without patch 10, just replacing sched_domain_span(sd) by > sds_idle_cpus(sd->shared) will ensure that sis loops only on cpus that > get a chance to be idle so select_idle_core is likely to return an > idle_candidate > Yes but if the idle mask is out of date for any reason then idle CPUs might be missed -- hence the intent to maintain a mask of CPUs visited and use the idle cpu mask as a hint to prioritise CPUs that are likely idle but fall back to a normal scan if none of the "idle cpu mask" CPUs are actually idle. -- Mel Gorman SUSE Labs
Re: [PATCH 06/10] sched/fair: Clear the target CPU from the cpumask of CPUs searched
On Fri, Dec 04, 2020 at 02:47:48PM +0100, Vincent Guittot wrote: > > IIUC, select_idle_core and select_idle_cpu share the same > > cpumask(select_idle_mask)? > > If the target's sibling is removed from select_idle_mask from > > select_idle_core(), > > select_idle_cpu() will lose the chance to pick it up? > > This is only relevant for patch 10 which is not to be included IIUC > what mel said in cover letter : "Patches 9 and 10 are stupid in the > context of this series." > Patch 10 was stupid in the context of the prototype because select_idle_core always returned a CPU. A variation ended up being reintroduced at the end of the Series Yet To Be Posted so that SMT siblings are cleared during select_idle_core() but select_idle_cpu() still has a mask with unvisited CPUs to consider if no idle cores are found. As far as I know, this would still be compatible with Aubrey's idle cpu mask as long as it's visited and cleared between select_idle_core and select_idle_cpu. It relaxes the contraints on Aubrey to some extent because the idle cpu mask would be a hint so if the information is out of date, an idle cpu may still be found the normal way. -- Mel Gorman SUSE Labs
Re: [PATCH 06/10] sched/fair: Clear the target CPU from the cpumask of CPUs searched
On Fri, Dec 04, 2020 at 02:17:20PM +0100, Vincent Guittot wrote: > On Fri, 4 Dec 2020 at 14:13, Vincent Guittot > wrote: > > > > On Fri, 4 Dec 2020 at 12:30, Mel Gorman wrote: > > > > > > On Fri, Dec 04, 2020 at 11:56:36AM +0100, Vincent Guittot wrote: > > > > > The intent was that the sibling might still be an idle candidate. In > > > > > the current draft of the series, I do not even clear this so that the > > > > > SMT sibling is considered as an idle candidate. The reasoning is that > > > > > if > > > > > there are no idle cores then an SMT sibling of the target is as good > > > > > an > > > > > idle CPU to select as any. > > > > > > > > Isn't the purpose of select_idle_smt ? > > > > > > > > > > Only in part. > > > > > > > select_idle_core() looks for an idle core and opportunistically saves > > > > an idle CPU candidate to skip select_idle_cpu. In this case this is > > > > useless loops for select_idle_core() because we are sure that the core > > > > is not idle > > > > > > > > > > If select_idle_core() finds an idle candidate other than the sibling, > > > it'll use it if there is no idle core -- it picks a busy sibling based > > > on a linear walk of the cpumask. Similarly, select_idle_cpu() is not > > > > My point is that it's a waste of time to loop the sibling cpus of > > target in select_idle_core because it will not help to find an idle > > core. The sibling cpus will then be check either by select_idle_cpu > > of select_idle_smt > I understand and you're right, the full loop was in the context of a series that unified select_idle_* where it made sense. The version I'm currently testing aborts the SMT search if a !idle sibling is encountered. That means that select_idle_core() will no longer scan the entire domain if there are no idle cores. https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/commit/?h=sched-sissearch-v2r6=eb04a344cf7d7ca64c0c8fc0bcade261fa08c19e With the patch on its own, it does mean that select_idle_sibling starts over because SMT siblings might have been cleared. As an aside, select_idle_core() has it's own problems even then. It can start a scan for an idle sibling when cpu_rq(target)->nr_running is very large -- over 100+ running tasks which is almost certainly a useless scan for cores. However, I haven't done anything with that in this series as it seemed like it would be follow-up work. > also, while looping the cpumask, the sibling cpus of not idle cpu are > removed and will not be check > True and I spotted this. I think the load_balance_mask can be abused to clear siblings during select_idle_core() while using select_idle_mask to track CPUs that have not been scanned yet so select_idle_cpu only scans CPUs that have not already been visited. https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/commit/?h=sched-sissearch-v2r6=a6e986dae38855e3be26dfde86bbef1617431dd1 As both the idle candidate and the load_balance_mask abuse are likely to be controversial, I shuffled the series so that it's ordered from least least controversial to most controversial. This https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/log/?h=sched-sissearch-v2r6 is what is currently being tested. It'll take most of the weekend and I'll post them properly if they pass tests and do not throw up nasty surprises. -- Mel Gorman SUSE Labs
Re: [PATCH 06/10] sched/fair: Clear the target CPU from the cpumask of CPUs searched
On Fri, Dec 04, 2020 at 11:56:36AM +0100, Vincent Guittot wrote: > > The intent was that the sibling might still be an idle candidate. In > > the current draft of the series, I do not even clear this so that the > > SMT sibling is considered as an idle candidate. The reasoning is that if > > there are no idle cores then an SMT sibling of the target is as good an > > idle CPU to select as any. > > Isn't the purpose of select_idle_smt ? > Only in part. > select_idle_core() looks for an idle core and opportunistically saves > an idle CPU candidate to skip select_idle_cpu. In this case this is > useless loops for select_idle_core() because we are sure that the core > is not idle > If select_idle_core() finds an idle candidate other than the sibling, it'll use it if there is no idle core -- it picks a busy sibling based on a linear walk of the cpumask. Similarly, select_idle_cpu() is not guaranteed to scan the sibling first (ordering) or even reach the sibling (throttling). select_idle_smt() is a last-ditch effort. -- Mel Gorman SUSE Labs
Re: [PATCH 2/2] sched: Split the function show_schedstat()
On Fri, Dec 04, 2020 at 09:22:34AM +0800, Yunfeng Ye wrote: > > > On 2020/12/3 17:42, Mel Gorman wrote: > > On Thu, Dec 03, 2020 at 02:47:14PM +0800, Yunfeng Ye wrote: > >> The schedstat include runqueue-specific stats and domain-specific stats, > >> so split it into two functions, show_rqstat() and show_domainstat(). > >> > >> No functional changes. > >> > >> Signed-off-by: Yunfeng Ye > > > > Why? > > > > I could understand if there was a follow-up patch that adjusted some > > subset or there was a difference in checking for schedstat_enabled, > > locking or inserting new schedstat information. This can happen in the > > general case when the end result is easier to review here it seems to be > > just moving code around. > > > The rqstat and domainstat is independent state information. so I think > split it into two individual function is clearer. > The comments and the names of the structures being accessesd is sufficient to make it clear. -- Mel Gorman SUSE Labs
Re: [PATCH 06/10] sched/fair: Clear the target CPU from the cpumask of CPUs searched
On Thu, Dec 03, 2020 at 05:38:03PM +0100, Vincent Guittot wrote: > On Thu, 3 Dec 2020 at 15:11, Mel Gorman wrote: > > > > The target CPU is definitely not idle in both select_idle_core and > > select_idle_cpu. For select_idle_core(), the SMT is potentially > > checked unnecessarily as the core is definitely not idle if the > > target is busy. For select_idle_cpu(), the first CPU checked is > > simply a waste. > > > > > Signed-off-by: Mel Gorman > > --- > > kernel/sched/fair.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 68dd9cd62fbd..1d8f5c4b4936 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6077,6 +6077,7 @@ static int select_idle_core(struct task_struct *p, > > struct sched_domain *sd, int > > return -1; > > > > cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); > > + __cpumask_clear_cpu(target, cpus); > > should clear cpu_smt_mask(target) as we are sure that the core will not be > idle > The intent was that the sibling might still be an idle candidate. In the current draft of the series, I do not even clear this so that the SMT sibling is considered as an idle candidate. The reasoning is that if there are no idle cores then an SMT sibling of the target is as good an idle CPU to select as any. -- Mel Gorman SUSE Labs
Re: [PATCH 04/10] sched/fair: Return an idle cpu if one is found after a failed search for an idle core
On Thu, Dec 03, 2020 at 05:35:29PM +0100, Vincent Guittot wrote: > > index fc48cc99b03d..845bc0cd9158 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6066,6 +6066,7 @@ void __update_idle_core(struct rq *rq) > > */ > > static int select_idle_core(struct task_struct *p, struct sched_domain > > *sd, int target) > > { > > + int idle_candidate = -1; > > struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); > > int core, cpu; > > > > @@ -6084,7 +6085,13 @@ static int select_idle_core(struct task_struct *p, > > struct sched_domain *sd, int > > schedstat_inc(this_rq()->sis_scanned); > > if (!available_idle_cpu(cpu)) { > > idle = false; > > - break; > > + if (idle_candidate != -1) > > + break; > > > If I get your changes correctly, it will now continue to loop on all > cpus of the smt mask to try to find an idle cpu whereas it was That was an oversight, the intent is that the SMT search breaks but the search for an idle core continues. The patch was taken from a very different series that unified all the select_idle_* functions as a single function and I failed to fix it up properly. The unification series didn't generate good results back 9 months ago when I tried and I never finished it off. In the current context, it would not make sense to try a unification again. > With the change above you might end up looping all cpus of llc if > there is only one idle cpu in the llc whereas before we were looping > only 1 cpu per core at most. The bottom change makes sense but the > above on is in some way replacing completely select_idle_cpu and > bypass SIS_PROP and we should avoid that IMO > You're right of course, it was never intended to behave like that. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0a3d338770c4..49b1590e60a9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6084,8 +6084,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int for_each_cpu(cpu, cpu_smt_mask(core)) { if (!available_idle_cpu(cpu)) { idle = false; - if (idle_candidate != -1) - break; + break; } if (idle_candidate == -1 && -- Mel Gorman SUSE Labs
[PATCH 10/10] sched/fair: Avoid revisiting CPUs multiple times during select_idle_sibling
Note: While this is done in the context of select_idle_core(), I would not expect it to be done like this. The intent is to illustrate how idle_cpu_mask could be filtered before select_idle_cpus() scans the rest of a domain or a wider scan was done across a cluster. select_idle_core() potentially searches a number of CPUs for idle candidates before select_idle_cpu() clears the mask and revisits the same CPUs. This patch moves the initialisation of select_idle_mask to the top-level and reuses the same mask across both select_idle_core and select_idle_cpu. select_idle_smt() is left alone as the cost of checking one SMT sibling is marginal relative to calling __clear_cpumask_cpu() for evey CPU visited by select_idle_core(). Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 29 - 1 file changed, 16 insertions(+), 13 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cd95daf9f53e..af2e108c20c0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6096,10 +6096,9 @@ void __update_idle_core(struct rq *rq) * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above. */ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, - int target, int nr) + int target, int nr, struct cpumask *cpus) { int idle_candidate = -1; - struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); int core, cpu; if (!static_branch_likely(_smt_present)) @@ -6108,9 +6107,6 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, if (!test_idle_cores(target, false)) return -1; - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); - __cpumask_clear_cpu(target, cpus); - for_each_cpu_wrap(core, cpus, target) { bool idle = true; @@ -6175,7 +6171,7 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t #else /* CONFIG_SCHED_SMT */ static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, - int target, int nr) + int target, int nr, struct cpumask *cpus) { return -1; } @@ -6193,14 +6189,10 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd * average idle time for this rq (as found in rq->avg_idle). */ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, - int target, int nr) + int target, int nr, struct cpumask *cpus) { - struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); int cpu; - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); - __cpumask_clear_cpu(target, cpus); - for_each_cpu_wrap(cpu, cpus, target) { schedstat_inc(this_rq()->sis_scanned); if (!--nr) @@ -6260,6 +6252,7 @@ static inline bool asym_fits_capacity(int task_util, int cpu) static int select_idle_sibling(struct task_struct *p, int prev, int target) { struct sched_domain *sd, *this_sd; + struct cpumask *cpus_visited; unsigned long task_util; int i, recent_used_cpu, depth; u64 time; @@ -6358,13 +6351,23 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) depth = sis_search_depth(sd, this_sd); + /* +* Init the select_idle_mask. select_idle_core() will mask +* out the CPUs that have already been limited to limit the +* search in select_idle_cpu(). Further clearing is not +* done as select_idle_smt checks only one CPU. +*/ + cpus_visited = this_cpu_cpumask_var_ptr(select_idle_mask); + cpumask_and(cpus_visited, sched_domain_span(sd), p->cpus_ptr); + __cpumask_clear_cpu(target, cpus_visited); + schedstat_inc(this_rq()->sis_domain_search); - i = select_idle_core(p, sd, target, depth); + i = select_idle_core(p, sd, target, depth, cpus_visited); if ((unsigned)i < nr_cpumask_bits) return i; time = cpu_clock(smp_processor_id()); - i = select_idle_cpu(p, sd, target, depth); + i = select_idle_cpu(p, sd, target, depth, cpus_visited); if ((unsigned)i < nr_cpumask_bits) goto acct_cost; -- 2.26.2
[PATCH 09/10] sched/fair: Limit the search for an idle core
Note: This is a bad idea, it's for illustration only to show how the search space can be filtered at each stage. Searching an idle_cpu_mask would be a potential option. select_idle_core() would be left alone as it has its own throttling mechanism select_idle_core() may search a full domain for an idle core even if idle CPUs exist result in an excessive search. This patch partially limits the search for an idle core similar to select_idle_cpu() once an idle candidate is found. Note that this patch can *increase* the number of runqueues considered. Any searching done by select_idle_core() is duplicated by select_idle_cpu() if an idle candidate is not found. If there is an idle CPU then aborting select_idle_core() can have a negative impact. This is addressed in the next patch. Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 33ce65b67381..cd95daf9f53e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6095,7 +6095,8 @@ void __update_idle_core(struct rq *rq) * there are no idle cores left in the system; tracked through * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above. */ -static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target) +static int select_idle_core(struct task_struct *p, struct sched_domain *sd, + int target, int nr) { int idle_candidate = -1; struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); @@ -6115,6 +6116,11 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int for_each_cpu(cpu, cpu_smt_mask(core)) { schedstat_inc(this_rq()->sis_scanned); + + /* Apply limits if there is an idle candidate */ + if (idle_candidate != -1) + nr--; + if (!available_idle_cpu(cpu)) { idle = false; if (idle_candidate != -1) @@ -6130,6 +6136,9 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int if (idle) return core; + if (!nr) + break; + cpumask_andnot(cpus, cpus, cpu_smt_mask(core)); } @@ -6165,7 +6174,8 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t #else /* CONFIG_SCHED_SMT */ -static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target) +static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, + int target, int nr) { return -1; } @@ -6349,7 +6359,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) depth = sis_search_depth(sd, this_sd); schedstat_inc(this_rq()->sis_domain_search); - i = select_idle_core(p, sd, target); + i = select_idle_core(p, sd, target, depth); if ((unsigned)i < nr_cpumask_bits) return i; -- 2.26.2
[PATCH 02/10] sched/fair: Track efficiency of task recent_used_cpu
This simply tracks the efficiency of the recent_used_cpu. The hit rate of this matters as it can avoid a domain search. Similarly, the miss rate matters because each miss is a penalty to the fast path. It is not required that this patch be merged with the series but if we are looking at the usefulness of p->recent_used_cpu, the stats generate hard data on what the hit rate is. MMTests uses this to generate additional metrics. SIS Recent Used Hit: A recent CPU was eligible and used. Each hit is a domain search avoided. SIS Recent Used Miss: A recent CPU was eligible but unavailable. Each time this is miss, there was a small penalty to the fast path before a domain search happened. SIS Recent Success Rate: A percentage of the number of hits versus the total attempts to use the recent CPU. SIS Recent Attempts: The total number of times the recent CPU was examined. A high number of Recent Attempts with a low Success Rate implies the fast path is being punished severely. This could have been presented as a weighting of hits and misses but calculating an appropriate weight for misses is problematic. Signed-off-by: Mel Gorman --- kernel/sched/debug.c | 2 ++ kernel/sched/fair.c | 23 +-- kernel/sched/sched.h | 2 ++ kernel/sched/stats.c | 7 --- 4 files changed, 21 insertions(+), 13 deletions(-) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 2386cc5e79e5..8f933a9e8c25 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -718,6 +718,8 @@ do { \ P(sis_domain_search); P(sis_scanned); P(sis_failed); + P(sis_recent_hit); + P(sis_recent_miss); } #undef P diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 494ba01f3414..d9acd55d309b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6291,16 +6291,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) recent_used_cpu = p->recent_used_cpu; if (recent_used_cpu != prev && recent_used_cpu != target && - cpus_share_cache(recent_used_cpu, target) && - (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) && - cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) && - asym_fits_capacity(task_util, recent_used_cpu)) { - /* -* Replace recent_used_cpu with prev as it is a potential -* candidate for the next wake: -*/ - p->recent_used_cpu = prev; - return recent_used_cpu; + cpus_share_cache(recent_used_cpu, target)) { + if ((available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) && + cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) && + asym_fits_capacity(task_util, recent_used_cpu)) { + /* +* Replace recent_used_cpu with prev as it is a potential +* candidate for the next wake: +*/ + p->recent_used_cpu = prev; + schedstat_inc(this_rq()->sis_recent_hit); + return recent_used_cpu; + } + schedstat_inc(this_rq()->sis_recent_miss); } /* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 90a62dd9293d..6a6578c4c24b 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1055,6 +1055,8 @@ struct rq { unsigned intsis_domain_search; unsigned intsis_scanned; unsigned intsis_failed; + unsigned intsis_recent_hit; + unsigned intsis_recent_miss; #endif #ifdef CONFIG_CPU_IDLE diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c index 390bfcc3842c..402fab75aa14 100644 --- a/kernel/sched/stats.c +++ b/kernel/sched/stats.c @@ -10,7 +10,7 @@ * Bump this up when changing the output format or the meaning of an existing * format, so that tools can adapt (or abort) */ -#define SCHEDSTAT_VERSION 16 +#define SCHEDSTAT_VERSION 17 static int show_schedstat(struct seq_file *seq, void *v) { @@ -30,14 +30,15 @@ static int show_schedstat(struct seq_file *seq, void *v) /* runqueue-specific stats */ seq_printf(seq, - "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u %u", + "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u %u %u %u", cpu, rq->yld_count, rq->sched_count, rq->sched_goidle, rq->ttwu_count, rq->ttwu_local, rq->rq
[PATCH 09/10] sched/fair: Limit the search for an idle core
Note: This is a bad idea, it's for illustration only to show how the search space can be filtered at each stage. Searching an idle_cpu_mask would be a potential option. select_idle_core() would be left alone as it has its own throttling mechanism select_idle_core() may search a full domain for an idle core even if idle CPUs exist result in an excessive search. This patch partially limits the search for an idle core similar to select_idle_cpu() once an idle candidate is found. Note that this patch can *increase* the number of runqueues considered. Any searching done by select_idle_core() is duplicated by select_idle_cpu() if an idle candidate is not found. If there is an idle CPU then aborting select_idle_core() can have a negative impact. This is addressed in the next patch. Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 33ce65b67381..cd95daf9f53e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6095,7 +6095,8 @@ void __update_idle_core(struct rq *rq) * there are no idle cores left in the system; tracked through * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above. */ -static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target) +static int select_idle_core(struct task_struct *p, struct sched_domain *sd, + int target, int nr) { int idle_candidate = -1; struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); @@ -6115,6 +6116,11 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int for_each_cpu(cpu, cpu_smt_mask(core)) { schedstat_inc(this_rq()->sis_scanned); + + /* Apply limits if there is an idle candidate */ + if (idle_candidate != -1) + nr--; + if (!available_idle_cpu(cpu)) { idle = false; if (idle_candidate != -1) @@ -6130,6 +6136,9 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int if (idle) return core; + if (!nr) + break; + cpumask_andnot(cpus, cpus, cpu_smt_mask(core)); } @@ -6165,7 +6174,8 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t #else /* CONFIG_SCHED_SMT */ -static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target) +static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, + int target, int nr) { return -1; } @@ -6349,7 +6359,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) depth = sis_search_depth(sd, this_sd); schedstat_inc(this_rq()->sis_domain_search); - i = select_idle_core(p, sd, target); + i = select_idle_core(p, sd, target, depth); if ((unsigned)i < nr_cpumask_bits) return i; -- 2.26.2
[PATCH 06/10] sched/fair: Clear the target CPU from the cpumask of CPUs searched
The target CPU is definitely not idle in both select_idle_core and select_idle_cpu. For select_idle_core(), the SMT is potentially checked unnecessarily as the core is definitely not idle if the target is busy. For select_idle_cpu(), the first CPU checked is simply a waste. Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 68dd9cd62fbd..1d8f5c4b4936 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6077,6 +6077,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int return -1; cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + __cpumask_clear_cpu(target, cpus); for_each_cpu_wrap(core, cpus, target) { bool idle = true; @@ -6181,6 +6182,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t time = cpu_clock(this); cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + __cpumask_clear_cpu(target, cpus); for_each_cpu_wrap(cpu, cpus, target) { schedstat_inc(this_rq()->sis_scanned); -- 2.26.2
[PATCH 05/10] sched/fair: Do not replace recent_used_cpu with the new target
After select_idle_sibling, p->recent_used_cpu is set to the new target. However on the next wakeup, prev will be the same as recent_used_cpu unless the load balancer has moved the task since the last wakeup. It still works, but is less efficient than it can be after all the changes that went in since that reduce unnecessary migrations, load balancer changes etc. This patch preserves recent_used_cpu for longer. With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled 5.10.0-rc6 5.10.0-rc6 idlecandidate-v1r10altrecent-v1r10 Hmean 1505.67 ( 0.00%) 501.34 * -0.86%* Hmean 2974.06 ( 0.00%) 981.39 * 0.75%* Hmean 4 1904.43 ( 0.00%) 1926.13 * 1.14%* Hmean 8 3721.02 ( 0.00%) 3799.86 * 2.12%* Hmean 16 6769.17 ( 0.00%) 6938.40 * 2.50%* Hmean 32 10312.58 ( 0.00%)10632.11 * 3.10%* Hmean 64 13792.01 ( 0.00%)13670.17 * -0.88%* Hmean 12820963.44 ( 0.00%)21456.33 * 2.35%* Hmean 25620335.62 ( 0.00%)21070.24 * 3.61%* Hmean 32020147.25 ( 0.00%)20624.92 * 2.37%* The benefit is marginal, the main impact is on how it affects p->recent_used_cpu and whether a domain search happens. From the schedstats patches and schedstat enabled Ops SIS Search 5653107942.00 5726545742.00 Ops SIS Domain Search3365067916.00 3319768543.00 Ops SIS Scanned112173512543.00 99194352541.00 Ops SIS Domain Scanned 109885472517.00 96787575342.00 Ops SIS Failures 2923185114.00 2950166441.00 Ops SIS Recent Used Hit 56547.00 118064916.00 Ops SIS Recent Used Miss 1590899250.00 354942791.00 Ops SIS Recent Attempts 1590955797.00 473007707.00 Ops SIS Search Efficiency 5.04 5.77 Ops SIS Domain Search Eff 3.06 3.43 Ops SIS Fast Success Rate40.47 42.03 Ops SIS Success Rate 48.29 48.48 Ops SIS Recent Success Rate 0.00 24.96 (First interesting point is the ridiculous number of times runqueues are enabled -- almost 97 billion times over the course of 40 minutes) Note "Recent Used Hit" is over 2000 times more likely to succeed. The failure rate also increases by quite a lot but the cost is marginal even if the "Fast Success Rate" only increases by 2% overall. What cannot be observed from these stats is where the biggest impact as these stats cover low utilisation to over saturation. If graphed over time, the graphs show that the sched domain is only scanned at negligible rates until the machine is fully busy. With low utilisation, the "Fast Success Rate" is almost 100% until the machine is fully busy. For 320 clients, the success rate is close to 0% which is unsurprising. Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 845bc0cd9158..68dd9cd62fbd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6293,6 +6293,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) /* Check a recently used CPU as a potential idle candidate: */ recent_used_cpu = p->recent_used_cpu; + p->recent_used_cpu = prev; if (recent_used_cpu != prev && recent_used_cpu != target && cpus_share_cache(recent_used_cpu, target)) { @@ -6789,9 +6790,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) } else if (wake_flags & WF_TTWU) { /* XXX always ? */ /* Fast path */ new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); - - if (want_affine) - current->recent_used_cpu = cpu; } rcu_read_unlock(); -- 2.26.2
[PATCH 08/10] sched/fair: Reintroduce SIS_AVG_CPU but in the context of SIS_PROP to reduce search depth
Subject says it all but no supporting data at this time. This might help the hackbench case in isolation or throw other workloads under the bus. Final version will have proper data. Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 8 1 file changed, 8 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 185fc6e28f8e..33ce65b67381 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6024,6 +6024,14 @@ static int sis_search_depth(struct sched_domain *sd, struct sched_domain *this_s nr = div_u64(span_avg, avg_cost); else nr = 4; + + /* +* Throttle the depth search futher if average idle time is +* below the average cost. This is primarily to deal with +* the saturated case where searches are likely to fail. +*/ + if (avg_idle < avg_cost) + nr >>= 1; } return nr; -- 2.26.2
[PATCH 07/10] sched/fair: Account for the idle cpu/smt search cost
select_idle_cpu() accounts average search cost for the purposes of conducting a limited proportional search if SIS_PROP is enabled. The issue is that select_idle_cpu() does not account for the cost if a candidate is found and select_idle_smt() is ignored. This patch moves the accounting of avg_cost to cover the cpu/smt search costs. select_idle_core() costs could be accounted for but it has its own throttling mechanism by tracking depending on whether idle cores are expected to exist. This patch is a bisection hazard becuse SIS_PROP and how it balances avg_cost vs avg_idle was probably guided by the fact that avg_cost was not always accounted for. Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 82 + 1 file changed, 46 insertions(+), 36 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1d8f5c4b4936..185fc6e28f8e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6006,6 +6006,29 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p return new_cpu; } +static int sis_search_depth(struct sched_domain *sd, struct sched_domain *this_sd) +{ + u64 avg_cost, avg_idle, span_avg; + int nr = INT_MAX; + + if (sched_feat(SIS_PROP)) { + /* +* Due to large variance we need a large fuzz factor; hackbench in +* particularly is sensitive here. +*/ + avg_idle = this_rq()->avg_idle / 512; + avg_cost = this_sd->avg_scan_cost + 1; + + span_avg = sd->span_weight * avg_idle; + if (span_avg > 4*avg_cost) + nr = div_u64(span_avg, avg_cost); + else + nr = 4; + } + + return nr; +} + #ifdef CONFIG_SCHED_SMT DEFINE_STATIC_KEY_FALSE(sched_smt_present); EXPORT_SYMBOL_GPL(sched_smt_present); @@ -6151,35 +6174,11 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd * comparing the average scan cost (tracked in sd->avg_scan_cost) against the * average idle time for this rq (as found in rq->avg_idle). */ -static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target) +static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, + int target, int nr) { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); - struct sched_domain *this_sd; - u64 avg_cost, avg_idle; - u64 time; - int this = smp_processor_id(); - int cpu, nr = INT_MAX; - - this_sd = rcu_dereference(*this_cpu_ptr(_llc)); - if (!this_sd) - return -1; - - /* -* Due to large variance we need a large fuzz factor; hackbench in -* particularly is sensitive here. -*/ - avg_idle = this_rq()->avg_idle / 512; - avg_cost = this_sd->avg_scan_cost + 1; - - if (sched_feat(SIS_PROP)) { - u64 span_avg = sd->span_weight * avg_idle; - if (span_avg > 4*avg_cost) - nr = div_u64(span_avg, avg_cost); - else - nr = 4; - } - - time = cpu_clock(this); + int cpu; cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); __cpumask_clear_cpu(target, cpus); @@ -6192,9 +6191,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t break; } - time = cpu_clock(this) - time; - update_avg(_sd->avg_scan_cost, time); - return cpu; } @@ -6245,9 +6241,10 @@ static inline bool asym_fits_capacity(int task_util, int cpu) */ static int select_idle_sibling(struct task_struct *p, int prev, int target) { - struct sched_domain *sd; + struct sched_domain *sd, *this_sd; unsigned long task_util; - int i, recent_used_cpu; + int i, recent_used_cpu, depth; + u64 time; schedstat_inc(this_rq()->sis_search); @@ -6337,21 +6334,34 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if (!sd) return target; + this_sd = rcu_dereference(*this_cpu_ptr(_llc)); + if (!this_sd) + return target; + + depth = sis_search_depth(sd, this_sd); + schedstat_inc(this_rq()->sis_domain_search); i = select_idle_core(p, sd, target); if ((unsigned)i < nr_cpumask_bits) return i; - i = select_idle_cpu(p, sd, target); + time = cpu_clock(smp_processor_id()); + i = select_idle_cpu(p, sd, target, depth); if ((unsigned)i < nr_cpumask_bits) - return i; + goto acct_cost; i = select_idle_smt(p, sd, target); if ((unsigned)i < nr_cpumask_bits) - return i; +
[PATCH 04/10] sched/fair: Return an idle cpu if one is found after a failed search for an idle core
select_idle_core is called when SMT is active and there is likely a free core available. It may find idle CPUs but this information is simply discarded and the scan starts over again with select_idle_cpu. This patch caches information on idle CPUs found during the search for a core and uses one if no core is found. This is a tradeoff. There may be a slight impact when utilisation is low and an idle core can be found quickly. It provides improvements as the number of busy CPUs approaches 50% of the domain size when SMT is enabled. With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled 5.10.0-rc6 5.10.0-rc6 schedstat idlecandidate Hmean 1500.06 ( 0.00%) 505.67 * 1.12%* Hmean 2975.90 ( 0.00%) 974.06 * -0.19%* Hmean 4 1902.95 ( 0.00%) 1904.43 * 0.08%* Hmean 8 3761.73 ( 0.00%) 3721.02 * -1.08%* Hmean 16 6713.93 ( 0.00%) 6769.17 * 0.82%* Hmean 32 10435.31 ( 0.00%)10312.58 * -1.18%* Hmean 64 12325.51 ( 0.00%)13792.01 * 11.90%* Hmean 12821225.21 ( 0.00%)20963.44 * -1.23%* Hmean 25620532.83 ( 0.00%)20335.62 * -0.96%* Hmean 32020334.81 ( 0.00%)20147.25 * -0.92%* In this particular test, the cost/benefit is marginal except for 64 which was a point where the machine was over 50% busy but not fully utilised. Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fc48cc99b03d..845bc0cd9158 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6066,6 +6066,7 @@ void __update_idle_core(struct rq *rq) */ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target) { + int idle_candidate = -1; struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); int core, cpu; @@ -6084,7 +6085,13 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int schedstat_inc(this_rq()->sis_scanned); if (!available_idle_cpu(cpu)) { idle = false; - break; + if (idle_candidate != -1) + break; + } + + if (idle_candidate == -1 && + cpumask_test_cpu(cpu, p->cpus_ptr)) { + idle_candidate = cpu; } } @@ -6099,7 +6106,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int */ set_idle_cores(target, 0); - return -1; + return idle_candidate; } /* -- 2.26.2
[RFC PATCH 00/10] Reduce time complexity of select_idle_sibling
This is an early prototype that has not been tested heavily. While parts of it may stand on its own, the motivation to release early is Aubrey Li's series on using an idle cpumask to optimise the search and Barry Song's series on representing clusters on die. The series is based on tip/sched/core rebased to 5.10-rc6. Patches 1-2 add schedstats to track the search efficiency of select_idle_sibling. They can be dropped from the final version but are useful when looking at select_idle_sibling in general. MMTests can already parse the stats and generate useful data including graphs over time. Patch 3 kills SIS_AVG_CPU but is partially reintroduced later in the context of SIS_PROP. Patch 4 notes that select_idle_core() can find an idle CPU that is not a free core yet it is ignored and a second search is conducted in select_idle_cpu() which is wasteful. Note that this patch will definitely change in the final version. Patch 5 adjusts p->recent_used_cpu so that it has a higher success rate and avoids searching the domain in some cases. Patch 6 notes that select_idle_* always starts with a CPU that is definitely not idle and fixes that. Patch 7 notes that SIS_PROP is only partially accounting for search costs. While this might be accidentally beneficial, it makes it much harder to reason about the effectiveness of SIS_PROP. Patch 8 uses similar logic to SIS_AVG_CPU but in the context of SIS_PROP to throttle the search depth. Patches 9 and 10 are stupid in the context of this series. They are included even though it makes no sense to use SIS_PROP logic in select_idle_core() as it already has throttling logic. The point is to illustrate that the select_idle_mask can be initialised at the start of a domain search used to mask out CPUs that have already been visited. In the context of Aubrey's and Barry's work, select_idle_mask would be initialised *after* select_idle_core as select_idle_core uses select_idle_mask for its own purposes. In Aubrey's case, the next step would be to scan idle_cpus_span as those CPUs may still be idle and bias the search towards likely idle candidates. If that fails, select_idle_mask clears all the bits set in idle_cpus_span and then scans the remainder. Similar observations apply to Barry's work, scan the local domain first, mask out those bits then scan the remaining CPUs in the cluster. The final version of this series will drop patches 1-2 unless there is demand and definitely drop patches 9-10. However, all 4 patches may be useful in the context of Aubrey's and Barry's work. Patches 1-2 would give more precise results on exactly how much they are improving "SIS Domain Search Efficiency" which may be more illustrative than just the headline performance figures of a given workload. The final version of this series will also adjust patch 4. If select_idle_core() runs at all then it definitely should return a CPU -- either an idle CPU or the target as it has already searched the entire domain and no further searching should be conducted. Barry might change that back so that a cluster can be scanned but it would be done in the context of the cluster series. -- 2.26.2
[PATCH 03/10] sched/fair: Remove SIS_AVG_CPU
SIS_AVG_CPU was introduced as a means of avoiding a search when the average search cost indicated that the search would likely fail. It was a blunt instrument and disabled by 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") and later replaced with a proportional search depth by 1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()"). While there are corner cases where SIS_AVG_CPU is better, it has now been disabled for almost three years. As the intent of SIS_PROP is to reduce the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus on SIS_PROP as a throttling mechanism. Signed-off-by: Mel Gorman --- kernel/sched/fair.c | 3 --- kernel/sched/features.h | 1 - 2 files changed, 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d9acd55d309b..fc48cc99b03d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6163,9 +6163,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t avg_idle = this_rq()->avg_idle / 512; avg_cost = this_sd->avg_scan_cost + 1; - if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost) - return -1; - if (sched_feat(SIS_PROP)) { u64 span_avg = sd->span_weight * avg_idle; if (span_avg > 4*avg_cost) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 68d369cba9e4..e875eabb6600 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -54,7 +54,6 @@ SCHED_FEAT(TTWU_QUEUE, true) /* * When doing wakeups, attempt to limit superfluous scans of the LLC domain. */ -SCHED_FEAT(SIS_AVG_CPU, false) SCHED_FEAT(SIS_PROP, true) /* -- 2.26.2
[PATCH 01/10] sched/fair: Track efficiency of select_idle_sibling
select_idle_sibling is an important path that finds a nearby idle CPU on wakeup. As it is examining other CPUs state, it can be expensive in terms of cache usage. This patch tracks the search efficiency if schedstats are enabled. In general, this is only useful for kernel developers but schedstats are typically disabled by default so it is convenient for development and mostly free otherwise. It is not required that this patch be merged with the series but if we are looking at time or search complexity, the stats generate hard data on what the search costs actually are. SIS Search: Number of calls to select_idle_sibling SIS Domain Search: Number of times the domain was searched because the fast path failed. SIS Scanned: Generally the number of runqueues scanned but the fast path counts as 1 regardless of the values for target, prev and recent. SIS Domain Scanned: Number of runqueues scanned during a search of the LLC domain. SIS Failures: Number of SIS calls that failed to find an idle CPU SIS Search Efficiency: A ratio expressed as a percentage of runqueues scanned versus idle CPUs found. A 100% efficiency indicates that the target, prev or recent CPU of a task was idle at wakeup. The lower the efficiency, the more runqueues were scanned before an idle CPU was found. SIS Domain Search Efficiency: Similar, except only for the slower SIS patch. SIS Fast Success Rate: Percentage of SIS that used target, prev or recent CPUs. SIS Success rate: Percentage of scans that found an idle CPU. Signed-off-by: Mel Gorman --- kernel/sched/debug.c | 4 kernel/sched/fair.c | 14 ++ kernel/sched/sched.h | 6 ++ kernel/sched/stats.c | 8 +--- 4 files changed, 29 insertions(+), 3 deletions(-) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 2357921580f9..2386cc5e79e5 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -714,6 +714,10 @@ do { \ P(sched_goidle); P(ttwu_count); P(ttwu_local); + P(sis_search); + P(sis_domain_search); + P(sis_scanned); + P(sis_failed); } #undef P diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 98075f9ea9a8..494ba01f3414 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6081,6 +6081,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int bool idle = true; for_each_cpu(cpu, cpu_smt_mask(core)) { + schedstat_inc(this_rq()->sis_scanned); if (!available_idle_cpu(cpu)) { idle = false; break; @@ -6112,6 +6113,7 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t return -1; for_each_cpu(cpu, cpu_smt_mask(target)) { + schedstat_inc(this_rq()->sis_scanned); if (!cpumask_test_cpu(cpu, p->cpus_ptr) || !cpumask_test_cpu(cpu, sched_domain_span(sd))) continue; @@ -6177,6 +6179,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); for_each_cpu_wrap(cpu, cpus, target) { + schedstat_inc(this_rq()->sis_scanned); if (!--nr) return -1; if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) @@ -6240,6 +6243,15 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) unsigned long task_util; int i, recent_used_cpu; + schedstat_inc(this_rq()->sis_search); + + /* +* Checking if prev, target and recent is treated as one scan. A +* perfect hit on one of those is considered 100% efficiency. +* Further scanning impairs efficiency. +*/ + schedstat_inc(this_rq()->sis_scanned); + /* * On asymmetric system, update task utilization because we will check * that the task fits with cpu's capacity. @@ -6315,6 +6327,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if (!sd) return target; + schedstat_inc(this_rq()->sis_domain_search); i = select_idle_core(p, sd, target); if ((unsigned)i < nr_cpumask_bits) return i; @@ -6327,6 +6340,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if ((unsigned)i < nr_cpumask_bits) return i; + schedstat_inc(this_rq()->sis_failed); return target; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index f5acb6c5ce49..90a62dd9293d 100644 --- a/kernel
Re: [PATCH -V6 RESEND 1/3] numa balancing: Migrate on fault among multiple bound nodes
On Thu, Dec 03, 2020 at 11:25:50AM +0100, Peter Zijlstra wrote: > On Wed, Dec 02, 2020 at 11:40:54AM +0000, Mel Gorman wrote: > > On Wed, Dec 02, 2020 at 04:42:32PM +0800, Huang Ying wrote: > > > Now, NUMA balancing can only optimize the page placement among the > > > NUMA nodes if the default memory policy is used. Because the memory > > > policy specified explicitly should take precedence. But this seems > > > too strict in some situations. For example, on a system with 4 NUMA > > > nodes, if the memory of an application is bound to the node 0 and 1, > > > NUMA balancing can potentially migrate the pages between the node 0 > > > and 1 to reduce cross-node accessing without breaking the explicit > > > memory binding policy. > > > > > > > Ok, I think this part is ok and while the test case is somewhat > > superficial, it at least demonstrated that the NUMA balancing overhead > > did not offset any potential benefit > > > > Acked-by: Mel Gorman > > Who do we expect to merge this, me through tip/sched/core or akpm ? I would expect akpm, it's much more on the mm side because it affects the semantics of memory policies. It should also have more mm-orientated review than just mine because it affects user-visible semantics and the ability to detect whether the feature is available or not needs to be treated with care. -- Mel Gorman SUSE Labs
Re: [RFC PATCH v2 2/2] scheduler: add scheduler level for clusters
On Thu, Dec 03, 2020 at 10:28:31AM +0100, Peter Zijlstra wrote: > On Tue, Dec 01, 2020 at 04:04:04PM +, Valentin Schneider wrote: > > > > Gating this behind this new config only leveraged by arm64 doesn't make it > > very generic. Note that powerpc also has this newish "CACHE" level which > > seems to overlap in function with your "CLUSTER" one (both are arch > > specific, though). > > > > I think what you are after here is an SD_SHARE_PKG_RESOURCES domain walk, > > i.e. scan CPUs by increasing cache "distance". We already have it in some > > form, as we scan SMT & LLC domains; AFAICT LLC always maps to MC, except > > for said powerpc's CACHE thingie. > > There's some intel chips with a smaller L2, but I don't think we ever > bothered. > > There's also the extended topology stuff from Intel: SMT, Core, Module, > Tile, Die, of which we've only partially used Die I think. > > Whatever we do, it might make sense to not all use different names. > > Also, I think Mel said he was cooking something for > select_idle_balance(). > > Also, I've previously posted patches that fold all the iterations into > one, so it might make sense to revisit some of that if Mel also already > didn.t I didn't. The NUMA/lb reconcilation took most of my attention and right now I'm looking at select_idle_sibling again in preparation for tracking the idle cpumask in a sensible fashion. The main idea I had in mind was special casing EPYC as it has multiple small L3 caches that may benefit from select_idle_sibling looking slightly beyond the LLC as a search domain but it has not even started yet. -- Mel Gorman SUSE Labs