from:"Mel Gorman"

Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()

2021-01-20 Thread Mel Gorman

On Wed, Jan 20, 2021 at 10:21:47AM +0100, Vincent Guittot wrote:
> On Wed, 20 Jan 2021 at 10:12, Mel Gorman  wrote:
> >
> > On Wed, Jan 20, 2021 at 02:00:18PM +0530, Gautham R Shenoy wrote:
> > > > @@ -6157,18 +6169,31 @@ static int select_idle_cpu(struct task_struct 
> > > > *p, struct sched_domain *sd, int t
> > > > }
> > > >
> > > > for_each_cpu_wrap(cpu, cpus, target) {
> > > > -   if (!--nr)
> > > > -   return -1;
> > > > -   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> > > > -   break;
> > > > +   if (smt) {
> > > > +   i = select_idle_core(p, cpu, cpus, _cpu);
> > > > +   if ((unsigned int)i < nr_cpumask_bits)
> > > > +   return i;
> > > > +
> > > > +   } else {
> > > > +   if (!--nr)
> > > > +   return -1;
> > > > +   i = __select_idle_cpu(cpu);
> > > > +   if ((unsigned int)i < nr_cpumask_bits) {
> > > > +   idle_cpu = i;
> > > > +   break;
> > > > +   }
> > > > +   }
> > > > }
> > > >
> > > > -   if (sched_feat(SIS_PROP)) {
> > > > +   if (smt)
> > > > +   set_idle_cores(this, false);
> > >
> > > Shouldn't we set_idle_cores(false) only if this was the last idle
> > > core in the LLC ?
> > >
> >
> > That would involve rechecking the cpumask bits that have not been
> > scanned to see if any of them are an idle core. As the existance of idle
> > cores can change very rapidly, it's not worth the cost.
> 
> But don't we reach this point only if we scanned all CPUs and didn't
> find an idle core ?

Yes, but my understanding of Gauthams suggestion was to check if an
idle core found was the last idle core available and set has_idle_cores
to false in that case. I think this would be relatively expensive and
possibly futile as returning the last idle core for this wakeup does not
mean there will be no idle core on the next wakeup as other cores may go
idle between wakeups.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()

2021-01-20 Thread Mel Gorman

On Wed, Jan 20, 2021 at 02:00:18PM +0530, Gautham R Shenoy wrote:
> > @@ -6157,18 +6169,31 @@ static int select_idle_cpu(struct task_struct *p, 
> > struct sched_domain *sd, int t
> > }
> > 
> > for_each_cpu_wrap(cpu, cpus, target) {
> > -   if (!--nr)
> > -   return -1;
> > -   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> > -   break;
> > +   if (smt) {
> > +   i = select_idle_core(p, cpu, cpus, _cpu);
> > +   if ((unsigned int)i < nr_cpumask_bits)
> > +   return i;
> > +
> > +   } else {
> > +   if (!--nr)
> > +   return -1;
> > +   i = __select_idle_cpu(cpu);
> > +   if ((unsigned int)i < nr_cpumask_bits) {
> > +   idle_cpu = i;
> > +   break;
> > +   }
> > +   }
> > }
> > 
> > -   if (sched_feat(SIS_PROP)) {
> > +   if (smt)
> > +   set_idle_cores(this, false);
> 
> Shouldn't we set_idle_cores(false) only if this was the last idle
> core in the LLC ? 
> 

That would involve rechecking the cpumask bits that have not been
scanned to see if any of them are an idle core. As the existance of idle
cores can change very rapidly, it's not worth the cost.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH v3 0/5] Scan for an idle sibling in a single pass

2021-01-19 Thread Mel Gorman

On Tue, Jan 19, 2021 at 12:33:04PM +0100, Vincent Guittot wrote:
> On Tue, 19 Jan 2021 at 12:22, Mel Gorman  wrote:
> >
> > Changelog since v2
> > o Remove unnecessary parameters
> > o Update nr during scan only when scanning for cpus
> 
> Hi Mel,
> 
> I haven't looked at your previous version mainly because I'm chasing a
> performance regression on v5.11-rcx which prevents me from testing the
> impact of your patchset on my !SMT2 system.
> Will do this as soon as this problem is fixed
> 

Thanks, that would be appreciated as I do not have access to a !SMT2
system to do my own evaluation.

-- 
Mel Gorman
SUSE Labs

[PATCH 5/5] sched/fair: Merge select_idle_core/cpu()

2021-01-19 Thread Mel Gorman

From: Peter Zijlstra (Intel) 

Both select_idle_core() and select_idle_cpu() do a loop over the same
cpumask. Observe that by clearing the already visited CPUs, we can
fold the iteration and iterate a core at a time.

All we need to do is remember any non-idle CPU we encountered while
scanning for an idle core. This way we'll only iterate every CPU once.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 101 ++--
 1 file changed, 61 insertions(+), 40 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 12e08da90024..822851b39b65 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6006,6 +6006,14 @@ static inline int find_idlest_cpu(struct sched_domain 
*sd, struct task_struct *p
return new_cpu;
 }
 
+static inline int __select_idle_cpu(int cpu)
+{
+   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+   return cpu;
+
+   return -1;
+}
+
 #ifdef CONFIG_SCHED_SMT
 DEFINE_STATIC_KEY_FALSE(sched_smt_present);
 EXPORT_SYMBOL_GPL(sched_smt_present);
@@ -6066,40 +6074,34 @@ void __update_idle_core(struct rq *rq)
  * there are no idle cores left in the system; tracked through
  * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
  */
-static int select_idle_core(struct task_struct *p, struct sched_domain *sd, 
int target)
+static int select_idle_core(struct task_struct *p, int core, struct cpumask 
*cpus, int *idle_cpu)
 {
-   struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
-   int core, cpu;
+   bool idle = true;
+   int cpu;
 
if (!static_branch_likely(_smt_present))
-   return -1;
-
-   if (!test_idle_cores(target, false))
-   return -1;
-
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   return __select_idle_cpu(core);
 
-   for_each_cpu_wrap(core, cpus, target) {
-   bool idle = true;
-
-   for_each_cpu(cpu, cpu_smt_mask(core)) {
-   if (!available_idle_cpu(cpu)) {
-   idle = false;
-   break;
+   for_each_cpu(cpu, cpu_smt_mask(core)) {
+   if (!available_idle_cpu(cpu)) {
+   idle = false;
+   if (*idle_cpu == -1) {
+   if (sched_idle_cpu(cpu) && 
cpumask_test_cpu(cpu, p->cpus_ptr)) {
+   *idle_cpu = cpu;
+   break;
+   }
+   continue;
}
+   break;
}
-
-   if (idle)
-   return core;
-
-   cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
+   if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
+   *idle_cpu = cpu;
}
 
-   /*
-* Failed to find an idle core; stop looking for one.
-*/
-   set_idle_cores(target, 0);
+   if (idle)
+   return core;
 
+   cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
return -1;
 }
 
@@ -6107,9 +6109,18 @@ static int select_idle_core(struct task_struct *p, 
struct sched_domain *sd, int
 
 #define sched_smt_weight   1
 
-static inline int select_idle_core(struct task_struct *p, struct sched_domain 
*sd, int target)
+static inline void set_idle_cores(int cpu, int val)
 {
-   return -1;
+}
+
+static inline bool test_idle_cores(int cpu, bool def)
+{
+   return def;
+}
+
+static inline int select_idle_core(struct task_struct *p, int core, struct 
cpumask *cpus, int *idle_cpu)
+{
+   return __select_idle_cpu(core);
 }
 
 #endif /* CONFIG_SCHED_SMT */
@@ -6124,10 +6135,11 @@ static inline int select_idle_core(struct task_struct 
*p, struct sched_domain *s
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int 
target)
 {
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+   int i, cpu, idle_cpu = -1, nr = INT_MAX;
+   bool smt = test_idle_cores(target, false);
+   int this = smp_processor_id();
struct sched_domain *this_sd;
u64 time;
-   int this = smp_processor_id();
-   int cpu, nr = INT_MAX;
 
this_sd = rcu_dereference(*this_cpu_ptr(_llc));
if (!this_sd)
@@ -6135,7 +6147,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
 
-   if (sched_feat(SIS_PROP)) {
+   if (sched_feat(SIS_PROP) && !smt) {
u64 avg_cost, avg_idle, span_avg;
 
/*
@@ -6157,18 +6169,31 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
}
 
for_each_cpu_wrap(cpu, cpus, target) {
-   if

[PATCH 4/5] sched/fair: Remove select_idle_smt()

2021-01-19 Thread Mel Gorman

From: Peter Zijlstra (Intel) 

In order to make the next patch more readable, and to quantify the
actual effectiveness of this pass, start by removing it.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 30 --
 1 file changed, 30 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0811e2fe4f19..12e08da90024 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6103,27 +6103,6 @@ static int select_idle_core(struct task_struct *p, 
struct sched_domain *sd, int
return -1;
 }
 
-/*
- * Scan the local SMT mask for idle CPUs.
- */
-static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int 
target)
-{
-   int cpu;
-
-   if (!static_branch_likely(_smt_present))
-   return -1;
-
-   for_each_cpu(cpu, cpu_smt_mask(target)) {
-   if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
-   !cpumask_test_cpu(cpu, sched_domain_span(sd)))
-   continue;
-   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
-   return cpu;
-   }
-
-   return -1;
-}
-
 #else /* CONFIG_SCHED_SMT */
 
 #define sched_smt_weight   1
@@ -6133,11 +6112,6 @@ static inline int select_idle_core(struct task_struct 
*p, struct sched_domain *s
return -1;
 }
 
-static inline int select_idle_smt(struct task_struct *p, struct sched_domain 
*sd, int target)
-{
-   return -1;
-}
-
 #endif /* CONFIG_SCHED_SMT */
 
 #define sis_min_cores  2
@@ -6331,10 +6305,6 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
if ((unsigned)i < nr_cpumask_bits)
return i;
 
-   i = select_idle_smt(p, sd, target);
-   if ((unsigned)i < nr_cpumask_bits)
-   return i;
-
return target;
 }
 
-- 
2.26.2

[PATCH 3/5] sched/fair: Make select_idle_cpu() proportional to cores

2021-01-19 Thread Mel Gorman

From: Peter Zijlstra (Intel) 

Instead of calculating how many (logical) CPUs to scan, compute how
many cores to scan.

This changes behaviour for anything !SMT2.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Mel Gorman 
---
 kernel/sched/core.c  | 18 +-
 kernel/sched/fair.c  | 12 ++--
 kernel/sched/sched.h |  2 ++
 3 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 15d2562118d1..ada8faac2e4d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7444,11 +7444,19 @@ int sched_cpu_activate(unsigned int cpu)
balance_push_set(cpu, false);
 
 #ifdef CONFIG_SCHED_SMT
-   /*
-* When going up, increment the number of cores with SMT present.
-*/
-   if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
-   static_branch_inc_cpuslocked(_smt_present);
+   do {
+   int weight = cpumask_weight(cpu_smt_mask(cpu));
+
+   if (weight > sched_smt_weight)
+   sched_smt_weight = weight;
+
+   /*
+* When going up, increment the number of cores with SMT 
present.
+*/
+   if (weight == 2)
+   static_branch_inc_cpuslocked(_smt_present);
+
+   } while (0);
 #endif
set_cpu_active(cpu, true);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c8d8e185cf3b..0811e2fe4f19 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6010,6 +6010,8 @@ static inline int find_idlest_cpu(struct sched_domain 
*sd, struct task_struct *p
 DEFINE_STATIC_KEY_FALSE(sched_smt_present);
 EXPORT_SYMBOL_GPL(sched_smt_present);
 
+int sched_smt_weight __read_mostly = 1;
+
 static inline void set_idle_cores(int cpu, int val)
 {
struct sched_domain_shared *sds;
@@ -6124,6 +6126,8 @@ static int select_idle_smt(struct task_struct *p, struct 
sched_domain *sd, int t
 
 #else /* CONFIG_SCHED_SMT */
 
+#define sched_smt_weight   1
+
 static inline int select_idle_core(struct task_struct *p, struct sched_domain 
*sd, int target)
 {
return -1;
@@ -6136,6 +6140,8 @@ static inline int select_idle_smt(struct task_struct *p, 
struct sched_domain *sd
 
 #endif /* CONFIG_SCHED_SMT */
 
+#define sis_min_cores  2
+
 /*
  * Scan the LLC domain for idle CPUs; this is dynamically regulated by
  * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -6166,10 +6172,12 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
avg_cost = this_sd->avg_scan_cost + 1;
 
span_avg = sd->span_weight * avg_idle;
-   if (span_avg > 4*avg_cost)
+   if (span_avg > sis_min_cores*avg_cost)
nr = div_u64(span_avg, avg_cost);
else
-   nr = 4;
+   nr = sis_min_cores;
+
+   nr *= sched_smt_weight;
 
time = cpu_clock(this);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 12ada79d40f3..29aabe98dd1d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1107,6 +1107,8 @@ static inline void update_idle_core(struct rq *rq)
__update_idle_core(rq);
 }
 
+extern int sched_smt_weight;
+
 #else
 static inline void update_idle_core(struct rq *rq) { }
 #endif
-- 
2.26.2

[PATCH 2/5] sched/fair: Move avg_scan_cost calculations under SIS_PROP

2021-01-19 Thread Mel Gorman

As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
check and while we are at it, exclude the cost of initialising the CPU
mask from the average scan cost.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9f5682aeda2e..c8d8e185cf3b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
if (!this_sd)
return -1;
 
+   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+
if (sched_feat(SIS_PROP)) {
u64 avg_cost, avg_idle, span_avg;
 
@@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
nr = div_u64(span_avg, avg_cost);
else
nr = 4;
-   }
-
-   time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   time = cpu_clock(this);
+   }
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -6181,8 +6181,10 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
break;
}
 
-   time = cpu_clock(this) - time;
-   update_avg(_sd->avg_scan_cost, time);
+   if (sched_feat(SIS_PROP)) {
+   time = cpu_clock(this) - time;
+   update_avg(_sd->avg_scan_cost, time);
+   }
 
return cpu;
 }
-- 
2.26.2

[PATCH 1/5] sched/fair: Remove SIS_AVG_CPU

2021-01-19 Thread Mel Gorman

SIS_AVG_CPU was introduced as a means of avoiding a search when the
average search cost indicated that the search would likely fail. It was
a blunt instrument and disabled by commit 4c77b18cf8b7 ("sched/fair: Make
select_idle_cpu() more aggressive") and later replaced with a proportional
search depth by commit 1ad3aaf3fcd2 ("sched/core: Implement new approach
to scale select_idle_cpu()").

While there are corner cases where SIS_AVG_CPU is better, it has now been
disabled for almost three years. As the intent of SIS_PROP is to reduce
the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus
on SIS_PROP as a throttling mechanism.

Signed-off-by: Mel Gorman 
Reviewed-by: Vincent Guittot 
---
 kernel/sched/fair.c | 20 +---
 kernel/sched/features.h |  1 -
 2 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..9f5682aeda2e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6145,7 +6145,6 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 {
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
struct sched_domain *this_sd;
-   u64 avg_cost, avg_idle;
u64 time;
int this = smp_processor_id();
int cpu, nr = INT_MAX;
@@ -6154,18 +6153,17 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
if (!this_sd)
return -1;
 
-   /*
-* Due to large variance we need a large fuzz factor; hackbench in
-* particularly is sensitive here.
-*/
-   avg_idle = this_rq()->avg_idle / 512;
-   avg_cost = this_sd->avg_scan_cost + 1;
+   if (sched_feat(SIS_PROP)) {
+   u64 avg_cost, avg_idle, span_avg;
 
-   if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
-   return -1;
+   /*
+* Due to large variance we need a large fuzz factor;
+* hackbench in particularly is sensitive here.
+*/
+   avg_idle = this_rq()->avg_idle / 512;
+   avg_cost = this_sd->avg_scan_cost + 1;
 
-   if (sched_feat(SIS_PROP)) {
-   u64 span_avg = sd->span_weight * avg_idle;
+   span_avg = sd->span_weight * avg_idle;
if (span_avg > 4*avg_cost)
nr = div_u64(span_avg, avg_cost);
else
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 68d369cba9e4..e875eabb6600 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -54,7 +54,6 @@ SCHED_FEAT(TTWU_QUEUE, true)
 /*
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
  */
-SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
 
 /*
-- 
2.26.2

[PATCH v3 0/5] Scan for an idle sibling in a single pass

2021-01-19 Thread Mel Gorman

Changelog since v2
o Remove unnecessary parameters
o Update nr during scan only when scanning for cpus

Changlog since v1
o Move extern declaration to header for coding style
o Remove unnecessary parameter from __select_idle_cpu

This series of 5 patches reposts three patches from Peter entitled
"select_idle_sibling() wreckage". It only scans the runqueues in a single
pass when searching for an idle sibling.

Two patches from Peter were dropped. The first patch altered how scan
depth was calculated. Scan depth deletion is a random number generator
with two major limitations. The avg_idle time is based on the time
between a CPU going idle and being woken up clamped approximately by
2*sysctl_sched_migration_cost.  This is difficult to compare in a sensible
fashion to avg_scan_cost. The second issue is that only the avg_scan_cost
of scan failures is recorded and it does not decay.  This requires deeper
surgery that would justify a patch on its own although Peter notes that
https://lkml.kernel.org/r/20180530143105.977759...@infradead.org is
potentially useful for an alternative avg_idle metric.

The second patch dropped converted the idle core scan throttling
mechanism to SIS_PROP. While this would unify the throttling of core
and CPU scanning, it was not free of regressions and has_idle_cores is
a fairly effective throttling mechanism with the caveat that it can have
a lot of false positives for workloads like hackbench.

Peter's series tried to solve three problems at once, this subset addresses
one problem. As with anything select_idle_sibling, it's a mix of wins and
losses but won more than it lost across a range of workloads and machines.

 kernel/sched/core.c |  18 +++--
 kernel/sched/fair.c | 161 
 kernel/sched/features.h |   1 -
 kernel/sched/sched.h|   2 +
 4 files changed, 95 insertions(+), 87 deletions(-)

-- 
2.26.2

Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()

2021-01-18 Thread Mel Gorman

On Mon, Jan 18, 2021 at 08:55:03PM +0800, Li, Aubrey wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 12e08da90024..6c0f841e9e75 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6006,6 +6006,14 @@ static inline int find_idlest_cpu(struct 
> > sched_domain *sd, struct task_struct *p
> > return new_cpu;
> >  }
> >  
> > +static inline int __select_idle_cpu(struct task_struct *p, int core, 
> > struct cpumask *cpus)
> 
> Sorry if I missed anything, why p and cpus are needed here?
> 

They are not needed. The original code was matching the calling pattern
for select_idle_core() which needs p and cpus to check if sibling CPUs
are allowed.

> > @@ -6135,7 +6147,7 @@ static int select_idle_cpu(struct task_struct *p, 
> > struct sched_domain *sd, int t
> >  
> > cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >  
> > -   if (sched_feat(SIS_PROP)) {
> > +   if (sched_feat(SIS_PROP) && !smt) {
>
> Is it possible the system does have a idle core, but I still don't want to 
> scan the entire llc domain?
> 

This version is matching historical behaviour. To limit the scan for cores,
select_idle_core() would need to obey SIS_PROP and that patch was dropped
as it introduced regressions. It would only be considered once SIS_PROP
had better metrics for limiting the depth of the search.

> > u64 avg_cost, avg_idle, span_avg;
> >  
> > /*
> > @@ -6159,16 +6171,29 @@ static int select_idle_cpu(struct task_struct *p, 
> > struct sched_domain *sd, int t
> > for_each_cpu_wrap(cpu, cpus, target) {
> > if (!--nr)
> > return -1;
> 
> It looks like nr only makes sense when smt = false now, can it be moved into 
> else branch below?
> 

It can. I expect the saving to be marginal and it will need to move back
when/if select_idle_core() obeys SIS_PROP.

> > -   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> > -   break;
> > +   if (smt) {
> > +   i = select_idle_core(p, cpu, cpus, _cpu);
> > +   if ((unsigned int)i < nr_cpumask_bits)
> > +   return i;
> 
> What if the last idle core is selected here, should we set_idle_cores false 
> before return?
> 

We'd have to check what bits were still set in the cpus mask and
determine if they represent an idle core. I severely doubt it would be
worth the cost given that the availability of idle cores can change at
any instant.

-- 
Mel Gorman
SUSE Labs

Re: [RFC PATCH] Remove redundant sched_numa_balancing check.

2021-01-18 Thread Mel Gorman

On Mon, Jan 18, 2021 at 09:32:18PM +1100, Imran Khan wrote:
> task_numa_fault is invoked from do_numa_page/do_huge_pmd_numa_page,
> for task_numa_work induced memory faults. task_numa_work is scheduled
> from task_tick_numa which is invoked only if sched_numa_balancing
> is true.
> 
> So task_numa_fault will not get invoked if sched_numa_balancing is
> false and hence we can avoid checking it again in task_numa_fault.
> 
> Signed-off-by: Imran Khan 

If NUMA balancing is disabled at runtime, there may still be PTEs that
are marked for NUMA balancing. While these still get handled at fault,
there is no point tracking the fault information in task_numa_fault and
this function can still get called after sched_numa_balancing is
disabled.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH] mm, compaction: move high_pfn to the for loop scope.

2021-01-18 Thread Mel Gorman

On Mon, Jan 18, 2021 at 03:41:26PM +0800, Rokudo Yan wrote:
> In fast_isolate_freepages, high_pfn will be used if a prefered one(PFN >= 
> low_fn) not found. But the high_pfn
> is not reset before searching an free area, so when it was used as freepage, 
> it may from another free area searched before.
> And move_freelist_head(freelist, freepage) will have unexpected behavior(eg. 
> corrupt the MOVABLE freelist)
> 
> Unable to handle kernel paging request at virtual address dead0200
> Mem abort info:
>   ESR = 0x9644
>   Exception class = DABT (current EL), IL = 32 bits
>   SET = 0, FnV = 0
>   EA = 0, S1PTW = 0
> Data abort info:
>   ISV = 0, ISS = 0x0044
>   CM = 0, WnR = 1
> [dead0200] address between user and kernel address ranges
> 
> -000|list_cut_before(inline)
> -000|move_freelist_head(inline)
> -000|fast_isolate_freepages(inline)
> -000|isolate_freepages(inline)
> -000|compaction_alloc(?, ?)
> -001|unmap_and_move(inline)
> -001|migrate_pages([NSD:0xFF80088CBBD0] from = 0xFF80088CBD88, 
> [NSD:0xFF80088CBBC8] get_new_p
> -002|__read_once_size(inline)
> -002|static_key_count(inline)
> -002|static_key_false(inline)
> -002|trace_mm_compaction_migratepages(inline)
> -002|compact_zone(?, [NSD:0xFF80088CBCB0] capc = 0x0)
> -003|kcompactd_do_work(inline)
> -003|kcompactd([X19] p = 0xFF93227FBC40)
> -004|kthread([X20] _create = 0xFFE1AFB26380)
> -005|ret_from_fork(asm)
> ---|end of frame
> 
> The issue was reported on an smart phone product with 6GB ram and 3GB zram as 
> swap device.
> 
> This patch fixes the issue by reset high_pfn before searching each free area, 
> which ensure
> freepage and freelist match when call move_freelist_head in 
> fast_isolate_freepages().
> 
> Fixes: 5a811889de10f1eb ("mm, compaction: use free lists to quickly locate a 
> migration target")
> 
> Signed-off-by: Rokudo Yan 

Other than a fixes line, I do not think this changed so my previous Ack
still applies

Acked-by: Mel Gorman 

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 3/5] sched/fair: Make select_idle_cpu() proportional to cores

2021-01-18 Thread Mel Gorman

On Mon, Jan 18, 2021 at 04:14:36PM +0800, Li, Aubrey wrote:
> > 
> > @@ -6124,6 +6126,8 @@ static int select_idle_smt(struct task_struct *p, 
> > struct sched_domain *sd, int t
> >  
> >  #else /* CONFIG_SCHED_SMT */
> >  
> > +#define sched_smt_weight   1
> > +
> >  static inline int select_idle_core(struct task_struct *p, struct 
> > sched_domain *sd, int target)
> >  {
> > return -1;
> >
> > 
> >
> > @@ -6166,10 +6172,12 @@ static int select_idle_cpu(struct task_struct *p, 
> > struct sched_domain *sd, int t
> > avg_cost = this_sd->avg_scan_cost + 1;
> >  
> > span_avg = sd->span_weight * avg_idle;
> > -   if (span_avg > 4*avg_cost)
> > +   if (span_avg > sis_min_cores*avg_cost)
> > nr = div_u64(span_avg, avg_cost);
> > else
> > -   nr = 4;
> > +   nr = sis_min_cores;
> > +
> > +   nr *= sched_smt_weight;
> 
> Is it better to put this into an inline wrapper to hide sched_smt_weight if 
> !CONFIG_SCHED_SMT?
> 

There already is a #define sched_smt_weight for !CONFIG_SCHED_SMT and I
do not think an inline wrapper would make it more readable or maintainable.

-- 
Mel Gorman
SUSE Labs

[PATCH v2 0/5] Scan for an idle sibling in a single pass

2021-01-15 Thread Mel Gorman

Changlog since v1
o Move extern declaration to header for coding style
o Remove unnecessary parameter from __select_idle_cpu

This series of 5 patches reposts three patches from Peter entitled
"select_idle_sibling() wreckage". It only scans the runqueues in a single
pass when searching for an idle sibling.

Two patches from Peter were dropped. The first patch altered how scan
depth was calculated. Scan depth deletion is a random number generator
with two major limitations. The avg_idle time is based on the time
between a CPU going idle and being woken up clamped approximately by
2*sysctl_sched_migration_cost.  This is difficult to compare in a sensible
fashion to avg_scan_cost. The second issue is that only the avg_scan_cost
of scan failures is recorded and it does not decay.  This requires deeper
surgery that would justify a patch on its own although Peter notes that
https://lkml.kernel.org/r/20180530143105.977759...@infradead.org is
potentially useful for an alternative avg_idle metric.

The second patch dropped converted the idle core scan throttling
mechanism to SIS_PROP. While this would unify the throttling of core
and CPU scanning, it was not free of regressions and has_idle_cores is
a fairly effective throttling mechanism with the caveat that it can have
a lot of false positives for workloads like hackbench.

Peter's series tried to solve three problems at once, this subset addresses
one problem. As with anything select_idle_sibling, it's a mix of wins and
losses but won more than it lost across a range of workloads and machines.

 kernel/sched/core.c |  18 +++--
 kernel/sched/fair.c | 157 
 kernel/sched/features.h |   1 -
 kernel/sched/sched.h|   2 +
 4 files changed, 93 insertions(+), 85 deletions(-)

-- 
2.26.2

[PATCH 1/5] sched/fair: Remove SIS_AVG_CPU

2021-01-15 Thread Mel Gorman

SIS_AVG_CPU was introduced as a means of avoiding a search when the
average search cost indicated that the search would likely fail. It was
a blunt instrument and disabled by commit 4c77b18cf8b7 ("sched/fair: Make
select_idle_cpu() more aggressive") and later replaced with a proportional
search depth by commit 1ad3aaf3fcd2 ("sched/core: Implement new approach
to scale select_idle_cpu()").

While there are corner cases where SIS_AVG_CPU is better, it has now been
disabled for almost three years. As the intent of SIS_PROP is to reduce
the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus
on SIS_PROP as a throttling mechanism.

Signed-off-by: Mel Gorman 
Reviewed-by: Vincent Guittot 
---
 kernel/sched/fair.c | 20 +---
 kernel/sched/features.h |  1 -
 2 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..9f5682aeda2e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6145,7 +6145,6 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 {
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
struct sched_domain *this_sd;
-   u64 avg_cost, avg_idle;
u64 time;
int this = smp_processor_id();
int cpu, nr = INT_MAX;
@@ -6154,18 +6153,17 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
if (!this_sd)
return -1;
 
-   /*
-* Due to large variance we need a large fuzz factor; hackbench in
-* particularly is sensitive here.
-*/
-   avg_idle = this_rq()->avg_idle / 512;
-   avg_cost = this_sd->avg_scan_cost + 1;
+   if (sched_feat(SIS_PROP)) {
+   u64 avg_cost, avg_idle, span_avg;
 
-   if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
-   return -1;
+   /*
+* Due to large variance we need a large fuzz factor;
+* hackbench in particularly is sensitive here.
+*/
+   avg_idle = this_rq()->avg_idle / 512;
+   avg_cost = this_sd->avg_scan_cost + 1;
 
-   if (sched_feat(SIS_PROP)) {
-   u64 span_avg = sd->span_weight * avg_idle;
+   span_avg = sd->span_weight * avg_idle;
if (span_avg > 4*avg_cost)
nr = div_u64(span_avg, avg_cost);
else
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 68d369cba9e4..e875eabb6600 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -54,7 +54,6 @@ SCHED_FEAT(TTWU_QUEUE, true)
 /*
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
  */
-SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
 
 /*
-- 
2.26.2

[PATCH 3/5] sched/fair: Make select_idle_cpu() proportional to cores

2021-01-15 Thread Mel Gorman

From: Peter Zijlstra (Intel) 

Instead of calculating how many (logical) CPUs to scan, compute how
many cores to scan.

This changes behaviour for anything !SMT2.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Mel Gorman 
---
 kernel/sched/core.c  | 18 +-
 kernel/sched/fair.c  | 12 ++--
 kernel/sched/sched.h |  2 ++
 3 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 15d2562118d1..ada8faac2e4d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7444,11 +7444,19 @@ int sched_cpu_activate(unsigned int cpu)
balance_push_set(cpu, false);
 
 #ifdef CONFIG_SCHED_SMT
-   /*
-* When going up, increment the number of cores with SMT present.
-*/
-   if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
-   static_branch_inc_cpuslocked(_smt_present);
+   do {
+   int weight = cpumask_weight(cpu_smt_mask(cpu));
+
+   if (weight > sched_smt_weight)
+   sched_smt_weight = weight;
+
+   /*
+* When going up, increment the number of cores with SMT 
present.
+*/
+   if (weight == 2)
+   static_branch_inc_cpuslocked(_smt_present);
+
+   } while (0);
 #endif
set_cpu_active(cpu, true);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c8d8e185cf3b..0811e2fe4f19 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6010,6 +6010,8 @@ static inline int find_idlest_cpu(struct sched_domain 
*sd, struct task_struct *p
 DEFINE_STATIC_KEY_FALSE(sched_smt_present);
 EXPORT_SYMBOL_GPL(sched_smt_present);
 
+int sched_smt_weight __read_mostly = 1;
+
 static inline void set_idle_cores(int cpu, int val)
 {
struct sched_domain_shared *sds;
@@ -6124,6 +6126,8 @@ static int select_idle_smt(struct task_struct *p, struct 
sched_domain *sd, int t
 
 #else /* CONFIG_SCHED_SMT */
 
+#define sched_smt_weight   1
+
 static inline int select_idle_core(struct task_struct *p, struct sched_domain 
*sd, int target)
 {
return -1;
@@ -6136,6 +6140,8 @@ static inline int select_idle_smt(struct task_struct *p, 
struct sched_domain *sd
 
 #endif /* CONFIG_SCHED_SMT */
 
+#define sis_min_cores  2
+
 /*
  * Scan the LLC domain for idle CPUs; this is dynamically regulated by
  * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -6166,10 +6172,12 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
avg_cost = this_sd->avg_scan_cost + 1;
 
span_avg = sd->span_weight * avg_idle;
-   if (span_avg > 4*avg_cost)
+   if (span_avg > sis_min_cores*avg_cost)
nr = div_u64(span_avg, avg_cost);
else
-   nr = 4;
+   nr = sis_min_cores;
+
+   nr *= sched_smt_weight;
 
time = cpu_clock(this);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 12ada79d40f3..29aabe98dd1d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1107,6 +1107,8 @@ static inline void update_idle_core(struct rq *rq)
__update_idle_core(rq);
 }
 
+extern int sched_smt_weight;
+
 #else
 static inline void update_idle_core(struct rq *rq) { }
 #endif
-- 
2.26.2

[PATCH 5/5] sched/fair: Merge select_idle_core/cpu()

2021-01-15 Thread Mel Gorman

From: Peter Zijlstra (Intel) 

Both select_idle_core() and select_idle_cpu() do a loop over the same
cpumask. Observe that by clearing the already visited CPUs, we can
fold the iteration and iterate a core at a time.

All we need to do is remember any non-idle CPU we encountered while
scanning for an idle core. This way we'll only iterate every CPU once.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 97 +++--
 1 file changed, 59 insertions(+), 38 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 12e08da90024..6c0f841e9e75 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6006,6 +6006,14 @@ static inline int find_idlest_cpu(struct sched_domain 
*sd, struct task_struct *p
return new_cpu;
 }
 
+static inline int __select_idle_cpu(struct task_struct *p, int core, struct 
cpumask *cpus)
+{
+   if (available_idle_cpu(core) || sched_idle_cpu(core))
+   return core;
+
+   return -1;
+}
+
 #ifdef CONFIG_SCHED_SMT
 DEFINE_STATIC_KEY_FALSE(sched_smt_present);
 EXPORT_SYMBOL_GPL(sched_smt_present);
@@ -6066,40 +6074,34 @@ void __update_idle_core(struct rq *rq)
  * there are no idle cores left in the system; tracked through
  * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
  */
-static int select_idle_core(struct task_struct *p, struct sched_domain *sd, 
int target)
+static int select_idle_core(struct task_struct *p, int core, struct cpumask 
*cpus, int *idle_cpu)
 {
-   struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
-   int core, cpu;
+   bool idle = true;
+   int cpu;
 
if (!static_branch_likely(_smt_present))
-   return -1;
-
-   if (!test_idle_cores(target, false))
-   return -1;
-
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   return __select_idle_cpu(p, core, cpus);
 
-   for_each_cpu_wrap(core, cpus, target) {
-   bool idle = true;
-
-   for_each_cpu(cpu, cpu_smt_mask(core)) {
-   if (!available_idle_cpu(cpu)) {
-   idle = false;
-   break;
+   for_each_cpu(cpu, cpu_smt_mask(core)) {
+   if (!available_idle_cpu(cpu)) {
+   idle = false;
+   if (*idle_cpu == -1) {
+   if (sched_idle_cpu(cpu) && 
cpumask_test_cpu(cpu, p->cpus_ptr)) {
+   *idle_cpu = cpu;
+   break;
+   }
+   continue;
}
+   break;
}
-
-   if (idle)
-   return core;
-
-   cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
+   if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
+   *idle_cpu = cpu;
}
 
-   /*
-* Failed to find an idle core; stop looking for one.
-*/
-   set_idle_cores(target, 0);
+   if (idle)
+   return core;
 
+   cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
return -1;
 }
 
@@ -6107,9 +6109,18 @@ static int select_idle_core(struct task_struct *p, 
struct sched_domain *sd, int
 
 #define sched_smt_weight   1
 
-static inline int select_idle_core(struct task_struct *p, struct sched_domain 
*sd, int target)
+static inline void set_idle_cores(int cpu, int val)
 {
-   return -1;
+}
+
+static inline bool test_idle_cores(int cpu, bool def)
+{
+   return def;
+}
+
+static inline int select_idle_core(struct task_struct *p, int core, struct 
cpumask *cpus, int *idle_cpu)
+{
+   return __select_idle_cpu(p, core, cpus);
 }
 
 #endif /* CONFIG_SCHED_SMT */
@@ -6124,10 +6135,11 @@ static inline int select_idle_core(struct task_struct 
*p, struct sched_domain *s
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int 
target)
 {
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+   int i, cpu, idle_cpu = -1, nr = INT_MAX;
+   bool smt = test_idle_cores(target, false);
+   int this = smp_processor_id();
struct sched_domain *this_sd;
u64 time;
-   int this = smp_processor_id();
-   int cpu, nr = INT_MAX;
 
this_sd = rcu_dereference(*this_cpu_ptr(_llc));
if (!this_sd)
@@ -6135,7 +6147,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
 
-   if (sched_feat(SIS_PROP)) {
+   if (sched_feat(SIS_PROP) && !smt) {
u64 avg_cost, avg_idle, span_avg;
 
/*
@@ -6159,16 +6171,29 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t

[PATCH 2/5] sched/fair: Move avg_scan_cost calculations under SIS_PROP

2021-01-15 Thread Mel Gorman

As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
check and while we are at it, exclude the cost of initialising the CPU
mask from the average scan cost.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9f5682aeda2e..c8d8e185cf3b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
if (!this_sd)
return -1;
 
+   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+
if (sched_feat(SIS_PROP)) {
u64 avg_cost, avg_idle, span_avg;
 
@@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
nr = div_u64(span_avg, avg_cost);
else
nr = 4;
-   }
-
-   time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   time = cpu_clock(this);
+   }
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -6181,8 +6181,10 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
break;
}
 
-   time = cpu_clock(this) - time;
-   update_avg(_sd->avg_scan_cost, time);
+   if (sched_feat(SIS_PROP)) {
+   time = cpu_clock(this) - time;
+   update_avg(_sd->avg_scan_cost, time);
+   }
 
return cpu;
 }
-- 
2.26.2

[PATCH 4/5] sched/fair: Remove select_idle_smt()

2021-01-15 Thread Mel Gorman

From: Peter Zijlstra (Intel) 

In order to make the next patch more readable, and to quantify the
actual effectiveness of this pass, start by removing it.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 30 --
 1 file changed, 30 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0811e2fe4f19..12e08da90024 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6103,27 +6103,6 @@ static int select_idle_core(struct task_struct *p, 
struct sched_domain *sd, int
return -1;
 }
 
-/*
- * Scan the local SMT mask for idle CPUs.
- */
-static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int 
target)
-{
-   int cpu;
-
-   if (!static_branch_likely(_smt_present))
-   return -1;
-
-   for_each_cpu(cpu, cpu_smt_mask(target)) {
-   if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
-   !cpumask_test_cpu(cpu, sched_domain_span(sd)))
-   continue;
-   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
-   return cpu;
-   }
-
-   return -1;
-}
-
 #else /* CONFIG_SCHED_SMT */
 
 #define sched_smt_weight   1
@@ -6133,11 +6112,6 @@ static inline int select_idle_core(struct task_struct 
*p, struct sched_domain *s
return -1;
 }
 
-static inline int select_idle_smt(struct task_struct *p, struct sched_domain 
*sd, int target)
-{
-   return -1;
-}
-
 #endif /* CONFIG_SCHED_SMT */
 
 #define sis_min_cores  2
@@ -6331,10 +6305,6 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
if ((unsigned)i < nr_cpumask_bits)
return i;
 
-   i = select_idle_smt(p, sd, target);
-   if ((unsigned)i < nr_cpumask_bits)
-   return i;
-
return target;
 }
 
-- 
2.26.2

Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()

2021-01-14 Thread Mel Gorman

On Thu, Jan 14, 2021 at 04:44:46PM +0100, Vincent Guittot wrote:
> > domain. There is no need to make it specific to the core account and we
> > are already doing the full scan. Throttling that would be a separate patch.
> >
> > > This patch 5 should focus on merging select_idle_core and
> > > select_idle_cpu so we keep (almost) the same behavior but each CPU is
> > > checked only once.
> > >
> >
> > Which I think it's already doing. Main glitch really is that
> > __select_idle_cpu() shouldn't be taking *idle_cpu as it does not consume
> > the information.
> 
>  don't really like the if (smt) else in the for_each_cpu_wrap(cpu,
> cpus, target) loop  because it just looks like we fail to merge idle
> core and idle cpu search loop at the end.
> 

While it's not the best, I did at one point have a series that fully
unified this function and it wasn't pretty.

> But there is probably not much we can do without changing what is
> accounted idle core  search in the avg_scan_cost
> 

Indeed. Maybe in the future it'll make more sense to consolidate it
further but between the depth search and possibly using SIS_PROP core
core searches, we've bigger fish to fry.

Current delta between this series and what is being tested is simply

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7bfa73de6a8d..ada8faac2e4d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7446,7 +7446,6 @@ int sched_cpu_activate(unsigned int cpu)
 #ifdef CONFIG_SCHED_SMT
do {
int weight = cpumask_weight(cpu_smt_mask(cpu));
-   extern int sched_smt_weight;
 
if (weight > sched_smt_weight)
sched_smt_weight = weight;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 84f02abb29e3..6c0f841e9e75 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6006,7 +6006,7 @@ static inline int find_idlest_cpu(struct sched_domain 
*sd, struct task_struct *p
return new_cpu;
 }
 
-static inline int __select_idle_cpu(struct task_struct *p, int core, struct 
cpumask *cpus, int *idle_cpu)
+static inline int __select_idle_cpu(struct task_struct *p, int core, struct 
cpumask *cpus)
 {
if (available_idle_cpu(core) || sched_idle_cpu(core))
return core;
@@ -6080,7 +6080,7 @@ static int select_idle_core(struct task_struct *p, int 
core, struct cpumask *cpu
int cpu;
 
if (!static_branch_likely(_smt_present))
-   return __select_idle_cpu(p, core, cpus, idle_cpu);
+   return __select_idle_cpu(p, core, cpus);
 
for_each_cpu(cpu, cpu_smt_mask(core)) {
if (!available_idle_cpu(cpu)) {
@@ -6120,7 +6120,7 @@ static inline bool test_idle_cores(int cpu, bool def)
 
 static inline int select_idle_core(struct task_struct *p, int core, struct 
cpumask *cpus, int *idle_cpu)
 {
-   return __select_idle_cpu(p, core, cpus, idle_cpu);
+   return __select_idle_cpu(p, core, cpus);
 }
 
 #endif /* CONFIG_SCHED_SMT */
@@ -6177,7 +6177,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
return i;
 
} else {
-   i = __select_idle_cpu(p, cpu, cpus, _cpu);
+   i = __select_idle_cpu(p, cpu, cpus);
if ((unsigned int)i < nr_cpumask_bits) {
idle_cpu = i;
break;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 12ada79d40f3..29aabe98dd1d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1107,6 +1107,8 @@ static inline void update_idle_core(struct rq *rq)
__update_idle_core(rq);
 }
 
+extern int sched_smt_weight;
+
 #else
 static inline void update_idle_core(struct rq *rq) { }
 #endif
-- 
Mel Gorman
SUSE Labs

Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()

2021-01-14 Thread Mel Gorman

On Thu, Jan 14, 2021 at 02:25:32PM +0100, Vincent Guittot wrote:
> On Thu, 14 Jan 2021 at 10:35, Mel Gorman  wrote:
> >
> > On Wed, Jan 13, 2021 at 06:03:00PM +0100, Vincent Guittot wrote:
> > > > @@ -6159,16 +6171,29 @@ static int select_idle_cpu(struct task_struct 
> > > > *p, struct sched_domain *sd, int t
> > > > for_each_cpu_wrap(cpu, cpus, target) {
> > > > if (!--nr)
> > > > return -1;
> > > > -   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> > > > -   break;
> > > > +   if (smt) {
> > >
> > > If we want to stay on something similar to the previous behavior, we
> > > want to check on all cores if test_idle_cores is true so nr should be
> > > set to number of cores
> > >
> >
> > I don't think we necessarily want to do that. has_idle_cores is an
> > effective throttling mechanism but it's not perfect. If the full domain
> > is always scanned for a core then there can be excessive scanning in
> 
> But that's what the code is currently doing. Can this change be done
> in another patch so we can check the impact of each change more
> easily?

Ok, when looking at this again instead of just the mail, the flow is;

int i, cpu, idle_cpu = -1, nr = INT_MAX;
...
if (sched_feat(SIS_PROP) && !smt) {
/* recalculate nr */
}

The !smt check should mean that core scanning is still scanning the entire
domain. There is no need to make it specific to the core account and we
are already doing the full scan. Throttling that would be a separate patch.

> This patch 5 should focus on merging select_idle_core and
> select_idle_cpu so we keep (almost) the same behavior but each CPU is
> checked only once.
> 

Which I think it's already doing. Main glitch really is that
__select_idle_cpu() shouldn't be taking *idle_cpu as it does not consume
the information.

> > workloads like hackbench which tends to have has_idle_cores return false
> > positives. It becomes important once average busy CPUs is over half of
> > the domain for SMT2.
> >
> > At least with the patch if that change was made, we still would not scan
> > twice going over the same runqueues so it would still be an improvement
> 
> yeah, it's for me the main goal of this patchset with the calculation
> of avg_can_cost being done only when SIS_PROP is true and the remove
> of SIS_AVG
> 
> any changes in the number of cpu/core to loop on is sensitive to
> regression and should be done in a separate patch IMHO
> 

Understood.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()

2021-01-14 Thread Mel Gorman

On Wed, Jan 13, 2021 at 06:03:00PM +0100, Vincent Guittot wrote:
> > @@ -6159,16 +6171,29 @@ static int select_idle_cpu(struct task_struct *p, 
> > struct sched_domain *sd, int t
> > for_each_cpu_wrap(cpu, cpus, target) {
> > if (!--nr)
> > return -1;
> > -   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> > -   break;
> > +   if (smt) {
> 
> If we want to stay on something similar to the previous behavior, we
> want to check on all cores if test_idle_cores is true so nr should be
> set to number of cores
> 

I don't think we necessarily want to do that. has_idle_cores is an
effective throttling mechanism but it's not perfect. If the full domain
is always scanned for a core then there can be excessive scanning in
workloads like hackbench which tends to have has_idle_cores return false
positives. It becomes important once average busy CPUs is over half of
the domain for SMT2.

At least with the patch if that change was made, we still would not scan
twice going over the same runqueues so it would still be an improvement
but it would be nice to avoid excessive deep scanning when there are a
lot of busy CPUs but individual tasks are rapidly idling.

However, in the event regressions are found, changing to your suggested
behaviour would be an obvious starting point.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 3/5] sched/fair: Make select_idle_cpu() proportional to cores

2021-01-14 Thread Mel Gorman

On Wed, Jan 13, 2021 at 05:49:58PM +0100, Vincent Guittot wrote:
> > @@ -7444,11 +7444,20 @@ int sched_cpu_activate(unsigned int cpu)
> > balance_push_set(cpu, false);
> >
> >  #ifdef CONFIG_SCHED_SMT
> > -   /*
> > -* When going up, increment the number of cores with SMT present.
> > -*/
> > -   if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
> > -   static_branch_inc_cpuslocked(_smt_present);
> > +   do {
> > +   int weight = cpumask_weight(cpu_smt_mask(cpu));
> > +   extern int sched_smt_weight;
> 
> coding style problem
> 

Presumably you are referring to an extern defined in a C file. That can
move to kernel/sched/sched.h in this patch.

> > 
> >  /*
> >   * Scan the LLC domain for idle CPUs; this is dynamically regulated by
> >   * comparing the average scan cost (tracked in sd->avg_scan_cost) against 
> > the
> > @@ -6166,10 +6172,12 @@ static int select_idle_cpu(struct task_struct *p, 
> > struct sched_domain *sd, int t
> > avg_cost = this_sd->avg_scan_cost + 1;
> >
> > span_avg = sd->span_weight * avg_idle;
> > -   if (span_avg > 4*avg_cost)
> > +   if (span_avg > sis_min_cores*avg_cost)
> > nr = div_u64(span_avg, avg_cost);
> > else
> > -   nr = 4;
> > +   nr = sis_min_cores;
> > +
> > +   nr *= sched_smt_weight;
> 
> Also,  patch 5 will look at all CPUs of a core in select_idle_core so
> nr will decrement by 1 per core so i don't see the need to multiply by
> sched_smt_weight one patch 5 is applied
> 

It makes sense in the context of this patch but can be removed again in
the last patch and then I think sched_smt_weight only exists in core.c

-- 
Mel Gorman
SUSE Labs

Re: [PATCH] mm, compaction: move high_pfn to the for loop scope.

2021-01-12 Thread Mel Gorman

On Tue, Jan 12, 2021 at 05:47:20PM +0800, Rokudo Yan wrote:
> In fast_isolate_freepages, high_pfn will be used if a prefered one(PFN >= 
> low_fn) not found. But the high_pfn
> is not reset before searching an free area, so when it was used as freepage, 
> it may from another free area searched before.
> And move_freelist_head(freelist, freepage) will have unexpected behavior(eg. 
> corrupt the MOVABLE freelist)
> 
> Unable to handle kernel paging request at virtual address dead0200
> Mem abort info:
>   ESR = 0x9644
>   Exception class = DABT (current EL), IL = 32 bits
>   SET = 0, FnV = 0
>   EA = 0, S1PTW = 0
> Data abort info:
>   ISV = 0, ISS = 0x0044
>   CM = 0, WnR = 1
> [dead0200] address between user and kernel address ranges
> 
> -000|list_cut_before(inline)
> -000|move_freelist_head(inline)
> -000|fast_isolate_freepages(inline)
> -000|isolate_freepages(inline)
> -000|compaction_alloc(?, ?)
> -001|unmap_and_move(inline)
> -001|migrate_pages([NSD:0xFF80088CBBD0] from = 0xFF80088CBD88, 
> [NSD:0xFF80088CBBC8] get_new_p
> -002|__read_once_size(inline)
> -002|static_key_count(inline)
> -002|static_key_false(inline)
> -002|trace_mm_compaction_migratepages(inline)
> -002|compact_zone(?, [NSD:0xFF80088CBCB0] capc = 0x0)
> -003|kcompactd_do_work(inline)
> -003|kcompactd([X19] p = 0xFF93227FBC40)
> -004|kthread([X20] _create = 0xFFE1AFB26380)
> -005|ret_from_fork(asm)
> ---|end of frame
> 
> The issue was reported on an smart phone product with 6GB ram and 3GB zram as 
> swap device.
> 
> This patch fixes the issue by reset high_pfn before searching each free area, 
> which ensure
> freepage and freelist match when call move_freelist_head in 
> fast_isolate_freepages().
> 
> Link: 
> http://lkml.kernel.org/r/20190118175136.31341-12-mgor...@techsingularity.net
> Fixes: 5a811889de10f1eb ("mm, compaction: use free lists to quickly locate a 
> migration target")

Thanks.

Acked-by: Mel Gorman 

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 3/3 v2] sched/fair: reduce cases for active balance

2021-01-12 Thread Mel Gorman

On Thu, Jan 07, 2021 at 11:33:25AM +0100, Vincent Guittot wrote:
> Active balance is triggered for a number of voluntary cases like misfit
> or pinned tasks cases but also after that a number of load balance
> attempts failed to migrate a task. There is no need to use active load
> balance when the group is overloaded because an overloaded state means
> that there is at least one waiting task. Nevertheless, the waiting task
> is not selected and detached until the threshold becomes higher than its
> load. This threshold increases with the number of failed lb (see the
> condition if ((load >> env->sd->nr_balance_failed) > env->imbalance) in
> detach_tasks()) and the waiting task will end up to be selected after a
> number of attempts.
> 
> Signed-off-by: Vincent Guittot 

I didn't see major problems so.

Acked-by: Mel Gorman 

Like you, I do not see significant performance differences, either
positive or negative (minor gains and losses that are borderline).
However, I didn't track the stats necessary to see what impact it had on
alb_* stats which is an oversight but schedstat tracking interferes with
results on its own. It would have been nice to see how often this case
is even hit in the overloaded case.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH] mm, compaction: make sure we isolate a valid freepage when high_pfn is used

2021-01-12 Thread Mel Gorman

On Tue, Jan 12, 2021 at 01:19:55PM +0800, Rokudo Yan wrote:
> In fast_isolate_freepages, high_pfn will be used if a prefered one(PFN >= 
> low_fn) not found. But the high_pfn
> is not reset before searching an free area, so when it was used as freepage, 
> it may from another free area searched before.
> And move_freelist_head(freelist, freepage) will have unexpected behavior(eg. 
> corrupt the MOVABLE freelist)
> 
> Unable to handle kernel paging request at virtual address dead0200
> Mem abort info:
>   ESR = 0x9644
>   Exception class = DABT (current EL), IL = 32 bits
>   SET = 0, FnV = 0
>   EA = 0, S1PTW = 0
> Data abort info:
>   ISV = 0, ISS = 0x0044
>   CM = 0, WnR = 1
> [dead0200] address between user and kernel address ranges
> 
> -000|list_cut_before(inline)
> -000|move_freelist_head(inline)
> -000|fast_isolate_freepages(inline)
> -000|isolate_freepages(inline)
> -000|compaction_alloc(?, ?)
> -001|unmap_and_move(inline)
> -001|migrate_pages([NSD:0xFF80088CBBD0] from = 0xFF80088CBD88, 
> [NSD:0xFF80088CBBC8] get_new_p
> -002|__read_once_size(inline)
> -002|static_key_count(inline)
> -002|static_key_false(inline)
> -002|trace_mm_compaction_migratepages(inline)
> -002|compact_zone(?, [NSD:0xFF80088CBCB0] capc = 0x0)
> -003|kcompactd_do_work(inline)
> -003|kcompactd([X19] p = 0xFF93227FBC40)
> -004|kthread([X20] _create = 0xFFE1AFB26380)
> -005|ret_from_fork(asm)
> ---|end of frame
> 
> The issue was reported on an smart phone product with 6GB ram and 3GB zram as 
> swap device.
> 
> This patch fixes the issue by reset high_pfn before searching each free area, 
> which ensure
> freepage and freelist match when call move_freelist_head in 
> fast_isolate_freepages().
> 
> Link: 
> http://lkml.kernel.org/r/1558711908-15688-1-git-send-email-suzuki.poul...@arm.com
> Fixes: 5a811889de10f1eb ("mm, compaction: use free lists to quickly locate a 
> migration target")

In that case, please move the high_pfn declaration and initialiser within
the for (order = cc->order - 1;...) loop to make it explicit what the
scope of high_pfn is.

-- 
Mel Gorman
SUSE Labs

Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting

2021-01-11 Thread Mel Gorman

On Mon, Jan 11, 2021 at 03:36:57PM +0100, Vincent Guittot wrote:
> > > 
> > >
> > > I think
> > > that we should decay it periodically to reflect there is less and less
> > > idle time (in fact no more)  on this busy CPU that never goes to idle.
> > > If a cpu was idle for a long period but then a long running task
> > > starts, the avg_idle will stay stalled to the large value which is
> > > becoming less and less relevant.
> >
> > While I get what you're saying, it does not help extrapolate what the
> > idleness of a domain is.
> 
> not but it gives a more up to date view of the idleness of the local
> cpu which is better than a stalled value
> 

Fair enough.

> >
> > > At the opposite, a cpu with a short running/idle period task will have
> > > a lower avg_idle whereas it is more often idle.
> > >
> > > Another thing that worries me, is that we use the avg_idle of the
> > > local cpu, which is obviously not idle otherwise it would have been
> > > selected, to decide how much time we should spend on looking for
> > > another idle CPU. I'm not sure that's the right metrics to use
> > > especially with a possibly stalled value.
> > >
> >
> > A better estimate requires heavy writes to sd_llc. The cost of that will
> > likely offset any benefit gained by a superior selection of a scan
> > depth.
> >
> > Treating a successful scan cost and a failed scan cost as being equal has
> > too many corner cases. If we do not want to weight the successful scan
> > cost, then the compromise is to keep the old behaviour that accounts for
> 
> I think that keeping the current way to scane_cost id the best option for now
> 

I sent a series that drops this patch for the moment as well as the
SIS_PROP for selecting a core.

-- 
Mel Gorman
SUSE Labs

[PATCH 4/5] sched/fair: Remove select_idle_smt()

2021-01-11 Thread Mel Gorman

From: Peter Zijlstra (Intel) 

In order to make the next patch more readable, and to quantify the
actual effectiveness of this pass, start by removing it.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 30 --
 1 file changed, 30 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0811e2fe4f19..12e08da90024 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6103,27 +6103,6 @@ static int select_idle_core(struct task_struct *p, 
struct sched_domain *sd, int
return -1;
 }
 
-/*
- * Scan the local SMT mask for idle CPUs.
- */
-static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int 
target)
-{
-   int cpu;
-
-   if (!static_branch_likely(_smt_present))
-   return -1;
-
-   for_each_cpu(cpu, cpu_smt_mask(target)) {
-   if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
-   !cpumask_test_cpu(cpu, sched_domain_span(sd)))
-   continue;
-   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
-   return cpu;
-   }
-
-   return -1;
-}
-
 #else /* CONFIG_SCHED_SMT */
 
 #define sched_smt_weight   1
@@ -6133,11 +6112,6 @@ static inline int select_idle_core(struct task_struct 
*p, struct sched_domain *s
return -1;
 }
 
-static inline int select_idle_smt(struct task_struct *p, struct sched_domain 
*sd, int target)
-{
-   return -1;
-}
-
 #endif /* CONFIG_SCHED_SMT */
 
 #define sis_min_cores  2
@@ -6331,10 +6305,6 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
if ((unsigned)i < nr_cpumask_bits)
return i;
 
-   i = select_idle_smt(p, sd, target);
-   if ((unsigned)i < nr_cpumask_bits)
-   return i;
-
return target;
 }
 
-- 
2.26.2

[PATCH 3/5] sched/fair: Make select_idle_cpu() proportional to cores

2021-01-11 Thread Mel Gorman

From: Peter Zijlstra (Intel) 

Instead of calculating how many (logical) CPUs to scan, compute how
many cores to scan.

This changes behaviour for anything !SMT2.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Mel Gorman 
---
 kernel/sched/core.c | 19 ++-
 kernel/sched/fair.c | 12 ++--
 2 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 15d2562118d1..7bfa73de6a8d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7444,11 +7444,20 @@ int sched_cpu_activate(unsigned int cpu)
balance_push_set(cpu, false);
 
 #ifdef CONFIG_SCHED_SMT
-   /*
-* When going up, increment the number of cores with SMT present.
-*/
-   if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
-   static_branch_inc_cpuslocked(_smt_present);
+   do {
+   int weight = cpumask_weight(cpu_smt_mask(cpu));
+   extern int sched_smt_weight;
+
+   if (weight > sched_smt_weight)
+   sched_smt_weight = weight;
+
+   /*
+* When going up, increment the number of cores with SMT 
present.
+*/
+   if (weight == 2)
+   static_branch_inc_cpuslocked(_smt_present);
+
+   } while (0);
 #endif
set_cpu_active(cpu, true);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c8d8e185cf3b..0811e2fe4f19 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6010,6 +6010,8 @@ static inline int find_idlest_cpu(struct sched_domain 
*sd, struct task_struct *p
 DEFINE_STATIC_KEY_FALSE(sched_smt_present);
 EXPORT_SYMBOL_GPL(sched_smt_present);
 
+int sched_smt_weight __read_mostly = 1;
+
 static inline void set_idle_cores(int cpu, int val)
 {
struct sched_domain_shared *sds;
@@ -6124,6 +6126,8 @@ static int select_idle_smt(struct task_struct *p, struct 
sched_domain *sd, int t
 
 #else /* CONFIG_SCHED_SMT */
 
+#define sched_smt_weight   1
+
 static inline int select_idle_core(struct task_struct *p, struct sched_domain 
*sd, int target)
 {
return -1;
@@ -6136,6 +6140,8 @@ static inline int select_idle_smt(struct task_struct *p, 
struct sched_domain *sd
 
 #endif /* CONFIG_SCHED_SMT */
 
+#define sis_min_cores  2
+
 /*
  * Scan the LLC domain for idle CPUs; this is dynamically regulated by
  * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -6166,10 +6172,12 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
avg_cost = this_sd->avg_scan_cost + 1;
 
span_avg = sd->span_weight * avg_idle;
-   if (span_avg > 4*avg_cost)
+   if (span_avg > sis_min_cores*avg_cost)
nr = div_u64(span_avg, avg_cost);
else
-   nr = 4;
+   nr = sis_min_cores;
+
+   nr *= sched_smt_weight;
 
time = cpu_clock(this);
}
-- 
2.26.2

[PATCH 5/5] sched/fair: Merge select_idle_core/cpu()

2021-01-11 Thread Mel Gorman

Both select_idle_core() and select_idle_cpu() do a loop over the same
cpumask. Observe that by clearing the already visited CPUs, we can
fold the iteration and iterate a core at a time.

All we need to do is remember any non-idle CPU we encountered while
scanning for an idle core. This way we'll only iterate every CPU once.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 97 +++--
 1 file changed, 59 insertions(+), 38 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 12e08da90024..84f02abb29e3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6006,6 +6006,14 @@ static inline int find_idlest_cpu(struct sched_domain 
*sd, struct task_struct *p
return new_cpu;
 }
 
+static inline int __select_idle_cpu(struct task_struct *p, int core, struct 
cpumask *cpus, int *idle_cpu)
+{
+   if (available_idle_cpu(core) || sched_idle_cpu(core))
+   return core;
+
+   return -1;
+}
+
 #ifdef CONFIG_SCHED_SMT
 DEFINE_STATIC_KEY_FALSE(sched_smt_present);
 EXPORT_SYMBOL_GPL(sched_smt_present);
@@ -6066,40 +6074,34 @@ void __update_idle_core(struct rq *rq)
  * there are no idle cores left in the system; tracked through
  * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
  */
-static int select_idle_core(struct task_struct *p, struct sched_domain *sd, 
int target)
+static int select_idle_core(struct task_struct *p, int core, struct cpumask 
*cpus, int *idle_cpu)
 {
-   struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
-   int core, cpu;
+   bool idle = true;
+   int cpu;
 
if (!static_branch_likely(_smt_present))
-   return -1;
-
-   if (!test_idle_cores(target, false))
-   return -1;
-
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   return __select_idle_cpu(p, core, cpus, idle_cpu);
 
-   for_each_cpu_wrap(core, cpus, target) {
-   bool idle = true;
-
-   for_each_cpu(cpu, cpu_smt_mask(core)) {
-   if (!available_idle_cpu(cpu)) {
-   idle = false;
-   break;
+   for_each_cpu(cpu, cpu_smt_mask(core)) {
+   if (!available_idle_cpu(cpu)) {
+   idle = false;
+   if (*idle_cpu == -1) {
+   if (sched_idle_cpu(cpu) && 
cpumask_test_cpu(cpu, p->cpus_ptr)) {
+   *idle_cpu = cpu;
+   break;
+   }
+   continue;
}
+   break;
}
-
-   if (idle)
-   return core;
-
-   cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
+   if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
+   *idle_cpu = cpu;
}
 
-   /*
-* Failed to find an idle core; stop looking for one.
-*/
-   set_idle_cores(target, 0);
+   if (idle)
+   return core;
 
+   cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
return -1;
 }
 
@@ -6107,9 +6109,18 @@ static int select_idle_core(struct task_struct *p, 
struct sched_domain *sd, int
 
 #define sched_smt_weight   1
 
-static inline int select_idle_core(struct task_struct *p, struct sched_domain 
*sd, int target)
+static inline void set_idle_cores(int cpu, int val)
 {
-   return -1;
+}
+
+static inline bool test_idle_cores(int cpu, bool def)
+{
+   return def;
+}
+
+static inline int select_idle_core(struct task_struct *p, int core, struct 
cpumask *cpus, int *idle_cpu)
+{
+   return __select_idle_cpu(p, core, cpus, idle_cpu);
 }
 
 #endif /* CONFIG_SCHED_SMT */
@@ -6124,10 +6135,11 @@ static inline int select_idle_core(struct task_struct 
*p, struct sched_domain *s
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int 
target)
 {
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+   int i, cpu, idle_cpu = -1, nr = INT_MAX;
+   bool smt = test_idle_cores(target, false);
+   int this = smp_processor_id();
struct sched_domain *this_sd;
u64 time;
-   int this = smp_processor_id();
-   int cpu, nr = INT_MAX;
 
this_sd = rcu_dereference(*this_cpu_ptr(_llc));
if (!this_sd)
@@ -6135,7 +6147,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
 
-   if (sched_feat(SIS_PROP)) {
+   if (sched_feat(SIS_PROP) && !smt) {
u64 avg_cost, avg_idle, span_avg;
 
/*
@@ -6159,16 +6171,29 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t

[PATCH 0/5] Scan for an idle sibling in a single pass

2021-01-11 Thread Mel Gorman

This series of 5 patches reposts three patches from Peter entitled
"select_idle_sibling() wreckage". It only scans the runqueues in a single
pass when searching for an idle sibling.

Two patches from Peter were dropped. The first patch altered how scan
depth was calculated. Scan depth deletion is a random number generator
with two major limitations. The avg_idle time is based on the time
between a CPU going idle and being woken up clamped approximately by
2*sysctl_sched_migration_cost.  This is difficult to compare in a sensible
fashion to avg_scan_cost. The second issue is that only the avg_scan_cost
of scan failures is recorded and it does not decay.  This requires deeper
surgery that would justify a patch on its own although Peter notes that
https://lkml.kernel.org/r/20180530143105.977759...@infradead.org is
potentially useful for an alternative avg_idle metric.

The second patch dropped converted the idle core scan throttling
mechanism to SIS_PROP. While this would unify the throttling of core
and CPU scanning, it was not free of regressions and has_idle_cores is
a fairly effective throttling mechanism with the caveat that it can have
a lot of false positives for workloads like hackbench.

Peter's series tried to solve three problems at once, this subset addresses
one problem. As with anything select_idle_sibling, it's a mix of wins and
losses but won more than it lost across a range of workloads and machines.

 kernel/sched/core.c |  19 +++--
 kernel/sched/fair.c | 157 
 kernel/sched/features.h |   1 -
 3 files changed, 92 insertions(+), 85 deletions(-)

-- 
2.26.2

[PATCH 2/5] sched/fair: Move avg_scan_cost calculations under SIS_PROP

2021-01-11 Thread Mel Gorman

As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
check and while we are at it, exclude the cost of initialising the CPU
mask from the average scan cost.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9f5682aeda2e..c8d8e185cf3b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
if (!this_sd)
return -1;
 
+   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+
if (sched_feat(SIS_PROP)) {
u64 avg_cost, avg_idle, span_avg;
 
@@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
nr = div_u64(span_avg, avg_cost);
else
nr = 4;
-   }
-
-   time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   time = cpu_clock(this);
+   }
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -6181,8 +6181,10 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
break;
}
 
-   time = cpu_clock(this) - time;
-   update_avg(_sd->avg_scan_cost, time);
+   if (sched_feat(SIS_PROP)) {
+   time = cpu_clock(this) - time;
+   update_avg(_sd->avg_scan_cost, time);
+   }
 
return cpu;
 }
-- 
2.26.2

[PATCH 1/5] sched/fair: Remove SIS_AVG_CPU

2021-01-11 Thread Mel Gorman

SIS_AVG_CPU was introduced as a means of avoiding a search when the
average search cost indicated that the search would likely fail. It was
a blunt instrument and disabled by commit 4c77b18cf8b7 ("sched/fair: Make
select_idle_cpu() more aggressive") and later replaced with a proportional
search depth by commit 1ad3aaf3fcd2 ("sched/core: Implement new approach
to scale select_idle_cpu()").

While there are corner cases where SIS_AVG_CPU is better, it has now been
disabled for almost three years. As the intent of SIS_PROP is to reduce
the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus
on SIS_PROP as a throttling mechanism.

Signed-off-by: Mel Gorman 
Reviewed-by: Vincent Guittot 
---
 kernel/sched/fair.c | 20 +---
 kernel/sched/features.h |  1 -
 2 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..9f5682aeda2e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6145,7 +6145,6 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 {
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
struct sched_domain *this_sd;
-   u64 avg_cost, avg_idle;
u64 time;
int this = smp_processor_id();
int cpu, nr = INT_MAX;
@@ -6154,18 +6153,17 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
if (!this_sd)
return -1;
 
-   /*
-* Due to large variance we need a large fuzz factor; hackbench in
-* particularly is sensitive here.
-*/
-   avg_idle = this_rq()->avg_idle / 512;
-   avg_cost = this_sd->avg_scan_cost + 1;
+   if (sched_feat(SIS_PROP)) {
+   u64 avg_cost, avg_idle, span_avg;
 
-   if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
-   return -1;
+   /*
+* Due to large variance we need a large fuzz factor;
+* hackbench in particularly is sensitive here.
+*/
+   avg_idle = this_rq()->avg_idle / 512;
+   avg_cost = this_sd->avg_scan_cost + 1;
 
-   if (sched_feat(SIS_PROP)) {
-   u64 span_avg = sd->span_weight * avg_idle;
+   span_avg = sd->span_weight * avg_idle;
if (span_avg > 4*avg_cost)
nr = div_u64(span_avg, avg_cost);
else
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 68d369cba9e4..e875eabb6600 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -54,7 +54,6 @@ SCHED_FEAT(TTWU_QUEUE, true)
 /*
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
  */
-SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
 
 /*
-- 
2.26.2

Re: [PATCH 2/3 v2] sched/fair: don't set LBF_ALL_PINNED unnecessarily

2021-01-11 Thread Mel Gorman

On Thu, Jan 07, 2021 at 11:33:24AM +0100, Vincent Guittot wrote:
> Setting LBF_ALL_PINNED during active load balance is only valid when there
> is only 1 running task on the rq otherwise this ends up increasing the
> balance interval whereas other tasks could migrate after the next interval
> once they become cache-cold as an example.
> 
> LBF_ALL_PINNED flag is now always set it by default. It is then cleared
> when we find one task that can be pulled when calling detach_tasks() or
> during active migration.
> 
> Signed-off-by: Vincent Guittot 

When reviewing this, I found it curious that sched_nr_migrate_break is
a const instead of a #define. It's not clear why as it only appears to
exist in case sysctl_sched_nr_migrate is set higher than 32 to prevent
rq lock starvation.

With your patch, s/atleast/at least/ in the comments if there is another
version.

Otherwise, I didn't spot a real problem so

Acked-by: Mel Gorman 

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 1/3 v2] sched/fair: skip idle cfs_rq

2021-01-11 Thread Mel Gorman

On Thu, Jan 07, 2021 at 11:33:23AM +0100, Vincent Guittot wrote:
> Don't waste time checking whether an idle cfs_rq could be the busiest
> queue. Furthermore, this can end up selecting a cfs_rq with a high load
> but being idle in case of migrate_load.
> 
> Signed-off-by: Vincent Guittot 
> Reviewed-by: Valentin Schneider 

Acked-by: Mel Gorman 

-- 
Mel Gorman
SUSE Labs

Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting

2021-01-09 Thread Mel Gorman

On Fri, Jan 08, 2021 at 08:45:44PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 08, 2021 at 04:10:51PM +0100, Vincent Guittot wrote:
> > Also, there is another problem (that I'm investigating)  which is that
> > this_rq()->avg_idle is stalled when your cpu is busy. Which means that
> > this avg_idle can just be a very old and meaningless value. I think
> > that we should decay it periodically to reflect there is less and less
> 
> https://lkml.kernel.org/r/20180530143105.977759...@infradead.org
> 
> :-)

This needs to be revived. I'm of the opinion that your initial series
needs to be split somewhat into "single scan for core" parts followed by
the "Fix depth selection". I have tests in the queue and one of them is
just patch 2 on its own. Preliminary results for patch 2 on its own do not
look bad but that's based on one test (tbench). It'll be tomorrow before
it works through variants of patch 1 which I suspect will be inconclusive
and make me more convinced it should be split out separately.

The single scan for core would be patches 2-4 of the series this thread
is about which is an orthogonal problem to avoiding repeated scans of
the same runqueues during a single wakeup. I would prefer to drop patch
5 initially because the has_idle_cores throttling mechanism for idle core
searches is reasonably effective and the first round of tests indicated
that changing it was inconclusive and should be treated separately.

The depth scan stuff would go on top because right now the depth scan is
based on magic numbers and no amount of shuffling that around will make
it less magic without an avg_idle value that decays and potentially the
scan cost also aging.  It's much less straight-forward than the single
scan aspect.

Switching idle core searches to SIS_PROP should also be treated
separately.

Thoughts? I don't want to go too far in this direction on my own because
the testing requirements are severe in terms of time. Even then no matter
how much I test, stuff will be missed as it's sensitive to domain sizes,
CPU generation, cache topology, cache characteristics, domain utilisation,
architecture, kitchen sink etc.

-- 
Mel Gorman
SUSE Labs

Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting

2021-01-09 Thread Mel Gorman

On Fri, Jan 08, 2021 at 09:21:48PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 08, 2021 at 10:27:38AM +0000, Mel Gorman wrote:
> 
> > 1. avg_scan_cost is now based on the average scan cost of a rq but
> >avg_idle is still scaled to the domain size. This is a bit problematic
> >because it's comparing scan cost of a single rq with the estimated
> >average idle time of a domain. As a result, the scan depth can be much
> >larger than it was before the patch and led to some regressions.
> 
> > @@ -6164,25 +6164,25 @@ static int select_idle_cpu(struct task_struct *p, 
> > struct sched_domain *sd, int t
> >  */
> > avg_idle = this_rq()->avg_idle / 512;
> > avg_cost = this_sd->avg_scan_cost + 1;
> > -
> > -   span_avg = sd->span_weight * avg_idle;
> > -   if (span_avg > 4*avg_cost)
> > -   nr = div_u64(span_avg, avg_cost);
> > -   else
> > +   nr = div_u64(avg_idle, avg_cost);
> > +   if (nr < 4)
> > nr = 4;
> 
> Oooh, could it be I simply didn't remember how that code was supposed to
> work and should kick my (much) younger self for not writing a comment?
> 
> Consider:
> 
>span_weight * avg_idle   avg_cost
>   nr = -- = avg_idle / --
>avg_costspan_weigt
> 
> Where: avg_cost / span_weight ~= cost-per-rq
> 

This would definitely make sense and I even evaluated it but the nature
of avg_idle and the scale it works at (up to 2*sched_migration_cost)
just ended up generating lunatic values far outside the size of the domain
size. Fitting that to the domain size just ended up looking silly too and
avg_cost does not decay. Still, in principle, it's the right direction,
it's just not what the code does right now.

-- 
Mel Gorman
SUSE Labs

Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting

2021-01-08 Thread Mel Gorman

On Fri, Jan 08, 2021 at 04:10:51PM +0100, Vincent Guittot wrote:
> > > Trying to bias the avg_scan_cost with:  loops <<= 2;
> > > will just make avg_scan_cost lost any kind of meaning because it
> > > doesn't reflect the avg cost of scanning a rq anymore
> > >
> >
> > Before the series, the avg_scan_cost also did not represent the cost of
> > scanning a RQ before either. Treating scan failures and successes equally
> 
> I agree that the previous avg_scan_cost was not representing a RQ
> because it was the avg cost of scanning the full domain.

It was not even that. As the full domain was not necessarily scanned at
all, it simply reflected how much time was spent scanning in general. It
neither represented an rq scan cost (which is variable due to cache
traffic) nor did it represent a full domain scan.

> And we were
> comparing it with the average idle time (weighted by few factors).
> And this cost was impacted by the fact that the scan can return early
> because it found a cpu. This has advantage and drawback but at least
> stays coherent in what we are comparing
> 

Not really because it represented the cost of a scan of some number of
rqs from 1 to sd->span_weight.

> Peter's patch wants to move on per rq avg scan cost. And what you're
> proposing is to add a magic heuristic to bias the per rq which at the
> end makes this value just an opaque metric.
> 

The metric is a heuristic no matter what. The scan cost of a RQ is not
fixed as it depends on whether cache data needs to be updated or not. Also
bear in mind that the first round of results and the results that I posted
showed that Peter's patch has significant corner cases that the patch
mitigates. You also note that avg_idle is an unusual metric to compare
against because it has numerous timing artifacts. At least one of them
is that we are extrapolating the domain idle time from a single rq which
is a poor proxy measure when a domain is only partially used. There just
is not a better one available without heavy writes to sd_llc which causes
its own set of problems.

> If we really want to keep the impact of early return than IMO we
> should stay on a full domain scan level instead of a per rq.
> 

That also has the same class of side-effects. Once the scan cost of
a successful scan is strictly accounted for, there are problems. Even
tracking the success scans is costly as the CPU clock has to be read and
sd_llc has to be updated.

> Also, there is another problem (that I'm investigating)  which is that
> this_rq()->avg_idle is stalled when your cpu is busy. Which means that
> this avg_idle can just be a very old and meaningless value.

Yes, avg_idle in itself is just the average inter-arrival time between
a CPU going idle and receiving a wakeup partially bound roughly
by 2*sysctl_sched_migration_cost. If avg_idle is traced for each
select_idle_cpu(), it's obvious that it takes time to adjust when a
load starts.

> I think
> that we should decay it periodically to reflect there is less and less
> idle time (in fact no more)  on this busy CPU that never goes to idle.
> If a cpu was idle for a long period but then a long running task
> starts, the avg_idle will stay stalled to the large value which is
> becoming less and less relevant.

While I get what you're saying, it does not help extrapolate what the
idleness of a domain is.

> At the opposite, a cpu with a short running/idle period task will have
> a lower avg_idle whereas it is more often idle.
> 
> Another thing that worries me, is that we use the avg_idle of the
> local cpu, which is obviously not idle otherwise it would have been
> selected, to decide how much time we should spend on looking for
> another idle CPU. I'm not sure that's the right metrics to use
> especially with a possibly stalled value.
> 

A better estimate requires heavy writes to sd_llc. The cost of that will
likely offset any benefit gained by a superior selection of a scan
depth.

Treating a successful scan cost and a failed scan cost as being equal has
too many corner cases. If we do not want to weight the successful scan
cost, then the compromise is to keep the old behaviour that accounts for
scan failures only (avoids an sd_llc write at least) but base it on the
estimated scan cost for a single rq. The fact that we do not account for
scan failures should be explicitly commented so we do not forget it
again in the future.

-- 
Mel Gorman
SUSE Labs

Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting

2021-01-08 Thread Mel Gorman

   30   5.6573 (   0.00%)  5.7390 *  -1.44%*
Amean 48   8.9037 (   0.00%)  8.8053 *   1.10%*
Amean 79  14.9190 (   0.00%) 14.4360 *   3.24%*
Amean 110 22.5703 (   0.00%) 21.9210 (   2.88%)
Amean 141 29.2400 (   0.00%) 28.0110 *   4.20%*
Amean 172 36.3720 (   0.00%) 34.7963 (   4.33%)
Amean 203 43.5783 (   0.00%) 42.5537 *   2.35%*
Amean 234 50.3653 (   0.00%) 47.3463 *   5.99%*
Amean 265 57.6153 (   0.00%) 55.6247 *   3.46%*
Amean 296 62.7370 (   0.00%) 62.0720 (   1.06%)

Adjusting the scan cost for successes is neither a universal win or
failure but it's closer to historical behaviour and the strict
accounting does hit corner cases.  If a deep scan is finding an idle CPU,
it makes some sense to continue scanning deeply by adjusting the weight
instead of prematurely failing.

The testing of the full series previously showed that some loads never
recovered from side-effects of the first patch and the last patch in the
series introduced new problems of its own. Hence, I would like to limit
the negative impact of the first patch and, if necessary, cut the last
patch altogether.

-- 
Mel Gorman
SUSE Labs

Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting

2021-01-08 Thread Mel Gorman

On Fri, Jan 08, 2021 at 01:01:10PM +, Qais Yousef wrote:
> On 01/08/21 10:27, Mel Gorman wrote:
> > for_each_cpu_wrap(cpu, cpus, target) {
> > -   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> > +   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
> > +   /* Adjust cost of a successful scan */
> > +   loops <<= 2;
> > +
> > break;
> > +   }
> >  
> > -   if (loops >= nr) {
> > +   if (++loops >= nr) {
> > cpu = -1;
> > break;
> > }
> > -   loops++;
> 
> Random (out of the blue) comment.
> 
> Now this will increment loops before the comparison/break. ie: we're
> effectively doing one iteration less IIRC. Should loops be initialized to
> 0 instead of 1?
> 

Yep, although in practice it'll make little difference except after a
rapid phase change when avg_idle still appears high on a per-rq basis
yet the domain is fully busy with no idle CPUs.

-- 
Mel Gorman
SUSE Labs

Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting

2021-01-08 Thread Mel Gorman

On Tue, Dec 15, 2020 at 08:59:11AM +0100, Peter Zijlstra wrote:
> On Tue, Dec 15, 2020 at 11:36:35AM +0800, Li, Aubrey wrote:
> > On 2020/12/15 0:48, Peter Zijlstra wrote:
> > > We compute the average cost of the total scan, but then use it as a
> > > per-cpu scan cost when computing the scan proportion. Fix this by
> > > properly computing a per-cpu scan cost.
> > > 
> > > This also fixes a bug where we would terminate early (!--nr, case) and
> > > not account that cost at all.
> > 
> > I'm a bit worried this may introduce a regression under heavy load.
> > The overhead of adding another cpu_clock() and calculation becomes 
> > significant when sis_scan is throttled by nr.
> 
> The thing is, the code as it exists today makes no sense what so ever.
> It's plain broken batshit.
> 
> We calculate the total scanning time (irrespective of how many CPUs we
> touched), and then use that calculate the number of cpus to scan. That's
> just daft.
> 
> After this patch we calculate the avg cost of scanning 1 cpu and use
> that to calculate how many cpus to scan. Which is coherent and sane.
> 
> Maybe it can be improved, but that's a completely different thing.

In general I agree with everything you said so lets talk about the last
sentence.

1. avg_scan_cost is now based on the average scan cost of a rq but
   avg_idle is still scaled to the domain size. This is a bit problematic
   because it's comparing scan cost of a single rq with the estimated
   average idle time of a domain. As a result, the scan depth can be much
   larger than it was before the patch and led to some regressions.

2. Accounting for the scan cost of success makes sense but there is a
   big difference between a scan that finds an idle CPU and one that fails.
   For failures, the scan cost is wasted CPU time where as a success
   means that an uncontested CPU is used. This can cause a search to be
   truncated earlier than it should be when the domain is lightly loaded.

The patch below attempts to deal with both of those points. The
remaining difficulty is the "fuzz" factor which is necessary to bring
large avg_idle values into a reasonable range for comparing against
avg_scan_cost. The max avg_idle value can be anything but generally the
ceiling is 2*sysctl_sched_migration_cost. I didn't quickly spot a good way
of mapping avg_idle to a range between 0 and sd->span_weight.  The obvious
one was (weight*avg_idle/(2*sched_migration_cost)) but it did not work very
well as avg_scan_cost accounts for the full cost of accessing a remote rq.
However, this could be revisited later after this series is sorted out.

Anyway, for the enumerated concerns, how about this on top for your
first patch? It seemed to work well for a few workloads on 3 machines
at least and I'd like to nail this down before considering the remaining
patches. The first run indicated that the first patch offset some of the
benefits of the other patches.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 65a2f208ada8..1e04a250e8ee 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6156,7 +6156,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);

if (sched_feat(SIS_PROP)) {
-   u64 avg_cost, avg_idle, span_avg;
+   u64 avg_cost, avg_idle;

/*
 * Due to large variance we need a large fuzz factor;
@@ -6164,25 +6164,25 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
 */
avg_idle = this_rq()->avg_idle / 512;
avg_cost = this_sd->avg_scan_cost + 1;
-
-   span_avg = sd->span_weight * avg_idle;
-   if (span_avg > 4*avg_cost)
-   nr = div_u64(span_avg, avg_cost);
-   else
+   nr = div_u64(avg_idle, avg_cost);
+   if (nr < 4)
nr = 4;

time = cpu_clock(this);
}

for_each_cpu_wrap(cpu, cpus, target) {
-   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
+   /* Adjust cost of a successful scan */
+   loops <<= 2;
+
break;
+   }

-   if (loops >= nr) {
+   if (++loops >= nr) {
    cpu = -1;
break;
}
-   loops++;
}

if (sched_feat(SIS_PROP)) {

-- 
Mel Gorman
SUSE Labs

Re: [RFC][PATCH 0/5] select_idle_sibling() wreckage

2021-01-04 Thread Mel Gorman

On Wed, Dec 23, 2020 at 02:23:41PM +0100, Vincent Guittot wrote:
> > Tests are still running on my side but early results shows perf
> > regression for hackbench
> 
> Few more results before being off:
> On small embedded system, the problem seems to be mainly a matter of
> setting the right number of loops.
> 
> On large smt system, The system on which  I usually run my tests  if
> off for now so i haven't been able to finalize tests yet but the
> problem might be that we don't loop all core anymore with this
> patchset compare to current algorithm
> 

Tests ran over the holidays and are available at 
http://www.skynet.ie/~mel/postings/peterz-20210104/dashboard.html

I am thrawling through the data but by and large the two main
observations I've had so far are

1. The last patch seems the most problematic and the most likely to make
   a large change, particularly to hackbench. For example;

http://www.skynet.ie/~mel/postings/peterz-20210104/scheduler-unbound/bing2/index.html#hackbench-thread-pipes

   The idle cpu cutoff is reasonably effective even though it triggers a
   lot of false positives meaning that it may be better to treat that in
   isolation

2. The cost accounting one had variable impact. Generally it was small
   gains and losses but tbench for low client counts is an exception as
   low thread counts say variable impact. Some big losses although EPYC1
   is a counter-example (toto in the dashboard)

The second issue might be responsible for the first issue, not sure.
However, it does not suprise me that properly accounting would have an
impact on the SMT depth search and likely needs tweaking.

-- 
Mel Gorman
SUSE Labs

Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting

2020-12-15 Thread Mel Gorman

On Tue, Dec 15, 2020 at 08:59:11AM +0100, Peter Zijlstra wrote:
> On Tue, Dec 15, 2020 at 11:36:35AM +0800, Li, Aubrey wrote:
> > On 2020/12/15 0:48, Peter Zijlstra wrote:
> > > We compute the average cost of the total scan, but then use it as a
> > > per-cpu scan cost when computing the scan proportion. Fix this by
> > > properly computing a per-cpu scan cost.
> > > 
> > > This also fixes a bug where we would terminate early (!--nr, case) and
> > > not account that cost at all.
> > 
> > I'm a bit worried this may introduce a regression under heavy load.
> > The overhead of adding another cpu_clock() and calculation becomes 
> > significant when sis_scan is throttled by nr.
> 
> The thing is, the code as it exists today makes no sense what so ever.

Which makes it very hard to reason about or change in a "safe" manner as
all sorts of counter-intuitive effects occur.

The series is queued and running and takes 1-2 days. I haven't reviewed
the patches properly (holiday) but it'll be interesting to get some
provisional data at least.

-- 
Mel Gorman
SUSE Labs

Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-14 Thread Mel Gorman

On Mon, Dec 14, 2020 at 10:18:16AM +0100, Vincent Guittot wrote:
> On Fri, 11 Dec 2020 at 18:45, Peter Zijlstra  wrote:
> >
> > On Thu, Dec 10, 2020 at 12:58:33PM +0000, Mel Gorman wrote:
> > > The prequisite patch to make that approach work was rejected though
> > > as on its own, it's not very helpful and Vincent didn't like that the
> > > load_balance_mask was abused to make it effective.
> >
> > So last time I poked at all this, I found that using more masks was a
> > performance issue as well due to added cache misses.
> >
> > Anyway, I finally found some time to look at all this, and while reading
> > over the whole SiS crud to refresh the brain came up with the below...
> >
> > It's still naf, but at least it gets rid of a bunch of duplicate
> > scanning and LoC decreases, so it should be awesome. Ofcourse, as
> > always, benchmarks will totally ruin everything, we'll see, I started
> > some.
> >
> > It goes on top of Mel's first two patches (which is about as far as I
> > got)...
> 
> We have several different things that Aubrey and Mel patches are
> trying to achieve:
> 
> Aubrey wants to avoid scanning busy cpus
> - patch works well on busy system: I see a significant benefit with
> hackbench on a lot of group on my 2 nodes * 28 cores * 4 smt
> hackbench -l 2000 -g 128
> tip 3.334
> w/ patch 2.978 (+12%)
> 

It's still the case that Aubrey's work does not conflict with what Peter
is doing. All that changes is what mask is applied.

Similarly, the p->recent_used_cpu changes I made initially (but got
rejected) reduces scanning. I intend to revisit that to use recent_used_cpu
as a search target because one side-effect of the patch was the siblings
can be used prematurely.

> - Aubey also tried to not scan the cpus which are idle for a short
> duration (when a tick not stopped) but there are problems because not
> stopping a tick doesn't mean a short idle. In fact , something similar
> to find_idlest_group_cpu() should be done in this case but then it's
> no more a fast path search
> 

The crowd most likely to complain about this is Facebook as they have
workloads that prefer deep searches to find idle CPUs.

> Mel wants to minimize looping over the cpus
> -patch 4 is an obvious win on light loaded system because it avoids
> looping twice the cpus mask when some cpus are idle but no core

Peter's patch replaces that entirely which I'm fine with.

> -But patch 3 generates perf régression
> hackbench -l 2000 -g 1
> tip 12.293
> /w all Mel's patches 15.163 -14%
> /w Aubrey + Mel patches minus patch 3 : 12.788 +3.8% But I think that
> Aubreay's patch doesn't help here. Test without aubrey's patch are
> running
> 
> -he also tried to used load_balance_mask to do something similar to the below
> 

Peter's approach removes load_balance_mask abuse so I think it's a
better approach overall to scanning LLC domains in a single pass. It
needs refinement and a lot of testing but I think it's promising.

> > -static int select_idle_core(struct task_struct *p, struct sched_domain 
> > *sd, int target)
> > +static int __select_idle_core(int core, struct cpumask *cpus, int 
> > *idle_cpu)
> >  {
> > -   struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> > -   int core, cpu;
> > -
> > -   if (!static_branch_likely(_smt_present))
> > -   return -1;
> > -
> > -   if (!test_idle_cores(target, false))
> > -   return -1;
> > -
> > -   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > -
> > -   for_each_cpu_wrap(core, cpus, target) {
> > -   bool idle = true;
> > +   bool idle = true;
> > +   int cpu;
> >
> > -   for_each_cpu(cpu, cpu_smt_mask(core)) {
> > -   if (!available_idle_cpu(cpu)) {
> > -   idle = false;
> > -   break;
> > -   }
> > +   for_each_cpu(cpu, cpu_smt_mask(core)) {
> > +   if (!available_idle_cpu(cpu)) {
> > +   idle = false;
> > +   continue;
> 
> By not breaking on the first not idle cpu of the core, you will
> largely increase the number of loops. On my system, it increases 4
> times from 28 up tu 112
> 

But breaking early will mean that SMT siblings are used prematurely.
However, if SIS_PROP was partially enforced for the idle core scan, it
could limit the search for an idle core once an idle candidate was
discovered.

-- 
Mel Gorman
SUSE Labs

Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-14 Thread Mel Gorman

On Mon, Dec 14, 2020 at 10:31:22AM +0100, Peter Zijlstra wrote:
> On Mon, Dec 14, 2020 at 09:11:29AM +0100, Vincent Guittot wrote:
> > On Fri, 11 Dec 2020 at 23:50, Mel Gorman  
> > wrote:
> 
> > > I originally did something like that on purpose too but Vincent called
> > > it out so it is worth mentioning now to avoid surprises. That said, I'm
> > > at the point where anything SIS-related simply needs excessive exposure
> > > because no matter what it does, someone is unhappy with it.
> > 
> > Yeah, I don't like extending the idle core search loop for something
> > that is not related to the idle core search. This adds more work out
> > of  control of the sis_prop. So I'm still against hiding some idle cpu
> > search in idle core search.
> 
> The idea, of course, is to do less. The current code is pretty crap in
> that it will do a whole bunch of things multiple times.
> 

^^^ this

While there is some overhead when searching for an idle core to track
an idle sibling, it's better than double scanning when test_idle_cores()
returns a false positive (or it races with a parallel search that takes
the last idle core).

> Also, a possible follow up, would be something like the below (and
> remove all the sds->has_idle_cores crud), which brings core scanning
> under SIS_PROP.
> 

I'm less confident about this this but admit I have no data. test_idle_core
becomes critical for hackbench once it saturates the machine as it'll
generally return a true positive.

Where test_idle_cores causes problems is when domains are over 50%
busy and returns false positives due to races and the work to find an
idle core is wasted. The flip side is that updating the has_idle_core
information is also expensive in this case as it causes lots of dirty
cache line bouncing so maybe in practice it might be ok. It definitely
should be a separate patch that is tested on top of your first prototype
with the p->cpus_ptr check when picking an idle candidate.

The other side-effect is that with this patch that the scan cost is
*always* accounted for. While this makes intuitive sense as it was never
clear to me why it was only accounted with scan failures. When I had tested
something like this, it was a mix of wins and losses. At minimum, a patch
that always accounts for scan cost and one the removes the test_idle_core
should be separate patches for bisection purposes at the very least.

This is the current set of results I have for your prototype plus the
fixes I suggested on top

http://www.skynet.ie/~mel/postings/peterz-20201214/dashboard.html

It's not a universal win but appears to win more than it loses

The biggest loss is dbench on EPYC 2

http://www.skynet.ie/~mel/postings/peterz-20201214/io-dbench4-async-xfs/romulus/index.html#dbench4

It's not at clear why it was so badly affected but in general, EPYC can
be problematic as it has multiple small LLCs. The same machine for specjvm
showed large gains.

http://www.skynet.ie/~mel/postings/peterz-20201214/jvm-specjbb2005-multi/romulus/index.html#specjbb

There are a lot of results to trawl through but mostly it shows that
it's a mix of wins and losses and it's both workload and machine
dependant which is generally true for anything select_idle_sibling
related.

As the merge window is open, it's inevitable that this will need to be
evaluated against 5.11-rc1 when all the current batch of scheduler code
has been merged. Do you mind splitting your prototype into three patches
and slap some sort of changlog on them? I'll run them through the grid
with p->recent_used_cpu changes on top to use recent_use_cpu as a search
hint instead of an idle candidate so that it scans for a core. They'll
take a while to run but it's automated and some people are going to be
dropping off for holidays relatively soon anyway. I can test on arm too
but as it does not have SMT enabled, it'll be less useful.

-- 
Mel Gorman
SUSE Labs

Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-11 Thread Mel Gorman

On Fri, Dec 11, 2020 at 11:19:05PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 11, 2020 at 08:43:37PM +0000, Mel Gorman wrote:
> > One bug is in __select_idle_core() though. It's scanning the SMT mask,
> > not select_idle_mask so it can return an idle candidate that is not in
> > p->cpus_ptr.
> 
> D'0h.. luckily the benchmarks don't hit that :-)
> 

Yep, neither do mine for the most part which is why I ran it as-is but
eventually someone would complain that sched_setscheduler was being
ignored.

Trick is whether a failed check should continue searching for an idle core
or terminate early and just pick an allowed idle CPU. I tend to favour
the latter but it's hard to predict what sort of reasonable workloads
would be affected.

> > There are some other potential caveats.
> > 
> > This is a single pass so when test_idle_cores() is true, __select_idle_core
> > is used to to check all the siblings even if the core is not idle. That
> > could have been cut short if __select_idle_core checked *idle_cpu ==
> > 1 and terminated the SMT scan if an idle candidate had already been found.
> 
> So I did that on purpose, so as to track the last/most-recent idle cpu,
> with the thinking that that cpu has the higher chance of still being
> idle vs one we checked earlier/longer-ago.
> 
> I suppose we benchmark both and see which is liked best.
> 

I originally did something like that on purpose too but Vincent called
it out so it is worth mentioning now to avoid surprises. That said, I'm
at the point where anything SIS-related simply needs excessive exposure
because no matter what it does, someone is unhappy with it.

> > Second downside is related. If test_idle_cpus() was true but no idle
> > CPU is found then __select_idle_core has been called enough to scan
> > the entire domain. In this corner case, the new code does *more* work
> > because the old code would have failed select_idle_core() quickly and
> > then select_idle_cpu() would be throttled by SIS_PROP. I think this will
> > only be noticable in the heavily overloaded case but if the corner case
> > hits enough then the new code will be slower than the old code for the
> > over-saturated case (i.e. hackbench with lots of groups).
> 
> Right, due to scanning siblings, even if the first inspected thread is
> not idle, we scan more.
> 

Yep.

> > The third potential downside is that the SMT sibling is not guaranteed to
> > be checked due to SIS_PROP throttling but in the old code, that would have
> > been checked by select_idle_smt(). That might result in premature stacking
> > of runnable tasks on the same CPU. Similarly, as __select_idle_core may
> > find multiple idle candidates, it will not pick the targets SMT sibling
> > if it is idle like select_idle_smt would have.
> > 
> > That said, I am skeptical that select_idle_smt() matters all that often.
> 
> This, I didn't really believe in it either.
> 

Good because I think any benefit from select_idle_smt is so marginal
that it should be ignored if the full scan is simpler overall.

> The benchmarks I started are mostly noise, with a few wins for
> TCP_STREAM and UDP_RR around the 50% mark. Although I should run
> longer variants to make sure.

So far I have one benchmark from one machine. It's tbench again because
it's a reasonable communicating workload that is trivial to vary the
level of utilisation.

80-cpu CascadeLake, 2 sockets, HT enabled
tbench4
  5.10.0-rc6 5.10.0-rc6 
5.10.0-rc6
   baseline-v1r1singlescan-v1r1
singlescan-v1r3
Hmean 1503.90 (   0.00%)  505.19 *   0.26%*  499.20 *  
-0.93%*
Hmean 2980.80 (   0.00%)  975.15 *  -0.58%*  983.79 *   
0.31%*
Hmean 4   1912.37 (   0.00%) 1883.25 *  -1.52%* 1923.76 *   
0.60%*
Hmean 8   3741.47 (   0.00%) 3568.66 *  -4.62%* 3690.60 *  
-1.36%*
Hmean 16  6552.90 (   0.00%) 6549.97 *  -0.04%* 6478.37 *  
-1.14%*
Hmean 32 10217.34 (   0.00%)10266.66 *   0.48%*10291.60 *   
0.73%*
Hmean 64 13604.93 (   0.00%)11240.88 * -17.38%*12045.87 * 
-11.46%*
Hmean 12821194.11 (   0.00%)21316.08 *   0.58%*21868.39 *   
3.18%*
Hmean 25621163.19 (   0.00%)20989.14 *  -0.82%*20831.20 *  
-1.57%*
Hmean 32020906.29 (   0.00%)21024.11 *   0.56%*20756.88 *  
-0.71%*
Stddev1  3.14 (   0.00%)1.17 (  62.93%)1.05 (  
66.61%)
Stddev2  4.44 (   0.00%)3.72 (  16.35%)2.20 (  
50.56%)
Stddev4  9.09 (   0.00%)   18.67 (-105.32%)3.66 (  
59.71%)
Stddev8 12.87 (   0.00%)   18.96 ( -47.31%)   11.90 (   
7.58%)
Stddev16 8.

Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-11 Thread Mel Gorman

On Fri, Dec 11, 2020 at 06:44:42PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 10, 2020 at 12:58:33PM +0000, Mel Gorman wrote:
> > The prequisite patch to make that approach work was rejected though
> > as on its own, it's not very helpful and Vincent didn't like that the
> > load_balance_mask was abused to make it effective.
> 
> So last time I poked at all this, I found that using more masks was a
> performance issue as well due to added cache misses.
> 
> Anyway, I finally found some time to look at all this, and while reading
> over the whole SiS crud to refresh the brain came up with the below...
> 
> It's still naf, but at least it gets rid of a bunch of duplicate
> scanning and LoC decreases, so it should be awesome. Ofcourse, as
> always, benchmarks will totally ruin everything, we'll see, I started
> some.
> 
> It goes on top of Mel's first two patches (which is about as far as I
> got)...
> 

It's not free of bugs but it's interesting! The positive aspects are that
it does a single scan, inits select_idle_mask once and caches an idle
candidate when searching for an idle core but in a far cleaner fashion
than what I did.

One bug is in __select_idle_core() though. It's scanning the SMT mask,
not select_idle_mask so it can return an idle candidate that is not in
p->cpus_ptr.

I queued tests anyway to see what it looks like.

There are some other potential caveats.

This is a single pass so when test_idle_cores() is true, __select_idle_core
is used to to check all the siblings even if the core is not idle. That
could have been cut short if __select_idle_core checked *idle_cpu ==
1 and terminated the SMT scan if an idle candidate had already been found.

Second downside is related. If test_idle_cpus() was true but no idle
CPU is found then __select_idle_core has been called enough to scan
the entire domain. In this corner case, the new code does *more* work
because the old code would have failed select_idle_core() quickly and
then select_idle_cpu() would be throttled by SIS_PROP. I think this will
only be noticable in the heavily overloaded case but if the corner case
hits enough then the new code will be slower than the old code for the
over-saturated case (i.e. hackbench with lots of groups).

The third potential downside is that the SMT sibling is not guaranteed to
be checked due to SIS_PROP throttling but in the old code, that would have
been checked by select_idle_smt(). That might result in premature stacking
of runnable tasks on the same CPU. Similarly, as __select_idle_core may
find multiple idle candidates, it will not pick the targets SMT sibling
if it is idle like select_idle_smt would have.

That said, I am skeptical that select_idle_smt() matters all that often.
If select_idle_cpu() has been throttled heavily then the machine is
either over-saturated or was over-saturated in the very recent pass. In
that situation, the chances of the SMT siblings being free is slim.

Each of the downsides could potentially addressed with this untested
mess

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 637f610c7059..260592d143d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6061,7 +6061,7 @@ void __update_idle_core(struct rq *rq)
rcu_read_unlock();
 }

-static int __select_idle_core(int core, struct cpumask *cpus, int *idle_cpu)
+static int __select_idle_core(struct task_struct *p, int core, struct cpumask 
*cpus, int *idle_cpu)
 {
bool idle = true;
int cpu;
@@ -6070,9 +6070,11 @@ static int __select_idle_core(int core, struct cpumask 
*cpus, int *idle_cpu)
schedstat_inc(this_rq()->sis_scanned);
if (!available_idle_cpu(cpu)) {
idle = false;
-   continue;
+   if (*idle_cpu == -1)
+   continue;
+   break;
}
-   if (idle_cpu)
+   if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
*idle_cpu = cpu;
}

@@ -6094,7 +6096,7 @@ static inline bool test_idle_cores(int cpu, bool def)
return def;
 }

-static inline int __select_idle_core(int core, struct cpumask *cpus, int 
*idle_cpu)
+static inline int __select_idle_core(struct task_struct *p, int core, struct 
cpumask *cpus, int *idle_cpu)
 {
return -1;
 }
@@ -6142,7 +6144,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t

for_each_cpu_wrap(cpu, cpus, target) {
if (smt) {
-   i = __select_idle_core(cpu, cpus, _cpu);
+   i = __select_idle_core(p, cpu, cpus, _cpu);
if ((unsigned)i < nr_cpumask_bits)
return i;

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling

2020-12-11 Thread Mel Gorman

On Fri, Dec 11, 2020 at 10:51:17AM +0100, Vincent Guittot wrote:
> On Thu, 10 Dec 2020 at 12:04, Mel Gorman  wrote:
> >
> > On Thu, Dec 10, 2020 at 10:38:37AM +0100, Vincent Guittot wrote:
> > > > while testing your patchset and Aubrey one on top of tip, I'm facing
> > > > some perf regression on my arm64 numa system on hackbench and reaim.
> > > > The regression seems to comes from your patchset but i don't know
> > > > which patch in particular yet
> > > >
> > > > hackbench -l 256000 -g 1
> > > >
> > > > v5.10-rc7 + tip/sched/core 13,255(+/- 3.22%)
> > > > with your patchset 15.368(+/- 2.74)  -15.9%
> > > >
> > > > I'm also seeing perf regression on reaim but this one needs more
> > > > investigation before confirming
> > > >
> > > > TBH, I was not expecting regressions. I'm running more test to find
> > > > which patch is the culprit
> > >
> > > The regression comes from patch 3: sched/fair: Do not replace
> > > recent_used_cpu with the new target
> > >
> >
> > That's not entirely surprising. The intent of the patch is to increase the
> > hit rate of p->recent_used_cpu but it's not a guaranteed win due to two
> > corner cases. If multiple tasks have the same p->recent_used_cpu, they can
> > race to use that CPU and stack as a result instead of searching the domain.
> > If SMT is enabled then p->recent_used_cpu can point to an idle CPU that has
> > a busy sibling which the search would have avoided in select_idle_core().
> >
> > I think you are using processes and sockets for hackbench but as you'll
> > see later, hackbench can be used both to show losses and gains.
> 
> I run more hackbench tests with pipe and socket and both show
> regression with patch 3 whereas this is significant improvement with
> other patches and Aubrey's one
> 

Is SMT enabled on your test machine? If not, then patch 4 should make no
difference but if SMT is enabled, I wonder how this untested version of
patch 3 behaves for you. The main difference is that the recent used cpu
is used as a search target so that it would still check if it's an idle
core and if not, fall through so it's used as an idle CPU after checking
it's allowed by p->cpus_ptr.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5c41875aec23..63980bcf6e70 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6275,21 +6275,14 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
return prev;
}
 
-   /* Check a recently used CPU as a potential idle candidate: */
+   /* Check a recently used CPU as a search target: */
recent_used_cpu = p->recent_used_cpu;
+   p->recent_used_cpu = prev;
if (recent_used_cpu != prev &&
recent_used_cpu != target &&
cpus_share_cache(recent_used_cpu, target) &&
-   (available_idle_cpu(recent_used_cpu) || 
sched_idle_cpu(recent_used_cpu)) &&
-   cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
-   asym_fits_capacity(task_util, recent_used_cpu)) {
-   /*
-* Replace recent_used_cpu with prev as it is a potential
-* candidate for the next wake:
-*/
-   p->recent_used_cpu = prev;
-   return recent_used_cpu;
-   }
+   (available_idle_cpu(recent_used_cpu) || 
sched_idle_cpu(recent_used_cpu)))
+   target = recent_used_cpu;
 
/*
 * For asymmetric CPU capacity systems, our domain of interest is
@@ -6768,9 +6761,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, 
int wake_flags)
} else if (wake_flags & WF_TTWU) { /* XXX always ? */
/* Fast path */
new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
-
-   if (want_affine)
-   current->recent_used_cpu = cpu;
}
rcu_read_unlock();
 

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 3/4] sched/fair: Do not replace recent_used_cpu with the new target

2020-12-11 Thread Mel Gorman

On Fri, Dec 11, 2020 at 05:34:43PM +0800, Hillf Danton wrote:
> On Fri, 11 Dec 2020 09:02:28 +0000 Mel Gorman wrote:
> >On Fri, Dec 11, 2020 at 02:25:42PM +0800, Hillf Danton wrote:
> >> On Tue,  8 Dec 2020 15:35:00 + Mel Gorman wrote:
> >> > @@ -6277,17 +6277,13 @@ static int select_idle_sibling(struct 
> >> > task_struct *p, int prev, int target)
> >> >  
> >> >  /* Check a recently used CPU as a potential idle candidate: */
> >> >  recent_used_cpu = p->recent_used_cpu;
> >> > +p->recent_used_cpu = prev;
> >> >  if (recent_used_cpu != prev &&
> >> >  recent_used_cpu != target &&
> >> >  cpus_share_cache(recent_used_cpu, target) &&
> >> >  (available_idle_cpu(recent_used_cpu) || 
> >> > sched_idle_cpu(recent_used_cpu)) &&
> >> >  cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
> >> 
> >> Typo? Fix it in spin if so.
> >> 
> >
> >What typo?
> 
> After your change it is prev that we check against p->cpus_ptr instead of
> the recent CPU. Wonder the point to do such a check for returning the
> recent one.

Ah... yes, this is indeed wrong. It wouldn't affect Vincent's case
that showed a problem with a hackbench configuration (which I'm still
disappointed about as it's a trade-off depending on machine and workload)
but it allows a task to run on the wrong cpu if sched_setscheduler()
was called between wakeup events.

-- 
Mel Gorman
SUSE Labs

[tip: sched/core] sched/fair: Clear SMT siblings after determining the core is not idle

2020-12-11 Thread tip-bot2 for Mel Gorman

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 13d5a5e9f9b8515da3c04305ae1bb03ab91be7a7
Gitweb:
https://git.kernel.org/tip/13d5a5e9f9b8515da3c04305ae1bb03ab91be7a7
Author:Mel Gorman 
AuthorDate:Mon, 30 Nov 2020 14:40:20 
Committer: Ingo Molnar 
CommitterDate: Fri, 11 Dec 2020 10:30:38 +01:00

sched/fair: Clear SMT siblings after determining the core is not idle

The clearing of SMT siblings from the SIS mask before checking for an idle
core is a small but unnecessary cost. Defer the clearing of the siblings
until the scan moves to the next potential target. The cost of this was
not measured as it is borderline noise but it should be self-evident.

Signed-off-by: Mel Gorman 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Reviewed-by: Vincent Guittot 
Link: https://lkml.kernel.org/r/20201130144020.gs3...@techsingularity.net
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f5dceda..efac224 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6086,10 +6086,11 @@ static int select_idle_core(struct task_struct *p, 
struct sched_domain *sd, int 
break;
}
}
-   cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
 
if (idle)
return core;
+
+   cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
}
 
/*

Re: [PATCH 3/4] sched/fair: Do not replace recent_used_cpu with the new target

2020-12-11 Thread Mel Gorman

On Fri, Dec 11, 2020 at 02:25:42PM +0800, Hillf Danton wrote:
> On Tue,  8 Dec 2020 15:35:00 +0000 Mel Gorman wrote:
> > @@ -6277,17 +6277,13 @@ static int select_idle_sibling(struct task_struct 
> > *p, int prev, int target)
> >  
> > /* Check a recently used CPU as a potential idle candidate: */
> > recent_used_cpu = p->recent_used_cpu;
> > +   p->recent_used_cpu = prev;
> > if (recent_used_cpu != prev &&
> > recent_used_cpu != target &&
> > cpus_share_cache(recent_used_cpu, target) &&
> > (available_idle_cpu(recent_used_cpu) || 
> > sched_idle_cpu(recent_used_cpu)) &&
> > cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
> 
> Typo? Fix it in spin if so.
> 

What typo?

> > asym_fits_capacity(task_util, recent_used_cpu)) {
> > -   /*
> > -* Replace recent_used_cpu with prev as it is a potential
> > -* candidate for the next wake:
> > -*/
> > -   p->recent_used_cpu = prev;
> >     return recent_used_cpu;
> 
> I prefer to update the recent CPU after llc check.
> 

That would prevent recent_used_cpu leaving the LLC the task first
started on.

-- 
Mel Gorman
SUSE Labs

Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-10 Thread Mel Gorman

ed.
> 

That's somewhat more marginal as a benefit but fair enough. I took a
closer look at the scan rates for 320 clients which loads the machine
fairly heavily. mpstat is reported 0% idle time but the domain scanned
with or without the patch is almost identical which hints that scans were
not avoided. As I write this, hackbench has not executed yet to see if
it makes a difference but it's in the queue.

> > I agree that it should never be worse/bigger but if the idle cpu mask is
> > often the same bits as the sched_domain_span then it's not going to be
> > any better either.
> > 
> 
> At least on my side, v5 is easier to see the benefit, as rapidly entering
> and exiting idle will make idle governor to retain the tick, v5 uses stop_tick
> signal from idle governor to decide if this CPU is added to idle cpumask.
> https://lore.kernel.org/lkml/20201118043113.53128-1-aubrey...@linux.intel.com/
> 
> But as Vincent and Peter concerned, v5 may break the latency sensitive 
> workload.
> Also the short idle definition need more work.
> 

Indeed, this is why I was advocating an approach that would use the idle
CPU mask as a hint to quickly find idle CPUs but still fall through
the scan the rest of the domain if necessary. That would cover the
"latency sensitive" workload situation because a deeper scan still
happens if necessary. While I'm not 100% certain, they were probably
thinking about schbench which favours deep searches for idle CPUs and is
latency sensitive.

The prequisite patch to make that approach work was rejected though
as on its own, it's not very helpful and Vincent didn't like that the
load_balance_mask was abused to make it effective.

-- 
Mel Gorman
SUSE Labs

Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-10 Thread Mel Gorman

 -0   [001] d..1   137.409035: update_idle_cpumask: 
> set_idle-1, cpumask: 0-3
> 

What's the size of the LLC domain on this machine? If it's 4 then this
is indicating that there is little difference between scanning the full
domain and targetting idle CPUs via the idle cpumask.

> stage 3: uperf running, select_idle_cpu scan all the CPUs in the scheduler 
> domain at the beginning
> ===
>uperf-560 [000] d..3   138.418494: select_task_rq_fair: 
> scanning: 0-3
>uperf-560 [000] d..3   138.418506: select_task_rq_fair: 
> scanning: 0-3
>uperf-560 [000] d..3   138.418514: select_task_rq_fair: 
> scanning: 0-3
>uperf-560 [000] dN.3   138.418534: select_task_rq_fair: 
> scanning: 0-3
>uperf-560 [000] dN.3   138.418543: select_task_rq_fair: 
> scanning: 0-3
>uperf-560 [000] dN.3   138.418551: select_task_rq_fair: 
> scanning: 0-3
>uperf-561 [003] d..3   138.418577: select_task_rq_fair: 
> scanning: 0-3
>uperf-561 [003] d..3   138.418600: select_task_rq_fair: 
> scanning: 0-3
>uperf-561 [003] d..3   138.418617: select_task_rq_fair: 
> scanning: 0-3
>uperf-561 [003] d..3   138.418640: select_task_rq_fair: 
> scanning: 0-3
>uperf-561 [003] d..3   138.418652: select_task_rq_fair: 
> scanning: 0-3
>uperf-561 [003] d..3   138.418662: select_task_rq_fair: 
> scanning: 0-3
>uperf-561 [003] d..3   138.418672: select_task_rq_fair: 
> scanning: 0-3
>uperf-560 [000] d..5   138.418676: select_task_rq_fair: 
> scanning: 0-3
>uperf-561 [003] d..3   138.418693: select_task_rq_fair: 
> scanning: 0-3
> kworker/u8:3-110 [002] d..4   138.418746: select_task_rq_fair: 
> scanning: 0-3
> 
> stage 4: scheduler tick comes, update idle cpumask to EMPTY
> 
>uperf-572 [002] d.h.   139.420568: update_idle_cpumask: 
> set_idle-0, cpumask: 1,3
>uperf-574 [000] d.H2   139.420568: update_idle_cpumask: 
> set_idle-0, cpumask: 1,3
>uperf-565 [003] d.H6   139.420569: update_idle_cpumask: 
> set_idle-0, cpumask: 1
> tmux: server-528 [001] d.h2   139.420572: update_idle_cpumask: 
> set_idle-0, cpumask: 
> 

It's the timing of the clear that may be problematic. For the
configurations I use, the tick is every 4 milliseconds which is a very
long time in the scheduler for communicating workloads. A simple
round-trip of perf pipe where tasks will be entering/exiting rapidly is
just 0.004 milliseconds on my desktop.

> > 
> > I suspect the issue is that the mask is only marked busy from the tick
> > context which is a very wide window. If select_idle_cpu() picks an idle
> > CPU from the mask, it's still marked as idle in the mask.
> > 
> That should be fine because we still check available_idle_cpu() and 
> sched_idle_cpu for the selected
> CPU. And even if that CPU is marked as idle in the mask, that mask should not 
> be worse(bigger) than
> the default sched_domain_span(sd).
> 

I agree that it should never be worse/bigger but if the idle cpu mask is
often the same bits as the sched_domain_span then it's not going to be
any better either.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling

2020-12-10 Thread Mel Gorman

ket CascadeLake

Amean 10.3663 (   0.00%)  0.3657 (   0.18%)
Amean 40.7510 (   0.00%)  0.7793 (  -3.77%)
Amean 71.2650 (   0.00%)  1.2583 (   0.53%)
Amean 12   1.9510 (   0.00%)  1.9450 (   0.31%)
Amean 21   2.9677 (   0.00%)  3.0277 (  -2.02%)
Amean 30   4.2993 (   0.00%)  4.0237 *   6.41%*
Amean 48   6.5373 (   0.00%)  6.2987 *   3.65%*
Amean 79  10.5513 (   0.00%) 10.3280 (   2.12%)
Amean 110 15.8567 (   0.00%) 13.9817 (  11.82%)
Amean 141 17.4243 (   0.00%) 17.3177 (   0.61%)
Amean 172 21.0473 (   0.00%) 20.9760 (   0.34%)
Amean 203 25.1070 (   0.00%) 25.1150 (  -0.03%)
Amean 234 28.6753 (   0.00%) 28.9383 (  -0.92%)
Amean 265 32.7970 (   0.00%) 32.9663 (  -0.52%)
Amean 296 36.6510 (   0.00%) 36.6753 (  -0.07%)

Neutral for 1 group, small regression for 4 groups, few gains around the
middle, neutral when over-saturated.

select_idle_sibling is a curse because it's very rare that a change to
it is a universal win. On balance, I think it's better to avoid searching
the domain at all where possible even if there are cases where searching
can have a benefit such as finding an idle core instead of picking an
idle CPU with a busy sibling via p->recent_used_cpu.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP

2020-12-10 Thread Mel Gorman

On Thu, Dec 10, 2020 at 01:18:05PM +0800, Li, Aubrey wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index ac7b34e7372b..5c41875aec23 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, 
> > struct sched_domain *sd, int t
> > if (!this_sd)
> > return -1;
> >  
> > +   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > +
> > if (sched_feat(SIS_PROP)) {
> > u64 avg_cost, avg_idle, span_avg;
> >  
> > @@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, 
> > struct sched_domain *sd, int t
> > nr = div_u64(span_avg, avg_cost);
> > else
> > nr = 4;
> > -   }
> > -
> > -   time = cpu_clock(this);
> >  
> > -   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > +   time = cpu_clock(this);
> > +   }
> >  
> > for_each_cpu_wrap(cpu, cpus, target) {
> > if (!--nr)
> > return -1;
> 
> I thought about this again and here seems not to be consistent:
> - even if nr reduces to 0, shouldn't avg_scan_cost be updated as well before 
> return -1?

You're right, but it's outside the scope
of this patch. I noted that this was a problem in
lore.kernel.org/r/lore.kernel.org/r/20201203141124.7391-8-mgor...@techsingularity.net
It's neither a consistent win or loss to always account for it and so
was dropped for this series to keep the number of controversial patches
to a minimum.

> - if avg_scan_cost is not updated because nr is throttled, the first 
>   time = cpu_clock(this);
>   can be optimized. As nr is calculated and we already know which of the 
> weight of cpumask and nr is greater.
> 

That is also outside the scope of this patch. To do that, cpumask_weight()
would have to be calculated but it's likely to be a net loss. Even under
light load, nr will be smaller than the domain weight incurring both the
cost of cpumask_weight and the clock read in the common case.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling

2020-12-09 Thread Mel Gorman

On Tue, Dec 08, 2020 at 03:34:57PM +, Mel Gorman wrote:
> Changelog since v1
> o Drop single-pass patch  
> (vincent)
> o Scope variables used for SIS_AVG_CPU
> (dietmar)
> o Remove redundant assignment (dietmar
> 
> This reduces the amount of runqueue scanning in select_idle_sibling in
> the worst case.
> 
> Patch 1 removes SIS_AVG_CPU because it's unused.
> 
> Patch 2 moves all SIS_PROP-related calculations under SIS_PROP
> 
> Patch 3 improves the hit rate of p->recent_used_cpu to reduce the amount
>   of scanning. It should be relatively uncontroversial
> 
> Patch 4 returns an idle candidate if one is found while scanning for a
>   free core.
> 

Any other objections to the series? Vincent marked 1, 3 and 4 as
reviewed. While patch 2 had some mild cosmetic concerns, I think the
version and how it treats SIS_PROP is fine as it is to keep it
functionally equivalent to !SIS_PROP and without adding too many
SIS_PROP checks.

-- 
Mel Gorman
SUSE Labs

Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-09 Thread Mel Gorman

On Wed, Dec 09, 2020 at 02:24:04PM +0800, Aubrey Li wrote:
> Add idle cpumask to track idle cpus in sched domain. Every time
> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
> target. And if the CPU is not in idle, the CPU is cleared in idle
> cpumask during scheduler tick to ratelimit idle cpumask update.
> 
> When a task wakes up to select an idle cpu, scanning idle cpumask
> has lower cost than scanning all the cpus in last level cache domain,
> especially when the system is heavily loaded.
> 
> Benchmarks including hackbench, schbench, uperf, sysbench mysql
> and kbuild were tested on a x86 4 socket system with 24 cores per
> socket and 2 hyperthreads per core, total 192 CPUs, no regression
> found.
> 

I ran this patch with tbench on top of of the schedstat patches that
track SIS efficiency. The tracking adds overhead so it's not a perfect
performance comparison but the expectation would be that the patch reduces
the number of runqueues that are scanned

tbench4
  5.10.0-rc6 5.10.0-rc6
  schedstat-v1r1  idlemask-v7r1
Hmean 1504.76 (   0.00%)  500.14 *  -0.91%*
Hmean 2   1001.22 (   0.00%)  970.37 *  -3.08%*
Hmean 4   1930.56 (   0.00%) 1880.96 *  -2.57%*
Hmean 8   3688.05 (   0.00%) 3537.72 *  -4.08%*
Hmean 16  6352.71 (   0.00%) 6439.53 *   1.37%*
Hmean 32 10066.37 (   0.00%)10124.65 *   0.58%*
Hmean 64 12846.32 (   0.00%)11627.27 *  -9.49%*
Hmean 12822278.41 (   0.00%)22304.33 *   0.12%*
Hmean 25621455.52 (   0.00%)20900.13 *  -2.59%*
Hmean 32021802.38 (   0.00%)21928.81 *   0.58%*

Not very optimistic result. The schedstats indicate;

5.10.0-rc6 5.10.0-rc6
schedstat-v1r1  idlemask-v7r1
Ops TTWU Count   5599714302.00  5589495123.00
Ops TTWU Local   2687713250.00  2563662550.00
Ops SIS Search   5596677950.00  5586381168.00
Ops SIS Domain Search3268344934.00  3229088045.00
Ops SIS Scanned 15909069113.00 16568899405.00
Ops SIS Domain Scanned  13580736097.00 14211606282.00
Ops SIS Failures 2944874939.00  2843113421.00
Ops SIS Core Search   262853975.00   311781774.00
Ops SIS Core Hit  185189656.00   216097102.00
Ops SIS Core Miss  77664319.0095684672.00
Ops SIS Recent Used Hit   124265515.00   146021086.00
Ops SIS Recent Used Miss  338142547.00   403547579.00
Ops SIS Recent Attempts   462408062.00   549568665.00
Ops SIS Search Efficiency35.18  33.72
Ops SIS Domain Search Eff24.07  22.72
Ops SIS Fast Success Rate41.60  42.20
Ops SIS Success Rate 47.38  49.11
Ops SIS Recent Success Rate  26.87  26.57

The field I would expect to decrease is SIS Domain Scanned -- the number
of runqueues that were examined but it's actually worse and graphing over
time shows it's worse for the client thread counts.  select_idle_cpu()
is definitely being called because "Domain Search" is 10 times higher than
"Core Search" and there "Core Miss" is non-zero.

I suspect the issue is that the mask is only marked busy from the tick
context which is a very wide window. If select_idle_cpu() picks an idle
CPU from the mask, it's still marked as idle in the mask.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP

2020-12-09 Thread Mel Gorman

On Wed, Dec 09, 2020 at 07:07:11PM +0800, Li, Aubrey wrote:
> On 2020/12/9 17:05, Mel Gorman wrote:
> > On Wed, Dec 09, 2020 at 01:28:11PM +0800, Li, Aubrey wrote:
> >>>> nr = div_u64(span_avg, avg_cost);
> >>>> else
> >>>> nr = 4;
> >>>> -   }
> >>>> -
> >>>> -   time = cpu_clock(this);
> >>>>
> >>>> -   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >>>> +   time = cpu_clock(this);
> >>>> +   }
> >>>>
> >>>> for_each_cpu_wrap(cpu, cpus, target) {
> >>>> if (!--nr)
> >>
> >> nr is the key of this throttling mechanism, need to be placed under 
> >> sched_feat(SIS_PROP) as well.
> >>
> > 
> > It isn't necessary as nr in initialised to INT_MAX if !SIS_PROP.
> >If !SIS_PROP, nr need to -1 then tested in the loop, instead of testing 
> >directly.
> But with SIS_PROP, need adding a test in the loop.
> Since SIS_PROP is default true, I think it's okay to keep the current way.
> 

It's because it's default true and the cost is negligible that I'm leaving
it alone. The branch cost and nr accounting cost is negligible and it
avoids peppering select_idle_cpu() with too many SIS_PROP checks.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP

2020-12-09 Thread Mel Gorman

On Wed, Dec 09, 2020 at 01:28:11PM +0800, Li, Aubrey wrote:
> >> nr = div_u64(span_avg, avg_cost);
> >> else
> >> nr = 4;
> >> -   }
> >> -
> >> -   time = cpu_clock(this);
> >>
> >> -   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >> +   time = cpu_clock(this);
> >> +   }
> >>
> >> for_each_cpu_wrap(cpu, cpus, target) {
> >> if (!--nr)
> 
> nr is the key of this throttling mechanism, need to be placed under 
> sched_feat(SIS_PROP) as well.
> 

It isn't necessary as nr in initialised to INT_MAX if !SIS_PROP.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP

2020-12-08 Thread Mel Gorman

On Tue, Dec 08, 2020 at 05:03:21PM +0100, Vincent Guittot wrote:
> On Tue, 8 Dec 2020 at 16:35, Mel Gorman  wrote:
> >
> > As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
> > even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
> > check and while we are at it, exclude the cost of initialising the CPU
> > mask from the average scan cost.
> >
> > Signed-off-by: Mel Gorman 
> > ---
> >  kernel/sched/fair.c | 14 --
> >  1 file changed, 8 insertions(+), 6 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index ac7b34e7372b..5c41875aec23 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, 
> > struct sched_domain *sd, int t
> > if (!this_sd)
> > return -1;
> 
> Just noticed while reviewing the patch that the above related to
> this_sd can also go under sched_feat(SIS_PROP)
> 

Technically yes but I also decided against it. It's a functional difference
depending on whether SIS_PROP is set in the highly unlikely case that
this_sd == NULL. I was also thinking in terms of what happens if SIS_PROP
was disabled and enabled while a search is in progress even if it's very
unlikely. In that case, this_sd would be uninitialised. That might be
impossible in practice depending on how static branching is implemented
but I don't think we should rely on the internals of jump labels and play
it safe. I can move it in if you feel strongly about it but I think the
disable/enable race is enough of a concern to leave it alone.

-- 
Mel Gorman
SUSE Labs

[PATCH 4/4] sched/fair: Return an idle cpu if one is found after a failed search for an idle core

2020-12-08 Thread Mel Gorman

select_idle_core is called when SMT is active and there is likely a free
core available. It may find idle CPUs but this information is simply
discarded and the scan starts over again with select_idle_cpu.

This patch caches information on idle CPUs found during the search for
a core and uses one if no core is found. This is a tradeoff. There may
be a slight impact when utilisation is low and an idle core can be
found quickly. It provides improvements as the number of busy CPUs
approaches 50% of the domain size when SMT is enabled.

With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled

  5.10.0-rc6 5.10.0-rc6
   schedstat  idlecandidate
Hmean 1500.06 (   0.00%)  505.67 *   1.12%*
Hmean 2975.90 (   0.00%)  974.06 *  -0.19%*
Hmean 4   1902.95 (   0.00%) 1904.43 *   0.08%*
Hmean 8   3761.73 (   0.00%) 3721.02 *  -1.08%*
Hmean 16  6713.93 (   0.00%) 6769.17 *   0.82%*
Hmean 32 10435.31 (   0.00%)10312.58 *  -1.18%*
Hmean 64 12325.51 (   0.00%)13792.01 *  11.90%*
Hmean 12821225.21 (   0.00%)20963.44 *  -1.23%*
Hmean 25620532.83 (   0.00%)20335.62 *  -0.96%*
Hmean 32020334.81 (   0.00%)20147.25 *  -0.92%*

Note that there is a significant corner case. As the SMT scan may be
terminated early, not all CPUs have been visited and select_idle_cpu()
is still called for a full scan. This case is handled in the next
patch.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 413d895bbbf8..ed6f45832d97 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6066,6 +6066,7 @@ void __update_idle_core(struct rq *rq)
  */
 static int select_idle_core(struct task_struct *p, struct sched_domain *sd, 
int target)
 {
+   int idle_candidate = -1;
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
int core, cpu;
 
@@ -6085,6 +6086,11 @@ static int select_idle_core(struct task_struct *p, 
struct sched_domain *sd, int
idle = false;
break;
}
+
+   if (idle_candidate == -1 &&
+   cpumask_test_cpu(cpu, p->cpus_ptr)) {
+   idle_candidate = cpu;
+   }
}
 
if (idle)
@@ -6098,7 +6104,7 @@ static int select_idle_core(struct task_struct *p, struct 
sched_domain *sd, int
 */
set_idle_cores(target, 0);
 
-   return -1;
+   return idle_candidate;
 }
 
 /*
-- 
2.26.2

[PATCH 3/4] sched/fair: Do not replace recent_used_cpu with the new target

2020-12-08 Thread Mel Gorman

After select_idle_sibling, p->recent_used_cpu is set to the
new target. However on the next wakeup, prev will be the same as
recent_used_cpu unless the load balancer has moved the task since the last
wakeup. It still works, but is less efficient than it can be after all
the changes that went in since that reduce unnecessary migrations, load
balancer changes etc.  This patch preserves recent_used_cpu for longer.

With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled

  5.10.0-rc6 5.10.0-rc6
 baseline-v2   altrecent-v2
Hmean 1508.39 (   0.00%)  502.05 *  -1.25%*
Hmean 2986.70 (   0.00%)  983.65 *  -0.31%*
Hmean 4   1914.55 (   0.00%) 1920.24 *   0.30%*
Hmean 8   3702.37 (   0.00%) 3663.96 *  -1.04%*
Hmean 16  6573.11 (   0.00%) 6545.58 *  -0.42%*
Hmean 32 10142.57 (   0.00%)10253.73 *   1.10%*
Hmean 64 14348.40 (   0.00%)12506.31 * -12.84%*
Hmean 12821842.59 (   0.00%)21967.13 *   0.57%*
Hmean 25620813.75 (   0.00%)21534.52 *   3.46%*
Hmean 32020684.33 (   0.00%)21070.14 *   1.87%*

The different was marginal except for 64 threads which showed in the
baseline that the result was very unstable where as the patch was much
more stable. This is somewhat machine specific as on a separate 80-cpu
Broadwell machine the same test reported.

  5.10.0-rc6 5.10.0-rc6
 baseline-v2   altrecent-v2
Hmean 1310.36 (   0.00%)  291.81 *  -5.98%*
Hmean 2340.86 (   0.00%)  547.22 *  60.54%*
Hmean 4912.29 (   0.00%) 1063.21 *  16.54%*
Hmean 8   2116.40 (   0.00%) 2103.60 *  -0.60%*
Hmean 16  4232.90 (   0.00%) 4362.92 *   3.07%*
Hmean 32  8442.03 (   0.00%) 8642.10 *   2.37%*
Hmean 64 11733.91 (   0.00%)11473.66 *  -2.22%*
Hmean 12817727.24 (   0.00%)16784.23 *  -5.32%*
Hmean 25616089.23 (   0.00%)16110.79 *   0.13%*
Hmean 32015992.60 (   0.00%)16071.64 *   0.49%*

schedstats were not used in this series but from an earlier debugging
effort, the schedstats after the test run were as follows;

Ops SIS Search   5653107942.00  5726545742.00
Ops SIS Domain Search3365067916.00  3319768543.00
Ops SIS Scanned112173512543.00 99194352541.00
Ops SIS Domain Scanned 109885472517.00 96787575342.00
Ops SIS Failures 2923185114.00  2950166441.00
Ops SIS Recent Used Hit   56547.00   118064916.00
Ops SIS Recent Used Miss 1590899250.00   354942791.00
Ops SIS Recent Attempts  1590955797.00   473007707.00
Ops SIS Search Efficiency 5.04   5.77
Ops SIS Domain Search Eff 3.06   3.43
Ops SIS Fast Success Rate40.47  42.03
Ops SIS Success Rate 48.29  48.48
Ops SIS Recent Success Rate   0.00  24.96

First interesting point is the ridiculous number of times runqueues are
enabled -- almost 97 billion times over the course of 40 minutes

With the patch, "Recent Used Hit" is over 2000 times more likely to
succeed. The failure rate also increases by quite a lot but the cost is
marginal even if the "Fast Success Rate" only increases by 2% overall. What
cannot be observed from these stats is where the biggest impact as these
stats cover low utilisation to over saturation.

If graphed over time, the graphs show that the sched domain is only
scanned at negligible rates until the machine is fully busy. With
low utilisation, the "Fast Success Rate" is almost 100% until the
machine is fully busy. For 320 clients, the success rate is close to
0% which is unsurprising.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5c41875aec23..413d895bbbf8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6277,17 +6277,13 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
 
/* Check a recently used CPU as a potential idle candidate: */
recent_used_cpu = p->recent_used_cpu;
+   p->recent_used_cpu = prev;
if (recent_used_cpu != prev &&
recent_used_cpu != target &&
cpus_share_cache(recent_used_cpu, target) &&
(available_idle_cpu(recent_used_cpu) || 
sched_idle_cpu(recent_used_cpu)) &&
cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
asym_fits_capacity(task_util, recent_used_cpu)) {
-   /*
-* Replace recent_used_cpu with prev as it is a potential
-* candidate for the next wake:
-*/
-   p->recent

[PATCH 1/4] sched/fair: Remove SIS_AVG_CPU

2020-12-08 Thread Mel Gorman

SIS_AVG_CPU was introduced as a means of avoiding a search when the
average search cost indicated that the search would likely fail. It
was a blunt instrument and disabled by 4c77b18cf8b7 ("sched/fair: Make
select_idle_cpu() more aggressive") and later replaced with a proportional
search depth by 1ad3aaf3fcd2 ("sched/core: Implement new approach to
scale select_idle_cpu()").

While there are corner cases where SIS_AVG_CPU is better, it has now been
disabled for almost three years. As the intent of SIS_PROP is to reduce
the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus
on SIS_PROP as a throttling mechanism.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 20 +---
 kernel/sched/features.h |  1 -
 2 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 98075f9ea9a8..ac7b34e7372b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6145,7 +6145,6 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 {
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
struct sched_domain *this_sd;
-   u64 avg_cost, avg_idle;
u64 time;
int this = smp_processor_id();
int cpu, nr = INT_MAX;
@@ -6154,18 +6153,17 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
if (!this_sd)
return -1;
 
-   /*
-* Due to large variance we need a large fuzz factor; hackbench in
-* particularly is sensitive here.
-*/
-   avg_idle = this_rq()->avg_idle / 512;
-   avg_cost = this_sd->avg_scan_cost + 1;
+   if (sched_feat(SIS_PROP)) {
+   u64 avg_cost, avg_idle, span_avg;
 
-   if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
-   return -1;
+   /*
+* Due to large variance we need a large fuzz factor;
+* hackbench in particularly is sensitive here.
+*/
+   avg_idle = this_rq()->avg_idle / 512;
+   avg_cost = this_sd->avg_scan_cost + 1;
 
-   if (sched_feat(SIS_PROP)) {
-   u64 span_avg = sd->span_weight * avg_idle;
+   span_avg = sd->span_weight * avg_idle;
if (span_avg > 4*avg_cost)
nr = div_u64(span_avg, avg_cost);
else
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 68d369cba9e4..e875eabb6600 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -54,7 +54,6 @@ SCHED_FEAT(TTWU_QUEUE, true)
 /*
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
  */
-SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
 
 /*
-- 
2.26.2

[PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP

2020-12-08 Thread Mel Gorman

As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
check and while we are at it, exclude the cost of initialising the CPU
mask from the average scan cost.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ac7b34e7372b..5c41875aec23 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
if (!this_sd)
return -1;
 
+   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+
if (sched_feat(SIS_PROP)) {
u64 avg_cost, avg_idle, span_avg;
 
@@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
nr = div_u64(span_avg, avg_cost);
else
nr = 4;
-   }
-
-   time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   time = cpu_clock(this);
+   }
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -6181,8 +6181,10 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
break;
}
 
-   time = cpu_clock(this) - time;
-   update_avg(_sd->avg_scan_cost, time);
+   if (sched_feat(SIS_PROP)) {
+   time = cpu_clock(this) - time;
+   update_avg(_sd->avg_scan_cost, time);
+   }
 
return cpu;
 }
-- 
2.26.2

[PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling

2020-12-08 Thread Mel Gorman

Changelog since v1
o Drop single-pass patch
(vincent)
o Scope variables used for SIS_AVG_CPU  
(dietmar)
o Remove redundant assignment   (dietmar

This reduces the amount of runqueue scanning in select_idle_sibling in
the worst case.

Patch 1 removes SIS_AVG_CPU because it's unused.

Patch 2 moves all SIS_PROP-related calculations under SIS_PROP

Patch 3 improves the hit rate of p->recent_used_cpu to reduce the amount
of scanning. It should be relatively uncontroversial

Patch 4 returns an idle candidate if one is found while scanning for a
free core.

-- 
2.26.2

Mel Gorman (4):
  sched/fair: Remove SIS_AVG_CPU
  sched/fair: Move avg_scan_cost calculations under SIS_PROP
  sched/fair: Do not replace recent_used_cpu with the new target
  sched/fair: Return an idle cpu if one is found after a failed search
for an idle core

 kernel/sched/fair.c | 51 -
 kernel/sched/features.h |  1 -
 2 files changed, 25 insertions(+), 27 deletions(-)

-- 
2.26.2

Re: [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU

2020-12-08 Thread Mel Gorman

On Tue, Dec 08, 2020 at 03:47:40PM +0100, Vincent Guittot wrote:
> > I considered it but made the choice to exclude the cost of cpumask_and()
> > from the avg_scan_cost instead. It's minor but when doing the original
> 
> At the cost of a less readable code
> 

Slightly less readable, yes.

> > prototype, I didn't think it was appropriate to count the cpumask
> > clearing as part of the scan cost as it's not directly related.
> 
> hmm... I think it is because the number of loop is directly related to
> the allowed cpus
> 

While that is true, the cost of initialising the map is constant and
what is most important is tracking the scan cost which is variable.
Without SIS_AVG_CPU, the cpumask init can go before SIS_PROP without any
penalty so is this version preferable?

--8<--
sched/fair: Move avg_scan_cost calculations under SIS_PROP

As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
check and while we are at it, exclude the cost of initialising the CPU
mask from the average scan cost.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ac7b34e7372b..5c41875aec23 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
if (!this_sd)
return -1;
 
+   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+
if (sched_feat(SIS_PROP)) {
u64 avg_cost, avg_idle, span_avg;
 
@@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
nr = div_u64(span_avg, avg_cost);
else
nr = 4;
-   }
-
-   time = cpu_clock(this);
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   time = cpu_clock(this);
+   }
 
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
@@ -6181,8 +6181,10 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
break;
}
 
-   time = cpu_clock(this) - time;
-   update_avg(_sd->avg_scan_cost, time);
+   if (sched_feat(SIS_PROP)) {
+   time = cpu_clock(this) - time;
+   update_avg(_sd->avg_scan_cost, time);
+   }
 
return cpu;
 }

Re: [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU

2020-12-08 Thread Mel Gorman

On Tue, Dec 08, 2020 at 02:43:10PM +0100, Vincent Guittot wrote:
> On Tue, 8 Dec 2020 at 14:36, Mel Gorman  wrote:
> >
> > On Tue, Dec 08, 2020 at 02:24:32PM +0100, Vincent Guittot wrote:
> > > > > Nitpick:
> > > > >
> > > > > Since now avg_cost and avg_idle are only used w/ SIS_PROP, they could 
> > > > > go
> > > > > completely into the SIS_PROP if condition.
> > > > >
> > > >
> > > > Yeah, I can do that. In the initial prototype, that happened in a
> > > > separate patch that split out SIS_PROP into a helper function and I
> > > > never merged it back. It's a trivial change.
> > >
> > > while doing this, should you also put the update of
> > > this_sd->avg_scan_cost under the SIS_PROP feature ?
> > >
> >
> > It's outside the scope of the series but why not. This?
> >
> > --8<--
> > sched/fair: Move avg_scan_cost calculations under SIS_PROP
> >
> > As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
> > even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
> > check and while we are at it, exclude the cost of initialising the CPU
> > mask from the average scan cost.
> >
> > Signed-off-by: Mel Gorman 
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 19ca0265f8aa..0fee53b1aae4 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6176,10 +6176,10 @@ static int select_idle_cpu(struct task_struct *p, 
> > struct sched_domain *sd, int t
> > nr = 4;
> > }
> >
> > -   time = cpu_clock(this);
> 
> I would move it in the if (sched_feat(SIS_PROP)) above.
> 

I considered it but made the choice to exclude the cost of cpumask_and()
from the avg_scan_cost instead. It's minor but when doing the original
prototype, I didn't think it was appropriate to count the cpumask
clearing as part of the scan cost as it's not directly related.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU

2020-12-08 Thread Mel Gorman

On Tue, Dec 08, 2020 at 02:24:32PM +0100, Vincent Guittot wrote:
> > > Nitpick:
> > >
> > > Since now avg_cost and avg_idle are only used w/ SIS_PROP, they could go
> > > completely into the SIS_PROP if condition.
> > >
> >
> > Yeah, I can do that. In the initial prototype, that happened in a
> > separate patch that split out SIS_PROP into a helper function and I
> > never merged it back. It's a trivial change.
> 
> while doing this, should you also put the update of
> this_sd->avg_scan_cost under the SIS_PROP feature ?
> 

It's outside the scope of the series but why not. This?

--8<--
sched/fair: Move avg_scan_cost calculations under SIS_PROP

As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
check and while we are at it, exclude the cost of initialising the CPU
mask from the average scan cost.

Signed-off-by: Mel Gorman 

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 19ca0265f8aa..0fee53b1aae4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6176,10 +6176,10 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
nr = 4;
}
 
-   time = cpu_clock(this);
-
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
 
+   if (sched_feat(SIS_PROP))
+   time = cpu_clock(this);
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
return -1;
@@ -6187,8 +6187,10 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
break;
}
 
-   time = cpu_clock(this) - time;
-   update_avg(_sd->avg_scan_cost, time);
+   if (sched_feat(SIS_PROP)) {
+   time = cpu_clock(this) - time;
+   update_avg(_sd->avg_scan_cost, time);
+   }
 
return cpu;
 }

Re: [PATCH 2/4] sched/fair: Do not replace recent_used_cpu with the new target

2020-12-08 Thread Mel Gorman

On Tue, Dec 08, 2020 at 10:57:29AM +0100, Dietmar Eggemann wrote:
> On 07/12/2020 10:15, Mel Gorman wrote:
> > After select_idle_sibling, p->recent_used_cpu is set to the
> > new target. However on the next wakeup, prev will be the same as
> 
> I'm confused here. Isn't current->recent_used_cpu set to 'cpu =
> smp_processor_id()' after sis()? Looking at v5.10-rc6.

If you are referring to this;

   if (want_affine)
   current->recent_used_cpu = cpu;

then it gets removed by the path. That replaces recent_used_cpu with the
wakers CPU which still works but the hit rate is lower.

> 
> [...]
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 23934dbac635..01b38fc17bca 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6274,6 +6274,7 @@ static int select_idle_sibling(struct task_struct *p, 
> > int prev, int target)
> >  
> > /* Check a recently used CPU as a potential idle candidate: */
> > recent_used_cpu = p->recent_used_cpu;
> > +   p->recent_used_cpu = prev;
> > if (recent_used_cpu != prev &&
> > recent_used_cpu != target &&
> > cpus_share_cache(recent_used_cpu, target) &&
> 
> p->recent_used_cpu is already set to prev in this if condition.
> 

That can be removed as redundant, I'll fix it.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU

2020-12-08 Thread Mel Gorman

On Tue, Dec 08, 2020 at 11:07:19AM +0100, Dietmar Eggemann wrote:
> On 07/12/2020 10:15, Mel Gorman wrote:
> > SIS_AVG_CPU was introduced as a means of avoiding a search when the
> > average search cost indicated that the search would likely fail. It
> > was a blunt instrument and disabled by 4c77b18cf8b7 ("sched/fair: Make
> > select_idle_cpu() more aggressive") and later replaced with a proportional
> > search depth by 1ad3aaf3fcd2 ("sched/core: Implement new approach to
> > scale select_idle_cpu()").
> > 
> > While there are corner cases where SIS_AVG_CPU is better, it has now been
> > disabled for almost three years. As the intent of SIS_PROP is to reduce
> > the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus
> > on SIS_PROP as a throttling mechanism.
> > 
> > Signed-off-by: Mel Gorman 
> > ---
> >  kernel/sched/fair.c | 3 ---
> >  kernel/sched/features.h | 1 -
> >  2 files changed, 4 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 98075f9ea9a8..23934dbac635 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6161,9 +6161,6 @@ static int select_idle_cpu(struct task_struct *p, 
> > struct sched_domain *sd, int t
> > avg_idle = this_rq()->avg_idle / 512;
> > avg_cost = this_sd->avg_scan_cost + 1;
> >  
> > -   if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
> > -   return -1;
> > -
> > if (sched_feat(SIS_PROP)) {
> > u64 span_avg = sd->span_weight * avg_idle;
> > if (span_avg > 4*avg_cost)
> 
> Nitpick:
> 
> Since now avg_cost and avg_idle are only used w/ SIS_PROP, they could go
> completely into the SIS_PROP if condition.
> 

Yeah, I can do that. In the initial prototype, that happened in a
separate patch that split out SIS_PROP into a helper function and I
never merged it back. It's a trivial change.

Thanks.

-- 
Mel Gorman
SUSE Labs

Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-07 Thread Mel Gorman

On Mon, Dec 07, 2020 at 04:50:15PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 18, 2020 at 12:31:13PM +0800, Aubrey Li wrote:
> > Add idle cpumask to track idle cpus in sched domain. When a CPU
> > enters idle, if the idle driver indicates to stop tick, this CPU
> > is set in the idle cpumask to be a wakeup target. And if the CPU
> > is not in idle, the CPU is cleared in idle cpumask during scheduler
> > tick to ratelimit idle cpumask update.
> 
> So this will only have SIS consider CPUs that have the tick stopped?
> That might affect the facebook tail-latency workloads quite a bit.
> 

This is the concern I had as well. That's why patch "sched/fair: Avoid
revisiting CPUs multiple times during select_idle_sibling" was created. The
intent was that a mask could be used as a hint about what runqueues to
look at first but continue looking at other runqueues without examining
the same runqueue twice. While the patch would not be merged on its own,
it may still be relevant in the context of an idle CPU mask depending on
how up to date it is.

-- 
Mel Gorman
SUSE Labs

Re: [RFC PATCH 0/4] Reduce worst-case scanning of runqueues in select_idle_sibling

2020-12-07 Thread Mel Gorman

On Mon, Dec 07, 2020 at 04:04:41PM +0100, Vincent Guittot wrote:
> On Mon, 7 Dec 2020 at 10:15, Mel Gorman  wrote:
> >
> > This is a minimal series to reduce the amount of runqueue scanning in
> > select_idle_sibling in the worst case.
> >
> > Patch 1 removes SIS_AVG_CPU because it's unused.
> >
> > Patch 2 improves the hit rate of p->recent_used_cpu to reduce the amount
> > of scanning. It should be relatively uncontroversial
> >
> > Patch 3-4 scans the runqueues in a single pass for select_idle_core()
> > and select_idle_cpu() so runqueues are not scanned twice. It's
> > a tradeoff because it benefits deep scans but introduces overhead
> > for shallow scans.
> >
> > Even if patch 3-4 is rejected to allow more time for Aubrey's idle cpu mask
> 
> patch 3 looks fine and doesn't collide with Aubrey's work. But I don't
> like patch 4  which manipulates different cpumask including
> load_balance_mask out of LB and I prefer to wait for v6 of Aubrey's
> patchset which should fix the problem of possibly  scanning twice busy
> cpus in select_idle_core and select_idle_cpu
> 

Seems fair, we can see where we stand after V6 of Aubrey's work.  A lot
of the motivation for patch 4 would go away if we managed to avoid calling
select_idle_core() unnecessarily. As it stands, we can call it a lot from
hackbench even though the chance of getting an idle core are minimal.

Assuming I revisit it, I'll update the schedstat debug patches to include
the times select_idle_core() starts versus how many times it fails and
see can I think of a useful heuristic.

I'll wait for more review on patches 1-3 and if I hear nothing, I'll
resend just those.

Thanks Vincent.

-- 
Mel Gorman
SUSE Labs

[PATCH 4/4] sched/fair: Avoid revisiting CPUs multiple times during select_idle_sibling

2020-12-07 Thread Mel Gorman

select_idle_core() potentially searches a number of CPUs for idle candidates
before select_idle_cpu() clears the mask and revisits the same CPUs. This
patch moves the initialisation of select_idle_mask to the top-level and
reuses the same mask across both select_idle_core and select_idle_cpu.
select_idle_smt() is left alone as the cost of checking one SMT sibling
is marginal relative to calling __clear_cpumask_cpu() for every CPU
visited by select_idle_core().

With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled

  5.10.0-rc6 5.10.0-rc6
altrecent-v2  singlepass-v2
Hmean 1502.05 (   0.00%)  498.53 *  -0.70%*
Hmean 2983.65 (   0.00%)  982.33 *  -0.13%*
Hmean 4   1920.24 (   0.00%) 1929.34 *   0.47%*
Hmean 8   3663.96 (   0.00%) 3536.94 *  -3.47%*
Hmean 16  6545.58 (   0.00%) 6560.21 *   0.22%*
Hmean 32 10253.73 (   0.00%)10374.30 *   1.18%*
Hmean 64 12506.31 (   0.00%)11692.26 *  -6.51%*
Hmean 12821967.13 (   0.00%)21705.80 *  -1.19%*
Hmean 25621534.52 (   0.00%)21223.50 *  -1.44%*
Hmean 32021070.14 (   0.00%)21023.31 *  -0.22%*

As before, results are somewhat workload and machine-specific with a mix
of good and bad. For example, netperf UDP_STREAM on an 80-cpu Broadwell
machine shows;

 5.10.0-rc6 5.10.0-rc6
   altrecent-v2  singlepass-v2
Hmean send-64 232.78 (   0.00%)  258.02 *  10.85%*
Hmean send-128464.69 (   0.00%)  511.53 *  10.08%*
Hmean send-256904.72 (   0.00%)  999.40 *  10.46%*
Hmean send-1024  3512.08 (   0.00%) 3868.86 *  10.16%*
Hmean send-2048  6777.61 (   0.00%) 7435.05 *   9.70%*
Hmean send-3312 10352.45 (   0.00%)11321.43 *   9.36%*
Hmean send-4096 12660.58 (   0.00%)13577.08 *   7.24%*
Hmean send-8192 19240.36 (   0.00%)21321.34 *  10.82%*
Hmean send-1638429691.27 (   0.00%)31296.47 *   5.41%*

The real question is -- is it better to scan CPU runqueues just once
where possible or scan multiple times? This patch scans once but it must
do additional work to track that scanning so for shallow scans it's a
loss and or deep scans, it's a win.

The main downside to this patch is that cpumask manipulations are
more complex because two cpumasks are involved. The alternative is that
select_idle_core() would scan the full domain and always return a CPU which
was previously considered. Alternatively, the depth of select_idle_core()
could be limited. Similarly, select_idle_core() often scans when no idle
core is available as test_idle_cores is very race-prone and maybe there
is a better approach to determining if select_idle_core() is likely to
succeed or not.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 40 +---
 1 file changed, 29 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 00c3b526a5bd..3b1736dc5bde 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6064,10 +6064,11 @@ void __update_idle_core(struct rq *rq)
  * there are no idle cores left in the system; tracked through
  * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
  */
-static int select_idle_core(struct task_struct *p, struct sched_domain *sd, 
int target)
+static int select_idle_core(struct task_struct *p, struct sched_domain *sd,
+   int target, struct cpumask *cpus_scan)
 {
int idle_candidate = -1;
-   struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+   struct cpumask *cpus;
int core, cpu;
 
if (!static_branch_likely(_smt_present))
@@ -6076,12 +6077,21 @@ static int select_idle_core(struct task_struct *p, 
struct sched_domain *sd, int
if (!test_idle_cores(target, false))
return -1;
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   /*
+* Repurpose load_balance_mask to avoid rescanning cores while
+* cpus_scan tracks the cpu if select_idle_cpu() is necessary.
+* In this context, it should be impossible to enter LB and
+* clobber load_balance_mask.
+*/
+   cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
+   cpumask_copy(cpus, cpus_scan);
 
for_each_cpu_wrap(core, cpus, target) {
bool idle = true;
 
for_each_cpu(cpu, cpu_smt_mask(core)) {
+   __cpumask_clear_cpu(cpu, cpus_scan);
+
if (!available_idle_cpu(cpu)) {
idle = false;
break;
@@ -6130,7 +6140,8 @@ static int select_idle_smt(struct task_struct *p, struct 
sched_domain *sd, int t
 
 #else /* CONFIG_

[PATCH 3/4] sched/fair: Return an idle cpu if one is found after a failed search for an idle core

2020-12-07 Thread Mel Gorman

select_idle_core is called when SMT is active and there is likely a free
core available. It may find idle CPUs but this information is simply
discarded and the scan starts over again with select_idle_cpu.

This patch caches information on idle CPUs found during the search for
a core and uses one if no core is found. This is a tradeoff. There may
be a slight impact when utilisation is low and an idle core can be
found quickly. It provides improvements as the number of busy CPUs
approaches 50% of the domain size when SMT is enabled.

With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled

  5.10.0-rc6 5.10.0-rc6
   schedstat  idlecandidate
Hmean 1500.06 (   0.00%)  505.67 *   1.12%*
Hmean 2975.90 (   0.00%)  974.06 *  -0.19%*
Hmean 4   1902.95 (   0.00%) 1904.43 *   0.08%*
Hmean 8   3761.73 (   0.00%) 3721.02 *  -1.08%*
Hmean 16  6713.93 (   0.00%) 6769.17 *   0.82%*
Hmean 32 10435.31 (   0.00%)10312.58 *  -1.18%*
Hmean 64 12325.51 (   0.00%)13792.01 *  11.90%*
Hmean 12821225.21 (   0.00%)20963.44 *  -1.23%*
Hmean 25620532.83 (   0.00%)20335.62 *  -0.96%*
Hmean 32020334.81 (   0.00%)20147.25 *  -0.92%*

Note that there is a significant corner case. As the SMT scan may be
terminated early, not all CPUs have been visited and select_idle_cpu()
is still called for a full scan. This case is handled in the next
patch.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 01b38fc17bca..00c3b526a5bd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6066,6 +6066,7 @@ void __update_idle_core(struct rq *rq)
  */
 static int select_idle_core(struct task_struct *p, struct sched_domain *sd, 
int target)
 {
+   int idle_candidate = -1;
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
int core, cpu;
 
@@ -6085,6 +6086,11 @@ static int select_idle_core(struct task_struct *p, 
struct sched_domain *sd, int
idle = false;
break;
}
+
+   if (idle_candidate == -1 &&
+   cpumask_test_cpu(cpu, p->cpus_ptr)) {
+   idle_candidate = cpu;
+   }
}
 
if (idle)
@@ -6098,7 +6104,7 @@ static int select_idle_core(struct task_struct *p, struct 
sched_domain *sd, int
 */
set_idle_cores(target, 0);
 
-   return -1;
+   return idle_candidate;
 }
 
 /*
-- 
2.26.2

[RFC PATCH 0/4] Reduce worst-case scanning of runqueues in select_idle_sibling

2020-12-07 Thread Mel Gorman

This is a minimal series to reduce the amount of runqueue scanning in
select_idle_sibling in the worst case.

Patch 1 removes SIS_AVG_CPU because it's unused.

Patch 2 improves the hit rate of p->recent_used_cpu to reduce the amount
of scanning. It should be relatively uncontroversial

Patch 3-4 scans the runqueues in a single pass for select_idle_core()
and select_idle_cpu() so runqueues are not scanned twice. It's
a tradeoff because it benefits deep scans but introduces overhead
for shallow scans.

Even if patch 3-4 is rejected to allow more time for Aubrey's idle cpu mask
approach to stand on its own, patches 1-2 should be fine. The main decision
with patch 4 is whether select_idle_core() should do a full scan when searching
for an idle core, whether it should be throttled in some other fashion or
whether it should be just left alone.

-- 
2.26.2

[PATCH 2/4] sched/fair: Do not replace recent_used_cpu with the new target

2020-12-07 Thread Mel Gorman

After select_idle_sibling, p->recent_used_cpu is set to the
new target. However on the next wakeup, prev will be the same as
recent_used_cpu unless the load balancer has moved the task since the last
wakeup. It still works, but is less efficient than it can be after all
the changes that went in since that reduce unnecessary migrations, load
balancer changes etc.  This patch preserves recent_used_cpu for longer.

With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled

  5.10.0-rc6 5.10.0-rc6
 baseline-v2   altrecent-v2
Hmean 1508.39 (   0.00%)  502.05 *  -1.25%*
Hmean 2986.70 (   0.00%)  983.65 *  -0.31%*
Hmean 4   1914.55 (   0.00%) 1920.24 *   0.30%*
Hmean 8   3702.37 (   0.00%) 3663.96 *  -1.04%*
Hmean 16  6573.11 (   0.00%) 6545.58 *  -0.42%*
Hmean 32 10142.57 (   0.00%)10253.73 *   1.10%*
Hmean 64 14348.40 (   0.00%)12506.31 * -12.84%*
Hmean 12821842.59 (   0.00%)21967.13 *   0.57%*
Hmean 25620813.75 (   0.00%)21534.52 *   3.46%*
Hmean 32020684.33 (   0.00%)21070.14 *   1.87%*

The different was marginal except for 64 threads which showed in the
baseline that the result was very unstable where as the patch was much
more stable. This is somewhat machine specific as on a separate 80-cpu
Broadwell machine the same test reported.

  5.10.0-rc6 5.10.0-rc6
 baseline-v2   altrecent-v2
Hmean 1310.36 (   0.00%)  291.81 *  -5.98%*
Hmean 2340.86 (   0.00%)  547.22 *  60.54%*
Hmean 4912.29 (   0.00%) 1063.21 *  16.54%*
Hmean 8   2116.40 (   0.00%) 2103.60 *  -0.60%*
Hmean 16  4232.90 (   0.00%) 4362.92 *   3.07%*
Hmean 32  8442.03 (   0.00%) 8642.10 *   2.37%*
Hmean 64 11733.91 (   0.00%)11473.66 *  -2.22%*
Hmean 12817727.24 (   0.00%)16784.23 *  -5.32%*
Hmean 25616089.23 (   0.00%)16110.79 *   0.13%*
Hmean 32015992.60 (   0.00%)16071.64 *   0.49%*

schedstats were not used in this series but from an earlier debugging
effort, the schedstats after the test run were as follows;

Ops SIS Search   5653107942.00  5726545742.00
Ops SIS Domain Search3365067916.00  3319768543.00
Ops SIS Scanned112173512543.00 99194352541.00
Ops SIS Domain Scanned 109885472517.00 96787575342.00
Ops SIS Failures 2923185114.00  2950166441.00
Ops SIS Recent Used Hit   56547.00   118064916.00
Ops SIS Recent Used Miss 1590899250.00   354942791.00
Ops SIS Recent Attempts  1590955797.00   473007707.00
Ops SIS Search Efficiency 5.04   5.77
Ops SIS Domain Search Eff 3.06   3.43
Ops SIS Fast Success Rate40.47  42.03
Ops SIS Success Rate 48.29  48.48
Ops SIS Recent Success Rate   0.00  24.96

First interesting point is the ridiculous number of times runqueues are
enabled -- almost 97 billion times over the course of 40 minutes

With the patch, "Recent Used Hit" is over 2000 times more likely to
succeed. The failure rate also increases by quite a lot but the cost is
marginal even if the "Fast Success Rate" only increases by 2% overall. What
cannot be observed from these stats is where the biggest impact as these
stats cover low utilisation to over saturation.

If graphed over time, the graphs show that the sched domain is only
scanned at negligible rates until the machine is fully busy. With
low utilisation, the "Fast Success Rate" is almost 100% until the
machine is fully busy. For 320 clients, the success rate is close to
0% which is unsurprising.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 23934dbac635..01b38fc17bca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6274,6 +6274,7 @@ static int select_idle_sibling(struct task_struct *p, int 
prev, int target)
 
/* Check a recently used CPU as a potential idle candidate: */
recent_used_cpu = p->recent_used_cpu;
+   p->recent_used_cpu = prev;
if (recent_used_cpu != prev &&
recent_used_cpu != target &&
cpus_share_cache(recent_used_cpu, target) &&
@@ -6765,9 +6766,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, 
int wake_flags)
} else if (wake_flags & WF_TTWU) { /* XXX always ? */
/* Fast path */
new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
-
-   if (want_affine)
-   current->recent_used_cpu = cpu;
}
rcu_read_unlock();
 
-- 
2.26.2

[PATCH 1/4] sched/fair: Remove SIS_AVG_CPU

2020-12-07 Thread Mel Gorman

SIS_AVG_CPU was introduced as a means of avoiding a search when the
average search cost indicated that the search would likely fail. It
was a blunt instrument and disabled by 4c77b18cf8b7 ("sched/fair: Make
select_idle_cpu() more aggressive") and later replaced with a proportional
search depth by 1ad3aaf3fcd2 ("sched/core: Implement new approach to
scale select_idle_cpu()").

While there are corner cases where SIS_AVG_CPU is better, it has now been
disabled for almost three years. As the intent of SIS_PROP is to reduce
the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus
on SIS_PROP as a throttling mechanism.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 3 ---
 kernel/sched/features.h | 1 -
 2 files changed, 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 98075f9ea9a8..23934dbac635 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6161,9 +6161,6 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
avg_idle = this_rq()->avg_idle / 512;
avg_cost = this_sd->avg_scan_cost + 1;
 
-   if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
-   return -1;
-
if (sched_feat(SIS_PROP)) {
u64 span_avg = sd->span_weight * avg_idle;
if (span_avg > 4*avg_cost)
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 68d369cba9e4..e875eabb6600 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -54,7 +54,6 @@ SCHED_FEAT(TTWU_QUEUE, true)
 /*
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
  */
-SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
 
 /*
-- 
2.26.2

Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

2020-12-06 Thread Mel Gorman

On Sat, Dec 05, 2020 at 09:26:47PM -0500, Andrea Arcangeli wrote:
> Hi Mel,
> 
> On Thu, Nov 26, 2020 at 10:47:20AM +0000, Mel Gorman wrote:
> > Agreed. This thread has a lot of different directions in it at this
> > point so what I'd hope for is first, a patch that initialises holes with
> > zone/node linkages within a 1<<(MAX_ORDER-1) alignment. If there is a
> > hole, it would be expected the pages are PageReserved. Second, a fix to
> > fast_isolate that forces PFNs returned to always be within the stated
> > zone boundaries.
> 
> The last two patches should resolve the struct page
> initialization
> https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/ and the
> VM_BUG_ON_PAGE never happened again as expected.
> 
> So I looked back to see how the "distance" logic is accurate. I added
> those checks and I get negative hits:
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index cc1a7f600a86..844a90b0acdf 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1331,6 +1331,12 @@ fast_isolate_freepages(struct compact_control *cc)
>   low_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 2));
>   min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1));
>  
> + WARN_ON_ONCE((long) low_pfn < 0);
> + WARN_ON_ONCE((long) min_pfn < 0);
> + if ((long) low_pfn < 0)
> + return cc->free_pfn;
> + if ((long) min_pfn < 0)
> + return cc->free_pfn;
>   if (WARN_ON_ONCE(min_pfn > low_pfn))
>   low_pfn = min_pfn;
> 
> Both warn-on-once triggers, so it goes negative. While it appears not
> kernel crashing since pfn_valid won't succeed on negative entries and
> they'll always be higher than any pfn in the freelist, is this sign
> that there's further room for improvement here?
> 

Possibly, checking the wrong pfns is simply risky. This is not tested
at all, just checking if it's in the right ballpark even. Intent is that
when the free/migrate PFNs are too close or already overlapping that no
attempt is made and it returns back to detect the scanners have met.

diff --git a/mm/compaction.c b/mm/compaction.c
index 13cb7a961b31..208cb5857446 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1313,6 +1313,10 @@ fast_isolate_freepages(struct compact_control *cc)
if (cc->order <= 0)
return cc->free_pfn;
 
+   /* Ensure that migration and free scanner are not about to cross */
+   if (cc->migrate_pfn < cc->free_pfn)
+   return cc->free_pfn;
+
/*
 * If starting the scan, use a deeper search and use the highest
 * PFN found if a suitable one is not found.
@@ -1324,9 +1328,12 @@ fast_isolate_freepages(struct compact_control *cc)
 
/*
 * Preferred point is in the top quarter of the scan space but take
-* a pfn from the top half if the search is problematic.
+* a pfn from the top half if the search is problematic. Ensure
+* there is enough distance to make the fast search worthwhile.
 */
distance = (cc->free_pfn - cc->migrate_pfn);
+   if (distance <= (pageblock_nr_pages << 2))
+   return cc->free_pfn;
low_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 2));
min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1));
 

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 06/10] sched/fair: Clear the target CPU from the cpumask of CPUs searched

2020-12-04 Thread Mel Gorman

On Fri, Dec 04, 2020 at 04:43:05PM +0100, Vincent Guittot wrote:
> On Fri, 4 Dec 2020 at 16:40, Mel Gorman  wrote:
> >
> > On Fri, Dec 04, 2020 at 04:23:48PM +0100, Vincent Guittot wrote:
> > > On Fri, 4 Dec 2020 at 15:31, Mel Gorman  
> > > wrote:
> > > >
> > > > On Fri, Dec 04, 2020 at 02:47:48PM +0100, Vincent Guittot wrote:
> > > > > > IIUC, select_idle_core and select_idle_cpu share the same 
> > > > > > cpumask(select_idle_mask)?
> > > > > > If the target's sibling is removed from select_idle_mask from 
> > > > > > select_idle_core(),
> > > > > > select_idle_cpu() will lose the chance to pick it up?
> > > > >
> > > > > This is only relevant for patch 10 which is not to be included IIUC
> > > > > what mel said in cover letter : "Patches 9 and 10 are stupid in the
> > > > > context of this series."
> > > > >
> > > >
> > > > Patch 10 was stupid in the context of the prototype because
> > > > select_idle_core always returned a CPU. A variation ended up being
> > > > reintroduced at the end of the Series Yet To Be Posted so that SMT 
> > > > siblings
> > > > are cleared during select_idle_core() but select_idle_cpu() still has a
> > > > mask with unvisited CPUs to consider if no idle cores are found.
> > > >
> > > > As far as I know, this would still be compatible with Aubrey's idle
> > > > cpu mask as long as it's visited and cleared between select_idle_core
> > > > and select_idle_cpu. It relaxes the contraints on Aubrey to some extent
> > > > because the idle cpu mask would be a hint so if the information is out
> > > > of date, an idle cpu may still be found the normal way.
> > >
> > > But even without patch 10, just replacing sched_domain_span(sd) by
> > > sds_idle_cpus(sd->shared) will ensure that sis loops only on cpus that
> > > get a chance to be idle so select_idle_core is likely to return an
> > > idle_candidate
> > >
> >
> > Yes but if the idle mask is out of date for any reason then idle CPUs might
> 
> In fact it's the opposite, a cpu in idle mask might not be idle but
> all cpus that enter idle will be set
> 

When I first checked, the information was based on the tick or a CPU
stopping the tick. That was not guaranteed to be up to date so I considered
the best option would be to treat idle cpu mask as advisory. It would
not necessarily cover a CPU that was entering idle and polling before
entering an idle state for example or a rq that would pass sched_idle_cpu()
depending on the timing of the update_idle_cpumask call.

I know you reviewed that patch and v6 may be very different but the more
up to date that information is, the greater the cache conflicts will be
on sched_domain_shared so maintaining the up-to-date information may cost
enough to offset any benefit from reduced searching at wakeup.

If this turns out to be wrong, then great, the idle cpu mask can be used
as both the basis for an idle core search and a fast find of an individual
CPU. If the cost of keeping up to date information is too high then the
idle_cpu_mask can be treated as advisory to start the search and track
CPUs visited.

The series are not either/or, chunks of the series I posted are orthogonal
(e.g. changes to p->recent_cpu_used), the latter parts could either work
with idle cpu mask or be replaced by idle cpu mask depending on which
performs better.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 06/10] sched/fair: Clear the target CPU from the cpumask of CPUs searched

2020-12-04 Thread Mel Gorman

On Fri, Dec 04, 2020 at 04:23:48PM +0100, Vincent Guittot wrote:
> On Fri, 4 Dec 2020 at 15:31, Mel Gorman  wrote:
> >
> > On Fri, Dec 04, 2020 at 02:47:48PM +0100, Vincent Guittot wrote:
> > > > IIUC, select_idle_core and select_idle_cpu share the same 
> > > > cpumask(select_idle_mask)?
> > > > If the target's sibling is removed from select_idle_mask from 
> > > > select_idle_core(),
> > > > select_idle_cpu() will lose the chance to pick it up?
> > >
> > > This is only relevant for patch 10 which is not to be included IIUC
> > > what mel said in cover letter : "Patches 9 and 10 are stupid in the
> > > context of this series."
> > >
> >
> > Patch 10 was stupid in the context of the prototype because
> > select_idle_core always returned a CPU. A variation ended up being
> > reintroduced at the end of the Series Yet To Be Posted so that SMT siblings
> > are cleared during select_idle_core() but select_idle_cpu() still has a
> > mask with unvisited CPUs to consider if no idle cores are found.
> >
> > As far as I know, this would still be compatible with Aubrey's idle
> > cpu mask as long as it's visited and cleared between select_idle_core
> > and select_idle_cpu. It relaxes the contraints on Aubrey to some extent
> > because the idle cpu mask would be a hint so if the information is out
> > of date, an idle cpu may still be found the normal way.
> 
> But even without patch 10, just replacing sched_domain_span(sd) by
> sds_idle_cpus(sd->shared) will ensure that sis loops only on cpus that
> get a chance to be idle so select_idle_core is likely to return an
> idle_candidate
> 

Yes but if the idle mask is out of date for any reason then idle CPUs might
be missed -- hence the intent to maintain a mask of CPUs visited and use
the idle cpu mask as a hint to prioritise CPUs that are likely idle but
fall back to a normal scan if none of the "idle cpu mask" CPUs are
actually idle.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 06/10] sched/fair: Clear the target CPU from the cpumask of CPUs searched

2020-12-04 Thread Mel Gorman

On Fri, Dec 04, 2020 at 02:47:48PM +0100, Vincent Guittot wrote:
> > IIUC, select_idle_core and select_idle_cpu share the same 
> > cpumask(select_idle_mask)?
> > If the target's sibling is removed from select_idle_mask from 
> > select_idle_core(),
> > select_idle_cpu() will lose the chance to pick it up?
> 
> This is only relevant for patch 10 which is not to be included IIUC
> what mel said in cover letter : "Patches 9 and 10 are stupid in the
> context of this series."
> 

Patch 10 was stupid in the context of the prototype because
select_idle_core always returned a CPU. A variation ended up being
reintroduced at the end of the Series Yet To Be Posted so that SMT siblings
are cleared during select_idle_core() but select_idle_cpu() still has a
mask with unvisited CPUs to consider if no idle cores are found.

As far as I know, this would still be compatible with Aubrey's idle
cpu mask as long as it's visited and cleared between select_idle_core
and select_idle_cpu. It relaxes the contraints on Aubrey to some extent
because the idle cpu mask would be a hint so if the information is out
of date, an idle cpu may still be found the normal way.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 06/10] sched/fair: Clear the target CPU from the cpumask of CPUs searched

2020-12-04 Thread Mel Gorman

On Fri, Dec 04, 2020 at 02:17:20PM +0100, Vincent Guittot wrote:
> On Fri, 4 Dec 2020 at 14:13, Vincent Guittot  
> wrote:
> >
> > On Fri, 4 Dec 2020 at 12:30, Mel Gorman  wrote:
> > >
> > > On Fri, Dec 04, 2020 at 11:56:36AM +0100, Vincent Guittot wrote:
> > > > > The intent was that the sibling might still be an idle candidate. In
> > > > > the current draft of the series, I do not even clear this so that the
> > > > > SMT sibling is considered as an idle candidate. The reasoning is that 
> > > > > if
> > > > > there are no idle cores then an SMT sibling of the target is as good 
> > > > > an
> > > > > idle CPU to select as any.
> > > >
> > > > Isn't the purpose of select_idle_smt ?
> > > >
> > >
> > > Only in part.
> > >
> > > > select_idle_core() looks for an idle core and opportunistically saves
> > > > an idle CPU candidate to skip select_idle_cpu. In this case this is
> > > > useless loops for select_idle_core() because we are sure that the core
> > > > is not idle
> > > >
> > >
> > > If select_idle_core() finds an idle candidate other than the sibling,
> > > it'll use it if there is no idle core -- it picks a busy sibling based
> > > on a linear walk of the cpumask. Similarly, select_idle_cpu() is not
> >
> > My point is that it's a waste of time to loop the sibling cpus of
> > target in select_idle_core because it will not help to find an idle
> > core. The sibling  cpus will then be check either by select_idle_cpu
> > of select_idle_smt
> 

I understand and you're right, the full loop was in the context of a series
that unified select_idle_* where it made sense. The version I'm currently
testing aborts the SMT search if a !idle sibling is encountered. That
means that select_idle_core() will no longer scan the entire domain if
there are no idle cores.

https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/commit/?h=sched-sissearch-v2r6=eb04a344cf7d7ca64c0c8fc0bcade261fa08c19e

With the patch on its own, it does mean that select_idle_sibling
starts over because SMT siblings might have been cleared. As an aside,
select_idle_core() has it's own problems even then.  It can start a scan
for an idle sibling when cpu_rq(target)->nr_running is very large --
over 100+ running tasks which is almost certainly a useless scan for
cores. However, I haven't done anything with that in this series as it
seemed like it would be follow-up work.

> also, while looping the cpumask, the sibling cpus of not idle cpu are
> removed and will not be check
> 

True and I spotted this. I think the load_balance_mask can be abused to
clear siblings during select_idle_core() while using select_idle_mask to
track CPUs that have not been scanned yet so select_idle_cpu only scans
CPUs that have not already been visited.

https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/commit/?h=sched-sissearch-v2r6=a6e986dae38855e3be26dfde86bbef1617431dd1

As both the idle candidate and the load_balance_mask abuse are likely to
be controversial, I shuffled the series so that it's ordered from least
least controversial to most controversial.

This
https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/log/?h=sched-sissearch-v2r6
is what is currently being tested. It'll take most of the weekend and I'll
post them properly if they pass tests and do not throw up nasty surprises.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 06/10] sched/fair: Clear the target CPU from the cpumask of CPUs searched

2020-12-04 Thread Mel Gorman

On Fri, Dec 04, 2020 at 11:56:36AM +0100, Vincent Guittot wrote:
> > The intent was that the sibling might still be an idle candidate. In
> > the current draft of the series, I do not even clear this so that the
> > SMT sibling is considered as an idle candidate. The reasoning is that if
> > there are no idle cores then an SMT sibling of the target is as good an
> > idle CPU to select as any.
> 
> Isn't the purpose of select_idle_smt ?
> 

Only in part.

> select_idle_core() looks for an idle core and opportunistically saves
> an idle CPU candidate to skip select_idle_cpu. In this case this is
> useless loops for select_idle_core() because we are sure that the core
> is not idle
> 

If select_idle_core() finds an idle candidate other than the sibling,
it'll use it if there is no idle core -- it picks a busy sibling based
on a linear walk of the cpumask. Similarly, select_idle_cpu() is not
guaranteed to scan the sibling first (ordering) or even reach the sibling
(throttling). select_idle_smt() is a last-ditch effort.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 2/2] sched: Split the function show_schedstat()

2020-12-04 Thread Mel Gorman

On Fri, Dec 04, 2020 at 09:22:34AM +0800, Yunfeng Ye wrote:
> 
> 
> On 2020/12/3 17:42, Mel Gorman wrote:
> > On Thu, Dec 03, 2020 at 02:47:14PM +0800, Yunfeng Ye wrote:
> >> The schedstat include runqueue-specific stats and domain-specific stats,
> >> so split it into two functions, show_rqstat() and show_domainstat().
> >>
> >> No functional changes.
> >>
> >> Signed-off-by: Yunfeng Ye 
> > 
> > Why?
> > 
> > I could understand if there was a follow-up patch that adjusted some
> > subset or there was a difference in checking for schedstat_enabled,
> > locking or inserting new schedstat information. This can happen in the
> > general case when the end result is easier to review here it seems to be
> > just moving code around.
> > 
> The rqstat and domainstat is independent state information. so I think
> split it into two individual function is clearer.
> 

The comments and the names of the structures being accessesd is sufficient
to make it clear.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 06/10] sched/fair: Clear the target CPU from the cpumask of CPUs searched

2020-12-03 Thread Mel Gorman

On Thu, Dec 03, 2020 at 05:38:03PM +0100, Vincent Guittot wrote:
> On Thu, 3 Dec 2020 at 15:11, Mel Gorman  wrote:
> >
> > The target CPU is definitely not idle in both select_idle_core and
> > select_idle_cpu. For select_idle_core(), the SMT is potentially
> > checked unnecessarily as the core is definitely not idle if the
> > target is busy. For select_idle_cpu(), the first CPU checked is
> > simply a waste.
> 
> >
> > Signed-off-by: Mel Gorman 
> > ---
> >  kernel/sched/fair.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 68dd9cd62fbd..1d8f5c4b4936 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6077,6 +6077,7 @@ static int select_idle_core(struct task_struct *p, 
> > struct sched_domain *sd, int
> > return -1;
> >
> > cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > +   __cpumask_clear_cpu(target, cpus);
> 
> should clear cpu_smt_mask(target) as we are sure that the core will not be 
> idle
> 

The intent was that the sibling might still be an idle candidate. In
the current draft of the series, I do not even clear this so that the
SMT sibling is considered as an idle candidate. The reasoning is that if
there are no idle cores then an SMT sibling of the target is as good an
idle CPU to select as any.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 04/10] sched/fair: Return an idle cpu if one is found after a failed search for an idle core

2020-12-03 Thread Mel Gorman

On Thu, Dec 03, 2020 at 05:35:29PM +0100, Vincent Guittot wrote:
> > index fc48cc99b03d..845bc0cd9158 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6066,6 +6066,7 @@ void __update_idle_core(struct rq *rq)
> >   */
> >  static int select_idle_core(struct task_struct *p, struct sched_domain 
> > *sd, int target)
> >  {
> > +   int idle_candidate = -1;
> > struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> > int core, cpu;
> >
> > @@ -6084,7 +6085,13 @@ static int select_idle_core(struct task_struct *p, 
> > struct sched_domain *sd, int
> > schedstat_inc(this_rq()->sis_scanned);
> > if (!available_idle_cpu(cpu)) {
> > idle = false;
> > -   break;
> > +   if (idle_candidate != -1)
> > +   break;
> 
> 
> If I get your changes correctly, it will now continue to loop on all
> cpus of the smt mask to try to find an idle cpu whereas it was

That was an oversight, the intent is that the SMT search breaks but
the search for an idle core continues. The patch was taken from a very
different series that unified all the select_idle_* functions as a single
function and I failed to fix it up properly. The unification series
didn't generate good results back 9 months ago when I tried and I never
finished it off. In the current context, it would not make sense to try
a unification again.

> With the change above you might end up looping all cpus of llc if
> there is only one idle cpu in the llc whereas before we were looping
> only 1 cpu per core at most. The bottom change makes sense but the
> above on is in some way replacing completely select_idle_cpu and
> bypass SIS_PROP and we should avoid that IMO
> 

You're right of course, it was never intended to behave like that.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0a3d338770c4..49b1590e60a9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6084,8 +6084,7 @@ static int select_idle_core(struct task_struct *p, struct 
sched_domain *sd, int
for_each_cpu(cpu, cpu_smt_mask(core)) {
if (!available_idle_cpu(cpu)) {
idle = false;
-   if (idle_candidate != -1)
-       break;
+   break;
}
 
if (idle_candidate == -1 &&

-- 
Mel Gorman
SUSE Labs

[PATCH 10/10] sched/fair: Avoid revisiting CPUs multiple times during select_idle_sibling

2020-12-03 Thread Mel Gorman

Note: While this is done in the context of select_idle_core(), I would not
expect it to be done like this. The intent is to illustrate how
idle_cpu_mask could be filtered before select_idle_cpus() scans
the rest of a domain or a wider scan was done across a cluster.

select_idle_core() potentially searches a number of CPUs for idle candidates
before select_idle_cpu() clears the mask and revisits the same CPUs. This
patch moves the initialisation of select_idle_mask to the top-level and
reuses the same mask across both select_idle_core and select_idle_cpu.
select_idle_smt() is left alone as the cost of checking one SMT sibling
is marginal relative to calling __clear_cpumask_cpu() for evey CPU
visited by select_idle_core().

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 29 -
 1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cd95daf9f53e..af2e108c20c0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6096,10 +6096,9 @@ void __update_idle_core(struct rq *rq)
  * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
  */
 static int select_idle_core(struct task_struct *p, struct sched_domain *sd,
-   int target, int nr)
+   int target, int nr, struct cpumask *cpus)
 {
int idle_candidate = -1;
-   struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
int core, cpu;
 
if (!static_branch_likely(_smt_present))
@@ -6108,9 +6107,6 @@ static int select_idle_core(struct task_struct *p, struct 
sched_domain *sd,
if (!test_idle_cores(target, false))
return -1;
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
-   __cpumask_clear_cpu(target, cpus);
-
for_each_cpu_wrap(core, cpus, target) {
bool idle = true;
 
@@ -6175,7 +6171,7 @@ static int select_idle_smt(struct task_struct *p, struct 
sched_domain *sd, int t
 #else /* CONFIG_SCHED_SMT */
 
 static inline int select_idle_core(struct task_struct *p, struct sched_domain 
*sd,
-   int target, int nr)
+   int target, int nr, struct cpumask 
*cpus)
 {
return -1;
 }
@@ -6193,14 +6189,10 @@ static inline int select_idle_smt(struct task_struct 
*p, struct sched_domain *sd
  * average idle time for this rq (as found in rq->avg_idle).
  */
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd,
-   int target, int nr)
+   int target, int nr, struct cpumask *cpus)
 {
-   struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
int cpu;
 
-   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
-   __cpumask_clear_cpu(target, cpus);
-
for_each_cpu_wrap(cpu, cpus, target) {
schedstat_inc(this_rq()->sis_scanned);
if (!--nr)
@@ -6260,6 +6252,7 @@ static inline bool asym_fits_capacity(int task_util, int 
cpu)
 static int select_idle_sibling(struct task_struct *p, int prev, int target)
 {
struct sched_domain *sd, *this_sd;
+   struct cpumask *cpus_visited;
unsigned long task_util;
int i, recent_used_cpu, depth;
u64 time;
@@ -6358,13 +6351,23 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
 
depth = sis_search_depth(sd, this_sd);
 
+   /*
+* Init the select_idle_mask. select_idle_core() will mask
+* out the CPUs that have already been limited to limit the
+* search in select_idle_cpu(). Further clearing is not
+* done as select_idle_smt checks only one CPU.
+*/
+   cpus_visited = this_cpu_cpumask_var_ptr(select_idle_mask);
+   cpumask_and(cpus_visited, sched_domain_span(sd), p->cpus_ptr);
+   __cpumask_clear_cpu(target, cpus_visited);
+
schedstat_inc(this_rq()->sis_domain_search);
-   i = select_idle_core(p, sd, target, depth);
+   i = select_idle_core(p, sd, target, depth, cpus_visited);
if ((unsigned)i < nr_cpumask_bits)
return i;
 
time = cpu_clock(smp_processor_id());
-   i = select_idle_cpu(p, sd, target, depth);
+   i = select_idle_cpu(p, sd, target, depth, cpus_visited);
if ((unsigned)i < nr_cpumask_bits)
goto acct_cost;
 
-- 
2.26.2

[PATCH 09/10] sched/fair: Limit the search for an idle core

2020-12-03 Thread Mel Gorman

Note: This is a bad idea, it's for illustration only to show how the
search space can be filtered at each stage. Searching an
idle_cpu_mask would be a potential option. select_idle_core()
would be left alone as it has its own throttling mechanism

select_idle_core() may search a full domain for an idle core even if idle
CPUs exist result in an excessive search. This patch partially limits
the search for an idle core similar to select_idle_cpu() once an idle
candidate is found.

Note that this patch can *increase* the number of runqueues considered.
Any searching done by select_idle_core() is duplicated by select_idle_cpu()
if an idle candidate is not found. If there is an idle CPU then aborting
select_idle_core() can have a negative impact. This is addressed in the
next patch.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 33ce65b67381..cd95daf9f53e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6095,7 +6095,8 @@ void __update_idle_core(struct rq *rq)
  * there are no idle cores left in the system; tracked through
  * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
  */
-static int select_idle_core(struct task_struct *p, struct sched_domain *sd, 
int target)
+static int select_idle_core(struct task_struct *p, struct sched_domain *sd,
+   int target, int nr)
 {
int idle_candidate = -1;
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
@@ -6115,6 +6116,11 @@ static int select_idle_core(struct task_struct *p, 
struct sched_domain *sd, int
 
for_each_cpu(cpu, cpu_smt_mask(core)) {
schedstat_inc(this_rq()->sis_scanned);
+
+   /* Apply limits if there is an idle candidate */
+   if (idle_candidate != -1)
+   nr--;
+
if (!available_idle_cpu(cpu)) {
idle = false;
if (idle_candidate != -1)
@@ -6130,6 +6136,9 @@ static int select_idle_core(struct task_struct *p, struct 
sched_domain *sd, int
if (idle)
return core;
 
+   if (!nr)
+   break;
+
cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
}
 
@@ -6165,7 +6174,8 @@ static int select_idle_smt(struct task_struct *p, struct 
sched_domain *sd, int t
 
 #else /* CONFIG_SCHED_SMT */
 
-static inline int select_idle_core(struct task_struct *p, struct sched_domain 
*sd, int target)
+static inline int select_idle_core(struct task_struct *p, struct sched_domain 
*sd,
+   int target, int nr)
 {
return -1;
 }
@@ -6349,7 +6359,7 @@ static int select_idle_sibling(struct task_struct *p, int 
prev, int target)
depth = sis_search_depth(sd, this_sd);
 
schedstat_inc(this_rq()->sis_domain_search);
-   i = select_idle_core(p, sd, target);
+   i = select_idle_core(p, sd, target, depth);
if ((unsigned)i < nr_cpumask_bits)
return i;
 
-- 
2.26.2

[PATCH 02/10] sched/fair: Track efficiency of task recent_used_cpu

2020-12-03 Thread Mel Gorman

This simply tracks the efficiency of the recent_used_cpu. The hit rate
of this matters as it can avoid a domain search. Similarly, the miss
rate matters because each miss is a penalty to the fast path.

It is not required that this patch be merged with the series but if we
are looking at the usefulness of p->recent_used_cpu, the stats generate
hard data on what the hit rate is.

MMTests uses this to generate additional metrics.

SIS Recent Used Hit: A recent CPU was eligible and used. Each hit is
a domain search avoided.

SIS Recent Used Miss: A recent CPU was eligible but unavailable. Each
time this is miss, there was a small penalty to the fast path
before a domain search happened.

SIS Recent Success Rate: A percentage of the number of hits versus
the total attempts to use the recent CPU.

SIS Recent Attempts: The total number of times the recent CPU was examined.
A high number of Recent Attempts with a low Success Rate implies
the fast path is being punished severely. This could have been
presented as a weighting of hits and misses but calculating an
appropriate weight for misses is problematic.

Signed-off-by: Mel Gorman 
---
 kernel/sched/debug.c |  2 ++
 kernel/sched/fair.c  | 23 +--
 kernel/sched/sched.h |  2 ++
 kernel/sched/stats.c |  7 ---
 4 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2386cc5e79e5..8f933a9e8c25 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -718,6 +718,8 @@ do {
\
P(sis_domain_search);
P(sis_scanned);
P(sis_failed);
+   P(sis_recent_hit);
+   P(sis_recent_miss);
}
 #undef P
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 494ba01f3414..d9acd55d309b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6291,16 +6291,19 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
recent_used_cpu = p->recent_used_cpu;
if (recent_used_cpu != prev &&
recent_used_cpu != target &&
-   cpus_share_cache(recent_used_cpu, target) &&
-   (available_idle_cpu(recent_used_cpu) || 
sched_idle_cpu(recent_used_cpu)) &&
-   cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
-   asym_fits_capacity(task_util, recent_used_cpu)) {
-   /*
-* Replace recent_used_cpu with prev as it is a potential
-* candidate for the next wake:
-*/
-   p->recent_used_cpu = prev;
-   return recent_used_cpu;
+   cpus_share_cache(recent_used_cpu, target)) {
+   if ((available_idle_cpu(recent_used_cpu) || 
sched_idle_cpu(recent_used_cpu)) &&
+   cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
+   asym_fits_capacity(task_util, recent_used_cpu)) {
+   /*
+* Replace recent_used_cpu with prev as it is a 
potential
+* candidate for the next wake:
+*/
+   p->recent_used_cpu = prev;
+   schedstat_inc(this_rq()->sis_recent_hit);
+   return recent_used_cpu;
+   }
+   schedstat_inc(this_rq()->sis_recent_miss);
}
 
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 90a62dd9293d..6a6578c4c24b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1055,6 +1055,8 @@ struct rq {
unsigned intsis_domain_search;
unsigned intsis_scanned;
unsigned intsis_failed;
+   unsigned intsis_recent_hit;
+   unsigned intsis_recent_miss;
 #endif
 
 #ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 390bfcc3842c..402fab75aa14 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -10,7 +10,7 @@
  * Bump this up when changing the output format or the meaning of an existing
  * format, so that tools can adapt (or abort)
  */
-#define SCHEDSTAT_VERSION 16
+#define SCHEDSTAT_VERSION 17
 
 static int show_schedstat(struct seq_file *seq, void *v)
 {
@@ -30,14 +30,15 @@ static int show_schedstat(struct seq_file *seq, void *v)
 
/* runqueue-specific stats */
seq_printf(seq,
-   "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u %u",
+   "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u %u %u %u",
cpu, rq->yld_count,
rq->sched_count, rq->sched_goidle,
rq->ttwu_count, rq->ttwu_local,
rq->rq

[PATCH 09/10] sched/fair: Limit the search for an idle core

2020-12-03 Thread Mel Gorman

Note: This is a bad idea, it's for illustration only to show how the
search space can be filtered at each stage. Searching an
idle_cpu_mask would be a potential option. select_idle_core()
would be left alone as it has its own throttling mechanism

select_idle_core() may search a full domain for an idle core even if idle
CPUs exist result in an excessive search. This patch partially limits
the search for an idle core similar to select_idle_cpu() once an idle
candidate is found.

Note that this patch can *increase* the number of runqueues considered.
Any searching done by select_idle_core() is duplicated by select_idle_cpu()
if an idle candidate is not found. If there is an idle CPU then aborting
select_idle_core() can have a negative impact. This is addressed in the
next patch.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 33ce65b67381..cd95daf9f53e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6095,7 +6095,8 @@ void __update_idle_core(struct rq *rq)
  * there are no idle cores left in the system; tracked through
  * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
  */
-static int select_idle_core(struct task_struct *p, struct sched_domain *sd, 
int target)
+static int select_idle_core(struct task_struct *p, struct sched_domain *sd,
+   int target, int nr)
 {
int idle_candidate = -1;
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
@@ -6115,6 +6116,11 @@ static int select_idle_core(struct task_struct *p, 
struct sched_domain *sd, int
 
for_each_cpu(cpu, cpu_smt_mask(core)) {
schedstat_inc(this_rq()->sis_scanned);
+
+   /* Apply limits if there is an idle candidate */
+   if (idle_candidate != -1)
+   nr--;
+
if (!available_idle_cpu(cpu)) {
idle = false;
if (idle_candidate != -1)
@@ -6130,6 +6136,9 @@ static int select_idle_core(struct task_struct *p, struct 
sched_domain *sd, int
if (idle)
return core;
 
+   if (!nr)
+   break;
+
cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
}
 
@@ -6165,7 +6174,8 @@ static int select_idle_smt(struct task_struct *p, struct 
sched_domain *sd, int t
 
 #else /* CONFIG_SCHED_SMT */
 
-static inline int select_idle_core(struct task_struct *p, struct sched_domain 
*sd, int target)
+static inline int select_idle_core(struct task_struct *p, struct sched_domain 
*sd,
+   int target, int nr)
 {
return -1;
 }
@@ -6349,7 +6359,7 @@ static int select_idle_sibling(struct task_struct *p, int 
prev, int target)
depth = sis_search_depth(sd, this_sd);
 
schedstat_inc(this_rq()->sis_domain_search);
-   i = select_idle_core(p, sd, target);
+   i = select_idle_core(p, sd, target, depth);
if ((unsigned)i < nr_cpumask_bits)
return i;
 
-- 
2.26.2

[PATCH 06/10] sched/fair: Clear the target CPU from the cpumask of CPUs searched

2020-12-03 Thread Mel Gorman

The target CPU is definitely not idle in both select_idle_core and
select_idle_cpu. For select_idle_core(), the SMT is potentially
checked unnecessarily as the core is definitely not idle if the
target is busy. For select_idle_cpu(), the first CPU checked is
simply a waste.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 68dd9cd62fbd..1d8f5c4b4936 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6077,6 +6077,7 @@ static int select_idle_core(struct task_struct *p, struct 
sched_domain *sd, int
return -1;
 
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   __cpumask_clear_cpu(target, cpus);
 
for_each_cpu_wrap(core, cpus, target) {
bool idle = true;
@@ -6181,6 +6182,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
time = cpu_clock(this);
 
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+   __cpumask_clear_cpu(target, cpus);
 
for_each_cpu_wrap(cpu, cpus, target) {
schedstat_inc(this_rq()->sis_scanned);
-- 
2.26.2

[PATCH 05/10] sched/fair: Do not replace recent_used_cpu with the new target

2020-12-03 Thread Mel Gorman

After select_idle_sibling, p->recent_used_cpu is set to the
new target. However on the next wakeup, prev will be the same as
recent_used_cpu unless the load balancer has moved the task since the last
wakeup. It still works, but is less efficient than it can be after all
the changes that went in since that reduce unnecessary migrations, load
balancer changes etc.  This patch preserves recent_used_cpu for longer.

With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled

  5.10.0-rc6 5.10.0-rc6
 idlecandidate-v1r10altrecent-v1r10
Hmean 1505.67 (   0.00%)  501.34 *  -0.86%*
Hmean 2974.06 (   0.00%)  981.39 *   0.75%*
Hmean 4   1904.43 (   0.00%) 1926.13 *   1.14%*
Hmean 8   3721.02 (   0.00%) 3799.86 *   2.12%*
Hmean 16  6769.17 (   0.00%) 6938.40 *   2.50%*
Hmean 32 10312.58 (   0.00%)10632.11 *   3.10%*
Hmean 64 13792.01 (   0.00%)13670.17 *  -0.88%*
Hmean 12820963.44 (   0.00%)21456.33 *   2.35%*
Hmean 25620335.62 (   0.00%)21070.24 *   3.61%*
Hmean 32020147.25 (   0.00%)20624.92 *   2.37%*

The benefit is marginal, the main impact is on how it affects
p->recent_used_cpu and whether a domain search happens. From the schedstats
patches and schedstat enabled

Ops SIS Search   5653107942.00  5726545742.00
Ops SIS Domain Search3365067916.00  3319768543.00
Ops SIS Scanned112173512543.00 99194352541.00
Ops SIS Domain Scanned 109885472517.00 96787575342.00
Ops SIS Failures 2923185114.00  2950166441.00
Ops SIS Recent Used Hit   56547.00   118064916.00
Ops SIS Recent Used Miss 1590899250.00   354942791.00
Ops SIS Recent Attempts  1590955797.00   473007707.00
Ops SIS Search Efficiency 5.04   5.77
Ops SIS Domain Search Eff 3.06   3.43
Ops SIS Fast Success Rate40.47  42.03
Ops SIS Success Rate 48.29  48.48
Ops SIS Recent Success Rate   0.00  24.96

(First interesting point is the ridiculous number of times runqueues are
enabled -- almost 97 billion times over the course of 40 minutes)

Note "Recent Used Hit" is over 2000 times more likely to succeed. The
failure rate also increases by quite a lot but the cost is marginal
even if the "Fast Success Rate" only increases by 2% overall. What
cannot be observed from these stats is where the biggest impact as
these stats cover low utilisation to over saturation.

If graphed over time, the graphs show that the sched domain is only
scanned at negligible rates until the machine is fully busy. With
low utilisation, the "Fast Success Rate" is almost 100% until the
machine is fully busy. For 320 clients, the success rate is close to
0% which is unsurprising.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 845bc0cd9158..68dd9cd62fbd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6293,6 +6293,7 @@ static int select_idle_sibling(struct task_struct *p, int 
prev, int target)
 
/* Check a recently used CPU as a potential idle candidate: */
recent_used_cpu = p->recent_used_cpu;
+   p->recent_used_cpu = prev;
if (recent_used_cpu != prev &&
recent_used_cpu != target &&
cpus_share_cache(recent_used_cpu, target)) {
@@ -6789,9 +6790,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, 
int wake_flags)
} else if (wake_flags & WF_TTWU) { /* XXX always ? */
/* Fast path */
new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
-
-   if (want_affine)
-   current->recent_used_cpu = cpu;
}
rcu_read_unlock();
 
-- 
2.26.2

[PATCH 08/10] sched/fair: Reintroduce SIS_AVG_CPU but in the context of SIS_PROP to reduce search depth

2020-12-03 Thread Mel Gorman

Subject says it all but no supporting data at this time. This might help
the hackbench case in isolation or throw other workloads under the bus.
Final version will have proper data.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 185fc6e28f8e..33ce65b67381 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6024,6 +6024,14 @@ static int sis_search_depth(struct sched_domain *sd, 
struct sched_domain *this_s
nr = div_u64(span_avg, avg_cost);
else
nr = 4;
+
+   /*
+* Throttle the depth search futher if average idle time is
+* below the average cost. This is primarily to deal with
+* the saturated case where searches are likely to fail.
+*/
+   if (avg_idle < avg_cost)
+   nr >>= 1;
}
 
return nr;
-- 
2.26.2

[PATCH 07/10] sched/fair: Account for the idle cpu/smt search cost

2020-12-03 Thread Mel Gorman

select_idle_cpu() accounts average search cost for the purposes of
conducting a limited proportional search if SIS_PROP is enabled. The issue
is that select_idle_cpu() does not account for the cost if a candidate
is found and select_idle_smt() is ignored.

This patch moves the accounting of avg_cost to cover the cpu/smt search
costs. select_idle_core() costs could be accounted for but it has its
own throttling mechanism by tracking depending on whether idle cores are
expected to exist.

This patch is a bisection hazard becuse SIS_PROP and how it balances
avg_cost vs avg_idle was probably guided by the fact that avg_cost was
not always accounted for.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 82 +
 1 file changed, 46 insertions(+), 36 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d8f5c4b4936..185fc6e28f8e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6006,6 +6006,29 @@ static inline int find_idlest_cpu(struct sched_domain 
*sd, struct task_struct *p
return new_cpu;
 }
 
+static int sis_search_depth(struct sched_domain *sd, struct sched_domain 
*this_sd)
+{
+   u64 avg_cost, avg_idle, span_avg;
+   int nr = INT_MAX;
+
+   if (sched_feat(SIS_PROP)) {
+   /*
+* Due to large variance we need a large fuzz factor; hackbench 
in
+* particularly is sensitive here.
+*/
+   avg_idle = this_rq()->avg_idle / 512;
+   avg_cost = this_sd->avg_scan_cost + 1;
+
+   span_avg = sd->span_weight * avg_idle;
+   if (span_avg > 4*avg_cost)
+   nr = div_u64(span_avg, avg_cost);
+   else
+   nr = 4;
+   }
+
+   return nr;
+}
+
 #ifdef CONFIG_SCHED_SMT
 DEFINE_STATIC_KEY_FALSE(sched_smt_present);
 EXPORT_SYMBOL_GPL(sched_smt_present);
@@ -6151,35 +6174,11 @@ static inline int select_idle_smt(struct task_struct 
*p, struct sched_domain *sd
  * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
  * average idle time for this rq (as found in rq->avg_idle).
  */
-static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int 
target)
+static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd,
+   int target, int nr)
 {
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
-   struct sched_domain *this_sd;
-   u64 avg_cost, avg_idle;
-   u64 time;
-   int this = smp_processor_id();
-   int cpu, nr = INT_MAX;
-
-   this_sd = rcu_dereference(*this_cpu_ptr(_llc));
-   if (!this_sd)
-   return -1;
-
-   /*
-* Due to large variance we need a large fuzz factor; hackbench in
-* particularly is sensitive here.
-*/
-   avg_idle = this_rq()->avg_idle / 512;
-   avg_cost = this_sd->avg_scan_cost + 1;
-
-   if (sched_feat(SIS_PROP)) {
-   u64 span_avg = sd->span_weight * avg_idle;
-   if (span_avg > 4*avg_cost)
-   nr = div_u64(span_avg, avg_cost);
-   else
-   nr = 4;
-   }
-
-   time = cpu_clock(this);
+   int cpu;
 
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
__cpumask_clear_cpu(target, cpus);
@@ -6192,9 +6191,6 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
break;
}
 
-   time = cpu_clock(this) - time;
-   update_avg(_sd->avg_scan_cost, time);
-
return cpu;
 }
 
@@ -6245,9 +6241,10 @@ static inline bool asym_fits_capacity(int task_util, int 
cpu)
  */
 static int select_idle_sibling(struct task_struct *p, int prev, int target)
 {
-   struct sched_domain *sd;
+   struct sched_domain *sd, *this_sd;
unsigned long task_util;
-   int i, recent_used_cpu;
+   int i, recent_used_cpu, depth;
+   u64 time;
 
schedstat_inc(this_rq()->sis_search);
 
@@ -6337,21 +6334,34 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
if (!sd)
return target;
 
+   this_sd = rcu_dereference(*this_cpu_ptr(_llc));
+   if (!this_sd)
+   return target;
+
+   depth = sis_search_depth(sd, this_sd);
+
schedstat_inc(this_rq()->sis_domain_search);
i = select_idle_core(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
 
-   i = select_idle_cpu(p, sd, target);
+   time = cpu_clock(smp_processor_id());
+   i = select_idle_cpu(p, sd, target, depth);
if ((unsigned)i < nr_cpumask_bits)
-   return i;
+   goto acct_cost;
 
i = select_idle_smt(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
-   return i;
+

[PATCH 04/10] sched/fair: Return an idle cpu if one is found after a failed search for an idle core

2020-12-03 Thread Mel Gorman

select_idle_core is called when SMT is active and there is likely a free
core available. It may find idle CPUs but this information is simply
discarded and the scan starts over again with select_idle_cpu.

This patch caches information on idle CPUs found during the search for
a core and uses one if no core is found. This is a tradeoff. There may
be a slight impact when utilisation is low and an idle core can be
found quickly. It provides improvements as the number of busy CPUs
approaches 50% of the domain size when SMT is enabled.

With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled

  5.10.0-rc6 5.10.0-rc6
   schedstat  idlecandidate
Hmean 1500.06 (   0.00%)  505.67 *   1.12%*
Hmean 2975.90 (   0.00%)  974.06 *  -0.19%*
Hmean 4   1902.95 (   0.00%) 1904.43 *   0.08%*
Hmean 8   3761.73 (   0.00%) 3721.02 *  -1.08%*
Hmean 16  6713.93 (   0.00%) 6769.17 *   0.82%*
Hmean 32 10435.31 (   0.00%)10312.58 *  -1.18%*
Hmean 64 12325.51 (   0.00%)13792.01 *  11.90%*
Hmean 12821225.21 (   0.00%)20963.44 *  -1.23%*
Hmean 25620532.83 (   0.00%)20335.62 *  -0.96%*
Hmean 32020334.81 (   0.00%)20147.25 *  -0.92%*

In this particular test, the cost/benefit is marginal except
for 64 which was a point where the machine was over 50% busy
but not fully utilised.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fc48cc99b03d..845bc0cd9158 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6066,6 +6066,7 @@ void __update_idle_core(struct rq *rq)
  */
 static int select_idle_core(struct task_struct *p, struct sched_domain *sd, 
int target)
 {
+   int idle_candidate = -1;
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
int core, cpu;
 
@@ -6084,7 +6085,13 @@ static int select_idle_core(struct task_struct *p, 
struct sched_domain *sd, int
schedstat_inc(this_rq()->sis_scanned);
if (!available_idle_cpu(cpu)) {
idle = false;
-   break;
+   if (idle_candidate != -1)
+   break;
+   }
+
+   if (idle_candidate == -1 &&
+   cpumask_test_cpu(cpu, p->cpus_ptr)) {
+   idle_candidate = cpu;
}
}
 
@@ -6099,7 +6106,7 @@ static int select_idle_core(struct task_struct *p, struct 
sched_domain *sd, int
 */
set_idle_cores(target, 0);
 
-   return -1;
+   return idle_candidate;
 }
 
 /*
-- 
2.26.2

[RFC PATCH 00/10] Reduce time complexity of select_idle_sibling

2020-12-03 Thread Mel Gorman

This is an early prototype that has not been tested heavily. While parts
of it may stand on its own, the motivation to release early is Aubrey
Li's series on using an idle cpumask to optimise the search and Barry
Song's series on representing clusters on die. The series is based on
tip/sched/core rebased to 5.10-rc6.

Patches 1-2 add schedstats to track the search efficiency of
select_idle_sibling. They can be dropped from the final version but
are useful when looking at select_idle_sibling in general. MMTests
can already parse the stats and generate useful data including
graphs over time.

Patch 3 kills SIS_AVG_CPU but is partially reintroduced later in the
context of SIS_PROP.

Patch 4 notes that select_idle_core() can find an idle CPU that is
not a free core yet it is ignored and a second search is conducted
in select_idle_cpu() which is wasteful. Note that this patch
will definitely change in the final version.

Patch 5 adjusts p->recent_used_cpu so that it has a higher success rate
and avoids searching the domain in some cases.

Patch 6 notes that select_idle_* always starts with a CPU that is
definitely not idle and fixes that.

Patch 7 notes that SIS_PROP is only partially accounting for search
costs. While this might be accidentally beneficial, it makes it
much harder to reason about the effectiveness of SIS_PROP.

Patch 8 uses similar logic to SIS_AVG_CPU but in the context of
SIS_PROP to throttle the search depth.

Patches 9 and 10 are stupid in the context of this series. They
are included even though it makes no sense to use SIS_PROP logic in
select_idle_core() as it already has throttling logic. The point
is to illustrate that the select_idle_mask can be initialised
at the start of a domain search used to mask out CPUs that have
already been visited.

In the context of Aubrey's and Barry's work, select_idle_mask would
be initialised *after* select_idle_core as select_idle_core uses
select_idle_mask for its own purposes. In Aubrey's case, the next
step would be to scan idle_cpus_span as those CPUs may still be idle
and bias the search towards likely idle candidates. If that fails,
select_idle_mask clears all the bits set in idle_cpus_span and then
scans the remainder. Similar observations apply to Barry's work, scan the
local domain first, mask out those bits then scan the remaining CPUs in
the cluster.

The final version of this series will drop patches 1-2 unless there is
demand and definitely drop patches 9-10. However, all 4 patches may be
useful in the context of Aubrey's and Barry's work. Patches 1-2 would
give more precise results on exactly how much they are improving "SIS
Domain Search Efficiency" which may be more illustrative than just the
headline performance figures of a given workload. The final version of
this series will also adjust patch 4. If select_idle_core() runs at all
then it definitely should return a CPU -- either an idle CPU or the target
as it has already searched the entire domain and no further searching
should be conducted. Barry might change that back so that a cluster can
be scanned but it would be done in the context of the cluster series.

-- 
2.26.2

[PATCH 03/10] sched/fair: Remove SIS_AVG_CPU

2020-12-03 Thread Mel Gorman

SIS_AVG_CPU was introduced as a means of avoiding a search when the
average search cost indicated that the search would likely fail. It
was a blunt instrument and disabled by 4c77b18cf8b7 ("sched/fair: Make
select_idle_cpu() more aggressive") and later replaced with a proportional
search depth by 1ad3aaf3fcd2 ("sched/core: Implement new approach to
scale select_idle_cpu()").

While there are corner cases where SIS_AVG_CPU is better, it has now been
disabled for almost three years. As the intent of SIS_PROP is to reduce
the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus
on SIS_PROP as a throttling mechanism.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 3 ---
 kernel/sched/features.h | 1 -
 2 files changed, 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d9acd55d309b..fc48cc99b03d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6163,9 +6163,6 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
avg_idle = this_rq()->avg_idle / 512;
avg_cost = this_sd->avg_scan_cost + 1;
 
-   if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
-   return -1;
-
if (sched_feat(SIS_PROP)) {
u64 span_avg = sd->span_weight * avg_idle;
if (span_avg > 4*avg_cost)
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 68d369cba9e4..e875eabb6600 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -54,7 +54,6 @@ SCHED_FEAT(TTWU_QUEUE, true)
 /*
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
  */
-SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
 
 /*
-- 
2.26.2

[PATCH 01/10] sched/fair: Track efficiency of select_idle_sibling

2020-12-03 Thread Mel Gorman

select_idle_sibling is an important path that finds a nearby idle CPU on
wakeup. As it is examining other CPUs state, it can be expensive in terms
of cache usage. This patch tracks the search efficiency if schedstats
are enabled. In general, this is only useful for kernel developers but
schedstats are typically disabled by default so it is convenient for
development and mostly free otherwise.

It is not required that this patch be merged with the series but if we
are looking at time or search complexity, the stats generate hard data
on what the search costs actually are.

SIS Search: Number of calls to select_idle_sibling

SIS Domain Search: Number of times the domain was searched because the
fast path failed.

SIS Scanned: Generally the number of runqueues scanned but the fast
path counts as 1 regardless of the values for target, prev
and recent.

SIS Domain Scanned: Number of runqueues scanned during a search of the
LLC domain.

SIS Failures: Number of SIS calls that failed to find an idle CPU

SIS Search Efficiency: A ratio expressed as a percentage of runqueues
scanned versus idle CPUs found. A 100% efficiency indicates that
the target, prev or recent CPU of a task was idle at wakeup. The
lower the efficiency, the more runqueues were scanned before an
idle CPU was found.

SIS Domain Search Efficiency: Similar, except only for the slower SIS
patch.

SIS Fast Success Rate: Percentage of SIS that used target, prev or
recent CPUs.

SIS Success rate: Percentage of scans that found an idle CPU.

Signed-off-by: Mel Gorman 
---
 kernel/sched/debug.c |  4 
 kernel/sched/fair.c  | 14 ++
 kernel/sched/sched.h |  6 ++
 kernel/sched/stats.c |  8 +---
 4 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2357921580f9..2386cc5e79e5 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -714,6 +714,10 @@ do {   
\
P(sched_goidle);
P(ttwu_count);
P(ttwu_local);
+   P(sis_search);
+   P(sis_domain_search);
+   P(sis_scanned);
+   P(sis_failed);
}
 #undef P
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 98075f9ea9a8..494ba01f3414 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6081,6 +6081,7 @@ static int select_idle_core(struct task_struct *p, struct 
sched_domain *sd, int
bool idle = true;
 
for_each_cpu(cpu, cpu_smt_mask(core)) {
+   schedstat_inc(this_rq()->sis_scanned);
if (!available_idle_cpu(cpu)) {
idle = false;
break;
@@ -6112,6 +6113,7 @@ static int select_idle_smt(struct task_struct *p, struct 
sched_domain *sd, int t
return -1;
 
for_each_cpu(cpu, cpu_smt_mask(target)) {
+   schedstat_inc(this_rq()->sis_scanned);
if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
!cpumask_test_cpu(cpu, sched_domain_span(sd)))
continue;
@@ -6177,6 +6179,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
 
for_each_cpu_wrap(cpu, cpus, target) {
+   schedstat_inc(this_rq()->sis_scanned);
if (!--nr)
return -1;
if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
@@ -6240,6 +6243,15 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
unsigned long task_util;
int i, recent_used_cpu;
 
+   schedstat_inc(this_rq()->sis_search);
+
+   /*
+* Checking if prev, target and recent is treated as one scan. A
+* perfect hit on one of those is considered 100% efficiency.
+* Further scanning impairs efficiency.
+*/
+   schedstat_inc(this_rq()->sis_scanned);
+
/*
 * On asymmetric system, update task utilization because we will check
 * that the task fits with cpu's capacity.
@@ -6315,6 +6327,7 @@ static int select_idle_sibling(struct task_struct *p, int 
prev, int target)
if (!sd)
return target;
 
+   schedstat_inc(this_rq()->sis_domain_search);
i = select_idle_core(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
@@ -6327,6 +6340,7 @@ static int select_idle_sibling(struct task_struct *p, int 
prev, int target)
if ((unsigned)i < nr_cpumask_bits)
return i;
 
+   schedstat_inc(this_rq()->sis_failed);
return target;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f5acb6c5ce49..90a62dd9293d 100644
--- a/kernel

Re: [PATCH -V6 RESEND 1/3] numa balancing: Migrate on fault among multiple bound nodes

2020-12-03 Thread Mel Gorman

On Thu, Dec 03, 2020 at 11:25:50AM +0100, Peter Zijlstra wrote:
> On Wed, Dec 02, 2020 at 11:40:54AM +0000, Mel Gorman wrote:
> > On Wed, Dec 02, 2020 at 04:42:32PM +0800, Huang Ying wrote:
> > > Now, NUMA balancing can only optimize the page placement among the
> > > NUMA nodes if the default memory policy is used.  Because the memory
> > > policy specified explicitly should take precedence.  But this seems
> > > too strict in some situations.  For example, on a system with 4 NUMA
> > > nodes, if the memory of an application is bound to the node 0 and 1,
> > > NUMA balancing can potentially migrate the pages between the node 0
> > > and 1 to reduce cross-node accessing without breaking the explicit
> > > memory binding policy.
> > > 
> > 
> > Ok, I think this part is ok and while the test case is somewhat
> > superficial, it at least demonstrated that the NUMA balancing overhead
> > did not offset any potential benefit
> > 
> > Acked-by: Mel Gorman 
> 
> Who do we expect to merge this, me through tip/sched/core or akpm ?

I would expect akpm, it's much more on the mm side because it affects
the semantics of memory policies. It should also have more mm-orientated
review than just mine because it affects user-visible semantics and the
ability to detect whether the feature is available or not needs to be
treated with care.

-- 
Mel Gorman
SUSE Labs

Re: [RFC PATCH v2 2/2] scheduler: add scheduler level for clusters

2020-12-03 Thread Mel Gorman

On Thu, Dec 03, 2020 at 10:28:31AM +0100, Peter Zijlstra wrote:
> On Tue, Dec 01, 2020 at 04:04:04PM +, Valentin Schneider wrote:
> > 
> > Gating this behind this new config only leveraged by arm64 doesn't make it
> > very generic. Note that powerpc also has this newish "CACHE" level which
> > seems to overlap in function with your "CLUSTER" one (both are arch
> > specific, though).
> > 
> > I think what you are after here is an SD_SHARE_PKG_RESOURCES domain walk,
> > i.e. scan CPUs by increasing cache "distance". We already have it in some
> > form, as we scan SMT & LLC domains; AFAICT LLC always maps to MC, except
> > for said powerpc's CACHE thingie.
> 
> There's some intel chips with a smaller L2, but I don't think we ever
> bothered.
> 
> There's also the extended topology stuff from Intel: SMT, Core, Module,
> Tile, Die, of which we've only partially used Die I think.
> 
> Whatever we do, it might make sense to not all use different names.
> 
> Also, I think Mel said he was cooking something for
> select_idle_balance().
> 
> Also, I've previously posted patches that fold all the iterations into
> one, so it might make sense to revisit some of that if Mel also already
> didn.t

I didn't. The NUMA/lb reconcilation took most of my attention and right
now I'm looking at select_idle_sibling again in preparation for tracking
the idle cpumask in a sensible fashion. The main idea I had in mind was
special casing EPYC as it has multiple small L3 caches that may benefit
from select_idle_sibling looking slightly beyond the LLC as a search
domain but it has not even started yet.

-- 
Mel Gorman
SUSE Labs

< 1 2 3 4 5 6 7 8 9 10 >

201 - 300 of 10256 matches

Mail list logo