Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
On Wed, 10 Mar 2021 at 06:53, Srikar Dronamraju wrote: > > * Vincent Guittot [2021-03-08 14:52:39]: > > > On Fri, 26 Feb 2021 at 17:41, Srikar Dronamraju > > wrote: > > > > > Thanks Vincent for your review comments. > > > > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync) > > > +{ > > > + struct sched_domain_shared *tsds, *psds; > > > + int pnr_busy, pllc_size, tnr_busy, tllc_size, diff; > > > + > > > + tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu)); > > > + tnr_busy = atomic_read(>nr_busy_cpus); > > > + tllc_size = per_cpu(sd_llc_size, this_cpu); > > > + > > > + psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu)); > > > + pnr_busy = atomic_read(>nr_busy_cpus); > > > + pllc_size = per_cpu(sd_llc_size, prev_cpu); > > > + > > > + /* No need to compare, if both LLCs are fully loaded */ > > > + if (pnr_busy == pllc_size && tnr_busy == pllc_size) > > > + return nr_cpumask_bits; > > > + > > > + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size) > > > + return this_cpu; > > > > Why have you chosen to favor this_cpu instead of prev_cpu unlike for > > wake_idle ? > > At this point, we know the waker running on this_cpu and wakee which was > running on prev_cpu are affine to each other and this_cpu and prev_cpu dont > share cache. I chose to move them close to each other to benefit from the > cache sharing. Based on feedback from Peter and Rik, I made the check more > conservative i.e tnr_busy <= tllc_size/smt_weight (where smt_weight is the > cpumask weight of smt domain for this_cpu) i.e if we have a free core in yeah make sense > this llc domain, chose this_cpu. select_idle_sibling() should pick an idle > cpu/core/smt within the llc domain for this_cpu. > > Do you feel, this may not be the correct option? I was worried that we end up pulling tasks in same llc but the condition above and wake_wide should prevent such behavior > > We are also experimenting with another option, were we call prefer_idler_cpu > after wa_weight. I.e > 1. if wake_affine_weight choses this_cpu but llc in prev_cpu has an idle > smt/CPU but there are no idle smt/CPU in this_cpu, then chose idle smt/CPU > in prev_cpu > 2. if wake_affine_weight choses nr_cpumask(aka prev_cpu) but llc in this_cpu > has an idle smt/CPU but there are no idle smt/CPU in prev_cpu, then chose > idle smt/CPU in this_cpu > > > > > + > > > + /* For better wakeup latency, prefer idler LLC to cache affinity > > > */ > > > + diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size; > > > + if (!diff) > > > + return nr_cpumask_bits; > > > + if (diff < 0) > > > + return this_cpu; > > > + > > > + return prev_cpu; > > > +} > > > + > > > static int wake_affine(struct sched_domain *sd, struct task_struct *p, > > >int this_cpu, int prev_cpu, int sync) > > > { > > > @@ -5877,6 +5907,10 @@ static int wake_affine(struct sched_domain *sd, > > > struct task_struct *p, > > > if (sched_feat(WA_IDLE)) > > > target = wake_affine_idle(this_cpu, prev_cpu, sync); > > > > > > + if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits && > > > + !cpus_share_cache(this_cpu, prev_cpu)) > > > + target = prefer_idler_llc(this_cpu, prev_cpu, sync); > > > > could you use the same naming convention as others function ? > > wake_affine_llc as an example > > I guess you meant s/prefer_idler_llc/wake_affine_llc/ yes > Sure. I can modify. > > > > > > + > > > if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits) > > > target = wake_affine_weight(sd, p, this_cpu, prev_cpu, > > > sync); > > > > > > @@ -5884,8 +5918,11 @@ static int wake_affine(struct sched_domain *sd, > > > struct task_struct *p, > > > if (target == nr_cpumask_bits) > > > return prev_cpu; > > > > > > - schedstat_inc(sd->ttwu_move_affine); > > > - schedstat_inc(p->se.statistics.nr_wakeups_affine); > > > + if (target == this_cpu) { > > > > How is this condition related to $subject ? > > Before this change, wake_affine_weight and wake_affine_idle would either > return this_cpu or nr_cpumask_bits. Just before this check, we check if > target is nr_cpumask_bits and return prev_cpu. So the stats were only > incremented when target was this_cpu. > > However with prefer_idler_llc, we may return this_cpu, prev_cpu or > nr_cpumask_bits. Now we only to update stats when we have chosen to migrate > the task to this_cpu. Hence I had this check. ok got it. May be return earlier in this case like for if (target == nr_cpumask_bits) above > > If we use the slightly lazier approach which is check for wa_weight first > before wa_idler_llc, then we may not need this change at all. > > -- > Thanks and Regards > Srikar Dronamraju
Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
* Vincent Guittot [2021-03-08 14:52:39]: > On Fri, 26 Feb 2021 at 17:41, Srikar Dronamraju > wrote: > > Thanks Vincent for your review comments. > > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync) > > +{ > > + struct sched_domain_shared *tsds, *psds; > > + int pnr_busy, pllc_size, tnr_busy, tllc_size, diff; > > + > > + tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu)); > > + tnr_busy = atomic_read(>nr_busy_cpus); > > + tllc_size = per_cpu(sd_llc_size, this_cpu); > > + > > + psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu)); > > + pnr_busy = atomic_read(>nr_busy_cpus); > > + pllc_size = per_cpu(sd_llc_size, prev_cpu); > > + > > + /* No need to compare, if both LLCs are fully loaded */ > > + if (pnr_busy == pllc_size && tnr_busy == pllc_size) > > + return nr_cpumask_bits; > > + > > + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size) > > + return this_cpu; > > Why have you chosen to favor this_cpu instead of prev_cpu unlike for > wake_idle ? At this point, we know the waker running on this_cpu and wakee which was running on prev_cpu are affine to each other and this_cpu and prev_cpu dont share cache. I chose to move them close to each other to benefit from the cache sharing. Based on feedback from Peter and Rik, I made the check more conservative i.e tnr_busy <= tllc_size/smt_weight (where smt_weight is the cpumask weight of smt domain for this_cpu) i.e if we have a free core in this llc domain, chose this_cpu. select_idle_sibling() should pick an idle cpu/core/smt within the llc domain for this_cpu. Do you feel, this may not be the correct option? We are also experimenting with another option, were we call prefer_idler_cpu after wa_weight. I.e 1. if wake_affine_weight choses this_cpu but llc in prev_cpu has an idle smt/CPU but there are no idle smt/CPU in this_cpu, then chose idle smt/CPU in prev_cpu 2. if wake_affine_weight choses nr_cpumask(aka prev_cpu) but llc in this_cpu has an idle smt/CPU but there are no idle smt/CPU in prev_cpu, then chose idle smt/CPU in this_cpu > > + > > + /* For better wakeup latency, prefer idler LLC to cache affinity */ > > + diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size; > > + if (!diff) > > + return nr_cpumask_bits; > > + if (diff < 0) > > + return this_cpu; > > + > > + return prev_cpu; > > +} > > + > > static int wake_affine(struct sched_domain *sd, struct task_struct *p, > >int this_cpu, int prev_cpu, int sync) > > { > > @@ -5877,6 +5907,10 @@ static int wake_affine(struct sched_domain *sd, > > struct task_struct *p, > > if (sched_feat(WA_IDLE)) > > target = wake_affine_idle(this_cpu, prev_cpu, sync); > > > > + if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits && > > + !cpus_share_cache(this_cpu, prev_cpu)) > > + target = prefer_idler_llc(this_cpu, prev_cpu, sync); > > could you use the same naming convention as others function ? > wake_affine_llc as an example I guess you meant s/prefer_idler_llc/wake_affine_llc/ Sure. I can modify. > > > + > > if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits) > > target = wake_affine_weight(sd, p, this_cpu, prev_cpu, > > sync); > > > > @@ -5884,8 +5918,11 @@ static int wake_affine(struct sched_domain *sd, > > struct task_struct *p, > > if (target == nr_cpumask_bits) > > return prev_cpu; > > > > - schedstat_inc(sd->ttwu_move_affine); > > - schedstat_inc(p->se.statistics.nr_wakeups_affine); > > + if (target == this_cpu) { > > How is this condition related to $subject ? Before this change, wake_affine_weight and wake_affine_idle would either return this_cpu or nr_cpumask_bits. Just before this check, we check if target is nr_cpumask_bits and return prev_cpu. So the stats were only incremented when target was this_cpu. However with prefer_idler_llc, we may return this_cpu, prev_cpu or nr_cpumask_bits. Now we only to update stats when we have chosen to migrate the task to this_cpu. Hence I had this check. If we use the slightly lazier approach which is check for wa_weight first before wa_idler_llc, then we may not need this change at all. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
On Fri, 26 Feb 2021 at 17:41, Srikar Dronamraju wrote: > > On POWER8 and POWER9, the last level cache (L2) has been at the level of > a group of 8 threads (SMT8 on POWER8, a big-core comprising of a pair of > SMT4 cores on POWER9). However, on POWER10, the LLC domain is at the > level of a group of SMT4 threads within the SMT8 core. Due to the > shrinking in the size of the LLC domain, the probability of finding an > idle CPU in the LLC domain of the target is lesser on POWER10 compared > to the previous generation processors. > > With commit 9538abee18cc ("powerpc/smp: Add support detecting > thread-groups sharing L2 cache") benchmarks such as Daytrader > (https://github.com/WASdev/sample.daytrader7) show a drop in throughput > in a configuration consisting of 1 JVM spanning across 6-8 Bigcores on > POWER10. Analysis showed that this was because more number of wakeups > were happening on busy CPUs when the utilization was 60-70%. This drop > in throughput also shows up as a drop in CPU utilization. However most > other benchmarks benefit with detecting the thread-groups that share L2 > cache. > > Current order of preference to pick a LLC while waking a wake-affine > task: > 1. Between the waker CPU and previous CPU, prefer the LLC of the CPU >that is idle. > > 2. Between the waker CPU and previous CPU, prefer the LLC of the CPU >that is less lightly loaded. > > In the current situation where waker and previous CPUs are busy, but > only one of its LLC has an idle CPU, Scheduler may end up picking a LLC > with no idle CPUs. To mitigate this, add a new step between 1 and 2 > where Scheduler compares idle CPUs in waker and previous LLCs and picks > the appropriate one. > > The other alternative is to search for an idle CPU in the other LLC, if > the current select_idle_sibling is unable to find an idle CPU in the > preferred LLC. But that may increase the time to select a CPU. > > > 5.11-rc6 5.11-rc6+revert > 5.11-rc6+patch > 8CORE/1JVM 80USERS throughput 6651.66716.3 (0.97%)6940 > (4.34%) > sys/user:time 59.75/23.86 61.77/24.55 60/24 > > 8CORE/2JVM 80USERS throughput 6425.46446.8 (0.33%)6473.2 > (0.74%) > sys/user:time 70.59/24.25 72.28/23.77 70/24 > > 8CORE/4JVM 80USERS throughput 5355.35551.2 (3.66%)5586.6 > (4.32%) > sys/user:time 76.74/21.79 76.54/22.73 76/22 > > 8CORE/8JVM 80USERS throughput 4420.64553.3 (3.00%)4405.8 > (-0.33%) > sys/user:time 79.13/20.32 78.76/21.01 79/20 > > Cc: LKML > Cc: Michael Ellerman > Cc: Michael Neuling > Cc: Gautham R Shenoy > Cc: Parth Shah > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Valentin Schneider > Cc: Dietmar Eggemann > Cc: Mel Gorman > Cc: Vincent Guittot > Co-developed-by: Gautham R Shenoy > Signed-off-by: Gautham R Shenoy > Co-developed-by: Parth Shah > Signed-off-by: Parth Shah > Signed-off-by: Srikar Dronamraju > --- > kernel/sched/fair.c | 41 +++-- > kernel/sched/features.h | 2 ++ > 2 files changed, 41 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 8a8bd7b13634..d49bfcdc4a19 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5869,6 +5869,36 @@ wake_affine_weight(struct sched_domain *sd, struct > task_struct *p, > return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits; > } > > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync) > +{ > + struct sched_domain_shared *tsds, *psds; > + int pnr_busy, pllc_size, tnr_busy, tllc_size, diff; > + > + tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu)); > + tnr_busy = atomic_read(>nr_busy_cpus); > + tllc_size = per_cpu(sd_llc_size, this_cpu); > + > + psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu)); > + pnr_busy = atomic_read(>nr_busy_cpus); > + pllc_size = per_cpu(sd_llc_size, prev_cpu); > + > + /* No need to compare, if both LLCs are fully loaded */ > + if (pnr_busy == pllc_size && tnr_busy == pllc_size) > + return nr_cpumask_bits; > + > + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size) > + return this_cpu; Why have you chosen to favor this_cpu instead of prev_cpu unlike for wake_idle ? > + > + /* For better wakeup latency, prefer idler LLC to cache affinity */ > + diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size; > + if (!diff) > + return nr_cpumask_bits; > + if (diff < 0) > + return this_cpu; > + > + return prev_cpu; > +} > + > static int wake_affine(struct sched_domain *sd, struct task_struct *p, >int this_cpu, int prev_cpu, int sync) > { > @@ -5877,6 +5907,10 @@ static int wake_affine(struct
Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
* Peter Zijlstra [2021-03-02 10:10:32]: > On Tue, Mar 02, 2021 at 01:09:46PM +0530, Srikar Dronamraju wrote: > > > Oh, could be, I didn't grep :/ We could have core code keep track of the > > > smt count I suppose. > > > > Could we use cpumask_weight(cpu_smt_mask(this_cpu)) instead? > > cpumask_weight() is potentially super expensive. With CPUMASK_OFFSTACK > you get at least one more cache miss and then the bitmap might be really > long. > > Best to compute the results once and use it later. Oh okay .. Noted. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
* Dietmar Eggemann [2021-03-02 10:53:06]: > On 26/02/2021 17:40, Srikar Dronamraju wrote: > > [...] > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 8a8bd7b13634..d49bfcdc4a19 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -5869,6 +5869,36 @@ wake_affine_weight(struct sched_domain *sd, struct > > task_struct *p, > > return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits; > > } > > > > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync) > > +{ > > + struct sched_domain_shared *tsds, *psds; > > + int pnr_busy, pllc_size, tnr_busy, tllc_size, diff; > > + > > + tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu)); > > + tnr_busy = atomic_read(>nr_busy_cpus); > > + tllc_size = per_cpu(sd_llc_size, this_cpu); > > + > > + psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu)); > > + pnr_busy = atomic_read(>nr_busy_cpus); > > + pllc_size = per_cpu(sd_llc_size, prev_cpu); > > + > > + /* No need to compare, if both LLCs are fully loaded */ > > + if (pnr_busy == pllc_size && tnr_busy == pllc_size) > > ^ >shouldn't this be tllc_size ? Yes, thanks for pointing out. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
On 26/02/2021 17:40, Srikar Dronamraju wrote: [...] > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 8a8bd7b13634..d49bfcdc4a19 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5869,6 +5869,36 @@ wake_affine_weight(struct sched_domain *sd, struct > task_struct *p, > return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits; > } > > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync) > +{ > + struct sched_domain_shared *tsds, *psds; > + int pnr_busy, pllc_size, tnr_busy, tllc_size, diff; > + > + tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu)); > + tnr_busy = atomic_read(>nr_busy_cpus); > + tllc_size = per_cpu(sd_llc_size, this_cpu); > + > + psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu)); > + pnr_busy = atomic_read(>nr_busy_cpus); > + pllc_size = per_cpu(sd_llc_size, prev_cpu); > + > + /* No need to compare, if both LLCs are fully loaded */ > + if (pnr_busy == pllc_size && tnr_busy == pllc_size) ^ shouldn't this be tllc_size ? > + return nr_cpumask_bits; > + > + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size) > + return this_cpu; > + > + /* For better wakeup latency, prefer idler LLC to cache affinity */ > + diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size; > + if (!diff) > + return nr_cpumask_bits; > + if (diff < 0) > + return this_cpu; > + > + return prev_cpu; > +} > +
Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
On Tue, Mar 02, 2021 at 01:09:46PM +0530, Srikar Dronamraju wrote: > > Oh, could be, I didn't grep :/ We could have core code keep track of the > > smt count I suppose. > > Could we use cpumask_weight(cpu_smt_mask(this_cpu)) instead? cpumask_weight() is potentially super expensive. With CPUMASK_OFFSTACK you get at least one more cache miss and then the bitmap might be really long. Best to compute the results once and use it later.
Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
* Peter Zijlstra [2021-03-01 18:18:28]: > On Mon, Mar 01, 2021 at 10:36:01PM +0530, Srikar Dronamraju wrote: > > * Peter Zijlstra [2021-03-01 16:44:42]: > > > > > On Sat, Feb 27, 2021 at 02:56:07PM -0500, Rik van Riel wrote: > > > > On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote: > > > > > > > > > > Maybe we could use "tnr_busy * 4 < > > > > tllc_size * 3" or > > > > something like that? > > > > > > How about: > > > > > > tnr_busy < tllc_size / topology_max_smt_threads() > > > > > > ? > > > > Isn't topology_max_smt_threads only for x86 as of today? > > Or Am I missing out? > > Oh, could be, I didn't grep :/ We could have core code keep track of the > smt count I suppose. Could we use cpumask_weight(cpu_smt_mask(this_cpu)) instead? -- Thanks and Regards Srikar Dronamraju
Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
On Mon, Mar 01, 2021 at 10:36:01PM +0530, Srikar Dronamraju wrote: > * Peter Zijlstra [2021-03-01 16:44:42]: > > > On Sat, Feb 27, 2021 at 02:56:07PM -0500, Rik van Riel wrote: > > > On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote: > > > > > > + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size) > > > > + return this_cpu; > > > > > > I wonder if we need to use a slightly lower threshold on > > > very large LLCs, both to account for the fact that the > > > select_idle_cpu code may not find the single idle CPU > > > among a dozen busy ones, or because on a system with > > > hyperthreading we may often be better off picking another > > > LLC for HT contention issues? > > > > > > Maybe we could use "tnr_busy * 4 < > > > tllc_size * 3" or > > > something like that? > > > > How about: > > > > tnr_busy < tllc_size / topology_max_smt_threads() > > > > ? > > Isn't topology_max_smt_threads only for x86 as of today? > Or Am I missing out? Oh, could be, I didn't grep :/ We could have core code keep track of the smt count I suppose.
Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
* Peter Zijlstra [2021-03-01 16:40:33]: > On Fri, Feb 26, 2021 at 10:10:29PM +0530, Srikar Dronamraju wrote: > > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync) > > +{ > > + struct sched_domain_shared *tsds, *psds; > > + int pnr_busy, pllc_size, tnr_busy, tllc_size, diff; > > + > > + tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu)); > > + tnr_busy = atomic_read(>nr_busy_cpus); > > + tllc_size = per_cpu(sd_llc_size, this_cpu); > > + > > + psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu)); > > + pnr_busy = atomic_read(>nr_busy_cpus); > > + pllc_size = per_cpu(sd_llc_size, prev_cpu); > > + > > nr_busy_cpus is NO_HZ_COMMON So this code that consumes it should be > too. Thanks Peter, will take care of this along with other changes including calling within rcu_read_lock and checking for tsds and psds after rcu_dereference. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
* Peter Zijlstra [2021-03-01 16:44:42]: > On Sat, Feb 27, 2021 at 02:56:07PM -0500, Rik van Riel wrote: > > On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote: > > > > + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size) > > > + return this_cpu; > > > > I wonder if we need to use a slightly lower threshold on > > very large LLCs, both to account for the fact that the > > select_idle_cpu code may not find the single idle CPU > > among a dozen busy ones, or because on a system with > > hyperthreading we may often be better off picking another > > LLC for HT contention issues? > > > > Maybe we could use "tnr_busy * 4 < > > tllc_size * 3" or > > something like that? > > How about: > > tnr_busy < tllc_size / topology_max_smt_threads() > > ? Isn't topology_max_smt_threads only for x86 as of today? Or Am I missing out? -- Thanks and Regards Srikar Dronamraju
Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
On Sat, Feb 27, 2021 at 02:56:07PM -0500, Rik van Riel wrote: > On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote: > > + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size) > > + return this_cpu; > > I wonder if we need to use a slightly lower threshold on > very large LLCs, both to account for the fact that the > select_idle_cpu code may not find the single idle CPU > among a dozen busy ones, or because on a system with > hyperthreading we may often be better off picking another > LLC for HT contention issues? > > Maybe we could use "tnr_busy * 4 < > tllc_size * 3" or > something like that? How about: tnr_busy < tllc_size / topology_max_smt_threads() ? signature.asc Description: PGP signature
Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
On Fri, Feb 26, 2021 at 10:10:29PM +0530, Srikar Dronamraju wrote: > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync) > +{ > + struct sched_domain_shared *tsds, *psds; > + int pnr_busy, pllc_size, tnr_busy, tllc_size, diff; > + > + tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu)); > + tnr_busy = atomic_read(>nr_busy_cpus); > + tllc_size = per_cpu(sd_llc_size, this_cpu); > + > + psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu)); > + pnr_busy = atomic_read(>nr_busy_cpus); > + pllc_size = per_cpu(sd_llc_size, prev_cpu); > + nr_busy_cpus is NO_HZ_COMMON So this code that consumes it should be too.
Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
* Rik van Riel [2021-02-27 14:56:07]: > > In the current situation where waker and previous CPUs are busy, but > > only one of its LLC has an idle CPU, Scheduler may end up picking a > > LLC > > with no idle CPUs. To mitigate this, add a new step between 1 and 2 > > where Scheduler compares idle CPUs in waker and previous LLCs and > > picks > > the appropriate one. > > I like that idea a lot. That could also solve some of the Thanks. > issues sometimes observed on multi-node x86 systems, and > probably on the newer AMD chips with several LLCs on chip. > Okay. > > + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size) > > + return this_cpu; > > I wonder if we need to use a slightly lower threshold on > very large LLCs, both to account for the fact that the > select_idle_cpu code may not find the single idle CPU > among a dozen busy ones, or because on a system with > hyperthreading we may often be better off picking another > LLC for HT contention issues? > > Maybe we could use "tnr_busy * 4 < > tllc_size * 3" or > something like that? > > That way we will only try to find the last 5 idle > CPUs > in a 22 CPU LLC if the other LLC also has fewer than 6 > idle cores. > > That might increase our chances of finding an idle CPU > with SIS_PROP enabled, and might allow WA_WAKER to be > true by default. Agree we need to be conservative esp if we want to make WA_WAKER on by default. I would still like to hear from other people if they think its ok to enable it by default. I wonder if enabling it by default can cause some load imbalances leading to more active load balance down the line. I haven't benchmarked with WA_WAKER enabled. Thanks Rik for your inputs. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote: > Current order of preference to pick a LLC while waking a wake-affine > task: > 1. Between the waker CPU and previous CPU, prefer the LLC of the CPU >that is idle. > > 2. Between the waker CPU and previous CPU, prefer the LLC of the CPU >that is less lightly loaded. > > In the current situation where waker and previous CPUs are busy, but > only one of its LLC has an idle CPU, Scheduler may end up picking a > LLC > with no idle CPUs. To mitigate this, add a new step between 1 and 2 > where Scheduler compares idle CPUs in waker and previous LLCs and > picks > the appropriate one. I like that idea a lot. That could also solve some of the issues sometimes observed on multi-node x86 systems, and probably on the newer AMD chips with several LLCs on chip. > + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size) > + return this_cpu; I wonder if we need to use a slightly lower threshold on very large LLCs, both to account for the fact that the select_idle_cpu code may not find the single idle CPU among a dozen busy ones, or because on a system with hyperthreading we may often be better off picking another LLC for HT contention issues? Maybe we could use "tnr_busy * 4 < tllc_size * 3" or something like that? That way we will only try to find the last 5 idle CPUs in a 22 CPU LLC if the other LLC also has fewer than 6 idle cores. That might increase our chances of finding an idle CPU with SIS_PROP enabled, and might allow WA_WAKER to be true by default. > + /* For better wakeup latency, prefer idler LLC to cache > affinity */ > + diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size; > + if (!diff) > + return nr_cpumask_bits; > + if (diff < 0) > + return this_cpu; > + > + return prev_cpu; > +} -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
[PATCH] sched/fair: Prefer idle CPU to cache affinity
On POWER8 and POWER9, the last level cache (L2) has been at the level of a group of 8 threads (SMT8 on POWER8, a big-core comprising of a pair of SMT4 cores on POWER9). However, on POWER10, the LLC domain is at the level of a group of SMT4 threads within the SMT8 core. Due to the shrinking in the size of the LLC domain, the probability of finding an idle CPU in the LLC domain of the target is lesser on POWER10 compared to the previous generation processors. With commit 9538abee18cc ("powerpc/smp: Add support detecting thread-groups sharing L2 cache") benchmarks such as Daytrader (https://github.com/WASdev/sample.daytrader7) show a drop in throughput in a configuration consisting of 1 JVM spanning across 6-8 Bigcores on POWER10. Analysis showed that this was because more number of wakeups were happening on busy CPUs when the utilization was 60-70%. This drop in throughput also shows up as a drop in CPU utilization. However most other benchmarks benefit with detecting the thread-groups that share L2 cache. Current order of preference to pick a LLC while waking a wake-affine task: 1. Between the waker CPU and previous CPU, prefer the LLC of the CPU that is idle. 2. Between the waker CPU and previous CPU, prefer the LLC of the CPU that is less lightly loaded. In the current situation where waker and previous CPUs are busy, but only one of its LLC has an idle CPU, Scheduler may end up picking a LLC with no idle CPUs. To mitigate this, add a new step between 1 and 2 where Scheduler compares idle CPUs in waker and previous LLCs and picks the appropriate one. The other alternative is to search for an idle CPU in the other LLC, if the current select_idle_sibling is unable to find an idle CPU in the preferred LLC. But that may increase the time to select a CPU. 5.11-rc6 5.11-rc6+revert 5.11-rc6+patch 8CORE/1JVM 80USERS throughput 6651.66716.3 (0.97%)6940 (4.34%) sys/user:time 59.75/23.86 61.77/24.55 60/24 8CORE/2JVM 80USERS throughput 6425.46446.8 (0.33%)6473.2 (0.74%) sys/user:time 70.59/24.25 72.28/23.77 70/24 8CORE/4JVM 80USERS throughput 5355.35551.2 (3.66%)5586.6 (4.32%) sys/user:time 76.74/21.79 76.54/22.73 76/22 8CORE/8JVM 80USERS throughput 4420.64553.3 (3.00%)4405.8 (-0.33%) sys/user:time 79.13/20.32 78.76/21.01 79/20 Cc: LKML Cc: Michael Ellerman Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Parth Shah Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Dietmar Eggemann Cc: Mel Gorman Cc: Vincent Guittot Co-developed-by: Gautham R Shenoy Signed-off-by: Gautham R Shenoy Co-developed-by: Parth Shah Signed-off-by: Parth Shah Signed-off-by: Srikar Dronamraju --- kernel/sched/fair.c | 41 +++-- kernel/sched/features.h | 2 ++ 2 files changed, 41 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8a8bd7b13634..d49bfcdc4a19 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5869,6 +5869,36 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p, return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits; } +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync) +{ + struct sched_domain_shared *tsds, *psds; + int pnr_busy, pllc_size, tnr_busy, tllc_size, diff; + + tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu)); + tnr_busy = atomic_read(>nr_busy_cpus); + tllc_size = per_cpu(sd_llc_size, this_cpu); + + psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu)); + pnr_busy = atomic_read(>nr_busy_cpus); + pllc_size = per_cpu(sd_llc_size, prev_cpu); + + /* No need to compare, if both LLCs are fully loaded */ + if (pnr_busy == pllc_size && tnr_busy == pllc_size) + return nr_cpumask_bits; + + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size) + return this_cpu; + + /* For better wakeup latency, prefer idler LLC to cache affinity */ + diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size; + if (!diff) + return nr_cpumask_bits; + if (diff < 0) + return this_cpu; + + return prev_cpu; +} + static int wake_affine(struct sched_domain *sd, struct task_struct *p, int this_cpu, int prev_cpu, int sync) { @@ -5877,6 +5907,10 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, if (sched_feat(WA_IDLE)) target = wake_affine_idle(this_cpu, prev_cpu, sync); + if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits && + !cpus_share_cache(this_cpu, prev_cpu)) + target = prefer_idler_llc(this_cpu, prev_cpu, sync); + if