subject:"\[PATCH\] sched\/fair\: Prefer idle CPU to cache affinity"

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-03-10 Thread Vincent Guittot

On Wed, 10 Mar 2021 at 06:53, Srikar Dronamraju
 wrote:
>
> * Vincent Guittot  [2021-03-08 14:52:39]:
>
> > On Fri, 26 Feb 2021 at 17:41, Srikar Dronamraju
> >  wrote:
> > >
>
> Thanks Vincent for your review comments.
>
> > > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> > > +{
> > > +   struct sched_domain_shared *tsds, *psds;
> > > +   int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> > > +
> > > +   tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> > > +   tnr_busy = atomic_read(>nr_busy_cpus);
> > > +   tllc_size = per_cpu(sd_llc_size, this_cpu);
> > > +
> > > +   psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> > > +   pnr_busy = atomic_read(>nr_busy_cpus);
> > > +   pllc_size = per_cpu(sd_llc_size, prev_cpu);
> > > +
> > > +   /* No need to compare, if both LLCs are fully loaded */
> > > +   if (pnr_busy == pllc_size && tnr_busy == pllc_size)
> > > +   return nr_cpumask_bits;
> > > +
> > > +   if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> > > +   return this_cpu;
> >
> > Why have you chosen to favor this_cpu instead of prev_cpu unlike for 
> > wake_idle ?
>
> At this point, we know the waker running on this_cpu and wakee which was
> running on prev_cpu are affine to each other and this_cpu and prev_cpu dont
> share cache. I chose to move them close to each other to benefit from the
> cache sharing. Based on feedback from Peter and Rik, I made the check more
> conservative i.e tnr_busy <= tllc_size/smt_weight (where smt_weight is the
> cpumask weight of smt domain for this_cpu) i.e if we have a free core in

yeah make sense

> this llc domain, chose this_cpu.  select_idle_sibling() should pick an idle
> cpu/core/smt within the llc domain for this_cpu.
>
> Do you feel, this may not be the correct option?

I was worried that we end up pulling tasks in same llc but the
condition above and wake_wide should prevent such behavior

>
> We are also experimenting with another option, were we call prefer_idler_cpu
> after wa_weight. I.e
> 1. if wake_affine_weight choses this_cpu but llc in prev_cpu has an idle
> smt/CPU but there are no idle smt/CPU in this_cpu, then chose idle smt/CPU
> in prev_cpu
> 2. if wake_affine_weight choses nr_cpumask(aka prev_cpu) but llc in this_cpu
> has an idle smt/CPU but there are no idle smt/CPU in prev_cpu, then chose
> idle smt/CPU in this_cpu
>
>
> > > +
> > > +   /* For better wakeup latency, prefer idler LLC to cache affinity 
> > > */
> > > +   diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size;
> > > +   if (!diff)
> > > +   return nr_cpumask_bits;
> > > +   if (diff < 0)
> > > +   return this_cpu;
> > > +
> > > +   return prev_cpu;
> > > +}
> > > +
> > >  static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> > >int this_cpu, int prev_cpu, int sync)
> > >  {
> > > @@ -5877,6 +5907,10 @@ static int wake_affine(struct sched_domain *sd, 
> > > struct task_struct *p,
> > > if (sched_feat(WA_IDLE))
> > > target = wake_affine_idle(this_cpu, prev_cpu, sync);
> > >
> > > +   if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits &&
> > > +   !cpus_share_cache(this_cpu, prev_cpu))
> > > +   target = prefer_idler_llc(this_cpu, prev_cpu, sync);
> >
> > could you use the same naming convention as others function ?
> > wake_affine_llc as an example
>
> I guess you meant s/prefer_idler_llc/wake_affine_llc/

yes

> Sure. I can modify.
>
> >
> > > +
> > > if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
> > > target = wake_affine_weight(sd, p, this_cpu, prev_cpu, 
> > > sync);
> > >
> > > @@ -5884,8 +5918,11 @@ static int wake_affine(struct sched_domain *sd, 
> > > struct task_struct *p,
> > > if (target == nr_cpumask_bits)
> > > return prev_cpu;
> > >
> > > -   schedstat_inc(sd->ttwu_move_affine);
> > > -   schedstat_inc(p->se.statistics.nr_wakeups_affine);
> > > +   if (target == this_cpu) {
> >
> > How is this condition related to $subject ?
>
> Before this change, wake_affine_weight and wake_affine_idle would either
> return this_cpu or nr_cpumask_bits. Just before this check, we check if
> target is nr_cpumask_bits and return prev_cpu. So the stats were only
> incremented when target was this_cpu.
>
> However with prefer_idler_llc, we may return this_cpu, prev_cpu or
> nr_cpumask_bits. Now we only to update stats when we have chosen to migrate
> the task to this_cpu. Hence I had this check.

ok got it.

May be return earlier in this case like for  if (target ==
nr_cpumask_bits) above

>
> If we use the slightly lazier approach which is check for wa_weight first
> before wa_idler_llc, then we may not need this change at all.
>
> --
> Thanks and Regards
> Srikar Dronamraju

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-03-09 Thread Srikar Dronamraju

* Vincent Guittot  [2021-03-08 14:52:39]:

> On Fri, 26 Feb 2021 at 17:41, Srikar Dronamraju
>  wrote:
> >

Thanks Vincent for your review comments.

> > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> > +{
> > +   struct sched_domain_shared *tsds, *psds;
> > +   int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> > +
> > +   tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> > +   tnr_busy = atomic_read(>nr_busy_cpus);
> > +   tllc_size = per_cpu(sd_llc_size, this_cpu);
> > +
> > +   psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> > +   pnr_busy = atomic_read(>nr_busy_cpus);
> > +   pllc_size = per_cpu(sd_llc_size, prev_cpu);
> > +
> > +   /* No need to compare, if both LLCs are fully loaded */
> > +   if (pnr_busy == pllc_size && tnr_busy == pllc_size)
> > +   return nr_cpumask_bits;
> > +
> > +   if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> > +   return this_cpu;
> 
> Why have you chosen to favor this_cpu instead of prev_cpu unlike for 
> wake_idle ?

At this point, we know the waker running on this_cpu and wakee which was
running on prev_cpu are affine to each other and this_cpu and prev_cpu dont
share cache. I chose to move them close to each other to benefit from the
cache sharing. Based on feedback from Peter and Rik, I made the check more
conservative i.e tnr_busy <= tllc_size/smt_weight (where smt_weight is the
cpumask weight of smt domain for this_cpu) i.e if we have a free core in
this llc domain, chose this_cpu.  select_idle_sibling() should pick an idle
cpu/core/smt within the llc domain for this_cpu.

Do you feel, this may not be the correct option?

We are also experimenting with another option, were we call prefer_idler_cpu
after wa_weight. I.e 
1. if wake_affine_weight choses this_cpu but llc in prev_cpu has an idle
smt/CPU but there are no idle smt/CPU in this_cpu, then chose idle smt/CPU
in prev_cpu
2. if wake_affine_weight choses nr_cpumask(aka prev_cpu) but llc in this_cpu
has an idle smt/CPU but there are no idle smt/CPU in prev_cpu, then chose
idle smt/CPU in this_cpu

> > +
> > +   /* For better wakeup latency, prefer idler LLC to cache affinity */
> > +   diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size;
> > +   if (!diff)
> > +   return nr_cpumask_bits;
> > +   if (diff < 0)
> > +   return this_cpu;
> > +
> > +   return prev_cpu;
> > +}
> > +
> >  static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> >int this_cpu, int prev_cpu, int sync)
> >  {
> > @@ -5877,6 +5907,10 @@ static int wake_affine(struct sched_domain *sd, 
> > struct task_struct *p,
> > if (sched_feat(WA_IDLE))
> > target = wake_affine_idle(this_cpu, prev_cpu, sync);
> >
> > +   if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits &&
> > +   !cpus_share_cache(this_cpu, prev_cpu))
> > +   target = prefer_idler_llc(this_cpu, prev_cpu, sync);
> 
> could you use the same naming convention as others function ?
> wake_affine_llc as an example

I guess you meant s/prefer_idler_llc/wake_affine_llc/
Sure. I can modify.

> 
> > +
> > if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
> > target = wake_affine_weight(sd, p, this_cpu, prev_cpu, 
> > sync);
> >
> > @@ -5884,8 +5918,11 @@ static int wake_affine(struct sched_domain *sd, 
> > struct task_struct *p,
> > if (target == nr_cpumask_bits)
> > return prev_cpu;
> >
> > -   schedstat_inc(sd->ttwu_move_affine);
> > -   schedstat_inc(p->se.statistics.nr_wakeups_affine);
> > +   if (target == this_cpu) {
> 
> How is this condition related to $subject ?

Before this change, wake_affine_weight and wake_affine_idle would either
return this_cpu or nr_cpumask_bits. Just before this check, we check if
target is nr_cpumask_bits and return prev_cpu. So the stats were only
incremented when target was this_cpu.

However with prefer_idler_llc, we may return this_cpu, prev_cpu or
nr_cpumask_bits. Now we only to update stats when we have chosen to migrate
the task to this_cpu. Hence I had this check.

If we use the slightly lazier approach which is check for wa_weight first
before wa_idler_llc, then we may not need this change at all.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-03-08 Thread Vincent Guittot

On Fri, 26 Feb 2021 at 17:41, Srikar Dronamraju
 wrote:
>
> On POWER8 and POWER9, the last level cache (L2) has been at the level of
> a group of 8 threads (SMT8 on POWER8, a big-core comprising of a pair of
> SMT4 cores on POWER9). However, on POWER10, the LLC domain is at the
> level of a group of SMT4 threads within the SMT8 core. Due to the
> shrinking in the size of the LLC domain, the probability of finding an
> idle CPU in the LLC domain of the target is lesser on POWER10 compared
> to the previous generation processors.
>
> With commit 9538abee18cc ("powerpc/smp: Add support detecting
> thread-groups sharing L2 cache") benchmarks such as Daytrader
> (https://github.com/WASdev/sample.daytrader7) show a drop in throughput
> in a configuration consisting of 1 JVM spanning across 6-8 Bigcores on
> POWER10.  Analysis showed that this was because more number of wakeups
> were happening on busy CPUs when the utilization was 60-70%. This drop
> in throughput also shows up as a drop in CPU utilization. However most
> other benchmarks benefit with detecting the thread-groups that share L2
> cache.
>
> Current order of preference to pick a LLC while waking a wake-affine
> task:
> 1. Between the waker CPU and previous CPU, prefer the LLC of the CPU
>that is idle.
>
> 2. Between the waker CPU and previous CPU, prefer the LLC of the CPU
>that is less lightly loaded.
>
> In the current situation where waker and previous CPUs are busy, but
> only one of its LLC has an idle CPU, Scheduler may end up picking a LLC
> with no idle CPUs. To mitigate this, add a new step between 1 and 2
> where Scheduler compares idle CPUs in waker and previous LLCs and picks
> the appropriate one.
>
> The other alternative is to search for an idle CPU in the other LLC, if
> the current select_idle_sibling is unable to find an idle CPU in the
> preferred LLC. But that may increase the time to select a CPU.
>
>
>  5.11-rc6  5.11-rc6+revert   
> 5.11-rc6+patch
> 8CORE/1JVM  80USERS   throughput 6651.66716.3 (0.97%)6940 
> (4.34%)
>   sys/user:time  59.75/23.86   61.77/24.55   60/24
>
> 8CORE/2JVM  80USERS   throughput 6425.46446.8 (0.33%)6473.2 
> (0.74%)
>   sys/user:time  70.59/24.25   72.28/23.77   70/24
>
> 8CORE/4JVM  80USERS   throughput 5355.35551.2 (3.66%)5586.6 
> (4.32%)
>   sys/user:time  76.74/21.79   76.54/22.73   76/22
>
> 8CORE/8JVM  80USERS   throughput 4420.64553.3 (3.00%)4405.8 
> (-0.33%)
>   sys/user:time  79.13/20.32   78.76/21.01   79/20
>
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Michael Neuling 
> Cc: Gautham R Shenoy 
> Cc: Parth Shah 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Dietmar Eggemann 
> Cc: Mel Gorman 
> Cc: Vincent Guittot 
> Co-developed-by: Gautham R Shenoy 
> Signed-off-by: Gautham R Shenoy 
> Co-developed-by: Parth Shah 
> Signed-off-by: Parth Shah 
> Signed-off-by: Srikar Dronamraju 
> ---
>  kernel/sched/fair.c | 41 +++--
>  kernel/sched/features.h |  2 ++
>  2 files changed, 41 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a8bd7b13634..d49bfcdc4a19 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5869,6 +5869,36 @@ wake_affine_weight(struct sched_domain *sd, struct 
> task_struct *p,
> return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
>  }
>
> +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> +{
> +   struct sched_domain_shared *tsds, *psds;
> +   int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> +
> +   tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> +   tnr_busy = atomic_read(>nr_busy_cpus);
> +   tllc_size = per_cpu(sd_llc_size, this_cpu);
> +
> +   psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> +   pnr_busy = atomic_read(>nr_busy_cpus);
> +   pllc_size = per_cpu(sd_llc_size, prev_cpu);
> +
> +   /* No need to compare, if both LLCs are fully loaded */
> +   if (pnr_busy == pllc_size && tnr_busy == pllc_size)
> +   return nr_cpumask_bits;
> +
> +   if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> +   return this_cpu;

Why have you chosen to favor this_cpu instead of prev_cpu unlike for wake_idle ?

> +
> +   /* For better wakeup latency, prefer idler LLC to cache affinity */
> +   diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size;
> +   if (!diff)
> +   return nr_cpumask_bits;
> +   if (diff < 0)
> +   return this_cpu;
> +
> +   return prev_cpu;
> +}
> +
>  static int wake_affine(struct sched_domain *sd, struct task_struct *p,
>int this_cpu, int prev_cpu, int sync)
>  {
> @@ -5877,6 +5907,10 @@ static int wake_affine(struct

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-03-02 Thread Srikar Dronamraju

* Peter Zijlstra  [2021-03-02 10:10:32]:

> On Tue, Mar 02, 2021 at 01:09:46PM +0530, Srikar Dronamraju wrote:
> > > Oh, could be, I didn't grep :/ We could have core code keep track of the
> > > smt count I suppose.
> > 
> > Could we use cpumask_weight(cpu_smt_mask(this_cpu)) instead?
> 
> cpumask_weight() is potentially super expensive. With CPUMASK_OFFSTACK
> you get at least one more cache miss and then the bitmap might be really
> long.
> 
> Best to compute the results once and use it later.

Oh okay .. Noted.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-03-02 Thread Srikar Dronamraju

* Dietmar Eggemann  [2021-03-02 10:53:06]:

> On 26/02/2021 17:40, Srikar Dronamraju wrote:
> 
> [...]
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8a8bd7b13634..d49bfcdc4a19 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5869,6 +5869,36 @@ wake_affine_weight(struct sched_domain *sd, struct 
> > task_struct *p,
> > return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
> >  }
> > 
> > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> > +{
> > +   struct sched_domain_shared *tsds, *psds;
> > +   int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> > +
> > +   tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> > +   tnr_busy = atomic_read(>nr_busy_cpus);
> > +   tllc_size = per_cpu(sd_llc_size, this_cpu);
> > +
> > +   psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> > +   pnr_busy = atomic_read(>nr_busy_cpus);
> > +   pllc_size = per_cpu(sd_llc_size, prev_cpu);
> > +
> > +   /* No need to compare, if both LLCs are fully loaded */
> > +   if (pnr_busy == pllc_size && tnr_busy == pllc_size)
> 
>  ^
>shouldn't this be tllc_size ?

Yes, thanks for pointing out.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-03-02 Thread Dietmar Eggemann

On 26/02/2021 17:40, Srikar Dronamraju wrote:

[...]

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a8bd7b13634..d49bfcdc4a19 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5869,6 +5869,36 @@ wake_affine_weight(struct sched_domain *sd, struct 
> task_struct *p,
>   return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
>  }
> 
> +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> +{
> + struct sched_domain_shared *tsds, *psds;
> + int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> +
> + tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> + tnr_busy = atomic_read(>nr_busy_cpus);
> + tllc_size = per_cpu(sd_llc_size, this_cpu);
> +
> + psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> + pnr_busy = atomic_read(>nr_busy_cpus);
> + pllc_size = per_cpu(sd_llc_size, prev_cpu);
> +
> + /* No need to compare, if both LLCs are fully loaded */
> + if (pnr_busy == pllc_size && tnr_busy == pllc_size)

 ^
   shouldn't this be tllc_size ?

> + return nr_cpumask_bits;
> +
> + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> + return this_cpu;
> +
> + /* For better wakeup latency, prefer idler LLC to cache affinity */
> + diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size;
> + if (!diff)
> + return nr_cpumask_bits;
> + if (diff < 0)
> + return this_cpu;
> +
> + return prev_cpu;
> +}
> +

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-03-02 Thread Peter Zijlstra

On Tue, Mar 02, 2021 at 01:09:46PM +0530, Srikar Dronamraju wrote:
> > Oh, could be, I didn't grep :/ We could have core code keep track of the
> > smt count I suppose.
> 
> Could we use cpumask_weight(cpu_smt_mask(this_cpu)) instead?

cpumask_weight() is potentially super expensive. With CPUMASK_OFFSTACK
you get at least one more cache miss and then the bitmap might be really
long.

Best to compute the results once and use it later.

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-03-02 Thread Srikar Dronamraju

* Peter Zijlstra  [2021-03-01 18:18:28]:

> On Mon, Mar 01, 2021 at 10:36:01PM +0530, Srikar Dronamraju wrote:
> > * Peter Zijlstra  [2021-03-01 16:44:42]:
> > 
> > > On Sat, Feb 27, 2021 at 02:56:07PM -0500, Rik van Riel wrote:
> > > > On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote:
> > > 
 > > > 
> > > > Maybe we could use "tnr_busy * 4 <
> > > > tllc_size * 3" or
> > > > something like that?
> > > 
> > > How about:
> > > 
> > >   tnr_busy < tllc_size / topology_max_smt_threads()
> > > 
> > > ?
> > 
> > Isn't topology_max_smt_threads only for x86 as of today?
> > Or Am I missing out?
> 
> Oh, could be, I didn't grep :/ We could have core code keep track of the
> smt count I suppose.

Could we use cpumask_weight(cpu_smt_mask(this_cpu)) instead?

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-03-01 Thread Peter Zijlstra

On Mon, Mar 01, 2021 at 10:36:01PM +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra  [2021-03-01 16:44:42]:
> 
> > On Sat, Feb 27, 2021 at 02:56:07PM -0500, Rik van Riel wrote:
> > > On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote:
> > 
> > > > +   if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> > > > +   return this_cpu;
> > > 
> > > I wonder if we need to use a slightly lower threshold on
> > > very large LLCs, both to account for the fact that the
> > > select_idle_cpu code may not find the single idle CPU
> > > among a dozen busy ones, or because on a system with
> > > hyperthreading we may often be better off picking another
> > > LLC for HT contention issues?
> > > 
> > > Maybe we could use "tnr_busy * 4 <
> > > tllc_size * 3" or
> > > something like that?
> > 
> > How about:
> > 
> > tnr_busy < tllc_size / topology_max_smt_threads()
> > 
> > ?
> 
> Isn't topology_max_smt_threads only for x86 as of today?
> Or Am I missing out?

Oh, could be, I didn't grep :/ We could have core code keep track of the
smt count I suppose.

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-03-01 Thread Srikar Dronamraju

* Peter Zijlstra  [2021-03-01 16:40:33]:

> On Fri, Feb 26, 2021 at 10:10:29PM +0530, Srikar Dronamraju wrote:
> > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> > +{
> > +   struct sched_domain_shared *tsds, *psds;
> > +   int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> > +
> > +   tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> > +   tnr_busy = atomic_read(>nr_busy_cpus);
> > +   tllc_size = per_cpu(sd_llc_size, this_cpu);
> > +
> > +   psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> > +   pnr_busy = atomic_read(>nr_busy_cpus);
> > +   pllc_size = per_cpu(sd_llc_size, prev_cpu);
> > +
> 
> nr_busy_cpus is NO_HZ_COMMON So this code that consumes it should be
> too.

Thanks Peter, will take care of this along with other changes including
calling within rcu_read_lock and checking for tsds and psds after
rcu_dereference.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-03-01 Thread Srikar Dronamraju

* Peter Zijlstra  [2021-03-01 16:44:42]:

> On Sat, Feb 27, 2021 at 02:56:07PM -0500, Rik van Riel wrote:
> > On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote:
> 
> > > + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> > > + return this_cpu;
> > 
> > I wonder if we need to use a slightly lower threshold on
> > very large LLCs, both to account for the fact that the
> > select_idle_cpu code may not find the single idle CPU
> > among a dozen busy ones, or because on a system with
> > hyperthreading we may often be better off picking another
> > LLC for HT contention issues?
> > 
> > Maybe we could use "tnr_busy * 4 <
> > tllc_size * 3" or
> > something like that?
> 
> How about:
> 
>   tnr_busy < tllc_size / topology_max_smt_threads()
> 
> ?

Isn't topology_max_smt_threads only for x86 as of today?
Or Am I missing out?


-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-03-01 Thread Peter Zijlstra

On Sat, Feb 27, 2021 at 02:56:07PM -0500, Rik van Riel wrote:
> On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote:

> > +   if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> > +   return this_cpu;
> 
> I wonder if we need to use a slightly lower threshold on
> very large LLCs, both to account for the fact that the
> select_idle_cpu code may not find the single idle CPU
> among a dozen busy ones, or because on a system with
> hyperthreading we may often be better off picking another
> LLC for HT contention issues?
> 
> Maybe we could use "tnr_busy * 4 <
> tllc_size * 3" or
> something like that?

How about:

tnr_busy < tllc_size / topology_max_smt_threads()

?


signature.asc
Description: PGP signature

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-03-01 Thread Peter Zijlstra

On Fri, Feb 26, 2021 at 10:10:29PM +0530, Srikar Dronamraju wrote:
> +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> +{
> + struct sched_domain_shared *tsds, *psds;
> + int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> +
> + tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> + tnr_busy = atomic_read(>nr_busy_cpus);
> + tllc_size = per_cpu(sd_llc_size, this_cpu);
> +
> + psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> + pnr_busy = atomic_read(>nr_busy_cpus);
> + pllc_size = per_cpu(sd_llc_size, prev_cpu);
> +

nr_busy_cpus is NO_HZ_COMMON So this code that consumes it should be
too.

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-03-01 Thread Srikar Dronamraju

* Rik van Riel  [2021-02-27 14:56:07]:

> > In the current situation where waker and previous CPUs are busy, but
> > only one of its LLC has an idle CPU, Scheduler may end up picking a
> > LLC
> > with no idle CPUs. To mitigate this, add a new step between 1 and 2
> > where Scheduler compares idle CPUs in waker and previous LLCs and
> > picks
> > the appropriate one.
> 
> I like that idea a lot. That could also solve some of the

Thanks.

> issues sometimes observed on multi-node x86 systems, and
> probably on the newer AMD chips with several LLCs on chip.
> 

Okay.

> > +   if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> > +   return this_cpu;
> 
> I wonder if we need to use a slightly lower threshold on
> very large LLCs, both to account for the fact that the
> select_idle_cpu code may not find the single idle CPU
> among a dozen busy ones, or because on a system with
> hyperthreading we may often be better off picking another
> LLC for HT contention issues?

> 
> Maybe we could use "tnr_busy * 4 <
> tllc_size * 3" or
> something like that?
> 
> That way we will only try to find the last 5 idle
> CPUs
> in a 22 CPU LLC if the other LLC also has fewer than 6
> idle cores.
> 
> That might increase our chances of finding an idle CPU
> with SIS_PROP enabled, and might allow WA_WAKER to be
> true by default.

Agree we need to be conservative esp if we want to make WA_WAKER on by
default. I would still like to hear from other people if they think its ok
to enable it by default. I wonder if enabling it by default can cause some
load imbalances leading to more active load balance down the line.  I
haven't benchmarked with WA_WAKER enabled.

Thanks Rik for your inputs.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-02-27 Thread Rik van Riel

On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote:

> Current order of preference to pick a LLC while waking a wake-affine
> task:
> 1. Between the waker CPU and previous CPU, prefer the LLC of the CPU
>that is idle.
> 
> 2. Between the waker CPU and previous CPU, prefer the LLC of the CPU
>that is less lightly loaded.
> 
> In the current situation where waker and previous CPUs are busy, but
> only one of its LLC has an idle CPU, Scheduler may end up picking a
> LLC
> with no idle CPUs. To mitigate this, add a new step between 1 and 2
> where Scheduler compares idle CPUs in waker and previous LLCs and
> picks
> the appropriate one.

I like that idea a lot. That could also solve some of the
issues sometimes observed on multi-node x86 systems, and
probably on the newer AMD chips with several LLCs on chip.

> + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> + return this_cpu;

I wonder if we need to use a slightly lower threshold on
very large LLCs, both to account for the fact that the
select_idle_cpu code may not find the single idle CPU
among a dozen busy ones, or because on a system with
hyperthreading we may often be better off picking another
LLC for HT contention issues?

Maybe we could use "tnr_busy * 4 <
tllc_size * 3" or
something like that?

That way we will only try to find the last 5 idle
CPUs
in a 22 CPU LLC if the other LLC also has fewer than 6
idle cores.

That might increase our chances of finding an idle CPU
with SIS_PROP enabled, and might allow WA_WAKER to be
true by default.

> + /* For better wakeup latency, prefer idler LLC to cache
> affinity */
> + diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size;
> + if (!diff)
> + return nr_cpumask_bits;
> + if (diff < 0)
> + return this_cpu;
> +
> + return prev_cpu;
> +}

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part

[PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-02-26 Thread Srikar Dronamraju

On POWER8 and POWER9, the last level cache (L2) has been at the level of
a group of 8 threads (SMT8 on POWER8, a big-core comprising of a pair of
SMT4 cores on POWER9). However, on POWER10, the LLC domain is at the
level of a group of SMT4 threads within the SMT8 core. Due to the
shrinking in the size of the LLC domain, the probability of finding an
idle CPU in the LLC domain of the target is lesser on POWER10 compared
to the previous generation processors.

With commit 9538abee18cc ("powerpc/smp: Add support detecting
thread-groups sharing L2 cache") benchmarks such as Daytrader
(https://github.com/WASdev/sample.daytrader7) show a drop in throughput
in a configuration consisting of 1 JVM spanning across 6-8 Bigcores on
POWER10.  Analysis showed that this was because more number of wakeups
were happening on busy CPUs when the utilization was 60-70%. This drop
in throughput also shows up as a drop in CPU utilization. However most
other benchmarks benefit with detecting the thread-groups that share L2
cache.

Current order of preference to pick a LLC while waking a wake-affine
task:
1. Between the waker CPU and previous CPU, prefer the LLC of the CPU
   that is idle.

2. Between the waker CPU and previous CPU, prefer the LLC of the CPU
   that is less lightly loaded.

In the current situation where waker and previous CPUs are busy, but
only one of its LLC has an idle CPU, Scheduler may end up picking a LLC
with no idle CPUs. To mitigate this, add a new step between 1 and 2
where Scheduler compares idle CPUs in waker and previous LLCs and picks
the appropriate one.

The other alternative is to search for an idle CPU in the other LLC, if
the current select_idle_sibling is unable to find an idle CPU in the
preferred LLC. But that may increase the time to select a CPU.


 5.11-rc6  5.11-rc6+revert   
5.11-rc6+patch
8CORE/1JVM  80USERS   throughput 6651.66716.3 (0.97%)6940 
(4.34%)
  sys/user:time  59.75/23.86   61.77/24.55   60/24

8CORE/2JVM  80USERS   throughput 6425.46446.8 (0.33%)6473.2 
(0.74%)
  sys/user:time  70.59/24.25   72.28/23.77   70/24

8CORE/4JVM  80USERS   throughput 5355.35551.2 (3.66%)5586.6 
(4.32%)
  sys/user:time  76.74/21.79   76.54/22.73   76/22

8CORE/8JVM  80USERS   throughput 4420.64553.3 (3.00%)4405.8 
(-0.33%)
  sys/user:time  79.13/20.32   78.76/21.01   79/20

Cc: LKML 
Cc: Michael Ellerman 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Parth Shah 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Dietmar Eggemann 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Co-developed-by: Gautham R Shenoy 
Signed-off-by: Gautham R Shenoy 
Co-developed-by: Parth Shah 
Signed-off-by: Parth Shah 
Signed-off-by: Srikar Dronamraju 
---
 kernel/sched/fair.c | 41 +++--
 kernel/sched/features.h |  2 ++
 2 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a8bd7b13634..d49bfcdc4a19 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5869,6 +5869,36 @@ wake_affine_weight(struct sched_domain *sd, struct 
task_struct *p,
return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
 }

+static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
+{
+   struct sched_domain_shared *tsds, *psds;
+   int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
+
+   tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
+   tnr_busy = atomic_read(>nr_busy_cpus);
+   tllc_size = per_cpu(sd_llc_size, this_cpu);
+
+   psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
+   pnr_busy = atomic_read(>nr_busy_cpus);
+   pllc_size = per_cpu(sd_llc_size, prev_cpu);
+
+   /* No need to compare, if both LLCs are fully loaded */
+   if (pnr_busy == pllc_size && tnr_busy == pllc_size)
+   return nr_cpumask_bits;
+
+   if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
+   return this_cpu;
+
+   /* For better wakeup latency, prefer idler LLC to cache affinity */
+   diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size;
+   if (!diff)
+   return nr_cpumask_bits;
+   if (diff < 0)
+   return this_cpu;
+
+   return prev_cpu;
+}
+
 static int wake_affine(struct sched_domain *sd, struct task_struct *p,
   int this_cpu, int prev_cpu, int sync)
 {
@@ -5877,6 +5907,10 @@ static int wake_affine(struct sched_domain *sd, struct 
task_struct *p,
if (sched_feat(WA_IDLE))
target = wake_affine_idle(this_cpu, prev_cpu, sync);

+   if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits &&
+   !cpus_share_cache(this_cpu, prev_cpu))
+   target = prefer_idler_llc(this_cpu, prev_cpu, sync);
+
if

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

[PATCH] sched/fair: Prefer idle CPU to cache affinity

16 matches

Site Navigation

Mail list logo

Footer information