Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-09 Thread Valentin Schneider
On 02/09/2018 11:41 AM, Vincent Guittot wrote:
> On 8 February 2018 at 20:21, Valentin Schneider
>  wrote:
>> On 02/08/2018 01:36 PM, Vincent Guittot wrote:
>>> On 8 February 2018 at 13:46, Valentin Schneider
>>>  wrote:
 On 02/06/2018 07:23 PM, Vincent Guittot wrote:
> [...]
> 
>>
>> In summary:
>>
>> 20 iterations per test case
>> All non-mentioned CPUs are idling
>>
>> -
>> kick_ilb() test case:
>> -
>>
>> a, b = 100% rt-app tasks
>> - = idling
>>
>> Accumulating load before sleeping
>> ^
>> ^
>> CPU1| a a a - - - a
>> CPU2| - - b b b b b
>>   v
>>   v > Periodically kicking ILBs to update nohz blocked load
>>
>> Baseline:
>> _nohz_idle_balance() takes 39µs in average
>> nohz_balance_enter_idle() takes 233ns in average
>>
>> W/ cpumask:
>> _nohz_idle_balance() takes 33µs in average
>> nohz_balance_enter_idle() takes 283ns in average
>>
>> Diff:
>> _nohz_idle_balance() -6µs in average (-16%)
>> nohz_balance_enter_idle() +50ns in average (+21%)
> 
> In your use case, there is no contention when accessing the cpumask.
> Have you tried a use case with tasks that wake ups and go back to idle
> simultaneously on several/all cpus so they will fight to update the
> atomic resources ?
> That would be interesting to see the impact on the runtime of the
> nohz_balance_enter_idle function

No, I haven't tried that yet. For now these tests picture the "best case" 
scenario since all but one CPU is idle. I've been meaning to test busier
scenarios - I'll give your idle/sleep storm a try, thanks for the suggestion.

I also need to work on a test case for the load_balance() call in
idle_balance(). As Peter mentioned, the clearing of has_blocked in 
update_sd_lb_stats() can only be done with atomic ops, so that's another 
thing to profile against the baseline.



Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-09 Thread Valentin Schneider
On 02/09/2018 11:41 AM, Vincent Guittot wrote:
> On 8 February 2018 at 20:21, Valentin Schneider
>  wrote:
>> On 02/08/2018 01:36 PM, Vincent Guittot wrote:
>>> On 8 February 2018 at 13:46, Valentin Schneider
>>>  wrote:
 On 02/06/2018 07:23 PM, Vincent Guittot wrote:
> [...]
> 
>>
>> In summary:
>>
>> 20 iterations per test case
>> All non-mentioned CPUs are idling
>>
>> -
>> kick_ilb() test case:
>> -
>>
>> a, b = 100% rt-app tasks
>> - = idling
>>
>> Accumulating load before sleeping
>> ^
>> ^
>> CPU1| a a a - - - a
>> CPU2| - - b b b b b
>>   v
>>   v > Periodically kicking ILBs to update nohz blocked load
>>
>> Baseline:
>> _nohz_idle_balance() takes 39µs in average
>> nohz_balance_enter_idle() takes 233ns in average
>>
>> W/ cpumask:
>> _nohz_idle_balance() takes 33µs in average
>> nohz_balance_enter_idle() takes 283ns in average
>>
>> Diff:
>> _nohz_idle_balance() -6µs in average (-16%)
>> nohz_balance_enter_idle() +50ns in average (+21%)
> 
> In your use case, there is no contention when accessing the cpumask.
> Have you tried a use case with tasks that wake ups and go back to idle
> simultaneously on several/all cpus so they will fight to update the
> atomic resources ?
> That would be interesting to see the impact on the runtime of the
> nohz_balance_enter_idle function

No, I haven't tried that yet. For now these tests picture the "best case" 
scenario since all but one CPU is idle. I've been meaning to test busier
scenarios - I'll give your idle/sleep storm a try, thanks for the suggestion.

I also need to work on a test case for the load_balance() call in
idle_balance(). As Peter mentioned, the clearing of has_blocked in 
update_sd_lb_stats() can only be done with atomic ops, so that's another 
thing to profile against the baseline.



Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-09 Thread Peter Zijlstra
On Thu, Feb 08, 2018 at 05:52:00PM +0100, Vincent Guittot wrote:
> On 8 February 2018 at 16:44, Peter Zijlstra  wrote:
> > On Thu, Feb 08, 2018 at 04:05:58PM +0100, Vincent Guittot wrote:
> >> On 8 February 2018 at 15:00, Peter Zijlstra  wrote:
> >> > On Tue, Feb 06, 2018 at 08:23:05PM +0100, Vincent Guittot wrote:
> >> >
> >> >> @@ -9207,13 +9231,15 @@ void nohz_balance_enter_idle(int cpu)
> >> >>   if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
> >> >>   return;
> >> >>
> >> >> + rq->has_blocked_load = 1;
> >
> > Should we not set that with rq->lock held? We already clear it while
> > holding rq->lock.
> 
> I think it's safe because it is used to re-enable the periodic decay
> unconditionally.
> It is cleared with  rq->lock held to prevents any update of the cfs_rq
> *_avg  while deciding if we can clear has_blocked_load

Yes, and that clearing must then have observed the new addition. But
please put so in a comment.




Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-09 Thread Peter Zijlstra
On Thu, Feb 08, 2018 at 05:52:00PM +0100, Vincent Guittot wrote:
> On 8 February 2018 at 16:44, Peter Zijlstra  wrote:
> > On Thu, Feb 08, 2018 at 04:05:58PM +0100, Vincent Guittot wrote:
> >> On 8 February 2018 at 15:00, Peter Zijlstra  wrote:
> >> > On Tue, Feb 06, 2018 at 08:23:05PM +0100, Vincent Guittot wrote:
> >> >
> >> >> @@ -9207,13 +9231,15 @@ void nohz_balance_enter_idle(int cpu)
> >> >>   if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
> >> >>   return;
> >> >>
> >> >> + rq->has_blocked_load = 1;
> >
> > Should we not set that with rq->lock held? We already clear it while
> > holding rq->lock.
> 
> I think it's safe because it is used to re-enable the periodic decay
> unconditionally.
> It is cleared with  rq->lock held to prevents any update of the cfs_rq
> *_avg  while deciding if we can clear has_blocked_load

Yes, and that clearing must then have observed the new addition. But
please put so in a comment.




Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-09 Thread Vincent Guittot
On 8 February 2018 at 20:21, Valentin Schneider
 wrote:
> On 02/08/2018 01:36 PM, Vincent Guittot wrote:
>> On 8 February 2018 at 13:46, Valentin Schneider
>>  wrote:
>>> On 02/06/2018 07:23 PM, Vincent Guittot wrote:
 [...]

>
> For now I've been using those made-up rt-app workloads to stress specific
> bits of code, but I agree it would be really nice to have a "real" workload
> to see how both power (additional kick_ilb()'s) and performance (additional
> atomic ops) are affected. As Vincent suggested offline, it could be worth
> giving it a shot on Android...
>
>
> In the meantime I've done some more accurate profiling with cpumasks on my
> Juno r2 (patch at the bottom). As a sidenote, I realised I don't have a test
> case for the load update through load_balance() in idle_balance() - I only
> test the !overload segment. Will think about that.
>
> In summary:
>
> 20 iterations per test case
> All non-mentioned CPUs are idling
>
> -
> kick_ilb() test case:
> -
>
> a, b = 100% rt-app tasks
> - = idling
>
> Accumulating load before sleeping
> ^
> ^
> CPU1| a a a - - - a
> CPU2| - - b b b b b
>   v
>   v > Periodically kicking ILBs to update nohz blocked load
>
> Baseline:
> _nohz_idle_balance() takes 39µs in average
> nohz_balance_enter_idle() takes 233ns in average
>
> W/ cpumask:
> _nohz_idle_balance() takes 33µs in average
> nohz_balance_enter_idle() takes 283ns in average
>
> Diff:
> _nohz_idle_balance() -6µs in average (-16%)
> nohz_balance_enter_idle() +50ns in average (+21%)

In your use case, there is no contention when accessing the cpumask.
Have you tried a use case with tasks that wake ups and go back to idle
simultaneously on several/all cpus so they will fight to update the
atomic resources ?
That would be interesting to see the impact on the runtime of the
nohz_balance_enter_idle function

>
>
> ---
> Local _nohz_idle_balance() for NEWLY_IDLE test case:
> ---
>
> a = 100% rt-app task
> b = periodic rt-app task
> - = idling
>
> Accumulating load before sleeping
> ^
> ^
> CPU1| a a a - - - - - a
> CPU2| - - b - b - b - b
>v
>v > Periodically updating nohz blocked load through local
>_nohz_idle_balance() in idle_balance()
>
>
> (execution times are x2 as fast as previous test case because CPU2 is a
> big CPU, whereas CPU0 - the ILB 99% of the time in the previous test - is a
> LITTLE CPU ~ half as powerful).
>
> Baseline:
> _nohz_idle_balance() takes 20µs in average
> nohz_balance_enter_idle() takes 183ns in average
>
> W/ cpumask:
> _nohz_idle_balance() takes 13µs in average
> nohz_balance_enter_idle() takes 217ns in average
>
> Diff:
> _nohz_idle_balance() -7µs in average (-38%)
> nohz_balance_enter_idle() +34ns in average (+18%)
>
>
>
> For more details: 
> https://gist.github.com/valschneider/78099acee87a18057d56cc6cc03978b1
>
> ---
>  kernel/sched/fair.c | 82 
> +
>  1 file changed, 39 insertions(+), 43 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3abb3bc..4042025 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5404,8 +5404,8 @@ decay_load_missed(unsigned long load, unsigned long 
> missed_updates, int idx)
>
>  static struct {
> cpumask_var_t idle_cpus_mask;
> +   cpumask_var_t stale_cpus_mask; /* Idle CPUs with blocked load */
> atomic_t nr_cpus;
> -   int has_blocked;/* Idle CPUS has blocked load */
> unsigned long next_balance; /* in jiffy units */
> unsigned long next_blocked; /* Next update of blocked load in 
> jiffies */
>  } nohz cacheline_aligned;
> @@ -6968,7 +6968,6 @@ enum fbq_type { regular, remote, all };
>  #define LBF_DST_PINNED  0x04
>  #define LBF_SOME_PINNED0x08
>  #define LBF_NOHZ_STATS 0x10
> -#define LBF_NOHZ_AGAIN 0x20
>
>  struct lb_env {
> struct sched_domain *sd;
> @@ -7827,25 +7826,24 @@ group_type group_classify(struct sched_group *group,
> return group_other;
>  }
>
> -static bool update_nohz_stats(struct rq *rq)
> +static void update_nohz_stats(struct rq *rq)
>  {
>  #ifdef CONFIG_NO_HZ_COMMON
> unsigned int cpu = rq->cpu;
>
> -   if (!rq->has_blocked_load)
> -   return false;
>
> -   if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> -   return false;
> +   if (!cpumask_test_cpu(cpu, nohz.stale_cpus_mask))
> +   return;
>
> if (!time_after(jiffies, rq->last_blocked_load_update_tick))
> -   return true;
> +   return;
>
> update_blocked_averages(cpu);
>
> -   return rq->has_blocked_load;
> +   if 

Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-09 Thread Vincent Guittot
On 8 February 2018 at 20:21, Valentin Schneider
 wrote:
> On 02/08/2018 01:36 PM, Vincent Guittot wrote:
>> On 8 February 2018 at 13:46, Valentin Schneider
>>  wrote:
>>> On 02/06/2018 07:23 PM, Vincent Guittot wrote:
 [...]

>
> For now I've been using those made-up rt-app workloads to stress specific
> bits of code, but I agree it would be really nice to have a "real" workload
> to see how both power (additional kick_ilb()'s) and performance (additional
> atomic ops) are affected. As Vincent suggested offline, it could be worth
> giving it a shot on Android...
>
>
> In the meantime I've done some more accurate profiling with cpumasks on my
> Juno r2 (patch at the bottom). As a sidenote, I realised I don't have a test
> case for the load update through load_balance() in idle_balance() - I only
> test the !overload segment. Will think about that.
>
> In summary:
>
> 20 iterations per test case
> All non-mentioned CPUs are idling
>
> -
> kick_ilb() test case:
> -
>
> a, b = 100% rt-app tasks
> - = idling
>
> Accumulating load before sleeping
> ^
> ^
> CPU1| a a a - - - a
> CPU2| - - b b b b b
>   v
>   v > Periodically kicking ILBs to update nohz blocked load
>
> Baseline:
> _nohz_idle_balance() takes 39µs in average
> nohz_balance_enter_idle() takes 233ns in average
>
> W/ cpumask:
> _nohz_idle_balance() takes 33µs in average
> nohz_balance_enter_idle() takes 283ns in average
>
> Diff:
> _nohz_idle_balance() -6µs in average (-16%)
> nohz_balance_enter_idle() +50ns in average (+21%)

In your use case, there is no contention when accessing the cpumask.
Have you tried a use case with tasks that wake ups and go back to idle
simultaneously on several/all cpus so they will fight to update the
atomic resources ?
That would be interesting to see the impact on the runtime of the
nohz_balance_enter_idle function

>
>
> ---
> Local _nohz_idle_balance() for NEWLY_IDLE test case:
> ---
>
> a = 100% rt-app task
> b = periodic rt-app task
> - = idling
>
> Accumulating load before sleeping
> ^
> ^
> CPU1| a a a - - - - - a
> CPU2| - - b - b - b - b
>v
>v > Periodically updating nohz blocked load through local
>_nohz_idle_balance() in idle_balance()
>
>
> (execution times are x2 as fast as previous test case because CPU2 is a
> big CPU, whereas CPU0 - the ILB 99% of the time in the previous test - is a
> LITTLE CPU ~ half as powerful).
>
> Baseline:
> _nohz_idle_balance() takes 20µs in average
> nohz_balance_enter_idle() takes 183ns in average
>
> W/ cpumask:
> _nohz_idle_balance() takes 13µs in average
> nohz_balance_enter_idle() takes 217ns in average
>
> Diff:
> _nohz_idle_balance() -7µs in average (-38%)
> nohz_balance_enter_idle() +34ns in average (+18%)
>
>
>
> For more details: 
> https://gist.github.com/valschneider/78099acee87a18057d56cc6cc03978b1
>
> ---
>  kernel/sched/fair.c | 82 
> +
>  1 file changed, 39 insertions(+), 43 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3abb3bc..4042025 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5404,8 +5404,8 @@ decay_load_missed(unsigned long load, unsigned long 
> missed_updates, int idx)
>
>  static struct {
> cpumask_var_t idle_cpus_mask;
> +   cpumask_var_t stale_cpus_mask; /* Idle CPUs with blocked load */
> atomic_t nr_cpus;
> -   int has_blocked;/* Idle CPUS has blocked load */
> unsigned long next_balance; /* in jiffy units */
> unsigned long next_blocked; /* Next update of blocked load in 
> jiffies */
>  } nohz cacheline_aligned;
> @@ -6968,7 +6968,6 @@ enum fbq_type { regular, remote, all };
>  #define LBF_DST_PINNED  0x04
>  #define LBF_SOME_PINNED0x08
>  #define LBF_NOHZ_STATS 0x10
> -#define LBF_NOHZ_AGAIN 0x20
>
>  struct lb_env {
> struct sched_domain *sd;
> @@ -7827,25 +7826,24 @@ group_type group_classify(struct sched_group *group,
> return group_other;
>  }
>
> -static bool update_nohz_stats(struct rq *rq)
> +static void update_nohz_stats(struct rq *rq)
>  {
>  #ifdef CONFIG_NO_HZ_COMMON
> unsigned int cpu = rq->cpu;
>
> -   if (!rq->has_blocked_load)
> -   return false;
>
> -   if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> -   return false;
> +   if (!cpumask_test_cpu(cpu, nohz.stale_cpus_mask))
> +   return;
>
> if (!time_after(jiffies, rq->last_blocked_load_update_tick))
> -   return true;
> +   return;
>
> update_blocked_averages(cpu);
>
> -   return rq->has_blocked_load;
> +   if (!rq->has_blocked_load)
> +   cpumask_clear_cpu(cpu, 

Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-09 Thread Andrea Parri
On Thu, Feb 08, 2018 at 04:03:41PM +, Will Deacon wrote:
> On Thu, Feb 08, 2018 at 04:46:43PM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 08, 2018 at 03:30:31PM +, Will Deacon wrote:
> > > On Thu, Feb 08, 2018 at 03:00:05PM +0100, Peter Zijlstra wrote:
> > 
> > > > Without this ordering I think it would be possible to loose has_blocked
> > > > and not observe the CPU either.
> > > 
> > > I had a quick look at this, and I think you're right. This looks very much
> > > like an 'R'-shaped test, which means it's smp_mb() all round otherwise 
> > > Power
> > > will go wrong. That also means the smp_mb__after_atomic() in
> > > nohz_balance_enter_idle *cannot* be an smp_wmb(), so you might want a
> > > comment stating that explicitly.
> > 
> > Thanks Will. BTW, where does that 'R' shape nomenclature come from?
> > This is the first I've heard of it.
> 
> I don't know where it originates from, but the imfamous "test6.pdf" has it:
> 
> https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
> 
> half way down the first page on the left. It says "needs sync+sync" which

Indeed.  As a curiosity: I've never _observed_ R+lwsync+sync (the lwsync
separating the two writes), and other people who tried found the same

  http://moscova.inria.fr/~maranget/cats7/linux/hard.html#unseen
  http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/ppc051.html#toc8 .

It would be interesting to hear about different results ... ;-)

  Andrea


> is about as bad as it gets for Power (compare with "2+2w", which gets away
> with lwsync+lwsync). See also:
> 
> http://materials.dagstuhl.de/files/16/16471/16471.DerekWilliams.Slides.pdf
> 
> for a light-hearted, yet technically accurate story about the latter.
> 
> Will


Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-09 Thread Andrea Parri
On Thu, Feb 08, 2018 at 04:03:41PM +, Will Deacon wrote:
> On Thu, Feb 08, 2018 at 04:46:43PM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 08, 2018 at 03:30:31PM +, Will Deacon wrote:
> > > On Thu, Feb 08, 2018 at 03:00:05PM +0100, Peter Zijlstra wrote:
> > 
> > > > Without this ordering I think it would be possible to loose has_blocked
> > > > and not observe the CPU either.
> > > 
> > > I had a quick look at this, and I think you're right. This looks very much
> > > like an 'R'-shaped test, which means it's smp_mb() all round otherwise 
> > > Power
> > > will go wrong. That also means the smp_mb__after_atomic() in
> > > nohz_balance_enter_idle *cannot* be an smp_wmb(), so you might want a
> > > comment stating that explicitly.
> > 
> > Thanks Will. BTW, where does that 'R' shape nomenclature come from?
> > This is the first I've heard of it.
> 
> I don't know where it originates from, but the imfamous "test6.pdf" has it:
> 
> https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
> 
> half way down the first page on the left. It says "needs sync+sync" which

Indeed.  As a curiosity: I've never _observed_ R+lwsync+sync (the lwsync
separating the two writes), and other people who tried found the same

  http://moscova.inria.fr/~maranget/cats7/linux/hard.html#unseen
  http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/ppc051.html#toc8 .

It would be interesting to hear about different results ... ;-)

  Andrea


> is about as bad as it gets for Power (compare with "2+2w", which gets away
> with lwsync+lwsync). See also:
> 
> http://materials.dagstuhl.de/files/16/16471/16471.DerekWilliams.Slides.pdf
> 
> for a light-hearted, yet technically accurate story about the latter.
> 
> Will


Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Valentin Schneider
On 02/08/2018 01:36 PM, Vincent Guittot wrote:
> On 8 February 2018 at 13:46, Valentin Schneider
>  wrote:
>> On 02/06/2018 07:23 PM, Vincent Guittot wrote:
>>> [...]
>>> @@ -7826,8 +7842,8 @@ static inline void update_sg_lb_stats(struct lb_env 
>>> *env,
>>>   for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>>>   struct rq *rq = cpu_rq(i);
>>>
>>> - if (env->flags & LBF_NOHZ_STATS)
>>> - update_nohz_stats(rq);
>>> + if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq))
>>> + env->flags |= LBF_NOHZ_AGAIN;
>>>
>>>   /* Bias balancing toward cpus of our domain */
>>>   if (local_group)
>>> @@ -7979,18 +7995,15 @@ static inline void update_sd_lb_stats(struct lb_env 
>>> *env, struct sd_lb_stats *sd
>>>   struct sg_lb_stats *local = >local_stat;
>>>   struct sg_lb_stats tmp_sgs;
>>>   int load_idx, prefer_sibling = 0;
>>> + int has_blocked = READ_ONCE(nohz.has_blocked);
>>>   bool overload = false;
>>>
>>>   if (child && child->flags & SD_PREFER_SIBLING)
>>>   prefer_sibling = 1;
>>>
>>>  #ifdef CONFIG_NO_HZ_COMMON
>>> - if (env->idle == CPU_NEWLY_IDLE) {
>>> + if (env->idle == CPU_NEWLY_IDLE && has_blocked)
>>>   env->flags |= LBF_NOHZ_STATS;
>>> -
>>> - if (cpumask_subset(nohz.idle_cpus_mask, 
>>> sched_domain_span(env->sd)))
>>> - nohz.next_stats = jiffies + 
>>> msecs_to_jiffies(LOAD_AVG_PERIOD);
>>> - }
>>>  #endif
>>>
>>>   load_idx = get_sd_load_idx(env->sd, env->idle);
>>> @@ -8046,6 +8059,15 @@ static inline void update_sd_lb_stats(struct lb_env 
>>> *env, struct sd_lb_stats *sd
>>>   sg = sg->next;
>>>   } while (sg != env->sd->groups);
>>>
>>> +#ifdef CONFIG_NO_HZ_COMMON
>>> + if ((env->flags & LBF_NOHZ_AGAIN) &&
>>> + cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
>>> +
>>> + WRITE_ONCE(nohz.next_blocked,
>>> + jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
>>
>> Here we push the stats update forward if we visited all the nohz CPUs but 
>> they
>> still have blocked load. IMO we should also clear the nohz.has_blocked flag
>> if we visited all the nohz CPUs and none had blocked load left.
>>
>> If we don't do that, we could very well have cleared all of the nohz blocked
>> load in idle_balance and successfully pulled a task, but the flag isn't
>> cleared so we'll end up doing a _nohz_idle_balance() later on for nothing.
>>
>>
>> As I said in a previous comment, we also have this problem with periodic
>> load balance: if a CPU goes nohz (nohz.has_blocked is raised) but wakes up,
>> e.g. before the next nohz.next_blocked, we should stop kicking ILBs
>>
>> Now I'd need to test this, but I think it can actually get worse: if that
>> CPU keeps generating blocked load after this short idle period, no matter
>> how many _nohz_idle_balance() we go through we will never reach a point
>> where nohz.has_blocked gets cleared, and we'll keep kicking those ILBs to
>> update blocked load that already gets updated in the periodic balance.
>>
>> I think that's where a nohz blocked load cpumask can also help: on top of
>> skipping nohz CPUs that don't need an update, we can stop the whole remote
>> update machinery when the last nohz cpu with blocked load wakes up, or say
>> when it goes through its first periodic balance.
> 
> The main question is : Do we want to remove all useless kicks to ilb
> or useless calls to _nohz_idle_balance at the cost of adding more
> latency in the idle/wakeup path because of the additional atomic
> operations to keep track of which CPUs are idle, tickless with blocked
> load or do we accept to kick ilb or call _nohz_idle_balance uselessly
> from time to time for some use cases
> 
> I agree with you that there are some useless calls with the proposal
> which can be removed with additional checks, variables and cpumask
> manipulation. Which use case will benefits from these additional
> checks and does it worth ?
> 
> Vincent

For now I've been using those made-up rt-app workloads to stress specific
bits of code, but I agree it would be really nice to have a "real" workload
to see how both power (additional kick_ilb()'s) and performance (additional
atomic ops) are affected. As Vincent suggested offline, it could be worth
giving it a shot on Android...


In the meantime I've done some more accurate profiling with cpumasks on my
Juno r2 (patch at the bottom). As a sidenote, I realised I don't have a test
case for the load update through load_balance() in idle_balance() - I only
test the !overload segment. Will think about that.

In summary:

20 iterations per test case
All non-mentioned CPUs are idling

-
kick_ilb() test case:
-

a, b = 100% rt-app tasks
- = idling

Accumulating load before sleeping
^
^
CPU1| 

Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Valentin Schneider
On 02/08/2018 01:36 PM, Vincent Guittot wrote:
> On 8 February 2018 at 13:46, Valentin Schneider
>  wrote:
>> On 02/06/2018 07:23 PM, Vincent Guittot wrote:
>>> [...]
>>> @@ -7826,8 +7842,8 @@ static inline void update_sg_lb_stats(struct lb_env 
>>> *env,
>>>   for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>>>   struct rq *rq = cpu_rq(i);
>>>
>>> - if (env->flags & LBF_NOHZ_STATS)
>>> - update_nohz_stats(rq);
>>> + if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq))
>>> + env->flags |= LBF_NOHZ_AGAIN;
>>>
>>>   /* Bias balancing toward cpus of our domain */
>>>   if (local_group)
>>> @@ -7979,18 +7995,15 @@ static inline void update_sd_lb_stats(struct lb_env 
>>> *env, struct sd_lb_stats *sd
>>>   struct sg_lb_stats *local = >local_stat;
>>>   struct sg_lb_stats tmp_sgs;
>>>   int load_idx, prefer_sibling = 0;
>>> + int has_blocked = READ_ONCE(nohz.has_blocked);
>>>   bool overload = false;
>>>
>>>   if (child && child->flags & SD_PREFER_SIBLING)
>>>   prefer_sibling = 1;
>>>
>>>  #ifdef CONFIG_NO_HZ_COMMON
>>> - if (env->idle == CPU_NEWLY_IDLE) {
>>> + if (env->idle == CPU_NEWLY_IDLE && has_blocked)
>>>   env->flags |= LBF_NOHZ_STATS;
>>> -
>>> - if (cpumask_subset(nohz.idle_cpus_mask, 
>>> sched_domain_span(env->sd)))
>>> - nohz.next_stats = jiffies + 
>>> msecs_to_jiffies(LOAD_AVG_PERIOD);
>>> - }
>>>  #endif
>>>
>>>   load_idx = get_sd_load_idx(env->sd, env->idle);
>>> @@ -8046,6 +8059,15 @@ static inline void update_sd_lb_stats(struct lb_env 
>>> *env, struct sd_lb_stats *sd
>>>   sg = sg->next;
>>>   } while (sg != env->sd->groups);
>>>
>>> +#ifdef CONFIG_NO_HZ_COMMON
>>> + if ((env->flags & LBF_NOHZ_AGAIN) &&
>>> + cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
>>> +
>>> + WRITE_ONCE(nohz.next_blocked,
>>> + jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
>>
>> Here we push the stats update forward if we visited all the nohz CPUs but 
>> they
>> still have blocked load. IMO we should also clear the nohz.has_blocked flag
>> if we visited all the nohz CPUs and none had blocked load left.
>>
>> If we don't do that, we could very well have cleared all of the nohz blocked
>> load in idle_balance and successfully pulled a task, but the flag isn't
>> cleared so we'll end up doing a _nohz_idle_balance() later on for nothing.
>>
>>
>> As I said in a previous comment, we also have this problem with periodic
>> load balance: if a CPU goes nohz (nohz.has_blocked is raised) but wakes up,
>> e.g. before the next nohz.next_blocked, we should stop kicking ILBs
>>
>> Now I'd need to test this, but I think it can actually get worse: if that
>> CPU keeps generating blocked load after this short idle period, no matter
>> how many _nohz_idle_balance() we go through we will never reach a point
>> where nohz.has_blocked gets cleared, and we'll keep kicking those ILBs to
>> update blocked load that already gets updated in the periodic balance.
>>
>> I think that's where a nohz blocked load cpumask can also help: on top of
>> skipping nohz CPUs that don't need an update, we can stop the whole remote
>> update machinery when the last nohz cpu with blocked load wakes up, or say
>> when it goes through its first periodic balance.
> 
> The main question is : Do we want to remove all useless kicks to ilb
> or useless calls to _nohz_idle_balance at the cost of adding more
> latency in the idle/wakeup path because of the additional atomic
> operations to keep track of which CPUs are idle, tickless with blocked
> load or do we accept to kick ilb or call _nohz_idle_balance uselessly
> from time to time for some use cases
> 
> I agree with you that there are some useless calls with the proposal
> which can be removed with additional checks, variables and cpumask
> manipulation. Which use case will benefits from these additional
> checks and does it worth ?
> 
> Vincent

For now I've been using those made-up rt-app workloads to stress specific
bits of code, but I agree it would be really nice to have a "real" workload
to see how both power (additional kick_ilb()'s) and performance (additional
atomic ops) are affected. As Vincent suggested offline, it could be worth
giving it a shot on Android...


In the meantime I've done some more accurate profiling with cpumasks on my
Juno r2 (patch at the bottom). As a sidenote, I realised I don't have a test
case for the load update through load_balance() in idle_balance() - I only
test the !overload segment. Will think about that.

In summary:

20 iterations per test case
All non-mentioned CPUs are idling

-
kick_ilb() test case:
-

a, b = 100% rt-app tasks
- = idling

Accumulating load before sleeping
^
^
CPU1| a a a - - - a
CPU2| - - b b 

Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Vincent Guittot
On 8 February 2018 at 16:44, Peter Zijlstra  wrote:
> On Thu, Feb 08, 2018 at 04:05:58PM +0100, Vincent Guittot wrote:
>> On 8 February 2018 at 15:00, Peter Zijlstra  wrote:
>> > On Tue, Feb 06, 2018 at 08:23:05PM +0100, Vincent Guittot wrote:
>> >
>> >> @@ -9207,13 +9231,15 @@ void nohz_balance_enter_idle(int cpu)
>> >>   if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
>> >>   return;
>> >>
>> >> + rq->has_blocked_load = 1;
>
> Should we not set that with rq->lock held? We already clear it while
> holding rq->lock.

I think it's safe because it is used to re-enable the periodic decay
unconditionally.
It is cleared with  rq->lock held to prevents any update of the cfs_rq
*_avg  while deciding if we can clear has_blocked_load

>
>> >> +
>> >>   if (rq->nohz_tick_stopped)
>> >> - return;
>> >
>> > this case is difficult... needs thinking
>>
>> The use case happens when a CPU wakes up and goes back to idle before
>> the tick fires and clears nohz_tick_stopped.
>
> Yes, and so we could have accrued blocked load. Now in this case the CPU
> must already be set in the cpumask, but we could've already cleared
> has_blocked.
>
> My question is mostly about needing that "goto out" to set the flag,
> because I think we can loose it on a store collision vs clearing it. But
> in that case I suppose the iteration must already be in progress, which
> in turn will observe rq->has_blocked_load and re-set nohz.has_blocked.
>
> So yes, this is good, but could use a comment.
>
>> > Without this ordering I think it would be possible to loose has_blocked
>> > and not observe the CPU either.
>>
>> I think that you are right.
>> I also wondered if some barriers were necessary but wrongly concluded
>> that set operation on nohz.idle_cpus_mask and WRITE_ONCE with volatile
>> would be enough to ensure the right ordering
>
> Yeah, so I forgot to write the comment in my patch, but it had the
> barriers implied by cmpxchg.


Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Vincent Guittot
On 8 February 2018 at 16:44, Peter Zijlstra  wrote:
> On Thu, Feb 08, 2018 at 04:05:58PM +0100, Vincent Guittot wrote:
>> On 8 February 2018 at 15:00, Peter Zijlstra  wrote:
>> > On Tue, Feb 06, 2018 at 08:23:05PM +0100, Vincent Guittot wrote:
>> >
>> >> @@ -9207,13 +9231,15 @@ void nohz_balance_enter_idle(int cpu)
>> >>   if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
>> >>   return;
>> >>
>> >> + rq->has_blocked_load = 1;
>
> Should we not set that with rq->lock held? We already clear it while
> holding rq->lock.

I think it's safe because it is used to re-enable the periodic decay
unconditionally.
It is cleared with  rq->lock held to prevents any update of the cfs_rq
*_avg  while deciding if we can clear has_blocked_load

>
>> >> +
>> >>   if (rq->nohz_tick_stopped)
>> >> - return;
>> >
>> > this case is difficult... needs thinking
>>
>> The use case happens when a CPU wakes up and goes back to idle before
>> the tick fires and clears nohz_tick_stopped.
>
> Yes, and so we could have accrued blocked load. Now in this case the CPU
> must already be set in the cpumask, but we could've already cleared
> has_blocked.
>
> My question is mostly about needing that "goto out" to set the flag,
> because I think we can loose it on a store collision vs clearing it. But
> in that case I suppose the iteration must already be in progress, which
> in turn will observe rq->has_blocked_load and re-set nohz.has_blocked.
>
> So yes, this is good, but could use a comment.
>
>> > Without this ordering I think it would be possible to loose has_blocked
>> > and not observe the CPU either.
>>
>> I think that you are right.
>> I also wondered if some barriers were necessary but wrongly concluded
>> that set operation on nohz.idle_cpus_mask and WRITE_ONCE with volatile
>> would be enough to ensure the right ordering
>
> Yeah, so I forgot to write the comment in my patch, but it had the
> barriers implied by cmpxchg.


Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Will Deacon
On Thu, Feb 08, 2018 at 04:46:43PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 08, 2018 at 03:30:31PM +, Will Deacon wrote:
> > On Thu, Feb 08, 2018 at 03:00:05PM +0100, Peter Zijlstra wrote:
> 
> > > Without this ordering I think it would be possible to loose has_blocked
> > > and not observe the CPU either.
> > 
> > I had a quick look at this, and I think you're right. This looks very much
> > like an 'R'-shaped test, which means it's smp_mb() all round otherwise Power
> > will go wrong. That also means the smp_mb__after_atomic() in
> > nohz_balance_enter_idle *cannot* be an smp_wmb(), so you might want a
> > comment stating that explicitly.
> 
> Thanks Will. BTW, where does that 'R' shape nomenclature come from?
> This is the first I've heard of it.

I don't know where it originates from, but the imfamous "test6.pdf" has it:

https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf

half way down the first page on the left. It says "needs sync+sync" which
is about as bad as it gets for Power (compare with "2+2w", which gets away
with lwsync+lwsync). See also:

http://materials.dagstuhl.de/files/16/16471/16471.DerekWilliams.Slides.pdf

for a light-hearted, yet technically accurate story about the latter.

Will


Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Will Deacon
On Thu, Feb 08, 2018 at 04:46:43PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 08, 2018 at 03:30:31PM +, Will Deacon wrote:
> > On Thu, Feb 08, 2018 at 03:00:05PM +0100, Peter Zijlstra wrote:
> 
> > > Without this ordering I think it would be possible to loose has_blocked
> > > and not observe the CPU either.
> > 
> > I had a quick look at this, and I think you're right. This looks very much
> > like an 'R'-shaped test, which means it's smp_mb() all round otherwise Power
> > will go wrong. That also means the smp_mb__after_atomic() in
> > nohz_balance_enter_idle *cannot* be an smp_wmb(), so you might want a
> > comment stating that explicitly.
> 
> Thanks Will. BTW, where does that 'R' shape nomenclature come from?
> This is the first I've heard of it.

I don't know where it originates from, but the imfamous "test6.pdf" has it:

https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf

half way down the first page on the left. It says "needs sync+sync" which
is about as bad as it gets for Power (compare with "2+2w", which gets away
with lwsync+lwsync). See also:

http://materials.dagstuhl.de/files/16/16471/16471.DerekWilliams.Slides.pdf

for a light-hearted, yet technically accurate story about the latter.

Will


Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Peter Zijlstra
On Thu, Feb 08, 2018 at 03:30:31PM +, Will Deacon wrote:
> On Thu, Feb 08, 2018 at 03:00:05PM +0100, Peter Zijlstra wrote:

> > Without this ordering I think it would be possible to loose has_blocked
> > and not observe the CPU either.
> 
> I had a quick look at this, and I think you're right. This looks very much
> like an 'R'-shaped test, which means it's smp_mb() all round otherwise Power
> will go wrong. That also means the smp_mb__after_atomic() in
> nohz_balance_enter_idle *cannot* be an smp_wmb(), so you might want a
> comment stating that explicitly.

Thanks Will. BTW, where does that 'R' shape nomenclature come from?
This is the first I've heard of it.


Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Peter Zijlstra
On Thu, Feb 08, 2018 at 03:30:31PM +, Will Deacon wrote:
> On Thu, Feb 08, 2018 at 03:00:05PM +0100, Peter Zijlstra wrote:

> > Without this ordering I think it would be possible to loose has_blocked
> > and not observe the CPU either.
> 
> I had a quick look at this, and I think you're right. This looks very much
> like an 'R'-shaped test, which means it's smp_mb() all round otherwise Power
> will go wrong. That also means the smp_mb__after_atomic() in
> nohz_balance_enter_idle *cannot* be an smp_wmb(), so you might want a
> comment stating that explicitly.

Thanks Will. BTW, where does that 'R' shape nomenclature come from?
This is the first I've heard of it.


Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Peter Zijlstra
On Thu, Feb 08, 2018 at 04:05:58PM +0100, Vincent Guittot wrote:
> On 8 February 2018 at 15:00, Peter Zijlstra  wrote:
> > On Tue, Feb 06, 2018 at 08:23:05PM +0100, Vincent Guittot wrote:
> >
> >> @@ -9207,13 +9231,15 @@ void nohz_balance_enter_idle(int cpu)
> >>   if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
> >>   return;
> >>
> >> + rq->has_blocked_load = 1;

Should we not set that with rq->lock held? We already clear it while
holding rq->lock.

> >> +
> >>   if (rq->nohz_tick_stopped)
> >> - return;
> >
> > this case is difficult... needs thinking
> 
> The use case happens when a CPU wakes up and goes back to idle before
> the tick fires and clears nohz_tick_stopped.

Yes, and so we could have accrued blocked load. Now in this case the CPU
must already be set in the cpumask, but we could've already cleared
has_blocked.

My question is mostly about needing that "goto out" to set the flag,
because I think we can loose it on a store collision vs clearing it. But
in that case I suppose the iteration must already be in progress, which
in turn will observe rq->has_blocked_load and re-set nohz.has_blocked.

So yes, this is good, but could use a comment.

> > Without this ordering I think it would be possible to loose has_blocked
> > and not observe the CPU either.
> 
> I think that you are right.
> I also wondered if some barriers were necessary but wrongly concluded
> that set operation on nohz.idle_cpus_mask and WRITE_ONCE with volatile
> would be enough to ensure the right ordering

Yeah, so I forgot to write the comment in my patch, but it had the
barriers implied by cmpxchg.


Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Peter Zijlstra
On Thu, Feb 08, 2018 at 04:05:58PM +0100, Vincent Guittot wrote:
> On 8 February 2018 at 15:00, Peter Zijlstra  wrote:
> > On Tue, Feb 06, 2018 at 08:23:05PM +0100, Vincent Guittot wrote:
> >
> >> @@ -9207,13 +9231,15 @@ void nohz_balance_enter_idle(int cpu)
> >>   if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
> >>   return;
> >>
> >> + rq->has_blocked_load = 1;

Should we not set that with rq->lock held? We already clear it while
holding rq->lock.

> >> +
> >>   if (rq->nohz_tick_stopped)
> >> - return;
> >
> > this case is difficult... needs thinking
> 
> The use case happens when a CPU wakes up and goes back to idle before
> the tick fires and clears nohz_tick_stopped.

Yes, and so we could have accrued blocked load. Now in this case the CPU
must already be set in the cpumask, but we could've already cleared
has_blocked.

My question is mostly about needing that "goto out" to set the flag,
because I think we can loose it on a store collision vs clearing it. But
in that case I suppose the iteration must already be in progress, which
in turn will observe rq->has_blocked_load and re-set nohz.has_blocked.

So yes, this is good, but could use a comment.

> > Without this ordering I think it would be possible to loose has_blocked
> > and not observe the CPU either.
> 
> I think that you are right.
> I also wondered if some barriers were necessary but wrongly concluded
> that set operation on nohz.idle_cpus_mask and WRITE_ONCE with volatile
> would be enough to ensure the right ordering

Yeah, so I forgot to write the comment in my patch, but it had the
barriers implied by cmpxchg.


Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Will Deacon
On Thu, Feb 08, 2018 at 03:00:05PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 06, 2018 at 08:23:05PM +0100, Vincent Guittot wrote:
> > @@ -9222,6 +9248,13 @@ void nohz_balance_enter_idle(int cpu)
> > atomic_inc(_cpus);
> >  
> > set_cpu_sd_state_idle(cpu);
> 
>   /*
>* Ensures that if nohz_idle_balance() fails to observe our
>* @idle_cpus_mask store, it must observe the @has_blocked
>* store.
>*/
>   smp_mb__after_atomic();
> 
> > +
> > +out:
> > +   /*
> > +* Each time a cpu enter idle, we assume that it has blocked load and
> > +* enable the periodic update of the load of idle cpus
> > +*/
> > +   WRITE_ONCE(nohz.has_blocked, 1);
> >  }
> 
> 
> 
> > @@ -9374,6 +9407,16 @@ static bool nohz_idle_balance(struct rq *this_rq, 
> > enum cpu_idle_type idle)
> >  
> > SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
> >  
> > +   /*
> > +* We assume there will be no idle load after this update and clear
> > +* the has_blocked flag. If a cpu enters idle in the mean time, it will
> > +* set the has_blocked flag and trig another update of idle load.
> > +* Because a cpu that becomes idle, is added to idle_cpus_mask before
> > +* setting the flag, we are sure to not clear the state and not
> > +* check the load of an idle cpu.
> > +*/
> > +   WRITE_ONCE(nohz.has_blocked, 0);
> 
>   /*
>* Ensures that if we miss the CPU, we must see the has_blocked
>* store from nohz_balance_enter_idle().
>*/
>   smp_mb();
> 
> > for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
> > if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
> > continue;
> 
> 
> I _think_, but my brain isn't quite willing to turn on today.
> 
> Without this ordering I think it would be possible to loose has_blocked
> and not observe the CPU either.

I had a quick look at this, and I think you're right. This looks very much
like an 'R'-shaped test, which means it's smp_mb() all round otherwise Power
will go wrong. That also means the smp_mb__after_atomic() in
nohz_balance_enter_idle *cannot* be an smp_wmb(), so you might want a
comment stating that explicitly.

On arm64, release/acquire would work, but that's basically not the case for
anybody else including x86, so let's not go there.

Will


Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Will Deacon
On Thu, Feb 08, 2018 at 03:00:05PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 06, 2018 at 08:23:05PM +0100, Vincent Guittot wrote:
> > @@ -9222,6 +9248,13 @@ void nohz_balance_enter_idle(int cpu)
> > atomic_inc(_cpus);
> >  
> > set_cpu_sd_state_idle(cpu);
> 
>   /*
>* Ensures that if nohz_idle_balance() fails to observe our
>* @idle_cpus_mask store, it must observe the @has_blocked
>* store.
>*/
>   smp_mb__after_atomic();
> 
> > +
> > +out:
> > +   /*
> > +* Each time a cpu enter idle, we assume that it has blocked load and
> > +* enable the periodic update of the load of idle cpus
> > +*/
> > +   WRITE_ONCE(nohz.has_blocked, 1);
> >  }
> 
> 
> 
> > @@ -9374,6 +9407,16 @@ static bool nohz_idle_balance(struct rq *this_rq, 
> > enum cpu_idle_type idle)
> >  
> > SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
> >  
> > +   /*
> > +* We assume there will be no idle load after this update and clear
> > +* the has_blocked flag. If a cpu enters idle in the mean time, it will
> > +* set the has_blocked flag and trig another update of idle load.
> > +* Because a cpu that becomes idle, is added to idle_cpus_mask before
> > +* setting the flag, we are sure to not clear the state and not
> > +* check the load of an idle cpu.
> > +*/
> > +   WRITE_ONCE(nohz.has_blocked, 0);
> 
>   /*
>* Ensures that if we miss the CPU, we must see the has_blocked
>* store from nohz_balance_enter_idle().
>*/
>   smp_mb();
> 
> > for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
> > if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
> > continue;
> 
> 
> I _think_, but my brain isn't quite willing to turn on today.
> 
> Without this ordering I think it would be possible to loose has_blocked
> and not observe the CPU either.

I had a quick look at this, and I think you're right. This looks very much
like an 'R'-shaped test, which means it's smp_mb() all round otherwise Power
will go wrong. That also means the smp_mb__after_atomic() in
nohz_balance_enter_idle *cannot* be an smp_wmb(), so you might want a
comment stating that explicitly.

On arm64, release/acquire would work, but that's basically not the case for
anybody else including x86, so let's not go there.

Will


Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Vincent Guittot
On 8 February 2018 at 15:00, Peter Zijlstra  wrote:
> On Tue, Feb 06, 2018 at 08:23:05PM +0100, Vincent Guittot wrote:
>
>> @@ -9207,13 +9231,15 @@ void nohz_balance_enter_idle(int cpu)
>>   if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
>>   return;
>>
>> + rq->has_blocked_load = 1;
>> +
>>   if (rq->nohz_tick_stopped)
>> - return;
>
> this case is difficult... needs thinking

The use case happens when a CPU wakes up and goes back to idle before
the tick fires and clears nohz_tick_stopped.

>
>> + goto out;
>>
>>   /*
>>* If we're a completely isolated CPU, we don't play.
>>*/
>> - if (on_null_domain(cpu_rq(cpu)))
>> + if (on_null_domain(rq))
>>   return;
>>
>>   rq->nohz_tick_stopped = 1;
>> @@ -9222,6 +9248,13 @@ void nohz_balance_enter_idle(int cpu)
>>   atomic_inc(_cpus);
>>
>>   set_cpu_sd_state_idle(cpu);
>
> /*
>  * Ensures that if nohz_idle_balance() fails to observe our
>  * @idle_cpus_mask store, it must observe the @has_blocked
>  * store.
>  */
> smp_mb__after_atomic();
>
>> +
>> +out:
>> + /*
>> +  * Each time a cpu enter idle, we assume that it has blocked load and
>> +  * enable the periodic update of the load of idle cpus
>> +  */
>> + WRITE_ONCE(nohz.has_blocked, 1);
>>  }
>
>
>
>> @@ -9374,6 +9407,16 @@ static bool nohz_idle_balance(struct rq *this_rq, 
>> enum cpu_idle_type idle)
>>
>>   SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
>>
>> + /*
>> +  * We assume there will be no idle load after this update and clear
>> +  * the has_blocked flag. If a cpu enters idle in the mean time, it will
>> +  * set the has_blocked flag and trig another update of idle load.
>> +  * Because a cpu that becomes idle, is added to idle_cpus_mask before
>> +  * setting the flag, we are sure to not clear the state and not
>> +  * check the load of an idle cpu.
>> +  */
>> + WRITE_ONCE(nohz.has_blocked, 0);
>
> /*
>  * Ensures that if we miss the CPU, we must see the has_blocked
>  * store from nohz_balance_enter_idle().
>  */
> smp_mb();
>
>>   for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
>>   if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
>>   continue;
>
>
> I _think_, but my brain isn't quite willing to turn on today.
>
> Without this ordering I think it would be possible to loose has_blocked
> and not observe the CPU either.

I think that you are right.
I also wondered if some barriers were necessary but wrongly concluded
that set operation on nohz.idle_cpus_mask and WRITE_ONCE with volatile
would be enough to ensure the right ordering

>
>


Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Vincent Guittot
On 8 February 2018 at 15:00, Peter Zijlstra  wrote:
> On Tue, Feb 06, 2018 at 08:23:05PM +0100, Vincent Guittot wrote:
>
>> @@ -9207,13 +9231,15 @@ void nohz_balance_enter_idle(int cpu)
>>   if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
>>   return;
>>
>> + rq->has_blocked_load = 1;
>> +
>>   if (rq->nohz_tick_stopped)
>> - return;
>
> this case is difficult... needs thinking

The use case happens when a CPU wakes up and goes back to idle before
the tick fires and clears nohz_tick_stopped.

>
>> + goto out;
>>
>>   /*
>>* If we're a completely isolated CPU, we don't play.
>>*/
>> - if (on_null_domain(cpu_rq(cpu)))
>> + if (on_null_domain(rq))
>>   return;
>>
>>   rq->nohz_tick_stopped = 1;
>> @@ -9222,6 +9248,13 @@ void nohz_balance_enter_idle(int cpu)
>>   atomic_inc(_cpus);
>>
>>   set_cpu_sd_state_idle(cpu);
>
> /*
>  * Ensures that if nohz_idle_balance() fails to observe our
>  * @idle_cpus_mask store, it must observe the @has_blocked
>  * store.
>  */
> smp_mb__after_atomic();
>
>> +
>> +out:
>> + /*
>> +  * Each time a cpu enter idle, we assume that it has blocked load and
>> +  * enable the periodic update of the load of idle cpus
>> +  */
>> + WRITE_ONCE(nohz.has_blocked, 1);
>>  }
>
>
>
>> @@ -9374,6 +9407,16 @@ static bool nohz_idle_balance(struct rq *this_rq, 
>> enum cpu_idle_type idle)
>>
>>   SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
>>
>> + /*
>> +  * We assume there will be no idle load after this update and clear
>> +  * the has_blocked flag. If a cpu enters idle in the mean time, it will
>> +  * set the has_blocked flag and trig another update of idle load.
>> +  * Because a cpu that becomes idle, is added to idle_cpus_mask before
>> +  * setting the flag, we are sure to not clear the state and not
>> +  * check the load of an idle cpu.
>> +  */
>> + WRITE_ONCE(nohz.has_blocked, 0);
>
> /*
>  * Ensures that if we miss the CPU, we must see the has_blocked
>  * store from nohz_balance_enter_idle().
>  */
> smp_mb();
>
>>   for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
>>   if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
>>   continue;
>
>
> I _think_, but my brain isn't quite willing to turn on today.
>
> Without this ordering I think it would be possible to loose has_blocked
> and not observe the CPU either.

I think that you are right.
I also wondered if some barriers were necessary but wrongly concluded
that set operation on nohz.idle_cpus_mask and WRITE_ONCE with volatile
would be enough to ensure the right ordering

>
>


Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Peter Zijlstra
On Thu, Feb 08, 2018 at 03:00:05PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 06, 2018 at 08:23:05PM +0100, Vincent Guittot wrote:
> 
> > @@ -9207,13 +9231,15 @@ void nohz_balance_enter_idle(int cpu)
> > if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
> > return;
> >  
> > +   rq->has_blocked_load = 1;
> > +
> > if (rq->nohz_tick_stopped)
> > -   return;
> 
> this case is difficult... needs thinking
> 
> > +   goto out;
> >  
> > /*
> >  * If we're a completely isolated CPU, we don't play.
> >  */
> > -   if (on_null_domain(cpu_rq(cpu)))
> > +   if (on_null_domain(rq))
> > return;
> >  
> > rq->nohz_tick_stopped = 1;
> > @@ -9222,6 +9248,13 @@ void nohz_balance_enter_idle(int cpu)
> > atomic_inc(_cpus);
> >  
> > set_cpu_sd_state_idle(cpu);
> 
>   /*
>* Ensures that if nohz_idle_balance() fails to observe our
>* @idle_cpus_mask store, it must observe the @has_blocked
>* store.
>*/
>   smp_mb__after_atomic();
> 
> > +
> > +out:
> > +   /*
> > +* Each time a cpu enter idle, we assume that it has blocked load and
> > +* enable the periodic update of the load of idle cpus
> > +*/
> > +   WRITE_ONCE(nohz.has_blocked, 1);
> >  }
> 
> 
> 
> > @@ -9374,6 +9407,16 @@ static bool nohz_idle_balance(struct rq *this_rq, 
> > enum cpu_idle_type idle)
> >  
> > SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
> >  
> > +   /*
> > +* We assume there will be no idle load after this update and clear
> > +* the has_blocked flag. If a cpu enters idle in the mean time, it will
> > +* set the has_blocked flag and trig another update of idle load.
> > +* Because a cpu that becomes idle, is added to idle_cpus_mask before
> > +* setting the flag, we are sure to not clear the state and not
> > +* check the load of an idle cpu.
> > +*/
> > +   WRITE_ONCE(nohz.has_blocked, 0);
> 
>   /*
>* Ensures that if we miss the CPU, we must see the has_blocked
>* store from nohz_balance_enter_idle().
>*/
>   smp_mb();
> 
> > for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
> > if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
> > continue;
> 
> 
> I _think_, but my brain isn't quite willing to turn on today.
> 
> Without this ordering I think it would be possible to loose has_blocked
> and not observe the CPU either.
> 
> 

might be the exact case from litmus-tests/R+mbonceonces.litmus but I
can't seem to get herd7 working today.


Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Peter Zijlstra
On Thu, Feb 08, 2018 at 03:00:05PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 06, 2018 at 08:23:05PM +0100, Vincent Guittot wrote:
> 
> > @@ -9207,13 +9231,15 @@ void nohz_balance_enter_idle(int cpu)
> > if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
> > return;
> >  
> > +   rq->has_blocked_load = 1;
> > +
> > if (rq->nohz_tick_stopped)
> > -   return;
> 
> this case is difficult... needs thinking
> 
> > +   goto out;
> >  
> > /*
> >  * If we're a completely isolated CPU, we don't play.
> >  */
> > -   if (on_null_domain(cpu_rq(cpu)))
> > +   if (on_null_domain(rq))
> > return;
> >  
> > rq->nohz_tick_stopped = 1;
> > @@ -9222,6 +9248,13 @@ void nohz_balance_enter_idle(int cpu)
> > atomic_inc(_cpus);
> >  
> > set_cpu_sd_state_idle(cpu);
> 
>   /*
>* Ensures that if nohz_idle_balance() fails to observe our
>* @idle_cpus_mask store, it must observe the @has_blocked
>* store.
>*/
>   smp_mb__after_atomic();
> 
> > +
> > +out:
> > +   /*
> > +* Each time a cpu enter idle, we assume that it has blocked load and
> > +* enable the periodic update of the load of idle cpus
> > +*/
> > +   WRITE_ONCE(nohz.has_blocked, 1);
> >  }
> 
> 
> 
> > @@ -9374,6 +9407,16 @@ static bool nohz_idle_balance(struct rq *this_rq, 
> > enum cpu_idle_type idle)
> >  
> > SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
> >  
> > +   /*
> > +* We assume there will be no idle load after this update and clear
> > +* the has_blocked flag. If a cpu enters idle in the mean time, it will
> > +* set the has_blocked flag and trig another update of idle load.
> > +* Because a cpu that becomes idle, is added to idle_cpus_mask before
> > +* setting the flag, we are sure to not clear the state and not
> > +* check the load of an idle cpu.
> > +*/
> > +   WRITE_ONCE(nohz.has_blocked, 0);
> 
>   /*
>* Ensures that if we miss the CPU, we must see the has_blocked
>* store from nohz_balance_enter_idle().
>*/
>   smp_mb();
> 
> > for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
> > if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
> > continue;
> 
> 
> I _think_, but my brain isn't quite willing to turn on today.
> 
> Without this ordering I think it would be possible to loose has_blocked
> and not observe the CPU either.
> 
> 

might be the exact case from litmus-tests/R+mbonceonces.litmus but I
can't seem to get herd7 working today.


Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Peter Zijlstra
On Tue, Feb 06, 2018 at 08:23:05PM +0100, Vincent Guittot wrote:

> @@ -9207,13 +9231,15 @@ void nohz_balance_enter_idle(int cpu)
>   if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
>   return;
>  
> + rq->has_blocked_load = 1;
> +
>   if (rq->nohz_tick_stopped)
> - return;

this case is difficult... needs thinking

> + goto out;
>  
>   /*
>* If we're a completely isolated CPU, we don't play.
>*/
> - if (on_null_domain(cpu_rq(cpu)))
> + if (on_null_domain(rq))
>   return;
>  
>   rq->nohz_tick_stopped = 1;
> @@ -9222,6 +9248,13 @@ void nohz_balance_enter_idle(int cpu)
>   atomic_inc(_cpus);
>  
>   set_cpu_sd_state_idle(cpu);

/*
 * Ensures that if nohz_idle_balance() fails to observe our
 * @idle_cpus_mask store, it must observe the @has_blocked
 * store.
 */
smp_mb__after_atomic();

> +
> +out:
> + /*
> +  * Each time a cpu enter idle, we assume that it has blocked load and
> +  * enable the periodic update of the load of idle cpus
> +  */
> + WRITE_ONCE(nohz.has_blocked, 1);
>  }



> @@ -9374,6 +9407,16 @@ static bool nohz_idle_balance(struct rq *this_rq, enum 
> cpu_idle_type idle)
>  
>   SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
>  
> + /*
> +  * We assume there will be no idle load after this update and clear
> +  * the has_blocked flag. If a cpu enters idle in the mean time, it will
> +  * set the has_blocked flag and trig another update of idle load.
> +  * Because a cpu that becomes idle, is added to idle_cpus_mask before
> +  * setting the flag, we are sure to not clear the state and not
> +  * check the load of an idle cpu.
> +  */
> + WRITE_ONCE(nohz.has_blocked, 0);

/*
 * Ensures that if we miss the CPU, we must see the has_blocked
 * store from nohz_balance_enter_idle().
 */
smp_mb();

>   for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
>   if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
>   continue;


I _think_, but my brain isn't quite willing to turn on today.

Without this ordering I think it would be possible to loose has_blocked
and not observe the CPU either.




Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Peter Zijlstra
On Tue, Feb 06, 2018 at 08:23:05PM +0100, Vincent Guittot wrote:

> @@ -9207,13 +9231,15 @@ void nohz_balance_enter_idle(int cpu)
>   if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
>   return;
>  
> + rq->has_blocked_load = 1;
> +
>   if (rq->nohz_tick_stopped)
> - return;

this case is difficult... needs thinking

> + goto out;
>  
>   /*
>* If we're a completely isolated CPU, we don't play.
>*/
> - if (on_null_domain(cpu_rq(cpu)))
> + if (on_null_domain(rq))
>   return;
>  
>   rq->nohz_tick_stopped = 1;
> @@ -9222,6 +9248,13 @@ void nohz_balance_enter_idle(int cpu)
>   atomic_inc(_cpus);
>  
>   set_cpu_sd_state_idle(cpu);

/*
 * Ensures that if nohz_idle_balance() fails to observe our
 * @idle_cpus_mask store, it must observe the @has_blocked
 * store.
 */
smp_mb__after_atomic();

> +
> +out:
> + /*
> +  * Each time a cpu enter idle, we assume that it has blocked load and
> +  * enable the periodic update of the load of idle cpus
> +  */
> + WRITE_ONCE(nohz.has_blocked, 1);
>  }



> @@ -9374,6 +9407,16 @@ static bool nohz_idle_balance(struct rq *this_rq, enum 
> cpu_idle_type idle)
>  
>   SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
>  
> + /*
> +  * We assume there will be no idle load after this update and clear
> +  * the has_blocked flag. If a cpu enters idle in the mean time, it will
> +  * set the has_blocked flag and trig another update of idle load.
> +  * Because a cpu that becomes idle, is added to idle_cpus_mask before
> +  * setting the flag, we are sure to not clear the state and not
> +  * check the load of an idle cpu.
> +  */
> + WRITE_ONCE(nohz.has_blocked, 0);

/*
 * Ensures that if we miss the CPU, we must see the has_blocked
 * store from nohz_balance_enter_idle().
 */
smp_mb();

>   for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
>   if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
>   continue;


I _think_, but my brain isn't quite willing to turn on today.

Without this ordering I think it would be possible to loose has_blocked
and not observe the CPU either.




Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Vincent Guittot
On 8 February 2018 at 13:46, Valentin Schneider
 wrote:
> On 02/06/2018 07:23 PM, Vincent Guittot wrote:
>> [...]
>> @@ -7826,8 +7842,8 @@ static inline void update_sg_lb_stats(struct lb_env 
>> *env,
>>   for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>>   struct rq *rq = cpu_rq(i);
>>
>> - if (env->flags & LBF_NOHZ_STATS)
>> - update_nohz_stats(rq);
>> + if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq))
>> + env->flags |= LBF_NOHZ_AGAIN;
>>
>>   /* Bias balancing toward cpus of our domain */
>>   if (local_group)
>> @@ -7979,18 +7995,15 @@ static inline void update_sd_lb_stats(struct lb_env 
>> *env, struct sd_lb_stats *sd
>>   struct sg_lb_stats *local = >local_stat;
>>   struct sg_lb_stats tmp_sgs;
>>   int load_idx, prefer_sibling = 0;
>> + int has_blocked = READ_ONCE(nohz.has_blocked);
>>   bool overload = false;
>>
>>   if (child && child->flags & SD_PREFER_SIBLING)
>>   prefer_sibling = 1;
>>
>>  #ifdef CONFIG_NO_HZ_COMMON
>> - if (env->idle == CPU_NEWLY_IDLE) {
>> + if (env->idle == CPU_NEWLY_IDLE && has_blocked)
>>   env->flags |= LBF_NOHZ_STATS;
>> -
>> - if (cpumask_subset(nohz.idle_cpus_mask, 
>> sched_domain_span(env->sd)))
>> - nohz.next_stats = jiffies + 
>> msecs_to_jiffies(LOAD_AVG_PERIOD);
>> - }
>>  #endif
>>
>>   load_idx = get_sd_load_idx(env->sd, env->idle);
>> @@ -8046,6 +8059,15 @@ static inline void update_sd_lb_stats(struct lb_env 
>> *env, struct sd_lb_stats *sd
>>   sg = sg->next;
>>   } while (sg != env->sd->groups);
>>
>> +#ifdef CONFIG_NO_HZ_COMMON
>> + if ((env->flags & LBF_NOHZ_AGAIN) &&
>> + cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
>> +
>> + WRITE_ONCE(nohz.next_blocked,
>> + jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
>
> Here we push the stats update forward if we visited all the nohz CPUs but they
> still have blocked load. IMO we should also clear the nohz.has_blocked flag
> if we visited all the nohz CPUs and none had blocked load left.
>
> If we don't do that, we could very well have cleared all of the nohz blocked
> load in idle_balance and successfully pulled a task, but the flag isn't
> cleared so we'll end up doing a _nohz_idle_balance() later on for nothing.
>
>
> As I said in a previous comment, we also have this problem with periodic
> load balance: if a CPU goes nohz (nohz.has_blocked is raised) but wakes up,
> e.g. before the next nohz.next_blocked, we should stop kicking ILBs
>
> Now I'd need to test this, but I think it can actually get worse: if that
> CPU keeps generating blocked load after this short idle period, no matter
> how many _nohz_idle_balance() we go through we will never reach a point
> where nohz.has_blocked gets cleared, and we'll keep kicking those ILBs to
> update blocked load that already gets updated in the periodic balance.
>
> I think that's where a nohz blocked load cpumask can also help: on top of
> skipping nohz CPUs that don't need an update, we can stop the whole remote
> update machinery when the last nohz cpu with blocked load wakes up, or say
> when it goes through its first periodic balance.

The main question is : Do we want to remove all useless kicks to ilb
or useless calls to _nohz_idle_balance at the cost of adding more
latency in the idle/wakeup path because of the additional atomic
operations to keep track of which CPUs are idle, tickless with blocked
load or do we accept to kick ilb or call _nohz_idle_balance uselessly
from time to time for some use cases

I agree with you that there are some useless calls with the proposal
which can be removed with additional checks, variables and cpumask
manipulation. Which use case will benefits from these additional
checks and does it worth ?

Vincent

>
>> + }
>> +#endif
>> +
>


Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Vincent Guittot
On 8 February 2018 at 13:46, Valentin Schneider
 wrote:
> On 02/06/2018 07:23 PM, Vincent Guittot wrote:
>> [...]
>> @@ -7826,8 +7842,8 @@ static inline void update_sg_lb_stats(struct lb_env 
>> *env,
>>   for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>>   struct rq *rq = cpu_rq(i);
>>
>> - if (env->flags & LBF_NOHZ_STATS)
>> - update_nohz_stats(rq);
>> + if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq))
>> + env->flags |= LBF_NOHZ_AGAIN;
>>
>>   /* Bias balancing toward cpus of our domain */
>>   if (local_group)
>> @@ -7979,18 +7995,15 @@ static inline void update_sd_lb_stats(struct lb_env 
>> *env, struct sd_lb_stats *sd
>>   struct sg_lb_stats *local = >local_stat;
>>   struct sg_lb_stats tmp_sgs;
>>   int load_idx, prefer_sibling = 0;
>> + int has_blocked = READ_ONCE(nohz.has_blocked);
>>   bool overload = false;
>>
>>   if (child && child->flags & SD_PREFER_SIBLING)
>>   prefer_sibling = 1;
>>
>>  #ifdef CONFIG_NO_HZ_COMMON
>> - if (env->idle == CPU_NEWLY_IDLE) {
>> + if (env->idle == CPU_NEWLY_IDLE && has_blocked)
>>   env->flags |= LBF_NOHZ_STATS;
>> -
>> - if (cpumask_subset(nohz.idle_cpus_mask, 
>> sched_domain_span(env->sd)))
>> - nohz.next_stats = jiffies + 
>> msecs_to_jiffies(LOAD_AVG_PERIOD);
>> - }
>>  #endif
>>
>>   load_idx = get_sd_load_idx(env->sd, env->idle);
>> @@ -8046,6 +8059,15 @@ static inline void update_sd_lb_stats(struct lb_env 
>> *env, struct sd_lb_stats *sd
>>   sg = sg->next;
>>   } while (sg != env->sd->groups);
>>
>> +#ifdef CONFIG_NO_HZ_COMMON
>> + if ((env->flags & LBF_NOHZ_AGAIN) &&
>> + cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
>> +
>> + WRITE_ONCE(nohz.next_blocked,
>> + jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
>
> Here we push the stats update forward if we visited all the nohz CPUs but they
> still have blocked load. IMO we should also clear the nohz.has_blocked flag
> if we visited all the nohz CPUs and none had blocked load left.
>
> If we don't do that, we could very well have cleared all of the nohz blocked
> load in idle_balance and successfully pulled a task, but the flag isn't
> cleared so we'll end up doing a _nohz_idle_balance() later on for nothing.
>
>
> As I said in a previous comment, we also have this problem with periodic
> load balance: if a CPU goes nohz (nohz.has_blocked is raised) but wakes up,
> e.g. before the next nohz.next_blocked, we should stop kicking ILBs
>
> Now I'd need to test this, but I think it can actually get worse: if that
> CPU keeps generating blocked load after this short idle period, no matter
> how many _nohz_idle_balance() we go through we will never reach a point
> where nohz.has_blocked gets cleared, and we'll keep kicking those ILBs to
> update blocked load that already gets updated in the periodic balance.
>
> I think that's where a nohz blocked load cpumask can also help: on top of
> skipping nohz CPUs that don't need an update, we can stop the whole remote
> update machinery when the last nohz cpu with blocked load wakes up, or say
> when it goes through its first periodic balance.

The main question is : Do we want to remove all useless kicks to ilb
or useless calls to _nohz_idle_balance at the cost of adding more
latency in the idle/wakeup path because of the additional atomic
operations to keep track of which CPUs are idle, tickless with blocked
load or do we accept to kick ilb or call _nohz_idle_balance uselessly
from time to time for some use cases

I agree with you that there are some useless calls with the proposal
which can be removed with additional checks, variables and cpumask
manipulation. Which use case will benefits from these additional
checks and does it worth ?

Vincent

>
>> + }
>> +#endif
>> +
>


Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Peter Zijlstra
On Thu, Feb 08, 2018 at 12:46:53PM +, Valentin Schneider wrote:
> On 02/06/2018 07:23 PM, Vincent Guittot wrote:
> > [...]
> > @@ -7826,8 +7842,8 @@ static inline void update_sg_lb_stats(struct lb_env 
> > *env,
> > for_each_cpu_and(i, sched_group_span(group), env->cpus) {
> > struct rq *rq = cpu_rq(i);
> >  
> > -   if (env->flags & LBF_NOHZ_STATS)
> > -   update_nohz_stats(rq);
> > +   if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq))
> > +   env->flags |= LBF_NOHZ_AGAIN;
> >  
> > /* Bias balancing toward cpus of our domain */
> > if (local_group)
> > @@ -7979,18 +7995,15 @@ static inline void update_sd_lb_stats(struct lb_env 
> > *env, struct sd_lb_stats *sd
> > struct sg_lb_stats *local = >local_stat;
> > struct sg_lb_stats tmp_sgs;
> > int load_idx, prefer_sibling = 0;
> > +   int has_blocked = READ_ONCE(nohz.has_blocked);
> > bool overload = false;
> >  
> > if (child && child->flags & SD_PREFER_SIBLING)
> > prefer_sibling = 1;
> >  
> >  #ifdef CONFIG_NO_HZ_COMMON
> > -   if (env->idle == CPU_NEWLY_IDLE) {
> > +   if (env->idle == CPU_NEWLY_IDLE && has_blocked)
> > env->flags |= LBF_NOHZ_STATS;
> > -
> > -   if (cpumask_subset(nohz.idle_cpus_mask, 
> > sched_domain_span(env->sd)))
> > -   nohz.next_stats = jiffies + 
> > msecs_to_jiffies(LOAD_AVG_PERIOD);
> > -   }
> >  #endif
> >  
> > load_idx = get_sd_load_idx(env->sd, env->idle);
> > @@ -8046,6 +8059,15 @@ static inline void update_sd_lb_stats(struct lb_env 
> > *env, struct sd_lb_stats *sd
> > sg = sg->next;
> > } while (sg != env->sd->groups);
> >  
> > +#ifdef CONFIG_NO_HZ_COMMON
> > +   if ((env->flags & LBF_NOHZ_AGAIN) &&
> > +   cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> > +
> > +   WRITE_ONCE(nohz.next_blocked,
> > +   jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> 
> Here we push the stats update forward if we visited all the nohz CPUs but they
> still have blocked load. IMO we should also clear the nohz.has_blocked flag
> if we visited all the nohz CPUs and none had blocked load left.

You can't, new cpu's can have joined the set. I used to detect that, but
that requires atomic ops.


Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Peter Zijlstra
On Thu, Feb 08, 2018 at 12:46:53PM +, Valentin Schneider wrote:
> On 02/06/2018 07:23 PM, Vincent Guittot wrote:
> > [...]
> > @@ -7826,8 +7842,8 @@ static inline void update_sg_lb_stats(struct lb_env 
> > *env,
> > for_each_cpu_and(i, sched_group_span(group), env->cpus) {
> > struct rq *rq = cpu_rq(i);
> >  
> > -   if (env->flags & LBF_NOHZ_STATS)
> > -   update_nohz_stats(rq);
> > +   if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq))
> > +   env->flags |= LBF_NOHZ_AGAIN;
> >  
> > /* Bias balancing toward cpus of our domain */
> > if (local_group)
> > @@ -7979,18 +7995,15 @@ static inline void update_sd_lb_stats(struct lb_env 
> > *env, struct sd_lb_stats *sd
> > struct sg_lb_stats *local = >local_stat;
> > struct sg_lb_stats tmp_sgs;
> > int load_idx, prefer_sibling = 0;
> > +   int has_blocked = READ_ONCE(nohz.has_blocked);
> > bool overload = false;
> >  
> > if (child && child->flags & SD_PREFER_SIBLING)
> > prefer_sibling = 1;
> >  
> >  #ifdef CONFIG_NO_HZ_COMMON
> > -   if (env->idle == CPU_NEWLY_IDLE) {
> > +   if (env->idle == CPU_NEWLY_IDLE && has_blocked)
> > env->flags |= LBF_NOHZ_STATS;
> > -
> > -   if (cpumask_subset(nohz.idle_cpus_mask, 
> > sched_domain_span(env->sd)))
> > -   nohz.next_stats = jiffies + 
> > msecs_to_jiffies(LOAD_AVG_PERIOD);
> > -   }
> >  #endif
> >  
> > load_idx = get_sd_load_idx(env->sd, env->idle);
> > @@ -8046,6 +8059,15 @@ static inline void update_sd_lb_stats(struct lb_env 
> > *env, struct sd_lb_stats *sd
> > sg = sg->next;
> > } while (sg != env->sd->groups);
> >  
> > +#ifdef CONFIG_NO_HZ_COMMON
> > +   if ((env->flags & LBF_NOHZ_AGAIN) &&
> > +   cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> > +
> > +   WRITE_ONCE(nohz.next_blocked,
> > +   jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> 
> Here we push the stats update forward if we visited all the nohz CPUs but they
> still have blocked load. IMO we should also clear the nohz.has_blocked flag
> if we visited all the nohz CPUs and none had blocked load left.

You can't, new cpu's can have joined the set. I used to detect that, but
that requires atomic ops.


Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Valentin Schneider
On 02/06/2018 07:23 PM, Vincent Guittot wrote:
> [...]
> @@ -7826,8 +7842,8 @@ static inline void update_sg_lb_stats(struct lb_env 
> *env,
>   for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>   struct rq *rq = cpu_rq(i);
>  
> - if (env->flags & LBF_NOHZ_STATS)
> - update_nohz_stats(rq);
> + if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq))
> + env->flags |= LBF_NOHZ_AGAIN;
>  
>   /* Bias balancing toward cpus of our domain */
>   if (local_group)
> @@ -7979,18 +7995,15 @@ static inline void update_sd_lb_stats(struct lb_env 
> *env, struct sd_lb_stats *sd
>   struct sg_lb_stats *local = >local_stat;
>   struct sg_lb_stats tmp_sgs;
>   int load_idx, prefer_sibling = 0;
> + int has_blocked = READ_ONCE(nohz.has_blocked);
>   bool overload = false;
>  
>   if (child && child->flags & SD_PREFER_SIBLING)
>   prefer_sibling = 1;
>  
>  #ifdef CONFIG_NO_HZ_COMMON
> - if (env->idle == CPU_NEWLY_IDLE) {
> + if (env->idle == CPU_NEWLY_IDLE && has_blocked)
>   env->flags |= LBF_NOHZ_STATS;
> -
> - if (cpumask_subset(nohz.idle_cpus_mask, 
> sched_domain_span(env->sd)))
> - nohz.next_stats = jiffies + 
> msecs_to_jiffies(LOAD_AVG_PERIOD);
> - }
>  #endif
>  
>   load_idx = get_sd_load_idx(env->sd, env->idle);
> @@ -8046,6 +8059,15 @@ static inline void update_sd_lb_stats(struct lb_env 
> *env, struct sd_lb_stats *sd
>   sg = sg->next;
>   } while (sg != env->sd->groups);
>  
> +#ifdef CONFIG_NO_HZ_COMMON
> + if ((env->flags & LBF_NOHZ_AGAIN) &&
> + cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> +
> + WRITE_ONCE(nohz.next_blocked,
> + jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));

Here we push the stats update forward if we visited all the nohz CPUs but they
still have blocked load. IMO we should also clear the nohz.has_blocked flag
if we visited all the nohz CPUs and none had blocked load left.

If we don't do that, we could very well have cleared all of the nohz blocked
load in idle_balance and successfully pulled a task, but the flag isn't
cleared so we'll end up doing a _nohz_idle_balance() later on for nothing.


As I said in a previous comment, we also have this problem with periodic
load balance: if a CPU goes nohz (nohz.has_blocked is raised) but wakes up,
e.g. before the next nohz.next_blocked, we should stop kicking ILBs

Now I'd need to test this, but I think it can actually get worse: if that 
CPU keeps generating blocked load after this short idle period, no matter 
how many _nohz_idle_balance() we go through we will never reach a point 
where nohz.has_blocked gets cleared, and we'll keep kicking those ILBs to
update blocked load that already gets updated in the periodic balance.

I think that's where a nohz blocked load cpumask can also help: on top of
skipping nohz CPUs that don't need an update, we can stop the whole remote
update machinery when the last nohz cpu with blocked load wakes up, or say
when it goes through its first periodic balance.

> + }
> +#endif
> +



Re: [PATCH v2 1/3] sched: Stop nohz stats when decayed

2018-02-08 Thread Valentin Schneider
On 02/06/2018 07:23 PM, Vincent Guittot wrote:
> [...]
> @@ -7826,8 +7842,8 @@ static inline void update_sg_lb_stats(struct lb_env 
> *env,
>   for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>   struct rq *rq = cpu_rq(i);
>  
> - if (env->flags & LBF_NOHZ_STATS)
> - update_nohz_stats(rq);
> + if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq))
> + env->flags |= LBF_NOHZ_AGAIN;
>  
>   /* Bias balancing toward cpus of our domain */
>   if (local_group)
> @@ -7979,18 +7995,15 @@ static inline void update_sd_lb_stats(struct lb_env 
> *env, struct sd_lb_stats *sd
>   struct sg_lb_stats *local = >local_stat;
>   struct sg_lb_stats tmp_sgs;
>   int load_idx, prefer_sibling = 0;
> + int has_blocked = READ_ONCE(nohz.has_blocked);
>   bool overload = false;
>  
>   if (child && child->flags & SD_PREFER_SIBLING)
>   prefer_sibling = 1;
>  
>  #ifdef CONFIG_NO_HZ_COMMON
> - if (env->idle == CPU_NEWLY_IDLE) {
> + if (env->idle == CPU_NEWLY_IDLE && has_blocked)
>   env->flags |= LBF_NOHZ_STATS;
> -
> - if (cpumask_subset(nohz.idle_cpus_mask, 
> sched_domain_span(env->sd)))
> - nohz.next_stats = jiffies + 
> msecs_to_jiffies(LOAD_AVG_PERIOD);
> - }
>  #endif
>  
>   load_idx = get_sd_load_idx(env->sd, env->idle);
> @@ -8046,6 +8059,15 @@ static inline void update_sd_lb_stats(struct lb_env 
> *env, struct sd_lb_stats *sd
>   sg = sg->next;
>   } while (sg != env->sd->groups);
>  
> +#ifdef CONFIG_NO_HZ_COMMON
> + if ((env->flags & LBF_NOHZ_AGAIN) &&
> + cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> +
> + WRITE_ONCE(nohz.next_blocked,
> + jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));

Here we push the stats update forward if we visited all the nohz CPUs but they
still have blocked load. IMO we should also clear the nohz.has_blocked flag
if we visited all the nohz CPUs and none had blocked load left.

If we don't do that, we could very well have cleared all of the nohz blocked
load in idle_balance and successfully pulled a task, but the flag isn't
cleared so we'll end up doing a _nohz_idle_balance() later on for nothing.


As I said in a previous comment, we also have this problem with periodic
load balance: if a CPU goes nohz (nohz.has_blocked is raised) but wakes up,
e.g. before the next nohz.next_blocked, we should stop kicking ILBs

Now I'd need to test this, but I think it can actually get worse: if that 
CPU keeps generating blocked load after this short idle period, no matter 
how many _nohz_idle_balance() we go through we will never reach a point 
where nohz.has_blocked gets cleared, and we'll keep kicking those ILBs to
update blocked load that already gets updated in the periodic balance.

I think that's where a nohz blocked load cpumask can also help: on top of
skipping nohz CPUs that don't need an update, we can stop the whole remote
update machinery when the last nohz cpu with blocked load wakes up, or say
when it goes through its first periodic balance.

> + }
> +#endif
> +