Re: [PATCH v3 3/3] sched: update blocked load when newly idle

2018-03-01 Thread Frederic Weisbecker
On Mon, Feb 12, 2018 at 04:38:05PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 12, 2018 at 03:34:44PM +0100, Vincent Guittot wrote:
> > > Aside from the above being an unreadable mess, I dislike that it breaks
> > > the various isolation crud, we should not touch CPUs outside of our
> > > domain.
> > >
> > > 
> > > Maybe something like the below? (unfinished)
> > >
> > 
> > good catch. I completely miss the isolation stuff.
> > But isn't already the case when kicking ilb ? I mean that an idle CPU 
> > touches
> > all idle CPUs and some can be outside its domain during ilb.
> 
> > Shouldn't we test housekeeping_cpu(cpu, HK_FLAG_SCHED) instead if we want to
> > make sure that an isolated/full nohz CPU will not be used for updating 
> > blocked
> > load of CPUs outside its domain ?
> 
> I _thought_ we had some 'housekeeping' crud in the ilb selection logic,
> but now I can't find it. Frederic?

I think you're referring to nohz_balance_idle(). The call is still there but 
HK_FLAG_SCHED
is unused for now. I initially turned it on by default on nohz_full but some 
people
complained. I don't recall why exactly. Anyway I'm waiting for a suitable 
interface
to use it.


Re: [PATCH v3 3/3] sched: update blocked load when newly idle

2018-03-01 Thread Frederic Weisbecker
On Mon, Feb 12, 2018 at 04:38:05PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 12, 2018 at 03:34:44PM +0100, Vincent Guittot wrote:
> > > Aside from the above being an unreadable mess, I dislike that it breaks
> > > the various isolation crud, we should not touch CPUs outside of our
> > > domain.
> > >
> > > 
> > > Maybe something like the below? (unfinished)
> > >
> > 
> > good catch. I completely miss the isolation stuff.
> > But isn't already the case when kicking ilb ? I mean that an idle CPU 
> > touches
> > all idle CPUs and some can be outside its domain during ilb.
> 
> > Shouldn't we test housekeeping_cpu(cpu, HK_FLAG_SCHED) instead if we want to
> > make sure that an isolated/full nohz CPU will not be used for updating 
> > blocked
> > load of CPUs outside its domain ?
> 
> I _thought_ we had some 'housekeeping' crud in the ilb selection logic,
> but now I can't find it. Frederic?

I think you're referring to nohz_balance_idle(). The call is still there but 
HK_FLAG_SCHED
is unused for now. I initially turned it on by default on nohz_full but some 
people
complained. I don't recall why exactly. Anyway I'm waiting for a suitable 
interface
to use it.


Re: [PATCH v3 3/3] sched: update blocked load when newly idle

2018-02-12 Thread Vincent Guittot
On 12 February 2018 at 16:38, Peter Zijlstra  wrote:
> On Mon, Feb 12, 2018 at 03:34:44PM +0100, Vincent Guittot wrote:
>> Le Monday 12 Feb 2018 à 13:04:11 (+0100), Peter Zijlstra a écrit :
>> > On Mon, Feb 12, 2018 at 09:07:54AM +0100, Vincent Guittot wrote:
>
>> > So I really hate this one, also I suspect its broken, because we do this
>> > check before dropping rq->lock and _nohz_idle_balance() will take
>> > rq->lock.
>>
>> yes. it will take both newly idle rq and idle rq lock
>
> Right, can't do that, there's ordering rules for multiple RQ locks etc..
>
>>
>> >
>> >
>> > Aside from the above being an unreadable mess, I dislike that it breaks
>> > the various isolation crud, we should not touch CPUs outside of our
>> > domain.
>> >
>> >
>> > Maybe something like the below? (unfinished)
>> >
>>
>> good catch. I completely miss the isolation stuff.
>> But isn't already the case when kicking ilb ? I mean that an idle CPU touches
>> all idle CPUs and some can be outside its domain during ilb.
>
>> Shouldn't we test housekeeping_cpu(cpu, HK_FLAG_SCHED) instead if we want to
>> make sure that an isolated/full nohz CPU will not be used for updating 
>> blocked
>> load of CPUs outside its domain ?
>
> I _thought_ we had some 'housekeeping' crud in the ilb selection logic,
> but now I can't find it. Frederic?
>
>> Is something below more readable:
>>
>>   /*
>> +  * This CPU doesn't want to be disturbed by scheduler
>> +  * houskeeping
>>*/
>> + if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
>> + goto out;
>> +
>> + /* Will wake up very soon. No time for doing anything else*/
>> + if (this_rq->avg_idle < sysctl_sched_migration_cost)
>> + goto out;
>> +
>> + /* Don't need to update blocked load of idle CPUs*/
>> + if (!has_blocked || time_after_eq(jiffies, next_blocked)
>> + goto out;
>> +
>> + raw_spin_unlock(_rq->lock);
>> + /*
>> +  * This CPU is going to be idle and blocked load of idle CPUs
>> +  * need to be updated. Run the ilb locally as it is a good
>> +  * candidate for ilb instead of waking up another idle CPU.
>> +  * Kick an normal ilb if we failed to do the update.
>> +  */
>> + if !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, 
>> CPU_NEWLY_IDLE))
>>   kick_ilb(NOHZ_STATS_KICK);
>> + raw_spin_lock(_rq->lock);
>>
>>   goto out;
>
> It is, but I think you're still doing that avg_idle thing twice now,
> right?

yes the goal was to try to not exceed idle time but I wonder if it is
really needed because the need_resched() in the
"for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
" will abort the loop if something is schedule on this_cpu just like
for a normal ilb().
So I think that we can remove this test with avg_idle.

>
>> > @@ -7850,7 +7850,7 @@ static bool update_nohz_stats(struct rq
>> > if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
>> > return false;
>> >
>> > -   if (!time_after(jiffies, rq->last_blocked_load_update_tick))
>> > +   if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
>>
>> This fix the concern raised on the other thread, isn't it ?
>
> Yes.
>
>> > +static int nohz_age(struct sched_domain *sd)
>> > +{
>> > +   struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
>> > +   bool has_blocked_load;
>> > +
>> > +   WRITE_ONCE(nohz.has_blocked, 0);
>> > +
>> > +   smp_mb();
>> > +
>> > +   cpumask_and(cpus, sched_domain_span(sd), nohz.idle_cpus_mask);
>> > +
>> > +   has_blocked_load = cpumask_subset(nohz.idle_cpus_mask, 
>> > sched_domain_span(sd));
>> > +
>> > +   for_each_cpu(cpu, cpus) {
>> > +   struct rq *rq = cpu_rq(cpu);
>> > +
>> > +   has_blocked_load |= update_nohz_stats(rq, true);
>> > +   }
>> > +
>> > +   if (has_blocked_load)
>> > +   WRITE_ONCE(nohz.has_blocked, 1);
>> > +}
>> > +
>>
>> we duplicate what is done in nohe_idle_balance
>
> In parts yes.. I was too lazy to combine :-)
>
>> > @@ -8919,9 +8955,13 @@ static int idle_balance(struct rq *this_
>> > if (sd->flags & SD_BALANCE_NEWIDLE) {
>> > t0 = sched_clock_cpu(this_cpu);
>> >
>> > -   pulled_task = load_balance(this_cpu, this_rq,
>> > -  sd, CPU_NEWLY_IDLE,
>> > -  _balancing);
>> > +   if (nohz_blocked) {
>> > +   nohz_age(sd);
>>
>> Do we really need to loop all sched_domain of newly idle CPU and call
>> nohz_age for each level ?
>> Can't we only call  nohz_age with the widest/last sched_domain level ?
>
> Yeah, dunno. I went back and forth on that a bit. The largest is
> rq->rd->span. The reason I settled on this variant in the end is that 

Re: [PATCH v3 3/3] sched: update blocked load when newly idle

2018-02-12 Thread Vincent Guittot
On 12 February 2018 at 16:38, Peter Zijlstra  wrote:
> On Mon, Feb 12, 2018 at 03:34:44PM +0100, Vincent Guittot wrote:
>> Le Monday 12 Feb 2018 à 13:04:11 (+0100), Peter Zijlstra a écrit :
>> > On Mon, Feb 12, 2018 at 09:07:54AM +0100, Vincent Guittot wrote:
>
>> > So I really hate this one, also I suspect its broken, because we do this
>> > check before dropping rq->lock and _nohz_idle_balance() will take
>> > rq->lock.
>>
>> yes. it will take both newly idle rq and idle rq lock
>
> Right, can't do that, there's ordering rules for multiple RQ locks etc..
>
>>
>> >
>> >
>> > Aside from the above being an unreadable mess, I dislike that it breaks
>> > the various isolation crud, we should not touch CPUs outside of our
>> > domain.
>> >
>> >
>> > Maybe something like the below? (unfinished)
>> >
>>
>> good catch. I completely miss the isolation stuff.
>> But isn't already the case when kicking ilb ? I mean that an idle CPU touches
>> all idle CPUs and some can be outside its domain during ilb.
>
>> Shouldn't we test housekeeping_cpu(cpu, HK_FLAG_SCHED) instead if we want to
>> make sure that an isolated/full nohz CPU will not be used for updating 
>> blocked
>> load of CPUs outside its domain ?
>
> I _thought_ we had some 'housekeeping' crud in the ilb selection logic,
> but now I can't find it. Frederic?
>
>> Is something below more readable:
>>
>>   /*
>> +  * This CPU doesn't want to be disturbed by scheduler
>> +  * houskeeping
>>*/
>> + if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
>> + goto out;
>> +
>> + /* Will wake up very soon. No time for doing anything else*/
>> + if (this_rq->avg_idle < sysctl_sched_migration_cost)
>> + goto out;
>> +
>> + /* Don't need to update blocked load of idle CPUs*/
>> + if (!has_blocked || time_after_eq(jiffies, next_blocked)
>> + goto out;
>> +
>> + raw_spin_unlock(_rq->lock);
>> + /*
>> +  * This CPU is going to be idle and blocked load of idle CPUs
>> +  * need to be updated. Run the ilb locally as it is a good
>> +  * candidate for ilb instead of waking up another idle CPU.
>> +  * Kick an normal ilb if we failed to do the update.
>> +  */
>> + if !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, 
>> CPU_NEWLY_IDLE))
>>   kick_ilb(NOHZ_STATS_KICK);
>> + raw_spin_lock(_rq->lock);
>>
>>   goto out;
>
> It is, but I think you're still doing that avg_idle thing twice now,
> right?

yes the goal was to try to not exceed idle time but I wonder if it is
really needed because the need_resched() in the
"for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
" will abort the loop if something is schedule on this_cpu just like
for a normal ilb().
So I think that we can remove this test with avg_idle.

>
>> > @@ -7850,7 +7850,7 @@ static bool update_nohz_stats(struct rq
>> > if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
>> > return false;
>> >
>> > -   if (!time_after(jiffies, rq->last_blocked_load_update_tick))
>> > +   if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
>>
>> This fix the concern raised on the other thread, isn't it ?
>
> Yes.
>
>> > +static int nohz_age(struct sched_domain *sd)
>> > +{
>> > +   struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
>> > +   bool has_blocked_load;
>> > +
>> > +   WRITE_ONCE(nohz.has_blocked, 0);
>> > +
>> > +   smp_mb();
>> > +
>> > +   cpumask_and(cpus, sched_domain_span(sd), nohz.idle_cpus_mask);
>> > +
>> > +   has_blocked_load = cpumask_subset(nohz.idle_cpus_mask, 
>> > sched_domain_span(sd));
>> > +
>> > +   for_each_cpu(cpu, cpus) {
>> > +   struct rq *rq = cpu_rq(cpu);
>> > +
>> > +   has_blocked_load |= update_nohz_stats(rq, true);
>> > +   }
>> > +
>> > +   if (has_blocked_load)
>> > +   WRITE_ONCE(nohz.has_blocked, 1);
>> > +}
>> > +
>>
>> we duplicate what is done in nohe_idle_balance
>
> In parts yes.. I was too lazy to combine :-)
>
>> > @@ -8919,9 +8955,13 @@ static int idle_balance(struct rq *this_
>> > if (sd->flags & SD_BALANCE_NEWIDLE) {
>> > t0 = sched_clock_cpu(this_cpu);
>> >
>> > -   pulled_task = load_balance(this_cpu, this_rq,
>> > -  sd, CPU_NEWLY_IDLE,
>> > -  _balancing);
>> > +   if (nohz_blocked) {
>> > +   nohz_age(sd);
>>
>> Do we really need to loop all sched_domain of newly idle CPU and call
>> nohz_age for each level ?
>> Can't we only call  nohz_age with the widest/last sched_domain level ?
>
> Yeah, dunno. I went back and forth on that a bit. The largest is
> rq->rd->span. The reason I settled on this variant in the end is that it
> keeps locality. 

Re: [PATCH v3 3/3] sched: update blocked load when newly idle

2018-02-12 Thread Peter Zijlstra
On Mon, Feb 12, 2018 at 03:34:44PM +0100, Vincent Guittot wrote:
> Le Monday 12 Feb 2018 à 13:04:11 (+0100), Peter Zijlstra a écrit :
> > On Mon, Feb 12, 2018 at 09:07:54AM +0100, Vincent Guittot wrote:

> > So I really hate this one, also I suspect its broken, because we do this
> > check before dropping rq->lock and _nohz_idle_balance() will take
> > rq->lock.
> 
> yes. it will take both newly idle rq and idle rq lock

Right, can't do that, there's ordering rules for multiple RQ locks etc..

> 
> >
> > 
> > Aside from the above being an unreadable mess, I dislike that it breaks
> > the various isolation crud, we should not touch CPUs outside of our
> > domain.
> >
> > 
> > Maybe something like the below? (unfinished)
> >
> 
> good catch. I completely miss the isolation stuff.
> But isn't already the case when kicking ilb ? I mean that an idle CPU touches
> all idle CPUs and some can be outside its domain during ilb.

> Shouldn't we test housekeeping_cpu(cpu, HK_FLAG_SCHED) instead if we want to
> make sure that an isolated/full nohz CPU will not be used for updating blocked
> load of CPUs outside its domain ?

I _thought_ we had some 'housekeeping' crud in the ilb selection logic,
but now I can't find it. Frederic?

> Is something below more readable:
>  
>   /*
> +  * This CPU doesn't want to be disturbed by scheduler
> +  * houskeeping
>*/
> + if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
> + goto out;
> +
> + /* Will wake up very soon. No time for doing anything else*/
> + if (this_rq->avg_idle < sysctl_sched_migration_cost)
> + goto out;
> +
> + /* Don't need to update blocked load of idle CPUs*/
> + if (!has_blocked || time_after_eq(jiffies, next_blocked)
> + goto out;
> +
> + raw_spin_unlock(_rq->lock);
> + /*
> +  * This CPU is going to be idle and blocked load of idle CPUs
> +  * need to be updated. Run the ilb locally as it is a good
> +  * candidate for ilb instead of waking up another idle CPU.
> +  * Kick an normal ilb if we failed to do the update.
> +  */
> + if !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, 
> CPU_NEWLY_IDLE))
>   kick_ilb(NOHZ_STATS_KICK);
> + raw_spin_lock(_rq->lock);
>  
>   goto out;

It is, but I think you're still doing that avg_idle thing twice now,
right?

> > @@ -7850,7 +7850,7 @@ static bool update_nohz_stats(struct rq
> > if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > return false;
> >  
> > -   if (!time_after(jiffies, rq->last_blocked_load_update_tick))
> > +   if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> 
> This fix the concern raised on the other thread, isn't it ?

Yes.

> > +static int nohz_age(struct sched_domain *sd)
> > +{
> > +   struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
> > +   bool has_blocked_load;
> > +
> > +   WRITE_ONCE(nohz.has_blocked, 0);
> > +
> > +   smp_mb();
> > +
> > +   cpumask_and(cpus, sched_domain_span(sd), nohz.idle_cpus_mask);
> > +
> > +   has_blocked_load = cpumask_subset(nohz.idle_cpus_mask, 
> > sched_domain_span(sd));
> > +
> > +   for_each_cpu(cpu, cpus) {
> > +   struct rq *rq = cpu_rq(cpu);
> > +
> > +   has_blocked_load |= update_nohz_stats(rq, true);
> > +   }
> > +
> > +   if (has_blocked_load)
> > +   WRITE_ONCE(nohz.has_blocked, 1);
> > +}
> > +
> 
> we duplicate what is done in nohe_idle_balance

In parts yes.. I was too lazy to combine :-)

> > @@ -8919,9 +8955,13 @@ static int idle_balance(struct rq *this_
> > if (sd->flags & SD_BALANCE_NEWIDLE) {
> > t0 = sched_clock_cpu(this_cpu);
> >  
> > -   pulled_task = load_balance(this_cpu, this_rq,
> > -  sd, CPU_NEWLY_IDLE,
> > -  _balancing);
> > +   if (nohz_blocked) {
> > +   nohz_age(sd);
> 
> Do we really need to loop all sched_domain of newly idle CPU and call
> nohz_age for each level ?
> Can't we only call  nohz_age with the widest/last sched_domain level ?

Yeah, dunno. I went back and forth on that a bit. The largest is
rq->rd->span. The reason I settled on this variant in the end is that it
keeps locality. When short idle, it will only scan nearby CPUs instead
of reaching half-way across the machine.

> Furthermore, we use sd->max_newidle_lb_cost to decide to abort the loop.
> But this is updated with full load balancing which is longer than just
> updating blocked load.
> This will increase the chance to abort before reaching the last level.

Yes.. I figured we'd take that hit :/

> > +   } else {
> > +   pulled_task = load_balance(this_cpu, 

Re: [PATCH v3 3/3] sched: update blocked load when newly idle

2018-02-12 Thread Peter Zijlstra
On Mon, Feb 12, 2018 at 03:34:44PM +0100, Vincent Guittot wrote:
> Le Monday 12 Feb 2018 à 13:04:11 (+0100), Peter Zijlstra a écrit :
> > On Mon, Feb 12, 2018 at 09:07:54AM +0100, Vincent Guittot wrote:

> > So I really hate this one, also I suspect its broken, because we do this
> > check before dropping rq->lock and _nohz_idle_balance() will take
> > rq->lock.
> 
> yes. it will take both newly idle rq and idle rq lock

Right, can't do that, there's ordering rules for multiple RQ locks etc..

> 
> >
> > 
> > Aside from the above being an unreadable mess, I dislike that it breaks
> > the various isolation crud, we should not touch CPUs outside of our
> > domain.
> >
> > 
> > Maybe something like the below? (unfinished)
> >
> 
> good catch. I completely miss the isolation stuff.
> But isn't already the case when kicking ilb ? I mean that an idle CPU touches
> all idle CPUs and some can be outside its domain during ilb.

> Shouldn't we test housekeeping_cpu(cpu, HK_FLAG_SCHED) instead if we want to
> make sure that an isolated/full nohz CPU will not be used for updating blocked
> load of CPUs outside its domain ?

I _thought_ we had some 'housekeeping' crud in the ilb selection logic,
but now I can't find it. Frederic?

> Is something below more readable:
>  
>   /*
> +  * This CPU doesn't want to be disturbed by scheduler
> +  * houskeeping
>*/
> + if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
> + goto out;
> +
> + /* Will wake up very soon. No time for doing anything else*/
> + if (this_rq->avg_idle < sysctl_sched_migration_cost)
> + goto out;
> +
> + /* Don't need to update blocked load of idle CPUs*/
> + if (!has_blocked || time_after_eq(jiffies, next_blocked)
> + goto out;
> +
> + raw_spin_unlock(_rq->lock);
> + /*
> +  * This CPU is going to be idle and blocked load of idle CPUs
> +  * need to be updated. Run the ilb locally as it is a good
> +  * candidate for ilb instead of waking up another idle CPU.
> +  * Kick an normal ilb if we failed to do the update.
> +  */
> + if !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, 
> CPU_NEWLY_IDLE))
>   kick_ilb(NOHZ_STATS_KICK);
> + raw_spin_lock(_rq->lock);
>  
>   goto out;

It is, but I think you're still doing that avg_idle thing twice now,
right?

> > @@ -7850,7 +7850,7 @@ static bool update_nohz_stats(struct rq
> > if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > return false;
> >  
> > -   if (!time_after(jiffies, rq->last_blocked_load_update_tick))
> > +   if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> 
> This fix the concern raised on the other thread, isn't it ?

Yes.

> > +static int nohz_age(struct sched_domain *sd)
> > +{
> > +   struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
> > +   bool has_blocked_load;
> > +
> > +   WRITE_ONCE(nohz.has_blocked, 0);
> > +
> > +   smp_mb();
> > +
> > +   cpumask_and(cpus, sched_domain_span(sd), nohz.idle_cpus_mask);
> > +
> > +   has_blocked_load = cpumask_subset(nohz.idle_cpus_mask, 
> > sched_domain_span(sd));
> > +
> > +   for_each_cpu(cpu, cpus) {
> > +   struct rq *rq = cpu_rq(cpu);
> > +
> > +   has_blocked_load |= update_nohz_stats(rq, true);
> > +   }
> > +
> > +   if (has_blocked_load)
> > +   WRITE_ONCE(nohz.has_blocked, 1);
> > +}
> > +
> 
> we duplicate what is done in nohe_idle_balance

In parts yes.. I was too lazy to combine :-)

> > @@ -8919,9 +8955,13 @@ static int idle_balance(struct rq *this_
> > if (sd->flags & SD_BALANCE_NEWIDLE) {
> > t0 = sched_clock_cpu(this_cpu);
> >  
> > -   pulled_task = load_balance(this_cpu, this_rq,
> > -  sd, CPU_NEWLY_IDLE,
> > -  _balancing);
> > +   if (nohz_blocked) {
> > +   nohz_age(sd);
> 
> Do we really need to loop all sched_domain of newly idle CPU and call
> nohz_age for each level ?
> Can't we only call  nohz_age with the widest/last sched_domain level ?

Yeah, dunno. I went back and forth on that a bit. The largest is
rq->rd->span. The reason I settled on this variant in the end is that it
keeps locality. When short idle, it will only scan nearby CPUs instead
of reaching half-way across the machine.

> Furthermore, we use sd->max_newidle_lb_cost to decide to abort the loop.
> But this is updated with full load balancing which is longer than just
> updating blocked load.
> This will increase the chance to abort before reaching the last level.

Yes.. I figured we'd take that hit :/

> > +   } else {
> > +   pulled_task = load_balance(this_cpu, 

Re: [PATCH v3 3/3] sched: update blocked load when newly idle

2018-02-12 Thread Vincent Guittot
Le Monday 12 Feb 2018 à 13:04:11 (+0100), Peter Zijlstra a écrit :
> On Mon, Feb 12, 2018 at 09:07:54AM +0100, Vincent Guittot wrote:
> > @@ -8863,12 +8866,26 @@ static int idle_balance(struct rq *this_rq, struct 
> > rq_flags *rf)
> >  
> > if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> > !this_rq->rd->overload) {
> > +   unsigned long has_blocked = READ_ONCE(nohz.has_blocked);
> > +   unsigned long next_blocked = READ_ONCE(nohz.next_blocked);
> > +
> > rcu_read_lock();
> > sd = rcu_dereference_check_sched_domain(this_rq->sd);
> > if (sd)
> > update_next_balance(sd, _balance);
> > rcu_read_unlock();
> >  
> > +   /*
> > +* Update blocked idle load if it has not been done for a
> > +* while. Try to do it locally before entering idle but kick a
> > +* ilb if it takes too much time and/or might delay next local
> > +* wake up
> > +*/
> > +   if (has_blocked && time_after_eq(jiffies, next_blocked) &&
> > +   (this_rq->avg_idle < 
> > sysctl_sched_migration_cost ||
> > +   !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, 
> > CPU_NEWLY_IDLE)))
> > +   kick_ilb(NOHZ_STATS_KICK);
> > +
> > goto out;
> > }
> 
> So I really hate this one, also I suspect its broken, because we do this
> check before dropping rq->lock and _nohz_idle_balance() will take
> rq->lock.

yes. it will take both newly idle rq and idle rq lock

>
> 
> Aside from the above being an unreadable mess, I dislike that it breaks
> the various isolation crud, we should not touch CPUs outside of our
> domain.
>
> 
> Maybe something like the below? (unfinished)
>

good catch. I completely miss the isolation stuff.
But isn't already the case when kicking ilb ? I mean that an idle CPU touches
all idle CPUs and some can be outside its domain during ilb.

The goal of this call to _nohz_idle_balance was to consider this cpu as idle
because that's what will happen as it will not try to pull other task.
We save the kick of an already idle cpu for doing the update of blocked load

Shouldn't we test housekeeping_cpu(cpu, HK_FLAG_SCHED) instead if we want to
make sure that an isolated/full nohz CPU will not be used for updating blocked
load of CPUs outside its domain ?

Is something below more readable:
 
/*
-* Update blocked idle load if it has not been done for a
-* while. Try to do it locally before entering idle but kick a
-* ilb if it takes too much time and/or might delay next local
-* wake up
+* This CPU doesn't want to be disturbed by scheduler
+* houskeeping
 */
-   if (has_blocked && time_after_eq(jiffies, next_blocked) &&
-   (this_rq->avg_idle < 
sysctl_sched_migration_cost ||
-   !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, 
CPU_NEWLY_IDLE)))
+   if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
+   goto out;
+
+   /* Will wake up very soon. No time for doing anything else*/
+   if (this_rq->avg_idle < sysctl_sched_migration_cost)
+   goto out;
+
+   /* Don't need to update blocked load of idle CPUs*/
+   if (!has_blocked || time_after_eq(jiffies, next_blocked)
+   goto out;
+
+   raw_spin_unlock(_rq->lock);
+   /*
+* This CPU is going to be idle and blocked load of idle CPUs
+* need to be updated. Run the ilb locally as it is a good
+* candidate for ilb instead of waking up another idle CPU.
+* Kick an normal ilb if we failed to do the update.
+*/
+   if !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, 
CPU_NEWLY_IDLE))
kick_ilb(NOHZ_STATS_KICK);
+   raw_spin_lock(_rq->lock);
 
goto out;

> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7839,7 +7839,7 @@ group_type group_classify(struct sched_g
>   return group_other;
>  }
>  
> -static bool update_nohz_stats(struct rq *rq)
> +static bool update_nohz_stats(struct rq *rq, bool force)
>  {
>  #ifdef CONFIG_NO_HZ_COMMON
>   unsigned int cpu = rq->cpu;
> @@ -7850,7 +7850,7 @@ static bool update_nohz_stats(struct rq
>   if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
>   return false;
>  
> - if (!time_after(jiffies, rq->last_blocked_load_update_tick))
> + if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))

This fix the concern raised on the other thread, isn't it ?

>   return true;
>  
>   update_blocked_averages(cpu);
> @@ -7883,7 +7883,7 @@ static inline void update_sg_lb_stats(st
>   

Re: [PATCH v3 3/3] sched: update blocked load when newly idle

2018-02-12 Thread Vincent Guittot
Le Monday 12 Feb 2018 à 13:04:11 (+0100), Peter Zijlstra a écrit :
> On Mon, Feb 12, 2018 at 09:07:54AM +0100, Vincent Guittot wrote:
> > @@ -8863,12 +8866,26 @@ static int idle_balance(struct rq *this_rq, struct 
> > rq_flags *rf)
> >  
> > if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> > !this_rq->rd->overload) {
> > +   unsigned long has_blocked = READ_ONCE(nohz.has_blocked);
> > +   unsigned long next_blocked = READ_ONCE(nohz.next_blocked);
> > +
> > rcu_read_lock();
> > sd = rcu_dereference_check_sched_domain(this_rq->sd);
> > if (sd)
> > update_next_balance(sd, _balance);
> > rcu_read_unlock();
> >  
> > +   /*
> > +* Update blocked idle load if it has not been done for a
> > +* while. Try to do it locally before entering idle but kick a
> > +* ilb if it takes too much time and/or might delay next local
> > +* wake up
> > +*/
> > +   if (has_blocked && time_after_eq(jiffies, next_blocked) &&
> > +   (this_rq->avg_idle < 
> > sysctl_sched_migration_cost ||
> > +   !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, 
> > CPU_NEWLY_IDLE)))
> > +   kick_ilb(NOHZ_STATS_KICK);
> > +
> > goto out;
> > }
> 
> So I really hate this one, also I suspect its broken, because we do this
> check before dropping rq->lock and _nohz_idle_balance() will take
> rq->lock.

yes. it will take both newly idle rq and idle rq lock

>
> 
> Aside from the above being an unreadable mess, I dislike that it breaks
> the various isolation crud, we should not touch CPUs outside of our
> domain.
>
> 
> Maybe something like the below? (unfinished)
>

good catch. I completely miss the isolation stuff.
But isn't already the case when kicking ilb ? I mean that an idle CPU touches
all idle CPUs and some can be outside its domain during ilb.

The goal of this call to _nohz_idle_balance was to consider this cpu as idle
because that's what will happen as it will not try to pull other task.
We save the kick of an already idle cpu for doing the update of blocked load

Shouldn't we test housekeeping_cpu(cpu, HK_FLAG_SCHED) instead if we want to
make sure that an isolated/full nohz CPU will not be used for updating blocked
load of CPUs outside its domain ?

Is something below more readable:
 
/*
-* Update blocked idle load if it has not been done for a
-* while. Try to do it locally before entering idle but kick a
-* ilb if it takes too much time and/or might delay next local
-* wake up
+* This CPU doesn't want to be disturbed by scheduler
+* houskeeping
 */
-   if (has_blocked && time_after_eq(jiffies, next_blocked) &&
-   (this_rq->avg_idle < 
sysctl_sched_migration_cost ||
-   !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, 
CPU_NEWLY_IDLE)))
+   if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
+   goto out;
+
+   /* Will wake up very soon. No time for doing anything else*/
+   if (this_rq->avg_idle < sysctl_sched_migration_cost)
+   goto out;
+
+   /* Don't need to update blocked load of idle CPUs*/
+   if (!has_blocked || time_after_eq(jiffies, next_blocked)
+   goto out;
+
+   raw_spin_unlock(_rq->lock);
+   /*
+* This CPU is going to be idle and blocked load of idle CPUs
+* need to be updated. Run the ilb locally as it is a good
+* candidate for ilb instead of waking up another idle CPU.
+* Kick an normal ilb if we failed to do the update.
+*/
+   if !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, 
CPU_NEWLY_IDLE))
kick_ilb(NOHZ_STATS_KICK);
+   raw_spin_lock(_rq->lock);
 
goto out;

> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7839,7 +7839,7 @@ group_type group_classify(struct sched_g
>   return group_other;
>  }
>  
> -static bool update_nohz_stats(struct rq *rq)
> +static bool update_nohz_stats(struct rq *rq, bool force)
>  {
>  #ifdef CONFIG_NO_HZ_COMMON
>   unsigned int cpu = rq->cpu;
> @@ -7850,7 +7850,7 @@ static bool update_nohz_stats(struct rq
>   if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
>   return false;
>  
> - if (!time_after(jiffies, rq->last_blocked_load_update_tick))
> + if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))

This fix the concern raised on the other thread, isn't it ?

>   return true;
>  
>   update_blocked_averages(cpu);
> @@ -7883,7 +7883,7 @@ static inline void update_sg_lb_stats(st
>   

Re: [PATCH v3 3/3] sched: update blocked load when newly idle

2018-02-12 Thread Peter Zijlstra
On Mon, Feb 12, 2018 at 09:07:54AM +0100, Vincent Guittot wrote:
> @@ -8863,12 +8866,26 @@ static int idle_balance(struct rq *this_rq, struct 
> rq_flags *rf)
>  
>   if (this_rq->avg_idle < sysctl_sched_migration_cost ||
>   !this_rq->rd->overload) {
> + unsigned long has_blocked = READ_ONCE(nohz.has_blocked);
> + unsigned long next_blocked = READ_ONCE(nohz.next_blocked);
> +
>   rcu_read_lock();
>   sd = rcu_dereference_check_sched_domain(this_rq->sd);
>   if (sd)
>   update_next_balance(sd, _balance);
>   rcu_read_unlock();
>  
> + /*
> +  * Update blocked idle load if it has not been done for a
> +  * while. Try to do it locally before entering idle but kick a
> +  * ilb if it takes too much time and/or might delay next local
> +  * wake up
> +  */
> + if (has_blocked && time_after_eq(jiffies, next_blocked) &&
> + (this_rq->avg_idle < 
> sysctl_sched_migration_cost ||
> + !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, 
> CPU_NEWLY_IDLE)))
> + kick_ilb(NOHZ_STATS_KICK);
> +
>   goto out;
>   }

So I really hate this one, also I suspect its broken, because we do this
check before dropping rq->lock and _nohz_idle_balance() will take
rq->lock.

Aside from the above being an unreadable mess, I dislike that it breaks
the various isolation crud, we should not touch CPUs outside of our
domain.

Maybe something like the below? (unfinished)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7839,7 +7839,7 @@ group_type group_classify(struct sched_g
return group_other;
 }
 
-static bool update_nohz_stats(struct rq *rq)
+static bool update_nohz_stats(struct rq *rq, bool force)
 {
 #ifdef CONFIG_NO_HZ_COMMON
unsigned int cpu = rq->cpu;
@@ -7850,7 +7850,7 @@ static bool update_nohz_stats(struct rq
if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
return false;
 
-   if (!time_after(jiffies, rq->last_blocked_load_update_tick))
+   if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
return true;
 
update_blocked_averages(cpu);
@@ -7883,7 +7883,7 @@ static inline void update_sg_lb_stats(st
for_each_cpu_and(i, sched_group_span(group), env->cpus) {
struct rq *rq = cpu_rq(i);
 
-   if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq))
+   if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, 
false))
env->flags |= LBF_NOHZ_AGAIN;
 
/* Bias balancing toward cpus of our domain */
@@ -8857,6 +8857,29 @@ update_next_balance(struct sched_domain
*next_balance = next;
 }
 
+static int nohz_age(struct sched_domain *sd)
+{
+   struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
+   bool has_blocked_load;
+
+   WRITE_ONCE(nohz.has_blocked, 0);
+
+   smp_mb();
+
+   cpumask_and(cpus, sched_domain_span(sd), nohz.idle_cpus_mask);
+
+   has_blocked_load = cpumask_subset(nohz.idle_cpus_mask, 
sched_domain_span(sd));
+
+   for_each_cpu(cpu, cpus) {
+   struct rq *rq = cpu_rq(cpu);
+
+   has_blocked_load |= update_nohz_stats(rq, true);
+   }
+
+   if (has_blocked_load)
+   WRITE_ONCE(nohz.has_blocked, 1);
+}
+
 /*
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
@@ -8868,6 +8891,7 @@ static int idle_balance(struct rq *this_
struct sched_domain *sd;
int pulled_task = 0;
u64 curr_cost = 0;
+   bool nohz_blocked = false;
 
/*
 * We must set idle_stamp _before_ calling idle_balance(), such that we
@@ -8889,8 +8913,8 @@ static int idle_balance(struct rq *this_
 */
rq_unpin_lock(this_rq, rf);
 
-   if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-   !this_rq->rd->overload) {
+   if (this_rq->avg_idle < sysctl_sched_migration_cost) {
+short_idle:
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sd)
@@ -8900,6 +8924,18 @@ static int idle_balance(struct rq *this_
goto out;
}
 
+   if (!this_rq->rd->overload) {
+#ifdef CONFIG_NO_HZ_COMMON
+   unsigned int has_blocked = READ_ONCE(nohz.has_blocked);
+   unsigned long next_blocked = READ_ONE(nohz.next_blocked);
+
+   if (has_blocked && time_after_eq(jiffies, next_blocked))
+   nohz_blocked = true;
+   else
+#endif
+   goto short_idle;
+   }
+
raw_spin_unlock(_rq->lock);
 
update_blocked_averages(this_cpu);
@@ -8919,9 +8955,13 @@ static int idle_balance(struct rq *this_
  

Re: [PATCH v3 3/3] sched: update blocked load when newly idle

2018-02-12 Thread Peter Zijlstra
On Mon, Feb 12, 2018 at 09:07:54AM +0100, Vincent Guittot wrote:
> @@ -8863,12 +8866,26 @@ static int idle_balance(struct rq *this_rq, struct 
> rq_flags *rf)
>  
>   if (this_rq->avg_idle < sysctl_sched_migration_cost ||
>   !this_rq->rd->overload) {
> + unsigned long has_blocked = READ_ONCE(nohz.has_blocked);
> + unsigned long next_blocked = READ_ONCE(nohz.next_blocked);
> +
>   rcu_read_lock();
>   sd = rcu_dereference_check_sched_domain(this_rq->sd);
>   if (sd)
>   update_next_balance(sd, _balance);
>   rcu_read_unlock();
>  
> + /*
> +  * Update blocked idle load if it has not been done for a
> +  * while. Try to do it locally before entering idle but kick a
> +  * ilb if it takes too much time and/or might delay next local
> +  * wake up
> +  */
> + if (has_blocked && time_after_eq(jiffies, next_blocked) &&
> + (this_rq->avg_idle < 
> sysctl_sched_migration_cost ||
> + !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, 
> CPU_NEWLY_IDLE)))
> + kick_ilb(NOHZ_STATS_KICK);
> +
>   goto out;
>   }

So I really hate this one, also I suspect its broken, because we do this
check before dropping rq->lock and _nohz_idle_balance() will take
rq->lock.

Aside from the above being an unreadable mess, I dislike that it breaks
the various isolation crud, we should not touch CPUs outside of our
domain.

Maybe something like the below? (unfinished)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7839,7 +7839,7 @@ group_type group_classify(struct sched_g
return group_other;
 }
 
-static bool update_nohz_stats(struct rq *rq)
+static bool update_nohz_stats(struct rq *rq, bool force)
 {
 #ifdef CONFIG_NO_HZ_COMMON
unsigned int cpu = rq->cpu;
@@ -7850,7 +7850,7 @@ static bool update_nohz_stats(struct rq
if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
return false;
 
-   if (!time_after(jiffies, rq->last_blocked_load_update_tick))
+   if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
return true;
 
update_blocked_averages(cpu);
@@ -7883,7 +7883,7 @@ static inline void update_sg_lb_stats(st
for_each_cpu_and(i, sched_group_span(group), env->cpus) {
struct rq *rq = cpu_rq(i);
 
-   if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq))
+   if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, 
false))
env->flags |= LBF_NOHZ_AGAIN;
 
/* Bias balancing toward cpus of our domain */
@@ -8857,6 +8857,29 @@ update_next_balance(struct sched_domain
*next_balance = next;
 }
 
+static int nohz_age(struct sched_domain *sd)
+{
+   struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
+   bool has_blocked_load;
+
+   WRITE_ONCE(nohz.has_blocked, 0);
+
+   smp_mb();
+
+   cpumask_and(cpus, sched_domain_span(sd), nohz.idle_cpus_mask);
+
+   has_blocked_load = cpumask_subset(nohz.idle_cpus_mask, 
sched_domain_span(sd));
+
+   for_each_cpu(cpu, cpus) {
+   struct rq *rq = cpu_rq(cpu);
+
+   has_blocked_load |= update_nohz_stats(rq, true);
+   }
+
+   if (has_blocked_load)
+   WRITE_ONCE(nohz.has_blocked, 1);
+}
+
 /*
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
@@ -8868,6 +8891,7 @@ static int idle_balance(struct rq *this_
struct sched_domain *sd;
int pulled_task = 0;
u64 curr_cost = 0;
+   bool nohz_blocked = false;
 
/*
 * We must set idle_stamp _before_ calling idle_balance(), such that we
@@ -8889,8 +8913,8 @@ static int idle_balance(struct rq *this_
 */
rq_unpin_lock(this_rq, rf);
 
-   if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-   !this_rq->rd->overload) {
+   if (this_rq->avg_idle < sysctl_sched_migration_cost) {
+short_idle:
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sd)
@@ -8900,6 +8924,18 @@ static int idle_balance(struct rq *this_
goto out;
}
 
+   if (!this_rq->rd->overload) {
+#ifdef CONFIG_NO_HZ_COMMON
+   unsigned int has_blocked = READ_ONCE(nohz.has_blocked);
+   unsigned long next_blocked = READ_ONE(nohz.next_blocked);
+
+   if (has_blocked && time_after_eq(jiffies, next_blocked))
+   nohz_blocked = true;
+   else
+#endif
+   goto short_idle;
+   }
+
raw_spin_unlock(_rq->lock);
 
update_blocked_averages(this_cpu);
@@ -8919,9 +8955,13 @@ static int idle_balance(struct rq *this_