Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-04-09 Thread Tim Chen



On 4/9/21 8:26 AM, Vincent Guittot wrote:


 I was expecting idle load balancer to be rate limited to 60 Hz, which
>>>
>>> Why 60Hz ?
>>>
>>
>> My thinking is we will trigger load balance only after rq->next_balance.
>>
>> void trigger_load_balance(struct rq *rq)
>> {
>> /* Don't need to rebalance while attached to NULL domain */
>> if (unlikely(on_null_domain(rq)))
>> return;
>>
>> if (time_after_eq(jiffies, rq->next_balance))
>> raise_softirq(SCHED_SOFTIRQ);
>>
>> nohz_balancer_kick(rq);
>> }
>>
>> And it seems like next_balance is set to be 60 Hz
>>
>> static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
>> {
>> int continue_balancing = 1;
>> int cpu = rq->cpu;
>> int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
>> unsigned long interval;
>> struct sched_domain *sd;
>> /* Earliest time when we have to do rebalance again */
>> unsigned long next_balance = jiffies + 60*HZ;
> 
> This doesn't mean 60 Hz period but 60*HZ with HZ being the number of
> jiffies per second. We init next_balance with now + 60 sec to make
> sure it's far later than the next balance of the sched_domains
> 
> Then, update_next_balance() keeps track of 1st balance to happen next time
> 

Thanks for pointing out my misread of the code.  In this case the
balance frequency should be lower than I thought as balance should be 60 sec
apart in theory.  

>> Here's a snapshot of the trace. However I didn't have the current task in my 
>> trace.
>> You can tell the frequency that update_blocked_averages is called on
>> cpu 2 by the jiffies value.  They are quite close together (1 to 3 jiffies 
>> apart).
>> When I have a chance to get on the machine, I'll take another look
>> at the current task and whether we got to trigger_load_balance() from 
>> scheduler_tick().
>>
>>
>>  3.505 ( ): probe:update_blocked_averages:(810cf070) 
>> cpu=2 jiffies=0x1004fb731
>>  4.505 ( ): probe:update_blocked_averages:(810cf070) 
>> cpu=2 jiffies=0x1004fb732
>>  6.484 ( ): probe:newidle_balance:(810d2470) 
>> this_rq=0x88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb733
>>  6.506 ( ): probe:update_blocked_averages:(810cf070) 
>> cpu=2 jiffies=0x1004fb734
>>  9.503 ( ): probe:update_blocked_averages:(810cf070) 
>> cpu=2 jiffies=0x1004fb737
>> 11.504 ( ): probe:update_blocked_averages:(810cf070) 
>> cpu=2 jiffies=0x1004fb739
>> 11.602 ( ): probe:newidle_balance:(810d2470) 
>> this_rq=0x88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
>> 11.624 ( ): probe:newidle_balance:(810d2470) 
>> this_rq=0x88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>> 11.642 ( ): probe:newidle_balance:(810d2470) 
>> this_rq=0x88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>> 11.645 ( ): probe:newidle_balance:(810d2470) 
>> this_rq=0x88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>> 11.977 ( ): probe:newidle_balance:(810d2470) 
>> this_rq=0x88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>> 12.003 ( ): probe:newidle_balance:(810d2470) 
>> this_rq=0x88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>> 12.015 ( ): probe:newidle_balance:(810d2470) 
>> this_rq=0x88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>> 12.043 ( ): probe:newidle_balance:(810d2470) 
>> this_rq=0x88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>> 12.567 ( ): probe:update_blocked_averages:(810cf070) 
>> cpu=2 jiffies=0x1004fb73a
>> 13.856 ( ): probe:newidle_balance:(810d2470) 
>> this_rq=0x88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
>> 13.910 ( ): probe:newidle_balance:(810d2470) 
>> this_rq=0x88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>> 14.003 ( ): probe:newidle_balance:(810d2470) 
>> this_rq=0x88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>> 14.159 ( ): probe:newidle_balance:(810d2470) 
>> this_rq=0x88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>> 14.203 ( ): probe:newidle_balance:(810d2470) 
>> this_rq=0x88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>> 14.223 ( ): probe:newidle_balance:(810d2470) 
>> this_rq=0x88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>> 14.301 ( ): probe:newidle_balance:(810d2470) 
>> this_rq=0x88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>> 14.504 ( ): probe:update_blocked_averages:(810cf070) 
>> cpu=2 jiffies=0x1004fb73c
>> 14.637 ( ): 

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-04-09 Thread Vincent Guittot
On Fri, 9 Apr 2021 at 01:05, Tim Chen  wrote:
>
>
>
>
> On 4/8/21 7:51 AM, Vincent Guittot wrote:
>
> >> I was suprised to find the overall cpu% consumption of 
> >> update_blocked_averages
> >> and throughput of the benchmark still didn't change much.  So I took a
> >> peek into the profile and found the update_blocked_averages calls shifted 
> >> to the idle load balancer.
> >> The call to update_locked_averages was reduced in newidle_balance so the 
> >> patch did
> >> what we intended.  But the overall rate of calls to
> >
> > At least , we have removed the useless call to update_blocked_averages
> > in newidle_balance when we will not perform any newly idle load
> > balance
> >
> >> update_blocked_averages remain roughly the same, shifting from
> >> newidle_balance to run_rebalance_domains.
> >>
> >>100.00%  (810cf070)
> >> |
> >> ---update_blocked_averages
> >>|
> >>|--95.47%--run_rebalance_domains
> >>|  __do_softirq
> >>|  |
> >>|  |--94.27%--asm_call_irq_on_stack
> >>|  |  do_softirq_own_stack
> >
> > The call of  update_blocked_averages mainly comes from SCHED_SOFTIRQ.
> > And as a result, not from the new path
> > do_idle()->nohz_run_idle_balance() which has been added by this patch
> > to defer the call to update_nohz_stats() after newlyidle_balance and
> > before entering idle.
> >
> >>|  |  |
> >>|  |  |--93.74%--irq_exit_rcu
> >>|  |  |  |
> >>|  |  |  
> >> |--88.20%--sysvec_apic_timer_interrupt
> >>|  |  |  |  
> >> asm_sysvec_apic_timer_interrupt
> >>|  |  |  |  |
> >>...
> >>|
> >>|
> >> --4.53%--newidle_balance
> >>   pick_next_task_fair
> >>
> >> I was expecting idle load balancer to be rate limited to 60 Hz, which
> >
> > Why 60Hz ?
> >
>
> My thinking is we will trigger load balance only after rq->next_balance.
>
> void trigger_load_balance(struct rq *rq)
> {
> /* Don't need to rebalance while attached to NULL domain */
> if (unlikely(on_null_domain(rq)))
> return;
>
> if (time_after_eq(jiffies, rq->next_balance))
> raise_softirq(SCHED_SOFTIRQ);
>
> nohz_balancer_kick(rq);
> }
>
> And it seems like next_balance is set to be 60 Hz
>
> static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
> {
> int continue_balancing = 1;
> int cpu = rq->cpu;
> int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
> unsigned long interval;
> struct sched_domain *sd;
> /* Earliest time when we have to do rebalance again */
> unsigned long next_balance = jiffies + 60*HZ;

This doesn't mean 60 Hz period but 60*HZ with HZ being the number of
jiffies per second. We init next_balance with now + 60 sec to make
sure it's far later than the next balance of the sched_domains

Then, update_next_balance() keeps track of 1st balance to happen next time

>
>
> >> should be 15 jiffies apart on the test system with CONFIG_HZ_250.
> >> When I did a trace on a single CPU, I see that update_blocked_averages
> >> are often called between 1 to 4 jiffies apart, which is at a much higher
> >> rate than I expected.  I haven't taken a closer look yet.  But you may
> >
> > 2 things can trigger a SCHED_SOFTIRQ/run_rebalance_domains:
> > - the need for an update of blocked load which should not happen more
> > than once every 32ms which means a rate of around 30Hz
> > - the need for a load balance of a sched_domain. The min interval for
> > a sched_domain is its weight when the CPU is idle which is usually few
> > jiffies
> >
> > The only idea that I have for now is that we spend less time in
> > newidle_balance which changes the dynamic of your system.
> >
> > In your trace, could you check if update_blocked_averages is called
> > during the tick ? and Is the current task idle task ?
>
> Here's a snapshot of the trace. However I didn't have the current task in my 
> trace.
> You can tell the frequency that update_blocked_averages is called on
> cpu 2 by the jiffies value.  They are quite close together (1 to 3 jiffies 
> apart).
> When I have a chance to get on the machine, I'll take another look
> at the current task and whether we got to trigger_load_balance() from 
> scheduler_tick().
>
>
>  3.505 ( ): probe:update_blocked_averages:(810cf070) 
> cpu=2 jiffies=0x1004fb731
>  4.505 ( ): probe:update_blocked_averages:(810cf070) 
> cpu=2 jiffies=0x1004fb732
>  6.484 ( ): probe:newidle_balance:(810d2470) 
> this_rq=0x88fe7f8aae00 

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-04-08 Thread Tim Chen




On 4/8/21 7:51 AM, Vincent Guittot wrote:

>> I was suprised to find the overall cpu% consumption of 
>> update_blocked_averages
>> and throughput of the benchmark still didn't change much.  So I took a
>> peek into the profile and found the update_blocked_averages calls shifted to 
>> the idle load balancer.
>> The call to update_locked_averages was reduced in newidle_balance so the 
>> patch did
>> what we intended.  But the overall rate of calls to
> 
> At least , we have removed the useless call to update_blocked_averages
> in newidle_balance when we will not perform any newly idle load
> balance
> 
>> update_blocked_averages remain roughly the same, shifting from
>> newidle_balance to run_rebalance_domains.
>>
>>100.00%  (810cf070)
>> |
>> ---update_blocked_averages
>>|
>>|--95.47%--run_rebalance_domains
>>|  __do_softirq
>>|  |
>>|  |--94.27%--asm_call_irq_on_stack
>>|  |  do_softirq_own_stack
> 
> The call of  update_blocked_averages mainly comes from SCHED_SOFTIRQ.
> And as a result, not from the new path
> do_idle()->nohz_run_idle_balance() which has been added by this patch
> to defer the call to update_nohz_stats() after newlyidle_balance and
> before entering idle.
> 
>>|  |  |
>>|  |  |--93.74%--irq_exit_rcu
>>|  |  |  |
>>|  |  |  
>> |--88.20%--sysvec_apic_timer_interrupt
>>|  |  |  |  
>> asm_sysvec_apic_timer_interrupt
>>|  |  |  |  |
>>...
>>|
>>|
>> --4.53%--newidle_balance
>>   pick_next_task_fair
>>
>> I was expecting idle load balancer to be rate limited to 60 Hz, which
> 
> Why 60Hz ?
> 

My thinking is we will trigger load balance only after rq->next_balance.

void trigger_load_balance(struct rq *rq)
{
/* Don't need to rebalance while attached to NULL domain */
if (unlikely(on_null_domain(rq)))
return;

if (time_after_eq(jiffies, rq->next_balance))
raise_softirq(SCHED_SOFTIRQ);

nohz_balancer_kick(rq);
}

And it seems like next_balance is set to be 60 Hz

static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
{
int continue_balancing = 1;
int cpu = rq->cpu;
int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
unsigned long interval;
struct sched_domain *sd;
/* Earliest time when we have to do rebalance again */
unsigned long next_balance = jiffies + 60*HZ;


>> should be 15 jiffies apart on the test system with CONFIG_HZ_250.
>> When I did a trace on a single CPU, I see that update_blocked_averages
>> are often called between 1 to 4 jiffies apart, which is at a much higher
>> rate than I expected.  I haven't taken a closer look yet.  But you may
> 
> 2 things can trigger a SCHED_SOFTIRQ/run_rebalance_domains:
> - the need for an update of blocked load which should not happen more
> than once every 32ms which means a rate of around 30Hz
> - the need for a load balance of a sched_domain. The min interval for
> a sched_domain is its weight when the CPU is idle which is usually few
> jiffies
> 
> The only idea that I have for now is that we spend less time in
> newidle_balance which changes the dynamic of your system.
> 
> In your trace, could you check if update_blocked_averages is called
> during the tick ? and Is the current task idle task ?

Here's a snapshot of the trace. However I didn't have the current task in my 
trace.
You can tell the frequency that update_blocked_averages is called on
cpu 2 by the jiffies value.  They are quite close together (1 to 3 jiffies 
apart).
When I have a chance to get on the machine, I'll take another look
at the current task and whether we got to trigger_load_balance() from 
scheduler_tick().


 3.505 ( ): probe:update_blocked_averages:(810cf070) cpu=2 
jiffies=0x1004fb731
 4.505 ( ): probe:update_blocked_averages:(810cf070) cpu=2 
jiffies=0x1004fb732
 6.484 ( ): probe:newidle_balance:(810d2470) 
this_rq=0x88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb733
 6.506 ( ): probe:update_blocked_averages:(810cf070) cpu=2 
jiffies=0x1004fb734
 9.503 ( ): probe:update_blocked_averages:(810cf070) cpu=2 
jiffies=0x1004fb737
11.504 ( ): probe:update_blocked_averages:(810cf070) cpu=2 
jiffies=0x1004fb739
11.602 ( ): probe:newidle_balance:(810d2470) 
this_rq=0x88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
11.624 ( ): probe:newidle_balance:(810d2470) 

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-04-08 Thread Vincent Guittot
On Wed, 7 Apr 2021 at 19:19, Tim Chen  wrote:
>
>
>
> On 4/7/21 7:02 AM, Vincent Guittot wrote:
> > Hi Tim,
> >
> > On Wed, 24 Mar 2021 at 17:05, Tim Chen  wrote:
> >>
> >>
> >>
> >> On 3/24/21 6:44 AM, Vincent Guittot wrote:
> >>> Hi Tim,
> >>
> >>>
> >>> IIUC your problem, we call update_blocked_averages() but because of:
> >>>
> >>>   if (this_rq->avg_idle < curr_cost + 
> >>> sd->max_newidle_lb_cost) {
> >>>   update_next_balance(sd, _balance);
> >>>   break;
> >>>   }
> >>>
> >>> the for_each_domain loop stops even before running load_balance on the 1st
> >>> sched domain level which means that update_blocked_averages() was called
> >>> unnecessarily.
> >>>
> >>
> >> That's right
> >>
> >>> And this is even more true with a small sysctl_sched_migration_cost which 
> >>> allows newly
> >>> idle LB for very small this_rq->avg_idle. We could wonder why you set 
> >>> such a low value
> >>> for sysctl_sched_migration_cost which is lower than the 
> >>> max_newidle_lb_cost of the
> >>> smallest domain but that's probably because of task_hot().
> >>>
> >>> if avg_idle is lower than the sd->max_newidle_lb_cost of the 1st 
> >>> sched_domain, we should
> >>> skip spin_unlock/lock and for_each_domain() loop entirely
> >>>
> >>> Maybe something like below:
> >>>
> >>
> >> The patch makes sense.  I'll ask our benchmark team to queue this patch 
> >> for testing.
> >
> > Do you have feedback from your benchmark team ?
> >
>
> Vincent,
>
> Thanks for following up. I just got some data back from the benchmark team.
> The performance didn't change with your patch.  And the overall cpu% of 
> update_blocked_averages
> also remain at about the same level.  My first thought was perhaps this update
> still didn't catch all the calls to update_blocked_averages
>
> if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> -   !READ_ONCE(this_rq->rd->overload)) {
> +   !READ_ONCE(this_rq->rd->overload) ||
> +   (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
> To experiment, I added one more check on the next_balance to further limit
> the path to actually do idle load balance with the next_balance time.
>
> if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> -   !READ_ONCE(this_rq->rd->overload)) {
> +   time_before(jiffies, this_rq->next_balance) ||
> +   !READ_ONCE(this_rq->rd->overload) ||
> +   (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
> I was suprised to find the overall cpu% consumption of update_blocked_averages
> and throughput of the benchmark still didn't change much.  So I took a
> peek into the profile and found the update_blocked_averages calls shifted to 
> the idle load balancer.
> The call to update_locked_averages was reduced in newidle_balance so the 
> patch did
> what we intended.  But the overall rate of calls to

At least , we have removed the useless call to update_blocked_averages
in newidle_balance when we will not perform any newly idle load
balance

> update_blocked_averages remain roughly the same, shifting from
> newidle_balance to run_rebalance_domains.
>
>100.00%  (810cf070)
> |
> ---update_blocked_averages
>|
>|--95.47%--run_rebalance_domains
>|  __do_softirq
>|  |
>|  |--94.27%--asm_call_irq_on_stack
>|  |  do_softirq_own_stack

The call of  update_blocked_averages mainly comes from SCHED_SOFTIRQ.
And as a result, not from the new path
do_idle()->nohz_run_idle_balance() which has been added by this patch
to defer the call to update_nohz_stats() after newlyidle_balance and
before entering idle.

>|  |  |
>|  |  |--93.74%--irq_exit_rcu
>|  |  |  |
>|  |  |  
> |--88.20%--sysvec_apic_timer_interrupt
>|  |  |  |  
> asm_sysvec_apic_timer_interrupt
>|  |  |  |  |
>...
>|
>|
> --4.53%--newidle_balance
>   pick_next_task_fair
>
> I was expecting idle load balancer to be rate limited to 60 Hz, which

Why 60Hz ?

> should be 15 jiffies apart on the test system with CONFIG_HZ_250.
> When I did a trace on a single CPU, I see that update_blocked_averages
> are often called between 1 to 4 jiffies apart, which is at a much higher
> rate than I expected.  I haven't taken a closer look yet.  But you may

2 things can trigger a SCHED_SOFTIRQ/run_rebalance_domains:
- the need for an update of blocked load which should not happen more
than once every 32ms which means a rate of around 30Hz
- the need for a load balance of a sched_domain. The min interval 

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-04-07 Thread Tim Chen



On 4/7/21 7:02 AM, Vincent Guittot wrote:
> Hi Tim,
> 
> On Wed, 24 Mar 2021 at 17:05, Tim Chen  wrote:
>>
>>
>>
>> On 3/24/21 6:44 AM, Vincent Guittot wrote:
>>> Hi Tim,
>>
>>>
>>> IIUC your problem, we call update_blocked_averages() but because of:
>>>
>>>   if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
>>>   update_next_balance(sd, _balance);
>>>   break;
>>>   }
>>>
>>> the for_each_domain loop stops even before running load_balance on the 1st
>>> sched domain level which means that update_blocked_averages() was called
>>> unnecessarily.
>>>
>>
>> That's right
>>
>>> And this is even more true with a small sysctl_sched_migration_cost which 
>>> allows newly
>>> idle LB for very small this_rq->avg_idle. We could wonder why you set such 
>>> a low value
>>> for sysctl_sched_migration_cost which is lower than the max_newidle_lb_cost 
>>> of the
>>> smallest domain but that's probably because of task_hot().
>>>
>>> if avg_idle is lower than the sd->max_newidle_lb_cost of the 1st 
>>> sched_domain, we should
>>> skip spin_unlock/lock and for_each_domain() loop entirely
>>>
>>> Maybe something like below:
>>>
>>
>> The patch makes sense.  I'll ask our benchmark team to queue this patch for 
>> testing.
> 
> Do you have feedback from your benchmark team ?
> 

Vincent,

Thanks for following up. I just got some data back from the benchmark team.
The performance didn't change with your patch.  And the overall cpu% of 
update_blocked_averages
also remain at about the same level.  My first thought was perhaps this update
still didn't catch all the calls to update_blocked_averages

if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-   !READ_ONCE(this_rq->rd->overload)) {
+   !READ_ONCE(this_rq->rd->overload) ||
+   (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

To experiment, I added one more check on the next_balance to further limit
the path to actually do idle load balance with the next_balance time.

if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-   !READ_ONCE(this_rq->rd->overload)) {
+   time_before(jiffies, this_rq->next_balance) ||  
+   !READ_ONCE(this_rq->rd->overload) ||
+   (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

I was suprised to find the overall cpu% consumption of update_blocked_averages
and throughput of the benchmark still didn't change much.  So I took a 
peek into the profile and found the update_blocked_averages calls shifted to 
the idle load balancer.
The call to update_locked_averages was reduced in newidle_balance so the patch 
did
what we intended.  But the overall rate of calls to
update_blocked_averages remain roughly the same, shifting from
newidle_balance to run_rebalance_domains.

   100.00%  (810cf070)
|
---update_blocked_averages
   |  
   |--95.47%--run_rebalance_domains
   |  __do_softirq
   |  |  
   |  |--94.27%--asm_call_irq_on_stack
   |  |  do_softirq_own_stack
   |  |  |  
   |  |  |--93.74%--irq_exit_rcu
   |  |  |  |  
   |  |  |  
|--88.20%--sysvec_apic_timer_interrupt
   |  |  |  |  
asm_sysvec_apic_timer_interrupt
   |  |  |  |  |  
   ...
   |
   |  
--4.53%--newidle_balance
  pick_next_task_fair

I was expecting idle load balancer to be rate limited to 60 Hz, which
should be 15 jiffies apart on the test system with CONFIG_HZ_250.
When I did a trace on a single CPU, I see that update_blocked_averages
are often called between 1 to 4 jiffies apart, which is at a much higher
rate than I expected.  I haven't taken a closer look yet.  But you may
have a better idea.  I won't have access to the test system and workload
till probably next week.

Thanks.

Tim


Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-04-07 Thread Vincent Guittot
Hi Tim,

On Wed, 24 Mar 2021 at 17:05, Tim Chen  wrote:
>
>
>
> On 3/24/21 6:44 AM, Vincent Guittot wrote:
> > Hi Tim,
>
> >
> > IIUC your problem, we call update_blocked_averages() but because of:
> >
> >   if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> >   update_next_balance(sd, _balance);
> >   break;
> >   }
> >
> > the for_each_domain loop stops even before running load_balance on the 1st
> > sched domain level which means that update_blocked_averages() was called
> > unnecessarily.
> >
>
> That's right
>
> > And this is even more true with a small sysctl_sched_migration_cost which 
> > allows newly
> > idle LB for very small this_rq->avg_idle. We could wonder why you set such 
> > a low value
> > for sysctl_sched_migration_cost which is lower than the max_newidle_lb_cost 
> > of the
> > smallest domain but that's probably because of task_hot().
> >
> > if avg_idle is lower than the sd->max_newidle_lb_cost of the 1st 
> > sched_domain, we should
> > skip spin_unlock/lock and for_each_domain() loop entirely
> >
> > Maybe something like below:
> >
>
> The patch makes sense.  I'll ask our benchmark team to queue this patch for 
> testing.

Do you have feedback from your benchmark team ?

Regards,
Vincent
>
> Tim
>
>


Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-03-24 Thread Tim Chen



On 3/24/21 6:44 AM, Vincent Guittot wrote:
> Hi Tim,

> 
> IIUC your problem, we call update_blocked_averages() but because of:
> 
>   if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
>   update_next_balance(sd, _balance);
>   break;
>   }
> 
> the for_each_domain loop stops even before running load_balance on the 1st
> sched domain level which means that update_blocked_averages() was called
> unnecessarily. 
> 

That's right

> And this is even more true with a small sysctl_sched_migration_cost which 
> allows newly
> idle LB for very small this_rq->avg_idle. We could wonder why you set such a 
> low value 
> for sysctl_sched_migration_cost which is lower than the max_newidle_lb_cost 
> of the
> smallest domain but that's probably because of task_hot().
> 
> if avg_idle is lower than the sd->max_newidle_lb_cost of the 1st 
> sched_domain, we should
> skip spin_unlock/lock and for_each_domain() loop entirely
> 
> Maybe something like below:
> 

The patch makes sense.  I'll ask our benchmark team to queue this patch for 
testing.

Tim




Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-03-24 Thread Vincent Guittot
Hi Tim,

Le mardi 23 mars 2021 à 14:37:59 (-0700), Tim Chen a écrit :
> 
> 
> On 1/29/21 9:27 AM, Vincent Guittot wrote:
> > 
> > The patch below moves the update of the blocked load of CPUs outside 
> > newidle_balance().
> 
> On a well known database workload, we also saw a lot of overhead to do 
> update_blocked_averages
> in newidle_balance().  So changes to reduce this overhead is much welcomed.
> 
> Turning on cgroup induces 9% throughput degradation on a 2 socket 40 cores 
> per socket Icelake system.  
> 
> A big part of the overhead in our database workload comes from updating
> blocked averages in newidle_balance, caused by I/O threads making
> some CPUs go in and out of idle frequently in the following code path:
> 
> __blkdev_direct_IO_simple
>   |  
>   |io_schedule_timeout
>   |  |  
>   |   schedule_timeout
>   | |  
>   |  schedule
>   ||  
>   | __schedule
>   |   |  
>   |pick_next_task_fair
>   |  |  
>   |   
> newidle_balance
>   | | 
>  
>  
> update_blocked_averages
> 
> We found update_blocked_averages() now consumed most CPU time, eating up 2% 
> of the CPU cycles once cgroup
> gets turned on.
> 
> I hacked up Joe's original patch to rate limit the update of blocked
> averages called from newidle_balance().  The 9% throughput degradation 
> reduced to
> 5.4%.  We'll be testing Vincent's change to see if it can give
> similar performance improvement.
> 
> Though in our test environment, sysctl_sched_migration_cost was kept
> much lower (25000) compared to the default (50), to encourage migrations 
> to idle cpu
> and reduce latency.  We got quite a lot of calls to update_blocked_averages 
> directly 
> and then try to load_balance in newidle_balance instead of relegating
> the responsibility to idle load balancer.  (See code snippet in 
> newidle_balance below)  
> 
> 
> if (this_rq->avg_idle < sysctl_sched_migration_cost ||   
> <-sched_migration_cost check
> !READ_ONCE(this_rq->rd->overload)) {
> 
> rcu_read_lock();
> sd = rcu_dereference_check_sched_domain(this_rq->sd);
> if (sd)
> update_next_balance(sd, _balance);
> rcu_read_unlock();
> 
> goto out;  <--- invoke idle load balancer
> }
> 
> raw_spin_unlock(_rq->lock);
> 
> update_blocked_averages(this_cpu);
> 
>    followed by load balance code ---
> 
 
> So the update_blocked_averages offload to idle_load_balancer in Vincent's 
> patch is less 
> effective in this case with small sched_migration_cost.
> 
> Looking at the code a bit more, we don't actually load balance every time in 
> this code path
> unless our avg_idle time exceeds some threshold.  Doing 
> update_blocked_averages immediately 

IIUC your problem, we call update_blocked_averages() but because of:

if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
update_next_balance(sd, _balance);
break;
}

the for_each_domain loop stops even before running load_balance on the 1st
sched domain level which means that update_blocked_averages() was called
unnecessarily. 

And this is even more true with a small sysctl_sched_migration_cost which 
allows newly
idle LB for very small this_rq->avg_idle. We could wonder why you set such a 
low value 
for sysctl_sched_migration_cost which is lower than the max_newidle_lb_cost of 
the
smallest domain but that's probably because of task_hot().

if avg_idle is lower than the sd->max_newidle_lb_cost of the 1st sched_domain, 
we should
skip spin_unlock/lock and for_each_domain() loop entirely

Maybe something like below:


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 76e33a70d575..08933e0d87ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10723,17 +10723,21 @@ static int newidle_balance(struct rq *this_rq, struct 
rq_flags *rf)
 */
rq_unpin_lock(this_rq, rf);

+   rcu_read_lock();
+   sd = rcu_dereference_check_sched_domain(this_rq->sd);
+
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-   !READ_ONCE(this_rq->rd->overload)) {
+   !READ_ONCE(this_rq->rd->overload) ||
+   (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

-   rcu_read_lock();
-   

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-03-23 Thread Tim Chen



On 1/29/21 9:27 AM, Vincent Guittot wrote:
> 
> The patch below moves the update of the blocked load of CPUs outside 
> newidle_balance().

On a well known database workload, we also saw a lot of overhead to do 
update_blocked_averages
in newidle_balance().  So changes to reduce this overhead is much welcomed.

Turning on cgroup induces 9% throughput degradation on a 2 socket 40 cores per 
socket Icelake system.  

A big part of the overhead in our database workload comes from updating
blocked averages in newidle_balance, caused by I/O threads making
some CPUs go in and out of idle frequently in the following code path:

__blkdev_direct_IO_simple
  |  
  |io_schedule_timeout
  |  |  
  |   schedule_timeout
  | |  
  |  schedule
  ||  
  | __schedule
  |   |  
  |pick_next_task_fair
  |  |  
  |   
newidle_balance
  | |   
   
 
update_blocked_averages

We found update_blocked_averages() now consumed most CPU time, eating up 2% of 
the CPU cycles once cgroup
gets turned on.

I hacked up Joe's original patch to rate limit the update of blocked
averages called from newidle_balance().  The 9% throughput degradation reduced 
to
5.4%.  We'll be testing Vincent's change to see if it can give
similar performance improvement.

Though in our test environment, sysctl_sched_migration_cost was kept
much lower (25000) compared to the default (50), to encourage migrations to 
idle cpu
and reduce latency.  We got quite a lot of calls to update_blocked_averages 
directly 
and then try to load_balance in newidle_balance instead of relegating
the responsibility to idle load balancer.  (See code snippet in newidle_balance 
below)  


if (this_rq->avg_idle < sysctl_sched_migration_cost ||   
<-sched_migration_cost check
!READ_ONCE(this_rq->rd->overload)) {

rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sd)
update_next_balance(sd, _balance);
rcu_read_unlock();

goto out;  <--- invoke idle load balancer
}

raw_spin_unlock(_rq->lock);

update_blocked_averages(this_cpu);

 followed by load balance code ---


So the update_blocked_averages offload to idle_load_balancer in Vincent's patch 
is less 
effective in this case with small sched_migration_cost.

Looking at the code a bit more, we don't actually load balance every time in 
this code path
unless our avg_idle time exceeds some threshold.  Doing update_blocked_averages 
immediately 
is only needed if we do call load_balance().  If we don't do any load balance 
in the code path,
we can let the idle load balancer update the blocked averages lazily.

Something like the following perhaps on top of Vincent's patch?  We haven't 
really tested
this change yet but want to see if this change makes sense to you.

Tim

Signed-off-by: Tim Chen 
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 63950d80fd0b..b93f5f52658a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10591,6 +10591,7 @@ static int newidle_balance(struct rq *this_rq, struct 
rq_flags *rf)
struct sched_domain *sd;
int pulled_task = 0;
u64 curr_cost = 0;
+   bool updated_blocked_avg = false;
 
update_misfit_status(NULL, this_rq);
/*
@@ -10627,7 +10628,6 @@ static int newidle_balance(struct rq *this_rq, struct 
rq_flags *rf)
 
raw_spin_unlock(_rq->lock);
 
-   update_blocked_averages(this_cpu);
rcu_read_lock();
for_each_domain(this_cpu, sd) {
int continue_balancing = 1;
@@ -10639,6 +10639,11 @@ static int newidle_balance(struct rq *this_rq, struct 
rq_flags *rf)
}
 
if (sd->flags & SD_BALANCE_NEWIDLE) {
+   if (!updated_blocked_avg) {
+   update_blocked_averages(this_cpu);
+   updated_blocked_avg = true;
+   }
+
t0 = sched_clock_cpu(this_cpu);
 
pulled_task = load_balance(this_cpu, this_rq,
 



Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-02-04 Thread Qais Yousef
On 02/03/21 18:35, Vincent Guittot wrote:
> > >   raw_spin_unlock(_rq->lock);
> > > - /*
> > > -  * This CPU is going to be idle and blocked load of idle CPUs
> > > -  * need to be updated. Run the ilb locally as it is a good
> > > -  * candidate for ilb instead of waking up another idle CPU.
> > > -  * Kick an normal ilb if we failed to do the update.
> > > -  */
> > > - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
> >
> > Since we removed the call to this function (which uses this_rq)
> >
> > > - kick_ilb(NOHZ_STATS_KICK);
> > > + kick_ilb(NOHZ_STATS_KICK);
> >
> > And unconditionally call kick_ilb() which will find a suitable cpu to run 
> > the
> > lb at regardless what this_rq is.
> >
> > Doesn't the below become unnecessary now?
> 
> The end goal is to keep running on this cpu that is about to become idle.
> 
> This patch is mainly  there to check that Joel's problem disappears if
> the update of the blocked load of the cpus is not done in the
> newidle_balance context. I'm preparing few other patches on top to
> clean up the full path

+1

> >
> >   10494 /*
> >   10495  * This CPU doesn't want to be disturbed by 
> > scheduler
> >   10496  * housekeeping
> >   10497  */
> >   10498 if (!housekeeping_cpu(this_cpu, HK_FLAG_SCHED))
> >   10499 return;
> >   10500
> >   10501 /* Will wake up very soon. No time for doing 
> > anything else*/
> >   10502 if (this_rq->avg_idle < sysctl_sched_migration_cost)
> >   10503 return;
> >
> > And we can drop this_rq arg altogether?
> >
> > >   raw_spin_lock(_rq->lock);
> > >  }
> > >
> > > @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, 
> > > struct rq_flags *rf)
> > >   update_next_balance(sd, _balance);
> > >   rcu_read_unlock();
> > >
> > > - nohz_newidle_balance(this_rq);
> > > -
> > >   goto out;
> > >   }
> > >
> > > @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, 
> > > struct rq_flags *rf)
> > >
> > >   if (pulled_task)
> > >   this_rq->idle_stamp = 0;
> > > + else
> > > + nohz_newidle_balance(this_rq);
> >
> > Since nohz_newidle_balance() will not do any real work now, I couldn't 
> > figure
> > out what moving this here achieves. Fault from my end to parse the change 
> > most
> > likely :-)
> 
> The goal is to schedule the update only if we are about to be idle and
> nothing else has been queued in the meantime

I see. This short coming already existed and not *strictly* related to moving
update of blocked load out of newidle balance.

Thanks

--
Qais Yousef


Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-02-04 Thread Dietmar Eggemann
On 03/02/2021 14:12, Vincent Guittot wrote:
> On Wed, 3 Feb 2021 at 12:54, Dietmar Eggemann  
> wrote:
>>
>> On 29/01/2021 18:27, Vincent Guittot wrote:
>>> Le vendredi 29 janv. 2021 � 11:33:00 (+0100), Vincent Guittot a �crit :
 On Thu, 28 Jan 2021 at 16:09, Joel Fernandes  
 wrote:
>
> Hi Vincent,
>
> On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
>  wrote:
>>> On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
 On Fri, 22 Jan 2021 at 20:10, Joel Fernandes  
 wrote:
> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
>> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
>>  wrote:
>>
>> [...]
>>
 The only point that I agree with, is that running
 update_blocked_averages with preempt and irq off is not a good thing
 because we don't manage the number of csf_rq to update and I'm going
 to provide a patchset for this
>>>
>>> The patch below moves the update of the blocked load of CPUs outside 
>>> newidle_balance().
>>>
>>> Instead, the update is done with the usual idle load balance update. I'm 
>>> working on an
>>> additonnal patch that will select this cpu that is about to become idle, 
>>> instead of a
>>> random idle cpu but this 1st step fixe the problem of lot of update in 
>>> newly idle.
>>
>> I'm trying to understand the need for this extra patch.
>>
>> The patch below moves away from doing update_blocked_averages() (1) for
>> all CPUs in sched groups of the sched domain:
>>
>> newidle_balance()->load_balance()->
>> find_busiest_group()->update_sd_lb_stats()->update_sg_lb_stats()
>>
>> to:
>>
>> calling (1) for CPUs in nohz.idle_cpus_mask in _nohz_idle_balance() via
>> update_nohz_stats() and for the ilb CPU.
>>
>> newidle_balance() calls (1) for newidle CPU already.
>>
>> What would be the benefit to choose newidle CPU as ilb CPU?
> 
> To prevent waking up another idle cpu to run the update whereas
> newidle cpu is already woken up and about to be idle so the best
> candidate.
> All the aim of the removed code was to prevent waking up an idle cpu
> for doing something that could be done by the newidle cpu before it
> enters idle state

Ah OK, makes sense. So the only diff would be that newidle CPU will call
(1) on nohz.idle_cpus_mask rather on on sd->span.

[...]


Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-02-03 Thread Joel Fernandes
On Fri, Jan 29, 2021 at 06:27:27PM +0100, Vincent Guittot wrote:
[...]
> > update_blocked_averages with preempt and irq off is not a good thing
> > because we don't manage the number of csf_rq to update and I'm going
> > to provide a patchset for this
> 
> The patch below moves the update of the blocked load of CPUs outside 
> newidle_balance().
> 
> Instead, the update is done with the usual idle load balance update. I'm 
> working on an
> additonnal patch that will select this cpu that is about to become idle, 
> instead of a
> random idle cpu but this 1st step fixe the problem of lot of update in newly 
> idle.
> 
> Signed-off-by: Vincent Guittot 

I confirmed that with this patch, I don't see the preemptoff issues related
to update_blocked_averages() anymore (tested using preemptoff tracer).

I went through the patch and it looks correct to me, I will further review it
and await further reviews from others as well, and then backport the patch to
our kernels. Thanks Vince and everyone!

Tested-by: Joel Fernandes (Google) 

thanks,

 - Joel



> ---
>  kernel/sched/fair.c | 32 +++-
>  1 file changed, 3 insertions(+), 29 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 197a51473e0c..8200b1d4df3d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7421,8 +7421,6 @@ enum migration_type {
>  #define LBF_NEED_BREAK   0x02
>  #define LBF_DST_PINNED  0x04
>  #define LBF_SOME_PINNED  0x08
> -#define LBF_NOHZ_STATS   0x10
> -#define LBF_NOHZ_AGAIN   0x20
>  
>  struct lb_env {
>   struct sched_domain *sd;
> @@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env 
> *env,
>   for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>   struct rq *rq = cpu_rq(i);
>  
> - if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, 
> false))
> - env->flags |= LBF_NOHZ_AGAIN;
> -
>   sgs->group_load += cpu_load(rq);
>   sgs->group_util += cpu_util(i);
>   sgs->group_runnable += cpu_runnable(rq);
> @@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env 
> *env, struct sd_lb_stats *sd
>   struct sg_lb_stats tmp_sgs;
>   int sg_status = 0;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> - if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
> - env->flags |= LBF_NOHZ_STATS;
> -#endif
> -
>   do {
>   struct sg_lb_stats *sgs = _sgs;
>   int local_group;
> @@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env 
> *env, struct sd_lb_stats *sd
>   /* Tag domain that child domain prefers tasks go to siblings first */
>   sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> - if ((env->flags & LBF_NOHZ_AGAIN) &&
> - cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> -
> - WRITE_ONCE(nohz.next_blocked,
> -jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> - }
> -#endif
> -
>   if (env->sd->flags & SD_NUMA)
>   env->fbq_type = fbq_classify_group(>busiest_stat);
>  
> @@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
>   return;
>  
>   raw_spin_unlock(_rq->lock);
> - /*
> -  * This CPU is going to be idle and blocked load of idle CPUs
> -  * need to be updated. Run the ilb locally as it is a good
> -  * candidate for ilb instead of waking up another idle CPU.
> -  * Kick an normal ilb if we failed to do the update.
> -  */
> - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
> - kick_ilb(NOHZ_STATS_KICK);
> + kick_ilb(NOHZ_STATS_KICK);
>   raw_spin_lock(_rq->lock);
>  }
>  
> @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct 
> rq_flags *rf)
>   update_next_balance(sd, _balance);
>   rcu_read_unlock();
>  
> - nohz_newidle_balance(this_rq);
> -
>   goto out;
>   }
>  
> @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct 
> rq_flags *rf)
>  
>   if (pulled_task)
>   this_rq->idle_stamp = 0;
> + else
> + nohz_newidle_balance(this_rq);
>  
>   rq_repin_lock(this_rq, rf);
>  
> -- 
> 2.17.1
> 
> 
> > 
> > > for this.
> > >
> > > > Also update_blocked_averages was supposed called in newlyidle_balance
> > > > when the coming idle duration is expected to be long enough
> > >
> > > No, we do not want the schedule loop to take half a millisecond.
> > 
> > keep in mind that you are scaling frequency so everything takes time
> > at lowest frequency/capacity ...
> > 
> > >
> > > > > > IIUC, your real problem is that newidle_balance is running whereas a
> > > > > > task is about to wake up on the cpu and we don't abort quickly 
> > > > > > during
> > > > > > 

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-02-03 Thread Vincent Guittot
On Wed, 3 Feb 2021 at 18:09, Qais Yousef  wrote:
>
> On 01/29/21 18:27, Vincent Guittot wrote:
> > The patch below moves the update of the blocked load of CPUs outside 
> > newidle_balance().
> >
> > Instead, the update is done with the usual idle load balance update. I'm 
> > working on an
> > additonnal patch that will select this cpu that is about to become idle, 
> > instead of a
> > random idle cpu but this 1st step fixe the problem of lot of update in 
> > newly idle.
> >
> > Signed-off-by: Vincent Guittot 
> > ---
> >  kernel/sched/fair.c | 32 +++-
> >  1 file changed, 3 insertions(+), 29 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 197a51473e0c..8200b1d4df3d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7421,8 +7421,6 @@ enum migration_type {
> >  #define LBF_NEED_BREAK   0x02
> >  #define LBF_DST_PINNED  0x04
> >  #define LBF_SOME_PINNED  0x08
> > -#define LBF_NOHZ_STATS   0x10
> > -#define LBF_NOHZ_AGAIN   0x20
> >
> >  struct lb_env {
> >   struct sched_domain *sd;
> > @@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env 
> > *env,
> >   for_each_cpu_and(i, sched_group_span(group), env->cpus) {
> >   struct rq *rq = cpu_rq(i);
> >
> > - if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, 
> > false))
> > - env->flags |= LBF_NOHZ_AGAIN;
> > -
> >   sgs->group_load += cpu_load(rq);
> >   sgs->group_util += cpu_util(i);
> >   sgs->group_runnable += cpu_runnable(rq);
> > @@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env 
> > *env, struct sd_lb_stats *sd
> >   struct sg_lb_stats tmp_sgs;
> >   int sg_status = 0;
> >
> > -#ifdef CONFIG_NO_HZ_COMMON
> > - if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
> > - env->flags |= LBF_NOHZ_STATS;
> > -#endif
> > -
> >   do {
> >   struct sg_lb_stats *sgs = _sgs;
> >   int local_group;
> > @@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env 
> > *env, struct sd_lb_stats *sd
> >   /* Tag domain that child domain prefers tasks go to siblings first */
> >   sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
> >
> > -#ifdef CONFIG_NO_HZ_COMMON
> > - if ((env->flags & LBF_NOHZ_AGAIN) &&
> > - cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> > -
> > - WRITE_ONCE(nohz.next_blocked,
> > -jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> > - }
> > -#endif
> > -
> >   if (env->sd->flags & SD_NUMA)
> >   env->fbq_type = fbq_classify_group(>busiest_stat);
> >
> > @@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq 
> > *this_rq)
> >   return;
> >
> >   raw_spin_unlock(_rq->lock);
> > - /*
> > -  * This CPU is going to be idle and blocked load of idle CPUs
> > -  * need to be updated. Run the ilb locally as it is a good
> > -  * candidate for ilb instead of waking up another idle CPU.
> > -  * Kick an normal ilb if we failed to do the update.
> > -  */
> > - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
>
> Since we removed the call to this function (which uses this_rq)
>
> > - kick_ilb(NOHZ_STATS_KICK);
> > + kick_ilb(NOHZ_STATS_KICK);
>
> And unconditionally call kick_ilb() which will find a suitable cpu to run the
> lb at regardless what this_rq is.
>
> Doesn't the below become unnecessary now?

The end goal is to keep running on this cpu that is about to become idle.

This patch is mainly  there to check that Joel's problem disappears if
the update of the blocked load of the cpus is not done in the
newidle_balance context. I'm preparing few other patches on top to
clean up the full path

>
>   10494 /*
>   10495  * This CPU doesn't want to be disturbed by scheduler
>   10496  * housekeeping
>   10497  */
>   10498 if (!housekeeping_cpu(this_cpu, HK_FLAG_SCHED))
>   10499 return;
>   10500
>   10501 /* Will wake up very soon. No time for doing anything 
> else*/
>   10502 if (this_rq->avg_idle < sysctl_sched_migration_cost)
>   10503 return;
>
> And we can drop this_rq arg altogether?
>
> >   raw_spin_lock(_rq->lock);
> >  }
> >
> > @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, 
> > struct rq_flags *rf)
> >   update_next_balance(sd, _balance);
> >   rcu_read_unlock();
> >
> > - nohz_newidle_balance(this_rq);
> > -
> >   goto out;
> >   }
> >
> > @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, 
> > struct rq_flags *rf)
> >
> >   if (pulled_task)
> > 

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-02-03 Thread Qais Yousef
On 01/29/21 18:27, Vincent Guittot wrote:
> The patch below moves the update of the blocked load of CPUs outside 
> newidle_balance().
> 
> Instead, the update is done with the usual idle load balance update. I'm 
> working on an
> additonnal patch that will select this cpu that is about to become idle, 
> instead of a
> random idle cpu but this 1st step fixe the problem of lot of update in newly 
> idle.
> 
> Signed-off-by: Vincent Guittot 
> ---
>  kernel/sched/fair.c | 32 +++-
>  1 file changed, 3 insertions(+), 29 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 197a51473e0c..8200b1d4df3d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7421,8 +7421,6 @@ enum migration_type {
>  #define LBF_NEED_BREAK   0x02
>  #define LBF_DST_PINNED  0x04
>  #define LBF_SOME_PINNED  0x08
> -#define LBF_NOHZ_STATS   0x10
> -#define LBF_NOHZ_AGAIN   0x20
>  
>  struct lb_env {
>   struct sched_domain *sd;
> @@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env 
> *env,
>   for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>   struct rq *rq = cpu_rq(i);
>  
> - if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, 
> false))
> - env->flags |= LBF_NOHZ_AGAIN;
> -
>   sgs->group_load += cpu_load(rq);
>   sgs->group_util += cpu_util(i);
>   sgs->group_runnable += cpu_runnable(rq);
> @@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env 
> *env, struct sd_lb_stats *sd
>   struct sg_lb_stats tmp_sgs;
>   int sg_status = 0;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> - if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
> - env->flags |= LBF_NOHZ_STATS;
> -#endif
> -
>   do {
>   struct sg_lb_stats *sgs = _sgs;
>   int local_group;
> @@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env 
> *env, struct sd_lb_stats *sd
>   /* Tag domain that child domain prefers tasks go to siblings first */
>   sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> - if ((env->flags & LBF_NOHZ_AGAIN) &&
> - cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> -
> - WRITE_ONCE(nohz.next_blocked,
> -jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> - }
> -#endif
> -
>   if (env->sd->flags & SD_NUMA)
>   env->fbq_type = fbq_classify_group(>busiest_stat);
>  
> @@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
>   return;
>  
>   raw_spin_unlock(_rq->lock);
> - /*
> -  * This CPU is going to be idle and blocked load of idle CPUs
> -  * need to be updated. Run the ilb locally as it is a good
> -  * candidate for ilb instead of waking up another idle CPU.
> -  * Kick an normal ilb if we failed to do the update.
> -  */
> - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))

Since we removed the call to this function (which uses this_rq)

> - kick_ilb(NOHZ_STATS_KICK);
> + kick_ilb(NOHZ_STATS_KICK);

And unconditionally call kick_ilb() which will find a suitable cpu to run the
lb at regardless what this_rq is.

Doesn't the below become unnecessary now?

  10494 /*
  10495  * This CPU doesn't want to be disturbed by scheduler
  10496  * housekeeping
  10497  */
  10498 if (!housekeeping_cpu(this_cpu, HK_FLAG_SCHED))
  10499 return;
  10500
  10501 /* Will wake up very soon. No time for doing anything 
else*/
  10502 if (this_rq->avg_idle < sysctl_sched_migration_cost)
  10503 return;

And we can drop this_rq arg altogether?

>   raw_spin_lock(_rq->lock);
>  }
>  
> @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct 
> rq_flags *rf)
>   update_next_balance(sd, _balance);
>   rcu_read_unlock();
>  
> - nohz_newidle_balance(this_rq);
> -
>   goto out;
>   }
>  
> @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct 
> rq_flags *rf)
>  
>   if (pulled_task)
>   this_rq->idle_stamp = 0;
> + else
> + nohz_newidle_balance(this_rq);

Since nohz_newidle_balance() will not do any real work now, I couldn't figure
out what moving this here achieves. Fault from my end to parse the change most
likely :-)

Joel can still test this patch as is of course. This is just an early review
since I already spent the time trying to understand it.

Thanks

--
Qais Yousef

>  
>   rq_repin_lock(this_rq, rf);
>  
> -- 
> 2.17.1


Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-02-03 Thread Vincent Guittot
On Wed, 3 Feb 2021 at 12:54, Dietmar Eggemann  wrote:
>
> On 29/01/2021 18:27, Vincent Guittot wrote:
> > Le vendredi 29 janv. 2021 � 11:33:00 (+0100), Vincent Guittot a �crit :
> >> On Thu, 28 Jan 2021 at 16:09, Joel Fernandes  
> >> wrote:
> >>>
> >>> Hi Vincent,
> >>>
> >>> On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
> >>>  wrote:
> > On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> >> On Fri, 22 Jan 2021 at 20:10, Joel Fernandes  
> >> wrote:
> >>> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
>  On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
>   wrote:
>
> [...]
>
> >> The only point that I agree with, is that running
> >> update_blocked_averages with preempt and irq off is not a good thing
> >> because we don't manage the number of csf_rq to update and I'm going
> >> to provide a patchset for this
> >
> > The patch below moves the update of the blocked load of CPUs outside 
> > newidle_balance().
> >
> > Instead, the update is done with the usual idle load balance update. I'm 
> > working on an
> > additonnal patch that will select this cpu that is about to become idle, 
> > instead of a
> > random idle cpu but this 1st step fixe the problem of lot of update in 
> > newly idle.
>
> I'm trying to understand the need for this extra patch.
>
> The patch below moves away from doing update_blocked_averages() (1) for
> all CPUs in sched groups of the sched domain:
>
> newidle_balance()->load_balance()->
> find_busiest_group()->update_sd_lb_stats()->update_sg_lb_stats()
>
> to:
>
> calling (1) for CPUs in nohz.idle_cpus_mask in _nohz_idle_balance() via
> update_nohz_stats() and for the ilb CPU.
>
> newidle_balance() calls (1) for newidle CPU already.
>
> What would be the benefit to choose newidle CPU as ilb CPU?

To prevent waking up another idle cpu to run the update whereas
newidle cpu is already woken up and about to be idle so the best
candidate.
All the aim of the removed code was to prevent waking up an idle cpu
for doing something that could be done by the newidle cpu before it
enters idle state

>
> > Signed-off-by: Vincent Guittot 
> > ---
> >  kernel/sched/fair.c | 32 +++-
> >  1 file changed, 3 insertions(+), 29 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 197a51473e0c..8200b1d4df3d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7421,8 +7421,6 @@ enum migration_type {
> >  #define LBF_NEED_BREAK   0x02
> >  #define LBF_DST_PINNED  0x04
> >  #define LBF_SOME_PINNED  0x08
> > -#define LBF_NOHZ_STATS   0x10
> > -#define LBF_NOHZ_AGAIN   0x20
> >
> >  struct lb_env {
> >   struct sched_domain *sd;
> > @@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env 
> > *env,
> >   for_each_cpu_and(i, sched_group_span(group), env->cpus) {
> >   struct rq *rq = cpu_rq(i);
> >
> > - if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, 
> > false))
> > - env->flags |= LBF_NOHZ_AGAIN;
> > -
> >   sgs->group_load += cpu_load(rq);
> >   sgs->group_util += cpu_util(i);
> >   sgs->group_runnable += cpu_runnable(rq);
> > @@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env 
> > *env, struct sd_lb_stats *sd
> >   struct sg_lb_stats tmp_sgs;
> >   int sg_status = 0;
> >
> > -#ifdef CONFIG_NO_HZ_COMMON
> > - if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
> > - env->flags |= LBF_NOHZ_STATS;
> > -#endif
> > -
> >   do {
> >   struct sg_lb_stats *sgs = _sgs;
> >   int local_group;
> > @@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env 
> > *env, struct sd_lb_stats *sd
> >   /* Tag domain that child domain prefers tasks go to siblings first */
> >   sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
> >
> > -#ifdef CONFIG_NO_HZ_COMMON
> > - if ((env->flags & LBF_NOHZ_AGAIN) &&
> > - cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> > -
> > - WRITE_ONCE(nohz.next_blocked,
> > -jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> > - }
> > -#endif
> > -
> >   if (env->sd->flags & SD_NUMA)
> >   env->fbq_type = fbq_classify_group(>busiest_stat);
> >
> > @@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq 
> > *this_rq)
> >   return;
> >
> >   raw_spin_unlock(_rq->lock);
> > - /*
> > -  * This CPU is going to be idle and blocked load of idle CPUs
> > -  * need to be updated. Run the ilb locally as it is a good
> > -  * candidate for ilb instead of waking up another idle CPU.
> > -  * Kick an normal ilb if we failed to do the update.
> > -  */
> > - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
> > - 

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-02-03 Thread Dietmar Eggemann
On 29/01/2021 18:27, Vincent Guittot wrote:
> Le vendredi 29 janv. 2021 � 11:33:00 (+0100), Vincent Guittot a �crit :
>> On Thu, 28 Jan 2021 at 16:09, Joel Fernandes  wrote:
>>>
>>> Hi Vincent,
>>>
>>> On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
>>>  wrote:
> On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
>> On Fri, 22 Jan 2021 at 20:10, Joel Fernandes  
>> wrote:
>>> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
 On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
  wrote:

[...]

>> The only point that I agree with, is that running
>> update_blocked_averages with preempt and irq off is not a good thing
>> because we don't manage the number of csf_rq to update and I'm going
>> to provide a patchset for this
> 
> The patch below moves the update of the blocked load of CPUs outside 
> newidle_balance().
> 
> Instead, the update is done with the usual idle load balance update. I'm 
> working on an
> additonnal patch that will select this cpu that is about to become idle, 
> instead of a
> random idle cpu but this 1st step fixe the problem of lot of update in newly 
> idle.

I'm trying to understand the need for this extra patch.

The patch below moves away from doing update_blocked_averages() (1) for
all CPUs in sched groups of the sched domain:

newidle_balance()->load_balance()->
find_busiest_group()->update_sd_lb_stats()->update_sg_lb_stats()

to:

calling (1) for CPUs in nohz.idle_cpus_mask in _nohz_idle_balance() via
update_nohz_stats() and for the ilb CPU.

newidle_balance() calls (1) for newidle CPU already.

What would be the benefit to choose newidle CPU as ilb CPU?

> Signed-off-by: Vincent Guittot 
> ---
>  kernel/sched/fair.c | 32 +++-
>  1 file changed, 3 insertions(+), 29 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 197a51473e0c..8200b1d4df3d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7421,8 +7421,6 @@ enum migration_type {
>  #define LBF_NEED_BREAK   0x02
>  #define LBF_DST_PINNED  0x04
>  #define LBF_SOME_PINNED  0x08
> -#define LBF_NOHZ_STATS   0x10
> -#define LBF_NOHZ_AGAIN   0x20
>  
>  struct lb_env {
>   struct sched_domain *sd;
> @@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env 
> *env,
>   for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>   struct rq *rq = cpu_rq(i);
>  
> - if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, 
> false))
> - env->flags |= LBF_NOHZ_AGAIN;
> -
>   sgs->group_load += cpu_load(rq);
>   sgs->group_util += cpu_util(i);
>   sgs->group_runnable += cpu_runnable(rq);
> @@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env 
> *env, struct sd_lb_stats *sd
>   struct sg_lb_stats tmp_sgs;
>   int sg_status = 0;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> - if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
> - env->flags |= LBF_NOHZ_STATS;
> -#endif
> -
>   do {
>   struct sg_lb_stats *sgs = _sgs;
>   int local_group;
> @@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env 
> *env, struct sd_lb_stats *sd
>   /* Tag domain that child domain prefers tasks go to siblings first */
>   sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> - if ((env->flags & LBF_NOHZ_AGAIN) &&
> - cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> -
> - WRITE_ONCE(nohz.next_blocked,
> -jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> - }
> -#endif
> -
>   if (env->sd->flags & SD_NUMA)
>   env->fbq_type = fbq_classify_group(>busiest_stat);
>  
> @@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
>   return;
>  
>   raw_spin_unlock(_rq->lock);
> - /*
> -  * This CPU is going to be idle and blocked load of idle CPUs
> -  * need to be updated. Run the ilb locally as it is a good
> -  * candidate for ilb instead of waking up another idle CPU.
> -  * Kick an normal ilb if we failed to do the update.
> -  */
> - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
> - kick_ilb(NOHZ_STATS_KICK);
> + kick_ilb(NOHZ_STATS_KICK);
>   raw_spin_lock(_rq->lock);
>  }
>  
> @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct 
> rq_flags *rf)
>   update_next_balance(sd, _balance);
>   rcu_read_unlock();
>  
> - nohz_newidle_balance(this_rq);
> -
>   goto out;
>   }
>  
> @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct 
> rq_flags *rf)
>  
>   if (pulled_task)
>   this_rq->idle_stamp = 0;
> + else
> + 

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-02-01 Thread Joel Fernandes
On Fri, Jan 29, 2021 at 5:33 AM Vincent Guittot
 wrote:
[...]
> > > > > So why is it a problem for you ? You are mentioning newly idle load
> > > > > balance so I assume that your root problem is the scheduling delay
> > > > > generated by the newly idle load balance which then calls
> > > > > update_blocked_averages
> > > >
> > > > Yes, the new idle balance is when I see it happen quite often. I do see 
> > > > it
> > > > happen with other load balance as well, but it not that often as those 
> > > > LB
> > > > don't run as often as new idle balance.
> > >
> > > The update of average blocked load has been added in newidle_balance
> > > to take advantage of the cpu becoming idle but it seems to create a
> > > long preempt off sequence. I 'm going to prepare a patch to move it
> > > out the schedule path.
> >
> > Ok thanks, that would really help!
> >
> > > > > rate limiting the call to update_blocked_averages() will only reduce
> > > > > the number of time it happens but it will not prevent it to happen.
> > > >
> > > > Sure, but soft real-time issue can tolerate if the issue does not 
> > > > happen very
> > > > often. In this case though, it is frequent.
> > >
> > > Could you provide details of the problem that you are facing ? It's
> > > not clear for me what happens in your case at the end. Have you got an
> > > audio glitch as an example?
> > >
> > > "Not often" doesn't really give any clue.
> >
> > I believe from my side I have provided details. I shared output from a
> > function graph tracer and schbench micro benchmark clearly showing the
> > issue and improvements. Sorry, I don't have a real-world reproducer
>
> In fact I disagree and I'm not sure that your results show the right
> problem but just a side effect related to your system.
>
> With schbench -t 2 -m 2 -r 5, the test runs 1 task per CPU and newidle
> balance should never be triggered because tasks will get an idle cpus
> everytime. When I run schbench -t 3 -m 2 -r 5 (in order to get 8
> threads on my 8 cpus system), all threads directly wake up on an idle
> cpu and newidle balance is never involved.
> As a result, the schbench
> results are not impacted by newidle_balance whatever its duration.

I disagree. Here you are assuming that schbench is the only task
running on the system. There are other processes, daemons as well. I
see a strong correlation between commenting out
update_blocked_averages() and not seeing the latency hit at the higher
percentiles.

> This means that a problem in newidle_balance doesn't impact schbench
> results with a correct task placement. This also means that in your
> case, some threads are placed on the same cpu and wait to be scheduled
> and finally a lot of things can generate the increased delay If
> you look at your results for schbench -t 2 -m 2 -r 5: The  *99.0th:
> 12656 (8 samples) shows a delayed of around 12ms which is the typical
> running time slice of a task when several tasks are fighting for the
> same cpu and one has to wait. So this results does not reflect the
> duration of newidle balance but instead that the task placement was
> wrong and one task has to wait before running. Your RFC patch probably
> changes the dynamic and as a result the task placement but it does not
> save 12ms and is irrelevant regarding the problem that you raised
> about the duration of update_blocked_load.
> If I run schbench -t 3 -m 2 -r 5 on a dragonboard 845 (4 little, 4
> big) with schedutil and EAS enabled:
> /home/linaro/Bin/schbench -t 3 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (318 total samples)
> 50.0th: 315 (159 samples)
> 75.0th: 735 (80 samples)
> 90.0th: 773 (48 samples)
> 95.0th: 845 (16 samples)
> *99.0th: 12336 (12 samples)
> 99.5th: 15408 (2 samples)
> 99.9th: 17504 (1 samples)
> min=4, max=17473

Sure, there could be a different problem causing these higher
percentile latencies on your device. I still think 12ms is awful.

> I have similar results and a quick look at the trace shows that 2
> tasks are fighting for the same cpu whereas there are idle cpus. Then
> If I use another cpufreq governor than schedutil like ondemand as an
> example, the EAS is disabled and the results becomes:
> /home/linaro/Bin/schbench -t 3 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (318 total samples)
> 50.0th: 232 (159 samples)
> 75.0th: 268 (80 samples)
> 90.0th: 292 (49 samples)
> 95.0th: 307 (15 samples)
> *99.0th: 394 (12 samples)
> 99.5th: 397 (2 samples)
> 99.9th: 400 (1 samples)
> min=114, max=400

Yes, definitely changing the governor also solves the problem (like
for example if performance governor is used). The problem happens at
low frequencies.

> So a quick and wrong conclusion could be to say that you should disable EAS 
> ;-)

Sure, except the power management guys may come after me ;-)

[..]
> The only point that I agree with, is that running
> update_blocked_averages with preempt and irq off is not a good thing
> because we don't manage the number of csf_rq to update and 

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-29 Thread Vincent Guittot
Le vendredi 29 janv. 2021 à 11:33:00 (+0100), Vincent Guittot a écrit :
> On Thu, 28 Jan 2021 at 16:09, Joel Fernandes  wrote:
> >
> > Hi Vincent,
> >
> > On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
> >  wrote:
> > > > On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> > > > > On Fri, 22 Jan 2021 at 20:10, Joel Fernandes  
> > > > > wrote:
> > > > > > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > > > > > > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, 
> > > > > > > > I found
> > > > > > > > that there are a lot of calls to update_blocked_averages(). 
> > > > > > > > This causes
> > > > > > > > the schedule loop to slow down to taking upto 500 micro seconds 
> > > > > > > > at
> > > > > > > > times (due to newidle load balance). I have also seen this 
> > > > > > > > manifest in
> > > > > > > > the periodic balancer.
> > > > > > > >
> > > > > > > > Closer look shows that the problem is caused by the following
> > > > > > > > ingredients:
> > > > > > > > 1. If the system has a lot of inactive CGroups (thanks Dietmar 
> > > > > > > > for
> > > > > > > > suggesting to inspect /proc/sched_debug for this), this can make
> > > > > > > > __update_blocked_fair() take a long time.
> > > > > > >
> > > > > > > Inactive cgroups are removed from the list so they should not 
> > > > > > > impact
> > > > > > > the duration
> > > > > >
> > > > > > I meant blocked CGroups. According to this code, a cfs_rq can be 
> > > > > > partially
> > > > > > decayed and not have any tasks running on it but its load needs to 
> > > > > > be
> > > > > > decayed, correct? That's what I meant by 'inactive'. I can reword 
> > > > > > it to
> > > > > > 'blocked'.
> > > > >
> > > > > How many blocked cgroups have you got ?
> > > >
> > > > I put a counter in for_each_leaf_cfs_rq_safe() { } to count how many 
> > > > times
> > > > this loop runs per new idle balance. When the problem happens I see 
> > > > this loop
> > > > run 35-40 times (for one single instance of newidle balance). So in 
> > > > total
> > > > there are at least these many cfs_rq load updates.
> > >
> > > Do you mean that you have 35-40 cgroups ? Or the 35-40 includes all CPUs ?
> >
> > All CPUs.
> >
> > > > I also see that new idle balance can be called 200-500 times per second.
> > >
> > > This is not surprising because newidle_balance() is called every time
> > > the CPU is about to become idle
> >
> > Sure.
> >
> > > > > >
> > > > > >   * There can be a lot of idle CPU cgroups.  Don't 
> > > > > > let fully
> > > > > >   * decayed cfs_rqs linger on the list.
> > > > > >   */
> > > > > >  if (cfs_rq_is_decayed(cfs_rq))
> > > > > >  list_del_leaf_cfs_rq(cfs_rq);
> > > > > >
> > > > > > > > 2. The device has a lot of CPUs in a cluster which causes 
> > > > > > > > schedutil in a
> > > > > > > > shared frequency domain configuration to be slower than usual. 
> > > > > > > > (the load
> > > > > > >
> > > > > > > What do you mean exactly by it causes schedutil to be slower than 
> > > > > > > usual ?
> > > > > >
> > > > > > sugov_next_freq_shared() is order number of CPUs in the a cluster. 
> > > > > > This
> > > > > > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared 
> > > > > > policy
> > > > > > frequency update needs to go through utilization of other CPUs in 
> > > > > > the
> > > > > > cluster. I believe this could be adding to the problem but is not 
> > > > > > really
> > > > > > needed to optimize if we can rate limit the calls to 
> > > > > > update_blocked_averages
> > > > > > to begin with.
> > > > >
> > > > > Qais mentioned half of the time being used by
> > > > > sugov_next_freq_shared(). Are there any frequency changes resulting in
> > > > > this call ?
> > > >
> > > > I do not see a frequency update happening at the time of the problem. 
> > > > However
> > > > note that sugov_iowait_boost() does run even if frequency is not being
> > > > updated. IIRC, this function is also not that light weight and I am not 
> > > > sure
> > > > if it is a good idea to call this that often.
> > >
> > > Scheduler can't make any assumption about how often schedutil/cpufreq
> > > wants to be called. Some are fast and straightforward and can be
> > > called very often to adjust frequency; Others can't handle much
> > > updates. The rate limit mechanism in schedutil and io-boost should be
> > > there for such purpose.
> >
> > Sure, I know that's the intention.
> >
> > > > > > > > average updates also try to update the frequency in schedutil).
> > > > > > > >
> > > > > > > > 3. The CPU is running at a low frequency causing the 
> > > > > > > > scheduler/schedutil
> > > > > > > > code paths to take longer than when running at a high CPU 
> > > > > > > > frequency.
> > > > > > >
> > > > > > > Low frequency 

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-28 Thread Qais Yousef
On 01/28/21 10:09, Joel Fernandes wrote:
> > > > Qais mentioned half of the time being used by
> > > > sugov_next_freq_shared(). Are there any frequency changes resulting in
> > > > this call ?
> > >
> > > I do not see a frequency update happening at the time of the problem. 
> > > However
> > > note that sugov_iowait_boost() does run even if frequency is not being
> > > updated. IIRC, this function is also not that light weight and I am not 
> > > sure
> > > if it is a good idea to call this that often.
> >
> > Scheduler can't make any assumption about how often schedutil/cpufreq
> > wants to be called. Some are fast and straightforward and can be
> > called very often to adjust frequency; Others can't handle much
> > updates. The rate limit mechanism in schedutil and io-boost should be
> > there for such purpose.
> 
> Sure, I know that's the intention.

How about the below patch to help with reducing the impact of
sugov_update_shared()?

I basically made sure it'll bail out immediately if it gets 2 calls within the
defined rate_limit_us value.

And tweaked the way we call cpufreq_update_util() from
update_blocked_averages() too so that we first update blocked load on all cpus,
then we ask for the frequency update. Combined with above this should result to
a single call to sugov_update_shared() for each policy. Each call should see
the final state of what is the load really is.

Only compile tested!

Thanks

--
Qais Yousef

->8-


diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 6931f0cdeb80..bd83e9343749 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -109,7 +109,6 @@ static bool sugov_update_next_freq(struct sugov_policy 
*sg_policy, u64 time,
return false;
 
sg_policy->next_freq = next_freq;
-   sg_policy->last_freq_update_time = time;
 
return true;
 }
@@ -554,20 +553,23 @@ sugov_update_shared(struct update_util_data *hook, u64 
time, unsigned int flags)
 
raw_spin_lock(_policy->update_lock);
 
+   ignore_dl_rate_limit(sg_cpu, sg_policy);
+
+   if (!sugov_should_update_freq(sg_policy, time))
+   goto out;
+
sugov_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;
 
-   ignore_dl_rate_limit(sg_cpu, sg_policy);
-
-   if (sugov_should_update_freq(sg_policy, time)) {
-   next_f = sugov_next_freq_shared(sg_cpu, time);
+   next_f = sugov_next_freq_shared(sg_cpu, time);
 
-   if (sg_policy->policy->fast_switch_enabled)
-   sugov_fast_switch(sg_policy, time, next_f);
-   else
-   sugov_deferred_update(sg_policy, time, next_f);
-   }
+   if (sg_policy->policy->fast_switch_enabled)
+   sugov_fast_switch(sg_policy, time, next_f);
+   else
+   sugov_deferred_update(sg_policy, time, next_f);
 
+   sg_policy->last_freq_update_time = time;
+out:
raw_spin_unlock(_policy->update_lock);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..47bd8be03a6c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8033,7 +8033,7 @@ static unsigned long task_h_load(struct task_struct *p)
 }
 #endif
 
-static void update_blocked_averages(int cpu)
+static void update_blocked_averages(int cpu, bool update_freq)
 {
bool decayed = false, done = true;
struct rq *rq = cpu_rq(cpu);
@@ -8046,7 +8046,7 @@ static void update_blocked_averages(int cpu)
decayed |= __update_blocked_fair(rq, );
 
update_blocked_load_status(rq, !done);
-   if (decayed)
+   if (decayed && update_freq)
cpufreq_update_util(rq, 0);
rq_unlock_irqrestore(rq, );
 }
@@ -8384,7 +8384,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
return true;
 
-   update_blocked_averages(cpu);
+   update_blocked_averages(cpu, false);
 
return rq->has_blocked_load;
 #else
@@ -8454,6 +8454,15 @@ static inline void update_sg_lb_stats(struct lb_env *env,
}
}
 
+   /*
+* We did a bunch a blocked average updates, let cpufreq know about
+* them in one go. For system with a lot of cpus on a frequency domain
+* this will make sure we take all the changes that affects the policy
+* in one go.
+*/
+   for_each_cpu_and(i, sched_group_span(group), env->cpus)
+   cpufreq_update_util(cpu_rq(i), 0);
+
/* Check if dst CPU is idle and preferred to this group */
if (env->sd->flags & SD_ASYM_PACKING &&
env->idle != CPU_NOT_IDLE &&
@@ -10464,7 +10473,7 @@ static bool _nohz_idle_balance(struct rq *this_rq, 
unsigned int flags,
 
/* Newly idle CPU doesn't need an update */
if (idle != CPU_NEWLY_IDLE) {
-   update_blocked_averages(this_cpu);
+ 

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-28 Thread Joel Fernandes
Hi Vincent,

On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
 wrote:
> > On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> > > On Fri, 22 Jan 2021 at 20:10, Joel Fernandes  
> > > wrote:
> > > > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > > > > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > > > >  wrote:
> > > > > >
> > > > > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I 
> > > > > > found
> > > > > > that there are a lot of calls to update_blocked_averages(). This 
> > > > > > causes
> > > > > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > > > > times (due to newidle load balance). I have also seen this manifest 
> > > > > > in
> > > > > > the periodic balancer.
> > > > > >
> > > > > > Closer look shows that the problem is caused by the following
> > > > > > ingredients:
> > > > > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > > > > suggesting to inspect /proc/sched_debug for this), this can make
> > > > > > __update_blocked_fair() take a long time.
> > > > >
> > > > > Inactive cgroups are removed from the list so they should not impact
> > > > > the duration
> > > >
> > > > I meant blocked CGroups. According to this code, a cfs_rq can be 
> > > > partially
> > > > decayed and not have any tasks running on it but its load needs to be
> > > > decayed, correct? That's what I meant by 'inactive'. I can reword it to
> > > > 'blocked'.
> > >
> > > How many blocked cgroups have you got ?
> >
> > I put a counter in for_each_leaf_cfs_rq_safe() { } to count how many times
> > this loop runs per new idle balance. When the problem happens I see this 
> > loop
> > run 35-40 times (for one single instance of newidle balance). So in total
> > there are at least these many cfs_rq load updates.
>
> Do you mean that you have 35-40 cgroups ? Or the 35-40 includes all CPUs ?

All CPUs.

> > I also see that new idle balance can be called 200-500 times per second.
>
> This is not surprising because newidle_balance() is called every time
> the CPU is about to become idle

Sure.

> > > >
> > > >   * There can be a lot of idle CPU cgroups.  Don't let 
> > > > fully
> > > >   * decayed cfs_rqs linger on the list.
> > > >   */
> > > >  if (cfs_rq_is_decayed(cfs_rq))
> > > >  list_del_leaf_cfs_rq(cfs_rq);
> > > >
> > > > > > 2. The device has a lot of CPUs in a cluster which causes schedutil 
> > > > > > in a
> > > > > > shared frequency domain configuration to be slower than usual. (the 
> > > > > > load
> > > > >
> > > > > What do you mean exactly by it causes schedutil to be slower than 
> > > > > usual ?
> > > >
> > > > sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> > > > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> > > > frequency update needs to go through utilization of other CPUs in the
> > > > cluster. I believe this could be adding to the problem but is not really
> > > > needed to optimize if we can rate limit the calls to 
> > > > update_blocked_averages
> > > > to begin with.
> > >
> > > Qais mentioned half of the time being used by
> > > sugov_next_freq_shared(). Are there any frequency changes resulting in
> > > this call ?
> >
> > I do not see a frequency update happening at the time of the problem. 
> > However
> > note that sugov_iowait_boost() does run even if frequency is not being
> > updated. IIRC, this function is also not that light weight and I am not sure
> > if it is a good idea to call this that often.
>
> Scheduler can't make any assumption about how often schedutil/cpufreq
> wants to be called. Some are fast and straightforward and can be
> called very often to adjust frequency; Others can't handle much
> updates. The rate limit mechanism in schedutil and io-boost should be
> there for such purpose.

Sure, I know that's the intention.

> > > > > > average updates also try to update the frequency in schedutil).
> > > > > >
> > > > > > 3. The CPU is running at a low frequency causing the 
> > > > > > scheduler/schedutil
> > > > > > code paths to take longer than when running at a high CPU frequency.
> > > > >
> > > > > Low frequency usually means low utilization so it should happen that 
> > > > > much.
> > > >
> > > > It happens a lot as can be seen with schbench. It is super easy to 
> > > > reproduce.
> > >
> > > Happening a lot in itself is not a problem if there is nothing else to
> > > do so it's not a argument in itself
> >
> > It is a problem - it shows up in the preempt off critical section latency
>
> But this point is not related to the point above which is about how
> often it happens.

It is related. A bad thing happening quite often is worse than a bad
thing happening infrequently. I agree it is not a root cause fix, but
it makes things "better".

> > tracer. Are you saying its Ok for preemption to be disabled on 

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-28 Thread Vincent Guittot
Hi Joel,

On Wed, 27 Jan 2021 at 19:43, Joel Fernandes  wrote:
>
> Hi Vincent,
>
> On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> > On Fri, 22 Jan 2021 at 20:10, Joel Fernandes  wrote:
> > > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > > > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > > >  wrote:
> > > > >
> > > > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I 
> > > > > found
> > > > > that there are a lot of calls to update_blocked_averages(). This 
> > > > > causes
> > > > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > > > times (due to newidle load balance). I have also seen this manifest in
> > > > > the periodic balancer.
> > > > >
> > > > > Closer look shows that the problem is caused by the following
> > > > > ingredients:
> > > > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > > > suggesting to inspect /proc/sched_debug for this), this can make
> > > > > __update_blocked_fair() take a long time.
> > > >
> > > > Inactive cgroups are removed from the list so they should not impact
> > > > the duration
> > >
> > > I meant blocked CGroups. According to this code, a cfs_rq can be partially
> > > decayed and not have any tasks running on it but its load needs to be
> > > decayed, correct? That's what I meant by 'inactive'. I can reword it to
> > > 'blocked'.
> >
> > How many blocked cgroups have you got ?
>
> I put a counter in for_each_leaf_cfs_rq_safe() { } to count how many times
> this loop runs per new idle balance. When the problem happens I see this loop
> run 35-40 times (for one single instance of newidle balance). So in total
> there are at least these many cfs_rq load updates.

Do you mean that you have 35-40 cgroups ? Or the 35-40 includes all CPUs ?

>
> I also see that new idle balance can be called 200-500 times per second.

This is not surprising because newidle_balance() is called every time
the CPU is about to become idle

>
> > >
> > >   * There can be a lot of idle CPU cgroups.  Don't let 
> > > fully
> > >   * decayed cfs_rqs linger on the list.
> > >   */
> > >  if (cfs_rq_is_decayed(cfs_rq))
> > >  list_del_leaf_cfs_rq(cfs_rq);
> > >
> > > > > 2. The device has a lot of CPUs in a cluster which causes schedutil 
> > > > > in a
> > > > > shared frequency domain configuration to be slower than usual. (the 
> > > > > load
> > > >
> > > > What do you mean exactly by it causes schedutil to be slower than usual 
> > > > ?
> > >
> > > sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> > > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> > > frequency update needs to go through utilization of other CPUs in the
> > > cluster. I believe this could be adding to the problem but is not really
> > > needed to optimize if we can rate limit the calls to 
> > > update_blocked_averages
> > > to begin with.
> >
> > Qais mentioned half of the time being used by
> > sugov_next_freq_shared(). Are there any frequency changes resulting in
> > this call ?
>
> I do not see a frequency update happening at the time of the problem. However
> note that sugov_iowait_boost() does run even if frequency is not being
> updated. IIRC, this function is also not that light weight and I am not sure
> if it is a good idea to call this that often.

Scheduler can't make any assumption about how often schedutil/cpufreq
wants to be called. Some are fast and straightforward and can be
called very often to adjust frequency; Others can't handle much
updates. The rate limit mechanism in schedutil and io-boost should be
there for such purpose.

>
> > > > > average updates also try to update the frequency in schedutil).
> > > > >
> > > > > 3. The CPU is running at a low frequency causing the 
> > > > > scheduler/schedutil
> > > > > code paths to take longer than when running at a high CPU frequency.
> > > >
> > > > Low frequency usually means low utilization so it should happen that 
> > > > much.
> > >
> > > It happens a lot as can be seen with schbench. It is super easy to 
> > > reproduce.
> >
> > Happening a lot in itself is not a problem if there is nothing else to
> > do so it's not a argument in itself
>
> It is a problem - it shows up in the preempt off critical section latency

But this point is not related to the point above which is about how
often it happens.

> tracer. Are you saying its Ok for preemption to be disabled on system for 500
> micro seconds?  That hurts real-time applications (audio etc).

So. Is your problem related to real-time applications (audio etc) ?

>
> > So why is it a problem for you ? You are mentioning newly idle load
> > balance so I assume that your root problem is the scheduling delay
> > generated by the newly idle load balance which then calls
> > update_blocked_averages
>
> Yes, the new idle balance is when I see it happen 

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-27 Thread Joel Fernandes
Hi Vincent,

On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> On Fri, 22 Jan 2021 at 20:10, Joel Fernandes  wrote:
> > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > >  wrote:
> > > >
> > > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > > > that there are a lot of calls to update_blocked_averages(). This causes
> > > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > > times (due to newidle load balance). I have also seen this manifest in
> > > > the periodic balancer.
> > > >
> > > > Closer look shows that the problem is caused by the following
> > > > ingredients:
> > > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > > suggesting to inspect /proc/sched_debug for this), this can make
> > > > __update_blocked_fair() take a long time.
> > >
> > > Inactive cgroups are removed from the list so they should not impact
> > > the duration
> >
> > I meant blocked CGroups. According to this code, a cfs_rq can be partially
> > decayed and not have any tasks running on it but its load needs to be
> > decayed, correct? That's what I meant by 'inactive'. I can reword it to
> > 'blocked'.
> 
> How many blocked cgroups have you got ?

I put a counter in for_each_leaf_cfs_rq_safe() { } to count how many times
this loop runs per new idle balance. When the problem happens I see this loop
run 35-40 times (for one single instance of newidle balance). So in total
there are at least these many cfs_rq load updates.

I also see that new idle balance can be called 200-500 times per second.

> >
> >   * There can be a lot of idle CPU cgroups.  Don't let fully
> >   * decayed cfs_rqs linger on the list.
> >   */
> >  if (cfs_rq_is_decayed(cfs_rq))
> >  list_del_leaf_cfs_rq(cfs_rq);
> >
> > > > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > > > shared frequency domain configuration to be slower than usual. (the load
> > >
> > > What do you mean exactly by it causes schedutil to be slower than usual ?
> >
> > sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> > frequency update needs to go through utilization of other CPUs in the
> > cluster. I believe this could be adding to the problem but is not really
> > needed to optimize if we can rate limit the calls to update_blocked_averages
> > to begin with.
> 
> Qais mentioned half of the time being used by
> sugov_next_freq_shared(). Are there any frequency changes resulting in
> this call ?

I do not see a frequency update happening at the time of the problem. However
note that sugov_iowait_boost() does run even if frequency is not being
updated. IIRC, this function is also not that light weight and I am not sure
if it is a good idea to call this that often.

> > > > average updates also try to update the frequency in schedutil).
> > > >
> > > > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > > > code paths to take longer than when running at a high CPU frequency.
> > >
> > > Low frequency usually means low utilization so it should happen that much.
> >
> > It happens a lot as can be seen with schbench. It is super easy to 
> > reproduce.
> 
> Happening a lot in itself is not a problem if there is nothing else to
> do so it's not a argument in itself

It is a problem - it shows up in the preempt off critical section latency
tracer. Are you saying its Ok for preemption to be disabled on system for 500
micro seconds?  That hurts real-time applications (audio etc).

> So why is it a problem for you ? You are mentioning newly idle load
> balance so I assume that your root problem is the scheduling delay
> generated by the newly idle load balance which then calls
> update_blocked_averages

Yes, the new idle balance is when I see it happen quite often. I do see it
happen with other load balance as well, but it not that often as those LB
don't run as often as new idle balance.

> 
> rate limiting the call to update_blocked_averages() will only reduce
> the number of time it happens but it will not prevent it to happen.

Sure, but soft real-time issue can tolerate if the issue does not happen very
often. In this case though, it is frequent.

> IIUC, your real problem is that newidle_balance is running whereas a
> task is about to wake up on the cpu and we don't abort quickly during
> this load_balance

Yes.

> so we could also try to abort earlier in case of newly idle load balance

I think interrupts are disabled when the load balance runs, so there's no way
for say an audio interrupt to even run in order to wake up a task. How would
you know to abort the new idle load balance?

Could you elaborate more also on the drawback of the rate limiting patch we
posted? Do you see 

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-26 Thread Qais Yousef
On 01/25/21 14:23, Vincent Guittot wrote:
> On Fri, 22 Jan 2021 at 19:39, Qais Yousef  wrote:
> >
> > On 01/22/21 17:56, Vincent Guittot wrote:
> > > > ---
> > > >  kernel/sched/fair.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 04a3ce20da67..fe2dc0024db5 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool 
> > > > force)
> > > > if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > > > return false;
> > > >
> > > > -   if (!force && !time_after(jiffies, 
> > > > rq->last_blocked_load_update_tick))
> > > > +   if (!force && !time_after(jiffies, 
> > > > rq->last_blocked_load_update_tick + (HZ/20)))
> > >
> > > This condition is there to make sure to update blocked load at most
> > > once a tick in order to filter newly idle case otherwise the rate
> > > limit is already done by load balance interval
> > > This hard coded (HZ/20) looks really like an ugly hack
> >
> > This was meant as an RFC patch to discuss the problem really.
> >
> > Joel is seeing update_blocked_averages() taking ~100us. Half of it seems in
> > processing __update_blocked_fair() and the other half in 
> > sugov_update_shared().
> > So roughly 50us each. Note that each function is calling an iterator in
> 
> Can I assume that a freq change happens if sugov_update_shared() takes 50us ?
> which would mean that the update was useful at the end ?

I couldn't reproduce his problem on Juno. But I think it is not actually doing
any frequency update. sugov_update_shared() is rate limited by
sugov_should_update_freq(). Joel has a 1ms rate limit for schedutil in sysfs.
The function traces showed that it is continuously doing the full scan which
indicates that sugov_update_next_freq() is continuously bailing out at

if else (sg_policy->next_freq == next_freq)
return false;

> 
> > return. Correct me if my numbers are wrong Joel.
> >
> > Running on a little core on low frequency these numbers don't look too odd.
> > So I'm not seeing how we can speed these functions up.
> >
> > But since update_sg_lb_stats() will end up with multiple calls to
> > update_blocked_averages() in one go, this latency adds up quickly.
> >
> > One noticeable factor in Joel's system is the presence of a lot of cgroups.
> > Which is essentially what makes __update_blocked_fair() expensive, and it 
> > seems
> > to always return something has decayed so we end up with a call to
> > sugov_update_shared() in every call.
> >
> > I think we should limit the expensive call to update_blocked_averages() but
> 
> At the opposite, some will complain that block values  stay stall to
> high value and prevent any useful adjustment.
> 
> Also update_blocked average is already rate limited with idle and busy
> load_balance
> 
> Seems like the problem raised by Joel is the number of newly idle load balance

It could be. When Joel comments out the update_blocked_averages() or rate limit
it the big schedule delay disappears. Which is just an indication of where we
could do better. Either by making update_blocked_averages() faster, or control
how often we do it in a row. I thought the latter made more sense - though
implementation wise I'm not sure how we should do that.

We might actually help update_blocked_averages() being a bit faster by not
doing cpufreq_update_util() in every call and do it once in the last call to
update_blocked_averages(). Since it seemed the for_each_leaf_cfs_rq_safe() loop
in __update_blocked_fair() is expensive too, not sure if reducing the calls to
cpufreq_update_util() will be enough.

Thanks

--
Qais Yousef


Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-25 Thread Vincent Guittot
On Fri, 22 Jan 2021 at 19:39, Qais Yousef  wrote:
>
> On 01/22/21 17:56, Vincent Guittot wrote:
> > > ---
> > >  kernel/sched/fair.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 04a3ce20da67..fe2dc0024db5 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool 
> > > force)
> > > if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > > return false;
> > >
> > > -   if (!force && !time_after(jiffies, 
> > > rq->last_blocked_load_update_tick))
> > > +   if (!force && !time_after(jiffies, 
> > > rq->last_blocked_load_update_tick + (HZ/20)))
> >
> > This condition is there to make sure to update blocked load at most
> > once a tick in order to filter newly idle case otherwise the rate
> > limit is already done by load balance interval
> > This hard coded (HZ/20) looks really like an ugly hack
>
> This was meant as an RFC patch to discuss the problem really.
>
> Joel is seeing update_blocked_averages() taking ~100us. Half of it seems in
> processing __update_blocked_fair() and the other half in 
> sugov_update_shared().
> So roughly 50us each. Note that each function is calling an iterator in

Can I assume that a freq change happens if sugov_update_shared() takes 50us ?
which would mean that the update was useful at the end ?

> return. Correct me if my numbers are wrong Joel.
>
> Running on a little core on low frequency these numbers don't look too odd.
> So I'm not seeing how we can speed these functions up.
>
> But since update_sg_lb_stats() will end up with multiple calls to
> update_blocked_averages() in one go, this latency adds up quickly.
>
> One noticeable factor in Joel's system is the presence of a lot of cgroups.
> Which is essentially what makes __update_blocked_fair() expensive, and it 
> seems
> to always return something has decayed so we end up with a call to
> sugov_update_shared() in every call.
>
> I think we should limit the expensive call to update_blocked_averages() but

At the opposite, some will complain that block values  stay stall to
high value and prevent any useful adjustment.

Also update_blocked average is already rate limited with idle and busy
load_balance

Seems like the problem raised by Joel is the number of newly idle load balance

> I honestly don't know what would be the right way to do it :-/
>
> Or maybe there's another better alternative too..
>
> Thanks
>
> --
> Qais Yousef
>
> >
> > > return true;
> > >
> > > update_blocked_averages(cpu);
> > > --
> > > 2.30.0.280.ga3ce27912f-goog
> > >


Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-25 Thread Dietmar Eggemann
On 25/01/2021 18:30, Vincent Guittot wrote:
> On Mon, 25 Jan 2021 at 11:45, Dietmar Eggemann  
> wrote:
>>
>> On 22/01/2021 20:10, Joel Fernandes wrote:
>>> Hi Vincent,
>>>
>>> Thanks for reply. Please see the replies below:
>>>
>>> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
 On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
  wrote:

[...]

>> If I understood you correctly, you want to avoid these frequent calls
>> to update_blocked_averages() here to further avoid invoking sched_util
>> via update_blocked_averages() -> cpufreq_update_util() (since 'decayed'
>> is set) very often in your setup.
> 
> So It's not clear if the problem that joel wants to raise, is about:
> - the running time of  update_blocked_averages
> - the running time of the cpufreq_update_util which is called because
> utilization has decayed during the update of blocked load
> - the wake up latency because of newly_idle lb

Pretty much so.

IIRC his interest is driven by the fact that he saw much less activity
in newly_idle lb and therefore cpufreq_update_util on a system with the
same kernel and userspace but with less CPUs (i.e. also smaller
frequency domains) and less cgroups (with blocked load) and started
wondering why.

I assume that since he understands this environment now much better, he
should be able to come up with better test numbers to show if there is a
performance issue on his 2+6 DynamIQ system and if yes, where exactly in
this code path.


Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-25 Thread Vincent Guittot
On Mon, 25 Jan 2021 at 11:45, Dietmar Eggemann  wrote:
>
> On 22/01/2021 20:10, Joel Fernandes wrote:
> > Hi Vincent,
> >
> > Thanks for reply. Please see the replies below:
> >
> > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> >> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> >>  wrote:
> >>>
> >>> On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> >>> that there are a lot of calls to update_blocked_averages(). This causes
> >>> the schedule loop to slow down to taking upto 500 micro seconds at
> >>> times (due to newidle load balance). I have also seen this manifest in
> >>> the periodic balancer.
> >>>
> >>> Closer look shows that the problem is caused by the following
> >>> ingredients:
> >>> 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> >>> suggesting to inspect /proc/sched_debug for this), this can make
> >>> __update_blocked_fair() take a long time.
> >>
> >> Inactive cgroups are removed from the list so they should not impact
> >> the duration
> >
> > I meant blocked CGroups. According to this code, a cfs_rq can be partially
> > decayed and not have any tasks running on it but its load needs to be
> > decayed, correct? That's what I meant by 'inactive'. I can reword it to
> > 'blocked'.
> >
> >   * There can be a lot of idle CPU cgroups.  Don't let fully
> >   * decayed cfs_rqs linger on the list.
> >   */
> >  if (cfs_rq_is_decayed(cfs_rq))
> >  list_del_leaf_cfs_rq(cfs_rq);
> >
> >>> 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> >>> shared frequency domain configuration to be slower than usual. (the load
> >>
> >> What do you mean exactly by it causes schedutil to be slower than usual ?
> >
> > sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> > frequency update needs to go through utilization of other CPUs in the
> > cluster. I believe this could be adding to the problem but is not really
> > needed to optimize if we can rate limit the calls to update_blocked_averages
> > to begin with.
> >
> >>> average updates also try to update the frequency in schedutil).
> >>>
> >>> 3. The CPU is running at a low frequency causing the scheduler/schedutil
> >>> code paths to take longer than when running at a high CPU frequency.
> >>
> >> Low frequency usually means low utilization so it should happen that much.
> >
> > It happens a lot as can be seen with schbench. It is super easy to 
> > reproduce.
> >
> > schedule() can result in new idle balance with the CFS pick call happening
> > often. Here is a function graph trace.  The tracer shows
> > update_blocked_averages taking a lot of time.
> >
> >  sugov:0-2454  [002]  2657.992570: funcgraph_entry:   | 
> >  load_balance() {
> >  sugov:0-2454  [002]  2657.992577: funcgraph_entry:   | 
> >update_group_capacity() {
> >  sugov:0-2454  [002]  2657.992580: funcgraph_entry:2.656 us   | 
> >  __msecs_to_jiffies();
> >  sugov:0-2454  [002]  2657.992585: funcgraph_entry:2.447 us   | 
> >  _raw_spin_lock_irqsave();
> >  sugov:0-2454  [002]  2657.992591: funcgraph_entry:2.552 us   | 
> >  _raw_spin_unlock_irqrestore();
> >  sugov:0-2454  [002]  2657.992595: funcgraph_exit:   + 17.448 us  | 
> >}
> >  sugov:0-2454  [002]  2657.992597: funcgraph_entry:1.875 us   | 
> >update_nohz_stats();
> >  sugov:0-2454  [002]  2657.992601: funcgraph_entry:1.667 us   | 
> >idle_cpu();
> >  sugov:0-2454  [002]  2657.992605: funcgraph_entry:   | 
> >update_nohz_stats() {
> >  sugov:0-2454  [002]  2657.992608: funcgraph_entry:  + 33.333 us  | 
> >  update_blocked_averages();
> >  sugov:0-2454  [002]  2657.992643: funcgraph_exit:   + 38.073 us  | 
> >}
> >  sugov:0-2454  [002]  2657.992645: funcgraph_entry:1.770 us   | 
> >idle_cpu();
> >  sugov:0-2454  [002]  2657.992649: funcgraph_entry:   | 
> >update_nohz_stats() {
> >  sugov:0-2454  [002]  2657.992651: funcgraph_entry:  + 41.823 us  | 
> >  update_blocked_averages();
> >  sugov:0-2454  [002]  2657.992694: funcgraph_exit:   + 45.729 us  | 
> >}
> >  sugov:0-2454  [002]  2657.992696: funcgraph_entry:1.823 us   | 
> >idle_cpu();
> >  sugov:0-2454  [002]  2657.992700: funcgraph_entry:   | 
> >update_nohz_stats() {
> >  sugov:0-2454  [002]  2657.992702: funcgraph_entry:  + 35.312 us  | 
> >  update_blocked_averages();
> >  sugov:0-2454  [002]  2657.992740: funcgraph_exit:   + 39.792 us  | 
> >}
> >  sugov:0-2454  [002]  2657.992742: funcgraph_entry:1.771 us   | 
> >idle_cpu();
> >  sugov:0-2454  [002]  2657.992746: 

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-25 Thread Vincent Guittot
On Fri, 22 Jan 2021 at 20:10, Joel Fernandes  wrote:
>
> Hi Vincent,
>
> Thanks for reply. Please see the replies below:
>
> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> >  wrote:
> > >
> > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > > that there are a lot of calls to update_blocked_averages(). This causes
> > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > times (due to newidle load balance). I have also seen this manifest in
> > > the periodic balancer.
> > >
> > > Closer look shows that the problem is caused by the following
> > > ingredients:
> > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > suggesting to inspect /proc/sched_debug for this), this can make
> > > __update_blocked_fair() take a long time.
> >
> > Inactive cgroups are removed from the list so they should not impact
> > the duration
>
> I meant blocked CGroups. According to this code, a cfs_rq can be partially
> decayed and not have any tasks running on it but its load needs to be
> decayed, correct? That's what I meant by 'inactive'. I can reword it to
> 'blocked'.

How many blocked cgroups have you got ?

>
>   * There can be a lot of idle CPU cgroups.  Don't let fully
>   * decayed cfs_rqs linger on the list.
>   */
>  if (cfs_rq_is_decayed(cfs_rq))
>  list_del_leaf_cfs_rq(cfs_rq);
>
> > > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > > shared frequency domain configuration to be slower than usual. (the load
> >
> > What do you mean exactly by it causes schedutil to be slower than usual ?
>
> sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> frequency update needs to go through utilization of other CPUs in the
> cluster. I believe this could be adding to the problem but is not really
> needed to optimize if we can rate limit the calls to update_blocked_averages
> to begin with.

Qais mentioned half of the time being used by
sugov_next_freq_shared(). Are there any frequency changes resulting in
this call ?

>
> > > average updates also try to update the frequency in schedutil).
> > >
> > > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > > code paths to take longer than when running at a high CPU frequency.
> >
> > Low frequency usually means low utilization so it should happen that much.
>
> It happens a lot as can be seen with schbench. It is super easy to reproduce.

Happening a lot in itself is not a problem if there is nothing else to
do so it's not a argument in itself

So why is it a problem for you ? You are mentioning newly idle load
balance so I assume that your root problem is the scheduling delay
generated by the newly idle load balance which then calls
update_blocked_averages

rate limiting the call to update_blocked_averages() will only reduce
the number of time it happens but it will not prevent it to happen.

>
> schedule() can result in new idle balance with the CFS pick call happening
> often. Here is a function graph trace.  The tracer shows
> update_blocked_averages taking a lot of time.
>
>  sugov:0-2454  [002]  2657.992570: funcgraph_entry:   |  
> load_balance() {
>  sugov:0-2454  [002]  2657.992577: funcgraph_entry:   |   
>  update_group_capacity() {
>  sugov:0-2454  [002]  2657.992580: funcgraph_entry:2.656 us   |   
>__msecs_to_jiffies();
>  sugov:0-2454  [002]  2657.992585: funcgraph_entry:2.447 us   |   
>_raw_spin_lock_irqsave();
>  sugov:0-2454  [002]  2657.992591: funcgraph_entry:2.552 us   |   
>_raw_spin_unlock_irqrestore();
>  sugov:0-2454  [002]  2657.992595: funcgraph_exit:   + 17.448 us  |   
>  }
>  sugov:0-2454  [002]  2657.992597: funcgraph_entry:1.875 us   |   
>  update_nohz_stats();
>  sugov:0-2454  [002]  2657.992601: funcgraph_entry:1.667 us   |   
>  idle_cpu();
>  sugov:0-2454  [002]  2657.992605: funcgraph_entry:   |   
>  update_nohz_stats() {
>  sugov:0-2454  [002]  2657.992608: funcgraph_entry:  + 33.333 us  |   
>update_blocked_averages();
>  sugov:0-2454  [002]  2657.992643: funcgraph_exit:   + 38.073 us  |   
>  }
>  sugov:0-2454  [002]  2657.992645: funcgraph_entry:1.770 us   |   
>  idle_cpu();
>  sugov:0-2454  [002]  2657.992649: funcgraph_entry:   |   
>  update_nohz_stats() {
>  sugov:0-2454  [002]  2657.992651: funcgraph_entry:  + 41.823 us  |   
>update_blocked_averages();
>  sugov:0-2454  [002]  2657.992694: funcgraph_exit:   + 45.729 us  |   
>  }
>  sugov:0-2454  [002]  2657.992696: funcgraph_entry:1.823 us   |   
>  idle_cpu();
>  

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-25 Thread Dietmar Eggemann
On 22/01/2021 20:10, Joel Fernandes wrote:
> Hi Vincent,
> 
> Thanks for reply. Please see the replies below:
> 
> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
>> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
>>  wrote:
>>>
>>> On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
>>> that there are a lot of calls to update_blocked_averages(). This causes
>>> the schedule loop to slow down to taking upto 500 micro seconds at
>>> times (due to newidle load balance). I have also seen this manifest in
>>> the periodic balancer.
>>>
>>> Closer look shows that the problem is caused by the following
>>> ingredients:
>>> 1. If the system has a lot of inactive CGroups (thanks Dietmar for
>>> suggesting to inspect /proc/sched_debug for this), this can make
>>> __update_blocked_fair() take a long time.
>>
>> Inactive cgroups are removed from the list so they should not impact
>> the duration
> 
> I meant blocked CGroups. According to this code, a cfs_rq can be partially
> decayed and not have any tasks running on it but its load needs to be
> decayed, correct? That's what I meant by 'inactive'. I can reword it to
> 'blocked'.
> 
>   * There can be a lot of idle CPU cgroups.  Don't let fully
>   * decayed cfs_rqs linger on the list.
>   */
>  if (cfs_rq_is_decayed(cfs_rq))
>  list_del_leaf_cfs_rq(cfs_rq);
> 
>>> 2. The device has a lot of CPUs in a cluster which causes schedutil in a
>>> shared frequency domain configuration to be slower than usual. (the load
>>
>> What do you mean exactly by it causes schedutil to be slower than usual ?
> 
> sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> frequency update needs to go through utilization of other CPUs in the
> cluster. I believe this could be adding to the problem but is not really
> needed to optimize if we can rate limit the calls to update_blocked_averages
> to begin with.
> 
>>> average updates also try to update the frequency in schedutil).
>>>
>>> 3. The CPU is running at a low frequency causing the scheduler/schedutil
>>> code paths to take longer than when running at a high CPU frequency.
>>
>> Low frequency usually means low utilization so it should happen that much.
> 
> It happens a lot as can be seen with schbench. It is super easy to reproduce.
> 
> schedule() can result in new idle balance with the CFS pick call happening
> often. Here is a function graph trace.  The tracer shows
> update_blocked_averages taking a lot of time.
> 
>  sugov:0-2454  [002]  2657.992570: funcgraph_entry:   |  
> load_balance() {
>  sugov:0-2454  [002]  2657.992577: funcgraph_entry:   |   
>  update_group_capacity() {
>  sugov:0-2454  [002]  2657.992580: funcgraph_entry:2.656 us   |   
>__msecs_to_jiffies();
>  sugov:0-2454  [002]  2657.992585: funcgraph_entry:2.447 us   |   
>_raw_spin_lock_irqsave();
>  sugov:0-2454  [002]  2657.992591: funcgraph_entry:2.552 us   |   
>_raw_spin_unlock_irqrestore();
>  sugov:0-2454  [002]  2657.992595: funcgraph_exit:   + 17.448 us  |   
>  }
>  sugov:0-2454  [002]  2657.992597: funcgraph_entry:1.875 us   |   
>  update_nohz_stats();
>  sugov:0-2454  [002]  2657.992601: funcgraph_entry:1.667 us   |   
>  idle_cpu();
>  sugov:0-2454  [002]  2657.992605: funcgraph_entry:   |   
>  update_nohz_stats() {
>  sugov:0-2454  [002]  2657.992608: funcgraph_entry:  + 33.333 us  |   
>update_blocked_averages();
>  sugov:0-2454  [002]  2657.992643: funcgraph_exit:   + 38.073 us  |   
>  }
>  sugov:0-2454  [002]  2657.992645: funcgraph_entry:1.770 us   |   
>  idle_cpu();
>  sugov:0-2454  [002]  2657.992649: funcgraph_entry:   |   
>  update_nohz_stats() {
>  sugov:0-2454  [002]  2657.992651: funcgraph_entry:  + 41.823 us  |   
>update_blocked_averages();
>  sugov:0-2454  [002]  2657.992694: funcgraph_exit:   + 45.729 us  |   
>  }
>  sugov:0-2454  [002]  2657.992696: funcgraph_entry:1.823 us   |   
>  idle_cpu();
>  sugov:0-2454  [002]  2657.992700: funcgraph_entry:   |   
>  update_nohz_stats() {
>  sugov:0-2454  [002]  2657.992702: funcgraph_entry:  + 35.312 us  |   
>update_blocked_averages();
>  sugov:0-2454  [002]  2657.992740: funcgraph_exit:   + 39.792 us  |   
>  }
>  sugov:0-2454  [002]  2657.992742: funcgraph_entry:1.771 us   |   
>  idle_cpu();
>  sugov:0-2454  [002]  2657.992746: funcgraph_entry:   |   
>  update_nohz_stats() {
>  sugov:0-2454  [002]  2657.992748: funcgraph_entry:  + 33.438 us  |   
>update_blocked_averages();
>  sugov:0-2454  [002]  2657.992783: funcgraph_exit:   + 37.500 us  |   
> 

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-22 Thread Qais Yousef
On 01/22/21 17:56, Vincent Guittot wrote:
> > ---
> >  kernel/sched/fair.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 04a3ce20da67..fe2dc0024db5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool 
> > force)
> > if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > return false;
> >
> > -   if (!force && !time_after(jiffies, 
> > rq->last_blocked_load_update_tick))
> > +   if (!force && !time_after(jiffies, 
> > rq->last_blocked_load_update_tick + (HZ/20)))
> 
> This condition is there to make sure to update blocked load at most
> once a tick in order to filter newly idle case otherwise the rate
> limit is already done by load balance interval
> This hard coded (HZ/20) looks really like an ugly hack

This was meant as an RFC patch to discuss the problem really.

Joel is seeing update_blocked_averages() taking ~100us. Half of it seems in
processing __update_blocked_fair() and the other half in sugov_update_shared().
So roughly 50us each. Note that each function is calling an iterator in
return. Correct me if my numbers are wrong Joel.

Running on a little core on low frequency these numbers don't look too odd.
So I'm not seeing how we can speed these functions up.

But since update_sg_lb_stats() will end up with multiple calls to
update_blocked_averages() in one go, this latency adds up quickly.

One noticeable factor in Joel's system is the presence of a lot of cgroups.
Which is essentially what makes __update_blocked_fair() expensive, and it seems
to always return something has decayed so we end up with a call to
sugov_update_shared() in every call.

I think we should limit the expensive call to update_blocked_averages() but
I honestly don't know what would be the right way to do it :-/

Or maybe there's another better alternative too..

Thanks

--
Qais Yousef

> 
> > return true;
> >
> > update_blocked_averages(cpu);
> > --
> > 2.30.0.280.ga3ce27912f-goog
> >


Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-22 Thread Joel Fernandes
On Fri, Jan 22, 2021 at 06:39:27PM +, Qais Yousef wrote:
> On 01/22/21 17:56, Vincent Guittot wrote:
> > > ---
> > >  kernel/sched/fair.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 04a3ce20da67..fe2dc0024db5 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool 
> > > force)
> > > if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > > return false;
> > >
> > > -   if (!force && !time_after(jiffies, 
> > > rq->last_blocked_load_update_tick))
> > > +   if (!force && !time_after(jiffies, 
> > > rq->last_blocked_load_update_tick + (HZ/20)))
> > 
> > This condition is there to make sure to update blocked load at most
> > once a tick in order to filter newly idle case otherwise the rate
> > limit is already done by load balance interval
> > This hard coded (HZ/20) looks really like an ugly hack
> 
> This was meant as an RFC patch to discuss the problem really.

Agreed, sorry.

> Joel is seeing update_blocked_averages() taking ~100us. Half of it seems in
> processing __update_blocked_fair() and the other half in 
> sugov_update_shared().
> So roughly 50us each. Note that each function is calling an iterator in
> return. Correct me if my numbers are wrong Joel.

Correct, and I see update_nohz_stats() itself called around 8 times during a
load balance which multiplies the overhead.

Dietmar found out also that the reason for update_nohz_stacks() being called
8 times is because in our setup, there is only 1 MC sched domain with all 8
CPUs, versus say 2 MC domains with 4 CPUs each.

> Running on a little core on low frequency these numbers don't look too odd.
> So I'm not seeing how we can speed these functions up.

Agreed.

> But since update_sg_lb_stats() will end up with multiple calls to
> update_blocked_averages() in one go, this latency adds up quickly.

True!

> One noticeable factor in Joel's system is the presence of a lot of cgroups.
> Which is essentially what makes __update_blocked_fair() expensive, and it 
> seems
> to always return something has decayed so we end up with a call to
> sugov_update_shared() in every call.

Correct.

thanks,

 - Joel

[..]


Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-22 Thread Joel Fernandes
Hi Vincent,

Thanks for reply. Please see the replies below:

On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
>  wrote:
> >
> > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > that there are a lot of calls to update_blocked_averages(). This causes
> > the schedule loop to slow down to taking upto 500 micro seconds at
> > times (due to newidle load balance). I have also seen this manifest in
> > the periodic balancer.
> >
> > Closer look shows that the problem is caused by the following
> > ingredients:
> > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > suggesting to inspect /proc/sched_debug for this), this can make
> > __update_blocked_fair() take a long time.
> 
> Inactive cgroups are removed from the list so they should not impact
> the duration

I meant blocked CGroups. According to this code, a cfs_rq can be partially
decayed and not have any tasks running on it but its load needs to be
decayed, correct? That's what I meant by 'inactive'. I can reword it to
'blocked'.

  * There can be a lot of idle CPU cgroups.  Don't let fully
  * decayed cfs_rqs linger on the list.
  */
 if (cfs_rq_is_decayed(cfs_rq))
 list_del_leaf_cfs_rq(cfs_rq);

> > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > shared frequency domain configuration to be slower than usual. (the load
> 
> What do you mean exactly by it causes schedutil to be slower than usual ?

sugov_next_freq_shared() is order number of CPUs in the a cluster. This
system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
frequency update needs to go through utilization of other CPUs in the
cluster. I believe this could be adding to the problem but is not really
needed to optimize if we can rate limit the calls to update_blocked_averages
to begin with.

> > average updates also try to update the frequency in schedutil).
> >
> > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > code paths to take longer than when running at a high CPU frequency.
> 
> Low frequency usually means low utilization so it should happen that much.

It happens a lot as can be seen with schbench. It is super easy to reproduce.

schedule() can result in new idle balance with the CFS pick call happening
often. Here is a function graph trace.  The tracer shows
update_blocked_averages taking a lot of time.

 sugov:0-2454  [002]  2657.992570: funcgraph_entry:   |  
load_balance() {
 sugov:0-2454  [002]  2657.992577: funcgraph_entry:   |
update_group_capacity() {
 sugov:0-2454  [002]  2657.992580: funcgraph_entry:2.656 us   | 
 __msecs_to_jiffies();
 sugov:0-2454  [002]  2657.992585: funcgraph_entry:2.447 us   | 
 _raw_spin_lock_irqsave();
 sugov:0-2454  [002]  2657.992591: funcgraph_entry:2.552 us   | 
 _raw_spin_unlock_irqrestore();
 sugov:0-2454  [002]  2657.992595: funcgraph_exit:   + 17.448 us  |}
 sugov:0-2454  [002]  2657.992597: funcgraph_entry:1.875 us   |
update_nohz_stats();
 sugov:0-2454  [002]  2657.992601: funcgraph_entry:1.667 us   |
idle_cpu();
 sugov:0-2454  [002]  2657.992605: funcgraph_entry:   |
update_nohz_stats() {
 sugov:0-2454  [002]  2657.992608: funcgraph_entry:  + 33.333 us  | 
 update_blocked_averages();
 sugov:0-2454  [002]  2657.992643: funcgraph_exit:   + 38.073 us  |}
 sugov:0-2454  [002]  2657.992645: funcgraph_entry:1.770 us   |
idle_cpu();
 sugov:0-2454  [002]  2657.992649: funcgraph_entry:   |
update_nohz_stats() {
 sugov:0-2454  [002]  2657.992651: funcgraph_entry:  + 41.823 us  | 
 update_blocked_averages();
 sugov:0-2454  [002]  2657.992694: funcgraph_exit:   + 45.729 us  |}
 sugov:0-2454  [002]  2657.992696: funcgraph_entry:1.823 us   |
idle_cpu();
 sugov:0-2454  [002]  2657.992700: funcgraph_entry:   |
update_nohz_stats() {
 sugov:0-2454  [002]  2657.992702: funcgraph_entry:  + 35.312 us  | 
 update_blocked_averages();
 sugov:0-2454  [002]  2657.992740: funcgraph_exit:   + 39.792 us  |}
 sugov:0-2454  [002]  2657.992742: funcgraph_entry:1.771 us   |
idle_cpu();
 sugov:0-2454  [002]  2657.992746: funcgraph_entry:   |
update_nohz_stats() {
 sugov:0-2454  [002]  2657.992748: funcgraph_entry:  + 33.438 us  | 
 update_blocked_averages();
 sugov:0-2454  [002]  2657.992783: funcgraph_exit:   + 37.500 us  |}
 sugov:0-2454  [002]  2657.992785: funcgraph_entry:1.771 us   |
idle_cpu();
 sugov:0-2454  [002]  2657.992790: funcgraph_entry:   |
update_nohz_stats() {
 

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-22 Thread Vincent Guittot
On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
 wrote:
>
> On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> that there are a lot of calls to update_blocked_averages(). This causes
> the schedule loop to slow down to taking upto 500 micro seconds at
> times (due to newidle load balance). I have also seen this manifest in
> the periodic balancer.
>
> Closer look shows that the problem is caused by the following
> ingredients:
> 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> suggesting to inspect /proc/sched_debug for this), this can make
> __update_blocked_fair() take a long time.

Inactive cgroups are removed from the list so they should not impact
the duration

>
> 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> shared frequency domain configuration to be slower than usual. (the load

What do you mean exactly by it causes schedutil to be slower than usual ?

> average updates also try to update the frequency in schedutil).
>
> 3. The CPU is running at a low frequency causing the scheduler/schedutil
> code paths to take longer than when running at a high CPU frequency.

Low frequency usually means low utilization so it should happen that much.

>
> The fix is simply rate limit the calls to update_blocked_averages to 20
> times per second. It appears that updating the blocked average less
> often is sufficient. Currently I see about 200 calls per second

Would be good to explain why updating less often is sufficient ?

> sometimes, which seems overkill.
>
> schbench shows a clear improvement with the change:

Have you got more details about your test setup ?
which platform ?
which kernel ?

>
> Without patch:
> ~/schbench -t 2 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (212 total samples)
> 50.0th: 210 (106 samples)
> 75.0th: 619 (53 samples)
> 90.0th: 665 (32 samples)
> 95.0th: 703 (11 samples)
> *99.0th: 12656 (8 samples)
> 99.5th: 12784 (1 samples)
> 99.9th: 13424 (1 samples)
> min=15, max=13424
>
> With patch:
> ~/schbench -t 2 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (214 total samples)
> 50.0th: 188 (108 samples)
> 75.0th: 238 (53 samples)
> 90.0th: 623 (32 samples)
> 95.0th: 657 (12 samples)
> *99.0th: 717 (7 samples)
> 99.5th: 725 (2 samples)
> 99.9th: 725 (0 samples)
>
> Cc: Paul McKenney 
> Cc: Frederic Weisbecker 
> Suggested-by: Dietmar Eggeman 
> Co-developed-by: Qais Yousef 
> Signed-off-by: Qais Yousef 
> Signed-off-by: Joel Fernandes (Google) 
>
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 04a3ce20da67..fe2dc0024db5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
> if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> return false;
>
> -   if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> +   if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick 
> + (HZ/20)))

This condition is there to make sure to update blocked load at most
once a tick in order to filter newly idle case otherwise the rate
limit is already done by load balance interval
This hard coded (HZ/20) looks really like an ugly hack

> return true;
>
> update_blocked_averages(cpu);
> --
> 2.30.0.280.ga3ce27912f-goog
>


[PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-22 Thread Joel Fernandes (Google)
On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
that there are a lot of calls to update_blocked_averages(). This causes
the schedule loop to slow down to taking upto 500 micro seconds at
times (due to newidle load balance). I have also seen this manifest in
the periodic balancer.

Closer look shows that the problem is caused by the following
ingredients:
1. If the system has a lot of inactive CGroups (thanks Dietmar for
suggesting to inspect /proc/sched_debug for this), this can make
__update_blocked_fair() take a long time.

2. The device has a lot of CPUs in a cluster which causes schedutil in a
shared frequency domain configuration to be slower than usual. (the load
average updates also try to update the frequency in schedutil).

3. The CPU is running at a low frequency causing the scheduler/schedutil
code paths to take longer than when running at a high CPU frequency.

The fix is simply rate limit the calls to update_blocked_averages to 20
times per second. It appears that updating the blocked average less
often is sufficient. Currently I see about 200 calls per second
sometimes, which seems overkill.

schbench shows a clear improvement with the change:

Without patch:
~/schbench -t 2 -m 2 -r 5
Latency percentiles (usec) runtime 5 (s) (212 total samples)
50.0th: 210 (106 samples)
75.0th: 619 (53 samples)
90.0th: 665 (32 samples)
95.0th: 703 (11 samples)
*99.0th: 12656 (8 samples)
99.5th: 12784 (1 samples)
99.9th: 13424 (1 samples)
min=15, max=13424

With patch:
~/schbench -t 2 -m 2 -r 5
Latency percentiles (usec) runtime 5 (s) (214 total samples)
50.0th: 188 (108 samples)
75.0th: 238 (53 samples)
90.0th: 623 (32 samples)
95.0th: 657 (12 samples)
*99.0th: 717 (7 samples)
99.5th: 725 (2 samples)
99.9th: 725 (0 samples)

Cc: Paul McKenney 
Cc: Frederic Weisbecker 
Suggested-by: Dietmar Eggeman 
Co-developed-by: Qais Yousef 
Signed-off-by: Qais Yousef 
Signed-off-by: Joel Fernandes (Google) 

---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..fe2dc0024db5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
return false;
 
-   if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
+   if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + 
(HZ/20)))
return true;
 
update_blocked_averages(cpu);
-- 
2.30.0.280.ga3ce27912f-goog