Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-09-28 Thread Wanpeng Li
On Sat, 29 Sep 2018 at 01:36, Dietmar Eggemann  wrote:
>
> On 09/28/2018 06:10 PM, Steve Muckle wrote:
> > On 09/27/2018 05:43 PM, Wanpeng Li wrote:
>  On your CPU4:
>  scheduler_ipi()
> -> sched_ttwu_pending()
>  -> ttwu_do_activate()=> p->sched_remote_wakeup should be
>  false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not
>   -> ttwu_activate()
>    -> activate_task()
> -> enqueue_task()
>  -> enqueue_task_fair()
>   -> enqueue_entity()
>    bool renorm = !(flags &
>  ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE)
>  so renorm is false in enqueue_entity(), why you mentioned that the
>  cfs_rq->min_vruntime is still added to the se->vruntime in
>  enqueue_task_fair()?
> >>>
> >>> Maybe this is a misunderstanding on my side but didn't you asked me to
> >>> '... Could you point out when the fair rq's min_vruntime is added to the
> >>> task's vruntime in your *later* scenario? ...'
> >>
> >> Yeah, if the calltrace above and my analysis is correct, then the fair
> >> rq's min_vruntime will not be added to the task's vruntime in your
> >> *later* scenario, which means that your patch is not necessary.
> >
> > In the scenario I observed, the task is not waking - it is running and
> > being deboosted from priority inheritance, transitioning from RT to CFS.
> >
> > Dietmar and I both were able to reproduce the issue with the testcase I
> > posted earlier in this thread.
>
> Correct, and with the same testcase I got this call stack in this scenario:
>
> [   35.588509] CPU: 1 PID: 2926 Comm: fair_task Not tainted
> 4.18.0-rc6-00052-g11b7dafa2edb-dirty #5
> [   35.597217] Hardware name: ARM Juno development board (r0) (DT)
> [   35.603080] Call trace:
> [   35.605509]  dump_backtrace+0x0/0x168
> [   35.609138]  show_stack+0x24/0x30
> [   35.612424]  dump_stack+0xac/0xe4
> [   35.615710]  enqueue_task_fair+0xae0/0x11c0
> [   35.619854]  rt_mutex_setprio+0x5a0/0x628
> [   35.623827]  mark_wakeup_next_waiter+0x7c/0xc8
> [   35.628228]  __rt_mutex_futex_unlock+0x30/0x50
> [   35.632630]  do_futex+0x74c/0xb28
> [   35.635912]  sys_futex+0x118/0x198
> [   35.639280]  el0_svc_naked+0x30/0x34

Thanks for pointing out. :)

Regards,
Wanpeng Li


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-09-28 Thread Dietmar Eggemann

On 09/28/2018 06:10 PM, Steve Muckle wrote:

On 09/27/2018 05:43 PM, Wanpeng Li wrote:

On your CPU4:
scheduler_ipi()
   -> sched_ttwu_pending()
    -> ttwu_do_activate()    => p->sched_remote_wakeup should be
false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not
 -> ttwu_activate()
  -> activate_task()
   -> enqueue_task()
    -> enqueue_task_fair()
 -> enqueue_entity()
  bool renorm = !(flags &
ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE)
so renorm is false in enqueue_entity(), why you mentioned that the
cfs_rq->min_vruntime is still added to the se->vruntime in
enqueue_task_fair()?


Maybe this is a misunderstanding on my side but didn't you asked me to
'... Could you point out when the fair rq's min_vruntime is added to the
task's vruntime in your *later* scenario? ...'


Yeah, if the calltrace above and my analysis is correct, then the fair
rq's min_vruntime will not be added to the task's vruntime in your
*later* scenario, which means that your patch is not necessary.


In the scenario I observed, the task is not waking - it is running and 
being deboosted from priority inheritance, transitioning from RT to CFS.


Dietmar and I both were able to reproduce the issue with the testcase I 
posted earlier in this thread.


Correct, and with the same testcase I got this call stack in this scenario:

[   35.588509] CPU: 1 PID: 2926 Comm: fair_task Not tainted 
4.18.0-rc6-00052-g11b7dafa2edb-dirty #5

[   35.597217] Hardware name: ARM Juno development board (r0) (DT)
[   35.603080] Call trace:
[   35.605509]  dump_backtrace+0x0/0x168
[   35.609138]  show_stack+0x24/0x30
[   35.612424]  dump_stack+0xac/0xe4
[   35.615710]  enqueue_task_fair+0xae0/0x11c0
[   35.619854]  rt_mutex_setprio+0x5a0/0x628
[   35.623827]  mark_wakeup_next_waiter+0x7c/0xc8
[   35.628228]  __rt_mutex_futex_unlock+0x30/0x50
[   35.632630]  do_futex+0x74c/0xb28
[   35.635912]  sys_futex+0x118/0x198
[   35.639280]  el0_svc_naked+0x30/0x34

[...]



Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-09-28 Thread Dietmar Eggemann

On 09/28/2018 02:43 AM, Wanpeng Li wrote:

On Thu, 27 Sep 2018 at 21:23, Dietmar Eggemann  wrote:


On 09/27/2018 03:19 AM, Wanpeng Li wrote:

On Thu, 27 Sep 2018 at 06:38, Dietmar Eggemann  wrote:


Hi,

On 09/26/2018 11:50 AM, Wanpeng Li wrote:

Hi Dietmar,
On Tue, 28 Aug 2018 at 22:55, Dietmar Eggemann  wrote:


On 08/27/2018 12:14 PM, Peter Zijlstra wrote:

On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote:

On 08/24/2018 02:47 AM, Peter Zijlstra wrote:

On 08/17/2018 11:27 AM, Steve Muckle wrote:


[...]


- later, when the prio is deboosted and the task is moved back
to the fair class, the fair rq's min_vruntime is added to
the task's vruntime, even though it wasn't subtracted earlier.


Could you point out when the fair rq's min_vruntime is added to the
task's vruntime in your *later* scenario? attach_task_cfs_rq will not
do that the same reason as detach_task_cfs_rq. fair task's
sched_remote_wakeup is false which results in vruntime will not be
renormalized in enqueue_entity.


The cfs_rq->min_vruntime is still added to the se->vruntime in
enqueue_task_fair().


I understand what your patch done,


It's not my patch ;-) I just helped to find out under which
circumstances this issue can happen.


On your CPU4:
scheduler_ipi()
   -> sched_ttwu_pending()
-> ttwu_do_activate()=> p->sched_remote_wakeup should be
false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not
 -> ttwu_activate()
  -> activate_task()
   -> enqueue_task()
-> enqueue_task_fair()
 -> enqueue_entity()
  bool renorm = !(flags &
ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE)
so renorm is false in enqueue_entity(), why you mentioned that the
cfs_rq->min_vruntime is still added to the se->vruntime in
enqueue_task_fair()?


Maybe this is a misunderstanding on my side but didn't you asked me to
'... Could you point out when the fair rq's min_vruntime is added to the
task's vruntime in your *later* scenario? ...'


Yeah, if the calltrace above and my analysis is correct, then the fair
rq's min_vruntime will not be added to the task's vruntime in your
*later* scenario, which means that your patch is not necessary.


Ah, ok, both, the ENQUEUE_WAKEUP and the ENQUEUE_MIGRATED are not set in 
the enqueue_entity() call so renorm is 1 (flags is 0xe).







Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-09-28 Thread Joel Fernandes
On Fri, Sep 28, 2018 at 9:10 AM, 'Steve Muckle' via kernel-team
 wrote:
> On 09/27/2018 05:43 PM, Wanpeng Li wrote:

 On your CPU4:
 scheduler_ipi()
-> sched_ttwu_pending()
 -> ttwu_do_activate()=> p->sched_remote_wakeup should be
 false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not
  -> ttwu_activate()
   -> activate_task()
-> enqueue_task()
 -> enqueue_task_fair()
  -> enqueue_entity()
   bool renorm = !(flags &
 ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE)
 so renorm is false in enqueue_entity(), why you mentioned that the
 cfs_rq->min_vruntime is still added to the se->vruntime in
 enqueue_task_fair()?
>>>
>>>
>>> Maybe this is a misunderstanding on my side but didn't you asked me to
>>> '... Could you point out when the fair rq's min_vruntime is added to the
>>> task's vruntime in your *later* scenario? ...'
>>
>>
>> Yeah, if the calltrace above and my analysis is correct, then the fair
>> rq's min_vruntime will not be added to the task's vruntime in your
>> *later* scenario, which means that your patch is not necessary.
>
>
> In the scenario I observed, the task is not waking - it is running and being
> deboosted from priority inheritance, transitioning from RT to CFS.
>
> Dietmar and I both were able to reproduce the issue with the testcase I
> posted earlier in this thread.

Can this issue still show up say if the wake up was not remote? Say
the task was locally awakened. In that case do we still need to check
the class in vruntime_normalized like John was doing? Just want to
make sure we caught all scenarios.

- Joel


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-09-28 Thread Joel Fernandes
On Wed, Sep 26, 2018 at 6:19 PM, Wanpeng Li  wrote:
> On Thu, 27 Sep 2018 at 06:38, Dietmar Eggemann  
> wrote:
>>
>> Hi,
>>
>> On 09/26/2018 11:50 AM, Wanpeng Li wrote:
>> > Hi Dietmar,
>> > On Tue, 28 Aug 2018 at 22:55, Dietmar Eggemann  
>> > wrote:
>> >>
>> >> On 08/27/2018 12:14 PM, Peter Zijlstra wrote:
>> >>> On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote:
>>  On 08/24/2018 02:47 AM, Peter Zijlstra wrote:
>> >>> On 08/17/2018 11:27 AM, Steve Muckle wrote:
>>
>> [...]
>>
>>  - later, when the prio is deboosted and the task is moved back
>>    to the fair class, the fair rq's min_vruntime is added to
>>    the task's vruntime, even though it wasn't subtracted earlier.
>> >
>> > Could you point out when the fair rq's min_vruntime is added to the
>> > task's vruntime in your *later* scenario? attach_task_cfs_rq will not
>> > do that the same reason as detach_task_cfs_rq. fair task's
>> > sched_remote_wakeup is false which results in vruntime will not be
>> > renormalized in enqueue_entity.
>>
>> The cfs_rq->min_vruntime is still added to the se->vruntime in
>> enqueue_task_fair().
>
> I understand what your patch done,
>
> On your CPU4:
> scheduler_ipi()
>  -> sched_ttwu_pending()
>   -> ttwu_do_activate()=> p->sched_remote_wakeup should be
> false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not
>-> ttwu_activate()
> -> activate_task()
>  -> enqueue_task()
>   -> enqueue_task_fair()
>-> enqueue_entity()
> bool renorm = !(flags &
> ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE)
> so renorm is false in enqueue_entity(), why you mentioned that the
> cfs_rq->min_vruntime is still added to the se->vruntime in
> enqueue_task_fair()?

If I understand John's original patch correctly, the additional
vruntime is added when the class of the waking task is changed during
priority deboost so that path is different from the one you listed
above? Perhaps you should be looking at rt_mutex_setprio?

 - Joel


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-09-28 Thread Steve Muckle

On 09/27/2018 05:43 PM, Wanpeng Li wrote:

On your CPU4:
scheduler_ipi()
   -> sched_ttwu_pending()
-> ttwu_do_activate()=> p->sched_remote_wakeup should be
false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not
 -> ttwu_activate()
  -> activate_task()
   -> enqueue_task()
-> enqueue_task_fair()
 -> enqueue_entity()
  bool renorm = !(flags &
ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE)
so renorm is false in enqueue_entity(), why you mentioned that the
cfs_rq->min_vruntime is still added to the se->vruntime in
enqueue_task_fair()?


Maybe this is a misunderstanding on my side but didn't you asked me to
'... Could you point out when the fair rq's min_vruntime is added to the
task's vruntime in your *later* scenario? ...'


Yeah, if the calltrace above and my analysis is correct, then the fair
rq's min_vruntime will not be added to the task's vruntime in your
*later* scenario, which means that your patch is not necessary.


In the scenario I observed, the task is not waking - it is running and 
being deboosted from priority inheritance, transitioning from RT to CFS.


Dietmar and I both were able to reproduce the issue with the testcase I 
posted earlier in this thread.


thanks,
Steve


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-09-27 Thread Wanpeng Li
On Thu, 27 Sep 2018 at 21:23, Dietmar Eggemann  wrote:
>
> On 09/27/2018 03:19 AM, Wanpeng Li wrote:
> > On Thu, 27 Sep 2018 at 06:38, Dietmar Eggemann  
> > wrote:
> >>
> >> Hi,
> >>
> >> On 09/26/2018 11:50 AM, Wanpeng Li wrote:
> >>> Hi Dietmar,
> >>> On Tue, 28 Aug 2018 at 22:55, Dietmar Eggemann  
> >>> wrote:
> 
>  On 08/27/2018 12:14 PM, Peter Zijlstra wrote:
> > On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote:
> >> On 08/24/2018 02:47 AM, Peter Zijlstra wrote:
> > On 08/17/2018 11:27 AM, Steve Muckle wrote:
> >>
> >> [...]
> >>
> >> - later, when the prio is deboosted and the task is moved back
> >>to the fair class, the fair rq's min_vruntime is added to
> >>the task's vruntime, even though it wasn't subtracted 
> >> earlier.
> >>>
> >>> Could you point out when the fair rq's min_vruntime is added to the
> >>> task's vruntime in your *later* scenario? attach_task_cfs_rq will not
> >>> do that the same reason as detach_task_cfs_rq. fair task's
> >>> sched_remote_wakeup is false which results in vruntime will not be
> >>> renormalized in enqueue_entity.
> >>
> >> The cfs_rq->min_vruntime is still added to the se->vruntime in
> >> enqueue_task_fair().
> >
> > I understand what your patch done,
>
> It's not my patch ;-) I just helped to find out under which
> circumstances this issue can happen.
>
> > On your CPU4:
> > scheduler_ipi()
> >   -> sched_ttwu_pending()
> >-> ttwu_do_activate()=> p->sched_remote_wakeup should be
> > false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not
> > -> ttwu_activate()
> >  -> activate_task()
> >   -> enqueue_task()
> >-> enqueue_task_fair()
> > -> enqueue_entity()
> >  bool renorm = !(flags &
> > ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE)
> > so renorm is false in enqueue_entity(), why you mentioned that the
> > cfs_rq->min_vruntime is still added to the se->vruntime in
> > enqueue_task_fair()?
>
> Maybe this is a misunderstanding on my side but didn't you asked me to
> '... Could you point out when the fair rq's min_vruntime is added to the
> task's vruntime in your *later* scenario? ...'

Yeah, if the calltrace above and my analysis is correct, then the fair
rq's min_vruntime will not be added to the task's vruntime in your
*later* scenario, which means that your patch is not necessary.

Regards,
Wanpeng Li


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-09-27 Thread Dietmar Eggemann

On 09/27/2018 03:19 AM, Wanpeng Li wrote:

On Thu, 27 Sep 2018 at 06:38, Dietmar Eggemann  wrote:


Hi,

On 09/26/2018 11:50 AM, Wanpeng Li wrote:

Hi Dietmar,
On Tue, 28 Aug 2018 at 22:55, Dietmar Eggemann  wrote:


On 08/27/2018 12:14 PM, Peter Zijlstra wrote:

On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote:

On 08/24/2018 02:47 AM, Peter Zijlstra wrote:

On 08/17/2018 11:27 AM, Steve Muckle wrote:


[...]


- later, when the prio is deboosted and the task is moved back
   to the fair class, the fair rq's min_vruntime is added to
   the task's vruntime, even though it wasn't subtracted earlier.


Could you point out when the fair rq's min_vruntime is added to the
task's vruntime in your *later* scenario? attach_task_cfs_rq will not
do that the same reason as detach_task_cfs_rq. fair task's
sched_remote_wakeup is false which results in vruntime will not be
renormalized in enqueue_entity.


The cfs_rq->min_vruntime is still added to the se->vruntime in
enqueue_task_fair().


I understand what your patch done,


It's not my patch ;-) I just helped to find out under which 
circumstances this issue can happen.



On your CPU4:
scheduler_ipi()
  -> sched_ttwu_pending()
   -> ttwu_do_activate()=> p->sched_remote_wakeup should be
false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not
-> ttwu_activate()
 -> activate_task()
  -> enqueue_task()
   -> enqueue_task_fair()
-> enqueue_entity()
 bool renorm = !(flags &
ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE)
so renorm is false in enqueue_entity(), why you mentioned that the
cfs_rq->min_vruntime is still added to the se->vruntime in
enqueue_task_fair()?


Maybe this is a misunderstanding on my side but didn't you asked me to 
'... Could you point out when the fair rq's min_vruntime is added to the 
task's vruntime in your *later* scenario? ...'


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-09-26 Thread Wanpeng Li
On Thu, 27 Sep 2018 at 06:38, Dietmar Eggemann  wrote:
>
> Hi,
>
> On 09/26/2018 11:50 AM, Wanpeng Li wrote:
> > Hi Dietmar,
> > On Tue, 28 Aug 2018 at 22:55, Dietmar Eggemann  
> > wrote:
> >>
> >> On 08/27/2018 12:14 PM, Peter Zijlstra wrote:
> >>> On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote:
>  On 08/24/2018 02:47 AM, Peter Zijlstra wrote:
> >>> On 08/17/2018 11:27 AM, Steve Muckle wrote:
>
> [...]
>
>  - later, when the prio is deboosted and the task is moved back
>    to the fair class, the fair rq's min_vruntime is added to
>    the task's vruntime, even though it wasn't subtracted earlier.
> >
> > Could you point out when the fair rq's min_vruntime is added to the
> > task's vruntime in your *later* scenario? attach_task_cfs_rq will not
> > do that the same reason as detach_task_cfs_rq. fair task's
> > sched_remote_wakeup is false which results in vruntime will not be
> > renormalized in enqueue_entity.
>
> The cfs_rq->min_vruntime is still added to the se->vruntime in
> enqueue_task_fair().

I understand what your patch done,

On your CPU4:
scheduler_ipi()
 -> sched_ttwu_pending()
  -> ttwu_do_activate()=> p->sched_remote_wakeup should be
false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not
   -> ttwu_activate()
-> activate_task()
 -> enqueue_task()
  -> enqueue_task_fair()
   -> enqueue_entity()
bool renorm = !(flags &
ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE)
so renorm is false in enqueue_entity(), why you mentioned that the
cfs_rq->min_vruntime is still added to the se->vruntime in
enqueue_task_fair()?

Regards,
Wanpeng Li


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-09-26 Thread Dietmar Eggemann
Hi,

On 09/26/2018 11:50 AM, Wanpeng Li wrote:
> Hi Dietmar,
> On Tue, 28 Aug 2018 at 22:55, Dietmar Eggemann  
> wrote:
>>
>> On 08/27/2018 12:14 PM, Peter Zijlstra wrote:
>>> On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote:
 On 08/24/2018 02:47 AM, Peter Zijlstra wrote:
>>> On 08/17/2018 11:27 AM, Steve Muckle wrote:

[...]

 - later, when the prio is deboosted and the task is moved back
   to the fair class, the fair rq's min_vruntime is added to
   the task's vruntime, even though it wasn't subtracted earlier.
> 
> Could you point out when the fair rq's min_vruntime is added to the
> task's vruntime in your *later* scenario? attach_task_cfs_rq will not
> do that the same reason as detach_task_cfs_rq. fair task's
> sched_remote_wakeup is false which results in vruntime will not be
> renormalized in enqueue_entity.

The cfs_rq->min_vruntime is still added to the se->vruntime in
enqueue_task_fair().

It's just that without this patch, which adds the '&&
p->sched_remote_wakeup' bit to the condition under which
vruntime_normalized() returns true, detach_task_cfs_rq() won't go into the
'if (!vruntime_normalized(p))' path and not subtract cfs_rq->min_vruntime
from se->vruntime. 

Since 'task_cpu(p) equal cpu' in try_to_wake_up() for the fair task,
WF_MIGRATED is not set and set_task_cpu() -> migrate_task_rq_fair()
is not called which could subtract cfs_rq->min_vruntime from
se->vruntime as well.

My former example with a different set of trace events:

fair_task-3580  [004]35.389346: sched_stat_runtime:   comm=fair_task 
pid=3580 runtime=45312 [ns] vruntime=46922871 [ns] <-- se->vruntime=46.922.871
...
  rt_task-3579  [000]35.391573: sched_waking: comm=fair_task 
pid=3580 prio=120 target_cpu=004
...
  rt_task-3579  [000]35.391627: sched_pi_setprio: comm=fair_task 
pid=3580 oldprio=120 newprio=19
...
  rt_task-3579  [000]35.391661: bprint:   detach_task_cfs_rq: 
task=fair_task pid=3580 cpu=4 vruntime_normalized=1
  rt_task-3579  [000]35.391706: sched_switch: rt_task:3579 [19] D 
==> swapper/0:0 [120]
   -0 [004]35.391834: sched_wakeup: fair_task:3580 [19] 
success=1 CPU:004
   -0 [004]35.391840: sched_switch: swapper/4:0 [120] S 
==> fair_task:3580 [19]
fair_task-3580  [004]35.391853: sched_pi_setprio: comm=fair_task 
pid=3580 oldprio=19 newprio=120
...
fair_task-3580  [004]35.391863: bprint:   enqueue_task_fair: 
task=fair_task pid=3580 curr=0 se->vruntime=93845742 cpu=4 
cfs_rq->min_vruntime=46922871
...
fair_task-3580  [004]35.391877: sched_waking: comm=rt_task pid=3579 
prio=19 target_cpu=000
...
fair_task-3580  [004]35.391885: sched_stat_runtime:   comm=fair_task 
pid=3580 runtime=31250 [ns] vruntime=93876992 [ns] <-- se->vruntime=93.876.992


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-09-26 Thread Wanpeng Li
Hi Dietmar,
On Tue, 28 Aug 2018 at 22:55, Dietmar Eggemann  wrote:
>
> On 08/27/2018 12:14 PM, Peter Zijlstra wrote:
> > On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote:
> >> On 08/24/2018 02:47 AM, Peter Zijlstra wrote:
> > On 08/17/2018 11:27 AM, Steve Muckle wrote:
> >>>
> >> When rt_mutex_setprio changes a task's scheduling class to RT,
> >> we're seeing cases where the task's vruntime is not updated
> >> correctly upon return to the fair class.
> >>>
> >> Specifically, the following is being observed:
> >> - task is deactivated while still in the fair class
> >> - task is boosted to RT via rt_mutex_setprio, which changes
> >>  the task to RT and calls check_class_changed.
> >> - check_class_changed leads to detach_task_cfs_rq, at which point
> >>  the vruntime_normalized check sees that the task's state is 
> >> TASK_WAKING,
> >>  which results in skipping the subtraction of the rq's min_vruntime
> >>  from the task's vruntime

> >> - later, when the prio is deboosted and the task is moved back
> >>  to the fair class, the fair rq's min_vruntime is added to
> >>  the task's vruntime, even though it wasn't subtracted earlier.

Could you point out when the fair rq's min_vruntime is added to the
task's vruntime in your *later* scenario? attach_task_cfs_rq will not
do that the same reason as detach_task_cfs_rq. fair task's
sched_remote_wakeup is false which results in vruntime will not be
renormalized in enqueue_entity.

Regards,
Wanpeng Li

> >>>
> >>> I'm thinking that is an incomplete scenario; where do we get to
> >>> TASK_WAKING.
> >>
> >> Yes there's a missing bit of context here at the beginning that the task to
> >> be boosted had already been put into TASK_WAKING.
> >
> > See, I'm confused...
> >
> > The only time TASK_WAKING is visible, is if we've done a remote wakeup
> > and it's 'stuck' on the remote wake_list. And in that case we've done
> > migrate_task_rq_fair() on it.
> >
> > So by the time either rt_mutex_setprio() or __sched_setscheduler() get
> > to calling check_class_changed(), under both pi_lock and rq->lock, the
> > vruntime_normalized() thing should be right.
> >
> > So please detail the exact scenario. Because I'm not seeing it.
>
> Using Steve's test program (https://lkml.org/lkml/2018/8/24/686) I see the
> issue but only if the two tasks (rt_task, fair_task) run on 2 cpus which
> don't share LLC (e.g. CPU0 and CPU4 on hikey960).
>
> So the wakeup goes the TTWU_QUEUE && !share_cache (ttwu_queue_remote) path.
>
> ...
> rt_task-3579  [000] 35.391573: sched_waking: comm=fair_task pid=3580 
> prio=120 target_cpu=004
> rt_task-3579  [000] 35.391580: bprint:   try_to_wake_up: 
> try_to_wake_up: task=fair_task pid=3580 task_cpu(p)=4 p->on_rq=0
> rt_task-3579  [000] 35.391584: bprint:   try_to_wake_up: ttwu_queue: 
> task=fair_task pid=3580
> rt_task-3579  [000] 35.391588: bprint:   try_to_wake_up: 
> ttwu_queue_remote: task=fair_task pid=3580
> rt_task-3579  [000] 35.391591: bprint:   try_to_wake_up: 
> ttwu_queue_remote: cpu=4 smp_send_reschedule
> rt_task-3579  [000] 35.391627: sched_pi_setprio: comm=fair_task pid=3580 
> oldprio=120 newprio=19
> rt_task-3579  [000] 35.391635: bprint:   rt_mutex_setprio: 
> task=fair_task pid=3580 prio=120->19 queued=0 running=0 state=0x200 
> vruntime=46922871 cpu=4 cfs_rq->min_vruntime=7807420844
> rt_task-3579  [000] 35.391641: bprint:   rt_mutex_setprio: p->prio 
> set: task=fair_task pid=3580 prio=19 queued=0 running=0 state=0x200 
> vruntime=46922871
> rt_task-3579  [000] 35.391646: bprint:   rt_mutex_setprio: queued 
> checked: task=fair_task pid=3580 prio=19 queued=0 running=0 state=0x200 
> vruntime=46922871
> rt_task-3579  [000] 35.391652: bprint:   rt_mutex_setprio: running 
> checked: task=fair_task pid=3580 prio=19 queued=0 running=0 state=0x200 
> vruntime=46922871
> rt_task-3579  [000] 35.391657: bprint:   rt_mutex_setprio: 
> fair_class=0x08da2c80 rt_class=0x08da2d70 
> prev_class=0x08da2c80 p->sched_class=0x08da2d70 oldprio=120 
> p->prio=19
> rt_task-3579  [000] 35.391661: bprint:   detach_task_cfs_rq: 
> task=fair_task pid=3580 cpu=4 vruntime_normalized=1
> rt_task-3579  [000] 35.391706: sched_switch: rt_task:3579 [19] D ==> 
> swapper/0:0 [120]
>  -0 [004] 35.391828: bprint:   ttwu_do_activate: 
> ttwu_do_activate: task=fair_task pid=3580
>  -0 [004] 35.391832: bprint:   ttwu_do_activate: 
> ttwu_activate: task=fair_task pid=3580
>  -0 [004] 35.391833: bprint:   ttwu_do_wakeup: 
> ttwu_do_wakeup: task=fair_task pid=3580
>  -0 [004] 35.391834: sched_wakeup: fair_task:3580 [19] 
> success=1 CPU:004
>
> It doesn't happen on hikey960 when I use two cpus of the same LLC or on my
> laptop (i7-4750HQ).


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-09-10 Thread Dietmar Eggemann

On 09/07/2018 12:58 AM, Vincent Guittot wrote:

On Fri, 7 Sep 2018 at 09:16, Juri Lelli  wrote:


On 06/09/18 16:25, Dietmar Eggemann wrote:

Hi Juri,

On 08/23/2018 11:54 PM, Juri Lelli wrote:

On 23/08/18 18:52, Dietmar Eggemann wrote:

Hi,

On 08/21/2018 01:54 AM, Miguel de Dios wrote:

On 08/17/2018 11:27 AM, Steve Muckle wrote:

From: John Dias 


[...]


Adding semaphores is possible but rt-app has no easy way to initialize
individual objects, e.g. sem_init(..., value). The only way I see is via the
global section, like "pi_enabled". But then, this is true for all objects of
this kind (in this case mutexes)?


Right, global section should work fine. Why do you think this is a
problem/limitation?


keep in mind that rt-app still have "ressources" section. This one is
optional and almost never used as resources can be created on the fly
but it's still there and can be used to initialize resources if needed
like semaphore


I wasn't aware of that but this will do the job AFAICS. I just have to 
re-introduce the direct calls to init_foo_resource() (in this case 
init_sem_resource()) in init_resource_data() and call that instead of 
init_resource_data() for semaphores listed in the global resources section.


Example for a semaphore b_sem with initial value eq. 1:

"resources" : {
"b_sem" : { "type" : "sem_wait", "value" : 1 }
}

[...]


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-09-07 Thread Vincent Guittot
On Fri, 7 Sep 2018 at 09:16, Juri Lelli  wrote:
>
> On 06/09/18 16:25, Dietmar Eggemann wrote:
> > Hi Juri,
> >
> > On 08/23/2018 11:54 PM, Juri Lelli wrote:
> > > On 23/08/18 18:52, Dietmar Eggemann wrote:
> > > > Hi,
> > > >
> > > > On 08/21/2018 01:54 AM, Miguel de Dios wrote:
> > > > > On 08/17/2018 11:27 AM, Steve Muckle wrote:
> > > > > > From: John Dias 
> >
> > [...]
> >
> > > >
> > > > I tried to catch this issue on my Arm64 Juno board using pi_test (and a
> > > > slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from
> > > > rt-tests but wasn't able to do so.
> > > >
> > > > # pi_stress --inversions=1 --duration=1 --groups=1 --sched 
> > > > id=low,policy=cfs
> > > >
> > > > Starting PI Stress Test
> > > > Number of thread groups: 1
> > > > Duration of test run: 1 seconds
> > > > Number of inversions per group: 1
> > > >   Admin thread SCHED_FIFO priority 4
> > > > 1 groups of 3 threads will be created
> > > >High thread SCHED_FIFO priority 3
> > > > Med thread SCHED_FIFO priority 2
> > > > Low thread SCHED_OTHER nice 0
> > > >
> > > > # ./pip_stress
> > > >
> > > > In both cases, the cfs task entering  rt_mutex_setprio() is queued, so
> > > > dequeue_task_fair()->dequeue_entity(), which subtracts 
> > > > cfs_rq->min_vruntime
> > > > from se->vruntime, is called on it before it gets the rt prio.
> > > >
> > > > Maybe it requires a very specific use of the pthread library to provoke 
> > > > this
> > > > issue by making sure that the cfs tasks really blocks/sleeps?
> > >
> > > Maybe one could play with rt-app to recreate such specific use case?
> > >
> > > https://github.com/scheduler-tools/rt-app/blob/master/doc/tutorial.txt#L459
> >
> > I played a little bit with rt-app on hikey960 to re-create Steve's test
> > program.
>
> Oh, nice! Thanks for sharing what you have got.
>
> > Since there is no semaphore support (sem_wait(), sem_post()) I used
> > condition variables (wait: pthread_cond_wait() , signal:
> > pthread_cond_signal()). It's not really the same since this is stateless but
> > sleeps before the signals help to maintain the state in this easy example.
> >
> > This provokes the vruntime issue e.g. for cpus 0,4 and it doesn't for 0,1:
> >
> >
> > "global": {
> > "calibration" : 130,
> >   "pi_enabled" : true
> > },
> > "tasks": {
> > "rt_task": {
> >   "loop" : 100,
> >   "policy" : "SCHED_FIFO",
> >   "cpus" : [0],
> >
> >   "lock" : "b_mutex",
> >   "wait" : { "ref" : "b_cond", "mutex" : "b_mutex" },
> >   "unlock" : "b_mutex",
> >   "sleep" : 3000,
> >   "lock1" : "a_mutex",
> >   "signal" : "a_cond",
> >   "unlock1" : "a_mutex",
> >   "lock2" : "pi-mutex",
> >   "unlock2" : "pi-mutex"
> > },
> >   "cfs_task": {
> >   "loop" : 100,
> >   "policy" : "SCHED_OTHER",
> >   "cpus" : [4],
> >
> >   "lock" : "pi-mutex",
> >   "sleep" : 3000,
> >   "lock1" : "b_mutex",
> >   "signal" : "b_cond",
> >   "unlock" : "b_mutex",
> >   "lock2" : "a_mutex",
> >   "wait" : { "ref" : "a_cond", "mutex" : "a_mutex" },
> >   "unlock1" : "a_mutex",
> >   "unlock2" : "pi-mutex"
> >   }
> > }
> > }
> >
> > Adding semaphores is possible but rt-app has no easy way to initialize
> > individual objects, e.g. sem_init(..., value). The only way I see is via the
> > global section, like "pi_enabled". But then, this is true for all objects of
> > this kind (in this case mutexes)?
>
> Right, global section should work fine. Why do you think this is a
> problem/limitation?

keep in mind that rt-app still have "ressources" section. This one is
optional and almost never used as resources can be created on the fly
but it's still there and can be used to initialize resources if needed
like semaphore

>
> > So the following couple of lines extension to rt-app works because both
> > semaphores can be initialized to 0:
> >
> >  {
> > "global": {
> > "calibration" : 130,
> >   "pi_enabled" : true
> > },
> > "tasks": {
> > "rt_task": {
> >   "loop" : 100,
> >   "policy" : "SCHED_FIFO",
> >   "cpus" : [0],
> >
> >   "sem_wait" : "b_sem",
> >   "sleep" : 1000,
> >   "sem_post" : "a_sem",
> >
> >   "lock" : "pi-mutex",
> >   "unlock" : "pi-mutex"
> > },
> >   "cfs_task": {
> >   "loop" : 100,
> >   "policy" : "SCHED_OTHER",
> >   "cpus" : [4],
> >
> >   "lock" : "pi-mutex",
> >   "sleep" : 1000,
> >   "sem_post" : "b_sem",
> >   "sem_wait" : "a_sem",
> >   "unlock" : "pi-mutex"
> >   }
> > }
> > }
> >
> > Any thoughts on that? I can see something like this as infrastructure to
> > create a regression test case based on rt-app and standard ftrace.
>
> Agree. I guess we s

Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-09-07 Thread Juri Lelli
On 06/09/18 16:25, Dietmar Eggemann wrote:
> Hi Juri,
> 
> On 08/23/2018 11:54 PM, Juri Lelli wrote:
> > On 23/08/18 18:52, Dietmar Eggemann wrote:
> > > Hi,
> > > 
> > > On 08/21/2018 01:54 AM, Miguel de Dios wrote:
> > > > On 08/17/2018 11:27 AM, Steve Muckle wrote:
> > > > > From: John Dias 
> 
> [...]
> 
> > > 
> > > I tried to catch this issue on my Arm64 Juno board using pi_test (and a
> > > slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from
> > > rt-tests but wasn't able to do so.
> > > 
> > > # pi_stress --inversions=1 --duration=1 --groups=1 --sched 
> > > id=low,policy=cfs
> > > 
> > > Starting PI Stress Test
> > > Number of thread groups: 1
> > > Duration of test run: 1 seconds
> > > Number of inversions per group: 1
> > >   Admin thread SCHED_FIFO priority 4
> > > 1 groups of 3 threads will be created
> > >High thread SCHED_FIFO priority 3
> > > Med thread SCHED_FIFO priority 2
> > > Low thread SCHED_OTHER nice 0
> > > 
> > > # ./pip_stress
> > > 
> > > In both cases, the cfs task entering  rt_mutex_setprio() is queued, so
> > > dequeue_task_fair()->dequeue_entity(), which subtracts 
> > > cfs_rq->min_vruntime
> > > from se->vruntime, is called on it before it gets the rt prio.
> > > 
> > > Maybe it requires a very specific use of the pthread library to provoke 
> > > this
> > > issue by making sure that the cfs tasks really blocks/sleeps?
> > 
> > Maybe one could play with rt-app to recreate such specific use case?
> > 
> > https://github.com/scheduler-tools/rt-app/blob/master/doc/tutorial.txt#L459
> 
> I played a little bit with rt-app on hikey960 to re-create Steve's test
> program.

Oh, nice! Thanks for sharing what you have got.

> Since there is no semaphore support (sem_wait(), sem_post()) I used
> condition variables (wait: pthread_cond_wait() , signal:
> pthread_cond_signal()). It's not really the same since this is stateless but
> sleeps before the signals help to maintain the state in this easy example.
> 
> This provokes the vruntime issue e.g. for cpus 0,4 and it doesn't for 0,1:
> 
> 
> "global": {
> "calibration" : 130,
>   "pi_enabled" : true
> },
> "tasks": {
> "rt_task": {
>   "loop" : 100,
>   "policy" : "SCHED_FIFO",
>   "cpus" : [0],
> 
>   "lock" : "b_mutex",
>   "wait" : { "ref" : "b_cond", "mutex" : "b_mutex" },
>   "unlock" : "b_mutex",
>   "sleep" : 3000,
>   "lock1" : "a_mutex",
>   "signal" : "a_cond",
>   "unlock1" : "a_mutex",
>   "lock2" : "pi-mutex",
>   "unlock2" : "pi-mutex"
> },
>   "cfs_task": {
>   "loop" : 100,
>   "policy" : "SCHED_OTHER",
>   "cpus" : [4],
> 
>   "lock" : "pi-mutex",
>   "sleep" : 3000,
>   "lock1" : "b_mutex",
>   "signal" : "b_cond",
>   "unlock" : "b_mutex",
>   "lock2" : "a_mutex",
>   "wait" : { "ref" : "a_cond", "mutex" : "a_mutex" },
>   "unlock1" : "a_mutex",
>   "unlock2" : "pi-mutex"
>   }
> }
> }
> 
> Adding semaphores is possible but rt-app has no easy way to initialize
> individual objects, e.g. sem_init(..., value). The only way I see is via the
> global section, like "pi_enabled". But then, this is true for all objects of
> this kind (in this case mutexes)?

Right, global section should work fine. Why do you think this is a
problem/limitation?

> So the following couple of lines extension to rt-app works because both
> semaphores can be initialized to 0:
> 
>  {
> "global": {
> "calibration" : 130,
>   "pi_enabled" : true
> },
> "tasks": {
> "rt_task": {
>   "loop" : 100,
>   "policy" : "SCHED_FIFO",
>   "cpus" : [0],
> 
>   "sem_wait" : "b_sem",
>   "sleep" : 1000,
>   "sem_post" : "a_sem",
> 
>   "lock" : "pi-mutex",
>   "unlock" : "pi-mutex"
> },
>   "cfs_task": {
>   "loop" : 100,
>   "policy" : "SCHED_OTHER",
>   "cpus" : [4],
> 
>   "lock" : "pi-mutex",
>   "sleep" : 1000,
>   "sem_post" : "b_sem",
>   "sem_wait" : "a_sem",
>   "unlock" : "pi-mutex"
>   }
> }
> }
> 
> Any thoughts on that? I can see something like this as infrastructure to
> create a regression test case based on rt-app and standard ftrace.

Agree. I guess we should add your first example to the repo (you'd be
very welcome to create a PR) already and then work to support the second?


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-09-06 Thread Dietmar Eggemann

Hi Juri,

On 08/23/2018 11:54 PM, Juri Lelli wrote:

On 23/08/18 18:52, Dietmar Eggemann wrote:

Hi,

On 08/21/2018 01:54 AM, Miguel de Dios wrote:

On 08/17/2018 11:27 AM, Steve Muckle wrote:

From: John Dias 


[...]



I tried to catch this issue on my Arm64 Juno board using pi_test (and a
slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from
rt-tests but wasn't able to do so.

# pi_stress --inversions=1 --duration=1 --groups=1 --sched id=low,policy=cfs

Starting PI Stress Test
Number of thread groups: 1
Duration of test run: 1 seconds
Number of inversions per group: 1
  Admin thread SCHED_FIFO priority 4
1 groups of 3 threads will be created
   High thread SCHED_FIFO priority 3
Med thread SCHED_FIFO priority 2
Low thread SCHED_OTHER nice 0

# ./pip_stress

In both cases, the cfs task entering  rt_mutex_setprio() is queued, so
dequeue_task_fair()->dequeue_entity(), which subtracts cfs_rq->min_vruntime
from se->vruntime, is called on it before it gets the rt prio.

Maybe it requires a very specific use of the pthread library to provoke this
issue by making sure that the cfs tasks really blocks/sleeps?


Maybe one could play with rt-app to recreate such specific use case?

https://github.com/scheduler-tools/rt-app/blob/master/doc/tutorial.txt#L459


I played a little bit with rt-app on hikey960 to re-create Steve's test 
program.
Since there is no semaphore support (sem_wait(), sem_post()) I used 
condition variables (wait: pthread_cond_wait() , signal: 
pthread_cond_signal()). It's not really the same since this is stateless 
but sleeps before the signals help to maintain the state in this easy 
example.


This provokes the vruntime issue e.g. for cpus 0,4 and it doesn't for 0,1:


"global": {
"calibration" : 130,
"pi_enabled" : true
},
"tasks": {
"rt_task": {
"loop" : 100,
"policy" : "SCHED_FIFO",
"cpus" : [0],

"lock" : "b_mutex",
"wait" : { "ref" : "b_cond", "mutex" : "b_mutex" },
"unlock" : "b_mutex",
"sleep" : 3000,
"lock1" : "a_mutex",
"signal" : "a_cond",
"unlock1" : "a_mutex",
"lock2" : "pi-mutex",
"unlock2" : "pi-mutex"
},
"cfs_task": {
"loop" : 100,
"policy" : "SCHED_OTHER",
"cpus" : [4],

"lock" : "pi-mutex",
"sleep" : 3000,
"lock1" : "b_mutex",
"signal" : "b_cond",
"unlock" : "b_mutex",
"lock2" : "a_mutex",
"wait" : { "ref" : "a_cond", "mutex" : "a_mutex" },
"unlock1" : "a_mutex",
"unlock2" : "pi-mutex"
}
}
}

Adding semaphores is possible but rt-app has no easy way to initialize 
individual objects, e.g. sem_init(..., value). The only way I see is via 
the global section, like "pi_enabled". But then, this is true for all 
objects of this kind (in this case mutexes)?


So the following couple of lines extension to rt-app works because both 
semaphores can be initialized to 0:


 {
"global": {
"calibration" : 130,
"pi_enabled" : true
},
"tasks": {
"rt_task": {
"loop" : 100,
"policy" : "SCHED_FIFO",
"cpus" : [0],

"sem_wait" : "b_sem",
"sleep" : 1000,
"sem_post" : "a_sem",

"lock" : "pi-mutex",
"unlock" : "pi-mutex"
},
"cfs_task": {
"loop" : 100,
"policy" : "SCHED_OTHER",
"cpus" : [4],

"lock" : "pi-mutex",
"sleep" : 1000,
"sem_post" : "b_sem",
"sem_wait" : "a_sem",
"unlock" : "pi-mutex"
}
}
}

Any thoughts on that? I can see something like this as infrastructure to 
create a regression test case based on rt-app and standard ftrace.


[...]


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-31 Thread Steve Muckle

On 08/29/2018 08:33 AM, Dietmar Eggemann wrote:

Yes, this solves the issue for the case I described. Using
'p->sched_remote_wakeup' (WF_MIGRATED) looks more elegant than using
'p->sched_class == &fair_sched_class'.


It's confirmed that this patch solves the original issue we saw (and my 
test case passes for me as well). I will send this version out shortly.


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-29 Thread Dietmar Eggemann

On 08/29/2018 12:59 PM, Peter Zijlstra wrote:

On Wed, Aug 29, 2018 at 11:54:58AM +0100, Dietmar Eggemann wrote:

I forgot to mention that since fair_task's cpu affinity is restricted to
CPU4, there is no call to set_task_cpu()->migrate_task_rq_fair() since if
(task_cpu(p) != cpu) fails.

I think the combination of cpu affinity of the fair_task to CPU4 and the
fact that the scheduler runs on CPU1 when waking fair_task (with the two
cpus not sharing LLC) while TTWU_QUEUE is enabled is the situation in which
this vruntime issue can happen.


Ohhh, D'0h. A remote wakeup that doesn't migrate.


Ah, there is this WF_MIGRATED flag, perfect for the distinction whether 
a task migrated or not.



That would suggest something like so:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b39fb596f6c1..b3b62cf37fb6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct 
*p)
 * - A task which has been woken up by try_to_wake_up() and
 *   waiting for actually being woken up by sched_ttwu_pending().
 */
-   if (!se->sum_exec_runtime || p->state == TASK_WAKING)
+   if (!se->sum_exec_runtime ||
+   (p->state == TASK_WAKING && p->sched_remote_wakeup))
return true;
  
  	return false;
Yes, this solves the issue for the case I described. Using 
'p->sched_remote_wakeup' (WF_MIGRATED) looks more elegant than using 
'p->sched_class == &fair_sched_class'.


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-29 Thread Peter Zijlstra
On Wed, Aug 29, 2018 at 11:54:58AM +0100, Dietmar Eggemann wrote:
> I forgot to mention that since fair_task's cpu affinity is restricted to
> CPU4, there is no call to set_task_cpu()->migrate_task_rq_fair() since if
> (task_cpu(p) != cpu) fails.
>
> I think the combination of cpu affinity of the fair_task to CPU4 and the
> fact that the scheduler runs on CPU1 when waking fair_task (with the two
> cpus not sharing LLC) while TTWU_QUEUE is enabled is the situation in which
> this vruntime issue can happen.

Ohhh, D'0h. A remote wakeup that doesn't migrate.

That would suggest something like so:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b39fb596f6c1..b3b62cf37fb6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct 
*p)
 * - A task which has been woken up by try_to_wake_up() and
 *   waiting for actually being woken up by sched_ttwu_pending().
 */
-   if (!se->sum_exec_runtime || p->state == TASK_WAKING)
+   if (!se->sum_exec_runtime ||
+   (p->state == TASK_WAKING && p->sched_remote_wakeup))
return true;
 
return false;





Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-29 Thread Dietmar Eggemann

On 08/28/2018 03:53 PM, Dietmar Eggemann wrote:

On 08/27/2018 12:14 PM, Peter Zijlstra wrote:

On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote:

On 08/24/2018 02:47 AM, Peter Zijlstra wrote:

On 08/17/2018 11:27 AM, Steve Muckle wrote:



When rt_mutex_setprio changes a task's scheduling class to RT,
we're seeing cases where the task's vruntime is not updated
correctly upon return to the fair class.



Specifically, the following is being observed:
- task is deactivated while still in the fair class
- task is boosted to RT via rt_mutex_setprio, which changes
  the task to RT and calls check_class_changed.
- check_class_changed leads to detach_task_cfs_rq, at which point
  the vruntime_normalized check sees that the task's state is TASK_WAKING,
  which results in skipping the subtraction of the rq's min_vruntime
  from the task's vruntime
- later, when the prio is deboosted and the task is moved back
  to the fair class, the fair rq's min_vruntime is added to
  the task's vruntime, even though it wasn't subtracted earlier.


I'm thinking that is an incomplete scenario; where do we get to
TASK_WAKING.


Yes there's a missing bit of context here at the beginning that the task to
be boosted had already been put into TASK_WAKING.


See, I'm confused...

The only time TASK_WAKING is visible, is if we've done a remote wakeup
and it's 'stuck' on the remote wake_list. And in that case we've done
migrate_task_rq_fair() on it.

So by the time either rt_mutex_setprio() or __sched_setscheduler() get
to calling check_class_changed(), under both pi_lock and rq->lock, the
vruntime_normalized() thing should be right.

So please detail the exact scenario. Because I'm not seeing it.


Using Steve's test program (https://lkml.org/lkml/2018/8/24/686) I see the
issue but only if the two tasks (rt_task, fair_task) run on 2 cpus which
don't share LLC (e.g. CPU0 and CPU4 on hikey960).

So the wakeup goes the TTWU_QUEUE && !share_cache (ttwu_queue_remote) path.


I forgot to mention that since fair_task's cpu affinity is restricted to 
CPU4, there is no call to set_task_cpu()->migrate_task_rq_fair() since 
if (task_cpu(p) != cpu) fails.


I think the combination of cpu affinity of the fair_task to CPU4 and the 
fact that the scheduler runs on CPU1 when waking fair_task (with the two 
cpus not sharing LLC) while TTWU_QUEUE is enabled is the situation in 
which this vruntime issue can happen.


[...]


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-28 Thread Dietmar Eggemann
On 08/27/2018 12:14 PM, Peter Zijlstra wrote:
> On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote:
>> On 08/24/2018 02:47 AM, Peter Zijlstra wrote:
> On 08/17/2018 11:27 AM, Steve Muckle wrote:
>>>
>> When rt_mutex_setprio changes a task's scheduling class to RT,
>> we're seeing cases where the task's vruntime is not updated
>> correctly upon return to the fair class.
>>>
>> Specifically, the following is being observed:
>> - task is deactivated while still in the fair class
>> - task is boosted to RT via rt_mutex_setprio, which changes
>>  the task to RT and calls check_class_changed.
>> - check_class_changed leads to detach_task_cfs_rq, at which point
>>  the vruntime_normalized check sees that the task's state is 
>> TASK_WAKING,
>>  which results in skipping the subtraction of the rq's min_vruntime
>>  from the task's vruntime
>> - later, when the prio is deboosted and the task is moved back
>>  to the fair class, the fair rq's min_vruntime is added to
>>  the task's vruntime, even though it wasn't subtracted earlier.
>>>
>>> I'm thinking that is an incomplete scenario; where do we get to
>>> TASK_WAKING.
>>
>> Yes there's a missing bit of context here at the beginning that the task to
>> be boosted had already been put into TASK_WAKING.
> 
> See, I'm confused...
> 
> The only time TASK_WAKING is visible, is if we've done a remote wakeup
> and it's 'stuck' on the remote wake_list. And in that case we've done
> migrate_task_rq_fair() on it.
> 
> So by the time either rt_mutex_setprio() or __sched_setscheduler() get
> to calling check_class_changed(), under both pi_lock and rq->lock, the
> vruntime_normalized() thing should be right.
> 
> So please detail the exact scenario. Because I'm not seeing it.

Using Steve's test program (https://lkml.org/lkml/2018/8/24/686) I see the
issue but only if the two tasks (rt_task, fair_task) run on 2 cpus which
don't share LLC (e.g. CPU0 and CPU4 on hikey960).

So the wakeup goes the TTWU_QUEUE && !share_cache (ttwu_queue_remote) path.

...
rt_task-3579  [000] 35.391573: sched_waking: comm=fair_task pid=3580 
prio=120 target_cpu=004
rt_task-3579  [000] 35.391580: bprint:   try_to_wake_up: 
try_to_wake_up: task=fair_task pid=3580 task_cpu(p)=4 p->on_rq=0
rt_task-3579  [000] 35.391584: bprint:   try_to_wake_up: ttwu_queue: 
task=fair_task pid=3580
rt_task-3579  [000] 35.391588: bprint:   try_to_wake_up: 
ttwu_queue_remote: task=fair_task pid=3580
rt_task-3579  [000] 35.391591: bprint:   try_to_wake_up: 
ttwu_queue_remote: cpu=4 smp_send_reschedule
rt_task-3579  [000] 35.391627: sched_pi_setprio: comm=fair_task pid=3580 
oldprio=120 newprio=19
rt_task-3579  [000] 35.391635: bprint:   rt_mutex_setprio: 
task=fair_task pid=3580 prio=120->19 queued=0 running=0 state=0x200 
vruntime=46922871 cpu=4 cfs_rq->min_vruntime=7807420844
rt_task-3579  [000] 35.391641: bprint:   rt_mutex_setprio: p->prio set: 
task=fair_task pid=3580 prio=19 queued=0 running=0 state=0x200 vruntime=46922871
rt_task-3579  [000] 35.391646: bprint:   rt_mutex_setprio: queued 
checked: task=fair_task pid=3580 prio=19 queued=0 running=0 state=0x200 
vruntime=46922871
rt_task-3579  [000] 35.391652: bprint:   rt_mutex_setprio: running 
checked: task=fair_task pid=3580 prio=19 queued=0 running=0 state=0x200 
vruntime=46922871
rt_task-3579  [000] 35.391657: bprint:   rt_mutex_setprio: 
fair_class=0x08da2c80 rt_class=0x08da2d70 
prev_class=0x08da2c80 p->sched_class=0x08da2d70 oldprio=120 
p->prio=19
rt_task-3579  [000] 35.391661: bprint:   detach_task_cfs_rq: 
task=fair_task pid=3580 cpu=4 vruntime_normalized=1
rt_task-3579  [000] 35.391706: sched_switch: rt_task:3579 [19] D ==> 
swapper/0:0 [120] 
 -0 [004] 35.391828: bprint:   ttwu_do_activate: 
ttwu_do_activate: task=fair_task pid=3580
 -0 [004] 35.391832: bprint:   ttwu_do_activate: 
ttwu_activate: task=fair_task pid=3580
 -0 [004] 35.391833: bprint:   ttwu_do_wakeup: 
ttwu_do_wakeup: task=fair_task pid=3580
 -0 [004] 35.391834: sched_wakeup: fair_task:3580 [19] success=1 
CPU:004

It doesn't happen on hikey960 when I use two cpus of the same LLC or on my
laptop (i7-4750HQ).


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-27 Thread Peter Zijlstra
On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote:
> On 08/24/2018 02:47 AM, Peter Zijlstra wrote:
> > > > On 08/17/2018 11:27 AM, Steve Muckle wrote:
> > 
> > > > > When rt_mutex_setprio changes a task's scheduling class to RT,
> > > > > we're seeing cases where the task's vruntime is not updated
> > > > > correctly upon return to the fair class.
> > 
> > > > > Specifically, the following is being observed:
> > > > > - task is deactivated while still in the fair class
> > > > > - task is boosted to RT via rt_mutex_setprio, which changes
> > > > > the task to RT and calls check_class_changed.
> > > > > - check_class_changed leads to detach_task_cfs_rq, at which point
> > > > > the vruntime_normalized check sees that the task's state is 
> > > > > TASK_WAKING,
> > > > > which results in skipping the subtraction of the rq's min_vruntime
> > > > > from the task's vruntime
> > > > > - later, when the prio is deboosted and the task is moved back
> > > > > to the fair class, the fair rq's min_vruntime is added to
> > > > > the task's vruntime, even though it wasn't subtracted earlier.
> > 
> > I'm thinking that is an incomplete scenario; where do we get to
> > TASK_WAKING.
> 
> Yes there's a missing bit of context here at the beginning that the task to
> be boosted had already been put into TASK_WAKING.

See, I'm confused...

The only time TASK_WAKING is visible, is if we've done a remote wakeup
and it's 'stuck' on the remote wake_list. And in that case we've done
migrate_task_rq_fair() on it.

So by the time either rt_mutex_setprio() or __sched_setscheduler() get
to calling check_class_changed(), under both pi_lock and rq->lock, the
vruntime_normalized() thing should be right.

So please detail the exact scenario. Because I'm not seeing it.



Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-24 Thread Steve Muckle

On 08/24/2018 02:47 AM, Peter Zijlstra wrote:

On 08/17/2018 11:27 AM, Steve Muckle wrote:



When rt_mutex_setprio changes a task's scheduling class to RT,
we're seeing cases where the task's vruntime is not updated
correctly upon return to the fair class.



Specifically, the following is being observed:
- task is deactivated while still in the fair class
- task is boosted to RT via rt_mutex_setprio, which changes
the task to RT and calls check_class_changed.
- check_class_changed leads to detach_task_cfs_rq, at which point
the vruntime_normalized check sees that the task's state is TASK_WAKING,
which results in skipping the subtraction of the rq's min_vruntime
from the task's vruntime
- later, when the prio is deboosted and the task is moved back
to the fair class, the fair rq's min_vruntime is added to
the task's vruntime, even though it wasn't subtracted earlier.


I'm thinking that is an incomplete scenario; where do we get to
TASK_WAKING.


Yes there's a missing bit of context here at the beginning that the task 
to be boosted had already been put into TASK_WAKING.


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-24 Thread Steve Muckle

On 08/23/2018 11:54 PM, Juri Lelli wrote:

I tried to catch this issue on my Arm64 Juno board using pi_test (and a
slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from
rt-tests but wasn't able to do so.

# pi_stress --inversions=1 --duration=1 --groups=1 --sched id=low,policy=cfs

Starting PI Stress Test
Number of thread groups: 1
Duration of test run: 1 seconds
Number of inversions per group: 1
  Admin thread SCHED_FIFO priority 4
1 groups of 3 threads will be created
   High thread SCHED_FIFO priority 3
Med thread SCHED_FIFO priority 2
Low thread SCHED_OTHER nice 0

# ./pip_stress

In both cases, the cfs task entering  rt_mutex_setprio() is queued, so
dequeue_task_fair()->dequeue_entity(), which subtracts cfs_rq->min_vruntime
from se->vruntime, is called on it before it gets the rt prio.

Maybe it requires a very specific use of the pthread library to provoke this
issue by making sure that the cfs tasks really blocks/sleeps?


Maybe one could play with rt-app to recreate such specific use case?

https://github.com/scheduler-tools/rt-app/blob/master/doc/tutorial.txt#L459


This was reproduced for me on tip of mainline by using the program at 
the end of this mail. It was run in a 2 CPU virtualbox VM. Relevant 
annotated bits of the trace:


low-prio thread vruntime is 752ms
  pi-vruntime-tes-598   [001] d...   520.572459: sched_stat_runtime: 
comm=pi-vruntime-tes pid=598 runtime=29953 [ns] vruntime=752888705 [ns]


low-prio thread waits on a_sem
  pi-vruntime-tes-598   [001] d...   520.572465: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=598 prev_prio=120 prev_state=D ==> 
next_comm=swapper/1 next_pid=0 next_prio=120


high prio thread finishes wakeup, then sleeps for 1ms
   -0 [000] dNh.   520.572483: sched_wakeup: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
   -0 [000] d...   520.572486: sched_switch: 
prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=597 next_prio=19
  pi-vruntime-tes-597   [000] d...   520.572498: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> 
next_comm=swapper/0 next_pid=0 next_prio=120


high prio thread wakes up after 1ms sleep, posts a_sem which starts to 
wake low-prio thread, then tries to grab pi_mutex, which low-prio thread has
   -0 [000] d.h.   520.573876: sched_waking: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
   -0 [000] dNh.   520.573879: sched_wakeup: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
   -0 [000] d...   520.573887: sched_switch: 
prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=597 next_prio=19
  pi-vruntime-tes-597   [000] d...   520.573895: sched_waking: 
comm=pi-vruntime-tes pid=598 prio=120 target_cpu=001


low-prio thread pid 598 gets pi_mutex priority inheritance, this happens 
while low-prio thread is in waking state
  pi-vruntime-tes-597   [000] d...   520.573911: sched_pi_setprio: 
comm=pi-vruntime-tes pid=598 oldprio=120 newprio=19


high-prio thread sleeps on pi_mutex
  pi-vruntime-tes-597   [000] d...   520.573919: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> 
next_comm=swapper/0 next_pid=0 next_prio=120


low-prio thread finishes wakeup
   -0 [001] dNh.   520.573932: sched_wakeup: 
comm=pi-vruntime-tes pid=598 prio=19 target_cpu=001
   -0 [001] d...   520.573936: sched_switch: 
prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=598 next_prio=19


low-prio thread releases pi-mutex, loses pi boost, high-prio thread 
wakes for pi-mutex
  pi-vruntime-tes-598   [001] d...   520.573946: sched_pi_setprio: 
comm=pi-vruntime-tes pid=598 oldprio=19 newprio=120
  pi-vruntime-tes-598   [001] dN..   520.573954: sched_waking: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000


low-prio thread vruntime is 1505ms
  pi-vruntime-tes-598   [001] dN..   520.573966: sched_stat_runtime: 
comm=pi-vruntime-tes pid=598 runtime=20150 [ns] vruntime=1505797560 [ns]


The timing is quite sensitive since the task being boosted has to be 
caught in the TASK_WAKING state. The program:


/*
  * Test case for vruntime management during rtmutex priority inheritance
  * promotion and demotion.
  *
  * build with -lpthread
  */

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 

#define ERROR_CHECK(x) \
 if (x) \
 fprintf(stderr, "Error at line %d", __LINE__);

pthread_mutex_t pi_mutex;
sem_t a_sem;
sem_t b_sem;

void *rt_thread_func(void *arg) {
 int policy;
 int i = 0;
 cpu_set_t cpuset;

 CPU_ZERO(&cpuset);
 CPU_SET(0, &cpuset);
 ERROR_CHECK(pthread_setaffinity_np(pthread_self(), 
sizeof(cpu_set_t),

&cpuset));

 while(i++ < 100) {
 sem_wait(&b_s

Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-24 Thread Peter Zijlstra
> > On 08/17/2018 11:27 AM, Steve Muckle wrote:

> > > When rt_mutex_setprio changes a task's scheduling class to RT,
> > > we're seeing cases where the task's vruntime is not updated
> > > correctly upon return to the fair class.

> > > Specifically, the following is being observed:
> > > - task is deactivated while still in the fair class
> > > - task is boosted to RT via rt_mutex_setprio, which changes
> > >the task to RT and calls check_class_changed.
> > > - check_class_changed leads to detach_task_cfs_rq, at which point
> > >the vruntime_normalized check sees that the task's state is 
> > > TASK_WAKING,
> > >which results in skipping the subtraction of the rq's min_vruntime
> > >from the task's vruntime
> > > - later, when the prio is deboosted and the task is moved back
> > >to the fair class, the fair rq's min_vruntime is added to
> > >the task's vruntime, even though it wasn't subtracted earlier.

I'm thinking that is an incomplete scenario; where do we get to
TASK_WAKING.


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-24 Thread Peter Zijlstra
On Mon, Aug 20, 2018 at 04:54:25PM -0700, Miguel de Dios wrote:
> On 08/17/2018 11:27 AM, Steve Muckle wrote:
> > From: John Dias 
> > 
> > When rt_mutex_setprio changes a task's scheduling class to RT,
> > we're seeing cases where the task's vruntime is not updated
> > correctly upon return to the fair class.
> > Specifically, the following is being observed:
> > - task is deactivated while still in the fair class
> > - task is boosted to RT via rt_mutex_setprio, which changes
> >the task to RT and calls check_class_changed.
> > - check_class_changed leads to detach_task_cfs_rq, at which point
> >the vruntime_normalized check sees that the task's state is TASK_WAKING,
> >which results in skipping the subtraction of the rq's min_vruntime
> >from the task's vruntime
> > - later, when the prio is deboosted and the task is moved back
> >to the fair class, the fair rq's min_vruntime is added to
> >the task's vruntime, even though it wasn't subtracted earlier.
> > The immediate result is inflation of the task's vruntime, giving
> > it lower priority (starving it if there's enough available work).
> > The longer-term effect is inflation of all vruntimes because the
> > task's vruntime becomes the rq's min_vruntime when the higher
> > priority tasks go idle. That leads to a vicious cycle, where
> > the vruntime inflation repeatedly doubled.
> > 
> > The change here is to detect when vruntime_normalized is being
> > called when the task is waking but is waking in another class,
> > and to conclude that this is a case where vruntime has not
> > been normalized.
> > 
> > Signed-off-by: John Dias 
> > Signed-off-by: Steve Muckle 
> > ---
> >   kernel/sched/fair.c | 3 ++-
> >   1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b39fb596f6c1..14011d7929d8 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct 
> > task_struct *p)
> >  * - A task which has been woken up by try_to_wake_up() and
> >  *   waiting for actually being woken up by sched_ttwu_pending().
> >  */
> > -   if (!se->sum_exec_runtime || p->state == TASK_WAKING)
> > +   if (!se->sum_exec_runtime ||
> > +   (p->state == TASK_WAKING && p->sched_class == &fair_sched_class))
> > return true;
> > return false;

> The normalization of vruntime used to exist in task_waking but it was
> removed and the normalization was moved into migrate_task_rq_fair. The
> reasoning being that task_waking_fair was only hit when a task is queued
> onto a different core and migrate_task_rq_fair should do the same work.
> 
> However, we're finding that there's one case which migrate_task_rq_fair
> doesn't hit: that being the case where rt_mutex_setprio changes a task's
> scheduling class to RT when its scheduled out. The task never hits
> migrate_task_rq_fair because it is switched to RT and migrates as an RT
> task. Because of this we're getting an unbounded addition of min_vruntime
> when the task is re-attached to the CFS runqueue when it loses the inherited
> priority. The patch above works because now the kernel specifically checks
> for this case and normalizes accordingly.
> 
> Here's the patch I was talking about:
> https://lore.kernel.org/patchwork/patch/677689/. In our testing we were
> seeing vruntimes nearly double every time after rt_mutex_setprio boosts the
> task to RT.

Bah, patchwork is such shit... how do you get to the previus patch from
there? Because I think 2/3 is the actual commit that changed things, 3/3
just cleans up a bit.

That would be commit:

  b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration")

But I'm still somewhat confused; how would task_waking_fair() have
helped if we're already changed to a different class?




Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-23 Thread Juri Lelli
On 23/08/18 18:52, Dietmar Eggemann wrote:
> Hi,
> 
> On 08/21/2018 01:54 AM, Miguel de Dios wrote:
> > On 08/17/2018 11:27 AM, Steve Muckle wrote:
> > > From: John Dias 
> > > 
> > > When rt_mutex_setprio changes a task's scheduling class to RT,
> > > we're seeing cases where the task's vruntime is not updated
> > > correctly upon return to the fair class.
> > > Specifically, the following is being observed:
> > > - task is deactivated while still in the fair class
> > > - task is boosted to RT via rt_mutex_setprio, which changes
> > >    the task to RT and calls check_class_changed.
> > > - check_class_changed leads to detach_task_cfs_rq, at which point
> > >    the vruntime_normalized check sees that the task's state is
> > > TASK_WAKING,
> > >    which results in skipping the subtraction of the rq's min_vruntime
> > >    from the task's vruntime
> > > - later, when the prio is deboosted and the task is moved back
> > >    to the fair class, the fair rq's min_vruntime is added to
> > >    the task's vruntime, even though it wasn't subtracted earlier.
> > > The immediate result is inflation of the task's vruntime, giving
> > > it lower priority (starving it if there's enough available work).
> > > The longer-term effect is inflation of all vruntimes because the
> > > task's vruntime becomes the rq's min_vruntime when the higher
> > > priority tasks go idle. That leads to a vicious cycle, where
> > > the vruntime inflation repeatedly doubled.
> > > 
> > > The change here is to detect when vruntime_normalized is being
> > > called when the task is waking but is waking in another class,
> > > and to conclude that this is a case where vruntime has not
> > > been normalized.
> > > 
> > > Signed-off-by: John Dias 
> > > Signed-off-by: Steve Muckle 
> > > ---
> > >   kernel/sched/fair.c | 3 ++-
> > >   1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index b39fb596f6c1..14011d7929d8 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct
> > > task_struct *p)
> > >    * - A task which has been woken up by try_to_wake_up() and
> > >    *   waiting for actually being woken up by sched_ttwu_pending().
> > >    */
> > > -    if (!se->sum_exec_runtime || p->state == TASK_WAKING)
> > > +    if (!se->sum_exec_runtime ||
> > > +    (p->state == TASK_WAKING && p->sched_class ==
> > > &fair_sched_class))
> > >   return true;
> > >   return false;
> > The normalization of vruntime used to exist in task_waking but it was
> > removed and the normalization was moved into migrate_task_rq_fair. The
> > reasoning being that task_waking_fair was only hit when a task is queued
> > onto a different core and migrate_task_rq_fair should do the same work.
> > 
> > However, we're finding that there's one case which migrate_task_rq_fair
> > doesn't hit: that being the case where rt_mutex_setprio changes a task's
> > scheduling class to RT when its scheduled out. The task never hits
> > migrate_task_rq_fair because it is switched to RT and migrates as an RT
> > task. Because of this we're getting an unbounded addition of
> > min_vruntime when the task is re-attached to the CFS runqueue when it
> > loses the inherited priority. The patch above works because now the
> > kernel specifically checks for this case and normalizes accordingly.
> > 
> > Here's the patch I was talking about:
> > https://lore.kernel.org/patchwork/patch/677689/. In our testing we were
> > seeing vruntimes nearly double every time after rt_mutex_setprio boosts
> > the task to RT.
> > 
> > Signed-off-by: Miguel de Dios 
> > Tested-by: Miguel de Dios 
> 
> I tried to catch this issue on my Arm64 Juno board using pi_test (and a
> slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from
> rt-tests but wasn't able to do so.
> 
> # pi_stress --inversions=1 --duration=1 --groups=1 --sched id=low,policy=cfs
> 
> Starting PI Stress Test
> Number of thread groups: 1
> Duration of test run: 1 seconds
> Number of inversions per group: 1
>  Admin thread SCHED_FIFO priority 4
> 1 groups of 3 threads will be created
>   High thread SCHED_FIFO priority 3
>Med thread SCHED_FIFO priority 2
>Low thread SCHED_OTHER nice 0
> 
> # ./pip_stress
> 
> In both cases, the cfs task entering  rt_mutex_setprio() is queued, so
> dequeue_task_fair()->dequeue_entity(), which subtracts cfs_rq->min_vruntime
> from se->vruntime, is called on it before it gets the rt prio.
> 
> Maybe it requires a very specific use of the pthread library to provoke this
> issue by making sure that the cfs tasks really blocks/sleeps?

Maybe one could play with rt-app to recreate such specific use case?

https://github.com/scheduler-tools/rt-app/blob/master/doc/tutorial.txt#L459

Best,

- Juri


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-23 Thread Dietmar Eggemann

Hi,

On 08/21/2018 01:54 AM, Miguel de Dios wrote:

On 08/17/2018 11:27 AM, Steve Muckle wrote:

From: John Dias 

When rt_mutex_setprio changes a task's scheduling class to RT,
we're seeing cases where the task's vruntime is not updated
correctly upon return to the fair class.
Specifically, the following is being observed:
- task is deactivated while still in the fair class
- task is boosted to RT via rt_mutex_setprio, which changes
   the task to RT and calls check_class_changed.
- check_class_changed leads to detach_task_cfs_rq, at which point
   the vruntime_normalized check sees that the task's state is 
TASK_WAKING,

   which results in skipping the subtraction of the rq's min_vruntime
   from the task's vruntime
- later, when the prio is deboosted and the task is moved back
   to the fair class, the fair rq's min_vruntime is added to
   the task's vruntime, even though it wasn't subtracted earlier.
The immediate result is inflation of the task's vruntime, giving
it lower priority (starving it if there's enough available work).
The longer-term effect is inflation of all vruntimes because the
task's vruntime becomes the rq's min_vruntime when the higher
priority tasks go idle. That leads to a vicious cycle, where
the vruntime inflation repeatedly doubled.

The change here is to detect when vruntime_normalized is being
called when the task is waking but is waking in another class,
and to conclude that this is a case where vruntime has not
been normalized.

Signed-off-by: John Dias 
Signed-off-by: Steve Muckle 
---
  kernel/sched/fair.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b39fb596f6c1..14011d7929d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct 
task_struct *p)

   * - A task which has been woken up by try_to_wake_up() and
   *   waiting for actually being woken up by sched_ttwu_pending().
   */
-    if (!se->sum_exec_runtime || p->state == TASK_WAKING)
+    if (!se->sum_exec_runtime ||
+    (p->state == TASK_WAKING && p->sched_class == 
&fair_sched_class))

  return true;
  return false;
The normalization of vruntime used to exist in task_waking but it was 
removed and the normalization was moved into migrate_task_rq_fair. The 
reasoning being that task_waking_fair was only hit when a task is queued 
onto a different core and migrate_task_rq_fair should do the same work.


However, we're finding that there's one case which migrate_task_rq_fair 
doesn't hit: that being the case where rt_mutex_setprio changes a task's 
scheduling class to RT when its scheduled out. The task never hits 
migrate_task_rq_fair because it is switched to RT and migrates as an RT 
task. Because of this we're getting an unbounded addition of 
min_vruntime when the task is re-attached to the CFS runqueue when it 
loses the inherited priority. The patch above works because now the 
kernel specifically checks for this case and normalizes accordingly.


Here's the patch I was talking about: 
https://lore.kernel.org/patchwork/patch/677689/. In our testing we were 
seeing vruntimes nearly double every time after rt_mutex_setprio boosts 
the task to RT.


Signed-off-by: Miguel de Dios 
Tested-by: Miguel de Dios 


I tried to catch this issue on my Arm64 Juno board using pi_test (and a 
slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from 
rt-tests but wasn't able to do so.


# pi_stress --inversions=1 --duration=1 --groups=1 --sched id=low,policy=cfs

Starting PI Stress Test
Number of thread groups: 1
Duration of test run: 1 seconds
Number of inversions per group: 1
 Admin thread SCHED_FIFO priority 4
1 groups of 3 threads will be created
  High thread SCHED_FIFO priority 3
   Med thread SCHED_FIFO priority 2
   Low thread SCHED_OTHER nice 0

# ./pip_stress

In both cases, the cfs task entering  rt_mutex_setprio() is queued, so 
dequeue_task_fair()->dequeue_entity(), which subtracts 
cfs_rq->min_vruntime from se->vruntime, is called on it before it gets 
the rt prio.


Maybe it requires a very specific use of the pthread library to provoke 
this issue by making sure that the cfs tasks really blocks/sleeps?


Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-20 Thread Miguel de Dios

On 08/17/2018 11:27 AM, Steve Muckle wrote:

From: John Dias 

When rt_mutex_setprio changes a task's scheduling class to RT,
we're seeing cases where the task's vruntime is not updated
correctly upon return to the fair class.
Specifically, the following is being observed:
- task is deactivated while still in the fair class
- task is boosted to RT via rt_mutex_setprio, which changes
   the task to RT and calls check_class_changed.
- check_class_changed leads to detach_task_cfs_rq, at which point
   the vruntime_normalized check sees that the task's state is TASK_WAKING,
   which results in skipping the subtraction of the rq's min_vruntime
   from the task's vruntime
- later, when the prio is deboosted and the task is moved back
   to the fair class, the fair rq's min_vruntime is added to
   the task's vruntime, even though it wasn't subtracted earlier.
The immediate result is inflation of the task's vruntime, giving
it lower priority (starving it if there's enough available work).
The longer-term effect is inflation of all vruntimes because the
task's vruntime becomes the rq's min_vruntime when the higher
priority tasks go idle. That leads to a vicious cycle, where
the vruntime inflation repeatedly doubled.

The change here is to detect when vruntime_normalized is being
called when the task is waking but is waking in another class,
and to conclude that this is a case where vruntime has not
been normalized.

Signed-off-by: John Dias 
Signed-off-by: Steve Muckle 
---
  kernel/sched/fair.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b39fb596f6c1..14011d7929d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct 
*p)
 * - A task which has been woken up by try_to_wake_up() and
 *   waiting for actually being woken up by sched_ttwu_pending().
 */
-   if (!se->sum_exec_runtime || p->state == TASK_WAKING)
+   if (!se->sum_exec_runtime ||
+   (p->state == TASK_WAKING && p->sched_class == &fair_sched_class))
return true;
  
  	return false;
The normalization of vruntime used to exist in task_waking but it was 
removed and the normalization was moved into migrate_task_rq_fair. The 
reasoning being that task_waking_fair was only hit when a task is queued 
onto a different core and migrate_task_rq_fair should do the same work.


However, we're finding that there's one case which migrate_task_rq_fair 
doesn't hit: that being the case where rt_mutex_setprio changes a task's 
scheduling class to RT when its scheduled out. The task never hits 
migrate_task_rq_fair because it is switched to RT and migrates as an RT 
task. Because of this we're getting an unbounded addition of 
min_vruntime when the task is re-attached to the CFS runqueue when it 
loses the inherited priority. The patch above works because now the 
kernel specifically checks for this case and normalizes accordingly.


Here's the patch I was talking about: 
https://lore.kernel.org/patchwork/patch/677689/. In our testing we were 
seeing vruntimes nearly double every time after rt_mutex_setprio boosts 
the task to RT.


Signed-off-by: Miguel de Dios 
Tested-by: Miguel de Dios