Re: [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list

2016-09-21 Thread Vincent Guittot
On 21 September 2016 at 19:25, Dietmar Eggemann
 wrote:
> On 21/09/16 13:34, Vincent Guittot wrote:
>> Hi Dietmar,
>>
>> On 21 September 2016 at 12:14, Dietmar Eggemann
>>  wrote:
>>> Hi Vincent,
>>>
>>> On 12/09/16 08:47, Vincent Guittot wrote:
>
> [...]
>
>>> I guess in the meantime we lost the functionality to remove a cfs_rq from 
>>> the
>>> leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t 
>>> migrates
>>> away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6)
>>> owned by Cpu1 stay in the leaf_cfs_rq list.
>>>
>>> Shouldn't we reintegrate it? Following patch goes on top of this patch:
>>
>> I see one problem: Once a cfs_rq has been removed , it will not be
>> periodically updated anymore until the next enqueue. A sleeping task
>> is only attached but not enqueued when it moves into a cfs_rq so its
>> load is added to cfs_rq's load which have to be periodically
>> updated/decayed. So adding a cfs_rq to the list only during an enqueue
>> is not enough in this case.
>
> There was more in the original patch (commit 82958366cfea), the removal of the
> cfs_rq from the list was only done in case the se->avg.runnable_avg_sum had
> decayed to 0. Couldn't we use something similar with se->avg.load_avg instead
> to make sure that these blocked contributions have been decayed before we do 
> the
> removal?

Yes we can use se->avg.load_avg but that's not enough. Once,
se->avg.load_avg becomes null and group cfs_rq->nr_running == 0, we
remove the cfs_rq  from the list and this cfs_rq will not be updated
until a new sched_entity is enqueued. But when you move a sleeping
task between 2 task groups, the load of the task is attached to the
destination cfs_rq but not enqueue so the cfs_rq->avg.load_avg is no
more null but it will not be updated during update_blocked_averages
because the cfs_rq is no more in the list. So waiting for the enqueue
of the sched_entity to add the cfs_rq back in the list is no more
enough. You will have to do that during attach too

>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4ac1718560e2..3595b0623b37 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6566,6 +6566,8 @@ static void update_blocked_averages(int cpu)
>  * list_add_leaf_cfs_rq() for details.
>  */
> for_each_leaf_cfs_rq(rq, cfs_rq) {
> +   struct sched_entity *se = cfs_rq->tg->se[cpu];
> +
> /* throttled entities do not contribute to load */
> if (throttled_hierarchy(cfs_rq))
> continue;
> @@ -6574,8 +6576,12 @@ static void update_blocked_averages(int cpu)
> update_tg_load_avg(cfs_rq, 0);
>
> /* Propagate pending load changes to the parent */
> -   if (cfs_rq->tg->se[cpu])
> -   update_load_avg(cfs_rq->tg->se[cpu], 0, 0);
> +   if (se) {
> +   update_load_avg(se, 0, 0);
> +
> +   if (!se->avg.load_avg && !cfs_rq->nr_running)
> +   list_del_leaf_cfs_rq(cfs_rq);
> +   }
> }
> raw_spin_unlock_irqrestore(>lock, flags);
>  }
>
>
>> Then, only the highest child level of task group will be removed
>> because cfs_rq->nr_running will be > 0 as soon as a child task group
>> is created and enqueued into a task group. Or you should use
>> cfs.h_nr_running instead i don't know all implications
>
> In my tests all cfs_rq's (tg_id=6,4,2) on the source CPU (CPU1) get removed? 
> Do I miss something?

no you don't miss anything. I have replied a bit too quickly on that
point;  the sched_entity of a tsk_group is dequeued when its group
cfs_rq becomes idle  so the parent cfs_rq->nr_running can be null

>
> migration/1-16 [001] 67.235292: sched_migrate_task: comm=task0 pid=2210 
> prio=120 orig_cpu=1 dest_cpu=2
> migration/1-16 [001] 67.235295: bprint: enqueue_task_fair: 
> list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x80097550d700 tg_id=6 on_list=0
> migration/1-16 [001] 67.235298: bprint: enqueue_task_fair: 
> list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x800975903700 tg_id=4 on_list=0
> migration/1-16 [001] 67.235300: bprint: enqueue_task_fair: 
> list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x800974e0fe00 tg_id=2 on_list=0
> migration/1-16 [001] 67.235302: bprint: enqueue_task_fair: 
> list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x80097fecb6e0 tg_id=1 on_list=1
> migration/1-16 [001] 67.235309: bprint: 
> update_blocked_averages: update_blocked_averages: cpu=1 
> cfs_rq=0x80097550d800 tg_id=6 on_list=1
>  -> migration/1-16 [001] 67.235312: bprint: 
> update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 
> cfs_rq=0x80097550d800 tg_id=6 on_list=1
> migration/1-16 [001] 67.235314: bprint: 
> update_blocked_averages: update_blocked_averages: 

Re: [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list

2016-09-21 Thread Vincent Guittot
On 21 September 2016 at 19:25, Dietmar Eggemann
 wrote:
> On 21/09/16 13:34, Vincent Guittot wrote:
>> Hi Dietmar,
>>
>> On 21 September 2016 at 12:14, Dietmar Eggemann
>>  wrote:
>>> Hi Vincent,
>>>
>>> On 12/09/16 08:47, Vincent Guittot wrote:
>
> [...]
>
>>> I guess in the meantime we lost the functionality to remove a cfs_rq from 
>>> the
>>> leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t 
>>> migrates
>>> away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6)
>>> owned by Cpu1 stay in the leaf_cfs_rq list.
>>>
>>> Shouldn't we reintegrate it? Following patch goes on top of this patch:
>>
>> I see one problem: Once a cfs_rq has been removed , it will not be
>> periodically updated anymore until the next enqueue. A sleeping task
>> is only attached but not enqueued when it moves into a cfs_rq so its
>> load is added to cfs_rq's load which have to be periodically
>> updated/decayed. So adding a cfs_rq to the list only during an enqueue
>> is not enough in this case.
>
> There was more in the original patch (commit 82958366cfea), the removal of the
> cfs_rq from the list was only done in case the se->avg.runnable_avg_sum had
> decayed to 0. Couldn't we use something similar with se->avg.load_avg instead
> to make sure that these blocked contributions have been decayed before we do 
> the
> removal?

Yes we can use se->avg.load_avg but that's not enough. Once,
se->avg.load_avg becomes null and group cfs_rq->nr_running == 0, we
remove the cfs_rq  from the list and this cfs_rq will not be updated
until a new sched_entity is enqueued. But when you move a sleeping
task between 2 task groups, the load of the task is attached to the
destination cfs_rq but not enqueue so the cfs_rq->avg.load_avg is no
more null but it will not be updated during update_blocked_averages
because the cfs_rq is no more in the list. So waiting for the enqueue
of the sched_entity to add the cfs_rq back in the list is no more
enough. You will have to do that during attach too

>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4ac1718560e2..3595b0623b37 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6566,6 +6566,8 @@ static void update_blocked_averages(int cpu)
>  * list_add_leaf_cfs_rq() for details.
>  */
> for_each_leaf_cfs_rq(rq, cfs_rq) {
> +   struct sched_entity *se = cfs_rq->tg->se[cpu];
> +
> /* throttled entities do not contribute to load */
> if (throttled_hierarchy(cfs_rq))
> continue;
> @@ -6574,8 +6576,12 @@ static void update_blocked_averages(int cpu)
> update_tg_load_avg(cfs_rq, 0);
>
> /* Propagate pending load changes to the parent */
> -   if (cfs_rq->tg->se[cpu])
> -   update_load_avg(cfs_rq->tg->se[cpu], 0, 0);
> +   if (se) {
> +   update_load_avg(se, 0, 0);
> +
> +   if (!se->avg.load_avg && !cfs_rq->nr_running)
> +   list_del_leaf_cfs_rq(cfs_rq);
> +   }
> }
> raw_spin_unlock_irqrestore(>lock, flags);
>  }
>
>
>> Then, only the highest child level of task group will be removed
>> because cfs_rq->nr_running will be > 0 as soon as a child task group
>> is created and enqueued into a task group. Or you should use
>> cfs.h_nr_running instead i don't know all implications
>
> In my tests all cfs_rq's (tg_id=6,4,2) on the source CPU (CPU1) get removed? 
> Do I miss something?

no you don't miss anything. I have replied a bit too quickly on that
point;  the sched_entity of a tsk_group is dequeued when its group
cfs_rq becomes idle  so the parent cfs_rq->nr_running can be null

>
> migration/1-16 [001] 67.235292: sched_migrate_task: comm=task0 pid=2210 
> prio=120 orig_cpu=1 dest_cpu=2
> migration/1-16 [001] 67.235295: bprint: enqueue_task_fair: 
> list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x80097550d700 tg_id=6 on_list=0
> migration/1-16 [001] 67.235298: bprint: enqueue_task_fair: 
> list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x800975903700 tg_id=4 on_list=0
> migration/1-16 [001] 67.235300: bprint: enqueue_task_fair: 
> list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x800974e0fe00 tg_id=2 on_list=0
> migration/1-16 [001] 67.235302: bprint: enqueue_task_fair: 
> list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x80097fecb6e0 tg_id=1 on_list=1
> migration/1-16 [001] 67.235309: bprint: 
> update_blocked_averages: update_blocked_averages: cpu=1 
> cfs_rq=0x80097550d800 tg_id=6 on_list=1
>  -> migration/1-16 [001] 67.235312: bprint: 
> update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 
> cfs_rq=0x80097550d800 tg_id=6 on_list=1
> migration/1-16 [001] 67.235314: bprint: 
> update_blocked_averages: update_blocked_averages: cpu=1 
> cfs_rq=0x800975903600 tg_id=4 on_list=1
>  

Re: [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list

2016-09-21 Thread Dietmar Eggemann
On 21/09/16 13:34, Vincent Guittot wrote:
> Hi Dietmar,
> 
> On 21 September 2016 at 12:14, Dietmar Eggemann
>  wrote:
>> Hi Vincent,
>>
>> On 12/09/16 08:47, Vincent Guittot wrote:

[...]

>> I guess in the meantime we lost the functionality to remove a cfs_rq from the
>> leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t 
>> migrates
>> away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6)
>> owned by Cpu1 stay in the leaf_cfs_rq list.
>>
>> Shouldn't we reintegrate it? Following patch goes on top of this patch:
> 
> I see one problem: Once a cfs_rq has been removed , it will not be
> periodically updated anymore until the next enqueue. A sleeping task
> is only attached but not enqueued when it moves into a cfs_rq so its
> load is added to cfs_rq's load which have to be periodically
> updated/decayed. So adding a cfs_rq to the list only during an enqueue
> is not enough in this case.

There was more in the original patch (commit 82958366cfea), the removal of the
cfs_rq from the list was only done in case the se->avg.runnable_avg_sum had
decayed to 0. Couldn't we use something similar with se->avg.load_avg instead
to make sure that these blocked contributions have been decayed before we do the
removal?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4ac1718560e2..3595b0623b37 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6566,6 +6566,8 @@ static void update_blocked_averages(int cpu)
 * list_add_leaf_cfs_rq() for details.
 */
for_each_leaf_cfs_rq(rq, cfs_rq) {
+   struct sched_entity *se = cfs_rq->tg->se[cpu];
+
/* throttled entities do not contribute to load */
if (throttled_hierarchy(cfs_rq))
continue;
@@ -6574,8 +6576,12 @@ static void update_blocked_averages(int cpu)
update_tg_load_avg(cfs_rq, 0); 
 
/* Propagate pending load changes to the parent */
-   if (cfs_rq->tg->se[cpu])
-   update_load_avg(cfs_rq->tg->se[cpu], 0, 0);
+   if (se) {
+   update_load_avg(se, 0, 0);
+
+   if (!se->avg.load_avg && !cfs_rq->nr_running)
+   list_del_leaf_cfs_rq(cfs_rq);
+   }
}
raw_spin_unlock_irqrestore(>lock, flags);
 }


> Then, only the highest child level of task group will be removed
> because cfs_rq->nr_running will be > 0 as soon as a child task group
> is created and enqueued into a task group. Or you should use
> cfs.h_nr_running instead i don't know all implications

In my tests all cfs_rq's (tg_id=6,4,2) on the source CPU (CPU1) get removed? Do 
I miss something?

migration/1-16 [001] 67.235292: sched_migrate_task: comm=task0 pid=2210 
prio=120 orig_cpu=1 dest_cpu=2
migration/1-16 [001] 67.235295: bprint: enqueue_task_fair: 
list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x80097550d700 tg_id=6 on_list=0
migration/1-16 [001] 67.235298: bprint: enqueue_task_fair: 
list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x800975903700 tg_id=4 on_list=0
migration/1-16 [001] 67.235300: bprint: enqueue_task_fair: 
list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x800974e0fe00 tg_id=2 on_list=0
migration/1-16 [001] 67.235302: bprint: enqueue_task_fair: 
list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x80097fecb6e0 tg_id=1 on_list=1
migration/1-16 [001] 67.235309: bprint: 
update_blocked_averages: update_blocked_averages: cpu=1 
cfs_rq=0x80097550d800 tg_id=6 on_list=1
 -> migration/1-16 [001] 67.235312: bprint: 
update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0x80097550d800 
tg_id=6 on_list=1
migration/1-16 [001] 67.235314: bprint: 
update_blocked_averages: update_blocked_averages: cpu=1 
cfs_rq=0x800975903600 tg_id=4 on_list=1
 -> migration/1-16 [001] 67.235315: bprint: 
update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0x800975903600 
tg_id=4 on_list=1
migration/1-16 [001] 67.235316: bprint: 
update_blocked_averages: update_blocked_averages: cpu=1 
cfs_rq=0x800974e0f600 tg_id=2 on_list=1
 -> migration/1-16 [001] 67.235318: bprint: 
update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0x800974e0f600 
tg_id=2 on_list=1
migration/1-16 [001] 67.235319: bprint: 
update_blocked_averages: update_blocked_averages: cpu=1 
cfs_rq=0x80097feb56e0 tg_id=1 on_list=1

If we don't remove these cfs_rq's, IMHO they stay in the list as long as the tg 
exists.

[...]


Re: [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list

2016-09-21 Thread Dietmar Eggemann
On 21/09/16 13:34, Vincent Guittot wrote:
> Hi Dietmar,
> 
> On 21 September 2016 at 12:14, Dietmar Eggemann
>  wrote:
>> Hi Vincent,
>>
>> On 12/09/16 08:47, Vincent Guittot wrote:

[...]

>> I guess in the meantime we lost the functionality to remove a cfs_rq from the
>> leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t 
>> migrates
>> away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6)
>> owned by Cpu1 stay in the leaf_cfs_rq list.
>>
>> Shouldn't we reintegrate it? Following patch goes on top of this patch:
> 
> I see one problem: Once a cfs_rq has been removed , it will not be
> periodically updated anymore until the next enqueue. A sleeping task
> is only attached but not enqueued when it moves into a cfs_rq so its
> load is added to cfs_rq's load which have to be periodically
> updated/decayed. So adding a cfs_rq to the list only during an enqueue
> is not enough in this case.

There was more in the original patch (commit 82958366cfea), the removal of the
cfs_rq from the list was only done in case the se->avg.runnable_avg_sum had
decayed to 0. Couldn't we use something similar with se->avg.load_avg instead
to make sure that these blocked contributions have been decayed before we do the
removal?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4ac1718560e2..3595b0623b37 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6566,6 +6566,8 @@ static void update_blocked_averages(int cpu)
 * list_add_leaf_cfs_rq() for details.
 */
for_each_leaf_cfs_rq(rq, cfs_rq) {
+   struct sched_entity *se = cfs_rq->tg->se[cpu];
+
/* throttled entities do not contribute to load */
if (throttled_hierarchy(cfs_rq))
continue;
@@ -6574,8 +6576,12 @@ static void update_blocked_averages(int cpu)
update_tg_load_avg(cfs_rq, 0); 
 
/* Propagate pending load changes to the parent */
-   if (cfs_rq->tg->se[cpu])
-   update_load_avg(cfs_rq->tg->se[cpu], 0, 0);
+   if (se) {
+   update_load_avg(se, 0, 0);
+
+   if (!se->avg.load_avg && !cfs_rq->nr_running)
+   list_del_leaf_cfs_rq(cfs_rq);
+   }
}
raw_spin_unlock_irqrestore(>lock, flags);
 }


> Then, only the highest child level of task group will be removed
> because cfs_rq->nr_running will be > 0 as soon as a child task group
> is created and enqueued into a task group. Or you should use
> cfs.h_nr_running instead i don't know all implications

In my tests all cfs_rq's (tg_id=6,4,2) on the source CPU (CPU1) get removed? Do 
I miss something?

migration/1-16 [001] 67.235292: sched_migrate_task: comm=task0 pid=2210 
prio=120 orig_cpu=1 dest_cpu=2
migration/1-16 [001] 67.235295: bprint: enqueue_task_fair: 
list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x80097550d700 tg_id=6 on_list=0
migration/1-16 [001] 67.235298: bprint: enqueue_task_fair: 
list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x800975903700 tg_id=4 on_list=0
migration/1-16 [001] 67.235300: bprint: enqueue_task_fair: 
list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x800974e0fe00 tg_id=2 on_list=0
migration/1-16 [001] 67.235302: bprint: enqueue_task_fair: 
list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x80097fecb6e0 tg_id=1 on_list=1
migration/1-16 [001] 67.235309: bprint: 
update_blocked_averages: update_blocked_averages: cpu=1 
cfs_rq=0x80097550d800 tg_id=6 on_list=1
 -> migration/1-16 [001] 67.235312: bprint: 
update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0x80097550d800 
tg_id=6 on_list=1
migration/1-16 [001] 67.235314: bprint: 
update_blocked_averages: update_blocked_averages: cpu=1 
cfs_rq=0x800975903600 tg_id=4 on_list=1
 -> migration/1-16 [001] 67.235315: bprint: 
update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0x800975903600 
tg_id=4 on_list=1
migration/1-16 [001] 67.235316: bprint: 
update_blocked_averages: update_blocked_averages: cpu=1 
cfs_rq=0x800974e0f600 tg_id=2 on_list=1
 -> migration/1-16 [001] 67.235318: bprint: 
update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0x800974e0f600 
tg_id=2 on_list=1
migration/1-16 [001] 67.235319: bprint: 
update_blocked_averages: update_blocked_averages: cpu=1 
cfs_rq=0x80097feb56e0 tg_id=1 on_list=1

If we don't remove these cfs_rq's, IMHO they stay in the list as long as the tg 
exists.

[...]


Re: [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list

2016-09-21 Thread Vincent Guittot
Hi Dietmar,

On 21 September 2016 at 12:14, Dietmar Eggemann
 wrote:
> Hi Vincent,
>
> On 12/09/16 08:47, Vincent Guittot wrote:
>> Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that
>> a child will always be called before its parent.
>>
>> The hierarchical order in shares update list has been introduced by
>> commit 67e86250f8ea ("sched: Introduce hierarchal order on shares update 
>> list")
>>
>> With the current implementation a child can be still put after its parent.
>>
>> Lets take the example of
>>root
>> \
>>  b
>>  /\
>>  c d*
>>|
>>e*
>>
>> with root -> b -> c already enqueued but not d -> e so the leaf_cfs_rq_list
>> looks like: head -> c -> b -> root -> tail
>>
>> The branch d -> e will be added the first time that they are enqueued,
>> starting with e then d.
>>
>> When e is added, its parents is not already on the list so e is put at the
>> tail : head -> c -> b -> root -> e -> tail
>>
>> Then, d is added at the head because its parent is already on the list:
>> head -> d -> c -> b -> root -> e -> tail
>>
>> e is not placed at the right position and will be called the last whereas
>> it should be called at the beginning.
>>
>> Because it follows the bottom-up enqueue sequence, we are sure that we
>> will finished to add either a cfs_rq without parent or a cfs_rq with a parent
>> that is already on the list. We can use this event to detect when we have
>> finished to add a new branch. For the others, whose parents are not already
>> added, we have to ensure that they will be added after their children that
>> have just been inserted the steps before, and after any potential parents 
>> that
>> are already in the list. The easiest way is to put the cfs_rq just after the
>> last inserted one and to keep track of it untl the branch is fully added.
>
> [...]
>
> I tested it with a multi-level task group hierarchy and it does the
> right thing for this testcase (task t runs alternately on Cpu0 and Cpu1
> in tg w/ tg_css_id=6) in a multi-level task group hierarchy (tg_css_id=2,4,6).

Thanks for testing

>
> I wonder if this patch is related to the comment "TODO: fix up out-of-order
> children on enqueue." in update_shares_cpu() of commit 82958366cfea
> ("sched: Replace update_shares weight distribution with per-entity
> computation")?

I had not that comment in mind when i have done this patch only to
ensure that the propagation of load and utilization in children will
be done before testing their parents
That being said, I agree that the comment probably points to the same
ordering issue than what this patch solves

>
> I guess in the meantime we lost the functionality to remove a cfs_rq from the
> leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t 
> migrates
> away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6)
> owned by Cpu1 stay in the leaf_cfs_rq list.
>
> Shouldn't we reintegrate it? Following patch goes on top of this patch:

I see one problem: Once a cfs_rq has been removed , it will not be
periodically updated anymore until the next enqueue. A sleeping task
is only attached but not enqueued when it moves into a cfs_rq so its
load is added to cfs_rq's load which have to be periodically
updated/decayed. So adding a cfs_rq to the list only during an enqueue
is not enough in this case.

Then, only the highest child level of task group will be removed
because cfs_rq->nr_running will be > 0 as soon as a child task group
is created and enqueued into a task group. Or you should use
cfs.h_nr_running instead i don't know all implications

Regards,
Vincent

>
> -- >8 --
>
> From: Dietmar Eggemann 
> Date: Tue, 20 Sep 2016 17:30:09 +0100
> Subject: [PATCH] Re-integrate list_del_leaf_cfs_rq() into
>  update_blocked_averages()
>
> There is no functionality anymore to delete a cfs_rq from the leaf_cfs_rq
> list in case there are no se's enqueued on it.
>
> The functionality had been initially introduced by commit 82958366cfea
> ("sched: Replace update_shares weight distribution with per-entity
> computation") but has been taken out by commit 9d89c257dfb9 ("sched/fair:
> Rewrite runnable load and utilization average tracking").
>
> Signed-off-by: Dietmar Eggemann 
> ---
>  kernel/sched/fair.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dd530359ef84..951c83337e2b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6584,8 +6584,12 @@ static void update_blocked_averages(int cpu)
> update_tg_load_avg(cfs_rq, 0);
>
> /* Propagate pending load changes to the parent */
> -   if (cfs_rq->tg->se[cpu])
> +   if (cfs_rq->tg->se[cpu]) {
> update_load_avg(cfs_rq->tg->se[cpu], 0, 0);
> +
> +   if 

Re: [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list

2016-09-21 Thread Vincent Guittot
Hi Dietmar,

On 21 September 2016 at 12:14, Dietmar Eggemann
 wrote:
> Hi Vincent,
>
> On 12/09/16 08:47, Vincent Guittot wrote:
>> Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that
>> a child will always be called before its parent.
>>
>> The hierarchical order in shares update list has been introduced by
>> commit 67e86250f8ea ("sched: Introduce hierarchal order on shares update 
>> list")
>>
>> With the current implementation a child can be still put after its parent.
>>
>> Lets take the example of
>>root
>> \
>>  b
>>  /\
>>  c d*
>>|
>>e*
>>
>> with root -> b -> c already enqueued but not d -> e so the leaf_cfs_rq_list
>> looks like: head -> c -> b -> root -> tail
>>
>> The branch d -> e will be added the first time that they are enqueued,
>> starting with e then d.
>>
>> When e is added, its parents is not already on the list so e is put at the
>> tail : head -> c -> b -> root -> e -> tail
>>
>> Then, d is added at the head because its parent is already on the list:
>> head -> d -> c -> b -> root -> e -> tail
>>
>> e is not placed at the right position and will be called the last whereas
>> it should be called at the beginning.
>>
>> Because it follows the bottom-up enqueue sequence, we are sure that we
>> will finished to add either a cfs_rq without parent or a cfs_rq with a parent
>> that is already on the list. We can use this event to detect when we have
>> finished to add a new branch. For the others, whose parents are not already
>> added, we have to ensure that they will be added after their children that
>> have just been inserted the steps before, and after any potential parents 
>> that
>> are already in the list. The easiest way is to put the cfs_rq just after the
>> last inserted one and to keep track of it untl the branch is fully added.
>
> [...]
>
> I tested it with a multi-level task group hierarchy and it does the
> right thing for this testcase (task t runs alternately on Cpu0 and Cpu1
> in tg w/ tg_css_id=6) in a multi-level task group hierarchy (tg_css_id=2,4,6).

Thanks for testing

>
> I wonder if this patch is related to the comment "TODO: fix up out-of-order
> children on enqueue." in update_shares_cpu() of commit 82958366cfea
> ("sched: Replace update_shares weight distribution with per-entity
> computation")?

I had not that comment in mind when i have done this patch only to
ensure that the propagation of load and utilization in children will
be done before testing their parents
That being said, I agree that the comment probably points to the same
ordering issue than what this patch solves

>
> I guess in the meantime we lost the functionality to remove a cfs_rq from the
> leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t 
> migrates
> away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6)
> owned by Cpu1 stay in the leaf_cfs_rq list.
>
> Shouldn't we reintegrate it? Following patch goes on top of this patch:

I see one problem: Once a cfs_rq has been removed , it will not be
periodically updated anymore until the next enqueue. A sleeping task
is only attached but not enqueued when it moves into a cfs_rq so its
load is added to cfs_rq's load which have to be periodically
updated/decayed. So adding a cfs_rq to the list only during an enqueue
is not enough in this case.

Then, only the highest child level of task group will be removed
because cfs_rq->nr_running will be > 0 as soon as a child task group
is created and enqueued into a task group. Or you should use
cfs.h_nr_running instead i don't know all implications

Regards,
Vincent

>
> -- >8 --
>
> From: Dietmar Eggemann 
> Date: Tue, 20 Sep 2016 17:30:09 +0100
> Subject: [PATCH] Re-integrate list_del_leaf_cfs_rq() into
>  update_blocked_averages()
>
> There is no functionality anymore to delete a cfs_rq from the leaf_cfs_rq
> list in case there are no se's enqueued on it.
>
> The functionality had been initially introduced by commit 82958366cfea
> ("sched: Replace update_shares weight distribution with per-entity
> computation") but has been taken out by commit 9d89c257dfb9 ("sched/fair:
> Rewrite runnable load and utilization average tracking").
>
> Signed-off-by: Dietmar Eggemann 
> ---
>  kernel/sched/fair.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dd530359ef84..951c83337e2b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6584,8 +6584,12 @@ static void update_blocked_averages(int cpu)
> update_tg_load_avg(cfs_rq, 0);
>
> /* Propagate pending load changes to the parent */
> -   if (cfs_rq->tg->se[cpu])
> +   if (cfs_rq->tg->se[cpu]) {
> update_load_avg(cfs_rq->tg->se[cpu], 0, 0);
> +
> +   if (!cfs_rq->nr_running)
> +   list_del_leaf_cfs_rq(cfs_rq);
> +   

Re: [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list

2016-09-21 Thread Dietmar Eggemann
Hi Vincent,

On 12/09/16 08:47, Vincent Guittot wrote:
> Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that
> a child will always be called before its parent.
> 
> The hierarchical order in shares update list has been introduced by
> commit 67e86250f8ea ("sched: Introduce hierarchal order on shares update 
> list")
> 
> With the current implementation a child can be still put after its parent.
> 
> Lets take the example of
>root
> \
>  b
>  /\
>  c d*
>|
>e*
> 
> with root -> b -> c already enqueued but not d -> e so the leaf_cfs_rq_list
> looks like: head -> c -> b -> root -> tail
> 
> The branch d -> e will be added the first time that they are enqueued,
> starting with e then d.
> 
> When e is added, its parents is not already on the list so e is put at the
> tail : head -> c -> b -> root -> e -> tail
> 
> Then, d is added at the head because its parent is already on the list:
> head -> d -> c -> b -> root -> e -> tail
> 
> e is not placed at the right position and will be called the last whereas
> it should be called at the beginning.
> 
> Because it follows the bottom-up enqueue sequence, we are sure that we
> will finished to add either a cfs_rq without parent or a cfs_rq with a parent
> that is already on the list. We can use this event to detect when we have
> finished to add a new branch. For the others, whose parents are not already
> added, we have to ensure that they will be added after their children that
> have just been inserted the steps before, and after any potential parents that
> are already in the list. The easiest way is to put the cfs_rq just after the
> last inserted one and to keep track of it untl the branch is fully added.

[...]

I tested it with a multi-level task group hierarchy and it does the
right thing for this testcase (task t runs alternately on Cpu0 and Cpu1
in tg w/ tg_css_id=6) in a multi-level task group hierarchy (tg_css_id=2,4,6).

I wonder if this patch is related to the comment "TODO: fix up out-of-order
children on enqueue." in update_shares_cpu() of commit 82958366cfea
("sched: Replace update_shares weight distribution with per-entity
computation")?

I guess in the meantime we lost the functionality to remove a cfs_rq from the
leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t 
migrates
away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6)
owned by Cpu1 stay in the leaf_cfs_rq list.

Shouldn't we reintegrate it? Following patch goes on top of this patch:

-- >8 --

From: Dietmar Eggemann 
Date: Tue, 20 Sep 2016 17:30:09 +0100
Subject: [PATCH] Re-integrate list_del_leaf_cfs_rq() into
 update_blocked_averages()

There is no functionality anymore to delete a cfs_rq from the leaf_cfs_rq
list in case there are no se's enqueued on it. 

The functionality had been initially introduced by commit 82958366cfea
("sched: Replace update_shares weight distribution with per-entity
computation") but has been taken out by commit 9d89c257dfb9 ("sched/fair:
Rewrite runnable load and utilization average tracking").

Signed-off-by: Dietmar Eggemann 
---
 kernel/sched/fair.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dd530359ef84..951c83337e2b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6584,8 +6584,12 @@ static void update_blocked_averages(int cpu)
update_tg_load_avg(cfs_rq, 0); 
 
/* Propagate pending load changes to the parent */
-   if (cfs_rq->tg->se[cpu])
+   if (cfs_rq->tg->se[cpu]) {
update_load_avg(cfs_rq->tg->se[cpu], 0, 0); 
+
+   if (!cfs_rq->nr_running)
+   list_del_leaf_cfs_rq(cfs_rq);
+   }
}
raw_spin_unlock_irqrestore(>lock, flags);
 }
-- 
1.9.1


Re: [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list

2016-09-21 Thread Dietmar Eggemann
Hi Vincent,

On 12/09/16 08:47, Vincent Guittot wrote:
> Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that
> a child will always be called before its parent.
> 
> The hierarchical order in shares update list has been introduced by
> commit 67e86250f8ea ("sched: Introduce hierarchal order on shares update 
> list")
> 
> With the current implementation a child can be still put after its parent.
> 
> Lets take the example of
>root
> \
>  b
>  /\
>  c d*
>|
>e*
> 
> with root -> b -> c already enqueued but not d -> e so the leaf_cfs_rq_list
> looks like: head -> c -> b -> root -> tail
> 
> The branch d -> e will be added the first time that they are enqueued,
> starting with e then d.
> 
> When e is added, its parents is not already on the list so e is put at the
> tail : head -> c -> b -> root -> e -> tail
> 
> Then, d is added at the head because its parent is already on the list:
> head -> d -> c -> b -> root -> e -> tail
> 
> e is not placed at the right position and will be called the last whereas
> it should be called at the beginning.
> 
> Because it follows the bottom-up enqueue sequence, we are sure that we
> will finished to add either a cfs_rq without parent or a cfs_rq with a parent
> that is already on the list. We can use this event to detect when we have
> finished to add a new branch. For the others, whose parents are not already
> added, we have to ensure that they will be added after their children that
> have just been inserted the steps before, and after any potential parents that
> are already in the list. The easiest way is to put the cfs_rq just after the
> last inserted one and to keep track of it untl the branch is fully added.

[...]

I tested it with a multi-level task group hierarchy and it does the
right thing for this testcase (task t runs alternately on Cpu0 and Cpu1
in tg w/ tg_css_id=6) in a multi-level task group hierarchy (tg_css_id=2,4,6).

I wonder if this patch is related to the comment "TODO: fix up out-of-order
children on enqueue." in update_shares_cpu() of commit 82958366cfea
("sched: Replace update_shares weight distribution with per-entity
computation")?

I guess in the meantime we lost the functionality to remove a cfs_rq from the
leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t 
migrates
away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6)
owned by Cpu1 stay in the leaf_cfs_rq list.

Shouldn't we reintegrate it? Following patch goes on top of this patch:

-- >8 --

From: Dietmar Eggemann 
Date: Tue, 20 Sep 2016 17:30:09 +0100
Subject: [PATCH] Re-integrate list_del_leaf_cfs_rq() into
 update_blocked_averages()

There is no functionality anymore to delete a cfs_rq from the leaf_cfs_rq
list in case there are no se's enqueued on it. 

The functionality had been initially introduced by commit 82958366cfea
("sched: Replace update_shares weight distribution with per-entity
computation") but has been taken out by commit 9d89c257dfb9 ("sched/fair:
Rewrite runnable load and utilization average tracking").

Signed-off-by: Dietmar Eggemann 
---
 kernel/sched/fair.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dd530359ef84..951c83337e2b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6584,8 +6584,12 @@ static void update_blocked_averages(int cpu)
update_tg_load_avg(cfs_rq, 0); 
 
/* Propagate pending load changes to the parent */
-   if (cfs_rq->tg->se[cpu])
+   if (cfs_rq->tg->se[cpu]) {
update_load_avg(cfs_rq->tg->se[cpu], 0, 0); 
+
+   if (!cfs_rq->nr_running)
+   list_del_leaf_cfs_rq(cfs_rq);
+   }
}
raw_spin_unlock_irqrestore(>lock, flags);
 }
-- 
1.9.1