Re: [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list
On 21 September 2016 at 19:25, Dietmar Eggemannwrote: > On 21/09/16 13:34, Vincent Guittot wrote: >> Hi Dietmar, >> >> On 21 September 2016 at 12:14, Dietmar Eggemann >> wrote: >>> Hi Vincent, >>> >>> On 12/09/16 08:47, Vincent Guittot wrote: > > [...] > >>> I guess in the meantime we lost the functionality to remove a cfs_rq from >>> the >>> leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t >>> migrates >>> away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6) >>> owned by Cpu1 stay in the leaf_cfs_rq list. >>> >>> Shouldn't we reintegrate it? Following patch goes on top of this patch: >> >> I see one problem: Once a cfs_rq has been removed , it will not be >> periodically updated anymore until the next enqueue. A sleeping task >> is only attached but not enqueued when it moves into a cfs_rq so its >> load is added to cfs_rq's load which have to be periodically >> updated/decayed. So adding a cfs_rq to the list only during an enqueue >> is not enough in this case. > > There was more in the original patch (commit 82958366cfea), the removal of the > cfs_rq from the list was only done in case the se->avg.runnable_avg_sum had > decayed to 0. Couldn't we use something similar with se->avg.load_avg instead > to make sure that these blocked contributions have been decayed before we do > the > removal? Yes we can use se->avg.load_avg but that's not enough. Once, se->avg.load_avg becomes null and group cfs_rq->nr_running == 0, we remove the cfs_rq from the list and this cfs_rq will not be updated until a new sched_entity is enqueued. But when you move a sleeping task between 2 task groups, the load of the task is attached to the destination cfs_rq but not enqueue so the cfs_rq->avg.load_avg is no more null but it will not be updated during update_blocked_averages because the cfs_rq is no more in the list. So waiting for the enqueue of the sched_entity to add the cfs_rq back in the list is no more enough. You will have to do that during attach too > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 4ac1718560e2..3595b0623b37 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6566,6 +6566,8 @@ static void update_blocked_averages(int cpu) > * list_add_leaf_cfs_rq() for details. > */ > for_each_leaf_cfs_rq(rq, cfs_rq) { > + struct sched_entity *se = cfs_rq->tg->se[cpu]; > + > /* throttled entities do not contribute to load */ > if (throttled_hierarchy(cfs_rq)) > continue; > @@ -6574,8 +6576,12 @@ static void update_blocked_averages(int cpu) > update_tg_load_avg(cfs_rq, 0); > > /* Propagate pending load changes to the parent */ > - if (cfs_rq->tg->se[cpu]) > - update_load_avg(cfs_rq->tg->se[cpu], 0, 0); > + if (se) { > + update_load_avg(se, 0, 0); > + > + if (!se->avg.load_avg && !cfs_rq->nr_running) > + list_del_leaf_cfs_rq(cfs_rq); > + } > } > raw_spin_unlock_irqrestore(>lock, flags); > } > > >> Then, only the highest child level of task group will be removed >> because cfs_rq->nr_running will be > 0 as soon as a child task group >> is created and enqueued into a task group. Or you should use >> cfs.h_nr_running instead i don't know all implications > > In my tests all cfs_rq's (tg_id=6,4,2) on the source CPU (CPU1) get removed? > Do I miss something? no you don't miss anything. I have replied a bit too quickly on that point; the sched_entity of a tsk_group is dequeued when its group cfs_rq becomes idle so the parent cfs_rq->nr_running can be null > > migration/1-16 [001] 67.235292: sched_migrate_task: comm=task0 pid=2210 > prio=120 orig_cpu=1 dest_cpu=2 > migration/1-16 [001] 67.235295: bprint: enqueue_task_fair: > list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x80097550d700 tg_id=6 on_list=0 > migration/1-16 [001] 67.235298: bprint: enqueue_task_fair: > list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x800975903700 tg_id=4 on_list=0 > migration/1-16 [001] 67.235300: bprint: enqueue_task_fair: > list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x800974e0fe00 tg_id=2 on_list=0 > migration/1-16 [001] 67.235302: bprint: enqueue_task_fair: > list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x80097fecb6e0 tg_id=1 on_list=1 > migration/1-16 [001] 67.235309: bprint: > update_blocked_averages: update_blocked_averages: cpu=1 > cfs_rq=0x80097550d800 tg_id=6 on_list=1 > -> migration/1-16 [001] 67.235312: bprint: > update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 > cfs_rq=0x80097550d800 tg_id=6 on_list=1 > migration/1-16 [001] 67.235314: bprint: > update_blocked_averages: update_blocked_averages:
Re: [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list
On 21 September 2016 at 19:25, Dietmar Eggemann wrote: > On 21/09/16 13:34, Vincent Guittot wrote: >> Hi Dietmar, >> >> On 21 September 2016 at 12:14, Dietmar Eggemann >> wrote: >>> Hi Vincent, >>> >>> On 12/09/16 08:47, Vincent Guittot wrote: > > [...] > >>> I guess in the meantime we lost the functionality to remove a cfs_rq from >>> the >>> leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t >>> migrates >>> away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6) >>> owned by Cpu1 stay in the leaf_cfs_rq list. >>> >>> Shouldn't we reintegrate it? Following patch goes on top of this patch: >> >> I see one problem: Once a cfs_rq has been removed , it will not be >> periodically updated anymore until the next enqueue. A sleeping task >> is only attached but not enqueued when it moves into a cfs_rq so its >> load is added to cfs_rq's load which have to be periodically >> updated/decayed. So adding a cfs_rq to the list only during an enqueue >> is not enough in this case. > > There was more in the original patch (commit 82958366cfea), the removal of the > cfs_rq from the list was only done in case the se->avg.runnable_avg_sum had > decayed to 0. Couldn't we use something similar with se->avg.load_avg instead > to make sure that these blocked contributions have been decayed before we do > the > removal? Yes we can use se->avg.load_avg but that's not enough. Once, se->avg.load_avg becomes null and group cfs_rq->nr_running == 0, we remove the cfs_rq from the list and this cfs_rq will not be updated until a new sched_entity is enqueued. But when you move a sleeping task between 2 task groups, the load of the task is attached to the destination cfs_rq but not enqueue so the cfs_rq->avg.load_avg is no more null but it will not be updated during update_blocked_averages because the cfs_rq is no more in the list. So waiting for the enqueue of the sched_entity to add the cfs_rq back in the list is no more enough. You will have to do that during attach too > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 4ac1718560e2..3595b0623b37 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6566,6 +6566,8 @@ static void update_blocked_averages(int cpu) > * list_add_leaf_cfs_rq() for details. > */ > for_each_leaf_cfs_rq(rq, cfs_rq) { > + struct sched_entity *se = cfs_rq->tg->se[cpu]; > + > /* throttled entities do not contribute to load */ > if (throttled_hierarchy(cfs_rq)) > continue; > @@ -6574,8 +6576,12 @@ static void update_blocked_averages(int cpu) > update_tg_load_avg(cfs_rq, 0); > > /* Propagate pending load changes to the parent */ > - if (cfs_rq->tg->se[cpu]) > - update_load_avg(cfs_rq->tg->se[cpu], 0, 0); > + if (se) { > + update_load_avg(se, 0, 0); > + > + if (!se->avg.load_avg && !cfs_rq->nr_running) > + list_del_leaf_cfs_rq(cfs_rq); > + } > } > raw_spin_unlock_irqrestore(>lock, flags); > } > > >> Then, only the highest child level of task group will be removed >> because cfs_rq->nr_running will be > 0 as soon as a child task group >> is created and enqueued into a task group. Or you should use >> cfs.h_nr_running instead i don't know all implications > > In my tests all cfs_rq's (tg_id=6,4,2) on the source CPU (CPU1) get removed? > Do I miss something? no you don't miss anything. I have replied a bit too quickly on that point; the sched_entity of a tsk_group is dequeued when its group cfs_rq becomes idle so the parent cfs_rq->nr_running can be null > > migration/1-16 [001] 67.235292: sched_migrate_task: comm=task0 pid=2210 > prio=120 orig_cpu=1 dest_cpu=2 > migration/1-16 [001] 67.235295: bprint: enqueue_task_fair: > list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x80097550d700 tg_id=6 on_list=0 > migration/1-16 [001] 67.235298: bprint: enqueue_task_fair: > list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x800975903700 tg_id=4 on_list=0 > migration/1-16 [001] 67.235300: bprint: enqueue_task_fair: > list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x800974e0fe00 tg_id=2 on_list=0 > migration/1-16 [001] 67.235302: bprint: enqueue_task_fair: > list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x80097fecb6e0 tg_id=1 on_list=1 > migration/1-16 [001] 67.235309: bprint: > update_blocked_averages: update_blocked_averages: cpu=1 > cfs_rq=0x80097550d800 tg_id=6 on_list=1 > -> migration/1-16 [001] 67.235312: bprint: > update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 > cfs_rq=0x80097550d800 tg_id=6 on_list=1 > migration/1-16 [001] 67.235314: bprint: > update_blocked_averages: update_blocked_averages: cpu=1 > cfs_rq=0x800975903600 tg_id=4 on_list=1 >
Re: [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list
On 21/09/16 13:34, Vincent Guittot wrote: > Hi Dietmar, > > On 21 September 2016 at 12:14, Dietmar Eggemann >wrote: >> Hi Vincent, >> >> On 12/09/16 08:47, Vincent Guittot wrote: [...] >> I guess in the meantime we lost the functionality to remove a cfs_rq from the >> leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t >> migrates >> away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6) >> owned by Cpu1 stay in the leaf_cfs_rq list. >> >> Shouldn't we reintegrate it? Following patch goes on top of this patch: > > I see one problem: Once a cfs_rq has been removed , it will not be > periodically updated anymore until the next enqueue. A sleeping task > is only attached but not enqueued when it moves into a cfs_rq so its > load is added to cfs_rq's load which have to be periodically > updated/decayed. So adding a cfs_rq to the list only during an enqueue > is not enough in this case. There was more in the original patch (commit 82958366cfea), the removal of the cfs_rq from the list was only done in case the se->avg.runnable_avg_sum had decayed to 0. Couldn't we use something similar with se->avg.load_avg instead to make sure that these blocked contributions have been decayed before we do the removal? diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4ac1718560e2..3595b0623b37 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6566,6 +6566,8 @@ static void update_blocked_averages(int cpu) * list_add_leaf_cfs_rq() for details. */ for_each_leaf_cfs_rq(rq, cfs_rq) { + struct sched_entity *se = cfs_rq->tg->se[cpu]; + /* throttled entities do not contribute to load */ if (throttled_hierarchy(cfs_rq)) continue; @@ -6574,8 +6576,12 @@ static void update_blocked_averages(int cpu) update_tg_load_avg(cfs_rq, 0); /* Propagate pending load changes to the parent */ - if (cfs_rq->tg->se[cpu]) - update_load_avg(cfs_rq->tg->se[cpu], 0, 0); + if (se) { + update_load_avg(se, 0, 0); + + if (!se->avg.load_avg && !cfs_rq->nr_running) + list_del_leaf_cfs_rq(cfs_rq); + } } raw_spin_unlock_irqrestore(>lock, flags); } > Then, only the highest child level of task group will be removed > because cfs_rq->nr_running will be > 0 as soon as a child task group > is created and enqueued into a task group. Or you should use > cfs.h_nr_running instead i don't know all implications In my tests all cfs_rq's (tg_id=6,4,2) on the source CPU (CPU1) get removed? Do I miss something? migration/1-16 [001] 67.235292: sched_migrate_task: comm=task0 pid=2210 prio=120 orig_cpu=1 dest_cpu=2 migration/1-16 [001] 67.235295: bprint: enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x80097550d700 tg_id=6 on_list=0 migration/1-16 [001] 67.235298: bprint: enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x800975903700 tg_id=4 on_list=0 migration/1-16 [001] 67.235300: bprint: enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x800974e0fe00 tg_id=2 on_list=0 migration/1-16 [001] 67.235302: bprint: enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x80097fecb6e0 tg_id=1 on_list=1 migration/1-16 [001] 67.235309: bprint: update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0x80097550d800 tg_id=6 on_list=1 -> migration/1-16 [001] 67.235312: bprint: update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0x80097550d800 tg_id=6 on_list=1 migration/1-16 [001] 67.235314: bprint: update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0x800975903600 tg_id=4 on_list=1 -> migration/1-16 [001] 67.235315: bprint: update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0x800975903600 tg_id=4 on_list=1 migration/1-16 [001] 67.235316: bprint: update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0x800974e0f600 tg_id=2 on_list=1 -> migration/1-16 [001] 67.235318: bprint: update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0x800974e0f600 tg_id=2 on_list=1 migration/1-16 [001] 67.235319: bprint: update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0x80097feb56e0 tg_id=1 on_list=1 If we don't remove these cfs_rq's, IMHO they stay in the list as long as the tg exists. [...]
Re: [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list
On 21/09/16 13:34, Vincent Guittot wrote: > Hi Dietmar, > > On 21 September 2016 at 12:14, Dietmar Eggemann > wrote: >> Hi Vincent, >> >> On 12/09/16 08:47, Vincent Guittot wrote: [...] >> I guess in the meantime we lost the functionality to remove a cfs_rq from the >> leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t >> migrates >> away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6) >> owned by Cpu1 stay in the leaf_cfs_rq list. >> >> Shouldn't we reintegrate it? Following patch goes on top of this patch: > > I see one problem: Once a cfs_rq has been removed , it will not be > periodically updated anymore until the next enqueue. A sleeping task > is only attached but not enqueued when it moves into a cfs_rq so its > load is added to cfs_rq's load which have to be periodically > updated/decayed. So adding a cfs_rq to the list only during an enqueue > is not enough in this case. There was more in the original patch (commit 82958366cfea), the removal of the cfs_rq from the list was only done in case the se->avg.runnable_avg_sum had decayed to 0. Couldn't we use something similar with se->avg.load_avg instead to make sure that these blocked contributions have been decayed before we do the removal? diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4ac1718560e2..3595b0623b37 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6566,6 +6566,8 @@ static void update_blocked_averages(int cpu) * list_add_leaf_cfs_rq() for details. */ for_each_leaf_cfs_rq(rq, cfs_rq) { + struct sched_entity *se = cfs_rq->tg->se[cpu]; + /* throttled entities do not contribute to load */ if (throttled_hierarchy(cfs_rq)) continue; @@ -6574,8 +6576,12 @@ static void update_blocked_averages(int cpu) update_tg_load_avg(cfs_rq, 0); /* Propagate pending load changes to the parent */ - if (cfs_rq->tg->se[cpu]) - update_load_avg(cfs_rq->tg->se[cpu], 0, 0); + if (se) { + update_load_avg(se, 0, 0); + + if (!se->avg.load_avg && !cfs_rq->nr_running) + list_del_leaf_cfs_rq(cfs_rq); + } } raw_spin_unlock_irqrestore(>lock, flags); } > Then, only the highest child level of task group will be removed > because cfs_rq->nr_running will be > 0 as soon as a child task group > is created and enqueued into a task group. Or you should use > cfs.h_nr_running instead i don't know all implications In my tests all cfs_rq's (tg_id=6,4,2) on the source CPU (CPU1) get removed? Do I miss something? migration/1-16 [001] 67.235292: sched_migrate_task: comm=task0 pid=2210 prio=120 orig_cpu=1 dest_cpu=2 migration/1-16 [001] 67.235295: bprint: enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x80097550d700 tg_id=6 on_list=0 migration/1-16 [001] 67.235298: bprint: enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x800975903700 tg_id=4 on_list=0 migration/1-16 [001] 67.235300: bprint: enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x800974e0fe00 tg_id=2 on_list=0 migration/1-16 [001] 67.235302: bprint: enqueue_task_fair: list_add_leaf_cfs_rq: cpu=2 cfs_rq=0x80097fecb6e0 tg_id=1 on_list=1 migration/1-16 [001] 67.235309: bprint: update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0x80097550d800 tg_id=6 on_list=1 -> migration/1-16 [001] 67.235312: bprint: update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0x80097550d800 tg_id=6 on_list=1 migration/1-16 [001] 67.235314: bprint: update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0x800975903600 tg_id=4 on_list=1 -> migration/1-16 [001] 67.235315: bprint: update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0x800975903600 tg_id=4 on_list=1 migration/1-16 [001] 67.235316: bprint: update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0x800974e0f600 tg_id=2 on_list=1 -> migration/1-16 [001] 67.235318: bprint: update_blocked_averages: list_del_leaf_cfs_rq: cpu=1 cfs_rq=0x800974e0f600 tg_id=2 on_list=1 migration/1-16 [001] 67.235319: bprint: update_blocked_averages: update_blocked_averages: cpu=1 cfs_rq=0x80097feb56e0 tg_id=1 on_list=1 If we don't remove these cfs_rq's, IMHO they stay in the list as long as the tg exists. [...]
Re: [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list
Hi Dietmar, On 21 September 2016 at 12:14, Dietmar Eggemannwrote: > Hi Vincent, > > On 12/09/16 08:47, Vincent Guittot wrote: >> Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that >> a child will always be called before its parent. >> >> The hierarchical order in shares update list has been introduced by >> commit 67e86250f8ea ("sched: Introduce hierarchal order on shares update >> list") >> >> With the current implementation a child can be still put after its parent. >> >> Lets take the example of >>root >> \ >> b >> /\ >> c d* >>| >>e* >> >> with root -> b -> c already enqueued but not d -> e so the leaf_cfs_rq_list >> looks like: head -> c -> b -> root -> tail >> >> The branch d -> e will be added the first time that they are enqueued, >> starting with e then d. >> >> When e is added, its parents is not already on the list so e is put at the >> tail : head -> c -> b -> root -> e -> tail >> >> Then, d is added at the head because its parent is already on the list: >> head -> d -> c -> b -> root -> e -> tail >> >> e is not placed at the right position and will be called the last whereas >> it should be called at the beginning. >> >> Because it follows the bottom-up enqueue sequence, we are sure that we >> will finished to add either a cfs_rq without parent or a cfs_rq with a parent >> that is already on the list. We can use this event to detect when we have >> finished to add a new branch. For the others, whose parents are not already >> added, we have to ensure that they will be added after their children that >> have just been inserted the steps before, and after any potential parents >> that >> are already in the list. The easiest way is to put the cfs_rq just after the >> last inserted one and to keep track of it untl the branch is fully added. > > [...] > > I tested it with a multi-level task group hierarchy and it does the > right thing for this testcase (task t runs alternately on Cpu0 and Cpu1 > in tg w/ tg_css_id=6) in a multi-level task group hierarchy (tg_css_id=2,4,6). Thanks for testing > > I wonder if this patch is related to the comment "TODO: fix up out-of-order > children on enqueue." in update_shares_cpu() of commit 82958366cfea > ("sched: Replace update_shares weight distribution with per-entity > computation")? I had not that comment in mind when i have done this patch only to ensure that the propagation of load and utilization in children will be done before testing their parents That being said, I agree that the comment probably points to the same ordering issue than what this patch solves > > I guess in the meantime we lost the functionality to remove a cfs_rq from the > leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t > migrates > away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6) > owned by Cpu1 stay in the leaf_cfs_rq list. > > Shouldn't we reintegrate it? Following patch goes on top of this patch: I see one problem: Once a cfs_rq has been removed , it will not be periodically updated anymore until the next enqueue. A sleeping task is only attached but not enqueued when it moves into a cfs_rq so its load is added to cfs_rq's load which have to be periodically updated/decayed. So adding a cfs_rq to the list only during an enqueue is not enough in this case. Then, only the highest child level of task group will be removed because cfs_rq->nr_running will be > 0 as soon as a child task group is created and enqueued into a task group. Or you should use cfs.h_nr_running instead i don't know all implications Regards, Vincent > > -- >8 -- > > From: Dietmar Eggemann > Date: Tue, 20 Sep 2016 17:30:09 +0100 > Subject: [PATCH] Re-integrate list_del_leaf_cfs_rq() into > update_blocked_averages() > > There is no functionality anymore to delete a cfs_rq from the leaf_cfs_rq > list in case there are no se's enqueued on it. > > The functionality had been initially introduced by commit 82958366cfea > ("sched: Replace update_shares weight distribution with per-entity > computation") but has been taken out by commit 9d89c257dfb9 ("sched/fair: > Rewrite runnable load and utilization average tracking"). > > Signed-off-by: Dietmar Eggemann > --- > kernel/sched/fair.c | 6 +- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index dd530359ef84..951c83337e2b 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6584,8 +6584,12 @@ static void update_blocked_averages(int cpu) > update_tg_load_avg(cfs_rq, 0); > > /* Propagate pending load changes to the parent */ > - if (cfs_rq->tg->se[cpu]) > + if (cfs_rq->tg->se[cpu]) { > update_load_avg(cfs_rq->tg->se[cpu], 0, 0); > + > + if
Re: [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list
Hi Dietmar, On 21 September 2016 at 12:14, Dietmar Eggemann wrote: > Hi Vincent, > > On 12/09/16 08:47, Vincent Guittot wrote: >> Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that >> a child will always be called before its parent. >> >> The hierarchical order in shares update list has been introduced by >> commit 67e86250f8ea ("sched: Introduce hierarchal order on shares update >> list") >> >> With the current implementation a child can be still put after its parent. >> >> Lets take the example of >>root >> \ >> b >> /\ >> c d* >>| >>e* >> >> with root -> b -> c already enqueued but not d -> e so the leaf_cfs_rq_list >> looks like: head -> c -> b -> root -> tail >> >> The branch d -> e will be added the first time that they are enqueued, >> starting with e then d. >> >> When e is added, its parents is not already on the list so e is put at the >> tail : head -> c -> b -> root -> e -> tail >> >> Then, d is added at the head because its parent is already on the list: >> head -> d -> c -> b -> root -> e -> tail >> >> e is not placed at the right position and will be called the last whereas >> it should be called at the beginning. >> >> Because it follows the bottom-up enqueue sequence, we are sure that we >> will finished to add either a cfs_rq without parent or a cfs_rq with a parent >> that is already on the list. We can use this event to detect when we have >> finished to add a new branch. For the others, whose parents are not already >> added, we have to ensure that they will be added after their children that >> have just been inserted the steps before, and after any potential parents >> that >> are already in the list. The easiest way is to put the cfs_rq just after the >> last inserted one and to keep track of it untl the branch is fully added. > > [...] > > I tested it with a multi-level task group hierarchy and it does the > right thing for this testcase (task t runs alternately on Cpu0 and Cpu1 > in tg w/ tg_css_id=6) in a multi-level task group hierarchy (tg_css_id=2,4,6). Thanks for testing > > I wonder if this patch is related to the comment "TODO: fix up out-of-order > children on enqueue." in update_shares_cpu() of commit 82958366cfea > ("sched: Replace update_shares weight distribution with per-entity > computation")? I had not that comment in mind when i have done this patch only to ensure that the propagation of load and utilization in children will be done before testing their parents That being said, I agree that the comment probably points to the same ordering issue than what this patch solves > > I guess in the meantime we lost the functionality to remove a cfs_rq from the > leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t > migrates > away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6) > owned by Cpu1 stay in the leaf_cfs_rq list. > > Shouldn't we reintegrate it? Following patch goes on top of this patch: I see one problem: Once a cfs_rq has been removed , it will not be periodically updated anymore until the next enqueue. A sleeping task is only attached but not enqueued when it moves into a cfs_rq so its load is added to cfs_rq's load which have to be periodically updated/decayed. So adding a cfs_rq to the list only during an enqueue is not enough in this case. Then, only the highest child level of task group will be removed because cfs_rq->nr_running will be > 0 as soon as a child task group is created and enqueued into a task group. Or you should use cfs.h_nr_running instead i don't know all implications Regards, Vincent > > -- >8 -- > > From: Dietmar Eggemann > Date: Tue, 20 Sep 2016 17:30:09 +0100 > Subject: [PATCH] Re-integrate list_del_leaf_cfs_rq() into > update_blocked_averages() > > There is no functionality anymore to delete a cfs_rq from the leaf_cfs_rq > list in case there are no se's enqueued on it. > > The functionality had been initially introduced by commit 82958366cfea > ("sched: Replace update_shares weight distribution with per-entity > computation") but has been taken out by commit 9d89c257dfb9 ("sched/fair: > Rewrite runnable load and utilization average tracking"). > > Signed-off-by: Dietmar Eggemann > --- > kernel/sched/fair.c | 6 +- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index dd530359ef84..951c83337e2b 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6584,8 +6584,12 @@ static void update_blocked_averages(int cpu) > update_tg_load_avg(cfs_rq, 0); > > /* Propagate pending load changes to the parent */ > - if (cfs_rq->tg->se[cpu]) > + if (cfs_rq->tg->se[cpu]) { > update_load_avg(cfs_rq->tg->se[cpu], 0, 0); > + > + if (!cfs_rq->nr_running) > + list_del_leaf_cfs_rq(cfs_rq); > +
Re: [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list
Hi Vincent, On 12/09/16 08:47, Vincent Guittot wrote: > Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that > a child will always be called before its parent. > > The hierarchical order in shares update list has been introduced by > commit 67e86250f8ea ("sched: Introduce hierarchal order on shares update > list") > > With the current implementation a child can be still put after its parent. > > Lets take the example of >root > \ > b > /\ > c d* >| >e* > > with root -> b -> c already enqueued but not d -> e so the leaf_cfs_rq_list > looks like: head -> c -> b -> root -> tail > > The branch d -> e will be added the first time that they are enqueued, > starting with e then d. > > When e is added, its parents is not already on the list so e is put at the > tail : head -> c -> b -> root -> e -> tail > > Then, d is added at the head because its parent is already on the list: > head -> d -> c -> b -> root -> e -> tail > > e is not placed at the right position and will be called the last whereas > it should be called at the beginning. > > Because it follows the bottom-up enqueue sequence, we are sure that we > will finished to add either a cfs_rq without parent or a cfs_rq with a parent > that is already on the list. We can use this event to detect when we have > finished to add a new branch. For the others, whose parents are not already > added, we have to ensure that they will be added after their children that > have just been inserted the steps before, and after any potential parents that > are already in the list. The easiest way is to put the cfs_rq just after the > last inserted one and to keep track of it untl the branch is fully added. [...] I tested it with a multi-level task group hierarchy and it does the right thing for this testcase (task t runs alternately on Cpu0 and Cpu1 in tg w/ tg_css_id=6) in a multi-level task group hierarchy (tg_css_id=2,4,6). I wonder if this patch is related to the comment "TODO: fix up out-of-order children on enqueue." in update_shares_cpu() of commit 82958366cfea ("sched: Replace update_shares weight distribution with per-entity computation")? I guess in the meantime we lost the functionality to remove a cfs_rq from the leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t migrates away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6) owned by Cpu1 stay in the leaf_cfs_rq list. Shouldn't we reintegrate it? Following patch goes on top of this patch: -- >8 -- From: Dietmar EggemannDate: Tue, 20 Sep 2016 17:30:09 +0100 Subject: [PATCH] Re-integrate list_del_leaf_cfs_rq() into update_blocked_averages() There is no functionality anymore to delete a cfs_rq from the leaf_cfs_rq list in case there are no se's enqueued on it. The functionality had been initially introduced by commit 82958366cfea ("sched: Replace update_shares weight distribution with per-entity computation") but has been taken out by commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking"). Signed-off-by: Dietmar Eggemann --- kernel/sched/fair.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index dd530359ef84..951c83337e2b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6584,8 +6584,12 @@ static void update_blocked_averages(int cpu) update_tg_load_avg(cfs_rq, 0); /* Propagate pending load changes to the parent */ - if (cfs_rq->tg->se[cpu]) + if (cfs_rq->tg->se[cpu]) { update_load_avg(cfs_rq->tg->se[cpu], 0, 0); + + if (!cfs_rq->nr_running) + list_del_leaf_cfs_rq(cfs_rq); + } } raw_spin_unlock_irqrestore(>lock, flags); } -- 1.9.1
Re: [PATCH 2/7 v3] sched: fix hierarchical order in rq->leaf_cfs_rq_list
Hi Vincent, On 12/09/16 08:47, Vincent Guittot wrote: > Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that > a child will always be called before its parent. > > The hierarchical order in shares update list has been introduced by > commit 67e86250f8ea ("sched: Introduce hierarchal order on shares update > list") > > With the current implementation a child can be still put after its parent. > > Lets take the example of >root > \ > b > /\ > c d* >| >e* > > with root -> b -> c already enqueued but not d -> e so the leaf_cfs_rq_list > looks like: head -> c -> b -> root -> tail > > The branch d -> e will be added the first time that they are enqueued, > starting with e then d. > > When e is added, its parents is not already on the list so e is put at the > tail : head -> c -> b -> root -> e -> tail > > Then, d is added at the head because its parent is already on the list: > head -> d -> c -> b -> root -> e -> tail > > e is not placed at the right position and will be called the last whereas > it should be called at the beginning. > > Because it follows the bottom-up enqueue sequence, we are sure that we > will finished to add either a cfs_rq without parent or a cfs_rq with a parent > that is already on the list. We can use this event to detect when we have > finished to add a new branch. For the others, whose parents are not already > added, we have to ensure that they will be added after their children that > have just been inserted the steps before, and after any potential parents that > are already in the list. The easiest way is to put the cfs_rq just after the > last inserted one and to keep track of it untl the branch is fully added. [...] I tested it with a multi-level task group hierarchy and it does the right thing for this testcase (task t runs alternately on Cpu0 and Cpu1 in tg w/ tg_css_id=6) in a multi-level task group hierarchy (tg_css_id=2,4,6). I wonder if this patch is related to the comment "TODO: fix up out-of-order children on enqueue." in update_shares_cpu() of commit 82958366cfea ("sched: Replace update_shares weight distribution with per-entity computation")? I guess in the meantime we lost the functionality to remove a cfs_rq from the leaf_cfs_rq list once there are no se's enqueued on it anymore. If e.g. t migrates away from Cpu1, all the cfs_rq's of the task hierarchy (for tg_css_id=2,4,6) owned by Cpu1 stay in the leaf_cfs_rq list. Shouldn't we reintegrate it? Following patch goes on top of this patch: -- >8 -- From: Dietmar Eggemann Date: Tue, 20 Sep 2016 17:30:09 +0100 Subject: [PATCH] Re-integrate list_del_leaf_cfs_rq() into update_blocked_averages() There is no functionality anymore to delete a cfs_rq from the leaf_cfs_rq list in case there are no se's enqueued on it. The functionality had been initially introduced by commit 82958366cfea ("sched: Replace update_shares weight distribution with per-entity computation") but has been taken out by commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking"). Signed-off-by: Dietmar Eggemann --- kernel/sched/fair.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index dd530359ef84..951c83337e2b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6584,8 +6584,12 @@ static void update_blocked_averages(int cpu) update_tg_load_avg(cfs_rq, 0); /* Propagate pending load changes to the parent */ - if (cfs_rq->tg->se[cpu]) + if (cfs_rq->tg->se[cpu]) { update_load_avg(cfs_rq->tg->se[cpu], 0, 0); + + if (!cfs_rq->nr_running) + list_del_leaf_cfs_rq(cfs_rq); + } } raw_spin_unlock_irqrestore(>lock, flags); } -- 1.9.1