Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-26 Thread Yuyang Du
Hi Peter,

What about this patch?

Adding:
Tested-by: Stefan Ekenberg 

Thanks,
Yuyang

On Mon, Jul 06, 2015 at 06:11:51AM +0800, Yuyang Du wrote:
> On Fri, Jul 03, 2015 at 06:39:28PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 02, 2015 at 07:25:11AM +0800, Yuyang Du wrote:
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 40a7fcb..f7cc1ef 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -5898,6 +5898,10 @@ static int detach_tasks(struct lb_env *env)
> > >   return 0;
> > >  
> > >   while (!list_empty(tasks)) {
> > > +
> > > + if (env->idle == CPU_NEWLY_IDLE && env->src_rq->nr_running <= 1)
> > 
> > Should we make that ->idle != CPU_NOT_IDLE ?
> 
> I think including CPU_IDLE is good.
> 
> --
> Subject: [PATCH] sched: Avoid pulling all tasks in idle balancing
> 
> In idle balancing where a CPU going idle pulls tasks from another CPU,
> a livelock may happen if the CPU pulls all tasks from another, makes
> it idle, and this iterates. So just avoid this.
> 
> Reported-by: Rabin Vincent 
> Signed-off-by: Yuyang Du 
> ---
>  kernel/sched/fair.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 40a7fcb..769d591 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5898,6 +5898,13 @@ static int detach_tasks(struct lb_env *env)
>   return 0;
>  
>   while (!list_empty(tasks)) {
> + /*
> +  * We don't want to steal all, otherwise we may be treated 
> likewise,
> +  * which could at worst lead to a livelock crash.
> +  */
> + if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
> + break;
> +
>   p = list_first_entry(tasks, struct task_struct, se.group_node);
>  
>   env->loop++;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-26 Thread Yuyang Du
Hi Peter,

What about this patch?

Adding:
Tested-by: Stefan Ekenberg stefan.ekenb...@axis.com

Thanks,
Yuyang

On Mon, Jul 06, 2015 at 06:11:51AM +0800, Yuyang Du wrote:
 On Fri, Jul 03, 2015 at 06:39:28PM +0200, Peter Zijlstra wrote:
  On Thu, Jul 02, 2015 at 07:25:11AM +0800, Yuyang Du wrote:
   diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
   index 40a7fcb..f7cc1ef 100644
   --- a/kernel/sched/fair.c
   +++ b/kernel/sched/fair.c
   @@ -5898,6 +5898,10 @@ static int detach_tasks(struct lb_env *env)
 return 0;

 while (!list_empty(tasks)) {
   +
   + if (env-idle == CPU_NEWLY_IDLE  env-src_rq-nr_running = 1)
  
  Should we make that -idle != CPU_NOT_IDLE ?
 
 I think including CPU_IDLE is good.
 
 --
 Subject: [PATCH] sched: Avoid pulling all tasks in idle balancing
 
 In idle balancing where a CPU going idle pulls tasks from another CPU,
 a livelock may happen if the CPU pulls all tasks from another, makes
 it idle, and this iterates. So just avoid this.
 
 Reported-by: Rabin Vincent rabin.vinc...@axis.com
 Signed-off-by: Yuyang Du yuyang...@intel.com
 ---
  kernel/sched/fair.c | 7 +++
  1 file changed, 7 insertions(+)
 
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 40a7fcb..769d591 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -5898,6 +5898,13 @@ static int detach_tasks(struct lb_env *env)
   return 0;
  
   while (!list_empty(tasks)) {
 + /*
 +  * We don't want to steal all, otherwise we may be treated 
 likewise,
 +  * which could at worst lead to a livelock crash.
 +  */
 + if (env-idle != CPU_NOT_IDLE  env-src_rq-nr_running = 1)
 + break;
 +
   p = list_first_entry(tasks, struct task_struct, se.group_node);
  
   env-loop++;
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-13 Thread Dietmar Eggemann
On 07/07/15 12:17, Rabin Vincent wrote:
> On Mon, Jul 06, 2015 at 07:36:56PM +0200, Dietmar Eggemann wrote:
>> Rabin, could you share the content of your
>> /sys/fs/cgroup/cpu/system.slice directory and of /proc/cgroups ?
> 
> Here's /proc/cgroups,
> 
> # cat /proc/cgroups 
> #subsys_name  hierarchy   num_cgroups enabled
> cpu   2   98  1
> cpuacct   2   98  1
> 
> and the contents of /sys/fs/cgroup/cpu/system.slice are available here:
> https://drive.google.com/file/d/0B4tMLbMvJ-l6ZVBvZ09QOE15MU0/view
> 
> /Rabin
> 

So why not maintain a runnable signal for the task group se's?
At least to figure out if the 118 is coming from blocked load.

-- >8 --
Subject: [PATCH] sched: Maintain a runnable version of tg->load_avg and
 cfs_rq->tg_load_contrib

Including blocked load in the load average contribution of sched
entities (se->avg.load_avg_contrib) representing task groups can
lead to scenarios where the imbalance is greater than
sum(task_h_load(p)) for all tasks p on the src rq.

To avoid this use cfs_rq->runnable_tg_load_contrib and
tg->runnable_load_avg to calculate se->avg.load_avg_contrib for
sched entities representing task groups.

Both runnable based values are updated in cadence with the
existing values.

The existing tg->load_avg and cfs_rq->tg_load_contrib are still
used to calculate task group weight.

Signed-off-by: Dietmar Eggemann 
---
 kernel/sched/fair.c  | 11 ---
 kernel/sched/sched.h |  2 ++
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 587a2f67ceb1..f2cfbaaf5700 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2647,7 +2647,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct 
cfs_rq *cfs_rq,
 int force_update)
 {
struct task_group *tg = cfs_rq->tg;
-   long tg_contrib;
+   long tg_contrib, runnable_tg_contrib;
 
tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
tg_contrib -= cfs_rq->tg_load_contrib;
@@ -2655,9 +2655,14 @@ static inline void 
__update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
if (!tg_contrib)
return;
 
+   runnable_tg_contrib = cfs_rq->runnable_load_avg;
+   runnable_tg_contrib -= cfs_rq->runnable_tg_load_contrib;
+
if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
atomic_long_add(tg_contrib, >load_avg);
cfs_rq->tg_load_contrib += tg_contrib;
+   atomic_long_add(runnable_tg_contrib, >runnable_load_avg);
+   cfs_rq->runnable_tg_load_contrib += runnable_tg_contrib;
}
 }
 
@@ -2690,9 +2695,9 @@ static inline void __update_group_entity_contrib(struct 
sched_entity *se)
 
u64 contrib;
 
-   contrib = cfs_rq->tg_load_contrib * tg->shares;
+   contrib = cfs_rq->runnable_tg_load_contrib * tg->shares;
se->avg.load_avg_contrib = div_u64(contrib,
-atomic_long_read(>load_avg) + 1);
+   atomic_long_read(>runnable_load_avg) + 1);
 
/*
 * For group entities we need to compute a correction term in the case
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 84d48790bb6d..eed74e5efe91 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -245,6 +245,7 @@ struct task_group {
 
 #ifdef CONFIG_SMP
atomic_long_t load_avg;
+   atomic_long_t runnable_load_avg;
atomic_t runnable_avg;
 #endif
 #endif
@@ -386,6 +387,7 @@ struct cfs_rq {
/* Required to track per-cpu representation of a task_group */
u32 tg_runnable_contrib;
unsigned long tg_load_contrib;
+   unsigned long runnable_tg_load_contrib;
 
/*
 *   h_load = weight * f(tg)
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-13 Thread Dietmar Eggemann
On 07/07/15 12:17, Rabin Vincent wrote:
 On Mon, Jul 06, 2015 at 07:36:56PM +0200, Dietmar Eggemann wrote:
 Rabin, could you share the content of your
 /sys/fs/cgroup/cpu/system.slice directory and of /proc/cgroups ?
 
 Here's /proc/cgroups,
 
 # cat /proc/cgroups 
 #subsys_name  hierarchy   num_cgroups enabled
 cpu   2   98  1
 cpuacct   2   98  1
 
 and the contents of /sys/fs/cgroup/cpu/system.slice are available here:
 https://drive.google.com/file/d/0B4tMLbMvJ-l6ZVBvZ09QOE15MU0/view
 
 /Rabin
 

So why not maintain a runnable signal for the task group se's?
At least to figure out if the 118 is coming from blocked load.

-- 8 --
Subject: [PATCH] sched: Maintain a runnable version of tg-load_avg and
 cfs_rq-tg_load_contrib

Including blocked load in the load average contribution of sched
entities (se-avg.load_avg_contrib) representing task groups can
lead to scenarios where the imbalance is greater than
sum(task_h_load(p)) for all tasks p on the src rq.

To avoid this use cfs_rq-runnable_tg_load_contrib and
tg-runnable_load_avg to calculate se-avg.load_avg_contrib for
sched entities representing task groups.

Both runnable based values are updated in cadence with the
existing values.

The existing tg-load_avg and cfs_rq-tg_load_contrib are still
used to calculate task group weight.

Signed-off-by: Dietmar Eggemann dietmar.eggem...@arm.com
---
 kernel/sched/fair.c  | 11 ---
 kernel/sched/sched.h |  2 ++
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 587a2f67ceb1..f2cfbaaf5700 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2647,7 +2647,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct 
cfs_rq *cfs_rq,
 int force_update)
 {
struct task_group *tg = cfs_rq-tg;
-   long tg_contrib;
+   long tg_contrib, runnable_tg_contrib;
 
tg_contrib = cfs_rq-runnable_load_avg + cfs_rq-blocked_load_avg;
tg_contrib -= cfs_rq-tg_load_contrib;
@@ -2655,9 +2655,14 @@ static inline void 
__update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
if (!tg_contrib)
return;
 
+   runnable_tg_contrib = cfs_rq-runnable_load_avg;
+   runnable_tg_contrib -= cfs_rq-runnable_tg_load_contrib;
+
if (force_update || abs(tg_contrib)  cfs_rq-tg_load_contrib / 8) {
atomic_long_add(tg_contrib, tg-load_avg);
cfs_rq-tg_load_contrib += tg_contrib;
+   atomic_long_add(runnable_tg_contrib, tg-runnable_load_avg);
+   cfs_rq-runnable_tg_load_contrib += runnable_tg_contrib;
}
 }
 
@@ -2690,9 +2695,9 @@ static inline void __update_group_entity_contrib(struct 
sched_entity *se)
 
u64 contrib;
 
-   contrib = cfs_rq-tg_load_contrib * tg-shares;
+   contrib = cfs_rq-runnable_tg_load_contrib * tg-shares;
se-avg.load_avg_contrib = div_u64(contrib,
-atomic_long_read(tg-load_avg) + 1);
+   atomic_long_read(tg-runnable_load_avg) + 1);
 
/*
 * For group entities we need to compute a correction term in the case
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 84d48790bb6d..eed74e5efe91 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -245,6 +245,7 @@ struct task_group {
 
 #ifdef CONFIG_SMP
atomic_long_t load_avg;
+   atomic_long_t runnable_load_avg;
atomic_t runnable_avg;
 #endif
 #endif
@@ -386,6 +387,7 @@ struct cfs_rq {
/* Required to track per-cpu representation of a task_group */
u32 tg_runnable_contrib;
unsigned long tg_load_contrib;
+   unsigned long runnable_tg_load_contrib;
 
/*
 *   h_load = weight * f(tg)
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-10 Thread Yuyang Du
Hi,

On Thu, Jul 09, 2015 at 03:32:20PM +0100, Morten Rasmussen wrote:
> On Mon, Jul 06, 2015 at 06:31:44AM +0800, Yuyang Du wrote:
> > On Fri, Jul 03, 2015 at 06:38:31PM +0200, Peter Zijlstra wrote:
> > > > I'm not against having a policy that sits somewhere in between, we just
> > > > have to agree it is the right policy and clean up the load-balance code
> > > > such that the implemented policy is clear.
> > > 
> > > Right, for balancing its a tricky question, but mixing them without
> > > intent is, as you say, a bit of a mess.
> > > 
> > > So clearly blocked load doesn't make sense for (new)idle balancing. OTOH
> > > it does make some sense for the regular periodic balancing, because
> > > there we really do care mostly about the averages, esp. so when we're
> > > overloaded -- but there are issues there too.
> > > 
> > > Now we can't track them both (or rather we could, but overhead).
> > > 
> > > I like Yuyang's load tracking rewrite, but it changes exactly this part,
> > > and I'm not sure I understand the full ramifications of that yet.
> 
> I don't think anybody does ;-) But I think we should try to make it
> work.
> 
> > Thanks. It would be a pure average policy, which is non-perfect like now,
> > and certainly needs a mixing like now, but it is worth a starter, because
> > it is simple and reasaonble, and based on it, the other parts can be simple
> > and reasonable.
> 
> I think we all agree on the benefits of taking blocked load into
> account but also that there are some policy questions to be addressed.
> 
> > > One way out would be to split the load balancer into 3 distinct regions;
> > > 
> > >  1) get a task on every CPU, screw everything else.
> > >  2) get each CPU fully utilized, still ignoring 'load'
> > >  3) when everybody is fully utilized, consider load.
> 
> Seems very reasonable to me. We more or less follow that idea in the
> energy-model driven scheduling patches, at least 2) and 3).
> 
> The difficult bit is detecting when to transition between 2) and 3). If
> you want to enforce smp_nice you have to start worrying about task
> priority as soon as one cpu is fully utilized.
> 
> For example, a fully utilized cpu has two high priority tasks while all
> other cpus are running low priority tasks and are not fully utilized.
> The utilization imbalance may be too small to cause any tasks to be
> migrated, so we end up giving fewer cycles to the high priority tasks.
> 
> > > If we make find_busiest_foo() select one of these 3, and make
> > > calculate_imbalance() invariant to the metric passed in, and have things
> > > like cpu_load() and task_load() return different, but coherent, numbers
> > > depending on which region we're in, this almost sounds 'simple'.
> > > 
> > > The devil is in the details, and the balancer is a hairy nest of details
> > > which will make the above non-trivial.
> 
> Yes, but if we have an overall policy like the one you propose we can at
> least make it complicated and claim that we think we know what it is
> supposed to do ;-)
> 
> I agree that there is some work to be done in find_busiest_*() and
> calcuate_imbalance() + friends. Maybe step one should be to clean them
> up a bit.

Consensus looks like that we move step-by-step and start working right now:

1) Based on the "Rewrite" patch, let me add cfs_rq->runnable_load_avg. Then 
   we will have up-to-date everything: load.weight, runnable_load_avg, and
   load_avg (including runnable + blocked), from pure now to pure average.
   The runnable_load_avg will be used the same as now. So we will not have
   a shred of remification. As long as the code is cleared and simplified,
   it is a win.

2) Let's clean up a bit the load balancing part code-wise, and if needed,
   make change to the obvious things, otherwise leave it unchanged.

3) Polish/complicate the policies, :)

What do you think?

Thanks,
Yuyang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-10 Thread Yuyang Du
On Thu, Jul 09, 2015 at 02:53:14PM +0100, Morten Rasmussen wrote:
> On Mon, Jul 06, 2015 at 04:12:41AM +0800, Yuyang Du wrote:
> > Hi Morten,
> > 
> > On Fri, Jul 03, 2015 at 10:34:41AM +0100, Morten Rasmussen wrote:
> > > > > IOW, since task groups include blocked load in the load_avg_contrib 
> > > > > (see
> > > > > __update_group_entity_contrib() and 
> > > > > __update_cfs_rq_tg_load_contrib()) the
> > > > > imbalance includes blocked load and hence env->imbalance >=
> > > > > sum(task_h_load(p)) for all tasks p on the rq. Which leads to
> > > > > detach_tasks() emptying the rq completely in the reported scenario 
> > > > > where
> > > > > blocked load > runnable load.
> > > > 
> > > > Whenever I want to know the load avg concerning task group, I need to
> > > > walk through the complete codes again, I prefer not to do it this time.
> > > > But it should not be that simply to say "the 118 comes from the blocked 
> > > > load".
> > > 
> > > But the whole hierarchy of group entities is updated each time we enqueue
> > > or dequeue a task. I don't see how the group entity load_avg_contrib is
> > > not up to date? Why do you need to update it again?
> > > 
> > > In any case, we have one task in the group hierarchy which has a
> > > load_avg_contrib of 0 and the grand-grand parent group entity has a
> > > load_avg_contrib of 118 and no additional tasks. That load contribution
> > > must be from tasks which are no longer around on the rq? No?
> > 
> > load_avg_contrib has WEIGHT inside, so the most I can say is:
> > SE: 8f456e00's load_avg_contrib 118 = (its cfs_rq's runnable + blocked) / 
> > (tg->load_avg + 1) * tg->shares
> > 
> > The tg->shares is probably 1024 (at least 911). So we are just left with:
> > 
> > cfs_rq / tg = 11.5%
> 
> Yes, we also know that there is only one runnable task in the task group
> hierarchy and its contribution is 0. Hence the rest must be from
> non-runnable tasks belonging to some child group.

Agreed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-10 Thread Yuyang Du
On Thu, Jul 09, 2015 at 02:53:14PM +0100, Morten Rasmussen wrote:
 On Mon, Jul 06, 2015 at 04:12:41AM +0800, Yuyang Du wrote:
  Hi Morten,
  
  On Fri, Jul 03, 2015 at 10:34:41AM +0100, Morten Rasmussen wrote:
 IOW, since task groups include blocked load in the load_avg_contrib 
 (see
 __update_group_entity_contrib() and 
 __update_cfs_rq_tg_load_contrib()) the
 imbalance includes blocked load and hence env-imbalance =
 sum(task_h_load(p)) for all tasks p on the rq. Which leads to
 detach_tasks() emptying the rq completely in the reported scenario 
 where
 blocked load  runnable load.

Whenever I want to know the load avg concerning task group, I need to
walk through the complete codes again, I prefer not to do it this time.
But it should not be that simply to say the 118 comes from the blocked 
load.
   
   But the whole hierarchy of group entities is updated each time we enqueue
   or dequeue a task. I don't see how the group entity load_avg_contrib is
   not up to date? Why do you need to update it again?
   
   In any case, we have one task in the group hierarchy which has a
   load_avg_contrib of 0 and the grand-grand parent group entity has a
   load_avg_contrib of 118 and no additional tasks. That load contribution
   must be from tasks which are no longer around on the rq? No?
  
  load_avg_contrib has WEIGHT inside, so the most I can say is:
  SE: 8f456e00's load_avg_contrib 118 = (its cfs_rq's runnable + blocked) / 
  (tg-load_avg + 1) * tg-shares
  
  The tg-shares is probably 1024 (at least 911). So we are just left with:
  
  cfs_rq / tg = 11.5%
 
 Yes, we also know that there is only one runnable task in the task group
 hierarchy and its contribution is 0. Hence the rest must be from
 non-runnable tasks belonging to some child group.

Agreed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-10 Thread Yuyang Du
Hi,

On Thu, Jul 09, 2015 at 03:32:20PM +0100, Morten Rasmussen wrote:
 On Mon, Jul 06, 2015 at 06:31:44AM +0800, Yuyang Du wrote:
  On Fri, Jul 03, 2015 at 06:38:31PM +0200, Peter Zijlstra wrote:
I'm not against having a policy that sits somewhere in between, we just
have to agree it is the right policy and clean up the load-balance code
such that the implemented policy is clear.
   
   Right, for balancing its a tricky question, but mixing them without
   intent is, as you say, a bit of a mess.
   
   So clearly blocked load doesn't make sense for (new)idle balancing. OTOH
   it does make some sense for the regular periodic balancing, because
   there we really do care mostly about the averages, esp. so when we're
   overloaded -- but there are issues there too.
   
   Now we can't track them both (or rather we could, but overhead).
   
   I like Yuyang's load tracking rewrite, but it changes exactly this part,
   and I'm not sure I understand the full ramifications of that yet.
 
 I don't think anybody does ;-) But I think we should try to make it
 work.
 
  Thanks. It would be a pure average policy, which is non-perfect like now,
  and certainly needs a mixing like now, but it is worth a starter, because
  it is simple and reasaonble, and based on it, the other parts can be simple
  and reasonable.
 
 I think we all agree on the benefits of taking blocked load into
 account but also that there are some policy questions to be addressed.
 
   One way out would be to split the load balancer into 3 distinct regions;
   
1) get a task on every CPU, screw everything else.
2) get each CPU fully utilized, still ignoring 'load'
3) when everybody is fully utilized, consider load.
 
 Seems very reasonable to me. We more or less follow that idea in the
 energy-model driven scheduling patches, at least 2) and 3).
 
 The difficult bit is detecting when to transition between 2) and 3). If
 you want to enforce smp_nice you have to start worrying about task
 priority as soon as one cpu is fully utilized.
 
 For example, a fully utilized cpu has two high priority tasks while all
 other cpus are running low priority tasks and are not fully utilized.
 The utilization imbalance may be too small to cause any tasks to be
 migrated, so we end up giving fewer cycles to the high priority tasks.
 
   If we make find_busiest_foo() select one of these 3, and make
   calculate_imbalance() invariant to the metric passed in, and have things
   like cpu_load() and task_load() return different, but coherent, numbers
   depending on which region we're in, this almost sounds 'simple'.
   
   The devil is in the details, and the balancer is a hairy nest of details
   which will make the above non-trivial.
 
 Yes, but if we have an overall policy like the one you propose we can at
 least make it complicated and claim that we think we know what it is
 supposed to do ;-)
 
 I agree that there is some work to be done in find_busiest_*() and
 calcuate_imbalance() + friends. Maybe step one should be to clean them
 up a bit.

Consensus looks like that we move step-by-step and start working right now:

1) Based on the Rewrite patch, let me add cfs_rq-runnable_load_avg. Then 
   we will have up-to-date everything: load.weight, runnable_load_avg, and
   load_avg (including runnable + blocked), from pure now to pure average.
   The runnable_load_avg will be used the same as now. So we will not have
   a shred of remification. As long as the code is cleared and simplified,
   it is a win.

2) Let's clean up a bit the load balancing part code-wise, and if needed,
   make change to the obvious things, otherwise leave it unchanged.

3) Polish/complicate the policies, :)

What do you think?

Thanks,
Yuyang
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-09 Thread Morten Rasmussen
On Mon, Jul 06, 2015 at 06:31:44AM +0800, Yuyang Du wrote:
> On Fri, Jul 03, 2015 at 06:38:31PM +0200, Peter Zijlstra wrote:
> > > I'm not against having a policy that sits somewhere in between, we just
> > > have to agree it is the right policy and clean up the load-balance code
> > > such that the implemented policy is clear.
> > 
> > Right, for balancing its a tricky question, but mixing them without
> > intent is, as you say, a bit of a mess.
> > 
> > So clearly blocked load doesn't make sense for (new)idle balancing. OTOH
> > it does make some sense for the regular periodic balancing, because
> > there we really do care mostly about the averages, esp. so when we're
> > overloaded -- but there are issues there too.
> > 
> > Now we can't track them both (or rather we could, but overhead).
> > 
> > I like Yuyang's load tracking rewrite, but it changes exactly this part,
> > and I'm not sure I understand the full ramifications of that yet.

I don't think anybody does ;-) But I think we should try to make it
work.

> Thanks. It would be a pure average policy, which is non-perfect like now,
> and certainly needs a mixing like now, but it is worth a starter, because
> it is simple and reasaonble, and based on it, the other parts can be simple
> and reasonable.

I think we all agree on the benefits of taking blocked load into
account but also that there are some policy questions to be addressed.

> > One way out would be to split the load balancer into 3 distinct regions;
> > 
> >  1) get a task on every CPU, screw everything else.
> >  2) get each CPU fully utilized, still ignoring 'load'
> >  3) when everybody is fully utilized, consider load.

Seems very reasonable to me. We more or less follow that idea in the
energy-model driven scheduling patches, at least 2) and 3).

The difficult bit is detecting when to transition between 2) and 3). If
you want to enforce smp_nice you have to start worrying about task
priority as soon as one cpu is fully utilized.

For example, a fully utilized cpu has two high priority tasks while all
other cpus are running low priority tasks and are not fully utilized.
The utilization imbalance may be too small to cause any tasks to be
migrated, so we end up giving fewer cycles to the high priority tasks.

> > If we make find_busiest_foo() select one of these 3, and make
> > calculate_imbalance() invariant to the metric passed in, and have things
> > like cpu_load() and task_load() return different, but coherent, numbers
> > depending on which region we're in, this almost sounds 'simple'.
> > 
> > The devil is in the details, and the balancer is a hairy nest of details
> > which will make the above non-trivial.

Yes, but if we have an overall policy like the one you propose we can at
least make it complicated and claim that we think we know what it is
supposed to do ;-)

I agree that there is some work to be done in find_busiest_*() and
calcuate_imbalance() + friends. Maybe step one should be to clean them
up a bit.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-09 Thread Morten Rasmussen
On Mon, Jul 06, 2015 at 04:12:41AM +0800, Yuyang Du wrote:
> Hi Morten,
> 
> On Fri, Jul 03, 2015 at 10:34:41AM +0100, Morten Rasmussen wrote:
> > > > IOW, since task groups include blocked load in the load_avg_contrib (see
> > > > __update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) 
> > > > the
> > > > imbalance includes blocked load and hence env->imbalance >=
> > > > sum(task_h_load(p)) for all tasks p on the rq. Which leads to
> > > > detach_tasks() emptying the rq completely in the reported scenario where
> > > > blocked load > runnable load.
> > > 
> > > Whenever I want to know the load avg concerning task group, I need to
> > > walk through the complete codes again, I prefer not to do it this time.
> > > But it should not be that simply to say "the 118 comes from the blocked 
> > > load".
> > 
> > But the whole hierarchy of group entities is updated each time we enqueue
> > or dequeue a task. I don't see how the group entity load_avg_contrib is
> > not up to date? Why do you need to update it again?
> > 
> > In any case, we have one task in the group hierarchy which has a
> > load_avg_contrib of 0 and the grand-grand parent group entity has a
> > load_avg_contrib of 118 and no additional tasks. That load contribution
> > must be from tasks which are no longer around on the rq? No?
> 
> load_avg_contrib has WEIGHT inside, so the most I can say is:
> SE: 8f456e00's load_avg_contrib 118 = (its cfs_rq's runnable + blocked) / 
> (tg->load_avg + 1) * tg->shares
> 
> The tg->shares is probably 1024 (at least 911). So we are just left with:
> 
> cfs_rq / tg = 11.5%

Yes, we also know that there is only one runnable task in the task group
hierarchy and its contribution is 0. Hence the rest must be from
non-runnable tasks belonging to some child group.

> I myself did question the sudden jump from 0 to 118 (see a previous reply).
> 
> But anyway, this really is irrelevant to the discusstion.

Agreed.

>  
> > > Anyway, with blocked load, yes, we definitely can't move (or even find) 
> > > some
> > > ammount of the imbalance if we only look at the tasks on the queue. But 
> > > this
> > > may or may not be a problem.
> > > 
> > > Firstly, the question comes to whether we want blocked load anywhere.
> > > This is just about a "now vs. average" question.
> > 
> > That is what I meant in the paragraph below. It is a scheduling policy
> > question.
> > 
> > > Secondly, if we stick to average, we just need to treat the blocked load
> > > consistently, not that group SE has it, but task SE does not, or somewhere
> > > has it, others not.
> > 
> > I agree that inconsistent use of blocked load will lead us into trouble.
> > The problem is that none of the load-balance logic was designed for
> > blocked load. It was written to deal load that is currently on the rq
> > and slightly biased by average cpu load, not dealing with load
> > contribution of tasks which we can't migrate at the moment because they
> > are blocked. The load-balance code has to be updated to deal with
> > blocked load. We will run into all sorts of issues if we don't and roll
> > out use of blocked load everywhere.
> > 
> > However, before we can rework the load-balance code, we have to agree on
> > the now vs average balance policy.
> > 
> > Your proposed patch implements a policy somewhere in between. We try to
> > balance based on average, but we don't allow idle_balance() to empty a
> > cpu completely. A pure average balance policy would allow a cpu to go
> > idle even if we could do have tasks waiting on other rqs if the blocked
> > load indicates that other task will show up shortly one the cpu. A pure
> > "now" balance would balance based on runnable_load_avg for all entities
> > including groups ignoring all blocked load, but that goes against the
> > PELT group balancing design.
> > 
> > I'm not against having a policy that sits somewhere in between, we just
> > have to agree it is the right policy and clean up the load-balance code
> > such that the implemented policy is clear.
>  
> The proposed patch sits in between? I agree, but would rather see it from
> another perspective.
> 
> First, I don't think it merits a solution/policy. It is just a cheap
> "last guard" to protect the "king" - no crash.

It's fine for me to have checks that makes sure we don't crash if we hit
some corner case in the load-balance code. I was more interested in
figuring out why we ended up in this situation and how we can implement
a more consistent policy.

> Second, a "pure average" policy is pretty fine in general, but it does not
> mean we would simply allow a CPU to be pulled empty, that is because we are
> making a bet from a prediction (average) here. By prediction, it basically
> means sometimes it fails. As the failure could lead to a disater, without
> blaming the prediction, it is reasonable we make a sort of "damage control".

I like the idea of balancing based on a average load/utilization that
includes blocked 

Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-09 Thread Stefan Ekenberg
Hi,

I tested the patch on a setup with 7 devices, all running the same troublesome 
use-case in parallel (same use-case as we used to produce the crash dumps). 
This use-case was previously able to reproduce the problem about 21 times 
during 24 hours. After including the patch the setup ran perfectly for 48 
hours. So to summarize, patch tested OK.

Tested-by: Stefan Ekenberg 

On Mon, Jul 06, 2015 at 12:11:51AM +0200, Yuyang Du wrote:
> On Fri, Jul 03, 2015 at 06:39:28PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 02, 2015 at 07:25:11AM +0800, Yuyang Du wrote:
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 40a7fcb..f7cc1ef 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -5898,6 +5898,10 @@ static int detach_tasks(struct lb_env *env)
> > > return 0;
> > >
> > > while (!list_empty(tasks)) {
> > > +
> > > +   if (env->idle == CPU_NEWLY_IDLE && env->src_rq->nr_running <= 
> > > 1)
> >
> > Should we make that ->idle != CPU_NOT_IDLE ?
> 
> I think including CPU_IDLE is good.
> 
> --
> Subject: [PATCH] sched: Avoid pulling all tasks in idle balancing
> 
> In idle balancing where a CPU going idle pulls tasks from another CPU,
> a livelock may happen if the CPU pulls all tasks from another, makes
> it idle, and this iterates. So just avoid this.
> 
> Reported-by: Rabin Vincent 
> Signed-off-by: Yuyang Du 
> ---
>  kernel/sched/fair.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 40a7fcb..769d591 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5898,6 +5898,13 @@ static int detach_tasks(struct lb_env *env)
> return 0;
> 
> while (!list_empty(tasks)) {
> +   /*
> +* We don't want to steal all, otherwise we may be treated 
> likewise,
> +* which could at worst lead to a livelock crash.
> +*/
> +   if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
> +   break;
> +
> p = list_first_entry(tasks, struct task_struct, 
> se.group_node);
> 
> env->loop++;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-09 Thread Morten Rasmussen
On Mon, Jul 06, 2015 at 06:31:44AM +0800, Yuyang Du wrote:
 On Fri, Jul 03, 2015 at 06:38:31PM +0200, Peter Zijlstra wrote:
   I'm not against having a policy that sits somewhere in between, we just
   have to agree it is the right policy and clean up the load-balance code
   such that the implemented policy is clear.
  
  Right, for balancing its a tricky question, but mixing them without
  intent is, as you say, a bit of a mess.
  
  So clearly blocked load doesn't make sense for (new)idle balancing. OTOH
  it does make some sense for the regular periodic balancing, because
  there we really do care mostly about the averages, esp. so when we're
  overloaded -- but there are issues there too.
  
  Now we can't track them both (or rather we could, but overhead).
  
  I like Yuyang's load tracking rewrite, but it changes exactly this part,
  and I'm not sure I understand the full ramifications of that yet.

I don't think anybody does ;-) But I think we should try to make it
work.

 Thanks. It would be a pure average policy, which is non-perfect like now,
 and certainly needs a mixing like now, but it is worth a starter, because
 it is simple and reasaonble, and based on it, the other parts can be simple
 and reasonable.

I think we all agree on the benefits of taking blocked load into
account but also that there are some policy questions to be addressed.

  One way out would be to split the load balancer into 3 distinct regions;
  
   1) get a task on every CPU, screw everything else.
   2) get each CPU fully utilized, still ignoring 'load'
   3) when everybody is fully utilized, consider load.

Seems very reasonable to me. We more or less follow that idea in the
energy-model driven scheduling patches, at least 2) and 3).

The difficult bit is detecting when to transition between 2) and 3). If
you want to enforce smp_nice you have to start worrying about task
priority as soon as one cpu is fully utilized.

For example, a fully utilized cpu has two high priority tasks while all
other cpus are running low priority tasks and are not fully utilized.
The utilization imbalance may be too small to cause any tasks to be
migrated, so we end up giving fewer cycles to the high priority tasks.

  If we make find_busiest_foo() select one of these 3, and make
  calculate_imbalance() invariant to the metric passed in, and have things
  like cpu_load() and task_load() return different, but coherent, numbers
  depending on which region we're in, this almost sounds 'simple'.
  
  The devil is in the details, and the balancer is a hairy nest of details
  which will make the above non-trivial.

Yes, but if we have an overall policy like the one you propose we can at
least make it complicated and claim that we think we know what it is
supposed to do ;-)

I agree that there is some work to be done in find_busiest_*() and
calcuate_imbalance() + friends. Maybe step one should be to clean them
up a bit.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-09 Thread Morten Rasmussen
On Mon, Jul 06, 2015 at 04:12:41AM +0800, Yuyang Du wrote:
 Hi Morten,
 
 On Fri, Jul 03, 2015 at 10:34:41AM +0100, Morten Rasmussen wrote:
IOW, since task groups include blocked load in the load_avg_contrib (see
__update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) 
the
imbalance includes blocked load and hence env-imbalance =
sum(task_h_load(p)) for all tasks p on the rq. Which leads to
detach_tasks() emptying the rq completely in the reported scenario where
blocked load  runnable load.
   
   Whenever I want to know the load avg concerning task group, I need to
   walk through the complete codes again, I prefer not to do it this time.
   But it should not be that simply to say the 118 comes from the blocked 
   load.
  
  But the whole hierarchy of group entities is updated each time we enqueue
  or dequeue a task. I don't see how the group entity load_avg_contrib is
  not up to date? Why do you need to update it again?
  
  In any case, we have one task in the group hierarchy which has a
  load_avg_contrib of 0 and the grand-grand parent group entity has a
  load_avg_contrib of 118 and no additional tasks. That load contribution
  must be from tasks which are no longer around on the rq? No?
 
 load_avg_contrib has WEIGHT inside, so the most I can say is:
 SE: 8f456e00's load_avg_contrib 118 = (its cfs_rq's runnable + blocked) / 
 (tg-load_avg + 1) * tg-shares
 
 The tg-shares is probably 1024 (at least 911). So we are just left with:
 
 cfs_rq / tg = 11.5%

Yes, we also know that there is only one runnable task in the task group
hierarchy and its contribution is 0. Hence the rest must be from
non-runnable tasks belonging to some child group.

 I myself did question the sudden jump from 0 to 118 (see a previous reply).
 
 But anyway, this really is irrelevant to the discusstion.

Agreed.

  
   Anyway, with blocked load, yes, we definitely can't move (or even find) 
   some
   ammount of the imbalance if we only look at the tasks on the queue. But 
   this
   may or may not be a problem.
   
   Firstly, the question comes to whether we want blocked load anywhere.
   This is just about a now vs. average question.
  
  That is what I meant in the paragraph below. It is a scheduling policy
  question.
  
   Secondly, if we stick to average, we just need to treat the blocked load
   consistently, not that group SE has it, but task SE does not, or somewhere
   has it, others not.
  
  I agree that inconsistent use of blocked load will lead us into trouble.
  The problem is that none of the load-balance logic was designed for
  blocked load. It was written to deal load that is currently on the rq
  and slightly biased by average cpu load, not dealing with load
  contribution of tasks which we can't migrate at the moment because they
  are blocked. The load-balance code has to be updated to deal with
  blocked load. We will run into all sorts of issues if we don't and roll
  out use of blocked load everywhere.
  
  However, before we can rework the load-balance code, we have to agree on
  the now vs average balance policy.
  
  Your proposed patch implements a policy somewhere in between. We try to
  balance based on average, but we don't allow idle_balance() to empty a
  cpu completely. A pure average balance policy would allow a cpu to go
  idle even if we could do have tasks waiting on other rqs if the blocked
  load indicates that other task will show up shortly one the cpu. A pure
  now balance would balance based on runnable_load_avg for all entities
  including groups ignoring all blocked load, but that goes against the
  PELT group balancing design.
  
  I'm not against having a policy that sits somewhere in between, we just
  have to agree it is the right policy and clean up the load-balance code
  such that the implemented policy is clear.
  
 The proposed patch sits in between? I agree, but would rather see it from
 another perspective.
 
 First, I don't think it merits a solution/policy. It is just a cheap
 last guard to protect the king - no crash.

It's fine for me to have checks that makes sure we don't crash if we hit
some corner case in the load-balance code. I was more interested in
figuring out why we ended up in this situation and how we can implement
a more consistent policy.

 Second, a pure average policy is pretty fine in general, but it does not
 mean we would simply allow a CPU to be pulled empty, that is because we are
 making a bet from a prediction (average) here. By prediction, it basically
 means sometimes it fails. As the failure could lead to a disater, without
 blaming the prediction, it is reasonable we make a sort of damage control.

I like the idea of balancing based on a average load/utilization that
includes blocked load/utilization. It ties in very nicely driving DVFS
for example. It is a fundamental change to how we perceive load for
load-balancing decisions though. It basically requires an update to the
load-balancing 

Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-09 Thread Stefan Ekenberg
Hi,

I tested the patch on a setup with 7 devices, all running the same troublesome 
use-case in parallel (same use-case as we used to produce the crash dumps). 
This use-case was previously able to reproduce the problem about 21 times 
during 24 hours. After including the patch the setup ran perfectly for 48 
hours. So to summarize, patch tested OK.

Tested-by: Stefan Ekenberg stefan.ekenb...@axis.com

On Mon, Jul 06, 2015 at 12:11:51AM +0200, Yuyang Du wrote:
 On Fri, Jul 03, 2015 at 06:39:28PM +0200, Peter Zijlstra wrote:
  On Thu, Jul 02, 2015 at 07:25:11AM +0800, Yuyang Du wrote:
   diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
   index 40a7fcb..f7cc1ef 100644
   --- a/kernel/sched/fair.c
   +++ b/kernel/sched/fair.c
   @@ -5898,6 +5898,10 @@ static int detach_tasks(struct lb_env *env)
   return 0;
  
   while (!list_empty(tasks)) {
   +
   +   if (env-idle == CPU_NEWLY_IDLE  env-src_rq-nr_running = 
   1)
 
  Should we make that -idle != CPU_NOT_IDLE ?
 
 I think including CPU_IDLE is good.
 
 --
 Subject: [PATCH] sched: Avoid pulling all tasks in idle balancing
 
 In idle balancing where a CPU going idle pulls tasks from another CPU,
 a livelock may happen if the CPU pulls all tasks from another, makes
 it idle, and this iterates. So just avoid this.
 
 Reported-by: Rabin Vincent rabin.vinc...@axis.com
 Signed-off-by: Yuyang Du yuyang...@intel.com
 ---
  kernel/sched/fair.c | 7 +++
  1 file changed, 7 insertions(+)
 
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 40a7fcb..769d591 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -5898,6 +5898,13 @@ static int detach_tasks(struct lb_env *env)
 return 0;
 
 while (!list_empty(tasks)) {
 +   /*
 +* We don't want to steal all, otherwise we may be treated 
 likewise,
 +* which could at worst lead to a livelock crash.
 +*/
 +   if (env-idle != CPU_NOT_IDLE  env-src_rq-nr_running = 1)
 +   break;
 +
 p = list_first_entry(tasks, struct task_struct, 
 se.group_node);
 
 env-loop++;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-07 Thread Rabin Vincent
On Mon, Jul 06, 2015 at 07:36:56PM +0200, Dietmar Eggemann wrote:
> Rabin, could you share the content of your
> /sys/fs/cgroup/cpu/system.slice directory and of /proc/cgroups ?

Here's /proc/cgroups,

# cat /proc/cgroups 
#subsys_namehierarchy   num_cgroups enabled
cpu 2   98  1
cpuacct 2   98  1

and the contents of /sys/fs/cgroup/cpu/system.slice are available here:
https://drive.google.com/file/d/0B4tMLbMvJ-l6ZVBvZ09QOE15MU0/view

/Rabin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-07 Thread Rabin Vincent
On Mon, Jul 06, 2015 at 07:36:56PM +0200, Dietmar Eggemann wrote:
 Rabin, could you share the content of your
 /sys/fs/cgroup/cpu/system.slice directory and of /proc/cgroups ?

Here's /proc/cgroups,

# cat /proc/cgroups 
#subsys_namehierarchy   num_cgroups enabled
cpu 2   98  1
cpuacct 2   98  1

and the contents of /sys/fs/cgroup/cpu/system.slice are available here:
https://drive.google.com/file/d/0B4tMLbMvJ-l6ZVBvZ09QOE15MU0/view

/Rabin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-06 Thread Dietmar Eggemann
Hi Yuyang,

On 05/07/15 21:12, Yuyang Du wrote:
> Hi Morten,
> 
> On Fri, Jul 03, 2015 at 10:34:41AM +0100, Morten Rasmussen wrote:
 IOW, since task groups include blocked load in the load_avg_contrib (see
 __update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) the
 imbalance includes blocked load and hence env->imbalance >=
 sum(task_h_load(p)) for all tasks p on the rq. Which leads to
 detach_tasks() emptying the rq completely in the reported scenario where
 blocked load > runnable load.
>>>
>>> Whenever I want to know the load avg concerning task group, I need to
>>> walk through the complete codes again, I prefer not to do it this time.
>>> But it should not be that simply to say "the 118 comes from the blocked 
>>> load".
>>
>> But the whole hierarchy of group entities is updated each time we enqueue
>> or dequeue a task. I don't see how the group entity load_avg_contrib is
>> not up to date? Why do you need to update it again?
>>
>> In any case, we have one task in the group hierarchy which has a
>> load_avg_contrib of 0 and the grand-grand parent group entity has a
>> load_avg_contrib of 118 and no additional tasks. That load contribution
>> must be from tasks which are no longer around on the rq? No?
> 
> load_avg_contrib has WEIGHT inside, so the most I can say is:
> SE: 8f456e00's load_avg_contrib 118 = (its cfs_rq's runnable + blocked) / 
> (tg->load_avg + 1) * tg->shares
> 
> The tg->shares is probably 1024 (at least 911). So we are just left with:
> 
> cfs_rq / tg = 11.5%
> 
> I myself did question the sudden jump from 0 to 118 (see a previous reply).

Do you mean the jump from system-rngd.slice (0) (tg.css.id=3) to
system.slice (118) (tg.css.id=2)?

Maybe, the 118 might come from another tg hierarchy (w/ tg.css.id >= 3)
inside the system.slice group representing another service.

Rabin, could you share the content of your
/sys/fs/cgroup/cpu/system.slice directory and of /proc/cgroups ?

Whether 118 comes from the cfs_rq->blocked_load_avg of one of the tg
levels of one of the other system.slice tg hierarchies or it results
from not updating the se.avg.load_avg_contrib values of se's
representing tg's immediately is not that important I guess.
Even if we're able to sync both things (task en/dequeue and tg
se.avg.load_avg_contrib update) perfectly (by calling
update_cfs_rq_blocked_load() always w/ force_update=1 and immediately
after that update_entity_load_avg() for all tg se's in one hierarchy, we
would still have to deal w/ the blocked load part if the tg se
representing system.slice contributes to
cpu_rq(cpu)->cfs.runnable_load_avg.

-- Dietmar

> 
> But anyway, this really is irrelevant to the discusstion.
>  
[...]

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-06 Thread Yuyang Du
On Fri, Jul 03, 2015 at 06:38:31PM +0200, Peter Zijlstra wrote:
> > I'm not against having a policy that sits somewhere in between, we just
> > have to agree it is the right policy and clean up the load-balance code
> > such that the implemented policy is clear.
> 
> Right, for balancing its a tricky question, but mixing them without
> intent is, as you say, a bit of a mess.
> 
> So clearly blocked load doesn't make sense for (new)idle balancing. OTOH
> it does make some sense for the regular periodic balancing, because
> there we really do care mostly about the averages, esp. so when we're
> overloaded -- but there are issues there too.
> 
> Now we can't track them both (or rather we could, but overhead).
> 
> I like Yuyang's load tracking rewrite, but it changes exactly this part,
> and I'm not sure I understand the full ramifications of that yet.
 
Thanks. It would be a pure average policy, which is non-perfect like now,
and certainly needs a mixing like now, but it is worth a starter, because
it is simple and reasaonble, and based on it, the other parts can be simple
and reasonable.

> One way out would be to split the load balancer into 3 distinct regions;
> 
>  1) get a task on every CPU, screw everything else.
>  2) get each CPU fully utilized, still ignoring 'load'
>  3) when everybody is fully utilized, consider load.
> 
> If we make find_busiest_foo() select one of these 3, and make
> calculate_imbalance() invariant to the metric passed in, and have things
> like cpu_load() and task_load() return different, but coherent, numbers
> depending on which region we're in, this almost sounds 'simple'.
> 
> The devil is in the details, and the balancer is a hairy nest of details
> which will make the above non-trivial.
> 
> But for 1) we could simply 'balance' on nr_running, for 2) we can
> 'balance' on runnable_avg and for 3) we'll 'balance' on load_avg (which
> will then include blocked load).
> 
> Let me go play outside for a bit so that it can sink in what kind of
> nonsense my heat addled brain has just sprouted :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-06 Thread Yuyang Du
On Fri, Jul 03, 2015 at 06:39:28PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 02, 2015 at 07:25:11AM +0800, Yuyang Du wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 40a7fcb..f7cc1ef 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5898,6 +5898,10 @@ static int detach_tasks(struct lb_env *env)
> > return 0;
> >  
> > while (!list_empty(tasks)) {
> > +
> > +   if (env->idle == CPU_NEWLY_IDLE && env->src_rq->nr_running <= 1)
> 
> Should we make that ->idle != CPU_NOT_IDLE ?

I think including CPU_IDLE is good.

--
Subject: [PATCH] sched: Avoid pulling all tasks in idle balancing

In idle balancing where a CPU going idle pulls tasks from another CPU,
a livelock may happen if the CPU pulls all tasks from another, makes
it idle, and this iterates. So just avoid this.

Reported-by: Rabin Vincent 
Signed-off-by: Yuyang Du 
---
 kernel/sched/fair.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40a7fcb..769d591 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5898,6 +5898,13 @@ static int detach_tasks(struct lb_env *env)
return 0;
 
while (!list_empty(tasks)) {
+   /*
+* We don't want to steal all, otherwise we may be treated 
likewise,
+* which could at worst lead to a livelock crash.
+*/
+   if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
+   break;
+
p = list_first_entry(tasks, struct task_struct, se.group_node);
 
env->loop++;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-06 Thread Dietmar Eggemann
Hi Yuyang,

On 05/07/15 21:12, Yuyang Du wrote:
 Hi Morten,
 
 On Fri, Jul 03, 2015 at 10:34:41AM +0100, Morten Rasmussen wrote:
 IOW, since task groups include blocked load in the load_avg_contrib (see
 __update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) the
 imbalance includes blocked load and hence env-imbalance =
 sum(task_h_load(p)) for all tasks p on the rq. Which leads to
 detach_tasks() emptying the rq completely in the reported scenario where
 blocked load  runnable load.

 Whenever I want to know the load avg concerning task group, I need to
 walk through the complete codes again, I prefer not to do it this time.
 But it should not be that simply to say the 118 comes from the blocked 
 load.

 But the whole hierarchy of group entities is updated each time we enqueue
 or dequeue a task. I don't see how the group entity load_avg_contrib is
 not up to date? Why do you need to update it again?

 In any case, we have one task in the group hierarchy which has a
 load_avg_contrib of 0 and the grand-grand parent group entity has a
 load_avg_contrib of 118 and no additional tasks. That load contribution
 must be from tasks which are no longer around on the rq? No?
 
 load_avg_contrib has WEIGHT inside, so the most I can say is:
 SE: 8f456e00's load_avg_contrib 118 = (its cfs_rq's runnable + blocked) / 
 (tg-load_avg + 1) * tg-shares
 
 The tg-shares is probably 1024 (at least 911). So we are just left with:
 
 cfs_rq / tg = 11.5%
 
 I myself did question the sudden jump from 0 to 118 (see a previous reply).

Do you mean the jump from system-rngd.slice (0) (tg.css.id=3) to
system.slice (118) (tg.css.id=2)?

Maybe, the 118 might come from another tg hierarchy (w/ tg.css.id = 3)
inside the system.slice group representing another service.

Rabin, could you share the content of your
/sys/fs/cgroup/cpu/system.slice directory and of /proc/cgroups ?

Whether 118 comes from the cfs_rq-blocked_load_avg of one of the tg
levels of one of the other system.slice tg hierarchies or it results
from not updating the se.avg.load_avg_contrib values of se's
representing tg's immediately is not that important I guess.
Even if we're able to sync both things (task en/dequeue and tg
se.avg.load_avg_contrib update) perfectly (by calling
update_cfs_rq_blocked_load() always w/ force_update=1 and immediately
after that update_entity_load_avg() for all tg se's in one hierarchy, we
would still have to deal w/ the blocked load part if the tg se
representing system.slice contributes to
cpu_rq(cpu)-cfs.runnable_load_avg.

-- Dietmar

 
 But anyway, this really is irrelevant to the discusstion.
  
[...]

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-06 Thread Yuyang Du
On Fri, Jul 03, 2015 at 06:39:28PM +0200, Peter Zijlstra wrote:
 On Thu, Jul 02, 2015 at 07:25:11AM +0800, Yuyang Du wrote:
  diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
  index 40a7fcb..f7cc1ef 100644
  --- a/kernel/sched/fair.c
  +++ b/kernel/sched/fair.c
  @@ -5898,6 +5898,10 @@ static int detach_tasks(struct lb_env *env)
  return 0;
   
  while (!list_empty(tasks)) {
  +
  +   if (env-idle == CPU_NEWLY_IDLE  env-src_rq-nr_running = 1)
 
 Should we make that -idle != CPU_NOT_IDLE ?

I think including CPU_IDLE is good.

--
Subject: [PATCH] sched: Avoid pulling all tasks in idle balancing

In idle balancing where a CPU going idle pulls tasks from another CPU,
a livelock may happen if the CPU pulls all tasks from another, makes
it idle, and this iterates. So just avoid this.

Reported-by: Rabin Vincent rabin.vinc...@axis.com
Signed-off-by: Yuyang Du yuyang...@intel.com
---
 kernel/sched/fair.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40a7fcb..769d591 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5898,6 +5898,13 @@ static int detach_tasks(struct lb_env *env)
return 0;
 
while (!list_empty(tasks)) {
+   /*
+* We don't want to steal all, otherwise we may be treated 
likewise,
+* which could at worst lead to a livelock crash.
+*/
+   if (env-idle != CPU_NOT_IDLE  env-src_rq-nr_running = 1)
+   break;
+
p = list_first_entry(tasks, struct task_struct, se.group_node);
 
env-loop++;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-06 Thread Yuyang Du
On Fri, Jul 03, 2015 at 06:38:31PM +0200, Peter Zijlstra wrote:
  I'm not against having a policy that sits somewhere in between, we just
  have to agree it is the right policy and clean up the load-balance code
  such that the implemented policy is clear.
 
 Right, for balancing its a tricky question, but mixing them without
 intent is, as you say, a bit of a mess.
 
 So clearly blocked load doesn't make sense for (new)idle balancing. OTOH
 it does make some sense for the regular periodic balancing, because
 there we really do care mostly about the averages, esp. so when we're
 overloaded -- but there are issues there too.
 
 Now we can't track them both (or rather we could, but overhead).
 
 I like Yuyang's load tracking rewrite, but it changes exactly this part,
 and I'm not sure I understand the full ramifications of that yet.
 
Thanks. It would be a pure average policy, which is non-perfect like now,
and certainly needs a mixing like now, but it is worth a starter, because
it is simple and reasaonble, and based on it, the other parts can be simple
and reasonable.

 One way out would be to split the load balancer into 3 distinct regions;
 
  1) get a task on every CPU, screw everything else.
  2) get each CPU fully utilized, still ignoring 'load'
  3) when everybody is fully utilized, consider load.
 
 If we make find_busiest_foo() select one of these 3, and make
 calculate_imbalance() invariant to the metric passed in, and have things
 like cpu_load() and task_load() return different, but coherent, numbers
 depending on which region we're in, this almost sounds 'simple'.
 
 The devil is in the details, and the balancer is a hairy nest of details
 which will make the above non-trivial.
 
 But for 1) we could simply 'balance' on nr_running, for 2) we can
 'balance' on runnable_avg and for 3) we'll 'balance' on load_avg (which
 will then include blocked load).
 
 Let me go play outside for a bit so that it can sink in what kind of
 nonsense my heat addled brain has just sprouted :-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-05 Thread Yuyang Du
Hi Morten,

On Fri, Jul 03, 2015 at 10:34:41AM +0100, Morten Rasmussen wrote:
> > > IOW, since task groups include blocked load in the load_avg_contrib (see
> > > __update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) the
> > > imbalance includes blocked load and hence env->imbalance >=
> > > sum(task_h_load(p)) for all tasks p on the rq. Which leads to
> > > detach_tasks() emptying the rq completely in the reported scenario where
> > > blocked load > runnable load.
> > 
> > Whenever I want to know the load avg concerning task group, I need to
> > walk through the complete codes again, I prefer not to do it this time.
> > But it should not be that simply to say "the 118 comes from the blocked 
> > load".
> 
> But the whole hierarchy of group entities is updated each time we enqueue
> or dequeue a task. I don't see how the group entity load_avg_contrib is
> not up to date? Why do you need to update it again?
> 
> In any case, we have one task in the group hierarchy which has a
> load_avg_contrib of 0 and the grand-grand parent group entity has a
> load_avg_contrib of 118 and no additional tasks. That load contribution
> must be from tasks which are no longer around on the rq? No?

load_avg_contrib has WEIGHT inside, so the most I can say is:
SE: 8f456e00's load_avg_contrib 118 = (its cfs_rq's runnable + blocked) / 
(tg->load_avg + 1) * tg->shares

The tg->shares is probably 1024 (at least 911). So we are just left with:

cfs_rq / tg = 11.5%

I myself did question the sudden jump from 0 to 118 (see a previous reply).

But anyway, this really is irrelevant to the discusstion.
 
> > Anyway, with blocked load, yes, we definitely can't move (or even find) some
> > ammount of the imbalance if we only look at the tasks on the queue. But this
> > may or may not be a problem.
> > 
> > Firstly, the question comes to whether we want blocked load anywhere.
> > This is just about a "now vs. average" question.
> 
> That is what I meant in the paragraph below. It is a scheduling policy
> question.
> 
> > Secondly, if we stick to average, we just need to treat the blocked load
> > consistently, not that group SE has it, but task SE does not, or somewhere
> > has it, others not.
> 
> I agree that inconsistent use of blocked load will lead us into trouble.
> The problem is that none of the load-balance logic was designed for
> blocked load. It was written to deal load that is currently on the rq
> and slightly biased by average cpu load, not dealing with load
> contribution of tasks which we can't migrate at the moment because they
> are blocked. The load-balance code has to be updated to deal with
> blocked load. We will run into all sorts of issues if we don't and roll
> out use of blocked load everywhere.
> 
> However, before we can rework the load-balance code, we have to agree on
> the now vs average balance policy.
> 
> Your proposed patch implements a policy somewhere in between. We try to
> balance based on average, but we don't allow idle_balance() to empty a
> cpu completely. A pure average balance policy would allow a cpu to go
> idle even if we could do have tasks waiting on other rqs if the blocked
> load indicates that other task will show up shortly one the cpu. A pure
> "now" balance would balance based on runnable_load_avg for all entities
> including groups ignoring all blocked load, but that goes against the
> PELT group balancing design.
> 
> I'm not against having a policy that sits somewhere in between, we just
> have to agree it is the right policy and clean up the load-balance code
> such that the implemented policy is clear.
 
The proposed patch sits in between? I agree, but would rather see it from
another perspective.

First, I don't think it merits a solution/policy. It is just a cheap
"last guard" to protect the "king" - no crash.

Second, a "pure average" policy is pretty fine in general, but it does not
mean we would simply allow a CPU to be pulled empty, that is because we are
making a bet from a prediction (average) here. By prediction, it basically
means sometimes it fails. As the failure could lead to a disater, without
blaming the prediction, it is reasonable we make a sort of "damage control".

Thanks,
Yuyang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-05 Thread Yuyang Du
Hi Morten,

On Fri, Jul 03, 2015 at 10:34:41AM +0100, Morten Rasmussen wrote:
   IOW, since task groups include blocked load in the load_avg_contrib (see
   __update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) the
   imbalance includes blocked load and hence env-imbalance =
   sum(task_h_load(p)) for all tasks p on the rq. Which leads to
   detach_tasks() emptying the rq completely in the reported scenario where
   blocked load  runnable load.
  
  Whenever I want to know the load avg concerning task group, I need to
  walk through the complete codes again, I prefer not to do it this time.
  But it should not be that simply to say the 118 comes from the blocked 
  load.
 
 But the whole hierarchy of group entities is updated each time we enqueue
 or dequeue a task. I don't see how the group entity load_avg_contrib is
 not up to date? Why do you need to update it again?
 
 In any case, we have one task in the group hierarchy which has a
 load_avg_contrib of 0 and the grand-grand parent group entity has a
 load_avg_contrib of 118 and no additional tasks. That load contribution
 must be from tasks which are no longer around on the rq? No?

load_avg_contrib has WEIGHT inside, so the most I can say is:
SE: 8f456e00's load_avg_contrib 118 = (its cfs_rq's runnable + blocked) / 
(tg-load_avg + 1) * tg-shares

The tg-shares is probably 1024 (at least 911). So we are just left with:

cfs_rq / tg = 11.5%

I myself did question the sudden jump from 0 to 118 (see a previous reply).

But anyway, this really is irrelevant to the discusstion.
 
  Anyway, with blocked load, yes, we definitely can't move (or even find) some
  ammount of the imbalance if we only look at the tasks on the queue. But this
  may or may not be a problem.
  
  Firstly, the question comes to whether we want blocked load anywhere.
  This is just about a now vs. average question.
 
 That is what I meant in the paragraph below. It is a scheduling policy
 question.
 
  Secondly, if we stick to average, we just need to treat the blocked load
  consistently, not that group SE has it, but task SE does not, or somewhere
  has it, others not.
 
 I agree that inconsistent use of blocked load will lead us into trouble.
 The problem is that none of the load-balance logic was designed for
 blocked load. It was written to deal load that is currently on the rq
 and slightly biased by average cpu load, not dealing with load
 contribution of tasks which we can't migrate at the moment because they
 are blocked. The load-balance code has to be updated to deal with
 blocked load. We will run into all sorts of issues if we don't and roll
 out use of blocked load everywhere.
 
 However, before we can rework the load-balance code, we have to agree on
 the now vs average balance policy.
 
 Your proposed patch implements a policy somewhere in between. We try to
 balance based on average, but we don't allow idle_balance() to empty a
 cpu completely. A pure average balance policy would allow a cpu to go
 idle even if we could do have tasks waiting on other rqs if the blocked
 load indicates that other task will show up shortly one the cpu. A pure
 now balance would balance based on runnable_load_avg for all entities
 including groups ignoring all blocked load, but that goes against the
 PELT group balancing design.
 
 I'm not against having a policy that sits somewhere in between, we just
 have to agree it is the right policy and clean up the load-balance code
 such that the implemented policy is clear.
 
The proposed patch sits in between? I agree, but would rather see it from
another perspective.

First, I don't think it merits a solution/policy. It is just a cheap
last guard to protect the king - no crash.

Second, a pure average policy is pretty fine in general, but it does not
mean we would simply allow a CPU to be pulled empty, that is because we are
making a bet from a prediction (average) here. By prediction, it basically
means sometimes it fails. As the failure could lead to a disater, without
blaming the prediction, it is reasonable we make a sort of damage control.

Thanks,
Yuyang
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-03 Thread Peter Zijlstra
On Thu, Jul 02, 2015 at 07:25:11AM +0800, Yuyang Du wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 40a7fcb..f7cc1ef 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5898,6 +5898,10 @@ static int detach_tasks(struct lb_env *env)
>   return 0;
>  
>   while (!list_empty(tasks)) {
> +
> + if (env->idle == CPU_NEWLY_IDLE && env->src_rq->nr_running <= 1)

Should we make that ->idle != CPU_NOT_IDLE ?

> + break;
> +
>   p = list_first_entry(tasks, struct task_struct, se.group_node);
>  
>   env->loop++;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-03 Thread Peter Zijlstra
On Fri, Jul 03, 2015 at 10:34:41AM +0100, Morten Rasmussen wrote:
> > > IOW, since task groups include blocked load in the load_avg_contrib (see
> > > __update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) the
> > > imbalance includes blocked load and hence env->imbalance >=
> > > sum(task_h_load(p)) for all tasks p on the rq. Which leads to
> > > detach_tasks() emptying the rq completely in the reported scenario where
> > > blocked load > runnable load.

So IIRC we need the blocked load for groups for computing the per-cpu
slices of the total weight, avg works really well for that.

> I'm not against having a policy that sits somewhere in between, we just
> have to agree it is the right policy and clean up the load-balance code
> such that the implemented policy is clear.

Right, for balancing its a tricky question, but mixing them without
intent is, as you say, a bit of a mess.

So clearly blocked load doesn't make sense for (new)idle balancing. OTOH
it does make some sense for the regular periodic balancing, because
there we really do care mostly about the averages, esp. so when we're
overloaded -- but there are issues there too.

Now we can't track them both (or rather we could, but overhead).

I like Yuyang's load tracking rewrite, but it changes exactly this part,
and I'm not sure I understand the full ramifications of that yet.

One way out would be to split the load balancer into 3 distinct regions;

 1) get a task on every CPU, screw everything else.
 2) get each CPU fully utilized, still ignoring 'load'
 3) when everybody is fully utilized, consider load.

If we make find_busiest_foo() select one of these 3, and make
calculate_imbalance() invariant to the metric passed in, and have things
like cpu_load() and task_load() return different, but coherent, numbers
depending on which region we're in, this almost sounds 'simple'.

The devil is in the details, and the balancer is a hairy nest of details
which will make the above non-trivial.

But for 1) we could simply 'balance' on nr_running, for 2) we can
'balance' on runnable_avg and for 3) we'll 'balance' on load_avg (which
will then include blocked load).

Let me go play outside for a bit so that it can sink in what kind of
nonsense my heat addled brain has just sprouted :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-03 Thread Morten Rasmussen
On Fri, Jul 03, 2015 at 03:37:02AM +0800, Yuyang Du wrote:
> Hi Morten,
> 
> On Thu, Jul 02, 2015 at 12:40:32PM +0100, Morten Rasmussen wrote:
> > detach_tasks() will attempts to pull 62 based on tasks task_h_load() but
> > the task_h_load() sum is only 5 + 10 + 0 and hence detach_tasks() will
> > empty the src_rq.
> > 
> > IOW, since task groups include blocked load in the load_avg_contrib (see
> > __update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) the
> > imbalance includes blocked load and hence env->imbalance >=
> > sum(task_h_load(p)) for all tasks p on the rq. Which leads to
> > detach_tasks() emptying the rq completely in the reported scenario where
> > blocked load > runnable load.
> 
> Whenever I want to know the load avg concerning task group, I need to
> walk through the complete codes again, I prefer not to do it this time.
> But it should not be that simply to say "the 118 comes from the blocked load".

But the whole hierarchy of group entities is updated each time we enqueue
or dequeue a task. I don't see how the group entity load_avg_contrib is
not up to date? Why do you need to update it again?

In any case, we have one task in the group hierarchy which has a
load_avg_contrib of 0 and the grand-grand parent group entity has a
load_avg_contrib of 118 and no additional tasks. That load contribution
must be from tasks which are no longer around on the rq? No?

> Anyway, with blocked load, yes, we definitely can't move (or even find) some
> ammount of the imbalance if we only look at the tasks on the queue. But this
> may or may not be a problem.
> 
> Firstly, the question comes to whether we want blocked load anywhere.
> This is just about a "now vs. average" question.

That is what I meant in the paragraph below. It is a scheduling policy
question.

> Secondly, if we stick to average, we just need to treat the blocked load
> consistently, not that group SE has it, but task SE does not, or somewhere
> has it, others not.

I agree that inconsistent use of blocked load will lead us into trouble.
The problem is that none of the load-balance logic was designed for
blocked load. It was written to deal load that is currently on the rq
and slightly biased by average cpu load, not dealing with load
contribution of tasks which we can't migrate at the moment because they
are blocked. The load-balance code has to be updated to deal with
blocked load. We will run into all sorts of issues if we don't and roll
out use of blocked load everywhere.

However, before we can rework the load-balance code, we have to agree on
the now vs average balance policy.

Your proposed patch implements a policy somewhere in between. We try to
balance based on average, but we don't allow idle_balance() to empty a
cpu completely. A pure average balance policy would allow a cpu to go
idle even if we could do have tasks waiting on other rqs if the blocked
load indicates that other task will show up shortly one the cpu. A pure
"now" balance would balance based on runnable_load_avg for all entities
including groups ignoring all blocked load, but that goes against the
PELT group balancing design.

I'm not against having a policy that sits somewhere in between, we just
have to agree it is the right policy and clean up the load-balance code
such that the implemented policy is clear.

Morten

>  
> Thanks,
> Yuyang
> 
> > Whether emptying the src_rq is the right thing to do depends on on your
> > point of view. Does balanced load (runnable+blocked) take priority over
> > keeping cpus busy or not? For idle_balance() it seems intuitively
> > correct to not empty the rq and hence you could consider env->imbalance
> > to be too big.
> > 
> > I think we will see more of this kind of problems if we include
> > weighted_cpuload() as well. Parts of the imbalance calculation code is
> > quite old and could use some attention first.
> > 
> > A short term fix could be what Yuyang propose, stop pulling tasks when
> > there is only one left in detach_tasks(). It won't affect active load
> > balance where we may want to migrate the last task as it active load
> > balance doesn't use detach_tasks().
> > 
> > Morten
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-03 Thread Peter Zijlstra
On Fri, Jul 03, 2015 at 10:34:41AM +0100, Morten Rasmussen wrote:
   IOW, since task groups include blocked load in the load_avg_contrib (see
   __update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) the
   imbalance includes blocked load and hence env-imbalance =
   sum(task_h_load(p)) for all tasks p on the rq. Which leads to
   detach_tasks() emptying the rq completely in the reported scenario where
   blocked load  runnable load.

So IIRC we need the blocked load for groups for computing the per-cpu
slices of the total weight, avg works really well for that.

 I'm not against having a policy that sits somewhere in between, we just
 have to agree it is the right policy and clean up the load-balance code
 such that the implemented policy is clear.

Right, for balancing its a tricky question, but mixing them without
intent is, as you say, a bit of a mess.

So clearly blocked load doesn't make sense for (new)idle balancing. OTOH
it does make some sense for the regular periodic balancing, because
there we really do care mostly about the averages, esp. so when we're
overloaded -- but there are issues there too.

Now we can't track them both (or rather we could, but overhead).

I like Yuyang's load tracking rewrite, but it changes exactly this part,
and I'm not sure I understand the full ramifications of that yet.

One way out would be to split the load balancer into 3 distinct regions;

 1) get a task on every CPU, screw everything else.
 2) get each CPU fully utilized, still ignoring 'load'
 3) when everybody is fully utilized, consider load.

If we make find_busiest_foo() select one of these 3, and make
calculate_imbalance() invariant to the metric passed in, and have things
like cpu_load() and task_load() return different, but coherent, numbers
depending on which region we're in, this almost sounds 'simple'.

The devil is in the details, and the balancer is a hairy nest of details
which will make the above non-trivial.

But for 1) we could simply 'balance' on nr_running, for 2) we can
'balance' on runnable_avg and for 3) we'll 'balance' on load_avg (which
will then include blocked load).

Let me go play outside for a bit so that it can sink in what kind of
nonsense my heat addled brain has just sprouted :-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-03 Thread Peter Zijlstra
On Thu, Jul 02, 2015 at 07:25:11AM +0800, Yuyang Du wrote:
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 40a7fcb..f7cc1ef 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -5898,6 +5898,10 @@ static int detach_tasks(struct lb_env *env)
   return 0;
  
   while (!list_empty(tasks)) {
 +
 + if (env-idle == CPU_NEWLY_IDLE  env-src_rq-nr_running = 1)

Should we make that -idle != CPU_NOT_IDLE ?

 + break;
 +
   p = list_first_entry(tasks, struct task_struct, se.group_node);
  
   env-loop++;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-03 Thread Morten Rasmussen
On Fri, Jul 03, 2015 at 03:37:02AM +0800, Yuyang Du wrote:
 Hi Morten,
 
 On Thu, Jul 02, 2015 at 12:40:32PM +0100, Morten Rasmussen wrote:
  detach_tasks() will attempts to pull 62 based on tasks task_h_load() but
  the task_h_load() sum is only 5 + 10 + 0 and hence detach_tasks() will
  empty the src_rq.
  
  IOW, since task groups include blocked load in the load_avg_contrib (see
  __update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) the
  imbalance includes blocked load and hence env-imbalance =
  sum(task_h_load(p)) for all tasks p on the rq. Which leads to
  detach_tasks() emptying the rq completely in the reported scenario where
  blocked load  runnable load.
 
 Whenever I want to know the load avg concerning task group, I need to
 walk through the complete codes again, I prefer not to do it this time.
 But it should not be that simply to say the 118 comes from the blocked load.

But the whole hierarchy of group entities is updated each time we enqueue
or dequeue a task. I don't see how the group entity load_avg_contrib is
not up to date? Why do you need to update it again?

In any case, we have one task in the group hierarchy which has a
load_avg_contrib of 0 and the grand-grand parent group entity has a
load_avg_contrib of 118 and no additional tasks. That load contribution
must be from tasks which are no longer around on the rq? No?

 Anyway, with blocked load, yes, we definitely can't move (or even find) some
 ammount of the imbalance if we only look at the tasks on the queue. But this
 may or may not be a problem.
 
 Firstly, the question comes to whether we want blocked load anywhere.
 This is just about a now vs. average question.

That is what I meant in the paragraph below. It is a scheduling policy
question.

 Secondly, if we stick to average, we just need to treat the blocked load
 consistently, not that group SE has it, but task SE does not, or somewhere
 has it, others not.

I agree that inconsistent use of blocked load will lead us into trouble.
The problem is that none of the load-balance logic was designed for
blocked load. It was written to deal load that is currently on the rq
and slightly biased by average cpu load, not dealing with load
contribution of tasks which we can't migrate at the moment because they
are blocked. The load-balance code has to be updated to deal with
blocked load. We will run into all sorts of issues if we don't and roll
out use of blocked load everywhere.

However, before we can rework the load-balance code, we have to agree on
the now vs average balance policy.

Your proposed patch implements a policy somewhere in between. We try to
balance based on average, but we don't allow idle_balance() to empty a
cpu completely. A pure average balance policy would allow a cpu to go
idle even if we could do have tasks waiting on other rqs if the blocked
load indicates that other task will show up shortly one the cpu. A pure
now balance would balance based on runnable_load_avg for all entities
including groups ignoring all blocked load, but that goes against the
PELT group balancing design.

I'm not against having a policy that sits somewhere in between, we just
have to agree it is the right policy and clean up the load-balance code
such that the implemented policy is clear.

Morten

  
 Thanks,
 Yuyang
 
  Whether emptying the src_rq is the right thing to do depends on on your
  point of view. Does balanced load (runnable+blocked) take priority over
  keeping cpus busy or not? For idle_balance() it seems intuitively
  correct to not empty the rq and hence you could consider env-imbalance
  to be too big.
  
  I think we will see more of this kind of problems if we include
  weighted_cpuload() as well. Parts of the imbalance calculation code is
  quite old and could use some attention first.
  
  A short term fix could be what Yuyang propose, stop pulling tasks when
  there is only one left in detach_tasks(). It won't affect active load
  balance where we may want to migrate the last task as it active load
  balance doesn't use detach_tasks().
  
  Morten
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-02 Thread Mike Galbraith
On Fri, 2015-07-03 at 02:42 +0800, Yuyang Du wrote:

> But still, I think, even with the above, in idle balancing, pulling until the 
> source
> rq's nr_running == 1 is not just "a short term fix", but should be there 
> permanently
> acting like a last guard with no overhead, why not.

Yeah, seems so.  Searching for steal all samples...
(this is all with autogroup)

load_balance: idle - s_run: 0  d_run: 2 s_load: 0  d_load: 3  imb: 23  
det_tasks: 2  det_load: 3  zeros: 1
load_balance: idle - s_run: 0  d_run: 2 s_load: 0  d_load: 0  imb: 32  
det_tasks: 2  det_load: 0  zeros: 2
load_balance: idle - s_run: 0  d_run: 2 s_load: 0  d_load: 1  imb: 17  
det_tasks: 2  det_load: 1  zeros: 1
load_balance: idle - s_run: 0  d_run: 2 s_load: 0  d_load: 37  imb: 22  
det_tasks: 2  det_load: 37  zeros: 1
load_balance: idle - s_run: 0  d_run: 2 s_load: 0  d_load: 0  imb: 102  
det_tasks: 2  det_load: 0  zeros: 2


load_balance: idle - s_run: 0  d_run: 1 s_load: 0  d_load: 93  imb: 47  
det_tasks: 1  det_load: 93  zeros: 0
load_balance: idle - s_run: 0  d_run: 2 s_load: 0  d_load: 202  imb: 125  
det_tasks: 2  det_load: 202  zeros: 0
load_balance: idle - s_run: 0  d_run: 2 s_load: 0  d_load: 243  imb: 188  
det_tasks: 2  det_load: 243  zeros: 0
load_balance: idle - s_run: 0  d_run: 1 s_load: 0  d_load: 145  imb: 73  
det_tasks: 1  det_load: 145  zeros: 0
load_balance: idle - s_run: 0  d_run: 1 s_load: 0  d_load: 46  imb: 24  
det_tasks: 1  det_load: 46  zeros: 0

Both varieties of total pilferage (w/wo 0 load tasks involved) seem to
happen only during idle balance, never periodic (yet).

Oddity: make -j8 occasionally stacks/pulls piles of load=dinky.

homer:/sys/kernel/debug/tracing # for i in `seq 1 10`; do cat trace|grep 
"s_run: 1.*det_tasks: $i.*zeros: 0"|wc -l; done
71634
1567
79
15
1
3
0
2
3
0
homer:/sys/kernel/debug/tracing # cat trace|grep "s_run: 1.*det_tasks: 
8.*zeros: 0"
  -0 [002] dNs.   594.973783: load_balance: norm - s_run: 1  
d_run: 9 s_load: 67  d_load: 1110  imb: 86  det_tasks: 8  det_load: 86  zeros: 0
   <...>-10367 [007] d...  1456.477281: load_balance: idle - s_run: 1  
d_run: 8 s_load: 805  d_load: 22  imb: 45  det_tasks: 8  det_load: 22  zeros: 0
homer:/sys/kernel/debug/tracing # cat trace|grep "s_run: 1.*det_tasks: 
9.*zeros: 0"
   <...>-23317 [004] d...   486.677925: load_balance: idle - s_run: 1  
d_run: 9 s_load: 888  d_load: 27  imb: 47  det_tasks: 9  det_load: 27  zeros: 0
   <...>-11485 [002] d...   573.411095: load_balance: idle - s_run: 1  
d_run: 9 s_load: 124  d_load: 78  imb: 82  det_tasks: 9  det_load: 78  zeros: 0
   <...>-23286 [000] d...  1510.378740: load_balance: idle - s_run: 1  
d_run: 9 s_load: 102  d_load: 58  imb: 63  det_tasks: 9  det_load: 58  zeros: 0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-02 Thread Yuyang Du
Hi Morten,

On Thu, Jul 02, 2015 at 12:40:32PM +0100, Morten Rasmussen wrote:
> detach_tasks() will attempts to pull 62 based on tasks task_h_load() but
> the task_h_load() sum is only 5 + 10 + 0 and hence detach_tasks() will
> empty the src_rq.
> 
> IOW, since task groups include blocked load in the load_avg_contrib (see
> __update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) the
> imbalance includes blocked load and hence env->imbalance >=
> sum(task_h_load(p)) for all tasks p on the rq. Which leads to
> detach_tasks() emptying the rq completely in the reported scenario where
> blocked load > runnable load.

Whenever I want to know the load avg concerning task group, I need to
walk through the complete codes again, I prefer not to do it this time.
But it should not be that simply to say "the 118 comes from the blocked load".

Anyway, with blocked load, yes, we definitely can't move (or even find) some
ammount of the imbalance if we only look at the tasks on the queue. But this
may or may not be a problem.

Firstly, the question comes to whether we want blocked load anywhere.
This is just about a "now vs. average" question.

Secondly, if we stick to average, we just need to treat the blocked load
consistently, not that group SE has it, but task SE does not, or somewhere
has it, others not.
 
Thanks,
Yuyang

> Whether emptying the src_rq is the right thing to do depends on on your
> point of view. Does balanced load (runnable+blocked) take priority over
> keeping cpus busy or not? For idle_balance() it seems intuitively
> correct to not empty the rq and hence you could consider env->imbalance
> to be too big.
> 
> I think we will see more of this kind of problems if we include
> weighted_cpuload() as well. Parts of the imbalance calculation code is
> quite old and could use some attention first.
> 
> A short term fix could be what Yuyang propose, stop pulling tasks when
> there is only one left in detach_tasks(). It won't affect active load
> balance where we may want to migrate the last task as it active load
> balance doesn't use detach_tasks().
> 
> Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-02 Thread Yuyang Du
On Thu, Jul 02, 2015 at 12:44:55PM +0100, Morten Rasmussen wrote:
> On Thu, Jul 02, 2015 at 12:53:59PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 02, 2015 at 07:25:11AM +0800, Yuyang Du wrote:
> > > And obviously, the idle balancing livelock SHOULD happen: one CPU pulls
> > > tasks from the other, makes the other idle, and this iterates...
> > > 
> > > That being said, it is also obvious to prevent the livelock from 
> > > happening:
> > > idle pulling until the source rq's nr_running is 1, becuase otherwise we
> > > just avoid idleness by making another idleness.
> > 
> > Well, ideally the imbalance calculation would be so that it would avoid
> > this from happening in the first place. Its a 'balance' operation, not a
> > 'steal everything'.
> > 
> > We want to take work -- as we have none -- but we want to ensure that
> > afterwards we have equal work, ie we're balanced.
> 
> Agreed, I think this is the true problem. See my other reply.

Yes, this is agreed at all time. Like I said load_balance() (for idle balancing)
should compute the right imbalance and move a fair amount, to achieve we are
balanced. Whatever is wrong in how much computed and moved "right imbalance" is
should be fixed anyway.

But still, I think, even with the above, in idle balancing, pulling until the 
source
rq's nr_running == 1 is not just "a short term fix", but should be there 
permanently
acting like a last guard with no overhead, why not.
 
> 
> > 
> > So clearly that all is hosed. Now Morten was looking into simplifying
> > calculate_imbalance() recently.
> 
> Yes. I'm held up doing other stuff at the moment, but I think
> calculate_imbalance() needs some attention and I'm planning on looking at
> that next.

Thanks,
Yuyang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-02 Thread Morten Rasmussen
On Thu, Jul 02, 2015 at 12:53:59PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 02, 2015 at 07:25:11AM +0800, Yuyang Du wrote:
> > And obviously, the idle balancing livelock SHOULD happen: one CPU pulls
> > tasks from the other, makes the other idle, and this iterates...
> > 
> > That being said, it is also obvious to prevent the livelock from happening:
> > idle pulling until the source rq's nr_running is 1, becuase otherwise we
> > just avoid idleness by making another idleness.
> 
> Well, ideally the imbalance calculation would be so that it would avoid
> this from happening in the first place. Its a 'balance' operation, not a
> 'steal everything'.
> 
> We want to take work -- as we have none -- but we want to ensure that
> afterwards we have equal work, ie we're balanced.

Agreed, I think this is the true problem. See my other reply.

> 
> So clearly that all is hosed. Now Morten was looking into simplifying
> calculate_imbalance() recently.

Yes. I'm held up doing other stuff at the moment, but I think
calculate_imbalance() needs some attention and I'm planning on looking at
that next.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-02 Thread Morten Rasmussen
On Thu, Jul 02, 2015 at 09:05:39AM +0800, Yuyang Du wrote:
> Hi Mike,
> 
> On Thu, Jul 02, 2015 at 10:05:47AM +0200, Mike Galbraith wrote:
> > On Thu, 2015-07-02 at 07:25 +0800, Yuyang Du wrote:
> > 
> > > That being said, it is also obvious to prevent the livelock from 
> > > happening:
> > > idle pulling until the source rq's nr_running is 1, becuase otherwise we
> > > just avoid idleness by making another idleness.
> > 
> > Yeah, but that's just the symptom, not the disease.  Better for the idle
> > balance symptom may actually be to only pull one when idle balancing.
> > After all, the immediate goal is to find something better to do than
> > idle, not to achieve continual perfect (is the enemy of good) balance.
> > 
> Symptom? :)
> 
> You mean "pull one and stop, can't be greedy"? Right, but still need to
> assure you don't make another idle CPU (meaning until nr_running == 1), which
> is the cure to disease.
> 
> I am ok with at most "pull one", but probably we stick to the load_balance()
> by pulling an fair amount, assuming load_balance() magically computes the
> right imbalance, otherwise you may have to do multiple "pull one"s.

Talking about the disease and looking at the debug data that Rabin has
provided I think the problem is due to the way blocked load is handled
(or not handled) in calculate_imbalance().

We have three entities in the root cfs_rq on cpu1:

1. Task entity pid 7, load_avg_contrib = 5.
2. Task entity pid 30, load_avg_contrib = 10.
3. Group entity, load_avg_contrib = 118, but contains task entity pid
   413 further down the hierarchy with task_h_load() = 0. The 118 comes
   from the blocked load contribution in the system.slice task group.

calculate_imbalance() figures out the average loads are:

cpu0: load/capacity = 0*1024/1024 = 0
cpu1: load/capacity = (5 + 10 + 118)*1024/1024 = 133

domain: load/capacity = (0 + 133)*1024/(2*1024) = 62

env->imbalance = 62

Rabin reported env->imbalance = 60 after pulling the rcu task with
load_avg_contrib = 5. It doesn't match my numbers exactly, but it pretty
close ;-)

detach_tasks() will attempts to pull 62 based on tasks task_h_load() but
the task_h_load() sum is only 5 + 10 + 0 and hence detach_tasks() will
empty the src_rq.

IOW, since task groups include blocked load in the load_avg_contrib (see
__update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) the
imbalance includes blocked load and hence env->imbalance >=
sum(task_h_load(p)) for all tasks p on the rq. Which leads to
detach_tasks() emptying the rq completely in the reported scenario where
blocked load > runnable load.

Whether emptying the src_rq is the right thing to do depends on on your
point of view. Does balanced load (runnable+blocked) take priority over
keeping cpus busy or not? For idle_balance() it seems intuitively
correct to not empty the rq and hence you could consider env->imbalance
to be too big.

I think we will see more of this kind of problems if we include
weighted_cpuload() as well. Parts of the imbalance calculation code is
quite old and could use some attention first.

A short term fix could be what Yuyang propose, stop pulling tasks when
there is only one left in detach_tasks(). It won't affect active load
balance where we may want to migrate the last task as it active load
balance doesn't use detach_tasks().

Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-02 Thread Peter Zijlstra
On Thu, Jul 02, 2015 at 07:25:11AM +0800, Yuyang Du wrote:
> And obviously, the idle balancing livelock SHOULD happen: one CPU pulls
> tasks from the other, makes the other idle, and this iterates...
> 
> That being said, it is also obvious to prevent the livelock from happening:
> idle pulling until the source rq's nr_running is 1, becuase otherwise we
> just avoid idleness by making another idleness.

Well, ideally the imbalance calculation would be so that it would avoid
this from happening in the first place. Its a 'balance' operation, not a
'steal everything'.

We want to take work -- as we have none -- but we want to ensure that
afterwards we have equal work, ie we're balanced.

So clearly that all is hosed. Now Morten was looking into simplifying
calculate_imbalance() recently.

> On Wed, Jul 01, 2015 at 10:44:04PM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 01, 2015 at 04:55:51PM +0200, Rabin Vincent wrote:
> > >  PID: 413TASK: 8edda408  CPU: 1   COMMAND: "rngd"
> > >   task_h_load(): 0 [ = (load_avg_contrib {0} * cfs_rq->h_load {   
> > >  0}) / (cfs_rq->runnable_load_avg {0} + 1) ]
> > >   SE: 8edda450 load_avg_contrib: 0 load.weight:  1024 PARENT: 
> > > 8fffbd00 GROUPNAME: (null)
> > >   SE: 8fffbd00 load_avg_contrib: 0 load.weight: 2 PARENT: 
> > > 8f531f80 GROUPNAME: rngd@hwrng.service
> > >   SE: 8f531f80 load_avg_contrib: 0 load.weight:  1024 PARENT: 
> > > 8f456e00 GROUPNAME: system-rngd.slice
> > >   SE: 8f456e00 load_avg_contrib:   118 load.weight:   911 PARENT: 
> > >  GROUPNAME: system.slice
> > 
> > Firstly, a group (parent) load_avg_contrib should never be less than
> > that of its constituent parts, therefore the top 3 SEs should have at
> > least 118 too.
> 
> I think the downward is parent,

Ugh, I cannot read. Let me blame it on the heat.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-02 Thread Mike Galbraith
On Thu, 2015-07-02 at 09:05 +0800, Yuyang Du wrote:
>  
> Symptom? :)

Yeah, my fingers have a collection of those :)

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-02 Thread Yuyang Du
Hi Mike,

On Thu, Jul 02, 2015 at 10:05:47AM +0200, Mike Galbraith wrote:
> On Thu, 2015-07-02 at 07:25 +0800, Yuyang Du wrote:
> 
> > That being said, it is also obvious to prevent the livelock from happening:
> > idle pulling until the source rq's nr_running is 1, becuase otherwise we
> > just avoid idleness by making another idleness.
> 
> Yeah, but that's just the symptom, not the disease.  Better for the idle
> balance symptom may actually be to only pull one when idle balancing.
> After all, the immediate goal is to find something better to do than
> idle, not to achieve continual perfect (is the enemy of good) balance.
> 
Symptom? :)

You mean "pull one and stop, can't be greedy"? Right, but still need to
assure you don't make another idle CPU (meaning until nr_running == 1), which
is the cure to disease.

I am ok with at most "pull one", but probably we stick to the load_balance()
by pulling an fair amount, assuming load_balance() magically computes the
right imbalance, otherwise you may have to do multiple "pull one"s.

Thanks,
Yuyang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-02 Thread Mike Galbraith
On Thu, 2015-07-02 at 07:25 +0800, Yuyang Du wrote:

> That being said, it is also obvious to prevent the livelock from happening:
> idle pulling until the source rq's nr_running is 1, becuase otherwise we
> just avoid idleness by making another idleness.

Yeah, but that's just the symptom, not the disease.  Better for the idle
balance symptom may actually be to only pull one when idle balancing.
After all, the immediate goal is to find something better to do than
idle, not to achieve continual perfect (is the enemy of good) balance.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-02 Thread Yuyang Du
Hello,

We have two basic load balancing: idle blancing and periodic balancing.

It takes a tick for periodic balancing to happen again, so the livelock
could not be from it.

On the contrary, the idle balancing occurs as needed at arbitrary time,
so the livelock could happen.

And obviously, the idle balancing livelock SHOULD happen: one CPU pulls
tasks from the other, makes the other idle, and this iterates...

That being said, it is also obvious to prevent the livelock from happening:
idle pulling until the source rq's nr_running is 1, becuase otherwise we
just avoid idleness by making another idleness.

Hope the patch at the end should work.

On Wed, Jul 01, 2015 at 10:44:04PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 01, 2015 at 04:55:51PM +0200, Rabin Vincent wrote:
> >  PID: 413TASK: 8edda408  CPU: 1   COMMAND: "rngd"
> >   task_h_load(): 0 [ = (load_avg_contrib {0} * cfs_rq->h_load {
> > 0}) / (cfs_rq->runnable_load_avg {0} + 1) ]
> >   SE: 8edda450 load_avg_contrib: 0 load.weight:  1024 PARENT: 8fffbd00 
> > GROUPNAME: (null)
> >   SE: 8fffbd00 load_avg_contrib: 0 load.weight: 2 PARENT: 8f531f80 
> > GROUPNAME: rngd@hwrng.service
> >   SE: 8f531f80 load_avg_contrib: 0 load.weight:  1024 PARENT: 8f456e00 
> > GROUPNAME: system-rngd.slice
> >   SE: 8f456e00 load_avg_contrib:   118 load.weight:   911 PARENT:  
> > GROUPNAME: system.slice
> 
> Firstly, a group (parent) load_avg_contrib should never be less than
> that of its constituent parts, therefore the top 3 SEs should have at
> least 118 too.

I think the downward is parent, so with this case, "parent is bigger than
child" is ok. 

But if the child is the only contributor to the parent's load (probably is this 
case),
then the load_avg_contrib should not jump suddenly from 0 to 118.

So this might be due to the __update_group_entity_contrib(), I did not look into
detail in this complex function, but a glimpse suggests it is at least not 
consistent
if not arbitray, so likely will not satisfy "parent is bigger than child".

But this patch series http://comments.gmane.org/gmane.linux.kernel/1981970 
should
be very consistent in computing all SE's load avergage, thus safisfy the 
universal
truth...

> Now its been a while since I looked at the per entity load tracking
> stuff so some of the details have left me, but while it looks like we
> add the se->avg.load_avg_contrib to its cfs->runnable_load, we do not
> propagate that into the corresponding (group) se.
> 
> This means the se->avg.load_avg_contrib is accounted per cpu without
> migration benefits. So if our task just got migrated onto a cpu that
> hasn't ran the group in a while, the group will not have accumulated
> runtime.

Yes, this is true, the migration has nothing to do with either the source
or the destination group SEs. Down this road, if we want the addition and
subtraction after migation, we need to do it bottom-up along the SE, and 
do without rq->lock (?). Need to think about it for a while.

Thanks,
Yuyang
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40a7fcb..f7cc1ef 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5898,6 +5898,10 @@ static int detach_tasks(struct lb_env *env)
return 0;
 
while (!list_empty(tasks)) {
+
+   if (env->idle == CPU_NEWLY_IDLE && env->src_rq->nr_running <= 1)
+   break;
+
p = list_first_entry(tasks, struct task_struct, se.group_node);
 
env->loop++;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-02 Thread Yuyang Du
Hello,

We have two basic load balancing: idle blancing and periodic balancing.

It takes a tick for periodic balancing to happen again, so the livelock
could not be from it.

On the contrary, the idle balancing occurs as needed at arbitrary time,
so the livelock could happen.

And obviously, the idle balancing livelock SHOULD happen: one CPU pulls
tasks from the other, makes the other idle, and this iterates...

That being said, it is also obvious to prevent the livelock from happening:
idle pulling until the source rq's nr_running is 1, becuase otherwise we
just avoid idleness by making another idleness.

Hope the patch at the end should work.

On Wed, Jul 01, 2015 at 10:44:04PM +0200, Peter Zijlstra wrote:
 On Wed, Jul 01, 2015 at 04:55:51PM +0200, Rabin Vincent wrote:
   PID: 413TASK: 8edda408  CPU: 1   COMMAND: rngd
task_h_load(): 0 [ = (load_avg_contrib {0} * cfs_rq-h_load {
  0}) / (cfs_rq-runnable_load_avg {0} + 1) ]
SE: 8edda450 load_avg_contrib: 0 load.weight:  1024 PARENT: 8fffbd00 
  GROUPNAME: (null)
SE: 8fffbd00 load_avg_contrib: 0 load.weight: 2 PARENT: 8f531f80 
  GROUPNAME: rngd@hwrng.service
SE: 8f531f80 load_avg_contrib: 0 load.weight:  1024 PARENT: 8f456e00 
  GROUPNAME: system-rngd.slice
SE: 8f456e00 load_avg_contrib:   118 load.weight:   911 PARENT:  
  GROUPNAME: system.slice
 
 Firstly, a group (parent) load_avg_contrib should never be less than
 that of its constituent parts, therefore the top 3 SEs should have at
 least 118 too.

I think the downward is parent, so with this case, parent is bigger than
child is ok. 

But if the child is the only contributor to the parent's load (probably is this 
case),
then the load_avg_contrib should not jump suddenly from 0 to 118.

So this might be due to the __update_group_entity_contrib(), I did not look into
detail in this complex function, but a glimpse suggests it is at least not 
consistent
if not arbitray, so likely will not satisfy parent is bigger than child.

But this patch series http://comments.gmane.org/gmane.linux.kernel/1981970 
should
be very consistent in computing all SE's load avergage, thus safisfy the 
universal
truth...

 Now its been a while since I looked at the per entity load tracking
 stuff so some of the details have left me, but while it looks like we
 add the se-avg.load_avg_contrib to its cfs-runnable_load, we do not
 propagate that into the corresponding (group) se.
 
 This means the se-avg.load_avg_contrib is accounted per cpu without
 migration benefits. So if our task just got migrated onto a cpu that
 hasn't ran the group in a while, the group will not have accumulated
 runtime.

Yes, this is true, the migration has nothing to do with either the source
or the destination group SEs. Down this road, if we want the addition and
subtraction after migation, we need to do it bottom-up along the SE, and 
do without rq-lock (?). Need to think about it for a while.

Thanks,
Yuyang
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40a7fcb..f7cc1ef 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5898,6 +5898,10 @@ static int detach_tasks(struct lb_env *env)
return 0;
 
while (!list_empty(tasks)) {
+
+   if (env-idle == CPU_NEWLY_IDLE  env-src_rq-nr_running = 1)
+   break;
+
p = list_first_entry(tasks, struct task_struct, se.group_node);
 
env-loop++;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-02 Thread Mike Galbraith
On Thu, 2015-07-02 at 07:25 +0800, Yuyang Du wrote:

 That being said, it is also obvious to prevent the livelock from happening:
 idle pulling until the source rq's nr_running is 1, becuase otherwise we
 just avoid idleness by making another idleness.

Yeah, but that's just the symptom, not the disease.  Better for the idle
balance symptom may actually be to only pull one when idle balancing.
After all, the immediate goal is to find something better to do than
idle, not to achieve continual perfect (is the enemy of good) balance.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-02 Thread Mike Galbraith
On Thu, 2015-07-02 at 09:05 +0800, Yuyang Du wrote:
  
 Symptom? :)

Yeah, my fingers have a collection of those :)

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-02 Thread Yuyang Du
Hi Mike,

On Thu, Jul 02, 2015 at 10:05:47AM +0200, Mike Galbraith wrote:
 On Thu, 2015-07-02 at 07:25 +0800, Yuyang Du wrote:
 
  That being said, it is also obvious to prevent the livelock from happening:
  idle pulling until the source rq's nr_running is 1, becuase otherwise we
  just avoid idleness by making another idleness.
 
 Yeah, but that's just the symptom, not the disease.  Better for the idle
 balance symptom may actually be to only pull one when idle balancing.
 After all, the immediate goal is to find something better to do than
 idle, not to achieve continual perfect (is the enemy of good) balance.
 
Symptom? :)

You mean pull one and stop, can't be greedy? Right, but still need to
assure you don't make another idle CPU (meaning until nr_running == 1), which
is the cure to disease.

I am ok with at most pull one, but probably we stick to the load_balance()
by pulling an fair amount, assuming load_balance() magically computes the
right imbalance, otherwise you may have to do multiple pull ones.

Thanks,
Yuyang
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-02 Thread Morten Rasmussen
On Thu, Jul 02, 2015 at 09:05:39AM +0800, Yuyang Du wrote:
 Hi Mike,
 
 On Thu, Jul 02, 2015 at 10:05:47AM +0200, Mike Galbraith wrote:
  On Thu, 2015-07-02 at 07:25 +0800, Yuyang Du wrote:
  
   That being said, it is also obvious to prevent the livelock from 
   happening:
   idle pulling until the source rq's nr_running is 1, becuase otherwise we
   just avoid idleness by making another idleness.
  
  Yeah, but that's just the symptom, not the disease.  Better for the idle
  balance symptom may actually be to only pull one when idle balancing.
  After all, the immediate goal is to find something better to do than
  idle, not to achieve continual perfect (is the enemy of good) balance.
  
 Symptom? :)
 
 You mean pull one and stop, can't be greedy? Right, but still need to
 assure you don't make another idle CPU (meaning until nr_running == 1), which
 is the cure to disease.
 
 I am ok with at most pull one, but probably we stick to the load_balance()
 by pulling an fair amount, assuming load_balance() magically computes the
 right imbalance, otherwise you may have to do multiple pull ones.

Talking about the disease and looking at the debug data that Rabin has
provided I think the problem is due to the way blocked load is handled
(or not handled) in calculate_imbalance().

We have three entities in the root cfs_rq on cpu1:

1. Task entity pid 7, load_avg_contrib = 5.
2. Task entity pid 30, load_avg_contrib = 10.
3. Group entity, load_avg_contrib = 118, but contains task entity pid
   413 further down the hierarchy with task_h_load() = 0. The 118 comes
   from the blocked load contribution in the system.slice task group.

calculate_imbalance() figures out the average loads are:

cpu0: load/capacity = 0*1024/1024 = 0
cpu1: load/capacity = (5 + 10 + 118)*1024/1024 = 133

domain: load/capacity = (0 + 133)*1024/(2*1024) = 62

env-imbalance = 62

Rabin reported env-imbalance = 60 after pulling the rcu task with
load_avg_contrib = 5. It doesn't match my numbers exactly, but it pretty
close ;-)

detach_tasks() will attempts to pull 62 based on tasks task_h_load() but
the task_h_load() sum is only 5 + 10 + 0 and hence detach_tasks() will
empty the src_rq.

IOW, since task groups include blocked load in the load_avg_contrib (see
__update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) the
imbalance includes blocked load and hence env-imbalance =
sum(task_h_load(p)) for all tasks p on the rq. Which leads to
detach_tasks() emptying the rq completely in the reported scenario where
blocked load  runnable load.

Whether emptying the src_rq is the right thing to do depends on on your
point of view. Does balanced load (runnable+blocked) take priority over
keeping cpus busy or not? For idle_balance() it seems intuitively
correct to not empty the rq and hence you could consider env-imbalance
to be too big.

I think we will see more of this kind of problems if we include
weighted_cpuload() as well. Parts of the imbalance calculation code is
quite old and could use some attention first.

A short term fix could be what Yuyang propose, stop pulling tasks when
there is only one left in detach_tasks(). It won't affect active load
balance where we may want to migrate the last task as it active load
balance doesn't use detach_tasks().

Morten
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-02 Thread Morten Rasmussen
On Thu, Jul 02, 2015 at 12:53:59PM +0200, Peter Zijlstra wrote:
 On Thu, Jul 02, 2015 at 07:25:11AM +0800, Yuyang Du wrote:
  And obviously, the idle balancing livelock SHOULD happen: one CPU pulls
  tasks from the other, makes the other idle, and this iterates...
  
  That being said, it is also obvious to prevent the livelock from happening:
  idle pulling until the source rq's nr_running is 1, becuase otherwise we
  just avoid idleness by making another idleness.
 
 Well, ideally the imbalance calculation would be so that it would avoid
 this from happening in the first place. Its a 'balance' operation, not a
 'steal everything'.
 
 We want to take work -- as we have none -- but we want to ensure that
 afterwards we have equal work, ie we're balanced.

Agreed, I think this is the true problem. See my other reply.

 
 So clearly that all is hosed. Now Morten was looking into simplifying
 calculate_imbalance() recently.

Yes. I'm held up doing other stuff at the moment, but I think
calculate_imbalance() needs some attention and I'm planning on looking at
that next.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-02 Thread Peter Zijlstra
On Thu, Jul 02, 2015 at 07:25:11AM +0800, Yuyang Du wrote:
 And obviously, the idle balancing livelock SHOULD happen: one CPU pulls
 tasks from the other, makes the other idle, and this iterates...
 
 That being said, it is also obvious to prevent the livelock from happening:
 idle pulling until the source rq's nr_running is 1, becuase otherwise we
 just avoid idleness by making another idleness.

Well, ideally the imbalance calculation would be so that it would avoid
this from happening in the first place. Its a 'balance' operation, not a
'steal everything'.

We want to take work -- as we have none -- but we want to ensure that
afterwards we have equal work, ie we're balanced.

So clearly that all is hosed. Now Morten was looking into simplifying
calculate_imbalance() recently.

 On Wed, Jul 01, 2015 at 10:44:04PM +0200, Peter Zijlstra wrote:
  On Wed, Jul 01, 2015 at 04:55:51PM +0200, Rabin Vincent wrote:
PID: 413TASK: 8edda408  CPU: 1   COMMAND: rngd
 task_h_load(): 0 [ = (load_avg_contrib {0} * cfs_rq-h_load {   
0}) / (cfs_rq-runnable_load_avg {0} + 1) ]
 SE: 8edda450 load_avg_contrib: 0 load.weight:  1024 PARENT: 
   8fffbd00 GROUPNAME: (null)
 SE: 8fffbd00 load_avg_contrib: 0 load.weight: 2 PARENT: 
   8f531f80 GROUPNAME: rngd@hwrng.service
 SE: 8f531f80 load_avg_contrib: 0 load.weight:  1024 PARENT: 
   8f456e00 GROUPNAME: system-rngd.slice
 SE: 8f456e00 load_avg_contrib:   118 load.weight:   911 PARENT: 
    GROUPNAME: system.slice
  
  Firstly, a group (parent) load_avg_contrib should never be less than
  that of its constituent parts, therefore the top 3 SEs should have at
  least 118 too.
 
 I think the downward is parent,

Ugh, I cannot read. Let me blame it on the heat.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-02 Thread Yuyang Du
Hi Morten,

On Thu, Jul 02, 2015 at 12:40:32PM +0100, Morten Rasmussen wrote:
 detach_tasks() will attempts to pull 62 based on tasks task_h_load() but
 the task_h_load() sum is only 5 + 10 + 0 and hence detach_tasks() will
 empty the src_rq.
 
 IOW, since task groups include blocked load in the load_avg_contrib (see
 __update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) the
 imbalance includes blocked load and hence env-imbalance =
 sum(task_h_load(p)) for all tasks p on the rq. Which leads to
 detach_tasks() emptying the rq completely in the reported scenario where
 blocked load  runnable load.

Whenever I want to know the load avg concerning task group, I need to
walk through the complete codes again, I prefer not to do it this time.
But it should not be that simply to say the 118 comes from the blocked load.

Anyway, with blocked load, yes, we definitely can't move (or even find) some
ammount of the imbalance if we only look at the tasks on the queue. But this
may or may not be a problem.

Firstly, the question comes to whether we want blocked load anywhere.
This is just about a now vs. average question.

Secondly, if we stick to average, we just need to treat the blocked load
consistently, not that group SE has it, but task SE does not, or somewhere
has it, others not.
 
Thanks,
Yuyang

 Whether emptying the src_rq is the right thing to do depends on on your
 point of view. Does balanced load (runnable+blocked) take priority over
 keeping cpus busy or not? For idle_balance() it seems intuitively
 correct to not empty the rq and hence you could consider env-imbalance
 to be too big.
 
 I think we will see more of this kind of problems if we include
 weighted_cpuload() as well. Parts of the imbalance calculation code is
 quite old and could use some attention first.
 
 A short term fix could be what Yuyang propose, stop pulling tasks when
 there is only one left in detach_tasks(). It won't affect active load
 balance where we may want to migrate the last task as it active load
 balance doesn't use detach_tasks().
 
 Morten
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-02 Thread Yuyang Du
On Thu, Jul 02, 2015 at 12:44:55PM +0100, Morten Rasmussen wrote:
 On Thu, Jul 02, 2015 at 12:53:59PM +0200, Peter Zijlstra wrote:
  On Thu, Jul 02, 2015 at 07:25:11AM +0800, Yuyang Du wrote:
   And obviously, the idle balancing livelock SHOULD happen: one CPU pulls
   tasks from the other, makes the other idle, and this iterates...
   
   That being said, it is also obvious to prevent the livelock from 
   happening:
   idle pulling until the source rq's nr_running is 1, becuase otherwise we
   just avoid idleness by making another idleness.
  
  Well, ideally the imbalance calculation would be so that it would avoid
  this from happening in the first place. Its a 'balance' operation, not a
  'steal everything'.
  
  We want to take work -- as we have none -- but we want to ensure that
  afterwards we have equal work, ie we're balanced.
 
 Agreed, I think this is the true problem. See my other reply.

Yes, this is agreed at all time. Like I said load_balance() (for idle balancing)
should compute the right imbalance and move a fair amount, to achieve we are
balanced. Whatever is wrong in how much computed and moved right imbalance is
should be fixed anyway.

But still, I think, even with the above, in idle balancing, pulling until the 
source
rq's nr_running == 1 is not just a short term fix, but should be there 
permanently
acting like a last guard with no overhead, why not.
 
 
  
  So clearly that all is hosed. Now Morten was looking into simplifying
  calculate_imbalance() recently.
 
 Yes. I'm held up doing other stuff at the moment, but I think
 calculate_imbalance() needs some attention and I'm planning on looking at
 that next.

Thanks,
Yuyang
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-02 Thread Mike Galbraith
On Fri, 2015-07-03 at 02:42 +0800, Yuyang Du wrote:

 But still, I think, even with the above, in idle balancing, pulling until the 
 source
 rq's nr_running == 1 is not just a short term fix, but should be there 
 permanently
 acting like a last guard with no overhead, why not.

Yeah, seems so.  Searching for steal all samples...
(this is all with autogroup)

load_balance: idle - s_run: 0  d_run: 2 s_load: 0  d_load: 3  imb: 23  
det_tasks: 2  det_load: 3  zeros: 1
load_balance: idle - s_run: 0  d_run: 2 s_load: 0  d_load: 0  imb: 32  
det_tasks: 2  det_load: 0  zeros: 2
load_balance: idle - s_run: 0  d_run: 2 s_load: 0  d_load: 1  imb: 17  
det_tasks: 2  det_load: 1  zeros: 1
load_balance: idle - s_run: 0  d_run: 2 s_load: 0  d_load: 37  imb: 22  
det_tasks: 2  det_load: 37  zeros: 1
load_balance: idle - s_run: 0  d_run: 2 s_load: 0  d_load: 0  imb: 102  
det_tasks: 2  det_load: 0  zeros: 2


load_balance: idle - s_run: 0  d_run: 1 s_load: 0  d_load: 93  imb: 47  
det_tasks: 1  det_load: 93  zeros: 0
load_balance: idle - s_run: 0  d_run: 2 s_load: 0  d_load: 202  imb: 125  
det_tasks: 2  det_load: 202  zeros: 0
load_balance: idle - s_run: 0  d_run: 2 s_load: 0  d_load: 243  imb: 188  
det_tasks: 2  det_load: 243  zeros: 0
load_balance: idle - s_run: 0  d_run: 1 s_load: 0  d_load: 145  imb: 73  
det_tasks: 1  det_load: 145  zeros: 0
load_balance: idle - s_run: 0  d_run: 1 s_load: 0  d_load: 46  imb: 24  
det_tasks: 1  det_load: 46  zeros: 0

Both varieties of total pilferage (w/wo 0 load tasks involved) seem to
happen only during idle balance, never periodic (yet).

Oddity: make -j8 occasionally stacks/pulls piles of load=dinky.

homer:/sys/kernel/debug/tracing # for i in `seq 1 10`; do cat trace|grep 
s_run: 1.*det_tasks: $i.*zeros: 0|wc -l; done
71634
1567
79
15
1
3
0
2
3
0
homer:/sys/kernel/debug/tracing # cat trace|grep s_run: 1.*det_tasks: 
8.*zeros: 0
  idle-0 [002] dNs.   594.973783: load_balance: norm - s_run: 1  
d_run: 9 s_load: 67  d_load: 1110  imb: 86  det_tasks: 8  det_load: 86  zeros: 0
   ...-10367 [007] d...  1456.477281: load_balance: idle - s_run: 1  
d_run: 8 s_load: 805  d_load: 22  imb: 45  det_tasks: 8  det_load: 22  zeros: 0
homer:/sys/kernel/debug/tracing # cat trace|grep s_run: 1.*det_tasks: 
9.*zeros: 0
   ...-23317 [004] d...   486.677925: load_balance: idle - s_run: 1  
d_run: 9 s_load: 888  d_load: 27  imb: 47  det_tasks: 9  det_load: 27  zeros: 0
   ...-11485 [002] d...   573.411095: load_balance: idle - s_run: 1  
d_run: 9 s_load: 124  d_load: 78  imb: 82  det_tasks: 9  det_load: 78  zeros: 0
   ...-23286 [000] d...  1510.378740: load_balance: idle - s_run: 1  
d_run: 9 s_load: 102  d_load: 58  imb: 63  det_tasks: 9  det_load: 58  zeros: 0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-01 Thread Peter Zijlstra
On Wed, Jul 01, 2015 at 04:55:51PM +0200, Rabin Vincent wrote:
>  PID: 413TASK: 8edda408  CPU: 1   COMMAND: "rngd"
>   task_h_load(): 0 [ = (load_avg_contrib {0} * cfs_rq->h_load {
> 0}) / (cfs_rq->runnable_load_avg {0} + 1) ]
>   SE: 8edda450 load_avg_contrib: 0 load.weight:  1024 PARENT: 8fffbd00 
> GROUPNAME: (null)
>   SE: 8fffbd00 load_avg_contrib: 0 load.weight: 2 PARENT: 8f531f80 
> GROUPNAME: rngd@hwrng.service
>   SE: 8f531f80 load_avg_contrib: 0 load.weight:  1024 PARENT: 8f456e00 
> GROUPNAME: system-rngd.slice
>   SE: 8f456e00 load_avg_contrib:   118 load.weight:   911 PARENT:  
> GROUPNAME: system.slice

So there's two problems there... the first we can (and should) fix, the
second I'm not sure there's anything we can do about.

Firstly, a group (parent) load_avg_contrib should never be less than
that of its constituent parts, therefore the top 3 SEs should have at
least 118 too.

Now its been a while since I looked at the per entity load tracking
stuff so some of the details have left me, but while it looks like we
add the se->avg.load_avg_contrib to its cfs->runnable_load, we do not
propagate that into the corresponding (group) se.

This means the se->avg.load_avg_contrib is accounted per cpu without
migration benefits. So if our task just got migrated onto a cpu that
hasn't ran the group in a while, the group will not have accumulated
runtime.

A quick fix would be something like the below; although I think we want
to do something else, like maybe propagate the load_avg_contrib up the
hierarchy etc.. But I need to think more about that.

The second problem is that your second SE has a weight of 2, that'll get
the task_h_load() a factor of 1/512 in which will flatten pretty much
anything down to small. This is per configuration, so there's really not
something we can or should do about that.

Untested, uncompiled hackery following, mostly for discussion.

---
 kernel/sched/fair.c  | 6 --
 kernel/sched/sched.h | 1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d57cc0ca0a6..95d0ba249c8b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6082,7 +6082,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
struct rq *rq = rq_of(cfs_rq);
struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
unsigned long now = jiffies;
-   unsigned long load;
+   unsigned long load, load_avg_contrib = 0;
 
if (cfs_rq->last_h_load_update == now)
return;
@@ -6090,6 +6090,8 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
cfs_rq->h_load_next = NULL;
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
+   cfs_rq->h_load_avg_contrib = load_avg_contrib =
+   max(load_avg_contrib, se->avg.load_avg_contrib);
cfs_rq->h_load_next = se;
if (cfs_rq->last_h_load_update == now)
break;
@@ -6102,7 +6104,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 
while ((se = cfs_rq->h_load_next) != NULL) {
load = cfs_rq->h_load;
-   load = div64_ul(load * se->avg.load_avg_contrib,
+   load = div64_ul(load * cfs_rq->h_load_avg_contrib,
cfs_rq->runnable_load_avg + 1);
cfs_rq = group_cfs_rq(se);
cfs_rq->h_load = load;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 885889190a1f..7738e3b301b7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -394,6 +394,7 @@ struct cfs_rq {
 * this group.
 */
unsigned long h_load;
+   unsigned long h_load_avg_contrib;
u64 last_h_load_update;
struct sched_entity *h_load_next;
 #endif /* CONFIG_FAIR_GROUP_SCHED */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-01 Thread Mike Galbraith
On Wed, 2015-07-01 at 16:55 +0200, Rabin Vincent wrote:

> So, we will not hit the "if (env->src_rq->load.weight <=
> env->dst_rq->load.weight + d_load)" condition to break out of the loop until 
> we
> actualy move all tasks.  So the patch will not have any effect on this case.
> Or am I missing something?

Probably not.  I did have it breaking if dst_rq would pass
src_rq->nr_running, which would certainly stop it, but thought I try to
let it watch weights.

Either way, task_h_load(p) returning 0 is not very wonderful.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-01 Thread Rabin Vincent
On Wed, Jul 01, 2015 at 07:36:35AM +0200, Mike Galbraith wrote:
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5897,7 +5897,7 @@ static int detach_tasks(struct lb_env *e
>  {
>   struct list_head *tasks = >src_rq->cfs_tasks;
>   struct task_struct *p;
> - unsigned long load;
> + unsigned long load, d_load = 0, s_load = env->src_rq->load.weight;
>   int detached = 0;
>  
>   lockdep_assert_held(>src_rq->lock);
> @@ -5936,6 +5936,11 @@ static int detach_tasks(struct lb_env *e
>  
>   detached++;
>   env->imbalance -= load;
> + if (!load) {
> + load = min_t(unsigned long, env->imbalance, 
> p->se.load.weight);
> + trace_printk("%s:%d is non-contributor - count as 
> %ld\n", p->comm, p->pid, load);
> + }
> + d_load += load;
>  
>  #ifdef CONFIG_PREEMPT
>   /*
> @@ -5954,6 +5959,18 @@ static int detach_tasks(struct lb_env *e
>   if (env->imbalance <= 0)
>   break;
>  
> + /*
> +  * We don't want to bleed busiest_rq dry either.  Weighted load
> +  * and/or imbalance may be dinky, load contribution can even be
> +  * zero, perhaps causing us to over balancem we had not assigned
> +  * it above.
> +  */
> + if (env->src_rq->load.weight <= env->dst_rq->load.weight + 
> d_load) {
> + trace_printk("OINK - imbal: %ld  load: %ld  run: %d  
> det: %d  sload_was: %ld sload_is: %ld  dload: %ld\n",
> + env->imbalance, load, env->src_rq->nr_running, 
> detached, s_load, env->src_rq->load.weight, env->dst_rq->load.weight+d_load);
> + break;
> + }
> +
>   continue;
>  next:
>   list_move_tail(>se.group_node, tasks);
> 

I've tried to analyse how your patch would affect the situation in one
of the crash dumps which I have of the problem.

In this dump, cpu0 is in the middle of dequeuing all tasks from cpu1.
rcu_sched has already been detached and there are two tasks left, one of them
which is being processed by dequeue_entity_load_avg() called from
dequeue_task() at the time the watchdog hits.  lb_env is the following.
imbalance is, as you can see, 60.

 crash> struct lb_env 8054fd50
 struct lb_env {
   sd = 0x8fc13e00, 
   src_rq = 0x81297200, 
   src_cpu = 1, 
   dst_cpu = 0, 
   dst_rq = 0x8128e200, 
   dst_grpmask = 0x0, 
   new_dst_cpu = 0, 
   idle = CPU_NEWLY_IDLE, 
   imbalance = 60, 
   cpus = 0x8128d238, 
   flags = 0, 
   loop = 2, 
   loop_break = 32, 
   loop_max = 3, 
   fbq_type = all, 
   tasks = {
 next = 0x8fc4c6ec, 
 prev = 0x8fc4c6ec
   }
 }

Weights of the runqueues:

 crash> struct rq.load.weight runqueues:0,1
 [0]: 8128e200
   load.weight = 0,
 [1]: 81297200
   load.weight = 1935,

The only running tasks on the system are these three:

 crash> foreach RU ps
PIDPPID  CPU   TASKST  %MEM VSZRSS  COMM
 > 0  0   0  8056d8b0  RU   0.0   0  0  [swapper/0]
 > 0  0   1  8fc5cd18  RU   0.0   0  0  [swapper/1]
 > 0  0   2  8fc5c6b0  RU   0.0   0  0  [swapper/2]
 > 0  0   3  8fc5c048  RU   0.0   0  0  [swapper/3]
   7  2   0  8fc4c690  RU   0.0   0  0  [rcu_sched]
  30  2   1  8fd26108  RU   0.0   0  0  [kswapd0]
 413  1   1  8edda408  RU   0.61900416  rngd

And the load.weight and load_avg_contribs for them and their parent SEs:

 crash> foreach 7 30 413 load
 PID: 7  TASK: 8fc4c690  CPU: 0   COMMAND: "rcu_sched"
  task_h_load():   325 [ = (load_avg_contrib {5} * cfs_rq->h_load {   65}) 
/ (cfs_rq->runnable_load_avg {0} + 1) ]
  SE: 8fc4c6d8 load_avg_contrib: 5 load.weight:  1024 PARENT:  
GROUPNAME: (null)
 
 PID: 30 TASK: 8fd26108  CPU: 1   COMMAND: "kswapd0"
  task_h_load():10 [ = (load_avg_contrib {   10} * cfs_rq->h_load {  133}) 
/ (cfs_rq->runnable_load_avg {  128} + 1) ]
  SE: 8fd26150 load_avg_contrib:10 load.weight:  1024 PARENT:  
GROUPNAME: (null)
 
 PID: 413TASK: 8edda408  CPU: 1   COMMAND: "rngd"
  task_h_load(): 0 [ = (load_avg_contrib {0} * cfs_rq->h_load {0}) 
/ (cfs_rq->runnable_load_avg {0} + 1) ]
  SE: 8edda450 load_avg_contrib: 0 load.weight:  1024 PARENT: 8fffbd00 
GROUPNAME: (null)
  SE: 8fffbd00 load_avg_contrib: 0 load.weight: 2 PARENT: 8f531f80 
GROUPNAME: rngd@hwrng.service
  SE: 8f531f80 load_avg_contrib: 0 load.weight:  1024 PARENT: 8f456e00 
GROUPNAME: system-rngd.slice
  SE: 8f456e00 load_avg_contrib:   118 load.weight:   911 PARENT:  
GROUPNAME: system.slice

Given the above, we can see that with your patch:

 - dst_rq->load.weight is 0 and will not change in this loop.

 - src_rq->load.weight was 1935 + 1024 before the loop.  It will go down
   to 1935 (already has), 1024, and then 0.
 
 - 

Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-01 Thread Rabin Vincent
On Wed, Jul 01, 2015 at 07:36:35AM +0200, Mike Galbraith wrote:
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -5897,7 +5897,7 @@ static int detach_tasks(struct lb_env *e
  {
   struct list_head *tasks = env-src_rq-cfs_tasks;
   struct task_struct *p;
 - unsigned long load;
 + unsigned long load, d_load = 0, s_load = env-src_rq-load.weight;
   int detached = 0;
  
   lockdep_assert_held(env-src_rq-lock);
 @@ -5936,6 +5936,11 @@ static int detach_tasks(struct lb_env *e
  
   detached++;
   env-imbalance -= load;
 + if (!load) {
 + load = min_t(unsigned long, env-imbalance, 
 p-se.load.weight);
 + trace_printk(%s:%d is non-contributor - count as 
 %ld\n, p-comm, p-pid, load);
 + }
 + d_load += load;
  
  #ifdef CONFIG_PREEMPT
   /*
 @@ -5954,6 +5959,18 @@ static int detach_tasks(struct lb_env *e
   if (env-imbalance = 0)
   break;
  
 + /*
 +  * We don't want to bleed busiest_rq dry either.  Weighted load
 +  * and/or imbalance may be dinky, load contribution can even be
 +  * zero, perhaps causing us to over balancem we had not assigned
 +  * it above.
 +  */
 + if (env-src_rq-load.weight = env-dst_rq-load.weight + 
 d_load) {
 + trace_printk(OINK - imbal: %ld  load: %ld  run: %d  
 det: %d  sload_was: %ld sload_is: %ld  dload: %ld\n,
 + env-imbalance, load, env-src_rq-nr_running, 
 detached, s_load, env-src_rq-load.weight, env-dst_rq-load.weight+d_load);
 + break;
 + }
 +
   continue;
  next:
   list_move_tail(p-se.group_node, tasks);
 

I've tried to analyse how your patch would affect the situation in one
of the crash dumps which I have of the problem.

In this dump, cpu0 is in the middle of dequeuing all tasks from cpu1.
rcu_sched has already been detached and there are two tasks left, one of them
which is being processed by dequeue_entity_load_avg() called from
dequeue_task() at the time the watchdog hits.  lb_env is the following.
imbalance is, as you can see, 60.

 crash struct lb_env 8054fd50
 struct lb_env {
   sd = 0x8fc13e00, 
   src_rq = 0x81297200, 
   src_cpu = 1, 
   dst_cpu = 0, 
   dst_rq = 0x8128e200, 
   dst_grpmask = 0x0, 
   new_dst_cpu = 0, 
   idle = CPU_NEWLY_IDLE, 
   imbalance = 60, 
   cpus = 0x8128d238, 
   flags = 0, 
   loop = 2, 
   loop_break = 32, 
   loop_max = 3, 
   fbq_type = all, 
   tasks = {
 next = 0x8fc4c6ec, 
 prev = 0x8fc4c6ec
   }
 }

Weights of the runqueues:

 crash struct rq.load.weight runqueues:0,1
 [0]: 8128e200
   load.weight = 0,
 [1]: 81297200
   load.weight = 1935,

The only running tasks on the system are these three:

 crash foreach RU ps
PIDPPID  CPU   TASKST  %MEM VSZRSS  COMM
  0  0   0  8056d8b0  RU   0.0   0  0  [swapper/0]
  0  0   1  8fc5cd18  RU   0.0   0  0  [swapper/1]
  0  0   2  8fc5c6b0  RU   0.0   0  0  [swapper/2]
  0  0   3  8fc5c048  RU   0.0   0  0  [swapper/3]
   7  2   0  8fc4c690  RU   0.0   0  0  [rcu_sched]
  30  2   1  8fd26108  RU   0.0   0  0  [kswapd0]
 413  1   1  8edda408  RU   0.61900416  rngd

And the load.weight and load_avg_contribs for them and their parent SEs:

 crash foreach 7 30 413 load
 PID: 7  TASK: 8fc4c690  CPU: 0   COMMAND: rcu_sched
  task_h_load():   325 [ = (load_avg_contrib {5} * cfs_rq-h_load {   65}) 
/ (cfs_rq-runnable_load_avg {0} + 1) ]
  SE: 8fc4c6d8 load_avg_contrib: 5 load.weight:  1024 PARENT:  
GROUPNAME: (null)
 
 PID: 30 TASK: 8fd26108  CPU: 1   COMMAND: kswapd0
  task_h_load():10 [ = (load_avg_contrib {   10} * cfs_rq-h_load {  133}) 
/ (cfs_rq-runnable_load_avg {  128} + 1) ]
  SE: 8fd26150 load_avg_contrib:10 load.weight:  1024 PARENT:  
GROUPNAME: (null)
 
 PID: 413TASK: 8edda408  CPU: 1   COMMAND: rngd
  task_h_load(): 0 [ = (load_avg_contrib {0} * cfs_rq-h_load {0}) 
/ (cfs_rq-runnable_load_avg {0} + 1) ]
  SE: 8edda450 load_avg_contrib: 0 load.weight:  1024 PARENT: 8fffbd00 
GROUPNAME: (null)
  SE: 8fffbd00 load_avg_contrib: 0 load.weight: 2 PARENT: 8f531f80 
GROUPNAME: rngd@hwrng.service
  SE: 8f531f80 load_avg_contrib: 0 load.weight:  1024 PARENT: 8f456e00 
GROUPNAME: system-rngd.slice
  SE: 8f456e00 load_avg_contrib:   118 load.weight:   911 PARENT:  
GROUPNAME: system.slice

Given the above, we can see that with your patch:

 - dst_rq-load.weight is 0 and will not change in this loop.

 - src_rq-load.weight was 1935 + 1024 before the loop.  It will go down
   to 1935 (already has), 1024, and then 0.
 
 - d_load will be 325*, 335, and then 395.

(* - probably not exactly since rcu_sched has already 

Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-01 Thread Mike Galbraith
On Wed, 2015-07-01 at 16:55 +0200, Rabin Vincent wrote:

 So, we will not hit the if (env-src_rq-load.weight =
 env-dst_rq-load.weight + d_load) condition to break out of the loop until 
 we
 actualy move all tasks.  So the patch will not have any effect on this case.
 Or am I missing something?

Probably not.  I did have it breaking if dst_rq would pass
src_rq-nr_running, which would certainly stop it, but thought I try to
let it watch weights.

Either way, task_h_load(p) returning 0 is not very wonderful.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-07-01 Thread Peter Zijlstra
On Wed, Jul 01, 2015 at 04:55:51PM +0200, Rabin Vincent wrote:
  PID: 413TASK: 8edda408  CPU: 1   COMMAND: rngd
   task_h_load(): 0 [ = (load_avg_contrib {0} * cfs_rq-h_load {
 0}) / (cfs_rq-runnable_load_avg {0} + 1) ]
   SE: 8edda450 load_avg_contrib: 0 load.weight:  1024 PARENT: 8fffbd00 
 GROUPNAME: (null)
   SE: 8fffbd00 load_avg_contrib: 0 load.weight: 2 PARENT: 8f531f80 
 GROUPNAME: rngd@hwrng.service
   SE: 8f531f80 load_avg_contrib: 0 load.weight:  1024 PARENT: 8f456e00 
 GROUPNAME: system-rngd.slice
   SE: 8f456e00 load_avg_contrib:   118 load.weight:   911 PARENT:  
 GROUPNAME: system.slice

So there's two problems there... the first we can (and should) fix, the
second I'm not sure there's anything we can do about.

Firstly, a group (parent) load_avg_contrib should never be less than
that of its constituent parts, therefore the top 3 SEs should have at
least 118 too.

Now its been a while since I looked at the per entity load tracking
stuff so some of the details have left me, but while it looks like we
add the se-avg.load_avg_contrib to its cfs-runnable_load, we do not
propagate that into the corresponding (group) se.

This means the se-avg.load_avg_contrib is accounted per cpu without
migration benefits. So if our task just got migrated onto a cpu that
hasn't ran the group in a while, the group will not have accumulated
runtime.

A quick fix would be something like the below; although I think we want
to do something else, like maybe propagate the load_avg_contrib up the
hierarchy etc.. But I need to think more about that.

The second problem is that your second SE has a weight of 2, that'll get
the task_h_load() a factor of 1/512 in which will flatten pretty much
anything down to small. This is per configuration, so there's really not
something we can or should do about that.

Untested, uncompiled hackery following, mostly for discussion.

---
 kernel/sched/fair.c  | 6 --
 kernel/sched/sched.h | 1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d57cc0ca0a6..95d0ba249c8b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6082,7 +6082,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
struct rq *rq = rq_of(cfs_rq);
struct sched_entity *se = cfs_rq-tg-se[cpu_of(rq)];
unsigned long now = jiffies;
-   unsigned long load;
+   unsigned long load, load_avg_contrib = 0;
 
if (cfs_rq-last_h_load_update == now)
return;
@@ -6090,6 +6090,8 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
cfs_rq-h_load_next = NULL;
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
+   cfs_rq-h_load_avg_contrib = load_avg_contrib =
+   max(load_avg_contrib, se-avg.load_avg_contrib);
cfs_rq-h_load_next = se;
if (cfs_rq-last_h_load_update == now)
break;
@@ -6102,7 +6104,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 
while ((se = cfs_rq-h_load_next) != NULL) {
load = cfs_rq-h_load;
-   load = div64_ul(load * se-avg.load_avg_contrib,
+   load = div64_ul(load * cfs_rq-h_load_avg_contrib,
cfs_rq-runnable_load_avg + 1);
cfs_rq = group_cfs_rq(se);
cfs_rq-h_load = load;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 885889190a1f..7738e3b301b7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -394,6 +394,7 @@ struct cfs_rq {
 * this group.
 */
unsigned long h_load;
+   unsigned long h_load_avg_contrib;
u64 last_h_load_update;
struct sched_entity *h_load_next;
 #endif /* CONFIG_FAIR_GROUP_SCHED */
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-06-30 Thread Mike Galbraith
On Tue, 2015-06-30 at 16:30 +0200, Rabin Vincent wrote:
> Hi,
> 
> We're seeing a livelock where two CPUs both loop with interrupts
> disabled in pick_next_task_fair() / idle_balance() and continuously
> fetch all tasks from each other...

Hm.  Does the below help?  Looks to me like we are over-balancing.

homer:/sys/kernel/debug/tracing # echo > trace;massive_intr 12 10 > 
/dev/null;cat trace|grep OINK | wc -l;tail trace
480
massive_intr-8967  [003] d...   404.285682: load_balance: OINK - imbal: 1  
load: 83  run: 1  det: 1  sload_was: 180 sload_is: 83  dload: 83
massive_intr-8960  [001] d.s.   404.293331: load_balance: OINK - imbal: 1  
load: 84  run: 1  det: 1  sload_was: 180 sload_is: 99  dload: 174
massive_intr-8962  [002] d.s.   404.317572: load_balance: OINK - imbal: 1  
load: 83  run: 1  det: 1  sload_was: 180 sload_is: 83  dload: 161
massive_intr-8967  [003] d...   404.318296: load_balance: OINK - imbal: 1  
load: 83  run: 1  det: 1  sload_was: 180 sload_is: 83  dload: 83
massive_intr-8960  [005] d...   404.341049: load_balance: OINK - imbal: 1  
load: 84  run: 1  det: 1  sload_was: 180 sload_is: 83  dload: 84
massive_intr-8962  [002] d.s.   404.381549: load_balance: OINK - imbal: 1  
load: 83  run: 1  det: 1  sload_was: 180 sload_is: 83  dload: 161
massive_intr-8971  [005] d...   404.417148: load_balance: OINK - imbal: 1  
load: 84  run: 1  det: 1  sload_was: 180 sload_is: 83  dload: 84
massive_intr-8964  [006] d.s.   404.418536: load_balance: OINK - imbal: 1  
load: 72  run: 1  det: 1  sload_was: 144 sload_is: 83  dload: 149
massive_intr-8968  [003] d...   404.437861: load_balance: OINK - imbal: 1  
load: 83  run: 1  det: 1  sload_was: 180 sload_is: 83  dload: 83
massive_intr-8970  [001] d.s.   404.485263: load_balance: OINK - imbal: 1  
load: 71  run: 1  det: 1  sload_was: 154 sload_is: 83  dload: 148
homer:/sys/kernel/debug/tracing # echo > trace;hackbench > /dev/null;cat 
trace|grep OINK | wc -l;tail trace
51
   hackbench-9079  [006] d...   425.982722: load_balance: OINK - imbal: 54  
load: 42  run: 4  det: 4  sload_was: 130 sload_is: 62  dload: 109
   hackbench-9231  [002] d...   425.982974: load_balance: OINK - imbal: 14  
load: 23  run: 1  det: 1  sload_was: 30 sload_is: 15  dload: 23
   hackbench-9328  [006] d...   425.983037: load_balance: OINK - imbal: 16  
load: 72  run: 2  det: 1  sload_was: 44 sload_is: 32  dload: 72
   hackbench-9197  [002] d...   425.984416: load_balance: OINK - imbal: 62  
load: 21  run: 3  det: 8  sload_was: 232 sload_is: 78  dload: 119
   hackbench-9196  [004] d...   425.984507: load_balance: OINK - imbal: 45  
load: 43  run: 1  det: 1  sload_was: 44 sload_is: 22  dload: 43
   hackbench-9201  [004] d...   425.984648: load_balance: OINK - imbal: 15  
load: 44  run: 1  det: 2  sload_was: 71 sload_is: 25  dload: 73
   hackbench-9235  [002] d...   425.984789: load_balance: OINK - imbal: 5  
load: 32  run: 2  det: 1  sload_was: 65 sload_is: 42  dload: 54
   hackbench-9327  [000] d...   425.985424: load_balance: OINK - imbal: 1  
load: 95  run: 1  det: 1  sload_was: 49 sload_is: 25  dload: 95
   hackbench-9193  [003] d...   425.988701: load_balance: OINK - imbal: 22  
load: 94  run: 1  det: 1  sload_was: 128 sload_is: 66  dload: 94
   hackbench-9197  [003] d...   425.988712: load_balance: OINK - imbal: 56  
load: 92  run: 1  det: 1  sload_was: 118 sload_is: 66  dload: 92
homer:/sys/kernel/debug/tracing # tail trace
kwin-4654  [002] d...   460.627311: load_balance: kglobalaccel:4597 
is non-contributor - count as 37
 konsole-4707  [005] d...   460.627639: load_balance: kded4:4594 is 
non-contributor - count as 40
 konsole-4707  [005] d...   460.627640: load_balance: 
kactivitymanage:4611 is non-contributor - count as 40
kwin-4654  [005] d...   460.627712: load_balance: 
kactivitymanage:4611 is non-contributor - count as 41
 kactivitymanage-4611  [005] d...   460.627726: load_balance: kded4:4594 is 
non-contributor - count as 40
kwin-4654  [005] d...   460.828628: load_balance: OINK - imbal: 3  
load: 618  run: 1  det: 1  sload_was: 1024 sload_is: 0  dload: 618
  plasma-desktop-4665  [000] d...   461.886746: load_balance: baloo_file:4666 
is non-contributor - count as 141
 systemd-journal-397   [001] d...   466.209790: load_balance: pulseaudio:4718 
is non-contributor - count as 5
 systemd-1 [007] d...   466.209868: load_balance: kmix:4704 is 
non-contributor - count as 13
 gnome-keyring-d-9455  [002] d...   466.209902: load_balance: krunner:4702 is 
non-contributor - count as 8

---
 kernel/sched/fair.c |   19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5897,7 +5897,7 @@ static int detach_tasks(struct lb_env *e
 {
struct list_head *tasks = >src_rq->cfs_tasks;
struct task_struct *p;
-   unsigned long load;
+   unsigned long 

[PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-06-30 Thread Rabin Vincent
Hi,

We're seeing a livelock where two CPUs both loop with interrupts
disabled in pick_next_task_fair() / idle_balance() and continuously
fetch all tasks from each other.  This has been observed on a 3.18
kernel but the relevant code paths appear to be the same in current
mainline.

The problem situation appears to be this:

cpu0cpu1


pick_next_task
 take rq0->lock
 pick_next_task_fair
  running=0
  idle_balance()
   drop rq0->lock


wakeup A,B on cpu0


pick_next_task
 take rq1->lock
 pick_next_task_fair
  running=0
  idle_balance()
   drop r1->lock
   load_balance()
   busiest=rq0
   take rq0->lock
   detach A,B
   drop rq0->lock
   take rq1->lock
   attach A,B
   pulled_task = 2
   drop rq1->lock

   load_balance()
   busiest=rq1
   take rq1->lock
   detach A,B
   drop rq1->lock
   take rq0->lock
   attach A,B
   pulled_task = 2
   drop rq0->lock

  running=0()
  idle_balance()
  busiest=rq0, pull A,B, etc.


  running = 0
  load_balance()
  busiest=rq1, pull A,B, etc

And this goes on, with interrupts disabled on both CPUs, for at least a
100 ms (which is when a hardware watchdog hits).

The conditions needed, apart from the right timing, are:

 - cgroups. One of the tasks, say A, needs to be in a CPU cgroup.  When
   the problem occurs, A's ->se has zero load_avg_contrib and
   task_h_load(A) is zero.  However, the se->parent->parent of A has a
   (relatively) high load_avg_contrib.  cpu0's cfs_rq has therefore a
   relatively high runnable_load_avg.  find_busiest_group() therefore
   detects imbalance, and detach_tasks() detaches all tasks.

 - PREEMPT=n.  Otherwise, the code under #ifdef in detach_tasks() would
   ensure that we'd only ever pull a maximum of one task during idle
   balancing.

 - cpu0 and cpu1 are threads on the same core (cpus_share_cache() ==
   true).  otherwise, cpu1 will not be able to wakeup tasks on cpu0
   while cpu0 has interrupts disabled (since an IPI would be required).
   Turning off the default TTWU_QUEUE feature would also provide the
   same effect.

I see two simple ways to prevent the livelock.  One is to just to remove
the #ifdef:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d57cc0..74c94dc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5937,7 +5937,6 @@ static int detach_tasks(struct lb_env *env)
detached++;
env->imbalance -= load;
 
-#ifdef CONFIG_PREEMPT
/*
 * NEWIDLE balancing is a source of latency, so preemptible
 * kernels will stop after the first task is detached to 
minimize
@@ -5945,7 +5944,6 @@ static int detach_tasks(struct lb_env *env)
 */
if (env->idle == CPU_NEWLY_IDLE)
break;
-#endif
 
/*
 * We only want to steal up to the prescribed amount of

Or this should work too:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d57cc0..13358cf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7487,6 +7487,8 @@ static int idle_balance(struct rq *this_rq)
 */
if (this_rq->cfs.h_nr_running && !pulled_task)
pulled_task = 1;
+   else if (!this_rq->cfs.h_nr_running && pulled_task)
+   pulled_task = 0;
 
 out:
/* Move the next balance forward */

But it seems that the real problem is that detach_tasks() can remove all
tasks, when cgroups are involved as described above?

Thanks.

/Rabin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-06-30 Thread Mike Galbraith
On Tue, 2015-06-30 at 16:30 +0200, Rabin Vincent wrote:
 Hi,
 
 We're seeing a livelock where two CPUs both loop with interrupts
 disabled in pick_next_task_fair() / idle_balance() and continuously
 fetch all tasks from each other...

Hm.  Does the below help?  Looks to me like we are over-balancing.

homer:/sys/kernel/debug/tracing # echo  trace;massive_intr 12 10  
/dev/null;cat trace|grep OINK | wc -l;tail trace
480
massive_intr-8967  [003] d...   404.285682: load_balance: OINK - imbal: 1  
load: 83  run: 1  det: 1  sload_was: 180 sload_is: 83  dload: 83
massive_intr-8960  [001] d.s.   404.293331: load_balance: OINK - imbal: 1  
load: 84  run: 1  det: 1  sload_was: 180 sload_is: 99  dload: 174
massive_intr-8962  [002] d.s.   404.317572: load_balance: OINK - imbal: 1  
load: 83  run: 1  det: 1  sload_was: 180 sload_is: 83  dload: 161
massive_intr-8967  [003] d...   404.318296: load_balance: OINK - imbal: 1  
load: 83  run: 1  det: 1  sload_was: 180 sload_is: 83  dload: 83
massive_intr-8960  [005] d...   404.341049: load_balance: OINK - imbal: 1  
load: 84  run: 1  det: 1  sload_was: 180 sload_is: 83  dload: 84
massive_intr-8962  [002] d.s.   404.381549: load_balance: OINK - imbal: 1  
load: 83  run: 1  det: 1  sload_was: 180 sload_is: 83  dload: 161
massive_intr-8971  [005] d...   404.417148: load_balance: OINK - imbal: 1  
load: 84  run: 1  det: 1  sload_was: 180 sload_is: 83  dload: 84
massive_intr-8964  [006] d.s.   404.418536: load_balance: OINK - imbal: 1  
load: 72  run: 1  det: 1  sload_was: 144 sload_is: 83  dload: 149
massive_intr-8968  [003] d...   404.437861: load_balance: OINK - imbal: 1  
load: 83  run: 1  det: 1  sload_was: 180 sload_is: 83  dload: 83
massive_intr-8970  [001] d.s.   404.485263: load_balance: OINK - imbal: 1  
load: 71  run: 1  det: 1  sload_was: 154 sload_is: 83  dload: 148
homer:/sys/kernel/debug/tracing # echo  trace;hackbench  /dev/null;cat 
trace|grep OINK | wc -l;tail trace
51
   hackbench-9079  [006] d...   425.982722: load_balance: OINK - imbal: 54  
load: 42  run: 4  det: 4  sload_was: 130 sload_is: 62  dload: 109
   hackbench-9231  [002] d...   425.982974: load_balance: OINK - imbal: 14  
load: 23  run: 1  det: 1  sload_was: 30 sload_is: 15  dload: 23
   hackbench-9328  [006] d...   425.983037: load_balance: OINK - imbal: 16  
load: 72  run: 2  det: 1  sload_was: 44 sload_is: 32  dload: 72
   hackbench-9197  [002] d...   425.984416: load_balance: OINK - imbal: 62  
load: 21  run: 3  det: 8  sload_was: 232 sload_is: 78  dload: 119
   hackbench-9196  [004] d...   425.984507: load_balance: OINK - imbal: 45  
load: 43  run: 1  det: 1  sload_was: 44 sload_is: 22  dload: 43
   hackbench-9201  [004] d...   425.984648: load_balance: OINK - imbal: 15  
load: 44  run: 1  det: 2  sload_was: 71 sload_is: 25  dload: 73
   hackbench-9235  [002] d...   425.984789: load_balance: OINK - imbal: 5  
load: 32  run: 2  det: 1  sload_was: 65 sload_is: 42  dload: 54
   hackbench-9327  [000] d...   425.985424: load_balance: OINK - imbal: 1  
load: 95  run: 1  det: 1  sload_was: 49 sload_is: 25  dload: 95
   hackbench-9193  [003] d...   425.988701: load_balance: OINK - imbal: 22  
load: 94  run: 1  det: 1  sload_was: 128 sload_is: 66  dload: 94
   hackbench-9197  [003] d...   425.988712: load_balance: OINK - imbal: 56  
load: 92  run: 1  det: 1  sload_was: 118 sload_is: 66  dload: 92
homer:/sys/kernel/debug/tracing # tail trace
kwin-4654  [002] d...   460.627311: load_balance: kglobalaccel:4597 
is non-contributor - count as 37
 konsole-4707  [005] d...   460.627639: load_balance: kded4:4594 is 
non-contributor - count as 40
 konsole-4707  [005] d...   460.627640: load_balance: 
kactivitymanage:4611 is non-contributor - count as 40
kwin-4654  [005] d...   460.627712: load_balance: 
kactivitymanage:4611 is non-contributor - count as 41
 kactivitymanage-4611  [005] d...   460.627726: load_balance: kded4:4594 is 
non-contributor - count as 40
kwin-4654  [005] d...   460.828628: load_balance: OINK - imbal: 3  
load: 618  run: 1  det: 1  sload_was: 1024 sload_is: 0  dload: 618
  plasma-desktop-4665  [000] d...   461.886746: load_balance: baloo_file:4666 
is non-contributor - count as 141
 systemd-journal-397   [001] d...   466.209790: load_balance: pulseaudio:4718 
is non-contributor - count as 5
 systemd-1 [007] d...   466.209868: load_balance: kmix:4704 is 
non-contributor - count as 13
 gnome-keyring-d-9455  [002] d...   466.209902: load_balance: krunner:4702 is 
non-contributor - count as 8

---
 kernel/sched/fair.c |   19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5897,7 +5897,7 @@ static int detach_tasks(struct lb_env *e
 {
struct list_head *tasks = env-src_rq-cfs_tasks;
struct task_struct *p;
-   unsigned long load;
+   unsigned long load, 

[PATCH?] Livelock in pick_next_task_fair() / idle_balance()

2015-06-30 Thread Rabin Vincent
Hi,

We're seeing a livelock where two CPUs both loop with interrupts
disabled in pick_next_task_fair() / idle_balance() and continuously
fetch all tasks from each other.  This has been observed on a 3.18
kernel but the relevant code paths appear to be the same in current
mainline.

The problem situation appears to be this:

cpu0cpu1


pick_next_task
 take rq0-lock
 pick_next_task_fair
  running=0
  idle_balance()
   drop rq0-lock


wakeup A,B on cpu0


pick_next_task
 take rq1-lock
 pick_next_task_fair
  running=0
  idle_balance()
   drop r1-lock
   load_balance()
   busiest=rq0
   take rq0-lock
   detach A,B
   drop rq0-lock
   take rq1-lock
   attach A,B
   pulled_task = 2
   drop rq1-lock

   load_balance()
   busiest=rq1
   take rq1-lock
   detach A,B
   drop rq1-lock
   take rq0-lock
   attach A,B
   pulled_task = 2
   drop rq0-lock

  running=0()
  idle_balance()
  busiest=rq0, pull A,B, etc.


  running = 0
  load_balance()
  busiest=rq1, pull A,B, etc

And this goes on, with interrupts disabled on both CPUs, for at least a
100 ms (which is when a hardware watchdog hits).

The conditions needed, apart from the right timing, are:

 - cgroups. One of the tasks, say A, needs to be in a CPU cgroup.  When
   the problem occurs, A's -se has zero load_avg_contrib and
   task_h_load(A) is zero.  However, the se-parent-parent of A has a
   (relatively) high load_avg_contrib.  cpu0's cfs_rq has therefore a
   relatively high runnable_load_avg.  find_busiest_group() therefore
   detects imbalance, and detach_tasks() detaches all tasks.

 - PREEMPT=n.  Otherwise, the code under #ifdef in detach_tasks() would
   ensure that we'd only ever pull a maximum of one task during idle
   balancing.

 - cpu0 and cpu1 are threads on the same core (cpus_share_cache() ==
   true).  otherwise, cpu1 will not be able to wakeup tasks on cpu0
   while cpu0 has interrupts disabled (since an IPI would be required).
   Turning off the default TTWU_QUEUE feature would also provide the
   same effect.

I see two simple ways to prevent the livelock.  One is to just to remove
the #ifdef:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d57cc0..74c94dc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5937,7 +5937,6 @@ static int detach_tasks(struct lb_env *env)
detached++;
env-imbalance -= load;
 
-#ifdef CONFIG_PREEMPT
/*
 * NEWIDLE balancing is a source of latency, so preemptible
 * kernels will stop after the first task is detached to 
minimize
@@ -5945,7 +5944,6 @@ static int detach_tasks(struct lb_env *env)
 */
if (env-idle == CPU_NEWLY_IDLE)
break;
-#endif
 
/*
 * We only want to steal up to the prescribed amount of

Or this should work too:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d57cc0..13358cf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7487,6 +7487,8 @@ static int idle_balance(struct rq *this_rq)
 */
if (this_rq-cfs.h_nr_running  !pulled_task)
pulled_task = 1;
+   else if (!this_rq-cfs.h_nr_running  pulled_task)
+   pulled_task = 0;
 
 out:
/* Move the next balance forward */

But it seems that the real problem is that detach_tasks() can remove all
tasks, when cgroups are involved as described above?

Thanks.

/Rabin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/