Re: weakness of runnable load tracking?

2012-12-08 Thread Alex Shi
On 12/06/2012 11:10 PM, Alex Shi wrote:
> 
>>> Hi Paul & Ingo:
>>>
>>> In a short word of this issue: burst forking/waking tasks have no time
>>> accumulate the load contribute, their runnable load are taken as zero.
>>> that make select_task_rq do a wrong decision on which group is idlest.
>>
>> So these aren't strictly comparable; bursting and forking tasks have
>> fairly different characteristics here.
> 
> Many thanks for looking into this. :)
>>
>> When we fork a task we intentionally reset the previous history.  This
>> means that a forked task that immediately runs is going to show up as
>> 100% runnable and then converge to it's true value.  This was fairly
>> intentionally chosen so that tasks would "start" fast rather than
>> having to worry about ramp up.
> 
> I am sorry for didn't see the 100% runnable for a new forked task. 
> I believe the code need the following patch to initialize decay_count, 
> and load_avg_contrib. otherwise they are random value. 
> In enqueue_entity_load_avg() p->se.avg.runnable_avg_sum for new forked 
> task is always zero, either because se.avg.last_runnable_update is set
>  as clock_task due to decay_count <=0, or just do 
> __synchronize_entity_decay not update_entity_load_avg.

Paul:
Would you like to give some comments for the following patches?

> 
> ===
> From a161000dbece6e95bf3b81e9246d51784589d393 Mon Sep 17 00:00:00 2001
> From: Alex Shi 
> Date: Mon, 3 Dec 2012 17:30:39 +0800
> Subject: [PATCH 05/12] sched: load tracking bug fix
> 
> We need initialize the se.avg.{decay_count, load_avg_contrib} to zero
> after a new task forked.
> Otherwise random values of above variable give a incorrect statistic
> data when do new task enqueue:
> enqueue_task_fair
> enqueue_entity
> enqueue_entity_load_avg
> 
> Signed-off-by: Alex Shi 
> ---
>  kernel/sched/core.c |2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5dae0d2..e6533e1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1534,6 +1534,8 @@ static void __sched_fork(struct task_struct *p)
>  #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
>   p->se.avg.runnable_avg_period = 0;
>   p->se.avg.runnable_avg_sum = 0;
> + p->se.avg.decay_count = 0;
> + p->se.avg.load_avg_contrib = 0;
>  #endif
>  #ifdef CONFIG_SCHEDSTATS
>   memset(>se.statistics, 0, sizeof(p->se.statistics));
> 


-- 
Thanks
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: weakness of runnable load tracking?

2012-12-08 Thread Alex Shi
On 12/06/2012 11:10 PM, Alex Shi wrote:
 
 Hi Paul  Ingo:

 In a short word of this issue: burst forking/waking tasks have no time
 accumulate the load contribute, their runnable load are taken as zero.
 that make select_task_rq do a wrong decision on which group is idlest.

 So these aren't strictly comparable; bursting and forking tasks have
 fairly different characteristics here.
 
 Many thanks for looking into this. :)

 When we fork a task we intentionally reset the previous history.  This
 means that a forked task that immediately runs is going to show up as
 100% runnable and then converge to it's true value.  This was fairly
 intentionally chosen so that tasks would start fast rather than
 having to worry about ramp up.
 
 I am sorry for didn't see the 100% runnable for a new forked task. 
 I believe the code need the following patch to initialize decay_count, 
 and load_avg_contrib. otherwise they are random value. 
 In enqueue_entity_load_avg() p-se.avg.runnable_avg_sum for new forked 
 task is always zero, either because se.avg.last_runnable_update is set
  as clock_task due to decay_count =0, or just do 
 __synchronize_entity_decay not update_entity_load_avg.

Paul:
Would you like to give some comments for the following patches?

 
 ===
 From a161000dbece6e95bf3b81e9246d51784589d393 Mon Sep 17 00:00:00 2001
 From: Alex Shi alex@intel.com
 Date: Mon, 3 Dec 2012 17:30:39 +0800
 Subject: [PATCH 05/12] sched: load tracking bug fix
 
 We need initialize the se.avg.{decay_count, load_avg_contrib} to zero
 after a new task forked.
 Otherwise random values of above variable give a incorrect statistic
 data when do new task enqueue:
 enqueue_task_fair
 enqueue_entity
 enqueue_entity_load_avg
 
 Signed-off-by: Alex Shi alex@intel.com
 ---
  kernel/sched/core.c |2 ++
  1 files changed, 2 insertions(+), 0 deletions(-)
 
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index 5dae0d2..e6533e1 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -1534,6 +1534,8 @@ static void __sched_fork(struct task_struct *p)
  #if defined(CONFIG_SMP)  defined(CONFIG_FAIR_GROUP_SCHED)
   p-se.avg.runnable_avg_period = 0;
   p-se.avg.runnable_avg_sum = 0;
 + p-se.avg.decay_count = 0;
 + p-se.avg.load_avg_contrib = 0;
  #endif
  #ifdef CONFIG_SCHEDSTATS
   memset(p-se.statistics, 0, sizeof(p-se.statistics));
 


-- 
Thanks
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: weakness of runnable load tracking?

2012-12-06 Thread Alex Shi
> 
> The treatment of a burst wake-up however is a little more interesting.
>  There are two reasonable trains of thought one can follow, the first
> is that:
> - If it IS truly bursty you don't really want it factoring into long
> term averages since steady state is not going to include that task;
> hence a low average is ok.  Anything that's more frequent then this is
> going to show up by definition of being within the periods.
> - The other is that if it truly is idle for _enormous_ amounts of time
> we want to give some cognizance to the fact that it might be more
> bursty when it wakes up.
> 
> It is my intuition that the greatest carnage here is actually caused
> by wake-up load-balancing getting in the way of periodic in
> establishing a steady state.  That these entities happen to not be
> runnable very often is just a red herring; they don't contribute
> enough load average to matter in the periodic case.  Increasing their
> load isn't going to really help this -- stronger, you don't want them
> affecting the steady state.  I suspect more mileage would result from
> reducing the interference wake-up load-balancing has with steady
> state.
> 
> e.g. One thing you can think about is considering tasks moved by
> wake-up load balance as "stolen", and allow periodic load-balance to
> re-settle things as if select_idle_sibling had not ignored it :-)

Consider the periodic load balance should spread tasks well, we can
assume the burst waking tasks are spread widely among all cpu before
their sleeping. That will relief the burst waking task imbalance.

And plus the uncertain utils of waking task, guess we can forget the
extra treatment for burst waking. If the waking tasks will keep running
long time. Let periodic LB to handle them again.
> 
>>
>> There is still 3 kinds of solution is helpful for this issue.
>>
>> a, set a unzero minimum value for the long time sleeping task. but it
>> seems unfair for other tasks these just sleep a short while.
>>
> 
> I think this is  reasonable to do regardless, we set such a cap in the
> cgroup case already.  Although you still obviously want this
> threshhold to be fairly low.  I suspect this is a weak improvement.

Agreed.
> 
>> b, just use runnable load contrib in load balance. Still using
>> nr_running to judge idlest group in select_task_rq_fair. but that may
>> cause a bit more migrations in future load balance.
> 
> I don't think this is a good approach.  The whole point of using
> blocked load is so that you can converge on a steady state where you
> don't NEED to move tasks.  What disrupts this is we naturally prefer
> idle cpus on wake-up balance to reduce wake-up latency.  As above, I
> think the better answer is making these two processes more
> co-operative.

Sure, instant load in fork/exec/waking will do interfere the later
periodic LB.
> 
>>
>> c, consider both runnable load and nr_running in the group: like in the
>> searching domain, the nr_running number increased a certain number, like
>> double of the domain span, in a certain time. we will think it's a burst
>> forking/waking happened, then just count the nr_running as the idlest
>> group criteria.
> 
> This feels like a bit of a hack.  I suspect this is more binary:
> 
> If there's already something running on all the cpus then we should
> let the periodic load balancer do placement taking averages into
> account.
> 
> Otherwise, we're in wake-idle and we throw the cat in the bathwater.

As like Mike said simple time is always not enough for 'burst'. So, it
will be approved to be a stupid optimising on some chance. :)
> 
>>
>> IMHO, I like the 3rd one a bit more. as to the certain time to judge if
>> a burst happened, since we will calculate the runnable avg at very tick,
>> so if increased nr_running is beyond sd->span_weight in 2 ticks, means
>> burst happening. What's your opinion of this?
>>
> 
> What are you defining as the "right" behavior for a group of tasks
> waking up that want to use only a short burst?
> 
> This seems to suggest you think spreading them is the best answer?
> What's the motivation for that?  Also: What does your topology look
> like that's preventing select_idle_sibling from pushing tasks (and
> then new-idle subsequently continuing to pull)?

Actually, I have no real story of the burst waking imbalance among cpus.

Considering above excuses, I will give up the optimizing on this. It's
not too late to reconsider this if some case is really triggered.
> 
> Certainly putting a lower bound on a tasks weight would help new-idle
> find the right cpu to pull these from.
> 
>> Any comments are appreciated!
>>
>> Regards!
>> Alex
>>>
>>>
>>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: weakness of runnable load tracking?

2012-12-06 Thread Alex Shi
On 12/06/2012 05:12 PM, Mike Galbraith wrote:
> On Thu, 2012-12-06 at 16:06 +0800, Alex Shi wrote: 

 Hi Paul & Ingo:

 In a short word of this issue: burst forking/waking tasks have no time
 accumulate the load contribute, their runnable load are taken as zero.
 that make select_task_rq do a wrong decision on which group is idlest.
>>>
>>> As you pointed out above, new tasks can (and imho should) be born with
>>> full weight.  Tasks _may_ become thin, but they're all born hungry.
>>
>> Thanks for comments. I think so. :)
>>>
 There is still 3 kinds of solution is helpful for this issue.

 a, set a unzero minimum value for the long time sleeping task. but it
 seems unfair for other tasks these just sleep a short while.

 b, just use runnable load contrib in load balance. Still using
 nr_running to judge idlest group in select_task_rq_fair. but that may
 cause a bit more migrations in future load balance.

 c, consider both runnable load and nr_running in the group: like in the
 searching domain, the nr_running number increased a certain number, like
 double of the domain span, in a certain time. we will think it's a burst
 forking/waking happened, then just count the nr_running as the idlest
 group criteria.

 IMHO, I like the 3rd one a bit more. as to the certain time to judge if
 a burst happened, since we will calculate the runnable avg at very tick,
 so if increased nr_running is beyond sd->span_weight in 2 ticks, means
 burst happening. What's your opinion of this?

 Any comments are appreciated!
>>>
>>> IMHO, for fork and bursty wake balancing, the only thing meaningful is
>>> the here and now state of runqueues tasks are being dumped into.
>>>
>>> Just because tasks are historically short running, you don't necessarily
>>> want to take a gaggle and wedge them into a too small group just to even
>>> out load averages.  If there was a hole available that you passed up by
>>> using average load, you lose utilization.  I can see how this load
>>> tracking stuff can average out to a win on a ~heavily loaded box, but
>>> bursty stuff I don't see how it can do anything but harm, so imho, the
>>> user should choose which is best for his box, instantaneous or history.
>>
>> Do you mean the system administrator need to do this choice?
> 
> That's my gut feeling just from pondering potential pitfalls.
> 
>> It's may a hard decision.  :)
> 
> Yup, very hard.
> 
>> Any suggestions of decision basis?
> 
> Same as most buttons.. poke it and  see what happens :) 

:D
> 
>>> WRT burst detection: any window you define can be longer than the burst.
>>
>> Maybe we can define 2 waking on same cpu in 1 tick is a burst happened,
>> and if the cpu had taken a waking task. we'd better skip this cpu. :)
>> Anyway, the hard point is we can not predict future.
> 
> No matter what the metric, you'll be reacting after the fact.
> 
> Somebody needs to code up that darn omniscience algorithm.  In a pinch,
> a simple undo the past will suffice :)

Yes. I see.
> 
> -Mike
> 


-- 
Thanks
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: weakness of runnable load tracking?

2012-12-06 Thread Alex Shi

>> Hi Paul & Ingo:
>>
>> In a short word of this issue: burst forking/waking tasks have no time
>> accumulate the load contribute, their runnable load are taken as zero.
>> that make select_task_rq do a wrong decision on which group is idlest.
> 
> So these aren't strictly comparable; bursting and forking tasks have
> fairly different characteristics here.

Many thanks for looking into this. :)
> 
> When we fork a task we intentionally reset the previous history.  This
> means that a forked task that immediately runs is going to show up as
> 100% runnable and then converge to it's true value.  This was fairly
> intentionally chosen so that tasks would "start" fast rather than
> having to worry about ramp up.

I am sorry for didn't see the 100% runnable for a new forked task. 
I believe the code need the following patch to initialize decay_count, 
and load_avg_contrib. otherwise they are random value. 
In enqueue_entity_load_avg() p->se.avg.runnable_avg_sum for new forked 
task is always zero, either because se.avg.last_runnable_update is set
 as clock_task due to decay_count <=0, or just do 
__synchronize_entity_decay not update_entity_load_avg.

===
>From a161000dbece6e95bf3b81e9246d51784589d393 Mon Sep 17 00:00:00 2001
From: Alex Shi 
Date: Mon, 3 Dec 2012 17:30:39 +0800
Subject: [PATCH 05/12] sched: load tracking bug fix

We need initialize the se.avg.{decay_count, load_avg_contrib} to zero
after a new task forked.
Otherwise random values of above variable give a incorrect statistic
data when do new task enqueue:
enqueue_task_fair
enqueue_entity
enqueue_entity_load_avg

Signed-off-by: Alex Shi 
---
 kernel/sched/core.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5dae0d2..e6533e1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1534,6 +1534,8 @@ static void __sched_fork(struct task_struct *p)
 #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
+   p->se.avg.decay_count = 0;
+   p->se.avg.load_avg_contrib = 0;
 #endif
 #ifdef CONFIG_SCHEDSTATS
memset(>se.statistics, 0, sizeof(p->se.statistics));
-- 
1.7.5.4


Yes, the runnable load will quick become 100%, after se->on_rq set, and passed 
one tick. 
If we are forking many tasks in one tick. I didn't find a useful chance to 
update 
load avg for new tasks.
So, guess we need the following patch:

==
>From 6de8083606b6e0b55e201eadbb945c60fe7631b6 Mon Sep 17 00:00:00 2001
From: Alex Shi 
Date: Thu, 6 Dec 2012 22:48:24 +0800
Subject: [PATCH 12/12] sched: set initial load avg of new forked task as its
 load weight

New task has no runnable sum at its first runnable time, that make
burst forking just select few idle cpus to put tasks.
Set initial load avg of new forked task as its load weight to resolve
this issue.

Signed-off-by: Alex Shi 
---
 include/linux/sched.h |1 +
 kernel/sched/core.c   |2 +-
 kernel/sched/fair.c   |   18 --
 3 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e483ccb..12063fa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1043,6 +1043,7 @@ struct sched_domain;
 #else
 #define ENQUEUE_WAKING 0
 #endif
+#define ENQUEUE_NEWTASK8
 
 #define DEQUEUE_SLEEP  1
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 980904d..6a4d225 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1643,7 +1643,7 @@ void wake_up_new_task(struct task_struct *p)
 #endif
 
rq = __task_rq_lock(p);
-   activate_task(rq, p, 0);
+   activate_task(rq, p, ENQUEUE_NEWTASK);
p->on_rq = 1;
trace_sched_wakeup_new(p, true);
check_preempt_curr(rq, p, WF_FORK);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5d6eb91..f15b0ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1289,8 +1289,9 @@ static inline void update_rq_runnable_avg(struct rq *rq, 
int runnable)
 /* Add the load generated by se into cfs_rq's child load-average */
 static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
  struct sched_entity *se,
- int wakeup)
+ int flags)
 {
+   int wakeup = flags & ENQUEUE_WAKEUP;
/*
 * We track migrations using entity decay_count <= 0, on a wake-up
 * migration we use a negative decay count to track the remote decays
@@ -1324,6 +1325,13 @@ static inline void enqueue_entity_load_avg(struct cfs_rq 
*cfs_rq,
update_entity_load_avg(se, 0);
}
 
+   /*
+* set the initial load avg of new task same as its load
+* in order to avoid brust fork make few cpu too heavier
+*/
+   if (flags & 

Re: weakness of runnable load tracking?

2012-12-06 Thread Paul Turner
On Wed, Dec 5, 2012 at 7:13 PM, Alex Shi  wrote:
> On 12/05/2012 11:19 PM, Alex Shi wrote:
>> Hi Paul:
>>
>> Runnable load tracking patch set introduce a good way to tracking each
>> entity/rq's running time.
>> But when I try to enable it in load balance, I found burst forking
>> many new tasks will make just few cpu heavy while other cpu has no
>> much task assigned. That is due to the new forked task's
>> load_avg_contrib is zero after just created. then no matter how many
>> tasks assigned to a CPU can not increase the cfs_rq->runnable_load_avg
>> or rq->avg.load_avg_contrib if this cpu idle.
>> Actually, if just for new task issue, we can set new task's initial
>> load_avg same as load_weight. but if we want to burst wake up many
>> long time sleeping tasks, it has the same issue here since their were
>> decayed to zero. So what solution I can thought is recording the se's
>> load_avg_contrib just before dequeue, and don't decay the value, when
>> it was waken up, add this value to new cfs_rq. but if so, the runnable
>> load tracking is total meaningless.
>> So do you have some idea of burst wakeup balancing with runnable load 
>> tracking?
>
> Hi Paul & Ingo:
>
> In a short word of this issue: burst forking/waking tasks have no time
> accumulate the load contribute, their runnable load are taken as zero.
> that make select_task_rq do a wrong decision on which group is idlest.

So these aren't strictly comparable; bursting and forking tasks have
fairly different characteristics here.

When we fork a task we intentionally reset the previous history.  This
means that a forked task that immediately runs is going to show up as
100% runnable and then converge to it's true value.  This was fairly
intentionally chosen so that tasks would "start" fast rather than
having to worry about ramp up.

The treatment of a burst wake-up however is a little more interesting.
 There are two reasonable trains of thought one can follow, the first
is that:
- If it IS truly bursty you don't really want it factoring into long
term averages since steady state is not going to include that task;
hence a low average is ok.  Anything that's more frequent then this is
going to show up by definition of being within the periods.
- The other is that if it truly is idle for _enormous_ amounts of time
we want to give some cognizance to the fact that it might be more
bursty when it wakes up.

It is my intuition that the greatest carnage here is actually caused
by wake-up load-balancing getting in the way of periodic in
establishing a steady state.  That these entities happen to not be
runnable very often is just a red herring; they don't contribute
enough load average to matter in the periodic case.  Increasing their
load isn't going to really help this -- stronger, you don't want them
affecting the steady state.  I suspect more mileage would result from
reducing the interference wake-up load-balancing has with steady
state.

e.g. One thing you can think about is considering tasks moved by
wake-up load balance as "stolen", and allow periodic load-balance to
re-settle things as if select_idle_sibling had not ignored it :-)

>
> There is still 3 kinds of solution is helpful for this issue.
>
> a, set a unzero minimum value for the long time sleeping task. but it
> seems unfair for other tasks these just sleep a short while.
>

I think this is  reasonable to do regardless, we set such a cap in the
cgroup case already.  Although you still obviously want this
threshhold to be fairly low.  I suspect this is a weak improvement.

> b, just use runnable load contrib in load balance. Still using
> nr_running to judge idlest group in select_task_rq_fair. but that may
> cause a bit more migrations in future load balance.

I don't think this is a good approach.  The whole point of using
blocked load is so that you can converge on a steady state where you
don't NEED to move tasks.  What disrupts this is we naturally prefer
idle cpus on wake-up balance to reduce wake-up latency.  As above, I
think the better answer is making these two processes more
co-operative.

>
> c, consider both runnable load and nr_running in the group: like in the
> searching domain, the nr_running number increased a certain number, like
> double of the domain span, in a certain time. we will think it's a burst
> forking/waking happened, then just count the nr_running as the idlest
> group criteria.

This feels like a bit of a hack.  I suspect this is more binary:

If there's already something running on all the cpus then we should
let the periodic load balancer do placement taking averages into
account.

Otherwise, we're in wake-idle and we throw the cat in the bathwater.

>
> IMHO, I like the 3rd one a bit more. as to the certain time to judge if
> a burst happened, since we will calculate the runnable avg at very tick,
> so if increased nr_running is beyond sd->span_weight in 2 ticks, means
> burst happening. What's your opinion of this?
>

What are you defining as the 

Re: weakness of runnable load tracking?

2012-12-06 Thread Mike Galbraith
On Thu, 2012-12-06 at 16:06 +0800, Alex Shi wrote: 
> >>
> >> Hi Paul & Ingo:
> >>
> >> In a short word of this issue: burst forking/waking tasks have no time
> >> accumulate the load contribute, their runnable load are taken as zero.
> >> that make select_task_rq do a wrong decision on which group is idlest.
> > 
> > As you pointed out above, new tasks can (and imho should) be born with
> > full weight.  Tasks _may_ become thin, but they're all born hungry.
> 
> Thanks for comments. I think so. :)
> > 
> >> There is still 3 kinds of solution is helpful for this issue.
> >>
> >> a, set a unzero minimum value for the long time sleeping task. but it
> >> seems unfair for other tasks these just sleep a short while.
> >>
> >> b, just use runnable load contrib in load balance. Still using
> >> nr_running to judge idlest group in select_task_rq_fair. but that may
> >> cause a bit more migrations in future load balance.
> >>
> >> c, consider both runnable load and nr_running in the group: like in the
> >> searching domain, the nr_running number increased a certain number, like
> >> double of the domain span, in a certain time. we will think it's a burst
> >> forking/waking happened, then just count the nr_running as the idlest
> >> group criteria.
> >>
> >> IMHO, I like the 3rd one a bit more. as to the certain time to judge if
> >> a burst happened, since we will calculate the runnable avg at very tick,
> >> so if increased nr_running is beyond sd->span_weight in 2 ticks, means
> >> burst happening. What's your opinion of this?
> >>
> >> Any comments are appreciated!
> > 
> > IMHO, for fork and bursty wake balancing, the only thing meaningful is
> > the here and now state of runqueues tasks are being dumped into.
> > 
> > Just because tasks are historically short running, you don't necessarily
> > want to take a gaggle and wedge them into a too small group just to even
> > out load averages.  If there was a hole available that you passed up by
> > using average load, you lose utilization.  I can see how this load
> > tracking stuff can average out to a win on a ~heavily loaded box, but
> > bursty stuff I don't see how it can do anything but harm, so imho, the
> > user should choose which is best for his box, instantaneous or history.
> 
> Do you mean the system administrator need to do this choice?

That's my gut feeling just from pondering potential pitfalls.

> It's may a hard decision.  :)

Yup, very hard.

> Any suggestions of decision basis?

Same as most buttons.. poke it and  see what happens :) 

> > WRT burst detection: any window you define can be longer than the burst.
> 
> Maybe we can define 2 waking on same cpu in 1 tick is a burst happened,
> and if the cpu had taken a waking task. we'd better skip this cpu. :)
> Anyway, the hard point is we can not predict future.

No matter what the metric, you'll be reacting after the fact.

Somebody needs to code up that darn omniscience algorithm.  In a pinch,
a simple undo the past will suffice :)

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: weakness of runnable load tracking?

2012-12-06 Thread Alex Shi
On 12/06/2012 02:52 PM, Preeti U Murthy wrote:
> Hi Alex,
>> Hi Paul & Ingo:
>>
>> In a short word of this issue: burst forking/waking tasks have no time
>> accumulate the load contribute, their runnable load are taken as zero.
> 
> On performing certain experiments on the way PJT's metric calculates the
> load,I observed a few things.Based on these observations let me see if i
> can address the issue of why PJT's metric is calculating the load of
> bursty tasks as 0.
> 
> When we speak about a burst waking task(I will not go into forking
> here),we should also speak about its duty cycle-it burst wakes for 1ms
> for a 10ms duty cycle or burst wakes 9s out of a 10s duty cycle-both
> being 10% tasks wrt their duty cycles.Lets see how load is calculated by
> PJT's metric in each of the above cases.
>--
>   |  |
>   |  |
> __|  |
>   A  B
>   1ms
>   <->
>   10ms
> <>
>   Example 1
> 
> When the task wakes up at A,it is not yet runnable,and an update of the
> task load takes place.Its runtime so far is 0,and its existing time is
> 10ms.Hence the load is 0/10*1024.Since a scheduler tick happens at B( a
> scheduler tick happens for every 1ms,10ms or 4ms.Let us assume 1ms),an
> update of the load takes place.PJT's metric divides the time elapsed
> into 1ms windows.There is just 1ms window,and hence the runtime is 1ms
> and the load is 1ms/10ms*1024.
> 
> *If the time elapsed between A and B were to be < 1ms,then PJT's metric
> will not capture it*.

An nice description to show this issue. :)
> 
> And under these circumstances the load remains 0/10ms*1024=0.This is the
> situation you are pointing out.Let us assume that these cycle continues
> throughout the lifetime of the load,then the load remains at 0.The
> question is if such tasks which run for periods<1ms is ok to be termed
> as 0 workloads.If it is fine,then what PJT's metric is doing is
> right.Maybe we should ignore such workloads because they hardly
> contribute to the load.Otherwise we will need to reduce the window of
> load update to < 1ms to capture such loads.
> 
> 
> Just for some additional info so that we know what happens to different
> kinds of loads with PJT's metric,consider the below situation:
>  --
> |  |
> |  |
> |  |
> A  B
>1s
> <-->
> <--->
>   10s
> <>
>Example 2
> 
> Here at A,the task wakes,just like in Example1 and the load is termed 0.
> In between A and B for every scheduler tick if we consider the load to
> get updated,then the load slowly increases from 0 to 1024 at B.It is
> 1024 here,although this is also a 10% task,whereas in Example1 the load
> is 102.4 - a 10% task.So what is fishy?
> 
> In my opinion,PJT's metric gives the tasks some time to prove their
> activeness after they wake up.In Example2 the task has stayed awake too
> long-1s; irrespective of what % of the total run time it is.Therefore it
> calculates the load to be big enough to balance.
> 
> In the example that you have quoted,the tasks may not have run long
> enough to consider them as candidates for load balance.
> 
> So,essentially what PJT's metric is doing is characterising a task by
> the amount it has run so far.
> 
> 
>> that make select_task_rq do a wrong decision on which group is idlest.
>>
>> There is still 3 kinds of solution is helpful for this issue.
>>
>> a, set a unzero minimum value for the long time sleeping task. but it
>> seems unfair for other tasks these just sleep a short while.
>>
>> b, just use runnable load contrib in load balance. Still using
>> nr_running to judge idlest group in select_task_rq_fair. but that may
>> cause a bit more migrations in future load balance.
>>
>> c, consider both runnable load and nr_running in the group: like in the
>> searching domain, the nr_running number increased a certain number, like
>> double of the domain span, in a certain time. we will think it's a burst
>> forking/waking happened, then just count the nr_running as the idlest
>> group criteria.
>>
>> IMHO, I like the 3rd one a bit more. as to the certain time to judge if
>> a burst happened, since we will calculate the runnable avg at very tick,
>> so if increased nr_running is beyond sd->span_weight in 2 ticks, means
>> burst happening. What's your opinion of this?
>>
>> Any comments are appreciated!
> 
> 
> So Pjt's metric rightly seems to be capturing the load of these bursty
> tasks but you are right in pointing out that when too many such loads
> queue up on the cpu,this metric will consider the load on the cpu as
> 0,which might not be such a good idea.
> 
> It is true that we need to bring in nr_running somewhere.Let me now go
> through your suggestions on 

Re: weakness of runnable load tracking?

2012-12-06 Thread Alex Shi
>>
>> Hi Paul & Ingo:
>>
>> In a short word of this issue: burst forking/waking tasks have no time
>> accumulate the load contribute, their runnable load are taken as zero.
>> that make select_task_rq do a wrong decision on which group is idlest.
> 
> As you pointed out above, new tasks can (and imho should) be born with
> full weight.  Tasks _may_ become thin, but they're all born hungry.

Thanks for comments. I think so. :)
> 
>> There is still 3 kinds of solution is helpful for this issue.
>>
>> a, set a unzero minimum value for the long time sleeping task. but it
>> seems unfair for other tasks these just sleep a short while.
>>
>> b, just use runnable load contrib in load balance. Still using
>> nr_running to judge idlest group in select_task_rq_fair. but that may
>> cause a bit more migrations in future load balance.
>>
>> c, consider both runnable load and nr_running in the group: like in the
>> searching domain, the nr_running number increased a certain number, like
>> double of the domain span, in a certain time. we will think it's a burst
>> forking/waking happened, then just count the nr_running as the idlest
>> group criteria.
>>
>> IMHO, I like the 3rd one a bit more. as to the certain time to judge if
>> a burst happened, since we will calculate the runnable avg at very tick,
>> so if increased nr_running is beyond sd->span_weight in 2 ticks, means
>> burst happening. What's your opinion of this?
>>
>> Any comments are appreciated!
> 
> IMHO, for fork and bursty wake balancing, the only thing meaningful is
> the here and now state of runqueues tasks are being dumped into.
> 
> Just because tasks are historically short running, you don't necessarily
> want to take a gaggle and wedge them into a too small group just to even
> out load averages.  If there was a hole available that you passed up by
> using average load, you lose utilization.  I can see how this load
> tracking stuff can average out to a win on a ~heavily loaded box, but
> bursty stuff I don't see how it can do anything but harm, so imho, the
> user should choose which is best for his box, instantaneous or history.

Do you mean the system administrator need to do this choice?
It's may a hard decision.  :)
Any suggestions of decision basis?
> 
> WRT burst detection: any window you define can be longer than the burst.

Maybe we can define 2 waking on same cpu in 1 tick is a burst happened,
and if the cpu had taken a waking task. we'd better skip this cpu. :)
Anyway, the hard point is we can not predict future.


> 
> $.02
> 
> -Mike
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: weakness of runnable load tracking?

2012-12-06 Thread Alex Shi

 Hi Paul  Ingo:

 In a short word of this issue: burst forking/waking tasks have no time
 accumulate the load contribute, their runnable load are taken as zero.
 that make select_task_rq do a wrong decision on which group is idlest.
 
 As you pointed out above, new tasks can (and imho should) be born with
 full weight.  Tasks _may_ become thin, but they're all born hungry.

Thanks for comments. I think so. :)
 
 There is still 3 kinds of solution is helpful for this issue.

 a, set a unzero minimum value for the long time sleeping task. but it
 seems unfair for other tasks these just sleep a short while.

 b, just use runnable load contrib in load balance. Still using
 nr_running to judge idlest group in select_task_rq_fair. but that may
 cause a bit more migrations in future load balance.

 c, consider both runnable load and nr_running in the group: like in the
 searching domain, the nr_running number increased a certain number, like
 double of the domain span, in a certain time. we will think it's a burst
 forking/waking happened, then just count the nr_running as the idlest
 group criteria.

 IMHO, I like the 3rd one a bit more. as to the certain time to judge if
 a burst happened, since we will calculate the runnable avg at very tick,
 so if increased nr_running is beyond sd-span_weight in 2 ticks, means
 burst happening. What's your opinion of this?

 Any comments are appreciated!
 
 IMHO, for fork and bursty wake balancing, the only thing meaningful is
 the here and now state of runqueues tasks are being dumped into.
 
 Just because tasks are historically short running, you don't necessarily
 want to take a gaggle and wedge them into a too small group just to even
 out load averages.  If there was a hole available that you passed up by
 using average load, you lose utilization.  I can see how this load
 tracking stuff can average out to a win on a ~heavily loaded box, but
 bursty stuff I don't see how it can do anything but harm, so imho, the
 user should choose which is best for his box, instantaneous or history.

Do you mean the system administrator need to do this choice?
It's may a hard decision.  :)
Any suggestions of decision basis?
 
 WRT burst detection: any window you define can be longer than the burst.

Maybe we can define 2 waking on same cpu in 1 tick is a burst happened,
and if the cpu had taken a waking task. we'd better skip this cpu. :)
Anyway, the hard point is we can not predict future.


 
 $.02
 
 -Mike
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: weakness of runnable load tracking?

2012-12-06 Thread Alex Shi
On 12/06/2012 02:52 PM, Preeti U Murthy wrote:
 Hi Alex,
 Hi Paul  Ingo:

 In a short word of this issue: burst forking/waking tasks have no time
 accumulate the load contribute, their runnable load are taken as zero.
 
 On performing certain experiments on the way PJT's metric calculates the
 load,I observed a few things.Based on these observations let me see if i
 can address the issue of why PJT's metric is calculating the load of
 bursty tasks as 0.
 
 When we speak about a burst waking task(I will not go into forking
 here),we should also speak about its duty cycle-it burst wakes for 1ms
 for a 10ms duty cycle or burst wakes 9s out of a 10s duty cycle-both
 being 10% tasks wrt their duty cycles.Lets see how load is calculated by
 PJT's metric in each of the above cases.
--
   |  |
   |  |
 __|  |
   A  B
   1ms
   -
   10ms
 
   Example 1
 
 When the task wakes up at A,it is not yet runnable,and an update of the
 task load takes place.Its runtime so far is 0,and its existing time is
 10ms.Hence the load is 0/10*1024.Since a scheduler tick happens at B( a
 scheduler tick happens for every 1ms,10ms or 4ms.Let us assume 1ms),an
 update of the load takes place.PJT's metric divides the time elapsed
 into 1ms windows.There is just 1ms window,and hence the runtime is 1ms
 and the load is 1ms/10ms*1024.
 
 *If the time elapsed between A and B were to be  1ms,then PJT's metric
 will not capture it*.

An nice description to show this issue. :)
 
 And under these circumstances the load remains 0/10ms*1024=0.This is the
 situation you are pointing out.Let us assume that these cycle continues
 throughout the lifetime of the load,then the load remains at 0.The
 question is if such tasks which run for periods1ms is ok to be termed
 as 0 workloads.If it is fine,then what PJT's metric is doing is
 right.Maybe we should ignore such workloads because they hardly
 contribute to the load.Otherwise we will need to reduce the window of
 load update to  1ms to capture such loads.
 
 
 Just for some additional info so that we know what happens to different
 kinds of loads with PJT's metric,consider the below situation:
  --
 |  |
 |  |
 |  |
 A  B
1s
 --
 ---
   10s
 
Example 2
 
 Here at A,the task wakes,just like in Example1 and the load is termed 0.
 In between A and B for every scheduler tick if we consider the load to
 get updated,then the load slowly increases from 0 to 1024 at B.It is
 1024 here,although this is also a 10% task,whereas in Example1 the load
 is 102.4 - a 10% task.So what is fishy?
 
 In my opinion,PJT's metric gives the tasks some time to prove their
 activeness after they wake up.In Example2 the task has stayed awake too
 long-1s; irrespective of what % of the total run time it is.Therefore it
 calculates the load to be big enough to balance.
 
 In the example that you have quoted,the tasks may not have run long
 enough to consider them as candidates for load balance.
 
 So,essentially what PJT's metric is doing is characterising a task by
 the amount it has run so far.
 
 
 that make select_task_rq do a wrong decision on which group is idlest.

 There is still 3 kinds of solution is helpful for this issue.

 a, set a unzero minimum value for the long time sleeping task. but it
 seems unfair for other tasks these just sleep a short while.

 b, just use runnable load contrib in load balance. Still using
 nr_running to judge idlest group in select_task_rq_fair. but that may
 cause a bit more migrations in future load balance.

 c, consider both runnable load and nr_running in the group: like in the
 searching domain, the nr_running number increased a certain number, like
 double of the domain span, in a certain time. we will think it's a burst
 forking/waking happened, then just count the nr_running as the idlest
 group criteria.

 IMHO, I like the 3rd one a bit more. as to the certain time to judge if
 a burst happened, since we will calculate the runnable avg at very tick,
 so if increased nr_running is beyond sd-span_weight in 2 ticks, means
 burst happening. What's your opinion of this?

 Any comments are appreciated!
 
 
 So Pjt's metric rightly seems to be capturing the load of these bursty
 tasks but you are right in pointing out that when too many such loads
 queue up on the cpu,this metric will consider the load on the cpu as
 0,which might not be such a good idea.
 
 It is true that we need to bring in nr_running somewhere.Let me now go
 through your suggestions on where to include nr_running and get back on
 this.I had planned on including nr_running while selecting the busy
 group in update_sd_lb_stats,but 

Re: weakness of runnable load tracking?

2012-12-06 Thread Mike Galbraith
On Thu, 2012-12-06 at 16:06 +0800, Alex Shi wrote: 
 
  Hi Paul  Ingo:
 
  In a short word of this issue: burst forking/waking tasks have no time
  accumulate the load contribute, their runnable load are taken as zero.
  that make select_task_rq do a wrong decision on which group is idlest.
  
  As you pointed out above, new tasks can (and imho should) be born with
  full weight.  Tasks _may_ become thin, but they're all born hungry.
 
 Thanks for comments. I think so. :)
  
  There is still 3 kinds of solution is helpful for this issue.
 
  a, set a unzero minimum value for the long time sleeping task. but it
  seems unfair for other tasks these just sleep a short while.
 
  b, just use runnable load contrib in load balance. Still using
  nr_running to judge idlest group in select_task_rq_fair. but that may
  cause a bit more migrations in future load balance.
 
  c, consider both runnable load and nr_running in the group: like in the
  searching domain, the nr_running number increased a certain number, like
  double of the domain span, in a certain time. we will think it's a burst
  forking/waking happened, then just count the nr_running as the idlest
  group criteria.
 
  IMHO, I like the 3rd one a bit more. as to the certain time to judge if
  a burst happened, since we will calculate the runnable avg at very tick,
  so if increased nr_running is beyond sd-span_weight in 2 ticks, means
  burst happening. What's your opinion of this?
 
  Any comments are appreciated!
  
  IMHO, for fork and bursty wake balancing, the only thing meaningful is
  the here and now state of runqueues tasks are being dumped into.
  
  Just because tasks are historically short running, you don't necessarily
  want to take a gaggle and wedge them into a too small group just to even
  out load averages.  If there was a hole available that you passed up by
  using average load, you lose utilization.  I can see how this load
  tracking stuff can average out to a win on a ~heavily loaded box, but
  bursty stuff I don't see how it can do anything but harm, so imho, the
  user should choose which is best for his box, instantaneous or history.
 
 Do you mean the system administrator need to do this choice?

That's my gut feeling just from pondering potential pitfalls.

 It's may a hard decision.  :)

Yup, very hard.

 Any suggestions of decision basis?

Same as most buttons.. poke it and cringe see what happens :) 

  WRT burst detection: any window you define can be longer than the burst.
 
 Maybe we can define 2 waking on same cpu in 1 tick is a burst happened,
 and if the cpu had taken a waking task. we'd better skip this cpu. :)
 Anyway, the hard point is we can not predict future.

No matter what the metric, you'll be reacting after the fact.

Somebody needs to code up that darn omniscience algorithm.  In a pinch,
a simple undo the past will suffice :)

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: weakness of runnable load tracking?

2012-12-06 Thread Paul Turner
On Wed, Dec 5, 2012 at 7:13 PM, Alex Shi alex@intel.com wrote:
 On 12/05/2012 11:19 PM, Alex Shi wrote:
 Hi PaulIngo:

 Runnable load tracking patch set introduce a good way to tracking each
 entity/rq's running time.
 But when I try to enable it in load balance, I found burst forking
 many new tasks will make just few cpu heavy while other cpu has no
 much task assigned. That is due to the new forked task's
 load_avg_contrib is zero after just created. then no matter how many
 tasks assigned to a CPU can not increase the cfs_rq-runnable_load_avg
 or rq-avg.load_avg_contrib if this cpu idle.
 Actually, if just for new task issue, we can set new task's initial
 load_avg same as load_weight. but if we want to burst wake up many
 long time sleeping tasks, it has the same issue here since their were
 decayed to zero. So what solution I can thought is recording the se's
 load_avg_contrib just before dequeue, and don't decay the value, when
 it was waken up, add this value to new cfs_rq. but if so, the runnable
 load tracking is total meaningless.
 So do you have some idea of burst wakeup balancing with runnable load 
 tracking?

 Hi Paul  Ingo:

 In a short word of this issue: burst forking/waking tasks have no time
 accumulate the load contribute, their runnable load are taken as zero.
 that make select_task_rq do a wrong decision on which group is idlest.

So these aren't strictly comparable; bursting and forking tasks have
fairly different characteristics here.

When we fork a task we intentionally reset the previous history.  This
means that a forked task that immediately runs is going to show up as
100% runnable and then converge to it's true value.  This was fairly
intentionally chosen so that tasks would start fast rather than
having to worry about ramp up.

The treatment of a burst wake-up however is a little more interesting.
 There are two reasonable trains of thought one can follow, the first
is that:
- If it IS truly bursty you don't really want it factoring into long
term averages since steady state is not going to include that task;
hence a low average is ok.  Anything that's more frequent then this is
going to show up by definition of being within the periods.
- The other is that if it truly is idle for _enormous_ amounts of time
we want to give some cognizance to the fact that it might be more
bursty when it wakes up.

It is my intuition that the greatest carnage here is actually caused
by wake-up load-balancing getting in the way of periodic in
establishing a steady state.  That these entities happen to not be
runnable very often is just a red herring; they don't contribute
enough load average to matter in the periodic case.  Increasing their
load isn't going to really help this -- stronger, you don't want them
affecting the steady state.  I suspect more mileage would result from
reducing the interference wake-up load-balancing has with steady
state.

e.g. One thing you can think about is considering tasks moved by
wake-up load balance as stolen, and allow periodic load-balance to
re-settle things as if select_idle_sibling had not ignored it :-)


 There is still 3 kinds of solution is helpful for this issue.

 a, set a unzero minimum value for the long time sleeping task. but it
 seems unfair for other tasks these just sleep a short while.


I think this is  reasonable to do regardless, we set such a cap in the
cgroup case already.  Although you still obviously want this
threshhold to be fairly low.  I suspect this is a weak improvement.

 b, just use runnable load contrib in load balance. Still using
 nr_running to judge idlest group in select_task_rq_fair. but that may
 cause a bit more migrations in future load balance.

I don't think this is a good approach.  The whole point of using
blocked load is so that you can converge on a steady state where you
don't NEED to move tasks.  What disrupts this is we naturally prefer
idle cpus on wake-up balance to reduce wake-up latency.  As above, I
think the better answer is making these two processes more
co-operative.


 c, consider both runnable load and nr_running in the group: like in the
 searching domain, the nr_running number increased a certain number, like
 double of the domain span, in a certain time. we will think it's a burst
 forking/waking happened, then just count the nr_running as the idlest
 group criteria.

This feels like a bit of a hack.  I suspect this is more binary:

If there's already something running on all the cpus then we should
let the periodic load balancer do placement taking averages into
account.

Otherwise, we're in wake-idle and we throw the cat in the bathwater.


 IMHO, I like the 3rd one a bit more. as to the certain time to judge if
 a burst happened, since we will calculate the runnable avg at very tick,
 so if increased nr_running is beyond sd-span_weight in 2 ticks, means
 burst happening. What's your opinion of this?


What are you defining as the right behavior for a group of tasks
waking up that want 

Re: weakness of runnable load tracking?

2012-12-06 Thread Alex Shi

 Hi Paul  Ingo:

 In a short word of this issue: burst forking/waking tasks have no time
 accumulate the load contribute, their runnable load are taken as zero.
 that make select_task_rq do a wrong decision on which group is idlest.
 
 So these aren't strictly comparable; bursting and forking tasks have
 fairly different characteristics here.

Many thanks for looking into this. :)
 
 When we fork a task we intentionally reset the previous history.  This
 means that a forked task that immediately runs is going to show up as
 100% runnable and then converge to it's true value.  This was fairly
 intentionally chosen so that tasks would start fast rather than
 having to worry about ramp up.

I am sorry for didn't see the 100% runnable for a new forked task. 
I believe the code need the following patch to initialize decay_count, 
and load_avg_contrib. otherwise they are random value. 
In enqueue_entity_load_avg() p-se.avg.runnable_avg_sum for new forked 
task is always zero, either because se.avg.last_runnable_update is set
 as clock_task due to decay_count =0, or just do 
__synchronize_entity_decay not update_entity_load_avg.

===
From a161000dbece6e95bf3b81e9246d51784589d393 Mon Sep 17 00:00:00 2001
From: Alex Shi alex@intel.com
Date: Mon, 3 Dec 2012 17:30:39 +0800
Subject: [PATCH 05/12] sched: load tracking bug fix

We need initialize the se.avg.{decay_count, load_avg_contrib} to zero
after a new task forked.
Otherwise random values of above variable give a incorrect statistic
data when do new task enqueue:
enqueue_task_fair
enqueue_entity
enqueue_entity_load_avg

Signed-off-by: Alex Shi alex@intel.com
---
 kernel/sched/core.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5dae0d2..e6533e1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1534,6 +1534,8 @@ static void __sched_fork(struct task_struct *p)
 #if defined(CONFIG_SMP)  defined(CONFIG_FAIR_GROUP_SCHED)
p-se.avg.runnable_avg_period = 0;
p-se.avg.runnable_avg_sum = 0;
+   p-se.avg.decay_count = 0;
+   p-se.avg.load_avg_contrib = 0;
 #endif
 #ifdef CONFIG_SCHEDSTATS
memset(p-se.statistics, 0, sizeof(p-se.statistics));
-- 
1.7.5.4


Yes, the runnable load will quick become 100%, after se-on_rq set, and passed 
one tick. 
If we are forking many tasks in one tick. I didn't find a useful chance to 
update 
load avg for new tasks.
So, guess we need the following patch:

==
From 6de8083606b6e0b55e201eadbb945c60fe7631b6 Mon Sep 17 00:00:00 2001
From: Alex Shi alex@intel.com
Date: Thu, 6 Dec 2012 22:48:24 +0800
Subject: [PATCH 12/12] sched: set initial load avg of new forked task as its
 load weight

New task has no runnable sum at its first runnable time, that make
burst forking just select few idle cpus to put tasks.
Set initial load avg of new forked task as its load weight to resolve
this issue.

Signed-off-by: Alex Shi alex@intel.com
---
 include/linux/sched.h |1 +
 kernel/sched/core.c   |2 +-
 kernel/sched/fair.c   |   18 --
 3 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e483ccb..12063fa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1043,6 +1043,7 @@ struct sched_domain;
 #else
 #define ENQUEUE_WAKING 0
 #endif
+#define ENQUEUE_NEWTASK8
 
 #define DEQUEUE_SLEEP  1
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 980904d..6a4d225 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1643,7 +1643,7 @@ void wake_up_new_task(struct task_struct *p)
 #endif
 
rq = __task_rq_lock(p);
-   activate_task(rq, p, 0);
+   activate_task(rq, p, ENQUEUE_NEWTASK);
p-on_rq = 1;
trace_sched_wakeup_new(p, true);
check_preempt_curr(rq, p, WF_FORK);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5d6eb91..f15b0ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1289,8 +1289,9 @@ static inline void update_rq_runnable_avg(struct rq *rq, 
int runnable)
 /* Add the load generated by se into cfs_rq's child load-average */
 static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
  struct sched_entity *se,
- int wakeup)
+ int flags)
 {
+   int wakeup = flags  ENQUEUE_WAKEUP;
/*
 * We track migrations using entity decay_count = 0, on a wake-up
 * migration we use a negative decay count to track the remote decays
@@ -1324,6 +1325,13 @@ static inline void enqueue_entity_load_avg(struct cfs_rq 
*cfs_rq,
update_entity_load_avg(se, 0);
}
 
+   /*
+* set the initial load avg of new task same as its load
+* in order to avoid brust fork make few cpu too heavier
+ 

Re: weakness of runnable load tracking?

2012-12-06 Thread Alex Shi
On 12/06/2012 05:12 PM, Mike Galbraith wrote:
 On Thu, 2012-12-06 at 16:06 +0800, Alex Shi wrote: 

 Hi Paul  Ingo:

 In a short word of this issue: burst forking/waking tasks have no time
 accumulate the load contribute, their runnable load are taken as zero.
 that make select_task_rq do a wrong decision on which group is idlest.

 As you pointed out above, new tasks can (and imho should) be born with
 full weight.  Tasks _may_ become thin, but they're all born hungry.

 Thanks for comments. I think so. :)

 There is still 3 kinds of solution is helpful for this issue.

 a, set a unzero minimum value for the long time sleeping task. but it
 seems unfair for other tasks these just sleep a short while.

 b, just use runnable load contrib in load balance. Still using
 nr_running to judge idlest group in select_task_rq_fair. but that may
 cause a bit more migrations in future load balance.

 c, consider both runnable load and nr_running in the group: like in the
 searching domain, the nr_running number increased a certain number, like
 double of the domain span, in a certain time. we will think it's a burst
 forking/waking happened, then just count the nr_running as the idlest
 group criteria.

 IMHO, I like the 3rd one a bit more. as to the certain time to judge if
 a burst happened, since we will calculate the runnable avg at very tick,
 so if increased nr_running is beyond sd-span_weight in 2 ticks, means
 burst happening. What's your opinion of this?

 Any comments are appreciated!

 IMHO, for fork and bursty wake balancing, the only thing meaningful is
 the here and now state of runqueues tasks are being dumped into.

 Just because tasks are historically short running, you don't necessarily
 want to take a gaggle and wedge them into a too small group just to even
 out load averages.  If there was a hole available that you passed up by
 using average load, you lose utilization.  I can see how this load
 tracking stuff can average out to a win on a ~heavily loaded box, but
 bursty stuff I don't see how it can do anything but harm, so imho, the
 user should choose which is best for his box, instantaneous or history.

 Do you mean the system administrator need to do this choice?
 
 That's my gut feeling just from pondering potential pitfalls.
 
 It's may a hard decision.  :)
 
 Yup, very hard.
 
 Any suggestions of decision basis?
 
 Same as most buttons.. poke it and cringe see what happens :) 

:D
 
 WRT burst detection: any window you define can be longer than the burst.

 Maybe we can define 2 waking on same cpu in 1 tick is a burst happened,
 and if the cpu had taken a waking task. we'd better skip this cpu. :)
 Anyway, the hard point is we can not predict future.
 
 No matter what the metric, you'll be reacting after the fact.
 
 Somebody needs to code up that darn omniscience algorithm.  In a pinch,
 a simple undo the past will suffice :)

Yes. I see.
 
 -Mike
 


-- 
Thanks
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: weakness of runnable load tracking?

2012-12-06 Thread Alex Shi
 
 The treatment of a burst wake-up however is a little more interesting.
  There are two reasonable trains of thought one can follow, the first
 is that:
 - If it IS truly bursty you don't really want it factoring into long
 term averages since steady state is not going to include that task;
 hence a low average is ok.  Anything that's more frequent then this is
 going to show up by definition of being within the periods.
 - The other is that if it truly is idle for _enormous_ amounts of time
 we want to give some cognizance to the fact that it might be more
 bursty when it wakes up.
 
 It is my intuition that the greatest carnage here is actually caused
 by wake-up load-balancing getting in the way of periodic in
 establishing a steady state.  That these entities happen to not be
 runnable very often is just a red herring; they don't contribute
 enough load average to matter in the periodic case.  Increasing their
 load isn't going to really help this -- stronger, you don't want them
 affecting the steady state.  I suspect more mileage would result from
 reducing the interference wake-up load-balancing has with steady
 state.
 
 e.g. One thing you can think about is considering tasks moved by
 wake-up load balance as stolen, and allow periodic load-balance to
 re-settle things as if select_idle_sibling had not ignored it :-)

Consider the periodic load balance should spread tasks well, we can
assume the burst waking tasks are spread widely among all cpu before
their sleeping. That will relief the burst waking task imbalance.

And plus the uncertain utils of waking task, guess we can forget the
extra treatment for burst waking. If the waking tasks will keep running
long time. Let periodic LB to handle them again.
 

 There is still 3 kinds of solution is helpful for this issue.

 a, set a unzero minimum value for the long time sleeping task. but it
 seems unfair for other tasks these just sleep a short while.

 
 I think this is  reasonable to do regardless, we set such a cap in the
 cgroup case already.  Although you still obviously want this
 threshhold to be fairly low.  I suspect this is a weak improvement.

Agreed.
 
 b, just use runnable load contrib in load balance. Still using
 nr_running to judge idlest group in select_task_rq_fair. but that may
 cause a bit more migrations in future load balance.
 
 I don't think this is a good approach.  The whole point of using
 blocked load is so that you can converge on a steady state where you
 don't NEED to move tasks.  What disrupts this is we naturally prefer
 idle cpus on wake-up balance to reduce wake-up latency.  As above, I
 think the better answer is making these two processes more
 co-operative.

Sure, instant load in fork/exec/waking will do interfere the later
periodic LB.
 

 c, consider both runnable load and nr_running in the group: like in the
 searching domain, the nr_running number increased a certain number, like
 double of the domain span, in a certain time. we will think it's a burst
 forking/waking happened, then just count the nr_running as the idlest
 group criteria.
 
 This feels like a bit of a hack.  I suspect this is more binary:
 
 If there's already something running on all the cpus then we should
 let the periodic load balancer do placement taking averages into
 account.
 
 Otherwise, we're in wake-idle and we throw the cat in the bathwater.

As like Mike said simple time is always not enough for 'burst'. So, it
will be approved to be a stupid optimising on some chance. :)
 

 IMHO, I like the 3rd one a bit more. as to the certain time to judge if
 a burst happened, since we will calculate the runnable avg at very tick,
 so if increased nr_running is beyond sd-span_weight in 2 ticks, means
 burst happening. What's your opinion of this?

 
 What are you defining as the right behavior for a group of tasks
 waking up that want to use only a short burst?
 
 This seems to suggest you think spreading them is the best answer?
 What's the motivation for that?  Also: What does your topology look
 like that's preventing select_idle_sibling from pushing tasks (and
 then new-idle subsequently continuing to pull)?

Actually, I have no real story of the burst waking imbalance among cpus.

Considering above excuses, I will give up the optimizing on this. It's
not too late to reconsider this if some case is really triggered.
 
 Certainly putting a lower bound on a tasks weight would help new-idle
 find the right cpu to pull these from.
 
 Any comments are appreciated!

 Regards!
 Alex




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: weakness of runnable load tracking?

2012-12-05 Thread Preeti U Murthy
Hi Alex,
> Hi Paul & Ingo:
> 
> In a short word of this issue: burst forking/waking tasks have no time
> accumulate the load contribute, their runnable load are taken as zero.

On performing certain experiments on the way PJT's metric calculates the
load,I observed a few things.Based on these observations let me see if i
can address the issue of why PJT's metric is calculating the load of
bursty tasks as 0.

When we speak about a burst waking task(I will not go into forking
here),we should also speak about its duty cycle-it burst wakes for 1ms
for a 10ms duty cycle or burst wakes 9s out of a 10s duty cycle-both
being 10% tasks wrt their duty cycles.Lets see how load is calculated by
PJT's metric in each of the above cases.
   --
  |  |
  |  |
__|  |
  A  B
  1ms
  <->
  10ms
<>
  Example 1

When the task wakes up at A,it is not yet runnable,and an update of the
task load takes place.Its runtime so far is 0,and its existing time is
10ms.Hence the load is 0/10*1024.Since a scheduler tick happens at B( a
scheduler tick happens for every 1ms,10ms or 4ms.Let us assume 1ms),an
update of the load takes place.PJT's metric divides the time elapsed
into 1ms windows.There is just 1ms window,and hence the runtime is 1ms
and the load is 1ms/10ms*1024.

*If the time elapsed between A and B were to be < 1ms,then PJT's metric
will not capture it*.

And under these circumstances the load remains 0/10ms*1024=0.This is the
situation you are pointing out.Let us assume that these cycle continues
throughout the lifetime of the load,then the load remains at 0.The
question is if such tasks which run for periods<1ms is ok to be termed
as 0 workloads.If it is fine,then what PJT's metric is doing is
right.Maybe we should ignore such workloads because they hardly
contribute to the load.Otherwise we will need to reduce the window of
load update to < 1ms to capture such loads.


Just for some additional info so that we know what happens to different
kinds of loads with PJT's metric,consider the below situation:
 --
|  |
|  |
|  |
A  B
   1s
<-->
<--->
  10s
<>
   Example 2

Here at A,the task wakes,just like in Example1 and the load is termed 0.
In between A and B for every scheduler tick if we consider the load to
get updated,then the load slowly increases from 0 to 1024 at B.It is
1024 here,although this is also a 10% task,whereas in Example1 the load
is 102.4 - a 10% task.So what is fishy?

In my opinion,PJT's metric gives the tasks some time to prove their
activeness after they wake up.In Example2 the task has stayed awake too
long-1s; irrespective of what % of the total run time it is.Therefore it
calculates the load to be big enough to balance.

In the example that you have quoted,the tasks may not have run long
enough to consider them as candidates for load balance.

So,essentially what PJT's metric is doing is characterising a task by
the amount it has run so far.


> that make select_task_rq do a wrong decision on which group is idlest.
> 
> There is still 3 kinds of solution is helpful for this issue.
> 
> a, set a unzero minimum value for the long time sleeping task. but it
> seems unfair for other tasks these just sleep a short while.
> 
> b, just use runnable load contrib in load balance. Still using
> nr_running to judge idlest group in select_task_rq_fair. but that may
> cause a bit more migrations in future load balance.
> 
> c, consider both runnable load and nr_running in the group: like in the
> searching domain, the nr_running number increased a certain number, like
> double of the domain span, in a certain time. we will think it's a burst
> forking/waking happened, then just count the nr_running as the idlest
> group criteria.
> 
> IMHO, I like the 3rd one a bit more. as to the certain time to judge if
> a burst happened, since we will calculate the runnable avg at very tick,
> so if increased nr_running is beyond sd->span_weight in 2 ticks, means
> burst happening. What's your opinion of this?
> 
> Any comments are appreciated!


So Pjt's metric rightly seems to be capturing the load of these bursty
tasks but you are right in pointing out that when too many such loads
queue up on the cpu,this metric will consider the load on the cpu as
0,which might not be such a good idea.

It is true that we need to bring in nr_running somewhere.Let me now go
through your suggestions on where to include nr_running and get back on
this.I had planned on including nr_running while selecting the busy
group in update_sd_lb_stats,but select_task_rq_fair is yet another place
to do this, thats right.Good that this issue was brought up :)

> Regards!
> Alex
>>
>>
> 


Re: weakness of runnable load tracking?

2012-12-05 Thread Mike Galbraith
On Thu, 2012-12-06 at 11:13 +0800, Alex Shi wrote: 
> On 12/05/2012 11:19 PM, Alex Shi wrote:
> > Hi Paul:
> > 
> > Runnable load tracking patch set introduce a good way to tracking each
> > entity/rq's running time.
> > But when I try to enable it in load balance, I found burst forking
> > many new tasks will make just few cpu heavy while other cpu has no
> > much task assigned. That is due to the new forked task's
> > load_avg_contrib is zero after just created. then no matter how many
> > tasks assigned to a CPU can not increase the cfs_rq->runnable_load_avg
> > or rq->avg.load_avg_contrib if this cpu idle.
> > Actually, if just for new task issue, we can set new task's initial
> > load_avg same as load_weight. but if we want to burst wake up many
> > long time sleeping tasks, it has the same issue here since their were
> > decayed to zero. So what solution I can thought is recording the se's
> > load_avg_contrib just before dequeue, and don't decay the value, when
> > it was waken up, add this value to new cfs_rq. but if so, the runnable
> > load tracking is total meaningless.
> > So do you have some idea of burst wakeup balancing with runnable load 
> > tracking?
> 
> Hi Paul & Ingo:
> 
> In a short word of this issue: burst forking/waking tasks have no time
> accumulate the load contribute, their runnable load are taken as zero.
> that make select_task_rq do a wrong decision on which group is idlest.

As you pointed out above, new tasks can (and imho should) be born with
full weight.  Tasks _may_ become thin, but they're all born hungry.

> There is still 3 kinds of solution is helpful for this issue.
> 
> a, set a unzero minimum value for the long time sleeping task. but it
> seems unfair for other tasks these just sleep a short while.
> 
> b, just use runnable load contrib in load balance. Still using
> nr_running to judge idlest group in select_task_rq_fair. but that may
> cause a bit more migrations in future load balance.
> 
> c, consider both runnable load and nr_running in the group: like in the
> searching domain, the nr_running number increased a certain number, like
> double of the domain span, in a certain time. we will think it's a burst
> forking/waking happened, then just count the nr_running as the idlest
> group criteria.
> 
> IMHO, I like the 3rd one a bit more. as to the certain time to judge if
> a burst happened, since we will calculate the runnable avg at very tick,
> so if increased nr_running is beyond sd->span_weight in 2 ticks, means
> burst happening. What's your opinion of this?
> 
> Any comments are appreciated!

IMHO, for fork and bursty wake balancing, the only thing meaningful is
the here and now state of runqueues tasks are being dumped into.

Just because tasks are historically short running, you don't necessarily
want to take a gaggle and wedge them into a too small group just to even
out load averages.  If there was a hole available that you passed up by
using average load, you lose utilization.  I can see how this load
tracking stuff can average out to a win on a ~heavily loaded box, but
bursty stuff I don't see how it can do anything but harm, so imho, the
user should choose which is best for his box, instantaneous or history.

WRT burst detection: any window you define can be longer than the burst.

$.02

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: weakness of runnable load tracking?

2012-12-05 Thread Alex Shi
On 12/05/2012 11:19 PM, Alex Shi wrote:
> Hi Paul:
> 
> Runnable load tracking patch set introduce a good way to tracking each
> entity/rq's running time.
> But when I try to enable it in load balance, I found burst forking
> many new tasks will make just few cpu heavy while other cpu has no
> much task assigned. That is due to the new forked task's
> load_avg_contrib is zero after just created. then no matter how many
> tasks assigned to a CPU can not increase the cfs_rq->runnable_load_avg
> or rq->avg.load_avg_contrib if this cpu idle.
> Actually, if just for new task issue, we can set new task's initial
> load_avg same as load_weight. but if we want to burst wake up many
> long time sleeping tasks, it has the same issue here since their were
> decayed to zero. So what solution I can thought is recording the se's
> load_avg_contrib just before dequeue, and don't decay the value, when
> it was waken up, add this value to new cfs_rq. but if so, the runnable
> load tracking is total meaningless.
> So do you have some idea of burst wakeup balancing with runnable load 
> tracking?

Hi Paul & Ingo:

In a short word of this issue: burst forking/waking tasks have no time
accumulate the load contribute, their runnable load are taken as zero.
that make select_task_rq do a wrong decision on which group is idlest.

There is still 3 kinds of solution is helpful for this issue.

a, set a unzero minimum value for the long time sleeping task. but it
seems unfair for other tasks these just sleep a short while.

b, just use runnable load contrib in load balance. Still using
nr_running to judge idlest group in select_task_rq_fair. but that may
cause a bit more migrations in future load balance.

c, consider both runnable load and nr_running in the group: like in the
searching domain, the nr_running number increased a certain number, like
double of the domain span, in a certain time. we will think it's a burst
forking/waking happened, then just count the nr_running as the idlest
group criteria.

IMHO, I like the 3rd one a bit more. as to the certain time to judge if
a burst happened, since we will calculate the runnable avg at very tick,
so if increased nr_running is beyond sd->span_weight in 2 ticks, means
burst happening. What's your opinion of this?

Any comments are appreciated!

Regards!
Alex
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: weakness of runnable load tracking?

2012-12-05 Thread Alex Shi
On 12/05/2012 11:19 PM, Alex Shi wrote:
 Hi PaulIngo:
 
 Runnable load tracking patch set introduce a good way to tracking each
 entity/rq's running time.
 But when I try to enable it in load balance, I found burst forking
 many new tasks will make just few cpu heavy while other cpu has no
 much task assigned. That is due to the new forked task's
 load_avg_contrib is zero after just created. then no matter how many
 tasks assigned to a CPU can not increase the cfs_rq-runnable_load_avg
 or rq-avg.load_avg_contrib if this cpu idle.
 Actually, if just for new task issue, we can set new task's initial
 load_avg same as load_weight. but if we want to burst wake up many
 long time sleeping tasks, it has the same issue here since their were
 decayed to zero. So what solution I can thought is recording the se's
 load_avg_contrib just before dequeue, and don't decay the value, when
 it was waken up, add this value to new cfs_rq. but if so, the runnable
 load tracking is total meaningless.
 So do you have some idea of burst wakeup balancing with runnable load 
 tracking?

Hi Paul  Ingo:

In a short word of this issue: burst forking/waking tasks have no time
accumulate the load contribute, their runnable load are taken as zero.
that make select_task_rq do a wrong decision on which group is idlest.

There is still 3 kinds of solution is helpful for this issue.

a, set a unzero minimum value for the long time sleeping task. but it
seems unfair for other tasks these just sleep a short while.

b, just use runnable load contrib in load balance. Still using
nr_running to judge idlest group in select_task_rq_fair. but that may
cause a bit more migrations in future load balance.

c, consider both runnable load and nr_running in the group: like in the
searching domain, the nr_running number increased a certain number, like
double of the domain span, in a certain time. we will think it's a burst
forking/waking happened, then just count the nr_running as the idlest
group criteria.

IMHO, I like the 3rd one a bit more. as to the certain time to judge if
a burst happened, since we will calculate the runnable avg at very tick,
so if increased nr_running is beyond sd-span_weight in 2 ticks, means
burst happening. What's your opinion of this?

Any comments are appreciated!

Regards!
Alex
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: weakness of runnable load tracking?

2012-12-05 Thread Mike Galbraith
On Thu, 2012-12-06 at 11:13 +0800, Alex Shi wrote: 
 On 12/05/2012 11:19 PM, Alex Shi wrote:
  Hi PaulIngo:
  
  Runnable load tracking patch set introduce a good way to tracking each
  entity/rq's running time.
  But when I try to enable it in load balance, I found burst forking
  many new tasks will make just few cpu heavy while other cpu has no
  much task assigned. That is due to the new forked task's
  load_avg_contrib is zero after just created. then no matter how many
  tasks assigned to a CPU can not increase the cfs_rq-runnable_load_avg
  or rq-avg.load_avg_contrib if this cpu idle.
  Actually, if just for new task issue, we can set new task's initial
  load_avg same as load_weight. but if we want to burst wake up many
  long time sleeping tasks, it has the same issue here since their were
  decayed to zero. So what solution I can thought is recording the se's
  load_avg_contrib just before dequeue, and don't decay the value, when
  it was waken up, add this value to new cfs_rq. but if so, the runnable
  load tracking is total meaningless.
  So do you have some idea of burst wakeup balancing with runnable load 
  tracking?
 
 Hi Paul  Ingo:
 
 In a short word of this issue: burst forking/waking tasks have no time
 accumulate the load contribute, their runnable load are taken as zero.
 that make select_task_rq do a wrong decision on which group is idlest.

As you pointed out above, new tasks can (and imho should) be born with
full weight.  Tasks _may_ become thin, but they're all born hungry.

 There is still 3 kinds of solution is helpful for this issue.
 
 a, set a unzero minimum value for the long time sleeping task. but it
 seems unfair for other tasks these just sleep a short while.
 
 b, just use runnable load contrib in load balance. Still using
 nr_running to judge idlest group in select_task_rq_fair. but that may
 cause a bit more migrations in future load balance.
 
 c, consider both runnable load and nr_running in the group: like in the
 searching domain, the nr_running number increased a certain number, like
 double of the domain span, in a certain time. we will think it's a burst
 forking/waking happened, then just count the nr_running as the idlest
 group criteria.
 
 IMHO, I like the 3rd one a bit more. as to the certain time to judge if
 a burst happened, since we will calculate the runnable avg at very tick,
 so if increased nr_running is beyond sd-span_weight in 2 ticks, means
 burst happening. What's your opinion of this?
 
 Any comments are appreciated!

IMHO, for fork and bursty wake balancing, the only thing meaningful is
the here and now state of runqueues tasks are being dumped into.

Just because tasks are historically short running, you don't necessarily
want to take a gaggle and wedge them into a too small group just to even
out load averages.  If there was a hole available that you passed up by
using average load, you lose utilization.  I can see how this load
tracking stuff can average out to a win on a ~heavily loaded box, but
bursty stuff I don't see how it can do anything but harm, so imho, the
user should choose which is best for his box, instantaneous or history.

WRT burst detection: any window you define can be longer than the burst.

$.02

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: weakness of runnable load tracking?

2012-12-05 Thread Preeti U Murthy
Hi Alex,
 Hi Paul  Ingo:
 
 In a short word of this issue: burst forking/waking tasks have no time
 accumulate the load contribute, their runnable load are taken as zero.

On performing certain experiments on the way PJT's metric calculates the
load,I observed a few things.Based on these observations let me see if i
can address the issue of why PJT's metric is calculating the load of
bursty tasks as 0.

When we speak about a burst waking task(I will not go into forking
here),we should also speak about its duty cycle-it burst wakes for 1ms
for a 10ms duty cycle or burst wakes 9s out of a 10s duty cycle-both
being 10% tasks wrt their duty cycles.Lets see how load is calculated by
PJT's metric in each of the above cases.
   --
  |  |
  |  |
__|  |
  A  B
  1ms
  -
  10ms

  Example 1

When the task wakes up at A,it is not yet runnable,and an update of the
task load takes place.Its runtime so far is 0,and its existing time is
10ms.Hence the load is 0/10*1024.Since a scheduler tick happens at B( a
scheduler tick happens for every 1ms,10ms or 4ms.Let us assume 1ms),an
update of the load takes place.PJT's metric divides the time elapsed
into 1ms windows.There is just 1ms window,and hence the runtime is 1ms
and the load is 1ms/10ms*1024.

*If the time elapsed between A and B were to be  1ms,then PJT's metric
will not capture it*.

And under these circumstances the load remains 0/10ms*1024=0.This is the
situation you are pointing out.Let us assume that these cycle continues
throughout the lifetime of the load,then the load remains at 0.The
question is if such tasks which run for periods1ms is ok to be termed
as 0 workloads.If it is fine,then what PJT's metric is doing is
right.Maybe we should ignore such workloads because they hardly
contribute to the load.Otherwise we will need to reduce the window of
load update to  1ms to capture such loads.


Just for some additional info so that we know what happens to different
kinds of loads with PJT's metric,consider the below situation:
 --
|  |
|  |
|  |
A  B
   1s
--
---
  10s

   Example 2

Here at A,the task wakes,just like in Example1 and the load is termed 0.
In between A and B for every scheduler tick if we consider the load to
get updated,then the load slowly increases from 0 to 1024 at B.It is
1024 here,although this is also a 10% task,whereas in Example1 the load
is 102.4 - a 10% task.So what is fishy?

In my opinion,PJT's metric gives the tasks some time to prove their
activeness after they wake up.In Example2 the task has stayed awake too
long-1s; irrespective of what % of the total run time it is.Therefore it
calculates the load to be big enough to balance.

In the example that you have quoted,the tasks may not have run long
enough to consider them as candidates for load balance.

So,essentially what PJT's metric is doing is characterising a task by
the amount it has run so far.


 that make select_task_rq do a wrong decision on which group is idlest.
 
 There is still 3 kinds of solution is helpful for this issue.
 
 a, set a unzero minimum value for the long time sleeping task. but it
 seems unfair for other tasks these just sleep a short while.
 
 b, just use runnable load contrib in load balance. Still using
 nr_running to judge idlest group in select_task_rq_fair. but that may
 cause a bit more migrations in future load balance.
 
 c, consider both runnable load and nr_running in the group: like in the
 searching domain, the nr_running number increased a certain number, like
 double of the domain span, in a certain time. we will think it's a burst
 forking/waking happened, then just count the nr_running as the idlest
 group criteria.
 
 IMHO, I like the 3rd one a bit more. as to the certain time to judge if
 a burst happened, since we will calculate the runnable avg at very tick,
 so if increased nr_running is beyond sd-span_weight in 2 ticks, means
 burst happening. What's your opinion of this?
 
 Any comments are appreciated!


So Pjt's metric rightly seems to be capturing the load of these bursty
tasks but you are right in pointing out that when too many such loads
queue up on the cpu,this metric will consider the load on the cpu as
0,which might not be such a good idea.

It is true that we need to bring in nr_running somewhere.Let me now go
through your suggestions on where to include nr_running and get back on
this.I had planned on including nr_running while selecting the busy
group in update_sd_lb_stats,but select_task_rq_fair is yet another place
to do this, thats right.Good that this issue was brought up :)

 Regards!
 Alex


 

Regards
Preeti U Murthy

--
To unsubscribe from this