[patch 09/16] sched: normalize tg load contributions against runnable time

2012-08-23 Thread pjt
From: Paul Turner 

Entities of equal weight should receive equitable distribution of cpu time.
This is challenging in the case of a task_group's shares as execution may be
occurring on multiple cpus simultaneously.

To handle this we divide up the shares into weights proportionate with the load
on each cfs_rq.  This does not however, account for the fact that the sum of
the parts may be less than one cpu and so we need to normalize:
  load(tg) = min(runnable_avg(tg), 1) * tg->shares
Where runnable_avg is the aggregate time in which the task_group had runnable
children.

Signed-off-by: Paul Turner 
Reviewed-by: Ben Segall .
---
 kernel/sched/debug.c |4 
 kernel/sched/fair.c  |   56 ++
 kernel/sched/sched.h |2 ++
 3 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2908923..71b0ea3 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -234,6 +234,10 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct 
cfs_rq *cfs_rq)
atomic64_read(_rq->tg->load_avg));
SEQ_printf(m, "  .%-30s: %lld\n", "tg_load_contrib",
cfs_rq->tg_load_contrib);
+   SEQ_printf(m, "  .%-30s: %d\n", "tg_runnable_contrib",
+   cfs_rq->tg_runnable_contrib);
+   SEQ_printf(m, "  .%-30s: %d\n", "tg->runnable_avg",
+   atomic_read(_rq->tg->runnable_avg));
 #endif
 
print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 92ef5f1..47a7998 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1112,19 +1112,73 @@ static inline void 
__update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
}
 }
 
+/*
+ * Aggregate cfs_rq runnable averages into an equivalent task_group
+ * representation for computing load contributions.
+ */
+static inline void __update_tg_runnable_avg(struct sched_avg *sa,
+ struct cfs_rq *cfs_rq)
+{
+   struct task_group *tg = cfs_rq->tg;
+   long contrib;
+
+   /* The fraction of a cpu used by this cfs_rq */
+   contrib = div_u64(sa->runnable_avg_sum << NICE_0_SHIFT,
+ sa->runnable_avg_period + 1);
+   contrib -= cfs_rq->tg_runnable_contrib;
+
+   if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
+   atomic_add(contrib, >runnable_avg);
+   cfs_rq->tg_runnable_contrib += contrib;
+   }
+}
+
 static inline void __update_group_entity_contrib(struct sched_entity *se)
 {
struct cfs_rq *cfs_rq = group_cfs_rq(se);
struct task_group *tg = cfs_rq->tg;
+   int runnable_avg;
+
u64 contrib;
 
contrib = cfs_rq->tg_load_contrib * tg->shares;
se->avg.load_avg_contrib = div64_u64(contrib,
 atomic64_read(>load_avg) + 1);
+
+   /*
+* For group entities we need to compute a correction term in the case
+* that they are consuming <1 cpu so that we would contribute the same
+* load as a task of equal weight.
+*
+* Explicitly co-ordinating this measurement would be expensive, but
+* fortunately the sum of each cpus contribution forms a usable
+* lower-bound on the true value.
+*
+* Consider the aggregate of 2 contributions.  Either they are disjoint
+* (and the sum represents true value) or they are disjoint and we are
+* understating by the aggregate of their overlap.
+*
+* Extending this to N cpus, for a given overlap, the maximum amount we
+* understand is then n_i(n_i+1)/2 * w_i where n_i is the number of
+* cpus that overlap for this interval and w_i is the interval width.
+*
+* On a small machine; the first term is well-bounded which bounds the
+* total error since w_i is a subset of the period.  Whereas on a
+* larger machine, while this first term can be larger, if w_i is the
+* of consequential size guaranteed to see n_i*w_i quickly converge to
+* our upper bound of 1-cpu.
+*/
+   runnable_avg = atomic_read(>runnable_avg);
+   if (runnable_avg < NICE_0_LOAD) {
+   se->avg.load_avg_contrib *= runnable_avg;
+   se->avg.load_avg_contrib >>= NICE_0_SHIFT;
+   }
 }
 #else
 static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 int force_update) {}
+static inline void __update_tg_runnable_avg(struct sched_avg *sa,
+ struct cfs_rq *cfs_rq) {}
 static inline void __update_group_entity_contrib(struct sched_entity *se) {}
 #endif
 
@@ -1146,6 +1200,7 @@ static long __update_entity_load_avg_contrib(struct 
sched_entity *se)
if (entity_is_task(se)) {
__update_task_entity_contrib(se);
  

[patch 09/16] sched: normalize tg load contributions against runnable time

2012-08-23 Thread pjt
From: Paul Turner p...@google.com

Entities of equal weight should receive equitable distribution of cpu time.
This is challenging in the case of a task_group's shares as execution may be
occurring on multiple cpus simultaneously.

To handle this we divide up the shares into weights proportionate with the load
on each cfs_rq.  This does not however, account for the fact that the sum of
the parts may be less than one cpu and so we need to normalize:
  load(tg) = min(runnable_avg(tg), 1) * tg-shares
Where runnable_avg is the aggregate time in which the task_group had runnable
children.

Signed-off-by: Paul Turner p...@google.com
Reviewed-by: Ben Segall bseg...@google.com.
---
 kernel/sched/debug.c |4 
 kernel/sched/fair.c  |   56 ++
 kernel/sched/sched.h |2 ++
 3 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2908923..71b0ea3 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -234,6 +234,10 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct 
cfs_rq *cfs_rq)
atomic64_read(cfs_rq-tg-load_avg));
SEQ_printf(m,   .%-30s: %lld\n, tg_load_contrib,
cfs_rq-tg_load_contrib);
+   SEQ_printf(m,   .%-30s: %d\n, tg_runnable_contrib,
+   cfs_rq-tg_runnable_contrib);
+   SEQ_printf(m,   .%-30s: %d\n, tg-runnable_avg,
+   atomic_read(cfs_rq-tg-runnable_avg));
 #endif
 
print_cfs_group_stats(m, cpu, cfs_rq-tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 92ef5f1..47a7998 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1112,19 +1112,73 @@ static inline void 
__update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
}
 }
 
+/*
+ * Aggregate cfs_rq runnable averages into an equivalent task_group
+ * representation for computing load contributions.
+ */
+static inline void __update_tg_runnable_avg(struct sched_avg *sa,
+ struct cfs_rq *cfs_rq)
+{
+   struct task_group *tg = cfs_rq-tg;
+   long contrib;
+
+   /* The fraction of a cpu used by this cfs_rq */
+   contrib = div_u64(sa-runnable_avg_sum  NICE_0_SHIFT,
+ sa-runnable_avg_period + 1);
+   contrib -= cfs_rq-tg_runnable_contrib;
+
+   if (abs(contrib)  cfs_rq-tg_runnable_contrib / 64) {
+   atomic_add(contrib, tg-runnable_avg);
+   cfs_rq-tg_runnable_contrib += contrib;
+   }
+}
+
 static inline void __update_group_entity_contrib(struct sched_entity *se)
 {
struct cfs_rq *cfs_rq = group_cfs_rq(se);
struct task_group *tg = cfs_rq-tg;
+   int runnable_avg;
+
u64 contrib;
 
contrib = cfs_rq-tg_load_contrib * tg-shares;
se-avg.load_avg_contrib = div64_u64(contrib,
 atomic64_read(tg-load_avg) + 1);
+
+   /*
+* For group entities we need to compute a correction term in the case
+* that they are consuming 1 cpu so that we would contribute the same
+* load as a task of equal weight.
+*
+* Explicitly co-ordinating this measurement would be expensive, but
+* fortunately the sum of each cpus contribution forms a usable
+* lower-bound on the true value.
+*
+* Consider the aggregate of 2 contributions.  Either they are disjoint
+* (and the sum represents true value) or they are disjoint and we are
+* understating by the aggregate of their overlap.
+*
+* Extending this to N cpus, for a given overlap, the maximum amount we
+* understand is then n_i(n_i+1)/2 * w_i where n_i is the number of
+* cpus that overlap for this interval and w_i is the interval width.
+*
+* On a small machine; the first term is well-bounded which bounds the
+* total error since w_i is a subset of the period.  Whereas on a
+* larger machine, while this first term can be larger, if w_i is the
+* of consequential size guaranteed to see n_i*w_i quickly converge to
+* our upper bound of 1-cpu.
+*/
+   runnable_avg = atomic_read(tg-runnable_avg);
+   if (runnable_avg  NICE_0_LOAD) {
+   se-avg.load_avg_contrib *= runnable_avg;
+   se-avg.load_avg_contrib = NICE_0_SHIFT;
+   }
 }
 #else
 static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 int force_update) {}
+static inline void __update_tg_runnable_avg(struct sched_avg *sa,
+ struct cfs_rq *cfs_rq) {}
 static inline void __update_group_entity_contrib(struct sched_entity *se) {}
 #endif
 
@@ -1146,6 +1200,7 @@ static long __update_entity_load_avg_contrib(struct 
sched_entity *se)
if (entity_is_task(se)) {

Re: [PATCH 09/16] sched: normalize tg load contributions against runnable time

2012-07-11 Thread Andre Noll
On Fri, Jul 06, 13:52, Peter Zijlstra wrote:
> This then yields:
> 
>   P(\Union_{i=1..n} u_i) ~= \Sum_{k=1..n} (-1)^(k-1) (n choose k) u^k
> 
> Which unfortunately isn't a series I found a sane solution for, but
> numerically (see below) we can see it very quickly approaches 1 when n
> >> 1.

Isn't this series just 1 - (1 - u)^n? So yes, it converges quickly to 1
if u is a probability.

Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe


signature.asc
Description: Digital signature


Re: [PATCH 09/16] sched: normalize tg load contributions against runnable time

2012-07-11 Thread Paul Turner
On Wed, Jul 4, 2012 at 12:48 PM, Peter Zijlstra  wrote:
>
> On Wed, 2012-06-27 at 19:24 -0700, Paul Turner wrote:
> > Entities of equal weight should receive equitable distribution of cpu time.
> > This is challenging in the case of a task_group's shares as execution may be
> > occurring on multiple cpus simultaneously.
> >
> > To handle this we divide up the shares into weights proportionate with the 
> > load
> > on each cfs_rq.  This does not however, account for the fact that the sum of
> > the parts may be less than one cpu and so we need to normalize:
> >   load(tg) = min(runnable_avg(tg), 1) * tg->shares
> > Where runnable_avg is the aggregate time in which the task_group had 
> > runnable
> > children.
>
> I remember we had a bit of a discussion on this last time, I thought you
> were going to convince me this approximation was 'right'.


I thought I had :-)

Here's what I wrote previously:

"""
So, as an estimate it has the nice property that it's always a lower
bound on the true value.

Consider the two cases for runnable contributions across a pair of cpus:

Either they are disjoint by wall time, in which case they would have
accounted for the same amount were they actually on the same cpu.
Or they overlap, in which case that over-lap period would be an
additional run-delay incurred on one of them.

Then, the maximum amount we understate for a given overlap is n(n+1)/2
* k where k is the the width of the overlap and n is the number of
cpus we over-lapped on.

But then, n is bounded by num_cpus -- so on a small machine our error
is bounded (e.g. we can be no more than 1/3 less than true on 2
cpus). On a larger machine this term can be larger, however, if k is
of consequential size we know we accounted at least n*k so we're still
going to approach the ceiling of 1 very quickly at which point it
doesn't matter that we under-estimated. The flip side on a larger
machine is that if k is of inconsequential size then n*2 * k is still
tiny relative to the size of the machine and the accuracy of the
approximation matters less.
"""

I approach it from a slightly different angle, nonetheless, it's
coherent with your analysis in your next reply.

What I forgot to do here was add the comments to this effect.  Will do.


>
>
> Care to still do so.. the rationale used should at least live in a
> comment somewhere, otherwise someone will go silly trying to understand
> things later on.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/16] sched: normalize tg load contributions against runnable time

2012-07-11 Thread Paul Turner
On Wed, Jul 4, 2012 at 12:48 PM, Peter Zijlstra a.p.zijls...@chello.nl wrote:

 On Wed, 2012-06-27 at 19:24 -0700, Paul Turner wrote:
  Entities of equal weight should receive equitable distribution of cpu time.
  This is challenging in the case of a task_group's shares as execution may be
  occurring on multiple cpus simultaneously.
 
  To handle this we divide up the shares into weights proportionate with the 
  load
  on each cfs_rq.  This does not however, account for the fact that the sum of
  the parts may be less than one cpu and so we need to normalize:
load(tg) = min(runnable_avg(tg), 1) * tg-shares
  Where runnable_avg is the aggregate time in which the task_group had 
  runnable
  children.

 I remember we had a bit of a discussion on this last time, I thought you
 were going to convince me this approximation was 'right'.


I thought I had :-)

Here's what I wrote previously:


So, as an estimate it has the nice property that it's always a lower
bound on the true value.

Consider the two cases for runnable contributions across a pair of cpus:

Either they are disjoint by wall time, in which case they would have
accounted for the same amount were they actually on the same cpu.
Or they overlap, in which case that over-lap period would be an
additional run-delay incurred on one of them.

Then, the maximum amount we understate for a given overlap is n(n+1)/2
* k where k is the the width of the overlap and n is the number of
cpus we over-lapped on.

But then, n is bounded by num_cpus -- so on a small machine our error
is bounded (e.g. we can be no more than 1/3 less than true on 2
cpus). On a larger machine this term can be larger, however, if k is
of consequential size we know we accounted at least n*k so we're still
going to approach the ceiling of 1 very quickly at which point it
doesn't matter that we under-estimated. The flip side on a larger
machine is that if k is of inconsequential size then n*2 * k is still
tiny relative to the size of the machine and the accuracy of the
approximation matters less.


I approach it from a slightly different angle, nonetheless, it's
coherent with your analysis in your next reply.

What I forgot to do here was add the comments to this effect.  Will do.




 Care to still do so.. the rationale used should at least live in a
 comment somewhere, otherwise someone will go silly trying to understand
 things later on.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/16] sched: normalize tg load contributions against runnable time

2012-07-11 Thread Andre Noll
On Fri, Jul 06, 13:52, Peter Zijlstra wrote:
 This then yields:
 
   P(\Union_{i=1..n} u_i) ~= \Sum_{k=1..n} (-1)^(k-1) (n choose k) u^k
 
 Which unfortunately isn't a series I found a sane solution for, but
 numerically (see below) we can see it very quickly approaches 1 when n
  1.

Isn't this series just 1 - (1 - u)^n? So yes, it converges quickly to 1
if u is a probability.

Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe


signature.asc
Description: Digital signature