Re: [PATCH 03/14] sched: pack small tasks

2013-04-26 Thread Vincent Guittot
On 26 April 2013 14:38, Peter Zijlstra  wrote:
> On Thu, Apr 25, 2013 at 07:23:19PM +0200, Vincent Guittot wrote:
>
>> +static bool is_buddy_busy(int cpu)
>> +{
>> + struct rq *rq = cpu_rq(cpu);
>> + u32 sum = rq->avg.runnable_avg_sum;
>> + u32 period = rq->avg.runnable_avg_period;
>> +
>> + /*
>> +  * If a CPU accesses the runnable_avg_sum and runnable_avg_period
>> +  * fields of its buddy CPU while the latter updates it, it can get the
>> +  * new version of a field and the old version of the other one. This
>> +  * can generate erroneous decisions. We don't want to use a lock
>> +  * mechanism for ensuring the coherency because of the overhead in
>> +  * this critical path.
>> +  * The runnable_avg_period of a runqueue tends to the max value in
>> +  * less than 345ms after plugging a CPU, which implies that we could
>> +  * use the max value instead of reading runnable_avg_period after
>> +  * 345ms. During the starting phase, we must ensure a minimum of
>> +  * coherency between the fields. A simple rule is runnable_avg_sum <=
>> +  * runnable_avg_period.
>> +  */
>> + sum = min(sum, period);
>> +
>> + /*
>> +  * A busy buddy is a CPU with a high load or a small load with a lot of
>> +  * running tasks.
>> +  */
>> + return (sum > (period / (rq->nr_running + 2)));
>> +}
>
>
> I'm still not sold on the entire nr_running thing and the comment doesn't say
> why you're doing it.

The main goal is really to minimize the scheduling latency of tasks.
If the cpu already has dozens of task in its runqueue the scheduling
latency can become quite huge. In this case it's worth using another
core

>
> I'm fairly sure there's software out there that wakes a gazillion threads at a
> time only for a gazillion-1 to go back to sleep immediately. Patterns like 
> that
> completely defeat your heuristic.
>
> Does that matter... I don't know.
>
> What happens if you keep this thing simple and only put a cap on utilization
> (say 80%) and drop the entire nr_running magic? Have you seen it make an 
> actual
> difference or did it just seem like a good (TM) thing to do?

It was a efficient way

>
>> +static bool is_light_task(struct task_struct *p)
>> +{
>> + /* A light task runs less than 20% in average */
>> + return ((p->se.avg.runnable_avg_sum  * 5) <
>> + (p->se.avg.runnable_avg_period));
>> +}
>
> There superfluous () and ' ' in there. Also why 20%.. seemed like a good
> number? Do we want a SCHED_DEBUG sysctl for that?

I have chosen 20% because it will ensure that the packing mechanism
will only apply on tasks that match the 2 following conditions:
- less than 10 ms during the last single run
- and for those tasks less than 20% of the time in average

These tasks are nearly never catch by the periodic load balance and
they are so small that the overall scheduling latency is nearly not
impacted while the cpu is not too busy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/14] sched: pack small tasks

2013-04-26 Thread Vincent Guittot
On 26 April 2013 14:30, Peter Zijlstra  wrote:
> On Thu, Apr 25, 2013 at 07:23:19PM +0200, Vincent Guittot wrote:
>> During the creation of sched_domain, we define a pack buddy CPU for each CPU
>> when one is available. We want to pack at all levels where a group of CPUs 
>> can
>> be power gated independently from others.
>> On a system that can't power gate a group of CPUs independently, the flag is
>> set at all sched_domain level and the buddy is set to -1. This is the default
>> behavior.
>>
>> On a dual clusters / dual cores system which can power gate each core and
>> cluster independently, the buddy configuration will be :
>>
>>   | Cluster 0   | Cluster 1   |
>>   | CPU0 | CPU1 | CPU2 | CPU3 |
>> ---
>> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>>
>> If the cores in a cluster can't be power gated independently, the buddy
>> configuration becomes:
>>
>>   | Cluster 0   | Cluster 1   |
>>   | CPU0 | CPU1 | CPU2 | CPU3 |
>> ---
>> buddy | CPU0 | CPU1 | CPU0 | CPU0 |
>>
>> Small tasks tend to slip out of the periodic load balance so the best place
>> to choose to migrate them is during their wake up. The decision is in O(1) as
>> we only check again one buddy CPU
>
>
> So I really don't get the point of this buddy stuff, even for light load non
> performance impact stuff you want to do.
>
> The moment you judge cpu0 busy you'll bail, even though its perfectly doable
> (and desirable afaict) to continue stacking light tasks on cpu1 instead of
> waking up cpu2/3.
>
> So what's wrong with keeping a single light-wake target cpu selection and
> updating it appropriately?

I have tried to follow the same kind of tasks migration as during load
balance: 1 CPU in a power domain group migrates tasks with other
groups.

>
> Also where/how does the nohz balance cpu criteria not match the light-wake
> target criteria?

The nohz balance cpu is an idle cpu but it doesn't mean that it's the
target cpu which will be sometime busy with the light tasks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/14] sched: pack small tasks

2013-04-26 Thread Peter Zijlstra
On Thu, Apr 25, 2013 at 07:23:19PM +0200, Vincent Guittot wrote:

> +static bool is_buddy_busy(int cpu)
> +{
> + struct rq *rq = cpu_rq(cpu);
> + u32 sum = rq->avg.runnable_avg_sum;
> + u32 period = rq->avg.runnable_avg_period;
> +
> + /*
> +  * If a CPU accesses the runnable_avg_sum and runnable_avg_period
> +  * fields of its buddy CPU while the latter updates it, it can get the
> +  * new version of a field and the old version of the other one. This
> +  * can generate erroneous decisions. We don't want to use a lock
> +  * mechanism for ensuring the coherency because of the overhead in
> +  * this critical path.
> +  * The runnable_avg_period of a runqueue tends to the max value in
> +  * less than 345ms after plugging a CPU, which implies that we could
> +  * use the max value instead of reading runnable_avg_period after
> +  * 345ms. During the starting phase, we must ensure a minimum of
> +  * coherency between the fields. A simple rule is runnable_avg_sum <=
> +  * runnable_avg_period.
> +  */
> + sum = min(sum, period);
> +
> + /*
> +  * A busy buddy is a CPU with a high load or a small load with a lot of
> +  * running tasks.
> +  */
> + return (sum > (period / (rq->nr_running + 2)));
> +}


I'm still not sold on the entire nr_running thing and the comment doesn't say
why you're doing it.

I'm fairly sure there's software out there that wakes a gazillion threads at a
time only for a gazillion-1 to go back to sleep immediately. Patterns like that
completely defeat your heuristic.

Does that matter... I don't know.

What happens if you keep this thing simple and only put a cap on utilization
(say 80%) and drop the entire nr_running magic? Have you seen it make an actual
difference or did it just seem like a good (TM) thing to do?

> +static bool is_light_task(struct task_struct *p)
> +{
> + /* A light task runs less than 20% in average */
> + return ((p->se.avg.runnable_avg_sum  * 5) <
> + (p->se.avg.runnable_avg_period));
> +}

There superfluous () and ' ' in there. Also why 20%.. seemed like a good
number? Do we want a SCHED_DEBUG sysctl for that?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/14] sched: pack small tasks

2013-04-26 Thread Peter Zijlstra
On Thu, Apr 25, 2013 at 07:23:19PM +0200, Vincent Guittot wrote:
> During the creation of sched_domain, we define a pack buddy CPU for each CPU
> when one is available. We want to pack at all levels where a group of CPUs can
> be power gated independently from others.
> On a system that can't power gate a group of CPUs independently, the flag is
> set at all sched_domain level and the buddy is set to -1. This is the default
> behavior.
> 
> On a dual clusters / dual cores system which can power gate each core and
> cluster independently, the buddy configuration will be :
> 
>   | Cluster 0   | Cluster 1   |
>   | CPU0 | CPU1 | CPU2 | CPU3 |
> ---
> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
> 
> If the cores in a cluster can't be power gated independently, the buddy
> configuration becomes:
> 
>   | Cluster 0   | Cluster 1   |
>   | CPU0 | CPU1 | CPU2 | CPU3 |
> ---
> buddy | CPU0 | CPU1 | CPU0 | CPU0 |
> 
> Small tasks tend to slip out of the periodic load balance so the best place
> to choose to migrate them is during their wake up. The decision is in O(1) as
> we only check again one buddy CPU


So I really don't get the point of this buddy stuff, even for light load non
performance impact stuff you want to do.

The moment you judge cpu0 busy you'll bail, even though its perfectly doable
(and desirable afaict) to continue stacking light tasks on cpu1 instead of
waking up cpu2/3.

So what's wrong with keeping a single light-wake target cpu selection and
updating it appropriately?

Also where/how does the nohz balance cpu criteria not match the light-wake
target criteria?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/14] sched: pack small tasks

2013-04-26 Thread Peter Zijlstra
On Thu, Apr 25, 2013 at 07:23:19PM +0200, Vincent Guittot wrote:
 During the creation of sched_domain, we define a pack buddy CPU for each CPU
 when one is available. We want to pack at all levels where a group of CPUs can
 be power gated independently from others.
 On a system that can't power gate a group of CPUs independently, the flag is
 set at all sched_domain level and the buddy is set to -1. This is the default
 behavior.
 
 On a dual clusters / dual cores system which can power gate each core and
 cluster independently, the buddy configuration will be :
 
   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |
 
 If the cores in a cluster can't be power gated independently, the buddy
 configuration becomes:
 
   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU1 | CPU0 | CPU0 |
 
 Small tasks tend to slip out of the periodic load balance so the best place
 to choose to migrate them is during their wake up. The decision is in O(1) as
 we only check again one buddy CPU


So I really don't get the point of this buddy stuff, even for light load non
performance impact stuff you want to do.

The moment you judge cpu0 busy you'll bail, even though its perfectly doable
(and desirable afaict) to continue stacking light tasks on cpu1 instead of
waking up cpu2/3.

So what's wrong with keeping a single light-wake target cpu selection and
updating it appropriately?

Also where/how does the nohz balance cpu criteria not match the light-wake
target criteria?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/14] sched: pack small tasks

2013-04-26 Thread Peter Zijlstra
On Thu, Apr 25, 2013 at 07:23:19PM +0200, Vincent Guittot wrote:

 +static bool is_buddy_busy(int cpu)
 +{
 + struct rq *rq = cpu_rq(cpu);
 + u32 sum = rq-avg.runnable_avg_sum;
 + u32 period = rq-avg.runnable_avg_period;
 +
 + /*
 +  * If a CPU accesses the runnable_avg_sum and runnable_avg_period
 +  * fields of its buddy CPU while the latter updates it, it can get the
 +  * new version of a field and the old version of the other one. This
 +  * can generate erroneous decisions. We don't want to use a lock
 +  * mechanism for ensuring the coherency because of the overhead in
 +  * this critical path.
 +  * The runnable_avg_period of a runqueue tends to the max value in
 +  * less than 345ms after plugging a CPU, which implies that we could
 +  * use the max value instead of reading runnable_avg_period after
 +  * 345ms. During the starting phase, we must ensure a minimum of
 +  * coherency between the fields. A simple rule is runnable_avg_sum =
 +  * runnable_avg_period.
 +  */
 + sum = min(sum, period);
 +
 + /*
 +  * A busy buddy is a CPU with a high load or a small load with a lot of
 +  * running tasks.
 +  */
 + return (sum  (period / (rq-nr_running + 2)));
 +}


I'm still not sold on the entire nr_running thing and the comment doesn't say
why you're doing it.

I'm fairly sure there's software out there that wakes a gazillion threads at a
time only for a gazillion-1 to go back to sleep immediately. Patterns like that
completely defeat your heuristic.

Does that matter... I don't know.

What happens if you keep this thing simple and only put a cap on utilization
(say 80%) and drop the entire nr_running magic? Have you seen it make an actual
difference or did it just seem like a good (TM) thing to do?

 +static bool is_light_task(struct task_struct *p)
 +{
 + /* A light task runs less than 20% in average */
 + return ((p-se.avg.runnable_avg_sum  * 5) 
 + (p-se.avg.runnable_avg_period));
 +}

There superfluous () and ' ' in there. Also why 20%.. seemed like a good
number? Do we want a SCHED_DEBUG sysctl for that?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/14] sched: pack small tasks

2013-04-26 Thread Vincent Guittot
On 26 April 2013 14:30, Peter Zijlstra pet...@infradead.org wrote:
 On Thu, Apr 25, 2013 at 07:23:19PM +0200, Vincent Guittot wrote:
 During the creation of sched_domain, we define a pack buddy CPU for each CPU
 when one is available. We want to pack at all levels where a group of CPUs 
 can
 be power gated independently from others.
 On a system that can't power gate a group of CPUs independently, the flag is
 set at all sched_domain level and the buddy is set to -1. This is the default
 behavior.

 On a dual clusters / dual cores system which can power gate each core and
 cluster independently, the buddy configuration will be :

   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |

 If the cores in a cluster can't be power gated independently, the buddy
 configuration becomes:

   | Cluster 0   | Cluster 1   |
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU1 | CPU0 | CPU0 |

 Small tasks tend to slip out of the periodic load balance so the best place
 to choose to migrate them is during their wake up. The decision is in O(1) as
 we only check again one buddy CPU


 So I really don't get the point of this buddy stuff, even for light load non
 performance impact stuff you want to do.

 The moment you judge cpu0 busy you'll bail, even though its perfectly doable
 (and desirable afaict) to continue stacking light tasks on cpu1 instead of
 waking up cpu2/3.

 So what's wrong with keeping a single light-wake target cpu selection and
 updating it appropriately?

I have tried to follow the same kind of tasks migration as during load
balance: 1 CPU in a power domain group migrates tasks with other
groups.


 Also where/how does the nohz balance cpu criteria not match the light-wake
 target criteria?

The nohz balance cpu is an idle cpu but it doesn't mean that it's the
target cpu which will be sometime busy with the light tasks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/14] sched: pack small tasks

2013-04-26 Thread Vincent Guittot
On 26 April 2013 14:38, Peter Zijlstra pet...@infradead.org wrote:
 On Thu, Apr 25, 2013 at 07:23:19PM +0200, Vincent Guittot wrote:

 +static bool is_buddy_busy(int cpu)
 +{
 + struct rq *rq = cpu_rq(cpu);
 + u32 sum = rq-avg.runnable_avg_sum;
 + u32 period = rq-avg.runnable_avg_period;
 +
 + /*
 +  * If a CPU accesses the runnable_avg_sum and runnable_avg_period
 +  * fields of its buddy CPU while the latter updates it, it can get the
 +  * new version of a field and the old version of the other one. This
 +  * can generate erroneous decisions. We don't want to use a lock
 +  * mechanism for ensuring the coherency because of the overhead in
 +  * this critical path.
 +  * The runnable_avg_period of a runqueue tends to the max value in
 +  * less than 345ms after plugging a CPU, which implies that we could
 +  * use the max value instead of reading runnable_avg_period after
 +  * 345ms. During the starting phase, we must ensure a minimum of
 +  * coherency between the fields. A simple rule is runnable_avg_sum =
 +  * runnable_avg_period.
 +  */
 + sum = min(sum, period);
 +
 + /*
 +  * A busy buddy is a CPU with a high load or a small load with a lot of
 +  * running tasks.
 +  */
 + return (sum  (period / (rq-nr_running + 2)));
 +}


 I'm still not sold on the entire nr_running thing and the comment doesn't say
 why you're doing it.

The main goal is really to minimize the scheduling latency of tasks.
If the cpu already has dozens of task in its runqueue the scheduling
latency can become quite huge. In this case it's worth using another
core


 I'm fairly sure there's software out there that wakes a gazillion threads at a
 time only for a gazillion-1 to go back to sleep immediately. Patterns like 
 that
 completely defeat your heuristic.

 Does that matter... I don't know.

 What happens if you keep this thing simple and only put a cap on utilization
 (say 80%) and drop the entire nr_running magic? Have you seen it make an 
 actual
 difference or did it just seem like a good (TM) thing to do?

It was a efficient way


 +static bool is_light_task(struct task_struct *p)
 +{
 + /* A light task runs less than 20% in average */
 + return ((p-se.avg.runnable_avg_sum  * 5) 
 + (p-se.avg.runnable_avg_period));
 +}

 There superfluous () and ' ' in there. Also why 20%.. seemed like a good
 number? Do we want a SCHED_DEBUG sysctl for that?

I have chosen 20% because it will ensure that the packing mechanism
will only apply on tasks that match the 2 following conditions:
- less than 10 ms during the last single run
- and for those tasks less than 20% of the time in average

These tasks are nearly never catch by the periodic load balance and
they are so small that the overall scheduling latency is nearly not
impacted while the cpu is not too busy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 03/14] sched: pack small tasks

2013-04-25 Thread Vincent Guittot
During the creation of sched_domain, we define a pack buddy CPU for each CPU
when one is available. We want to pack at all levels where a group of CPUs can
be power gated independently from others.
On a system that can't power gate a group of CPUs independently, the flag is
set at all sched_domain level and the buddy is set to -1. This is the default
behavior.

On a dual clusters / dual cores system which can power gate each core and
cluster independently, the buddy configuration will be :

  | Cluster 0   | Cluster 1   |
  | CPU0 | CPU1 | CPU2 | CPU3 |
---
buddy | CPU0 | CPU0 | CPU0 | CPU2 |

If the cores in a cluster can't be power gated independently, the buddy
configuration becomes:

  | Cluster 0   | Cluster 1   |
  | CPU0 | CPU1 | CPU2 | CPU3 |
---
buddy | CPU0 | CPU1 | CPU0 | CPU0 |

Small tasks tend to slip out of the periodic load balance so the best place
to choose to migrate them is during their wake up. The decision is in O(1) as
we only check again one buddy CPU

Signed-off-by: Vincent Guittot 
Reviewed-by: Morten Rasmussen 
---
 kernel/sched/core.c  |1 +
 kernel/sched/fair.c  |  132 ++
 kernel/sched/sched.h |5 ++
 3 files changed, 138 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3b9861f..c5ef170 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5664,6 +5664,7 @@ cpu_attach_domain(struct sched_domain *sd, struct 
root_domain *rd, int cpu)
rcu_assign_pointer(rq->sd, sd);
destroy_sched_domains(tmp, cpu);
 
+   update_packing_domain(cpu);
update_top_cache_domain(cpu);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9c2f726..6adc57c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -160,6 +160,76 @@ void sched_init_granularity(void)
update_sysctl();
 }
 
+
+#ifdef CONFIG_SMP
+/*
+ * Save the id of the optimal CPU that should be used to pack small tasks
+ * The value -1 is used when no buddy has been found
+ */
+DEFINE_PER_CPU(int, sd_pack_buddy);
+
+/*
+ * Look for the best buddy CPU that can be used to pack small tasks
+ * We make the assumption that it doesn't wort to pack on CPU that share the
+ * same powerline. We look for the 1st sched_domain without the
+ * SD_SHARE_POWERDOMAIN flag. Then we look for the sched_group with the lowest
+ * power per core based on the assumption that their power efficiency is
+ * better
+ */
+void update_packing_domain(int cpu)
+{
+   struct sched_domain *sd;
+   int id = -1;
+
+   sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN);
+   if (!sd)
+   sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);
+   else
+   sd = sd->parent;
+
+   while (sd && (sd->flags & SD_LOAD_BALANCE)
+   && !(sd->flags & SD_SHARE_POWERDOMAIN)) {
+   struct sched_group *sg = sd->groups;
+   struct sched_group *pack = sg;
+   struct sched_group *tmp;
+
+   /*
+* The sched_domain of a CPU points on the local sched_group
+* and this CPU of this local group is a good candidate
+*/
+   id = cpu;
+
+   /* loop the sched groups to find the best one */
+   for (tmp = sg->next; tmp != sg; tmp = tmp->next) {
+   if (tmp->sgp->power * pack->group_weight >
+   pack->sgp->power * tmp->group_weight)
+   continue;
+
+   if ((tmp->sgp->power * pack->group_weight ==
+   pack->sgp->power * tmp->group_weight)
+&& (cpumask_first(sched_group_cpus(tmp)) >= id))
+   continue;
+
+   /* we have found a better group */
+   pack = tmp;
+
+   /* Take the 1st CPU of the new group */
+   id = cpumask_first(sched_group_cpus(pack));
+   }
+
+   /* Look for another CPU than itself */
+   if (id != cpu)
+   break;
+
+   sd = sd->parent;
+   }
+
+   pr_debug("CPU%d packing on CPU%d\n", cpu, id);
+   per_cpu(sd_pack_buddy, cpu) = id;
+}
+
+#endif /* CONFIG_SMP */
+
 #if BITS_PER_LONG == 32
 # define WMULT_CONST   (~0UL)
 #else
@@ -3291,6 +3361,64 @@ done:
return target;
 }
 
+static bool is_buddy_busy(int cpu)
+{
+   struct rq *rq = cpu_rq(cpu);
+   u32 sum = rq->avg.runnable_avg_sum;
+   u32 period = rq->avg.runnable_avg_period;
+
+   /*
+* If a CPU accesses the runnable_avg_sum and runnable_avg_period
+* fields of its buddy CPU while the latter updates it, it can get the
+* new version of a field and the old version of the other one. This
+* can generate erroneous decisions. We 

[PATCH 03/14] sched: pack small tasks

2013-04-25 Thread Vincent Guittot
During the creation of sched_domain, we define a pack buddy CPU for each CPU
when one is available. We want to pack at all levels where a group of CPUs can
be power gated independently from others.
On a system that can't power gate a group of CPUs independently, the flag is
set at all sched_domain level and the buddy is set to -1. This is the default
behavior.

On a dual clusters / dual cores system which can power gate each core and
cluster independently, the buddy configuration will be :

  | Cluster 0   | Cluster 1   |
  | CPU0 | CPU1 | CPU2 | CPU3 |
---
buddy | CPU0 | CPU0 | CPU0 | CPU2 |

If the cores in a cluster can't be power gated independently, the buddy
configuration becomes:

  | Cluster 0   | Cluster 1   |
  | CPU0 | CPU1 | CPU2 | CPU3 |
---
buddy | CPU0 | CPU1 | CPU0 | CPU0 |

Small tasks tend to slip out of the periodic load balance so the best place
to choose to migrate them is during their wake up. The decision is in O(1) as
we only check again one buddy CPU

Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
Reviewed-by: Morten Rasmussen morten.rasmus...@arm.com
---
 kernel/sched/core.c  |1 +
 kernel/sched/fair.c  |  132 ++
 kernel/sched/sched.h |5 ++
 3 files changed, 138 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3b9861f..c5ef170 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5664,6 +5664,7 @@ cpu_attach_domain(struct sched_domain *sd, struct 
root_domain *rd, int cpu)
rcu_assign_pointer(rq-sd, sd);
destroy_sched_domains(tmp, cpu);
 
+   update_packing_domain(cpu);
update_top_cache_domain(cpu);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9c2f726..6adc57c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -160,6 +160,76 @@ void sched_init_granularity(void)
update_sysctl();
 }
 
+
+#ifdef CONFIG_SMP
+/*
+ * Save the id of the optimal CPU that should be used to pack small tasks
+ * The value -1 is used when no buddy has been found
+ */
+DEFINE_PER_CPU(int, sd_pack_buddy);
+
+/*
+ * Look for the best buddy CPU that can be used to pack small tasks
+ * We make the assumption that it doesn't wort to pack on CPU that share the
+ * same powerline. We look for the 1st sched_domain without the
+ * SD_SHARE_POWERDOMAIN flag. Then we look for the sched_group with the lowest
+ * power per core based on the assumption that their power efficiency is
+ * better
+ */
+void update_packing_domain(int cpu)
+{
+   struct sched_domain *sd;
+   int id = -1;
+
+   sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN);
+   if (!sd)
+   sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)-sd);
+   else
+   sd = sd-parent;
+
+   while (sd  (sd-flags  SD_LOAD_BALANCE)
+!(sd-flags  SD_SHARE_POWERDOMAIN)) {
+   struct sched_group *sg = sd-groups;
+   struct sched_group *pack = sg;
+   struct sched_group *tmp;
+
+   /*
+* The sched_domain of a CPU points on the local sched_group
+* and this CPU of this local group is a good candidate
+*/
+   id = cpu;
+
+   /* loop the sched groups to find the best one */
+   for (tmp = sg-next; tmp != sg; tmp = tmp-next) {
+   if (tmp-sgp-power * pack-group_weight 
+   pack-sgp-power * tmp-group_weight)
+   continue;
+
+   if ((tmp-sgp-power * pack-group_weight ==
+   pack-sgp-power * tmp-group_weight)
+ (cpumask_first(sched_group_cpus(tmp)) = id))
+   continue;
+
+   /* we have found a better group */
+   pack = tmp;
+
+   /* Take the 1st CPU of the new group */
+   id = cpumask_first(sched_group_cpus(pack));
+   }
+
+   /* Look for another CPU than itself */
+   if (id != cpu)
+   break;
+
+   sd = sd-parent;
+   }
+
+   pr_debug(CPU%d packing on CPU%d\n, cpu, id);
+   per_cpu(sd_pack_buddy, cpu) = id;
+}
+
+#endif /* CONFIG_SMP */
+
 #if BITS_PER_LONG == 32
 # define WMULT_CONST   (~0UL)
 #else
@@ -3291,6 +3361,64 @@ done:
return target;
 }
 
+static bool is_buddy_busy(int cpu)
+{
+   struct rq *rq = cpu_rq(cpu);
+   u32 sum = rq-avg.runnable_avg_sum;
+   u32 period = rq-avg.runnable_avg_period;
+
+   /*
+* If a CPU accesses the runnable_avg_sum and runnable_avg_period
+* fields of its buddy CPU while the latter updates it, it can get the
+* new version of a field and the old version of the other one. This
+* can generate erroneous