Re: HMP patches v2

2013-01-02 Thread Morten Rasmussen

On 02/01/13 10:29, Vincent Guittot wrote:

On 2 January 2013 06:28, Viresh Kumar viresh.ku...@linaro.org wrote:

On 20 December 2012 13:41, Vincent Guittot vincent.guit...@linaro.org wrote:

On 19 December 2012 11:57, Morten Rasmussen morten.rasmus...@arm.com wrote:

If I understand the new version of sched: secure access to other CPU
statistics correctly, the effect of the patch is:

Without the patch the cpu will appear to be busy if sum/period are not
coherent (sumperiod). The same is true with the patch except in the
case where nr_running is 0. In this particular case the cpu will appear
not to be busy. I assume there is good reason why this particular case
is important?


Sorry for this late reply.

It's not really more important than other but it's one case we can
safely detect to prevent spurious spread of tasks.
In addition, The incoherency occurs if both value are close so
nr_running == 0 was the only  condition that left to be tested



In any case the patch is fine by me.


Hmm... I am still confused :(

We have two patches from ARM, do let me know if i can drop these:


I think you can drop them as they don't apply anymore for V2.
Morten, do you confirm ?


Confirmed. I don't see any problems with the v2 patch. The overhead of
the check should be minimal.

Morten



Vincent



commit 3f1dff11ac95eda2772bef577e368bc124bfe087
Author: Morten Rasmussen morten.rasmus...@arm.com
Date:   Fri Nov 16 18:32:40 2012 +

 ARM: TC2: Re-enable SD_SHARE_POWERLINE

 Re-enable SD_SHARE_POWERLINE to reflect the power domains of TC2.

  arch/arm/kernel/topology.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

commit e8cceacd3913e3a3e955614bacc1bc81866bc243
Author: Liviu Dudau liviu.du...@arm.com
Date:   Fri Nov 16 18:32:38 2012 +

 Revert sched: secure access to other CPU statistics

 This reverts commit 2aa14d0379cc54bc0ec44adb7a2e0ad02ae293d0.

 The way this functionality is implemented is under review and the
current implementation
 is considered not safe.

 Signed-of-by: Liviu Dudau liviu.du...@arm.com

  kernel/sched/fair.c | 19 ++-
  1 file changed, 2 insertions(+), 17 deletions(-)





-- IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium.  Thank you.


___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev


Re: HMP patches v2

2012-12-19 Thread Morten Rasmussen

On 19/12/12 09:34, Viresh Kumar wrote:

On 19 December 2012 14:53, Vincent Guittot vincent.guit...@linaro.org wrote:

Le 19 déc. 2012 07:34, Viresh Kumar viresh.ku...@linaro.org a écrit :

Can we resolve this issue now? I don't want anything during the release
period
this time.


The new version of the patchset should solve the concerns of everybody


Morten,

Can you confirm or cross-check that? Branch is: sched-pack-small-tasks-v2



If I understand the new version of sched: secure access to other CPU
statistics correctly, the effect of the patch is:

Without the patch the cpu will appear to be busy if sum/period are not
coherent (sumperiod). The same is true with the patch except in the
case where nr_running is 0. In this particular case the cpu will appear
not to be busy. I assume there is good reason why this particular case
is important?

In any case the patch is fine by me.

Morten

-- IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium.  Thank you.


___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev


[HMP][PATCH 0/1] Global balance

2012-12-07 Thread Morten Rasmussen
Hi Viresh,

Here is a patch that introduces global load balancing on top of the existing HMP
patch set. It depends on the HMP patches already present in your 
task-placement-v2
branch. It can be applied on top of the HMP sysfs patches if needed. The fix 
should
be trivial.

Could you include in the MP branch for the 12.12 release? Testing with sysbench 
and
coremark show significant performance improvements for parallel workloads as all
cpus can now be used for cpu intensive tasks.

Thanks,
Morten

Morten Rasmussen (1):
  sched: Basic global balancing support for HMP

 kernel/sched/fair.c |  101 +--
 1 file changed, 97 insertions(+), 4 deletions(-)

-- 
1.7.9.5



___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev


[HMP][PATCH 1/1] sched: Basic global balancing support for HMP

2012-12-07 Thread Morten Rasmussen
This patch introduces an extra-check at task up-migration to
prevent overloading the cpus in the faster hmp_domain while the
slower hmp_domain is not fully utilized. The patch also introduces
a periodic balance check that can down-migrate tasks if the faster
domain is oversubscribed and the slower is under-utilized.

Signed-off-by: Morten Rasmussen morten.rasmus...@arm.com
---
 kernel/sched/fair.c |  101 +--
 1 file changed, 97 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1cfe112..7ac47c9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3249,6 +3249,80 @@ static inline void hmp_next_down_delay(struct 
sched_entity *se, int cpu)
se-avg.hmp_last_down_migration = cfs_rq_clock_task(cfs_rq);
se-avg.hmp_last_up_migration = 0;
 }
+
+static inline unsigned int hmp_domain_min_load(struct hmp_domain *hmpd,
+   int *min_cpu)
+{
+   int cpu;
+   int min_load = INT_MAX;
+   int min_cpu_temp = NR_CPUS;
+
+   for_each_cpu_mask(cpu, hmpd-cpus) {
+   if (cpu_rq(cpu)-cfs.tg_load_contrib  min_load) {
+   min_load = cpu_rq(cpu)-cfs.tg_load_contrib;
+   min_cpu_temp = cpu;
+   }
+   }
+
+   if (min_cpu)
+   *min_cpu = min_cpu_temp;
+
+   return min_load;
+}
+
+/*
+ * Calculate the task starvation
+ * This is the ratio of actually running time vs. runnable time.
+ * If the two are equal the task is getting the cpu time it needs or
+ * it is alone on the cpu and the cpu is fully utilized.
+ */
+static inline unsigned int hmp_task_starvation(struct sched_entity *se)
+{
+   u32 starvation;
+
+   starvation = se-avg.usage_avg_sum * scale_load_down(NICE_0_LOAD);
+   starvation /= (se-avg.runnable_avg_sum + 1);
+
+   return scale_load(starvation);
+}
+
+static inline unsigned int hmp_offload_down(int cpu, struct sched_entity *se)
+{
+   int min_usage;
+   int dest_cpu = NR_CPUS;
+
+   if (hmp_cpu_is_slowest(cpu))
+   return NR_CPUS;
+
+   /* Is the current domain fully loaded? */
+   /* load  ~94% */
+   min_usage = hmp_domain_min_load(hmp_cpu_domain(cpu), NULL);
+   if (min_usage  NICE_0_LOAD-64)
+   return NR_CPUS;
+
+   /* Is the cpu oversubscribed? */
+   /* load  ~194% */
+   if (cpu_rq(cpu)-cfs.tg_load_contrib  2*NICE_0_LOAD-64)
+   return NR_CPUS;
+
+   /* Is the task alone on the cpu? */
+   if (cpu_rq(cpu)-cfs.nr_running  2)
+   return NR_CPUS;
+
+   /* Is the task actually starving? */
+   if (hmp_task_starvation(se)  768) /* 25% waiting */
+   return NR_CPUS;
+
+   /* Does the slower domain have spare cycles? */
+   min_usage = hmp_domain_min_load(hmp_slower_domain(cpu), dest_cpu);
+   /* load  50% */
+   if (min_usage  NICE_0_LOAD/2)
+   return NR_CPUS;
+
+   if (cpumask_test_cpu(dest_cpu, hmp_slower_domain(cpu)-cpus))
+   return dest_cpu;
+   return NR_CPUS;
+}
 #endif /* CONFIG_SCHED_HMP */
 
 /*
@@ -5643,10 +5717,14 @@ static unsigned int hmp_up_migration(int cpu, struct 
sched_entity *se)
 hmp_next_up_threshold)
return 0;
 
-   if (se-avg.load_avg_ratio  hmp_up_threshold 
-   cpumask_intersects(hmp_faster_domain(cpu)-cpus,
-   tsk_cpus_allowed(p))) {
-   return 1;
+   if (se-avg.load_avg_ratio  hmp_up_threshold) {
+   /* Target domain load  ~94% */
+   if (hmp_domain_min_load(hmp_faster_domain(cpu), NULL)
+NICE_0_LOAD-64)
+   return 0;
+   if (cpumask_intersects(hmp_faster_domain(cpu)-cpus,
+   tsk_cpus_allowed(p)))
+   return 1;
}
return 0;
 }
@@ -5868,6 +5946,21 @@ static void hmp_force_up_migration(int this_cpu)
hmp_next_up_delay(p-se, target-push_cpu);
}
}
+   if (!force  !target-active_balance) {
+   /*
+* For now we just check the currently running task.
+* Selecting the lightest task for offloading will
+* require extensive book keeping.
+*/
+   target-push_cpu = hmp_offload_down(cpu, curr);
+   if (target-push_cpu  NR_CPUS) {
+   target-active_balance = 1;
+   target-migrate_task = p;
+   force = 1;
+   trace_sched_hmp_migrate(p, target-push_cpu, 2);
+   hmp_next_down_delay(p-se, target

Re: [HMP][PATCH 0/1] Global balance

2012-12-07 Thread Morten Rasmussen
Hi Amit,

I should have included the numbers in the cover letter. Here are
numbers for TC2.

sysbench (normalized execution time, lower is better)
threads   2   4  8
HMP  1.00  1.00  1.00
HMP+GB1.00  0.67  0.58

coremark (normalized iterations per second, higher is better)
threads   2   4  8
HMP  1.00  1.00  1.00
HMP+GB   1.00  1.39  1.73

So there is clear benefit of utilizing the A7s. It actually saves
energy too as the whole benchmark completes faster.

Regards,
Morten

On Fri, Dec 7, 2012 at 12:14 PM, Amit Kucheria amit.kuche...@linaro.org wrote:

 On Fri, Dec 7, 2012 at 5:33 PM, Morten Rasmussen
 morten.rasmus...@arm.com wrote:
  Hi Viresh,
 
  Here is a patch that introduces global load balancing on top of the 
  existing HMP
  patch set. It depends on the HMP patches already present in your 
  task-placement-v2
  branch. It can be applied on top of the HMP sysfs patches if needed. The 
  fix should
  be trivial.
 
  Could you include in the MP branch for the 12.12 release? Testing with 
  sysbench and
  coremark show significant performance improvements for parallel workloads 
  as all
  cpus can now be used for cpu intensive tasks.

 Morten,

 Can you share some performance number improvements and/or
 kernelshark-type graphs with and without this patch? It'd be very
 interesting to see the changes.

 Monday is the deadline to get this merged into the MP tree to make it
 to the release. It is end of week now. Not sure how much testing and
 review can be done before Monday. Your numbers might make a compelling
 argument.

 Regards,
 Amit

  Thanks,
  Morten
 
  Morten Rasmussen (1):
sched: Basic global balancing support for HMP
 
   kernel/sched/fair.c |  101 
  +--
   1 file changed, 97 insertions(+), 4 deletions(-)
 
  --
  1.7.9.5
 
 
 
  ___
  linaro-dev mailing list
  linaro-dev@lists.linaro.org
  http://lists.linaro.org/mailman/listinfo/linaro-dev

 ___
 linaro-dev mailing list
 linaro-dev@lists.linaro.org
 http://lists.linaro.org/mailman/listinfo/linaro-dev

___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev


Re: [HMP][PATCH 0/1] Global balance

2012-12-07 Thread Morten Rasmussen

On 07/12/12 14:54, Viresh Kumar wrote:

On 7 December 2012 18:43, Morten Rasmussen morten.rasmus...@arm.com wrote:

I should have included the numbers in the cover letter. Here are
numbers for TC2.

sysbench (normalized execution time, lower is better)
threads   2   4  8
HMP  1.00  1.00  1.00
HMP+GB1.00  0.67  0.58

coremark (normalized iterations per second, higher is better)
threads   2   4  8
HMP  1.00  1.00  1.00
HMP+GB   1.00  1.39  1.73

So there is clear benefit of utilizing the A7s. It actually saves
energy too as the whole benchmark completes faster.


Hi Morten,

I have applied your patch now and pushed v13. Please cross-check v13
to see if everything is correct.



It looks right to me.

Morten

-- IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium.  Thank you.


___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev


Re: HMP patches v2

2012-12-05 Thread Morten Rasmussen

On 05/12/12 11:01, Viresh Kumar wrote:

On 5 December 2012 16:28, Liviu Dudau liviu.du...@arm.com wrote:

The revert request came at Morten's suggestion. He has comments on the code and 
technical reasons
why he believes that the approach is not the best one as well as some scenarios 
where possible race
conditions can occur.

Morten, what is the latest update in this area. I'm not sure I have followed 
your discussion with
Vincent on the subject.


Just to make it more clear.. There are two reverts now. Please look
at the latest tree/branches. Vincent has provided another fixup patch
after which he commented we no longer need Mortens fix.

I have reverted that too, for the moment to keep things same as the
last release. Can Morten test with latest patches from Vincent (from his
branch) ? And provide fixups again ?



Hi,

I tested Vincent's fix (sched: pack small tasks: fix update packing
domain) for the buddy selection some weeks ago and confirmed that it
works. So my quick fixes are no longer necessary.

The issues around the reverted sched: secure access to other CPU
statistics have not yet been resolved. I don't think that we should
re-enable it until we are clear about what it is doing.

Morten

-- IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium.  Thank you.


___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev


Re: HMP patches v2

2012-12-05 Thread Morten Rasmussen

On 05/12/12 11:35, Viresh Kumar wrote:

On 5 December 2012 16:58, Morten Rasmussen morten.rasmus...@arm.com wrote:

I tested Vincent's fix (sched: pack small tasks: fix update packing
domain) for the buddy selection some weeks ago and confirmed that it
works. So my quick fixes are no longer necessary.

The issues around the reverted sched: secure access to other CPU
statistics have not yet been resolved. I don't think that we should
re-enable it until we are clear about what it is doing.


There are four patches i am carrying from ARM

4a29297 ARM: TC2: Re-enable SD_SHARE_POWERLINE
a1924a4 sched: SD_SHARE_POWERLINE buddy selection fix
39b0e77 Revert sched: secure access to other CPU statistics
eed72c8 Revert sched: pack small tasks: fix update packing domain

You want me to drop eed72c8 and a1924a4 ? Correct.


Yes.

Morten



--
viresh




-- IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium.  Thank you.


___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev


Re: [RFC 3/6] sched: pack small tasks

2012-11-20 Thread Morten Rasmussen
Hi Vincent,

On Mon, Nov 12, 2012 at 01:51:00PM +, Vincent Guittot wrote:
 On 9 November 2012 18:13, Morten Rasmussen morten.rasmus...@arm.com wrote:
  Hi Vincent,
 
  I have experienced suboptimal buddy selection on a dual cluster setup
  (ARM TC2) if SD_SHARE_POWERLINE is enabled at MC level and disabled at
  CPU level. This seems to be the correct flag settings for a system with
  only cluster level power gating.
 
  To me it looks like update_packing_domain() is not doing the right
  thing. See inline comments below.
 
 Hi Morten,
 
 Thanks for testing the patches.
 
 It seems that I have too optimized the loop and remove some use cases.
 
 
  On Sun, Oct 07, 2012 at 08:43:55AM +0100, Vincent Guittot wrote:
  During sched_domain creation, we define a pack buddy CPU if available.
 
  On a system that share the powerline at all level, the buddy is set to -1
 
  On a dual clusters / dual cores system which can powergate each core and
  cluster independantly, the buddy configuration will be :
| CPU0 | CPU1 | CPU2 | CPU3 |
  ---
  buddy | CPU0 | CPU0 | CPU0 | CPU2 |
 
  Small tasks tend to slip out of the periodic load balance.
  The best place to choose to migrate them is at their wake up.
 
  Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
  ---
   kernel/sched/core.c  |1 +
   kernel/sched/fair.c  |  109 
  ++
   kernel/sched/sched.h |1 +
   3 files changed, 111 insertions(+)
 
  diff --git a/kernel/sched/core.c b/kernel/sched/core.c
  index dab7908..70cadbe 100644
  --- a/kernel/sched/core.c
  +++ b/kernel/sched/core.c
  @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct 
  root_domain *rd, int cpu)
rcu_assign_pointer(rq-sd, sd);
destroy_sched_domains(tmp, cpu);
 
  + update_packing_domain(cpu);
update_top_cache_domain(cpu);
   }
 
  diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
  index 4f4a4f6..8c9d3ed 100644
  --- a/kernel/sched/fair.c
  +++ b/kernel/sched/fair.c
  @@ -157,6 +157,63 @@ void sched_init_granularity(void)
update_sysctl();
   }
 
  +
  +/*
  + * Save the id of the optimal CPU that should be used to pack small tasks
  + * The value -1 is used when no buddy has been found
  + */
  +DEFINE_PER_CPU(int, sd_pack_buddy);
  +
  +/* Look for the best buddy CPU that can be used to pack small tasks
  + * We make the assumption that it doesn't wort to pack on CPU that share 
  the
  + * same powerline. We looks for the 1st sched_domain without the
  + * SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the 
  lowest
  + * power per core based on the assumption that their power efficiency is
  + * better */
  +void update_packing_domain(int cpu)
  +{
  + struct sched_domain *sd;
  + int id = -1;
  +
  + sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE);
  + if (!sd)
  + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)-sd);
  + else
  + sd = sd-parent;
  sd is the highest level where SD_SHARE_POWERLINE is enabled so the sched
  groups of the parent level would represent the power domains. If get it
  right, we want to pack inside the cluster first and only let first cpu
 
 You probably wanted to use sched_group instead of cluster because
 cluster is only a special use case, didn't you ?
 
  of the cluster do packing on another cluster. So all cpus - except the
  first one - in the current sched domain should find its buddy within the
  domain and only the first one should go to the parent sched domain to
  find its buddy.
 
 We don't want to pack in the current sched_domain because it shares
 power domain. We want to pack at the parent level
 

Yes. I think we mean the same thing. The packing takes place at the
parent sched_domain but the sched_group that we are looking at only
contains the cpus of the level below.

 
  I propose the following fix:
 
  -   sd = sd-parent;
  +   if (cpumask_first(sched_domain_span(sd)) == cpu
  +   || !sd-parent)
  +   sd = sd-parent;
 
 We always look for the buddy in the parent level whatever the cpu
 position in the mask is.
 
 
 
  +
  + while (sd) {
  + struct sched_group *sg = sd-groups;
  + struct sched_group *pack = sg;
  + struct sched_group *tmp = sg-next;
  +
  + /* 1st CPU of the sched domain is a good candidate */
  + if (id == -1)
  + id = cpumask_first(sched_domain_span(sd));
 
  There is no guarantee that id is in the sched group pointed to by
  sd-groups, which is implicitly assumed later in the search loop. We
  need to find the sched group that contains id and point sg to that
  instead. I haven't found an elegant way to find that group, but the fix
  below should at least give the right result.
 
  +   /* Find sched group of candidate */
  +   tmp

Re: [HMP tunables v2][PATCH 3/7] ARM: TC2: Re-enable SD_SHARE_POWERLINE

2012-11-19 Thread Morten Rasmussen

Hi Vincent,

On 19/11/12 09:20, Vincent Guittot wrote:

Hi,

On 16 November 2012 19:32, Liviu Dudau liviu.du...@arm.com wrote:

From: Morten Rasmussen morten.rasmus...@arm.com

Re-enable SD_SHARE_POWERLINE to reflect the power domains of TC2.
---
  arch/arm/kernel/topology.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 317dac6..4d34e0e 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -228,7 +228,7 @@ struct cputopo_arm cpu_topology[NR_CPUS];

  int arch_sd_share_power_line(void)
  {
-   return 0*SD_SHARE_POWERLINE;
+   return 1*SD_SHARE_POWERLINE;


I'm not sure to catch your goal. With this modification, the power
line (or power domain) is shared at all level which should disable the
packing mechanism. But in a previous patch you fix the update packing
loop so I assume that you want to use it. Which kind of configuration
you would like to have among the proposal below ?

cpu   : CPU0 | CPU1 | CPU2 | CPU3 | CPU4
buddy conf 1 : CPU2 | CPU0 | CPU2 | CPU2 | CPU2
buddy conf 2 : CPU2 | CPU2 | CPU2 | CPU2 | CPU2
buddy conf 3 :   -1 |   -1 |   -1 |   -1 |   -1

When we look at the  git://git.linaro.org/arm/big.LITTLE/mp.git
big-LITTLE-MP-master-v12, we can see that you have defined a custom
sched_domain which hasn't been updated with SD_SHARE_POWERLINE flag so
the flag is cleared at CPU level. Based on this, I would say that you
want buddy conf 2 ? but I would say that buddy conf 1 should give
better result. Have you tried both ?



My goal with this fix is to set up the SD_SHARE_POWERLINE flags as they 
really are on TC2. It could have been done more elegantly. Since the HMP 
patches overrides the sched_domain flags at CPU level the 
SD_SHARE_POWERLINE is not being set by arch_sd_share_power_line(). With 
this fix we will get SD_SHARE_POWERLINE at MC level and no 
SD_SHARE_POWERLINE at CPU level, which I believe is the correct set up 
for TC2.


For the buddy configuration the goal is to get configuration 1 in your 
list above. You should get that when using the other patch to fix the 
buddy selection algorithm.
I'm not sure if conf 1 or 2 is best. I think it depends on the 
power/performance trade-off of the specific platform. conf 1 may lead to 
CPU1-CPU0-CPU2 migrations which may be undesirable. If your cpus are 
very leaky it might make sense to not do packing at all inside a high 
performance cluster and always do packing directly on a another low 
power cluster like conf 2. I think this needs further investigation.


I have only tested with conf 1 on TC2.

Regards,
Morten


Regards,
Vincent


  }

  const struct cpumask *cpu_coregroup_mask(int cpu)
--
1.7.9.5



___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev






___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev


Re: [HMP tunables v2][PATCH 3/7] ARM: TC2: Re-enable SD_SHARE_POWERLINE

2012-11-19 Thread Morten Rasmussen

On 19/11/12 12:23, Vincent Guittot wrote:

On 19 November 2012 13:08, Morten Rasmussen morten.rasmus...@arm.com wrote:

Hi Vincent,


On 19/11/12 09:20, Vincent Guittot wrote:


Hi,

On 16 November 2012 19:32, Liviu Dudau liviu.du...@arm.com wrote:


From: Morten Rasmussen morten.rasmus...@arm.com

Re-enable SD_SHARE_POWERLINE to reflect the power domains of TC2.
---
   arch/arm/kernel/topology.c |2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 317dac6..4d34e0e 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -228,7 +228,7 @@ struct cputopo_arm cpu_topology[NR_CPUS];

   int arch_sd_share_power_line(void)
   {
-   return 0*SD_SHARE_POWERLINE;
+   return 1*SD_SHARE_POWERLINE;



I'm not sure to catch your goal. With this modification, the power
line (or power domain) is shared at all level which should disable the
packing mechanism. But in a previous patch you fix the update packing
loop so I assume that you want to use it. Which kind of configuration
you would like to have among the proposal below ?

cpu   : CPU0 | CPU1 | CPU2 | CPU3 | CPU4
buddy conf 1 : CPU2 | CPU0 | CPU2 | CPU2 | CPU2
buddy conf 2 : CPU2 | CPU2 | CPU2 | CPU2 | CPU2
buddy conf 3 :   -1 |   -1 |   -1 |   -1 |   -1

When we look at the  git://git.linaro.org/arm/big.LITTLE/mp.git
big-LITTLE-MP-master-v12, we can see that you have defined a custom
sched_domain which hasn't been updated with SD_SHARE_POWERLINE flag so
the flag is cleared at CPU level. Based on this, I would say that you
want buddy conf 2 ? but I would say that buddy conf 1 should give
better result. Have you tried both ?



My goal with this fix is to set up the SD_SHARE_POWERLINE flags as they
really are on TC2. It could have been done more elegantly. Since the HMP
patches overrides the sched_domain flags at CPU level the SD_SHARE_POWERLINE
is not being set by arch_sd_share_power_line(). With this fix we will get
SD_SHARE_POWERLINE at MC level and no SD_SHARE_POWERLINE at CPU level, which
I believe is the correct set up for TC2.

For the buddy configuration the goal is to get configuration 1 in your list
above. You should get that when using the other patch to fix the buddy
selection algorithm.
I'm not sure if conf 1 or 2 is best. I think it depends on the
power/performance trade-off of the specific platform. conf 1 may lead to
CPU1-CPU0-CPU2 migrations which may be undesirable. If your cpus are very
leaky it might make sense to not do packing at all inside a high performance
cluster and always do packing directly on a another low power cluster like
conf 2. I think this needs further investigation.

I have only tested with conf 1 on TC2.


Hi Morten,

Conf1 is the default configuration for ARM platform because
SD_SHARE_POWERLINE is cleared at all levels for this architecture.

Conf2 should be used if you can't powergate the core independently but
several tests have demonstrated that even if you can't powergate each
core independently, it worth packing small task on few CPUs in a core
so it's worth using conf1 on TC2 as well.

Based on your explanation, we should use the original configuration of
SD_SHARE_POWERLINE (cleared at all level for ARM platform)


I agree that the result is the same, but I don't like disabling 
SD_SHARE_POWERLINE for all level when the cpus in each cluster actually 
are in the same power domain as it is the case on TC2. The name 
SHARE_POWERLINE implies a clear relation to the actual hardware design, 
thus setting the flags differently than the actual hardware design is 
misleading in my opinion. If the buddy selection algorithm doesn't 
select appropriate buddies when flags are set to reflect the actual 
hardware design I would suggest changing the buddy selection algorithm 
instead of changing the sched_domain flags.


If it is chosen to not have a direct relation between the flags and the 
hardware design, I think that the flag should be renamed so it doesn't 
give the wrong impression.


Morten



Regards
Vincent




Regards,
Morten



Regards,
Vincent


   }

   const struct cpumask *cpu_coregroup_mask(int cpu)
--
1.7.9.5



___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev












___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev


Re: [HMP tunables v2][PATCH 3/7] ARM: TC2: Re-enable SD_SHARE_POWERLINE

2012-11-19 Thread Morten Rasmussen

On 19/11/12 14:09, Vincent Guittot wrote:

On 19 November 2012 14:36, Morten Rasmussen morten.rasmus...@arm.com wrote:

On 19/11/12 12:23, Vincent Guittot wrote:


On 19 November 2012 13:08, Morten Rasmussen morten.rasmus...@arm.com
wrote:


Hi Vincent,


On 19/11/12 09:20, Vincent Guittot wrote:



Hi,

On 16 November 2012 19:32, Liviu Dudau liviu.du...@arm.com wrote:



From: Morten Rasmussen morten.rasmus...@arm.com

Re-enable SD_SHARE_POWERLINE to reflect the power domains of TC2.
---
arch/arm/kernel/topology.c |2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 317dac6..4d34e0e 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -228,7 +228,7 @@ struct cputopo_arm cpu_topology[NR_CPUS];

int arch_sd_share_power_line(void)
{
-   return 0*SD_SHARE_POWERLINE;
+   return 1*SD_SHARE_POWERLINE;




I'm not sure to catch your goal. With this modification, the power
line (or power domain) is shared at all level which should disable the
packing mechanism. But in a previous patch you fix the update packing
loop so I assume that you want to use it. Which kind of configuration
you would like to have among the proposal below ?

cpu   : CPU0 | CPU1 | CPU2 | CPU3 | CPU4
buddy conf 1 : CPU2 | CPU0 | CPU2 | CPU2 | CPU2
buddy conf 2 : CPU2 | CPU2 | CPU2 | CPU2 | CPU2
buddy conf 3 :   -1 |   -1 |   -1 |   -1 |   -1

When we look at the  git://git.linaro.org/arm/big.LITTLE/mp.git
big-LITTLE-MP-master-v12, we can see that you have defined a custom
sched_domain which hasn't been updated with SD_SHARE_POWERLINE flag so
the flag is cleared at CPU level. Based on this, I would say that you
want buddy conf 2 ? but I would say that buddy conf 1 should give
better result. Have you tried both ?



My goal with this fix is to set up the SD_SHARE_POWERLINE flags as they
really are on TC2. It could have been done more elegantly. Since the HMP
patches overrides the sched_domain flags at CPU level the
SD_SHARE_POWERLINE
is not being set by arch_sd_share_power_line(). With this fix we will get
SD_SHARE_POWERLINE at MC level and no SD_SHARE_POWERLINE at CPU level,
which
I believe is the correct set up for TC2.

For the buddy configuration the goal is to get configuration 1 in your
list
above. You should get that when using the other patch to fix the buddy
selection algorithm.
I'm not sure if conf 1 or 2 is best. I think it depends on the
power/performance trade-off of the specific platform. conf 1 may lead to
CPU1-CPU0-CPU2 migrations which may be undesirable. If your cpus are
very
leaky it might make sense to not do packing at all inside a high
performance
cluster and always do packing directly on a another low power cluster
like
conf 2. I think this needs further investigation.

I have only tested with conf 1 on TC2.



Hi Morten,

Conf1 is the default configuration for ARM platform because
SD_SHARE_POWERLINE is cleared at all levels for this architecture.

Conf2 should be used if you can't powergate the core independently but
several tests have demonstrated that even if you can't powergate each
core independently, it worth packing small task on few CPUs in a core
so it's worth using conf1 on TC2 as well.

Based on your explanation, we should use the original configuration of
SD_SHARE_POWERLINE (cleared at all level for ARM platform)



I agree that the result is the same, but I don't like disabling
SD_SHARE_POWERLINE for all level when the cpus in each cluster actually are
in the same power domain as it is the case on TC2. The name SHARE_POWERLINE
implies a clear relation to the actual hardware design, thus setting the
flags differently than the actual hardware design is misleading in my
opinion. If the buddy selection algorithm doesn't select appropriate buddies
when flags are set to reflect the actual hardware design I would suggest
changing the buddy selection algorithm instead of changing the sched_domain
flags.

If it is chosen to not have a direct relation between the flags and the
hardware design, I think that the flag should be renamed so it doesn't give
the wrong impression.


There is a direct link between the powergating and the SHARE_POWERLINE
and if you want that the buddy selection strictly reflects your HW
configuration, you must use conf2 and not conf1.


I just want the buddy selection to be reasonable when the 
SHARE_POWERLINE flags are reflecting the true hardware configuration. I 
haven't tested whether conf 1 or 2 is best yet. As long as I am getting 
one them it is definitely an improvement over not having task packing at 
all :)



Now, beside the packing small task patch and the TC2 configuration, it
has been proven that packing small tasks on an ARM platform (dual
cortex-A9) which can only powergate the cluster, improves the power
consumption of some low cpu load use cases like the MP3 playback (we
had used cpu hotplug at that time). This assumption has been proven
only

Re: [RFC 3/6] sched: pack small tasks

2012-11-09 Thread Morten Rasmussen
On Fri, Nov 02, 2012 at 10:53:47AM +, Santosh Shilimkar wrote:
 On Monday 29 October 2012 06:42 PM, Vincent Guittot wrote:
  On 24 October 2012 17:20, Santosh Shilimkar santosh.shilim...@ti.com 
  wrote:
  Vincent,
 
  Few comments/questions.
 
 
  On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:
 
  During sched_domain creation, we define a pack buddy CPU if available.
 
  On a system that share the powerline at all level, the buddy is set to -1
 
  On a dual clusters / dual cores system which can powergate each core and
  cluster independantly, the buddy configuration will be :
  | CPU0 | CPU1 | CPU2 | CPU3 |
  ---
  buddy | CPU0 | CPU0 | CPU0 | CPU2 |
 
   ^
  Is that a typo ? Should it be CPU2 instead of
  CPU0 ?
 
  No it's not a typo.
  The system packs at each scheduling level. It starts to pack in
  cluster because each core can power gate independently so CPU1 tries
  to pack its tasks in CPU0 and CPU3 in CPU2. Then, it packs at CPU
  level so CPU2 tries to pack in the cluster of CPU0 and CPU0 packs in
  itself
 
 I get it. Though in above example a task may migrate from say
 CPU3-CPU2-CPU0 as part of packing. I was just thinking whether
 moving such task from say CPU3 to CPU0 might be best instead.

To me it seems suboptimal to pack the task twice, but the alternative is
not good either. If you try to move the task directly to CPU0 you may
miss packing opportunities if CPU0 is already busy, while CPU2 might
have enough capacity to take it. It would probably be better to check
the business of CPU0 and then back off and try CPU2 if CP0 is busy. This
would require a buddy list for each CPU rather just a single buddy and
thus might become expensive.

 
 
  Small tasks tend to slip out of the periodic load balance.
  The best place to choose to migrate them is at their wake up.
 
  I have tried this series since I was looking at some of these packing
  bits. On Mobile workloads like OSIdle with Screen ON, MP3, gallary,
  I did see some additional filtering of threads with this series
  but its not making much difference in power. More on this below.
 
  Can I ask you which configuration you have used ? how many cores and
  cluster ?  Can they be power gated independently ?
 
 I have been trying with couple of setups. Dual Core ARM machine and
 Quad core X86 box with single package thought most of the mobile
 workload analysis I was doing on ARM machine. On both setups
 CPUs can be gated independently.
 
 
 
  Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
  ---
 kernel/sched/core.c  |1 +
 kernel/sched/fair.c  |  109
  ++
 kernel/sched/sched.h |1 +
 3 files changed, 111 insertions(+)
 
  diff --git a/kernel/sched/core.c b/kernel/sched/core.c
  index dab7908..70cadbe 100644
  --- a/kernel/sched/core.c
  +++ b/kernel/sched/core.c
  @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct
  root_domain *rd, int cpu)
   rcu_assign_pointer(rq-sd, sd);
   destroy_sched_domains(tmp, cpu);
 
  +   update_packing_domain(cpu);
   update_top_cache_domain(cpu);
 }
 
  diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
  index 4f4a4f6..8c9d3ed 100644
  --- a/kernel/sched/fair.c
  +++ b/kernel/sched/fair.c
  @@ -157,6 +157,63 @@ void sched_init_granularity(void)
   update_sysctl();
 }
 
  +
  +/*
  + * Save the id of the optimal CPU that should be used to pack small tasks
  + * The value -1 is used when no buddy has been found
  + */
  +DEFINE_PER_CPU(int, sd_pack_buddy);
  +
  +/* Look for the best buddy CPU that can be used to pack small tasks
  + * We make the assumption that it doesn't wort to pack on CPU that share
  the
 
  s/wort/worth
 
  yes
 
 
  + * same powerline. We looks for the 1st sched_domain without the
  + * SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the
  lowest
  + * power per core based on the assumption that their power efficiency is
  + * better */
 
  Commenting style..
  /*
*
*/
 
 
  yes
 
  Can you please expand the why the assumption is right ?
  it doesn't wort to pack on CPU that share the same powerline
 
  By share the same power-line, I mean that the CPUs can't power off
  independently. So if some CPUs can't power off independently, it's
  worth to try to use most of them to race to idle.
 
 In that case I suggest we use different word here. Power line can be
 treated as voltage line, power domain.
 May be SD_SHARE_CPU_POWERDOMAIN ?
 

How about just SD_SHARE_POWERDOMAIN ?

 
  Think about a scenario where you have quad core, ducal cluster system
 
   |Cluster1|  |cluster 2|
  | CPU0 | CPU1 | CPU2 | CPU3 |   | CPU0 | CPU1 | CPU2 | CPU3 |
 
 
  Both clusters run from same voltage rail and have same PLL
  clocking them. But the cluster have their own power domain
  and all CPU's can power gate them-self to 

Re: [RFC 3/6] sched: pack small tasks

2012-11-09 Thread Morten Rasmussen
Hi Vincent,

I have experienced suboptimal buddy selection on a dual cluster setup
(ARM TC2) if SD_SHARE_POWERLINE is enabled at MC level and disabled at
CPU level. This seems to be the correct flag settings for a system with
only cluster level power gating.

To me it looks like update_packing_domain() is not doing the right
thing. See inline comments below.

On Sun, Oct 07, 2012 at 08:43:55AM +0100, Vincent Guittot wrote:
 During sched_domain creation, we define a pack buddy CPU if available.
 
 On a system that share the powerline at all level, the buddy is set to -1
 
 On a dual clusters / dual cores system which can powergate each core and
 cluster independantly, the buddy configuration will be :
   | CPU0 | CPU1 | CPU2 | CPU3 |
 ---
 buddy | CPU0 | CPU0 | CPU0 | CPU2 |
 
 Small tasks tend to slip out of the periodic load balance.
 The best place to choose to migrate them is at their wake up.
 
 Signed-off-by: Vincent Guittot vincent.guit...@linaro.org
 ---
  kernel/sched/core.c  |1 +
  kernel/sched/fair.c  |  109 
 ++
  kernel/sched/sched.h |1 +
  3 files changed, 111 insertions(+)
 
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index dab7908..70cadbe 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct 
 root_domain *rd, int cpu)
   rcu_assign_pointer(rq-sd, sd);
   destroy_sched_domains(tmp, cpu);
  
 + update_packing_domain(cpu);
   update_top_cache_domain(cpu);
  }
  
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 4f4a4f6..8c9d3ed 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -157,6 +157,63 @@ void sched_init_granularity(void)
   update_sysctl();
  }
  
 +
 +/*
 + * Save the id of the optimal CPU that should be used to pack small tasks
 + * The value -1 is used when no buddy has been found
 + */
 +DEFINE_PER_CPU(int, sd_pack_buddy);
 +
 +/* Look for the best buddy CPU that can be used to pack small tasks
 + * We make the assumption that it doesn't wort to pack on CPU that share the
 + * same powerline. We looks for the 1st sched_domain without the
 + * SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the lowest
 + * power per core based on the assumption that their power efficiency is
 + * better */
 +void update_packing_domain(int cpu)
 +{
 + struct sched_domain *sd;
 + int id = -1;
 +
 + sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE);
 + if (!sd)
 + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)-sd);
 + else
 + sd = sd-parent;
sd is the highest level where SD_SHARE_POWERLINE is enabled so the sched
groups of the parent level would represent the power domains. If get it
right, we want to pack inside the cluster first and only let first cpu
of the cluster do packing on another cluster. So all cpus - except the
first one - in the current sched domain should find its buddy within the
domain and only the first one should go to the parent sched domain to
find its buddy.

I propose the following fix:

-   sd = sd-parent;
+   if (cpumask_first(sched_domain_span(sd)) == cpu
+   || !sd-parent)
+   sd = sd-parent;


 +
 + while (sd) {
 + struct sched_group *sg = sd-groups;
 + struct sched_group *pack = sg;
 + struct sched_group *tmp = sg-next;
 +
 + /* 1st CPU of the sched domain is a good candidate */
 + if (id == -1)
 + id = cpumask_first(sched_domain_span(sd));

There is no guarantee that id is in the sched group pointed to by
sd-groups, which is implicitly assumed later in the search loop. We
need to find the sched group that contains id and point sg to that
instead. I haven't found an elegant way to find that group, but the fix
below should at least give the right result.

+   /* Find sched group of candidate */
+   tmp = sd-groups;
+   do {
+   if (cpumask_test_cpu(id, sched_group_cpus(tmp)))
+   {
+   sg = tmp;
+   break;
+   }
+   } while (tmp = tmp-next, tmp != sd-groups);
+
+   pack = sg;
+   tmp = sg-next;

Regards,
Morten

 +
 + /* loop the sched groups to find the best one */
 + while (tmp != sg) {
 + if (tmp-sgp-power * sg-group_weight 
 + sg-sgp-power * tmp-group_weight)
 + pack = tmp;
 + tmp = tmp-next;
 + }
 +
 + /* we have found a better group */
 + if (pack != sg)
 + id = cpumask_first(sched_group_cpus(pack));
 +
 + /* Look for another CPU than itself */
 + 

Re: Fix for HMP scheduler crash [ Re: [GIT PULL]: big LITTLE MP v10]

2012-10-12 Thread Morten Rasmussen
Hi Tixy,

Thanks for the patch. I think this patch is the right way to solve this
issue.

There is still a problem with the priority filter in
hmp_down_migration() which Viresh pointed out earlier. There is no
checking of whether the task is actually allowed to run on any of the
slower cpus. Solving that would actually also fix the issue that you are
observing as a side effect. I have attached a patch.

I think we should apply both.

Thanks,
Morten

On Fri, Oct 12, 2012 at 02:33:40PM +0100, Jon Medhurst (Tixy) wrote:
 On Fri, 2012-10-12 at 14:19 +0100, Jon Medhurst (Tixy) wrote:
  The attached patch fixes the immediate problem by avoiding the empty
  domain (which is probably a good thing anyway)
 
 Oops, my last patch included some extra junk, the one attached to this
 mail fixes this...

 From 7365076675b851355d48e9b1157e223d7719e3ac Mon Sep 17 00:00:00 2001
 From: Jon Medhurst t...@linaro.org
 Date: Fri, 12 Oct 2012 13:45:35 +0100
 Subject: [PATCH] ARM: sched: Avoid empty 'slow' HMP domain
 
 On homogeneous (non-heterogeneous) systems all CPUs will be declared
 'fast' and the slow cpu list will be empty. In this situation we need to
 avoid adding an empty slow HMP domain otherwise the scheduler code will
 blow up when it attempts to move a task to the slow domain.
 
 Signed-off-by: Jon Medhurst t...@linaro.org
 ---
  arch/arm/kernel/topology.c |   10 ++
  1 file changed, 6 insertions(+), 4 deletions(-)
 
 diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
 index 58dac7a..0b51233 100644
 --- a/arch/arm/kernel/topology.c
 +++ b/arch/arm/kernel/topology.c
 @@ -396,10 +396,12 @@ void __init arch_get_hmp_domains(struct list_head 
 *hmp_domains_list)
* Must be ordered with respect to compute capacity.
* Fastest domain at head of list.
*/
 - domain = (struct hmp_domain *)
 - kmalloc(sizeof(struct hmp_domain), GFP_KERNEL);
 - cpumask_copy(domain-cpus, hmp_slow_cpu_mask);
 - list_add(domain-hmp_domains, hmp_domains_list);
 + if(!cpumask_empty(hmp_slow_cpu_mask)) {
 + domain = (struct hmp_domain *)
 + kmalloc(sizeof(struct hmp_domain), GFP_KERNEL);
 + cpumask_copy(domain-cpus, hmp_slow_cpu_mask);
 + list_add(domain-hmp_domains, hmp_domains_list);
 + }
   domain = (struct hmp_domain *)
   kmalloc(sizeof(struct hmp_domain), GFP_KERNEL);
   cpumask_copy(domain-cpus, hmp_fast_cpu_mask);
 -- 
 1.7.10.4
From 9f241c37bb7316eeea56e6c93541352cf5c9b8a8 Mon Sep 17 00:00:00 2001
From: Morten Rasmussen morten.rasmus...@arm.com
Date: Fri, 12 Oct 2012 15:25:02 +0100
Subject: [PATCH] sched: Only down migrate low priority tasks if allowed by
 affinity mask

Adds an extra check intersection of the task affinity mask and the slower
hmp_domain cpumask before down migrating low priority tasks.

Signed-off-by: Morten Rasmussen morten.rasmus...@arm.com
---
 kernel/sched/fair.c |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 56cbda1..edcf922 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5562,8 +5562,11 @@ static unsigned int hmp_down_migration(int cpu, struct sched_entity *se)
 
 #ifdef CONFIG_SCHED_HMP_PRIO_FILTER
 	/* Filter by task priority */
-	if (p-prio = hmp_up_prio)
+	if ((p-prio = hmp_up_prio) 
+		cpumask_intersects(hmp_slower_domain(cpu)-cpus,
+	tsk_cpus_allowed(p))) {
 		return 1;
+	}
 #endif
 
 	/* Let the task load settle before doing another down migration */
-- 
1.7.9.5
___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev


Re: Fix for HMP scheduler crash [ Re: [GIT PULL]: big LITTLE MP v10]

2012-10-12 Thread Morten Rasmussen
On Fri, Oct 12, 2012 at 04:33:19PM +0100, Jon Medhurst (Tixy) wrote:
 On Fri, 2012-10-12 at 16:11 +0100, Morten Rasmussen wrote:
  Hi Tixy,
  
  Thanks for the patch. I think this patch is the right way to solve this
  issue.
  
  There is still a problem with the priority filter in
  hmp_down_migration() which Viresh pointed out earlier. There is no
  checking of whether the task is actually allowed to run on any of the
  slower cpus. Solving that would actually also fix the issue that you are
  observing as a side effect. I have attached a patch.
 
 The patch looks reasonable. I've just run it on TC2 and A9 with the
 addition of a pr_err($); before the return 1; and can see the
 occosional '$' on TC2 and none on A9, as we would expect. So I guess
 that counts as:
 
 Reviewed-by: Jon Medhurst t...@linaro.org
 Tested-by: Jon Medhurst t...@linaro.org


Thanks for reviewing and testing.

My comments to your patch in the previous reply would count as:

Reviewed-by: Morten Rasmussen morten.rasmus...@arm.com

I have only tested it on TC2.

Morten
 
 -- 
 Tixy
 
 
  I think we should apply both.
  
  Thanks,
  Morten
  
  On Fri, Oct 12, 2012 at 02:33:40PM +0100, Jon Medhurst (Tixy) wrote:
   On Fri, 2012-10-12 at 14:19 +0100, Jon Medhurst (Tixy) wrote:
The attached patch fixes the immediate problem by avoiding the empty
domain (which is probably a good thing anyway)
   
   Oops, my last patch included some extra junk, the one attached to this
   mail fixes this...
  
   From 7365076675b851355d48e9b1157e223d7719e3ac Mon Sep 17 00:00:00 2001
   From: Jon Medhurst t...@linaro.org
   Date: Fri, 12 Oct 2012 13:45:35 +0100
   Subject: [PATCH] ARM: sched: Avoid empty 'slow' HMP domain
   
   On homogeneous (non-heterogeneous) systems all CPUs will be declared
   'fast' and the slow cpu list will be empty. In this situation we need to
   avoid adding an empty slow HMP domain otherwise the scheduler code will
   blow up when it attempts to move a task to the slow domain.
   
   Signed-off-by: Jon Medhurst t...@linaro.org
   ---
arch/arm/kernel/topology.c |   10 ++
1 file changed, 6 insertions(+), 4 deletions(-)
   
   diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
   index 58dac7a..0b51233 100644
   --- a/arch/arm/kernel/topology.c
   +++ b/arch/arm/kernel/topology.c
   @@ -396,10 +396,12 @@ void __init arch_get_hmp_domains(struct list_head 
   *hmp_domains_list)
  * Must be ordered with respect to compute capacity.
  * Fastest domain at head of list.
  */
   - domain = (struct hmp_domain *)
   - kmalloc(sizeof(struct hmp_domain), GFP_KERNEL);
   - cpumask_copy(domain-cpus, hmp_slow_cpu_mask);
   - list_add(domain-hmp_domains, hmp_domains_list);
   + if(!cpumask_empty(hmp_slow_cpu_mask)) {
   + domain = (struct hmp_domain *)
   + kmalloc(sizeof(struct hmp_domain), GFP_KERNEL);
   + cpumask_copy(domain-cpus, hmp_slow_cpu_mask);
   + list_add(domain-hmp_domains, hmp_domains_list);
   + }
 domain = (struct hmp_domain *)
 kmalloc(sizeof(struct hmp_domain), GFP_KERNEL);
 cpumask_copy(domain-cpus, hmp_fast_cpu_mask);
   -- 
   1.7.10.4
 
 
 


___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev


Re: [RFC PATCH 06/10] ARM: sched: Use device-tree to provide fast/slow CPU list for HMP

2012-10-10 Thread Morten Rasmussen
On Thu, Oct 04, 2012 at 07:49:32AM +0100, Viresh Kumar wrote:
 On 22 September 2012 00:02,  morten.rasmus...@arm.com wrote:
  From: Morten Rasmussen morten.rasmus...@arm.com
 
  We can't rely on Kconfig options to set the fast and slow CPU lists for
  HMP scheduling if we want a single kernel binary to support multiple
  devices with different CPU topology. E.g. TC2 (ARM's Test-Chip-2
  big.LITTLE system), Fast Models, or even non big.LITTLE devices.
 
  This patch adds the function arch_get_fast_and_slow_cpus() to generate
  the lists at run-time by parsing the CPU nodes in device-tree; it
  assumes slow cores are A7s and everything else is fast. The function
  still supports the old Kconfig options as this is useful for testing the
  HMP scheduler on devices without big.LITTLE.
 
 But this code is handling this case too at the end, with following logic:
 
  +   cpumask_setall(fast);
  +   cpumask_clear(slow);
 
 Am i missing something?
 

The HMP setup can be defined using Kconfig or DT. If both fails, it will
set all cpus to be fast cpus and effectively disable SCHED_HMP. The
Kconfig option is kept to allow testing of alternative HMP setups
without having to change the DT or use DT at all which might be handy
for non-ARM platforms. I hope that answers you question.

  This patch is reuse of a patch by Jon Medhurst t...@linaro.org with a
  few bits left out.
 
 Then probably he must be the author of this commit? Also a SOB is required
 from him here.
 

I don't know what the correct procedure is for this sort of partial
patch reuse. Since I didn't know better, I adopted Tixy's own reference
style that he used in one of his patches which is an extension of a
previous patch by me. I will of course fix it to follow normal procedure
if there is one.

  Signed-off-by: Morten Rasmussen morten.rasmus...@arm.com
  ---
   arch/arm/Kconfig   |4 ++-
   arch/arm/kernel/topology.c |   69 
  
   2 files changed, 72 insertions(+), 1 deletion(-)
 
  diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
  index cb80846..f1271bc 100644
  --- a/arch/arm/Kconfig
  +++ b/arch/arm/Kconfig
  @@ -1588,13 +1588,15 @@ config HMP_FAST_CPU_MASK
  string HMP scheduler fast CPU mask
  depends on SCHED_HMP
  help
  -  Specify the cpuids of the fast CPUs in the system as a list 
  string,
  +  Leave empty to use device tree information.
  + Specify the cpuids of the fast CPUs in the system as a list 
  string,
e.g. cpuid 0+1 should be specified as 0-1.
 
   config HMP_SLOW_CPU_MASK
  string HMP scheduler slow CPU mask
  depends on SCHED_HMP
  help
  + Leave empty to use device tree information.
Specify the cpuids of the slow CPUs in the system as a list 
  string,
e.g. cpuid 0+1 should be specified as 0-1.
 
  diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
  index 26c12c6..7682e12 100644
  --- a/arch/arm/kernel/topology.c
  +++ b/arch/arm/kernel/topology.c
  @@ -317,6 +317,75 @@ void store_cpu_topology(unsigned int cpuid)
  cpu_topology[cpuid].socket_id, mpidr);
   }
 
  +
  +#ifdef CONFIG_SCHED_HMP
  +
  +static const char * const little_cores[] = {
  +   arm,cortex-a7,
  +   NULL,
  +};
  +
  +static bool is_little_cpu(struct device_node *cn)
  +{
  +   const char * const *lc;
  +   for (lc = little_cores; *lc; lc++)
  +   if (of_device_is_compatible(cn, *lc))
  +   return true;
  +   return false;
  +}
  +
  +void __init arch_get_fast_and_slow_cpus(struct cpumask *fast,
  +   struct cpumask *slow)
  +{
  +   struct device_node *cn = NULL;
  +   int cpu = 0;
  +
  +   cpumask_clear(fast);
  +   cpumask_clear(slow);
  +
  +   /*
  +* Use the config options if they are given. This helps testing
  +* HMP scheduling on systems without a big.LITTLE architecture.
  +*/
  +   if (strlen(CONFIG_HMP_FAST_CPU_MASK)  
  strlen(CONFIG_HMP_SLOW_CPU_MASK)) {
  +   if (cpulist_parse(CONFIG_HMP_FAST_CPU_MASK, fast))
  +   WARN(1, Failed to parse HMP fast cpu mask!\n);
  +   if (cpulist_parse(CONFIG_HMP_SLOW_CPU_MASK, slow))
  +   WARN(1, Failed to parse HMP slow cpu mask!\n);
  +   return;
  +   }
  +
  +   /*
  +* Else, parse device tree for little cores.
  +*/
  +   while ((cn = of_find_node_by_type(cn, cpu))) {
  +
  +   if (cpu = num_possible_cpus())
  +   break;
  +
  +   if (is_little_cpu(cn))
  +   cpumask_set_cpu(cpu, slow);
  +   else
  +   cpumask_set_cpu(cpu, fast);
  +
  +   cpu++;
  +   }
  +
  +   if (!cpumask_empty(fast)  !cpumask_empty(slow))
  +   return

Re: [RFC PATCH 06/10] ARM: sched: Use device-tree to provide fast/slow CPU list for HMP

2012-10-10 Thread Morten Rasmussen
Hi Tixy,

Could you have a look at my code stealing patch below? Since it is
basically a trimmed version of one of your patches I would prefer to
put you as author and have your SOB on it. What is your opinion?

Thanks,
Morten

On Fri, Sep 21, 2012 at 07:32:21PM +0100, Morten Rasmussen wrote:
 From: Morten Rasmussen morten.rasmus...@arm.com
 
 We can't rely on Kconfig options to set the fast and slow CPU lists for
 HMP scheduling if we want a single kernel binary to support multiple
 devices with different CPU topology. E.g. TC2 (ARM's Test-Chip-2
 big.LITTLE system), Fast Models, or even non big.LITTLE devices.
 
 This patch adds the function arch_get_fast_and_slow_cpus() to generate
 the lists at run-time by parsing the CPU nodes in device-tree; it
 assumes slow cores are A7s and everything else is fast. The function
 still supports the old Kconfig options as this is useful for testing the
 HMP scheduler on devices without big.LITTLE.
 
 This patch is reuse of a patch by Jon Medhurst t...@linaro.org with a
 few bits left out.
 
 Signed-off-by: Morten Rasmussen morten.rasmus...@arm.com
 ---
  arch/arm/Kconfig   |4 ++-
  arch/arm/kernel/topology.c |   69 
 
  2 files changed, 72 insertions(+), 1 deletion(-)
 
 diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
 index cb80846..f1271bc 100644
 --- a/arch/arm/Kconfig
 +++ b/arch/arm/Kconfig
 @@ -1588,13 +1588,15 @@ config HMP_FAST_CPU_MASK
   string HMP scheduler fast CPU mask
   depends on SCHED_HMP
   help
 -  Specify the cpuids of the fast CPUs in the system as a list string,
 +  Leave empty to use device tree information.
 +   Specify the cpuids of the fast CPUs in the system as a list string,
 e.g. cpuid 0+1 should be specified as 0-1.
  
  config HMP_SLOW_CPU_MASK
   string HMP scheduler slow CPU mask
   depends on SCHED_HMP
   help
 +   Leave empty to use device tree information.
 Specify the cpuids of the slow CPUs in the system as a list string,
 e.g. cpuid 0+1 should be specified as 0-1.
  
 diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
 index 26c12c6..7682e12 100644
 --- a/arch/arm/kernel/topology.c
 +++ b/arch/arm/kernel/topology.c
 @@ -317,6 +317,75 @@ void store_cpu_topology(unsigned int cpuid)
   cpu_topology[cpuid].socket_id, mpidr);
  }
  
 +
 +#ifdef CONFIG_SCHED_HMP
 +
 +static const char * const little_cores[] = {
 + arm,cortex-a7,
 + NULL,
 +};
 +
 +static bool is_little_cpu(struct device_node *cn)
 +{
 + const char * const *lc;
 + for (lc = little_cores; *lc; lc++)
 + if (of_device_is_compatible(cn, *lc))
 + return true;
 + return false;
 +}
 +
 +void __init arch_get_fast_and_slow_cpus(struct cpumask *fast,
 + struct cpumask *slow)
 +{
 + struct device_node *cn = NULL;
 + int cpu = 0;
 +
 + cpumask_clear(fast);
 + cpumask_clear(slow);
 +
 + /*
 +  * Use the config options if they are given. This helps testing
 +  * HMP scheduling on systems without a big.LITTLE architecture.
 +  */
 + if (strlen(CONFIG_HMP_FAST_CPU_MASK)  
 strlen(CONFIG_HMP_SLOW_CPU_MASK)) {
 + if (cpulist_parse(CONFIG_HMP_FAST_CPU_MASK, fast))
 + WARN(1, Failed to parse HMP fast cpu mask!\n);
 + if (cpulist_parse(CONFIG_HMP_SLOW_CPU_MASK, slow))
 + WARN(1, Failed to parse HMP slow cpu mask!\n);
 + return;
 + }
 +
 + /*
 +  * Else, parse device tree for little cores.
 +  */
 + while ((cn = of_find_node_by_type(cn, cpu))) {
 +
 + if (cpu = num_possible_cpus())
 + break;
 +
 + if (is_little_cpu(cn))
 + cpumask_set_cpu(cpu, slow);
 + else
 + cpumask_set_cpu(cpu, fast);
 +
 + cpu++;
 + }
 +
 + if (!cpumask_empty(fast)  !cpumask_empty(slow))
 + return;
 +
 + /*
 +  * We didn't find both big and little cores so let's call all cores
 +  * fast as this will keep the system running, with all cores being
 +  * treated equal.
 +  */
 + cpumask_setall(fast);
 + cpumask_clear(slow);
 +}
 +
 +#endif /* CONFIG_SCHED_HMP */
 +
 +
  /*
   * init_cpu_topology is called at boot when only one cpu is running
   * which prevent simultaneous write access to cpu_topology array
 -- 
 1.7.9.5
 


___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev


Re: [RFC PATCH 07/10] ARM: sched: Setup SCHED_HMP domains

2012-10-10 Thread Morten Rasmussen
On Thu, Oct 04, 2012 at 07:58:45AM +0100, Viresh Kumar wrote:
 On 22 September 2012 00:02,  morten.rasmus...@arm.com wrote:
  diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
 
  +void __init arch_get_hmp_domains(struct list_head *hmp_domains_list)
  +{
  +   struct cpumask hmp_fast_cpu_mask;
  +   struct cpumask hmp_slow_cpu_mask;
 
 can be merged to single line.
 
  +   struct hmp_domain *domain;
  +
  +   arch_get_fast_and_slow_cpus(hmp_fast_cpu_mask, hmp_slow_cpu_mask);
  +
  +   /*
  +* Initialize hmp_domains
  +* Must be ordered with respect to compute capacity.
  +* Fastest domain at head of list.
  +*/
  +   domain = (struct hmp_domain *)
  +   kmalloc(sizeof(struct hmp_domain), GFP_KERNEL);
 
 should be:
 
 domain = kmalloc(sizeof(*domain), GFP_KERNEL);
 
  +   cpumask_copy(domain-cpus, hmp_slow_cpu_mask);
 
 what if kmalloc failed?
 
  +   list_add(domain-hmp_domains, hmp_domains_list);
  +   domain = (struct hmp_domain *)
  +   kmalloc(sizeof(struct hmp_domain), GFP_KERNEL);
 
 would be better to kmalloc only once with size 2* sizeof(*domain)
 
  +   cpumask_copy(domain-cpus, hmp_fast_cpu_mask);
  +   list_add(domain-hmp_domains, hmp_domains_list);
 
 Also would be better to create a macro for above two lines to remove
 code redundancy.
 

Agree on all of the above.

Thanks,
Morten


___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev


Re: [RFC PATCH 02/10] sched: Task placement for heterogeneous systems based on task load-tracking

2012-10-09 Thread Morten Rasmussen
Hi Viresh,

On Thu, Oct 04, 2012 at 07:02:03AM +0100, Viresh Kumar wrote:
 Hi Morten,
 
 On 22 September 2012 00:02,  morten.rasmus...@arm.com wrote:
  From: Morten Rasmussen morten.rasmus...@arm.com
 
  This patch introduces the basic SCHED_HMP infrastructure. Each class of
  cpus is represented by a hmp_domain and tasks will only be moved between
  these domains when their load profiles suggest it is beneficial.
 
  SCHED_HMP relies heavily on the task load-tracking introduced in Paul
  Turners fair group scheduling patch set:
 
  https://lkml.org/lkml/2012/8/23/267
 
  SCHED_HMP requires that the platform implements arch_get_hmp_domains()
  which should set up the platform specific list of hmp_domains. It is
  also assumed that the platform disables SD_LOAD_BALANCE for the
  appropriate sched_domains.
 
 An explanation of this requirement would be helpful here.
 

Yes. This is to prevent the load-balancer from moving tasks between
hmp_domains. This will be done exclusively by SCHED_HMP instead to
implement a strict task migration policy and avoid changing the
load-balancer behaviour. The load-balancer will take care of
load-balacing within each hmp_domain.

  Tasks placement takes place every time a task is to be inserted into
  a runqueue based on its load history. The task placement decision is
  based on load thresholds.
 
  There are no restrictions on the number of hmp_domains, however,
  multiple (2) has not been tested and the up/down migration policy is
  rather simple.
 
  Signed-off-by: Morten Rasmussen morten.rasmus...@arm.com
  ---
   arch/arm/Kconfig  |   17 +
   include/linux/sched.h |6 ++
   kernel/sched/fair.c   |  168 
  +
   kernel/sched/sched.h  |6 ++
   4 files changed, 197 insertions(+)
 
  diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
  index f4a5d58..5b09684 100644
  --- a/arch/arm/Kconfig
  +++ b/arch/arm/Kconfig
  @@ -1554,6 +1554,23 @@ config SCHED_SMT
MultiThreading at a cost of slightly increased overhead in some
places. If unsure say N here.
 
  +config DISABLE_CPU_SCHED_DOMAIN_BALANCE
  +   bool (EXPERIMENTAL) Disable CPU level scheduler load-balancing
  +   help
  + Disables scheduler load-balancing at CPU sched domain level.
 
 Shouldn't this depend on EXPERIMENTAL?
 

It should. The ongoing discussion about CONFIG_EXPERIMENTAL that Amit is
referring to hasn't come to a conclusion yet.

  +config SCHED_HMP
  +   bool (EXPERIMENTAL) Heterogenous multiprocessor scheduling
 
 ditto.
 
  +   depends on DISABLE_CPU_SCHED_DOMAIN_BALANCE  SCHED_MC  
  FAIR_GROUP_SCHED  !SCHED_AUTOGROUP
  +   help
  + Experimental scheduler optimizations for heterogeneous platforms.
  + Attempts to introspectively select task affinity to optimize power
  + and performance. Basic support for multiple (2) cpu types is in 
  place,
  + but it has only been tested with two types of cpus.
  + There is currently no support for migration of task groups, hence
  + !SCHED_AUTOGROUP. Furthermore, normal load-balancing must be 
  disabled
  + between cpus of different type (DISABLE_CPU_SCHED_DOMAIN_BALANCE).
  +
   config HAVE_ARM_SCU
  bool
  help
  diff --git a/include/linux/sched.h b/include/linux/sched.h
  index 81e4e82..df971a3 100644
  --- a/include/linux/sched.h
  +++ b/include/linux/sched.h
  @@ -1039,6 +1039,12 @@ unsigned long default_scale_smt_power(struct 
  sched_domain *sd, int cpu);
 
   bool cpus_share_cache(int this_cpu, int that_cpu);
 
  +#ifdef CONFIG_SCHED_HMP
  +struct hmp_domain {
  +   struct cpumask cpus;
  +   struct list_head hmp_domains;
 
 Probably need a better name here. domain_list?
 

Yes. hmp_domain_list would be better and stick with the hmp_* naming
convention.

  +};
  +#endif /* CONFIG_SCHED_HMP */
   #else /* CONFIG_SMP */
 
   struct sched_domain_attr;
  diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
  index 3e17dd5..d80de46 100644
  --- a/kernel/sched/fair.c
  +++ b/kernel/sched/fair.c
  @@ -3077,6 +3077,125 @@ static int select_idle_sibling(struct task_struct 
  *p, int target)
  return target;
   }
 
  +#ifdef CONFIG_SCHED_HMP
  +/*
  + * Heterogenous multiprocessor (HMP) optimizations
  + *
  + * The cpu types are distinguished using a list of hmp_domains
  + * which each represent one cpu type using a cpumask.
  + * The list is assumed ordered by compute capacity with the
  + * fastest domain first.
  + */
  +DEFINE_PER_CPU(struct hmp_domain *, hmp_cpu_domain);
  +
  +extern void __init arch_get_hmp_domains(struct list_head 
  *hmp_domains_list);
  +
  +/* Setup hmp_domains */
  +static int __init hmp_cpu_mask_setup(void)
 
 How should we interpret its return value? Can you mention what does 0  1 mean
 here?
 

Returns 0 if domain setup failed, i.e. the domain list is empty, and 1
otherwise.

  +{
  +   char buf[64];
  +   struct hmp_domain

Re: [RFC PATCH 04/10] sched: Introduce priority-based task migration filter

2012-10-09 Thread Morten Rasmussen
On Thu, Oct 04, 2012 at 07:27:00AM +0100, Viresh Kumar wrote:
 On 22 September 2012 00:02,  morten.rasmus...@arm.com wrote:
 
  +config SCHED_HMP_PRIO_FILTER
  +   bool (EXPERIMENTAL) Filter HMP migrations by task priority
  +   depends on SCHED_HMP
 
 Should it depend on EXPERIMENTAL?
 
  +   help
  + Enables task priority based HMP migration filter. Any task with
  + a NICE value above the threshold will always be on low-power cpus
  + with less compute capacity.
  +
  +config SCHED_HMP_PRIO_FILTER_VAL
  +   int NICE priority threshold
  +   default 5
  +   depends on SCHED_HMP_PRIO_FILTER
  +
   config HAVE_ARM_SCU
  bool
  help
  diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
  index 490f1f0..8f0f3b9 100644
  --- a/kernel/sched/fair.c
  +++ b/kernel/sched/fair.c
  @@ -3129,9 +3129,12 @@ static int __init hmp_cpu_mask_setup(void)
* hmp_down_threshold: max. load allowed for tasks migrating to a slower 
  cpu
* The default values (512, 256) offer good responsiveness, but may need
* tweaking suit particular needs.
  + *
  + * hmp_up_prio: Only up migrate task with high priority (hmp_up_prio)
*/
   unsigned int hmp_up_threshold = 512;
   unsigned int hmp_down_threshold = 256;
  +unsigned int hmp_up_prio = NICE_TO_PRIO(CONFIG_SCHED_HMP_PRIO_FILTER_VAL);
 
   static unsigned int hmp_up_migration(int cpu, struct sched_entity *se);
   static unsigned int hmp_down_migration(int cpu, struct sched_entity *se);
  @@ -5491,6 +5494,12 @@ static unsigned int hmp_up_migration(int cpu, struct 
  sched_entity *se)
  if (hmp_cpu_is_fastest(cpu))
  return 0;
 
  +#ifdef CONFIG_SCHED_HMP_PRIO_FILTER
  +   /* Filter by task priority */
  +   if (p-prio = hmp_up_prio)
  +   return 0;
  +#endif
  +
  if (cpumask_intersects(hmp_faster_domain(cpu)-cpus,
  tsk_cpus_allowed(p))
   se-avg.load_avg_ratio  hmp_up_threshold) {
  @@ -5507,6 +5516,12 @@ static unsigned int hmp_down_migration(int cpu, 
  struct sched_entity *se)
  if (hmp_cpu_is_slowest(cpu))
  return 0;
 
  +#ifdef CONFIG_SCHED_HMP_PRIO_FILTER
  +   /* Filter by task priority */
  +   if (p-prio = hmp_up_prio)
  +   return 1;
  +#endif
 
 Even if below cpumask_intersects() fails?
 

No. Good catch :)

  if (cpumask_intersects(hmp_slower_domain(cpu)-cpus,
  tsk_cpus_allowed(p))
   se-avg.load_avg_ratio  hmp_down_threshold) {
 
 --
 viresh
 

Thanks,
Morten


___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev


[RFC PATCH 09/10] sched: Add HMP task migration ftrace event

2012-09-21 Thread morten . rasmussen
From: Morten Rasmussen morten.rasmus...@arm.com

Adds ftrace event for tracing task migrations using HMP
optimized scheduling.

Signed-off-by: Morten Rasmussen morten.rasmus...@arm.com
---
 include/trace/events/sched.h |   28 
 kernel/sched/fair.c  |   15 +++
 2 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 847eb76..501aa32 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -555,6 +555,34 @@ TRACE_EVENT(sched_task_usage_ratio,
__entry-comm, __entry-pid,
__entry-ratio)
 );
+
+/*
+ * Tracepoint for HMP (CONFIG_SCHED_HMP) task migrations.
+ */
+TRACE_EVENT(sched_hmp_migrate,
+
+   TP_PROTO(struct task_struct *tsk, int dest, int force),
+
+   TP_ARGS(tsk, dest, force),
+
+   TP_STRUCT__entry(
+   __array(char, comm, TASK_COMM_LEN)
+   __field(pid_t, pid)
+   __field(int,  dest)
+   __field(int,  force)
+   ),
+
+   TP_fast_assign(
+   memcpy(__entry-comm, tsk-comm, TASK_COMM_LEN);
+   __entry-pid   = tsk-pid;
+   __entry-dest  = dest;
+   __entry-force = force;
+   ),
+
+   TP_printk(comm=%s pid=%d dest=%d force=%d,
+   __entry-comm, __entry-pid,
+   __entry-dest, __entry-force)
+);
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0be53be..811b2b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -,10 +,16 @@ unlock:
rcu_read_unlock();
 
 #ifdef CONFIG_SCHED_HMP
-   if (hmp_up_migration(prev_cpu, p-se))
-   return hmp_select_faster_cpu(p, prev_cpu);
-   if (hmp_down_migration(prev_cpu, p-se))
-   return hmp_select_slower_cpu(p, prev_cpu);
+   if (hmp_up_migration(prev_cpu, p-se)) {
+   new_cpu = hmp_select_faster_cpu(p, prev_cpu);
+   trace_sched_hmp_migrate(p, new_cpu, 0);
+   return new_cpu;
+   }
+   if (hmp_down_migration(prev_cpu, p-se)) {
+   new_cpu = hmp_select_slower_cpu(p, prev_cpu);
+   trace_sched_hmp_migrate(p, new_cpu, 0);
+   return new_cpu;
+   }
/* Make sure that the task stays in its previous hmp domain */
if (!cpumask_test_cpu(new_cpu, hmp_cpu_domain(prev_cpu)-cpus))
return prev_cpu;
@@ -5718,6 +5724,7 @@ static void hmp_force_up_migration(int this_cpu)
target-push_cpu = hmp_select_faster_cpu(p, 
cpu);
target-migrate_task = p;
force = 1;
+   trace_sched_hmp_migrate(p, target-push_cpu, 1);
}
}
raw_spin_unlock_irqrestore(target-lock, flags);
-- 
1.7.9.5



___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev


[RFC PATCH 05/10] ARM: Add HMP scheduling support for ARM architecture

2012-09-21 Thread morten . rasmussen
From: Morten Rasmussen morten.rasmus...@arm.com

Adds Kconfig entries to enable HMP scheduling on ARM platforms.
Currently, it disables CPU level sched_domain load-balacing in order
to simplify things. This needs fixing in a later revision. HMP
scheduling will do the load-balancing at this level instead.

Signed-off-by: Morten Rasmussen morten.rasmus...@arm.com
---
 arch/arm/Kconfig|   14 ++
 arch/arm/include/asm/topology.h |   32 
 2 files changed, 46 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 05de193..cb80846 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1584,6 +1584,20 @@ config SCHED_HMP_PRIO_FILTER_VAL
default 5
depends on SCHED_HMP_PRIO_FILTER
 
+config HMP_FAST_CPU_MASK
+   string HMP scheduler fast CPU mask
+   depends on SCHED_HMP
+   help
+  Specify the cpuids of the fast CPUs in the system as a list string,
+ e.g. cpuid 0+1 should be specified as 0-1.
+
+config HMP_SLOW_CPU_MASK
+   string HMP scheduler slow CPU mask
+   depends on SCHED_HMP
+   help
+ Specify the cpuids of the slow CPUs in the system as a list string,
+ e.g. cpuid 0+1 should be specified as 0-1.
+
 config HAVE_ARM_SCU
bool
help
diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 58b8b84..13a03de 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -27,6 +27,38 @@ void init_cpu_topology(void);
 void store_cpu_topology(unsigned int cpuid);
 const struct cpumask *cpu_coregroup_mask(int cpu);
 
+#ifdef CONFIG_DISABLE_CPU_SCHED_DOMAIN_BALANCE
+/* Common values for CPUs */
+#ifndef SD_CPU_INIT
+#define SD_CPU_INIT (struct sched_domain) {\
+   .min_interval   = 1,\
+   .max_interval   = 4,\
+   .busy_factor= 64,   \
+   .imbalance_pct  = 125,  \
+   .cache_nice_tries   = 1,\
+   .busy_idx   = 2,\
+   .idle_idx   = 1,\
+   .newidle_idx= 0,\
+   .wake_idx   = 0,\
+   .forkexec_idx   = 0,\
+   \
+   .flags  = 0*SD_LOAD_BALANCE \
+   | 1*SD_BALANCE_NEWIDLE  \
+   | 1*SD_BALANCE_EXEC \
+   | 1*SD_BALANCE_FORK \
+   | 0*SD_BALANCE_WAKE \
+   | 1*SD_WAKE_AFFINE  \
+   | 0*SD_PREFER_LOCAL \
+   | 0*SD_SHARE_CPUPOWER   \
+   | 0*SD_SHARE_PKG_RESOURCES  \
+   | 0*SD_SERIALIZE\
+   ,   \
+   .last_balance= jiffies, \
+   .balance_interval   = 1,\
+}
+#endif
+#endif /* CONFIG_DISABLE_CPU_SCHED_DOMAIN_BALANCE */
+
 #else
 
 static inline void init_cpu_topology(void) { }
-- 
1.7.9.5



___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev


[RFC PATCH 08/10] sched: Add ftrace events for entity load-tracking

2012-09-21 Thread morten . rasmussen
From: Morten Rasmussen morten.rasmus...@arm.com

Adds ftrace events for key variables related to the entity
load-tracking to help debugging scheduler behaviour. Allows tracing
of load contribution and runqueue residency ratio for both entities
and runqueues as well as entity CPU usage ratio.

Signed-off-by: Morten Rasmussen morten.rasmus...@arm.com
---
 include/trace/events/sched.h |  125 ++
 kernel/sched/fair.c  |7 +++
 2 files changed, 132 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 5a8671e..847eb76 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -430,6 +430,131 @@ TRACE_EVENT(sched_pi_setprio,
__entry-oldprio, __entry-newprio)
 );
 
+/*
+ * Tracepoint for showing tracked load contribution.
+ */
+TRACE_EVENT(sched_task_load_contrib,
+
+   TP_PROTO(struct task_struct *tsk, unsigned long load_contrib),
+
+   TP_ARGS(tsk, load_contrib),
+
+   TP_STRUCT__entry(
+   __array(char, comm, TASK_COMM_LEN)
+   __field(pid_t, pid)
+   __field(unsigned long, load_contrib)
+   ),
+
+   TP_fast_assign(
+   memcpy(__entry-comm, tsk-comm, TASK_COMM_LEN);
+   __entry-pid= tsk-pid;
+   __entry-load_contrib   = load_contrib;
+   ),
+
+   TP_printk(comm=%s pid=%d load_contrib=%lu,
+   __entry-comm, __entry-pid,
+   __entry-load_contrib)
+);
+
+/*
+ * Tracepoint for showing tracked task runnable ratio [0..1023].
+ */
+TRACE_EVENT(sched_task_runnable_ratio,
+
+   TP_PROTO(struct task_struct *tsk, unsigned long ratio),
+
+   TP_ARGS(tsk, ratio),
+
+   TP_STRUCT__entry(
+   __array(char, comm, TASK_COMM_LEN)
+   __field(pid_t, pid)
+   __field(unsigned long, ratio)
+   ),
+
+   TP_fast_assign(
+   memcpy(__entry-comm, tsk-comm, TASK_COMM_LEN);
+   __entry-pid   = tsk-pid;
+   __entry-ratio = ratio;
+   ),
+
+   TP_printk(comm=%s pid=%d ratio=%lu,
+   __entry-comm, __entry-pid,
+   __entry-ratio)
+);
+
+/*
+ * Tracepoint for showing tracked rq runnable ratio [0..1023].
+ */
+TRACE_EVENT(sched_rq_runnable_ratio,
+
+   TP_PROTO(int cpu, unsigned long ratio),
+
+   TP_ARGS(cpu, ratio),
+
+   TP_STRUCT__entry(
+   __field(int, cpu)
+   __field(unsigned long, ratio)
+   ),
+
+   TP_fast_assign(
+   __entry-cpu   = cpu;
+   __entry-ratio = ratio;
+   ),
+
+   TP_printk(cpu=%d ratio=%lu,
+   __entry-cpu,
+   __entry-ratio)
+);
+
+/*
+ * Tracepoint for showing tracked rq runnable load.
+ */
+TRACE_EVENT(sched_rq_runnable_load,
+
+   TP_PROTO(int cpu, u64 load),
+
+   TP_ARGS(cpu, load),
+
+   TP_STRUCT__entry(
+   __field(int, cpu)
+   __field(u64, load)
+   ),
+
+   TP_fast_assign(
+   __entry-cpu  = cpu;
+   __entry-load = load;
+   ),
+
+   TP_printk(cpu=%d load=%llu,
+   __entry-cpu,
+   __entry-load)
+);
+
+/*
+ * Tracepoint for showing tracked task cpu usage ratio [0..1023].
+ */
+TRACE_EVENT(sched_task_usage_ratio,
+
+   TP_PROTO(struct task_struct *tsk, unsigned long ratio),
+
+   TP_ARGS(tsk, ratio),
+
+   TP_STRUCT__entry(
+   __array(char, comm, TASK_COMM_LEN)
+   __field(pid_t, pid)
+   __field(unsigned long, ratio)
+   ),
+
+   TP_fast_assign(
+   memcpy(__entry-comm, tsk-comm, TASK_COMM_LEN);
+   __entry-pid   = tsk-pid;
+   __entry-ratio = ratio;
+   ),
+
+   TP_printk(comm=%s pid=%d ratio=%lu,
+   __entry-comm, __entry-pid,
+   __entry-ratio)
+);
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8f0f3b9..0be53be 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1192,9 +1192,11 @@ static inline void __update_task_entity_contrib(struct 
sched_entity *se)
contrib = se-avg.runnable_avg_sum * scale_load_down(se-load.weight);
contrib /= (se-avg.runnable_avg_period + 1);
se-avg.load_avg_contrib = scale_load(contrib);
+   trace_sched_task_load_contrib(task_of(se), se-avg.load_avg_contrib);
contrib = se-avg.runnable_avg_sum * scale_load_down(NICE_0_LOAD);
contrib /= (se-avg.runnable_avg_period + 1);
se-avg.load_avg_ratio = scale_load(contrib);
+   trace_sched_task_runnable_ratio(task_of(se), se-avg.load_avg_ratio);
 }
 
 /* Compute the current contribution to load_avg by se, return any delta */
@@ -1286,9 +1288,14 @@ static void update_cfs_rq_blocked_load(struct cfs_rq 
*cfs_rq, int

[RFC PATCH 00/10] sched: Task placement for heterogeneous MP systems

2012-09-21 Thread morten . rasmussen
From: Morten Rasmussen morten.rasmus...@arm.com

Hi Paul, Paul, Peter, Suresh, linaro-sched-sig, and LKML,

As a follow-up on my Linux Plumbers Conference talk about my experiments with
scheduling on heterogeneous systems I'm posting a proof-of-concept patch set
with my modifications. The intention behind the modifications is to tweak
scheduling behaviour to only use fast (and power hungry) cores when it is
necessary and also improve performance consistency. Without the modifications
it is more or less random where tasks are scheduled and so is the execution
time.

I'm seeing good improvements on performance consistency for web browsing on
Android using Bbench http://www.gem5.org/Bbench on the ARM big.LITTLE TC2
chip, which has two fast cores (Cortex-A15) and three power-efficient cores
(Cortex-A7). The total execution time numbers below are for Androids
SurfaceFlinger process is key for page rendering performance. The average
execution time is lower with the patches enabled and the standard deviation is
much smaller. Similar improvements can be seen for the Android.Browser and
WebViewCoreThread processes.

Total execution time statistics based on 50 runs.

SurfaceFlinger  SMP kernel [s]  HMP modifications [s]
--
Average 14.617  11.012
St. Dev. 4.577   0.902
10% Pctl.9.343  10.783
90% Pctl.   18.743  11.695

Unfortunately, I cannot share power-efficiency numbers at this stage.

This patch set introduces proof-of-concept scheduler modifications which
attempt to improve scheduling decisions on heterogeneous multi-processor
systems (HMP) such as ARM big.LITTLE systems. The patch set relies on the
entity load-tracking re-work patch set by Paul Turner:

https://lkml.org/lkml/2012/8/23/267

The modifications attempt to migrate tasks between cores with different
compute capacity depending on the tracked load and priority. The aim is
to only use fast cores for tasks which really need the extra performance
and thereby improve power consumption by running everything else on the
slow cores.

The patch introduces hmp_domains to represent the different types of cores
that are available on the given platform. Multiple (2) hmp_domains is
supported but not tested. hmp_domains must be set up by platform code and
the patch set includes patches for ARM platforms using device-tree.

The patches intentionally try to avoid modifying the existing code paths
as much as possible. The aim is to experiment with HMP scheduling and get
the overall policy right before integrating it properly with the existing
load-balancer.

Morten

Morten Rasmussen (10):
  sched: entity load-tracking load_avg_ratio
  sched: Task placement for heterogeneous systems based on task
load-tracking
  sched: Forced task migration on heterogeneous systems
  sched: Introduce priority-based task migration filter
  ARM: Add HMP scheduling support for ARM architecture
  ARM: sched: Use device-tree to provide fast/slow CPU list for HMP
  ARM: sched: Setup SCHED_HMP domains
  sched: Add ftrace events for entity load-tracking
  sched: Add HMP task migration ftrace event
  sched: SCHED_HMP multi-domain task migration control

 arch/arm/Kconfig|   46 +
 arch/arm/include/asm/topology.h |   32 +++
 arch/arm/kernel/topology.c  |   91 
 include/linux/sched.h   |   11 +
 include/trace/events/sched.h|  153 ++
 kernel/sched/core.c |4 +
 kernel/sched/fair.c |  434 ++-
 kernel/sched/sched.h|9 +
 8 files changed, 779 insertions(+), 1 deletion(-)

-- 
1.7.9.5



___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev


[RFC PATCH 07/10] ARM: sched: Setup SCHED_HMP domains

2012-09-21 Thread morten . rasmussen
From: Morten Rasmussen morten.rasmus...@arm.com

SCHED_HMP requires the different cpu types to be represented by an
ordered list of hmp_domains. Each hmp_domain represents all cpus of
a particular type using a cpumask.

The list is platform specific and therefore must be generated by
platform code by implementing arch_get_hmp_domains().

Signed-off-by: Morten Rasmussen morten.rasmus...@arm.com
---
 arch/arm/kernel/topology.c |   22 ++
 1 file changed, 22 insertions(+)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 7682e12..ec8ad5c 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -383,6 +383,28 @@ void __init arch_get_fast_and_slow_cpus(struct cpumask 
*fast,
cpumask_clear(slow);
 }
 
+void __init arch_get_hmp_domains(struct list_head *hmp_domains_list)
+{
+   struct cpumask hmp_fast_cpu_mask;
+   struct cpumask hmp_slow_cpu_mask;
+   struct hmp_domain *domain;
+
+   arch_get_fast_and_slow_cpus(hmp_fast_cpu_mask, hmp_slow_cpu_mask);
+
+   /*
+* Initialize hmp_domains
+* Must be ordered with respect to compute capacity.
+* Fastest domain at head of list.
+*/
+   domain = (struct hmp_domain *)
+   kmalloc(sizeof(struct hmp_domain), GFP_KERNEL);
+   cpumask_copy(domain-cpus, hmp_slow_cpu_mask);
+   list_add(domain-hmp_domains, hmp_domains_list);
+   domain = (struct hmp_domain *)
+   kmalloc(sizeof(struct hmp_domain), GFP_KERNEL);
+   cpumask_copy(domain-cpus, hmp_fast_cpu_mask);
+   list_add(domain-hmp_domains, hmp_domains_list);
+}
 #endif /* CONFIG_SCHED_HMP */
 
 
-- 
1.7.9.5



___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev


[RFC PATCH 06/10] ARM: sched: Use device-tree to provide fast/slow CPU list for HMP

2012-09-21 Thread morten . rasmussen
From: Morten Rasmussen morten.rasmus...@arm.com

We can't rely on Kconfig options to set the fast and slow CPU lists for
HMP scheduling if we want a single kernel binary to support multiple
devices with different CPU topology. E.g. TC2 (ARM's Test-Chip-2
big.LITTLE system), Fast Models, or even non big.LITTLE devices.

This patch adds the function arch_get_fast_and_slow_cpus() to generate
the lists at run-time by parsing the CPU nodes in device-tree; it
assumes slow cores are A7s and everything else is fast. The function
still supports the old Kconfig options as this is useful for testing the
HMP scheduler on devices without big.LITTLE.

This patch is reuse of a patch by Jon Medhurst t...@linaro.org with a
few bits left out.

Signed-off-by: Morten Rasmussen morten.rasmus...@arm.com
---
 arch/arm/Kconfig   |4 ++-
 arch/arm/kernel/topology.c |   69 
 2 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index cb80846..f1271bc 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1588,13 +1588,15 @@ config HMP_FAST_CPU_MASK
string HMP scheduler fast CPU mask
depends on SCHED_HMP
help
-  Specify the cpuids of the fast CPUs in the system as a list string,
+  Leave empty to use device tree information.
+ Specify the cpuids of the fast CPUs in the system as a list string,
  e.g. cpuid 0+1 should be specified as 0-1.
 
 config HMP_SLOW_CPU_MASK
string HMP scheduler slow CPU mask
depends on SCHED_HMP
help
+ Leave empty to use device tree information.
  Specify the cpuids of the slow CPUs in the system as a list string,
  e.g. cpuid 0+1 should be specified as 0-1.
 
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 26c12c6..7682e12 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -317,6 +317,75 @@ void store_cpu_topology(unsigned int cpuid)
cpu_topology[cpuid].socket_id, mpidr);
 }
 
+
+#ifdef CONFIG_SCHED_HMP
+
+static const char * const little_cores[] = {
+   arm,cortex-a7,
+   NULL,
+};
+
+static bool is_little_cpu(struct device_node *cn)
+{
+   const char * const *lc;
+   for (lc = little_cores; *lc; lc++)
+   if (of_device_is_compatible(cn, *lc))
+   return true;
+   return false;
+}
+
+void __init arch_get_fast_and_slow_cpus(struct cpumask *fast,
+   struct cpumask *slow)
+{
+   struct device_node *cn = NULL;
+   int cpu = 0;
+
+   cpumask_clear(fast);
+   cpumask_clear(slow);
+
+   /*
+* Use the config options if they are given. This helps testing
+* HMP scheduling on systems without a big.LITTLE architecture.
+*/
+   if (strlen(CONFIG_HMP_FAST_CPU_MASK)  
strlen(CONFIG_HMP_SLOW_CPU_MASK)) {
+   if (cpulist_parse(CONFIG_HMP_FAST_CPU_MASK, fast))
+   WARN(1, Failed to parse HMP fast cpu mask!\n);
+   if (cpulist_parse(CONFIG_HMP_SLOW_CPU_MASK, slow))
+   WARN(1, Failed to parse HMP slow cpu mask!\n);
+   return;
+   }
+
+   /*
+* Else, parse device tree for little cores.
+*/
+   while ((cn = of_find_node_by_type(cn, cpu))) {
+
+   if (cpu = num_possible_cpus())
+   break;
+
+   if (is_little_cpu(cn))
+   cpumask_set_cpu(cpu, slow);
+   else
+   cpumask_set_cpu(cpu, fast);
+
+   cpu++;
+   }
+
+   if (!cpumask_empty(fast)  !cpumask_empty(slow))
+   return;
+
+   /*
+* We didn't find both big and little cores so let's call all cores
+* fast as this will keep the system running, with all cores being
+* treated equal.
+*/
+   cpumask_setall(fast);
+   cpumask_clear(slow);
+}
+
+#endif /* CONFIG_SCHED_HMP */
+
+
 /*
  * init_cpu_topology is called at boot when only one cpu is running
  * which prevent simultaneous write access to cpu_topology array
-- 
1.7.9.5



___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev


[RFC PATCH 04/10] sched: Introduce priority-based task migration filter

2012-09-21 Thread morten . rasmussen
From: Morten Rasmussen morten.rasmus...@arm.com

Introduces a priority threshold which prevents low priority task
from migrating to faster hmp_domains (cpus). This is useful for
user-space software which assigns lower task priority to background
task.

Signed-off-by: Morten Rasmussen morten.rasmus...@arm.com
---
 arch/arm/Kconfig|   13 +
 kernel/sched/fair.c |   15 +++
 2 files changed, 28 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 5b09684..05de193 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1571,6 +1571,19 @@ config SCHED_HMP
  !SCHED_AUTOGROUP. Furthermore, normal load-balancing must be disabled
  between cpus of different type (DISABLE_CPU_SCHED_DOMAIN_BALANCE).
 
+config SCHED_HMP_PRIO_FILTER
+   bool (EXPERIMENTAL) Filter HMP migrations by task priority
+   depends on SCHED_HMP
+   help
+ Enables task priority based HMP migration filter. Any task with
+ a NICE value above the threshold will always be on low-power cpus
+ with less compute capacity.
+
+config SCHED_HMP_PRIO_FILTER_VAL
+   int NICE priority threshold
+   default 5
+   depends on SCHED_HMP_PRIO_FILTER
+
 config HAVE_ARM_SCU
bool
help
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 490f1f0..8f0f3b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3129,9 +3129,12 @@ static int __init hmp_cpu_mask_setup(void)
  * hmp_down_threshold: max. load allowed for tasks migrating to a slower cpu
  * The default values (512, 256) offer good responsiveness, but may need
  * tweaking suit particular needs.
+ *
+ * hmp_up_prio: Only up migrate task with high priority (hmp_up_prio)
  */
 unsigned int hmp_up_threshold = 512;
 unsigned int hmp_down_threshold = 256;
+unsigned int hmp_up_prio = NICE_TO_PRIO(CONFIG_SCHED_HMP_PRIO_FILTER_VAL);
 
 static unsigned int hmp_up_migration(int cpu, struct sched_entity *se);
 static unsigned int hmp_down_migration(int cpu, struct sched_entity *se);
@@ -5491,6 +5494,12 @@ static unsigned int hmp_up_migration(int cpu, struct 
sched_entity *se)
if (hmp_cpu_is_fastest(cpu))
return 0;
 
+#ifdef CONFIG_SCHED_HMP_PRIO_FILTER
+   /* Filter by task priority */
+   if (p-prio = hmp_up_prio)
+   return 0;
+#endif
+
if (cpumask_intersects(hmp_faster_domain(cpu)-cpus,
tsk_cpus_allowed(p))
 se-avg.load_avg_ratio  hmp_up_threshold) {
@@ -5507,6 +5516,12 @@ static unsigned int hmp_down_migration(int cpu, struct 
sched_entity *se)
if (hmp_cpu_is_slowest(cpu))
return 0;
 
+#ifdef CONFIG_SCHED_HMP_PRIO_FILTER
+   /* Filter by task priority */
+   if (p-prio = hmp_up_prio)
+   return 1;
+#endif
+
if (cpumask_intersects(hmp_slower_domain(cpu)-cpus,
tsk_cpus_allowed(p))
 se-avg.load_avg_ratio  hmp_down_threshold) {
-- 
1.7.9.5



___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev


[RFC PATCH 03/10] sched: Forced task migration on heterogeneous systems

2012-09-21 Thread morten . rasmussen
From: Morten Rasmussen morten.rasmus...@arm.com

This patch introduces forced task migration for moving suitable
currently running tasks between hmp_domains. Task behaviour is likely
to change over time. Tasks running in a less capable hmp_domain may
change to become more demanding and should therefore be migrated up.
They are unlikely go through the select_task_rq_fair() path anytime
soon and therefore need special attention.

This patch introduces a period check (SCHED_TICK) of the currently
running task on all runqueues and sets up a forced migration using
stop_machine_no_wait() if the task needs to be migrated.

Ideally, this should not be implemented by polling all runqueues.

Signed-off-by: Morten Rasmussen morten.rasmus...@arm.com
---
 kernel/sched/fair.c  |  196 +-
 kernel/sched/sched.h |3 +
 2 files changed, 198 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d80de46..490f1f0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3744,7 +3744,6 @@ int can_migrate_task(struct task_struct *p, struct lb_env 
*env)
 * 1) task is cache cold, or
 * 2) too many balance attempts have failed.
 */
-
tsk_cache_hot = task_hot(p, env-src_rq-clock_task, env-sd);
if (!tsk_cache_hot ||
env-sd-nr_balance_failed  env-sd-cache_nice_tries) {
@@ -5516,6 +5515,199 @@ static unsigned int hmp_down_migration(int cpu, struct 
sched_entity *se)
return 0;
 }
 
+/*
+ * hmp_can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
+ * Ideally this function should be merged with can_migrate_task() to avoid
+ * redundant code.
+ */
+static int hmp_can_migrate_task(struct task_struct *p, struct lb_env *env)
+{
+   int tsk_cache_hot = 0;
+
+   /*
+* We do not migrate tasks that are:
+* 1) running (obviously), or
+* 2) cannot be migrated to this CPU due to cpus_allowed
+*/
+   if (!cpumask_test_cpu(env-dst_cpu, tsk_cpus_allowed(p))) {
+   schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+   return 0;
+   }
+   env-flags = ~LBF_ALL_PINNED;
+
+   if (task_running(env-src_rq, p)) {
+   schedstat_inc(p, se.statistics.nr_failed_migrations_running);
+   return 0;
+   }
+
+   /*
+* Aggressive migration if:
+* 1) task is cache cold, or
+* 2) too many balance attempts have failed.
+*/
+
+   tsk_cache_hot = task_hot(p, env-src_rq-clock_task, env-sd);
+   if (!tsk_cache_hot ||
+   env-sd-nr_balance_failed  env-sd-cache_nice_tries) {
+#ifdef CONFIG_SCHEDSTATS
+   if (tsk_cache_hot) {
+   schedstat_inc(env-sd, lb_hot_gained[env-idle]);
+   schedstat_inc(p, se.statistics.nr_forced_migrations);
+   }
+#endif
+   return 1;
+   }
+
+   return 1;
+}
+
+/*
+ * move_specific_task tries to move a specific task.
+ * Returns 1 if successful and 0 otherwise.
+ * Called with both runqueues locked.
+ */
+static int move_specific_task(struct lb_env *env, struct task_struct *pm)
+{
+   struct task_struct *p, *n;
+
+   list_for_each_entry_safe(p, n, env-src_rq-cfs_tasks, se.group_node) {
+   if (throttled_lb_pair(task_group(p), env-src_rq-cpu,
+   env-dst_cpu))
+   continue;
+
+   if (!hmp_can_migrate_task(p, env))
+   continue;
+   /* Check if we found the right task */
+   if (p != pm)
+   continue;
+
+   move_task(p, env);
+   /*
+* Right now, this is only the third place move_task()
+* is called, so we can safely collect move_task()
+* stats here rather than inside move_task().
+*/
+   schedstat_inc(env-sd, lb_gained[env-idle]);
+   return 1;
+   }
+   return 0;
+}
+
+/*
+ * hmp_active_task_migration_cpu_stop is run by cpu stopper and used to
+ * migrate a specific task from one runqueue to another.
+ * hmp_force_up_migration uses this to push a currently running task
+ * off a runqueue.
+ * Based on active_load_balance_stop_cpu and can potentially be merged.
+ */
+static int hmp_active_task_migration_cpu_stop(void *data)
+{
+   struct rq *busiest_rq = data;
+   struct task_struct *p = busiest_rq-migrate_task;
+   int busiest_cpu = cpu_of(busiest_rq);
+   int target_cpu = busiest_rq-push_cpu;
+   struct rq *target_rq = cpu_rq(target_cpu);
+   struct sched_domain *sd;
+
+   raw_spin_lock_irq(busiest_rq-lock);
+   /* make sure the requested cpu hasn't gone down in the meantime */
+   if (unlikely(busiest_cpu != smp_processor_id() ||
+   !busiest_rq-active_balance)) {
+   goto out_unlock;
+   }
+   /* Is there any

[RFC PATCH 02/10] sched: Task placement for heterogeneous systems based on task load-tracking

2012-09-21 Thread morten . rasmussen
From: Morten Rasmussen morten.rasmus...@arm.com

This patch introduces the basic SCHED_HMP infrastructure. Each class of
cpus is represented by a hmp_domain and tasks will only be moved between
these domains when their load profiles suggest it is beneficial.

SCHED_HMP relies heavily on the task load-tracking introduced in Paul
Turners fair group scheduling patch set:

https://lkml.org/lkml/2012/8/23/267

SCHED_HMP requires that the platform implements arch_get_hmp_domains()
which should set up the platform specific list of hmp_domains. It is
also assumed that the platform disables SD_LOAD_BALANCE for the
appropriate sched_domains.
Tasks placement takes place every time a task is to be inserted into
a runqueue based on its load history. The task placement decision is
based on load thresholds.

There are no restrictions on the number of hmp_domains, however,
multiple (2) has not been tested and the up/down migration policy is
rather simple.

Signed-off-by: Morten Rasmussen morten.rasmus...@arm.com
---
 arch/arm/Kconfig  |   17 +
 include/linux/sched.h |6 ++
 kernel/sched/fair.c   |  168 +
 kernel/sched/sched.h  |6 ++
 4 files changed, 197 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index f4a5d58..5b09684 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1554,6 +1554,23 @@ config SCHED_SMT
  MultiThreading at a cost of slightly increased overhead in some
  places. If unsure say N here.
 
+config DISABLE_CPU_SCHED_DOMAIN_BALANCE
+   bool (EXPERIMENTAL) Disable CPU level scheduler load-balancing
+   help
+ Disables scheduler load-balancing at CPU sched domain level.
+
+config SCHED_HMP
+   bool (EXPERIMENTAL) Heterogenous multiprocessor scheduling
+   depends on DISABLE_CPU_SCHED_DOMAIN_BALANCE  SCHED_MC  
FAIR_GROUP_SCHED  !SCHED_AUTOGROUP
+   help
+ Experimental scheduler optimizations for heterogeneous platforms.
+ Attempts to introspectively select task affinity to optimize power
+ and performance. Basic support for multiple (2) cpu types is in 
place,
+ but it has only been tested with two types of cpus.
+ There is currently no support for migration of task groups, hence
+ !SCHED_AUTOGROUP. Furthermore, normal load-balancing must be disabled
+ between cpus of different type (DISABLE_CPU_SCHED_DOMAIN_BALANCE).
+
 config HAVE_ARM_SCU
bool
help
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 81e4e82..df971a3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1039,6 +1039,12 @@ unsigned long default_scale_smt_power(struct 
sched_domain *sd, int cpu);
 
 bool cpus_share_cache(int this_cpu, int that_cpu);
 
+#ifdef CONFIG_SCHED_HMP
+struct hmp_domain {
+   struct cpumask cpus;
+   struct list_head hmp_domains;
+};
+#endif /* CONFIG_SCHED_HMP */
 #else /* CONFIG_SMP */
 
 struct sched_domain_attr;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e17dd5..d80de46 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3077,6 +3077,125 @@ static int select_idle_sibling(struct task_struct *p, 
int target)
return target;
 }
 
+#ifdef CONFIG_SCHED_HMP
+/*
+ * Heterogenous multiprocessor (HMP) optimizations
+ *
+ * The cpu types are distinguished using a list of hmp_domains
+ * which each represent one cpu type using a cpumask.
+ * The list is assumed ordered by compute capacity with the
+ * fastest domain first.
+ */
+DEFINE_PER_CPU(struct hmp_domain *, hmp_cpu_domain);
+
+extern void __init arch_get_hmp_domains(struct list_head *hmp_domains_list);
+
+/* Setup hmp_domains */
+static int __init hmp_cpu_mask_setup(void)
+{
+   char buf[64];
+   struct hmp_domain *domain;
+   struct list_head *pos;
+   int dc, cpu;
+
+   pr_debug(Initializing HMP scheduler:\n);
+
+   /* Initialize hmp_domains using platform code */
+   arch_get_hmp_domains(hmp_domains);
+   if (list_empty(hmp_domains)) {
+   pr_debug(HMP domain list is empty!\n);
+   return 0;
+   }
+
+   /* Print hmp_domains */
+   dc = 0;
+   list_for_each(pos, hmp_domains) {
+   domain = list_entry(pos, struct hmp_domain, hmp_domains);
+   cpulist_scnprintf(buf, 64, domain-cpus);
+   pr_debug(  HMP domain %d: %s\n, dc, buf);
+
+   for_each_cpu_mask(cpu, domain-cpus) {
+   per_cpu(hmp_cpu_domain, cpu) = domain;
+   }
+   dc++;
+   }
+
+   return 1;
+}
+
+/*
+ * Migration thresholds should be in the range [0..1023]
+ * hmp_up_threshold: min. load required for migrating tasks to a faster cpu
+ * hmp_down_threshold: max. load allowed for tasks migrating to a slower cpu
+ * The default values (512, 256) offer good responsiveness, but may need
+ * tweaking suit particular needs.
+ */
+unsigned int hmp_up_threshold = 512;
+unsigned

[RFC PATCH 10/10] sched: SCHED_HMP multi-domain task migration control

2012-09-21 Thread morten . rasmussen
From: Morten Rasmussen morten.rasmus...@arm.com

We need a way to prevent tasks that are migrating up and down the
hmp_domains from migrating straight on through before the load has
adapted to the new compute capacity of the CPU on the new hmp_domain.
This patch adds a next up/down migration delay that prevents the task
from doing another migration in the same direction until the delay
has expired.

Signed-off-by: Morten Rasmussen morten.rasmus...@arm.com
---
 include/linux/sched.h |4 
 kernel/sched/core.c   |4 
 kernel/sched/fair.c   |   38 ++
 3 files changed, 46 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index df971a3..ca3890a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1158,6 +1158,10 @@ struct sched_avg {
s64 decay_count;
unsigned long load_avg_contrib;
unsigned long load_avg_ratio;
+#ifdef CONFIG_SCHED_HMP
+   u64 hmp_last_up_migration;
+   u64 hmp_last_down_migration;
+#endif
u32 usage_avg_sum;
 };
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 652b86b..a3b1ff6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1723,6 +1723,10 @@ static void __sched_fork(struct task_struct *p)
 #if defined(CONFIG_SMP)  defined(CONFIG_FAIR_GROUP_SCHED)
p-se.avg.runnable_avg_period = 0;
p-se.avg.runnable_avg_sum = 0;
+#ifdef CONFIG_SCHED_HMP
+   p-se.avg.hmp_last_up_migration = 0;
+   p-se.avg.hmp_last_down_migration = 0;
+#endif
 #endif
 #ifdef CONFIG_SCHEDSTATS
memset(p-se.statistics, 0, sizeof(p-se.statistics));
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 811b2b9..56cbda1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3138,10 +3138,14 @@ static int __init hmp_cpu_mask_setup(void)
  * tweaking suit particular needs.
  *
  * hmp_up_prio: Only up migrate task with high priority (hmp_up_prio)
+ * hmp_next_up_threshold: Delay before next up migration (1024 ~= 1 ms)
+ * hmp_next_down_threshold: Delay before next down migration (1024 ~= 1 ms)
  */
 unsigned int hmp_up_threshold = 512;
 unsigned int hmp_down_threshold = 256;
 unsigned int hmp_up_prio = NICE_TO_PRIO(CONFIG_SCHED_HMP_PRIO_FILTER_VAL);
+unsigned int hmp_next_up_threshold = 4096;
+unsigned int hmp_next_down_threshold = 4096;
 
 static unsigned int hmp_up_migration(int cpu, struct sched_entity *se);
 static unsigned int hmp_down_migration(int cpu, struct sched_entity *se);
@@ -3204,6 +3208,21 @@ static inline unsigned int hmp_select_slower_cpu(struct 
task_struct *tsk,
tsk_cpus_allowed(tsk));
 }
 
+static inline void hmp_next_up_delay(struct sched_entity *se, int cpu)
+{
+   struct cfs_rq *cfs_rq = cpu_rq(cpu)-cfs;
+
+   se-avg.hmp_last_up_migration = cfs_rq_clock_task(cfs_rq);
+   se-avg.hmp_last_down_migration = 0;
+}
+
+static inline void hmp_next_down_delay(struct sched_entity *se, int cpu)
+{
+   struct cfs_rq *cfs_rq = cpu_rq(cpu)-cfs;
+
+   se-avg.hmp_last_down_migration = cfs_rq_clock_task(cfs_rq);
+   se-avg.hmp_last_up_migration = 0;
+}
 #endif /* CONFIG_SCHED_HMP */
 
 /*
@@ -3335,11 +3354,13 @@ unlock:
 #ifdef CONFIG_SCHED_HMP
if (hmp_up_migration(prev_cpu, p-se)) {
new_cpu = hmp_select_faster_cpu(p, prev_cpu);
+   hmp_next_up_delay(p-se, new_cpu);
trace_sched_hmp_migrate(p, new_cpu, 0);
return new_cpu;
}
if (hmp_down_migration(prev_cpu, p-se)) {
new_cpu = hmp_select_slower_cpu(p, prev_cpu);
+   hmp_next_down_delay(p-se, new_cpu);
trace_sched_hmp_migrate(p, new_cpu, 0);
return new_cpu;
}
@@ -5503,6 +5524,8 @@ static void nohz_idle_balance(int this_cpu, enum 
cpu_idle_type idle) { }
 static unsigned int hmp_up_migration(int cpu, struct sched_entity *se)
 {
struct task_struct *p = task_of(se);
+   struct cfs_rq *cfs_rq = cpu_rq(cpu)-cfs;
+   u64 now;
 
if (hmp_cpu_is_fastest(cpu))
return 0;
@@ -5513,6 +5536,12 @@ static unsigned int hmp_up_migration(int cpu, struct 
sched_entity *se)
return 0;
 #endif
 
+   /* Let the task load settle before doing another up migration */
+   now = cfs_rq_clock_task(cfs_rq);
+   if (((now - se-avg.hmp_last_up_migration)  10)
+hmp_next_up_threshold)
+   return 0;
+
if (cpumask_intersects(hmp_faster_domain(cpu)-cpus,
tsk_cpus_allowed(p))
 se-avg.load_avg_ratio  hmp_up_threshold) {
@@ -5525,6 +5554,8 @@ static unsigned int hmp_up_migration(int cpu, struct 
sched_entity *se)
 static unsigned int hmp_down_migration(int cpu, struct sched_entity *se)
 {
struct task_struct *p = task_of(se);
+   struct cfs_rq *cfs_rq = cpu_rq(cpu)-cfs;
+   u64 now;
 
if (hmp_cpu_is_slowest(cpu

[RFC PATCH 01/10] sched: entity load-tracking load_avg_ratio

2012-09-21 Thread morten . rasmussen
From: Morten Rasmussen morten.rasmus...@arm.com

This patch adds load_avg_ratio to each task. The load_avg_ratio is a
variant of load_avg_contrib which is not scaled by the task priority. It
is calculated like this:

runnable_avg_sum * NICE_0_LOAD / (runnable_avg_period + 1).

Signed-off-by: Morten Rasmussen morten.rasmus...@arm.com
---
 include/linux/sched.h |1 +
 kernel/sched/fair.c   |3 +++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4dc4990..81e4e82 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1151,6 +1151,7 @@ struct sched_avg {
u64 last_runnable_update;
s64 decay_count;
unsigned long load_avg_contrib;
+   unsigned long load_avg_ratio;
u32 usage_avg_sum;
 };
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 095d86c..3e17dd5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1192,6 +1192,9 @@ static inline void __update_task_entity_contrib(struct 
sched_entity *se)
contrib = se-avg.runnable_avg_sum * scale_load_down(se-load.weight);
contrib /= (se-avg.runnable_avg_period + 1);
se-avg.load_avg_contrib = scale_load(contrib);
+   contrib = se-avg.runnable_avg_sum * scale_load_down(NICE_0_LOAD);
+   contrib /= (se-avg.runnable_avg_period + 1);
+   se-avg.load_avg_ratio = scale_load(contrib);
 }
 
 /* Compute the current contribution to load_avg by se, return any delta */
-- 
1.7.9.5



___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev


Re: [GIT PULL] bit-LITTLE-MP-v7 - IMPORTANT

2012-09-05 Thread Morten Rasmussen
Hi Viresh,

On Mon, Sep 03, 2012 at 06:21:26AM +0100, Viresh Kumar wrote:
 On 28 August 2012 10:37, Viresh Kumar viresh.ku...@linaro.org wrote:
  I have updated
 
  https://wiki.linaro.org/WorkingGroups/PowerManagement/Process/bigLittleMPTree
 
  as per our last discussion. Please see if i have missed something.
 
 Hi Guys,
 
 I will be sending PULL request of big-LITTLE-MP-v7 today as per schedule.
 Do let me know if you want anything to be included in it before that.
 
 @Morten: What should i do with patch reported by Santosh:
 
 ARM-Add-HMP-scheduling-support-for-ARM-architecture
 
 Do i need to apply it over your branch?

The patch is already in the original patch set, so I'm not sure why it
is missing.

http://linux-arm.org/git?p=arm-bls.git;a=commit;h=1416200dd62551aa9ac4aa207b0c66651ccbff2c

It needs to be there for the HMP scheduling to work.

Regards,
Morten


___
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev