Re: [PATCH] sched: support dynamiQ cluster

2018-04-13 Thread Joel Fernandes (Google)
On Fri, Apr 6, 2018 at 5:58 AM, Morten Rasmussen
 wrote:
> On Thu, Apr 05, 2018 at 06:22:48PM +0200, Vincent Guittot wrote:
>> Hi Morten,
>>
>> On 5 April 2018 at 17:46, Morten Rasmussen  wrote:
>> > On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote:
>> >> On 4 April 2018 at 12:44, Valentin Schneider  
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > On 03/04/18 13:17, Vincent Guittot wrote:
>> >> >> Hi Valentin,
>> >> >>
>> >> > [...]
>> >> >>>
>> >> >>> I believe ASYM_PACKING behaves better here because the workload is 
>> >> >>> only
>> >> >>> sysbench threads. As stated above, since task utilization is 
>> >> >>> disregarded, I
>> >> >>
>> >> >> It behaves better because it doesn't wait for the task's utilization
>> >> >> to reach a level before assuming the task needs high compute capacity.
>> >> >> The utilization gives an idea of the running time of the task not the
>> >> >> performance level that is needed
>> >> >>
>> >> >
>> >> > [
>> >> > That's my point actually. ASYM_PACKING disregards utilization and moves 
>> >> > those
>> >> > threads to the big cores ASAP, which is good here because it's just 
>> >> > sysbench
>> >> > threads.
>> >> >
>> >> > What I meant was that if the task composition changes, IOW we mix 
>> >> > "small"
>> >> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive 
>> >> > stuff like
>> >> > sysbench threads), we shouldn't assume all of those require to run on a 
>> >> > big
>> >> > CPU. The thing is, ASYM_PACKING can't make the difference between 
>> >> > those, so
>> > [Morten]
>> >>
>> >> That's the 1st point where I tend to disagree: why big cores are only
>> >> for long running task and periodic stuff can't need to run on big
>> >> cores to get max compute capacity ?
>> >> You make the assumption that only long running tasks need high compute
>> >> capacity. This patch wants to always provide max compute capacity to
>> >> the system and not only long running task
>> >
>> > There is no way we can tell if a periodic or short-running tasks
>> > requires the compute capacity of a big core or not based on utilization
>> > alone. The utilization can only tell us if a task could potentially use
>> > more compute capacity, i.e. the utilization approaches the compute
>> > capacity of its current cpu.
>> >
>> > How we handle low utilization tasks comes down to how we define
>> > "performance" and if we care about the cost of "performance" (e.g.
>> > energy consumption).
>> >
>> > Placing a low utilization task on a little cpu should always be fine
>> > from _throughput_ point of view. As long as the cpu has spare cycles it
>>
>> [Vincent]
>> I disagree, throughput is not only a matter of spare cycle it's also a
>> matter of how fast you compute the work like with IO activity as an
>> example
>
> [Morten]
> From a cpu centric point of view it is, but I agree that from a
> application/user point of view completion time might impact throughput
> too. For example of if your throughput depends on how fast you can
> offload work to some peripheral device (GPU for example).
>
> However, as I said in the beginning we don't know what the task does.

[Joel]
Just wanted to say about Vincent point of IO loads throughput -
remembering from when I was playing with the iowait boost stuff, that
- say you have a little task that does some IO and blocks and does so
periodically. In the scenario the task will run for little time and is
a little task by way of looking at utilization. However, if we were to
run it on the BIG CPUs, the overall throughput of the I/O activity
would be higher.

For this case, it seems its impossible to specify the "default"
behavior correctly. Like, do we care about performance or energy more?
This seems more like a policy-decision from userspace and not
something the scheduler should necessarily have to decide. Like if I/O
activity is background and not affecting the user experience.

thanks,

- Joel


Re: [PATCH] sched: support dynamiQ cluster

2018-04-13 Thread Joel Fernandes (Google)
On Fri, Apr 6, 2018 at 5:58 AM, Morten Rasmussen
 wrote:
> On Thu, Apr 05, 2018 at 06:22:48PM +0200, Vincent Guittot wrote:
>> Hi Morten,
>>
>> On 5 April 2018 at 17:46, Morten Rasmussen  wrote:
>> > On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote:
>> >> On 4 April 2018 at 12:44, Valentin Schneider  
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > On 03/04/18 13:17, Vincent Guittot wrote:
>> >> >> Hi Valentin,
>> >> >>
>> >> > [...]
>> >> >>>
>> >> >>> I believe ASYM_PACKING behaves better here because the workload is 
>> >> >>> only
>> >> >>> sysbench threads. As stated above, since task utilization is 
>> >> >>> disregarded, I
>> >> >>
>> >> >> It behaves better because it doesn't wait for the task's utilization
>> >> >> to reach a level before assuming the task needs high compute capacity.
>> >> >> The utilization gives an idea of the running time of the task not the
>> >> >> performance level that is needed
>> >> >>
>> >> >
>> >> > [
>> >> > That's my point actually. ASYM_PACKING disregards utilization and moves 
>> >> > those
>> >> > threads to the big cores ASAP, which is good here because it's just 
>> >> > sysbench
>> >> > threads.
>> >> >
>> >> > What I meant was that if the task composition changes, IOW we mix 
>> >> > "small"
>> >> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive 
>> >> > stuff like
>> >> > sysbench threads), we shouldn't assume all of those require to run on a 
>> >> > big
>> >> > CPU. The thing is, ASYM_PACKING can't make the difference between 
>> >> > those, so
>> > [Morten]
>> >>
>> >> That's the 1st point where I tend to disagree: why big cores are only
>> >> for long running task and periodic stuff can't need to run on big
>> >> cores to get max compute capacity ?
>> >> You make the assumption that only long running tasks need high compute
>> >> capacity. This patch wants to always provide max compute capacity to
>> >> the system and not only long running task
>> >
>> > There is no way we can tell if a periodic or short-running tasks
>> > requires the compute capacity of a big core or not based on utilization
>> > alone. The utilization can only tell us if a task could potentially use
>> > more compute capacity, i.e. the utilization approaches the compute
>> > capacity of its current cpu.
>> >
>> > How we handle low utilization tasks comes down to how we define
>> > "performance" and if we care about the cost of "performance" (e.g.
>> > energy consumption).
>> >
>> > Placing a low utilization task on a little cpu should always be fine
>> > from _throughput_ point of view. As long as the cpu has spare cycles it
>>
>> [Vincent]
>> I disagree, throughput is not only a matter of spare cycle it's also a
>> matter of how fast you compute the work like with IO activity as an
>> example
>
> [Morten]
> From a cpu centric point of view it is, but I agree that from a
> application/user point of view completion time might impact throughput
> too. For example of if your throughput depends on how fast you can
> offload work to some peripheral device (GPU for example).
>
> However, as I said in the beginning we don't know what the task does.

[Joel]
Just wanted to say about Vincent point of IO loads throughput -
remembering from when I was playing with the iowait boost stuff, that
- say you have a little task that does some IO and blocks and does so
periodically. In the scenario the task will run for little time and is
a little task by way of looking at utilization. However, if we were to
run it on the BIG CPUs, the overall throughput of the I/O activity
would be higher.

For this case, it seems its impossible to specify the "default"
behavior correctly. Like, do we care about performance or energy more?
This seems more like a policy-decision from userspace and not
something the scheduler should necessarily have to decide. Like if I/O
activity is background and not affecting the user experience.

thanks,

- Joel


Re: [PATCH] sched: support dynamiQ cluster

2018-04-13 Thread Vincent Guittot
On 12 April 2018 at 20:22, Peter Zijlstra  wrote:
> On Tue, Apr 10, 2018 at 02:19:50PM +0100, Morten Rasmussen wrote:
>> As said above, I see your point about completion time might suffer in
>> some cases for low utilization tasks, but I don't see how you can fix
>> that automagically. ASYM_PACKING has a lot of problematic side-effects.
>> If use-space knows that completion time is important for a task, there
>> are already ways to improve that somewhat in mainline (task priority and
>> pinning), and more powerful solutions in the Android kernel which
>> Patrick is currently pushing upstream.
>
> So I tend to side with Morten on this one. I don't particularly like
> ASYM_PACKING much, but we already had it for PPC and it works for the
> small difference in performance ITMI has.
>
> At the time Morten already objected to using it for ITMI, and I just
> haven't had time to look into his proposal for using capacity.
>
> But I don't see it working right for big.litte/dynamiq, simply because
> it is a very strong always big preference, which is against the whole
> design premisis of big.little (as Morten has been trying to argue).

In fact, Little not only gives some better power efficiency but it
also handles far better some stuff like interrupt handling as an
example
Nevertheless, whatever the solution, it will never fit with
big.Little/dynamiQ system without some EAS as soon as the power
efficiency is involved in the equation.
I have planned to test more deeply how ASYM_PACKING works with EAS
when i will have finished others on going activity.

>


Re: [PATCH] sched: support dynamiQ cluster

2018-04-13 Thread Vincent Guittot
On 12 April 2018 at 20:22, Peter Zijlstra  wrote:
> On Tue, Apr 10, 2018 at 02:19:50PM +0100, Morten Rasmussen wrote:
>> As said above, I see your point about completion time might suffer in
>> some cases for low utilization tasks, but I don't see how you can fix
>> that automagically. ASYM_PACKING has a lot of problematic side-effects.
>> If use-space knows that completion time is important for a task, there
>> are already ways to improve that somewhat in mainline (task priority and
>> pinning), and more powerful solutions in the Android kernel which
>> Patrick is currently pushing upstream.
>
> So I tend to side with Morten on this one. I don't particularly like
> ASYM_PACKING much, but we already had it for PPC and it works for the
> small difference in performance ITMI has.
>
> At the time Morten already objected to using it for ITMI, and I just
> haven't had time to look into his proposal for using capacity.
>
> But I don't see it working right for big.litte/dynamiq, simply because
> it is a very strong always big preference, which is against the whole
> design premisis of big.little (as Morten has been trying to argue).

In fact, Little not only gives some better power efficiency but it
also handles far better some stuff like interrupt handling as an
example
Nevertheless, whatever the solution, it will never fit with
big.Little/dynamiQ system without some EAS as soon as the power
efficiency is involved in the equation.
I have planned to test more deeply how ASYM_PACKING works with EAS
when i will have finished others on going activity.

>


Re: [PATCH] sched: support dynamiQ cluster

2018-04-13 Thread Morten Rasmussen
On Thu, Apr 12, 2018 at 08:22:11PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 10, 2018 at 02:19:50PM +0100, Morten Rasmussen wrote:
> > As said above, I see your point about completion time might suffer in
> > some cases for low utilization tasks, but I don't see how you can fix
> > that automagically. ASYM_PACKING has a lot of problematic side-effects.
> > If use-space knows that completion time is important for a task, there
> > are already ways to improve that somewhat in mainline (task priority and
> > pinning), and more powerful solutions in the Android kernel which
> > Patrick is currently pushing upstream.
> 
> So I tend to side with Morten on this one. I don't particularly like
> ASYM_PACKING much, but we already had it for PPC and it works for the
> small difference in performance ITMI has.
> 
> At the time Morten already objected to using it for ITMI, and I just
> haven't had time to look into his proposal for using capacity.
> 
> But I don't see it working right for big.litte/dynamiq, simply because
> it is a very strong always big preference, which is against the whole
> design premisis of big.little (as Morten has been trying to argue).

In Vincent's defence, vendors do sometimes make design decisions that I
don't quite understand. So there could be users that really want a
non-energy-aware big-first policy, but as I said earlier in this thread,
that could be implemented better with a small tweak to wake_cap() and
using the misfit patches.

We would have to disable big-first policy and go with the current
migrate-big-task-to-big-cpus policy as soon as we care about energy. I'm
happy to give that try and come up with a patch.


Re: [PATCH] sched: support dynamiQ cluster

2018-04-13 Thread Morten Rasmussen
On Thu, Apr 12, 2018 at 08:22:11PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 10, 2018 at 02:19:50PM +0100, Morten Rasmussen wrote:
> > As said above, I see your point about completion time might suffer in
> > some cases for low utilization tasks, but I don't see how you can fix
> > that automagically. ASYM_PACKING has a lot of problematic side-effects.
> > If use-space knows that completion time is important for a task, there
> > are already ways to improve that somewhat in mainline (task priority and
> > pinning), and more powerful solutions in the Android kernel which
> > Patrick is currently pushing upstream.
> 
> So I tend to side with Morten on this one. I don't particularly like
> ASYM_PACKING much, but we already had it for PPC and it works for the
> small difference in performance ITMI has.
> 
> At the time Morten already objected to using it for ITMI, and I just
> haven't had time to look into his proposal for using capacity.
> 
> But I don't see it working right for big.litte/dynamiq, simply because
> it is a very strong always big preference, which is against the whole
> design premisis of big.little (as Morten has been trying to argue).

In Vincent's defence, vendors do sometimes make design decisions that I
don't quite understand. So there could be users that really want a
non-energy-aware big-first policy, but as I said earlier in this thread,
that could be implemented better with a small tweak to wake_cap() and
using the misfit patches.

We would have to disable big-first policy and go with the current
migrate-big-task-to-big-cpus policy as soon as we care about energy. I'm
happy to give that try and come up with a patch.


Re: [PATCH] sched: support dynamiQ cluster

2018-04-12 Thread Peter Zijlstra
On Tue, Apr 10, 2018 at 02:19:50PM +0100, Morten Rasmussen wrote:
> As said above, I see your point about completion time might suffer in
> some cases for low utilization tasks, but I don't see how you can fix
> that automagically. ASYM_PACKING has a lot of problematic side-effects.
> If use-space knows that completion time is important for a task, there
> are already ways to improve that somewhat in mainline (task priority and
> pinning), and more powerful solutions in the Android kernel which
> Patrick is currently pushing upstream.

So I tend to side with Morten on this one. I don't particularly like
ASYM_PACKING much, but we already had it for PPC and it works for the
small difference in performance ITMI has.

At the time Morten already objected to using it for ITMI, and I just
haven't had time to look into his proposal for using capacity.

But I don't see it working right for big.litte/dynamiq, simply because
it is a very strong always big preference, which is against the whole
design premisis of big.little (as Morten has been trying to argue).



Re: [PATCH] sched: support dynamiQ cluster

2018-04-12 Thread Peter Zijlstra
On Tue, Apr 10, 2018 at 02:19:50PM +0100, Morten Rasmussen wrote:
> As said above, I see your point about completion time might suffer in
> some cases for low utilization tasks, but I don't see how you can fix
> that automagically. ASYM_PACKING has a lot of problematic side-effects.
> If use-space knows that completion time is important for a task, there
> are already ways to improve that somewhat in mainline (task priority and
> pinning), and more powerful solutions in the Android kernel which
> Patrick is currently pushing upstream.

So I tend to side with Morten on this one. I don't particularly like
ASYM_PACKING much, but we already had it for PPC and it works for the
small difference in performance ITMI has.

At the time Morten already objected to using it for ITMI, and I just
haven't had time to look into his proposal for using capacity.

But I don't see it working right for big.litte/dynamiq, simply because
it is a very strong always big preference, which is against the whole
design premisis of big.little (as Morten has been trying to argue).



Re: [PATCH] sched: support dynamiQ cluster

2018-04-10 Thread Morten Rasmussen
On Mon, Apr 09, 2018 at 09:34:00AM +0200, Vincent Guittot wrote:
> Hi Morten,
> 
> On 6 April 2018 at 14:58, Morten Rasmussen  wrote:
> > On Thu, Apr 05, 2018 at 06:22:48PM +0200, Vincent Guittot wrote:
> >> Hi Morten,
> >>
> >> On 5 April 2018 at 17:46, Morten Rasmussen  
> >> wrote:
> >> > On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote:
> >> >> On 4 April 2018 at 12:44, Valentin Schneider 
> >> >>  wrote:
> 
> [snip]
> 
> >> >> > What I meant was that if the task composition changes, IOW we mix 
> >> >> > "small"
> >> >> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive 
> >> >> > stuff like
> >> >> > sysbench threads), we shouldn't assume all of those require to run on 
> >> >> > a big
> >> >> > CPU. The thing is, ASYM_PACKING can't make the difference between 
> >> >> > those, so
> >> >>
> >> >> That's the 1st point where I tend to disagree: why big cores are only
> >> >> for long running task and periodic stuff can't need to run on big
> >> >> cores to get max compute capacity ?
> >> >> You make the assumption that only long running tasks need high compute
> >> >> capacity. This patch wants to always provide max compute capacity to
> >> >> the system and not only long running task
> >> >
> >> > There is no way we can tell if a periodic or short-running tasks
> >> > requires the compute capacity of a big core or not based on utilization
> >> > alone. The utilization can only tell us if a task could potentially use
> >> > more compute capacity, i.e. the utilization approaches the compute
> >> > capacity of its current cpu.
> >> >
> >> > How we handle low utilization tasks comes down to how we define
> >> > "performance" and if we care about the cost of "performance" (e.g.
> >> > energy consumption).
> >> >
> >> > Placing a low utilization task on a little cpu should always be fine
> >> > from _throughput_ point of view. As long as the cpu has spare cycles it
> >>
> >> I disagree, throughput is not only a matter of spare cycle it's also a
> >> matter of how fast you compute the work like with IO activity as an
> >> example
> >
> > From a cpu centric point of view it is, but I agree that from a
> > application/user point of view completion time might impact throughput
> > too. For example of if your throughput depends on how fast you can
> > offload work to some peripheral device (GPU for example).
> >
> > However, as I said in the beginning we don't know what the task does.
> 
> I agree but that's not what you do with misfit as you assume long
> running task has higher priority but not shorter running tasks

Not really, as I said in the previous replies it comes down what you see
as the goal of the CFS scheduler. With the misfit patches I'm just
trying to make sure that no task is overutilizing a cpu unnecessarily as
this is in line with what load-balancing does for SMP systems. Compute
capacity is distributed as evenly as possible based on utilization just
like it is for load-balancing when task priorities are the same. From
that point of view the misfit patches don't give long running tasks
preferential treatment. However, I do agree that from a completion time
point of view, low utilization tasks could suffer unnecessarily in some
scenarios.

I don't see optimizing for completion time of low utilization tasks as a
primary goal of CFS. Wake-up balancing does try to minimize wake-up
latency, but that is about it. Fork and exec balancing and the
load-balancing code is all based on load and utilization.

Even if we wanted to optimize for completion time it is more tricky for
asymmetric cpu capacity systems than it is for SMP. Just keeping the big
cpus busy all the time isn't going to do it for many scenarios.

Firstly, migrating running tasks is quite expensive so force-migrating a
short-running task could end up taking longer time than letting it
complete on a little cpu.

Secondly, by keeping big cpus busy at all cost you risk that longer
running tasks will either end up queueing on the big cpus if you choose
to enqueue them there anyway, or they could end up running on a little
cpu if you go for the first available cpu in which case you end up
harming the completion time of that task instead.  I'm not sure how you
balance which task's completion time is more important differently than
we do today based on load or utilization. The misfit patches use the
latter. We could let it use load instead although I think we have agreed
in the past the comparing load to capacity isn't great idea.

Finally, keeping big cpus busy will increase the number of active
migrations a lot.

As said above, I see your point about completion time might suffer in
some cases for low utilization tasks, but I don't see how you can fix
that automagically. ASYM_PACKING has a lot of problematic side-effects.
If use-space knows that completion time is important for a task, there
are already ways to improve that somewhat 

Re: [PATCH] sched: support dynamiQ cluster

2018-04-10 Thread Morten Rasmussen
On Mon, Apr 09, 2018 at 09:34:00AM +0200, Vincent Guittot wrote:
> Hi Morten,
> 
> On 6 April 2018 at 14:58, Morten Rasmussen  wrote:
> > On Thu, Apr 05, 2018 at 06:22:48PM +0200, Vincent Guittot wrote:
> >> Hi Morten,
> >>
> >> On 5 April 2018 at 17:46, Morten Rasmussen  
> >> wrote:
> >> > On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote:
> >> >> On 4 April 2018 at 12:44, Valentin Schneider 
> >> >>  wrote:
> 
> [snip]
> 
> >> >> > What I meant was that if the task composition changes, IOW we mix 
> >> >> > "small"
> >> >> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive 
> >> >> > stuff like
> >> >> > sysbench threads), we shouldn't assume all of those require to run on 
> >> >> > a big
> >> >> > CPU. The thing is, ASYM_PACKING can't make the difference between 
> >> >> > those, so
> >> >>
> >> >> That's the 1st point where I tend to disagree: why big cores are only
> >> >> for long running task and periodic stuff can't need to run on big
> >> >> cores to get max compute capacity ?
> >> >> You make the assumption that only long running tasks need high compute
> >> >> capacity. This patch wants to always provide max compute capacity to
> >> >> the system and not only long running task
> >> >
> >> > There is no way we can tell if a periodic or short-running tasks
> >> > requires the compute capacity of a big core or not based on utilization
> >> > alone. The utilization can only tell us if a task could potentially use
> >> > more compute capacity, i.e. the utilization approaches the compute
> >> > capacity of its current cpu.
> >> >
> >> > How we handle low utilization tasks comes down to how we define
> >> > "performance" and if we care about the cost of "performance" (e.g.
> >> > energy consumption).
> >> >
> >> > Placing a low utilization task on a little cpu should always be fine
> >> > from _throughput_ point of view. As long as the cpu has spare cycles it
> >>
> >> I disagree, throughput is not only a matter of spare cycle it's also a
> >> matter of how fast you compute the work like with IO activity as an
> >> example
> >
> > From a cpu centric point of view it is, but I agree that from a
> > application/user point of view completion time might impact throughput
> > too. For example of if your throughput depends on how fast you can
> > offload work to some peripheral device (GPU for example).
> >
> > However, as I said in the beginning we don't know what the task does.
> 
> I agree but that's not what you do with misfit as you assume long
> running task has higher priority but not shorter running tasks

Not really, as I said in the previous replies it comes down what you see
as the goal of the CFS scheduler. With the misfit patches I'm just
trying to make sure that no task is overutilizing a cpu unnecessarily as
this is in line with what load-balancing does for SMP systems. Compute
capacity is distributed as evenly as possible based on utilization just
like it is for load-balancing when task priorities are the same. From
that point of view the misfit patches don't give long running tasks
preferential treatment. However, I do agree that from a completion time
point of view, low utilization tasks could suffer unnecessarily in some
scenarios.

I don't see optimizing for completion time of low utilization tasks as a
primary goal of CFS. Wake-up balancing does try to minimize wake-up
latency, but that is about it. Fork and exec balancing and the
load-balancing code is all based on load and utilization.

Even if we wanted to optimize for completion time it is more tricky for
asymmetric cpu capacity systems than it is for SMP. Just keeping the big
cpus busy all the time isn't going to do it for many scenarios.

Firstly, migrating running tasks is quite expensive so force-migrating a
short-running task could end up taking longer time than letting it
complete on a little cpu.

Secondly, by keeping big cpus busy at all cost you risk that longer
running tasks will either end up queueing on the big cpus if you choose
to enqueue them there anyway, or they could end up running on a little
cpu if you go for the first available cpu in which case you end up
harming the completion time of that task instead.  I'm not sure how you
balance which task's completion time is more important differently than
we do today based on load or utilization. The misfit patches use the
latter. We could let it use load instead although I think we have agreed
in the past the comparing load to capacity isn't great idea.

Finally, keeping big cpus busy will increase the number of active
migrations a lot.

As said above, I see your point about completion time might suffer in
some cases for low utilization tasks, but I don't see how you can fix
that automagically. ASYM_PACKING has a lot of problematic side-effects.
If use-space knows that completion time is important for a task, there
are already ways to improve that somewhat in mainline (task priority and
pinning), and more powerful solutions in the 

Re: [PATCH] sched: support dynamiQ cluster

2018-04-09 Thread Vincent Guittot
Hi Morten,

On 6 April 2018 at 14:58, Morten Rasmussen  wrote:
> On Thu, Apr 05, 2018 at 06:22:48PM +0200, Vincent Guittot wrote:
>> Hi Morten,
>>
>> On 5 April 2018 at 17:46, Morten Rasmussen  wrote:
>> > On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote:
>> >> On 4 April 2018 at 12:44, Valentin Schneider  
>> >> wrote:

[snip]

>> >> > What I meant was that if the task composition changes, IOW we mix 
>> >> > "small"
>> >> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive 
>> >> > stuff like
>> >> > sysbench threads), we shouldn't assume all of those require to run on a 
>> >> > big
>> >> > CPU. The thing is, ASYM_PACKING can't make the difference between 
>> >> > those, so
>> >>
>> >> That's the 1st point where I tend to disagree: why big cores are only
>> >> for long running task and periodic stuff can't need to run on big
>> >> cores to get max compute capacity ?
>> >> You make the assumption that only long running tasks need high compute
>> >> capacity. This patch wants to always provide max compute capacity to
>> >> the system and not only long running task
>> >
>> > There is no way we can tell if a periodic or short-running tasks
>> > requires the compute capacity of a big core or not based on utilization
>> > alone. The utilization can only tell us if a task could potentially use
>> > more compute capacity, i.e. the utilization approaches the compute
>> > capacity of its current cpu.
>> >
>> > How we handle low utilization tasks comes down to how we define
>> > "performance" and if we care about the cost of "performance" (e.g.
>> > energy consumption).
>> >
>> > Placing a low utilization task on a little cpu should always be fine
>> > from _throughput_ point of view. As long as the cpu has spare cycles it
>>
>> I disagree, throughput is not only a matter of spare cycle it's also a
>> matter of how fast you compute the work like with IO activity as an
>> example
>
> From a cpu centric point of view it is, but I agree that from a
> application/user point of view completion time might impact throughput
> too. For example of if your throughput depends on how fast you can
> offload work to some peripheral device (GPU for example).
>
> However, as I said in the beginning we don't know what the task does.

I agree but that's not what you do with misfit as you assume long
running task has higher priority but not shorter running tasks

>
>> > means that work isn't piling up faster than it can be processed.
>> > However, from a _latency_ (completion time) point of view it might be a
>> > problem, and for latency sensitive tasks I can agree that going for max
>> > capacity might be better choice.
>> >
>> > The misfit patches places tasks based on utilization to ensure that
>> > tasks get the _throughput_ they need if possible. This is in line with
>> > the placement policy we have in select_task_rq_fair() already.
>> >
>> > We shouldn't forget that what we are discussing here is the default
>> > behaviour when we don't have sufficient knowledge about the tasks in the
>> > scheduler. So we are looking a reasonable middle-of-the-road policy that
>> > doesn't kill your performance or the battery. If user-space has its own
>>
>> But misfit task kills performance and might also kills your battery as
>> it doesn't prevent small task to run on big cores
>
> As I said it is not perfect for all use-cases, it is middle-of-the-road
> approach. But I strongly disagree that it is always a bad choice for

mmh ... I never said that it's always a bad choice; I said that it can
also easily make bad choice and kills performance and / or battery. In
fact, we can't really predict the behavior of the system as short
running tasks can be randomly put on big or little cores and random
behavior are impossible to predict and mitigate.

> both energy and performance as you suggest. ASYM_PACKING doesn't
> guarantee max "throughput" (by your definition) either as you may fill
> up your big cores with smaller tasks leaving the big tasks behind on
> little cpus.

You didn't understand the point here. Asym ensures the max throughput
to the system because it will provide the max compute capacity per
seconds to the whole system and not only to some specific tasks. You
assume that long running tasks must run on big cores and not short
running tasks. But why filling a big core with long running task and
filling a little core with short running tasks is the best choice ?
Why the opposite should not be better as long as the big core is fully
used ? The goal is to keep big CPU used whatever the type of tasks.
then, there are other mechanism like cgroup to help sorting groups of
tasks.

You try to partially do 2 things at the same time

>
>> The default behavior of the scheduler is to provide max _throughput_
>> not middle performance and then side activity can mitigate the power
>> impact like frequency scaling or like EAS which 

Re: [PATCH] sched: support dynamiQ cluster

2018-04-09 Thread Vincent Guittot
Hi Morten,

On 6 April 2018 at 14:58, Morten Rasmussen  wrote:
> On Thu, Apr 05, 2018 at 06:22:48PM +0200, Vincent Guittot wrote:
>> Hi Morten,
>>
>> On 5 April 2018 at 17:46, Morten Rasmussen  wrote:
>> > On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote:
>> >> On 4 April 2018 at 12:44, Valentin Schneider  
>> >> wrote:

[snip]

>> >> > What I meant was that if the task composition changes, IOW we mix 
>> >> > "small"
>> >> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive 
>> >> > stuff like
>> >> > sysbench threads), we shouldn't assume all of those require to run on a 
>> >> > big
>> >> > CPU. The thing is, ASYM_PACKING can't make the difference between 
>> >> > those, so
>> >>
>> >> That's the 1st point where I tend to disagree: why big cores are only
>> >> for long running task and periodic stuff can't need to run on big
>> >> cores to get max compute capacity ?
>> >> You make the assumption that only long running tasks need high compute
>> >> capacity. This patch wants to always provide max compute capacity to
>> >> the system and not only long running task
>> >
>> > There is no way we can tell if a periodic or short-running tasks
>> > requires the compute capacity of a big core or not based on utilization
>> > alone. The utilization can only tell us if a task could potentially use
>> > more compute capacity, i.e. the utilization approaches the compute
>> > capacity of its current cpu.
>> >
>> > How we handle low utilization tasks comes down to how we define
>> > "performance" and if we care about the cost of "performance" (e.g.
>> > energy consumption).
>> >
>> > Placing a low utilization task on a little cpu should always be fine
>> > from _throughput_ point of view. As long as the cpu has spare cycles it
>>
>> I disagree, throughput is not only a matter of spare cycle it's also a
>> matter of how fast you compute the work like with IO activity as an
>> example
>
> From a cpu centric point of view it is, but I agree that from a
> application/user point of view completion time might impact throughput
> too. For example of if your throughput depends on how fast you can
> offload work to some peripheral device (GPU for example).
>
> However, as I said in the beginning we don't know what the task does.

I agree but that's not what you do with misfit as you assume long
running task has higher priority but not shorter running tasks

>
>> > means that work isn't piling up faster than it can be processed.
>> > However, from a _latency_ (completion time) point of view it might be a
>> > problem, and for latency sensitive tasks I can agree that going for max
>> > capacity might be better choice.
>> >
>> > The misfit patches places tasks based on utilization to ensure that
>> > tasks get the _throughput_ they need if possible. This is in line with
>> > the placement policy we have in select_task_rq_fair() already.
>> >
>> > We shouldn't forget that what we are discussing here is the default
>> > behaviour when we don't have sufficient knowledge about the tasks in the
>> > scheduler. So we are looking a reasonable middle-of-the-road policy that
>> > doesn't kill your performance or the battery. If user-space has its own
>>
>> But misfit task kills performance and might also kills your battery as
>> it doesn't prevent small task to run on big cores
>
> As I said it is not perfect for all use-cases, it is middle-of-the-road
> approach. But I strongly disagree that it is always a bad choice for

mmh ... I never said that it's always a bad choice; I said that it can
also easily make bad choice and kills performance and / or battery. In
fact, we can't really predict the behavior of the system as short
running tasks can be randomly put on big or little cores and random
behavior are impossible to predict and mitigate.

> both energy and performance as you suggest. ASYM_PACKING doesn't
> guarantee max "throughput" (by your definition) either as you may fill
> up your big cores with smaller tasks leaving the big tasks behind on
> little cpus.

You didn't understand the point here. Asym ensures the max throughput
to the system because it will provide the max compute capacity per
seconds to the whole system and not only to some specific tasks. You
assume that long running tasks must run on big cores and not short
running tasks. But why filling a big core with long running task and
filling a little core with short running tasks is the best choice ?
Why the opposite should not be better as long as the big core is fully
used ? The goal is to keep big CPU used whatever the type of tasks.
then, there are other mechanism like cgroup to help sorting groups of
tasks.

You try to partially do 2 things at the same time

>
>> The default behavior of the scheduler is to provide max _throughput_
>> not middle performance and then side activity can mitigate the power
>> impact like frequency scaling or like EAS which tries to optimize the
>> usage of energy when system is not overloaded.
>
> That 

Re: [PATCH] sched: support dynamiQ cluster

2018-04-06 Thread Morten Rasmussen
On Thu, Apr 05, 2018 at 06:22:48PM +0200, Vincent Guittot wrote:
> Hi Morten,
> 
> On 5 April 2018 at 17:46, Morten Rasmussen  wrote:
> > On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote:
> >> On 4 April 2018 at 12:44, Valentin Schneider  
> >> wrote:
> >> > Hi,
> >> >
> >> > On 03/04/18 13:17, Vincent Guittot wrote:
> >> >> Hi Valentin,
> >> >>
> >> > [...]
> >> >>>
> >> >>> I believe ASYM_PACKING behaves better here because the workload is only
> >> >>> sysbench threads. As stated above, since task utilization is 
> >> >>> disregarded, I
> >> >>
> >> >> It behaves better because it doesn't wait for the task's utilization
> >> >> to reach a level before assuming the task needs high compute capacity.
> >> >> The utilization gives an idea of the running time of the task not the
> >> >> performance level that is needed
> >> >>
> >> >
> >> > That's my point actually. ASYM_PACKING disregards utilization and moves 
> >> > those
> >> > threads to the big cores ASAP, which is good here because it's just 
> >> > sysbench
> >> > threads.
> >> >
> >> > What I meant was that if the task composition changes, IOW we mix "small"
> >> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff 
> >> > like
> >> > sysbench threads), we shouldn't assume all of those require to run on a 
> >> > big
> >> > CPU. The thing is, ASYM_PACKING can't make the difference between those, 
> >> > so
> >>
> >> That's the 1st point where I tend to disagree: why big cores are only
> >> for long running task and periodic stuff can't need to run on big
> >> cores to get max compute capacity ?
> >> You make the assumption that only long running tasks need high compute
> >> capacity. This patch wants to always provide max compute capacity to
> >> the system and not only long running task
> >
> > There is no way we can tell if a periodic or short-running tasks
> > requires the compute capacity of a big core or not based on utilization
> > alone. The utilization can only tell us if a task could potentially use
> > more compute capacity, i.e. the utilization approaches the compute
> > capacity of its current cpu.
> >
> > How we handle low utilization tasks comes down to how we define
> > "performance" and if we care about the cost of "performance" (e.g.
> > energy consumption).
> >
> > Placing a low utilization task on a little cpu should always be fine
> > from _throughput_ point of view. As long as the cpu has spare cycles it
> 
> I disagree, throughput is not only a matter of spare cycle it's also a
> matter of how fast you compute the work like with IO activity as an
> example

>From a cpu centric point of view it is, but I agree that from a
application/user point of view completion time might impact throughput
too. For example of if your throughput depends on how fast you can
offload work to some peripheral device (GPU for example).

However, as I said in the beginning we don't know what the task does.

> > means that work isn't piling up faster than it can be processed.
> > However, from a _latency_ (completion time) point of view it might be a
> > problem, and for latency sensitive tasks I can agree that going for max
> > capacity might be better choice.
> >
> > The misfit patches places tasks based on utilization to ensure that
> > tasks get the _throughput_ they need if possible. This is in line with
> > the placement policy we have in select_task_rq_fair() already.
> >
> > We shouldn't forget that what we are discussing here is the default
> > behaviour when we don't have sufficient knowledge about the tasks in the
> > scheduler. So we are looking a reasonable middle-of-the-road policy that
> > doesn't kill your performance or the battery. If user-space has its own
> 
> But misfit task kills performance and might also kills your battery as
> it doesn't prevent small task to run on big cores

As I said it is not perfect for all use-cases, it is middle-of-the-road
approach. But I strongly disagree that it is always a bad choice for
both energy and performance as you suggest. ASYM_PACKING doesn't
guarantee max "throughput" (by your definition) either as you may fill
up your big cores with smaller tasks leaving the big tasks behind on
little cpus.

> The default behavior of the scheduler is to provide max _throughput_
> not middle performance and then side activity can mitigate the power
> impact like frequency scaling or like EAS which tries to optimize the
> usage of energy when system is not overloaded.

That view doesn't fit very well with all activities around integrating
cpufreq and the scheduler. Frequency scaling is an important factor in
optimizing the throughput.


> With misfit task, you
> make the assumption that short task on little core is the best
> placement to do even for a performance PoV.

I never said it was the best placement, I said it was a reasonable
default policy for big.LITTLE systems.

> It seems that you make
> some 

Re: [PATCH] sched: support dynamiQ cluster

2018-04-06 Thread Morten Rasmussen
On Thu, Apr 05, 2018 at 06:22:48PM +0200, Vincent Guittot wrote:
> Hi Morten,
> 
> On 5 April 2018 at 17:46, Morten Rasmussen  wrote:
> > On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote:
> >> On 4 April 2018 at 12:44, Valentin Schneider  
> >> wrote:
> >> > Hi,
> >> >
> >> > On 03/04/18 13:17, Vincent Guittot wrote:
> >> >> Hi Valentin,
> >> >>
> >> > [...]
> >> >>>
> >> >>> I believe ASYM_PACKING behaves better here because the workload is only
> >> >>> sysbench threads. As stated above, since task utilization is 
> >> >>> disregarded, I
> >> >>
> >> >> It behaves better because it doesn't wait for the task's utilization
> >> >> to reach a level before assuming the task needs high compute capacity.
> >> >> The utilization gives an idea of the running time of the task not the
> >> >> performance level that is needed
> >> >>
> >> >
> >> > That's my point actually. ASYM_PACKING disregards utilization and moves 
> >> > those
> >> > threads to the big cores ASAP, which is good here because it's just 
> >> > sysbench
> >> > threads.
> >> >
> >> > What I meant was that if the task composition changes, IOW we mix "small"
> >> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff 
> >> > like
> >> > sysbench threads), we shouldn't assume all of those require to run on a 
> >> > big
> >> > CPU. The thing is, ASYM_PACKING can't make the difference between those, 
> >> > so
> >>
> >> That's the 1st point where I tend to disagree: why big cores are only
> >> for long running task and periodic stuff can't need to run on big
> >> cores to get max compute capacity ?
> >> You make the assumption that only long running tasks need high compute
> >> capacity. This patch wants to always provide max compute capacity to
> >> the system and not only long running task
> >
> > There is no way we can tell if a periodic or short-running tasks
> > requires the compute capacity of a big core or not based on utilization
> > alone. The utilization can only tell us if a task could potentially use
> > more compute capacity, i.e. the utilization approaches the compute
> > capacity of its current cpu.
> >
> > How we handle low utilization tasks comes down to how we define
> > "performance" and if we care about the cost of "performance" (e.g.
> > energy consumption).
> >
> > Placing a low utilization task on a little cpu should always be fine
> > from _throughput_ point of view. As long as the cpu has spare cycles it
> 
> I disagree, throughput is not only a matter of spare cycle it's also a
> matter of how fast you compute the work like with IO activity as an
> example

>From a cpu centric point of view it is, but I agree that from a
application/user point of view completion time might impact throughput
too. For example of if your throughput depends on how fast you can
offload work to some peripheral device (GPU for example).

However, as I said in the beginning we don't know what the task does.

> > means that work isn't piling up faster than it can be processed.
> > However, from a _latency_ (completion time) point of view it might be a
> > problem, and for latency sensitive tasks I can agree that going for max
> > capacity might be better choice.
> >
> > The misfit patches places tasks based on utilization to ensure that
> > tasks get the _throughput_ they need if possible. This is in line with
> > the placement policy we have in select_task_rq_fair() already.
> >
> > We shouldn't forget that what we are discussing here is the default
> > behaviour when we don't have sufficient knowledge about the tasks in the
> > scheduler. So we are looking a reasonable middle-of-the-road policy that
> > doesn't kill your performance or the battery. If user-space has its own
> 
> But misfit task kills performance and might also kills your battery as
> it doesn't prevent small task to run on big cores

As I said it is not perfect for all use-cases, it is middle-of-the-road
approach. But I strongly disagree that it is always a bad choice for
both energy and performance as you suggest. ASYM_PACKING doesn't
guarantee max "throughput" (by your definition) either as you may fill
up your big cores with smaller tasks leaving the big tasks behind on
little cpus.

> The default behavior of the scheduler is to provide max _throughput_
> not middle performance and then side activity can mitigate the power
> impact like frequency scaling or like EAS which tries to optimize the
> usage of energy when system is not overloaded.

That view doesn't fit very well with all activities around integrating
cpufreq and the scheduler. Frequency scaling is an important factor in
optimizing the throughput.


> With misfit task, you
> make the assumption that short task on little core is the best
> placement to do even for a performance PoV.

I never said it was the best placement, I said it was a reasonable
default policy for big.LITTLE systems.

> It seems that you make
> some power/performance assumption without using an energy model which

Re: [PATCH] sched: support dynamiQ cluster

2018-04-05 Thread Vincent Guittot
Hi Morten,

On 5 April 2018 at 17:46, Morten Rasmussen  wrote:
> On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote:
>> On 4 April 2018 at 12:44, Valentin Schneider  
>> wrote:
>> > Hi,
>> >
>> > On 03/04/18 13:17, Vincent Guittot wrote:
>> >> Hi Valentin,
>> >>
>> > [...]
>> >>>
>> >>> I believe ASYM_PACKING behaves better here because the workload is only
>> >>> sysbench threads. As stated above, since task utilization is 
>> >>> disregarded, I
>> >>
>> >> It behaves better because it doesn't wait for the task's utilization
>> >> to reach a level before assuming the task needs high compute capacity.
>> >> The utilization gives an idea of the running time of the task not the
>> >> performance level that is needed
>> >>
>> >
>> > That's my point actually. ASYM_PACKING disregards utilization and moves 
>> > those
>> > threads to the big cores ASAP, which is good here because it's just 
>> > sysbench
>> > threads.
>> >
>> > What I meant was that if the task composition changes, IOW we mix "small"
>> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff 
>> > like
>> > sysbench threads), we shouldn't assume all of those require to run on a big
>> > CPU. The thing is, ASYM_PACKING can't make the difference between those, so
>>
>> That's the 1st point where I tend to disagree: why big cores are only
>> for long running task and periodic stuff can't need to run on big
>> cores to get max compute capacity ?
>> You make the assumption that only long running tasks need high compute
>> capacity. This patch wants to always provide max compute capacity to
>> the system and not only long running task
>
> There is no way we can tell if a periodic or short-running tasks
> requires the compute capacity of a big core or not based on utilization
> alone. The utilization can only tell us if a task could potentially use
> more compute capacity, i.e. the utilization approaches the compute
> capacity of its current cpu.
>
> How we handle low utilization tasks comes down to how we define
> "performance" and if we care about the cost of "performance" (e.g.
> energy consumption).
>
> Placing a low utilization task on a little cpu should always be fine
> from _throughput_ point of view. As long as the cpu has spare cycles it

I disagree, throughput is not only a matter of spare cycle it's also a
matter of how fast you compute the work like with IO activity as an
example

> means that work isn't piling up faster than it can be processed.
> However, from a _latency_ (completion time) point of view it might be a
> problem, and for latency sensitive tasks I can agree that going for max
> capacity might be better choice.
>
> The misfit patches places tasks based on utilization to ensure that
> tasks get the _throughput_ they need if possible. This is in line with
> the placement policy we have in select_task_rq_fair() already.
>
> We shouldn't forget that what we are discussing here is the default
> behaviour when we don't have sufficient knowledge about the tasks in the
> scheduler. So we are looking a reasonable middle-of-the-road policy that
> doesn't kill your performance or the battery. If user-space has its own

But misfit task kills performance and might also kills your battery as
it doesn't prevent small task to run on big cores
The default behavior of the scheduler is to provide max _throughput_
not middle performance and then side activity can mitigate the power
impact like frequency scaling or like EAS which tries to optimize the
usage of energy when system is not overloaded. With misfit task, you
make the assumption that short task on little core is the best
placement to do even for a performance PoV. It seems that you make
some power/performance assumption without using an energy model which
can make such decision. This is all the interest of EAS.

> opinion about performance requirements it is free to use task affinity
> to control which cpu the task end up on and ensure that the task gets
> max capacity always. On top of that we have had interfaces in Android
> for years to specify performance requirements for task (groups) to allow
> small tasks to be placed on big cpus and big task to be placed on little
> cpus depending on their requirements. It is even tied into cpufreq as
> well. A lot of effort has gone into Android to get this balance right.
> Patrick is working hard on upstreaming some of those features.
>
> In the bigger picture always going for max capacity is not desirable for
> well-configured big.LITTLE system. You would never exploit the advantage
> of the little cpus as you always use big first and only use little when
> the bigs are overloaded at which point having little cpus at all makes

If i'm not wrong misfit task patchset doesn't prevent little task to
run on big core

> little sense. Vendors build big.LITTLE systems because they want a
> better performance/energy trade-off, if they wanted max capacity always,
> 

Re: [PATCH] sched: support dynamiQ cluster

2018-04-05 Thread Vincent Guittot
Hi Morten,

On 5 April 2018 at 17:46, Morten Rasmussen  wrote:
> On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote:
>> On 4 April 2018 at 12:44, Valentin Schneider  
>> wrote:
>> > Hi,
>> >
>> > On 03/04/18 13:17, Vincent Guittot wrote:
>> >> Hi Valentin,
>> >>
>> > [...]
>> >>>
>> >>> I believe ASYM_PACKING behaves better here because the workload is only
>> >>> sysbench threads. As stated above, since task utilization is 
>> >>> disregarded, I
>> >>
>> >> It behaves better because it doesn't wait for the task's utilization
>> >> to reach a level before assuming the task needs high compute capacity.
>> >> The utilization gives an idea of the running time of the task not the
>> >> performance level that is needed
>> >>
>> >
>> > That's my point actually. ASYM_PACKING disregards utilization and moves 
>> > those
>> > threads to the big cores ASAP, which is good here because it's just 
>> > sysbench
>> > threads.
>> >
>> > What I meant was that if the task composition changes, IOW we mix "small"
>> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff 
>> > like
>> > sysbench threads), we shouldn't assume all of those require to run on a big
>> > CPU. The thing is, ASYM_PACKING can't make the difference between those, so
>>
>> That's the 1st point where I tend to disagree: why big cores are only
>> for long running task and periodic stuff can't need to run on big
>> cores to get max compute capacity ?
>> You make the assumption that only long running tasks need high compute
>> capacity. This patch wants to always provide max compute capacity to
>> the system and not only long running task
>
> There is no way we can tell if a periodic or short-running tasks
> requires the compute capacity of a big core or not based on utilization
> alone. The utilization can only tell us if a task could potentially use
> more compute capacity, i.e. the utilization approaches the compute
> capacity of its current cpu.
>
> How we handle low utilization tasks comes down to how we define
> "performance" and if we care about the cost of "performance" (e.g.
> energy consumption).
>
> Placing a low utilization task on a little cpu should always be fine
> from _throughput_ point of view. As long as the cpu has spare cycles it

I disagree, throughput is not only a matter of spare cycle it's also a
matter of how fast you compute the work like with IO activity as an
example

> means that work isn't piling up faster than it can be processed.
> However, from a _latency_ (completion time) point of view it might be a
> problem, and for latency sensitive tasks I can agree that going for max
> capacity might be better choice.
>
> The misfit patches places tasks based on utilization to ensure that
> tasks get the _throughput_ they need if possible. This is in line with
> the placement policy we have in select_task_rq_fair() already.
>
> We shouldn't forget that what we are discussing here is the default
> behaviour when we don't have sufficient knowledge about the tasks in the
> scheduler. So we are looking a reasonable middle-of-the-road policy that
> doesn't kill your performance or the battery. If user-space has its own

But misfit task kills performance and might also kills your battery as
it doesn't prevent small task to run on big cores
The default behavior of the scheduler is to provide max _throughput_
not middle performance and then side activity can mitigate the power
impact like frequency scaling or like EAS which tries to optimize the
usage of energy when system is not overloaded. With misfit task, you
make the assumption that short task on little core is the best
placement to do even for a performance PoV. It seems that you make
some power/performance assumption without using an energy model which
can make such decision. This is all the interest of EAS.

> opinion about performance requirements it is free to use task affinity
> to control which cpu the task end up on and ensure that the task gets
> max capacity always. On top of that we have had interfaces in Android
> for years to specify performance requirements for task (groups) to allow
> small tasks to be placed on big cpus and big task to be placed on little
> cpus depending on their requirements. It is even tied into cpufreq as
> well. A lot of effort has gone into Android to get this balance right.
> Patrick is working hard on upstreaming some of those features.
>
> In the bigger picture always going for max capacity is not desirable for
> well-configured big.LITTLE system. You would never exploit the advantage
> of the little cpus as you always use big first and only use little when
> the bigs are overloaded at which point having little cpus at all makes

If i'm not wrong misfit task patchset doesn't prevent little task to
run on big core

> little sense. Vendors build big.LITTLE systems because they want a
> better performance/energy trade-off, if they wanted max capacity always,
> they would just built big-only systems.

And that's 

Re: [PATCH] sched: support dynamiQ cluster

2018-04-05 Thread Morten Rasmussen
On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote:
> On 4 April 2018 at 12:44, Valentin Schneider  
> wrote:
> > Hi,
> >
> > On 03/04/18 13:17, Vincent Guittot wrote:
> >> Hi Valentin,
> >>
> > [...]
> >>>
> >>> I believe ASYM_PACKING behaves better here because the workload is only
> >>> sysbench threads. As stated above, since task utilization is disregarded, 
> >>> I
> >>
> >> It behaves better because it doesn't wait for the task's utilization
> >> to reach a level before assuming the task needs high compute capacity.
> >> The utilization gives an idea of the running time of the task not the
> >> performance level that is needed
> >>
> >
> > That's my point actually. ASYM_PACKING disregards utilization and moves 
> > those
> > threads to the big cores ASAP, which is good here because it's just sysbench
> > threads.
> >
> > What I meant was that if the task composition changes, IOW we mix "small"
> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff 
> > like
> > sysbench threads), we shouldn't assume all of those require to run on a big
> > CPU. The thing is, ASYM_PACKING can't make the difference between those, so
> 
> That's the 1st point where I tend to disagree: why big cores are only
> for long running task and periodic stuff can't need to run on big
> cores to get max compute capacity ?
> You make the assumption that only long running tasks need high compute
> capacity. This patch wants to always provide max compute capacity to
> the system and not only long running task

There is no way we can tell if a periodic or short-running tasks
requires the compute capacity of a big core or not based on utilization
alone. The utilization can only tell us if a task could potentially use
more compute capacity, i.e. the utilization approaches the compute
capacity of its current cpu.

How we handle low utilization tasks comes down to how we define
"performance" and if we care about the cost of "performance" (e.g.
energy consumption).

Placing a low utilization task on a little cpu should always be fine
from _throughput_ point of view. As long as the cpu has spare cycles it
means that work isn't piling up faster than it can be processed.
However, from a _latency_ (completion time) point of view it might be a
problem, and for latency sensitive tasks I can agree that going for max
capacity might be better choice.

The misfit patches places tasks based on utilization to ensure that
tasks get the _throughput_ they need if possible. This is in line with
the placement policy we have in select_task_rq_fair() already.

We shouldn't forget that what we are discussing here is the default
behaviour when we don't have sufficient knowledge about the tasks in the
scheduler. So we are looking a reasonable middle-of-the-road policy that
doesn't kill your performance or the battery. If user-space has its own
opinion about performance requirements it is free to use task affinity
to control which cpu the task end up on and ensure that the task gets
max capacity always. On top of that we have had interfaces in Android
for years to specify performance requirements for task (groups) to allow
small tasks to be placed on big cpus and big task to be placed on little
cpus depending on their requirements. It is even tied into cpufreq as
well. A lot of effort has gone into Android to get this balance right.
Patrick is working hard on upstreaming some of those features.

In the bigger picture always going for max capacity is not desirable for
well-configured big.LITTLE system. You would never exploit the advantage
of the little cpus as you always use big first and only use little when
the bigs are overloaded at which point having little cpus at all makes
little sense. Vendors build big.LITTLE systems because they want a
better performance/energy trade-off, if they wanted max capacity always,
they would just built big-only systems.

If we would be that concerned about latency, DVFS would be a problem too
and we would use nothing but the performance governor. So seen in the
bigger picture I have to disagree that blindly going for max capacity is
the right default policy for big.LITTLE. As soon as we involve a energy
model in the task placement decisions, it definitely isn't.

Morten


Re: [PATCH] sched: support dynamiQ cluster

2018-04-05 Thread Morten Rasmussen
On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote:
> On 4 April 2018 at 12:44, Valentin Schneider  
> wrote:
> > Hi,
> >
> > On 03/04/18 13:17, Vincent Guittot wrote:
> >> Hi Valentin,
> >>
> > [...]
> >>>
> >>> I believe ASYM_PACKING behaves better here because the workload is only
> >>> sysbench threads. As stated above, since task utilization is disregarded, 
> >>> I
> >>
> >> It behaves better because it doesn't wait for the task's utilization
> >> to reach a level before assuming the task needs high compute capacity.
> >> The utilization gives an idea of the running time of the task not the
> >> performance level that is needed
> >>
> >
> > That's my point actually. ASYM_PACKING disregards utilization and moves 
> > those
> > threads to the big cores ASAP, which is good here because it's just sysbench
> > threads.
> >
> > What I meant was that if the task composition changes, IOW we mix "small"
> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff 
> > like
> > sysbench threads), we shouldn't assume all of those require to run on a big
> > CPU. The thing is, ASYM_PACKING can't make the difference between those, so
> 
> That's the 1st point where I tend to disagree: why big cores are only
> for long running task and periodic stuff can't need to run on big
> cores to get max compute capacity ?
> You make the assumption that only long running tasks need high compute
> capacity. This patch wants to always provide max compute capacity to
> the system and not only long running task

There is no way we can tell if a periodic or short-running tasks
requires the compute capacity of a big core or not based on utilization
alone. The utilization can only tell us if a task could potentially use
more compute capacity, i.e. the utilization approaches the compute
capacity of its current cpu.

How we handle low utilization tasks comes down to how we define
"performance" and if we care about the cost of "performance" (e.g.
energy consumption).

Placing a low utilization task on a little cpu should always be fine
from _throughput_ point of view. As long as the cpu has spare cycles it
means that work isn't piling up faster than it can be processed.
However, from a _latency_ (completion time) point of view it might be a
problem, and for latency sensitive tasks I can agree that going for max
capacity might be better choice.

The misfit patches places tasks based on utilization to ensure that
tasks get the _throughput_ they need if possible. This is in line with
the placement policy we have in select_task_rq_fair() already.

We shouldn't forget that what we are discussing here is the default
behaviour when we don't have sufficient knowledge about the tasks in the
scheduler. So we are looking a reasonable middle-of-the-road policy that
doesn't kill your performance or the battery. If user-space has its own
opinion about performance requirements it is free to use task affinity
to control which cpu the task end up on and ensure that the task gets
max capacity always. On top of that we have had interfaces in Android
for years to specify performance requirements for task (groups) to allow
small tasks to be placed on big cpus and big task to be placed on little
cpus depending on their requirements. It is even tied into cpufreq as
well. A lot of effort has gone into Android to get this balance right.
Patrick is working hard on upstreaming some of those features.

In the bigger picture always going for max capacity is not desirable for
well-configured big.LITTLE system. You would never exploit the advantage
of the little cpus as you always use big first and only use little when
the bigs are overloaded at which point having little cpus at all makes
little sense. Vendors build big.LITTLE systems because they want a
better performance/energy trade-off, if they wanted max capacity always,
they would just built big-only systems.

If we would be that concerned about latency, DVFS would be a problem too
and we would use nothing but the performance governor. So seen in the
bigger picture I have to disagree that blindly going for max capacity is
the right default policy for big.LITTLE. As soon as we involve a energy
model in the task placement decisions, it definitely isn't.

Morten


Re: [PATCH] sched: support dynamiQ cluster

2018-04-04 Thread Vincent Guittot
On 4 April 2018 at 12:44, Valentin Schneider  wrote:
> Hi,
>
> On 03/04/18 13:17, Vincent Guittot wrote:
>> Hi Valentin,
>>
> [...]
>>>
>>> I believe ASYM_PACKING behaves better here because the workload is only
>>> sysbench threads. As stated above, since task utilization is disregarded, I
>>
>> It behaves better because it doesn't wait for the task's utilization
>> to reach a level before assuming the task needs high compute capacity.
>> The utilization gives an idea of the running time of the task not the
>> performance level that is needed
>>
>
> That's my point actually. ASYM_PACKING disregards utilization and moves those
> threads to the big cores ASAP, which is good here because it's just sysbench
> threads.
>
> What I meant was that if the task composition changes, IOW we mix "small"
> tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff like
> sysbench threads), we shouldn't assume all of those require to run on a big
> CPU. The thing is, ASYM_PACKING can't make the difference between those, so

That's the 1st point where I tend to disagree: why big cores are only
for long running task and periodic stuff can't need to run on big
cores to get max compute capacity ?
You make the assumption that only long running tasks need high compute
capacity. This patch wants to always provide max compute capacity to
the system and not only long running task

> it'll all come down to which task spawned first.
>
> Furthermore, ASYM_PACKING will forcefully move tasks via active balance
> regardless of the imbalance as long as a big CPU is idle.
>
> So we could have a scenario where loads of "small" tasks spawn, and they all
> get moved to a big CPU until they're all full (because they're periodic tasks
> so the big CPUs will eventually be idle and will pull another task as long as
> they get some idle time).
>
> Then, before the load tracking signals of those tasks ramp up high enough
> that the load balancer would try to move those to LITTLE CPUs, some "big"
> tasks spawn. They get scheduled on LITTLE CPUs, and now the system will look
> balanced so nothing will be done.

As explained above, as long as the big CPUs are always used,I don't
think it's a problem. What is a problem is if a task stays on a little
CPU whereas a big CPU is idle because we can provide more throughput

>
>
> I acknowledge this all sounds convoluted but I hope it highlights what I
> think could go wrong with ASYM_PACKING on asymmetric systems.
>
> Regards,
> Valentin


Re: [PATCH] sched: support dynamiQ cluster

2018-04-04 Thread Vincent Guittot
On 4 April 2018 at 12:44, Valentin Schneider  wrote:
> Hi,
>
> On 03/04/18 13:17, Vincent Guittot wrote:
>> Hi Valentin,
>>
> [...]
>>>
>>> I believe ASYM_PACKING behaves better here because the workload is only
>>> sysbench threads. As stated above, since task utilization is disregarded, I
>>
>> It behaves better because it doesn't wait for the task's utilization
>> to reach a level before assuming the task needs high compute capacity.
>> The utilization gives an idea of the running time of the task not the
>> performance level that is needed
>>
>
> That's my point actually. ASYM_PACKING disregards utilization and moves those
> threads to the big cores ASAP, which is good here because it's just sysbench
> threads.
>
> What I meant was that if the task composition changes, IOW we mix "small"
> tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff like
> sysbench threads), we shouldn't assume all of those require to run on a big
> CPU. The thing is, ASYM_PACKING can't make the difference between those, so

That's the 1st point where I tend to disagree: why big cores are only
for long running task and periodic stuff can't need to run on big
cores to get max compute capacity ?
You make the assumption that only long running tasks need high compute
capacity. This patch wants to always provide max compute capacity to
the system and not only long running task

> it'll all come down to which task spawned first.
>
> Furthermore, ASYM_PACKING will forcefully move tasks via active balance
> regardless of the imbalance as long as a big CPU is idle.
>
> So we could have a scenario where loads of "small" tasks spawn, and they all
> get moved to a big CPU until they're all full (because they're periodic tasks
> so the big CPUs will eventually be idle and will pull another task as long as
> they get some idle time).
>
> Then, before the load tracking signals of those tasks ramp up high enough
> that the load balancer would try to move those to LITTLE CPUs, some "big"
> tasks spawn. They get scheduled on LITTLE CPUs, and now the system will look
> balanced so nothing will be done.

As explained above, as long as the big CPUs are always used,I don't
think it's a problem. What is a problem is if a task stays on a little
CPU whereas a big CPU is idle because we can provide more throughput

>
>
> I acknowledge this all sounds convoluted but I hope it highlights what I
> think could go wrong with ASYM_PACKING on asymmetric systems.
>
> Regards,
> Valentin


Re: [PATCH] sched: support dynamiQ cluster

2018-04-04 Thread Valentin Schneider
Hi,

On 03/04/18 13:17, Vincent Guittot wrote:
> Hi Valentin,
> 
[...]
>>
>> I believe ASYM_PACKING behaves better here because the workload is only
>> sysbench threads. As stated above, since task utilization is disregarded, I
> 
> It behaves better because it doesn't wait for the task's utilization
> to reach a level before assuming the task needs high compute capacity.
> The utilization gives an idea of the running time of the task not the
> performance level that is needed
> 

That's my point actually. ASYM_PACKING disregards utilization and moves those
threads to the big cores ASAP, which is good here because it's just sysbench
threads.

What I meant was that if the task composition changes, IOW we mix "small"
tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff like
sysbench threads), we shouldn't assume all of those require to run on a big
CPU. The thing is, ASYM_PACKING can't make the difference between those, so
it'll all come down to which task spawned first.

Furthermore, ASYM_PACKING will forcefully move tasks via active balance
regardless of the imbalance as long as a big CPU is idle.

So we could have a scenario where loads of "small" tasks spawn, and they all
get moved to a big CPU until they're all full (because they're periodic tasks
so the big CPUs will eventually be idle and will pull another task as long as
they get some idle time).

Then, before the load tracking signals of those tasks ramp up high enough 
that the load balancer would try to move those to LITTLE CPUs, some "big"
tasks spawn. They get scheduled on LITTLE CPUs, and now the system will look
balanced so nothing will be done.


I acknowledge this all sounds convoluted but I hope it highlights what I
think could go wrong with ASYM_PACKING on asymmetric systems.

Regards,
Valentin


Re: [PATCH] sched: support dynamiQ cluster

2018-04-04 Thread Valentin Schneider
Hi,

On 03/04/18 13:17, Vincent Guittot wrote:
> Hi Valentin,
> 
[...]
>>
>> I believe ASYM_PACKING behaves better here because the workload is only
>> sysbench threads. As stated above, since task utilization is disregarded, I
> 
> It behaves better because it doesn't wait for the task's utilization
> to reach a level before assuming the task needs high compute capacity.
> The utilization gives an idea of the running time of the task not the
> performance level that is needed
> 

That's my point actually. ASYM_PACKING disregards utilization and moves those
threads to the big cores ASAP, which is good here because it's just sysbench
threads.

What I meant was that if the task composition changes, IOW we mix "small"
tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff like
sysbench threads), we shouldn't assume all of those require to run on a big
CPU. The thing is, ASYM_PACKING can't make the difference between those, so
it'll all come down to which task spawned first.

Furthermore, ASYM_PACKING will forcefully move tasks via active balance
regardless of the imbalance as long as a big CPU is idle.

So we could have a scenario where loads of "small" tasks spawn, and they all
get moved to a big CPU until they're all full (because they're periodic tasks
so the big CPUs will eventually be idle and will pull another task as long as
they get some idle time).

Then, before the load tracking signals of those tasks ramp up high enough 
that the load balancer would try to move those to LITTLE CPUs, some "big"
tasks spawn. They get scheduled on LITTLE CPUs, and now the system will look
balanced so nothing will be done.


I acknowledge this all sounds convoluted but I hope it highlights what I
think could go wrong with ASYM_PACKING on asymmetric systems.

Regards,
Valentin


Re: [PATCH] sched: support dynamiQ cluster

2018-04-03 Thread Vincent Guittot
Hi Valentin,

On 3 April 2018 at 00:27, Valentin Schneider  wrote:
> Hi,
>
> On 30/03/18 13:34, Vincent Guittot wrote:
>> Hi Morten,
>>
> [..]
>>>
>>> As I see it, the main differences is that ASYM_PACKING attempts to pack
>>> all tasks regardless of task utilization on the higher capacity cpus
>>> whereas the "misfit task" series carefully picks cpus with tasks they
>>> can't handle so we don't risk migrating tasks which are perfectly
>>
>> That's one main difference because misfit task will let middle range
>> load task on little CPUs which will not provide maximum performance.
>> I have put an example below
>>
>>> suitable to for a little cpu to a big cpu unnecessarily. Also it is
>>> based directly on utilization and cpu capacity like the capacity
>>> awareness we already have to deal with big.LITTLE in the wake-up path.
>
> I think that bit is quite important. AFAICT, ASYM_PACKING disregards
> task utilization, it only makes sure that (with your patch) tasks will be
> migrated to big CPUS if those ever go idle (pulls at NEWLY_IDLE balance or
> later on during nohz balance). I didn't see anything related to ASYM_PACKING
> in the wake path.
>
>>> Have to tried taking the misfit patches for a spin on your setup? I
>>> expect them give you the same behaviour as you report above.
>>
>> So I have tried both your tests and mine on both patchset and they
>> provide same results which is somewhat expected as the benches are run
>> for several seconds.
>> In other to highlight the main difference between misfit task and
>> ASYM_PACKING, I have reused your test and reduced the number of
>> max-request for sysbench so that the test duration was in the range of
>> hundreds ms.
>>
>> Hikey960 (emulate dynamiq topology)
>>min avg(stdev)  max
>> misfit 0.0975000.114911(+- 10%)0.138500
>> asym   0.0925000.106072(+-  6%)0.122900
>>
>> In this case, we can see that ASYM_PACKING is doing better( 8%)
>> because it migrates sysbench threads on big core as soon as they are
>> available whereas misfit task has to wait for the utilization to
>> increase above the 80% which takes around 70ms when starting with an
>> utilization that is null
>>
>
> I believe ASYM_PACKING behaves better here because the workload is only
> sysbench threads. As stated above, since task utilization is disregarded, I

It behaves better because it doesn't wait for the task's utilization
to reach a level before assuming the task needs high compute capacity.
The utilization gives an idea of the running time of the task not the
performance level that is needed

> think we could have a scenario where the big CPUs are filled with "small"
> tasks and the LITTLE CPUs hold a few "big" tasks - because what mostly
> matters here is the order in which the tasks spawn, not their utilization -
> which is potentially broken.
>
> There's that bit in *update_sd_pick_busiest()*:
>
> /* No ASYM_PACKING if target CPU is already busy */
> if (env->idle == CPU_NOT_IDLE)
> return true;
>
> So I'm not entirely sure how realistic that scenario is, but I suppose it
> could still happen. Food for thought in any case.
>
> Regards,
> Valentin


Re: [PATCH] sched: support dynamiQ cluster

2018-04-03 Thread Vincent Guittot
Hi Valentin,

On 3 April 2018 at 00:27, Valentin Schneider  wrote:
> Hi,
>
> On 30/03/18 13:34, Vincent Guittot wrote:
>> Hi Morten,
>>
> [..]
>>>
>>> As I see it, the main differences is that ASYM_PACKING attempts to pack
>>> all tasks regardless of task utilization on the higher capacity cpus
>>> whereas the "misfit task" series carefully picks cpus with tasks they
>>> can't handle so we don't risk migrating tasks which are perfectly
>>
>> That's one main difference because misfit task will let middle range
>> load task on little CPUs which will not provide maximum performance.
>> I have put an example below
>>
>>> suitable to for a little cpu to a big cpu unnecessarily. Also it is
>>> based directly on utilization and cpu capacity like the capacity
>>> awareness we already have to deal with big.LITTLE in the wake-up path.
>
> I think that bit is quite important. AFAICT, ASYM_PACKING disregards
> task utilization, it only makes sure that (with your patch) tasks will be
> migrated to big CPUS if those ever go idle (pulls at NEWLY_IDLE balance or
> later on during nohz balance). I didn't see anything related to ASYM_PACKING
> in the wake path.
>
>>> Have to tried taking the misfit patches for a spin on your setup? I
>>> expect them give you the same behaviour as you report above.
>>
>> So I have tried both your tests and mine on both patchset and they
>> provide same results which is somewhat expected as the benches are run
>> for several seconds.
>> In other to highlight the main difference between misfit task and
>> ASYM_PACKING, I have reused your test and reduced the number of
>> max-request for sysbench so that the test duration was in the range of
>> hundreds ms.
>>
>> Hikey960 (emulate dynamiq topology)
>>min avg(stdev)  max
>> misfit 0.0975000.114911(+- 10%)0.138500
>> asym   0.0925000.106072(+-  6%)0.122900
>>
>> In this case, we can see that ASYM_PACKING is doing better( 8%)
>> because it migrates sysbench threads on big core as soon as they are
>> available whereas misfit task has to wait for the utilization to
>> increase above the 80% which takes around 70ms when starting with an
>> utilization that is null
>>
>
> I believe ASYM_PACKING behaves better here because the workload is only
> sysbench threads. As stated above, since task utilization is disregarded, I

It behaves better because it doesn't wait for the task's utilization
to reach a level before assuming the task needs high compute capacity.
The utilization gives an idea of the running time of the task not the
performance level that is needed

> think we could have a scenario where the big CPUs are filled with "small"
> tasks and the LITTLE CPUs hold a few "big" tasks - because what mostly
> matters here is the order in which the tasks spawn, not their utilization -
> which is potentially broken.
>
> There's that bit in *update_sd_pick_busiest()*:
>
> /* No ASYM_PACKING if target CPU is already busy */
> if (env->idle == CPU_NOT_IDLE)
> return true;
>
> So I'm not entirely sure how realistic that scenario is, but I suppose it
> could still happen. Food for thought in any case.
>
> Regards,
> Valentin


Re: [PATCH] sched: support dynamiQ cluster

2018-04-02 Thread Valentin Schneider
Hi,

On 30/03/18 13:34, Vincent Guittot wrote:
> Hi Morten,
> 
[..]
>>
>> As I see it, the main differences is that ASYM_PACKING attempts to pack
>> all tasks regardless of task utilization on the higher capacity cpus
>> whereas the "misfit task" series carefully picks cpus with tasks they
>> can't handle so we don't risk migrating tasks which are perfectly
> 
> That's one main difference because misfit task will let middle range
> load task on little CPUs which will not provide maximum performance.
> I have put an example below
> 
>> suitable to for a little cpu to a big cpu unnecessarily. Also it is
>> based directly on utilization and cpu capacity like the capacity
>> awareness we already have to deal with big.LITTLE in the wake-up path.

I think that bit is quite important. AFAICT, ASYM_PACKING disregards
task utilization, it only makes sure that (with your patch) tasks will be
migrated to big CPUS if those ever go idle (pulls at NEWLY_IDLE balance or
later on during nohz balance). I didn't see anything related to ASYM_PACKING
in the wake path.

>> Have to tried taking the misfit patches for a spin on your setup? I
>> expect them give you the same behaviour as you report above.
> 
> So I have tried both your tests and mine on both patchset and they
> provide same results which is somewhat expected as the benches are run
> for several seconds.
> In other to highlight the main difference between misfit task and
> ASYM_PACKING, I have reused your test and reduced the number of
> max-request for sysbench so that the test duration was in the range of
> hundreds ms.
> 
> Hikey960 (emulate dynamiq topology)
>min avg(stdev)  max
> misfit 0.0975000.114911(+- 10%)0.138500
> asym   0.0925000.106072(+-  6%)0.122900
> 
> In this case, we can see that ASYM_PACKING is doing better( 8%)
> because it migrates sysbench threads on big core as soon as they are
> available whereas misfit task has to wait for the utilization to
> increase above the 80% which takes around 70ms when starting with an
> utilization that is null
> 

I believe ASYM_PACKING behaves better here because the workload is only
sysbench threads. As stated above, since task utilization is disregarded, I
think we could have a scenario where the big CPUs are filled with "small"
tasks and the LITTLE CPUs hold a few "big" tasks - because what mostly
matters here is the order in which the tasks spawn, not their utilization -
which is potentially broken.

There's that bit in *update_sd_pick_busiest()*:

/* No ASYM_PACKING if target CPU is already busy */ 

if (env->idle == CPU_NOT_IDLE)  

return true;

So I'm not entirely sure how realistic that scenario is, but I suppose it
could still happen. Food for thought in any case.

Regards,
Valentin


Re: [PATCH] sched: support dynamiQ cluster

2018-04-02 Thread Valentin Schneider
Hi,

On 30/03/18 13:34, Vincent Guittot wrote:
> Hi Morten,
> 
[..]
>>
>> As I see it, the main differences is that ASYM_PACKING attempts to pack
>> all tasks regardless of task utilization on the higher capacity cpus
>> whereas the "misfit task" series carefully picks cpus with tasks they
>> can't handle so we don't risk migrating tasks which are perfectly
> 
> That's one main difference because misfit task will let middle range
> load task on little CPUs which will not provide maximum performance.
> I have put an example below
> 
>> suitable to for a little cpu to a big cpu unnecessarily. Also it is
>> based directly on utilization and cpu capacity like the capacity
>> awareness we already have to deal with big.LITTLE in the wake-up path.

I think that bit is quite important. AFAICT, ASYM_PACKING disregards
task utilization, it only makes sure that (with your patch) tasks will be
migrated to big CPUS if those ever go idle (pulls at NEWLY_IDLE balance or
later on during nohz balance). I didn't see anything related to ASYM_PACKING
in the wake path.

>> Have to tried taking the misfit patches for a spin on your setup? I
>> expect them give you the same behaviour as you report above.
> 
> So I have tried both your tests and mine on both patchset and they
> provide same results which is somewhat expected as the benches are run
> for several seconds.
> In other to highlight the main difference between misfit task and
> ASYM_PACKING, I have reused your test and reduced the number of
> max-request for sysbench so that the test duration was in the range of
> hundreds ms.
> 
> Hikey960 (emulate dynamiq topology)
>min avg(stdev)  max
> misfit 0.0975000.114911(+- 10%)0.138500
> asym   0.0925000.106072(+-  6%)0.122900
> 
> In this case, we can see that ASYM_PACKING is doing better( 8%)
> because it migrates sysbench threads on big core as soon as they are
> available whereas misfit task has to wait for the utilization to
> increase above the 80% which takes around 70ms when starting with an
> utilization that is null
> 

I believe ASYM_PACKING behaves better here because the workload is only
sysbench threads. As stated above, since task utilization is disregarded, I
think we could have a scenario where the big CPUs are filled with "small"
tasks and the LITTLE CPUs hold a few "big" tasks - because what mostly
matters here is the order in which the tasks spawn, not their utilization -
which is potentially broken.

There's that bit in *update_sd_pick_busiest()*:

/* No ASYM_PACKING if target CPU is already busy */ 

if (env->idle == CPU_NOT_IDLE)  

return true;

So I'm not entirely sure how realistic that scenario is, but I suppose it
could still happen. Food for thought in any case.

Regards,
Valentin


Re: [PATCH] sched: support dynamiQ cluster

2018-03-30 Thread Vincent Guittot
Hi Morten,

On 29 March 2018 at 14:53, Morten Rasmussen  wrote:
> On Wed, Mar 28, 2018 at 09:46:55AM +0200, Vincent Guittot wrote:
>> Arm DynamiQ system can integrate cores with different micro architecture
>> or max OPP under the same DSU so we can have cores with different compute
>> capacity at the LLC (which was not the case with legacy big/LITTLE
>> architecture). Such configuration is similar in some way to ITMT on intel
>> platform which allows some cores to be boosted to higher turbo frequency
>> than others and which uses SD_ASYM_PACKING feature to ensures that CPUs with
>> highest capacity, will always be used in priortiy in order to provide
>> maximum throughput.
>>
>> Add arch_asym_cpu_priority() for arm64 as this function is used to
>> differentiate CPUs in the scheduler. The CPU's capacity is used to order
>> CPUs in the same DSU.
>>
>> Create sched domain topolgy level for arm64 so we can set SD_ASYM_PACKING
>> at MC level.
>>
>> Some tests have been done on a hikey960 platform (quad cortex-A53,
>> quad cortex-A73). For the test purpose, the CPUs topology of the hikey960
>> has been modified so the 8 heterogeneous cores are described as being part
>> of the same cluster and sharing resources (MC level) like with a DynamiQ DSU.
>>
>> Results below show the time in seconds to run sysbench --test=cpu with an
>> increasing number of threads. The sysbench test run 32 times
>>
>>  without patch with patchdiff
>> 1 threads11.04(+/- 30%)8.86(+/- 0%)  -19%
>> 2 threads 5.59(+/- 14%)4.43(+/- 0%)  -20%
>> 3 threads 3.80(+/- 13%)2.95(+/- 0%)  -22%
>> 4 threads 3.10(+/- 12%)2.22(+/- 0%)  -28%
>> 5 threads 2.47(+/-  5%)1.95(+/- 0%)  -21%
>> 6 threads 2.09(+/-  0%)1.73(+/- 0%)  -17%
>> 7 threads 1.64(+/-  0%)1.56(+/- 0%)  - 7%
>> 8 threads 1.42(+/-  0%)1.42(+/- 0%)0%
>>
>> Results show a better and stable results across iteration with the patch
>> compared to mainline because we are always using big cores in priority 
>> whereas
>> with mainline, the scheduler randomly choose a big or a little cores when
>> there are more cores than number of threads.
>> With 1 thread, the test duration varies in the range [8.85 .. 15.86] for
>> mainline whereas it stays in the range [8.85..8.87] with the patch
>
> Using ASYM_PACKING is essentially an easier but somewhat less accurate
> way to achieve the same behaviour for big.LITTLE system as with the
> "misfit task" series that been under review here for the last couple of
> months.

I think that it's not exactly the same goal although if it's probably
close but ASYM_PACKING ensures that the maximum compute capacity is
used.

>
> As I see it, the main differences is that ASYM_PACKING attempts to pack
> all tasks regardless of task utilization on the higher capacity cpus
> whereas the "misfit task" series carefully picks cpus with tasks they
> can't handle so we don't risk migrating tasks which are perfectly

That's one main difference because misfit task will let middle range
load task on little CPUs which will not provide maximum performance.
I have put an example below

> suitable to for a little cpu to a big cpu unnecessarily. Also it is
> based directly on utilization and cpu capacity like the capacity
> awareness we already have to deal with big.LITTLE in the wake-up path.
> Furthermore, it should work for all big.LITTLE systems regardless of the
> topology, where I think ASYM_PACKING might not work well for systems
> with separate big and little sched_domains.

I haven't look in details if ASYM_PACKING can work correctly on legacy
big/little as I was mainly focus on dynamiQ config but I guess that
might also work

>
> Have to tried taking the misfit patches for a spin on your setup? I
> expect them give you the same behaviour as you report above.

So I have tried both your tests and mine on both patchset and they
provide same results which is somewhat expected as the benches are run
for several seconds.
In other to highlight the main difference between misfit task and
ASYM_PACKING, I have reused your test and reduced the number of
max-request for sysbench so that the test duration was in the range of
hundreds ms.

Hikey960 (emulate dynamiq topology)
   min avg(stdev)  max
misfit 0.0975000.114911(+- 10%)0.138500
asym   0.0925000.106072(+-  6%)0.122900

In this case, we can see that ASYM_PACKING is doing better( 8%)
because it migrates sysbench threads on big core as soon as they are
available whereas misfit task has to wait for the utilization to
increase above the 80% which takes around 70ms when starting with an
utilization that is null

Regards,
Vincent

>
> Morten


Re: [PATCH] sched: support dynamiQ cluster

2018-03-30 Thread Vincent Guittot
Hi Morten,

On 29 March 2018 at 14:53, Morten Rasmussen  wrote:
> On Wed, Mar 28, 2018 at 09:46:55AM +0200, Vincent Guittot wrote:
>> Arm DynamiQ system can integrate cores with different micro architecture
>> or max OPP under the same DSU so we can have cores with different compute
>> capacity at the LLC (which was not the case with legacy big/LITTLE
>> architecture). Such configuration is similar in some way to ITMT on intel
>> platform which allows some cores to be boosted to higher turbo frequency
>> than others and which uses SD_ASYM_PACKING feature to ensures that CPUs with
>> highest capacity, will always be used in priortiy in order to provide
>> maximum throughput.
>>
>> Add arch_asym_cpu_priority() for arm64 as this function is used to
>> differentiate CPUs in the scheduler. The CPU's capacity is used to order
>> CPUs in the same DSU.
>>
>> Create sched domain topolgy level for arm64 so we can set SD_ASYM_PACKING
>> at MC level.
>>
>> Some tests have been done on a hikey960 platform (quad cortex-A53,
>> quad cortex-A73). For the test purpose, the CPUs topology of the hikey960
>> has been modified so the 8 heterogeneous cores are described as being part
>> of the same cluster and sharing resources (MC level) like with a DynamiQ DSU.
>>
>> Results below show the time in seconds to run sysbench --test=cpu with an
>> increasing number of threads. The sysbench test run 32 times
>>
>>  without patch with patchdiff
>> 1 threads11.04(+/- 30%)8.86(+/- 0%)  -19%
>> 2 threads 5.59(+/- 14%)4.43(+/- 0%)  -20%
>> 3 threads 3.80(+/- 13%)2.95(+/- 0%)  -22%
>> 4 threads 3.10(+/- 12%)2.22(+/- 0%)  -28%
>> 5 threads 2.47(+/-  5%)1.95(+/- 0%)  -21%
>> 6 threads 2.09(+/-  0%)1.73(+/- 0%)  -17%
>> 7 threads 1.64(+/-  0%)1.56(+/- 0%)  - 7%
>> 8 threads 1.42(+/-  0%)1.42(+/- 0%)0%
>>
>> Results show a better and stable results across iteration with the patch
>> compared to mainline because we are always using big cores in priority 
>> whereas
>> with mainline, the scheduler randomly choose a big or a little cores when
>> there are more cores than number of threads.
>> With 1 thread, the test duration varies in the range [8.85 .. 15.86] for
>> mainline whereas it stays in the range [8.85..8.87] with the patch
>
> Using ASYM_PACKING is essentially an easier but somewhat less accurate
> way to achieve the same behaviour for big.LITTLE system as with the
> "misfit task" series that been under review here for the last couple of
> months.

I think that it's not exactly the same goal although if it's probably
close but ASYM_PACKING ensures that the maximum compute capacity is
used.

>
> As I see it, the main differences is that ASYM_PACKING attempts to pack
> all tasks regardless of task utilization on the higher capacity cpus
> whereas the "misfit task" series carefully picks cpus with tasks they
> can't handle so we don't risk migrating tasks which are perfectly

That's one main difference because misfit task will let middle range
load task on little CPUs which will not provide maximum performance.
I have put an example below

> suitable to for a little cpu to a big cpu unnecessarily. Also it is
> based directly on utilization and cpu capacity like the capacity
> awareness we already have to deal with big.LITTLE in the wake-up path.
> Furthermore, it should work for all big.LITTLE systems regardless of the
> topology, where I think ASYM_PACKING might not work well for systems
> with separate big and little sched_domains.

I haven't look in details if ASYM_PACKING can work correctly on legacy
big/little as I was mainly focus on dynamiQ config but I guess that
might also work

>
> Have to tried taking the misfit patches for a spin on your setup? I
> expect them give you the same behaviour as you report above.

So I have tried both your tests and mine on both patchset and they
provide same results which is somewhat expected as the benches are run
for several seconds.
In other to highlight the main difference between misfit task and
ASYM_PACKING, I have reused your test and reduced the number of
max-request for sysbench so that the test duration was in the range of
hundreds ms.

Hikey960 (emulate dynamiq topology)
   min avg(stdev)  max
misfit 0.0975000.114911(+- 10%)0.138500
asym   0.0925000.106072(+-  6%)0.122900

In this case, we can see that ASYM_PACKING is doing better( 8%)
because it migrates sysbench threads on big core as soon as they are
available whereas misfit task has to wait for the utilization to
increase above the 80% which takes around 70ms when starting with an
utilization that is null

Regards,
Vincent

>
> Morten


Re: [PATCH] sched: support dynamiQ cluster

2018-03-29 Thread Morten Rasmussen
On Wed, Mar 28, 2018 at 09:46:55AM +0200, Vincent Guittot wrote:
> Arm DynamiQ system can integrate cores with different micro architecture
> or max OPP under the same DSU so we can have cores with different compute
> capacity at the LLC (which was not the case with legacy big/LITTLE
> architecture). Such configuration is similar in some way to ITMT on intel
> platform which allows some cores to be boosted to higher turbo frequency
> than others and which uses SD_ASYM_PACKING feature to ensures that CPUs with
> highest capacity, will always be used in priortiy in order to provide
> maximum throughput.
> 
> Add arch_asym_cpu_priority() for arm64 as this function is used to
> differentiate CPUs in the scheduler. The CPU's capacity is used to order
> CPUs in the same DSU.
> 
> Create sched domain topolgy level for arm64 so we can set SD_ASYM_PACKING
> at MC level.
> 
> Some tests have been done on a hikey960 platform (quad cortex-A53,
> quad cortex-A73). For the test purpose, the CPUs topology of the hikey960
> has been modified so the 8 heterogeneous cores are described as being part
> of the same cluster and sharing resources (MC level) like with a DynamiQ DSU.
> 
> Results below show the time in seconds to run sysbench --test=cpu with an
> increasing number of threads. The sysbench test run 32 times
> 
>  without patch with patchdiff
> 1 threads11.04(+/- 30%)8.86(+/- 0%)  -19%
> 2 threads 5.59(+/- 14%)4.43(+/- 0%)  -20%
> 3 threads 3.80(+/- 13%)2.95(+/- 0%)  -22%
> 4 threads 3.10(+/- 12%)2.22(+/- 0%)  -28%
> 5 threads 2.47(+/-  5%)1.95(+/- 0%)  -21%
> 6 threads 2.09(+/-  0%)1.73(+/- 0%)  -17%
> 7 threads 1.64(+/-  0%)1.56(+/- 0%)  - 7%
> 8 threads 1.42(+/-  0%)1.42(+/- 0%)0%
> 
> Results show a better and stable results across iteration with the patch
> compared to mainline because we are always using big cores in priority whereas
> with mainline, the scheduler randomly choose a big or a little cores when
> there are more cores than number of threads.
> With 1 thread, the test duration varies in the range [8.85 .. 15.86] for
> mainline whereas it stays in the range [8.85..8.87] with the patch

Using ASYM_PACKING is essentially an easier but somewhat less accurate
way to achieve the same behaviour for big.LITTLE system as with the
"misfit task" series that been under review here for the last couple of
months.

As I see it, the main differences is that ASYM_PACKING attempts to pack
all tasks regardless of task utilization on the higher capacity cpus
whereas the "misfit task" series carefully picks cpus with tasks they
can't handle so we don't risk migrating tasks which are perfectly
suitable to for a little cpu to a big cpu unnecessarily. Also it is
based directly on utilization and cpu capacity like the capacity
awareness we already have to deal with big.LITTLE in the wake-up path.
Furthermore, it should work for all big.LITTLE systems regardless of the
topology, where I think ASYM_PACKING might not work well for systems
with separate big and little sched_domains.

Have to tried taking the misfit patches for a spin on your setup? I
expect them give you the same behaviour as you report above.

Morten


Re: [PATCH] sched: support dynamiQ cluster

2018-03-29 Thread Morten Rasmussen
On Wed, Mar 28, 2018 at 09:46:55AM +0200, Vincent Guittot wrote:
> Arm DynamiQ system can integrate cores with different micro architecture
> or max OPP under the same DSU so we can have cores with different compute
> capacity at the LLC (which was not the case with legacy big/LITTLE
> architecture). Such configuration is similar in some way to ITMT on intel
> platform which allows some cores to be boosted to higher turbo frequency
> than others and which uses SD_ASYM_PACKING feature to ensures that CPUs with
> highest capacity, will always be used in priortiy in order to provide
> maximum throughput.
> 
> Add arch_asym_cpu_priority() for arm64 as this function is used to
> differentiate CPUs in the scheduler. The CPU's capacity is used to order
> CPUs in the same DSU.
> 
> Create sched domain topolgy level for arm64 so we can set SD_ASYM_PACKING
> at MC level.
> 
> Some tests have been done on a hikey960 platform (quad cortex-A53,
> quad cortex-A73). For the test purpose, the CPUs topology of the hikey960
> has been modified so the 8 heterogeneous cores are described as being part
> of the same cluster and sharing resources (MC level) like with a DynamiQ DSU.
> 
> Results below show the time in seconds to run sysbench --test=cpu with an
> increasing number of threads. The sysbench test run 32 times
> 
>  without patch with patchdiff
> 1 threads11.04(+/- 30%)8.86(+/- 0%)  -19%
> 2 threads 5.59(+/- 14%)4.43(+/- 0%)  -20%
> 3 threads 3.80(+/- 13%)2.95(+/- 0%)  -22%
> 4 threads 3.10(+/- 12%)2.22(+/- 0%)  -28%
> 5 threads 2.47(+/-  5%)1.95(+/- 0%)  -21%
> 6 threads 2.09(+/-  0%)1.73(+/- 0%)  -17%
> 7 threads 1.64(+/-  0%)1.56(+/- 0%)  - 7%
> 8 threads 1.42(+/-  0%)1.42(+/- 0%)0%
> 
> Results show a better and stable results across iteration with the patch
> compared to mainline because we are always using big cores in priority whereas
> with mainline, the scheduler randomly choose a big or a little cores when
> there are more cores than number of threads.
> With 1 thread, the test duration varies in the range [8.85 .. 15.86] for
> mainline whereas it stays in the range [8.85..8.87] with the patch

Using ASYM_PACKING is essentially an easier but somewhat less accurate
way to achieve the same behaviour for big.LITTLE system as with the
"misfit task" series that been under review here for the last couple of
months.

As I see it, the main differences is that ASYM_PACKING attempts to pack
all tasks regardless of task utilization on the higher capacity cpus
whereas the "misfit task" series carefully picks cpus with tasks they
can't handle so we don't risk migrating tasks which are perfectly
suitable to for a little cpu to a big cpu unnecessarily. Also it is
based directly on utilization and cpu capacity like the capacity
awareness we already have to deal with big.LITTLE in the wake-up path.
Furthermore, it should work for all big.LITTLE systems regardless of the
topology, where I think ASYM_PACKING might not work well for systems
with separate big and little sched_domains.

Have to tried taking the misfit patches for a spin on your setup? I
expect them give you the same behaviour as you report above.

Morten


Re: [PATCH] sched: support dynamiQ cluster

2018-03-28 Thread Vincent Guittot
On 28 March 2018 at 11:12, Will Deacon  wrote:
> On Wed, Mar 28, 2018 at 09:46:55AM +0200, Vincent Guittot wrote:

>>
>> The SD_ASYM_PACKING flag is disabled by default and I'm preparing another 
>> patch
>> to enable this dynamically at boot time by detecting the system topology.
>>
>>  arch/arm64/kernel/topology.c | 30 ++
>>  1 file changed, 30 insertions(+)
>>
>> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
>> index 2186853..cb6705e5 100644
>> --- a/arch/arm64/kernel/topology.c
>> +++ b/arch/arm64/kernel/topology.c
>> @@ -296,6 +296,33 @@ static void __init reset_cpu_topology(void)
>>   }
>>  }
>>
>> +#ifdef CONFIG_SCHED_MC
>> +unsigned int __read_mostly arm64_sched_asym_enabled;
>> +
>> +int arch_asym_cpu_priority(int cpu)
>> +{
>> + return topology_get_cpu_scale(NULL, cpu);
>> +}
>> +
>> +static inline int arm64_sched_dynamiq(void)
>> +{
>> + return arm64_sched_asym_enabled ? SD_ASYM_PACKING : 0;
>> +}
>> +
>> +static int arm64_core_flags(void)
>> +{
>> + return cpu_core_flags() | arm64_sched_dynamiq();
>> +}
>> +#endif
>> +
>> +static struct sched_domain_topology_level arm64_topology[] = {
>> +#ifdef CONFIG_SCHED_MC
>> + { cpu_coregroup_mask, arm64_core_flags, SD_INIT_NAME(MC) },
>
> Maybe stick this in a macro to avoid the double #ifdef?

ok, I will do that in next version

Vincent

>
> Will


Re: [PATCH] sched: support dynamiQ cluster

2018-03-28 Thread Vincent Guittot
On 28 March 2018 at 11:12, Will Deacon  wrote:
> On Wed, Mar 28, 2018 at 09:46:55AM +0200, Vincent Guittot wrote:

>>
>> The SD_ASYM_PACKING flag is disabled by default and I'm preparing another 
>> patch
>> to enable this dynamically at boot time by detecting the system topology.
>>
>>  arch/arm64/kernel/topology.c | 30 ++
>>  1 file changed, 30 insertions(+)
>>
>> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
>> index 2186853..cb6705e5 100644
>> --- a/arch/arm64/kernel/topology.c
>> +++ b/arch/arm64/kernel/topology.c
>> @@ -296,6 +296,33 @@ static void __init reset_cpu_topology(void)
>>   }
>>  }
>>
>> +#ifdef CONFIG_SCHED_MC
>> +unsigned int __read_mostly arm64_sched_asym_enabled;
>> +
>> +int arch_asym_cpu_priority(int cpu)
>> +{
>> + return topology_get_cpu_scale(NULL, cpu);
>> +}
>> +
>> +static inline int arm64_sched_dynamiq(void)
>> +{
>> + return arm64_sched_asym_enabled ? SD_ASYM_PACKING : 0;
>> +}
>> +
>> +static int arm64_core_flags(void)
>> +{
>> + return cpu_core_flags() | arm64_sched_dynamiq();
>> +}
>> +#endif
>> +
>> +static struct sched_domain_topology_level arm64_topology[] = {
>> +#ifdef CONFIG_SCHED_MC
>> + { cpu_coregroup_mask, arm64_core_flags, SD_INIT_NAME(MC) },
>
> Maybe stick this in a macro to avoid the double #ifdef?

ok, I will do that in next version

Vincent

>
> Will


Re: [PATCH] sched: support dynamiQ cluster

2018-03-28 Thread Will Deacon
On Wed, Mar 28, 2018 at 09:46:55AM +0200, Vincent Guittot wrote:
> Arm DynamiQ system can integrate cores with different micro architecture
> or max OPP under the same DSU so we can have cores with different compute
> capacity at the LLC (which was not the case with legacy big/LITTLE
> architecture). Such configuration is similar in some way to ITMT on intel
> platform which allows some cores to be boosted to higher turbo frequency
> than others and which uses SD_ASYM_PACKING feature to ensures that CPUs with
> highest capacity, will always be used in priortiy in order to provide
> maximum throughput.
> 
> Add arch_asym_cpu_priority() for arm64 as this function is used to
> differentiate CPUs in the scheduler. The CPU's capacity is used to order
> CPUs in the same DSU.
> 
> Create sched domain topolgy level for arm64 so we can set SD_ASYM_PACKING
> at MC level.
> 
> Some tests have been done on a hikey960 platform (quad cortex-A53,
> quad cortex-A73). For the test purpose, the CPUs topology of the hikey960
> has been modified so the 8 heterogeneous cores are described as being part
> of the same cluster and sharing resources (MC level) like with a DynamiQ DSU.
> 
> Results below show the time in seconds to run sysbench --test=cpu with an
> increasing number of threads. The sysbench test run 32 times
> 
>  without patch with patchdiff
> 1 threads11.04(+/- 30%)8.86(+/- 0%)  -19%
> 2 threads 5.59(+/- 14%)4.43(+/- 0%)  -20%
> 3 threads 3.80(+/- 13%)2.95(+/- 0%)  -22%
> 4 threads 3.10(+/- 12%)2.22(+/- 0%)  -28%
> 5 threads 2.47(+/-  5%)1.95(+/- 0%)  -21%
> 6 threads 2.09(+/-  0%)1.73(+/- 0%)  -17%
> 7 threads 1.64(+/-  0%)1.56(+/- 0%)  - 7%
> 8 threads 1.42(+/-  0%)1.42(+/- 0%)0%
> 
> Results show a better and stable results across iteration with the patch
> compared to mainline because we are always using big cores in priority whereas
> with mainline, the scheduler randomly choose a big or a little cores when
> there are more cores than number of threads.
> With 1 thread, the test duration varies in the range [8.85 .. 15.86] for
> mainline whereas it stays in the range [8.85..8.87] with the patch
> 
> Signed-off-by: Vincent Guittot 
> 
> ---
> 
> The SD_ASYM_PACKING flag is disabled by default and I'm preparing another 
> patch
> to enable this dynamically at boot time by detecting the system topology.
> 
>  arch/arm64/kernel/topology.c | 30 ++
>  1 file changed, 30 insertions(+)
> 
> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> index 2186853..cb6705e5 100644
> --- a/arch/arm64/kernel/topology.c
> +++ b/arch/arm64/kernel/topology.c
> @@ -296,6 +296,33 @@ static void __init reset_cpu_topology(void)
>   }
>  }
>  
> +#ifdef CONFIG_SCHED_MC
> +unsigned int __read_mostly arm64_sched_asym_enabled;
> +
> +int arch_asym_cpu_priority(int cpu)
> +{
> + return topology_get_cpu_scale(NULL, cpu);
> +}
> +
> +static inline int arm64_sched_dynamiq(void)
> +{
> + return arm64_sched_asym_enabled ? SD_ASYM_PACKING : 0;
> +}
> +
> +static int arm64_core_flags(void)
> +{
> + return cpu_core_flags() | arm64_sched_dynamiq();
> +}
> +#endif
> +
> +static struct sched_domain_topology_level arm64_topology[] = {
> +#ifdef CONFIG_SCHED_MC
> + { cpu_coregroup_mask, arm64_core_flags, SD_INIT_NAME(MC) },

Maybe stick this in a macro to avoid the double #ifdef?

Will


Re: [PATCH] sched: support dynamiQ cluster

2018-03-28 Thread Will Deacon
On Wed, Mar 28, 2018 at 09:46:55AM +0200, Vincent Guittot wrote:
> Arm DynamiQ system can integrate cores with different micro architecture
> or max OPP under the same DSU so we can have cores with different compute
> capacity at the LLC (which was not the case with legacy big/LITTLE
> architecture). Such configuration is similar in some way to ITMT on intel
> platform which allows some cores to be boosted to higher turbo frequency
> than others and which uses SD_ASYM_PACKING feature to ensures that CPUs with
> highest capacity, will always be used in priortiy in order to provide
> maximum throughput.
> 
> Add arch_asym_cpu_priority() for arm64 as this function is used to
> differentiate CPUs in the scheduler. The CPU's capacity is used to order
> CPUs in the same DSU.
> 
> Create sched domain topolgy level for arm64 so we can set SD_ASYM_PACKING
> at MC level.
> 
> Some tests have been done on a hikey960 platform (quad cortex-A53,
> quad cortex-A73). For the test purpose, the CPUs topology of the hikey960
> has been modified so the 8 heterogeneous cores are described as being part
> of the same cluster and sharing resources (MC level) like with a DynamiQ DSU.
> 
> Results below show the time in seconds to run sysbench --test=cpu with an
> increasing number of threads. The sysbench test run 32 times
> 
>  without patch with patchdiff
> 1 threads11.04(+/- 30%)8.86(+/- 0%)  -19%
> 2 threads 5.59(+/- 14%)4.43(+/- 0%)  -20%
> 3 threads 3.80(+/- 13%)2.95(+/- 0%)  -22%
> 4 threads 3.10(+/- 12%)2.22(+/- 0%)  -28%
> 5 threads 2.47(+/-  5%)1.95(+/- 0%)  -21%
> 6 threads 2.09(+/-  0%)1.73(+/- 0%)  -17%
> 7 threads 1.64(+/-  0%)1.56(+/- 0%)  - 7%
> 8 threads 1.42(+/-  0%)1.42(+/- 0%)0%
> 
> Results show a better and stable results across iteration with the patch
> compared to mainline because we are always using big cores in priority whereas
> with mainline, the scheduler randomly choose a big or a little cores when
> there are more cores than number of threads.
> With 1 thread, the test duration varies in the range [8.85 .. 15.86] for
> mainline whereas it stays in the range [8.85..8.87] with the patch
> 
> Signed-off-by: Vincent Guittot 
> 
> ---
> 
> The SD_ASYM_PACKING flag is disabled by default and I'm preparing another 
> patch
> to enable this dynamically at boot time by detecting the system topology.
> 
>  arch/arm64/kernel/topology.c | 30 ++
>  1 file changed, 30 insertions(+)
> 
> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> index 2186853..cb6705e5 100644
> --- a/arch/arm64/kernel/topology.c
> +++ b/arch/arm64/kernel/topology.c
> @@ -296,6 +296,33 @@ static void __init reset_cpu_topology(void)
>   }
>  }
>  
> +#ifdef CONFIG_SCHED_MC
> +unsigned int __read_mostly arm64_sched_asym_enabled;
> +
> +int arch_asym_cpu_priority(int cpu)
> +{
> + return topology_get_cpu_scale(NULL, cpu);
> +}
> +
> +static inline int arm64_sched_dynamiq(void)
> +{
> + return arm64_sched_asym_enabled ? SD_ASYM_PACKING : 0;
> +}
> +
> +static int arm64_core_flags(void)
> +{
> + return cpu_core_flags() | arm64_sched_dynamiq();
> +}
> +#endif
> +
> +static struct sched_domain_topology_level arm64_topology[] = {
> +#ifdef CONFIG_SCHED_MC
> + { cpu_coregroup_mask, arm64_core_flags, SD_INIT_NAME(MC) },

Maybe stick this in a macro to avoid the double #ifdef?

Will


[PATCH] sched: support dynamiQ cluster

2018-03-28 Thread Vincent Guittot
Arm DynamiQ system can integrate cores with different micro architecture
or max OPP under the same DSU so we can have cores with different compute
capacity at the LLC (which was not the case with legacy big/LITTLE
architecture). Such configuration is similar in some way to ITMT on intel
platform which allows some cores to be boosted to higher turbo frequency
than others and which uses SD_ASYM_PACKING feature to ensures that CPUs with
highest capacity, will always be used in priortiy in order to provide
maximum throughput.

Add arch_asym_cpu_priority() for arm64 as this function is used to
differentiate CPUs in the scheduler. The CPU's capacity is used to order
CPUs in the same DSU.

Create sched domain topolgy level for arm64 so we can set SD_ASYM_PACKING
at MC level.

Some tests have been done on a hikey960 platform (quad cortex-A53,
quad cortex-A73). For the test purpose, the CPUs topology of the hikey960
has been modified so the 8 heterogeneous cores are described as being part
of the same cluster and sharing resources (MC level) like with a DynamiQ DSU.

Results below show the time in seconds to run sysbench --test=cpu with an
increasing number of threads. The sysbench test run 32 times

 without patch with patchdiff
1 threads11.04(+/- 30%)8.86(+/- 0%)  -19%
2 threads 5.59(+/- 14%)4.43(+/- 0%)  -20%
3 threads 3.80(+/- 13%)2.95(+/- 0%)  -22%
4 threads 3.10(+/- 12%)2.22(+/- 0%)  -28%
5 threads 2.47(+/-  5%)1.95(+/- 0%)  -21%
6 threads 2.09(+/-  0%)1.73(+/- 0%)  -17%
7 threads 1.64(+/-  0%)1.56(+/- 0%)  - 7%
8 threads 1.42(+/-  0%)1.42(+/- 0%)0%

Results show a better and stable results across iteration with the patch
compared to mainline because we are always using big cores in priority whereas
with mainline, the scheduler randomly choose a big or a little cores when
there are more cores than number of threads.
With 1 thread, the test duration varies in the range [8.85 .. 15.86] for
mainline whereas it stays in the range [8.85..8.87] with the patch

Signed-off-by: Vincent Guittot 

---

The SD_ASYM_PACKING flag is disabled by default and I'm preparing another patch
to enable this dynamically at boot time by detecting the system topology.

 arch/arm64/kernel/topology.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index 2186853..cb6705e5 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -296,6 +296,33 @@ static void __init reset_cpu_topology(void)
}
 }
 
+#ifdef CONFIG_SCHED_MC
+unsigned int __read_mostly arm64_sched_asym_enabled;
+
+int arch_asym_cpu_priority(int cpu)
+{
+   return topology_get_cpu_scale(NULL, cpu);
+}
+
+static inline int arm64_sched_dynamiq(void)
+{
+   return arm64_sched_asym_enabled ? SD_ASYM_PACKING : 0;
+}
+
+static int arm64_core_flags(void)
+{
+   return cpu_core_flags() | arm64_sched_dynamiq();
+}
+#endif
+
+static struct sched_domain_topology_level arm64_topology[] = {
+#ifdef CONFIG_SCHED_MC
+   { cpu_coregroup_mask, arm64_core_flags, SD_INIT_NAME(MC) },
+#endif
+   { cpu_cpu_mask, SD_INIT_NAME(DIE) },
+   { NULL, },
+};
+
 void __init init_cpu_topology(void)
 {
reset_cpu_topology();
@@ -306,4 +333,7 @@ void __init init_cpu_topology(void)
 */
if (of_have_populated_dt() && parse_dt_topology())
reset_cpu_topology();
+
+   /* Set scheduler topology descriptor */
+   set_sched_topology(arm64_topology);
 }
-- 
2.7.4



[PATCH] sched: support dynamiQ cluster

2018-03-28 Thread Vincent Guittot
Arm DynamiQ system can integrate cores with different micro architecture
or max OPP under the same DSU so we can have cores with different compute
capacity at the LLC (which was not the case with legacy big/LITTLE
architecture). Such configuration is similar in some way to ITMT on intel
platform which allows some cores to be boosted to higher turbo frequency
than others and which uses SD_ASYM_PACKING feature to ensures that CPUs with
highest capacity, will always be used in priortiy in order to provide
maximum throughput.

Add arch_asym_cpu_priority() for arm64 as this function is used to
differentiate CPUs in the scheduler. The CPU's capacity is used to order
CPUs in the same DSU.

Create sched domain topolgy level for arm64 so we can set SD_ASYM_PACKING
at MC level.

Some tests have been done on a hikey960 platform (quad cortex-A53,
quad cortex-A73). For the test purpose, the CPUs topology of the hikey960
has been modified so the 8 heterogeneous cores are described as being part
of the same cluster and sharing resources (MC level) like with a DynamiQ DSU.

Results below show the time in seconds to run sysbench --test=cpu with an
increasing number of threads. The sysbench test run 32 times

 without patch with patchdiff
1 threads11.04(+/- 30%)8.86(+/- 0%)  -19%
2 threads 5.59(+/- 14%)4.43(+/- 0%)  -20%
3 threads 3.80(+/- 13%)2.95(+/- 0%)  -22%
4 threads 3.10(+/- 12%)2.22(+/- 0%)  -28%
5 threads 2.47(+/-  5%)1.95(+/- 0%)  -21%
6 threads 2.09(+/-  0%)1.73(+/- 0%)  -17%
7 threads 1.64(+/-  0%)1.56(+/- 0%)  - 7%
8 threads 1.42(+/-  0%)1.42(+/- 0%)0%

Results show a better and stable results across iteration with the patch
compared to mainline because we are always using big cores in priority whereas
with mainline, the scheduler randomly choose a big or a little cores when
there are more cores than number of threads.
With 1 thread, the test duration varies in the range [8.85 .. 15.86] for
mainline whereas it stays in the range [8.85..8.87] with the patch

Signed-off-by: Vincent Guittot 

---

The SD_ASYM_PACKING flag is disabled by default and I'm preparing another patch
to enable this dynamically at boot time by detecting the system topology.

 arch/arm64/kernel/topology.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index 2186853..cb6705e5 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -296,6 +296,33 @@ static void __init reset_cpu_topology(void)
}
 }
 
+#ifdef CONFIG_SCHED_MC
+unsigned int __read_mostly arm64_sched_asym_enabled;
+
+int arch_asym_cpu_priority(int cpu)
+{
+   return topology_get_cpu_scale(NULL, cpu);
+}
+
+static inline int arm64_sched_dynamiq(void)
+{
+   return arm64_sched_asym_enabled ? SD_ASYM_PACKING : 0;
+}
+
+static int arm64_core_flags(void)
+{
+   return cpu_core_flags() | arm64_sched_dynamiq();
+}
+#endif
+
+static struct sched_domain_topology_level arm64_topology[] = {
+#ifdef CONFIG_SCHED_MC
+   { cpu_coregroup_mask, arm64_core_flags, SD_INIT_NAME(MC) },
+#endif
+   { cpu_cpu_mask, SD_INIT_NAME(DIE) },
+   { NULL, },
+};
+
 void __init init_cpu_topology(void)
 {
reset_cpu_topology();
@@ -306,4 +333,7 @@ void __init init_cpu_topology(void)
 */
if (of_have_populated_dt() && parse_dt_topology())
reset_cpu_topology();
+
+   /* Set scheduler topology descriptor */
+   set_sched_topology(arm64_topology);
 }
-- 
2.7.4