subject:"\[RFC v3 1\/5\] sched\/core\: add capacity constraints to CPU controller"

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-04-01 Thread Patrick Bellasi

Hi Paul,

On 30-Mar 14:15, Paul Turner wrote:
> On Mon, Mar 20, 2017 at 11:08 AM, Patrick Bellasi
>  wrote:
> > On 20-Mar 13:15, Tejun Heo wrote:
> >> Hello,
> >>
> >> On Tue, Feb 28, 2017 at 02:38:38PM +, Patrick Bellasi wrote:
> >> > This patch extends the CPU controller by adding a couple of new
> >> > attributes, capacity_min and capacity_max, which can be used to enforce
> >> > bandwidth boosting and capping. More specifically:
> >> >
> >> > - capacity_min: defines the minimum capacity which should be granted
> >> > (by schedutil) when a task in this group is running,
> >> > i.e. the task will run at least at that capacity
> >> >
> >> > - capacity_max: defines the maximum capacity which can be granted
> >> > (by schedutil) when a task in this group is running,
> >> > i.e. the task can run up to that capacity
> >>
> >> cpu.capacity.min and cpu.capacity.max are the more conventional names.
> >
> > Ok, should be an easy renaming.
> >
> >> I'm not sure about the name capacity as it doesn't encode what it does
> >> and is difficult to tell apart from cpu bandwidth limits.  I think
> >> it'd be better to represent what it controls more explicitly.
> >
> > In the scheduler jargon, capacity represents the amount of computation
> > that a CPU can provide and it's usually defined to be 1024 for the
> > biggest CPU (on non SMP systems) running at the highest OPP (i.e.
> > maximum frequency).
> >
> > It's true that it kind of overlaps with the concept of "bandwidth".
> > However, the main difference here is that "bandwidth" is not frequency
> > (and architecture) scaled.
> > Thus, for example, assuming we have only one CPU with these two OPPs:
> >
> >OPP | Frequency | Capacity
> >  1 |500MHz |  512
> >  2 |  1GHz | 1024
> 
> I think exposing capacity in this manner is extremely challenging.
> It's not normalized in any way between architectures, which places a
> lot of the ABI in the API.

Capacities of CPUs are already exposed, at least for ARM platforms, using
a platform independent definition which is documented here:

http://lxr.free-electrons.com/source/Documentation/devicetree/bindings/arm/cpus.txt#L245

As the notes in the documentation highlight, it's not a perfect
metrics but still it allows to distinguish between computational
capabilities of different (micro)architectures and/or OPPs.

Within the scheduler we use SCHED_CAPACITY_SCALE:
   http://lxr.free-electrons.com/ident?i=SCHED_CAPACITY_SCALE
for everything related to actual CPU computational capabilities.

That's why in the current implementation we expose the same metric to
define capacity constraints.

We considered also the idea to expose a more generic percentage value
[0..100], do you think that could be better?
Consider that at the end we will still have to scale 100% to 1024.

> Have you considered any schemes for normalizing this in a reasonable fashion?

For each specific target platform, capacities are already normalized
to 1024, which is the capacity of the most capable CPU running at the
highest OPP. Thus, 1024 always represents 100% of the available
computational capabilities of the most preforming system's CPU.

Perhaps I cannot completely get what what you mean when you say that
it should be "normalized between architectures".
Can you explain better, maybe with an example?

[...]

Cheers Patrick

-- 
#include 

Patrick Bellasi

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-04-01 Thread Patrick Bellasi

Hi Paul,

On 30-Mar 14:15, Paul Turner wrote:
> On Mon, Mar 20, 2017 at 11:08 AM, Patrick Bellasi
>  wrote:
> > On 20-Mar 13:15, Tejun Heo wrote:
> >> Hello,
> >>
> >> On Tue, Feb 28, 2017 at 02:38:38PM +, Patrick Bellasi wrote:
> >> > This patch extends the CPU controller by adding a couple of new
> >> > attributes, capacity_min and capacity_max, which can be used to enforce
> >> > bandwidth boosting and capping. More specifically:
> >> >
> >> > - capacity_min: defines the minimum capacity which should be granted
> >> > (by schedutil) when a task in this group is running,
> >> > i.e. the task will run at least at that capacity
> >> >
> >> > - capacity_max: defines the maximum capacity which can be granted
> >> > (by schedutil) when a task in this group is running,
> >> > i.e. the task can run up to that capacity
> >>
> >> cpu.capacity.min and cpu.capacity.max are the more conventional names.
> >
> > Ok, should be an easy renaming.
> >
> >> I'm not sure about the name capacity as it doesn't encode what it does
> >> and is difficult to tell apart from cpu bandwidth limits.  I think
> >> it'd be better to represent what it controls more explicitly.
> >
> > In the scheduler jargon, capacity represents the amount of computation
> > that a CPU can provide and it's usually defined to be 1024 for the
> > biggest CPU (on non SMP systems) running at the highest OPP (i.e.
> > maximum frequency).
> >
> > It's true that it kind of overlaps with the concept of "bandwidth".
> > However, the main difference here is that "bandwidth" is not frequency
> > (and architecture) scaled.
> > Thus, for example, assuming we have only one CPU with these two OPPs:
> >
> >OPP | Frequency | Capacity
> >  1 |500MHz |  512
> >  2 |  1GHz | 1024
> 
> I think exposing capacity in this manner is extremely challenging.
> It's not normalized in any way between architectures, which places a
> lot of the ABI in the API.

Capacities of CPUs are already exposed, at least for ARM platforms, using
a platform independent definition which is documented here:

http://lxr.free-electrons.com/source/Documentation/devicetree/bindings/arm/cpus.txt#L245

As the notes in the documentation highlight, it's not a perfect
metrics but still it allows to distinguish between computational
capabilities of different (micro)architectures and/or OPPs.

Within the scheduler we use SCHED_CAPACITY_SCALE:
   http://lxr.free-electrons.com/ident?i=SCHED_CAPACITY_SCALE
for everything related to actual CPU computational capabilities.

That's why in the current implementation we expose the same metric to
define capacity constraints.

We considered also the idea to expose a more generic percentage value
[0..100], do you think that could be better?
Consider that at the end we will still have to scale 100% to 1024.

> Have you considered any schemes for normalizing this in a reasonable fashion?

For each specific target platform, capacities are already normalized
to 1024, which is the capacity of the most capable CPU running at the
highest OPP. Thus, 1024 always represents 100% of the available
computational capabilities of the most preforming system's CPU.

Perhaps I cannot completely get what what you mean when you say that
it should be "normalized between architectures".
Can you explain better, maybe with an example?

[...]

Cheers Patrick

-- 
#include 

Patrick Bellasi

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-30 Thread Paul Turner

On Mon, Mar 20, 2017 at 11:08 AM, Patrick Bellasi
 wrote:
> On 20-Mar 13:15, Tejun Heo wrote:
>> Hello,
>>
>> On Tue, Feb 28, 2017 at 02:38:38PM +, Patrick Bellasi wrote:
>> > This patch extends the CPU controller by adding a couple of new
>> > attributes, capacity_min and capacity_max, which can be used to enforce
>> > bandwidth boosting and capping. More specifically:
>> >
>> > - capacity_min: defines the minimum capacity which should be granted
>> > (by schedutil) when a task in this group is running,
>> > i.e. the task will run at least at that capacity
>> >
>> > - capacity_max: defines the maximum capacity which can be granted
>> > (by schedutil) when a task in this group is running,
>> > i.e. the task can run up to that capacity
>>
>> cpu.capacity.min and cpu.capacity.max are the more conventional names.
>
> Ok, should be an easy renaming.
>
>> I'm not sure about the name capacity as it doesn't encode what it does
>> and is difficult to tell apart from cpu bandwidth limits.  I think
>> it'd be better to represent what it controls more explicitly.
>
> In the scheduler jargon, capacity represents the amount of computation
> that a CPU can provide and it's usually defined to be 1024 for the
> biggest CPU (on non SMP systems) running at the highest OPP (i.e.
> maximum frequency).
>
> It's true that it kind of overlaps with the concept of "bandwidth".
> However, the main difference here is that "bandwidth" is not frequency
> (and architecture) scaled.
> Thus, for example, assuming we have only one CPU with these two OPPs:
>
>OPP | Frequency | Capacity
>  1 |500MHz |  512
>  2 |  1GHz | 1024

I think exposing capacity in this manner is extremely challenging.
It's not normalized in any way between architectures, which places a
lot of the ABI in the API.

Have you considered any schemes for normalizing this in a reasonable fashion?
`
>
> a task running 60% of the time on that CPU when configured to run at
> 500MHz, from the bandwidth standpoint it's using 60% bandwidth but, from
> the capacity standpoint, is using only 30% of the available capacity.
>
> IOW, bandwidth is purely temporal based while capacity factors in both
> frequency and architectural differences.
> Thus, while a "bandwidth" constraint limits the amount of time a task
> can use a CPU, independently from the "actual computation" performed,
> with the new "capacity" constraints we can enforce much "actual
> computation" a task can perform in the "unit of time".
>
>> > These attributes:
>> > a) are tunable at all hierarchy levels, i.e. root group too
>>
>> This usually is problematic because there should be a non-cgroup way
>> of configuring the feature in case cgroup isn't configured or used,
>> and it becomes awkward to have two separate mechanisms configuring the
>> same thing.  Maybe the feature is cgroup specific enough that it makes
>> sense here but this needs more explanation / justification.
>
> In the previous proposal I used to expose global tunables under
> procfs, e.g.:
>
>  /proc/sys/kernel/sched_capacity_min
>  /proc/sys/kernel/sched_capacity_max
>
> which can be used to defined tunable root constraints when CGroups are
> not available, and becomes RO when CGroups are.
>
> Can this be eventually an acceptable option?
>
> In any case I think that this feature will be mainly targeting CGroup
> based systems. Indeed, one of the main goals is to collect
> "application specific" information from "informed run-times". Being
> "application specific" means that we need a way to classify
> applications depending on the runtime context... and that capability
> in Linux is ultimately provided via the CGroup interface.
>
>> > b) allow to create subgroups of tasks which are not violating the
>> >capacity constraints defined by the parent group.
>> >Thus, tasks on a subgroup can only be more boosted and/or more
>>
>> For both limits and protections, the parent caps the maximum the
>> children can get.  At least that's what memcg does for memory.low.
>> Doing that makes sense for memcg because for memory the parent can
>> still do protections regardless of what its children are doing and it
>> makes delegation safe by default.
>
> Just to be more clear, the current proposal enforces:
>
> - capacity_max_child <= capacity_max_parent
>
>   Since, if a task is constrained to get only up to a certain amount
>   of capacity, than its childs cannot use more than that... eventually
>   they can only be further constrained.
>
> - capacity_min_child >= capacity_min_parent
>
>   Since, if a task has been boosted to run at least as much fast, than
>   its childs cannot be constrained to go slower without eventually
>   impacting parent performance.
>
>> I understand why you would want a property like capacity to be the
>> other direction as that way you get more specific as you walk down the
>> tree for both limits and

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-30 Thread Paul Turner

On Mon, Mar 20, 2017 at 11:08 AM, Patrick Bellasi
 wrote:
> On 20-Mar 13:15, Tejun Heo wrote:
>> Hello,
>>
>> On Tue, Feb 28, 2017 at 02:38:38PM +, Patrick Bellasi wrote:
>> > This patch extends the CPU controller by adding a couple of new
>> > attributes, capacity_min and capacity_max, which can be used to enforce
>> > bandwidth boosting and capping. More specifically:
>> >
>> > - capacity_min: defines the minimum capacity which should be granted
>> > (by schedutil) when a task in this group is running,
>> > i.e. the task will run at least at that capacity
>> >
>> > - capacity_max: defines the maximum capacity which can be granted
>> > (by schedutil) when a task in this group is running,
>> > i.e. the task can run up to that capacity
>>
>> cpu.capacity.min and cpu.capacity.max are the more conventional names.
>
> Ok, should be an easy renaming.
>
>> I'm not sure about the name capacity as it doesn't encode what it does
>> and is difficult to tell apart from cpu bandwidth limits.  I think
>> it'd be better to represent what it controls more explicitly.
>
> In the scheduler jargon, capacity represents the amount of computation
> that a CPU can provide and it's usually defined to be 1024 for the
> biggest CPU (on non SMP systems) running at the highest OPP (i.e.
> maximum frequency).
>
> It's true that it kind of overlaps with the concept of "bandwidth".
> However, the main difference here is that "bandwidth" is not frequency
> (and architecture) scaled.
> Thus, for example, assuming we have only one CPU with these two OPPs:
>
>OPP | Frequency | Capacity
>  1 |500MHz |  512
>  2 |  1GHz | 1024

I think exposing capacity in this manner is extremely challenging.
It's not normalized in any way between architectures, which places a
lot of the ABI in the API.

Have you considered any schemes for normalizing this in a reasonable fashion?
`
>
> a task running 60% of the time on that CPU when configured to run at
> 500MHz, from the bandwidth standpoint it's using 60% bandwidth but, from
> the capacity standpoint, is using only 30% of the available capacity.
>
> IOW, bandwidth is purely temporal based while capacity factors in both
> frequency and architectural differences.
> Thus, while a "bandwidth" constraint limits the amount of time a task
> can use a CPU, independently from the "actual computation" performed,
> with the new "capacity" constraints we can enforce much "actual
> computation" a task can perform in the "unit of time".
>
>> > These attributes:
>> > a) are tunable at all hierarchy levels, i.e. root group too
>>
>> This usually is problematic because there should be a non-cgroup way
>> of configuring the feature in case cgroup isn't configured or used,
>> and it becomes awkward to have two separate mechanisms configuring the
>> same thing.  Maybe the feature is cgroup specific enough that it makes
>> sense here but this needs more explanation / justification.
>
> In the previous proposal I used to expose global tunables under
> procfs, e.g.:
>
>  /proc/sys/kernel/sched_capacity_min
>  /proc/sys/kernel/sched_capacity_max
>
> which can be used to defined tunable root constraints when CGroups are
> not available, and becomes RO when CGroups are.
>
> Can this be eventually an acceptable option?
>
> In any case I think that this feature will be mainly targeting CGroup
> based systems. Indeed, one of the main goals is to collect
> "application specific" information from "informed run-times". Being
> "application specific" means that we need a way to classify
> applications depending on the runtime context... and that capability
> in Linux is ultimately provided via the CGroup interface.
>
>> > b) allow to create subgroups of tasks which are not violating the
>> >capacity constraints defined by the parent group.
>> >Thus, tasks on a subgroup can only be more boosted and/or more
>>
>> For both limits and protections, the parent caps the maximum the
>> children can get.  At least that's what memcg does for memory.low.
>> Doing that makes sense for memcg because for memory the parent can
>> still do protections regardless of what its children are doing and it
>> makes delegation safe by default.
>
> Just to be more clear, the current proposal enforces:
>
> - capacity_max_child <= capacity_max_parent
>
>   Since, if a task is constrained to get only up to a certain amount
>   of capacity, than its childs cannot use more than that... eventually
>   they can only be further constrained.
>
> - capacity_min_child >= capacity_min_parent
>
>   Since, if a task has been boosted to run at least as much fast, than
>   its childs cannot be constrained to go slower without eventually
>   impacting parent performance.
>
>> I understand why you would want a property like capacity to be the
>> other direction as that way you get more specific as you walk down the
>> tree for both limits and protections;
>
> Right, the

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-30 Thread Paul Turner

There is one important, fundamental difference here:
 {cfs,rt}_{period,runtime}_us is a property that applies to a group of
threads, it can be sub-divided.
We can consume 100ms of quota either by having one thread run for
100ms, or 2 threads running for 50ms.

This is not true for capacity.  It's a tag that affects the individual
threads it's applied to.
I'm also not sure if it's a hard constraint.  For example, suppose we
set a max that is smaller than a "big" cpu on an asymmetric system.
In the case that the faster CPU is relatively busy, but still
opportunistically available, we would still want to schedule it there.

This definitely seems to make more sense as a per-thread interface in
its current form.

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-30 Thread Paul Turner

There is one important, fundamental difference here:
 {cfs,rt}_{period,runtime}_us is a property that applies to a group of
threads, it can be sub-divided.
We can consume 100ms of quota either by having one thread run for
100ms, or 2 threads running for 50ms.

This is not true for capacity.  It's a tag that affects the individual
threads it's applied to.
I'm also not sure if it's a hard constraint.  For example, suppose we
set a max that is smaller than a "big" cpu on an asymmetric system.
In the case that the faster CPU is relatively busy, but still
opportunistically available, we would still want to schedule it there.

This definitely seems to make more sense as a per-thread interface in
its current form.

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-24 Thread Tejun Heo

Hello,

On Thu, Mar 23, 2017 at 11:37:50PM -0700, Joel Fernandes (Google) wrote:
> > That's also why we're gonna have problem if we later decide we need a
> > thread based API for it.  Once we make cgroup the primary owner of the
> > attribute, it's not straight forward to add another owner.
> 
> Sorry I don't immediately see why it is not straight forward to have a
> per-task API later once CGroup interface is added? Maybe if you don't
> mind giving an example that will help?
> 
> I can start with an example, say you have a single level hierarchy
> (Top-app in Android terms is the set of tasks that are user facing and
> we'd like to enforce some capacity minimums, background on the other
> hand is the opposite):
> 
>ROOT (min = 0, max = 1024)
>/ \
>   /   \
>   TOP-APP (min = 200, max = 1024)  BACKGROUND (min = 0, max = 500)
> 
> If in the future, if we want to have a per-task API to individually
> configure the task with these limits, it seems it will be straight
> forward to implement IMO.

Ah, you're right.  I got fixated on controllers which control single
value attributes (the net ones).  Yeah, we can extend the same range
interace to thread level.  Sorry about the confusion.

Not necessarliy specific to the API discussion but in general lower
level in the configuration hierarchy shouldn't be able to restrict
what the parents can do, so we need to think more about how delegation
should work.

> As Patrick mentioned, all of the usecases needing this right now is an
> informed runtime placing a task in a group of tasks and not needing to
> set attributes for each individual task. We are already placing tasks
> in individual CGroups in Android based on the information the runtime
> has so adding in the capacity constraints will make it fit naturally
> while leaving the door open for any future per-task API additions IMO.

But the question is that while the use case can be served by cgroup as
the primary API, it doesn't have to be.  This still is an attribute
range restriction controller rather than an actual hierarchical
resource distributor.  All the controller does is restricting per-task
configuration which is the only thing operative.  This is different
from bandwidth which is an actual resource controller which
distributes cpu cycles hierarchically.

There can be some benefits to having cgroup restrictions on capacity
but they won't add anything functionally fundamental.  So, I still
don't see why making cgroup the primary interface would be a good idea
here.

Thanks.

-- 
tejun

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-24 Thread Tejun Heo

Hello,

On Thu, Mar 23, 2017 at 11:37:50PM -0700, Joel Fernandes (Google) wrote:
> > That's also why we're gonna have problem if we later decide we need a
> > thread based API for it.  Once we make cgroup the primary owner of the
> > attribute, it's not straight forward to add another owner.
> 
> Sorry I don't immediately see why it is not straight forward to have a
> per-task API later once CGroup interface is added? Maybe if you don't
> mind giving an example that will help?
> 
> I can start with an example, say you have a single level hierarchy
> (Top-app in Android terms is the set of tasks that are user facing and
> we'd like to enforce some capacity minimums, background on the other
> hand is the opposite):
> 
>ROOT (min = 0, max = 1024)
>/ \
>   /   \
>   TOP-APP (min = 200, max = 1024)  BACKGROUND (min = 0, max = 500)
> 
> If in the future, if we want to have a per-task API to individually
> configure the task with these limits, it seems it will be straight
> forward to implement IMO.

Ah, you're right.  I got fixated on controllers which control single
value attributes (the net ones).  Yeah, we can extend the same range
interace to thread level.  Sorry about the confusion.

Not necessarliy specific to the API discussion but in general lower
level in the configuration hierarchy shouldn't be able to restrict
what the parents can do, so we need to think more about how delegation
should work.

> As Patrick mentioned, all of the usecases needing this right now is an
> informed runtime placing a task in a group of tasks and not needing to
> set attributes for each individual task. We are already placing tasks
> in individual CGroups in Android based on the information the runtime
> has so adding in the capacity constraints will make it fit naturally
> while leaving the door open for any future per-task API additions IMO.

But the question is that while the use case can be served by cgroup as
the primary API, it doesn't have to be.  This still is an attribute
range restriction controller rather than an actual hierarchical
resource distributor.  All the controller does is restricting per-task
configuration which is the only thing operative.  This is different
from bandwidth which is an actual resource controller which
distributes cpu cycles hierarchically.

There can be some benefits to having cgroup restrictions on capacity
but they won't add anything functionally fundamental.  So, I still
don't see why making cgroup the primary interface would be a good idea
here.

Thanks.

-- 
tejun

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-24 Thread Joel Fernandes (Google)

Hi Patrick,

On Thu, Mar 23, 2017 at 3:32 AM, Patrick Bellasi
 wrote:
[..]
>> > which can be used to defined tunable root constraints when CGroups are
>> > not available, and becomes RO when CGroups are.
>> >
>> > Can this be eventually an acceptable option?
>> >
>> > In any case I think that this feature will be mainly targeting CGroup
>> > based systems. Indeed, one of the main goals is to collect
>> > "application specific" information from "informed run-times". Being
>> > "application specific" means that we need a way to classify
>> > applications depending on the runtime context... and that capability
>> > in Linux is ultimately provided via the CGroup interface.
>>
>> I think the concern raised is more about whether CGroups is the right
>> interface to use for attaching capacity constraints to task or groups
>> of tasks, or is there a better way to attach such constraints?
>
> Notice that CGroups based classification allows to easily enforce
> the concept of "delegation containment". I think this feature should
> be nice to have whatever interface we choose.
>
> However, potentially we can define a proper per-task API; are you
> thinking to something specifically?
>

I was thinking how about adding per-task constraints to the resource
limits API if it makes sense to? There's already RLIMIT_CPU and
RLIMIT_NICE. An informed-runtime could then modify the limits of tasks
using prlimit.

>> The other advantage of such interface is we don't have to
>> create a separate CGroup for every new constraint limit and can have
>> several tasks with different unique constraints.
>
> That's still possible using CGroups and IMO it will not be the "most
> common case".
> Don't you think that in general we will need to set constraints at
> applications level, thus group of tasks?

Some applications could be a single task, also not all tasks in an
application may need constraints right?

> As a general rule we should probably go for an interface which makes
> easy the most common case.

I agree.

Thanks,
Joel

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-24 Thread Joel Fernandes (Google)

Hi Patrick,

On Thu, Mar 23, 2017 at 3:32 AM, Patrick Bellasi
 wrote:
[..]
>> > which can be used to defined tunable root constraints when CGroups are
>> > not available, and becomes RO when CGroups are.
>> >
>> > Can this be eventually an acceptable option?
>> >
>> > In any case I think that this feature will be mainly targeting CGroup
>> > based systems. Indeed, one of the main goals is to collect
>> > "application specific" information from "informed run-times". Being
>> > "application specific" means that we need a way to classify
>> > applications depending on the runtime context... and that capability
>> > in Linux is ultimately provided via the CGroup interface.
>>
>> I think the concern raised is more about whether CGroups is the right
>> interface to use for attaching capacity constraints to task or groups
>> of tasks, or is there a better way to attach such constraints?
>
> Notice that CGroups based classification allows to easily enforce
> the concept of "delegation containment". I think this feature should
> be nice to have whatever interface we choose.
>
> However, potentially we can define a proper per-task API; are you
> thinking to something specifically?
>

I was thinking how about adding per-task constraints to the resource
limits API if it makes sense to? There's already RLIMIT_CPU and
RLIMIT_NICE. An informed-runtime could then modify the limits of tasks
using prlimit.

>> The other advantage of such interface is we don't have to
>> create a separate CGroup for every new constraint limit and can have
>> several tasks with different unique constraints.
>
> That's still possible using CGroups and IMO it will not be the "most
> common case".
> Don't you think that in general we will need to set constraints at
> applications level, thus group of tasks?

Some applications could be a single task, also not all tasks in an
application may need constraints right?

> As a general rule we should probably go for an interface which makes
> easy the most common case.

I agree.

Thanks,
Joel

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-24 Thread Joel Fernandes (Google)

Hi Tejun,

>> That's also why the proposed interface has now been defined as a extension of
>> the CPU controller in such a way to keep a consistent view.
>>
>> This controller is already used by run-times like Android to "scope" apps by
>> constraining the amount of CPUs resource they are getting.
>> Is that not a legitimate usage of the cpu controller?
>>
>> What we are doing here is just extending it a bit in such a way that, while:
>>
>>   {cfs,rt}_{period,runtime}_us limits the amount of TIME we can use a CPU
>>
>> we can also use:
>>
>>   capacity_{min,max} to limit the actual COMPUTATIONAL BANDWIDTH we can use
>>  during that time.
>
> Yes, we do have bandwidth restriction as a cgroup only feature, which
> is different from how we handle nice levels and weights.  Given the
> nature of bandwidth limits, if necessary, it is straight-forward to
> expose per-task interface.
>
> capacity min/max isn't the same thing.  It isn't a limit on countable
> units of a specific resource and that's why the interface you
> suggested for .min is different.  It's restricting attribute set which
> can be picked in the subhierarchy rather than controlling distribution
> of atoms of the resource.
>
> That's also why we're gonna have problem if we later decide we need a
> thread based API for it.  Once we make cgroup the primary owner of the
> attribute, it's not straight forward to add another owner.

Sorry I don't immediately see why it is not straight forward to have a
per-task API later once CGroup interface is added? Maybe if you don't
mind giving an example that will help?

I can start with an example, say you have a single level hierarchy
(Top-app in Android terms is the set of tasks that are user facing and
we'd like to enforce some capacity minimums, background on the other
hand is the opposite):

   ROOT (min = 0, max = 1024)
   / \
  /   \
  TOP-APP (min = 200, max = 1024)  BACKGROUND (min = 0, max = 500)

If in the future, if we want to have a per-task API to individually
configure the task with these limits, it seems it will be straight
forward to implement IMO.

As Patrick mentioned, all of the usecases needing this right now is an
informed runtime placing a task in a group of tasks and not needing to
set attributes for each individual task. We are already placing tasks
in individual CGroups in Android based on the information the runtime
has so adding in the capacity constraints will make it fit naturally
while leaving the door open for any future per-task API additions IMO.

Thanks,

Joel

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-24 Thread Joel Fernandes (Google)

Hi Tejun,

>> That's also why the proposed interface has now been defined as a extension of
>> the CPU controller in such a way to keep a consistent view.
>>
>> This controller is already used by run-times like Android to "scope" apps by
>> constraining the amount of CPUs resource they are getting.
>> Is that not a legitimate usage of the cpu controller?
>>
>> What we are doing here is just extending it a bit in such a way that, while:
>>
>>   {cfs,rt}_{period,runtime}_us limits the amount of TIME we can use a CPU
>>
>> we can also use:
>>
>>   capacity_{min,max} to limit the actual COMPUTATIONAL BANDWIDTH we can use
>>  during that time.
>
> Yes, we do have bandwidth restriction as a cgroup only feature, which
> is different from how we handle nice levels and weights.  Given the
> nature of bandwidth limits, if necessary, it is straight-forward to
> expose per-task interface.
>
> capacity min/max isn't the same thing.  It isn't a limit on countable
> units of a specific resource and that's why the interface you
> suggested for .min is different.  It's restricting attribute set which
> can be picked in the subhierarchy rather than controlling distribution
> of atoms of the resource.
>
> That's also why we're gonna have problem if we later decide we need a
> thread based API for it.  Once we make cgroup the primary owner of the
> attribute, it's not straight forward to add another owner.

Sorry I don't immediately see why it is not straight forward to have a
per-task API later once CGroup interface is added? Maybe if you don't
mind giving an example that will help?

I can start with an example, say you have a single level hierarchy
(Top-app in Android terms is the set of tasks that are user facing and
we'd like to enforce some capacity minimums, background on the other
hand is the opposite):

   ROOT (min = 0, max = 1024)
   / \
  /   \
  TOP-APP (min = 200, max = 1024)  BACKGROUND (min = 0, max = 500)

If in the future, if we want to have a per-task API to individually
configure the task with these limits, it seems it will be straight
forward to implement IMO.

As Patrick mentioned, all of the usecases needing this right now is an
informed runtime placing a task in a group of tasks and not needing to
set attributes for each individual task. We are already placing tasks
in individual CGroups in Android based on the information the runtime
has so adding in the capacity constraints will make it fit naturally
while leaving the door open for any future per-task API additions IMO.

Thanks,

Joel

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-23 Thread Tejun Heo

Hello, Patrick.

On Thu, Mar 23, 2017 at 06:15:33PM +, Patrick Bellasi wrote:
> What is important to notice is that there is a middleware, in between
> the kernel and the applications. This is a special kind of user-space
> where it is still safe for the kernel to delegate some "decisions".
> 
> The ultimate user of the proposed interface will be such a middleware, not 
> each
> and every application. That's why the "containment" feature provided by 
> CGroups
> I think is a good fitting for the kind of design.

cgroup isn't required for this type of uses.  We've always had this
sort of usages in combination with mechanisms to restrict what
non-priv applications can do.  The usage is perfectly valid but
whether to use cgroup as the sole interface is a different issue.

Yes, cgroup interface can be used this way; however, it does exclude,
or at least makes pretty cumbersome, different use cases which can be
served by a regular API.  And that isn't the case when we approach it
from the other direction.

> I like this concept of "CGroups being a scoping mechanism" and I think it
> perfectly matches this use-case as well...
> 
> >  It shows up here too.  If you take out the cgroup part,
> > you're left with an interface which is hardly useful.  cgroup isn't
> > scoping the global system here.
> 
> It is, indeed:
> 
> 1) Applications do not see CGroups, never.
>They use whatever resources are available when CGroups are not in use.
> 
> 2) When an "Informed Run-time Resource Manager" schema is used, then the same
>applications are scoped in the sense that they becomes "managed 
> applications".
> 
>Managed applications are still completely "unaware" about the CGroup
>interface, they do not relay on that interface for what they have to do.
>However, in this scenario, there is a supervisor which know how much an
>application can get each and every instant.

But it isn't useful if you take cgroup out of the picture.  cgroup
isn't scoping a feature.  The feature is buried in the cgroup itself.
I don't think it's useful to argue over the fine semantics.  Please
see below.

> > It's becoming the primary interface
> > for this feature which most likely isn't a good sign.
> 
> It's a primary interface yes, but not for apps, only for an (optional)
> run-time resource manager.
> 
> What we want to enable with this interface is exactly the possibility for a
> privileged user-space entity to "scope" different applications.
> 
> Described like that we can argue that we can still implement this model using 
> a
> custom per-task API. However, this proposal is about "tuning/partitioning" a
> resource which is already (would say only) controllable using the CPU
> controller.
> That's also why the proposed interface has now been defined as a extension of
> the CPU controller in such a way to keep a consistent view.
> 
> This controller is already used by run-times like Android to "scope" apps by
> constraining the amount of CPUs resource they are getting.
> Is that not a legitimate usage of the cpu controller?
> 
> What we are doing here is just extending it a bit in such a way that, while:
> 
>   {cfs,rt}_{period,runtime}_us limits the amount of TIME we can use a CPU
> 
> we can also use:
> 
>   capacity_{min,max} to limit the actual COMPUTATIONAL BANDWIDTH we can use
>  during that time.

Yes, we do have bandwidth restriction as a cgroup only feature, which
is different from how we handle nice levels and weights.  Given the
nature of bandwidth limits, if necessary, it is straight-forward to
expose per-task interface.

capacity min/max isn't the same thing.  It isn't a limit on countable
units of a specific resource and that's why the interface you
suggested for .min is different.  It's restricting attribute set which
can be picked in the subhierarchy rather than controlling distribution
of atoms of the resource.

That's also why we're gonna have problem if we later decide we need a
thread based API for it.  Once we make cgroup the primary owner of the
attribute, it's not straight forward to add another owner.

> > So, my suggestion is to implement it as a per-task API.  If the
> > feature calls for scoped restrictions, we definitely can add cgroup
> > support for that but I'm really not convinced about using cgroup as
> > the primary interface for this.
> 
> Given this viewpoint, I can definitively see a "scoped restrictions" usage, as
> well as the idea that this can be a unique and primary interface.
> Again, not exposed generically to apps but targeting a proper integration
> of user-space run-time resource managers.
> 
> I hope this contributed to clarify better the scope.  Do you still see the
> CGroup API not as the best fit for such a usage?

Yes, I still think so.  It'd be best to first figure out how the
attribute should be configured, inherited and restricted using the
normal APIs and then layer scoped restrictions on top with cgroup.
cgroup shouldn't be used as a way

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-23 Thread Tejun Heo

Hello, Patrick.

On Thu, Mar 23, 2017 at 06:15:33PM +, Patrick Bellasi wrote:
> What is important to notice is that there is a middleware, in between
> the kernel and the applications. This is a special kind of user-space
> where it is still safe for the kernel to delegate some "decisions".
> 
> The ultimate user of the proposed interface will be such a middleware, not 
> each
> and every application. That's why the "containment" feature provided by 
> CGroups
> I think is a good fitting for the kind of design.

cgroup isn't required for this type of uses.  We've always had this
sort of usages in combination with mechanisms to restrict what
non-priv applications can do.  The usage is perfectly valid but
whether to use cgroup as the sole interface is a different issue.

Yes, cgroup interface can be used this way; however, it does exclude,
or at least makes pretty cumbersome, different use cases which can be
served by a regular API.  And that isn't the case when we approach it
from the other direction.

> I like this concept of "CGroups being a scoping mechanism" and I think it
> perfectly matches this use-case as well...
> 
> >  It shows up here too.  If you take out the cgroup part,
> > you're left with an interface which is hardly useful.  cgroup isn't
> > scoping the global system here.
> 
> It is, indeed:
> 
> 1) Applications do not see CGroups, never.
>They use whatever resources are available when CGroups are not in use.
> 
> 2) When an "Informed Run-time Resource Manager" schema is used, then the same
>applications are scoped in the sense that they becomes "managed 
> applications".
> 
>Managed applications are still completely "unaware" about the CGroup
>interface, they do not relay on that interface for what they have to do.
>However, in this scenario, there is a supervisor which know how much an
>application can get each and every instant.

But it isn't useful if you take cgroup out of the picture.  cgroup
isn't scoping a feature.  The feature is buried in the cgroup itself.
I don't think it's useful to argue over the fine semantics.  Please
see below.

> > It's becoming the primary interface
> > for this feature which most likely isn't a good sign.
> 
> It's a primary interface yes, but not for apps, only for an (optional)
> run-time resource manager.
> 
> What we want to enable with this interface is exactly the possibility for a
> privileged user-space entity to "scope" different applications.
> 
> Described like that we can argue that we can still implement this model using 
> a
> custom per-task API. However, this proposal is about "tuning/partitioning" a
> resource which is already (would say only) controllable using the CPU
> controller.
> That's also why the proposed interface has now been defined as a extension of
> the CPU controller in such a way to keep a consistent view.
> 
> This controller is already used by run-times like Android to "scope" apps by
> constraining the amount of CPUs resource they are getting.
> Is that not a legitimate usage of the cpu controller?
> 
> What we are doing here is just extending it a bit in such a way that, while:
> 
>   {cfs,rt}_{period,runtime}_us limits the amount of TIME we can use a CPU
> 
> we can also use:
> 
>   capacity_{min,max} to limit the actual COMPUTATIONAL BANDWIDTH we can use
>  during that time.

Yes, we do have bandwidth restriction as a cgroup only feature, which
is different from how we handle nice levels and weights.  Given the
nature of bandwidth limits, if necessary, it is straight-forward to
expose per-task interface.

capacity min/max isn't the same thing.  It isn't a limit on countable
units of a specific resource and that's why the interface you
suggested for .min is different.  It's restricting attribute set which
can be picked in the subhierarchy rather than controlling distribution
of atoms of the resource.

That's also why we're gonna have problem if we later decide we need a
thread based API for it.  Once we make cgroup the primary owner of the
attribute, it's not straight forward to add another owner.

> > So, my suggestion is to implement it as a per-task API.  If the
> > feature calls for scoped restrictions, we definitely can add cgroup
> > support for that but I'm really not convinced about using cgroup as
> > the primary interface for this.
> 
> Given this viewpoint, I can definitively see a "scoped restrictions" usage, as
> well as the idea that this can be a unique and primary interface.
> Again, not exposed generically to apps but targeting a proper integration
> of user-space run-time resource managers.
> 
> I hope this contributed to clarify better the scope.  Do you still see the
> CGroup API not as the best fit for such a usage?

Yes, I still think so.  It'd be best to first figure out how the
attribute should be configured, inherited and restricted using the
normal APIs and then layer scoped restrictions on top with cgroup.
cgroup shouldn't be used as a way

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-23 Thread Patrick Bellasi

On 23-Mar 12:01, Tejun Heo wrote:
> Hello,

Hi Tejun,

> On Thu, Mar 23, 2017 at 10:32:54AM +, Patrick Bellasi wrote:
> > > But then we would lose out on being able to attach capacity
> > > constraints to specific tasks or groups of tasks?
> > 
> > Yes, right. If CGroups are not available than you cannot specify
> > per-task constraints. This is just a system-wide global tunable.
> > 
> > Question is: does this overall proposal makes sense outside the scope
> > of task groups classification? (more on that afterwards)
> 
> I think it does, given that it's a per-thread property which requires
> internal application knowledge to tune.

Yes and no... perhaps I'm biased on some specific usage scenarios, but where I
find this interface more useful is not when apps tune themselves but instead
when an "external actor" (which I usually call an "informed run-time") controls
these apps.

> > > I think the concern raised is more about whether CGroups is the right
> > > interface to use for attaching capacity constraints to task or groups
> > > of tasks, or is there a better way to attach such constraints?
> > 
> > Notice that CGroups based classification allows to easily enforce
> > the concept of "delegation containment". I think this feature should
> > be nice to have whatever interface we choose.
> > 
> > However, potentially we can define a proper per-task API; are you
> > thinking to something specifically?
> 
> I don't think the overall outcome was too good when we used cgroup as
> the direct way of configuring certain attributes - it either excludes
> the possibility of easily accessible API from application side or

That's actually one of the main point: does it make sense to expose
such an API to applications at all?

What we are after is a properly defined interface where kernel-space and
user-space can potentially close this control loop:

a) a "privileged" user-space, which has much more a-priori information about
   tasks requirements, can feed some constraints to kernel-space

b) kernel-space, which has optimized and efficient mechanisms,
   enforces these constraints on a per task basis

Here is a graphical representation of these concepts:

  +-++-+  +-+
  | App1 Tasks  ++   | App2 Tasks  ++ | App3 Tasks  ++
  | ||   | || | ||
  +--|   +--| +--|
   +-++-+  +-+
|   |  |
  +--+
  |  |
  |  ++  |
  |  |  +-+   |  |
  |  |  |  Run-Time Optimized Services|   |  |
  |  |  |(e.g. execution model)   |   |  |
  |  |  +-+   |  |
  |  ||  |
  |  | Informed Run-Time Resource Manager |  |
  |  |   (Android, ChromeOS, Kubernets, etc...)   |  |
  |  +--^-+  |
  ||||
  ||Constraints ||
  ||(OPP and Task Placement biasing)||
  ||||
  || Monitoring ||
  |  +-v--+  |
  |  |   Linux Kernel |  |
  |  | (Scheduler, schedutil, ...)|  |
  |  ++  |
  |  |
  | Closed control and optimization loop |
  +--+

What is important to notice is that there is a middleware, in between
the kernel and the applications. This is a special kind of user-space
where it is still safe for the kernel to delegate some "decisions".

The ultimate user of the proposed interface will be such a middleware, not each
and every application. That's why the "containment" feature provided by CGroups
I think is a good fitting for the kind of design.

> conflicts with the attributes set through such API.

In this "run-time resource management" schema, generic applications do not
access the proposed API, which is reserved to the privileged user-space.

Applications eventually can request better services to the middleware, using a
completely different and more abstract API, which can also be domain specific.

> It's a lot clearer when cgroup just sets what's allowed under the hierarchy.
> This is also in line with the aspect that cgroup for the most part is
> a scoping mechanism - it's the most straight-forward to implement and
> use when

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-23 Thread Patrick Bellasi

On 23-Mar 12:01, Tejun Heo wrote:
> Hello,

Hi Tejun,

> On Thu, Mar 23, 2017 at 10:32:54AM +, Patrick Bellasi wrote:
> > > But then we would lose out on being able to attach capacity
> > > constraints to specific tasks or groups of tasks?
> > 
> > Yes, right. If CGroups are not available than you cannot specify
> > per-task constraints. This is just a system-wide global tunable.
> > 
> > Question is: does this overall proposal makes sense outside the scope
> > of task groups classification? (more on that afterwards)
> 
> I think it does, given that it's a per-thread property which requires
> internal application knowledge to tune.

Yes and no... perhaps I'm biased on some specific usage scenarios, but where I
find this interface more useful is not when apps tune themselves but instead
when an "external actor" (which I usually call an "informed run-time") controls
these apps.

> > > I think the concern raised is more about whether CGroups is the right
> > > interface to use for attaching capacity constraints to task or groups
> > > of tasks, or is there a better way to attach such constraints?
> > 
> > Notice that CGroups based classification allows to easily enforce
> > the concept of "delegation containment". I think this feature should
> > be nice to have whatever interface we choose.
> > 
> > However, potentially we can define a proper per-task API; are you
> > thinking to something specifically?
> 
> I don't think the overall outcome was too good when we used cgroup as
> the direct way of configuring certain attributes - it either excludes
> the possibility of easily accessible API from application side or

That's actually one of the main point: does it make sense to expose
such an API to applications at all?

What we are after is a properly defined interface where kernel-space and
user-space can potentially close this control loop:

a) a "privileged" user-space, which has much more a-priori information about
   tasks requirements, can feed some constraints to kernel-space

b) kernel-space, which has optimized and efficient mechanisms,
   enforces these constraints on a per task basis

Here is a graphical representation of these concepts:

  +-++-+  +-+
  | App1 Tasks  ++   | App2 Tasks  ++ | App3 Tasks  ++
  | ||   | || | ||
  +--|   +--| +--|
   +-++-+  +-+
|   |  |
  +--+
  |  |
  |  ++  |
  |  |  +-+   |  |
  |  |  |  Run-Time Optimized Services|   |  |
  |  |  |(e.g. execution model)   |   |  |
  |  |  +-+   |  |
  |  ||  |
  |  | Informed Run-Time Resource Manager |  |
  |  |   (Android, ChromeOS, Kubernets, etc...)   |  |
  |  +--^-+  |
  ||||
  ||Constraints ||
  ||(OPP and Task Placement biasing)||
  ||||
  || Monitoring ||
  |  +-v--+  |
  |  |   Linux Kernel |  |
  |  | (Scheduler, schedutil, ...)|  |
  |  ++  |
  |  |
  | Closed control and optimization loop |
  +--+

What is important to notice is that there is a middleware, in between
the kernel and the applications. This is a special kind of user-space
where it is still safe for the kernel to delegate some "decisions".

The ultimate user of the proposed interface will be such a middleware, not each
and every application. That's why the "containment" feature provided by CGroups
I think is a good fitting for the kind of design.

> conflicts with the attributes set through such API.

In this "run-time resource management" schema, generic applications do not
access the proposed API, which is reserved to the privileged user-space.

Applications eventually can request better services to the middleware, using a
completely different and more abstract API, which can also be domain specific.

> It's a lot clearer when cgroup just sets what's allowed under the hierarchy.
> This is also in line with the aspect that cgroup for the most part is
> a scoping mechanism - it's the most straight-forward to implement and
> use when

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-23 Thread Tejun Heo

Hello,

On Thu, Mar 23, 2017 at 10:32:54AM +, Patrick Bellasi wrote:
> > But then we would lose out on being able to attach capacity
> > constraints to specific tasks or groups of tasks?
> 
> Yes, right. If CGroups are not available than you cannot specify
> per-task constraints. This is just a system-wide global tunable.
> 
> Question is: does this overall proposal makes sense outside the scope
> of task groups classification? (more on that afterwards)

I think it does, given that it's a per-thread property which requires
internal application knowledge to tune.

> > I think the concern raised is more about whether CGroups is the right
> > interface to use for attaching capacity constraints to task or groups
> > of tasks, or is there a better way to attach such constraints?
> 
> Notice that CGroups based classification allows to easily enforce
> the concept of "delegation containment". I think this feature should
> be nice to have whatever interface we choose.
> 
> However, potentially we can define a proper per-task API; are you
> thinking to something specifically?

I don't think the overall outcome was too good when we used cgroup as
the direct way of configuring certain attributes - it either excludes
the possibility of easily accessible API from application side or
conflicts with the attributes set through such API.  It's a lot
clearer when cgroup just sets what's allowed under the hierarchy.

This is also in line with the aspect that cgroup for the most part is
a scoping mechanism - it's the most straight-forward to implement and
use when the behavior inside cgroup matches a system without cgroup,
just scoped.  It shows up here too.  If you take out the cgroup part,
you're left with an interface which is hardly useful.  cgroup isn't
scoping the global system here.  It's becoming the primary interface
for this feature which most likely isn't a good sign.

So, my suggestion is to implement it as a per-task API.  If the
feature calls for scoped restrictions, we definitely can add cgroup
support for that but I'm really not convinced about using cgroup as
the primary interface for this.

Thanks.

-- 
tejun

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-23 Thread Tejun Heo

Hello,

On Thu, Mar 23, 2017 at 10:32:54AM +, Patrick Bellasi wrote:
> > But then we would lose out on being able to attach capacity
> > constraints to specific tasks or groups of tasks?
> 
> Yes, right. If CGroups are not available than you cannot specify
> per-task constraints. This is just a system-wide global tunable.
> 
> Question is: does this overall proposal makes sense outside the scope
> of task groups classification? (more on that afterwards)

I think it does, given that it's a per-thread property which requires
internal application knowledge to tune.

> > I think the concern raised is more about whether CGroups is the right
> > interface to use for attaching capacity constraints to task or groups
> > of tasks, or is there a better way to attach such constraints?
> 
> Notice that CGroups based classification allows to easily enforce
> the concept of "delegation containment". I think this feature should
> be nice to have whatever interface we choose.
> 
> However, potentially we can define a proper per-task API; are you
> thinking to something specifically?

I don't think the overall outcome was too good when we used cgroup as
the direct way of configuring certain attributes - it either excludes
the possibility of easily accessible API from application side or
conflicts with the attributes set through such API.  It's a lot
clearer when cgroup just sets what's allowed under the hierarchy.

This is also in line with the aspect that cgroup for the most part is
a scoping mechanism - it's the most straight-forward to implement and
use when the behavior inside cgroup matches a system without cgroup,
just scoped.  It shows up here too.  If you take out the cgroup part,
you're left with an interface which is hardly useful.  cgroup isn't
scoping the global system here.  It's becoming the primary interface
for this feature which most likely isn't a good sign.

So, my suggestion is to implement it as a per-task API.  If the
feature calls for scoped restrictions, we definitely can add cgroup
support for that but I'm really not convinced about using cgroup as
the primary interface for this.

Thanks.

-- 
tejun

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-23 Thread Patrick Bellasi

On 22-Mar 17:28, Joel Fernandes (Google) wrote:
> Hi,
> 
> On Mon, Mar 20, 2017 at 11:08 AM, Patrick Bellasi
>  wrote:
> > On 20-Mar 13:15, Tejun Heo wrote:
> >> Hello,
> >>
> >> On Tue, Feb 28, 2017 at 02:38:38PM +, Patrick Bellasi wrote:
> [..]
> >> > These attributes:
> >> > a) are tunable at all hierarchy levels, i.e. root group too
> >>
> >> This usually is problematic because there should be a non-cgroup way
> >> of configuring the feature in case cgroup isn't configured or used,
> >> and it becomes awkward to have two separate mechanisms configuring the
> >> same thing.  Maybe the feature is cgroup specific enough that it makes
> >> sense here but this needs more explanation / justification.
> >
> > In the previous proposal I used to expose global tunables under
> > procfs, e.g.:
> >
> >  /proc/sys/kernel/sched_capacity_min
> >  /proc/sys/kernel/sched_capacity_max
> >
> 
> But then we would lose out on being able to attach capacity
> constraints to specific tasks or groups of tasks?

Yes, right. If CGroups are not available than you cannot specify
per-task constraints. This is just a system-wide global tunable.

Question is: does this overall proposal makes sense outside the scope
of task groups classification? (more on that afterwards)

> > which can be used to defined tunable root constraints when CGroups are
> > not available, and becomes RO when CGroups are.
> >
> > Can this be eventually an acceptable option?
> >
> > In any case I think that this feature will be mainly targeting CGroup
> > based systems. Indeed, one of the main goals is to collect
> > "application specific" information from "informed run-times". Being
> > "application specific" means that we need a way to classify
> > applications depending on the runtime context... and that capability
> > in Linux is ultimately provided via the CGroup interface.
> 
> I think the concern raised is more about whether CGroups is the right
> interface to use for attaching capacity constraints to task or groups
> of tasks, or is there a better way to attach such constraints?

Notice that CGroups based classification allows to easily enforce
the concept of "delegation containment". I think this feature should
be nice to have whatever interface we choose.

However, potentially we can define a proper per-task API; are you
thinking to something specifically?

> I am actually looking at a workload where its desirable to attach such
> constraints to only 1 thread or task, in this case it would be a bit
> overkill to use CGroups to attach such property just for 1 task with
> specific constraints

Well, perhaps it depends on how and when CGroups are created.
If we think about using a proper "Organize Once and Control" model,
(i.e. every app gets its own CGroup at creation time and there it
lives, for the rest of its time, while the user-space run-time
eventually tunes the assigned resources) than run-time overheads
should not be a major concerns.

AFAIK, Cgroups main overheads are associated to tasks migration and
tuning. Tuning will not be an issue for the kind of actions required
by capacity clamping. Regarding migrations, they should be avoided as
much as possible when CGroups are in use.

> and it would be beneficial that along with the
> CGroup interface, there's also an interface to attach it to individual
> tasks.

IMO a dual interface to do the same things will be at least confusing
and also more complicated to maintain.

> The other advantage of such interface is we don't have to
> create a separate CGroup for every new constraint limit and can have
> several tasks with different unique constraints.

That's still possible using CGroups and IMO it will not be the "most
common case".
Don't you think that in general we will need to set constraints at
applications level, thus group of tasks?

As a general rule we should probably go for an interface which makes
easy the most common case.

> Regards,
> Joel

-- 
#include 

Patrick Bellasi

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-23 Thread Patrick Bellasi

On 22-Mar 17:28, Joel Fernandes (Google) wrote:
> Hi,
> 
> On Mon, Mar 20, 2017 at 11:08 AM, Patrick Bellasi
>  wrote:
> > On 20-Mar 13:15, Tejun Heo wrote:
> >> Hello,
> >>
> >> On Tue, Feb 28, 2017 at 02:38:38PM +, Patrick Bellasi wrote:
> [..]
> >> > These attributes:
> >> > a) are tunable at all hierarchy levels, i.e. root group too
> >>
> >> This usually is problematic because there should be a non-cgroup way
> >> of configuring the feature in case cgroup isn't configured or used,
> >> and it becomes awkward to have two separate mechanisms configuring the
> >> same thing.  Maybe the feature is cgroup specific enough that it makes
> >> sense here but this needs more explanation / justification.
> >
> > In the previous proposal I used to expose global tunables under
> > procfs, e.g.:
> >
> >  /proc/sys/kernel/sched_capacity_min
> >  /proc/sys/kernel/sched_capacity_max
> >
> 
> But then we would lose out on being able to attach capacity
> constraints to specific tasks or groups of tasks?

Yes, right. If CGroups are not available than you cannot specify
per-task constraints. This is just a system-wide global tunable.

Question is: does this overall proposal makes sense outside the scope
of task groups classification? (more on that afterwards)

> > which can be used to defined tunable root constraints when CGroups are
> > not available, and becomes RO when CGroups are.
> >
> > Can this be eventually an acceptable option?
> >
> > In any case I think that this feature will be mainly targeting CGroup
> > based systems. Indeed, one of the main goals is to collect
> > "application specific" information from "informed run-times". Being
> > "application specific" means that we need a way to classify
> > applications depending on the runtime context... and that capability
> > in Linux is ultimately provided via the CGroup interface.
> 
> I think the concern raised is more about whether CGroups is the right
> interface to use for attaching capacity constraints to task or groups
> of tasks, or is there a better way to attach such constraints?

Notice that CGroups based classification allows to easily enforce
the concept of "delegation containment". I think this feature should
be nice to have whatever interface we choose.

However, potentially we can define a proper per-task API; are you
thinking to something specifically?

> I am actually looking at a workload where its desirable to attach such
> constraints to only 1 thread or task, in this case it would be a bit
> overkill to use CGroups to attach such property just for 1 task with
> specific constraints

Well, perhaps it depends on how and when CGroups are created.
If we think about using a proper "Organize Once and Control" model,
(i.e. every app gets its own CGroup at creation time and there it
lives, for the rest of its time, while the user-space run-time
eventually tunes the assigned resources) than run-time overheads
should not be a major concerns.

AFAIK, Cgroups main overheads are associated to tasks migration and
tuning. Tuning will not be an issue for the kind of actions required
by capacity clamping. Regarding migrations, they should be avoided as
much as possible when CGroups are in use.

> and it would be beneficial that along with the
> CGroup interface, there's also an interface to attach it to individual
> tasks.

IMO a dual interface to do the same things will be at least confusing
and also more complicated to maintain.

> The other advantage of such interface is we don't have to
> create a separate CGroup for every new constraint limit and can have
> several tasks with different unique constraints.

That's still possible using CGroups and IMO it will not be the "most
common case".
Don't you think that in general we will need to set constraints at
applications level, thus group of tasks?

As a general rule we should probably go for an interface which makes
easy the most common case.

> Regards,
> Joel

-- 
#include 

Patrick Bellasi

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-22 Thread Joel Fernandes (Google)

Hi,

On Mon, Mar 20, 2017 at 11:08 AM, Patrick Bellasi
 wrote:
> On 20-Mar 13:15, Tejun Heo wrote:
>> Hello,
>>
>> On Tue, Feb 28, 2017 at 02:38:38PM +, Patrick Bellasi wrote:
[..]
>> > These attributes:
>> > a) are tunable at all hierarchy levels, i.e. root group too
>>
>> This usually is problematic because there should be a non-cgroup way
>> of configuring the feature in case cgroup isn't configured or used,
>> and it becomes awkward to have two separate mechanisms configuring the
>> same thing.  Maybe the feature is cgroup specific enough that it makes
>> sense here but this needs more explanation / justification.
>
> In the previous proposal I used to expose global tunables under
> procfs, e.g.:
>
>  /proc/sys/kernel/sched_capacity_min
>  /proc/sys/kernel/sched_capacity_max
>

But then we would lose out on being able to attach capacity
constraints to specific tasks or groups of tasks?

> which can be used to defined tunable root constraints when CGroups are
> not available, and becomes RO when CGroups are.
>
> Can this be eventually an acceptable option?
>
> In any case I think that this feature will be mainly targeting CGroup
> based systems. Indeed, one of the main goals is to collect
> "application specific" information from "informed run-times". Being
> "application specific" means that we need a way to classify
> applications depending on the runtime context... and that capability
> in Linux is ultimately provided via the CGroup interface.

I think the concern raised is more about whether CGroups is the right
interface to use for attaching capacity constraints to task or groups
of tasks, or is there a better way to attach such constraints?

I am actually looking at a workload where its desirable to attach such
constraints to only 1 thread or task, in this case it would be a bit
overkill to use CGroups to attach such property just for 1 task with
specific constraints and it would be beneficial that along with the
CGroup interface, there's also an interface to attach it to individual
tasks. The other advantage of such interface is we don't have to
create a separate CGroup for every new constraint limit and can have
several tasks with different unique constraints.

Regards,
Joel

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-22 Thread Joel Fernandes (Google)

Hi,

On Mon, Mar 20, 2017 at 11:08 AM, Patrick Bellasi
 wrote:
> On 20-Mar 13:15, Tejun Heo wrote:
>> Hello,
>>
>> On Tue, Feb 28, 2017 at 02:38:38PM +, Patrick Bellasi wrote:
[..]
>> > These attributes:
>> > a) are tunable at all hierarchy levels, i.e. root group too
>>
>> This usually is problematic because there should be a non-cgroup way
>> of configuring the feature in case cgroup isn't configured or used,
>> and it becomes awkward to have two separate mechanisms configuring the
>> same thing.  Maybe the feature is cgroup specific enough that it makes
>> sense here but this needs more explanation / justification.
>
> In the previous proposal I used to expose global tunables under
> procfs, e.g.:
>
>  /proc/sys/kernel/sched_capacity_min
>  /proc/sys/kernel/sched_capacity_max
>

But then we would lose out on being able to attach capacity
constraints to specific tasks or groups of tasks?

> which can be used to defined tunable root constraints when CGroups are
> not available, and becomes RO when CGroups are.
>
> Can this be eventually an acceptable option?
>
> In any case I think that this feature will be mainly targeting CGroup
> based systems. Indeed, one of the main goals is to collect
> "application specific" information from "informed run-times". Being
> "application specific" means that we need a way to classify
> applications depending on the runtime context... and that capability
> in Linux is ultimately provided via the CGroup interface.

I think the concern raised is more about whether CGroups is the right
interface to use for attaching capacity constraints to task or groups
of tasks, or is there a better way to attach such constraints?

I am actually looking at a workload where its desirable to attach such
constraints to only 1 thread or task, in this case it would be a bit
overkill to use CGroups to attach such property just for 1 task with
specific constraints and it would be beneficial that along with the
CGroup interface, there's also an interface to attach it to individual
tasks. The other advantage of such interface is we don't have to
create a separate CGroup for every new constraint limit and can have
several tasks with different unique constraints.

Regards,
Joel

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-20 Thread Tejun Heo

Hello,

On Tue, Feb 28, 2017 at 02:38:38PM +, Patrick Bellasi wrote:
> This patch extends the CPU controller by adding a couple of new
> attributes, capacity_min and capacity_max, which can be used to enforce
> bandwidth boosting and capping. More specifically:
> 
> - capacity_min: defines the minimum capacity which should be granted
> (by schedutil) when a task in this group is running,
> i.e. the task will run at least at that capacity
> 
> - capacity_max: defines the maximum capacity which can be granted
> (by schedutil) when a task in this group is running,
> i.e. the task can run up to that capacity

cpu.capacity.min and cpu.capacity.max are the more conventional names.
I'm not sure about the name capacity as it doesn't encode what it does
and is difficult to tell apart from cpu bandwidth limits.  I think
it'd be better to represent what it controls more explicitly.

> These attributes:
> a) are tunable at all hierarchy levels, i.e. root group too

This usually is problematic because there should be a non-cgroup way
of configuring the feature in case cgroup isn't configured or used,
and it becomes awkward to have two separate mechanisms configuring the
same thing.  Maybe the feature is cgroup specific enough that it makes
sense here but this needs more explanation / justification.

> b) allow to create subgroups of tasks which are not violating the
>capacity constraints defined by the parent group.
>Thus, tasks on a subgroup can only be more boosted and/or more

For both limits and protections, the parent caps the maximum the
children can get.  At least that's what memcg does for memory.low.
Doing that makes sense for memcg because for memory the parent can
still do protections regardless of what its children are doing and it
makes delegation safe by default.

I understand why you would want a property like capacity to be the
other direction as that way you get more specific as you walk down the
tree for both limits and protections; however, I think we need to
think a bit more about it and ensure that the resulting interface
isn't confusing.  Would it work for capacity to behave the other
direction - ie. a parent's min restricting the highest min that its
descendants can get?  It's completely fine if that's weird.

Thanks.

-- 
tejun

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-20 Thread Tejun Heo

Hello,

On Tue, Feb 28, 2017 at 02:38:38PM +, Patrick Bellasi wrote:
> This patch extends the CPU controller by adding a couple of new
> attributes, capacity_min and capacity_max, which can be used to enforce
> bandwidth boosting and capping. More specifically:
> 
> - capacity_min: defines the minimum capacity which should be granted
> (by schedutil) when a task in this group is running,
> i.e. the task will run at least at that capacity
> 
> - capacity_max: defines the maximum capacity which can be granted
> (by schedutil) when a task in this group is running,
> i.e. the task can run up to that capacity

cpu.capacity.min and cpu.capacity.max are the more conventional names.
I'm not sure about the name capacity as it doesn't encode what it does
and is difficult to tell apart from cpu bandwidth limits.  I think
it'd be better to represent what it controls more explicitly.

> These attributes:
> a) are tunable at all hierarchy levels, i.e. root group too

This usually is problematic because there should be a non-cgroup way
of configuring the feature in case cgroup isn't configured or used,
and it becomes awkward to have two separate mechanisms configuring the
same thing.  Maybe the feature is cgroup specific enough that it makes
sense here but this needs more explanation / justification.

> b) allow to create subgroups of tasks which are not violating the
>capacity constraints defined by the parent group.
>Thus, tasks on a subgroup can only be more boosted and/or more

For both limits and protections, the parent caps the maximum the
children can get.  At least that's what memcg does for memory.low.
Doing that makes sense for memcg because for memory the parent can
still do protections regardless of what its children are doing and it
makes delegation safe by default.

I understand why you would want a property like capacity to be the
other direction as that way you get more specific as you walk down the
tree for both limits and protections; however, I think we need to
think a bit more about it and ensure that the resulting interface
isn't confusing.  Would it work for capacity to behave the other
direction - ie. a parent's min restricting the highest min that its
descendants can get?  It's completely fine if that's weird.

Thanks.

-- 
tejun

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-20 Thread Patrick Bellasi

On 20-Mar 13:15, Tejun Heo wrote:
> Hello,
> 
> On Tue, Feb 28, 2017 at 02:38:38PM +, Patrick Bellasi wrote:
> > This patch extends the CPU controller by adding a couple of new
> > attributes, capacity_min and capacity_max, which can be used to enforce
> > bandwidth boosting and capping. More specifically:
> > 
> > - capacity_min: defines the minimum capacity which should be granted
> > (by schedutil) when a task in this group is running,
> > i.e. the task will run at least at that capacity
> > 
> > - capacity_max: defines the maximum capacity which can be granted
> > (by schedutil) when a task in this group is running,
> > i.e. the task can run up to that capacity
> 
> cpu.capacity.min and cpu.capacity.max are the more conventional names.

Ok, should be an easy renaming.

> I'm not sure about the name capacity as it doesn't encode what it does
> and is difficult to tell apart from cpu bandwidth limits.  I think
> it'd be better to represent what it controls more explicitly.

In the scheduler jargon, capacity represents the amount of computation
that a CPU can provide and it's usually defined to be 1024 for the
biggest CPU (on non SMP systems) running at the highest OPP (i.e.
maximum frequency).

It's true that it kind of overlaps with the concept of "bandwidth".
However, the main difference here is that "bandwidth" is not frequency
(and architecture) scaled.
Thus, for example, assuming we have only one CPU with these two OPPs:

   OPP | Frequency | Capacity
 1 |500MHz |  512
 2 |  1GHz | 1024

a task running 60% of the time on that CPU when configured to run at
500MHz, from the bandwidth standpoint it's using 60% bandwidth but, from
the capacity standpoint, is using only 30% of the available capacity.

IOW, bandwidth is purely temporal based while capacity factors in both
frequency and architectural differences.
Thus, while a "bandwidth" constraint limits the amount of time a task
can use a CPU, independently from the "actual computation" performed,
with the new "capacity" constraints we can enforce much "actual
computation" a task can perform in the "unit of time".

> > These attributes:
> > a) are tunable at all hierarchy levels, i.e. root group too
> 
> This usually is problematic because there should be a non-cgroup way
> of configuring the feature in case cgroup isn't configured or used,
> and it becomes awkward to have two separate mechanisms configuring the
> same thing.  Maybe the feature is cgroup specific enough that it makes
> sense here but this needs more explanation / justification.

In the previous proposal I used to expose global tunables under
procfs, e.g.:

 /proc/sys/kernel/sched_capacity_min
 /proc/sys/kernel/sched_capacity_max

which can be used to defined tunable root constraints when CGroups are
not available, and becomes RO when CGroups are.

Can this be eventually an acceptable option?

In any case I think that this feature will be mainly targeting CGroup
based systems. Indeed, one of the main goals is to collect
"application specific" information from "informed run-times". Being
"application specific" means that we need a way to classify
applications depending on the runtime context... and that capability
in Linux is ultimately provided via the CGroup interface.

> > b) allow to create subgroups of tasks which are not violating the
> >capacity constraints defined by the parent group.
> >Thus, tasks on a subgroup can only be more boosted and/or more
> 
> For both limits and protections, the parent caps the maximum the
> children can get.  At least that's what memcg does for memory.low.
> Doing that makes sense for memcg because for memory the parent can
> still do protections regardless of what its children are doing and it
> makes delegation safe by default.

Just to be more clear, the current proposal enforces:

- capacity_max_child <= capacity_max_parent

  Since, if a task is constrained to get only up to a certain amount
  of capacity, than its childs cannot use more than that... eventually
  they can only be further constrained.

- capacity_min_child >= capacity_min_parent

  Since, if a task has been boosted to run at least as much fast, than
  its childs cannot be constrained to go slower without eventually
  impacting parent performance.

> I understand why you would want a property like capacity to be the
> other direction as that way you get more specific as you walk down the
> tree for both limits and protections;

Right, the protection schema is defined in such a way to never affect
parent constraints.

> however, I think we need to
> think a bit more about it and ensure that the resulting interface
> isn't confusing.

Sure.

> Would it work for capacity to behave the other
> direction - ie. a parent's min restricting the highest min that its
> descendants can get?  It's completely fine if that's weird.

I had a thought about that possibility and it was not convincing

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-20 Thread Patrick Bellasi

On 20-Mar 13:15, Tejun Heo wrote:
> Hello,
> 
> On Tue, Feb 28, 2017 at 02:38:38PM +, Patrick Bellasi wrote:
> > This patch extends the CPU controller by adding a couple of new
> > attributes, capacity_min and capacity_max, which can be used to enforce
> > bandwidth boosting and capping. More specifically:
> > 
> > - capacity_min: defines the minimum capacity which should be granted
> > (by schedutil) when a task in this group is running,
> > i.e. the task will run at least at that capacity
> > 
> > - capacity_max: defines the maximum capacity which can be granted
> > (by schedutil) when a task in this group is running,
> > i.e. the task can run up to that capacity
> 
> cpu.capacity.min and cpu.capacity.max are the more conventional names.

Ok, should be an easy renaming.

> I'm not sure about the name capacity as it doesn't encode what it does
> and is difficult to tell apart from cpu bandwidth limits.  I think
> it'd be better to represent what it controls more explicitly.

In the scheduler jargon, capacity represents the amount of computation
that a CPU can provide and it's usually defined to be 1024 for the
biggest CPU (on non SMP systems) running at the highest OPP (i.e.
maximum frequency).

It's true that it kind of overlaps with the concept of "bandwidth".
However, the main difference here is that "bandwidth" is not frequency
(and architecture) scaled.
Thus, for example, assuming we have only one CPU with these two OPPs:

   OPP | Frequency | Capacity
 1 |500MHz |  512
 2 |  1GHz | 1024

a task running 60% of the time on that CPU when configured to run at
500MHz, from the bandwidth standpoint it's using 60% bandwidth but, from
the capacity standpoint, is using only 30% of the available capacity.

IOW, bandwidth is purely temporal based while capacity factors in both
frequency and architectural differences.
Thus, while a "bandwidth" constraint limits the amount of time a task
can use a CPU, independently from the "actual computation" performed,
with the new "capacity" constraints we can enforce much "actual
computation" a task can perform in the "unit of time".

> > These attributes:
> > a) are tunable at all hierarchy levels, i.e. root group too
> 
> This usually is problematic because there should be a non-cgroup way
> of configuring the feature in case cgroup isn't configured or used,
> and it becomes awkward to have two separate mechanisms configuring the
> same thing.  Maybe the feature is cgroup specific enough that it makes
> sense here but this needs more explanation / justification.

In the previous proposal I used to expose global tunables under
procfs, e.g.:

 /proc/sys/kernel/sched_capacity_min
 /proc/sys/kernel/sched_capacity_max

which can be used to defined tunable root constraints when CGroups are
not available, and becomes RO when CGroups are.

Can this be eventually an acceptable option?

In any case I think that this feature will be mainly targeting CGroup
based systems. Indeed, one of the main goals is to collect
"application specific" information from "informed run-times". Being
"application specific" means that we need a way to classify
applications depending on the runtime context... and that capability
in Linux is ultimately provided via the CGroup interface.

> > b) allow to create subgroups of tasks which are not violating the
> >capacity constraints defined by the parent group.
> >Thus, tasks on a subgroup can only be more boosted and/or more
> 
> For both limits and protections, the parent caps the maximum the
> children can get.  At least that's what memcg does for memory.low.
> Doing that makes sense for memcg because for memory the parent can
> still do protections regardless of what its children are doing and it
> makes delegation safe by default.

Just to be more clear, the current proposal enforces:

- capacity_max_child <= capacity_max_parent

  Since, if a task is constrained to get only up to a certain amount
  of capacity, than its childs cannot use more than that... eventually
  they can only be further constrained.

- capacity_min_child >= capacity_min_parent

  Since, if a task has been boosted to run at least as much fast, than
  its childs cannot be constrained to go slower without eventually
  impacting parent performance.

> I understand why you would want a property like capacity to be the
> other direction as that way you get more specific as you walk down the
> tree for both limits and protections;

Right, the protection schema is defined in such a way to never affect
parent constraints.

> however, I think we need to
> think a bit more about it and ensure that the resulting interface
> isn't confusing.

Sure.

> Would it work for capacity to behave the other
> direction - ie. a parent's min restricting the highest min that its
> descendants can get?  It's completely fine if that's weird.

I had a thought about that possibility and it was not convincing

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-20 Thread Tejun Heo

On Mon, Mar 20, 2017 at 01:15:11PM -0400, Tejun Heo wrote:
> > a) are tunable at all hierarchy levels, i.e. root group too
> 
> This usually is problematic because there should be a non-cgroup way
> of configuring the feature in case cgroup isn't configured or used,
> and it becomes awkward to have two separate mechanisms configuring the
> same thing.  Maybe the feature is cgroup specific enough that it makes
> sense here but this needs more explanation / justification.

A related issue here is that what the non-cgroup interface and its
interaction with cgroup should be.  In the long term, I think it's
better to have a generic non-cgroup interface for these new features,
and we've gotten it wrong, or at least inconsistent, across different
settings - most don't affect API accessible settings and just confine
the configuration requested by the application inside the cgroup
constraints; however, cpuset does it the other way and overwrites
configurations set by individual applications.

If we agree that exposing this only through cgroup is fine, this isn't
a concern, but, given that this is a thread property and can obviously
be useful outside cgroups, that seems debatable at the very least.

Thanks.

-- 
tejun

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-20 Thread Tejun Heo

On Mon, Mar 20, 2017 at 01:15:11PM -0400, Tejun Heo wrote:
> > a) are tunable at all hierarchy levels, i.e. root group too
> 
> This usually is problematic because there should be a non-cgroup way
> of configuring the feature in case cgroup isn't configured or used,
> and it becomes awkward to have two separate mechanisms configuring the
> same thing.  Maybe the feature is cgroup specific enough that it makes
> sense here but this needs more explanation / justification.

A related issue here is that what the non-cgroup interface and its
interaction with cgroup should be.  In the long term, I think it's
better to have a generic non-cgroup interface for these new features,
and we've gotten it wrong, or at least inconsistent, across different
settings - most don't affect API accessible settings and just confine
the configuration requested by the application inside the cgroup
constraints; however, cpuset does it the other way and overwrites
configurations set by individual applications.

If we agree that exposing this only through cgroup is fine, this isn't
a concern, but, given that this is a thread property and can obviously
be useful outside cgroups, that seems debatable at the very least.

Thanks.

-- 
tejun

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-15 Thread Patrick Bellasi

On 15-Mar 10:24, Paul E. McKenney wrote:
> On Wed, Mar 15, 2017 at 04:44:39PM +, Patrick Bellasi wrote:
> > On 15-Mar 09:10, Paul E. McKenney wrote:
> > > On Wed, Mar 15, 2017 at 06:20:28AM -0700, Joel Fernandes wrote:
> > > > On Wed, Mar 15, 2017 at 4:20 AM, Patrick Bellasi
> > > >  wrote:
> > > > > On 13-Mar 03:46, Joel Fernandes (Google) wrote:
> > > > >> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
> > > > >>  wrote:
> > > > >> > The CPU CGroup controller allows to assign a specified (maximum)
> > > > >> > bandwidth to tasks within a group, however it does not enforce any
> > > > >> > constraint on how such bandwidth can be consumed.
> > > > >> > With the integration of schedutil, the scheduler has now the proper
> > > > >> > information about a task to select  the most suitable frequency to
> > > > >> > satisfy tasks needs.
> > > > >> [..]
> > > > >>
> > > > >> > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state 
> > > > >> > *css,
> > > > >> > +struct cftype *cft)
> > > > >> > +{
> > > > >> > +   struct task_group *tg;
> > > > >> > +   u64 min_capacity;
> > > > >> > +
> > > > >> > +   rcu_read_lock();
> > > > >> > +   tg = css_tg(css);
> > > > >> > +   min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];
> > > > >>
> > > > >> Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in
> > > > >> the write path) to avoid load-tearing?
> > > > >
> > > > > tg->cap_clamp is an "unsigned int" and thus I would expect a single
> > > > > memory access to write/read it, isn't it? I mean: I do not expect the
> > > > > compiler "to mess" with these accesses.
> > > > 
> > > > This depends on compiler and arch. I'm not sure if its in practice
> > > > these days an issue, but see section on 'load tearing' in
> > > > Documentation/memory-barriers.txt . If compiler decided to break down
> > > > the access to multiple accesses due to some reason, then might be a
> > > > problem.
> > > 
> > > The compiler might also be able to inline cpu_capacity_min_read_u64()
> > > fuse the load from tg->cap_clamp[CAP_CLAMP_MIN] with other accesses.
> > > If min_capacity is used several times in the ensuing code, the compiler
> > > could reload multiple times from tg->cap_clamp[CAP_CLAMP_MIN], which at
> > > best might be a bit confusing.
> > 
> > That's actually an interesting case, however I don't think it applies
> > in this case since cpu_capacity_min_read_u64() is called only via
> > a function poninter and thus it will never be inlined. isn't it?
> > 
> > > > Adding Paul for his expert opinion on the matter ;)
> > > 
> > > My personal approach is to use READ_ONCE() and WRITE_ONCE() unless
> > > I can absolutely prove that the compiler cannot do any destructive
> > > optimizations.  And I not-infrequently find unsuspected opportunities
> > > for destructive optimization in my own code.  Your mileage may vary.  ;-)
> > 
> > I guess here the main concern from Joel is that such a pattern:
> > 
> >u64 var = unsigned_int_value_from_memory;
> > 
> > can result is a couple of "load from memory" operations.
> 
> Indeed it can.  I first learned this the hard way in the early 1990s,
> so 20-year-old compiler optimizations are quite capable of making this
> sort of thing happen.
> 
> > In that case a similar:
> > 
> >   unsigned_int_left_value = new_unsigned_int_value;
> > 
> > executed on a different thread can overlap with the previous memory
> > read operations and ending up in "var" containing a not consistent
> > value.
> > 
> > Question is: can this really happen, given the data types in use?
> 
> So we have an updater changing the value of unsigned_int_left_value,
> while readers in other threads are accessing it, correct?  And you
> are asking whether the compiler can optimize the updater so as to
> mess up the readers, right?
> 
> One such optimization would be a byte-wise write, though I have no
> idea why a compiler would do such a thing assuming that the variable
> is reasonably sized and aligned.  Another is that the compiler could
> use the variable as temporary storage just before the assignment.
> (You haven't told the compiler that anyone else is reading it, though
> I don't know of this being done by production compilers.)  A third is
> that the compiler could fuse successive stores, which might or might
> not be a problem, depending.
> 
> Probably more, but that should be a start.  ;-)

Definitively yes! :)

Thanks for the detailed explanation, I'll add the READ_ONCE as
Joel suggested! +1

-- 
#include 

Patrick Bellasi

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-15 Thread Patrick Bellasi

On 15-Mar 10:24, Paul E. McKenney wrote:
> On Wed, Mar 15, 2017 at 04:44:39PM +, Patrick Bellasi wrote:
> > On 15-Mar 09:10, Paul E. McKenney wrote:
> > > On Wed, Mar 15, 2017 at 06:20:28AM -0700, Joel Fernandes wrote:
> > > > On Wed, Mar 15, 2017 at 4:20 AM, Patrick Bellasi
> > > >  wrote:
> > > > > On 13-Mar 03:46, Joel Fernandes (Google) wrote:
> > > > >> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
> > > > >>  wrote:
> > > > >> > The CPU CGroup controller allows to assign a specified (maximum)
> > > > >> > bandwidth to tasks within a group, however it does not enforce any
> > > > >> > constraint on how such bandwidth can be consumed.
> > > > >> > With the integration of schedutil, the scheduler has now the proper
> > > > >> > information about a task to select  the most suitable frequency to
> > > > >> > satisfy tasks needs.
> > > > >> [..]
> > > > >>
> > > > >> > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state 
> > > > >> > *css,
> > > > >> > +struct cftype *cft)
> > > > >> > +{
> > > > >> > +   struct task_group *tg;
> > > > >> > +   u64 min_capacity;
> > > > >> > +
> > > > >> > +   rcu_read_lock();
> > > > >> > +   tg = css_tg(css);
> > > > >> > +   min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];
> > > > >>
> > > > >> Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in
> > > > >> the write path) to avoid load-tearing?
> > > > >
> > > > > tg->cap_clamp is an "unsigned int" and thus I would expect a single
> > > > > memory access to write/read it, isn't it? I mean: I do not expect the
> > > > > compiler "to mess" with these accesses.
> > > > 
> > > > This depends on compiler and arch. I'm not sure if its in practice
> > > > these days an issue, but see section on 'load tearing' in
> > > > Documentation/memory-barriers.txt . If compiler decided to break down
> > > > the access to multiple accesses due to some reason, then might be a
> > > > problem.
> > > 
> > > The compiler might also be able to inline cpu_capacity_min_read_u64()
> > > fuse the load from tg->cap_clamp[CAP_CLAMP_MIN] with other accesses.
> > > If min_capacity is used several times in the ensuing code, the compiler
> > > could reload multiple times from tg->cap_clamp[CAP_CLAMP_MIN], which at
> > > best might be a bit confusing.
> > 
> > That's actually an interesting case, however I don't think it applies
> > in this case since cpu_capacity_min_read_u64() is called only via
> > a function poninter and thus it will never be inlined. isn't it?
> > 
> > > > Adding Paul for his expert opinion on the matter ;)
> > > 
> > > My personal approach is to use READ_ONCE() and WRITE_ONCE() unless
> > > I can absolutely prove that the compiler cannot do any destructive
> > > optimizations.  And I not-infrequently find unsuspected opportunities
> > > for destructive optimization in my own code.  Your mileage may vary.  ;-)
> > 
> > I guess here the main concern from Joel is that such a pattern:
> > 
> >u64 var = unsigned_int_value_from_memory;
> > 
> > can result is a couple of "load from memory" operations.
> 
> Indeed it can.  I first learned this the hard way in the early 1990s,
> so 20-year-old compiler optimizations are quite capable of making this
> sort of thing happen.
> 
> > In that case a similar:
> > 
> >   unsigned_int_left_value = new_unsigned_int_value;
> > 
> > executed on a different thread can overlap with the previous memory
> > read operations and ending up in "var" containing a not consistent
> > value.
> > 
> > Question is: can this really happen, given the data types in use?
> 
> So we have an updater changing the value of unsigned_int_left_value,
> while readers in other threads are accessing it, correct?  And you
> are asking whether the compiler can optimize the updater so as to
> mess up the readers, right?
> 
> One such optimization would be a byte-wise write, though I have no
> idea why a compiler would do such a thing assuming that the variable
> is reasonably sized and aligned.  Another is that the compiler could
> use the variable as temporary storage just before the assignment.
> (You haven't told the compiler that anyone else is reading it, though
> I don't know of this being done by production compilers.)  A third is
> that the compiler could fuse successive stores, which might or might
> not be a problem, depending.
> 
> Probably more, but that should be a start.  ;-)

Definitively yes! :)

Thanks for the detailed explanation, I'll add the READ_ONCE as
Joel suggested! +1

-- 
#include 

Patrick Bellasi

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-15 Thread Paul E. McKenney

On Wed, Mar 15, 2017 at 04:44:39PM +, Patrick Bellasi wrote:
> On 15-Mar 09:10, Paul E. McKenney wrote:
> > On Wed, Mar 15, 2017 at 06:20:28AM -0700, Joel Fernandes wrote:
> > > On Wed, Mar 15, 2017 at 4:20 AM, Patrick Bellasi
> > >  wrote:
> > > > On 13-Mar 03:46, Joel Fernandes (Google) wrote:
> > > >> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
> > > >>  wrote:
> > > >> > The CPU CGroup controller allows to assign a specified (maximum)
> > > >> > bandwidth to tasks within a group, however it does not enforce any
> > > >> > constraint on how such bandwidth can be consumed.
> > > >> > With the integration of schedutil, the scheduler has now the proper
> > > >> > information about a task to select  the most suitable frequency to
> > > >> > satisfy tasks needs.
> > > >> [..]
> > > >>
> > > >> > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state 
> > > >> > *css,
> > > >> > +struct cftype *cft)
> > > >> > +{
> > > >> > +   struct task_group *tg;
> > > >> > +   u64 min_capacity;
> > > >> > +
> > > >> > +   rcu_read_lock();
> > > >> > +   tg = css_tg(css);
> > > >> > +   min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];
> > > >>
> > > >> Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in
> > > >> the write path) to avoid load-tearing?
> > > >
> > > > tg->cap_clamp is an "unsigned int" and thus I would expect a single
> > > > memory access to write/read it, isn't it? I mean: I do not expect the
> > > > compiler "to mess" with these accesses.
> > > 
> > > This depends on compiler and arch. I'm not sure if its in practice
> > > these days an issue, but see section on 'load tearing' in
> > > Documentation/memory-barriers.txt . If compiler decided to break down
> > > the access to multiple accesses due to some reason, then might be a
> > > problem.
> > 
> > The compiler might also be able to inline cpu_capacity_min_read_u64()
> > fuse the load from tg->cap_clamp[CAP_CLAMP_MIN] with other accesses.
> > If min_capacity is used several times in the ensuing code, the compiler
> > could reload multiple times from tg->cap_clamp[CAP_CLAMP_MIN], which at
> > best might be a bit confusing.
> 
> That's actually an interesting case, however I don't think it applies
> in this case since cpu_capacity_min_read_u64() is called only via
> a function poninter and thus it will never be inlined. isn't it?
> 
> > > Adding Paul for his expert opinion on the matter ;)
> > 
> > My personal approach is to use READ_ONCE() and WRITE_ONCE() unless
> > I can absolutely prove that the compiler cannot do any destructive
> > optimizations.  And I not-infrequently find unsuspected opportunities
> > for destructive optimization in my own code.  Your mileage may vary.  ;-)
> 
> I guess here the main concern from Joel is that such a pattern:
> 
>u64 var = unsigned_int_value_from_memory;
> 
> can result is a couple of "load from memory" operations.

Indeed it can.  I first learned this the hard way in the early 1990s,
so 20-year-old compiler optimizations are quite capable of making this
sort of thing happen.

> In that case a similar:
> 
>   unsigned_int_left_value = new_unsigned_int_value;
> 
> executed on a different thread can overlap with the previous memory
> read operations and ending up in "var" containing a not consistent
> value.
> 
> Question is: can this really happen, given the data types in use?

So we have an updater changing the value of unsigned_int_left_value,
while readers in other threads are accessing it, correct?  And you
are asking whether the compiler can optimize the updater so as to
mess up the readers, right?

One such optimization would be a byte-wise write, though I have no
idea why a compiler would do such a thing assuming that the variable
is reasonably sized and aligned.  Another is that the compiler could
use the variable as temporary storage just before the assignment.
(You haven't told the compiler that anyone else is reading it, though
I don't know of this being done by production compilers.)  A third is
that the compiler could fuse successive stores, which might or might
not be a problem, depending.

Probably more, but that should be a start.  ;-)

Thanx, Paul

> Thanks! ;-)
> 
> -- 
> #include 
> 
> Patrick Bellasi
>

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-15 Thread Paul E. McKenney

On Wed, Mar 15, 2017 at 04:44:39PM +, Patrick Bellasi wrote:
> On 15-Mar 09:10, Paul E. McKenney wrote:
> > On Wed, Mar 15, 2017 at 06:20:28AM -0700, Joel Fernandes wrote:
> > > On Wed, Mar 15, 2017 at 4:20 AM, Patrick Bellasi
> > >  wrote:
> > > > On 13-Mar 03:46, Joel Fernandes (Google) wrote:
> > > >> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
> > > >>  wrote:
> > > >> > The CPU CGroup controller allows to assign a specified (maximum)
> > > >> > bandwidth to tasks within a group, however it does not enforce any
> > > >> > constraint on how such bandwidth can be consumed.
> > > >> > With the integration of schedutil, the scheduler has now the proper
> > > >> > information about a task to select  the most suitable frequency to
> > > >> > satisfy tasks needs.
> > > >> [..]
> > > >>
> > > >> > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state 
> > > >> > *css,
> > > >> > +struct cftype *cft)
> > > >> > +{
> > > >> > +   struct task_group *tg;
> > > >> > +   u64 min_capacity;
> > > >> > +
> > > >> > +   rcu_read_lock();
> > > >> > +   tg = css_tg(css);
> > > >> > +   min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];
> > > >>
> > > >> Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in
> > > >> the write path) to avoid load-tearing?
> > > >
> > > > tg->cap_clamp is an "unsigned int" and thus I would expect a single
> > > > memory access to write/read it, isn't it? I mean: I do not expect the
> > > > compiler "to mess" with these accesses.
> > > 
> > > This depends on compiler and arch. I'm not sure if its in practice
> > > these days an issue, but see section on 'load tearing' in
> > > Documentation/memory-barriers.txt . If compiler decided to break down
> > > the access to multiple accesses due to some reason, then might be a
> > > problem.
> > 
> > The compiler might also be able to inline cpu_capacity_min_read_u64()
> > fuse the load from tg->cap_clamp[CAP_CLAMP_MIN] with other accesses.
> > If min_capacity is used several times in the ensuing code, the compiler
> > could reload multiple times from tg->cap_clamp[CAP_CLAMP_MIN], which at
> > best might be a bit confusing.
> 
> That's actually an interesting case, however I don't think it applies
> in this case since cpu_capacity_min_read_u64() is called only via
> a function poninter and thus it will never be inlined. isn't it?
> 
> > > Adding Paul for his expert opinion on the matter ;)
> > 
> > My personal approach is to use READ_ONCE() and WRITE_ONCE() unless
> > I can absolutely prove that the compiler cannot do any destructive
> > optimizations.  And I not-infrequently find unsuspected opportunities
> > for destructive optimization in my own code.  Your mileage may vary.  ;-)
> 
> I guess here the main concern from Joel is that such a pattern:
> 
>u64 var = unsigned_int_value_from_memory;
> 
> can result is a couple of "load from memory" operations.

Indeed it can.  I first learned this the hard way in the early 1990s,
so 20-year-old compiler optimizations are quite capable of making this
sort of thing happen.

> In that case a similar:
> 
>   unsigned_int_left_value = new_unsigned_int_value;
> 
> executed on a different thread can overlap with the previous memory
> read operations and ending up in "var" containing a not consistent
> value.
> 
> Question is: can this really happen, given the data types in use?

So we have an updater changing the value of unsigned_int_left_value,
while readers in other threads are accessing it, correct?  And you
are asking whether the compiler can optimize the updater so as to
mess up the readers, right?

One such optimization would be a byte-wise write, though I have no
idea why a compiler would do such a thing assuming that the variable
is reasonably sized and aligned.  Another is that the compiler could
use the variable as temporary storage just before the assignment.
(You haven't told the compiler that anyone else is reading it, though
I don't know of this being done by production compilers.)  A third is
that the compiler could fuse successive stores, which might or might
not be a problem, depending.

Probably more, but that should be a start.  ;-)

Thanx, Paul

> Thanks! ;-)
> 
> -- 
> #include 
> 
> Patrick Bellasi
>

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-15 Thread Patrick Bellasi

On 15-Mar 09:10, Paul E. McKenney wrote:
> On Wed, Mar 15, 2017 at 06:20:28AM -0700, Joel Fernandes wrote:
> > On Wed, Mar 15, 2017 at 4:20 AM, Patrick Bellasi
> >  wrote:
> > > On 13-Mar 03:46, Joel Fernandes (Google) wrote:
> > >> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
> > >>  wrote:
> > >> > The CPU CGroup controller allows to assign a specified (maximum)
> > >> > bandwidth to tasks within a group, however it does not enforce any
> > >> > constraint on how such bandwidth can be consumed.
> > >> > With the integration of schedutil, the scheduler has now the proper
> > >> > information about a task to select  the most suitable frequency to
> > >> > satisfy tasks needs.
> > >> [..]
> > >>
> > >> > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,
> > >> > +struct cftype *cft)
> > >> > +{
> > >> > +   struct task_group *tg;
> > >> > +   u64 min_capacity;
> > >> > +
> > >> > +   rcu_read_lock();
> > >> > +   tg = css_tg(css);
> > >> > +   min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];
> > >>
> > >> Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in
> > >> the write path) to avoid load-tearing?
> > >
> > > tg->cap_clamp is an "unsigned int" and thus I would expect a single
> > > memory access to write/read it, isn't it? I mean: I do not expect the
> > > compiler "to mess" with these accesses.
> > 
> > This depends on compiler and arch. I'm not sure if its in practice
> > these days an issue, but see section on 'load tearing' in
> > Documentation/memory-barriers.txt . If compiler decided to break down
> > the access to multiple accesses due to some reason, then might be a
> > problem.
> 
> The compiler might also be able to inline cpu_capacity_min_read_u64()
> fuse the load from tg->cap_clamp[CAP_CLAMP_MIN] with other accesses.
> If min_capacity is used several times in the ensuing code, the compiler
> could reload multiple times from tg->cap_clamp[CAP_CLAMP_MIN], which at
> best might be a bit confusing.

That's actually an interesting case, however I don't think it applies
in this case since cpu_capacity_min_read_u64() is called only via
a function poninter and thus it will never be inlined. isn't it?

> > Adding Paul for his expert opinion on the matter ;)
> 
> My personal approach is to use READ_ONCE() and WRITE_ONCE() unless
> I can absolutely prove that the compiler cannot do any destructive
> optimizations.  And I not-infrequently find unsuspected opportunities
> for destructive optimization in my own code.  Your mileage may vary.  ;-)

I guess here the main concern from Joel is that such a pattern:

   u64 var = unsigned_int_value_from_memory;

can result is a couple of "load from memory" operations.

In that case a similar:

  unsigned_int_left_value = new_unsigned_int_value;

executed on a different thread can overlap with the previous memory
read operations and ending up in "var" containing a not consistent
value.

Question is: can this really happen, given the data types in use?


>   Thanx, Paul

Thanks! ;-)

-- 
#include 

Patrick Bellasi

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-15 Thread Patrick Bellasi

On 15-Mar 09:10, Paul E. McKenney wrote:
> On Wed, Mar 15, 2017 at 06:20:28AM -0700, Joel Fernandes wrote:
> > On Wed, Mar 15, 2017 at 4:20 AM, Patrick Bellasi
> >  wrote:
> > > On 13-Mar 03:46, Joel Fernandes (Google) wrote:
> > >> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
> > >>  wrote:
> > >> > The CPU CGroup controller allows to assign a specified (maximum)
> > >> > bandwidth to tasks within a group, however it does not enforce any
> > >> > constraint on how such bandwidth can be consumed.
> > >> > With the integration of schedutil, the scheduler has now the proper
> > >> > information about a task to select  the most suitable frequency to
> > >> > satisfy tasks needs.
> > >> [..]
> > >>
> > >> > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,
> > >> > +struct cftype *cft)
> > >> > +{
> > >> > +   struct task_group *tg;
> > >> > +   u64 min_capacity;
> > >> > +
> > >> > +   rcu_read_lock();
> > >> > +   tg = css_tg(css);
> > >> > +   min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];
> > >>
> > >> Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in
> > >> the write path) to avoid load-tearing?
> > >
> > > tg->cap_clamp is an "unsigned int" and thus I would expect a single
> > > memory access to write/read it, isn't it? I mean: I do not expect the
> > > compiler "to mess" with these accesses.
> > 
> > This depends on compiler and arch. I'm not sure if its in practice
> > these days an issue, but see section on 'load tearing' in
> > Documentation/memory-barriers.txt . If compiler decided to break down
> > the access to multiple accesses due to some reason, then might be a
> > problem.
> 
> The compiler might also be able to inline cpu_capacity_min_read_u64()
> fuse the load from tg->cap_clamp[CAP_CLAMP_MIN] with other accesses.
> If min_capacity is used several times in the ensuing code, the compiler
> could reload multiple times from tg->cap_clamp[CAP_CLAMP_MIN], which at
> best might be a bit confusing.

That's actually an interesting case, however I don't think it applies
in this case since cpu_capacity_min_read_u64() is called only via
a function poninter and thus it will never be inlined. isn't it?

> > Adding Paul for his expert opinion on the matter ;)
> 
> My personal approach is to use READ_ONCE() and WRITE_ONCE() unless
> I can absolutely prove that the compiler cannot do any destructive
> optimizations.  And I not-infrequently find unsuspected opportunities
> for destructive optimization in my own code.  Your mileage may vary.  ;-)

I guess here the main concern from Joel is that such a pattern:

   u64 var = unsigned_int_value_from_memory;

can result is a couple of "load from memory" operations.

In that case a similar:

  unsigned_int_left_value = new_unsigned_int_value;

executed on a different thread can overlap with the previous memory
read operations and ending up in "var" containing a not consistent
value.

Question is: can this really happen, given the data types in use?


>   Thanx, Paul

Thanks! ;-)

-- 
#include 

Patrick Bellasi

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-15 Thread Paul E. McKenney

On Wed, Mar 15, 2017 at 06:20:28AM -0700, Joel Fernandes wrote:
> On Wed, Mar 15, 2017 at 4:20 AM, Patrick Bellasi
>  wrote:
> > On 13-Mar 03:46, Joel Fernandes (Google) wrote:
> >> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
> >>  wrote:
> >> > The CPU CGroup controller allows to assign a specified (maximum)
> >> > bandwidth to tasks within a group, however it does not enforce any
> >> > constraint on how such bandwidth can be consumed.
> >> > With the integration of schedutil, the scheduler has now the proper
> >> > information about a task to select  the most suitable frequency to
> >> > satisfy tasks needs.
> >> [..]
> >>
> >> > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,
> >> > +struct cftype *cft)
> >> > +{
> >> > +   struct task_group *tg;
> >> > +   u64 min_capacity;
> >> > +
> >> > +   rcu_read_lock();
> >> > +   tg = css_tg(css);
> >> > +   min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];
> >>
> >> Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in
> >> the write path) to avoid load-tearing?
> >
> > tg->cap_clamp is an "unsigned int" and thus I would expect a single
> > memory access to write/read it, isn't it? I mean: I do not expect the
> > compiler "to mess" with these accesses.
> 
> This depends on compiler and arch. I'm not sure if its in practice
> these days an issue, but see section on 'load tearing' in
> Documentation/memory-barriers.txt . If compiler decided to break down
> the access to multiple accesses due to some reason, then might be a
> problem.

The compiler might also be able to inline cpu_capacity_min_read_u64()
fuse the load from tg->cap_clamp[CAP_CLAMP_MIN] with other accesses.
If min_capacity is used several times in the ensuing code, the compiler
could reload multiple times from tg->cap_clamp[CAP_CLAMP_MIN], which at
best might be a bit confusing.

> Adding Paul for his expert opinion on the matter ;)

My personal approach is to use READ_ONCE() and WRITE_ONCE() unless
I can absolutely prove that the compiler cannot do any destructive
optimizations.  And I not-infrequently find unsuspected opportunities
for destructive optimization in my own code.  Your mileage may vary.  ;-)

Thanx, Paul

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-15 Thread Paul E. McKenney

On Wed, Mar 15, 2017 at 06:20:28AM -0700, Joel Fernandes wrote:
> On Wed, Mar 15, 2017 at 4:20 AM, Patrick Bellasi
>  wrote:
> > On 13-Mar 03:46, Joel Fernandes (Google) wrote:
> >> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
> >>  wrote:
> >> > The CPU CGroup controller allows to assign a specified (maximum)
> >> > bandwidth to tasks within a group, however it does not enforce any
> >> > constraint on how such bandwidth can be consumed.
> >> > With the integration of schedutil, the scheduler has now the proper
> >> > information about a task to select  the most suitable frequency to
> >> > satisfy tasks needs.
> >> [..]
> >>
> >> > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,
> >> > +struct cftype *cft)
> >> > +{
> >> > +   struct task_group *tg;
> >> > +   u64 min_capacity;
> >> > +
> >> > +   rcu_read_lock();
> >> > +   tg = css_tg(css);
> >> > +   min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];
> >>
> >> Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in
> >> the write path) to avoid load-tearing?
> >
> > tg->cap_clamp is an "unsigned int" and thus I would expect a single
> > memory access to write/read it, isn't it? I mean: I do not expect the
> > compiler "to mess" with these accesses.
> 
> This depends on compiler and arch. I'm not sure if its in practice
> these days an issue, but see section on 'load tearing' in
> Documentation/memory-barriers.txt . If compiler decided to break down
> the access to multiple accesses due to some reason, then might be a
> problem.

The compiler might also be able to inline cpu_capacity_min_read_u64()
fuse the load from tg->cap_clamp[CAP_CLAMP_MIN] with other accesses.
If min_capacity is used several times in the ensuing code, the compiler
could reload multiple times from tg->cap_clamp[CAP_CLAMP_MIN], which at
best might be a bit confusing.

> Adding Paul for his expert opinion on the matter ;)

My personal approach is to use READ_ONCE() and WRITE_ONCE() unless
I can absolutely prove that the compiler cannot do any destructive
optimizations.  And I not-infrequently find unsuspected opportunities
for destructive optimization in my own code.  Your mileage may vary.  ;-)

Thanx, Paul

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-15 Thread Joel Fernandes

On Wed, Mar 15, 2017 at 4:20 AM, Patrick Bellasi
 wrote:
> On 13-Mar 03:46, Joel Fernandes (Google) wrote:
>> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
>>  wrote:
>> > The CPU CGroup controller allows to assign a specified (maximum)
>> > bandwidth to tasks within a group, however it does not enforce any
>> > constraint on how such bandwidth can be consumed.
>> > With the integration of schedutil, the scheduler has now the proper
>> > information about a task to select  the most suitable frequency to
>> > satisfy tasks needs.
>> [..]
>>
>> > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,
>> > +struct cftype *cft)
>> > +{
>> > +   struct task_group *tg;
>> > +   u64 min_capacity;
>> > +
>> > +   rcu_read_lock();
>> > +   tg = css_tg(css);
>> > +   min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];
>>
>> Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in
>> the write path) to avoid load-tearing?
>
> tg->cap_clamp is an "unsigned int" and thus I would expect a single
> memory access to write/read it, isn't it? I mean: I do not expect the
> compiler "to mess" with these accesses.

This depends on compiler and arch. I'm not sure if its in practice
these days an issue, but see section on 'load tearing' in
Documentation/memory-barriers.txt . If compiler decided to break down
the access to multiple accesses due to some reason, then might be a
problem.

Adding Paul for his expert opinion on the matter ;)

Thanks,
Joel

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-15 Thread Joel Fernandes

On Wed, Mar 15, 2017 at 4:20 AM, Patrick Bellasi
 wrote:
> On 13-Mar 03:46, Joel Fernandes (Google) wrote:
>> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
>>  wrote:
>> > The CPU CGroup controller allows to assign a specified (maximum)
>> > bandwidth to tasks within a group, however it does not enforce any
>> > constraint on how such bandwidth can be consumed.
>> > With the integration of schedutil, the scheduler has now the proper
>> > information about a task to select  the most suitable frequency to
>> > satisfy tasks needs.
>> [..]
>>
>> > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,
>> > +struct cftype *cft)
>> > +{
>> > +   struct task_group *tg;
>> > +   u64 min_capacity;
>> > +
>> > +   rcu_read_lock();
>> > +   tg = css_tg(css);
>> > +   min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];
>>
>> Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in
>> the write path) to avoid load-tearing?
>
> tg->cap_clamp is an "unsigned int" and thus I would expect a single
> memory access to write/read it, isn't it? I mean: I do not expect the
> compiler "to mess" with these accesses.

This depends on compiler and arch. I'm not sure if its in practice
these days an issue, but see section on 'load tearing' in
Documentation/memory-barriers.txt . If compiler decided to break down
the access to multiple accesses due to some reason, then might be a
problem.

Adding Paul for his expert opinion on the matter ;)

Thanks,
Joel

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-15 Thread Patrick Bellasi

On 13-Mar 03:46, Joel Fernandes (Google) wrote:
> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
>  wrote:
> > The CPU CGroup controller allows to assign a specified (maximum)
> > bandwidth to tasks within a group, however it does not enforce any
> > constraint on how such bandwidth can be consumed.
> > With the integration of schedutil, the scheduler has now the proper
> > information about a task to select  the most suitable frequency to
> > satisfy tasks needs.
> [..]
> 
> > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,
> > +struct cftype *cft)
> > +{
> > +   struct task_group *tg;
> > +   u64 min_capacity;
> > +
> > +   rcu_read_lock();
> > +   tg = css_tg(css);
> > +   min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];
> 
> Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in
> the write path) to avoid load-tearing?

tg->cap_clamp is an "unsigned int" and thus I would expect a single
memory access to write/read it, isn't it? I mean: I do not expect the
compiler "to mess" with these accesses.

However, if your concerns are more about overlapping read/write for the
same capacity from different threads, then perhaps we should better
use a mutex to serialize these two functions... not entirely convinced...

> Thanks,
> Joel

-- 
#include 

Patrick Bellasi

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-15 Thread Patrick Bellasi

On 13-Mar 03:46, Joel Fernandes (Google) wrote:
> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
>  wrote:
> > The CPU CGroup controller allows to assign a specified (maximum)
> > bandwidth to tasks within a group, however it does not enforce any
> > constraint on how such bandwidth can be consumed.
> > With the integration of schedutil, the scheduler has now the proper
> > information about a task to select  the most suitable frequency to
> > satisfy tasks needs.
> [..]
> 
> > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,
> > +struct cftype *cft)
> > +{
> > +   struct task_group *tg;
> > +   u64 min_capacity;
> > +
> > +   rcu_read_lock();
> > +   tg = css_tg(css);
> > +   min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];
> 
> Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in
> the write path) to avoid load-tearing?

tg->cap_clamp is an "unsigned int" and thus I would expect a single
memory access to write/read it, isn't it? I mean: I do not expect the
compiler "to mess" with these accesses.

However, if your concerns are more about overlapping read/write for the
same capacity from different threads, then perhaps we should better
use a mutex to serialize these two functions... not entirely convinced...

> Thanks,
> Joel

-- 
#include 

Patrick Bellasi

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-13 Thread Joel Fernandes (Google)

On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
 wrote:
> The CPU CGroup controller allows to assign a specified (maximum)
> bandwidth to tasks within a group, however it does not enforce any
> constraint on how such bandwidth can be consumed.
> With the integration of schedutil, the scheduler has now the proper
> information about a task to select  the most suitable frequency to
> satisfy tasks needs.
[..]

> +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,
> +struct cftype *cft)
> +{
> +   struct task_group *tg;
> +   u64 min_capacity;
> +
> +   rcu_read_lock();
> +   tg = css_tg(css);
> +   min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];

Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in
the write path) to avoid load-tearing?

Thanks,
Joel

Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-13 Thread Joel Fernandes (Google)

On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
 wrote:
> The CPU CGroup controller allows to assign a specified (maximum)
> bandwidth to tasks within a group, however it does not enforce any
> constraint on how such bandwidth can be consumed.
> With the integration of schedutil, the scheduler has now the proper
> information about a task to select  the most suitable frequency to
> satisfy tasks needs.
[..]

> +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,
> +struct cftype *cft)
> +{
> +   struct task_group *tg;
> +   u64 min_capacity;
> +
> +   rcu_read_lock();
> +   tg = css_tg(css);
> +   min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];

Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in
the write path) to avoid load-tearing?

Thanks,
Joel

[RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-02-28 Thread Patrick Bellasi

The CPU CGroup controller allows to assign a specified (maximum)
bandwidth to tasks within a group, however it does not enforce any
constraint on how such bandwidth can be consumed.
With the integration of schedutil, the scheduler has now the proper
information about a task to select  the most suitable frequency to
satisfy tasks needs.

This patch extends the CPU controller by adding a couple of new
attributes, capacity_min and capacity_max, which can be used to enforce
bandwidth boosting and capping. More specifically:

- capacity_min: defines the minimum capacity which should be granted
(by schedutil) when a task in this group is running,
i.e. the task will run at least at that capacity

- capacity_max: defines the maximum capacity which can be granted
(by schedutil) when a task in this group is running,
i.e. the task can run up to that capacity

These attributes:
a) are tunable at all hierarchy levels, i.e. root group too
b) allow to create subgroups of tasks which are not violating the
   capacity constraints defined by the parent group.
   Thus, tasks on a subgroup can only be more boosted and/or more
   capped, which is matching with the "limits" schema proposed by
   the "Resource Distribution Model (RDM)" suggested by the
   CGroups v2 documentation (Documentation/cgroup-v2.txt)

This patch provides the basic support to expose the two new attributes
and to validate their run-time update based on the "limits" schema of
the RDM.

Signed-off-by: Patrick Bellasi 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
Cc: linux-kernel@vger.kernel.org
---
 init/Kconfig |  17 ++
 kernel/sched/core.c  | 145 +++
 kernel/sched/sched.h |   8 +++
 3 files changed, 170 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index e1a93734..71e46ce 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1044,6 +1044,23 @@ menuconfig CGROUP_SCHED
  bandwidth allocation to such task groups. It uses cgroups to group
  tasks.
 
+config CAPACITY_CLAMPING
+   bool "Capacity clamping per group of tasks"
+   depends on CPU_FREQ_GOV_SCHEDUTIL
+   depends on CGROUP_SCHED
+   default n
+   help
+ This feature allows the scheduler to enforce maximum and minimum
+ capacity on each CPU based on RUNNABLE tasks currently scheduled
+ on that CPU.
+ Minimum capacity can be used for example to "boost" the performance
+ of important tasks by running them on an OPP which can be higher than
+ the minimum one eventually selected by the schedutil governor.
+ Maximum capacity can be used for example to "restrict" the maximum
+ OPP which can be requested by background tasks.
+
+ If in doubt, say N.
+
 if CGROUP_SCHED
 config FAIR_GROUP_SCHED
bool "Group scheduling for SCHED_OTHER"
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 34e2291..a171d49 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6015,6 +6015,11 @@ void __init sched_init(void)
autogroup_init(_task);
 #endif /* CONFIG_CGROUP_SCHED */
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+   root_task_group.cap_clamp[CAP_CLAMP_MIN] = 0;
+   root_task_group.cap_clamp[CAP_CLAMP_MAX] = SCHED_CAPACITY_SCALE;
+#endif /* CONFIG_CAPACITY_CLAMPING */
+
for_each_possible_cpu(i) {
struct rq *rq;
 
@@ -6310,6 +6315,11 @@ struct task_group *sched_create_group(struct task_group 
*parent)
if (!alloc_rt_sched_group(tg, parent))
goto err;
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+   tg->cap_clamp[CAP_CLAMP_MIN] = parent->cap_clamp[CAP_CLAMP_MIN];
+   tg->cap_clamp[CAP_CLAMP_MAX] = parent->cap_clamp[CAP_CLAMP_MAX];
+#endif
+
return tg;
 
 err:
@@ -6899,6 +6909,129 @@ static void cpu_cgroup_attach(struct cgroup_taskset 
*tset)
sched_move_task(task);
 }
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+
+static DEFINE_MUTEX(cap_clamp_mutex);
+
+static int cpu_capacity_min_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 value)
+{
+   struct cgroup_subsys_state *pos;
+   unsigned int min_value;
+   struct task_group *tg;
+   int ret = -EINVAL;
+
+   min_value = min_t(unsigned int, value, SCHED_CAPACITY_SCALE);
+
+   mutex_lock(_clamp_mutex);
+   rcu_read_lock();
+
+   tg = css_tg(css);
+
+   /* Already at the required value */
+   if (tg->cap_clamp[CAP_CLAMP_MIN] == min_value)
+   goto done;
+
+   /* Ensure to not exceed the maximum capacity */
+   if (tg->cap_clamp[CAP_CLAMP_MAX] < min_value)
+   goto out;
+
+   /* Ensure min cap fits within parent constraint */
+   if (tg->parent &&
+   tg->parent->cap_clamp[CAP_CLAMP_MIN] > min_value)
+   goto out;

[RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-02-28 Thread Patrick Bellasi

The CPU CGroup controller allows to assign a specified (maximum)
bandwidth to tasks within a group, however it does not enforce any
constraint on how such bandwidth can be consumed.
With the integration of schedutil, the scheduler has now the proper
information about a task to select  the most suitable frequency to
satisfy tasks needs.

This patch extends the CPU controller by adding a couple of new
attributes, capacity_min and capacity_max, which can be used to enforce
bandwidth boosting and capping. More specifically:

- capacity_min: defines the minimum capacity which should be granted
(by schedutil) when a task in this group is running,
i.e. the task will run at least at that capacity

- capacity_max: defines the maximum capacity which can be granted
(by schedutil) when a task in this group is running,
i.e. the task can run up to that capacity

These attributes:
a) are tunable at all hierarchy levels, i.e. root group too
b) allow to create subgroups of tasks which are not violating the
   capacity constraints defined by the parent group.
   Thus, tasks on a subgroup can only be more boosted and/or more
   capped, which is matching with the "limits" schema proposed by
   the "Resource Distribution Model (RDM)" suggested by the
   CGroups v2 documentation (Documentation/cgroup-v2.txt)

This patch provides the basic support to expose the two new attributes
and to validate their run-time update based on the "limits" schema of
the RDM.

Signed-off-by: Patrick Bellasi 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
Cc: linux-kernel@vger.kernel.org
---
 init/Kconfig |  17 ++
 kernel/sched/core.c  | 145 +++
 kernel/sched/sched.h |   8 +++
 3 files changed, 170 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index e1a93734..71e46ce 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1044,6 +1044,23 @@ menuconfig CGROUP_SCHED
  bandwidth allocation to such task groups. It uses cgroups to group
  tasks.
 
+config CAPACITY_CLAMPING
+   bool "Capacity clamping per group of tasks"
+   depends on CPU_FREQ_GOV_SCHEDUTIL
+   depends on CGROUP_SCHED
+   default n
+   help
+ This feature allows the scheduler to enforce maximum and minimum
+ capacity on each CPU based on RUNNABLE tasks currently scheduled
+ on that CPU.
+ Minimum capacity can be used for example to "boost" the performance
+ of important tasks by running them on an OPP which can be higher than
+ the minimum one eventually selected by the schedutil governor.
+ Maximum capacity can be used for example to "restrict" the maximum
+ OPP which can be requested by background tasks.
+
+ If in doubt, say N.
+
 if CGROUP_SCHED
 config FAIR_GROUP_SCHED
bool "Group scheduling for SCHED_OTHER"
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 34e2291..a171d49 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6015,6 +6015,11 @@ void __init sched_init(void)
autogroup_init(_task);
 #endif /* CONFIG_CGROUP_SCHED */
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+   root_task_group.cap_clamp[CAP_CLAMP_MIN] = 0;
+   root_task_group.cap_clamp[CAP_CLAMP_MAX] = SCHED_CAPACITY_SCALE;
+#endif /* CONFIG_CAPACITY_CLAMPING */
+
for_each_possible_cpu(i) {
struct rq *rq;
 
@@ -6310,6 +6315,11 @@ struct task_group *sched_create_group(struct task_group 
*parent)
if (!alloc_rt_sched_group(tg, parent))
goto err;
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+   tg->cap_clamp[CAP_CLAMP_MIN] = parent->cap_clamp[CAP_CLAMP_MIN];
+   tg->cap_clamp[CAP_CLAMP_MAX] = parent->cap_clamp[CAP_CLAMP_MAX];
+#endif
+
return tg;
 
 err:
@@ -6899,6 +6909,129 @@ static void cpu_cgroup_attach(struct cgroup_taskset 
*tset)
sched_move_task(task);
 }
 
+#ifdef CONFIG_CAPACITY_CLAMPING
+
+static DEFINE_MUTEX(cap_clamp_mutex);
+
+static int cpu_capacity_min_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 value)
+{
+   struct cgroup_subsys_state *pos;
+   unsigned int min_value;
+   struct task_group *tg;
+   int ret = -EINVAL;
+
+   min_value = min_t(unsigned int, value, SCHED_CAPACITY_SCALE);
+
+   mutex_lock(_clamp_mutex);
+   rcu_read_lock();
+
+   tg = css_tg(css);
+
+   /* Already at the required value */
+   if (tg->cap_clamp[CAP_CLAMP_MIN] == min_value)
+   goto done;
+
+   /* Ensure to not exceed the maximum capacity */
+   if (tg->cap_clamp[CAP_CLAMP_MAX] < min_value)
+   goto out;
+
+   /* Ensure min cap fits within parent constraint */
+   if (tg->parent &&
+   tg->parent->cap_clamp[CAP_CLAMP_MIN] > min_value)
+   goto out;
+
+   /* Each child must be a subset of us */
+   css_for_each_child(pos,

44 matches

Mail list logo