Re: [PATCH v8 1/6] cpuset: Enable cpuset controller in default hierarchy

2018-05-21 Thread Waiman Long
On 05/21/2018 11:09 AM, Patrick Bellasi wrote:
> On 21-May 09:55, Waiman Long wrote:
>
>> Changing cpuset.cpus will require searching for the all the tasks in
>> the cpuset and change its cpu mask.
> ... I'm wondering if that has to be the case. In principle there can
> be a different solution which is: update on demand. In the wakeup
> path, once we know a task really need a CPU and we want to find one
> for it, at that point we can align the cpuset mask with the task's
> one. Sort of using the cpuset mask as a clamp on top of the task's
> affinity mask.
>
> The main downside of such an approach could be the overheads in the
> wakeup path... but, still... that should be measured.
> The advantage is that we do not spend time changing attributes of
> tassk which, potentially, could be sleeping for a long time.

We already have a linked list of tasks in a cgroup. So it isn't too hard
to find them. Doing update on demand will require adding a bunch of code
to the wakeup path. So unless there is a good reason to do it, I don't
it as necessary at this point.

>
>> That isn't a fast operation, but it shouldn't be too bad either
>> depending on how many tasks are in the cpuset.
> Indeed, althought it still seems a bit odd and overkilling updating
> task affinity for tasks which are not currently RUNNABLE. Isn't it?
>
>> I would not suggest doing rapid changes to cpuset.cpus as a mean to tune
>> the behavior of a task. So what exactly is the tuning you are thinking
>> about? Is it moving a task from the a high-power cpu to a low power one
>> or vice versa?
> That's defenitively a possible use case. In Android for example we
> usually assign more resources to TOP_APP tasks (those belonging to the
> application you are currently using) while we restrict the resoures
> one we switch an app to be in BACKGROUND.

Switching an app from foreground to background and vice versa shouldn't
happen that frequently. Maybe once every few seconds, at most. I am just
wondering what use cases will require changing cpuset attributes in tens
per second.

> More in general, if you think about a generic Run-Time Resource
> Management framework, which assign resources to the tasks of multiple
> applications and want to have a fine grained control.
>
>> If so, it is probably better to move the task from one cpuset of
>> high-power cpus to another cpuset of low-power cpus.
> This is what Android does not but also what we want to possible
> change, for two main reasons:
>
> 1. it does not fit with the "number one guideline" for proper
>CGroups usage, which is "Organize Once and Control":
>   
> https://elixir.bootlin.com/linux/latest/source/Documentation/cgroup-v2.txt#L518
>where it says that:
>   migrating processes across cgroups frequently as a means to
>   apply different resource restrictions is discouraged.
>
>Despite this giudeline, it turns out that in v1 at least, it seems
>to be faster to move tasks across cpusets then tuning cpuset
>attributes... also when all the tasks are sleeping.

It is probably similar in v2 as the core logic are almost the same.

> 2. it does not allow to get advantages for accounting controllers such
>as the memory controller where, by moving tasks around, we cannot
>properly account and control the amount of memory a task can use.

For v1, memory controller and cpuset controller can be in different
hierarchy. For v2, we have a unified hierarchy. However, we don't need
to enable all the controllers in different levels of the hierarchy. For
example,

A (memory, cpuset) -- B1 (cpuset)
\-- B2 (cpuset)

Cgroup A has memory and cpuset controllers enabled. The child cgroups B1
and B2 only have cpuset enabled. You can move tasks between B1 and B2
and they will be subjected to the same memory limitation as imposed by
the memory controller in A. So there are way to work around that.

> Thsu, for these reasons and also to possibly migrate to the unified
> hierarchy schema proposed by CGroups v2... we would like a
> low-overhead mechanism for setting/tuning cpuset at run-time with
> whatever frequency you like.

We may be able to improve the performance of changing cpuset attribute
somewhat, but I don't believe there will be much improvement here.

 +
 +The "cpuset" controller is hierarchical.  That means the controller
 +cannot use CPUs or memory nodes not allowed in its parent.
 +
 +
 +Cpuset Interface Files
 +~~
 +
 +  cpuset.cpus
 +  A read-write multiple values file which exists on non-root
 +  cpuset-enabled cgroups.
 +
 +  It lists the CPUs allowed to be used by tasks within this
 +  cgroup.  The CPU numbers are comma-separated numbers or
 +  ranges.  For example:
 +
 +# cat cpuset.cpus
 +0-4,6,8-10
 +
 +  An empty value indicates that the cgroup is using the same
 +  setting as the nearest cgroup ancestor with a non-empty
 +  

Re: [PATCH v8 1/6] cpuset: Enable cpuset controller in default hierarchy

2018-05-21 Thread Patrick Bellasi
On 21-May 09:55, Waiman Long wrote:
> On 05/21/2018 07:55 AM, Patrick Bellasi wrote:
> > Hi Waiman!

[...]

> >> +Cpuset
> >> +--
> >> +
> >> +The "cpuset" controller provides a mechanism for constraining
> >> +the CPU and memory node placement of tasks to only the resources
> >> +specified in the cpuset interface files in a task's current cgroup.
> >> +This is especially valuable on large NUMA systems where placing jobs
> >> +on properly sized subsets of the systems with careful processor and
> >> +memory placement to reduce cross-node memory access and contention
> >> +can improve overall system performance.
> > Another quite important use-case for cpuset is Android, where they are
> > actively used to do both power-saving as well as performance tunings.
> > For example, depending on the status of an application, its threads
> > can be allowed to run on all available CPUS (e.g. foreground apps) or
> > be restricted only on few energy efficient CPUs (e.g. backgroud apps).
> >
> > Since here we are at "rewriting" cpusets for v2, I think it's important
> > to keep this mobile world scenario into consideration.
> >
> > For example, in this context, we are looking at the possibility to
> > update/tune cpuset.cpus with a relatively high rate, i.e. tens of
> > times per second. Not sure that's the same update rate usually
> > required for the large NUMA systems you cite above.  However, in this
> > case it's quite important to have really small overheads for these
> > operations.
> 
> The cgroup interface isn't designed for high update throughput.

Indeed, I had the same impression...

> Changing cpuset.cpus will require searching for the all the tasks in
> the cpuset and change its cpu mask.

... I'm wondering if that has to be the case. In principle there can
be a different solution which is: update on demand. In the wakeup
path, once we know a task really need a CPU and we want to find one
for it, at that point we can align the cpuset mask with the task's
one. Sort of using the cpuset mask as a clamp on top of the task's
affinity mask.

The main downside of such an approach could be the overheads in the
wakeup path... but, still... that should be measured.
The advantage is that we do not spend time changing attributes of
tassk which, potentially, could be sleeping for a long time.


> That isn't a fast operation, but it shouldn't be too bad either
> depending on how many tasks are in the cpuset.

Indeed, althought it still seems a bit odd and overkilling updating
task affinity for tasks which are not currently RUNNABLE. Isn't it?

> I would not suggest doing rapid changes to cpuset.cpus as a mean to tune
> the behavior of a task. So what exactly is the tuning you are thinking
> about? Is it moving a task from the a high-power cpu to a low power one
> or vice versa?

That's defenitively a possible use case. In Android for example we
usually assign more resources to TOP_APP tasks (those belonging to the
application you are currently using) while we restrict the resoures
one we switch an app to be in BACKGROUND.

More in general, if you think about a generic Run-Time Resource
Management framework, which assign resources to the tasks of multiple
applications and want to have a fine grained control.

> If so, it is probably better to move the task from one cpuset of
> high-power cpus to another cpuset of low-power cpus.

This is what Android does not but also what we want to possible
change, for two main reasons:

1. it does not fit with the "number one guideline" for proper
   CGroups usage, which is "Organize Once and Control":
  
https://elixir.bootlin.com/linux/latest/source/Documentation/cgroup-v2.txt#L518
   where it says that:
  migrating processes across cgroups frequently as a means to
  apply different resource restrictions is discouraged.

   Despite this giudeline, it turns out that in v1 at least, it seems
   to be faster to move tasks across cpusets then tuning cpuset
   attributes... also when all the tasks are sleeping.


2. it does not allow to get advantages for accounting controllers such
   as the memory controller where, by moving tasks around, we cannot
   properly account and control the amount of memory a task can use.

Thsu, for these reasons and also to possibly migrate to the unified
hierarchy schema proposed by CGroups v2... we would like a
low-overhead mechanism for setting/tuning cpuset at run-time with
whatever frequency you like.

> >> +
> >> +The "cpuset" controller is hierarchical.  That means the controller
> >> +cannot use CPUs or memory nodes not allowed in its parent.
> >> +
> >> +
> >> +Cpuset Interface Files
> >> +~~
> >> +
> >> +  cpuset.cpus
> >> +  A read-write multiple values file which exists on non-root
> >> +  cpuset-enabled cgroups.
> >> +
> >> +  It lists the CPUs allowed to be used by tasks within this
> >> +  cgroup.  The CPU numbers are comma-separated numbers or
> >> +  ranges.  For example:
> >> +
> >> +# cat 

Re: [PATCH v8 1/6] cpuset: Enable cpuset controller in default hierarchy

2018-05-21 Thread Waiman Long
On 05/21/2018 07:55 AM, Patrick Bellasi wrote:
> Hi Waiman!
>
> I've started looking at the possibility to move Android to use cgroups
> v2 and the availability of the cpuset controller makes this even more
> promising.
>
> I'll try to give a run to this series on Android, meanwhile I have
> some (hopefully not too much dummy) questions below.
>
> On 17-May 16:55, Waiman Long wrote:
>> Given the fact that thread mode had been merged into 4.14, it is now
>> time to enable cpuset to be used in the default hierarchy (cgroup v2)
>> as it is clearly threaded.
>>
>> The cpuset controller had experienced feature creep since its
>> introduction more than a decade ago. Besides the core cpus and mems
>> control files to limit cpus and memory nodes, there are a bunch of
>> additional features that can be controlled from the userspace. Some of
>> the features are of doubtful usefulness and may not be actively used.
>>
>> This patch enables cpuset controller in the default hierarchy with
>> a minimal set of features, namely just the cpus and mems and their
>> effective_* counterparts.  We can certainly add more features to the
>> default hierarchy in the future if there is a real user need for them
>> later on.
>>
>> Alternatively, with the unified hiearachy, it may make more sense
>> to move some of those additional cpuset features, if desired, to
>> memory controller or may be to the cpu controller instead of staying
>> with cpuset.
>>
>> Signed-off-by: Waiman Long 
>> ---
>>  Documentation/cgroup-v2.txt | 90 
>> ++---
>>  kernel/cgroup/cpuset.c  | 48 ++--
>>  2 files changed, 130 insertions(+), 8 deletions(-)
>>
>> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
>> index 74cdeae..cf7bac6 100644
>> --- a/Documentation/cgroup-v2.txt
>> +++ b/Documentation/cgroup-v2.txt
>> @@ -53,11 +53,13 @@ v1 is available under Documentation/cgroup-v1/.
>> 5-3-2. Writeback
>>   5-4. PID
>> 5-4-1. PID Interface Files
>> - 5-5. Device
>> - 5-6. RDMA
>> -   5-6-1. RDMA Interface Files
>> - 5-7. Misc
>> -   5-7-1. perf_event
>> + 5-5. Cpuset
>> +   5.5-1. Cpuset Interface Files
>> + 5-6. Device
>> + 5-7. RDMA
>> +   5-7-1. RDMA Interface Files
>> + 5-8. Misc
>> +   5-8-1. perf_event
>>   5-N. Non-normative information
>> 5-N-1. CPU controller root cgroup process behaviour
>> 5-N-2. IO controller root cgroup process behaviour
>> @@ -1435,6 +1437,84 @@ through fork() or clone(). These will return -EAGAIN 
>> if the creation
>>  of a new process would cause a cgroup policy to be violated.
>>  
>>  
>> +Cpuset
>> +--
>> +
>> +The "cpuset" controller provides a mechanism for constraining
>> +the CPU and memory node placement of tasks to only the resources
>> +specified in the cpuset interface files in a task's current cgroup.
>> +This is especially valuable on large NUMA systems where placing jobs
>> +on properly sized subsets of the systems with careful processor and
>> +memory placement to reduce cross-node memory access and contention
>> +can improve overall system performance.
> Another quite important use-case for cpuset is Android, where they are
> actively used to do both power-saving as well as performance tunings.
> For example, depending on the status of an application, its threads
> can be allowed to run on all available CPUS (e.g. foreground apps) or
> be restricted only on few energy efficient CPUs (e.g. backgroud apps).
>
> Since here we are at "rewriting" cpusets for v2, I think it's important
> to keep this mobile world scenario into consideration.
>
> For example, in this context, we are looking at the possibility to
> update/tune cpuset.cpus with a relatively high rate, i.e. tens of
> times per second. Not sure that's the same update rate usually
> required for the large NUMA systems you cite above.  However, in this
> case it's quite important to have really small overheads for these
> operations.

The cgroup interface isn't designed for high update throughput. Changing
cpuset.cpus will require searching for the all the tasks in the cpuset
and change its cpu mask. That isn't a fast operation, but it shouldn't
be too bad either depending on how many tasks are in the cpuset.

I would not suggest doing rapid changes to cpuset.cpus as a mean to tune
the behavior of a task. So what exactly is the tuning you are thinking
about? Is it moving a task from the a high-power cpu to a low power one
or vice versa? If so, it is probably better to move the task from one
cpuset of high-power cpus to another cpuset of low-power cpus.

>> +
>> +The "cpuset" controller is hierarchical.  That means the controller
>> +cannot use CPUs or memory nodes not allowed in its parent.
>> +
>> +
>> +Cpuset Interface Files
>> +~~
>> +
>> +  cpuset.cpus
>> +A read-write multiple values file which exists on non-root
>> +

Re: [PATCH v8 1/6] cpuset: Enable cpuset controller in default hierarchy

2018-05-21 Thread Patrick Bellasi
Hi Waiman!

I've started looking at the possibility to move Android to use cgroups
v2 and the availability of the cpuset controller makes this even more
promising.

I'll try to give a run to this series on Android, meanwhile I have
some (hopefully not too much dummy) questions below.

On 17-May 16:55, Waiman Long wrote:
> Given the fact that thread mode had been merged into 4.14, it is now
> time to enable cpuset to be used in the default hierarchy (cgroup v2)
> as it is clearly threaded.
> 
> The cpuset controller had experienced feature creep since its
> introduction more than a decade ago. Besides the core cpus and mems
> control files to limit cpus and memory nodes, there are a bunch of
> additional features that can be controlled from the userspace. Some of
> the features are of doubtful usefulness and may not be actively used.
> 
> This patch enables cpuset controller in the default hierarchy with
> a minimal set of features, namely just the cpus and mems and their
> effective_* counterparts.  We can certainly add more features to the
> default hierarchy in the future if there is a real user need for them
> later on.
> 
> Alternatively, with the unified hiearachy, it may make more sense
> to move some of those additional cpuset features, if desired, to
> memory controller or may be to the cpu controller instead of staying
> with cpuset.
> 
> Signed-off-by: Waiman Long 
> ---
>  Documentation/cgroup-v2.txt | 90 
> ++---
>  kernel/cgroup/cpuset.c  | 48 ++--
>  2 files changed, 130 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index 74cdeae..cf7bac6 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -53,11 +53,13 @@ v1 is available under Documentation/cgroup-v1/.
> 5-3-2. Writeback
>   5-4. PID
> 5-4-1. PID Interface Files
> - 5-5. Device
> - 5-6. RDMA
> -   5-6-1. RDMA Interface Files
> - 5-7. Misc
> -   5-7-1. perf_event
> + 5-5. Cpuset
> +   5.5-1. Cpuset Interface Files
> + 5-6. Device
> + 5-7. RDMA
> +   5-7-1. RDMA Interface Files
> + 5-8. Misc
> +   5-8-1. perf_event
>   5-N. Non-normative information
> 5-N-1. CPU controller root cgroup process behaviour
> 5-N-2. IO controller root cgroup process behaviour
> @@ -1435,6 +1437,84 @@ through fork() or clone(). These will return -EAGAIN 
> if the creation
>  of a new process would cause a cgroup policy to be violated.
>  
>  
> +Cpuset
> +--
> +
> +The "cpuset" controller provides a mechanism for constraining
> +the CPU and memory node placement of tasks to only the resources
> +specified in the cpuset interface files in a task's current cgroup.
> +This is especially valuable on large NUMA systems where placing jobs
> +on properly sized subsets of the systems with careful processor and
> +memory placement to reduce cross-node memory access and contention
> +can improve overall system performance.

Another quite important use-case for cpuset is Android, where they are
actively used to do both power-saving as well as performance tunings.
For example, depending on the status of an application, its threads
can be allowed to run on all available CPUS (e.g. foreground apps) or
be restricted only on few energy efficient CPUs (e.g. backgroud apps).

Since here we are at "rewriting" cpusets for v2, I think it's important
to keep this mobile world scenario into consideration.

For example, in this context, we are looking at the possibility to
update/tune cpuset.cpus with a relatively high rate, i.e. tens of
times per second. Not sure that's the same update rate usually
required for the large NUMA systems you cite above.  However, in this
case it's quite important to have really small overheads for these
operations.

> +
> +The "cpuset" controller is hierarchical.  That means the controller
> +cannot use CPUs or memory nodes not allowed in its parent.
> +
> +
> +Cpuset Interface Files
> +~~
> +
> +  cpuset.cpus
> + A read-write multiple values file which exists on non-root
> + cpuset-enabled cgroups.
> +
> + It lists the CPUs allowed to be used by tasks within this
> + cgroup.  The CPU numbers are comma-separated numbers or
> + ranges.  For example:
> +
> +   # cat cpuset.cpus
> +   0-4,6,8-10
> +
> + An empty value indicates that the cgroup is using the same
> + setting as the nearest cgroup ancestor with a non-empty
> + "cpuset.cpus" or all the available CPUs if none is found.

Does that means that we can move tasks into a newly created group for
which we have not yet configured this value?
AFAIK, that's a different behavior wrt v1... and I like it better.

> +
> + The value of "cpuset.cpus" stays constant until the next update
> + and won't be affected by any CPU hotplug events.

This also sounds interesting, does it means 

[PATCH v8 1/6] cpuset: Enable cpuset controller in default hierarchy

2018-05-17 Thread Waiman Long
Given the fact that thread mode had been merged into 4.14, it is now
time to enable cpuset to be used in the default hierarchy (cgroup v2)
as it is clearly threaded.

The cpuset controller had experienced feature creep since its
introduction more than a decade ago. Besides the core cpus and mems
control files to limit cpus and memory nodes, there are a bunch of
additional features that can be controlled from the userspace. Some of
the features are of doubtful usefulness and may not be actively used.

This patch enables cpuset controller in the default hierarchy with
a minimal set of features, namely just the cpus and mems and their
effective_* counterparts.  We can certainly add more features to the
default hierarchy in the future if there is a real user need for them
later on.

Alternatively, with the unified hiearachy, it may make more sense
to move some of those additional cpuset features, if desired, to
memory controller or may be to the cpu controller instead of staying
with cpuset.

Signed-off-by: Waiman Long 
---
 Documentation/cgroup-v2.txt | 90 ++---
 kernel/cgroup/cpuset.c  | 48 ++--
 2 files changed, 130 insertions(+), 8 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 74cdeae..cf7bac6 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -53,11 +53,13 @@ v1 is available under Documentation/cgroup-v1/.
5-3-2. Writeback
  5-4. PID
5-4-1. PID Interface Files
- 5-5. Device
- 5-6. RDMA
-   5-6-1. RDMA Interface Files
- 5-7. Misc
-   5-7-1. perf_event
+ 5-5. Cpuset
+   5.5-1. Cpuset Interface Files
+ 5-6. Device
+ 5-7. RDMA
+   5-7-1. RDMA Interface Files
+ 5-8. Misc
+   5-8-1. perf_event
  5-N. Non-normative information
5-N-1. CPU controller root cgroup process behaviour
5-N-2. IO controller root cgroup process behaviour
@@ -1435,6 +1437,84 @@ through fork() or clone(). These will return -EAGAIN if 
the creation
 of a new process would cause a cgroup policy to be violated.
 
 
+Cpuset
+--
+
+The "cpuset" controller provides a mechanism for constraining
+the CPU and memory node placement of tasks to only the resources
+specified in the cpuset interface files in a task's current cgroup.
+This is especially valuable on large NUMA systems where placing jobs
+on properly sized subsets of the systems with careful processor and
+memory placement to reduce cross-node memory access and contention
+can improve overall system performance.
+
+The "cpuset" controller is hierarchical.  That means the controller
+cannot use CPUs or memory nodes not allowed in its parent.
+
+
+Cpuset Interface Files
+~~
+
+  cpuset.cpus
+   A read-write multiple values file which exists on non-root
+   cpuset-enabled cgroups.
+
+   It lists the CPUs allowed to be used by tasks within this
+   cgroup.  The CPU numbers are comma-separated numbers or
+   ranges.  For example:
+
+ # cat cpuset.cpus
+ 0-4,6,8-10
+
+   An empty value indicates that the cgroup is using the same
+   setting as the nearest cgroup ancestor with a non-empty
+   "cpuset.cpus" or all the available CPUs if none is found.
+
+   The value of "cpuset.cpus" stays constant until the next update
+   and won't be affected by any CPU hotplug events.
+
+  cpuset.cpus.effective
+   A read-only multiple values file which exists on non-root
+   cpuset-enabled cgroups.
+
+   It lists the onlined CPUs that are actually allowed to be
+   used by tasks within the current cgroup.  If "cpuset.cpus"
+   is empty, it shows all the CPUs from the parent cgroup that
+   will be available to be used by this cgroup.  Otherwise, it is
+   a subset of "cpuset.cpus".  Its value will be affected by CPU
+   hotplug events.
+
+  cpuset.mems
+   A read-write multiple values file which exists on non-root
+   cpuset-enabled cgroups.
+
+   It lists the memory nodes allowed to be used by tasks within
+   this cgroup.  The memory node numbers are comma-separated
+   numbers or ranges.  For example:
+
+ # cat cpuset.mems
+ 0-1,3
+
+   An empty value indicates that the cgroup is using the same
+   setting as the nearest cgroup ancestor with a non-empty
+   "cpuset.mems" or all the available memory nodes if none
+   is found.
+
+   The value of "cpuset.mems" stays constant until the next update
+   and won't be affected by any memory nodes hotplug events.
+
+  cpuset.mems.effective
+   A read-only multiple values file which exists on non-root
+   cpuset-enabled cgroups.
+
+   It lists the onlined memory nodes that are actually allowed to
+   be used by tasks within the current cgroup.  If "cpuset.mems"
+   is empty, it shows all the memory nodes from the parent cgroup
+