from:"Waiman Long"

Re: [PATCH v8 3/6] cpuset: Add cpuset.sched.load_balance flag to v2

2018-05-25 Thread Waiman Long

On 05/25/2018 05:40 AM, Patrick Bellasi wrote:
> On 24-May 11:22, Waiman Long wrote:
>> On 05/24/2018 11:16 AM, Juri Lelli wrote:
>>> On 24/05/18 11:09, Waiman Long wrote:
>>>> On 05/24/2018 10:36 AM, Juri Lelli wrote:
>>>>> On 17/05/18 16:55, Waiman Long wrote:
>>>>>
>>>>> [...]
>>>>>
>>>>>> +A parent cgroup cannot distribute all its CPUs to child
>>>>>> +scheduling domain cgroups unless its load balancing flag is
>>>>>> +turned off.
>>>>>> +
>>>>>> +  cpuset.sched.load_balance
>>>>>> +A read-write single value file which exists on non-root
>>>>>> +cpuset-enabled cgroups.  It is a binary value flag that accepts
>>>>>> +either "0" (off) or a non-zero value (on).  This flag is set
>>>>>> +by the parent and is not delegatable.
>>>>>> +
>>>>>> +When it is on, tasks within this cpuset will be load-balanced
>>>>>> +by the kernel scheduler.  Tasks will be moved from CPUs with
>>>>>> +high load to other CPUs within the same cpuset with less load
>>>>>> +periodically.
>>>>>> +
>>>>>> +When it is off, there will be no load balancing among CPUs on
>>>>>> +this cgroup.  Tasks will stay in the CPUs they are running on
>>>>>> +and will not be moved to other CPUs.
>>>>>> +
>>>>>> +The initial value of this flag is "1".  This flag is then
>>>>>> +inherited by child cgroups with cpuset enabled.  Its state
>>>>>> +can only be changed on a scheduling domain cgroup with no
>>>>>> +cpuset-enabled children.
>>>>> [...]
>>>>>
>>>>>> +/*
>>>>>> + * On default hierachy, a load balance flag change is only 
>>>>>> allowed
>>>>>> + * in a scheduling domain with no child cpuset.
>>>>>> + */
>>>>>> +if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) && 
>>>>>> balance_flag_changed &&
>>>>>> +   (!is_sched_domain(cs) || css_has_online_children(>css))) 
>>>>>> {
>>>>>> +err = -EINVAL;
>>>>>> +goto out;
>>>>>> +}
>>>>> The rule is actually
>>>>>
>>>>>  - no child cpuset
>>>>>  - and it must be a scheduling domain
> I always a bit confused by the usage of "scheduling domain", which
> overlaps with the SD concept from the scheduler standpoint.

It is supposed to mimic SD concept of scheduler.

>
> AFAIU a cpuset sched domain is not granted to be turned into an
> actual scheduler SD, am I wrong?
>
> If that's the case, why not better disambiguate these two concept by
> calling the cpuset one a "cpus partition" or eventually "cpuset domain"?

Good point. Peter has similar comment. I will probably change the name
and clarifying it better in the documentation.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 3/6] cpuset: Add cpuset.sched.load_balance flag to v2

2018-05-24 Thread Waiman Long

On 05/24/2018 11:43 AM, Peter Zijlstra wrote:
> On Thu, May 17, 2018 at 04:55:42PM -0400, Waiman Long wrote:
>> The sched.load_balance flag is needed to enable CPU isolation similar to
>> what can be done with the "isolcpus" kernel boot parameter. Its value
>> can only be changed in a scheduling domain with no child cpusets. On
>> a non-scheduling domain cpuset, the value of sched.load_balance is
>> inherited from its parent.
>>
>> This flag is set by the parent and is not delegatable.
>>
>> Signed-off-by: Waiman Long <long...@redhat.com>
>> ---
>>  Documentation/cgroup-v2.txt | 24 
>>  kernel/cgroup/cpuset.c  | 53 
>> +
>>  2 files changed, 73 insertions(+), 4 deletions(-)
>>
>> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
>> index 54d9e22..071b634d 100644
>> --- a/Documentation/cgroup-v2.txt
>> +++ b/Documentation/cgroup-v2.txt
>> @@ -1536,6 +1536,30 @@ Cpuset Interface Files
>>  CPUs of the parent cgroup. Once it is set, this flag cannot be
>>  cleared if there are any child cgroups with cpuset enabled.
>>  
>> +A parent cgroup cannot distribute all its CPUs to child
>> +scheduling domain cgroups unless its load balancing flag is
>> +turned off.
>> +
>> +  cpuset.sched.load_balance
>> +A read-write single value file which exists on non-root
>> +cpuset-enabled cgroups.  It is a binary value flag that accepts
>> +either "0" (off) or a non-zero value (on).  This flag is set
>> +by the parent and is not delegatable.
>> +
>> +When it is on, tasks within this cpuset will be load-balanced
>> +by the kernel scheduler.  Tasks will be moved from CPUs with
>> +high load to other CPUs within the same cpuset with less load
>> +periodically.
>> +
>> +When it is off, there will be no load balancing among CPUs on
>> +this cgroup.  Tasks will stay in the CPUs they are running on
>> +and will not be moved to other CPUs.
>> +
>> +The initial value of this flag is "1".  This flag is then
>> +inherited by child cgroups with cpuset enabled.  Its state
>> +can only be changed on a scheduling domain cgroup with no
>> +cpuset-enabled children.
> I'm confused... why exactly do we have both domain and load_balance ?

The domain is for partitioning the CPUs only. It doesn't change the load
balancing state. So the load_balance flag is still need to turn on and
off load balancing.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 2/6] cpuset: Add new v2 cpuset.sched.domain flag

2018-05-24 Thread Waiman Long

On 05/24/2018 11:41 AM, Peter Zijlstra wrote:
> On Thu, May 17, 2018 at 04:55:41PM -0400, Waiman Long wrote:
>> A new cpuset.sched.domain boolean flag is added to cpuset v2. This new
>> flag indicates that the CPUs in the current cpuset should be treated
>> as a separate scheduling domain.
> The traditional name for this is a partition.

Do you want to call it cpuset.sched.partition? That name sounds strange
to me.

>>  This new flag is owned by the parent
>> and will cause the CPUs in the cpuset to be removed from the effective
>> CPUs of its parent.
> This is a significant departure from existing behaviour, but one I can
> appreciate. I don't immediately see something terribly wrong with it.
>
>> This is implemented internally by adding a new isolated_cpus mask that
>> holds the CPUs belonging to child scheduling domain cpusets so that:
>>
>>  isolated_cpus | effective_cpus = cpus_allowed
>>  isolated_cpus & effective_cpus = 0
>>
>> This new flag can only be turned on in a cpuset if its parent is either
>> root or a scheduling domain itself with non-empty cpu list. The state
>> of this flag cannot be changed if the cpuset has children.
>>
>> Signed-off-by: Waiman Long <long...@redhat.com>
>> ---
>>  Documentation/cgroup-v2.txt |  22 
>>  kernel/cgroup/cpuset.c  | 237 
>> +++-
>>  2 files changed, 256 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
>> index cf7bac6..54d9e22 100644
>> --- a/Documentation/cgroup-v2.txt
>> +++ b/Documentation/cgroup-v2.txt
>> @@ -1514,6 +1514,28 @@ Cpuset Interface Files
>>  it is a subset of "cpuset.mems".  Its value will be affected
>>  by memory nodes hotplug events.
>>  
>> +  cpuset.sched.domain
>> +A read-write single value file which exists on non-root
>> +cpuset-enabled cgroups.  It is a binary value flag that accepts
>> +either "0" (off) or a non-zero value (on).
> I would be conservative and only allow 0/1.

I stated that because echoing other integer value like 2 into the flag
file won't return any error. I will modify it to say just 0 and 1.

>>  This flag is set
>> +by the parent and is not delegatable.
>> +
>> +If set, it indicates that the CPUs in the current cgroup will
>> +be the root of a scheduling domain.  The root cgroup is always
>> +a scheduling domain.  There are constraints on where this flag
>> +can be set.  It can only be set in a cgroup if all the following
>> +conditions are true.
>> +
>> +1) The parent cgroup is also a scheduling domain with a non-empty
>> +   cpu list.
> Ah, so initially I was confused by the requirement for root to have it
> always set, but you'll allow child domains to steal _all_ CPUs, such
> that root ends up with an empty effective set?
>
> What about the (kernel) threads that cannot be moved out of the root
> group?

Actually, the current code won't allow you to take all the CPUs from a
scheduling domain cpuset with load balancing on. So there must be at
least 1 cpu left. You can take all away if load balancing is off.

>> +2) The list of CPUs are exclusive, i.e. they are not shared by
>> +   any of its siblings.
> Right.
>
>> +3) There is no child cgroups with cpuset enabled.
>> +
>> +Setting this flag will take the CPUs away from the effective
>> +CPUs of the parent cgroup. Once it is set, this flag cannot be
>> +cleared if there are any child cgroups with cpuset enabled.
> This I'm not clear on. Why?
>
That is for pragmatic reason as it is easier to code this way. We could
remove this restriction but that will make the code more complex.

Cheers,
Longman


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 3/6] cpuset: Add cpuset.sched.load_balance flag to v2

2018-05-24 Thread Waiman Long

On 05/24/2018 11:16 AM, Juri Lelli wrote:
> On 24/05/18 11:09, Waiman Long wrote:
>> On 05/24/2018 10:36 AM, Juri Lelli wrote:
>>> On 17/05/18 16:55, Waiman Long wrote:
>>>
>>> [...]
>>>
>>>> +  A parent cgroup cannot distribute all its CPUs to child
>>>> +  scheduling domain cgroups unless its load balancing flag is
>>>> +  turned off.
>>>> +
>>>> +  cpuset.sched.load_balance
>>>> +  A read-write single value file which exists on non-root
>>>> +  cpuset-enabled cgroups.  It is a binary value flag that accepts
>>>> +  either "0" (off) or a non-zero value (on).  This flag is set
>>>> +  by the parent and is not delegatable.
>>>> +
>>>> +  When it is on, tasks within this cpuset will be load-balanced
>>>> +  by the kernel scheduler.  Tasks will be moved from CPUs with
>>>> +  high load to other CPUs within the same cpuset with less load
>>>> +  periodically.
>>>> +
>>>> +  When it is off, there will be no load balancing among CPUs on
>>>> +  this cgroup.  Tasks will stay in the CPUs they are running on
>>>> +  and will not be moved to other CPUs.
>>>> +
>>>> +  The initial value of this flag is "1".  This flag is then
>>>> +  inherited by child cgroups with cpuset enabled.  Its state
>>>> +  can only be changed on a scheduling domain cgroup with no
>>>> +  cpuset-enabled children.
>>> [...]
>>>
>>>> +  /*
>>>> +   * On default hierachy, a load balance flag change is only allowed
>>>> +   * in a scheduling domain with no child cpuset.
>>>> +   */
>>>> +  if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) && balance_flag_changed &&
>>>> + (!is_sched_domain(cs) || css_has_online_children(>css))) {
>>>> +  err = -EINVAL;
>>>> +  goto out;
>>>> +  }
>>> The rule is actually
>>>
>>>  - no child cpuset
>>>  - and it must be a scheduling domain
>>>
>>> Right?
>> Yes, because it doesn't make sense to have a cpu in one cpuset that has
>> loading balance off while, at the same time, in another cpuset with load
>> balancing turned on. This restriction is there to make sure that the
>> above condition will not happen. I may be wrong if there is a realistic
>> use case where the above condition is desired.
> Yep, makes sense to me.
>
> Maybe add the second condition to the comment and documentation.

Sure. Will do.

-Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 4/6] cpuset: Make generate_sched_domains() recognize isolated_cpus

2018-05-23 Thread Waiman Long

On 05/23/2018 01:34 PM, Patrick Bellasi wrote:
> Hi Waiman,
>
> On 17-May 16:55, Waiman Long wrote:
>
> [...]
>
>> @@ -672,13 +672,14 @@ static int generate_sched_domains(cpumask_var_t 
>> **domains,
>>  int ndoms = 0;  /* number of sched domains in result */
>>  int nslot;  /* next empty doms[] struct cpumask slot */
>>  struct cgroup_subsys_state *pos_css;
>> +bool root_load_balance = is_sched_load_balance(_cpuset);
>>  
>>  doms = NULL;
>>  dattr = NULL;
>>  csa = NULL;
>>  
>>  /* Special case for the 99% of systems with one, full, sched domain */
>> -if (is_sched_load_balance(_cpuset)) {
>> +if (root_load_balance && !top_cpuset.isolation_count) {
> Perhaps I'm missing something but, it seems to me that, when the two
> conditions above are true, then we are going to destroy and rebuild
> the exact same scheduling domains.
>
> IOW, on 99% of systems where:
>
>is_sched_load_balance(_cpuset)
>top_cpuset.isolation_count = 0
>
> since boot time and forever, then every time we update a value for
> cpuset.cpus we keep rebuilding the same SDs.
>
> It's not strictly related to this patch, the same already happens in
> mainline based just on the first condition, but since you are extending
> that optimization, perhaps you can tell me where I'm possibly wrong or
> which cases I'm not considering.
>
> I'm interested mainly because on Android systems those conditions
> are always true and we see SDs rebuilds every time we write
> something in cpuset.cpus, which ultimately accounts for almost all the
> 6-7[ms] time required for the write to return, depending on the CPU
> frequency.
>
> Cheers Patrick
>
Yes, that is true. I will look into how to further optimize this. Thanks
for the suggestion.

-Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 2/6] cpuset: Add new v2 cpuset.sched.domain flag

2018-05-22 Thread Waiman Long

On 05/22/2018 08:57 AM, Juri Lelli wrote:
> Hi,
>
> On 17/05/18 16:55, Waiman Long wrote:
>
> [...]
>
>>  /**
>> + * update_isolated_cpumask - update the isolated_cpus mask of parent cpuset
>> + * @cpuset:  The cpuset that requests CPU isolation
>> + * @oldmask: The old isolated cpumask to be removed from the parent
>> + * @newmask: The new isolated cpumask to be added to the parent
>> + * Return: 0 if successful, an error code otherwise
>> + *
>> + * Changes to the isolated CPUs are not allowed if any of CPUs changing
>> + * state are in any of the child cpusets of the parent except the requesting
>> + * child.
>> + *
>> + * If the sched_domain flag changes, either the oldmask (0=>1) or the
>> + * newmask (1=>0) will be NULL.
>> + *
>> + * Called with cpuset_mutex held.
>> + */
>> +static int update_isolated_cpumask(struct cpuset *cpuset,
>> +struct cpumask *oldmask, struct cpumask *newmask)
>> +{
>> +int retval;
>> +int adding, deleting;
>> +cpumask_var_t addmask, delmask;
>> +struct cpuset *parent = parent_cs(cpuset);
>> +struct cpuset *sibling;
>> +struct cgroup_subsys_state *pos_css;
>> +int old_count = parent->isolation_count;
>> +bool dying = cpuset->css.flags & CSS_DYING;
>> +
>> +/*
>> + * Parent must be a scheduling domain with non-empty cpus_allowed.
>> + */
>> +if (!is_sched_domain(parent) || cpumask_empty(parent->cpus_allowed))
>> +return -EINVAL;
>> +
>> +/*
>> + * The oldmask, if present, must be a subset of parent's isolated
>> + * CPUs.
>> + */
>> +if (oldmask && !cpumask_empty(oldmask) && (!parent->isolation_count ||
>> +!cpumask_subset(oldmask, parent->isolated_cpus))) {
>> +WARN_ON_ONCE(1);
>> +return -EINVAL;
>> +}
>> +
>> +/*
>> + * A sched_domain state change is not allowed if there are
>> + * online children and the cpuset is not dying.
>> + */
>> +if (!dying && (!oldmask || !newmask) &&
>> +css_has_online_children(>css))
>> +return -EBUSY;
>> +
>> +if (!zalloc_cpumask_var(, GFP_KERNEL))
>> +return -ENOMEM;
>> +if (!zalloc_cpumask_var(, GFP_KERNEL)) {
>> +free_cpumask_var(addmask);
>> +return -ENOMEM;
>> +}
>> +
>> +if (!old_count) {
>> +if (!zalloc_cpumask_var(>isolated_cpus, GFP_KERNEL)) {
>> +retval = -ENOMEM;
>> +goto out;
>> +}
>> +old_count = 1;
>> +}
>> +
>> +retval = -EBUSY;
>> +adding = deleting = false;
>> +if (newmask)
>> +cpumask_copy(addmask, newmask);
>> +if (oldmask)
>> +deleting = cpumask_andnot(delmask, oldmask, addmask);
>> +if (newmask)
>> +adding = cpumask_andnot(addmask, newmask, delmask);
>> +
>> +if (!adding && !deleting)
>> +goto out_ok;
>> +
>> +/*
>> + * The cpus to be added must be in the parent's effective_cpus mask
>> + * but not in the isolated_cpus mask.
>> + */
>> +if (!cpumask_subset(addmask, parent->effective_cpus))
>> +goto out;
>> +if (parent->isolation_count &&
>> +cpumask_intersects(parent->isolated_cpus, addmask))
>> +goto out;
>> +
>> +/*
>> + * Check if any CPUs in addmask or delmask are in a sibling cpuset.
>> + * An empty sibling cpus_allowed means it is the same as parent's
>> + * effective_cpus. This checking is skipped if the cpuset is dying.
>> + */
>> +if (dying)
>> +goto updated_isolated_cpus;
>> +
>> +cpuset_for_each_child(sibling, pos_css, parent) {
>> +if ((sibling == cpuset) || !(sibling->css.flags & CSS_ONLINE))
>> +continue;
>> +if (cpumask_empty(sibling->cpus_allowed))
>> +goto out;
>> +if (adding &&
>> +cpumask_intersects(sibling->cpus_allowed, addmask))
>> +goto out;
>> +if (deleting &&
>> +cpumask_intersects(sibling->cpus_allowed, delmask))
>> +goto out;
>> +}
> Just got the below by echoing 1 into cpuset.sched.domain o

Re: [PATCH v8 1/6] cpuset: Enable cpuset controller in default hierarchy

2018-05-21 Thread Waiman Long

On 05/21/2018 11:09 AM, Patrick Bellasi wrote:
> On 21-May 09:55, Waiman Long wrote:
>
>> Changing cpuset.cpus will require searching for the all the tasks in
>> the cpuset and change its cpu mask.
> ... I'm wondering if that has to be the case. In principle there can
> be a different solution which is: update on demand. In the wakeup
> path, once we know a task really need a CPU and we want to find one
> for it, at that point we can align the cpuset mask with the task's
> one. Sort of using the cpuset mask as a clamp on top of the task's
> affinity mask.
>
> The main downside of such an approach could be the overheads in the
> wakeup path... but, still... that should be measured.
> The advantage is that we do not spend time changing attributes of
> tassk which, potentially, could be sleeping for a long time.

We already have a linked list of tasks in a cgroup. So it isn't too hard
to find them. Doing update on demand will require adding a bunch of code
to the wakeup path. So unless there is a good reason to do it, I don't
it as necessary at this point.

>
>> That isn't a fast operation, but it shouldn't be too bad either
>> depending on how many tasks are in the cpuset.
> Indeed, althought it still seems a bit odd and overkilling updating
> task affinity for tasks which are not currently RUNNABLE. Isn't it?
>
>> I would not suggest doing rapid changes to cpuset.cpus as a mean to tune
>> the behavior of a task. So what exactly is the tuning you are thinking
>> about? Is it moving a task from the a high-power cpu to a low power one
>> or vice versa?
> That's defenitively a possible use case. In Android for example we
> usually assign more resources to TOP_APP tasks (those belonging to the
> application you are currently using) while we restrict the resoures
> one we switch an app to be in BACKGROUND.

Switching an app from foreground to background and vice versa shouldn't
happen that frequently. Maybe once every few seconds, at most. I am just
wondering what use cases will require changing cpuset attributes in tens
per second.

> More in general, if you think about a generic Run-Time Resource
> Management framework, which assign resources to the tasks of multiple
> applications and want to have a fine grained control.
>
>> If so, it is probably better to move the task from one cpuset of
>> high-power cpus to another cpuset of low-power cpus.
> This is what Android does not but also what we want to possible
> change, for two main reasons:
>
> 1. it does not fit with the "number one guideline" for proper
>CGroups usage, which is "Organize Once and Control":
>   
> https://elixir.bootlin.com/linux/latest/source/Documentation/cgroup-v2.txt#L518
>where it says that:
>   migrating processes across cgroups frequently as a means to
>   apply different resource restrictions is discouraged.
>
>Despite this giudeline, it turns out that in v1 at least, it seems
>to be faster to move tasks across cpusets then tuning cpuset
>attributes... also when all the tasks are sleeping.

It is probably similar in v2 as the core logic are almost the same.

> 2. it does not allow to get advantages for accounting controllers such
>as the memory controller where, by moving tasks around, we cannot
>properly account and control the amount of memory a task can use.

For v1, memory controller and cpuset controller can be in different
hierarchy. For v2, we have a unified hierarchy. However, we don't need
to enable all the controllers in different levels of the hierarchy. For
example,

A (memory, cpuset) -- B1 (cpuset)
\-- B2 (cpuset)

Cgroup A has memory and cpuset controllers enabled. The child cgroups B1
and B2 only have cpuset enabled. You can move tasks between B1 and B2
and they will be subjected to the same memory limitation as imposed by
the memory controller in A. So there are way to work around that.

> Thsu, for these reasons and also to possibly migrate to the unified
> hierarchy schema proposed by CGroups v2... we would like a
> low-overhead mechanism for setting/tuning cpuset at run-time with
> whatever frequency you like.

We may be able to improve the performance of changing cpuset attribute
somewhat, but I don't believe there will be much improvement here.

>>>> +
>>>> +The "cpuset" controller is hierarchical.  That means the controller
>>>> +cannot use CPUs or memory nodes not allowed in its parent.
>>>> +
>>>> +
>>>> +Cpuset Interface Files
>>>> +~~
>>>> +
>>>> +  cpuset.cpus
>>>> +  A read-write multiple values file which exists on non-root
>>>> +  cpuset-enabled cgroups.

Re: [PATCH v8 1/6] cpuset: Enable cpuset controller in default hierarchy

2018-05-21 Thread Waiman Long

On 05/21/2018 07:55 AM, Patrick Bellasi wrote:
> Hi Waiman!
>
> I've started looking at the possibility to move Android to use cgroups
> v2 and the availability of the cpuset controller makes this even more
> promising.
>
> I'll try to give a run to this series on Android, meanwhile I have
> some (hopefully not too much dummy) questions below.
>
> On 17-May 16:55, Waiman Long wrote:
>> Given the fact that thread mode had been merged into 4.14, it is now
>> time to enable cpuset to be used in the default hierarchy (cgroup v2)
>> as it is clearly threaded.
>>
>> The cpuset controller had experienced feature creep since its
>> introduction more than a decade ago. Besides the core cpus and mems
>> control files to limit cpus and memory nodes, there are a bunch of
>> additional features that can be controlled from the userspace. Some of
>> the features are of doubtful usefulness and may not be actively used.
>>
>> This patch enables cpuset controller in the default hierarchy with
>> a minimal set of features, namely just the cpus and mems and their
>> effective_* counterparts.  We can certainly add more features to the
>> default hierarchy in the future if there is a real user need for them
>> later on.
>>
>> Alternatively, with the unified hiearachy, it may make more sense
>> to move some of those additional cpuset features, if desired, to
>> memory controller or may be to the cpu controller instead of staying
>> with cpuset.
>>
>> Signed-off-by: Waiman Long <long...@redhat.com>
>> ---
>>  Documentation/cgroup-v2.txt | 90 
>> ++---
>>  kernel/cgroup/cpuset.c  | 48 ++--
>>  2 files changed, 130 insertions(+), 8 deletions(-)
>>
>> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
>> index 74cdeae..cf7bac6 100644
>> --- a/Documentation/cgroup-v2.txt
>> +++ b/Documentation/cgroup-v2.txt
>> @@ -53,11 +53,13 @@ v1 is available under Documentation/cgroup-v1/.
>> 5-3-2. Writeback
>>   5-4. PID
>> 5-4-1. PID Interface Files
>> - 5-5. Device
>> - 5-6. RDMA
>> -   5-6-1. RDMA Interface Files
>> - 5-7. Misc
>> -   5-7-1. perf_event
>> + 5-5. Cpuset
>> +   5.5-1. Cpuset Interface Files
>> + 5-6. Device
>> + 5-7. RDMA
>> +   5-7-1. RDMA Interface Files
>> + 5-8. Misc
>> +   5-8-1. perf_event
>>   5-N. Non-normative information
>> 5-N-1. CPU controller root cgroup process behaviour
>> 5-N-2. IO controller root cgroup process behaviour
>> @@ -1435,6 +1437,84 @@ through fork() or clone(). These will return -EAGAIN 
>> if the creation
>>  of a new process would cause a cgroup policy to be violated.
>>  
>>  
>> +Cpuset
>> +--
>> +
>> +The "cpuset" controller provides a mechanism for constraining
>> +the CPU and memory node placement of tasks to only the resources
>> +specified in the cpuset interface files in a task's current cgroup.
>> +This is especially valuable on large NUMA systems where placing jobs
>> +on properly sized subsets of the systems with careful processor and
>> +memory placement to reduce cross-node memory access and contention
>> +can improve overall system performance.
> Another quite important use-case for cpuset is Android, where they are
> actively used to do both power-saving as well as performance tunings.
> For example, depending on the status of an application, its threads
> can be allowed to run on all available CPUS (e.g. foreground apps) or
> be restricted only on few energy efficient CPUs (e.g. backgroud apps).
>
> Since here we are at "rewriting" cpusets for v2, I think it's important
> to keep this mobile world scenario into consideration.
>
> For example, in this context, we are looking at the possibility to
> update/tune cpuset.cpus with a relatively high rate, i.e. tens of
> times per second. Not sure that's the same update rate usually
> required for the large NUMA systems you cite above.  However, in this
> case it's quite important to have really small overheads for these
> operations.

The cgroup interface isn't designed for high update throughput. Changing
cpuset.cpus will require searching for the all the tasks in the cpuset
and change its cpu mask. That isn't a fast operation, but it shouldn't
be too bad either depending on how many tasks are in the cpuset.

I would not suggest doing rapid changes to cpuset.cpus as a mean to tune
the behavior of a task. So what exactly is the tuning you are thinking
about? Is it movi

[PATCH v8 2/6] cpuset: Add new v2 cpuset.sched.domain flag

2018-05-17 Thread Waiman Long

A new cpuset.sched.domain boolean flag is added to cpuset v2. This new
flag indicates that the CPUs in the current cpuset should be treated
as a separate scheduling domain. This new flag is owned by the parent
and will cause the CPUs in the cpuset to be removed from the effective
CPUs of its parent.

This is implemented internally by adding a new isolated_cpus mask that
holds the CPUs belonging to child scheduling domain cpusets so that:

isolated_cpus | effective_cpus = cpus_allowed
isolated_cpus & effective_cpus = 0

This new flag can only be turned on in a cpuset if its parent is either
root or a scheduling domain itself with non-empty cpu list. The state
of this flag cannot be changed if the cpuset has children.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 Documentation/cgroup-v2.txt |  22 
 kernel/cgroup/cpuset.c  | 237 +++-
 2 files changed, 256 insertions(+), 3 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index cf7bac6..54d9e22 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1514,6 +1514,28 @@ Cpuset Interface Files
it is a subset of "cpuset.mems".  Its value will be affected
by memory nodes hotplug events.
 
+  cpuset.sched.domain
+   A read-write single value file which exists on non-root
+   cpuset-enabled cgroups.  It is a binary value flag that accepts
+   either "0" (off) or a non-zero value (on).  This flag is set
+   by the parent and is not delegatable.
+
+   If set, it indicates that the CPUs in the current cgroup will
+   be the root of a scheduling domain.  The root cgroup is always
+   a scheduling domain.  There are constraints on where this flag
+   can be set.  It can only be set in a cgroup if all the following
+   conditions are true.
+
+   1) The parent cgroup is also a scheduling domain with a non-empty
+  cpu list.
+   2) The list of CPUs are exclusive, i.e. they are not shared by
+  any of its siblings.
+   3) There is no child cgroups with cpuset enabled.
+
+   Setting this flag will take the CPUs away from the effective
+   CPUs of the parent cgroup. Once it is set, this flag cannot be
+   cleared if there are any child cgroups with cpuset enabled.
+
 
 Device controller
 -
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 419b758..e1a1af0 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -109,6 +109,9 @@ struct cpuset {
cpumask_var_t effective_cpus;
nodemask_t effective_mems;
 
+   /* Isolated CPUs for scheduling domain children */
+   cpumask_var_t isolated_cpus;
+
/*
 * This is old Memory Nodes tasks took on.
 *
@@ -134,6 +137,9 @@ struct cpuset {
 
/* for custom sched domain */
int relax_domain_level;
+
+   /* for isolated_cpus */
+   int isolation_count;
 };
 
 static inline struct cpuset *css_cs(struct cgroup_subsys_state *css)
@@ -175,6 +181,7 @@ static inline bool task_has_mempolicy(struct task_struct 
*task)
CS_SCHED_LOAD_BALANCE,
CS_SPREAD_PAGE,
CS_SPREAD_SLAB,
+   CS_SCHED_DOMAIN,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -203,6 +210,11 @@ static inline int is_sched_load_balance(const struct 
cpuset *cs)
return test_bit(CS_SCHED_LOAD_BALANCE, >flags);
 }
 
+static inline int is_sched_domain(const struct cpuset *cs)
+{
+   return test_bit(CS_SCHED_DOMAIN, >flags);
+}
+
 static inline int is_memory_migrate(const struct cpuset *cs)
 {
return test_bit(CS_MEMORY_MIGRATE, >flags);
@@ -220,7 +232,7 @@ static inline int is_spread_slab(const struct cpuset *cs)
 
 static struct cpuset top_cpuset = {
.flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) |
- (1 << CS_MEM_EXCLUSIVE)),
+ (1 << CS_MEM_EXCLUSIVE) | (1 << CS_SCHED_DOMAIN)),
 };
 
 /**
@@ -902,7 +914,19 @@ static void update_cpumasks_hier(struct cpuset *cs, struct 
cpumask *new_cpus)
cpuset_for_each_descendant_pre(cp, pos_css, cs) {
struct cpuset *parent = parent_cs(cp);
 
-   cpumask_and(new_cpus, cp->cpus_allowed, parent->effective_cpus);
+   /*
+* If parent has isolated CPUs, include them in the list
+* of allowable CPUs.
+*/
+   if (parent->isolation_count) {
+   cpumask_or(new_cpus, parent->effective_cpus,
+  parent->isolated_cpus);
+   cpumask_and(new_cpus, new_cpus, cpu_online_mask);
+   cpumask_and(new_cpus, new_cpus, cp->cpus_allowed);
+   } else {
+   cpumask_and

[PATCH v8 3/6] cpuset: Add cpuset.sched.load_balance flag to v2

2018-05-17 Thread Waiman Long

The sched.load_balance flag is needed to enable CPU isolation similar to
what can be done with the "isolcpus" kernel boot parameter. Its value
can only be changed in a scheduling domain with no child cpusets. On
a non-scheduling domain cpuset, the value of sched.load_balance is
inherited from its parent.

This flag is set by the parent and is not delegatable.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 Documentation/cgroup-v2.txt | 24 
 kernel/cgroup/cpuset.c  | 53 +
 2 files changed, 73 insertions(+), 4 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 54d9e22..071b634d 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1536,6 +1536,30 @@ Cpuset Interface Files
CPUs of the parent cgroup. Once it is set, this flag cannot be
cleared if there are any child cgroups with cpuset enabled.
 
+   A parent cgroup cannot distribute all its CPUs to child
+   scheduling domain cgroups unless its load balancing flag is
+   turned off.
+
+  cpuset.sched.load_balance
+   A read-write single value file which exists on non-root
+   cpuset-enabled cgroups.  It is a binary value flag that accepts
+   either "0" (off) or a non-zero value (on).  This flag is set
+   by the parent and is not delegatable.
+
+   When it is on, tasks within this cpuset will be load-balanced
+   by the kernel scheduler.  Tasks will be moved from CPUs with
+   high load to other CPUs within the same cpuset with less load
+   periodically.
+
+   When it is off, there will be no load balancing among CPUs on
+   this cgroup.  Tasks will stay in the CPUs they are running on
+   and will not be moved to other CPUs.
+
+   The initial value of this flag is "1".  This flag is then
+   inherited by child cgroups with cpuset enabled.  Its state
+   can only be changed on a scheduling domain cgroup with no
+   cpuset-enabled children.
+
 
 Device controller
 -
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e1a1af0..368e1b7 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -510,7 +510,7 @@ static int validate_change(struct cpuset *cur, struct 
cpuset *trial)
 
par = parent_cs(cur);
 
-   /* On legacy hiearchy, we must be a subset of our parent cpuset. */
+   /* On legacy hierarchy, we must be a subset of our parent cpuset. */
ret = -EACCES;
if (!is_in_v2_mode() && !is_cpuset_subset(trial, par))
goto out;
@@ -1061,6 +1061,14 @@ static int update_isolated_cpumask(struct cpuset *cpuset,
goto out;
 
/*
+* A parent can't distribute all its CPUs to child scheduling
+* domain cpusets unless load balancing is off.
+*/
+   if (adding & !deleting && is_sched_load_balance(parent) &&
+   cpumask_equal(addmask, parent->effective_cpus))
+   goto out;
+
+   /*
 * Check if any CPUs in addmask or delmask are in a sibling cpuset.
 * An empty sibling cpus_allowed means it is the same as parent's
 * effective_cpus. This checking is skipped if the cpuset is dying.
@@ -1531,6 +1539,16 @@ static int update_flag(cpuset_flagbits_t bit, struct 
cpuset *cs,
 
domain_flag_changed = (is_sched_domain(cs) != is_sched_domain(trialcs));
 
+   /*
+* On default hierachy, a load balance flag change is only allowed
+* in a scheduling domain with no child cpuset.
+*/
+   if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) && balance_flag_changed &&
+  (!is_sched_domain(cs) || css_has_online_children(>css))) {
+   err = -EINVAL;
+   goto out;
+   }
+
if (domain_flag_changed) {
err = turning_on
? update_isolated_cpumask(cs, NULL, cs->cpus_allowed)
@@ -2187,6 +2205,14 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state 
*css, struct cftype *cft)
.flags = CFTYPE_NOT_ON_ROOT,
},
 
+   {
+   .name = "sched.load_balance",
+   .read_u64 = cpuset_read_u64,
+   .write_u64 = cpuset_write_u64,
+   .private = FILE_SCHED_LOAD_BALANCE,
+   .flags = CFTYPE_NOT_ON_ROOT,
+   },
+
{ } /* terminate */
 };
 
@@ -2200,19 +2226,38 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state 
*css, struct cftype *cft)
 cpuset_css_alloc(struct cgroup_subsys_state *parent_css)
 {
struct cpuset *cs;
+   struct cgroup_subsys_state *errptr = ERR_PTR(-ENOMEM);
 
if (!parent_css)
return _cpuset.css;
 
cs = kzalloc(sizeof(*cs), GFP_KERNEL);
if (!cs)
-   return ERR_PTR(-ENOMEM);
+   ret

[PATCH v8 4/6] cpuset: Make generate_sched_domains() recognize isolated_cpus

2018-05-17 Thread Waiman Long

The generate_sched_domains() function and the hotplug code are modified
to make them use the newly introduced isolated_cpus mask for schedule
domains generation.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 kernel/cgroup/cpuset.c | 33 +
 1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 368e1b7..0e75f83 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -672,13 +672,14 @@ static int generate_sched_domains(cpumask_var_t **domains,
int ndoms = 0;  /* number of sched domains in result */
int nslot;  /* next empty doms[] struct cpumask slot */
struct cgroup_subsys_state *pos_css;
+   bool root_load_balance = is_sched_load_balance(_cpuset);
 
doms = NULL;
dattr = NULL;
csa = NULL;
 
/* Special case for the 99% of systems with one, full, sched domain */
-   if (is_sched_load_balance(_cpuset)) {
+   if (root_load_balance && !top_cpuset.isolation_count) {
ndoms = 1;
doms = alloc_sched_domains(ndoms);
if (!doms)
@@ -701,6 +702,8 @@ static int generate_sched_domains(cpumask_var_t **domains,
csn = 0;
 
rcu_read_lock();
+   if (root_load_balance)
+   csa[csn++] = _cpuset;
cpuset_for_each_descendant_pre(cp, pos_css, _cpuset) {
if (cp == _cpuset)
continue;
@@ -711,6 +714,9 @@ static int generate_sched_domains(cpumask_var_t **domains,
 * parent's cpus, so just skip them, and then we call
 * update_domain_attr_tree() to calc relax_domain_level of
 * the corresponding sched domain.
+*
+* If root is load-balancing, we can skip @cp if it
+* is a subset of the root's effective_cpus.
 */
if (!cpumask_empty(cp->cpus_allowed) &&
!(is_sched_load_balance(cp) &&
@@ -718,11 +724,16 @@ static int generate_sched_domains(cpumask_var_t **domains,
 housekeeping_cpumask(HK_FLAG_DOMAIN
continue;
 
+   if (root_load_balance &&
+   cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus))
+   continue;
+
if (is_sched_load_balance(cp))
csa[csn++] = cp;
 
-   /* skip @cp's subtree */
-   pos_css = css_rightmost_descendant(pos_css);
+   /* skip @cp's subtree if not a scheduling domain */
+   if (!is_sched_domain(cp))
+   pos_css = css_rightmost_descendant(pos_css);
}
rcu_read_unlock();
 
@@ -849,7 +860,12 @@ static void rebuild_sched_domains_locked(void)
 * passing doms with offlined cpu to partition_sched_domains().
 * Anyways, hotplug work item will rebuild sched domains.
 */
-   if (!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
+   if (!top_cpuset.isolation_count &&
+   !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
+   goto out;
+
+   if (top_cpuset.isolation_count &&
+  !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
goto out;
 
/* Generate domain masks and attrs */
@@ -2624,6 +2640,11 @@ static void cpuset_hotplug_workfn(struct work_struct 
*work)
cpumask_copy(_cpus, cpu_active_mask);
new_mems = node_states[N_MEMORY];
 
+   /*
+* If isolated_cpus is populated, it is likely that the check below
+* will produce a false positive on cpus_updated when the cpu list
+* isn't changed. It is extra work, but it is better to be safe.
+*/
cpus_updated = !cpumask_equal(top_cpuset.effective_cpus, _cpus);
mems_updated = !nodes_equal(top_cpuset.effective_mems, new_mems);
 
@@ -2632,6 +2653,10 @@ static void cpuset_hotplug_workfn(struct work_struct 
*work)
spin_lock_irq(_lock);
if (!on_dfl)
cpumask_copy(top_cpuset.cpus_allowed, _cpus);
+
+   if (top_cpuset.isolation_count)
+   cpumask_andnot(_cpus, _cpus,
+   top_cpuset.isolated_cpus);
cpumask_copy(top_cpuset.effective_cpus, _cpus);
spin_unlock_irq(_lock);
/* we don't mess with cpumasks of tasks in top_cpuset */
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 1/6] cpuset: Enable cpuset controller in default hierarchy

2018-05-17 Thread Waiman Long

Given the fact that thread mode had been merged into 4.14, it is now
time to enable cpuset to be used in the default hierarchy (cgroup v2)
as it is clearly threaded.

The cpuset controller had experienced feature creep since its
introduction more than a decade ago. Besides the core cpus and mems
control files to limit cpus and memory nodes, there are a bunch of
additional features that can be controlled from the userspace. Some of
the features are of doubtful usefulness and may not be actively used.

This patch enables cpuset controller in the default hierarchy with
a minimal set of features, namely just the cpus and mems and their
effective_* counterparts.  We can certainly add more features to the
default hierarchy in the future if there is a real user need for them
later on.

Alternatively, with the unified hiearachy, it may make more sense
to move some of those additional cpuset features, if desired, to
memory controller or may be to the cpu controller instead of staying
with cpuset.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 Documentation/cgroup-v2.txt | 90 ++---
 kernel/cgroup/cpuset.c  | 48 ++--
 2 files changed, 130 insertions(+), 8 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 74cdeae..cf7bac6 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -53,11 +53,13 @@ v1 is available under Documentation/cgroup-v1/.
5-3-2. Writeback
  5-4. PID
5-4-1. PID Interface Files
- 5-5. Device
- 5-6. RDMA
-   5-6-1. RDMA Interface Files
- 5-7. Misc
-   5-7-1. perf_event
+ 5-5. Cpuset
+   5.5-1. Cpuset Interface Files
+ 5-6. Device
+ 5-7. RDMA
+   5-7-1. RDMA Interface Files
+ 5-8. Misc
+   5-8-1. perf_event
  5-N. Non-normative information
5-N-1. CPU controller root cgroup process behaviour
5-N-2. IO controller root cgroup process behaviour
@@ -1435,6 +1437,84 @@ through fork() or clone(). These will return -EAGAIN if 
the creation
 of a new process would cause a cgroup policy to be violated.
 
 
+Cpuset
+--
+
+The "cpuset" controller provides a mechanism for constraining
+the CPU and memory node placement of tasks to only the resources
+specified in the cpuset interface files in a task's current cgroup.
+This is especially valuable on large NUMA systems where placing jobs
+on properly sized subsets of the systems with careful processor and
+memory placement to reduce cross-node memory access and contention
+can improve overall system performance.
+
+The "cpuset" controller is hierarchical.  That means the controller
+cannot use CPUs or memory nodes not allowed in its parent.
+
+
+Cpuset Interface Files
+~~
+
+  cpuset.cpus
+   A read-write multiple values file which exists on non-root
+   cpuset-enabled cgroups.
+
+   It lists the CPUs allowed to be used by tasks within this
+   cgroup.  The CPU numbers are comma-separated numbers or
+   ranges.  For example:
+
+ # cat cpuset.cpus
+ 0-4,6,8-10
+
+   An empty value indicates that the cgroup is using the same
+   setting as the nearest cgroup ancestor with a non-empty
+   "cpuset.cpus" or all the available CPUs if none is found.
+
+   The value of "cpuset.cpus" stays constant until the next update
+   and won't be affected by any CPU hotplug events.
+
+  cpuset.cpus.effective
+   A read-only multiple values file which exists on non-root
+   cpuset-enabled cgroups.
+
+   It lists the onlined CPUs that are actually allowed to be
+   used by tasks within the current cgroup.  If "cpuset.cpus"
+   is empty, it shows all the CPUs from the parent cgroup that
+   will be available to be used by this cgroup.  Otherwise, it is
+   a subset of "cpuset.cpus".  Its value will be affected by CPU
+   hotplug events.
+
+  cpuset.mems
+   A read-write multiple values file which exists on non-root
+   cpuset-enabled cgroups.
+
+   It lists the memory nodes allowed to be used by tasks within
+   this cgroup.  The memory node numbers are comma-separated
+   numbers or ranges.  For example:
+
+ # cat cpuset.mems
+ 0-1,3
+
+   An empty value indicates that the cgroup is using the same
+   setting as the nearest cgroup ancestor with a non-empty
+   "cpuset.mems" or all the available memory nodes if none
+   is found.
+
+   The value of "cpuset.mems" stays constant until the next update
+   and won't be affected by any memory nodes hotplug events.
+
+  cpuset.mems.effective
+   A read-only multiple values file which exists on non-root
+   cpuset-enabled cgroups.
+
+   It lists the onlined memory nodes that are actually allowed to
+   be used by tasks within the current cgroup.  I

[PATCH v8 6/6] cpuset: Allow reporting of sched domain generation info

2018-05-17 Thread Waiman Long

This patch enables us to report sched domain generation information.

If DYNAMIC_DEBUG is enabled, issuing the following command

  echo "file cpuset.c +p" > /sys/kernel/debug/dynamic_debug/control

and setting loglevel to 8 will allow the kernel to show what scheduling
domain changes are being made.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 kernel/cgroup/cpuset.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index fb8aa82b..8f586e8 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -820,6 +820,12 @@ static int generate_sched_domains(cpumask_var_t **domains,
}
BUG_ON(nslot != ndoms);
 
+#ifdef CONFIG_DEBUG_KERNEL
+   for (i = 0; i < ndoms; i++)
+   pr_debug("generate_sched_domains dom %d: %*pbl\n", i,
+cpumask_pr_args(doms[i]));
+#endif
+
 done:
kfree(csa);
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 0/6] Enable cpuset controller in default hierarchy

2018-05-17 Thread Waiman Long

v8:
 - Remove cpuset.cpus.isolated and add a new cpuset.sched.domain flag
   and rework the code accordingly.

v7:
 - Add a root-only cpuset.cpus.isolated control file for CPU isolation.
 - Enforce that load_balancing can only be turned off on cpusets with
   CPUs from the isolated list.
 - Update sched domain generation to allow cpusets with CPUs only
   from the isolated CPU list to be in separate root domains.

v6:
 - Hide cpuset control knobs in root cgroup.
 - Rename effective_cpus and effective_mems to cpus.effective and
   mems.effective respectively.
 - Remove cpuset.flags and add cpuset.sched_load_balance instead
   as the behavior of sched_load_balance has changed and so is
   not a simple flag.
 - Update cgroup-v2.txt accordingly.

v5:
 - Add patch 2 to provide the cpuset.flags control knob for the
   sched_load_balance flag which should be the only feature that is
   essential as a replacement of the "isolcpus" kernel boot parameter.

v4:
 - Further minimize the feature set by removing the flags control knob.

v3:
 - Further trim the additional features down to just memory_migrate.
 - Update Documentation/cgroup-v2.txt.

v6 patch: https://lkml.org/lkml/2018/3/21/530
v7 patch: https://lkml.org/lkml/2018/4/19/448

The purpose of this patchset is to provide a basic set of cpuset control
files for cgroup v2. This basic set includes the non-root "cpus",
"mems", "sched.load_balance" and "sched.domain". The "cpus.effective"
and "mems.effective" will appear in all cpuset-enabled cgroups.

The new control file that is unique to v2 is "sched.domain". It is a
boolean flag file that designates if a cgroup is a scheduling domain
with its own set of unique list of CPUs from scheduling perspective
disjointed from other scheduling domains. The root cgroup is always a
scheduling domain. Multiple levels of scheduling domains are supported
with some limitations. So a container scheduling domain root can behave
like a real root.

When a scheduling domain cgroup is removed, its list of exclusive CPUs
will be returned to the parent's cpus.effective automatically.

The "sched.load_balance" flag can only be changed in a scheduling domain.
with no child cpuset-enabled cgroups.

This patchset supports isolated CPUs in a child scheduling domain with
load balancing off. It also allows easy setup of multiple scheduling
domains without requiring the trick of turning load balancing off in the
root cgroup.

This patchset does not exclude the possibility of adding more features
in the future after careful consideration.

Patch 1 enables cpuset in cgroup v2 with cpus, mems and their
effective counterparts.

Patch 2 adds a new "sched.domain" control file for setting up multiple
scheduling domains. A scheduling domain implies cpu_exclusive.

Patch 3 adds a "sched.load_balance" flag to turn off load balancing in
a scheduling domain.

Patch 4 updates the scheduling domain genaration code to work with
the new scheduling domain feature.

Patch 5 exposes cpus.effective and mems.effective to the root cgroup as
enabling child scheduling domains will take CPUs away from the root cgroup.
So it will be nice to monitor what CPUs are left there.

Patch 6 enables the printing the debug information about scheduling
domain generation.

Waiman Long (6):
  cpuset: Enable cpuset controller in default hierarchy
  cpuset: Add new v2 cpuset.sched.domain flag
  cpuset: Add cpuset.sched.load_balance flag to v2
  cpuset: Make generate_sched_domains() recognize isolated_cpus
  cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  cpuset: Allow reporting of sched domain generation info

 Documentation/cgroup-v2.txt | 136 +++-
 kernel/cgroup/cpuset.c  | 375 ++--
 2 files changed, 492 insertions(+), 19 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 5/6] cpuset: Expose cpus.effective and mems.effective on cgroup v2 root

2018-05-17 Thread Waiman Long

Because of the fact that setting the "cpuset.sched.domain" in a direct
child of root can remove CPUs from the root's effective CPU list, it
makes sense to know what CPUs are left in the root cgroup for scheduling
purpose. So the "cpuset.cpus.effective" control file is now exposed in
the v2 cgroup root.

For consistency, the "cpuset.mems.effective" control file is exposed
as well.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 Documentation/cgroup-v2.txt | 4 ++--
 kernel/cgroup/cpuset.c  | 2 --
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 071b634d..8739b10 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1474,7 +1474,7 @@ Cpuset Interface Files
and won't be affected by any CPU hotplug events.
 
   cpuset.cpus.effective
-   A read-only multiple values file which exists on non-root
+   A read-only multiple values file which exists on all
cpuset-enabled cgroups.
 
It lists the onlined CPUs that are actually allowed to be
@@ -1504,7 +1504,7 @@ Cpuset Interface Files
and won't be affected by any memory nodes hotplug events.
 
   cpuset.mems.effective
-   A read-only multiple values file which exists on non-root
+   A read-only multiple values file which exists on all
cpuset-enabled cgroups.
 
It lists the onlined memory nodes that are actually allowed to
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 0e75f83..fb8aa82b 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2203,14 +2203,12 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state 
*css, struct cftype *cft)
.name = "cpus.effective",
.seq_show = cpuset_common_seq_show,
.private = FILE_EFFECTIVE_CPULIST,
-   .flags = CFTYPE_NOT_ON_ROOT,
},
 
{
.name = "mems.effective",
.seq_show = cpuset_common_seq_show,
.private = FILE_EFFECTIVE_MEMLIST,
-   .flags = CFTYPE_NOT_ON_ROOT,
},
 
{
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 3/5] cpuset: Add a root-only cpus.isolated v2 control file

2018-05-07 Thread Waiman Long

On 05/02/2018 10:08 AM, Peter Zijlstra wrote:
> On Thu, Apr 19, 2018 at 09:47:02AM -0400, Waiman Long wrote:
>> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
>> index c970bd7..8d89dc2 100644
>> --- a/Documentation/cgroup-v2.txt
>> +++ b/Documentation/cgroup-v2.txt
>> @@ -1484,6 +1484,31 @@ Cpuset Interface Files
>>  a subset of "cpuset.cpus".  Its value will be affected by CPU
>>  hotplug events.
>>  
>> +  cpuset.cpus.isolated
>> +A read-write multiple values file which exists on root cgroup
>> +only.
>> +
>> +It lists the CPUs that have been withdrawn from the root cgroup
>> +for load balancing.  These CPUs can still be allocated to child
>> +cpusets with load balancing enabled, if necessary.
>> +
>> +If a child cpuset contains only an exclusive set of CPUs that are
>> +a subset of the isolated CPUs and with load balancing enabled,
>> +these CPUs will be load balanced on a separate root domain from
>> +the one in the root cgroup.
>> +
>> +Just putting the CPUs into "cpuset.cpus.isolated" will be
>> +enough to disable load balancing on those CPUs as long as they
>> +do not appear in a child cpuset with load balancing enabled.
>> +Fine-grained control of cpu isolation can also be done by
>> +putting these isolated CPUs into child cpusets with load
>> +balancing disabled.
>> +
>> +The "cpuset.cpus.isolated" should be set up before child
>> +cpusets are created.  Once child cpusets are present, changes
>> +to "cpuset.cpus.isolated" will not be allowed if the CPUs that
>> +change their states are in any of the child cpusets.
>> +
> So I see why you did this, but it is _really_ ugly and breaks the
> container invariant.
>
> Ideally we'd make the root group less special, not more special.

Yes, I am planning to make the root cgroup less special by putting a new
isolation flag into all the non-root cgroup.

The container invariant thing, however, is a bit hard to do. Do we
really need a container root to behave exactly like the real root? I
guess we can make that happen if we really want to, but it will
certainly make the code more complex. So it is a trade-off about what is
worth to do and what is not.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 3/4] ipc: Allow boot time extension of IPCMNI from 32k to 2M

2018-05-07 Thread Waiman Long

On 05/07/2018 07:17 PM, Luis R. Rodriguez wrote:
> On Mon, May 07, 2018 at 04:59:11PM -0400, Waiman Long wrote:
>> diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
>> index 49f9bf4..d62335f 100644
>> --- a/ipc/ipc_sysctl.c
>> +++ b/ipc/ipc_sysctl.c
>> @@ -120,7 +120,8 @@ static int proc_ipc_sem_dointvec(struct ctl_table 
>> *table, int write,
>>  static int zero;
>>  static int one = 1;
>>  static int int_max = INT_MAX;
>> -static int ipc_mni = IPCMNI;
>> +int ipc_mni __read_mostly = IPCMNI;
>> +int ipc_mni_shift __read_mostly = IPCMNI_SHIFT;
>>  
>>  static struct ctl_table ipc_kern_table[] = {
>>  {
> Is use of ipc_mni and ipc_mni_shift a hot path? As per Christoph Lameter,
> its use should be reserved for data that is actually used frequently in hot
> paths, and typically this was done after performance traces reveal contention
> because a neighboring variable was frequently written to [0]. These would also
> be tightly packed, to reduce the number of cachelines needed to execute a
> critical path, so we should be selective about what variables use it.
>
> Your commit log does not describe why you'd use __read_mostly here. It would
> be useful if it did.
>
> [0] https://lkml.kernel.org/r/alpine.deb.2.11.1504301343190.28...@gentwo.org
I used __read_mostly to reduce the performance impact of transitioning
from a constant to a variable. But you are right, their use are probably
not in a hot path. So even the use of regular variables shouldn't show
any noticeable performance difference. I can take that out in the my
next version after I gather enough feedback.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 1/4] ipc: IPCMNI limit check for msgmni and shmmni

2018-05-07 Thread Waiman Long

On 05/07/2018 06:39 PM, Luis R. Rodriguez wrote:
> On Mon, May 07, 2018 at 04:59:09PM -0400, Waiman Long wrote:
>> A user can write arbitrary integer values to msgmni and shmmni sysctl
>> parameters without getting error, but the actual limit is really
>> IPCMNI (32k). This can mislead users as they think they can get a
>> value that is not real.
>>
>> The right limits are now set for msgmni and shmmni so that the users
>> will become aware if they set a value outside of the acceptable range.
>>
>> Signed-off-by: Waiman Long <long...@redhat.com>
>> ---
>>  ipc/ipc_sysctl.c | 7 +--
>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
>> index 8ad93c2..f87cb29 100644
>> --- a/ipc/ipc_sysctl.c
>> +++ b/ipc/ipc_sysctl.c
>> @@ -99,6 +99,7 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, 
>> int write,
>>  static int zero;
>>  static int one = 1;
>>  static int int_max = INT_MAX;
>> +static int ipc_mni = IPCMNI;
>>  
>>  static struct ctl_table ipc_kern_table[] = {
>>  {
>> @@ -120,7 +121,9 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, 
>> int write,
>>  .data   = _ipc_ns.shm_ctlmni,
>>  .maxlen = sizeof(init_ipc_ns.shm_ctlmni),
>>  .mode   = 0644,
>> -.proc_handler   = proc_ipc_dointvec,
>> +.proc_handler   = proc_ipc_dointvec_minmax,
>> +.extra1 = ,
>> +.extra2 = _mni,
>>  },
>>  {
>>  .procname   = "shm_rmid_forced",
>> @@ -147,7 +150,7 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, 
>> int write,
>>  .mode   = 0644,
>>  .proc_handler   = proc_ipc_dointvec_minmax,
>>  .extra1 = ,
>> -.extra2 = _max,
>> +.extra2 = _mni,
>>  },
>>  {
>>  .procname   = "auto_msgmni",
>> -- 
>> 1.8.3.1
> It seems negative values are not allowed, if true then having
> a caller to use proc_douintvec_minmax() would help with ensuring
> no invalid negative input values are used as well.
>
>   Luis

Negative value doesn't mean sense here. So it is true that we can use
proc_douintvec_minmax() instead. However, the data types themselves are
defined as "int". So I think it is better to keep using
proc_dointvec_minmax() to be consistent with the data type.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v7 2/4] ipc: IPCMNI limit check for semmni

2018-05-07 Thread Waiman Long

For SysV semaphores, the semmni value is the last part of the 4-element
sem number array. To make semmni behave in a similar way to msgmni and
shmmni, we can't directly use the _minmax handler. Instead, a special
sem specific handler is added to check the last argument to make sure
that it is limited to the [0, IPCMNI] range. An error will be returned
if this is not the case.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 ipc/ipc_sysctl.c | 23 ++-
 ipc/util.h   |  9 +
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index f87cb29..49f9bf4 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -88,12 +88,33 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, 
int write,
return proc_dointvec_minmax(_table, write, buffer, lenp, ppos);
 }
 
+static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+   int ret, semmni;
+   struct ipc_namespace *ns = current->nsproxy->ipc_ns;
+
+   semmni = ns->sem_ctls[3];
+   ret = proc_ipc_dointvec(table, write, buffer, lenp, ppos);
+
+   if (!ret)
+   ret = sem_check_semmni(current->nsproxy->ipc_ns);
+
+   /*
+* Reset the semmni value if an error happens.
+*/
+   if (ret)
+   ns->sem_ctls[3] = semmni;
+   return ret;
+}
+
 #else
 #define proc_ipc_doulongvec_minmax NULL
 #define proc_ipc_dointvec NULL
 #define proc_ipc_dointvec_minmax   NULL
 #define proc_ipc_dointvec_minmax_orphans   NULL
 #define proc_ipc_auto_msgmni  NULL
+#define proc_ipc_sem_dointvec NULL
 #endif
 
 static int zero;
@@ -175,7 +196,7 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, 
int write,
.data   = _ipc_ns.sem_ctls,
.maxlen = 4*sizeof(int),
.mode   = 0644,
-   .proc_handler   = proc_ipc_dointvec,
+   .proc_handler   = proc_ipc_sem_dointvec,
},
 #ifdef CONFIG_CHECKPOINT_RESTORE
{
diff --git a/ipc/util.h b/ipc/util.h
index acc5159..8b413f1 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -218,6 +218,15 @@ int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
 void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
void (*free)(struct ipc_namespace *, struct kern_ipc_perm *));
 
+static inline int sem_check_semmni(struct ipc_namespace *ns) {
+   /*
+* Check semmni range [0, IPCMNI]
+* semmni is the last element of sem_ctls[4] array
+*/
+   return ((ns->sem_ctls[3] < 0) || (ns->sem_ctls[3] > IPCMNI))
+   ? -ERANGE : 0;
+}
+
 #ifdef CONFIG_COMPAT
 #include 
 struct compat_ipc_perm {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v7 1/4] ipc: IPCMNI limit check for msgmni and shmmni

2018-05-07 Thread Waiman Long

A user can write arbitrary integer values to msgmni and shmmni sysctl
parameters without getting error, but the actual limit is really
IPCMNI (32k). This can mislead users as they think they can get a
value that is not real.

The right limits are now set for msgmni and shmmni so that the users
will become aware if they set a value outside of the acceptable range.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 ipc/ipc_sysctl.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index 8ad93c2..f87cb29 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -99,6 +99,7 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, int 
write,
 static int zero;
 static int one = 1;
 static int int_max = INT_MAX;
+static int ipc_mni = IPCMNI;
 
 static struct ctl_table ipc_kern_table[] = {
{
@@ -120,7 +121,9 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, 
int write,
.data   = _ipc_ns.shm_ctlmni,
.maxlen = sizeof(init_ipc_ns.shm_ctlmni),
.mode   = 0644,
-   .proc_handler   = proc_ipc_dointvec,
+   .proc_handler   = proc_ipc_dointvec_minmax,
+   .extra1 = ,
+   .extra2 = _mni,
},
{
.procname   = "shm_rmid_forced",
@@ -147,7 +150,7 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, 
int write,
.mode   = 0644,
.proc_handler   = proc_ipc_dointvec_minmax,
.extra1 = ,
-   .extra2 = _max,
+   .extra2 = _mni,
},
{
.procname   = "auto_msgmni",
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v7 3/4] ipc: Allow boot time extension of IPCMNI from 32k to 2M

2018-05-07 Thread Waiman Long

The maximum number of unique System V IPC identifiers was limited to
32k.  That limit should be big enough for most use cases.

However, there are some users out there requesting for more. To satisfy
the need of those users, a new boot time kernel option "ipcmni_extend"
is added to extend the IPCMNI value to 2M. This is a 64X increase which
hopefully is big enough for them.

This new option does have the side effect of reducing the maximum
number of unique sequence numbers from 64k down to 1k. So it is
a trade-off.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  3 ++
 ipc/ipc_sysctl.c| 12 ++-
 ipc/util.c  | 12 +++
 ipc/util.h  | 42 +++--
 4 files changed, 52 insertions(+), 17 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 11fc28e..00bc0cb 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1735,6 +1735,9 @@
ip= [IP_PNP]
See Documentation/filesystems/nfs/nfsroot.txt.
 
+   ipcmni_extend   [KNL] Extend the maximum number of unique System V
+   IPC identifiers from 32768 to 2097152.
+
irqaffinity=[SMP] Set the default irq affinity mask
The argument is a cpu list, as described above.
 
diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index 49f9bf4..d62335f 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -120,7 +120,8 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, 
int write,
 static int zero;
 static int one = 1;
 static int int_max = INT_MAX;
-static int ipc_mni = IPCMNI;
+int ipc_mni __read_mostly = IPCMNI;
+int ipc_mni_shift __read_mostly = IPCMNI_SHIFT;
 
 static struct ctl_table ipc_kern_table[] = {
{
@@ -246,3 +247,12 @@ static int __init ipc_sysctl_init(void)
 }
 
 device_initcall(ipc_sysctl_init);
+
+static int __init ipc_mni_extend(char *str)
+{
+   ipc_mni = IPCMNI_EXTEND;
+   ipc_mni_shift = IPCMNI_EXTEND_SHIFT;
+   pr_info("IPCMNI extended to %d.\n", ipc_mni);
+   return 0;
+}
+early_param("ipcmni_extend", ipc_mni_extend);
diff --git a/ipc/util.c b/ipc/util.c
index 4e81182..782a8d0 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -113,7 +113,7 @@ static int __init ipc_init(void)
  * @ids: ipc identifier set
  *
  * Set up the sequence range to use for the ipc identifier range (limited
- * below IPCMNI) then initialise the keys hashtable and ids idr.
+ * below ipc_mni) then initialise the keys hashtable and ids idr.
  */
 int ipc_init_ids(struct ipc_ids *ids)
 {
@@ -214,7 +214,7 @@ static inline int ipc_buildid(int id, struct ipc_ids *ids,
ids->next_id = -1;
}
 
-   return SEQ_MULTIPLIER * new->seq + id;
+   return (new->seq << SEQ_SHIFT) + id;
 }
 
 #else
@@ -228,7 +228,7 @@ static inline int ipc_buildid(int id, struct ipc_ids *ids,
if (ids->seq > IPCID_SEQ_MAX)
ids->seq = 0;
 
-   return SEQ_MULTIPLIER * new->seq + id;
+   return (new->seq << SEQ_SHIFT) + id;
 }
 
 #endif /* CONFIG_CHECKPOINT_RESTORE */
@@ -252,8 +252,8 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
kgid_t egid;
int id, err;
 
-   if (limit > IPCMNI)
-   limit = IPCMNI;
+   if (limit > ipc_mni)
+   limit = ipc_mni;
 
if (!ids->tables_initialized || ids->in_use >= limit)
return -ENOSPC;
@@ -777,7 +777,7 @@ static struct kern_ipc_perm *sysvipc_find_ipc(struct 
ipc_ids *ids, loff_t pos,
if (total >= ids->in_use)
return NULL;
 
-   for (; pos < IPCMNI; pos++) {
+   for (; pos < ipc_mni; pos++) {
ipc = idr_find(>ipcs_idr, pos);
if (ipc != NULL) {
*new_pos = pos + 1;
diff --git a/ipc/util.h b/ipc/util.h
index 8b413f1..9df177f 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -15,8 +15,30 @@
 #include 
 #include 
 
-#define IPCMNI 32768  /* <= MAX_INT limit for ipc arrays (including sysctl 
changes) */
-#define SEQ_MULTIPLIER (IPCMNI)
+/*
+ * By default, the ipc arrays can have up to 32k (15 bits) entries.
+ * When IPCMNI extension mode is turned on, the ipc arrays can have up
+ * to 2M (21 bits) entries. However, the space for sequence number will
+ * be shrunk from 16 bits to 10 bits.
+ */
+#define IPCMNI_SHIFT   15
+#define IPCMNI_EXTEND_SHIFT21
+#define IPCMNI (1 << IPCMNI_SHIFT)
+#define IPCMNI_EXTEND  (1 << IPCMNI_EXTEND_SHIFT)
+
+#ifdef CONFIG_SYSVIPC_SYSCTL
+extern int ipc_mni;
+extern int ipc_mni_shift;
+
+#define SEQ_SHIFT  ipc_mni_shi

[PATCH v7 4/4] ipc: Conserve sequence numbers in extended IPCMNI mode

2018-05-07 Thread Waiman Long

The mixing in of a sequence number into the IPC IDs is probably to
avoid ID reuse in userspace as much as possible. With extended IPCMNI
mode, the number of usable sequence numbers is greatly reduced leading
to higher chance of ID reuse.

To address this issue, we need to conserve the sequence number space
as much as possible. Right now, the sequence number is incremented
for every new ID created. In reality, we only need to increment the
sequence number when one or more IDs have been removed previously to
make sure that those IDs will not be reused when a new one is built.
This is being done in the extended IPCMNI mode,

Signed-off-by: Waiman Long <long...@redhat.com>
---
 include/linux/ipc_namespace.h |  1 +
 ipc/ipc_sysctl.c  |  2 ++
 ipc/util.c| 29 ++---
 ipc/util.h|  2 ++
 4 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index b5630c8..9c86fd9 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -16,6 +16,7 @@
 struct ipc_ids {
int in_use;
unsigned short seq;
+   unsigned short deleted;
bool tables_initialized;
struct rw_semaphore rwsem;
struct idr ipcs_idr;
diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index d62335f..1d32941 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -122,6 +122,7 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, 
int write,
 static int int_max = INT_MAX;
 int ipc_mni __read_mostly = IPCMNI;
 int ipc_mni_shift __read_mostly = IPCMNI_SHIFT;
+bool ipc_mni_extended __read_mostly;
 
 static struct ctl_table ipc_kern_table[] = {
{
@@ -252,6 +253,7 @@ static int __init ipc_mni_extend(char *str)
 {
ipc_mni = IPCMNI_EXTEND;
ipc_mni_shift = IPCMNI_EXTEND_SHIFT;
+   ipc_mni_extended = true;
pr_info("IPCMNI extended to %d.\n", ipc_mni);
return 0;
 }
diff --git a/ipc/util.c b/ipc/util.c
index 782a8d0..7c8e733 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -119,7 +119,8 @@ int ipc_init_ids(struct ipc_ids *ids)
 {
int err;
ids->in_use = 0;
-   ids->seq = 0;
+   ids->deleted = false;
+   ids->seq = ipc_mni_extended ? 0 : -1; /* seq # is pre-incremented */
init_rwsem(>rwsem);
err = rhashtable_init(>key_ht, _kht_params);
if (err)
@@ -193,6 +194,11 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids 
*ids, key_t key)
return NULL;
 }
 
+/*
+ * To conserve sequence number space with extended ipc_mni when new ID
+ * is built, the sequence number is incremented only when one or more
+ * IDs have been removed previously.
+ */
 #ifdef CONFIG_CHECKPOINT_RESTORE
 /*
  * Specify desired id for next allocated IPC object.
@@ -206,9 +212,13 @@ static inline int ipc_buildid(int id, struct ipc_ids *ids,
  struct kern_ipc_perm *new)
 {
if (ids->next_id < 0) { /* default, behave as !CHECKPOINT_RESTORE */
-   new->seq = ids->seq++;
-   if (ids->seq > IPCID_SEQ_MAX)
-   ids->seq = 0;
+   if (!ipc_mni_extended || ids->deleted) {
+   ids->seq++;
+   if (ids->seq > IPCID_SEQ_MAX)
+   ids->seq = 0;
+   ids->deleted = false;
+   }
+   new->seq = ids->seq;
} else {
new->seq = ipcid_to_seqx(ids->next_id);
ids->next_id = -1;
@@ -224,9 +234,13 @@ static inline int ipc_buildid(int id, struct ipc_ids *ids,
 static inline int ipc_buildid(int id, struct ipc_ids *ids,
  struct kern_ipc_perm *new)
 {
-   new->seq = ids->seq++;
-   if (ids->seq > IPCID_SEQ_MAX)
-   ids->seq = 0;
+   if (!ipc_mni_extended || ids->deleted) {
+   ids->seq++;
+   if (ids->seq > IPCID_SEQ_MAX)
+   ids->seq = 0;
+   ids->deleted = false;
+   }
+   new->seq = ids->seq;
 
return (new->seq << SEQ_SHIFT) + id;
 }
@@ -436,6 +450,7 @@ void ipc_rmid(struct ipc_ids *ids, struct kern_ipc_perm 
*ipcp)
idr_remove(>ipcs_idr, lid);
ipc_kht_remove(ids, ipcp);
ids->in_use--;
+   ids->deleted = true;
ipcp->deleted = true;
 
if (unlikely(lid == ids->max_id)) {
diff --git a/ipc/util.h b/ipc/util.h
index 9df177f..0ef381c 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -29,6 +29,7 @@
 #ifdef CONFIG_SYSVIPC_SYSCTL
 extern int ipc_mni;
 extern int ipc_mni_shift;
+extern bool ipc_mni_extended;
 
 #define SEQ_SHIFT  ipc_mni_shift
 #define SEQ_MASK   ((1 << ipc_mni_shift) - 1)
@@ -36,6 +37,7 @@
 #else /* CONFIG_SYSVIPC_SYSCTL */
 
 #define ipc_mni

[PATCH v7 0/4] ipc: IPCMNI limit check for *mni & increase that limit

2018-05-07 Thread Waiman Long

v6->v7:
 - Drop the range clamping code and just return error instead for now
   until there is user request for clamping support.
 - Fix compilation error when CONFIG_SYSVIPC_SYSCTL isn't defined.

v5->v6:
 - Consolidate the 3 ctl_table flags into 2.
 - Make similar changes to proc_doulongvec_minmax() and its associates
   to complete the clamping change.
 - Remove the sysctl registration failure test patch for now for later
   consideration.
 - Add extra braces to patch 1 to reduce code diff in a later patch.

v4->v5:
 - Revert the flags back to 16-bit so that there will be no change to
   the size of ctl_table.
 - Enhance the sysctl_check_flags() as requested by Luis to perform more
   checks to spot incorrect ctl_table entries.
 - Change the sysctl selftest to use dummy sysctls instead of production
   ones & enhance it to do more checks.
 - Add one more sysctl selftest for registration failure.
 - Add 2 ipc patches to add an extended mode to increase IPCMNI from
   32k to 2M.
 - Miscellaneous change to incorporate feedback comments from
   reviewers.

v3->v4:
 - Remove v3 patches 1 & 2 as they have been merged into the mm tree.
 - Change flags from uint16_t to unsigned int.
 - Remove CTL_FLAGS_OOR_WARNED and use pr_warn_ratelimited() instead.
 - Simplify the warning message code.
 - Add a new patch to fail the ctl_table registration with invalid flag.
 - Add a test case for range clamping in sysctl selftest.

v2->v3:
 - Fix kdoc comment errors.
 - Incorporate comments and suggestions from Luis R. Rodriguez.
 - Add a patch to fix a typo error in fs/proc/proc_sysctl.c.

v1->v2:
 - Add kdoc comments to the do_proc_do{u}intvec_minmax_conv_param
   structures.
 - Add a new flags field to the ctl_table structure for specifying
   whether range clamping should be activated instead of adding new
   sysctl parameter handlers.
 - Clamp the semmni value embedded in the multi-values sem parameter.

v4 patch: https://lkml.org/lkml/2018/3/12/867
v5 patch: https://lkml.org/lkml/2018/3/16/1106
v6 patch: https://lkml.org/lkml/2018/4/27/1094

The sysctl parameters msgmni, shmmni and semmni have an inherent limit
of IPC_MNI (32k). However, users may not be aware of that because they
can write a value much higher than that without getting any error or
notification. Reading the parameters back will show the newly written
values which are not real.

The real IPCMNI limit is now enforced to make sure that users won't
put in an unrealistic value. The first 2 patches enforce the limits.

There are also users out there requesting increase in the IPCMNI value.
The last 2 patches attempt to do that by using a boot kernel parameter
"ipcmni_extend" to increase the IPCMNI limit from 32k to 2M if the users
really want the extended value.

Waiman Long (4):
  ipc: IPCMNI limit check for msgmni and shmmni
  ipc: IPCMNI limit check for semmni
  ipc: Allow boot time extension of IPCMNI from 32k to 2M
  ipc: Conserve sequence numbers in extended IPCMNI mode

 Documentation/admin-guide/kernel-parameters.txt |  3 ++
 include/linux/ipc_namespace.h   |  1 +
 ipc/ipc_sysctl.c| 42 +++--
 ipc/util.c  | 41 ++---
 ipc/util.h  | 49 +
 5 files changed, 112 insertions(+), 24 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v6 0/8] ipc: Clamp *mni to the real IPCMNI limit & increase that limit

2018-05-07 Thread Waiman Long

On 05/02/2018 11:06 AM, Eric W. Biederman wrote:
>
>>> and or users that may or may not exist.  If you can find something that
>>> will care sure.  We need to avoid breaking userspace and causing
>>> regressions.  However as this stands it looks you are making maintenance
>>> of the kernel more difficult to avoid having to look to see if there are
>>> monsters under the bed.
>> I shall admit that it can be hard to find applications that will
>> explicitly need that as we usually don't have access to the applications
>> that the customers have. It is more a correctness issue where the
>> existing code is kind of lying about what can actually be supported. I
>> just want to make the users more aware of what the right limits are.
> You presume the kernel is lying to applications.  I admit the kernel
> can lie to applications.  I don't see any evidence that the kernel is
> actually doing so.  So far (to me) it looks like a large number of sysv
> shared memory segments is not particulalry common.
>
> So I would not be at all surprised if no regressions would be generated
> if you simply deny setting the value past the maximum.

Maybe you are right. I will update the patchset to fail the update if
the range is exceeded since I had added option of extending the limit if
the users choose to do so.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 2/5] cpuset: Add cpuset.sched_load_balance to v2

2018-05-02 Thread Waiman Long

On 05/02/2018 09:42 AM, Peter Zijlstra wrote:
> On Wed, May 02, 2018 at 09:29:54AM -0400, Waiman Long wrote:
>> On 05/02/2018 06:24 AM, Peter Zijlstra wrote:
>>> On Thu, Apr 19, 2018 at 09:47:01AM -0400, Waiman Long wrote:
>>>> +  cpuset.sched_load_balance
>>>> +  A read-write single value file which exists on non-root cgroups.
>>> Uhhm.. it should very much exist in the root group too. Otherwise you
>>> cannot disable it there, which is required to allow smaller groups to
>>> load-balance between themselves.
>>>
>>>> +  The default is "1" (on), and the other possible value is "0"
>>>> +  (off).
>>>> +
>>>> +  When it is on, tasks within this cpuset will be load-balanced
>>>> +  by the kernel scheduler.  Tasks will be moved from CPUs with
>>>> +  high load to other CPUs within the same cpuset with less load
>>>> +  periodically.
>>>> +
>>>> +  When it is off, there will be no load balancing among CPUs on
>>>> +  this cgroup.  Tasks will stay in the CPUs they are running on
>>>> +  and will not be moved to other CPUs.
>>>> +
>>>> +  This flag is hierarchical and is inherited by child cpusets. It
>>>> +  can be turned off only when the CPUs in this cpuset aren't
>>>> +  listed in the cpuset.cpus of other sibling cgroups, and all
>>>> +  the child cpusets, if present, have this flag turned off.
>>>> +
>>>> +  Once it is off, it cannot be turned back on as long as the
>>>> +  parent cgroup still has this flag in the off state.
>>> That too is wrong and broken. You explicitly want to turn it on for
>>> children.
>>>
>>> So the idea is that you can have:
>>>
>>> R
>>>   /   \
>>> A   B
>>>
>>> With:
>>>
>>> R cpus=0-3, load_balance=0
>>> A cpus=0-1, load_balance=1
>>> B cpus=2-3, load_balance=1
>>>
>>> Which will allow all tasks in A,B (and its children) to load-balance
>>> across 0-1 or 2-3 resp.
>>>
>>> If you don't allow the root group to disable load_balance, it will
>>> always be the largest group and load-balancing will always happen system
>>> wide.
>> If you look at the remaining patches in the series, I was proposing a
>> different way to support isolcpus and separate sched domains with
>> turning off load balancing in the root cgroup.
>>
>> For me, it doesn't feel right to have load balancing disabled in the
>> root cgroup as we probably cannot move all the tasks away from the root
>> cgroup anyway. I am going to update the current patchset to incorporate
>> suggestion from Tejun. It will probably be ready sometime next week.
>>
> I've read half of the next patch that adds the isolation thing. And
> while that kludges around the whole root cgorup is magic thing, it
> doesn't help if you move the above scenario on level down:
>
>
>   R
>  /\
>AB
>   /   \
> C   D
>
>
> R: cpus=0-7, load_balance=0
> A: cpus=0-1, load_balance=1
> B: cpus=2-7, load_balance=0
> C: cpus=2-3, load_balance=1
> D: cpus=4-7, load_balance=1
>
>
> Also, I feel we should strive to have a minimal amount of tasks that
> cannot be moved out of the root group; the current set is far too large.

What exactly is the use case you have in mind with loading balancing
disabled in B, but enabled in C and D? We would like to support some
sensible use cases, but not every possible combinations.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 2/5] cpuset: Add cpuset.sched_load_balance to v2

2018-05-02 Thread Waiman Long

On 05/02/2018 06:24 AM, Peter Zijlstra wrote:
> On Thu, Apr 19, 2018 at 09:47:01AM -0400, Waiman Long wrote:
>> +  cpuset.sched_load_balance
>> +A read-write single value file which exists on non-root cgroups.
> Uhhm.. it should very much exist in the root group too. Otherwise you
> cannot disable it there, which is required to allow smaller groups to
> load-balance between themselves.
>
>> +The default is "1" (on), and the other possible value is "0"
>> +(off).
>> +
>> +When it is on, tasks within this cpuset will be load-balanced
>> +by the kernel scheduler.  Tasks will be moved from CPUs with
>> +high load to other CPUs within the same cpuset with less load
>> +periodically.
>> +
>> +When it is off, there will be no load balancing among CPUs on
>> +this cgroup.  Tasks will stay in the CPUs they are running on
>> +and will not be moved to other CPUs.
>> +
>> +This flag is hierarchical and is inherited by child cpusets. It
>> +can be turned off only when the CPUs in this cpuset aren't
>> +listed in the cpuset.cpus of other sibling cgroups, and all
>> +the child cpusets, if present, have this flag turned off.
>> +
>> +Once it is off, it cannot be turned back on as long as the
>> +parent cgroup still has this flag in the off state.
> That too is wrong and broken. You explicitly want to turn it on for
> children.
>
> So the idea is that you can have:
>
>   R
> /   \
> A   B
>
> With:
>
>   R cpus=0-3, load_balance=0
>   A cpus=0-1, load_balance=1
>   B cpus=2-3, load_balance=1
>
> Which will allow all tasks in A,B (and its children) to load-balance
> across 0-1 or 2-3 resp.
>
> If you don't allow the root group to disable load_balance, it will
> always be the largest group and load-balancing will always happen system
> wide.

If you look at the remaining patches in the series, I was proposing a
different way to support isolcpus and separate sched domains with
turning off load balancing in the root cgroup.

For me, it doesn't feel right to have load balancing disabled in the
root cgroup as we probably cannot move all the tasks away from the root
cgroup anyway. I am going to update the current patchset to incorporate
suggestion from Tejun. It will probably be ready sometime next week.

Cheers,
Longman


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v6 0/8] ipc: Clamp *mni to the real IPCMNI limit & increase that limit

2018-05-02 Thread Waiman Long

On 05/01/2018 10:18 PM, Eric W. Biederman wrote:
>
>> The sysctl parameters msgmni, shmmni and semmni have an inherent limit
>> of IPC_MNI (32k). However, users may not be aware of that because they
>> can write a value much higher than that without getting any error or
>> notification. Reading the parameters back will show the newly written
>> values which are not real.
>>
>> Enforcing the limit by failing sysctl parameter write, however, may
>> cause regressions if existing user setup scripts set those parameters
>> above 32k as those scripts will now fail in this case.
> I have a serious problem with this approach.  Have you made any effort
> to identify any code that sets these values above 32k?  Have you looked
> to see if these applications actually care if you return an error when
> a value is set too large?

It is not that an application cares about if an error is returned or
not. Most applications don't care. It is that if an error is returned,
it means that the sysctl parameter isn't change at all instead of being
set to a large value and then internally clamped to a smaller number
which is still bigger than the original value. That is what can break an
application because the sysctl parameters may be just too small for the
application.

> Right now this seems like a lot of work to avoid breaking applications
> and or users that may or may not exist.  If you can find something that
> will care sure.  We need to avoid breaking userspace and causing
> regressions.  However as this stands it looks you are making maintenance
> of the kernel more difficult to avoid having to look to see if there are
> monsters under the bed.

I shall admit that it can be hard to find applications that will
explicitly need that as we usually don't have access to the applications
that the customers have. It is more a correctness issue where the
existing code is kind of lying about what can actually be supported. I
just want to make the users more aware of what the right limits are.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 4/5] cpuset: Restrict load balancing off cpus to subset of cpus.isolated

2018-05-01 Thread Waiman Long

On 05/01/2018 04:58 PM, Tejun Heo wrote:
> Hello,
>
> On Tue, May 01, 2018 at 04:33:45PM -0400, Waiman Long wrote:
>> I think that will work too. We currently don't have a flag to make a
>> file visible on first-level children only, but it shouldn't be hard to
>> make one.
> I think it'd be fine to make the flag file exist on all !root cgroups
> but only writable on the first level children.

Right. This flag will be inherited by child cgroups like the
sched_load_balance.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 4/5] cpuset: Restrict load balancing off cpus to subset of cpus.isolated

2018-05-01 Thread Waiman Long

On 05/01/2018 03:51 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> Sorry about the delay.
>
> On Thu, Apr 19, 2018 at 09:47:03AM -0400, Waiman Long wrote:
>> With the addition of "cpuset.cpus.isolated", it makes sense to add the
>> restriction that load balancing can only be turned off if the CPUs in
>> the isolated cpuset are subset of "cpuset.cpus.isolated".
>>
>> Signed-off-by: Waiman Long <long...@redhat.com>
>> ---
>>  Documentation/cgroup-v2.txt |  7 ---
>>  kernel/cgroup/cpuset.c  | 29 ++---
>>  2 files changed, 30 insertions(+), 6 deletions(-)
>>
>> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
>> index 8d89dc2..c4227ee 100644
>> --- a/Documentation/cgroup-v2.txt
>> +++ b/Documentation/cgroup-v2.txt
>> @@ -1554,9 +1554,10 @@ Cpuset Interface Files
>>  and will not be moved to other CPUs.
>>  
>>  This flag is hierarchical and is inherited by child cpusets. It
>> -can be turned off only when the CPUs in this cpuset aren't
>> -listed in the cpuset.cpus of other sibling cgroups, and all
>> -the child cpusets, if present, have this flag turned off.
>> +can be explicitly turned off only when it is a direct child of
>> +the root cgroup and the CPUs in this cpuset are subset of the
>> +root's "cpuset.cpus.isolated".  Moreover, the CPUs cannot be
>> +listed in the "cpuset.cpus" of other sibling cgroups.
> It is a little bit convoluted that the isolation requires coordination
> among root's isolated file and the first-level children's cpus file
> and the flag.  Maybe I'm missing something but can't we do something
> like the following?
>
> * Add isolated flag file, which can only be modified on empty (in
>   terms of cpus) first level children.
>
> * Once isolated flag is set, CPUs can only be added to the cpus file
>   iff they aren't being used by anyone else and automatically become
>   isolated.
>
> The first level cpus file is owned by the root cgroup anyway, so
> there's no danger regarding delegation or whatever and the interface
> would be a lot simpler.

I think that will work too. We currently don't have a flag to make a
file visible on first-level children only, but it shouldn't be hard to
make one.

Putting CPUs into an isolated child cpuset means removing it from the
root's effective CPUs. So I would probably like to expose the read-only
cpus.effective in the root cgroup so that we can check changes in the
effective cpu list.

I will renew the patchset will your suggestion.

Thanks,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v6 3/8] sysctl: Warn when a clamped sysctl parameter is set out of range

2018-05-01 Thread Waiman Long

On 04/30/2018 06:40 PM, Kees Cook wrote:
> I like this series overall, thanks! No objections from me. One thing I
> noted, though:
>
> On Fri, Apr 27, 2018 at 2:00 PM, Waiman Long <long...@redhat.com> wrote:
>> if (param->min && *param->min > val) {
>> if (clamp) {
>> val = *param->min;
>> +   clamped = true;
>> } else {
>> return -EINVAL;
>> }
> This appears as a common bit of logic in many places in the series. It
> seems like it'd make sense to make this a helper of some kind?
>
> -Kees
>
We can't have an inline helper function because the types are different.
We may be able to use a helper macro, but helper macro like that may be
not well accepted by the kernel community.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 3/8] sysctl: Warn when a clamped sysctl parameter is set out of range

2018-04-27 Thread Waiman Long

Even with clamped sysctl parameters, it is still not that straight
forward to figure out the exact range of those parameters. One may
try to write extreme parameter values to see if they get clamped.
To make it easier, a warning with the expected range will now be
printed into the kernel ring buffer when a clamped sysctl parameter
receives an out of range value.

The pr_warn_ratelimited() macro is used to limit the number of warning
messages that can be printed within a given period of time.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 kernel/sysctl.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 5b84c1d..76b2f1b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -17,6 +17,7 @@
  * The list_for_each() macro wasn't appropriate for the sysctl loop.
  *  Removed it and replaced it with older style, 03/23/00, Bill Wendling
  */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
 #include 
 #include 
@@ -2516,6 +2517,7 @@ static int proc_dointvec_minmax_sysadmin(struct ctl_table 
*table, int write,
  * @min: pointer to minimum allowable value
  * @max: pointer to maximum allowable value
  * @flags: pointer to flags
+ * @name: sysctl parameter name
  *
  * The do_proc_dointvec_minmax_conv_param structure provides the
  * minimum and maximum values for doing range checking for those sysctl
@@ -2525,6 +2527,7 @@ struct do_proc_dointvec_minmax_conv_param {
int *min;
int *max;
uint16_t *flags;
+   const char *name;
 };
 
 static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,
@@ -2534,12 +2537,14 @@ static int do_proc_dointvec_minmax_conv(bool *negp, 
unsigned long *lvalp,
struct do_proc_dointvec_minmax_conv_param *param = data;
if (write) {
int val = *negp ? -*lvalp : *lvalp;
+   bool clamped = false;
bool clamp = param->flags &&
   (*param->flags & CTL_FLAGS_CLAMP_SIGNED_RANGE);
 
if (param->min && *param->min > val) {
if (clamp) {
val = *param->min;
+   clamped = true;
} else {
return -EINVAL;
}
@@ -2547,11 +2552,17 @@ static int do_proc_dointvec_minmax_conv(bool *negp, 
unsigned long *lvalp,
if (param->max && *param->max < val) {
if (clamp) {
val = *param->max;
+   clamped = true;
} else {
return -EINVAL;
}
}
*valp = val;
+   if (clamped && param->name)
+   pr_warn_ratelimited("\"%s\" was set out of range [%d, 
%d], clamped to %d.\n",
+   param->name,
+   param->min ? *param->min : -INT_MAX,
+   param->max ? *param->max :  INT_MAX, val);
} else {
int val = *valp;
if (val < 0) {
@@ -2589,6 +2600,7 @@ int proc_dointvec_minmax(struct ctl_table *table, int 
write,
.min = (int *) table->extra1,
.max = (int *) table->extra2,
.flags = >flags,
+   .name  = table->procname,
};
return do_proc_dointvec(table, write, buffer, lenp, ppos,
do_proc_dointvec_minmax_conv, );
@@ -2599,6 +2611,7 @@ int proc_dointvec_minmax(struct ctl_table *table, int 
write,
  * @min: pointer to minimum allowable value
  * @max: pointer to maximum allowable value
  * @flags: pointer to flags
+ * @name: sysctl parameter name
  *
  * The do_proc_douintvec_minmax_conv_param structure provides the
  * minimum and maximum values for doing range checking for those sysctl
@@ -2608,6 +2621,7 @@ struct do_proc_douintvec_minmax_conv_param {
unsigned int *min;
unsigned int *max;
uint16_t *flags;
+   const char *name;
 };
 
 static int do_proc_douintvec_minmax_conv(unsigned long *lvalp,
@@ -2618,6 +2632,7 @@ static int do_proc_douintvec_minmax_conv(unsigned long 
*lvalp,
 
if (write) {
unsigned int val = *lvalp;
+   bool clamped = false;
bool clamp = param->flags &&
   (*param->flags & CTL_FLAGS_CLAMP_UNSIGNED_RANGE);
 
@@ -2627,6 +2642,7 @@ static int do_proc_douintvec_minmax_conv(unsigned long 
*lvalp,
if (param->min && *param->min > val) {
if (clamp) {
val = *param->min;
+   clamped = true;

[PATCH v6 2/8] proc/sysctl: Provide additional ctl_table.flags checks

2018-04-27 Thread Waiman Long

Checking code is added to provide the following additional
ctl_table.flags checks:

 1) No unknown flag is allowed.
 2) Minimum of a range cannot be larger than the maximum value.
 3) The signed and unsigned flags are mutually exclusive.
 4) The proc_handler should be consistent with the signed or unsigned
flags.

The separation of signed and unsigned flags helps to provide more
comprehensive checking than it would have been if there is only one
flag available.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 fs/proc/proc_sysctl.c | 60 +++
 1 file changed, 60 insertions(+)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 8989936..fb09454 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -1092,6 +1092,64 @@ static int sysctl_check_table_array(const char *path, 
struct ctl_table *table)
return err;
 }
 
+/*
+ * This code assumes that only one integer value is allowed in an integer
+ * sysctl when one of the clamping flags is used. If that assumption is no
+ * longer true, we may need to add another flag to indicate the entry size.
+ */
+static int sysctl_check_flags(const char *path, struct ctl_table *table)
+{
+   int err = 0;
+
+   if ((table->flags & ~CTL_TABLE_FLAGS_ALL) ||
+  ((table->flags & CTL_FLAGS_CLAMP_RANGE) == CTL_FLAGS_CLAMP_RANGE))
+   err = sysctl_err(path, table, "invalid flags");
+
+   if (table->flags & CTL_FLAGS_CLAMP_RANGE) {
+   int range_err = 0;
+   bool is_int = (table->maxlen == sizeof(int));
+
+   if (!is_int && (table->maxlen != sizeof(long))) {
+   range_err++;
+   } else if (!table->extra1 || !table->extra2) {
+   /* No min > max checking needed */
+   } else if (table->flags & CTL_FLAGS_CLAMP_UNSIGNED_RANGE) {
+   unsigned long min, max;
+
+   min = is_int ? *(unsigned int *)table->extra1
+: *(unsigned long *)table->extra1;
+   max = is_int ? *(unsigned int *)table->extra2
+: *(unsigned long *)table->extra2;
+   range_err += (min > max);
+   } else { /* table->flags & CTL_FLAGS_CLAMP_SIGNED_RANGE */
+
+   long min, max;
+
+   min = is_int ? *(int *)table->extra1
+: *(long *)table->extra1;
+   max = is_int ? *(int *)table->extra2
+: *(long *)table->extra2;
+   range_err += (min > max);
+   }
+
+   /*
+* proc_handler and flag consistency check.
+*/
+   if (((table->proc_handler == proc_douintvec_minmax)   ||
+(table->proc_handler == proc_doulongvec_minmax)) &&
+   !(table->flags & CTL_FLAGS_CLAMP_UNSIGNED_RANGE))
+   range_err++;
+
+   if ((table->proc_handler == proc_dointvec_minmax) &&
+  !(table->flags & CTL_FLAGS_CLAMP_SIGNED_RANGE))
+   range_err++;
+
+   if (range_err)
+   err |= sysctl_err(path, table, "Invalid range");
+   }
+   return err;
+}
+
 static int sysctl_check_table(const char *path, struct ctl_table *table)
 {
int err = 0;
@@ -,6 +1169,8 @@ static int sysctl_check_table(const char *path, struct 
ctl_table *table)
(table->proc_handler == proc_doulongvec_ms_jiffies_minmax)) 
{
if (!table->data)
err |= sysctl_err(path, table, "No data");
+   if (table->flags)
+   err |= sysctl_check_flags(path, table);
if (!table->maxlen)
err |= sysctl_err(path, table, "No maxlen");
else
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 4/8] ipc: Clamp msgmni and shmmni to the real IPCMNI limit

2018-04-27 Thread Waiman Long

A user can write arbitrary integer values to msgmni and shmmni sysctl
parameters without getting error, but the actual limit is really
IPCMNI (32k). This can mislead users as they think they can get a
value that is not real.

Enforcing the limit by failing the sysctl parameter write, however,
can break existing user applications in case they are writing a value
greater than 32k. Instead, the range clamping flag is set to enforce
the limit without failing existing user code. Users can easily figure
out if the sysctl parameter value is out of range by either reading
back the parameter value or checking the kernel ring buffer for warning.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 ipc/ipc_sysctl.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index 8ad93c2..d71f949 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -99,6 +99,7 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, int 
write,
 static int zero;
 static int one = 1;
 static int int_max = INT_MAX;
+static int ipc_mni = IPCMNI;
 
 static struct ctl_table ipc_kern_table[] = {
{
@@ -120,7 +121,10 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, 
int write,
.data   = _ipc_ns.shm_ctlmni,
.maxlen = sizeof(init_ipc_ns.shm_ctlmni),
.mode   = 0644,
-   .proc_handler   = proc_ipc_dointvec,
+   .proc_handler   = proc_ipc_dointvec_minmax,
+   .extra1 = ,
+   .extra2 = _mni,
+   .flags  = CTL_FLAGS_CLAMP_SIGNED_RANGE,
},
{
.procname   = "shm_rmid_forced",
@@ -147,7 +151,8 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, 
int write,
.mode   = 0644,
.proc_handler   = proc_ipc_dointvec_minmax,
.extra1 = ,
-   .extra2 = _max,
+   .extra2 = _mni,
+   .flags  = CTL_FLAGS_CLAMP_SIGNED_RANGE,
},
{
.procname   = "auto_msgmni",
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 1/8] sysctl: Add flags to support min/max range clamping

2018-04-27 Thread Waiman Long

When minimum/maximum values are specified for a sysctl parameter in
the ctl_table structure with proc_dointvec_minmax() handler, update
to that parameter will fail with error if the given value is outside
of the required range.

There are use cases where it may be better to clamp the value of
the sysctl parameter to the given range without failing the update,
especially if the users are not aware of the actual range limits.
Reading the value back after the update will now be a good practice
to see if the provided value exceeds the range limits.

To provide this less restrictive form of range checking, a new flags
field is added to the ctl_table structure. The new field is a 16-bit
value that just fits into the hole left by the 16-bit umode_t field
without increasing the size of the structure.

When either the CTL_FLAGS_CLAMP_SIGNED_RANGE or the
CTL_FLAGS_CLAMP_UNSIGNED_RANGE flag is set in the ctl_table entry, any
update from the userspace will be clamped to the given range without
error if either the proc_dointvec_minmax() or the proc_douintvec_minmax()
handlers is used respectively.

In the case of proc_doulongvec_minmax(), the out-of-range input value
is either ignored or clamped if the CTL_FLAGS_CLAMP_UNSIGNED_RANGE flag
is set.

The clamped value is either the maximum or minimum value that is
closest to the input value provided by the user.

This patch, by itself, does not require the use of separate signed
and unsigned flags.  However, the use of separate flags allows us to
perform more comprehensive checking in a later patch.

Extra braces are also used in this patch to make a latter patch easier
to read.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 include/linux/sysctl.h | 32 ++
 kernel/sysctl.c| 74 ++
 2 files changed, 94 insertions(+), 12 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index b769ecf..3a628cf 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -116,6 +116,7 @@ struct ctl_table
void *data;
int maxlen;
umode_t mode;
+   uint16_t flags;
struct ctl_table *child;/* Deprecated */
proc_handler *proc_handler; /* Callback for text formatting */
struct ctl_table_poll *poll;
@@ -123,6 +124,37 @@ struct ctl_table
void *extra2;
 } __randomize_layout;
 
+/**
+ * enum ctl_table_flags - flags for the ctl table (struct ctl_table.flags)
+ *
+ * @CTL_FLAGS_CLAMP_SIGNED_RANGE: Set to indicate that the entry holds a
+ * signed value and should be flexibly clamped to the provided
+ * min/max signed value in case the user provided a value outside
+ * of the given range.  The clamped value is either the provided
+ * minimum or maximum value that is closest to the input value.
+ * No lower bound or upper bound checking will be done if the
+ * corresponding minimum or maximum value isn't provided.
+ *
+ * @CTL_FLAGS_CLAMP_UNSIGNED_RANGE: Set to indicate that the entry holds
+ * an unsigned value and should be flexibly clamped to the provided
+ * min/max unsigned value in case the user provided a value outside
+ * of the given range.  The clamped value is either the provided
+ * minimum or maximum value that is closest to the input value.
+ * No lower bound or upper bound checking will be done if the
+ * corresponding minimum or maximum value isn't provided.
+ *
+ * At most 16 different flags are currently allowed.
+ */
+enum ctl_table_flags {
+   CTL_FLAGS_CLAMP_SIGNED_RANGE= BIT(0),
+   CTL_FLAGS_CLAMP_UNSIGNED_RANGE  = BIT(1),
+   __CTL_FLAGS_MAX = BIT(2),
+};
+
+#define CTL_FLAGS_CLAMP_RANGE  (CTL_FLAGS_CLAMP_SIGNED_RANGE|\
+CTL_FLAGS_CLAMP_UNSIGNED_RANGE)
+#define CTL_TABLE_FLAGS_ALL(__CTL_FLAGS_MAX - 1)
+
 struct ctl_node {
struct rb_node node;
struct ctl_table_header *header;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6a78cf7..5b84c1d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2515,6 +2515,7 @@ static int proc_dointvec_minmax_sysadmin(struct ctl_table 
*table, int write,
  * struct do_proc_dointvec_minmax_conv_param - proc_dointvec_minmax() range 
checking structure
  * @min: pointer to minimum allowable value
  * @max: pointer to maximum allowable value
+ * @flags: pointer to flags
  *
  * The do_proc_dointvec_minmax_conv_param structure provides the
  * minimum and maximum values for doing range checking for those sysctl
@@ -2523,6 +2524,7 @@ static int proc_dointvec_minmax_sysadmin(struct ctl_table 
*table, int write,
 struct do_proc_dointvec_minmax_conv_param {
int *min;
int *max;
+   uint16_t *flags;
 };
 
 static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,
@@ -2532,9 +2534,23 @@ static int do_proc_dointvec_minmax_conv(bool *negp, 
unsigned long *lvalp,
struct do_proc_dointvec_minmax_conv

[PATCH v6 6/8] test_sysctl: Add range clamping test

2018-04-27 Thread Waiman Long

Add a range clamping test to verify that the input value will be
clamped if it exceeds the builtin maximum or minimum value.

Below is the expected test run result:

Running test: sysctl_test_0006 - run #0
Checking range minimum clamping ... ok
Checking range maximum clamping ... ok
Checking range minimum clamping ... ok
Checking range maximum clamping ... ok

Signed-off-by: Waiman Long <long...@redhat.com>
---
 lib/test_sysctl.c| 29 ++
 tools/testing/selftests/sysctl/sysctl.sh | 52 
 2 files changed, 81 insertions(+)

diff --git a/lib/test_sysctl.c b/lib/test_sysctl.c
index 3dd801c..3c619b9 100644
--- a/lib/test_sysctl.c
+++ b/lib/test_sysctl.c
@@ -38,12 +38,18 @@
 
 static int i_zero;
 static int i_one_hundred = 100;
+static int signed_min = -10;
+static int signed_max = 10;
+static unsigned int unsigned_min = 10;
+static unsigned int unsigned_max = 30;
 
 struct test_sysctl_data {
int int_0001;
int int_0002;
int int_0003[4];
+   int range_0001;
 
+   unsigned int urange_0001;
unsigned int uint_0001;
 
char string_0001[65];
@@ -58,6 +64,9 @@ struct test_sysctl_data {
.int_0003[2] = 2,
.int_0003[3] = 3,
 
+   .range_0001 = 0,
+   .urange_0001 = 20,
+
.uint_0001 = 314,
 
.string_0001 = "(none)",
@@ -102,6 +111,26 @@ struct test_sysctl_data {
.mode   = 0644,
.proc_handler   = proc_dostring,
},
+   {
+   .procname   = "range_0001",
+   .data   = _data.range_0001,
+   .maxlen = sizeof(test_data.range_0001),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .flags  = CTL_FLAGS_CLAMP_SIGNED_RANGE,
+   .extra1 = _min,
+   .extra2 = _max,
+   },
+   {
+   .procname   = "urange_0001",
+   .data   = _data.urange_0001,
+   .maxlen = sizeof(test_data.urange_0001),
+   .mode   = 0644,
+   .proc_handler   = proc_douintvec_minmax,
+   .flags  = CTL_FLAGS_CLAMP_UNSIGNED_RANGE,
+   .extra1 = _min,
+   .extra2 = _max,
+   },
{ }
 };
 
diff --git a/tools/testing/selftests/sysctl/sysctl.sh 
b/tools/testing/selftests/sysctl/sysctl.sh
index ec232c3..1aa1bba 100755
--- a/tools/testing/selftests/sysctl/sysctl.sh
+++ b/tools/testing/selftests/sysctl/sysctl.sh
@@ -34,6 +34,7 @@ ALL_TESTS="$ALL_TESTS 0002:1:1"
 ALL_TESTS="$ALL_TESTS 0003:1:1"
 ALL_TESTS="$ALL_TESTS 0004:1:1"
 ALL_TESTS="$ALL_TESTS 0005:3:1"
+ALL_TESTS="$ALL_TESTS 0006:1:1"
 
 test_modprobe()
 {
@@ -543,6 +544,38 @@ run_stringtests()
test_rc
 }
 
+# TARGET, RANGE_MIN & RANGE_MAX need to be defined before running test.
+run_range_clamping_test()
+{
+   rc=0
+
+   echo -n "Checking range minimum clamping ... "
+   VAL=$((RANGE_MIN - 1))
+   echo -n $VAL > "${TARGET}" 2> /dev/null
+   EXITVAL=$?
+   NEWVAL=$(cat "${TARGET}")
+   if [[ $EXITVAL -ne 0 || $NEWVAL -ne $RANGE_MIN ]]; then
+   echo "FAIL" >&2
+   rc=1
+   else
+   echo "ok"
+   fi
+
+   echo -n "Checking range maximum clamping ... "
+   VAL=$((RANGE_MAX + 1))
+   echo -n $VAL > "${TARGET}" 2> /dev/null
+   EXITVAL=$?
+   NEWVAL=$(cat "${TARGET}")
+   if [[ $EXITVAL -ne 0 || $NEWVAL -ne $RANGE_MAX ]]; then
+   echo "FAIL" >&2
+   rc=1
+   else
+   echo "ok"
+   fi
+
+   test_rc
+}
+
 sysctl_test_0001()
 {
TARGET="${SYSCTL}/int_0001"
@@ -600,6 +633,25 @@ sysctl_test_0005()
run_limit_digit_int_array
 }
 
+sysctl_test_0006()
+{
+   TARGET="${SYSCTL}/range_0001"
+   ORIG=$(cat "${TARGET}")
+   RANGE_MIN=-10
+   RANGE_MAX=10
+
+   run_range_clamping_test
+   set_orig
+
+   TARGET="${SYSCTL}/urange_0001"
+   ORIG=$(cat "${TARGET}")
+   RANGE_MIN=10
+   RANGE_MAX=30
+
+   run_range_clamping_test
+   set_orig
+}
+
 list_tests()
 {
echo "Test ID list:"
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 5/8] ipc: Clamp semmni to the real IPCMNI limit

2018-04-27 Thread Waiman Long

For SysV semaphores, the semmni value is the last part of the 4-element
sem number array. To make semmni behave in a similar way to msgmni
and shmmni, we can't directly use the _minmax handler. Instead,
a special sem specific handler is added to check the last argument
to make sure that it is clamped to the [0, IPCMNI] range and prints
a warning message once when an out-of-range value is being written.
This does require duplicating some of the code in the _minmax handlers.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 ipc/ipc_sysctl.c | 12 +++-
 ipc/sem.c| 25 +
 ipc/util.h   |  4 
 3 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index d71f949..478e634 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -88,12 +88,22 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, 
int write,
return proc_dointvec_minmax(_table, write, buffer, lenp, ppos);
 }
 
+static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+   int ret = proc_ipc_dointvec(table, write, buffer, lenp, ppos);
+
+   sem_check_semmni(table, current->nsproxy->ipc_ns);
+   return ret;
+}
+
 #else
 #define proc_ipc_doulongvec_minmax NULL
 #define proc_ipc_dointvec NULL
 #define proc_ipc_dointvec_minmax   NULL
 #define proc_ipc_dointvec_minmax_orphans   NULL
 #define proc_ipc_auto_msgmni  NULL
+#define proc_ipc_sem_dointvec NULL
 #endif
 
 static int zero;
@@ -177,7 +187,7 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, 
int write,
.data   = _ipc_ns.sem_ctls,
.maxlen = 4*sizeof(int),
.mode   = 0644,
-   .proc_handler   = proc_ipc_dointvec,
+   .proc_handler   = proc_ipc_sem_dointvec,
},
 #ifdef CONFIG_CHECKPOINT_RESTORE
{
diff --git a/ipc/sem.c b/ipc/sem.c
index 06be75d..96bdec6 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -2397,3 +2397,28 @@ static int sysvipc_sem_proc_show(struct seq_file *s, 
void *it)
return 0;
 }
 #endif
+
+#ifdef CONFIG_PROC_SYSCTL
+/*
+ * Check to see if semmni is out of range and clamp it if necessary.
+ */
+void sem_check_semmni(struct ctl_table *table, struct ipc_namespace *ns)
+{
+   bool clamped = false;
+
+   /*
+* Clamp semmni to the range [0, IPCMNI].
+*/
+   if (ns->sc_semmni < 0) {
+   ns->sc_semmni = 0;
+   clamped = true;
+   }
+   if (ns->sc_semmni > IPCMNI) {
+   ns->sc_semmni = IPCMNI;
+   clamped = true;
+   }
+   if (clamped)
+   pr_warn_ratelimited("sysctl: \"sem[3]\" was set out of range 
[%d, %d], clamped to %d.\n",
+0, IPCMNI, ns->sc_semmni);
+}
+#endif
diff --git a/ipc/util.h b/ipc/util.h
index acc5159..7c20871 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -218,6 +218,10 @@ int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
 void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
void (*free)(struct ipc_namespace *, struct kern_ipc_perm *));
 
+#ifdef CONFIG_PROC_SYSCTL
+extern void sem_check_semmni(struct ctl_table *table, struct ipc_namespace 
*ns);
+#endif
+
 #ifdef CONFIG_COMPAT
 #include 
 struct compat_ipc_perm {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 7/8] ipc: Allow boot time extension of IPCMNI from 32k to 2M

2018-04-27 Thread Waiman Long

The maximum number of unique System V IPC identifiers was limited to
32k.  That limit should be big enough for most use cases.

However, there are some users out there requesting for more. To satisfy
the need of those users, a new boot time kernel option "ipcmni_extend"
is added to extend the IPCMNI value to 2M. This is a 64X increase which
hopefully is big enough for them.

This new option does have the side effect of reducing the maximum
number of unique sequence numbers from 64k down to 1k. So it is
a trade-off.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  3 +++
 ipc/ipc_sysctl.c| 12 +-
 ipc/util.c  | 12 +-
 ipc/util.h  | 30 ++---
 4 files changed, 42 insertions(+), 15 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 11fc28e..00bc0cb 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1735,6 +1735,9 @@
ip= [IP_PNP]
See Documentation/filesystems/nfs/nfsroot.txt.
 
+   ipcmni_extend   [KNL] Extend the maximum number of unique System V
+   IPC identifiers from 32768 to 2097152.
+
irqaffinity=[SMP] Set the default irq affinity mask
The argument is a cpu list, as described above.
 
diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index 478e634..4e2cb6d 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -109,7 +109,8 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, 
int write,
 static int zero;
 static int one = 1;
 static int int_max = INT_MAX;
-static int ipc_mni = IPCMNI;
+int ipc_mni __read_mostly = IPCMNI;
+int ipc_mni_shift __read_mostly = IPCMNI_SHIFT;
 
 static struct ctl_table ipc_kern_table[] = {
{
@@ -237,3 +238,12 @@ static int __init ipc_sysctl_init(void)
 }
 
 device_initcall(ipc_sysctl_init);
+
+static int __init ipc_mni_extend(char *str)
+{
+   ipc_mni = IPCMNI_EXTEND;
+   ipc_mni_shift = IPCMNI_EXTEND_SHIFT;
+   pr_info("IPCMNI extended to %d.\n", ipc_mni);
+   return 0;
+}
+early_param("ipcmni_extend", ipc_mni_extend);
diff --git a/ipc/util.c b/ipc/util.c
index 4e81182..782a8d0 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -113,7 +113,7 @@ static int __init ipc_init(void)
  * @ids: ipc identifier set
  *
  * Set up the sequence range to use for the ipc identifier range (limited
- * below IPCMNI) then initialise the keys hashtable and ids idr.
+ * below ipc_mni) then initialise the keys hashtable and ids idr.
  */
 int ipc_init_ids(struct ipc_ids *ids)
 {
@@ -214,7 +214,7 @@ static inline int ipc_buildid(int id, struct ipc_ids *ids,
ids->next_id = -1;
}
 
-   return SEQ_MULTIPLIER * new->seq + id;
+   return (new->seq << SEQ_SHIFT) + id;
 }
 
 #else
@@ -228,7 +228,7 @@ static inline int ipc_buildid(int id, struct ipc_ids *ids,
if (ids->seq > IPCID_SEQ_MAX)
ids->seq = 0;
 
-   return SEQ_MULTIPLIER * new->seq + id;
+   return (new->seq << SEQ_SHIFT) + id;
 }
 
 #endif /* CONFIG_CHECKPOINT_RESTORE */
@@ -252,8 +252,8 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
kgid_t egid;
int id, err;
 
-   if (limit > IPCMNI)
-   limit = IPCMNI;
+   if (limit > ipc_mni)
+   limit = ipc_mni;
 
if (!ids->tables_initialized || ids->in_use >= limit)
return -ENOSPC;
@@ -777,7 +777,7 @@ static struct kern_ipc_perm *sysvipc_find_ipc(struct 
ipc_ids *ids, loff_t pos,
if (total >= ids->in_use)
return NULL;
 
-   for (; pos < IPCMNI; pos++) {
+   for (; pos < ipc_mni; pos++) {
ipc = idr_find(>ipcs_idr, pos);
if (ipc != NULL) {
*new_pos = pos + 1;
diff --git a/ipc/util.h b/ipc/util.h
index 7c20871..e4d14b6 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -15,8 +15,22 @@
 #include 
 #include 
 
-#define IPCMNI 32768  /* <= MAX_INT limit for ipc arrays (including sysctl 
changes) */
-#define SEQ_MULTIPLIER (IPCMNI)
+/*
+ * By default, the ipc arrays can have up to 32k (15 bits) entries.
+ * When IPCMNI extension mode is turned on, the ipc arrays can have up
+ * to 2M (21 bits) entries. However, the space for sequence number will
+ * be shrunk from 16 bits to 10 bits.
+ */
+#define IPCMNI_SHIFT   15
+#define IPCMNI_EXTEND_SHIFT21
+#define IPCMNI (1 << IPCMNI_SHIFT)
+#define IPCMNI_EXTEND  (1 << IPCMNI_EXTEND_SHIFT)
+
+extern int ipc_mni;
+extern int ipc_mni_shift;
+
+#define SEQ_SHIFT  ipc_mni_shift
+#define SEQ_MASK   ((1 << i

[PATCH v6 8/8] ipc: Conserve sequence numbers in extended IPCMNI mode

2018-04-27 Thread Waiman Long

The mixing in of a sequence number into the IPC IDs is probably to
avoid ID reuse in userspace as much as possible. With extended IPCMNI
mode, the number of usable sequence numbers is greatly reduced leading
to higher chance of ID reuse.

To address this issue, we need to conserve the sequence number space
as much as possible. Right now, the sequence number is incremented
for every new ID created. In reality, we only need to increment the
sequence number when one or more IDs have been removed previously to
make sure that those IDs will not be reused when a new one is built.
This is being done in the extended IPCMNI mode,

Signed-off-by: Waiman Long <long...@redhat.com>
---
 include/linux/ipc_namespace.h |  1 +
 ipc/ipc_sysctl.c  |  2 ++
 ipc/util.c| 29 ++---
 ipc/util.h|  1 +
 4 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index b5630c8..9c86fd9 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -16,6 +16,7 @@
 struct ipc_ids {
int in_use;
unsigned short seq;
+   unsigned short deleted;
bool tables_initialized;
struct rw_semaphore rwsem;
struct idr ipcs_idr;
diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index 4e2cb6d..b7fb38c 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -111,6 +111,7 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, 
int write,
 static int int_max = INT_MAX;
 int ipc_mni __read_mostly = IPCMNI;
 int ipc_mni_shift __read_mostly = IPCMNI_SHIFT;
+bool ipc_mni_extended __read_mostly;
 
 static struct ctl_table ipc_kern_table[] = {
{
@@ -243,6 +244,7 @@ static int __init ipc_mni_extend(char *str)
 {
ipc_mni = IPCMNI_EXTEND;
ipc_mni_shift = IPCMNI_EXTEND_SHIFT;
+   ipc_mni_extended = true;
pr_info("IPCMNI extended to %d.\n", ipc_mni);
return 0;
 }
diff --git a/ipc/util.c b/ipc/util.c
index 782a8d0..7c8e733 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -119,7 +119,8 @@ int ipc_init_ids(struct ipc_ids *ids)
 {
int err;
ids->in_use = 0;
-   ids->seq = 0;
+   ids->deleted = false;
+   ids->seq = ipc_mni_extended ? 0 : -1; /* seq # is pre-incremented */
init_rwsem(>rwsem);
err = rhashtable_init(>key_ht, _kht_params);
if (err)
@@ -193,6 +194,11 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids 
*ids, key_t key)
return NULL;
 }
 
+/*
+ * To conserve sequence number space with extended ipc_mni when new ID
+ * is built, the sequence number is incremented only when one or more
+ * IDs have been removed previously.
+ */
 #ifdef CONFIG_CHECKPOINT_RESTORE
 /*
  * Specify desired id for next allocated IPC object.
@@ -206,9 +212,13 @@ static inline int ipc_buildid(int id, struct ipc_ids *ids,
  struct kern_ipc_perm *new)
 {
if (ids->next_id < 0) { /* default, behave as !CHECKPOINT_RESTORE */
-   new->seq = ids->seq++;
-   if (ids->seq > IPCID_SEQ_MAX)
-   ids->seq = 0;
+   if (!ipc_mni_extended || ids->deleted) {
+   ids->seq++;
+   if (ids->seq > IPCID_SEQ_MAX)
+   ids->seq = 0;
+   ids->deleted = false;
+   }
+   new->seq = ids->seq;
} else {
new->seq = ipcid_to_seqx(ids->next_id);
ids->next_id = -1;
@@ -224,9 +234,13 @@ static inline int ipc_buildid(int id, struct ipc_ids *ids,
 static inline int ipc_buildid(int id, struct ipc_ids *ids,
  struct kern_ipc_perm *new)
 {
-   new->seq = ids->seq++;
-   if (ids->seq > IPCID_SEQ_MAX)
-   ids->seq = 0;
+   if (!ipc_mni_extended || ids->deleted) {
+   ids->seq++;
+   if (ids->seq > IPCID_SEQ_MAX)
+   ids->seq = 0;
+   ids->deleted = false;
+   }
+   new->seq = ids->seq;
 
return (new->seq << SEQ_SHIFT) + id;
 }
@@ -436,6 +450,7 @@ void ipc_rmid(struct ipc_ids *ids, struct kern_ipc_perm 
*ipcp)
idr_remove(>ipcs_idr, lid);
ipc_kht_remove(ids, ipcp);
ids->in_use--;
+   ids->deleted = true;
ipcp->deleted = true;
 
if (unlikely(lid == ids->max_id)) {
diff --git a/ipc/util.h b/ipc/util.h
index e4d14b6..54a86fc 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -28,6 +28,7 @@
 
 extern int ipc_mni;
 extern int ipc_mni_shift;
+extern bool ipc_mni_extended;
 
 #define SEQ_SHIFT  ipc_mni_shift
 #define SEQ_MASK   ((1 << ipc_mni_shift) - 1)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 0/8] ipc: Clamp *mni to the real IPCMNI limit & increase that limit

2018-04-27 Thread Waiman Long

v5->v6:
 - Consolidate the 3 ctl_table flags into 2.
 - Make similar changes to proc_doulongvec_minmax() and its associates
   to complete the clamping change.
 - Remove the sysctl registration failure test patch for now for later
   consideration.
 - Add extra braces to patch 1 to reduce code diff in a later patch.

v4->v5:
 - Revert the flags back to 16-bit so that there will be no change to
   the size of ctl_table.
 - Enhance the sysctl_check_flags() as requested by Luis to perform more
   checks to spot incorrect ctl_table entries.
 - Change the sysctl selftest to use dummy sysctls instead of production
   ones & enhance it to do more checks.
 - Add one more sysctl selftest for registration failure.
 - Add 2 ipc patches to add an extended mode to increase IPCMNI from
   32k to 2M.
 - Miscellaneous change to incorporate feedback comments from
   reviewers.

v3->v4:
 - Remove v3 patches 1 & 2 as they have been merged into the mm tree.
 - Change flags from uint16_t to unsigned int.
 - Remove CTL_FLAGS_OOR_WARNED and use pr_warn_ratelimited() instead.
 - Simplify the warning message code.
 - Add a new patch to fail the ctl_table registration with invalid flag.
 - Add a test case for range clamping in sysctl selftest.

v2->v3:
 - Fix kdoc comment errors.
 - Incorporate comments and suggestions from Luis R. Rodriguez.
 - Add a patch to fix a typo error in fs/proc/proc_sysctl.c.

v1->v2:
 - Add kdoc comments to the do_proc_do{u}intvec_minmax_conv_param
   structures.
 - Add a new flags field to the ctl_table structure for specifying
   whether range clamping should be activated instead of adding new
   sysctl parameter handlers.
 - Clamp the semmni value embedded in the multi-values sem parameter.

v1 patch: https://lkml.org/lkml/2018/2/19/453
v2 patch: https://lkml.org/lkml/2018/2/27/627
v3 patch: https://lkml.org/lkml/2018/3/1/716 
v4 patch: https://lkml.org/lkml/2018/3/12/867
v5 patch: https://lkml.org/lkml/2018/3/16/1106

The sysctl parameters msgmni, shmmni and semmni have an inherent limit
of IPC_MNI (32k). However, users may not be aware of that because they
can write a value much higher than that without getting any error or
notification. Reading the parameters back will show the newly written
values which are not real.

Enforcing the limit by failing sysctl parameter write, however, may
cause regressions if existing user setup scripts set those parameters
above 32k as those scripts will now fail in this case.

To address this delemma, a new flags field is introduced into
the ctl_table. The value CTL_FLAGS_CLAMP_RANGE can be added to any
ctl_table entries to enable a looser range clamping without returning
any error. For example,

  .flags = CTL_FLAGS_CLAMP_RANGE,

This flags value are now used for the range checking of shmmni,
msgmni and semmni without breaking existing applications. If any out
of range value is written to those sysctl parameters, the following
warning will be printed instead.

  sysctl: "shmmni" was set out of range [0, 32768], clamped to 32768.

Reading the values back will show 32768 instead of some fake values.

New sysctl selftests are added to exercise new code added by this
patchset.

There are users out there requesting increase in the IPCMNI value.
The last 2 patches attempt to do that by using a boot kernel parameter
"ipcmni_extend" to increase the IPCMNI limit from 32k to 2M.

Eric Biederman had posted an RFC patch to just scrap the IPCMNI limit
and open up the whole positive integer space for IPC IDs. A major
issue that I have with this approach is that SysV IPC had been in use
for over 20 years. We just don't know if there are user applications
that have dependency on the way that the IDs are built. So drastic
change like this may have the potential of breaking some applications.

I prefer a more conservative approach where users will observe no
change in behavior unless they explictly opt in to enable the extended
mode. I could open up the whole positive integer space in this case
like what Eric did, but that will make the code more complex.  So I
just extend IPCMNI to 2M in this case and keep similar ID generation
logic.


Waiman Long (8):
  sysctl: Add flags to support min/max range clamping
  proc/sysctl: Provide additional ctl_table.flags checks
  sysctl: Warn when a clamped sysctl parameter is set out of range
  ipc: Clamp msgmni and shmmni to the real IPCMNI limit
  ipc: Clamp semmni to the real IPCMNI limit
  test_sysctl: Add range clamping test
  ipc: Allow boot time extension of IPCMNI from 32k to 2M
  ipc: Conserve sequence numbers in extended IPCMNI mode

 Documentation/admin-guide/kernel-parameters.txt |   3 +
 fs/proc/proc_sysctl.c   |  60 ++
 include/linux/ipc_namespace.h   |   1 +
 include/linux/sysctl.h  |  32 
 ipc/ipc_sysctl.c|  33 +++-
 ipc/sem.c

Re: [PATCH v7 0/5] cpuset: Enable cpuset controller in default hierarchy

2018-04-23 Thread Waiman Long

On 04/20/2018 04:23 AM, Mike Galbraith wrote:
> On Thu, 2018-04-19 at 09:46 -0400, Waiman Long wrote:
>> v7:
>>  - Add a root-only cpuset.cpus.isolated control file for CPU isolation.
>>  - Enforce that load_balancing can only be turned off on cpusets with
>>CPUs from the isolated list.
>>  - Update sched domain generation to allow cpusets with CPUs only
>>from the isolated CPU list to be in separate root domains.
> I haven't done much, but was able to do a q/d manual build, populate
> and teardown of system/critical sets on my desktop box, and it looked
> ok.  Thanks for getting this aboard the v2 boat.
>
>   -Mike

Thank for the testing.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 0/5] cpuset: Enable cpuset controller in default hierarchy

2018-04-23 Thread Waiman Long

On 04/23/2018 09:57 AM, Juri Lelli wrote:
> On 23/04/18 15:07, Juri Lelli wrote:
>> Hi Waiman,
>>
>> On 19/04/18 09:46, Waiman Long wrote:
>>> v7:
>>>  - Add a root-only cpuset.cpus.isolated control file for CPU isolation.
>>>  - Enforce that load_balancing can only be turned off on cpusets with
>>>CPUs from the isolated list.
>>>  - Update sched domain generation to allow cpusets with CPUs only
>>>from the isolated CPU list to be in separate root domains.
> Guess I'll be adding comments as soon as I stumble on something unclear
> (to me :), hope that's OK (shout if I should do it differently).
>
> The below looked unexpected to me:
>
> root@debian-kvm:/sys/fs/cgroup# cat g1/cpuset.cpus
> 2-3
> root@debian-kvm:/sys/fs/cgroup# cat g1/cpuset.mems
>
> root@debian-kvm:~# echo $$ > /sys/fs/cgroup/g1/cgroup.threads
> root@debian-kvm:/sys/fs/cgroup# cat g1/cgroup.threads
> 2312
>
> So I can add tasks to groups with no mems? Or is it this only true in my
> case with a single mem node? Or maybe it's inherited from root group
> (slightly confusing IMHO if that's the case).

No mems mean looking up the parents until we find one with non-empty
mems. The mems.effective will show you the actual memory nodes used.

-Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] proc/stat: Separate out individual irq counts into /proc/stat_irqs

2018-04-19 Thread Waiman Long

It was found that reading /proc/stat could be time consuming on
systems with a lot of irqs. For example, reading /proc/stat in a
certain 2-socket Skylake server took about 4.6ms because it had over
5k irqs. In that particular case, the majority of the CPU cycles for
reading /proc/stat was spent in the kstat_irqs() function.  Therefore,
application performance can be impacted if the application reads
/proc/stat rather frequently.

The "intr" line within /proc/stat contains a sum total of all the
irqs that have been serviced followed by a list of irq counts for
each individual irq number. In many cases, the first number is good
enough. The individual irq counts may not provide that much more
information.

In order to avoid this kind of performance issue, all these individual
irq counts are now separated into a new /proc/stat_irqs file. The
sum total irq count will stay in /proc/stat and be duplicated in
/proc/stat_irqs. Applications that need to look up individual irq counts
will now have to look into /proc/stat_irqs instead of /proc/stat.

v2: Update Documentation/filesystems/proc.txt accordingly.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 Documentation/filesystems/proc.txt | 22 -
 fs/proc/stat.c | 48 --
 2 files changed, 57 insertions(+), 13 deletions(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 2a84bb3..15558ff 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1300,7 +1300,7 @@ since the system first booted.  For a quick look, simply 
cat the file:
   cpu  2255 34 2290 22625563 6290 127 456 0 0 0
   cpu0 1132 34 1441 11311718 3675 127 438 0 0 0
   cpu1 1123 0 849 11313845 2614 0 18 0 0 0
-  intr 114930548 113199788 3 0 5 263 0 4 [... lots more numbers ...]
+  intr 114930548
   ctxt 1990473
   btime 1062191376
   processes 2915
@@ -1333,11 +1333,10 @@ second).  The meanings of the columns are as follows, 
from left to right:
 - guest: running a normal guest
 - guest_nice: running a niced guest
 
-The "intr" line gives counts of interrupts  serviced since boot time, for each
-of the  possible system interrupts.   The first  column  is the  total of  all
-interrupts serviced  including  unnumbered  architecture specific  interrupts;
-each  subsequent column is the  total for that particular numbered interrupt.
-Unnumbered interrupts are not shown, only summed into the total.
+The "intr" line gives the total of all interrupts including unnumbered
+architecture specific interrupts serviced since boot time.  To see the
+number of interrupts serviced for a particular numbered interrupt,
+the /proc/stat_irqs file should be used instead.
 
 The "ctxt" line gives the total number of context switches across all CPUs.
 
@@ -1359,6 +1358,17 @@ of the possible system softirqs. The first column is the 
total of all
 softirqs serviced; each subsequent column is the total for that particular
 softirq.
 
+To see the number of interrupts serviced for each of the numbered
+interrupts, the /proc/stat_irqs file can be viewed.
+
+  > cat /proc/stat_irqs
+  intr 114930548 113199788 3 0 5 263 0 4 [... lots more numbers ...]
+
+The "intr" line gives counts of interrupts  serviced since boot time, for each
+of the  possible system interrupts.   The first  column  is the  total of  all
+interrupts serviced  including  unnumbered  architecture specific  interrupts;
+each  subsequent column is the  total for that particular numbered interrupt.
+Unnumbered interrupts are not shown, only summed into the total.
 
 1.9 Ext4 file system parameters
 ---
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 59749df..79e3c03 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -155,11 +155,6 @@ static int show_stat(struct seq_file *p, void *v)
seq_putc(p, '\n');
}
seq_put_decimal_ull(p, "intr ", (unsigned long long)sum);
-
-   /* sum again ? it could be updated? */
-   for_each_irq_nr(j)
-   seq_put_decimal_ull(p, " ", kstat_irqs_usr(j));
-
seq_printf(p,
"\nctxt %llu\n"
"btime %llu\n"
@@ -181,15 +176,46 @@ static int show_stat(struct seq_file *p, void *v)
return 0;
 }
 
+/*
+ * Showing individual irq counts can be expensive if there are a lot of
+ * irqs. So it is done in a separate procfs file to reduce performance
+ * overhead of reading other statistical counts.
+ */
+static int show_stat_irqs(struct seq_file *p, void *v)
+{
+   int i, j;
+   u64 sum = 0;
+
+   for_each_possible_cpu(i) {
+   sum += kstat_cpu_irqs_sum(i);
+   sum += arch_irq_stat_cpu(i);
+   }
+   sum += arch_irq_stat();
+
+   seq_put_decimal_ull(p, "intr ", (unsigned long long)sum);
+
+   for_each_irq_nr(j)
+

[PATCH v7 0/5] cpuset: Enable cpuset controller in default hierarchy

2018-04-19 Thread Waiman Long

v7:
 - Add a root-only cpuset.cpus.isolated control file for CPU isolation.
 - Enforce that load_balancing can only be turned off on cpusets with
   CPUs from the isolated list.
 - Update sched domain generation to allow cpusets with CPUs only
   from the isolated CPU list to be in separate root domains.

v6:
 - Hide cpuset control knobs in root cgroup.
 - Rename effective_cpus and effective_mems to cpus.effective and
   mems.effective respectively.
 - Remove cpuset.flags and add cpuset.sched_load_balance instead
   as the behavior of sched_load_balance has changed and so is
   not a simple flag.
 - Update cgroup-v2.txt accordingly.

v5:
 - Add patch 2 to provide the cpuset.flags control knob for the
   sched_load_balance flag which should be the only feature that is
   essential as a replacement of the "isolcpus" kernel boot parameter.

v4:
 - Further minimize the feature set by removing the flags control knob.

v3:
 - Further trim the additional features down to just memory_migrate.
 - Update Documentation/cgroup-v2.txt.

v6 patch: https://lkml.org/lkml/2018/3/21/530

The purpose of this patchset is to provide a basic set of cpuset
features for cgroup v2. This basic set includes the non-root "cpus",
"mems", "cpus.effective" and "mems.effective", "sched_load_balance"
control file as well as a root-only "cpus.isolated".

The root-only "cpus.isolated" file is added to support use cases similar
to the "isolcpus" kernel parameter. CPUs from the isolated list can be
put into child cpusets where "sched_load_balance" can be disabled to
allow finer control of task-cpu mappings of those isolated CPUs.

On the other hand, enabling the "sched_load_balance" on a cpuset with
only CPUs from the isolated list will allow those CPUs to use a separate
root domain from that of the root cpuset.

This patchset does not exclude the possibility of adding more features
in the future after careful consideration.

Patch 1 enables cpuset in cgroup v2 with cpus, mems and their
effective counterparts.

Patch 2 adds sched_load_balance whose behavior changes in v2 to become
hierarchical and includes an implicit !cpu_exclusive.

Patch 3 adds a new root-only "cpuset.cpus.isolated" control file for
CPU isolation purpose.

Patch 4 adds the limitation that "sched_load_balance" can only be turned
off in a cpuset if all the CPUs in the cpuset are already in the root's
"cpuset.cpus.isolated".

Patch 5 modifies the sched domain generation code to generate separate root
sched domains if all the CPUs in a cpuset comes from "cpuset.cpus.isolated".

In other words, all the CPUs that need to be isolated or in separate
root domains have to be put into the "cpuset.cpus.isolated" first. Then
child cpusets can be created to partition those isolated CPUs into
either separate root domains with "sched_load_balance" on or really
isolated CPUs with "sched_load_balance" off. Load balancing cannot
be turned off at root.

Waiman Long (5):
  cpuset: Enable cpuset controller in default hierarchy
  cpuset: Add cpuset.sched_load_balance to v2
  cpuset: Add a root-only cpus.isolated v2 control file
  cpuset: Restrict load balancing off cpus to subset of cpus.isolated
  cpuset: Make generate_sched_domains() recognize isolated_cpus

 Documentation/cgroup-v2.txt | 138 -
 kernel/cgroup/cpuset.c  | 287 +---
 2 files changed, 404 insertions(+), 21 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v7 5/5] cpuset: Make generate_sched_domains() recognize isolated_cpus

2018-04-19 Thread Waiman Long

The generate_sched_domains() function and the hotplug code are modified
to make them use the newly introduced isolated_cpus mask for schedule
domains generation.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 kernel/cgroup/cpuset.c | 35 +--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index d05c4c8..a67c77a 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -683,13 +683,14 @@ static int generate_sched_domains(cpumask_var_t **domains,
int ndoms = 0;  /* number of sched domains in result */
int nslot;  /* next empty doms[] struct cpumask slot */
struct cgroup_subsys_state *pos_css;
+   bool root_load_balance = is_sched_load_balance(_cpuset);
 
doms = NULL;
dattr = NULL;
csa = NULL;
 
/* Special case for the 99% of systems with one, full, sched domain */
-   if (is_sched_load_balance(_cpuset)) {
+   if (root_load_balance && !top_cpuset.isolation_count) {
ndoms = 1;
doms = alloc_sched_domains(ndoms);
if (!doms)
@@ -712,6 +713,8 @@ static int generate_sched_domains(cpumask_var_t **domains,
csn = 0;
 
rcu_read_lock();
+   if (root_load_balance)
+   csa[csn++] = _cpuset;
cpuset_for_each_descendant_pre(cp, pos_css, _cpuset) {
if (cp == _cpuset)
continue;
@@ -722,6 +725,9 @@ static int generate_sched_domains(cpumask_var_t **domains,
 * parent's cpus, so just skip them, and then we call
 * update_domain_attr_tree() to calc relax_domain_level of
 * the corresponding sched domain.
+*
+* If root is load-balancing, we can skip @cp if it
+* is a subset of the root's effective_cpus.
 */
if (!cpumask_empty(cp->cpus_allowed) &&
!(is_sched_load_balance(cp) &&
@@ -729,6 +735,10 @@ static int generate_sched_domains(cpumask_var_t **domains,
 housekeeping_cpumask(HK_FLAG_DOMAIN
continue;
 
+   if (root_load_balance &&
+   cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus))
+   continue;
+
if (is_sched_load_balance(cp))
csa[csn++] = cp;
 
@@ -820,6 +830,12 @@ static int generate_sched_domains(cpumask_var_t **domains,
}
BUG_ON(nslot != ndoms);
 
+#ifdef CONFIG_DEBUG_KERNEL
+   for (i = 0; i < ndoms; i++)
+   pr_info("rebuild_sched_domains dom %d: %*pbl\n", i,
+   cpumask_pr_args(doms[i]));
+#endif
+
 done:
kfree(csa);
 
@@ -860,7 +876,12 @@ static void rebuild_sched_domains_locked(void)
 * passing doms with offlined cpu to partition_sched_domains().
 * Anyways, hotplug work item will rebuild sched domains.
 */
-   if (!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
+   if (!top_cpuset.isolation_count &&
+   !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
+   goto out;
+
+   if (top_cpuset.isolation_count &&
+  !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
goto out;
 
/* Generate domain masks and attrs */
@@ -1102,6 +1123,7 @@ static int update_isolated_cpumask(const char *buf)
 
top_cpuset.isolation_count = cpumask_weight(top_cpuset.isolated_cpus);
spin_unlock_irq(_lock);
+   rebuild_sched_domains_locked();
 
 out_ok:
retval = 0;
@@ -2530,6 +2552,11 @@ static void cpuset_hotplug_workfn(struct work_struct 
*work)
cpumask_copy(_cpus, cpu_active_mask);
new_mems = node_states[N_MEMORY];
 
+   /*
+* If isolated_cpus is populated, it is likely that the check below
+* will produce a false positive on cpus_updated when the cpu list
+* isn't changed. It is extra work, but it is better to be safe.
+*/
cpus_updated = !cpumask_equal(top_cpuset.effective_cpus, _cpus);
mems_updated = !nodes_equal(top_cpuset.effective_mems, new_mems);
 
@@ -2538,6 +2565,10 @@ static void cpuset_hotplug_workfn(struct work_struct 
*work)
spin_lock_irq(_lock);
if (!on_dfl)
cpumask_copy(top_cpuset.cpus_allowed, _cpus);
+
+   if (top_cpuset.isolation_count)
+   cpumask_andnot(_cpus, _cpus,
+   top_cpuset.isolated_cpus);
cpumask_copy(top_cpuset.effective_cpus, _cpus);
spin_unlock_irq(_lock);
/* we don't mess with cpumasks of tasks in top_cpuset */
-- 
1.8.3.1

--
To unsubscribe from this list:

[PATCH v7 2/5] cpuset: Add cpuset.sched_load_balance to v2

2018-04-19 Thread Waiman Long

The sched_load_balance flag is needed to enable CPU isolation similar
to what can be done with the "isolcpus" kernel boot parameter.

The sched_load_balance flag implies an implicit !cpu_exclusive as
it doesn't make sense to have an isolated CPU being load-balanced in
another cpuset.

For v2, this flag is hierarchical and is inherited by child cpusets. It
is not allowed to have this flag turn off in a parent cpuset, but on
in a child cpuset.

This flag is set by the parent and is not delegatable.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 Documentation/cgroup-v2.txt | 22 ++
 kernel/cgroup/cpuset.c  | 56 +++--
 2 files changed, 71 insertions(+), 7 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index ed8ec66..c970bd7 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1514,6 +1514,28 @@ Cpuset Interface Files
it is a subset of "cpuset.mems".  Its value will be affected
by memory nodes hotplug events.
 
+  cpuset.sched_load_balance
+   A read-write single value file which exists on non-root cgroups.
+   The default is "1" (on), and the other possible value is "0"
+   (off).
+
+   When it is on, tasks within this cpuset will be load-balanced
+   by the kernel scheduler.  Tasks will be moved from CPUs with
+   high load to other CPUs within the same cpuset with less load
+   periodically.
+
+   When it is off, there will be no load balancing among CPUs on
+   this cgroup.  Tasks will stay in the CPUs they are running on
+   and will not be moved to other CPUs.
+
+   This flag is hierarchical and is inherited by child cpusets. It
+   can be turned off only when the CPUs in this cpuset aren't
+   listed in the cpuset.cpus of other sibling cgroups, and all
+   the child cpusets, if present, have this flag turned off.
+
+   Once it is off, it cannot be turned back on as long as the
+   parent cgroup still has this flag in the off state.
+
 
 Device controller
 -
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 419b758..50c9254 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -407,15 +407,22 @@ static void cpuset_update_task_spread_flag(struct cpuset 
*cs,
  *
  * One cpuset is a subset of another if all its allowed CPUs and
  * Memory Nodes are a subset of the other, and its exclusive flags
- * are only set if the other's are set.  Call holding cpuset_mutex.
+ * are only set if the other's are set (on legacy hierarchy) or
+ * its sched_load_balance flag is only set if the other is set
+ * (on default hierarchy).  Caller holding cpuset_mutex.
  */
 
 static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q)
 {
-   return  cpumask_subset(p->cpus_allowed, q->cpus_allowed) &&
-   nodes_subset(p->mems_allowed, q->mems_allowed) &&
-   is_cpu_exclusive(p) <= is_cpu_exclusive(q) &&
-   is_mem_exclusive(p) <= is_mem_exclusive(q);
+   if (!cpumask_subset(p->cpus_allowed, q->cpus_allowed) ||
+   !nodes_subset(p->mems_allowed, q->mems_allowed))
+   return false;
+
+   if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys))
+   return is_sched_load_balance(p) <= is_sched_load_balance(q);
+   else
+   return is_cpu_exclusive(p) <= is_cpu_exclusive(q) &&
+  is_mem_exclusive(p) <= is_mem_exclusive(q);
 }
 
 /**
@@ -498,7 +505,7 @@ static int validate_change(struct cpuset *cur, struct 
cpuset *trial)
 
par = parent_cs(cur);
 
-   /* On legacy hiearchy, we must be a subset of our parent cpuset. */
+   /* On legacy hierarchy, we must be a subset of our parent cpuset. */
ret = -EACCES;
if (!is_in_v2_mode() && !is_cpuset_subset(trial, par))
goto out;
@@ -1327,6 +1334,19 @@ static int update_flag(cpuset_flagbits_t bit, struct 
cpuset *cs,
else
clear_bit(bit, >flags);
 
+   /*
+* On default hierarchy, turning off sched_load_balance flag implies
+* an implicit cpu_exclusive. Turning on sched_load_balance will
+* clear the cpu_exclusive flag.
+*/
+   if ((bit == CS_SCHED_LOAD_BALANCE) &&
+   cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
+   if (turning_on)
+   clear_bit(CS_CPU_EXCLUSIVE, >flags);
+   else
+   set_bit(CS_CPU_EXCLUSIVE, >flags);
+   }
+
err = validate_change(cs, trialcs);
if (err < 0)
goto out;
@@ -1966,6 +1986,14 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state 
*css, struct cftype *cft)
.flags = CFTYPE_NOT_ON_ROOT,
},
 
+   {
+   .

[PATCH v7 1/5] cpuset: Enable cpuset controller in default hierarchy

2018-04-19 Thread Waiman Long

Given the fact that thread mode had been merged into 4.14, it is now
time to enable cpuset to be used in the default hierarchy (cgroup v2)
as it is clearly threaded.

The cpuset controller had experienced feature creep since its
introduction more than a decade ago. Besides the core cpus and mems
control files to limit cpus and memory nodes, there are a bunch of
additional features that can be controlled from the userspace. Some of
the features are of doubtful usefulness and may not be actively used.

This patch enables cpuset controller in the default hierarchy with
a minimal set of features, namely just the cpus and mems and their
effective_* counterparts.  We can certainly add more features to the
default hierarchy in the future if there is a real user need for them
later on.

Alternatively, with the unified hiearachy, it may make more sense
to move some of those additional cpuset features, if desired, to
memory controller or may be to the cpu controller instead of staying
with cpuset.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 Documentation/cgroup-v2.txt | 90 ++---
 kernel/cgroup/cpuset.c  | 48 ++--
 2 files changed, 130 insertions(+), 8 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 74cdeae..ed8ec66 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -53,11 +53,13 @@ v1 is available under Documentation/cgroup-v1/.
5-3-2. Writeback
  5-4. PID
5-4-1. PID Interface Files
- 5-5. Device
- 5-6. RDMA
-   5-6-1. RDMA Interface Files
- 5-7. Misc
-   5-7-1. perf_event
+ 5-5. Cpuset
+   5.5-1. Cpuset Interface Files
+ 5-6. Device
+ 5-7. RDMA
+   5-7-1. RDMA Interface Files
+ 5-8. Misc
+   5-8-1. perf_event
  5-N. Non-normative information
5-N-1. CPU controller root cgroup process behaviour
5-N-2. IO controller root cgroup process behaviour
@@ -1435,6 +1437,84 @@ through fork() or clone(). These will return -EAGAIN if 
the creation
 of a new process would cause a cgroup policy to be violated.
 
 
+Cpuset
+--
+
+The "cpuset" controller provides a mechanism for constraining
+the CPU and memory node placement of tasks to only the resources
+specified in the cpuset interface files in a task's current cgroup.
+This is especially valuable on large NUMA systems where placing jobs
+on properly sized subsets of the systems with careful processor and
+memory placement to reduce cross-node memory access and contention
+can improve overall system performance.
+
+The "cpuset" controller is hierarchical.  That means the controller
+cannot use CPUs or memory nodes not allowed in its parent.
+
+
+Cpuset Interface Files
+~~
+
+  cpuset.cpus
+   A read-write multiple values file which exists on non-root
+   cgroups.
+
+   It lists the CPUs allowed to be used by tasks within this
+   cgroup.  The CPU numbers are comma-separated numbers or
+   ranges.  For example:
+
+ # cat cpuset.cpus
+ 0-4,6,8-10
+
+   An empty value indicates that the cgroup is using the same
+   setting as the nearest cgroup ancestor with a non-empty
+   "cpuset.cpus" or all the available CPUs if none is found.
+
+   The value of "cpuset.cpus" stays constant until the next update
+   and won't be affected by any CPU hotplug events.
+
+  cpuset.cpus.effective
+   A read-only multiple values file which exists on non-root
+   cgroups.
+
+   It lists the onlined CPUs that are actually allowed to be
+   used by tasks within the current cgroup.  If "cpuset.cpus"
+   is empty, it shows all the CPUs from the parent cgroup that
+   will be available to be used by this cgroup.  Otherwise, it is
+   a subset of "cpuset.cpus".  Its value will be affected by CPU
+   hotplug events.
+
+  cpuset.mems
+   A read-write multiple values file which exists on non-root
+   cgroups.
+
+   It lists the memory nodes allowed to be used by tasks within
+   this cgroup.  The memory node numbers are comma-separated
+   numbers or ranges.  For example:
+
+ # cat cpuset.mems
+ 0-1,3
+
+   An empty value indicates that the cgroup is using the same
+   setting as the nearest cgroup ancestor with a non-empty
+   "cpuset.mems" or all the available memory nodes if none
+   is found.
+
+   The value of "cpuset.mems" stays constant until the next update
+   and won't be affected by any memory nodes hotplug events.
+
+  cpuset.mems.effective
+   A read-only multiple values file which exists on non-root
+   cgroups.
+
+   It lists the onlined memory nodes that are actually allowed to
+   be used by tasks within the current cgroup.  If "cpuset.mems"
+   is empty, it shows all the memory n

[PATCH v7 3/5] cpuset: Add a root-only cpus.isolated v2 control file

2018-04-19 Thread Waiman Long

In order to better support CPU isolation as well as multiple root
domains for deadline scheduling, the ability to carve out a set of CPUs
specifically for isolation or for another root domain will be useful.

A new root-only "cpuset.cpus.isolated" control file is now added for
holding the list of CPUs that will not be participating in load balancing
within the root cpuset. The root's effective cpu list will not contain
any CPUs that are in "cpuset.cpus.isolated" file.  These isolated CPUs,
however, can still be put into child cpusets and load balanced within
them if necessary.

For CPU isolation, putting the CPUs into this new control file and not
having them in any of the child cpusets should be enough. Those isolated
CPUs can also be put into a child cpuset with load balancing disabled
for finer-grained control.

For creating additional root domains for scheduling, a child cpuset
should only select an exclusive set of CPUs within the isolated set.

The "cpuset.cpus.isolated" control file should be set up before
any child cpusets are created. If child cpusets are present, changes
to this control file will not be allowed if any CPUs that will change
state are in any of the child cpusets.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 Documentation/cgroup-v2.txt |  25 ++
 kernel/cgroup/cpuset.c  | 119 +++-
 2 files changed, 143 insertions(+), 1 deletion(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index c970bd7..8d89dc2 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1484,6 +1484,31 @@ Cpuset Interface Files
a subset of "cpuset.cpus".  Its value will be affected by CPU
hotplug events.
 
+  cpuset.cpus.isolated
+   A read-write multiple values file which exists on root cgroup
+   only.
+
+   It lists the CPUs that have been withdrawn from the root cgroup
+   for load balancing.  These CPUs can still be allocated to child
+   cpusets with load balancing enabled, if necessary.
+
+   If a child cpuset contains only an exclusive set of CPUs that are
+   a subset of the isolated CPUs and with load balancing enabled,
+   these CPUs will be load balanced on a separate root domain from
+   the one in the root cgroup.
+
+   Just putting the CPUs into "cpuset.cpus.isolated" will be
+   enough to disable load balancing on those CPUs as long as they
+   do not appear in a child cpuset with load balancing enabled.
+   Fine-grained control of cpu isolation can also be done by
+   putting these isolated CPUs into child cpusets with load
+   balancing disabled.
+
+   The "cpuset.cpus.isolated" should be set up before child
+   cpusets are created.  Once child cpusets are present, changes
+   to "cpuset.cpus.isolated" will not be allowed if the CPUs that
+   change their states are in any of the child cpusets.
+
   cpuset.mems
A read-write multiple values file which exists on non-root
cgroups.
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 50c9254..c746b18 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -109,6 +109,9 @@ struct cpuset {
cpumask_var_t effective_cpus;
nodemask_t effective_mems;
 
+   /* Isolated CPUs - root cpuset only */
+   cpumask_var_t isolated_cpus;
+
/*
 * This is old Memory Nodes tasks took on.
 *
@@ -134,6 +137,9 @@ struct cpuset {
 
/* for custom sched domain */
int relax_domain_level;
+
+   /* for isolated_cpus */
+   int isolation_count;
 };
 
 static inline struct cpuset *css_cs(struct cgroup_subsys_state *css)
@@ -909,7 +915,19 @@ static void update_cpumasks_hier(struct cpuset *cs, struct 
cpumask *new_cpus)
cpuset_for_each_descendant_pre(cp, pos_css, cs) {
struct cpuset *parent = parent_cs(cp);
 
-   cpumask_and(new_cpus, cp->cpus_allowed, parent->effective_cpus);
+   /*
+* If parent has isolated CPUs, include them in the list
+* of allowable CPUs.
+*/
+   if (parent->isolation_count) {
+   cpumask_or(new_cpus, parent->effective_cpus,
+  parent->isolated_cpus);
+   cpumask_and(new_cpus, new_cpus, cpu_online_mask);
+   cpumask_and(new_cpus, new_cpus, cp->cpus_allowed);
+   } else {
+   cpumask_and(new_cpus, cp->cpus_allowed,
+   parent->effective_cpus);
+   }
 
/*
 * If it becomes empty, inherit the effective mask of the
@@ -1004,6 +1022,85 @@ static int update_cpumask(struct cpuset *cs, struct 
cpuset *trialcs,
return 0;
 }
 
+/**
+ * update_isolat

[PATCH v7 4/5] cpuset: Restrict load balancing off cpus to subset of cpus.isolated

2018-04-19 Thread Waiman Long

With the addition of "cpuset.cpus.isolated", it makes sense to add the
restriction that load balancing can only be turned off if the CPUs in
the isolated cpuset are subset of "cpuset.cpus.isolated".

Signed-off-by: Waiman Long <long...@redhat.com>
---
 Documentation/cgroup-v2.txt |  7 ---
 kernel/cgroup/cpuset.c  | 29 ++---
 2 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 8d89dc2..c4227ee 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1554,9 +1554,10 @@ Cpuset Interface Files
and will not be moved to other CPUs.
 
This flag is hierarchical and is inherited by child cpusets. It
-   can be turned off only when the CPUs in this cpuset aren't
-   listed in the cpuset.cpus of other sibling cgroups, and all
-   the child cpusets, if present, have this flag turned off.
+   can be explicitly turned off only when it is a direct child of
+   the root cgroup and the CPUs in this cpuset are subset of the
+   root's "cpuset.cpus.isolated".  Moreover, the CPUs cannot be
+   listed in the "cpuset.cpus" of other sibling cgroups.
 
Once it is off, it cannot be turned back on as long as the
parent cgroup still has this flag in the off state.
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index c746b18..d05c4c8 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -511,6 +511,16 @@ static int validate_change(struct cpuset *cur, struct 
cpuset *trial)
 
par = parent_cs(cur);
 
+   /*
+* On default hierarchy with sched_load_balance flag off, the cpu
+* list must be a subset of the parent's isolated CPU list, if
+* defined (root).
+*/
+   if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
+   !is_sched_load_balance(trial) && par->isolation_count &&
+   !cpumask_subset(trial->cpus_allowed, par->isolated_cpus))
+   goto out;
+
/* On legacy hierarchy, we must be a subset of our parent cpuset. */
ret = -EACCES;
if (!is_in_v2_mode() && !is_cpuset_subset(trial, par))
@@ -1431,10 +1441,16 @@ static int update_flag(cpuset_flagbits_t bit, struct 
cpuset *cs,
else
clear_bit(bit, >flags);
 
+   balance_flag_changed = (is_sched_load_balance(cs) !=
+   is_sched_load_balance(trialcs));
+
/*
 * On default hierarchy, turning off sched_load_balance flag implies
 * an implicit cpu_exclusive. Turning on sched_load_balance will
 * clear the cpu_exclusive flag.
+*
+* sched_load_balance can only be turned off if all the CPUs are
+* in the parent's isolated CPU list.
 */
if ((bit == CS_SCHED_LOAD_BALANCE) &&
cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
@@ -1442,15 +1458,22 @@ static int update_flag(cpuset_flagbits_t bit, struct 
cpuset *cs,
clear_bit(CS_CPU_EXCLUSIVE, >flags);
else
set_bit(CS_CPU_EXCLUSIVE, >flags);
+
+   if (balance_flag_changed && !turning_on) {
+   struct cpuset *parent = parent_cs(cs);
+
+   err = -EBUSY;
+   if (!parent->isolation_count ||
+   !cpumask_subset(trialcs->cpus_allowed,
+   parent->cpus_allowed))
+   goto out;
+   }
}
 
err = validate_change(cs, trialcs);
if (err < 0)
goto out;
 
-   balance_flag_changed = (is_sched_load_balance(cs) !=
-   is_sched_load_balance(trialcs));
-
spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
|| (is_spread_page(cs) != is_spread_page(trialcs)));
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 1/9] sysctl: Add flags to support min/max range clamping

2018-03-29 Thread Waiman Long

On 03/29/2018 02:15 PM, Luis R. Rodriguez wrote:
> On Mon, Mar 19, 2018 at 11:39:19AM -0400, Waiman Long wrote:
>> On 03/16/2018 09:10 PM, Luis R. Rodriguez wrote:
>>> On Fri, Mar 16, 2018 at 02:13:42PM -0400, Waiman Long wrote:
>>>> When the CTL_FLAGS_CLAMP_RANGE flag is set in the ctl_table
>>>> entry, any update from the userspace will be clamped to the given
>>>> range without error if either the proc_dointvec_minmax() or the
>>>> proc_douintvec_minmax() handlers is used.
>>> I don't get it.  Why define a generic range flag when we can be mores 
>>> specific and
>>> you do that in your next patch. What's the point of this flag then?
>>>
>>>   Luis
>> I was thinking about using the signed/unsigned bits as just annotations
>> for ranges for future extension. For the purpose of this patchset alone,
>> I can merge the three bits into just two.
> Only introduce flags which you will actually use in the same patch series.
>
>   Luis

Yes, will do. Since the merge window is coming, should I wait until it
is over to send out the new patch?

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2

2018-03-27 Thread Waiman Long

On 03/27/2018 10:02 AM, Tejun Heo wrote:
> Hello,
>
> On Mon, Mar 26, 2018 at 04:28:49PM -0400, Waiman Long wrote:
>> Maybe we can have a different root level flag, say,
>> sched_partition_domain that is equivalent to !sched_load_balnace.
>> However, I am still not sure if we should enforce that no task should be
>> in the root cgroup when the flag is set.
>>
>> Tejun and Peter, what are your thoughts on this?
> I haven't looked into the other issues too much but we for sure cannot
> empty the root cgroup.
>
> Thanks.
>
Now, I have a different idea. How about we add a special root-only knob,
say, "cpuset.cpus.isolated" that contains the list of CPUs that are
still owned by root, but not participated in load balancing. All the
tasks in the root are load-balanced among the remaining CPUs.

A child can then be created that hold some or all the CPUs in the
isolated set. It will then have a separate root domain if load balancing
is on, or an isolated cpuset if load balancing is off.

Will that idea work?

Cheers,
Longman


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2

2018-03-26 Thread Waiman Long

On 03/26/2018 08:47 AM, Juri Lelli wrote:
> On 23/03/18 14:44, Waiman Long wrote:
>> On 03/23/2018 03:59 AM, Juri Lelli wrote:
> [...]
>
>>> OK, thanks for confirming. Can you tell again however why do you think
>>> we need to remove sched_load_balance from root level? Won't we end up
>>> having tasks put on isolated sets?
>> The root cgroup is special that it owns all the resources in the system.
>> We generally don't want restriction be put on the root cgroup. A child
>> cgroup has to be created to have constraints put on it. In fact, most of
>> the controller files don't show up in the v2 cgroup root at all.
>>
>> An isolated cgroup has to be put under root, e.g.
>>
>>   Root
>>  /\
>> isolated  balanced
>>
>>> Also, I guess children groups with more than one CPU will need to be
>>> able to load balance across their CPUs, no matter what their parent
>>> group does?
>> The purpose of an isolated cpuset is to have a dedicated set of CPUs to
>> be used by a certain application that makes its own scheduling decision
>> by placing tasks explicitly on specific CPUs. It just doesn't make sense
>> to have a CPU in an isolated cpuset to participated in load balancing in
>> another cpuset. If one want load balancing in a child cpuset, the parent
>> cpuset should have load balancing turned on as well.
> Isolated with CPUs overlapping some other cpuset makes little sense, I
> agree. What I have in mind however is an isolated set of CPUs that don't
> overlap with any other cpuset (as your balanced set above). In this case
> I think it makes sense to let the sys admin decide if "automatic" load
> balancing has to be performed (by the scheduler) or no load balacing at
> all has to take place?
>
> Further extending your example:
>
>  Root [0-3]
>/\
> group1 [0-1] group2[2-3]
>
> Why should we prevent load balancing to be disabled at root level (so
> that for example tasks still residing in root group are not freely
> migrated around, potentially disturbing both sub-groups)?
>
> Then one can decide that group1 is a "userspace managed" group (no load
> balancing takes place) and group2 is balanced by the scheduler.
>
> And this is not DEADLINE specific, IMHO.
>
>> As I look into the code, it seems like root domain is probably somewhat
>> associated with cpu_exclusive only. Whether sched_load_balance is set
>> doesn't really matter.  I will need to look further on the conditions
>> where a new root domain is created.
> I checked again myself (sched domains code is always a maze :) and I
> believe that sched_load_balance flag indeed controls domains (sched and
> root) creation and configuration . Changing the flag triggers potential
> rebuild and separed sched/root domains are generated if subgroups have
> non overlapping cpumasks.  cpu_exclusive only enforces this latter
> condition.

Right, I ran some tests and figured out that to have root_domain in the
child cgroup level, we do need to disable load balancing at the root
cgroup level and enabling it in child cgroups that are mutually disjoint
in their cpu lists. The cpu_exclusive flag isn't really needed.

I am not against doing that at the root cgroup, but it is kind of weird
in term of semantics. If we disable load balancing in the root cgroup,
but enabling it at child cgroups, what does that mean to the processes
that are still in the root cgroup?

The sched_load_balance flag isn't something that is passed to the
scheduler. It only only affects the CPU topology of the system. So I
suspect that a process in the root cgroup will be load balanced among
the CPUs in the one of the child cgroups. That doesn't look right unless
we enforce that no process can be in the root cgroup in this case.

Real cpu isolation will then require that we disable load balancing at
root, and enable load balancing in child cgroups that only contain CPUs
outside of the isolated CPU list. Again, it is still possible that some
tasks in the root cgroup, if present, may be using some of the isolated
CPUs.

Maybe we can have a different root level flag, say,
sched_partition_domain that is equivalent to !sched_load_balnace.
However, I am still not sure if we should enforce that no task should be
in the root cgroup when the flag is set.

Tejun and Peter, what are your thoughts on this?

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2

2018-03-23 Thread Waiman Long

On 03/23/2018 03:59 AM, Juri Lelli wrote:
> On 22/03/18 17:50, Waiman Long wrote:
>> On 03/22/2018 04:41 AM, Juri Lelli wrote:
>>> On 21/03/18 12:21, Waiman Long wrote:
> [...]
>
>>>> +  cpuset.sched_load_balance
>>>> +  A read-write single value file which exists on non-root cgroups.
>>>> +  The default is "1" (on), and the other possible value is "0"
>>>> +  (off).
>>>> +
>>>> +  When it is on, tasks within this cpuset will be load-balanced
>>>> +  by the kernel scheduler.  Tasks will be moved from CPUs with
>>>> +  high load to other CPUs within the same cpuset with less load
>>>> +  periodically.
>>>> +
>>>> +  When it is off, there will be no load balancing among CPUs on
>>>> +  this cgroup.  Tasks will stay in the CPUs they are running on
>>>> +  and will not be moved to other CPUs.
>>>> +
>>>> +  This flag is hierarchical and is inherited by child cpusets. It
>>>> +  can be turned off only when the CPUs in this cpuset aren't
>>>> +  listed in the cpuset.cpus of other sibling cgroups, and all
>>>> +  the child cpusets, if present, have this flag turned off.
>>>> +
>>>> +  Once it is off, it cannot be turned back on as long as the
>>>> +  parent cgroup still has this flag in the off state.
>>>> +
>>> I'm afraid that this will not work for SCHED_DEADLINE (at least for how
>>> it is implemented today). As you can see in Documentation [1] the only
>>> way a user has to perform partitioned/clustered scheduling is to create
>>> subset of exclusive cpusets and then assign deadline tasks to them. The
>>> other thing to take into account here is that a root_domain is created
>>> for each exclusive set and we use such root_domain to keep information
>>> about admitted bandwidth and speed up load balancing decisions (there is
>>> a max heap tracking deadlines of active tasks on each root_domain).
>>> Now, AFAIR distinct root_domain(s) are created when parent group has
>>> sched_load_balance disabled and cpus_exclusive set (in cgroup v1 that
>>> is). So, what we normally do is create, say, cpus_exclusive groups for
>>> the different clusters and then disable sched_load_balance at root level
>>> (so that each cluster gets its own root_domain). Also,
>>> sched_load_balance is enabled in children groups (as load balancing
>>> inside clusters is what we actually needed :).
>> That looks like an undocumented side effect to me. I would rather see an
>> explicit control file that enable root_domain and break it free from
>> cpu_exclusive && !sched_load_balance, e.g. sched_root_domain(?).
> Mmm, it actually makes some sort of sense to me that as long as parent
> groups can't load balance (because !sched_load_balance) and this group
> can't have CPUs overlapping with some other group (because
> cpu_exclusive) a data structure (root_domain) is created to handle load
> balancing for this isolated subsystem. I agree that it should be better
> documented, though.

Yes, this need to be documented.

>>> IIUC your proposal this will not be permitted with cgroup v2 because
>>> sched_load_balance won't be present at root level and children groups
>>> won't be able to set sched_load_balance back to 1 if that was set to 0
>>> in some parent. Is that true?
>> Yes, that is the current plan.
> OK, thanks for confirming. Can you tell again however why do you think
> we need to remove sched_load_balance from root level? Won't we end up
> having tasks put on isolated sets?

The root cgroup is special that it owns all the resources in the system.
We generally don't want restriction be put on the root cgroup. A child
cgroup has to be created to have constraints put on it. In fact, most of
the controller files don't show up in the v2 cgroup root at all.

An isolated cgroup has to be put under root, e.g.

  Root
 /\
isolated  balanced

>
> Also, I guess children groups with more than one CPU will need to be
> able to load balance across their CPUs, no matter what their parent
> group does?

The purpose of an isolated cpuset is to have a dedicated set of CPUs to
be used by a certain application that makes its own scheduling decision
by placing tasks explicitly on specific CPUs. It just doesn't make sense
to have a CPU in an isolated cpuset to participated in load balancing in
another cpuset. If one want load balancing in a child cpuset, the parent
cpuset should have load balancing turned on as well.

As I look into the code, it seems like root domain is probably somewhat
associat

Re: [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2

2018-03-22 Thread Waiman Long

On 03/22/2018 04:41 AM, Juri Lelli wrote:
> Hi Waiman,
>
> On 21/03/18 12:21, Waiman Long wrote:
>> The sched_load_balance flag is needed to enable CPU isolation similar
>> to what can be done with the "isolcpus" kernel boot parameter.
>>
>> The sched_load_balance flag implies an implicit !cpu_exclusive as
>> it doesn't make sense to have an isolated CPU being load-balanced in
>> another cpuset.
>>
>> For v2, this flag is hierarchical and is inherited by child cpusets. It
>> is not allowed to have this flag turn off in a parent cpuset, but on
>> in a child cpuset.
>>
>> This flag is set by the parent and is not delegatable.
>>
>> Signed-off-by: Waiman Long <long...@redhat.com>
>> ---
>>  Documentation/cgroup-v2.txt | 22 ++
>>  kernel/cgroup/cpuset.c  | 56 
>> +++--
>>  2 files changed, 71 insertions(+), 7 deletions(-)
>>
>> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
>> index ed8ec66..c970bd7 100644
>> --- a/Documentation/cgroup-v2.txt
>> +++ b/Documentation/cgroup-v2.txt
>> @@ -1514,6 +1514,28 @@ Cpuset Interface Files
>>  it is a subset of "cpuset.mems".  Its value will be affected
>>  by memory nodes hotplug events.
>>  
>> +  cpuset.sched_load_balance
>> +A read-write single value file which exists on non-root cgroups.
>> +The default is "1" (on), and the other possible value is "0"
>> +(off).
>> +
>> +When it is on, tasks within this cpuset will be load-balanced
>> +by the kernel scheduler.  Tasks will be moved from CPUs with
>> +high load to other CPUs within the same cpuset with less load
>> +periodically.
>> +
>> +When it is off, there will be no load balancing among CPUs on
>> +this cgroup.  Tasks will stay in the CPUs they are running on
>> +and will not be moved to other CPUs.
>> +
>> +This flag is hierarchical and is inherited by child cpusets. It
>> +can be turned off only when the CPUs in this cpuset aren't
>> +listed in the cpuset.cpus of other sibling cgroups, and all
>> +the child cpusets, if present, have this flag turned off.
>> +
>> +Once it is off, it cannot be turned back on as long as the
>> +parent cgroup still has this flag in the off state.
>> +
> I'm afraid that this will not work for SCHED_DEADLINE (at least for how
> it is implemented today). As you can see in Documentation [1] the only
> way a user has to perform partitioned/clustered scheduling is to create
> subset of exclusive cpusets and then assign deadline tasks to them. The
> other thing to take into account here is that a root_domain is created
> for each exclusive set and we use such root_domain to keep information
> about admitted bandwidth and speed up load balancing decisions (there is
> a max heap tracking deadlines of active tasks on each root_domain).
> Now, AFAIR distinct root_domain(s) are created when parent group has
> sched_load_balance disabled and cpus_exclusive set (in cgroup v1 that
> is). So, what we normally do is create, say, cpus_exclusive groups for
> the different clusters and then disable sched_load_balance at root level
> (so that each cluster gets its own root_domain). Also,
> sched_load_balance is enabled in children groups (as load balancing
> inside clusters is what we actually needed :).

That looks like an undocumented side effect to me. I would rather see an
explicit control file that enable root_domain and break it free from
cpu_exclusive && !sched_load_balance, e.g. sched_root_domain(?).

> IIUC your proposal this will not be permitted with cgroup v2 because
> sched_load_balance won't be present at root level and children groups
> won't be able to set sched_load_balance back to 1 if that was set to 0
> in some parent. Is that true?

Yes, that is the current plan.

> Look, the way things work today is most probably not perfect (just to
> say one thing, we need to disable load balancing for all classes at root
> level just because DEADLINE wants to set restricted affinities to his
> tasks :/) and we could probably think on how to change how this all
> work. So, let's first see if IIUC what you are proposing (and its
> implications). :)
>
Cgroup v2 is supposed to allow us to have a fresh start to rethink what
is a more sane way of partitioning resources without worrying about
backward compatibility. So I think it is time to design a new way for
deadline tasks to work with cpuset v2.

Cheers,
Longman



--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 0/2] cpuset: Enable cpuset controller in default hierarchy

2018-03-21 Thread Waiman Long

v6:
 - Hide cpuset control knobs in root cgroup.
 - Rename effective_cpus and effective_mems to cpus.effective and
   mems.effective respectively.
 - Remove cpuset.flags and add cpuset.sched_load_balance instead
   as the behavior of sched_load_balance has changed and so is
   not a simple flag.
 - Update cgroup-v2.txt accordingly.

v5:
 - Add patch 2 to provide the cpuset.flags control knob for the
   sched_load_balance flag which should be the only feature that is
   essential as a replacement of the "isolcpus" kernel boot parameter.

v4:
 - Further minimize the feature set by removing the flags control knob.

v3:
 - Further trim the additional features down to just memory_migrate.
 - Update Documentation/cgroup-v2.txt.

The purpose of this patchset is to provide a minimal set of cpuset
features for cgroup v2. That minimal set includes the cpus, mems,
cpus.effective and mems.effective and sched_load_balance. The last one is
needed to support use cases similar to the "isolcpus" kernel parameter.

This patchset does not exclude the possibility of adding more flags
and features in the future after careful consideration.

Patch 1 enables cpuset in cgroup v2 with cpus, mems and their
effective counterparts.

Patch 2 adds sched_load_balance whose behavior changes in v2 to become
hierarchical and includes an implicit !cpu_exclusive.

Waiman Long (2):
  cpuset: Enable cpuset controller in default hierarchy
  cpuset: Add cpuset.sched_load_balance to v2

 Documentation/cgroup-v2.txt | 112 ++--
 kernel/cgroup/cpuset.c  | 104 
 2 files changed, 201 insertions(+), 15 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2

2018-03-21 Thread Waiman Long

The sched_load_balance flag is needed to enable CPU isolation similar
to what can be done with the "isolcpus" kernel boot parameter.

The sched_load_balance flag implies an implicit !cpu_exclusive as
it doesn't make sense to have an isolated CPU being load-balanced in
another cpuset.

For v2, this flag is hierarchical and is inherited by child cpusets. It
is not allowed to have this flag turn off in a parent cpuset, but on
in a child cpuset.

This flag is set by the parent and is not delegatable.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 Documentation/cgroup-v2.txt | 22 ++
 kernel/cgroup/cpuset.c  | 56 +++--
 2 files changed, 71 insertions(+), 7 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index ed8ec66..c970bd7 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1514,6 +1514,28 @@ Cpuset Interface Files
it is a subset of "cpuset.mems".  Its value will be affected
by memory nodes hotplug events.
 
+  cpuset.sched_load_balance
+   A read-write single value file which exists on non-root cgroups.
+   The default is "1" (on), and the other possible value is "0"
+   (off).
+
+   When it is on, tasks within this cpuset will be load-balanced
+   by the kernel scheduler.  Tasks will be moved from CPUs with
+   high load to other CPUs within the same cpuset with less load
+   periodically.
+
+   When it is off, there will be no load balancing among CPUs on
+   this cgroup.  Tasks will stay in the CPUs they are running on
+   and will not be moved to other CPUs.
+
+   This flag is hierarchical and is inherited by child cpusets. It
+   can be turned off only when the CPUs in this cpuset aren't
+   listed in the cpuset.cpus of other sibling cgroups, and all
+   the child cpusets, if present, have this flag turned off.
+
+   Once it is off, it cannot be turned back on as long as the
+   parent cgroup still has this flag in the off state.
+
 
 Device controller
 -
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 419b758..d675c4f 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -407,15 +407,22 @@ static void cpuset_update_task_spread_flag(struct cpuset 
*cs,
  *
  * One cpuset is a subset of another if all its allowed CPUs and
  * Memory Nodes are a subset of the other, and its exclusive flags
- * are only set if the other's are set.  Call holding cpuset_mutex.
+ * are only set if the other's are set (on legacy hierarchy) or
+ * its sched_load_balance flag is only set if the other is set
+ * (on default hierarchy).  Caller holding cpuset_mutex.
  */
 
 static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q)
 {
-   return  cpumask_subset(p->cpus_allowed, q->cpus_allowed) &&
-   nodes_subset(p->mems_allowed, q->mems_allowed) &&
-   is_cpu_exclusive(p) <= is_cpu_exclusive(q) &&
-   is_mem_exclusive(p) <= is_mem_exclusive(q);
+   if (!cpumask_subset(p->cpus_allowed, q->cpus_allowed) ||
+   !nodes_subset(p->mems_allowed, q->mems_allowed))
+   return false;
+
+   if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys))
+   return is_sched_load_balance(p) <= is_sched_load_balance(q);
+   else
+   return is_cpu_exclusive(p) <= is_cpu_exclusive(q) &&
+  is_mem_exclusive(p) <= is_mem_exclusive(q);
 }
 
 /**
@@ -498,7 +505,7 @@ static int validate_change(struct cpuset *cur, struct 
cpuset *trial)
 
par = parent_cs(cur);
 
-   /* On legacy hiearchy, we must be a subset of our parent cpuset. */
+   /* On legacy hierarchy, we must be a subset of our parent cpuset. */
ret = -EACCES;
if (!is_in_v2_mode() && !is_cpuset_subset(trial, par))
goto out;
@@ -1327,6 +1334,19 @@ static int update_flag(cpuset_flagbits_t bit, struct 
cpuset *cs,
else
clear_bit(bit, >flags);
 
+   /*
+* On default hierarchy, turning off sched_load_balance flag implies
+* an implicit cpu_exclusive. Turning on sched_load_balance will
+* clear the cpu_exclusive flag.
+*/
+   if ((bit == CS_SCHED_LOAD_BALANCE) &&
+   cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
+   if (turning_on)
+   clear_bit(CS_CPU_EXCLUSIVE, >flags);
+   else
+   set_bit(CS_CPU_EXCLUSIVE, >flags);
+   }
+
err = validate_change(cs, trialcs);
if (err < 0)
goto out;
@@ -1966,6 +1986,14 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state 
*css, struct cftype *cft)
.flags = CFTYPE_NOT_ON_ROOT,
},
 
+   {
+   .

[PATCH v6 1/2] cpuset: Enable cpuset controller in default hierarchy

2018-03-21 Thread Waiman Long

Given the fact that thread mode had been merged into 4.14, it is now
time to enable cpuset to be used in the default hierarchy (cgroup v2)
as it is clearly threaded.

The cpuset controller had experienced feature creep since its
introduction more than a decade ago. Besides the core cpus and mems
control files to limit cpus and memory nodes, there are a bunch of
additional features that can be controlled from the userspace. Some of
the features are of doubtful usefulness and may not be actively used.

This patch enables cpuset controller in the default hierarchy with
a minimal set of features, namely just the cpus and mems and their
effective_* counterparts.  We can certainly add more features to the
default hierarchy in the future if there is a real user need for them
later on.

Alternatively, with the unified hiearachy, it may make more sense
to move some of those additional cpuset features, if desired, to
memory controller or may be to the cpu controller instead of staying
with cpuset.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 Documentation/cgroup-v2.txt | 90 ++---
 kernel/cgroup/cpuset.c  | 48 ++--
 2 files changed, 130 insertions(+), 8 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 74cdeae..ed8ec66 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -53,11 +53,13 @@ v1 is available under Documentation/cgroup-v1/.
5-3-2. Writeback
  5-4. PID
5-4-1. PID Interface Files
- 5-5. Device
- 5-6. RDMA
-   5-6-1. RDMA Interface Files
- 5-7. Misc
-   5-7-1. perf_event
+ 5-5. Cpuset
+   5.5-1. Cpuset Interface Files
+ 5-6. Device
+ 5-7. RDMA
+   5-7-1. RDMA Interface Files
+ 5-8. Misc
+   5-8-1. perf_event
  5-N. Non-normative information
5-N-1. CPU controller root cgroup process behaviour
5-N-2. IO controller root cgroup process behaviour
@@ -1435,6 +1437,84 @@ through fork() or clone(). These will return -EAGAIN if 
the creation
 of a new process would cause a cgroup policy to be violated.
 
 
+Cpuset
+--
+
+The "cpuset" controller provides a mechanism for constraining
+the CPU and memory node placement of tasks to only the resources
+specified in the cpuset interface files in a task's current cgroup.
+This is especially valuable on large NUMA systems where placing jobs
+on properly sized subsets of the systems with careful processor and
+memory placement to reduce cross-node memory access and contention
+can improve overall system performance.
+
+The "cpuset" controller is hierarchical.  That means the controller
+cannot use CPUs or memory nodes not allowed in its parent.
+
+
+Cpuset Interface Files
+~~
+
+  cpuset.cpus
+   A read-write multiple values file which exists on non-root
+   cgroups.
+
+   It lists the CPUs allowed to be used by tasks within this
+   cgroup.  The CPU numbers are comma-separated numbers or
+   ranges.  For example:
+
+ # cat cpuset.cpus
+ 0-4,6,8-10
+
+   An empty value indicates that the cgroup is using the same
+   setting as the nearest cgroup ancestor with a non-empty
+   "cpuset.cpus" or all the available CPUs if none is found.
+
+   The value of "cpuset.cpus" stays constant until the next update
+   and won't be affected by any CPU hotplug events.
+
+  cpuset.cpus.effective
+   A read-only multiple values file which exists on non-root
+   cgroups.
+
+   It lists the onlined CPUs that are actually allowed to be
+   used by tasks within the current cgroup.  If "cpuset.cpus"
+   is empty, it shows all the CPUs from the parent cgroup that
+   will be available to be used by this cgroup.  Otherwise, it is
+   a subset of "cpuset.cpus".  Its value will be affected by CPU
+   hotplug events.
+
+  cpuset.mems
+   A read-write multiple values file which exists on non-root
+   cgroups.
+
+   It lists the memory nodes allowed to be used by tasks within
+   this cgroup.  The memory node numbers are comma-separated
+   numbers or ranges.  For example:
+
+ # cat cpuset.mems
+ 0-1,3
+
+   An empty value indicates that the cgroup is using the same
+   setting as the nearest cgroup ancestor with a non-empty
+   "cpuset.mems" or all the available memory nodes if none
+   is found.
+
+   The value of "cpuset.mems" stays constant until the next update
+   and won't be affected by any memory nodes hotplug events.
+
+  cpuset.mems.effective
+   A read-only multiple values file which exists on non-root
+   cgroups.
+
+   It lists the onlined memory nodes that are actually allowed to
+   be used by tasks within the current cgroup.  If "cpuset.mems"
+   is empty, it shows all the memory n

Re: [PATCH v5 1/2] cpuset: Enable cpuset controller in default hierarchy

2018-03-20 Thread Waiman Long

On 03/20/2018 05:14 PM, Tejun Heo wrote:
> Hello,
>
> On Tue, Mar 20, 2018 at 04:53:37PM -0400, Waiman Long wrote:
>> ASAIK for v2, when cpuset.cpus is empty, cpuset.effective_cpus will show
>> all the cpus available from the parent. It is a different behavior from
>> v1. So do we still need a cpuset.cpus_available?
> Heh, you're right.  Let's forget about available and do
> cpuset.cpus.effective.  The primary reason for suggesting that was
> because of the similarity with cgroup.controllers and
> cgroup.subtree_control; however, they're that way because
> subtree_control is delegatable.  For a normal resource knob like
> cpuset.cpus, the knob is owned by the parent and what's interesting to
> the parent is its effective set that it's distributing from.

OK, will change the names as suggested.

-Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 1/2] cpuset: Enable cpuset controller in default hierarchy

2018-03-20 Thread Waiman Long

On 03/20/2018 04:10 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Tue, Mar 20, 2018 at 09:51:20AM -0400, Waiman Long wrote:
>>>> +  It lists the onlined CPUs that are actually allowed to be
>>>> +  used by tasks within the current cgroup. It is a subset of
>>>> +  "cpuset.cpus".  Its value will be affected by CPU hotplug
>>>> +  events.
>>> Can we do cpuset.cpus.availble which lists the cpus available to the
>>> cgroup instead of the eventual computed mask for the cgroup?  That'd
>>> be more useful as it doesn't lose the information by and'ing what's
>>> available with the cgroup's mask and it's trivial to determine the
>>> effective from the two masks.
>> I don't get what you want here. cpus is the cpuset's cpus_allowed mask.
>> effective_cpus is the effective_cpus mask. When you say cpus available
>> to the cgroup, do you mean the cpu_online_mask or the list of cpus from
>> the parent? Or do you just want to change the name to cpus.available
>> instead of effective_cpus?
> The available cpus from the parent, where the effective is AND between
> cpuset.available and cpuset.cpus of the cgroup, so that the user can
> see what's available for the cgroup unfiltered by cpuset.cpus.

ASAIK for v2, when cpuset.cpus is empty, cpuset.effective_cpus will show
all the cpus available from the parent. It is a different behavior from
v1. So do we still need a cpuset.cpus_available?

>> Right, I will set CFTYPE_NOT_ON_ROOT to "cpus" and "mems" as we are not
>> supposed to change them in the root. The effective_cpus and
>> effective_mems will be there in the root to show what are available.
> So, we can do that in the future but let's not do that for now.  It's
> the same problem we have for basically everything else and we've
> stayed away from replicating the information in the root cgroup.  This
> might change in the future but if we do that let's do that
> consistently.
That is fine. I will make them all disappears in the root cgroup.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 2/2] cpuset: Add cpuset.flags control knob to v2

2018-03-20 Thread Waiman Long

On 03/20/2018 04:22 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Tue, Mar 20, 2018 at 04:12:25PM -0400, Waiman Long wrote:
>> After some thought, I am planning to impose the following additional
>> constraints on how sched_load_balance works in v2.
>>
>> 1) sched_load_balance will be made hierarchical, the child will inherit
>> the flag from its parent.
>> 2) cpu_exclusive will be implicitly associated with sched_load_balance.
>> IOW, sched_load_balance => !cpu_exclusive, and !sched_load_balance =>
>> cpu_exclusive.
>> 3) sched_load_balance cannot be 1 on a child if it is 0 on the parent.
>>
>> With these changes, sched_load_balance will have to be set by the parent
>> and so will not be delegatable. Please let me know your thought on that.
> So, for configurations, we usually don't let them interact across
> hierarchy because that can lead to configurations surprise-changing
> and delegated children locking the parent into the current config.
>
> This case could be different and as long as we always guarantee that
> an ancestor isn't limited by its descendants in what it can configure,
> it should be okay (e.g. an ancestor should always be able to turn on
> sched_load_balance regardless of how the descendants are configured).

Yes, I will do some testing to make sure that a descendant won't be able
to affect how the ancestors can behave.

> Hmmm... can you explain why sched_load_balance needs to behave this
> way?

It boils down to the fact that it doesn't make sense to have a CPU in an
isolated cpuset to participate in load balancing in another cpuset as
Mike has said before. It is especially true in a parent-child
relationship where a delegatee can escape CPU isolation by re-enabling
sched_load_balance in a child cpuset.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 2/2] cpuset: Add cpuset.flags control knob to v2

2018-03-20 Thread Waiman Long

On 03/19/2018 12:33 PM, Waiman Long wrote:
> On 03/19/2018 12:26 PM, Tejun Heo wrote:
>> Hello, Waiman.
>>
>> On Thu, Mar 15, 2018 at 05:20:42PM -0400, Waiman Long wrote:
>>> +   The currently supported flag is:
>>> +
>>> + sched_load_balance
>>> +   When it is not set, there will be no load balancing
>>> +   among CPUs on this cpuset.  Tasks will stay in the
>>> +   CPUs they are running on and will not be moved to
>>> +   other CPUs.
>>> +
>>> +   When it is set, tasks within this cpuset will be
>>> +   load-balanced by the kernel scheduler.  Tasks will be
>>> +   moved from CPUs with high load to other CPUs within
>>> +   the same cpuset with less load periodically.
>> Hmm... looks like this is something which can be decided by the cgroup
>> itself and should be made delegatable.  Given that different flags
>> might need different delegation settings and the precedence of
>> memory.oom_group, I think it'd be better to make the flags separate
>> bool files - ie. cpuset.sched_load_balance which contains 0/1 and
>> marked delegatable.
>>
>> Thanks.
>>
> Sure. Will do that.

After some thought, I am planning to impose the following additional
constraints on how sched_load_balance works in v2.

1) sched_load_balance will be made hierarchical, the child will inherit
the flag from its parent.
2) cpu_exclusive will be implicitly associated with sched_load_balance.
IOW, sched_load_balance => !cpu_exclusive, and !sched_load_balance =>
cpu_exclusive.
3) sched_load_balance cannot be 1 on a child if it is 0 on the parent.

With these changes, sched_load_balance will have to be set by the parent
and so will not be delegatable. Please let me know your thought on that.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 1/2] cpuset: Enable cpuset controller in default hierarchy

2018-03-20 Thread Waiman Long

On 03/19/2018 11:59 AM, Tejun Heo wrote:
> Hello, Waiman.
>
> This looks great.  A couple nitpicks below.
>
>> + 5-3. Cpuset
>> +   5.3-1. Cpuset Interface Files
> Can we put cpuset below pid?  It feels weird to break up cpu, memory
> and io as they represent the three major resources and are in a
> similar fashion.
Sure. I will move it down below pid.

>> +  cpuset.effective_cpus
>> +A read-only multiple values file which exists on non-root
>> +cgroups.
>> +
>> +It lists the onlined CPUs that are actually allowed to be
>> +used by tasks within the current cgroup. It is a subset of
>> +"cpuset.cpus".  Its value will be affected by CPU hotplug
>> +events.
> Can we do cpuset.cpus.availble which lists the cpus available to the
> cgroup instead of the eventual computed mask for the cgroup?  That'd
> be more useful as it doesn't lose the information by and'ing what's
> available with the cgroup's mask and it's trivial to determine the
> effective from the two masks.

I don't get what you want here. cpus is the cpuset's cpus_allowed mask.
effective_cpus is the effective_cpus mask. When you say cpus available
to the cgroup, do you mean the cpu_online_mask or the list of cpus from
the parent? Or do you just want to change the name to cpus.available
instead of effective_cpus?

>> +  cpuset.effective_mems
>> +A read-only multiple values file which exists on non-root
>> +cgroups.
>> +
>> +It lists the onlined memory nodes that are actually allowed
>> +to be used by tasks within the current cgroup.  It is a subset
>> +of "cpuset.mems".  Its value will be affected by memory nodes
>> +hotplug events.
> Ditto.
>
>> +static struct cftype dfl_files[] = {
>> +{
>> +.name = "cpus",
>> +.seq_show = cpuset_common_seq_show,
>> +.write = cpuset_write_resmask,
>> +.max_write_len = (100U + 6 * NR_CPUS),
>> +.private = FILE_CPULIST,
>> +},
> Is it missing CFTYPE_NOT_ON_ROOT?  Other files too.

Right, I will set CFTYPE_NOT_ON_ROOT to "cpus" and "mems" as we are not
supposed to change them in the root. The effective_cpus and
effective_mems will be there in the root to show what are available.

Cheers,
Longman


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4] cpuset: Enable cpuset controller in default hierarchy

2018-03-19 Thread Waiman Long

On 03/19/2018 04:49 PM, Mike Galbraith wrote:
> On Mon, 2018-03-19 at 08:34 -0700, Tejun Heo wrote:
>> Hello, Mike.
>>
>> On Thu, Mar 15, 2018 at 03:49:01AM +0100, Mike Galbraith wrote:
>>> Under the hood v2 details are entirely up to you.  My input ends at
>>> please don't leave dynamic partitioning standing at the dock when v2
>>> sails.
>> So, this isn't about implementation details but about what the
>> interface achieves - ie, what's the actual function?  The only thing I
>> can see is blocking the entity which is configuring the hierarchy from
>> making certain configs.  While that might be useful in some specific
>> use cases, it seems to miss the bar for becoming its own kernel
>> feature.  After all, nothing prevents the same entity from clearing
>> the exlusive bit and making the said changes.
> Yes, privileged contexts can maliciously or stupidly step all over one
> other no matter what you do (finite resource), but oxymoron creation
> (CPUs simultaneously balanced and isolated) should be handled.  If one
> context can allocate a set overlapping a set another context intends to
> or already has detached from scheduler domains, both are screwed.
>
>   -Mike

The allocations of CPUs to child cgroups should be controlled by the
parent cgroup. It is the parent's fault if some CPUs are in both
balanced and isolated cgroups.

How about we don't allow turning off scheduling if the CPUs aren't
exclusive from the parent's perspective? So you can't create an isolated
cgroup if the CPUs aren't exclusive. Will this be a good enough compromise?

Cheers,
Longman



--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 2/2] cpuset: Add cpuset.flags control knob to v2

2018-03-19 Thread Waiman Long

On 03/19/2018 12:26 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Thu, Mar 15, 2018 at 05:20:42PM -0400, Waiman Long wrote:
>> +The currently supported flag is:
>> +
>> +  sched_load_balance
>> +When it is not set, there will be no load balancing
>> +among CPUs on this cpuset.  Tasks will stay in the
>> +CPUs they are running on and will not be moved to
>> +other CPUs.
>> +
>> +When it is set, tasks within this cpuset will be
>> +load-balanced by the kernel scheduler.  Tasks will be
>> +moved from CPUs with high load to other CPUs within
>> +the same cpuset with less load periodically.
> Hmm... looks like this is something which can be decided by the cgroup
> itself and should be made delegatable.  Given that different flags
> might need different delegation settings and the precedence of
> memory.oom_group, I think it'd be better to make the flags separate
> bool files - ie. cpuset.sched_load_balance which contains 0/1 and
> marked delegatable.
>
> Thanks.
>
Sure. Will do that.

-Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 1/9] sysctl: Add flags to support min/max range clamping

2018-03-19 Thread Waiman Long

On 03/16/2018 09:10 PM, Luis R. Rodriguez wrote:
> On Fri, Mar 16, 2018 at 02:13:42PM -0400, Waiman Long wrote:
>> When the CTL_FLAGS_CLAMP_RANGE flag is set in the ctl_table
>> entry, any update from the userspace will be clamped to the given
>> range without error if either the proc_dointvec_minmax() or the
>> proc_douintvec_minmax() handlers is used.
> I don't get it.  Why define a generic range flag when we can be mores 
> specific and
> you do that in your next patch. What's the point of this flag then?
>
>   Luis

I was thinking about using the signed/unsigned bits as just annotations
for ranges for future extension. For the purpose of this patchset alone,
I can merge the three bits into just two.

Cheers,
Longman


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 2/9] proc/sysctl: Provide additional ctl_table.flags checks

2018-03-19 Thread Waiman Long

On 03/16/2018 08:54 PM, Luis R. Rodriguez wrote:
> On Fri, Mar 16, 2018 at 02:13:43PM -0400, Waiman Long wrote:
>> Checking code is added to provide the following additional
>> ctl_table.flags checks:
>>
>>  1) No unknown flag is allowed.
>>  2) Minimum of a range cannot be larger than the maximum value.
>>  3) The signed and unsigned flags are mutually exclusive.
>>  4) The proc_handler should be consistent with the signed or unsigned
>> flags.
>>
>> Two new flags are added to indicate if the min/max values are signed
>> or unsigned - CTL_FLAGS_SIGNED_RANGE & CTL_FLAGS_UNSIGNED_RANGE.
>> These 2 flags can be optionally enabled for range checking purpose.
>> But either one of them must be set with CTL_FLAGS_CLAMP_RANGE.
>>
>> Signed-off-by: Waiman Long <long...@redhat.com>
>> ---
>> diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
>> index e446e1f..088f032 100644
>> --- a/include/linux/sysctl.h
>> +++ b/include/linux/sysctl.h
>> @@ -134,14 +134,26 @@ struct ctl_table
>>   *  the input value. No lower bound or upper bound checking will be
>>   *  done if the corresponding minimum or maximum value isn't provided.
>>   *
>> + * @CTL_FLAGS_SIGNED_RANGE: Set to indicate that the extra1 and extra2
>> + *  fields are pointers to minimum and maximum signed values of
>> + *  an allowable range.
>> + *
>> + * @CTL_FLAGS_UNSIGNED_RANGE: Set to indicate that the extra1 and extra2
>> + *  fields are pointers to minimum and maximum unsigned values of
>> + *  an allowable range.
>> + *
>>   * At most 16 different flags are allowed.
>>   */
>>  enum ctl_table_flags {
>>  CTL_FLAGS_CLAMP_RANGE   = BIT(0),
>> -__CTL_FLAGS_MAX = BIT(1),
>> +CTL_FLAGS_SIGNED_RANGE  = BIT(1),
>> +CTL_FLAGS_UNSIGNED_RANGE= BIT(2),
>> +__CTL_FLAGS_MAX = BIT(3),
>>  };
> You are adding new flags which the user can set, and yet these are used
> internally.
>
> It would be best if internal flags are just that, not flags that a user can 
> set.
>
> This patch should be folded with the first one.
>
> I'm starting to loose hope on these patch sets.
>
>   Luis

In order to do the correct min > max check, I need to know if the
quantity is signed or not. Just looking at the proc_handler alone is not
a reliable indicator if it is signed or unsigned.

Yes, I can put the signed bit into the previous patch.

-Longman


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 5/9] ipc: Clamp semmni to the real IPCMNI limit

2018-03-16 Thread Waiman Long

For SysV semaphores, the semmni value is the last part of the 4-element
sem number array. To make semmni behave in a similar way to msgmni
and shmmni, we can't directly use the _minmax handler. Instead,
a special sem specific handler is added to check the last argument
to make sure that it is clamped to the [0, IPCMNI] range and prints
a warning message once when an out-of-range value is being written.
This does require duplicating some of the code in the _minmax handlers.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 ipc/ipc_sysctl.c | 12 +++-
 ipc/sem.c| 25 +
 ipc/util.h   |  4 
 3 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index 088721e..0ad7088 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -88,12 +88,22 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, 
int write,
return proc_dointvec_minmax(_table, write, buffer, lenp, ppos);
 }
 
+static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+   int ret = proc_ipc_dointvec(table, write, buffer, lenp, ppos);
+
+   sem_check_semmni(table, current->nsproxy->ipc_ns);
+   return ret;
+}
+
 #else
 #define proc_ipc_doulongvec_minmax NULL
 #define proc_ipc_dointvec NULL
 #define proc_ipc_dointvec_minmax   NULL
 #define proc_ipc_dointvec_minmax_orphans   NULL
 #define proc_ipc_auto_msgmni  NULL
+#define proc_ipc_sem_dointvec NULL
 #endif
 
 static int zero;
@@ -177,7 +187,7 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, 
int write,
.data   = _ipc_ns.sem_ctls,
.maxlen = 4*sizeof(int),
.mode   = 0644,
-   .proc_handler   = proc_ipc_dointvec,
+   .proc_handler   = proc_ipc_sem_dointvec,
},
 #ifdef CONFIG_CHECKPOINT_RESTORE
{
diff --git a/ipc/sem.c b/ipc/sem.c
index a4af049..faf2caa 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -2337,3 +2337,28 @@ static int sysvipc_sem_proc_show(struct seq_file *s, 
void *it)
return 0;
 }
 #endif
+
+#ifdef CONFIG_PROC_SYSCTL
+/*
+ * Check to see if semmni is out of range and clamp it if necessary.
+ */
+void sem_check_semmni(struct ctl_table *table, struct ipc_namespace *ns)
+{
+   bool clamped = false;
+
+   /*
+* Clamp semmni to the range [0, IPCMNI].
+*/
+   if (ns->sc_semmni < 0) {
+   ns->sc_semmni = 0;
+   clamped = true;
+   }
+   if (ns->sc_semmni > IPCMNI) {
+   ns->sc_semmni = IPCMNI;
+   clamped = true;
+   }
+   if (clamped)
+   pr_warn_ratelimited("sysctl: \"sem[3]\" was set out of range 
[%d, %d], clamped to %d.\n",
+0, IPCMNI, ns->sc_semmni);
+}
+#endif
diff --git a/ipc/util.h b/ipc/util.h
index 89b8ec1..af57394 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -206,6 +206,10 @@ int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
 void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
void (*free)(struct ipc_namespace *, struct kern_ipc_perm *));
 
+#ifdef CONFIG_PROC_SYSCTL
+extern void sem_check_semmni(struct ctl_table *table, struct ipc_namespace 
*ns);
+#endif
+
 #ifdef CONFIG_COMPAT
 #include 
 struct compat_ipc_perm {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 4/9] ipc: Clamp msgmni and shmmni to the real IPCMNI limit

2018-03-16 Thread Waiman Long

A user can write arbitrary integer values to msgmni and shmmni sysctl
parameters without getting error, but the actual limit is really
IPCMNI (32k). This can mislead users as they think they can get a
value that is not real.

Enforcing the limit by failing the sysctl parameter write, however,
can break existing user applications. Instead, the range clamping flag
is set to enforce the limit without failing existing user code. Users
can easily figure out if the sysctl parameter value is out of range
by either reading back the parameter value or checking the kernel
ring buffer for warning.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 ipc/ipc_sysctl.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index 8ad93c2..088721e 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -99,6 +99,7 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, int 
write,
 static int zero;
 static int one = 1;
 static int int_max = INT_MAX;
+static int ipc_mni = IPCMNI;
 
 static struct ctl_table ipc_kern_table[] = {
{
@@ -120,7 +121,10 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, 
int write,
.data   = _ipc_ns.shm_ctlmni,
.maxlen = sizeof(init_ipc_ns.shm_ctlmni),
.mode   = 0644,
-   .proc_handler   = proc_ipc_dointvec,
+   .proc_handler   = proc_ipc_dointvec_minmax,
+   .extra1 = ,
+   .extra2 = _mni,
+   .flags  = CTL_FLAGS_CLAMP_RANGE_SIGNED,
},
{
.procname   = "shm_rmid_forced",
@@ -147,7 +151,8 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, 
int write,
.mode   = 0644,
.proc_handler   = proc_ipc_dointvec_minmax,
.extra1 = ,
-   .extra2 = _max,
+   .extra2 = _mni,
+   .flags  = CTL_FLAGS_CLAMP_RANGE_SIGNED,
},
{
.procname   = "auto_msgmni",
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 7/9] test_sysctl: Add ctl_table registration failure test

2018-03-16 Thread Waiman Long

Incorrect sysctl tables are constructed and fed to the
register_sysctl_table() function in the test_sysctl kernel module.
The function is supposed to fail the registration of those tables or
an error will be printed if no failure is returned.

The registration failures will cause other warning and error messages
to be printed into the dmesg log, though.

A new test is also added to the sysctl.sh to look for those failure
messages in the dmesg log to see if anything unexpeced happens.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 lib/test_sysctl.c| 41 
 tools/testing/selftests/sysctl/sysctl.sh | 15 
 2 files changed, 56 insertions(+)

diff --git a/lib/test_sysctl.c b/lib/test_sysctl.c
index 7bb4cf7..14853d5 100644
--- a/lib/test_sysctl.c
+++ b/lib/test_sysctl.c
@@ -154,13 +154,54 @@ struct test_sysctl_data {
{ }
 };
 
+static struct ctl_table fail_sysctl_table0[] = {
+   {
+   .procname   = "failed_sysctl0",
+   .data   = _data.range_0001,
+   .maxlen = sizeof(test_data.range_0001),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .flags  = CTL_FLAGS_CLAMP_RANGE_SIGNED,
+   .extra1 = _max,
+   .extra2 = _min,
+   },
+   { }
+};
+
+static struct ctl_table fail_sysctl_root_table[] = {
+   {
+   .procname   = "debug",
+   .maxlen = 0,
+   .mode   = 0555,
+   },
+   { }
+};
+
+static struct ctl_table *fail_tables[] = {
+   fail_sysctl_table0, NULL,
+};
+
 static struct ctl_table_header *test_sysctl_header;
 
 static int __init test_sysctl_init(void)
 {
+   struct ctl_table_header *fail_sysctl_header;
+   int i;
+
test_sysctl_header = register_sysctl_table(test_sysctl_root_table);
if (!test_sysctl_header)
return -ENOMEM;
+
+   for (i = 0; fail_tables[i]; i++) {
+   fail_sysctl_root_table[0].child = fail_tables[i];
+   fail_sysctl_header = 
register_sysctl_table(fail_sysctl_root_table);
+   if (fail_sysctl_header) {
+   pr_err("fail_tables[%d] registration check failed!\n", 
i);
+   unregister_sysctl_table(fail_sysctl_header);
+   break;
+   }
+   }
+
return 0;
 }
 late_initcall(test_sysctl_init);
diff --git a/tools/testing/selftests/sysctl/sysctl.sh 
b/tools/testing/selftests/sysctl/sysctl.sh
index 1aa1bba..23acdee 100755
--- a/tools/testing/selftests/sysctl/sysctl.sh
+++ b/tools/testing/selftests/sysctl/sysctl.sh
@@ -35,6 +35,7 @@ ALL_TESTS="$ALL_TESTS 0003:1:1"
 ALL_TESTS="$ALL_TESTS 0004:1:1"
 ALL_TESTS="$ALL_TESTS 0005:3:1"
 ALL_TESTS="$ALL_TESTS 0006:1:1"
+ALL_TESTS="$ALL_TESTS 0007:1:1"
 
 test_modprobe()
 {
@@ -652,6 +653,20 @@ sysctl_test_0006()
set_orig
 }
 
+sysctl_test_0007()
+{
+   echo "Checking test_sysctl module registration failure test ..."
+   dmesg | grep "sysctl.*fail_tables.*failed"
+   if [[ $? -eq 0 ]]; then
+   echo "FAIL" >&2
+   rc=1
+   else
+   echo "ok"
+   fi
+
+   test_rc
+}
+
 list_tests()
 {
echo "Test ID list:"
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 8/9] ipc: Allow boot time extension of IPCMNI from 32k to 2M

2018-03-16 Thread Waiman Long

The maximum number of unique System V IPC identifiers was limited to
32k.  That limit should be big enough for most use cases.

However, there are some users out there requesting for more. To satisfy
the need of those users, a new boot time kernel option "ipcmni_extend"
is added to extend the IPCMNI value to 2M. This is a 64X increase which
hopefully is big enough for them.

This new option does have the side effect of reducing the maximum
number of unique sequence numbers from 64k down to 1k. So it is
a trade-off.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  3 +++
 include/linux/ipc.h | 11 ++-
 ipc/ipc_sysctl.c| 12 +++-
 ipc/util.c  | 12 ++--
 ipc/util.h  | 18 +++---
 5 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 1d1d53f..2be35a4 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1733,6 +1733,9 @@
ip= [IP_PNP]
See Documentation/filesystems/nfs/nfsroot.txt.
 
+   ipcmni_extend   [KNL] Extend the maximum number of unique System V
+   IPC identifiers from 32768 to 2097152.
+
irqaffinity=[SMP] Set the default irq affinity mask
The argument is a cpu list, as described above.
 
diff --git a/include/linux/ipc.h b/include/linux/ipc.h
index 821b2f2..3ecd869 100644
--- a/include/linux/ipc.h
+++ b/include/linux/ipc.h
@@ -8,7 +8,16 @@
 #include 
 #include 
 
-#define IPCMNI 32768  /* <= MAX_INT limit for ipc arrays (including sysctl 
changes) */
+/*
+ * By default, the ipc arrays can have up to 32k (15 bits) entries.
+ * When IPCMNI extension mode is turned on, the ipc arrays can have up
+ * to 2M (21 bits) entries. However, the space for sequence number will
+ * be shrunk from 16 bits to 10 bits.
+ */
+#define IPCMNI_SHIFT   15
+#define IPCMNI_EXTEND_SHIFT21
+#define IPCMNI (1 << IPCMNI_SHIFT)
+#define IPCMNI_EXTEND  (1 << IPCMNI_EXTEND_SHIFT)
 
 /* used by in-kernel data structures */
 struct kern_ipc_perm {
diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index 0ad7088..5f7cfae 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -109,7 +109,8 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, 
int write,
 static int zero;
 static int one = 1;
 static int int_max = INT_MAX;
-static int ipc_mni = IPCMNI;
+int ipc_mni __read_mostly = IPCMNI;
+int ipc_mni_shift __read_mostly = IPCMNI_SHIFT;
 
 static struct ctl_table ipc_kern_table[] = {
{
@@ -237,3 +238,12 @@ static int __init ipc_sysctl_init(void)
 }
 
 device_initcall(ipc_sysctl_init);
+
+static int __init ipc_mni_extend(char *str)
+{
+   ipc_mni = IPCMNI_EXTEND;
+   ipc_mni_shift = IPCMNI_EXTEND_SHIFT;
+   pr_info("IPCMNI extended to %d.\n", ipc_mni);
+   return 0;
+}
+early_param("ipcmni_extend", ipc_mni_extend);
diff --git a/ipc/util.c b/ipc/util.c
index 4ed5a17..daee305 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -112,7 +112,7 @@ static int __init ipc_init(void)
  * @ids: ipc identifier set
  *
  * Set up the sequence range to use for the ipc identifier range (limited
- * below IPCMNI) then initialise the keys hashtable and ids idr.
+ * below ipc_mni) then initialise the keys hashtable and ids idr.
  */
 int ipc_init_ids(struct ipc_ids *ids)
 {
@@ -213,7 +213,7 @@ static inline int ipc_buildid(int id, struct ipc_ids *ids,
ids->next_id = -1;
}
 
-   return SEQ_MULTIPLIER * new->seq + id;
+   return (new->seq << SEQ_SHIFT) + id;
 }
 
 #else
@@ -227,7 +227,7 @@ static inline int ipc_buildid(int id, struct ipc_ids *ids,
if (ids->seq > IPCID_SEQ_MAX)
ids->seq = 0;
 
-   return SEQ_MULTIPLIER * new->seq + id;
+   return (new->seq << SEQ_SHIFT) + id;
 }
 
 #endif /* CONFIG_CHECKPOINT_RESTORE */
@@ -251,8 +251,8 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
kgid_t egid;
int id, err;
 
-   if (limit > IPCMNI)
-   limit = IPCMNI;
+   if (limit > ipc_mni)
+   limit = ipc_mni;
 
if (!ids->tables_initialized || ids->in_use >= limit)
return -ENOSPC;
@@ -769,7 +769,7 @@ static struct kern_ipc_perm *sysvipc_find_ipc(struct 
ipc_ids *ids, loff_t pos,
if (total >= ids->in_use)
return NULL;
 
-   for (; pos < IPCMNI; pos++) {
+   for (; pos < ipc_mni; pos++) {
ipc = idr_find(>ipcs_idr, pos);
if (ipc != NULL) {
*new_pos = p

[PATCH v5 9/9] ipc: Conserve sequence numbers in extended IPCMNI mode

2018-03-16 Thread Waiman Long

The mixing in of a sequence number into the IPC IDs is probably to
avoid ID reuse in userspace as much as possible. With extended IPCMNI
mode, the number of usable sequecne numbers is greatly reduced leading
to higher chance of ID reuse.

To address this issue, we need to conserve the sequence number space
as much as possible. Right now, the sequence number is incremented
for every new ID created. In reality, we only need to increment the
sequence number when one or more IDs have been removed previously to
make sure that those IDs will not be reused when a new one is built.
This is being done in the extended IPCMNI mode,

Signed-off-by: Waiman Long <long...@redhat.com>
---
 include/linux/ipc_namespace.h |  1 +
 ipc/ipc_sysctl.c  |  2 ++
 ipc/util.c| 29 ++---
 ipc/util.h|  1 +
 4 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index b5630c8..9c86fd9 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -16,6 +16,7 @@
 struct ipc_ids {
int in_use;
unsigned short seq;
+   unsigned short deleted;
bool tables_initialized;
struct rw_semaphore rwsem;
struct idr ipcs_idr;
diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index 5f7cfae..61a832d 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -111,6 +111,7 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, 
int write,
 static int int_max = INT_MAX;
 int ipc_mni __read_mostly = IPCMNI;
 int ipc_mni_shift __read_mostly = IPCMNI_SHIFT;
+bool ipc_mni_extended __read_mostly;
 
 static struct ctl_table ipc_kern_table[] = {
{
@@ -243,6 +244,7 @@ static int __init ipc_mni_extend(char *str)
 {
ipc_mni = IPCMNI_EXTEND;
ipc_mni_shift = IPCMNI_EXTEND_SHIFT;
+   ipc_mni_extended = true;
pr_info("IPCMNI extended to %d.\n", ipc_mni);
return 0;
 }
diff --git a/ipc/util.c b/ipc/util.c
index daee305..8b38a6f 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -118,7 +118,8 @@ int ipc_init_ids(struct ipc_ids *ids)
 {
int err;
ids->in_use = 0;
-   ids->seq = 0;
+   ids->deleted = false;
+   ids->seq = ipc_mni_extended ? 0 : -1; /* seq # is pre-incremented */
init_rwsem(>rwsem);
err = rhashtable_init(>key_ht, _kht_params);
if (err)
@@ -192,6 +193,11 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids 
*ids, key_t key)
return NULL;
 }
 
+/*
+ * To conserve sequence number space with extended ipc_mni when new ID
+ * is built, the sequence number is incremented only when one or more
+ * IDs have been removed previously.
+ */
 #ifdef CONFIG_CHECKPOINT_RESTORE
 /*
  * Specify desired id for next allocated IPC object.
@@ -205,9 +211,13 @@ static inline int ipc_buildid(int id, struct ipc_ids *ids,
  struct kern_ipc_perm *new)
 {
if (ids->next_id < 0) { /* default, behave as !CHECKPOINT_RESTORE */
-   new->seq = ids->seq++;
-   if (ids->seq > IPCID_SEQ_MAX)
-   ids->seq = 0;
+   if (!ipc_mni_extended || ids->deleted) {
+   ids->seq++;
+   if (ids->seq > IPCID_SEQ_MAX)
+   ids->seq = 0;
+   ids->deleted = false;
+   }
+   new->seq = ids->seq;
} else {
new->seq = ipcid_to_seqx(ids->next_id);
ids->next_id = -1;
@@ -223,9 +233,13 @@ static inline int ipc_buildid(int id, struct ipc_ids *ids,
 static inline int ipc_buildid(int id, struct ipc_ids *ids,
  struct kern_ipc_perm *new)
 {
-   new->seq = ids->seq++;
-   if (ids->seq > IPCID_SEQ_MAX)
-   ids->seq = 0;
+   if (!ipc_mni_extended || ids->deleted) {
+   ids->seq++;
+   if (ids->seq > IPCID_SEQ_MAX)
+   ids->seq = 0;
+   ids->deleted = false;
+   }
+   new->seq = ids->seq;
 
return (new->seq << SEQ_SHIFT) + id;
 }
@@ -435,6 +449,7 @@ void ipc_rmid(struct ipc_ids *ids, struct kern_ipc_perm 
*ipcp)
idr_remove(>ipcs_idr, lid);
ipc_kht_remove(ids, ipcp);
ids->in_use--;
+   ids->deleted = true;
ipcp->deleted = true;
 
if (unlikely(lid == ids->max_id)) {
diff --git a/ipc/util.h b/ipc/util.h
index 6871ca9..e6c2055 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -17,6 +17,7 @@
 
 extern int ipc_mni;
 extern int ipc_mni_shift;
+extern bool ipc_mni_extended;
 
 #define SEQ_SHIFT  ipc_mni_shift
 #define SEQ_MASK   ((1 << ipc_mni_shift) - 1)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 6/9] test_sysctl: Add range clamping test

2018-03-16 Thread Waiman Long

Add a range clamping test to verify that the input value will be
clamped if it exceeds the builtin maximum or minimum value.

Below is the expected test run result:

Running test: sysctl_test_0006 - run #0
Checking range minimum clamping ... ok
Checking range maximum clamping ... ok
Checking range minimum clamping ... ok
Checking range maximum clamping ... ok

Signed-off-by: Waiman Long <long...@redhat.com>
---
 lib/test_sysctl.c| 29 ++
 tools/testing/selftests/sysctl/sysctl.sh | 52 
 2 files changed, 81 insertions(+)

diff --git a/lib/test_sysctl.c b/lib/test_sysctl.c
index 3dd801c..7bb4cf7 100644
--- a/lib/test_sysctl.c
+++ b/lib/test_sysctl.c
@@ -38,12 +38,18 @@
 
 static int i_zero;
 static int i_one_hundred = 100;
+static int signed_min = -10;
+static int signed_max = 10;
+static unsigned int unsigned_min = 10;
+static unsigned int unsigned_max = 30;
 
 struct test_sysctl_data {
int int_0001;
int int_0002;
int int_0003[4];
+   int range_0001;
 
+   unsigned int urange_0001;
unsigned int uint_0001;
 
char string_0001[65];
@@ -58,6 +64,9 @@ struct test_sysctl_data {
.int_0003[2] = 2,
.int_0003[3] = 3,
 
+   .range_0001 = 0,
+   .urange_0001 = 20,
+
.uint_0001 = 314,
 
.string_0001 = "(none)",
@@ -102,6 +111,26 @@ struct test_sysctl_data {
.mode   = 0644,
.proc_handler   = proc_dostring,
},
+   {
+   .procname   = "range_0001",
+   .data   = _data.range_0001,
+   .maxlen = sizeof(test_data.range_0001),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .flags  = CTL_FLAGS_CLAMP_RANGE_SIGNED,
+   .extra1 = _min,
+   .extra2 = _max,
+   },
+   {
+   .procname   = "urange_0001",
+   .data   = _data.urange_0001,
+   .maxlen = sizeof(test_data.urange_0001),
+   .mode   = 0644,
+   .proc_handler   = proc_douintvec_minmax,
+   .flags  = CTL_FLAGS_CLAMP_RANGE_UNSIGNED,
+   .extra1 = _min,
+   .extra2 = _max,
+   },
{ }
 };
 
diff --git a/tools/testing/selftests/sysctl/sysctl.sh 
b/tools/testing/selftests/sysctl/sysctl.sh
index ec232c3..1aa1bba 100755
--- a/tools/testing/selftests/sysctl/sysctl.sh
+++ b/tools/testing/selftests/sysctl/sysctl.sh
@@ -34,6 +34,7 @@ ALL_TESTS="$ALL_TESTS 0002:1:1"
 ALL_TESTS="$ALL_TESTS 0003:1:1"
 ALL_TESTS="$ALL_TESTS 0004:1:1"
 ALL_TESTS="$ALL_TESTS 0005:3:1"
+ALL_TESTS="$ALL_TESTS 0006:1:1"
 
 test_modprobe()
 {
@@ -543,6 +544,38 @@ run_stringtests()
test_rc
 }
 
+# TARGET, RANGE_MIN & RANGE_MAX need to be defined before running test.
+run_range_clamping_test()
+{
+   rc=0
+
+   echo -n "Checking range minimum clamping ... "
+   VAL=$((RANGE_MIN - 1))
+   echo -n $VAL > "${TARGET}" 2> /dev/null
+   EXITVAL=$?
+   NEWVAL=$(cat "${TARGET}")
+   if [[ $EXITVAL -ne 0 || $NEWVAL -ne $RANGE_MIN ]]; then
+   echo "FAIL" >&2
+   rc=1
+   else
+   echo "ok"
+   fi
+
+   echo -n "Checking range maximum clamping ... "
+   VAL=$((RANGE_MAX + 1))
+   echo -n $VAL > "${TARGET}" 2> /dev/null
+   EXITVAL=$?
+   NEWVAL=$(cat "${TARGET}")
+   if [[ $EXITVAL -ne 0 || $NEWVAL -ne $RANGE_MAX ]]; then
+   echo "FAIL" >&2
+   rc=1
+   else
+   echo "ok"
+   fi
+
+   test_rc
+}
+
 sysctl_test_0001()
 {
TARGET="${SYSCTL}/int_0001"
@@ -600,6 +633,25 @@ sysctl_test_0005()
run_limit_digit_int_array
 }
 
+sysctl_test_0006()
+{
+   TARGET="${SYSCTL}/range_0001"
+   ORIG=$(cat "${TARGET}")
+   RANGE_MIN=-10
+   RANGE_MAX=10
+
+   run_range_clamping_test
+   set_orig
+
+   TARGET="${SYSCTL}/urange_0001"
+   ORIG=$(cat "${TARGET}")
+   RANGE_MIN=10
+   RANGE_MAX=30
+
+   run_range_clamping_test
+   set_orig
+}
+
 list_tests()
 {
echo "Test ID list:"
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 2/9] proc/sysctl: Provide additional ctl_table.flags checks

2018-03-16 Thread Waiman Long

Checking code is added to provide the following additional
ctl_table.flags checks:

 1) No unknown flag is allowed.
 2) Minimum of a range cannot be larger than the maximum value.
 3) The signed and unsigned flags are mutually exclusive.
 4) The proc_handler should be consistent with the signed or unsigned
flags.

Two new flags are added to indicate if the min/max values are signed
or unsigned - CTL_FLAGS_SIGNED_RANGE & CTL_FLAGS_UNSIGNED_RANGE.
These 2 flags can be optionally enabled for range checking purpose.
But either one of them must be set with CTL_FLAGS_CLAMP_RANGE.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 fs/proc/proc_sysctl.c  | 62 ++
 include/linux/sysctl.h | 16 +++--
 2 files changed, 76 insertions(+), 2 deletions(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 493c975..2863ea1 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -1092,6 +1092,66 @@ static int sysctl_check_table_array(const char *path, 
struct ctl_table *table)
return err;
 }
 
+static int sysctl_check_flags(const char *path, struct ctl_table *table)
+{
+   int err = 0;
+   uint16_t sign_flags = CTL_FLAGS_SIGNED_RANGE|CTL_FLAGS_UNSIGNED_RANGE;
+
+   if ((table->flags & ~CTL_TABLE_FLAGS_ALL) ||
+  ((table->flags & sign_flags) == sign_flags))
+   err = sysctl_err(path, table, "invalid flags");
+
+   if (table->flags & (CTL_FLAGS_CLAMP_RANGE | sign_flags)) {
+   int range_err = 0;
+   bool is_int = (table->maxlen == sizeof(int));
+
+   if (!is_int && (table->maxlen != sizeof(long))) {
+   range_err++;
+   } else if (!table->extra1 || !table->extra2) {
+   /* No min > max checking needed */
+   } else if (table->flags & CTL_FLAGS_UNSIGNED_RANGE) {
+   unsigned long min, max;
+
+   min = is_int ? *(unsigned int *)table->extra1
+: *(unsigned long *)table->extra1;
+   max = is_int ? *(unsigned int *)table->extra2
+: *(unsigned long *)table->extra2;
+   range_err += (min > max);
+   } else if (table->flags & CTL_FLAGS_SIGNED_RANGE) {
+
+   long min, max;
+
+   min = is_int ? *(int *)table->extra1
+: *(long *)table->extra1;
+   max = is_int ? *(int *)table->extra2
+: *(long *)table->extra2;
+   range_err += (min > max);
+   } else {
+   /*
+* Either CTL_FLAGS_UNSIGNED_RANGE or
+* CTL_FLAGS_SIGNED_RANGE should be set.
+*/
+   range_err++;
+   }
+
+   /*
+* proc_handler and flag consistency check.
+*/
+   if (((table->proc_handler == proc_douintvec_minmax)   ||
+(table->proc_handler == proc_doulongvec_minmax)) &&
+   !(table->flags & CTL_FLAGS_UNSIGNED_RANGE))
+   range_err++;
+
+   if ((table->proc_handler == proc_dointvec_minmax) &&
+  !(table->flags & CTL_FLAGS_SIGNED_RANGE))
+   range_err++;
+
+   if (range_err)
+   err |= sysctl_err(path, table, "Invalid range");
+   }
+   return err;
+}
+
 static int sysctl_check_table(const char *path, struct ctl_table *table)
 {
int err = 0;
@@ -,6 +1171,8 @@ static int sysctl_check_table(const char *path, struct 
ctl_table *table)
(table->proc_handler == proc_doulongvec_ms_jiffies_minmax)) 
{
if (!table->data)
err |= sysctl_err(path, table, "No data");
+   if (table->flags)
+   err |= sysctl_check_flags(path, table);
if (!table->maxlen)
err |= sysctl_err(path, table, "No maxlen");
else
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index e446e1f..088f032 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -134,14 +134,26 @@ struct ctl_table
  * the input value. No lower bound or upper bound checking will be
  * done if the corresponding minimum or maximum value isn't provided.
  *
+ * @CTL_FLAGS_SIGNED_RANGE: Set to indicate that the extra1 and extra2
+ * fields are pointers to minimum and maximum signed values of
+ * an allowable range.
+ *
+ * @CTL_FLAGS_UN

[PATCH v5 3/9] sysctl: Warn when a clamped sysctl parameter is set out of range

2018-03-16 Thread Waiman Long

Even with clamped sysctl parameters, it is still not that straight
forward to figure out the exact range of those parameters. One may
try to write extreme parameter values to see if they get clamped.
To make it easier, a warning with the expected range will now be
printed into the kernel ring buffer when a clamped sysctl parameter
receives an out of range value.

The pr_warn_ratelimited() macro is used to limit the number of warning
messages that can be printed within a given period of time.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 kernel/sysctl.c | 44 
 1 file changed, 36 insertions(+), 8 deletions(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index af351ed..a9e3ed4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -17,6 +17,7 @@
  * The list_for_each() macro wasn't appropriate for the sysctl loop.
  *  Removed it and replaced it with older style, 03/23/00, Bill Wendling
  */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
 #include 
 #include 
@@ -2505,6 +2506,7 @@ static int proc_dointvec_minmax_sysadmin(struct ctl_table 
*table, int write,
  * @min: pointer to minimum allowable value
  * @max: pointer to maximum allowable value
  * @flags: pointer to flags
+ * @name: sysctl parameter name
  *
  * The do_proc_dointvec_minmax_conv_param structure provides the
  * minimum and maximum values for doing range checking for those sysctl
@@ -2514,6 +2516,7 @@ struct do_proc_dointvec_minmax_conv_param {
int *min;
int *max;
uint16_t *flags;
+   const char *name;
 };
 
 static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,
@@ -2521,24 +2524,35 @@ static int do_proc_dointvec_minmax_conv(bool *negp, 
unsigned long *lvalp,
int write, void *data)
 {
struct do_proc_dointvec_minmax_conv_param *param = data;
+
if (write) {
int val = *negp ? -*lvalp : *lvalp;
+   bool clamped = false;
bool clamp = param->flags &&
   (*param->flags & CTL_FLAGS_CLAMP_RANGE);
 
if (param->min && *param->min > val) {
-   if (clamp)
+   if (clamp) {
val = *param->min;
-   else
+   clamped = true;
+   } else {
return -EINVAL;
+   }
}
if (param->max && *param->max < val) {
-   if (clamp)
+   if (clamp) {
val = *param->max;
-   else
+   clamped = true;
+   } else {
return -EINVAL;
+   }
}
*valp = val;
+   if (clamped && param->name)
+   pr_warn_ratelimited("\"%s\" was set out of range [%d, 
%d], clamped to %d.\n",
+   param->name,
+   param->min ? *param->min : -INT_MAX,
+   param->max ? *param->max :  INT_MAX, val);
} else {
int val = *valp;
if (val < 0) {
@@ -2576,6 +2590,7 @@ int proc_dointvec_minmax(struct ctl_table *table, int 
write,
.min = (int *) table->extra1,
.max = (int *) table->extra2,
.flags = >flags,
+   .name  = table->procname,
};
return do_proc_dointvec(table, write, buffer, lenp, ppos,
do_proc_dointvec_minmax_conv, );
@@ -2586,6 +2601,7 @@ int proc_dointvec_minmax(struct ctl_table *table, int 
write,
  * @min: pointer to minimum allowable value
  * @max: pointer to maximum allowable value
  * @flags: pointer to flags
+ * @name: sysctl parameter name
  *
  * The do_proc_douintvec_minmax_conv_param structure provides the
  * minimum and maximum values for doing range checking for those sysctl
@@ -2595,6 +2611,7 @@ struct do_proc_douintvec_minmax_conv_param {
unsigned int *min;
unsigned int *max;
uint16_t *flags;
+   const char *name;
 };
 
 static int do_proc_douintvec_minmax_conv(unsigned long *lvalp,
@@ -2605,6 +2622,7 @@ static int do_proc_douintvec_minmax_conv(unsigned long 
*lvalp,
 
if (write) {
unsigned int val = *lvalp;
+   bool clamped = false;
bool clamp = param->flags &&
   (*param->flags & CTL_FLAGS_CLAMP_RANGE);
 
@@ -2612,18 +2630,27 @@ static int do_proc_douintvec_minmax_conv(unsigned long 
*lvalp,
return -EINVAL;
 
if (param->min && *param-&

[PATCH v5 1/9] sysctl: Add flags to support min/max range clamping

2018-03-16 Thread Waiman Long

When minimum/maximum values are specified for a sysctl parameter in
the ctl_table structure with proc_dointvec_minmax() handler, update
to that parameter will fail with error if the given value is outside
of the required range.

There are use cases where it may be better to clamp the value of
the sysctl parameter to the given range without failing the update,
especially if the users are not aware of the actual range limits.
Reading the value back after the update will now be a good practice
to see if the provided value exceeds the range limits.

To provide this less restrictive form of range checking, a new flags
field is added to the ctl_table structure. The new field is a 16-bit
value that just fits into the hole left by the 16-bit umode_t field
without increasing the size of the structure.

When the CTL_FLAGS_CLAMP_RANGE flag is set in the ctl_table
entry, any update from the userspace will be clamped to the given
range without error if either the proc_dointvec_minmax() or the
proc_douintvec_minmax() handlers is used.

The clamped value is either the maximum or minimum value that is
closest to the input value provided by the user.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 include/linux/sysctl.h | 20 
 kernel/sysctl.c| 48 +++-
 2 files changed, 59 insertions(+), 9 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index b769ecf..e446e1f 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -116,6 +116,7 @@ struct ctl_table
void *data;
int maxlen;
umode_t mode;
+   uint16_t flags;
struct ctl_table *child;/* Deprecated */
proc_handler *proc_handler; /* Callback for text formatting */
struct ctl_table_poll *poll;
@@ -123,6 +124,25 @@ struct ctl_table
void *extra2;
 } __randomize_layout;
 
+/**
+ * enum ctl_table_flags - flags for the ctl table (struct ctl_table.flags)
+ *
+ * @CTL_FLAGS_CLAMP_RANGE: Set to indicate that the entry should be
+ * flexibly clamped to the provided min/max value in case the user
+ * provided a value outside of the given range. The clamped value is
+ * either the provided minimum or maximum value that is closest to
+ * the input value. No lower bound or upper bound checking will be
+ * done if the corresponding minimum or maximum value isn't provided.
+ *
+ * At most 16 different flags are allowed.
+ */
+enum ctl_table_flags {
+   CTL_FLAGS_CLAMP_RANGE   = BIT(0),
+   __CTL_FLAGS_MAX = BIT(1),
+};
+
+#define CTL_TABLE_FLAGS_ALL(__CTL_FLAGS_MAX - 1)
+
 struct ctl_node {
struct rb_node node;
struct ctl_table_header *header;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d2aa6b4..af351ed 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2504,6 +2504,7 @@ static int proc_dointvec_minmax_sysadmin(struct ctl_table 
*table, int write,
  * struct do_proc_dointvec_minmax_conv_param - proc_dointvec_minmax() range 
checking structure
  * @min: pointer to minimum allowable value
  * @max: pointer to maximum allowable value
+ * @flags: pointer to flags
  *
  * The do_proc_dointvec_minmax_conv_param structure provides the
  * minimum and maximum values for doing range checking for those sysctl
@@ -2512,6 +2513,7 @@ static int proc_dointvec_minmax_sysadmin(struct ctl_table 
*table, int write,
 struct do_proc_dointvec_minmax_conv_param {
int *min;
int *max;
+   uint16_t *flags;
 };
 
 static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,
@@ -2521,9 +2523,21 @@ static int do_proc_dointvec_minmax_conv(bool *negp, 
unsigned long *lvalp,
struct do_proc_dointvec_minmax_conv_param *param = data;
if (write) {
int val = *negp ? -*lvalp : *lvalp;
-   if ((param->min && *param->min > val) ||
-   (param->max && *param->max < val))
-   return -EINVAL;
+   bool clamp = param->flags &&
+  (*param->flags & CTL_FLAGS_CLAMP_RANGE);
+
+   if (param->min && *param->min > val) {
+   if (clamp)
+   val = *param->min;
+   else
+   return -EINVAL;
+   }
+   if (param->max && *param->max < val) {
+   if (clamp)
+   val = *param->max;
+   else
+   return -EINVAL;
+   }
*valp = val;
} else {
int val = *valp;
@@ -2552,7 +2566,8 @@ static int do_proc_dointvec_minmax_conv(bool *negp, 
unsigned long *lvalp,
  * This routine will ensure the values are within the range specified by
  * table->extra1 (min) and table->

[PATCH v5 0/9] ipc: Clamp *mni to the real IPCMNI limit & increase that limit

2018-03-16 Thread Waiman Long

v4->v5:
 - Revert the flags back to 16-bit so that there will be no change to
   the size of ctl_table.
 - Enhance the sysctl_check_flags() as requested by Luis to perform more
   checks to spot incorrect ctl_table entries.
 - Change the sysctl selftest to use dummy sysctls instead of production
   ones & enhance it to do more checks.
 - Add one more sysctl selftest for registration failure.
 - Add 2 ipc patches to add an extended mode to increase IPCMNI from
   32k to 2M.
 - Miscellaneous change to incorporate feedback comments from
   reviewers.

v3->v4:
 - Remove v3 patches 1 & 2 as they have been merged into the mm tree.
 - Change flags from uint16_t to unsigned int.
 - Remove CTL_FLAGS_OOR_WARNED and use pr_warn_ratelimited() instead.
 - Simplify the warning message code.
 - Add a new patch to fail the ctl_table registration with invalid flag.
 - Add a test case for range clamping in sysctl selftest.

v2->v3:
 - Fix kdoc comment errors.
 - Incorporate comments and suggestions from Luis R. Rodriguez.
 - Add a patch to fix a typo error in fs/proc/proc_sysctl.c.

v1->v2:
 - Add kdoc comments to the do_proc_do{u}intvec_minmax_conv_param
   structures.
 - Add a new flags field to the ctl_table structure for specifying
   whether range clamping should be activated instead of adding new
   sysctl parameter handlers.
 - Clamp the semmni value embedded in the multi-values sem parameter.

v1 patch: https://lkml.org/lkml/2018/2/19/453
v2 patch: https://lkml.org/lkml/2018/2/27/627
v3 patch: https://lkml.org/lkml/2018/3/1/716 
v4 patch: https://lkml.org/lkml/2018/3/12/867

The sysctl parameters msgmni, shmmni and semmni have an inherent limit
of IPC_MNI (32k). However, users may not be aware of that because they
can write a value much higher than that without getting any error or
notification. Reading the parameters back will show the newly written
values which are not real.

Enforcing the limit by failing sysctl parameter write, however, may
cause regressions if existing user setup scripts set those parameters
above 32k as those scripts will now fail in this case.

To address this delemma, a new flags field is introduced into
the ctl_table. The value CTL_FLAGS_CLAMP_RANGE can be added to any
ctl_table entries to enable a looser range clamping without returning
any error. For example,

  .flags = CTL_FLAGS_CLAMP_RANGE,

This flags value are now used for the range checking of shmmni,
msgmni and semmni without breaking existing applications. If any out
of range value is written to those sysctl parameters, the following
warning will be printed instead.

  sysctl: "shmmni" was set out of range [0, 32768], clamped to 32768.

Reading the values back will show 32768 instead of some fake values.

New sysctl selftests are added to exercise new code added by this
patchset.

There are users out there requesting increase in the IPCMNI value.
The last 2 patches attempt to do that by using a boot kernel parameter
"ipcmni_extend" to increase the IPCMNI limit from 32k to 2M.

Eric Biederman had posted an RFC patch to just scrap the IPCMNI limit
and open up the whole positive integer space for IPC IDs. A major
issue that I have with this approach is that SysV IPC had been in use
for over 20 years. We just don't know if there are user applications
that have dependency on the way that the IDs are built. So drastic
change like this may have the potential of breaking some applications.

I prefer a more conservative approach where users will observe no
change in behavior unless they explictly opt in to enable the extended
mode. I could open up the whole positive integer space in this case
like what Eric did, but that will make the code more complex.  So I
just extend IPCMNI to 2M in this case and keep similar ID generation
logic.

Waiman Long (9):
  sysctl: Add flags to support min/max range clamping
  proc/sysctl: Provide additional ctl_table.flags checks
  sysctl: Warn when a clamped sysctl parameter is set out of range
  ipc: Clamp msgmni and shmmni to the real IPCMNI limit
  ipc: Clamp semmni to the real IPCMNI limit
  test_sysctl: Add range clamping test
  test_sysctl: Add ctl_table registration failure test
  ipc: Allow boot time extension of IPCMNI from 32k to 2M
  ipc: Conserve sequence numbers in extended IPCMNI mode

 Documentation/admin-guide/kernel-parameters.txt |  3 +
 fs/proc/proc_sysctl.c   | 62 
 include/linux/ipc.h | 11 +++-
 include/linux/ipc_namespace.h   |  1 +
 include/linux/sysctl.h  | 32 +++
 ipc/ipc_sysctl.c| 33 ++-
 ipc/sem.c   | 25 
 ipc/util.c  | 41 -
 ipc/util.h  | 23 +---
 kernel/sysctl.c | 76 ++---
 lib/test_sys

[PATCH v5 1/2] cpuset: Enable cpuset controller in default hierarchy

2018-03-15 Thread Waiman Long

Given the fact that thread mode had been merged into 4.14, it is now
time to enable cpuset to be used in the default hierarchy (cgroup v2)
as it is clearly threaded.

The cpuset controller had experienced feature creep since its
introduction more than a decade ago. Besides the core cpus and mems
control files to limit cpus and memory nodes, there are a bunch of
additional features that can be controlled from the userspace. Some of
the features are of doubtful usefulness and may not be actively used.

This patch enables cpuset controller in the default hierarchy with
a minimal set of features, namely just the cpus and mems and their
effective_* counterparts.  We can certainly add more features to the
default hierarchy in the future if there is a real user need for them
later on.

Alternatively, with the unified hiearachy, it may make more sense
to move some of those additional cpuset features, if desired, to
memory controller or may be to the cpu controller instead of staying
with cpuset.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 Documentation/cgroup-v2.txt | 96 -
 kernel/cgroup/cpuset.c  | 44 +++--
 2 files changed, 127 insertions(+), 13 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 74cdeae..b91fd5d 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -48,16 +48,18 @@ v1 is available under Documentation/cgroup-v1/.
5-2-1. Memory Interface Files
5-2-2. Usage Guidelines
5-2-3. Memory Ownership
- 5-3. IO
-   5-3-1. IO Interface Files
-   5-3-2. Writeback
- 5-4. PID
-   5-4-1. PID Interface Files
- 5-5. Device
- 5-6. RDMA
-   5-6-1. RDMA Interface Files
- 5-7. Misc
-   5-7-1. perf_event
+ 5-3. Cpuset
+   5.3-1. Cpuset Interface Files
+ 5-4. IO
+   5-4-1. IO Interface Files
+   5-4-2. Writeback
+ 5-5. PID
+   5-5-1. PID Interface Files
+ 5-6. Device
+ 5-7. RDMA
+   5-7-1. RDMA Interface Files
+ 5-8. Misc
+   5-8-1. perf_event
  5-N. Non-normative information
5-N-1. CPU controller root cgroup process behaviour
5-N-2. IO controller root cgroup process behaviour
@@ -1243,6 +1245,80 @@ POSIX_FADV_DONTNEED to relinquish the ownership of 
memory areas
 belonging to the affected files to ensure correct memory ownership.
 
 
+Cpuset
+--
+
+The "cpuset" controller provides a mechanism for constraining
+the CPU and memory node placement of tasks to only the resources
+specified in the cpuset interface files in a task's current cgroup.
+This is especially valuable on large NUMA systems where placing jobs
+on properly sized subsets of the systems with careful processor and
+memory placement to reduce cross-node memory access and contention
+can improve overall system performance.
+
+The "cpuset" controller is hierarchical.  That means the controller
+cannot use CPUs or memory nodes not allowed in its parent.
+
+
+Cpuset Interface Files
+~~
+
+  cpuset.cpus
+   A read-write multiple values file which exists on non-root
+   cgroups.
+
+   It lists the CPUs allowed to be used by tasks within this
+   cgroup.  The CPU numbers are comma-separated numbers or
+   ranges.  For example:
+
+ # cat cpuset.cpus
+ 0-4,6,8-10
+
+   An empty value indicates that the cgroup is using the same
+   setting as the nearest cgroup ancestor with a non-empty
+   "cpuset.cpus" or all the available CPUs if none is found.
+
+   The value of "cpuset.cpus" stays constant until the next update
+   and won't be affected by any CPU hotplug events.
+
+  cpuset.effective_cpus
+   A read-only multiple values file which exists on non-root
+   cgroups.
+
+   It lists the onlined CPUs that are actually allowed to be
+   used by tasks within the current cgroup. It is a subset of
+   "cpuset.cpus".  Its value will be affected by CPU hotplug
+   events.
+
+  cpuset.mems
+   A read-write multiple values file which exists on non-root
+   cgroups.
+
+   It lists the memory nodes allowed to be used by tasks within
+   this cgroup.  The memory node numbers are comma-separated
+   numbers or ranges.  For example:
+
+ # cat cpuset.mems
+ 0-1,3
+
+   An empty value indicates that the cgroup is using the same
+   setting as the nearest cgroup ancestor with a non-empty
+   "cpuset.mems" or all the available memory nodes if none
+   is found.
+
+   The value of "cpuset.mems" stays constant until the next update
+   and won't be affected by any memory nodes hotplug events.
+
+  cpuset.effective_mems
+   A read-only multiple values file which exists on non-root
+   cgroups.
+
+   It lists the onlined memory nodes that are actually al

[PATCH v5 0/2] cpuset: Enable cpuset controller in default hierarchy

2018-03-15 Thread Waiman Long

v5:
 - Add patch 2 to provide the cpuset.flags control knob for the
   sched_load_balance flag which should be the only feature that is
   essential as a replacement of the "isolcpus" kernel boot parameter.

v4:
 - Further minimize the feature set by removing the flags control knob.

v3:
 - Further trim the additional features down to just memory_migrate.
 - Update Documentation/cgroup-v2.txt.

The purpose of this patchset is to provide a minimal set of cpuset
features for cgroup v2. That minimal set includes the cpus, mems and
their effective_* counterparts as well as a new flags control knob
that currently supports only the sched_load_balance flag.

This patchset does not exclude the possibility of adding more flags
and features in the future after careful consideration.

Patch 1 enables cpuset in cgroup v2 with cpus, mems and their
effective_* counterparts.

Patch 2 adds flags with support for the sched_load_balance only.

Waiman Long (2):
  cpuset: Enable cpuset controller in default hierarchy
  cpuset: Add cpuset.flags control knob to v2

 Documentation/cgroup-v2.txt | 128 
 kernel/cgroup/cpuset.c  | 140 +++-
 2 files changed, 256 insertions(+), 12 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4] cpuset: Enable cpuset controller in default hierarchy

2018-03-12 Thread Waiman Long

On 03/10/2018 08:16 AM, Peter Zijlstra wrote:
> On Fri, Mar 09, 2018 at 06:06:29PM -0500, Waiman Long wrote:
>> So you are talking about sched_relax_domain_level and
> That one I wouldn't be sad to see the back of.
>
>> sched_load_balance.
> This one, that's critical. And this is the perfect time to try and fix
> the whole isolcpus issue.
>
> The primary issue is that to make equivalent functionality available
> through cpuset, we need to basically start all tasks outside the root
> group.
>
> The equivalent of isolcpus=xxx is a cgroup setup like:
>
> root
>   /  \
>   systemother
>
> Where other has the @xxx cpus and system the remainder and
> root.sched_load_balance = 0.

I saw in the kernel-parameters.txt file that the isolcpus option was
deprecated - use cpusets instead. However, there doesn't seem to have
document on the right way to do it. Of course, we can achieve similar
results with what you have outlined above, but the process is more
complex than just adding another boot command line argument with
isolcpus. So I doubt isolcpus will die anytime soon unless we can make
the alternative as easy to use.

> Back before cgroups (and the new workqueue stuff), we could've started
> everything in the !root group, no worry. But now that doesn't work,
> because a bunch of controllers can't deal with that and everything
> cgroup expects the cgroupfs to be empty on boot.

AFAIK, all the processes belong to the root cgroup on boot. And the root
cgroup is usually special that the controller may not exert any control
for processes in the root cgroup. Many controllers become active for
processes in the child cgroups only. Would you mind elaborating what
doesn't quite work currently?

> It's one of my biggest regrets that I didn't 'fix' this before cgroups
> came along.
>
>> I have not removed any bits. I just haven't exposed
>> them yet. It does seem like these 2 control knobs are useful from the
>> scheduling perspective. Do we also need cpu_exclusive or just the two
>> sched control knobs are enough?
> I always forget if we need exclusive for load_balance to work; I'll
> peruse the document/code.

I think the cpu_exclusive feature can be useful to enforce that CPUs
allocated to the "other" isolated cgroup cannot be used by the processes
under the "system" parent.

I know that there are special code to handle the isolcpus option. How
about changing it to create a exclusive cpuset automatically instead.
Applications that need to run in those isolated CPUs can then use the
standard cgroup process to be moved into the isolated cgroup. For example,

isolcpus=,

or

isolcpuset=[,cpu:][,mem:]

We can then retire the old usage and encourage users to use the cgroup
API to manage it.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4] cpuset: Enable cpuset controller in default hierarchy

2018-03-09 Thread Waiman Long

On 03/09/2018 05:17 PM, Peter Zijlstra wrote:
> On Fri, Mar 09, 2018 at 03:43:34PM -0500, Waiman Long wrote:
>> The isolcpus= parameter just reduce the cpus available to the rests of
>> the system. The cpuset controller does look at that value and make
>> adjustment accordingly, but it has no dependence on exclusive cpu/mem
>> features of cpuset.
> The isolcpus= boot param is donkey shit and needs to die. cpuset _used_
> to be able to fully replace it, but with the advent of cgroup 'feature'
> this got lost.
>
> And instead of fixing it, you're making it _far_ worse. You completely
> removed all the bits that allow repartitioning the scheduler domains.
>
> Mike is completely right, full NAK on any such approach.

So you are talking about sched_relax_domain_level and
sched_load_balance. I have not removed any bits. I just haven't exposed
them yet. It does seem like these 2 control knobs are useful from the
scheduling perspective. Do we also need cpu_exclusive or just the two
sched control knobs are enough?

Cheers,
Longman


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4] cpuset: Enable cpuset controller in default hierarchy

2018-03-09 Thread Waiman Long

On 03/09/2018 02:40 PM, Mike Galbraith wrote:
>>>
>>> If v2 is to ever supersede v1, as is the normal way of things, core
>>> functionality really should be on the v2 boat when it sails.  What you
>>> left standing on the dock is critical core cpuset functionality.
>>>
>>> -Mike
>> From your perspective, what are core functionality that should be
>> included in cpuset v2 other than the ability to restrict cpus and memory
>> nodes.
> Exclusive sets are essential, no?  How else can you manage set wide
> properties such as topology (and hopefully soonish nohz).  You clearly
> can't have overlapping sets, one having scheduler topology, the other
> having none.  Whatever the form, something as core as the capability to
> dynamically partition and isolate should IMO be firmly aboard the v2
> boat before it sails.
>
>   -Mike

The isolcpus= parameter just reduce the cpus available to the rests of
the system. The cpuset controller does look at that value and make
adjustment accordingly, but it has no dependence on exclusive cpu/mem
features of cpuset.

-Longman


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4] cpuset: Enable cpuset controller in default hierarchy

2018-03-09 Thread Waiman Long

On 03/09/2018 01:17 PM, Mike Galbraith wrote:
> On Fri, 2018-03-09 at 12:45 -0500, Waiman Long wrote:
>> On 03/09/2018 11:34 AM, Mike Galbraith wrote:
>>> On Fri, 2018-03-09 at 10:35 -0500, Waiman Long wrote:
>>>> Given the fact that thread mode had been merged into 4.14, it is now
>>>> time to enable cpuset to be used in the default hierarchy (cgroup v2)
>>>> as it is clearly threaded.
>>>>
>>>> The cpuset controller had experienced feature creep since its
>>>> introduction more than a decade ago. Besides the core cpus and mems
>>>> control files to limit cpus and memory nodes, there are a bunch of
>>>> additional features that can be controlled from the userspace. Some of
>>>> the features are of doubtful usefulness and may not be actively used.
>>> One rather important features is the ability to dynamically partition a
>>> box and isolate critical loads.  How does one do that with v2?
>>>
>>> In v1, you create two or more exclusive sets, one for generic
>>> housekeeping, and one or more for critical load(s), RT in my case,
>>> turning off load balancing in the critical set(s) for obvious reasons.
>> This patch just serves as a foundation for cpuset support in v2. I am
>> not excluding the fact that more v1 features will be added in future
>> patches. We want to start with a clean slate and add on it after careful
>> consideration. There are some v1 cpuset features that are not used or
>> rarely used. We certainly want to get rid of them, if possible.
> If v2 is to ever supersede v1, as is the normal way of things, core
> functionality really should be on the v2 boat when it sails.  What you
> left standing on the dock is critical core cpuset functionality.
>
>   -Mike

>From your perspective, what are core functionality that should be
included in cpuset v2 other than the ability to restrict cpus and memory
nodes.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4] cpuset: Enable cpuset controller in default hierarchy

2018-03-09 Thread Waiman Long

On 03/09/2018 11:34 AM, Mike Galbraith wrote:
> On Fri, 2018-03-09 at 10:35 -0500, Waiman Long wrote:
>> Given the fact that thread mode had been merged into 4.14, it is now
>> time to enable cpuset to be used in the default hierarchy (cgroup v2)
>> as it is clearly threaded.
>>
>> The cpuset controller had experienced feature creep since its
>> introduction more than a decade ago. Besides the core cpus and mems
>> control files to limit cpus and memory nodes, there are a bunch of
>> additional features that can be controlled from the userspace. Some of
>> the features are of doubtful usefulness and may not be actively used.
> One rather important features is the ability to dynamically partition a
> box and isolate critical loads.  How does one do that with v2?
>
> In v1, you create two or more exclusive sets, one for generic
> housekeeping, and one or more for critical load(s), RT in my case,
> turning off load balancing in the critical set(s) for obvious reasons.

This patch just serves as a foundation for cpuset support in v2. I am
not excluding the fact that more v1 features will be added in future
patches. We want to start with a clean slate and add on it after careful
consideration. There are some v1 cpuset features that are not used or
rarely used. We certainly want to get rid of them, if possible.

Now for the exclusive cpuset, it is certainly something that can be done
in userspace. If there is a valid use case that requires exclusive
cpuset support in the kernel, we can certainly consider putting it into
v2 as well.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4] cpuset: Enable cpuset controller in default hierarchy

2018-03-09 Thread Waiman Long

Given the fact that thread mode had been merged into 4.14, it is now
time to enable cpuset to be used in the default hierarchy (cgroup v2)
as it is clearly threaded.

The cpuset controller had experienced feature creep since its
introduction more than a decade ago. Besides the core cpus and mems
control files to limit cpus and memory nodes, there are a bunch of
additional features that can be controlled from the userspace. Some of
the features are of doubtful usefulness and may not be actively used.

This patch enables cpuset controller in the default hierarchy with
a minimal set of features, namely just the cpus and mems and their
effective_* counterparts.  We can certainly add more features to the
default hierarchy in the future if there is a real user need for them
later on.

Alternatively, with the unified hiearachy, it may make more sense
to move some of those additional cpuset features, if desired, to
memory controller or may be to the cpu controller instead of staying
with cpuset.

v4:
 - Further minimize the feature set by removing the flags control knob.

v3:
 - Further trim the additional features down to just memory_migrate.
 - Update Documentation/cgroup-v2.txt.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 Documentation/cgroup-v2.txt | 96 -
 kernel/cgroup/cpuset.c  | 44 +++--
 2 files changed, 127 insertions(+), 13 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 74cdeae..8d7300f 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -48,16 +48,18 @@ v1 is available under Documentation/cgroup-v1/.
5-2-1. Memory Interface Files
5-2-2. Usage Guidelines
5-2-3. Memory Ownership
- 5-3. IO
-   5-3-1. IO Interface Files
-   5-3-2. Writeback
- 5-4. PID
-   5-4-1. PID Interface Files
- 5-5. Device
- 5-6. RDMA
-   5-6-1. RDMA Interface Files
- 5-7. Misc
-   5-7-1. perf_event
+ 5-3. Cpuset
+   5.3-1. Cpuset Interface Files
+ 5-4. IO
+   5-4-1. IO Interface Files
+   5-4-2. Writeback
+ 5-5. PID
+   5-5-1. PID Interface Files
+ 5-6. Device
+ 5-7. RDMA
+   5-7-1. RDMA Interface Files
+ 5-8. Misc
+   5-8-1. perf_event
  5-N. Non-normative information
5-N-1. CPU controller root cgroup process behaviour
5-N-2. IO controller root cgroup process behaviour
@@ -1243,6 +1245,80 @@ POSIX_FADV_DONTNEED to relinquish the ownership of 
memory areas
 belonging to the affected files to ensure correct memory ownership.
 
 
+Cpuset
+--
+
+The "cpuset" controller provides a mechanism for constraining
+the CPU and memory node placement of tasks to only the resources
+specified in the cpuset interface files in a task's current cgroup.
+This is especially valuable on large NUMA systems where placing jobs
+on properly sized subsets of the systems with careful processor and
+memory placement to reduce cross-node memory access and contention
+can improve overall system performance.
+
+The "cpuset" controller is hierarchical.  That means the controller
+cannot use CPUs or memory nodes not allowed in its parent.
+
+
+Cpuset Interface Files
+~~
+
+  cpuset.cpus
+   A read-write multiple values file which exists on non-root
+   cgroups.
+
+   It lists the CPUs allowed to be used by tasks within this
+   cgroup.  The CPU numbers are comma-separated numbers or
+   ranges.  For example:
+
+ # cat cpuset.cpus
+ 0-4,6,8-10
+
+   An empty value indicates that the cgroup is using the same
+   setting as the nearest cgroup ancestor with a non-empty
+   "cpuset.cpus" or all the available CPUs if none is found.
+
+   The value of "cpuset.cpus" stays constant until the next update
+   and won't be affected by any CPU hotplug events.
+
+  cpuset.effective_cpus
+   A read-only multiple values file which exists on non-root
+   cgroups.
+
+   It lists the onlined CPUs that are actually allowed to be
+   used by tasks within the current cgroup. It is a subset of
+   "cpuset.cpus". Its value will be affected by CPU hotplug
+   events.
+
+  cpuset.mems
+   A read-write multiple values file which exists on non-root
+   cgroups.
+
+   It lists the memory nodes allowed to be used by tasks within
+   this cgroup.  The memory node numbers are comma-separated
+   numbers or ranges.  For example:
+
+ # cat cpuset.mems
+ 0-1,3
+
+   An empty value indicates that the cgroup is using the same
+   setting as the nearest cgroup ancestor with a non-empty
+   "cpuset.mems" or all the available memory nodes if none
+   is found.
+
+   The value of "cpuset.mems" stays constant until the next update
+   and won't be affected by any memory nodes hotplug events.
+
+

Re: [PATCH v3] cpuset: Enable cpuset controller in default hierarchy

2017-11-27 Thread Waiman Long

On 11/27/2017 04:42 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Mon, Nov 27, 2017 at 04:19:57PM -0500, Waiman Long wrote:
>>> Let's start just with [e]cpus and [e]mems.  The flags interface looks
>>> fine but the implementations of these features are really bad and
>>> cgroup2 doesn't migrate resources for other controllers either anyway.
>> That is added because the mem_migrate feature is used in libvirt, I
>> think. I am thinking of add a "[EXPERIMENTAL]" tag to the flags to
>> indicate that it is subject to change.
> I see.  Do you happen to know what it's used for and why that's
> necessary just so that we can evaluate it better?  I'm not quite sure
> what adding [EXPERIMENTAL] tag would achieve.  If we expose the
> feature and people use it, we just have to keep it anyway.

The mem_migrate feature will probably enforce better NUMA locality as
the vCPU may move from one physical CPU to another if it is not pinned.

I want to add the experimental tag more in the sense that we are going
to add to the list of the flags in the future than removing an existing
one. Well, I guess we can just say it in the text instead of adding a tag.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] cpuset: Enable cpuset controller in default hierarchy

2017-11-27 Thread Waiman Long

On 11/27/2017 04:04 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> Sorry about the long delay.
>
> On Fri, Oct 06, 2017 at 05:10:30PM -0400, Waiman Long wrote:
>> +Cpuset Interface Files
>> +~~
>> +
>> +  cpuset.cpus
>> +A read-write multiple values file which exists on non-root
>> +cgroups.
>> +
>> +It lists the CPUs allowed to be used by tasks within this
>> +cgroup.  The CPU numbers are comma-separated numbers or
>> +ranges.  For example:
>> +
>> +  # cat cpuset.cpus
>> +  0-4,6,8-10
>> +
>> +An empty value indicates that the cgroup is using the same
>> +setting as the nearest cgroup ancestor with a non-empty
>> +"cpuset.cpus" or all the available CPUs if none is found.
>> +
>> +The value of "cpuset.cpus" stays constant until the next update
>> +and won't be affected by any CPU hotplug events.
>> +
>> +  cpuset.effective_cpus
> Can we do cpuset.ecpus in the fashion of euid, egid..?

Sure.
>> +  cpuset.effective_mems
> Ditto.

Sure.

>> +  cpuset.flags
>> +A read-write multiple values file which exists on non-root
>> +cgroups.
>> +
>> +It lists the flags that are set (with a '+' prefix) and those
>> +that are not set (with a '-' prefix).   The currently supported
>> +flag is:
>> +
>> +  mem_migrate
>> +When it is not set, an allocated memory page will
>> +stay in whatever node it was allocated independent
>> +of changes in "cpuset.mems".
>> +
>> +When it is set, tasks with memory pages not in
>> +"cpuset.mems" will have those pages migrated over to
>> +memory nodes specified in "cpuset.mems".  Any changes
>> +to "cpuset.mems" will cause pages in nodes that are
>> +no longer valid to be migrated over to the newly
>> +valid nodes.
> Let's start just with [e]cpus and [e]mems.  The flags interface looks
> fine but the implementations of these features are really bad and
> cgroup2 doesn't migrate resources for other controllers either anyway.

That is added because the mem_migrate feature is used in libvirt, I
think. I am thinking of add a "[EXPERIMENTAL]" tag to the flags to
indicate that it is subject to change.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] cpuset: Enable cpuset controller in default hierarchy

2017-11-14 Thread Waiman Long

On 10/26/2017 02:12 PM, Waiman Long wrote:
> On 10/26/2017 10:39 AM, Tejun Heo wrote:
>> Hello, Waiman.
>>
>> On Wed, Oct 25, 2017 at 11:50:34AM -0400, Waiman Long wrote:
>>> Ping! Any comment on this patch?
>> Sorry about the lack of response.  Here are my two thoughts.
>>
>> 1. I'm not really sure about the memory part.  Mostly because of the
>>way it's configured and enforced is completely out of step with how
>>mm behaves in general.  I'd like to get more input from mm folks on
>>this.
> Yes, I also have doubt about which of the additional features are being
> actively used. That is why the current patch exposes only the memory_migrate
> flag in addition to the core *cpus and *mems control files. All the
> other v1 features are not exposed waiting for further investigation and
> feedback. One way to get more feedback is to have something that people
> can play with. Maybe we could somehow tag it as experimental so that we
> can change the interface later on, when necessary, if you have concern
> about setting the APIs in stone.
>
>> 2. I want to think more about how we expose the effective settings.
>>Not that anything is wrong with what cpuset does, but more that I
>>wanna ensure that it's something we can follow in other cases where
>>we have similar hierarchical property propagation.
> Currently, the effective setting is exposed via the effective_cpus and
> effective_mems control files. Unlike other controllers that control
> resources, cpuset is unique in the sense that it is propagating
> hierarchical constraints on CPUs and memory nodes down the tree. I
> understand your desire to have a unified framework that can be applied
> to most controllers, but I doubt cpuset is a good model in this regard.

What do you think we can do for the 4.16 development cycle? I really
like to see some kind of at least experimental support for cpuset v2.
That may be the best way to gather feedback and decide what to do next.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2 1/1] locking/qspinlock/x86: Avoid test-and-set when PV_DEDICATED is set

2017-11-02 Thread Waiman Long

On 11/02/2017 02:08 PM, Eduardo Valentin wrote:
> On Thu, Nov 02, 2017 at 06:56:46PM +0100, Paolo Bonzini wrote:
>> On 02/11/2017 18:45, Eduardo Valentin wrote:
>>> Currently, the existing qspinlock implementation will fallback to
>>> test-and-set if the hypervisor has not set the PV_UNHALT flag.
>>>
>>> This patch gives the opportunity to guest kernels to select
>>> between test-and-set and the regular queueu fair lock implementation
>>> based on the PV_DEDICATED KVM feature flag. When the PV_DEDICATED
>>> flag is not set, the code will still fall back to test-and-set,
>>> but when the PV_DEDICATED flag is set, the code will use
>>> the regular queue spinlock implementation.
>> Have you seen Waiman's series that lets you specify this on the guest
>> command line instead?  Would this be acceptable for your use case?
>>
> No, can you please share a link to it? is it already merged to tip/master?

See https://lkml.org/lkml/2017/11/1/655

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] cpuset: Enable cpuset controller in default hierarchy

2017-10-26 Thread Waiman Long

On 10/26/2017 10:39 AM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Wed, Oct 25, 2017 at 11:50:34AM -0400, Waiman Long wrote:
>> Ping! Any comment on this patch?
> Sorry about the lack of response.  Here are my two thoughts.
>
> 1. I'm not really sure about the memory part.  Mostly because of the
>way it's configured and enforced is completely out of step with how
>mm behaves in general.  I'd like to get more input from mm folks on
>this.

Yes, I also have doubt about which of the additional features are being
actively used. That is why the current patch exposes only the memory_migrate
flag in addition to the core *cpus and *mems control files. All the
other v1 features are not exposed waiting for further investigation and
feedback. One way to get more feedback is to have something that people
can play with. Maybe we could somehow tag it as experimental so that we
can change the interface later on, when necessary, if you have concern
about setting the APIs in stone.

> 2. I want to think more about how we expose the effective settings.
>Not that anything is wrong with what cpuset does, but more that I
>wanna ensure that it's something we can follow in other cases where
>we have similar hierarchical property propagation.

Currently, the effective setting is exposed via the effective_cpus and
effective_mems control files. Unlike other controllers that control
resources, cpuset is unique in the sense that it is propagating
hierarchical constraints on CPUs and memory nodes down the tree. I
understand your desire to have a unified framework that can be applied
to most controllers, but I doubt cpuset is a good model in this regard.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] cpuset: Enable cpuset controller in default hierarchy

2017-10-25 Thread Waiman Long

On 10/06/2017 05:10 PM, Waiman Long wrote:
> Given the fact that thread mode had been merged into 4.14, it is now
> time to enable cpuset to be used in the default hierarchy (cgroup v2)
> as it is clearly threaded.
>
> The cpuset controller had experienced feature creep since its
> introduction more than a decade ago. Besides the core cpus and mems
> control files to limit cpus and memory nodes, there are a bunch of
> additional features that can be controlled from the userspace. Some of
> the features are of doubtful usefulness and may not be actively used.
>
> After examining the source code of some sample users like systemd,
> libvirt and lxc for their use of those additional features, only
> memory_migrate is used by libvirt.
>
> This patch enables cpuset controller in the default hierarchy with a
> minimal set of features. Currently, only memory_migrate is supported.
> We can certainly add more features to the default hierarchy if there
> is a real user need for them later on.
>
> For features that are actually flags which are set internally, they are
> being combined into a single "cpuset.flags" control file. That includes
> the memory_migrate feature which is the only flag that is currently
> supported. When the "cpuset.flags" file is read, it contains either
> "+mem_migrate" (enabled) or "-mem_migrate" (disabled).
>
> To enable it, use
>
>   # echo +mem_migrate > cpuset.flags
>
> To disable it, use
>
>   # echo -mem_migrate > cpuset.flags
>
> Note that the flag name is changed to "mem_migrate" for better naming
> consistency.
>
> v3:
>  - Further trim the additional features down to just memory_migrate.
>  - Update Documentation/cgroup-v2.txt.
>
> Signed-off-by: Waiman Long <long...@redhat.com>

Ping! Any comment on this patch?

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] locking/qspinlock/x86: Avoid test-and-set when PV_DEDICATED is set

2017-10-24 Thread Waiman Long

On 10/24/2017 11:37 AM, Eduardo Valentin wrote:
> Hello Peter,
> On Tue, Oct 24, 2017 at 10:13:45AM +0200, Peter Zijlstra wrote:
>> On Mon, Oct 23, 2017 at 05:44:27PM -0700, Eduardo Valentin wrote:
>>> @@ -46,6 +48,8 @@ static inline bool virt_spin_lock(struct qspinlock *lock)
>>> if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
>>> return false;
>>>  
>>> +   if (kvm_para_has_feature(KVM_FEATURE_PV_DEDICATED))
>>> +   return false;
>>> /*
>>>  * On hypervisors without PARAVIRT_SPINLOCKS support we fall
>>>  * back to a Test-and-Set spinlock, because fair locks have
>> This does not apply. Much has been changed here recently.
>>
>  I checked against Linus master branch before sending. Which tree/branch are 
> you referring to / should I based this?
>
Please check the tip tree
(https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git) which has
the latest changes in locking code.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3] cpuset: Enable cpuset controller in default hierarchy

2017-10-06 Thread Waiman Long

Given the fact that thread mode had been merged into 4.14, it is now
time to enable cpuset to be used in the default hierarchy (cgroup v2)
as it is clearly threaded.

The cpuset controller had experienced feature creep since its
introduction more than a decade ago. Besides the core cpus and mems
control files to limit cpus and memory nodes, there are a bunch of
additional features that can be controlled from the userspace. Some of
the features are of doubtful usefulness and may not be actively used.

After examining the source code of some sample users like systemd,
libvirt and lxc for their use of those additional features, only
memory_migrate is used by libvirt.

This patch enables cpuset controller in the default hierarchy with a
minimal set of features. Currently, only memory_migrate is supported.
We can certainly add more features to the default hierarchy if there
is a real user need for them later on.

For features that are actually flags which are set internally, they are
being combined into a single "cpuset.flags" control file. That includes
the memory_migrate feature which is the only flag that is currently
supported. When the "cpuset.flags" file is read, it contains either
"+mem_migrate" (enabled) or "-mem_migrate" (disabled).

To enable it, use

  # echo +mem_migrate > cpuset.flags

To disable it, use

  # echo -mem_migrate > cpuset.flags

Note that the flag name is changed to "mem_migrate" for better naming
consistency.

v3:
 - Further trim the additional features down to just memory_migrate.
 - Update Documentation/cgroup-v2.txt.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 Documentation/cgroup-v2.txt | 122 
 kernel/cgroup/cpuset.c  | 112 +++-
 2 files changed, 223 insertions(+), 11 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 0bbdc72..f9fea87 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -48,15 +48,17 @@ v1 is available under Documentation/cgroup-v1/.
5-2-1. Memory Interface Files
5-2-2. Usage Guidelines
5-2-3. Memory Ownership
- 5-3. IO
-   5-3-1. IO Interface Files
-   5-3-2. Writeback
- 5-4. PID
-   5-4-1. PID Interface Files
- 5-5. RDMA
-   5-5-1. RDMA Interface Files
- 5-6. Misc
-   5-6-1. perf_event
+ 5-3. Cpuset
+   5.3-1. Cpuset Interface Files
+ 5-4. IO
+   5-4-1. IO Interface Files
+   5-4-2. Writeback
+ 5-5. PID
+   5-5-1. PID Interface Files
+ 5-6. RDMA
+   5-6-1. RDMA Interface Files
+ 5-7. Misc
+   5-7-1. perf_event
6. Namespace
  6-1. Basics
  6-2. The Root and Views
@@ -1235,6 +1237,108 @@ POSIX_FADV_DONTNEED to relinquish the ownership of 
memory areas
 belonging to the affected files to ensure correct memory ownership.
 
 
+Cpuset
+--
+
+The "cpuset" controller provides a mechanism for constraining
+the CPU and memory node placement of tasks to only the resources
+specified in the cpuset interface files in a task's current cgroup.
+This is especially valuable on large NUMA systems where placing jobs
+on properly sized subsets of the systems with careful processor and
+memory placement to reduce cross-node memory access and contention
+can improve overall system performance.
+
+The "cpuset" controller is hierarchical.  That means the controller
+cannot use CPUs or memory nodes not allowed in its parent.
+
+
+Cpuset Interface Files
+~~
+
+  cpuset.cpus
+   A read-write multiple values file which exists on non-root
+   cgroups.
+
+   It lists the CPUs allowed to be used by tasks within this
+   cgroup.  The CPU numbers are comma-separated numbers or
+   ranges.  For example:
+
+ # cat cpuset.cpus
+ 0-4,6,8-10
+
+   An empty value indicates that the cgroup is using the same
+   setting as the nearest cgroup ancestor with a non-empty
+   "cpuset.cpus" or all the available CPUs if none is found.
+
+   The value of "cpuset.cpus" stays constant until the next update
+   and won't be affected by any CPU hotplug events.
+
+  cpuset.effective_cpus
+   A read-only multiple values file which exists on non-root
+   cgroups.
+
+   It lists the onlined CPUs that are actually allowed to be
+   used by tasks within the current cgroup. It is a subset of
+   "cpuset.cpus". Its value will be affected by CPU hotplug
+   events.
+
+  cpuset.mems
+   A read-write multiple values file which exists on non-root
+   cgroups.
+
+   It lists the memory nodes allowed to be used by tasks within
+   this cgroup.  The memory node numbers are comma-separated
+   numbers or ranges.  For example:
+
+ # cat cpuset.mems
+ 0-1,3
+
+   An empty value indicates that the cgrou

Re: [PATCH resend] x86,kvm: Add a kernel parameter to disable PV spinlock

2017-09-04 Thread Waiman Long

On 09/04/2017 10:40 AM, Peter Zijlstra wrote:
> On Mon, Sep 04, 2017 at 04:28:36PM +0200, Oscar Salvador wrote:
>> This is just a resend of Waiman Long's patch.
>> I could not find why it was not merged to upstream, so I thought
>> to give it another chance.
>> What follows is what Waiman Long wrote.
>>
>> Xen has an kernel command line argument "xen_nopvspin" to disable
>> paravirtual spinlocks. This patch adds a similar "kvm_nopvspin"
>> argument to disable paravirtual spinlocks for KVM. This can be useful
>> for testing as well as allowing administrators to choose unfair lock
>> for their KVM guests if they want to.
> For testing its trivial to hack your kernel and I don't feel this is
> something an Admin can make reasonable decisions about.

I almost forgot about this patch that I sent quite a while ago. I was
sending this patch out mainly to maintain consistency between KVM and
Xen. This patch is not that important to me, and that is why I didn't
push it further.

Cheers,
Longman
 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 0/5] fs/dcache: Limit # of negative dentries

2017-08-28 Thread Waiman Long

On 08/28/2017 01:58 PM, Waiman Long wrote:
> On 07/28/2017 02:34 PM, Waiman Long wrote:
>>  v2->v3:
>>   - Add a faster pruning rate when the free pool is closed to depletion.
>>   - As suggested by James Bottomley, add an artificial delay waiting
>> loop before killing a negative dentry and properly clear the
>> DCACHE_KILL_NEGATIVE flag if killing doesn't happen.
>>   - Add a new patch to track number of negative dentries that are
>> forcifully killed.
>>
>>  v1->v2:
>>   - Move the new nr_negative field to the end of dentry_stat_t structure
>> as suggested by Matthew Wilcox.
>>   - With the help of Miklos Szeredi, fix incorrect locking order in
>> dentry_kill() by using lock_parent() instead of locking the parent's
>> d_lock directly.
>>   - Correctly account for positive to negative dentry transitions.
>>   - Automatic pruning of negative dentries will now ignore the reference
>> bit in negative dentries but not the regular shrinking.
>>
>> A rogue application can potentially create a large number of negative
>> dentries in the system consuming most of the memory available. This
>> can impact performance of other applications running on the system.
>>
>> This patchset introduces changes to the dcache subsystem to limit
>> the number of negative dentries allowed to be created thus limiting
>> the amount of memory that can be consumed by negative dentries.
>>
>> Patch 1 tracks the number of negative dentries used and disallow
>> the creation of more when the limit is reached.
>>
>> Patch 2 enables /proc/sys/fs/dentry-state to report the number of
>> negative dentries in the system.
>>
>> Patch 3 enables automatic pruning of negative dentries when it is
>> close to the limit so that we won't end up killing recently used
>> negative dentries.
>>
>> Patch 4 prevents racing between negative dentry pruning and umount
>> operation.
>>
>> Patch 5 shows the number of forced negative dentry killings in
>> /proc/sys/fs/dentry-state. End users can then tune the neg_dentry_pc=
>> kernel boot parameter if they want to reduce forced negative dentry
>> killings.
>>
>> Waiman Long (5):
>>   fs/dcache: Limit numbers of negative dentries
>>   fs/dcache: Report negative dentry number in dentry-state
>>   fs/dcache: Enable automatic pruning of negative dentries
>>   fs/dcache: Protect negative dentry pruning from racing with umount
>>   fs/dcache: Track count of negative dentries forcibly killed
>>
>>  Documentation/admin-guide/kernel-parameters.txt |   7 +
>>  fs/dcache.c | 451 
>> ++--
>>  include/linux/dcache.h  |   8 +-
>>  include/linux/list_lru.h|   1 +
>>  mm/list_lru.c   |   4 +-
>>  5 files changed, 435 insertions(+), 36 deletions(-)
>>
> With a 4.13 based kernel, the positive & negative dentries lookup rates
> (lookups per second) after initial boot on a 32GB memory VM with and
> without the patch were as follows:
>
>   Metricw/o patchwith patch
>   -----
>   Positive dentry lookup  844881   842618
>   Negative dentry lookup 1865158  1901875
>   Negative dentry creation609887   617215
>
> The last row refers to the creation rate of 10 millions negative
> dentries with the negative dentry limit set to 50% (> 80M dentries).
> Ignoring some inherent noise in the test results, there wasn't any
> noticeable difference in term of lookup and negative dentry creation
> performance with or without this patch.
>
> If the limit was set to 5% (the default), the 10M negative dentry
> creation rate dropped to 199565 and the dentry-state was:
>
> 2344754 2326486 45  0   2316533 7494261
>
> This was expected as negative dentry creation throttling with forced
> dentry deletion happened in this case.
>
> IOW, this patch does not cause any regression in term of lookup and
> negative dentry creation performance as long as the limit hasn't been
> reached.

Another performance data point about running the AIM7 highsystime
workload on a 36-core 32G VM is as follows:

Running the AIM7 high-systime workload on the VM, the baseline
performance was 186770 jobs/min. By running a single-thread rogue
negative dentry creation program in the background until the patched
kernel with 5% limit started throttling, the performance was 183746
jobs/min. On an unpatched kernel with memory almost exhausted and
memory shrinker was kicked in, the performance was 148997 jobs/min.

So the patch does protect the system from suffering significant
performance degradation in case a negative dentry creation rogue
program is runninig in the background.

Cheers,
Longman


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 0/5] fs/dcache: Limit # of negative dentries

2017-08-28 Thread Waiman Long

On 07/28/2017 02:34 PM, Waiman Long wrote:
>  v2->v3:
>   - Add a faster pruning rate when the free pool is closed to depletion.
>   - As suggested by James Bottomley, add an artificial delay waiting
> loop before killing a negative dentry and properly clear the
> DCACHE_KILL_NEGATIVE flag if killing doesn't happen.
>   - Add a new patch to track number of negative dentries that are
> forcifully killed.
>
>  v1->v2:
>   - Move the new nr_negative field to the end of dentry_stat_t structure
> as suggested by Matthew Wilcox.
>   - With the help of Miklos Szeredi, fix incorrect locking order in
> dentry_kill() by using lock_parent() instead of locking the parent's
> d_lock directly.
>   - Correctly account for positive to negative dentry transitions.
>   - Automatic pruning of negative dentries will now ignore the reference
> bit in negative dentries but not the regular shrinking.
>
> A rogue application can potentially create a large number of negative
> dentries in the system consuming most of the memory available. This
> can impact performance of other applications running on the system.
>
> This patchset introduces changes to the dcache subsystem to limit
> the number of negative dentries allowed to be created thus limiting
> the amount of memory that can be consumed by negative dentries.
>
> Patch 1 tracks the number of negative dentries used and disallow
> the creation of more when the limit is reached.
>
> Patch 2 enables /proc/sys/fs/dentry-state to report the number of
> negative dentries in the system.
>
> Patch 3 enables automatic pruning of negative dentries when it is
> close to the limit so that we won't end up killing recently used
> negative dentries.
>
> Patch 4 prevents racing between negative dentry pruning and umount
> operation.
>
> Patch 5 shows the number of forced negative dentry killings in
> /proc/sys/fs/dentry-state. End users can then tune the neg_dentry_pc=
> kernel boot parameter if they want to reduce forced negative dentry
> killings.
>
> Waiman Long (5):
>   fs/dcache: Limit numbers of negative dentries
>   fs/dcache: Report negative dentry number in dentry-state
>   fs/dcache: Enable automatic pruning of negative dentries
>   fs/dcache: Protect negative dentry pruning from racing with umount
>   fs/dcache: Track count of negative dentries forcibly killed
>
>  Documentation/admin-guide/kernel-parameters.txt |   7 +
>  fs/dcache.c | 451 
> ++--
>  include/linux/dcache.h  |   8 +-
>  include/linux/list_lru.h|   1 +
>  mm/list_lru.c   |   4 +-
>  5 files changed, 435 insertions(+), 36 deletions(-)
>
With a 4.13 based kernel, the positive & negative dentries lookup rates
(lookups per second) after initial boot on a 32GB memory VM with and
without the patch were as follows:

  Metricw/o patchwith patch
  -----
  Positive dentry lookup  844881   842618
  Negative dentry lookup 1865158  1901875
  Negative dentry creation609887   617215

The last row refers to the creation rate of 10 millions negative
dentries with the negative dentry limit set to 50% (> 80M dentries).
Ignoring some inherent noise in the test results, there wasn't any
noticeable difference in term of lookup and negative dentry creation
performance with or without this patch.

If the limit was set to 5% (the default), the 10M negative dentry
creation rate dropped to 199565 and the dentry-state was:

2344754 2326486 45  0   2316533 7494261

This was expected as negative dentry creation throttling with forced
dentry deletion happened in this case.

IOW, this patch does not cause any regression in term of lookup and
negative dentry creation performance as long as the limit hasn't been
reached.

Cheers,
Longman


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 0/5] fs/dcache: Limit # of negative dentries

2017-08-21 Thread Waiman Long

On 08/20/2017 11:23 PM, Wangkai (Kevin,C) wrote:
>
> Yes, I have add some trace info for the dentry state changed, with dentry 
> flag and reference count:
>
> File create:
> [   42.636675] dentry [_1234] 0x880230be8180 flag 0x0 ref 1 ev dentry 
> alloc
> File close:
> [   42.637421] dentry [_1234] 0x880230be8180 flag 0x4800c0 ref 0 ev 
> dput called
>
> Unlink lookup:
> [  244.658086] dentry [_1234] 0x880230be8180 flag 0x4800c0 ref 1 ev 
> d_lookup
> Unlink d_delete:
> [  244.658254] dentry [_1234] 0x880230be8180 flag 0x800c0 ref 1 ev 
> d_lockref ref 1
> Unlink dput:
> [  244.658438] dentry [_1234] 0x880230be8180 flag 0x800c0 ref 0 ev 
> dput called
>
> The end, dentry's flag stay at 0x800c0, but this dentry was not freed, keeped 
> by the dcache as unused,
> After tens of thousands of the dentries slow down the dentry lookup 
> performance, kernel memory usage
> Keep high.
>
> Regards,
> Kevin

That is expected. The kernel does not get rid of negative dentries until
the shrinker is called because of memory pressure. Negative dentries do
help to improve file lookup performance. However, too much of negative
dentries suppress the amount of free memory available for other use.
That is why I send out my patch to limit the number of negative dentries
outstanding.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 0/5] fs/dcache: Limit # of negative dentries

2017-08-18 Thread Waiman Long

On 08/18/2017 05:59 AM, Wangkai (Kevin,C) wrote:
>
>>> In my patch the DCACHE_FILE_REMOVED flag was to distinguish the
>>> removed file and The closed file, I found there was no difference of a
>>> dentry between the removed file and the closed File, they all on the lru 
>>> list.
>> There is a difference between removed file and closed file. The type field of
>> d_flags will be empty for a removed file which indicate a negative dentry.
>> Anything else is a positive dentry. Look at the inline function 
>> d_is_negative()
>> [d_is_miss()] and you will see how it is done.
> After the file was removed, the dentry flag was not MISS, the flag was:
> DCACHE_REFERENCED | DCACHE_RCUACCESS | DCACHE_LRU_LIST | DCACHE_REGULAR_TYPE
> So, the dentry never be freed, until the kernel reclaim the slab memory.

The dentry_unlink_inode() function will clear DCACHE_REGULAR_TYPE.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 0/5] fs/dcache: Limit # of negative dentries

2017-08-17 Thread Waiman Long

On 08/17/2017 12:00 AM, Wangkai (Kevin,C) wrote:
>
>>>
>>> Hi Longman,
>>> I am a fresher of fsdevel, about 2 weeks before, I have joined this
>>> mail list, recently I have met the same problem of negative dentries,
>>> in my opinion, the dentries should be remove together with the files or
>> directories, I don't know you have submit this patch, I have another patch
>> about this:
>>> http://marc.info/?l=linux-fsdevel=150209902215266=2
>>>
>>> maybe this is a foo idea...
>>>
>>> regards
>>> Kevin
>> If you look at the code, the front dentries of the LRU list are removed when
>> there are too many negative dentries. That includes positive dentries as 
>> well as
>> it is not practical to just remove the negative dentries.
>>
>> I have looked at your patch. The dentry of a removed file becomes a negative
>> dentry. The kernel can keep track of those negative entries and there is no 
>> need
>> to add an additional flag for that.
>>
>> Cheers,
>> Longman
> One comment about your patch:
> In the patch 1/5 function dentry_kill first get dentry->d_flags, after lock 
> parent and
> Compare d_flags again, is this needed? The d_flags was changed under lock.

Yes, it is necessary. We are talking about an SMP system with multiple
threads running concurrently. If you look at the lock parent code, it
may release the current dentry lock before taking the parent's and then
the dentry lock again. As soon as the lock is released, anything can
happen to the dentry including changes in d_flags.

> In my patch the DCACHE_FILE_REMOVED flag was to distinguish the removed file 
> and
> The closed file, I found there was no difference of a dentry between the 
> removed file and the closed
> File, they all on the lru list.

There is a difference between removed file and closed file. The type
field of d_flags will be empty for a removed file which indicate a
negative dentry. Anything else is a positive dentry. Look at the inline
function d_is_negative() [d_is_miss()] and you will see how it is done.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] cpuset: Allow v2 behavior in v1 cgroup

2017-08-16 Thread Waiman Long

On 08/16/2017 10:36 AM, Tejun Heo wrote:
> Hello,
>
> On Wed, Aug 16, 2017 at 10:34:05AM -0400, Waiman Long wrote:
>>> It feels weird to make this a kernel boot param when all other options
>>> are specified on mount time.  Is there a reason why this can't be a
>>> mount option too?
>>>
>> Yes, we can certainly make this a mount option instead of a boot time
>> parameter. BTW, where do we usually document the mount options for cgroup?
> I don't think there's a central place.  Each controller documents
> theirs in their own file.
>
> Thanks.
>
OK, I am going to update the patch by controlling cpuset behavior by
mount option instead.

Thanks,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] cpuset: Allow v2 behavior in v1 cgroup

2017-08-16 Thread Waiman Long

On 08/16/2017 10:29 AM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Tue, Aug 15, 2017 at 01:27:20PM -0400, Waiman Long wrote:
>> +cpuset_v2_mode= [KNL] Enable cpuset v2 behavior in cpuset v1 cgroups.
>> +In v2 mode, the cpus and mems can be restored to
>> +their original values after a removal-addition
>> +event sequence.
>> +0: default value, cpuset v1 keeps legacy behavior.
>> +1: cpuset v1 behaves like cpuset v2.
>> +
> It feels weird to make this a kernel boot param when all other options
> are specified on mount time.  Is there a reason why this can't be a
> mount option too?
>
> Thanks.
>
Yes, we can certainly make this a mount option instead of a boot time
parameter. BTW, where do we usually document the mount options for cgroup?

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 0/5] fs/dcache: Limit # of negative dentries

2017-08-16 Thread Waiman Long

On 08/16/2017 06:33 AM, Wangkai (Kevin,C) wrote:
>> -Original Message-
>> From: linux-fsdevel-ow...@vger.kernel.org
>> [mailto:linux-fsdevel-ow...@vger.kernel.org] On Behalf Of Waiman Long
>> Sent: Wednesday, August 16, 2017 1:15 AM
>> To: Alexander Viro; Jonathan Corbet
>> Cc: linux-ker...@vger.kernel.org; linux-doc@vger.kernel.org;
>> linux-fsde...@vger.kernel.org; Paul E. McKenney; Andrew Morton; Ingo Molnar;
>> Miklos Szeredi; Matthew Wilcox; Larry Woodman; James Bottomley
>> Subject: Re: [PATCH v3 0/5] fs/dcache: Limit # of negative dentries
>>
>> On 07/28/2017 02:34 PM, Waiman Long wrote:
>>>  v2->v3:
>>>   - Add a faster pruning rate when the free pool is closed to depletion.
>>>   - As suggested by James Bottomley, add an artificial delay waiting
>>> loop before killing a negative dentry and properly clear the
>>> DCACHE_KILL_NEGATIVE flag if killing doesn't happen.
>>>   - Add a new patch to track number of negative dentries that are
>>> forcifully killed.
>>>
>>>  v1->v2:
>>>   - Move the new nr_negative field to the end of dentry_stat_t structure
>>> as suggested by Matthew Wilcox.
>>>   - With the help of Miklos Szeredi, fix incorrect locking order in
>>> dentry_kill() by using lock_parent() instead of locking the parent's
>>> d_lock directly.
>>>   - Correctly account for positive to negative dentry transitions.
>>>   - Automatic pruning of negative dentries will now ignore the reference
>>> bit in negative dentries but not the regular shrinking.
>>>
>>> A rogue application can potentially create a large number of negative
>>> dentries in the system consuming most of the memory available. This
>>> can impact performance of other applications running on the system.
>>>
>>> This patchset introduces changes to the dcache subsystem to limit the
>>> number of negative dentries allowed to be created thus limiting the
>>> amount of memory that can be consumed by negative dentries.
>>>
>>> Patch 1 tracks the number of negative dentries used and disallow the
>>> creation of more when the limit is reached.
>>>
>>> Patch 2 enables /proc/sys/fs/dentry-state to report the number of
>>> negative dentries in the system.
>>>
>>> Patch 3 enables automatic pruning of negative dentries when it is
>>> close to the limit so that we won't end up killing recently used
>>> negative dentries.
>>>
>>> Patch 4 prevents racing between negative dentry pruning and umount
>>> operation.
>>>
>>> Patch 5 shows the number of forced negative dentry killings in
>>> /proc/sys/fs/dentry-state. End users can then tune the neg_dentry_pc=
>>> kernel boot parameter if they want to reduce forced negative dentry
>>> killings.
>>>
>>> Waiman Long (5):
>>>   fs/dcache: Limit numbers of negative dentries
>>>   fs/dcache: Report negative dentry number in dentry-state
>>>   fs/dcache: Enable automatic pruning of negative dentries
>>>   fs/dcache: Protect negative dentry pruning from racing with umount
>>>   fs/dcache: Track count of negative dentries forcibly killed
>>>
>>>  Documentation/admin-guide/kernel-parameters.txt |   7 +
>>>  fs/dcache.c | 451
>> ++--
>>>  include/linux/dcache.h  |   8 +-
>>>  include/linux/list_lru.h|   1 +
>>>  mm/list_lru.c   |   4 +-
>>>  5 files changed, 435 insertions(+), 36 deletions(-)
>>>
>> I haven't received any comment on this v3 patch for over 2 weeks. Is there
>> anything I can do to make it more ready to be merged?
>>
>> Thanks,
>> Longman
> Hi Longman,
> I am a fresher of fsdevel, about 2 weeks before, I have joined this mail 
> list, recently I have met the same problem of negative dentries, 
> in my opinion, the dentries should be remove together with the files or 
> directories, I don't know you have submit this patch, I have
> another patch about this:
>
> http://marc.info/?l=linux-fsdevel=150209902215266=2
>
> maybe this is a foo idea...
>
> regards
> Kevin

If you look at the code, the front dentries of the LRU list are removed
when there are too many negative dentries. That includes positive
dentries as well as it is not practical to just remove the negative
dentries.

I have looked at your patch. The dentry of a removed file becomes a
negative dentry. The kernel can keep track of those negative entries and
there is no need to add an additional flag for that.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 >

1 - 100 of 301 matches

Mail list logo