Re: [PATCH REPOST] libata: remove ata_sff_data_xfer_noirq()

2018-07-11 Thread Tejun Heo
On Wed, Jul 11, 2018 at 05:21:05PM +0200, Sebastian Andrzej Siewior wrote:
> ata_sff_data_xfer_noirq() is invoked via the ->sff_data_xfer hook. The
> latter is invoked by ata_pio_sector(), atapi_send_cdb() and
> __atapi_pio_bytes() which in turn is invoked by ata_sff_hsm_move().
> The latter function requires that the "ap->lock" lock is held which
> needs to be taken with disabled interrupts.
> 
> There is no need have to have ata_sff_data_xfer_noirq() which invokes
> ata_sff_data_xfer32() with disabled interrupts because at this point the
> interrupts are already disabled.
> Remove the function and its references to it and replace all callers
> with ata_sff_data_xfer32().
> 
> Signed-off-by: Sebastian Andrzej Siewior 

Applied to libata/for-4.19.

Nice cleanup.  Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 02/11] docs: admin-guide: add cgroup-v2 documentation

2018-05-10 Thread Tejun Heo
On Wed, May 09, 2018 at 10:18:45AM -0300, Mauro Carvalho Chehab wrote:
> The cgroup-v2.txt is already in ReST format. So, move it to the
> admin-guide, where it belongs.
> 
> Cc: Tejun Heo <t...@kernel.org>
> Cc: Li Zefan <lize...@huawei.com>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+sams...@kernel.org>

Acked-by: Tejun Heo <t...@kernel.org>

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libata: remove ata_sff_data_xfer_noirq()

2018-05-07 Thread Tejun Heo
Hello, Sebastian.

On Fri, May 04, 2018 at 05:06:20PM +0200, Sebastian Andrzej Siewior wrote:
> ata_sff_data_xfer_noirq() is invoked via the ->sff_data_xfer hook. The
> latter is invoked by ata_pio_sector(), atapi_send_cdb() and
> __atapi_pio_bytes() which in turn is invoked by ata_sff_hsm_move().
> The latter function requires that the "ap->lock" lock is held which
> needs to be taken with disabled interrupts.
> 
> There is no need have to have ata_sff_data_xfer_noirq() which invokes
> ata_sff_data_xfer32() with disabled interrupts because at this point the
> interrupts are already disabled.
> Remove the function and its references to it and replace all callers
> with ata_sff_data_xfer32().

Can you please add irq disabled assert to ata_sff_data_xfer*()?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 4/5] cpuset: Restrict load balancing off cpus to subset of cpus.isolated

2018-05-01 Thread Tejun Heo
Hello,

On Tue, May 01, 2018 at 04:33:45PM -0400, Waiman Long wrote:
> I think that will work too. We currently don't have a flag to make a
> file visible on first-level children only, but it shouldn't be hard to
> make one.

I think it'd be fine to make the flag file exist on all !root cgroups
but only writable on the first level children.

> Putting CPUs into an isolated child cpuset means removing it from the
> root's effective CPUs. So I would probably like to expose the read-only
> cpus.effective in the root cgroup so that we can check changes in the
> effective cpu list.

Ah, yeah, that makes sense.

> I will renew the patchset will your suggestion.

Thank you very much.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 4/5] cpuset: Restrict load balancing off cpus to subset of cpus.isolated

2018-05-01 Thread Tejun Heo
Hello, Waiman.

Sorry about the delay.

On Thu, Apr 19, 2018 at 09:47:03AM -0400, Waiman Long wrote:
> With the addition of "cpuset.cpus.isolated", it makes sense to add the
> restriction that load balancing can only be turned off if the CPUs in
> the isolated cpuset are subset of "cpuset.cpus.isolated".
> 
> Signed-off-by: Waiman Long 
> ---
>  Documentation/cgroup-v2.txt |  7 ---
>  kernel/cgroup/cpuset.c  | 29 ++---
>  2 files changed, 30 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index 8d89dc2..c4227ee 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -1554,9 +1554,10 @@ Cpuset Interface Files
>   and will not be moved to other CPUs.
>  
>   This flag is hierarchical and is inherited by child cpusets. It
> - can be turned off only when the CPUs in this cpuset aren't
> - listed in the cpuset.cpus of other sibling cgroups, and all
> - the child cpusets, if present, have this flag turned off.
> + can be explicitly turned off only when it is a direct child of
> + the root cgroup and the CPUs in this cpuset are subset of the
> + root's "cpuset.cpus.isolated".  Moreover, the CPUs cannot be
> + listed in the "cpuset.cpus" of other sibling cgroups.

It is a little bit convoluted that the isolation requires coordination
among root's isolated file and the first-level children's cpus file
and the flag.  Maybe I'm missing something but can't we do something
like the following?

* Add isolated flag file, which can only be modified on empty (in
  terms of cpus) first level children.

* Once isolated flag is set, CPUs can only be added to the cpus file
  iff they aren't being used by anyone else and automatically become
  isolated.

The first level cpus file is owned by the root cgroup anyway, so
there's no danger regarding delegation or whatever and the interface
would be a lot simpler.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2

2018-03-27 Thread Tejun Heo
Hello,

On Mon, Mar 26, 2018 at 04:28:49PM -0400, Waiman Long wrote:
> Maybe we can have a different root level flag, say,
> sched_partition_domain that is equivalent to !sched_load_balnace.
> However, I am still not sure if we should enforce that no task should be
> in the root cgroup when the flag is set.
> 
> Tejun and Peter, what are your thoughts on this?

I haven't looked into the other issues too much but we for sure cannot
empty the root cgroup.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 2/2] cpuset: Add cpuset.flags control knob to v2

2018-03-20 Thread Tejun Heo
Hello, Waiman.

On Tue, Mar 20, 2018 at 04:12:25PM -0400, Waiman Long wrote:
> After some thought, I am planning to impose the following additional
> constraints on how sched_load_balance works in v2.
> 
> 1) sched_load_balance will be made hierarchical, the child will inherit
> the flag from its parent.
> 2) cpu_exclusive will be implicitly associated with sched_load_balance.
> IOW, sched_load_balance => !cpu_exclusive, and !sched_load_balance =>
> cpu_exclusive.
> 3) sched_load_balance cannot be 1 on a child if it is 0 on the parent.
> 
> With these changes, sched_load_balance will have to be set by the parent
> and so will not be delegatable. Please let me know your thought on that.

So, for configurations, we usually don't let them interact across
hierarchy because that can lead to configurations surprise-changing
and delegated children locking the parent into the current config.

This case could be different and as long as we always guarantee that
an ancestor isn't limited by its descendants in what it can configure,
it should be okay (e.g. an ancestor should always be able to turn on
sched_load_balance regardless of how the descendants are configured).

Hmmm... can you explain why sched_load_balance needs to behave this
way?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 1/2] cpuset: Enable cpuset controller in default hierarchy

2018-03-20 Thread Tejun Heo
Hello, Waiman.

On Tue, Mar 20, 2018 at 09:51:20AM -0400, Waiman Long wrote:
> >> +  It lists the onlined CPUs that are actually allowed to be
> >> +  used by tasks within the current cgroup. It is a subset of
> >> +  "cpuset.cpus".  Its value will be affected by CPU hotplug
> >> +  events.
> > Can we do cpuset.cpus.availble which lists the cpus available to the
> > cgroup instead of the eventual computed mask for the cgroup?  That'd
> > be more useful as it doesn't lose the information by and'ing what's
> > available with the cgroup's mask and it's trivial to determine the
> > effective from the two masks.
> 
> I don't get what you want here. cpus is the cpuset's cpus_allowed mask.
> effective_cpus is the effective_cpus mask. When you say cpus available
> to the cgroup, do you mean the cpu_online_mask or the list of cpus from
> the parent? Or do you just want to change the name to cpus.available
> instead of effective_cpus?

The available cpus from the parent, where the effective is AND between
cpuset.available and cpuset.cpus of the cgroup, so that the user can
see what's available for the cgroup unfiltered by cpuset.cpus.

> Right, I will set CFTYPE_NOT_ON_ROOT to "cpus" and "mems" as we are not
> supposed to change them in the root. The effective_cpus and
> effective_mems will be there in the root to show what are available.

So, we can do that in the future but let's not do that for now.  It's
the same problem we have for basically everything else and we've
stayed away from replicating the information in the root cgroup.  This
might change in the future but if we do that let's do that
consistently.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 1/2] cpuset: Enable cpuset controller in default hierarchy

2018-03-19 Thread Tejun Heo
Hello, Waiman.

This looks great.  A couple nitpicks below.

> + 5-3. Cpuset
> +   5.3-1. Cpuset Interface Files

Can we put cpuset below pid?  It feels weird to break up cpu, memory
and io as they represent the three major resources and are in a
similar fashion.

> +  cpuset.effective_cpus
> + A read-only multiple values file which exists on non-root
> + cgroups.
> +
> + It lists the onlined CPUs that are actually allowed to be
> + used by tasks within the current cgroup. It is a subset of
> + "cpuset.cpus".  Its value will be affected by CPU hotplug
> + events.

Can we do cpuset.cpus.availble which lists the cpus available to the
cgroup instead of the eventual computed mask for the cgroup?  That'd
be more useful as it doesn't lose the information by and'ing what's
available with the cgroup's mask and it's trivial to determine the
effective from the two masks.

> +  cpuset.effective_mems
> + A read-only multiple values file which exists on non-root
> + cgroups.
> +
> + It lists the onlined memory nodes that are actually allowed
> + to be used by tasks within the current cgroup.  It is a subset
> + of "cpuset.mems".  Its value will be affected by memory nodes
> + hotplug events.

Ditto.

> +static struct cftype dfl_files[] = {
> + {
> + .name = "cpus",
> + .seq_show = cpuset_common_seq_show,
> + .write = cpuset_write_resmask,
> + .max_write_len = (100U + 6 * NR_CPUS),
> + .private = FILE_CPULIST,
> + },

Is it missing CFTYPE_NOT_ON_ROOT?  Other files too.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 2/2] cpuset: Add cpuset.flags control knob to v2

2018-03-19 Thread Tejun Heo
Hello, Waiman.

On Thu, Mar 15, 2018 at 05:20:42PM -0400, Waiman Long wrote:
> + The currently supported flag is:
> +
> +   sched_load_balance
> + When it is not set, there will be no load balancing
> + among CPUs on this cpuset.  Tasks will stay in the
> + CPUs they are running on and will not be moved to
> + other CPUs.
> +
> + When it is set, tasks within this cpuset will be
> + load-balanced by the kernel scheduler.  Tasks will be
> + moved from CPUs with high load to other CPUs within
> + the same cpuset with less load periodically.

Hmm... looks like this is something which can be decided by the cgroup
itself and should be made delegatable.  Given that different flags
might need different delegation settings and the precedence of
memory.oom_group, I think it'd be better to make the flags separate
bool files - ie. cpuset.sched_load_balance which contains 0/1 and
marked delegatable.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] cpuset: Enable cpuset controller in default hierarchy

2018-03-19 Thread Tejun Heo
Hello, Mike.

On Thu, Mar 15, 2018 at 03:49:01AM +0100, Mike Galbraith wrote:
> Under the hood v2 details are entirely up to you.  My input ends at
> please don't leave dynamic partitioning standing at the dock when v2
> sails.

So, this isn't about implementation details but about what the
interface achieves - ie, what's the actual function?  The only thing I
can see is blocking the entity which is configuring the hierarchy from
making certain configs.  While that might be useful in some specific
use cases, it seems to miss the bar for becoming its own kernel
feature.  After all, nothing prevents the same entity from clearing
the exlusive bit and making the said changes.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] cpuset: Enable cpuset controller in default hierarchy

2018-03-14 Thread Tejun Heo
Hello,

On Sat, Mar 10, 2018 at 04:47:28AM +0100, Mike Galbraith wrote:
> Some form of cpu_exclusive (preferably exactly that, but something else
> could replace it) is needed to define sets that must not overlap any
> other set at creation time or any time thereafter.  A set with property
> 'exclusive' is the enabler for fundamentally exclusive (but dynamic!)
> set properties such as 'isolated' (etc etc).

I'm not sure cpu_exclusive makes sense.  A controller knob can either
belong to the parent or the cgroup itself and cpu_exclusive doesn't
make sense in either case.

1. cpu_exclusive is owned by the parent as other usual resource
   control knobs.  IOW, it's not delegatable.

   This is weird because it's asking the kernel to protect against its
   own misconfiguration and there's nothing preventing cpu_exclusive
   itself being cleared by the same entitya.

2. cpu_exclusive is owned by the cgroup itself like memory.oom_group.
   IOW, it's delegatable.

   This allows a cgroup to affect what its siblings can or cannot do,
   which is broken.  Semantically, it doesn't make much sense either.

I don't think it's a good idea to add a kernel mechanism to prevent
misconfiguration from a single entity.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Documentation: Fix 'file_mapped' -> 'mapped_file'

2018-01-30 Thread Tejun Heo
On Tue, Jan 30, 2018 at 05:42:13PM +0100, Florian Schmidt wrote:
> There is no entry file_mapped in the memory.stat file. This looks like a
> simple word flip that's gone unnoticed since 2010 (dc10e281f5fc,
> memcg: update documentation).
> 
> Signed-off-by: Florian Schmidt 

Applied to cgroup/for-4.16.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable

2018-01-30 Thread Tejun Heo
On Tue, Jan 30, 2018 at 01:20:11PM +0100, Michal Hocko wrote:
> From 361275a05ad7026b8f721f8aa756a4975a2c42b1 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mho...@suse.com>
> Date: Tue, 30 Jan 2018 09:54:15 +0100
> Subject: [PATCH] oom, memcg: clarify root memcg oom accounting
> 
> David Rientjes has pointed out that the current way how the root memcg
> is accounted for the cgroup aware OOM killer is undocumented. Unlike
> regular cgroups there is no accounting going on in the root memcg
> (mostly for performance reasons). Therefore we are suming up oom_badness
> of its tasks. This might result in an over accounting because of the
> oom_score_adj setting. Document this for now.
> 
> Acked-by: Roman Gushchin <g...@fb.com>
> Signed-off-by: Michal Hocko <mho...@suse.com>

Acked-by: Tejun Heo <t...@kernel.org>

Thanks, Michal.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable

2018-01-29 Thread Tejun Heo
Hello, Michal.

On Mon, Jan 29, 2018 at 11:46:57AM +0100, Michal Hocko wrote:
> @@ -1292,7 +1292,11 @@ the memory controller considers only cgroups belonging 
> to the sub-tree
>  of the OOM'ing cgroup.
>  
>  The root cgroup is treated as a leaf memory cgroup, so it's compared
> -with other leaf memory cgroups and cgroups with oom_group option set.
> +with other leaf memory cgroups and cgroups with oom_group option
> +set. Due to internal implementation restrictions the size of the root
> +cgroup is a cumulative sum of oom_badness of all its tasks (in other
> +words oom_score_adj of each task is obeyed). This might change in the
> +future.

Thanks, we can definitely use more documentation.  However, it's a bit
difficult to follow.  Maybe expand it to a separate paragraph on the
current behavior with a clear warning that the default OOM heuristics
is subject to changes?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable

2018-01-24 Thread Tejun Heo
Hello, Andrew.

On Wed, Jan 24, 2018 at 02:08:05PM -0800, Andrew Morton wrote:
> Can we please try to narrow the scope of this issue by concentrating on
> the userspace interfaces?  David believes that the mount option and
> memory.oom_group will disappear again in the near future, others
> disagree.

I'm confident that the interface is gonna age fine.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable

2018-01-20 Thread Tejun Heo
Hello, David.

On Fri, Jan 19, 2018 at 12:53:41PM -0800, David Rientjes wrote:
> Hearing no response, I'll implement this as a separate tunable in a v2 
> series assuming there are no better ideas proposed before next week.  One 
> of the nice things about a separate tunable is that an admin can control 
> the overall policy and they can delegate the mechanism (killall vs one 
> process) to a user subtree.  I agree with your earlier point that killall 
> vs one process is a property of the workload and is better defined 
> separately.

If I understood your arguments correctly, the reasons that you thought
your selectdion policy changes must go together with Roman's victim
action were two-fold.

1. You didn't want a separate knob for group oom behavior and wanted
   it to be combined with selection policy.  I'm glad that you now
   recognize that this would be the wrong design choice.

2. The current selection policy may be exploited by delegatee and
   strictly hierarchical seleciton should be available.  We can debate
   the pros and cons of different heuristics; however, to me, the
   followings are clear.

   * Strictly hierarchical approach can't replace the current policy.
 It doesn't work well for a lot of use cases.

   * OOM victim selection policy has always been subject to changes
 and improvements.

I don't see any blocker here.  The issue you're raising can and should
be handled separately.

In terms of interface, what makes an interface bad is when the
purposes aren't crystalized enough and different interface pieces fail
to clearnly encapsulate what's actually necessary.

Here, whether a workload can survive being killed piece-wise or not is
an inherent property of the workload and a pretty binary one at that.
I'm not necessarily against changing it to take string inputs but
don't see rationales for doing so yet.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable

2018-01-17 Thread Tejun Heo
Hello, David.

On Tue, Jan 16, 2018 at 06:15:08PM -0800, David Rientjes wrote:
> The behavior of killing an entire indivisible memory consumer, enabled
> by memory.oom_group, is an oom policy itself.  It specifies that all

I thought we discussed this before but maybe I'm misremembering.
There are two parts to the OOM policy.  One is victim selection, the
other is the action to take thereafter.

The two are different and conflating the two don't work too well.  For
example, please consider what should be given to the delegatee when
delegating a subtree, which often is a good excercise when designing
these APIs.

When a given workload is selected for OOM kill (IOW, selected to free
some memory), whether the workload can handle individual process kills
or not is the property of the workload itself.  Some applications can
safely handle some of its processes picked off and killed.  Most
others can't and want to be handled as a single unit, which makes it a
property of the workload.

That makes sense in the hierarchy too because whether one process or
the whole workload is killed doesn't infringe upon the parent's
authority over resources which in turn implies that there's nothing to
worry about how the parent's groupoom setting should constrain the
descendants.

OOM victim selection policy is a different beast.  As you've mentioned
multiple times, especially if you're worrying about people abusing OOM
policies by creating sub-cgroups and so on, the policy, first of all,
shouldn't be delegatable and secondly should have meaningful
hierarchical restrictions so that a policy that an ancestor chose
can't be nullified by a descendant.

I'm not necessarily against adding hierarchical victim selection
policy tunables; however, I am skeptical whether static tunables on
cgroup hierarchy (including selectable policies) can be made clean and
versatile enough, especially because the resource hierarchy doesn't
necessarily, or rather in most cases, match the OOM victim selection
decision tree, but I'd be happy to be proven wrong.

Without explicit configurations, the only thing the OOM killer needs
to guarantee is that the system can make forward progress.  We've
always been tweaking victim selection with or without cgroup and
absolutely shouldn't be locked into a specific heuristics.  The
heuristics is an implementaiton detail subject to improvements.

To me, your patchset actually seems to demonstrate that these are
separate issues.  The goal of groupoom is just to kill logical units
as cgroup hierarchy can inform the kernel of how workloads are
composed in the userspace.  If you want to improve victim selection,
sure, please go ahead, but your argument that groupoom can't be merged
because of victim selection policy doesn't make sense to me.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] cgroup, docs: document the root cgroup behavior of cpu and io controllers

2018-01-16 Thread Tejun Heo
On Wed, Jan 10, 2018 at 11:33:19PM +0100, Maciej S. Szmigiero wrote:
> Currently, cgroups v2 documentation contains only a generic remark that
> "How resource consumption in the root cgroup is governed is up to each
> controller", which isn't really telling users much, who need to dig in the
> code and / or commit messages to learn the exact behavior.
> 
> In cgroups v1 at least the blkio controller had its operation with respect
> to competition between child threads and child cgroups documented in
> blkio-controller.txt, with references to cfq-iosched.txt.
> Also, cgroups v2 documentation describes v1 behavior of both cpu and
> blkio controllers in an "Issues with v1" section.
> 
> Let's document this behavior also for cgroups v2 to make life easier for
> users.
> 
> Signed-off-by: Maciej S. Szmigiero 

Applied to cgroup/for-4.16.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RESEND] cgroup, docs: document the root cgroup behavior of cpu and io controllers

2018-01-08 Thread Tejun Heo
Hello,

On Thu, Jan 04, 2018 at 10:57:00PM +0100, Maciej S. Szmigiero wrote:
> Currently, cgroups v2 documentation contains only a generic remark that
> "How resource consumption in the root cgroup is governed is up to each
> controller", which isn't really telling users much, who need to dig in the
> code and / or commit messages to learn the exact behavior.

I don't think we want to fully guarantee the current behavior.  On the
scheduler side, I don't think it's likely to change but blkio side
*might* change.  Can you please collect the root behavior in a separte
section and clearly note that the behaviors are subject to change?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] cgroup-v2.txt: fix typos

2018-01-02 Thread Tejun Heo
On Tue, Jan 02, 2018 at 05:27:41PM +0100, Vladimir Rutsky wrote:
> Signed-off-by: Vladimir Rutsky 

Applied to cgroup/for-4.16.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] cgroup: Update documentation reference

2018-01-02 Thread Tejun Heo
On Fri, Dec 29, 2017 at 12:02:00PM -0800, Matt Roper wrote:
> The cgroup_subsys structure references a documentation file that has been
> renamed after the v1/v2 split.  Since the v2 documentation doesn't
> currently contain any information on kernel interfaces for controllers,
> point the user to the v1 docs.
> 
> Cc: Tejun Heo <t...@kernel.org>
> Cc: linux-doc@vger.kernel.org
> Signed-off-by: Matt Roper <matthew.d.ro...@intel.com>

Applied 1-2 to cgroup/for-4.16.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2

2017-12-21 Thread Tejun Heo
Hello, Shakeel.

On Thu, Dec 21, 2017 at 07:22:20AM -0800, Shakeel Butt wrote:
> I am claiming memory allocations under global pressure will be
> affected by the performance of the underlying swap device. However
> memory allocations under memcg memory pressure, with memsw, will not
> be affected by the performance of the underlying swap device. A job
> having 100 MiB limit running on a machine without global memory
> pressure will never see swap on hitting 100 MiB memsw limit.

But, without global memory pressure, the swap wouldn't be making any
difference to begin with.  Also, when multiple cgroups are hitting
memsw limits, they'd behave as if swappiness is zero increasing load
on the filesystems, which then then of course will affect everyone
under memory pressure whether memsw or not.

> > On top of that, what's the point?
> >
> > 1. As I wrote earlier, given the current OOM killer implementation,
> >whether OOM kicks in or not is not even that relevant in
> >determining the health of the workload.  There are frequent failure
> >modes where OOM killer fails to kick in while the workload isn't
> >making any meaningful forward progress.
> >
> 
> Deterministic oom-killer is not the point. The point is to
> "consistently limit the anon memory" allocated by the job which only
> memsw can provide. A job owner who has requested 100 MiB for a job
> sees some instances of the job suffer at 100 MiB and other instances
> suffer at 150 MiB, is an inconsistent behavior.

So, the first part, I get.  memsw happens to be be able to limit the
amount of anon memory.  I really don't think that was the intention
but more of a byproduct that some people might find useful.

The example you listed tho doesn't make much sense to me.  Given two
systems with differing level of memory pressures, two instances can
see wildly different performance regardless of memsw.

> > 2. On hitting memsw limit, the OOM decision is dependent on the
> >performance of the file backing devices.  Why is that necessarily
> >better than being dependent on swap or both, which would increase
> >the reclaim efficiency anyway?  You can't avoid being affected by
> >the underlying hardware one way or the other.
> 
> This is a separate discussion but still the amount of file backed
> pages is known and controlled by the job owner and they have the
> option to use a storage service, providing a consistent performance
> across different data centers, instead of the physical disks of the
> system where the job is running and thus isolating the job's
> performance from the speed of the local disk. This is not possible
> with swap. The swap (and its performance) is and should be transparent
> to the job owners.

And, for your use case, there is a noticeable difference between file
backed and anonymous memories and that's why you want to limit
anonymous memory independently from file backed memory.

It looks like what you actually want is limiting the amount of
anonymous memory independently from file-backed consumptions because,
in your setup, while swap is always on local disk the file storages
are over network and more configurable / flexible.

Assuming I'm not misunderstanding you, here are my thoughts.

* I'm not sure that distinguishing anon and file backed memories like
  that is the direction we want to head.  In fact, the more uniform we
  can behave across them, the more efficient we'd be as we wouldn't
  have that artificial barrier.  It is true that we don't have the
  same level of control for swap tho.

* Even if we want an independent anon limit, memsw isn't the solution.
  It's too conflated.  If you want to have anon limit, the right thing
  to do would be pushing for an independent anon limit, not memsw.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2

2017-12-21 Thread Tejun Heo
Hello, Shakeel.

On Wed, Dec 20, 2017 at 05:15:41PM -0800, Shakeel Butt wrote:
> Let's say we have a job that allocates 100 MiB memory and suppose 80
> MiB is anon and 20 MiB is non-anon (file & kmem).
> 
> [With memsw] Scheduler sets the memsw limit of the job to 100 MiB and
> memory to max. Now suppose the job tries to allocates memory more than
> 100 MiB, it will hit the memsw limit and will try to reclaim non-anon
> memory. The memcg OOM behavior will only depend on the reclaim of
> non-anon memory and will be independent of the underlying swap device.

Sure, the direct reclaim on memsw limit won't reclaim anon pages, but
think about how the state at that point would have formed.  You're
claiming that memsw makes memory allocation and balancing behavior an
invariant against the performance of the swap device that the machine
has.  It's simply not possible.

On top of that, what's the point?

1. As I wrote earlier, given the current OOM killer implementation,
   whether OOM kicks in or not is not even that relevant in
   determining the health of the workload.  There are frequent failure
   modes where OOM killer fails to kick in while the workload isn't
   making any meaningful forward progress.

2. On hitting memsw limit, the OOM decision is dependent on the
   performance of the file backing devices.  Why is that necessarily
   better than being dependent on swap or both, which would increase
   the reclaim efficiency anyway?  You can't avoid being affected by
   the underlying hardware one way or the other.

3. The only thing memsw does is that memsw direct reclaim will only
   consider file backed pages, which I think is more of an accident
   (in an attemp to avoid lower swap setting meaning higher actual
   memory usage) than the intended outcome.  This is obviously
   suboptimal and an implementation detail.  I don't think it's
   something we want to expose to userland as a feature.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2

2017-12-20 Thread Tejun Heo
Hello, Shakeel.

On Wed, Dec 20, 2017 at 12:15:46PM -0800, Shakeel Butt wrote:
> > I don't understand how this invariant is useful across different
> > backing swap devices and availability.  e.g. Our OOM decisions are
> > currently not great in that the kernel can easily thrash for a very
> > long time without making actual progresses.  If you combine that with
> > widely varying types and availability of swaps,
> 
> The kernel never swaps out on hitting memsw limit. So, the varying
> types and availability of swaps becomes invariant to the memcg OOM
> behavior of the job.

The kernel doesn't swap because of memsw because that wouldn't change
the memsw number; however, that has nothing to do with whether the
underlying swap device affects OOM behavior or not.  That invariant
can't prevent memcg decisions from being affected by the performance
of the underlying swap device.  How could it possibly achieve that?

The only reason memsw was designed the way it was designed was to
avoid lower swap limit meaning more memory consumption.  It is true
that swap and memory consumptions are interlinked; however, so are
memory and io, and we can't solve these issues by interlinking
separate resources in a single resource knob and that's why they're
separate in cgroup2.

> > Sure, but what does memswap achieve?
> 
> 1. memswap provides consistent memcg OOM killer and memcg memory
> reclaim behavior independent to swap.
> 2. With memswap, the job owners do not have to think or worry about swaps.

To me, you sound massively confused on what memsw can do.  It could be
that I'm just not understanding what you're saying.  So, let's try
this one more time.  Can you please give one concrete example of memsw
achieving critical capabilities that aren't possible without it?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2

2017-12-20 Thread Tejun Heo
Hello, Shakeel.

On Tue, Dec 19, 2017 at 02:39:19PM -0800, Shakeel Butt wrote:
> Suppose a user wants to run multiple instances of a specific job on
> different datacenters and s/he has budget of 100MiB for each instance.
> The instances are schduled on the requested datacenters and the
> scheduler has set the memory limit of those instances to 100MiB. Now,
> some datacenters have swap deployed, so, there, let's say, the swap
> limit of those instances are set according to swap medium
> availability. In this setting the user will see inconsistent memcg OOM
> behavior. Some of the instances see OOMs at 100MiB usage (suppose only
> anon memory) while some will see OOMs way above 100MiB due to swap.
> So, the user is required to know the internal knowledge of datacenters
> (like which has swap or not and swap type) and has to set the limits
> accordingly and thus increase the chance of config bugs.

I don't understand how this invariant is useful across different
backing swap devices and availability.  e.g. Our OOM decisions are
currently not great in that the kernel can easily thrash for a very
long time without making actual progresses.  If you combine that with
widely varying types and availability of swaps, whether something is
OOMing or not doesn't really tell you much.  The workload could be
running completely fine or have been thrashing without making any
meaningful forward progress for the past 15 mins.

Given that whether or not swap exists, how much is avialable and how
fast the backing swap device is all highly influential parameters in
how the workload behaves, I don't see what having sum of memory + swap
as an invariant actually buys.  And, even that essentially meaningless
invariant doesn't really exist - the performance of the swap device
absolutely affects when the OOM killer would kick in.

So, I don't see how the sum of memory+swap makes it possible to ignore
the swap type and availability.  Can you please explain that further?

> Also different types and sizes of swap mediums in data center will
> further complicates the configuration. One datacenter might have SSD
> as a swap, another might be doing swap on zram and third might be
> doing swap on nvdimm. Each can have different size and can be assigned
> to jobs differently. So, it is possible that the instances of the same
> job might be assigned different swap limit on different datacenters.

Sure, but what does memswap achieve?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2

2017-12-19 Thread Tejun Heo
Hello,

On Tue, Dec 19, 2017 at 10:25:12AM -0800, Shakeel Butt wrote:
> Making the runtime environment, an invariant is very critical to make
> the management of a job easier whose instances run on different
> clusters across the world. Some clusters might have different type of
> swaps installed while some might not have one at all and the
> availability of the swap can be dynamic (i.e. swap medium outage).
> 
> So, if users want to run multiple instances of a job across multiple
> clusters, they should be able to specify the limits of their jobs
> irrespective of the knowledge of cluster. The best case would be they
> just submits their jobs without any config and the system figures out
> the right limit and enforce that. And to figure out the right limit
> and enforcing it, the consistent memory usage history and consistent
> memory limit enforcement is very critical.

I'm having a hard time extracting anything concrete from your
explanation on why memsw is required.  Can you please ELI5 with some
examples?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2

2017-12-19 Thread Tejun Heo
Hello,

On Tue, Dec 19, 2017 at 09:23:29AM -0800, Shakeel Butt wrote:
> To provide consistent memory usage history using the current
> cgroup-v2's 'swap' interface, an additional metric expressing the
> intersection of memory and swap has to be exposed. Basically memsw is
> the union of memory and swap. So, if that additional metric can be

Exposing anonymous pages with swap backing sounds pretty trivial.

> used to find the union. However for consistent memory limit
> enforcement, I don't think there is an easy way to use current 'swap'
> interface.

Can you please go into details on why this is important?  I get that
you can't do it as easily w/o memsw but I don't understand why this is
a critical feature.  Why is that?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2

2017-12-19 Thread Tejun Heo
Hello,

On Tue, Dec 19, 2017 at 07:12:19AM -0800, Shakeel Butt wrote:
> Yes, there are pros & cons, therefore we should give users the option
> to select the API that is better suited for their use-cases and

Heh, that's not how API decisions should be made.  The long term
outcome would be really really bad.

> environment. Both approaches are not interchangeable. We use memsw
> internally for use-cases I mentioned in commit message. This is one of
> the main blockers for us to even consider cgroup-v2 for memory
> controller.

Let's concentrate on the use case.  I couldn't quite understand what
was missing from your description.  You said that it'd make things
easier for the centralized monitoring system which isn't really a
description of a use case.  Can you please go into more details
focusing on the eventual goals (rather than what's currently
implemented)?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] cgroup, docs: document cgroup v2 device controller

2017-12-13 Thread Tejun Heo
On Wed, Dec 13, 2017 at 07:49:03PM +, Roman Gushchin wrote:
> Add the corresponding section in cgroup v2 documentation.
> 
> Signed-off-by: Roman Gushchin <g...@fb.com>
> Cc: Tejun Heo <t...@kernel.org>
> Cc: Alexei Starovoitov <a...@kernel.org>
> Cc: kernel-t...@fb.com
> Cc: cgro...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org

Applied to cgroup/for-4.16.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] cpuset: Enable cpuset controller in default hierarchy

2017-11-27 Thread Tejun Heo
Hello, Waiman.

On Mon, Nov 27, 2017 at 04:19:57PM -0500, Waiman Long wrote:
> > Let's start just with [e]cpus and [e]mems.  The flags interface looks
> > fine but the implementations of these features are really bad and
> > cgroup2 doesn't migrate resources for other controllers either anyway.
> 
> That is added because the mem_migrate feature is used in libvirt, I
> think. I am thinking of add a "[EXPERIMENTAL]" tag to the flags to
> indicate that it is subject to change.

I see.  Do you happen to know what it's used for and why that's
necessary just so that we can evaluate it better?  I'm not quite sure
what adding [EXPERIMENTAL] tag would achieve.  If we expose the
feature and people use it, we just have to keep it anyway.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] cpuset: Enable cpuset controller in default hierarchy

2017-11-27 Thread Tejun Heo
Hello, Waiman.

Sorry about the long delay.

On Fri, Oct 06, 2017 at 05:10:30PM -0400, Waiman Long wrote:
> +Cpuset Interface Files
> +~~
> +
> +  cpuset.cpus
> + A read-write multiple values file which exists on non-root
> + cgroups.
> +
> + It lists the CPUs allowed to be used by tasks within this
> + cgroup.  The CPU numbers are comma-separated numbers or
> + ranges.  For example:
> +
> +   # cat cpuset.cpus
> +   0-4,6,8-10
> +
> + An empty value indicates that the cgroup is using the same
> + setting as the nearest cgroup ancestor with a non-empty
> + "cpuset.cpus" or all the available CPUs if none is found.
> +
> + The value of "cpuset.cpus" stays constant until the next update
> + and won't be affected by any CPU hotplug events.
> +
> +  cpuset.effective_cpus

Can we do cpuset.ecpus in the fashion of euid, egid..?

> +  cpuset.effective_mems

Ditto.

> +  cpuset.flags
> + A read-write multiple values file which exists on non-root
> + cgroups.
> +
> + It lists the flags that are set (with a '+' prefix) and those
> + that are not set (with a '-' prefix).   The currently supported
> + flag is:
> +
> +   mem_migrate
> + When it is not set, an allocated memory page will
> + stay in whatever node it was allocated independent
> + of changes in "cpuset.mems".
> +
> + When it is set, tasks with memory pages not in
> + "cpuset.mems" will have those pages migrated over to
> + memory nodes specified in "cpuset.mems".  Any changes
> + to "cpuset.mems" will cause pages in nodes that are
> + no longer valid to be migrated over to the newly
> + valid nodes.

Let's start just with [e]cpus and [e]mems.  The flags interface looks
fine but the implementations of these features are really bad and
cgroup2 doesn't migrate resources for other controllers either anyway.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] cfq-iosched: add "leaf_weight" setting for the root cgroup in cgroups v2

2017-10-30 Thread Tejun Heo
On Sun, Oct 29, 2017 at 05:36:53PM +0100, Maciej S. Szmigiero wrote:
> CFQ scheduler has a property that processes (or tasks in cgroups v1) that
> aren't assigned to any particular cgroup - that is, which stay in the root
> cgroup - effectively form an implicit leaf child node attached to the root
> cgroup.
> 
> This behavior is documented in blkio-controller.txt for cgroups v1, however
> as far as I know it isn't documented anywhere for cgroups v2 besides a
> generic remark that "How resource consumption in the root cgroup is
> governed is up to each controller" in cgroup-v2.txt.
> 
> By default, this implicit leaf child node has a (CFQ) weight which is two
> times higher that the default weight of a child cgroup.
> 
> cgroups v1 provide a "leaf_weight" setting which allow changing this value.
> However, this setting is missing from cgroups v2 and so the only way to
> tweak how much IO time processes in the root cgroup get is to adapt
> weight settings of all child cgroups accordingly.
> Let's add a "leaf_weight" setting to the root cgroup in cgroups v2, too.
> 
> Note that new kernel threads appear in the root cgroup and there seems to
> be no way to change this since kthreadd cannot be moved to another cgroup
> (for a good reason).
> 
> Signed-off-by: Maciej S. Szmigiero 

I don't think we wanna do this.  It's inconsistent with what other
controllers do and we want to charge the IOs in the root cgroup to the
right cgroup.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-10-05 Thread Tejun Heo
Hello, Michal.

On Thu, Oct 05, 2017 at 03:14:19PM +0200, Michal Hocko wrote:
> Yes and that is why I think a boot time knob would be the most simple
> way. It will also open doors for more oom policies in future which I
> believe come sooner or later.

While boot params are fine for development and debugging, as a
user-interface, they aren't great.

* The user can't easily confirm whether the config they input is
  correct and when they get it wrong what's wrong can be pretty
  mysterious.

* While kernel params can be made r/w through /proc, people usually
  don't expect that and using that can become really confusing because
  a lot of people use "dmesg|grep" to confirm the boot params and that
  won't agree with the setting written later.

* It can't be scoped.  What if we want to choose different policies
  per delegated subtree?

* Boot params aren't the easiest (again, if you're a developer,
  they're but most aren't developers) to play with and prone to cause
  deployment issues.

* In this case, even worse because it ends up silently ignoring a
  clearly explicit configuration in an interface file.

If the behavior differences we get from group oom code isn't critical
(and it doesn't seem to be), I'd greatly prefer just enabling it when
cgroup2 is in use.  If it absolutely must be opt-in even on cgroup2,
we can discuss other ways but I'd really like to see stronger
rationales before going that route.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v9 3/5] mm, oom: cgroup-aware OOM killer

2017-10-03 Thread Tejun Heo
Hello, Michal.

On Tue, Oct 03, 2017 at 04:22:46PM +0200, Michal Hocko wrote:
> On Tue 03-10-17 15:08:41, Roman Gushchin wrote:
> > On Tue, Oct 03, 2017 at 03:36:23PM +0200, Michal Hocko wrote:
> [...]
> > > I guess we want to inherit the value on the memcg creation but I agree
> > > that enforcing parent setting is weird. I will think about it some more
> > > but I agree that it is saner to only enforce per memcg value.
> > 
> > I'm not against, but we should come up with a good explanation, why we're
> > inheriting it; or not inherit.
> 
> Inheriting sounds like a less surprising behavior. Once you opt in for
> oom_group you can expect that descendants are going to assume the same
> unless they explicitly state otherwise.

Here's a counter example.

Let's say there's a container which hosts one main application, and
the container shares its host with other containers.

* Let's say the container is a regular containerized OS instance and
  can't really guarantee system integrity if one its processes gets
  randomly killed.

* However, the application that it's running inside an isolated cgroup
  is more intelligent and composed of multiple interchangeable
  processes and can treat killing of a random process as partial
  capacity loss.

When the host is setting up the outer container, it doesn't
necessarily know whether the containerized environment would be able
to handle partial OOM kills or not.  It's akin to panic_on_oom setting
at system level - it's the containerized instance itself which knows
whether it can handle partial OOM kills or not.  This is why this knob
should be delegatable.

Now, the container itself has group OOM set and the isolated main
application is starting up.  It obviously wants partial OOM kills
rather than group killing.  This is the same principle.  The
application which is being contained in the cgroup is the one which
knows how it can handle OOM conditions, not the outer environment, so
it obviously needs to be able to set the configuration it wants.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v8 0/4] cgroup-aware OOM killer

2017-09-22 Thread Tejun Heo
Hello,

On Fri, Sep 22, 2017 at 01:39:55PM -0700, David Rientjes wrote:
> Current heuristic based on processes is coupled with per-process
> /proc/pid/oom_score_adj.  The proposed 
> heuristic has no ability to be influenced by userspace, and it needs one.  
> The proposed heuristic based on memory cgroups coupled with Roman's 
> per-memcg memory.oom_priority is appropriate and needed.  It is not 

So, this is where we disagree.  I don't think it's a good design.

> "sophisticated intelligence," it merely allows userspace to protect vital 
> memory cgroups when opting into the new features (cgroups compared based 
> on size and memory.oom_group) that we very much want.

which can't achieve that goal very well for wide variety of users.

> > We even change the whole scheduling behaviors and try really hard to
> > not get locked into specific implementation details which exclude
> > future improvements.  Guaranteeing OOM killing selection would be
> > crazy.  Why would we prevent ourselves from doing things better in the
> > future?  We aren't talking about the semantics of read(2) here.  This
> > is a kernel emergency mechanism to avoid deadlock at the last moment.
> 
> We merely want to prefer other memory cgroups are oom killed on system oom 
> conditions before important ones, regardless if the important one is using 
> more memory than the others because of the new heuristic this patchset 
> introduces.  This is exactly the same as /proc/pid/oom_score_adj for the 
> current heuristic.

You were arguing that we should lock into a specific heuristics and
guarantee the same behavior.  We shouldn't.

When we introduce a user visible interface, we're making a lot of
promises.  My point is that we need to be really careful when making
those promises.

> If you have this low priority maintenance job charging memory to the high 
> priority hierarchy, you're already misconfigured unless you adjust 
> /proc/pid/oom_score_adj because it will oom kill any larger process than 
> itself in today's kernels anyway.
> 
> A better configuration would be attach this hypothetical low priority 
> maintenance job to its own sibling cgroup with its own memory limit to 
> avoid exactly that problem: it going berserk and charging too much memory 
> to the high priority container that results in one of its processes 
> getting oom killed.

And how do you guarantee that across delegation boundaries?  The
points you raise on why the priority should be applied level-by-level
are exactly the same points why this doesn't really work.  OOM killing
priority isn't something which can be distributed across cgroup
hierarchy level-by-level.  The resulting decision tree doesn't make
any sense.

I'm not against adding something which works but strict level-by-level
comparison isn't the solution.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v8 0/4] cgroup-aware OOM killer

2017-09-22 Thread Tejun Heo
Hello, David.

On Thu, Sep 21, 2017 at 02:17:25PM -0700, David Rientjes wrote:
> It doesn't have anything to do with my particular usecase, but rather the 
> ability of userspace to influence the decisions of the kernel.  Previous 
> to this patchset, when selection is done based on process size, userspace 
> has full control over selection.  After this patchset, userspace has no 
> control other than setting all processes to be oom disabled if the largest 
> memory consumer is to be protected.  Roman's memory.oom_priority provides 
> a perfect solution for userspace to be able to influence this decision 
> making and causes no change in behavior for users who choose not to tune 
> memory.oom_priority.  The nack originates from the general need for 
> userspace influence over oom victim selection and to avoid userspace 
> needing to take the rather drastic measure of setting all processes to be 
> oom disabled to prevent oom kill in kernels before oom priorities are 
> introduced.

Overall, I think that OOM killing is the wrong place to implement
sophisticated intelligence in.  It's too late to be smart - the
workload already has suffered significantly and there's only very
limited amount of computing which can be performed.  That said, if
there's a useful and general enough mechanism to configure OOM killer
behavior from userland, that can definitely be useful.

> The patchset compares memory cgroup size relative to sibling cgroups only, 
> the same comparison for memory.oom_priority.  There is a guarantee 
> provided on how cgroup size is compared in select_victim_memcg(), it 
> hierarchically accumulates the "size" from leaf nodes up to the root memcg 
> and then iterates the tree comparing sizes between sibling cgroups to 
> choose a victim memcg.  That algorithm could be more elaborately described 
> in the documentation, but we simply cannot change the implementation of 
> select_victim_memcg() later even without oom priorities since users cannot 
> get inconsistent results after opting into a feature between kernel 
> versions.  I believe the selection criteria should be implemented to be 
> deterministic, as select_victim_memcg() does, and the documentation should 
> fully describe what the selection criteria is, and then allow the user to 
> decide.

We even change the whole scheduling behaviors and try really hard to
not get locked into specific implementation details which exclude
future improvements.  Guaranteeing OOM killing selection would be
crazy.  Why would we prevent ourselves from doing things better in the
future?  We aren't talking about the semantics of read(2) here.  This
is a kernel emergency mechanism to avoid deadlock at the last moment.

> Roman is planning on introducing memory.oom_priority back into the 
> patchset per https://marc.info/?l=linux-kernel=150574701126877 and I 
> agree with the very clear semantic that it introduces: to have the 
> size-based comparison use the same rules as the userspace priority 
> comparison.  It's very powerful and I'm happy to ack the final version 
> that he plans on posting.

To me, the proposed oom_priority mechanism seems too limited and makes
the error of tightly coupling the hierarchical behavior of resource
distribution with OOM victim selection.  They can be related but are
not the same and coupling them together in the kernel interface is
likely a mistake which will lead to long term pains that we can't
easily get out of.

Here's a really simple use case.  Imagine a system which hosts two
containers of services and one is somewhat favored over the other and
wants to set up cgroup hierarchy so that resources are split at the
top level between the two containers.  oom_priority is set accordingly
too.  Let's say a low priority maintenance job in higher priority
container goes berserk, as they oftne do, and pushing the system into
OOM.

With the proposed static oom_priority mechanism, the only
configuration which can be expressed is "kill all of the lower top
level subtree before any of the higher one", which is a silly
restriction leading to silly behavior and a direct result of
conflating resource distribution network with level-by-level OOM
killing decsion.

If we want to allow users to steer OOM killing, I suspect that it
should be aligned at delegation boundaries rather than on cgroup
hierarchy itself.  We can discuss that but it is a separate
discussion.

The mechanism being proposed is fundamentally flawed.  You can't push
that in by nacking other improvements.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Documentation: core-api: minor workqueue.rst cleanups

2017-09-18 Thread Tejun Heo
On Mon, Sep 18, 2017 at 01:10:15PM -0700, Randy Dunlap wrote:
> From: Randy Dunlap <rdun...@infradead.org>
> 
> Clean up workqueue.rst:
> - fix minor typos
> - put '@' after `` instead of preceding them (one place)
> - use "CPU" instead of "cpu" in text consistently
> - quote one function name
> 
> Signed-off-by: Randy Dunlap <rdun...@infradead.org>
> Cc: Tejun Heo <t...@kernel.org>
> Cc: Florian Mickler <flor...@mickler.org>

Applied to wq/for-4.14-fixes.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] cpuset: Allow v2 behavior in v1 cgroup

2017-08-16 Thread Tejun Heo
Hello,

On Wed, Aug 16, 2017 at 10:34:05AM -0400, Waiman Long wrote:
> > It feels weird to make this a kernel boot param when all other options
> > are specified on mount time.  Is there a reason why this can't be a
> > mount option too?
> >
> Yes, we can certainly make this a mount option instead of a boot time
> parameter. BTW, where do we usually document the mount options for cgroup?

I don't think there's a central place.  Each controller documents
theirs in their own file.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] cpuset: Allow v2 behavior in v1 cgroup

2017-08-16 Thread Tejun Heo
Hello, Waiman.

On Tue, Aug 15, 2017 at 01:27:20PM -0400, Waiman Long wrote:
> + cpuset_v2_mode= [KNL] Enable cpuset v2 behavior in cpuset v1 cgroups.
> + In v2 mode, the cpus and mems can be restored to
> + their original values after a removal-addition
> + event sequence.
> + 0: default value, cpuset v1 keeps legacy behavior.
> + 1: cpuset v1 behaves like cpuset v2.
> +

It feels weird to make this a kernel boot param when all other options
are specified on mount time.  Is there a reason why this can't be a
mount option too?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 2/4] cgroup: implement hierarchy limits

2017-08-02 Thread Tejun Heo
Hello, Roman.

Generally looks good to me.  One minor nit.

On Wed, Aug 02, 2017 at 05:55:30PM +0100, Roman Gushchin wrote:
> +static ssize_t cgroup_max_descendants_write(struct kernfs_open_file *of,
> +char *buf, size_t nbytes, loff_t off)
> +{
> + struct cgroup *cgrp;
> + int descendants;
> + ssize_t ret;
> +
> + buf = strstrip(buf);
> + if (!strcmp(buf, "max")) {
> + descendants = INT_MAX;
> + } else {
> + ret = kstrtouint(buf, 0, );
 ^^^
 shouldn't this be kstrtoint?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] cgroup: add cgroup.stat interface with basic hierarchy stats

2017-07-28 Thread Tejun Heo
Hello,

On Fri, Jul 28, 2017 at 02:01:55PM +0100, Roman Gushchin wrote:
> > > +   nr_dying_descendants
> > > + Total number of dying descendant cgroups.
> > 
> > Can you please go into more detail on what's going on with dying
> > descendants here?
>
> Sure.
> Don't we plan do describe cgroup/css lifecycle in details
> in a separate section?

We should but it'd still be nice to have a short description here too.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] cgroup: add cgroup.stat interface with basic hierarchy stats

2017-07-27 Thread Tejun Heo
Hello,

On Thu, Jul 27, 2017 at 05:14:20PM +0100, Roman Gushchin wrote:
> Add a cgroup.stat interface to the base cgroup control files
> with the following metrics:
> 
> nr_descendantstotal number of descendant cgroups
> nr_dying_descendants  total number of dying descendant cgroups
> max_descendant_depth  maximum descent depth below the current cgroup

Yeah, this'd be great to have.  Some comments below.

> +  cgroup.stat
> + A read-only flat-keyed file with the following entries:
> +
> +   nr_descendants
> + Total number of descendant cgroups.
> +
> +   nr_dying_descendants
> + Total number of dying descendant cgroups.

Can you please go into more detail on what's going on with dying
descendants here?

> +static int cgroup_stats_show(struct seq_file *seq, void *v)
> +{
> + struct cgroup_subsys_state *css;
> + unsigned long total = 0;
> + unsigned long offline = 0;
> + int max_level = 0;
> +
> + rcu_read_lock();
> + css_for_each_descendant_pre(css, seq_css(seq)) {
> + if (css == seq_css(seq))
> + continue;
> + ++total;

Let's do post increment for consistency.

> + if (!(css->flags & CSS_ONLINE))
> + ++offline;
> + if (css->cgroup->level > max_level)
> + max_level = css->cgroup->level;
> + }
> + rcu_read_unlock();

I wonder whether we want to keep these counters in sync instead of
trying to gather the number on read.  Walking all descendants can get
expensive pretty quickly and things like nr_descendants will be useful
for other purposes too.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/31] cgroup-v2.txt: standardize document format

2017-06-18 Thread Tejun Heo
Hello, Andrew.

Can you please apply Mauro's doc format update patch for
cgroup-v2.txt?  The mm part causes conflicts, so I think it'd be
easier to route it through -mm.

The patch to apply is in the following message.

 https://marc.info/?l=linux-cgroups=149771274105674=raw

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics

2017-06-03 Thread Tejun Heo
Hello,

On Fri, Jun 02, 2017 at 04:36:22PM -0400, Waiman Long wrote:
> I wouldn't argue further on that if you insist. However, I still want to

Oh, please don't get me wrong.  I'm not trying to shut down the
discussion or anything.  It's just that whole-scope discussions can
get very meandering and time-consuming when these two issues can be
decoupled from each other without compromising on either.  Let's
approach these issues separately.

> relax the constraint somewhat by abandoning the no internal process
> constraint  when only threaded controllers (non-resource domains) are
> enabled even when thread mode has not been explicitly enabled. It is a
> modified version my second alternative. Now the question is which
> controllers are considered to be resource domains. I think memory and
> blkio are in the list. What else do you think should be considered
> resource domains?

And we're now a bit into repeating ourselves but for controlling of
any significant resources (mostly cpu, memory, io), there gotta be
significant portion of resource consumption which isn't tied to
spcific processes or threads that should be accounted for.  Both
memory and io already do this to a certain extent, but not completely.
cpu doesn't do it at all yet but we usually can't / shouldn't declare
a resource category to be domain-free.

There are exceptions - controllers which are only used for membership
identification (perf and the old net controllers), pids which is
explicitly tied to tasks (note that CPU cycles aren't), cpuset which
is an attribute propagating / restricting controller.

Out of those, the identification uses already aren't affected by the
constraint as they're now all either direct membership test against
the hierarchy or implicit controllers which aren't subject to the
constraint.  That leaves pids and cpuset.  We can exempt them from the
constraint but I'm not quite sure what that buys us given that neither
is affected by requiring explicit leaf nodes.  It'd just make the
rules more complicated without actual benefits.

That said, we can exempt those two.  I don't see much point in it but
we can definitely discuss the pros and cons, and it's likely that it's
not gonna make much difference wichever way we choose.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics

2017-06-01 Thread Tejun Heo
Hello,

On Thu, Jun 01, 2017 at 05:12:42PM -0400, Waiman Long wrote:
> Are you referring to keeping the no internal process restriction and
> document how to work around that instead? I would like to hear what
> workarounds are currently being used.

What we've been talking about all along - just creating explicit leaf
nodes.

> Anyway, you currently allow internal process in thread mode, but not in
> non-thread mode. I would prefer no such restriction in both thread and
> non-thread mode.

Heh, so, these aren't arbitrary.  The contraint is tied to
implementing resource domains and thread subtree doesn't have resource
domains in them, so they don't need the constraint.  I'm sorry about
the short replies but I'm kinda really tied up right now.  I'm gonna
do the thread mode so that it can be agnostic w.r.t. the internal
process constraint and I think it could be helpful to decouple these
discussions.  We've been having this discussion for a couple years now
and it looks like we're gonna go through it all over, which is fine,
but let's at least keep that separate.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics

2017-06-01 Thread Tejun Heo
Hello,

On Thu, Jun 01, 2017 at 04:48:48PM -0400, Waiman Long wrote:
> I think we are on agreement here. I should we should just document how
> userland can work around the internal process competition issue by
> setting up the cgroup hierarchy properly. Then we can remove the no
> internal process constraint.

Heh, we agree on the immediate solution but not the final direction.
This requirement affects how controllers implement resource control in
significant ways.  It is a restriction which can be worked around in
userland relatively easily.  I'd much prefer to keep the invariant
intact.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics

2017-06-01 Thread Tejun Heo
Hello,

On Thu, Jun 01, 2017 at 03:27:35PM -0400, Waiman Long wrote:
> As said in an earlier email, I agreed that masking it on the kernel side
> may not be the best solution. I offer 2 other alternatives:
> 1) Document on how to work around the resource domains issue by proper
> setup of the cgroup hierarchy.

We can definitely improve documentation.

> 2) Mark those controllers that require the no internal process
> competition constraint and disallow internal process only when those
> controllers are active.

We *can* do that but wouldn't this be equivalent to enabling thread
mode implicitly when only thread aware controllers are enabled?

> I prefer the first alternative, but I can go with the second if necessary.
> 
> The major rationale behind my enhanced thread mode patch was to allow
> something like
> 
>  R -- A -- B
>  \
>   T1 -- T2
> 
> where you can have resource domain controllers enabled in the thread
> root as well as some child cgroups of the thread root. As no internal
> process rule is currently not applicable to the thread root, this
> creates the dilemma that we need to deal with internal process competition.
> 
> The container invariant that PeterZ talked about will also be a serious
> issue here as I don't think we are going to set up a container root
> cgroup that will have no process allowed in it because it has some child
> cgroups. IMHO, I don't think cgroup v2 will get wide adoption without
> getting rid of that no internal process constraint.

The only thing which is necessary from inside a container is putting
the management processes into their own cgroups so that they can be
controlled (ie. the same thing you did with your patch but doing that
explicitly from userland) and userland management sw can do the same
thing whether it's inside a container or on a bare system.  BTW,
systemd already does so and works completely fine in terms of
containerization on cgroup2.  It is arguable whether we should make
this more convenient from kernel side but using cgroup2 for resource
control already requires the userspace tools to be adapted to it, so
I'm not sure how much benefit we'd gain from adding that compared to
explicitly documenting it.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics

2017-06-01 Thread Tejun Heo
Hello, Waiman.

On Thu, Jun 01, 2017 at 02:44:48PM -0400, Waiman Long wrote:
> > And cgroup-v2 has this 'exception' (aka wart) for the root group which
> > needs to be replicated for each namespace.
> 
> One of the changes that I proposed in my patches was to get rid of the
> no internal process constraint. I think that will solve a big part of
> the container invariant problem that we have with cgroup v2.

I'm not sure.  It just masks it without actually solving it.  I mean,
the constraint is thereq for a reason.  "Solving" it would defeat one
of the main capabilities for resource domains and masking it from
kernel side doesn't make whole lot of sense to me given that it's
something which can be easily done from userland.  If we take out that
part, for controllers which don't care about resource domains,
wouldn't thread mode be a sufficient solution?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics

2017-06-01 Thread Tejun Heo
Hello, Peter.

On Thu, Jun 01, 2017 at 05:10:45PM +0200, Peter Zijlstra wrote:
> I've not had time to look at any of this. But the question I'm most
> curious about is how cgroup-v2 preserves the container invariant.
> 
> That is, each container (namespace) should look like a 'real' machine.
> So just like userns allows to have a uid-0 (aka root) for each container
> and pidns allows a pid-1 for each container, cgroupns should provide a
> root group for each container.
> 
> And cgroup-v2 has this 'exception' (aka wart) for the root group which
> needs to be replicated for each namespace.

The goal has never been that a container must be indistinguishible
from a real machine.  For certain things, things simply don't have
exact equivalents due to sharing (memory stats or journal writes for
example) and those things are exactly why people prefer containers
over VMs for certain use cases.  If one wants full replication, VM
would be the way to go.

The goal is allowing enough container invariant so that appropriate
workloads can be contained and co-exist in useful ways.  This also
means that the contained workload is usually either a bit illiterate
w.r.t. to the system details (doesn't care) or makes some adjustments
for running inside a container (most quasi-full-system ones already
do).

System root is inherently different from all other nested roots.
Making some exceptions for the root isn't about taking away from other
roots but more reflecting the inherent differences - there are things
which are inherently system / bare-metal.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics

2017-06-01 Thread Tejun Heo
Hello, Waiman.

A short update.  I tried making root special while keeping the
existing threaded semantics but I didn't really like it because we
have to couple controller enables/disables with threaded
enables/disables.  I'm now trying a simpler, albeit a bit more
tedious, approach which should leave things mostly symmetrical.  I'm
hoping to be able to post mostly working patches this week.

Also, do you mind posting the debug patches as a separate series?
Let's get the bits which make sense indepdently in the tree.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics

2017-05-24 Thread Tejun Heo
Hello,

On Wed, May 24, 2017 at 05:17:13PM -0400, Waiman Long wrote:
> An alternative is to have separate enabling for thread root. For example,
> 
> # echo root > cgroup.threads
> # echo enable > child/cgroup.threads
> 
> The first statement make the current cgroup the thread root. However,
> setting it to a thread root doesn't make its child to be threaded. This
> have to be explicitly done on each of the children. Once a child cgroup
> is made to be threaded, all its descendants will be threaded. That will
> have the same effect as the current patch.

Yeah, I'm toying with different ideas.  I'll get back to you once
things get more concrete.

> With delegation, do you mean the relationship between a container and
> its host?

It can be but doesn't have to be.  For example, it can be delegations
to users without namespace / container being involved.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics

2017-05-24 Thread Tejun Heo
Hello, Waiman.

On Mon, May 22, 2017 at 01:13:16PM -0400, Waiman Long wrote:
> > Maybe I'm misunderstanding the design, but this seems to push the
> > processes which belong to the threaded subtree to the parent which is
> > part of the usual resource domain hierarchy thus breaking the no
> > internal competition constraint.  I'm not sure this is something we'd
> > want.  Given that the limitation of the original threaded mode was the
> > required nesting below root and that we treat root special anyway
> > (exactly in the way necessary), I wonder whether it'd be better to
> > simply allow root to be both domain and thread root.
> 
> Yes, root can be both domain and thread root. I haven't placed any
> restriction on that.

I've been playing with the proposed "make the parent resource domain".
Unfortunately, the parent - child relationship becomes weird.

The parent becomes the thread root, which means that its
cgroup.threads file becomes writable and threads can be put in there.
It's really weird to write to a child's interface and have the
parent's behavior changed.  This becomes weirder with delegation.  If
a cgroup is delegated, its cgroup.threads should be delegated too but
if the child enables threaded mode, that makes the undelegated parent
thread root, which means that either 1. the delegatee can't migrate
threads to the thread root or 2. if the parent's cgroup.threads is
writeable, the delegatee can mass with other descendants under it
which shouldn't be allowed.

I think the operation of making a cgroup a thread root should happen
on the cgroup where that's requested; otherwise, nesting becomes too
twisted.  This should be solvable.  Will think more about it.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 13/17] cgroup: Allow fine-grained controllers control in cgroup v2

2017-05-24 Thread Tejun Heo
Hello,

On Wed, May 24, 2017 at 01:49:46PM -0400, Waiman Long wrote:
> What I am saying is as follows:
> / A
> P - B
>\ C
> 
> # echo +memory > P/cgroups.subtree_control
> # echo -memory > P/A/cgroup.controllers
> # echo "#memory" > P/B/cgroup.controllers
> 
> The parent grants the memory controller to its children - A, B and C.
> Child A has the memory controller explicitly disabled. Child B has the
> memory controller in pass-through mode, while child C has the memory
> controller enabled by default. "echo +memory > cgroup.controllers" is
> not allowed. There are 2 possible choices with regard to the '-' or '#'
> prefixes. We can allow them before the grant from the parent or only
> after that. In the former case, the state remains dormant until after
> the grant from the parent.

Ah, I see, you want cgroup.controllers to be able to mask available
controllers by the parent.  Can you expand your example with further
nesting and how #memory on cgroup.controllers would affect the nested
descendant?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 13/17] cgroup: Allow fine-grained controllers control in cgroup v2

2017-05-24 Thread Tejun Heo
Hello, Waiman.

On Fri, May 19, 2017 at 05:20:01PM -0400, Waiman Long wrote:
> > This breaks the invariant that in a cgroup its resource control knobs
> > control distribution of resources from its parent.  IOW, the resource
> > control knobs of a cgroup always belong to the parent.  This is also
> > reflected in how delegation is done.  The delegatee assumes ownership
> > of the cgroup itself and the ability to manage sub-cgroups but doesn't
> > get the ownership of the resource control knobs as otherwise the
> > parent would lose control over how it distributes its resources.
> 
> One twist that I am thinking is to have a controller enabled by the
> parent in subtree_control, but then allow the child to either disable it
> or set it in pass-through mode by writing to controllers file. IOW, a
> child cannot enable a controller without parent's permission. Once a
> child has permission, it can do whatever it wants. A parent cannot force
> a child to have a controller enabled.

Heh, I think I need more details to follow your proposal.  Anyways,
what we need to guarantee is that a descendant is never allowed to
pull in more resources than its ancestors want it to.

> > Another aspect is that most controllers aren't that sensitive to
> > nesting several levels.  Expensive operations can be and already are
> > aggregated and the performance overhead of several levels of nesting
> > barely shows up.  Skipping levels can be an interesting optimization
> > approach and we can definitely support from the core side; however,
> > it'd be a lot nicer if we could do that optimization transparently
> > (e.g. CPU can skip multi level queueing if there usually is only one
> > item at some levels).
> 
> The trend that I am seeing is that the total number of controllers is
> going to grow over time. New controllers may be sensitive to the level
> of nesting like the cpu controller. I am also thinking about how systemd
> is using the cgroup filesystem for task classification purpose without
> any controller attached to it. With this scheme, we can accommodate all
> the different needs without using different cgroup filesystems.

I'm not sure about that.  It's true that cgroup hierarchy is being
used more but there are only so many hard / complex resources that we
deal with - cpu, memory and io.  Beyond those, other uses are usually
about identifying membership (perf, net) or propagating and
restricting attributes (cpuset).  pids can be considered an exception
but we have it only because pids can globally run out a lot sooner
than can be controlled through memory.  Even then, it's trivial.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 12/17] cgroup: Remove cgroup v2 no internal process constraint

2017-05-24 Thread Tejun Heo
Hello, Mike.

On Sat, May 20, 2017 at 04:10:07AM +0200, Mike Galbraith wrote:
> On Fri, 2017-05-19 at 16:38 -0400, Tejun Heo wrote:
> > Hello, Waiman.
> > 
> > On Mon, May 15, 2017 at 09:34:11AM -0400, Waiman Long wrote:
> > > The rationale behind the cgroup v2 no internal process constraint is
> > > to avoid resouorce competition between internal processes and child
> > > cgroups. However, not all controllers have problem with internal
> > > process competiton. Enforcing this rule may lead to unnatural process
> > > hierarchy and unneeded levels for those controllers.
> > 
> > This isn't necessarily something we can determine by looking at the
> > current state of controllers.  It's true that some controllers - pid
> > and perf - inherently only care about membership of each task but at
> > the same time neither really suffers from the constraint either.  CPU
> > which is the problematic one here...
> 
> (+ cpuacct + cpuset)

Yeah, cpuacct and cpuset are in the same boat as perf.  cpuset is
completely so and we can move the tree walk to the reader side or
aggregate propagation for cpuacct as necessary.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 12/17] cgroup: Remove cgroup v2 no internal process constraint

2017-05-24 Thread Tejun Heo
Hello,

On Mon, May 22, 2017 at 12:56:08PM -0400, Waiman Long wrote:
> All controllers can use the special sub-directory if userland chooses to
> do so. The problem that I am trying to address in this patch is to allow
> more natural hierarchy that reflect a certain purpose, like the task
> classification done by systemd. Restricting tasks only to leaf nodes
> makes the hierarchy unnatural and probably difficult to manage.

I see but how is this different from userland just creating the leaf
cgroup?  I'm not sure what this actually enables in terms of what can
be achieved with cgroup.  I suppose we can argue that this is more
convenient but I'd like to keep the interface orthogonal as much as
reasonably possible.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics

2017-05-19 Thread Tejun Heo
Hello,

On Fri, May 19, 2017 at 04:26:24PM -0400, Tejun Heo wrote:
> (exactly in the way necessary), I wonder whether it'd be better to
> simply allow root to be both domain and thread root.

I'll give this approach a shot early next week.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 13/17] cgroup: Allow fine-grained controllers control in cgroup v2

2017-05-19 Thread Tejun Heo
Hello, Waiman.

On Mon, May 15, 2017 at 09:34:12AM -0400, Waiman Long wrote:
> For cgroup v1, different controllers can be binded to different cgroup
> hierarchies optimized for their own use cases. That is not currently
> the case for cgroup v2 where combining all these controllers into
> the same hierarchy will probably require more levels than is needed
> by each individual controller.
> 
> By not enabling a controller in a cgroup and its descendants, we can
> effectively trim the hierarchy as seen by a controller from the leafs
> up. However, there is currently no way to compress the hierarchy in
> the intermediate levels.
> 
> This patch implements a fine-grained mechanism to allow a controller to
> skip some intermediate levels in a hierarchy and effectively flatten
> the hierarchy as seen by that controller.
> 
> Controllers can now be directly enabled or disabled in a cgroup
> by writing to the "cgroup.controllers" file.  The special prefix
> '#' with the controller name is used to set that controller in
> pass-through mode.  In that mode, the controller is disabled for that
> cgroup but it allows its children to have that controller enabled or
> in pass-through mode again.
> 
> With this change, each controller can now have a unique view of their
> virtual process hierarchy that can be quite different from other
> controllers.  We now have the freedom and flexibility to create the
> right hierarchy for each controller to suit their own needs without
> performance loss when compared with cgroup v1.

I can see the appeal but this needs at least more refinements.

This breaks the invariant that in a cgroup its resource control knobs
control distribution of resources from its parent.  IOW, the resource
control knobs of a cgroup always belong to the parent.  This is also
reflected in how delegation is done.  The delegatee assumes ownership
of the cgroup itself and the ability to manage sub-cgroups but doesn't
get the ownership of the resource control knobs as otherwise the
parent would lose control over how it distributes its resources.

Another aspect is that most controllers aren't that sensitive to
nesting several levels.  Expensive operations can be and already are
aggregated and the performance overhead of several levels of nesting
barely shows up.  Skipping levels can be an interesting optimization
approach and we can definitely support from the core side; however,
it'd be a lot nicer if we could do that optimization transparently
(e.g. CPU can skip multi level queueing if there usually is only one
item at some levels).

Hmm... that said, if we can fix the delegation issue in a not-too-ugly
way, why not?  I wonder whether we can still keep the resource control
knobs attached to the parent and skip in the middle.  Topology-wise,
that'd make more sense too.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 12/17] cgroup: Remove cgroup v2 no internal process constraint

2017-05-19 Thread Tejun Heo
Hello, Waiman.

On Mon, May 15, 2017 at 09:34:11AM -0400, Waiman Long wrote:
> The rationale behind the cgroup v2 no internal process constraint is
> to avoid resouorce competition between internal processes and child
> cgroups. However, not all controllers have problem with internal
> process competiton. Enforcing this rule may lead to unnatural process
> hierarchy and unneeded levels for those controllers.

This isn't necessarily something we can determine by looking at the
current state of controllers.  It's true that some controllers - pid
and perf - inherently only care about membership of each task but at
the same time neither really suffers from the constraint either.  CPU
which is the problematic one here and currently only cares about tasks
actually distributes resources which have parts which are specific to
domain rather than threads and we don't want to declare that CPU isn't
domain aware resource because it inherently is.

> This patch removes the no internal process contraint by enabling those
> controllers that don't like internal process competition to have a
> separate set of control knobs just for internal processes in a cgroup.
> 
> A new control file "cgroup.resource_control" is added. Enabling a
> controller with a "+" prefix will create a separate set of control
> knobs for that controller in the special "cgroup.resource_domain"
> sub-directory for all the internal processes. The existing control
> knobs in the cgroup will then be used to manage resource distribution
> between internal processes as a group and other child cgroups.

We would need to declare all major resource controllers to be needing
that special sub-directory.  That'd work around the
no-internal-process constraint but I don't think it is solving any
real problems.  It's just the kernel doing something that userland can
do with ease and more context.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 08/17] cgroup: Move debug cgroup to its own file

2017-05-19 Thread Tejun Heo
Hello,

On Fri, May 19, 2017 at 03:33:14PM -0400, Waiman Long wrote:
> On 05/19/2017 03:21 PM, Tejun Heo wrote:
> > Yeah but it also shows up as an integral part of stable interface
> > rather than e.g. /sys/kernel/debug.  This isn't of any interest to
> > people who aren't developing cgroup core code.  There is no reason to
> > risk growing dependencies on it.
> 
> The debug controller is used to show information relevant to the cgroup
> its css'es are attached to. So it will be very hard to use if we
> relocate to /sys/kernel/debug, for example. Currently, nothing in the
> debug controller other than debug_cgrp_subsys are exported. I don't see
> any risk of having dependency on that controller from other parts of the
> kernel.

Oh, sure, I wasn't suggesting moving it under /sys/kernel/debug but
that we'd want to take extra precautions as we can't.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics

2017-05-19 Thread Tejun Heo
Hello, Waiman.

On Mon, May 15, 2017 at 09:34:10AM -0400, Waiman Long wrote:
> Now we could have something like
> 
>   R -- A -- B
>\
> T1 -- T2
> 
> where R is the thread root, A and B are non-threaded cgroups, T1 and
> T2 are threaded cgroups. The cgroups R, T1, T2 form a threaded subtree
> where all the non-threaded resources are accounted for in R.  The no
> internal process constraint does not apply in the threaded subtree.
> Non-threaded controllers need to properly handle the competition
> between internal processes and child cgroups at the thread root.
> 
> This model will be flexible enough to support the need of the threaded
> controllers.

Maybe I'm misunderstanding the design, but this seems to push the
processes which belong to the threaded subtree to the parent which is
part of the usual resource domain hierarchy thus breaking the no
internal competition constraint.  I'm not sure this is something we'd
want.  Given that the limitation of the original threaded mode was the
required nesting below root and that we treat root special anyway
(exactly in the way necessary), I wonder whether it'd be better to
simply allow root to be both domain and thread root.

Specific review points below but we'd probably want to discuss the
overall design first.

> +static inline bool cgroup_is_threaded(const struct cgroup *cgrp)
> +{
> + return cgrp->proc_cgrp && (cgrp->proc_cgrp != cgrp);
> +}
> +
> +static inline bool cgroup_is_thread_root(const struct cgroup *cgrp)
> +{
> + return cgrp->proc_cgrp == cgrp;
> +}

Maybe add a bit of comments explaining what's going on with
->proc_cgrp?

>  /**
> + * threaded_children_count - returns # of threaded children
> + * @cgrp: cgroup to be tested
> + *
> + * cgroup_mutex must be held by the caller.
> + */
> +static int threaded_children_count(struct cgroup *cgrp)
> +{
> + struct cgroup *child;
> + int count = 0;
> +
> + lockdep_assert_held(_mutex);
> + cgroup_for_each_live_child(child, cgrp)
> + if (cgroup_is_threaded(child))
> + count++;
> + return count;
> +}

It probably would be a good idea to keep track of the count so that we
don't have to count them each time.  There are cases where people end
up creating a very high number of cgroups and we've already been
bitten a couple times with silly complexity issues.

> @@ -2982,22 +3010,48 @@ static int cgroup_enable_threaded(struct cgroup *cgrp)
>   LIST_HEAD(csets);
>   struct cgrp_cset_link *link;
>   struct css_set *cset, *cset_next;
> + struct cgroup *child;
>   int ret;
> + u16 ss_mask;
>  
>   lockdep_assert_held(_mutex);
>  
>   /* noop if already threaded */
> - if (cgrp->proc_cgrp)
> + if (cgroup_is_threaded(cgrp))
>   return 0;
>  
> - /* allow only if there are neither children or enabled controllers */
> - if (css_has_online_children(>self) || cgrp->subtree_control)
> + /*
> +  * Allow only if it is not the root and there are:
> +  * 1) no children,
> +  * 2) no non-threaded controllers are enabled, and
> +  * 3) no attached tasks.
> +  *
> +  * With no attached tasks, it is assumed that no css_sets will be
> +  * linked to the current cgroup. This may not be true if some dead
> +  * css_sets linger around due to task_struct leakage, for example.
> +  */

It doesn't look like the code is actually making this (incorrect)
assumption.  I suppose the comment is from before
cgroup_is_populated() was added?

>   spin_lock_irq(_set_lock);
>   list_for_each_entry(link, >cset_links, cset_link) {
>   cset = link->cset;
> + if (cset->dead)
> + continue;

Hmm... is this a bug fix which is necessary regardless of whether we
change the threadroot semantics or not?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 08/17] cgroup: Move debug cgroup to its own file

2017-05-19 Thread Tejun Heo
Hello, Waiman.

On Thu, May 18, 2017 at 11:52:18AM -0400, Waiman Long wrote:
> The controller name is "debug" and so it is obvious what this controller
> is for. However, the config prompt "Example controller" is indeed vague

Yeah but it also shows up as an integral part of stable interface
rather than e.g. /sys/kernel/debug.  This isn't of any interest to
people who aren't developing cgroup core code.  There is no reason to
risk growing dependencies on it.

> in meaning. So we can make the prompt more descriptive here. As for the
> boot param, are you saying something like "cgroup_debug" has to be
> specified in the command line even if CGROUP_DEBUG config is there for
> the debug controller to be enabled? I am fine with that if you think it
> is necessary.

Yeah, I think that'd be a good idea.  cgroup_debug should do.  While
at it, can you also please make CGROUP_DEBUG depend on DEBUG_KERNEL?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/31] cgroup-v2.txt: standardize document format

2017-05-19 Thread Tejun Heo
Hello, Mauro.

On Thu, May 18, 2017 at 10:22:11PM -0300, Mauro Carvalho Chehab wrote:
> Each text file under Documentation follows a different
> format. Some doesn't even have titles!
> 
> Change its representation to follow the adopted standard,
> using ReST markups for it to be parseable by Sphinx:
> 
> - Comment the internal index;
> - Use :Date: and :Author: for authorship;
> - Mark titles;
> - Mark literal blocks;
> - Adjust witespaces;
> - Mark notes;
> - Use table notation for the existing tables.

If this is the direction we're taking for all docs, looks good to me.
Do you want me to take the patch through the cgroup tree?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller

2017-04-26 Thread Tejun Heo
Hello, Waiman.

On Wed, Apr 26, 2017 at 12:05:27PM -0400, Waiman Long wrote:
> Does anyone has time to take a look at these patches?
> 
> As the merge window is going to open up next week, I am not going to
> bother you guys when the merge window opens.

Will get to it next week.  Sorry about the delay.  We're deploying
cgroup2 across the fleet and seeing a lot of interesting issues and I
was chasing down CPU controller performance issues for the last month
or so, which is now getting wrapped up.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv13 0/3] rdmacg: IB/core: rdma controller support

2017-01-09 Thread Tejun Heo
On Mon, Jan 09, 2017 at 02:01:26PM -0600, Parav Pandit wrote:
> I will generate new patch in sometime this week with for-4.11 to
> address comments and rebase.
> I will be renaming cgroup_rdma.c to kernel/cgroup/rdma.c (similar to pids.c)
> 
> I will keep the header file name include/linux/cgroup_rdma.h as it is.

Sounds good.  Thanks and happy new year!

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv13 0/3] rdmacg: IB/core: rdma controller support

2017-01-09 Thread Tejun Heo
On Fri, Dec 02, 2016 at 07:07:14PM +, Parav Pandit wrote:
> Patch is generated and tested against below Doug's linux-rdma
> git tree.
> URL: git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git
> Branch: master
> 
> Patchset is also compiled and tested against below Tejun's cgroup tree
> using cgroup v2 mode.
> URL: git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
> Branch: master

Except for the minor nits, all look good to me.  Can you please
refresh on top of cgroup/for-4.11 and the rdma devel branch?  Please
note that cgroup/for-4.11 moved the cgroup source files under
kernel/cgroup/.

Thanks!

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv13 2/3] IB/core: added support to use rdma cgroup controller

2017-01-09 Thread Tejun Heo
On Fri, Dec 02, 2016 at 07:07:16PM +, Parav Pandit wrote:
> Added support APIs for IB core to register/unregister every IB/RDMA
> device with rdma cgroup for tracking rdma resources.
> IB core registers with rdma cgroup controller.
> Added support APIs for uverbs layer to make use of rdma controller.
> Added uverbs layer to perform resource charge/uncharge functionality.
> Added support during query_device uverb operation to ensure it
> returns resource limits by honoring rdma cgroup configured limits.
> 
> Signed-off-by: Parav Pandit 

Looks good but the patch causes a number of conflicts.  Can you please
refresh on top of the latest rdma tree?  I can create a branch which
pulls in rdma and cgroup trees.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv13 1/3] rdmacg: Added rdma cgroup controller

2017-01-09 Thread Tejun Heo
Hello, Parav.

Looks good to me.  Just one minor nit below.

> +static void rdmacg_uncharge_hirerchy(struct rdma_cgroup *cg,

 hierarchy

Other than that, looks good to me.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv13 0/3] rdmacg: IB/core: rdma controller support

2016-12-05 Thread Tejun Heo
Parav, it's a bit too late for this cycle.  Let's target v4.11.  I'll
review the patches after the merge window.  Please ping me if I don't.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 2/4] workqueue: kerneldocify workqueue_attrs

2016-10-28 Thread Tejun Heo
On Fri, Oct 28, 2016 at 10:14:09AM +0200, Silvio Fricke wrote:
> Only formating changes.
> 
> Signed-off-by: Silvio Fricke <silvio.fri...@gmail.com>

Acked-by: Tejun Heo <t...@kernel.org>

Please feel free to route with other doc updates.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] Documentation/workqueue.txt: convert to ReST markup

2016-10-24 Thread Tejun Heo
Hello,

On Sun, Oct 23, 2016 at 10:27:34AM -0600, Jonathan Corbet wrote:
> That done, if it's OK with you, Tejun, I'd prefer to take it through the
> docs tree with your ack so I don't have to write yet another note to Linus
> explaining the index.rst conflicts...

Please feel free to add my ack and route it through the docs tree.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] Documentation/workqueue.txt: convert to ReST markup

2016-10-19 Thread Tejun Heo
On Wed, Oct 19, 2016 at 08:38:39PM +0200, Silvio Fricke wrote:
> This patch add a "misc" documentation section and add the workqueue
> documentation to this section.
> 
> Signed-off-by: Silvio Fricke 

Looks good to me.  How should these patches be routed?  Should I take
2 and 3 through wq/for-4.10?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller

2016-08-31 Thread Tejun Heo
Hello,

On Wed, Aug 31, 2016 at 06:07:30PM +0300, Matan Barak wrote:
> Currently, there are some discussions regarding the RDMA ABI. The current
> proposed approach (after a lot of discussions in the OFVWG) is to have
> driver dependent object types rather than the fixed set of IB object types
> we have today.
> AFAIK, some vendors might want to use the RDMA subsystem for a different
> fabrics which has a different set of objects.
> You could see RFCs for such concepts both from Mellanox and Intel on the
> linux-rdma mailing list.
> 
> Saying that, maybe we need to make the resource types a bit more flexible
> and dynamic.

That'd be back to square one and Christoph was dead against it too,
so...

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Networking cgroup controller

2016-08-25 Thread Tejun Heo
On Thu, Aug 25, 2016 at 12:09:20PM -0400, Tejun Heo wrote:
> ebpf approach does have its shortcomings for sure but mending them
> seems a lot more manageable and future-proof than going with fixed but
> constantly expanding set of operations.  e.g. We can add per-cgroup
> bpf programs which are called only on socket creation or other major
> events, or just let bpf programs which get called on bind(2), and add
^^
please ignore this part. half-assed edit.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Networking cgroup controller

2016-08-25 Thread Tejun Heo
Hello, Mahesh.

On Thu, Aug 25, 2016 at 08:54:19AM -0700, Mahesh Bandewar (महेश बंडेवार) wrote:
> In short most of the associated problems are handled by the
> cgroup-infra / APIs while all that need separate solution in
> alternatives.  Tejun, feels like I'm advocating cgroup approach to you
> ;)

My concern here is that the proposed fixed mechanism isn't gonna be
enough.  Port range matching wouldn't scale, so we'd need some hashmap
style thing which may be too expensive for simple matches so either we
do something adaptive or have different interfaces for the two and so
on.  IOW, I think this approach is likely to replicate what iptables
have been doing with its extensions.  I don't doubt that it is one of
the workable approaches but hardly an attractive one especially at
this point.

ebpf approach does have its shortcomings for sure but mending them
seems a lot more manageable and future-proof than going with fixed but
constantly expanding set of operations.  e.g. We can add per-cgroup
bpf programs which are called only on socket creation or other major
events, or just let bpf programs which get called on bind(2), and add
some per-cgroup state variables which are maintained by cgroup code
which can be used from these bpf programs.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv11 3/3] rdmacg: Added documentation for rdmacg

2016-08-24 Thread Tejun Heo
On Mon, Aug 22, 2016 at 06:03:51PM +0530, Parav Pandit wrote:
> +  rdma.max
> + A readwrite file that exists for all the cgroups except root that

Can you please add that it's a nested-keyed file?

...
> +  rdma.current
> + A read-only file that describes current resource usage.

Ditto.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv11 0/3] rdmacg: IB/core: rdma controller support

2016-08-24 Thread Tejun Heo
(cc'ing Christoph)

On Mon, Aug 22, 2016 at 06:03:48PM +0530, Parav Pandit wrote:
> rdmacg: IB/core: rdma controller support
> 
> Patch is generated and tested against below Doug's linux-rdma
> git tree.
> 
> URL: git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git
> Branch: master
> 
> Patchset is also compiled and tested against below Tejun's cgroup tree
> using cgroup v2 mode.
> URL: git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
> Branch: master
> 
> Overview:
> Currently user space applications can easily take away all the rdma
> device specific resources such as AH, CQ, QP, MR etc. Due to which other
> applications in other cgroup or kernel space ULPs may not even get chance
> to allocate any rdma resources. This results into service unavailibility.
> 
> RDMA cgroup addresses this issue by allowing resource accounting,
> limit enforcement on per cgroup, per rdma device basis.
> 
> RDMA uverbs layer will enforce limits on well defined RDMA verb
> resources without any HCA vendor device driver involvement.
> 
> RDMA uverbs layer will not do limit enforcement of HCA hw vendor
> specific resources. Instead rdma cgroup provides set of APIs
> through which vendor specific drivers can do resource accounting
> by making use of rdma cgroup.
> 
> Resource limit enforcement is hierarchical.
> 
> When process is migrated with active RDMA resources, rdma cgroup
> continues to uncharge original cgroup for allocated resource. New resource
> is charged to current process's cgroup, which means if the process is
> migrated with active resources, for new resources it will be charged to
> new cgroup and old resources will be correctly uncharged from old cgroup.
> 
> Changes from v10:
>   * (To address comments from Tejun, Christoph)
>1. Removed unused rpool_list_lock from rdma_cgroup structure.
>2. Moved rdma resource definition to rdma cgroup instead of IB stack
>3. Added prefix rdmacg to static instances
>4. Simplified locking with single mutex for all operations
>5. Following approach of atomically allocating object and
>   charging resource in hirerchy
>6. Code simplification due to single lock
>7. Using for_each_set_bit API for bit operation
>8. Renamed list heads as Objects instead of _head
>9. Renamed list entries as _node instead of _list.
>   10. Made usage_num to 64 bit to avoid overflow and to avoid 
>   additional code to track non zero number of usage counts.
>   * (To address comments from Doug)
>1. Added copyright and GPLv2 license

Looks good to me.  I just have a nit in the documentation.  Christoph,
what do you think?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] Documenation: update cgroup's document path

2016-08-03 Thread Tejun Heo
On Wed, Aug 03, 2016 at 03:44:17PM -0600, Jonathan Corbet wrote:
> On Tue,  2 Aug 2016 23:23:57 +0900
> "seokhoon.yoon" <iamyo...@gmail.com> wrote:
> 
> > cgroup's document path is changed to "cgroup-v1". update it.
> 
> Seems worthy to me, applied to the docs tree.

Acked-late-by: Tejun Heo <t...@kernel.org>

Thanks!

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/14] resource limits: aggregate task highwater marks to cgroup level

2016-07-19 Thread Tejun Heo
Hello, Topi.

On Tue, Jul 19, 2016 at 04:57:10PM +, Topi Miettinen wrote:
> Then there would need to be new limit checks at cgroup level. Would you
> see problems with that approach?

I'm worried that you're rushing this feature without thinking through
it.  You were mixing up completely orthogonal planes of accounting and
control without too much thought and are now suggesting something
which is also strange.  What do you mean by "new limit checks at
cgroup level"?  How would this be different from the resource
accounting and control implemented in the existing controllers?

Please take a step back and think through the overall design before
proposing changes to userland visible interface.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/14] resource limits: aggregate task highwater marks to cgroup level

2016-07-15 Thread Tejun Heo
Hello, Topi.

On Fri, Jul 15, 2016 at 01:35:49PM +0300, Topi Miettinen wrote:
> Collect resource usage highwater marks of a task to cgroup
> statistics when the task exits.

I'm not sure how this makes sense.  The limits are enforced and
collected per user or along the process hierarchy which can be very
different from cgroup organization.  What does collecting high
watermarks from orthogonal structure, sometimes even combining
per-user numbers from different users, even mean?  These are numbers
without clear semantics.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] capabilities: add capability cgroup controller

2016-06-27 Thread Tejun Heo
Hello,

On Mon, Jun 27, 2016 at 3:10 PM, Topi Miettinen  wrote:
> I'll have to study these more. But from what I saw so far, it looks to
> me that a separate tool would be needed to read taskstats and if that
> tool is not taken by distros, the users would not be any wiser, right?
> With cgroup (or /proc), no new tools would be needed.

That is a factor but shouldn't be a deciding factor in designing our
user-facing interfaces. Please also note that kernel source tree
already has tools/ subdirectory which contains userland tools which
are distributed along with the kernel.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] cgroup: pids: show number of failed forks since limit reset

2016-06-26 Thread Tejun Heo
Hello,

On Sun, Jun 26, 2016 at 09:34:41PM +1000, Aleksa Sarai wrote:
> If a user has a setup where they wait for notifications on changes to
> pids.event, and then auto-adjust the cgroup limits based on the number of
> failures you have a race condition between reading the pids.event file and
> then setting the new limit. Then, upon getting notified again there may have
> been many failed forks with the old limit set, so you might decide to bump
> up the limit again.
> 
> It's not a huge deal, I just though it could be useful to alleviate problems
> like the above.

This is something which can easily be avoided from userland.  I don't
think we need to add extra facilities for this.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] capabilities: add capability cgroup controller

2016-06-26 Thread Tejun Heo
Hello, Topi.

On Sun, Jun 26, 2016 at 3:14 PM, Topi Miettinen  wrote:
> The parent might be able do it if proc/pid/xyz files are still
> accessible after child exit but before its exit status is collected. But
> if the parent doesn't do it (and you are not able to change it to do it)
> and it collects the exit status without collecting other info, can you
> suggest a different way how another process could collect it 100% reliably?

I'm not saying that there's such mechanism now. I'm suggesting that
that'd be a more fitting way of implementing a new mechanism to track
capability usages.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] capabilities: add capability cgroup controller

2016-06-24 Thread Tejun Heo
Hello, Serge.

On Fri, Jun 24, 2016 at 11:59:10AM -0500, Serge E. Hallyn wrote:
> > Just monitoring is less jarring than implementing security enforcement
> > via cgroup, but it is still jarring.  What's wrong with recursive
> > process hierarchy monitoring which is in line with the whole facility
> > is implemented anyway?
> 
> As I think Topi pointed out, one shortcoming is that if there is a short-lived
> child task, using its /proc/self/status is racy.  You might just miss that it
> ever even existed, let alone that the "application" needed it.

But the parent can collect whatever its children used.  We already do
that with other stats.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] cgroup: pids: show number of failed forks since limit reset

2016-06-24 Thread Tejun Heo
Hello,

On Fri, Jun 24, 2016 at 01:00:48PM +1000, Aleksa Sarai wrote:
> This allows users to dynamically adjust their limits based on how many
> failed forks happened since they last reset their limits, otherwise they
> would have to track (in a racy way) how many limit failures there were
> since the last limit change manually. In addition, we log the first
> failure since the limit was reset (which was the original semantics of
> the patchset).

Isn't that trivially achievable by reading the counter and then
calculating the diff?  I don't think it matters all that much whether
the log message is printed once per cgroup or per config-change.  It's
just a hint for the admin to avoid setting her off on a wild goose
chase.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] capabilities: add capability cgroup controller

2016-06-23 Thread Tejun Heo
Hello,

On Thu, Jun 23, 2016 at 06:07:10PM +0300, Topi Miettinen wrote:
> There are many basic ways to control processes, including capabilities,
> cgroups and resource limits. However, there are far fewer ways to find
> out useful values for the limits, except blind trial and error.
> 
> Currently, there is no way to know which capabilities are actually used.
> Even the source code is only implicit, in-depth knowledge of each
> capability must be used when analyzing a program to judge which
> capabilities the program will exercise.
> 
> Add a new cgroup controller for monitoring of capabilities
> in the cgroup.
> 
> Test case demonstrating basic capability monitoring and how the
> capabilities are combined at next level (boot to rdshell):

This doesn't have anything to do with resource control and I don't
think it's a good idea to add arbitrary monitoring mechanisms to
cgroup just because it's easy to add interface there.  Given that
capabilities are inherited and modified through the process hierarchy,
shouldn't this be part of that?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] scsi: fix race between simultaneous decrements of ->host_failed

2016-05-31 Thread Tejun Heo
On Tue, May 31, 2016 at 04:38:17PM +0800, Wei Fang wrote:
> sas_ata_strategy_handler() adds the works of the ata error handler
> to system_unbound_wq. This workqueue asynchronously runs work items,

Are there more than one error handling work items per host?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Documentation/memcg: update kmem limit doc as codes behavior

2016-05-12 Thread Tejun Heo
On Wed, May 11, 2016 at 08:40:19AM +0200, Michal Hocko wrote:
> On Wed 11-05-16 14:07:31, Qiang Huang wrote:
> > The restriction of kmem setting is not there anymore because the
> > accounting is enabled by default even in the cgroup v1 - see
> > b313aeee2509 ("mm: memcontrol: enable kmem accounting for all
> > cgroups in the legacy hierarchy").
> > 
> > Update docs accordingly.
> 
> I am pretty sure there will be other things out of date in that file but
> this is an improvemtn already.
> 
> > Signed-off-by: Qiang Huang 
> 
> Acked-by: Michal Hocko 

Andrew, can you please pick this one up?

Thanks!

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Documentation: devres: Add missing INPUT function

2016-04-22 Thread Tejun Heo
On Thu, Apr 21, 2016 at 07:02:04PM +0200, Alexander Kurz wrote:
> devm_input_allocate_device() got introduced with commit 2be975c6d920
> ("Input: introduce managed input devices (add devres support)").
> Add this function to the list of managed interfaces within the
> devres documentation.
> 
> Signed-off-by: Alexander Kurz <ak...@blala.de>

Acked-by: Tejun Heo <t...@kernel.org>

Thanks!

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Documentation: cgroup: "unshare -C" to unshare cgroup namespace

2016-04-20 Thread Tejun Heo
On Thu, Apr 14, 2016 at 11:18:31PM +0200, Josef Lusticky wrote:
> Use "unshare -C" to be consistent with the unshare utility from util-linux
> 
> Signed-off-by: Josef Lusticky 

Applied to cgroup/for-4.6-ns.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv10 1/3] rdmacg: Added rdma cgroup controller

2016-04-05 Thread Tejun Heo
Hello, Parav.

On Tue, Apr 05, 2016 at 07:25:11AM -0700, Parav Pandit wrote:
> > As it is a fairly isolated domain, to certain extent, it could be okay
> > to let it go.  At the same time, I know that these enterprise things
> > tend to go completely wayward and am worried about individual drivers
> > going crazy with custom attributes in a non-sensical way.
> 
> If its crazy at driver level, I am sure it will be equally crazy for
> any end-user too. Therefore no user would use those crazy resources
> either.

So, the above paragraph simply isn't true.  It's not difficult to
twist things bit so that things work in a hackish and often horrible
way and we know this happens quite a bit, especially in enterprise
settings.

> Intent is certainly not for the individual drivers as we agreed in
> past.

You and I agreeing to that doesn't really matter all that much.

> IB stack maintainers would be reviewing new enum addition
> anyway, whether its verb or hw resource (unlikely).

Well, if the additions are unlikely...

> > In the original review message, I mentioned creating an interface
> > which creates the hierarchy of objects as necessary and returns the
> > target pool with lock held, can you please give it a shot?  Something
> > like the following.
> 
> o.k. I will attempt. Looks doable.
> I am on travel so it will take few more days for me to turn around
> with below approach with tested code.

So, if you go with single mutex, you don't really need get_and_lock
semantics, you can just call the get function with mutex held.  Please
do whichever looks best.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv10 1/3] rdmacg: Added rdma cgroup controller

2016-04-05 Thread Tejun Heo
Just one more thing.

On Tue, Apr 05, 2016 at 10:01:07AM -0400, Tejun Heo wrote:
...
> > pool_info in spin lock context, made me allocate memory to get all
> > values upfront through allocation.
> > Now that the lock is going away, I can do what you have described above.

So, this might be okay depending on the use case but it often becomes
painful to require sleeping context for freeing resources.  If you're
certain that requiring sleeping context is okay for all paths, using a
single mutex is fine but *usually* it isn't a great idea.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv10 1/3] rdmacg: Added rdma cgroup controller

2016-04-05 Thread Tejun Heo
Hello, Parav.

On Mon, Apr 04, 2016 at 07:22:38PM -0700, Parav Pandit wrote:
> > Is it actually customary to have rdma core module updated more
> > frequently separate from the kernel?  Out-of-tree modules being
> > updated separately happens quite a bit but this is an in-kernel
> > module, which usually is tightly coupled with the rest of the kernel.
> >
> Yes.
> rdma core module is compiled as kernel module.
> Its often updated for new features, fixes.
> So kernel version can be one but RDMA core module(s) get updated more
> frequently than kernel.

As it is a fairly isolated domain, to certain extent, it could be okay
to let it go.  At the same time, I know that these enterprise things
tend to go completely wayward and am worried about individual drivers
going crazy with custom attributes in a non-sensical way.  The
interface this patch is proposing easily allows that and that at the
cost of internal engineering flexibility.  I don't really want to be
caught up in a situation where we're forced to include broken usages
because that's what's already out in the wild.  I personally would
much prefer the resources to be defined rigidly.  Let's see how the
discussion with Christoph evolves.

> > I don't remember the details well but the code was vastly more complex
> > back then and the object lifetime management was muddy at best.  If I
> > reviewed in a contradicting way, my apologies, but given the current
> > code, it'd be better to have objects creation upfront.
> 
> Do you mean,
> try_charge() should
> lock()
> run loop to allocate in hierarchy, if not allocated.
> run loop again to charge.
> unlock()
> 
> If so, I prefer to run the loop once.

In the original review message, I mentioned creating an interface
which creates the hierarchy of objects as necessary and returns the
target pool with lock held, can you please give it a shot?  Something
like the following.

pool *get_pool(...)
{
lock;
if (target pool exists)
return pool w/ lock held;

create the pool hierarchically (might involve unlock);
if (succeeded)
return pool w/ lock held;
return NULL w/o lock;
}

> > It isn't necessarily about speed.  It makes clear that the parent
> > always should exist and makes the code easier to read and update.
> 
> It doesn't have to exist. It can get allocated when charging occurs.
> Otherwise even if rdma resources are not used, it ends up allocating
> rpool in hierarchy. (We talked this before)

Sure, create pools only for the used combinations but do that
hierarchically so that a child pool always has a parent.  I can
promise you that the code will read a lot better with that.

> > I don't know why you end up missing basic patterns so badly.  It's
> > making the review and iteration process pretty painful.  Please don't
> > be confrontational and try to read the review comments assuming good
> > will.
> >
> My understanding of seq_printf() being blocking call and accessing

seq_printf() can be called from any context; otherwise, it would be a
horrible interface to use, wouldn't it?

> pool_info in spin lock context, made me allocate memory to get all
> values upfront through allocation.
> Now that the lock is going away, I can do what you have described above.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv10 1/3] rdmacg: Added rdma cgroup controller

2016-04-04 Thread Tejun Heo
Hello,

On Mon, Apr 04, 2016 at 03:50:54PM -0700, Parav Pandit wrote:
> 2nd advantage is, it allows avoiding lot of rework, in bundling kernel
> modules with different kernel versions. So it is certainly valuable
> feature with almost nil complexity cost of code or memory that solves
> two problems with approach.
> If there two kernel versions with different type of resources and
> kernel modules with different resources, it gets ugly to maintain that
> kind of compatibility using compat.h. This approach avoid all such
> cases.

Is it actually customary to have rdma core module updated more
frequently separate from the kernel?  Out-of-tree modules being
updated separately happens quite a bit but this is an in-kernel
module, which usually is tightly coupled with the rest of the kernel.

> > * It's usually a good idea to have hierarchical objects to be created
> >   all the way to the root when a leaf node is requested and link
> >   parent at each level.  That way, when a descendant object is looked
> >   up, the caller can always deref its ->parent pointer till it reaches
> >   the top.
> 
> This is interesting comment. I cannot recall, but I think its v4 or v5
> was exactly doing same.
> Allocating resource pool first in one loop and charging them in another.
> In event of failure, uncharging them in 3rd.
> And you gave comment that time is to move away from that approach for
> simplicity!
> That I had refcnt on each object.

I don't remember the details well but the code was vastly more complex
back then and the object lifetime management was muddy at best.  If I
reviewed in a contradicting way, my apologies, but given the current
code, it'd be better to have objects creation upfront.

> > So, with each pool linking to its parent, instead of looking up at
> > each level, it can just follow ->parent recursively.
> 
> No. Lookup is being done for the rpool for that device and not the cgroup.
> doing recursively using ->parent would require additional pointer of
> rpool type to be maintained which is not necessary because we already
> have cgroup level parent.

Yes, I meant adding pool->parent field.  Hierarchical operations
become far simpler with the invariant that parent always exists.

> Adding ->parent would make it little faster compare to traversing 1 to
> 4 node entry list. However this is not hot path, so keeping existing
> code to traverse is more desirable than adding ->parent of rpool type,
> which also saves memory for large number of cgroup instances.

It isn't necessarily about speed.  It makes clear that the parent
always should exist and makes the code easier to read and update.

> >> + /*
> >> +  * A negative count (or overflow) is invalid,
> >> +  * it indicates a bug in the rdma controller.
> >> +  */
> >> + WARN_ON_ONCE(rpool->resources[index].usage < 0);
> >> + rpool->usage_sum--;
> >
> > What guarantees that this count wouldn't overflow?  More on this
> > later.
> 
> Are you suggesting to add WARN_ON for usage_sum variable for debugging?
> I believe 1st warning is sufficient?

I mean that it isn't particularly difficult to overflow a s32 when
adding up multiple usages into one.

> >> +static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of,
> >> +char *buf, size_t nbytes, loff_t off)
> >> +{
> >> + struct rdma_cgroup *cg = css_rdmacg(of_css(of));
> >> + const char *dev_name;
> >> + struct rdmacg_resource_pool *rpool;
> >> + struct rdmacg_device *device;
> >> + char *options = strstrip(buf);
> >> + struct rdmacg_pool_info *pool_info;
> >> + u64 enables = 0;
> >> + int *new_limits;
> >> + int i = 0, ret = 0;
> >> +
> >> + /* extract the device name first */
> >> + dev_name = strsep(, " ");
> >> + if (!dev_name) {
> >> + ret = -EINVAL;
> >> + goto err;
> >> + }
> >> +
> >> + /* acquire lock to synchronize with hot plug devices */
> >> + mutex_lock(_mutex);
> >> +
> >> + device = rdmacg_get_device_locked(dev_name);
> >> + if (!device) {
> >> + ret = -ENODEV;
> >> + goto parse_err;
> >> + }
> >> +
> >> + pool_info = >pool_info;
> >> +
> >> + new_limits = kcalloc(pool_info->table_len, sizeof(int), GFP_KERNEL);
> >> + if (!new_limits) {
> >> + ret = -ENOMEM;
> >> + goto parse_err;
> >> + }
> >
> > Hmm... except for new_limits allocation and alloc_cg_rpool()
> > invocation, both of which can be done at the head of the function,
> > nothing seems to require a mutex here.  If we move them to the head of
> > the function, can we get rid of dev_mutex entirely?
> 
> No. As described in comment where lock is taken,
> it ensures that if device is being removed while configuration is
> going on, it protects against those race condition.
> So will keep it as it is.

Yeah, sure, it needs protection.  I was wondering why it needed to be
a separate mutex.  The only sleeping 

Re: [PATCHv9 0/3] rdmacg: IB/core: rdma controller support

2016-03-02 Thread Tejun Heo
Hello, Parav.

It doesn't look like my reviews are getting through.  For now,

Nacked-by: Tejun Heo <t...@kernel.org>

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-02 Thread Tejun Heo
Hello,

On Wed, Mar 02, 2016 at 12:35:35AM +0530, Parav Pandit wrote:
> diff --git a/include/linux/cgroup_rdma.h b/include/linux/cgroup_rdma.h
> new file mode 100644
> index 000..2da3d6c
> --- /dev/null
> +++ b/include/linux/cgroup_rdma.h
> @@ -0,0 +1,50 @@
> +#ifndef _CGROUP_RDMA_H
> +#define _CGROUP_RDMA_H
> +
> +#include 
> +
> +#ifdef CONFIG_CGROUP_RDMA
> +
> +struct rdma_cgroup {
> + struct cgroup_subsys_state  css;
> +
> + spinlock_t rpool_list_lock; /* protects resource pool list */
> + struct list_head rpool_head;/* head to keep track of all resource
> +  * pools that belongs to this cgroup.
> +  */

I think we usually don't tail wing these comments.

> +};
> +
> +struct rdmacg_pool_info {
> + const char **resource_name_table;
> + int table_len;

Align fields consistently?  I've said this multiple times now but
please make the available resources constant and document them in
Documentation/cgroup-v2.txt.

> +};
> +
> +struct rdmacg_device {
> + struct rdmacg_pool_info pool_info;
> + struct list_headrdmacg_list;
> + struct list_headrpool_head;
> + /* protects resource pool list */
> + spinlock_t  rpool_lock;
> + char*name;
> +};
> +
> +/* APIs for RDMA/IB stack to publish when a device wants to
> + * participate in resource accounting
> + */

Please follow CodingStyle.  Wasn't this pointed out a couple times
already?

> +enum rdmacg_file_type {
> + RDMACG_RESOURCE_MAX,
> + RDMACG_RESOURCE_STAT,
> +};

Heh, usually not a good sign.  If you're using this to call into a
common function and then switch out on the file type, just switch out
at the method level and factor out common part into helpers.

> +/* resource tracker per resource for rdma cgroup */
> +struct rdmacg_resource {
> + int max;
> + int usage;
> +};

Align fields?

> +/**

The above indicates kerneldoc comments, which this isn't.

> + * resource pool object which represents, per cgroup, per device
> + * resources. There are multiple instance
> + * of this object per cgroup, therefore it cannot be embedded within
> + * rdma_cgroup structure. It is maintained as list.

Consistent paragraph fill?

> + */
> +struct rdmacg_resource_pool {
> + struct list_head cg_list;
> + struct list_head dev_list;
> +
> + struct rdmacg_device *device;
> + struct rdmacg_resource *resources;
> + struct rdma_cgroup *cg; /* owner cg used during device cleanup */
> +
> + int refcnt; /* count active user tasks of this pool */
> + int num_max_cnt;/* total number counts which are set to max */
> +};

Formatting.

> +static inline void set_resource_limit(struct rdmacg_resource_pool *rpool,
> +   int index, int new_max)

Is inline necessary?  Compiler can figure out these.

> +static struct rdmacg_resource_pool*
 ^
 space

> +find_cg_rpool_locked(struct rdma_cgroup *cg,
> +  struct rdmacg_device *device)
...
> +static void uncharge_cg_resource(struct rdma_cgroup *cg,
> +  struct rdmacg_device *device,
> +  int index, int num)
> +{
...
> + rpool->refcnt--;
> + if (rpool->refcnt == 0 && rpool->num_max_cnt == pool_info->table_len) {

If the caller charges 2 and then uncharges 1 two times, the refcnt
underflows?  Why not just track how many usages are zero?

...
> +void rdmacg_uncharge(struct rdma_cgroup *cg,
> +  struct rdmacg_device *device,
> +  int index, int num)
> +{
> + struct rdma_cgroup *p;
> +
> + for (p = cg; p; p = parent_rdmacg(p))
> + uncharge_cg_resource(p, device, index, num);
> +
> + css_put(>css);
> +}
> +EXPORT_SYMBOL(rdmacg_uncharge);

So, I suppose the code is trying to save lock contention with
fine-grained locking; however, as the code is currently structured,
it's just increasing the number of lock ops and it'd be highly likely
to cheaper to simply use a single lock.  If you're working up the tree
grabbing lock at each level, per-node locking doesn't save you
anything.  Also, it introduces conditions where charges are spuriously
denied when there are racing requestors.  If scalability becomes an
issue, the right way to address is adding percpu front cache.

> +void rdmacg_query_limit(struct rdmacg_device *device,
> + int *limits)
> +{
> + struct rdma_cgroup *cg, *p;
> + struct rdmacg_resource_pool *rpool;
> + struct rdmacg_pool_info *pool_info;
> + int i;
> +
> + cg = task_rdmacg(current);
> + pool_info = >pool_info;

Nothing seems to prevent @cg from going away if this races with
@current being migrated to a different cgroup.  Have you run this with
lockdep and rcu debugging enabled?  This should have triggered a
warning.

...
> 

Re: [PATCH v6 0/8] Additional kmsg devices

2016-02-25 Thread Tejun Heo
Hello, Kazimierz.

On Wed, Feb 24, 2016 at 12:53:13PM +0100, Kazimierz Krosman wrote:
> 1. kmsg device does not require maintenance by reader process side.
> Multiple writers can write to a device and new records overwrite logs saved 
> earlier.
> When system crashes logs can be restored with pstore mechanism.

I'm not sure this is the right layer to implement generic logging
facility.

> 2. Using kmsg can cause lower CPU utilisation in the real-word use case than 
> userspace logging mechanisms.
> We created 2 tests: (1) 100 writer processes write to created kmsg buffer and
> (2) 100 writers write to socket (stream)- there is one reader to protect
> socket buffer against overflow. Tests show that cpu utilisation in case of 
> first
> test is about 2.3 times lower (39.1%) than it is in second case (87.7%) 
> (measured
> with top program; tests code is attached below). Tested on Odroid XU4.

This sounds like a generic IPC problem than anything else.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 1/3] rdmacg: Added rdma cgroup controller.

2016-02-12 Thread Tejun Heo
Hello, Parav.

On Fri, Feb 12, 2016 at 02:49:38AM +0530, Parav Pandit wrote:
> 1. Removed two type of resource pool, made is single type (as you
> described in past comment)
> 2. Removed match tokens and have array definition like "qp", "mr", "cq" etc.
> 3. Wrote small parser and avoided match_token API as that won't work
> due to different array definition
> 4. Removed one-off remove API to unconfigure cgroup, instead all
> resource should be set to max.
> 5. Removed resource pool type (user/default), instead having max_num_cnt,
> when ref_cnt drops to zero and max_num_cnt = total_rescource_cnt, pool is 
> freed.
> 6. Resource definition ownership is now only with IB stack at single
> header file, no longer in each low level driver.
> This goes through IB maintainer and other reviewers eyes.
> This continue to give flexibility to not force kernel upgrade for few
> enums additions for new resource type.
> 7. Wherever possible pool lock is pushed out, except for hierarchical
> charging/unchanging points, as it not possible to do so, due to
> iterative process involves blocking allocations of rpool. Coming up
> more levels up to release locks doesn't make any sense either.
> This is anyway slow path where rpool is not allocated. Except for
> typical first resource allocation, this is less traveled path.
> 8.Other minor cleanups.
> 9. Avoided %d manipulation due to removal of match_token and replaced
> with seq_putc etc friend functions.

Sounds great.  Can't tell too much without looking at the code tho.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 1/3] rdmacg: Added rdma cgroup controller.

2016-02-01 Thread Tejun Heo
Hello, Parav.

On Sun, Jan 31, 2016 at 11:20:45PM +0530, Parav Pandit wrote:
> > Yes, it can.  It just becomes a part of kernel ABI that the IB stack
> > module depends on.
> >
> o.k. Its doable, but I believe its simple but its restrictive.

The whole point is to make it somewhat restrictive so that low level
drivers don't go crazy with it and the defined resources are clearly
identified and documented.

> If rest of the RDMA experts are ok with it, I will change it to do it that 
> way.
> We get to hear their feedback before we put this as ABI.

So, I'm really not gonna go for individual drivers defining resources
on their own.  That's a trainwreck waiting to happen.  There needs to
be a lot more scrutiny than that.

> >> Now coming to remove command's need.
> >> If user has previously configured limit of say mr=15. Now if wants to
> >> remove that configuration and don't want to bother for the limit.
> >> So the way, its done is by issuing "remove" command.
> >> Should I name is "reset".
> >
> > How is this different from setting the limits to max?
>
> We can set it to max. But during freeing time, checking array of 10+
> elements whether its max or not, didn't find efficient either.

So, in terms of user interface, it doesn't buy anything, right?  It's
generally a bad idea to mix implementation details like "checking 10+
elems on free" with interface decisions.  If the overhead of scanning
the array matters, it can easily be resolved by keeping the count of
non-max values.

It could be that setting all attributes to "max" is inconvenient from
interface POV (I don't think this would be a common use case tho).  If
so, please justify that, update the interface conventions along with
other similar knobs.  Don't introduce one-off custom thing.

...
> If we don't do this, and wait until device destruction, rpool just
> keeps growing!
>
> I think above is more important design aspect than just saving memory to me.
> 
> Let me know if you have different design thought here.
> (Like engineering can_attach() callback in rdma_cgroup, which I don't
> see necessary either).

Pin the css while things are in flight and free from css_free?

> > If you wanna insist on freeing them dynamically, free them when both
> > their usages and configs are nil.
> Agree. Thats what the code is doing using marking object type being default.
> If I remove the object type, as alternative,

You don't need object type for that.  Think about it again.  All you
need is a refcnt and cgroup already has a facility for that.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >