Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-03-22 Thread Peter Zijlstra
On Tue, Mar 21, 2017 at 01:39:58PM +0100, Peter Zijlstra wrote:
> 
> And yes, having to consider views is new and a direct consequence of
> this new optional feature. But I don't see how its a problem.
> 

So aside from having (RO) links in thread groups for system controllers,
we could also have a ${controller}_parent link back to whatever group is
the actual parent for that specific controller's view.

So then your B's memcg_parent would point to A, not T.

But I feel this is all superfluous window dressing; but if you want to
clarify the filesystem interface, this could be something to consider.



Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-03-22 Thread Peter Zijlstra
On Tue, Mar 21, 2017 at 01:39:58PM +0100, Peter Zijlstra wrote:
> 
> And yes, having to consider views is new and a direct consequence of
> this new optional feature. But I don't see how its a problem.
> 

So aside from having (RO) links in thread groups for system controllers,
we could also have a ${controller}_parent link back to whatever group is
the actual parent for that specific controller's view.

So then your B's memcg_parent would point to A, not T.

But I feel this is all superfluous window dressing; but if you want to
clarify the filesystem interface, this could be something to consider.



Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-03-21 Thread Peter Zijlstra
On Mon, Mar 13, 2017 at 04:05:44PM -0400, Tejun Heo wrote:
> Hey, Peter.  Sorry about the long delay.

No worries; we're all (too) busy.


> > > If we go to thread mode and back to domain mode, the control knobs for
> > > domain controllers don't make sense on the thread part of the tree and
> > > they won't have cgroup_subsys_state to correspond to either.  For
> > > example,
> > > 
> > >  A - T - B
> > > 
> > > B's memcg knobs would control memory distribution from A and cgroups
> > > in T can't have memcg knobs.  It'd be weird to indicate that memcg is
> > > enabled in those cgroups too.
> > 
> > But memcg _is_ enabled for T. All the tasks are mapped onto A for
> > purpose of the system controller (memcg) and are subject to its
> > constraints.
> 
> Sure, T is contained in A but think about the interface.  For memcg, T
> belongs to A.  B is the first descendant when viewed from memcg, which
> brings about two problems - memcg doesn't have control knobs to assign
> throughout T which is just weird and there's no way to configure how T
> competes against B.
> 
> > > We can make it work somehow.  It's just weird-ass interface.
> > 
> > You could make these control files (read-only?) symlinks back to A's
> > actual control files. To more explicitly show this.
> 
> But the knobs are supposed to control how much resource a child gets
> from its parent.  Flipping that over while walking down the same tree
> sounds horribly ugly and confusing to me.  Besides, that doesn't solve
> the problem with lacking the ability configure T's consumptions
> against B.

So I'm not confused; and I suspect you're not either. But you're worried
about 'simple' people getting confused?

The rules really are fairly straight forward; but yes, it will be a
little more involved than without this. But note that this is an
optional thing, people don't have to make thread groups if they don't
want to. And they further don't have to use non-leaf thread groups.

And at some point, there's no helping stupid; and trying to do so will
only make you insane.

So the fundamental thing to realize (and explain) is that there are two
different types of controllers; and that they operate on different views
of the hierarchy.

I think our goal as a kernel API should be presenting the capabilities
in a concise and consistent manner; and I feel that the proposed
interface is that.

So the points you raise above; about system controller knobs in thread
groups and competition between thread and system groups as seen for
system controllers are confusion due to not considering the views.

And yes, having to consider views is new and a direct consequence of
this new optional feature. But I don't see how its a problem.


> Scheduling hackbench is an extreme case tho and in practice at least
> we're not seeing noticeable issues with a few levels of nesting when
> the workload actually spends cpu cycles doing things other than
> scheduling.

Right; most workloads don't schedule _that_ much; and the overhead isn't
too painful.

> However, we're seeing significant increase in scheduling
> latency coming from how cgroups are handled from the rebalance path.
> I'm still looking into it and will write about that in a separate
> thread.

I have some vague memories of this being a pain point. IIRC it comes
down to the problem that latency is an absolute measure and the weight
is relative thing.

I think we mucked about with it some many years ago; but haven't done so
recently.

> > Also, there is the one giant wart in v2 wrt no-internal-processes;
> > namely the root group is allowed to have them.
> > 
> > Now I understand why this is so; so don't feel compelled to explain that
> > again, but it does make the model very ugly and has a real problem, see
> > below. OTOH, since it is there, I would very much like to make use of
> > this 'feature' and allow a thread-group on the root group.
> > 
> > And since you then _can_ have nested thread groups, it again becomes
> > very important to be able to find the resource domains, which brings me
> > back to my proposal; albeit with an addition constraint.
> 
> I've thought quite a bit about ways to allow thread granularity from
> the top while still presenting a consistent picture to resource domain
> controllers.  That's what's missing from the CPU controller side given
> Mike's claim that there's unavoidable overhead in nesting CPU
> controller and requiring at least one level of nesting on cgroup v2
> for thread granularity might not be acceptable.
> 
> Going back to why thread support on cgroup v2 was needed in the first
> place, it was to allow thread level control while cooperating with
> other controllers on v2.  IOW, allowing thread level control for CPU
> while cooperating with resource domain type controllers.

Well, not only CPU, I can see the same being used for perf for example.

> Now, going back to allowing thread hierarchies from the root, given
> that their resource domain can only be root, which is exactly 

Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-03-21 Thread Peter Zijlstra
On Mon, Mar 13, 2017 at 04:05:44PM -0400, Tejun Heo wrote:
> Hey, Peter.  Sorry about the long delay.

No worries; we're all (too) busy.


> > > If we go to thread mode and back to domain mode, the control knobs for
> > > domain controllers don't make sense on the thread part of the tree and
> > > they won't have cgroup_subsys_state to correspond to either.  For
> > > example,
> > > 
> > >  A - T - B
> > > 
> > > B's memcg knobs would control memory distribution from A and cgroups
> > > in T can't have memcg knobs.  It'd be weird to indicate that memcg is
> > > enabled in those cgroups too.
> > 
> > But memcg _is_ enabled for T. All the tasks are mapped onto A for
> > purpose of the system controller (memcg) and are subject to its
> > constraints.
> 
> Sure, T is contained in A but think about the interface.  For memcg, T
> belongs to A.  B is the first descendant when viewed from memcg, which
> brings about two problems - memcg doesn't have control knobs to assign
> throughout T which is just weird and there's no way to configure how T
> competes against B.
> 
> > > We can make it work somehow.  It's just weird-ass interface.
> > 
> > You could make these control files (read-only?) symlinks back to A's
> > actual control files. To more explicitly show this.
> 
> But the knobs are supposed to control how much resource a child gets
> from its parent.  Flipping that over while walking down the same tree
> sounds horribly ugly and confusing to me.  Besides, that doesn't solve
> the problem with lacking the ability configure T's consumptions
> against B.

So I'm not confused; and I suspect you're not either. But you're worried
about 'simple' people getting confused?

The rules really are fairly straight forward; but yes, it will be a
little more involved than without this. But note that this is an
optional thing, people don't have to make thread groups if they don't
want to. And they further don't have to use non-leaf thread groups.

And at some point, there's no helping stupid; and trying to do so will
only make you insane.

So the fundamental thing to realize (and explain) is that there are two
different types of controllers; and that they operate on different views
of the hierarchy.

I think our goal as a kernel API should be presenting the capabilities
in a concise and consistent manner; and I feel that the proposed
interface is that.

So the points you raise above; about system controller knobs in thread
groups and competition between thread and system groups as seen for
system controllers are confusion due to not considering the views.

And yes, having to consider views is new and a direct consequence of
this new optional feature. But I don't see how its a problem.


> Scheduling hackbench is an extreme case tho and in practice at least
> we're not seeing noticeable issues with a few levels of nesting when
> the workload actually spends cpu cycles doing things other than
> scheduling.

Right; most workloads don't schedule _that_ much; and the overhead isn't
too painful.

> However, we're seeing significant increase in scheduling
> latency coming from how cgroups are handled from the rebalance path.
> I'm still looking into it and will write about that in a separate
> thread.

I have some vague memories of this being a pain point. IIRC it comes
down to the problem that latency is an absolute measure and the weight
is relative thing.

I think we mucked about with it some many years ago; but haven't done so
recently.

> > Also, there is the one giant wart in v2 wrt no-internal-processes;
> > namely the root group is allowed to have them.
> > 
> > Now I understand why this is so; so don't feel compelled to explain that
> > again, but it does make the model very ugly and has a real problem, see
> > below. OTOH, since it is there, I would very much like to make use of
> > this 'feature' and allow a thread-group on the root group.
> > 
> > And since you then _can_ have nested thread groups, it again becomes
> > very important to be able to find the resource domains, which brings me
> > back to my proposal; albeit with an addition constraint.
> 
> I've thought quite a bit about ways to allow thread granularity from
> the top while still presenting a consistent picture to resource domain
> controllers.  That's what's missing from the CPU controller side given
> Mike's claim that there's unavoidable overhead in nesting CPU
> controller and requiring at least one level of nesting on cgroup v2
> for thread granularity might not be acceptable.
> 
> Going back to why thread support on cgroup v2 was needed in the first
> place, it was to allow thread level control while cooperating with
> other controllers on v2.  IOW, allowing thread level control for CPU
> while cooperating with resource domain type controllers.

Well, not only CPU, I can see the same being used for perf for example.

> Now, going back to allowing thread hierarchies from the root, given
> that their resource domain can only be root, which is exactly 

Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-03-14 Thread Mike Galbraith
On Mon, 2017-03-13 at 15:26 -0400, Tejun Heo wrote:
> Hello, Mike.
> 
> Sorry about the long delay.
> 
> On Mon, Feb 13, 2017 at 06:45:07AM +0100, Mike Galbraith wrote:
> > > > So, as long as the depth stays reasonable (single digit or lower),
> > > > what we try to do is keeping tree traversal operations aggregated or
> > > > located on slow paths.  There still are places that this overhead
> > > > shows up (e.g. the block controllers aren't too optimized) but it
> > > > isn't particularly difficult to make a handful of layers not matter at
> > > > all.
> > > 
> > > A handful of cpu bean counting layers stings considerably.
> 
> Hmm... yeah, I was trying to think about ways to avoid full scheduling
> overhead at each layer (the scheduler does a lot per each layer of
> scheduling) but don't think it's possible to circumvent that without
> introducing a whole lot of scheduling artifacts.

Yup.

> In a lot of workloads, the added overhead from several layers of CPU
> controllers doesn't seem to get in the way too much (most threads do
> something other than scheduling after all).

Sure, don't schedule a lot, it doesn't hurt much, but there are plenty
of loads that routinely do schedule a LOT, and there it matters a LOT..
which is why network benchmarks tend to be severely allergic to
scheduler lard.

>   The only major issue that
> we're seeing in the fleet is the cgroup iteration in idle rebalancing
> code pushing up the scheduling latency too much but that's a different
> issue.

Hm, I would suspect PELT to be the culprit there.  It helps smooth out
load balancing, but will stack "skinny looking" tasks.

-Mike


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-03-14 Thread Mike Galbraith
On Mon, 2017-03-13 at 15:26 -0400, Tejun Heo wrote:
> Hello, Mike.
> 
> Sorry about the long delay.
> 
> On Mon, Feb 13, 2017 at 06:45:07AM +0100, Mike Galbraith wrote:
> > > > So, as long as the depth stays reasonable (single digit or lower),
> > > > what we try to do is keeping tree traversal operations aggregated or
> > > > located on slow paths.  There still are places that this overhead
> > > > shows up (e.g. the block controllers aren't too optimized) but it
> > > > isn't particularly difficult to make a handful of layers not matter at
> > > > all.
> > > 
> > > A handful of cpu bean counting layers stings considerably.
> 
> Hmm... yeah, I was trying to think about ways to avoid full scheduling
> overhead at each layer (the scheduler does a lot per each layer of
> scheduling) but don't think it's possible to circumvent that without
> introducing a whole lot of scheduling artifacts.

Yup.

> In a lot of workloads, the added overhead from several layers of CPU
> controllers doesn't seem to get in the way too much (most threads do
> something other than scheduling after all).

Sure, don't schedule a lot, it doesn't hurt much, but there are plenty
of loads that routinely do schedule a LOT, and there it matters a LOT..
which is why network benchmarks tend to be severely allergic to
scheduler lard.

>   The only major issue that
> we're seeing in the fleet is the cgroup iteration in idle rebalancing
> code pushing up the scheduling latency too much but that's a different
> issue.

Hm, I would suspect PELT to be the culprit there.  It helps smooth out
load balancing, but will stack "skinny looking" tasks.

-Mike


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-03-13 Thread Tejun Heo
Hey, Peter.  Sorry about the long delay.

On Tue, Feb 14, 2017 at 11:35:41AM +0100, Peter Zijlstra wrote:
> > This is a bit of delta but as I wrote before, at least cpu (and
> > accordingly cpuacct) won't stay purely task-based as we should account
> > for resource consumptions which aren't tied to specific tasks to the
> > matching domain (e.g. CPU consumption during writeback, disk
> > encryption or CPU cycles spent to receive packets).
> 
> We should probably do that in another thread, but I'd probably insist on
> separate controllers that co-operate to get that done.

Let's shelve this for now.

> > cgroups on creation don't enable controllers by default and users can
> > enable and disable controllers dynamically as long as the conditions
> > are met.  So, they can be disable and re-enabled.
> 
> I was talking in a hierarchical sense, your section 2-4-2. Top-Down
> constraint seems to state similar things to what I meant.
> 
> Once you disable a controller it cannot be re-enabled in a subtree.

Ah, yeah, you can't jump across parts of the tree.

> > If we go to thread mode and back to domain mode, the control knobs for
> > domain controllers don't make sense on the thread part of the tree and
> > they won't have cgroup_subsys_state to correspond to either.  For
> > example,
> > 
> >  A - T - B
> > 
> > B's memcg knobs would control memory distribution from A and cgroups
> > in T can't have memcg knobs.  It'd be weird to indicate that memcg is
> > enabled in those cgroups too.
> 
> But memcg _is_ enabled for T. All the tasks are mapped onto A for
> purpose of the system controller (memcg) and are subject to its
> constraints.

Sure, T is contained in A but think about the interface.  For memcg, T
belongs to A.  B is the first descendant when viewed from memcg, which
brings about two problems - memcg doesn't have control knobs to assign
throughout T which is just weird and there's no way to configure how T
competes against B.

> > We can make it work somehow.  It's just weird-ass interface.
> 
> You could make these control files (read-only?) symlinks back to A's
> actual control files. To more explicitly show this.

But the knobs are supposed to control how much resource a child gets
from its parent.  Flipping that over while walking down the same tree
sounds horribly ugly and confusing to me.  Besides, that doesn't solve
the problem with lacking the ability configure T's consumptions
against B.

> > So, as long as the depth stays reasonable (single digit or lower),
> > what we try to do is keeping tree traversal operations aggregated or
> > located on slow paths. 
> 
> While at the same time you allowed that BPF cgroup thing to not be
> hierarchical because iterating the tree was too expensive; or did I
> misunderstand?

That was more because that was supposed to be part of bpf (network or
whatever) and just followed the model of table matching "is the target
under this hierarchy?".  That's where it came from after all.
Hierarchical walking can be added but it's more work (defining the
iteration direction and rules) and doesn't bring benefits without
working delegation.

If it were a cgroup controller, it should have been fully hierarchical
no matter what but that involves solving bpf delegation first.

> Also, I think Mike showed you the pain and hurt are quite visible for
> even a few levels.
> 
> Batching is tricky, you need to somehow bound the error function in
> order to not become too big a factor on behaviour. Esp. for cpu, cpuacct
> obviously doesn't care much as it doesn't enforce anything.

Yeah, I thought about this for quite a while but I couldn't think of
any easy way of circumventing overhead without introducing a lot of
scheduling artifacts (e.g. multiplying down the weights to practically
collapse multi levels of the hierarchy), at least for the weight based
control which what most people use.  It looks like the only way to
lower the overhead there is making generic scheduling cheaper but that
still means that multi-level will always be noticeably more expensive
in terms of raw schceduling performance.

Scheduling hackbench is an extreme case tho and in practice at least
we're not seeing noticeable issues with a few levels of nesting when
the workload actually spends cpu cycles doing things other than
scheduling.  However, we're seeing significant increase in scheduling
latency coming from how cgroups are handled from the rebalance path.
I'm still looking into it and will write about that in a separate
thread.

> > In general, I think it's important to ensure that this in general is
> > the case so that users can use the logical layouts matching the actual
> > resource hierarchy rather than having to twist the layout for
> > optimization.
> 
> One does what one can.. But it is important to understand the
> constraints, nothing comes for free.

Yeah, for sure.

> Also, there is the one giant wart in v2 wrt no-internal-processes;
> namely the root group is allowed to have them.
> 
> 

Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-03-13 Thread Tejun Heo
Hey, Peter.  Sorry about the long delay.

On Tue, Feb 14, 2017 at 11:35:41AM +0100, Peter Zijlstra wrote:
> > This is a bit of delta but as I wrote before, at least cpu (and
> > accordingly cpuacct) won't stay purely task-based as we should account
> > for resource consumptions which aren't tied to specific tasks to the
> > matching domain (e.g. CPU consumption during writeback, disk
> > encryption or CPU cycles spent to receive packets).
> 
> We should probably do that in another thread, but I'd probably insist on
> separate controllers that co-operate to get that done.

Let's shelve this for now.

> > cgroups on creation don't enable controllers by default and users can
> > enable and disable controllers dynamically as long as the conditions
> > are met.  So, they can be disable and re-enabled.
> 
> I was talking in a hierarchical sense, your section 2-4-2. Top-Down
> constraint seems to state similar things to what I meant.
> 
> Once you disable a controller it cannot be re-enabled in a subtree.

Ah, yeah, you can't jump across parts of the tree.

> > If we go to thread mode and back to domain mode, the control knobs for
> > domain controllers don't make sense on the thread part of the tree and
> > they won't have cgroup_subsys_state to correspond to either.  For
> > example,
> > 
> >  A - T - B
> > 
> > B's memcg knobs would control memory distribution from A and cgroups
> > in T can't have memcg knobs.  It'd be weird to indicate that memcg is
> > enabled in those cgroups too.
> 
> But memcg _is_ enabled for T. All the tasks are mapped onto A for
> purpose of the system controller (memcg) and are subject to its
> constraints.

Sure, T is contained in A but think about the interface.  For memcg, T
belongs to A.  B is the first descendant when viewed from memcg, which
brings about two problems - memcg doesn't have control knobs to assign
throughout T which is just weird and there's no way to configure how T
competes against B.

> > We can make it work somehow.  It's just weird-ass interface.
> 
> You could make these control files (read-only?) symlinks back to A's
> actual control files. To more explicitly show this.

But the knobs are supposed to control how much resource a child gets
from its parent.  Flipping that over while walking down the same tree
sounds horribly ugly and confusing to me.  Besides, that doesn't solve
the problem with lacking the ability configure T's consumptions
against B.

> > So, as long as the depth stays reasonable (single digit or lower),
> > what we try to do is keeping tree traversal operations aggregated or
> > located on slow paths. 
> 
> While at the same time you allowed that BPF cgroup thing to not be
> hierarchical because iterating the tree was too expensive; or did I
> misunderstand?

That was more because that was supposed to be part of bpf (network or
whatever) and just followed the model of table matching "is the target
under this hierarchy?".  That's where it came from after all.
Hierarchical walking can be added but it's more work (defining the
iteration direction and rules) and doesn't bring benefits without
working delegation.

If it were a cgroup controller, it should have been fully hierarchical
no matter what but that involves solving bpf delegation first.

> Also, I think Mike showed you the pain and hurt are quite visible for
> even a few levels.
> 
> Batching is tricky, you need to somehow bound the error function in
> order to not become too big a factor on behaviour. Esp. for cpu, cpuacct
> obviously doesn't care much as it doesn't enforce anything.

Yeah, I thought about this for quite a while but I couldn't think of
any easy way of circumventing overhead without introducing a lot of
scheduling artifacts (e.g. multiplying down the weights to practically
collapse multi levels of the hierarchy), at least for the weight based
control which what most people use.  It looks like the only way to
lower the overhead there is making generic scheduling cheaper but that
still means that multi-level will always be noticeably more expensive
in terms of raw schceduling performance.

Scheduling hackbench is an extreme case tho and in practice at least
we're not seeing noticeable issues with a few levels of nesting when
the workload actually spends cpu cycles doing things other than
scheduling.  However, we're seeing significant increase in scheduling
latency coming from how cgroups are handled from the rebalance path.
I'm still looking into it and will write about that in a separate
thread.

> > In general, I think it's important to ensure that this in general is
> > the case so that users can use the logical layouts matching the actual
> > resource hierarchy rather than having to twist the layout for
> > optimization.
> 
> One does what one can.. But it is important to understand the
> constraints, nothing comes for free.

Yeah, for sure.

> Also, there is the one giant wart in v2 wrt no-internal-processes;
> namely the root group is allowed to have them.
> 
> 

Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-03-13 Thread Tejun Heo
Hello, Mike.

Sorry about the long delay.

On Mon, Feb 13, 2017 at 06:45:07AM +0100, Mike Galbraith wrote:
> > > So, as long as the depth stays reasonable (single digit or lower),
> > > what we try to do is keeping tree traversal operations aggregated or
> > > located on slow paths.  There still are places that this overhead
> > > shows up (e.g. the block controllers aren't too optimized) but it
> > > isn't particularly difficult to make a handful of layers not matter at
> > > all.
> > 
> > A handful of cpu bean counting layers stings considerably.

Hmm... yeah, I was trying to think about ways to avoid full scheduling
overhead at each layer (the scheduler does a lot per each layer of
scheduling) but don't think it's possible to circumvent that without
introducing a whole lot of scheduling artifacts.

In a lot of workloads, the added overhead from several layers of CPU
controllers doesn't seem to get in the way too much (most threads do
something other than scheduling after all).  The only major issue that
we're seeing in the fleet is the cgroup iteration in idle rebalancing
code pushing up the scheduling latency too much but that's a different
issue.

Anyways, I understand that there are cases where people would want to
avoid any extra layers.  I'll continue on PeterZ's message.

> BTW, that overhead is also why merging cpu/cpuacct is not really as
> wonderful as it may seem on paper.  If you only want to account, you
> may not have anything to gain from group scheduling (in fact it may
> wreck performance), but you'll pay for it.

There's another reason why we would want accounting separate - because
weight based controllers, cpu and io currently, can't be enabled
without affecting the scheduling behavior.  However, they're different
from CPU controllers in that all the heavy part of operations can be
shifted to the readers (we just need to do per-cpu updates from hot
paths), so we might as well publish those stats by default on the v2
hierarchy.  We couldn't do the same in v1 because the number of
hierarchies were not limited.

Thanks.

-- 
tejun


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-03-13 Thread Tejun Heo
Hello, Mike.

Sorry about the long delay.

On Mon, Feb 13, 2017 at 06:45:07AM +0100, Mike Galbraith wrote:
> > > So, as long as the depth stays reasonable (single digit or lower),
> > > what we try to do is keeping tree traversal operations aggregated or
> > > located on slow paths.  There still are places that this overhead
> > > shows up (e.g. the block controllers aren't too optimized) but it
> > > isn't particularly difficult to make a handful of layers not matter at
> > > all.
> > 
> > A handful of cpu bean counting layers stings considerably.

Hmm... yeah, I was trying to think about ways to avoid full scheduling
overhead at each layer (the scheduler does a lot per each layer of
scheduling) but don't think it's possible to circumvent that without
introducing a whole lot of scheduling artifacts.

In a lot of workloads, the added overhead from several layers of CPU
controllers doesn't seem to get in the way too much (most threads do
something other than scheduling after all).  The only major issue that
we're seeing in the fleet is the cgroup iteration in idle rebalancing
code pushing up the scheduling latency too much but that's a different
issue.

Anyways, I understand that there are cases where people would want to
avoid any extra layers.  I'll continue on PeterZ's message.

> BTW, that overhead is also why merging cpu/cpuacct is not really as
> wonderful as it may seem on paper.  If you only want to account, you
> may not have anything to gain from group scheduling (in fact it may
> wreck performance), but you'll pay for it.

There's another reason why we would want accounting separate - because
weight based controllers, cpu and io currently, can't be enabled
without affecting the scheduling behavior.  However, they're different
from CPU controllers in that all the heavy part of operations can be
shifted to the readers (we just need to do per-cpu updates from hot
paths), so we might as well publish those stats by default on the v2
hierarchy.  We couldn't do the same in v1 because the number of
hierarchies were not limited.

Thanks.

-- 
tejun


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-14 Thread Peter Zijlstra
On Sun, Feb 12, 2017 at 02:05:44PM +0900, Tejun Heo wrote:
> Hello,
> 
> On Fri, Feb 10, 2017 at 06:51:45PM +0100, Peter Zijlstra wrote:
> > Sure, we're past that. This isn't about what memcg can or cannot do.
> > Previous discussions established that controllers come in two shapes:
> > 
> >  - task based controllers; these are build on per task properties and
> >groups are aggregates over sets of tasks. Since per definition inter
> >task competition is already defined on individual tasks, its fairly
> >trivial to extend the same rules to sets of tasks etc..
> > 
> >Examples: cpu, cpuset, cpuacct, perf, pid, (freezer)
> >
> >  - system controllers; instead of building from tasks upwards, they
> >split what previously would be machine wide / global state. For these
> >there is no natural competition rule vs tasks, and hence your
> >no-internal-task rule.
> > 
> >Examples: memcg, io, hugetlb
> 
> This is a bit of delta but as I wrote before, at least cpu (and
> accordingly cpuacct) won't stay purely task-based as we should account
> for resource consumptions which aren't tied to specific tasks to the
> matching domain (e.g. CPU consumption during writeback, disk
> encryption or CPU cycles spent to receive packets).

We should probably do that in another thread, but I'd probably insist on
separate controllers that co-operate to get that done.

> > > And here's another point, currently, all controllers are enabled
> > > consecutively from root.  If we have leaf thread subtrees, this still
> > > works fine.  Resource domain controllers won't be enabled into thread
> > > subtrees.  If we allow switching back and forth, what do we do in the
> > > middle while we're in the thread part?
> > 
> > From what I understand you cannot re-enable a controller once its been
> > disabled, right? If you disable it, its dead for the entire subtree.
> 
> cgroups on creation don't enable controllers by default and users can
> enable and disable controllers dynamically as long as the conditions
> are met.  So, they can be disable and re-enabled.

I was talking in a hierarchical sense, your section 2-4-2. Top-Down
constraint seems to state similar things to what I meant.

Once you disable a controller it cannot be re-enabled in a subtree.

> > > No matter what we do, it's
> > > gonna be more confusing and we lose basic invariants like "parent
> > > always has superset of control knobs that its child has".
> > 
> > No, exactly that. I don't think I ever proposed something different.
> >
> > The "resource domain" flag I proposed violates the no-internal-processes
> > thing, but it doesn't violate that rule afaict.
> 
> If we go to thread mode and back to domain mode, the control knobs for
> domain controllers don't make sense on the thread part of the tree and
> they won't have cgroup_subsys_state to correspond to either.  For
> example,
> 
>  A - T - B
> 
> B's memcg knobs would control memory distribution from A and cgroups
> in T can't have memcg knobs.  It'd be weird to indicate that memcg is
> enabled in those cgroups too.

But memcg _is_ enabled for T. All the tasks are mapped onto A for
purpose of the system controller (memcg) and are subject to its
constraints.

> We can make it work somehow.  It's just weird-ass interface.

You could make these control files (read-only?) symlinks back to A's
actual control files. To more explicitly show this.

> > > As for the runtime overhead, if you get affected by adding a top-level
> > > cgroup in any measureable way, we need to fix that.  That's not a
> > > valid argument for messing up the interface.
> > 
> > I think cgroup tree depth is a more significant issue; because of
> > hierarchy we often do tree walks (uo-to-root or down-to-task).
> > 
> > So creating elaborate trees is something I try not to do.
> 
> So, as long as the depth stays reasonable (single digit or lower),
> what we try to do is keeping tree traversal operations aggregated or
> located on slow paths. 

While at the same time you allowed that BPF cgroup thing to not be
hierarchical because iterating the tree was too expensive; or did I
misunderstand?

Also, I think Mike showed you the pain and hurt are quite visible for
even a few levels.

Batching is tricky, you need to somehow bound the error function in
order to not become too big a factor on behaviour. Esp. for cpu, cpuacct
obviously doesn't care much as it doesn't enforce anything.

> In general, I think it's important to ensure that this in general is
> the case so that users can use the logical layouts matching the actual
> resource hierarchy rather than having to twist the layout for
> optimization.

One does what one can.. But it is important to understand the
constraints, nothing comes for free.

> > > Even if we allow switching back and forth, we can't make the same
> > > cgroup both resource domain && thread root.  Not in a sane way at
> > > least.
> > 
> > The back and forth thing yes, but even with a single level, the 

Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-14 Thread Peter Zijlstra
On Sun, Feb 12, 2017 at 02:05:44PM +0900, Tejun Heo wrote:
> Hello,
> 
> On Fri, Feb 10, 2017 at 06:51:45PM +0100, Peter Zijlstra wrote:
> > Sure, we're past that. This isn't about what memcg can or cannot do.
> > Previous discussions established that controllers come in two shapes:
> > 
> >  - task based controllers; these are build on per task properties and
> >groups are aggregates over sets of tasks. Since per definition inter
> >task competition is already defined on individual tasks, its fairly
> >trivial to extend the same rules to sets of tasks etc..
> > 
> >Examples: cpu, cpuset, cpuacct, perf, pid, (freezer)
> >
> >  - system controllers; instead of building from tasks upwards, they
> >split what previously would be machine wide / global state. For these
> >there is no natural competition rule vs tasks, and hence your
> >no-internal-task rule.
> > 
> >Examples: memcg, io, hugetlb
> 
> This is a bit of delta but as I wrote before, at least cpu (and
> accordingly cpuacct) won't stay purely task-based as we should account
> for resource consumptions which aren't tied to specific tasks to the
> matching domain (e.g. CPU consumption during writeback, disk
> encryption or CPU cycles spent to receive packets).

We should probably do that in another thread, but I'd probably insist on
separate controllers that co-operate to get that done.

> > > And here's another point, currently, all controllers are enabled
> > > consecutively from root.  If we have leaf thread subtrees, this still
> > > works fine.  Resource domain controllers won't be enabled into thread
> > > subtrees.  If we allow switching back and forth, what do we do in the
> > > middle while we're in the thread part?
> > 
> > From what I understand you cannot re-enable a controller once its been
> > disabled, right? If you disable it, its dead for the entire subtree.
> 
> cgroups on creation don't enable controllers by default and users can
> enable and disable controllers dynamically as long as the conditions
> are met.  So, they can be disable and re-enabled.

I was talking in a hierarchical sense, your section 2-4-2. Top-Down
constraint seems to state similar things to what I meant.

Once you disable a controller it cannot be re-enabled in a subtree.

> > > No matter what we do, it's
> > > gonna be more confusing and we lose basic invariants like "parent
> > > always has superset of control knobs that its child has".
> > 
> > No, exactly that. I don't think I ever proposed something different.
> >
> > The "resource domain" flag I proposed violates the no-internal-processes
> > thing, but it doesn't violate that rule afaict.
> 
> If we go to thread mode and back to domain mode, the control knobs for
> domain controllers don't make sense on the thread part of the tree and
> they won't have cgroup_subsys_state to correspond to either.  For
> example,
> 
>  A - T - B
> 
> B's memcg knobs would control memory distribution from A and cgroups
> in T can't have memcg knobs.  It'd be weird to indicate that memcg is
> enabled in those cgroups too.

But memcg _is_ enabled for T. All the tasks are mapped onto A for
purpose of the system controller (memcg) and are subject to its
constraints.

> We can make it work somehow.  It's just weird-ass interface.

You could make these control files (read-only?) symlinks back to A's
actual control files. To more explicitly show this.

> > > As for the runtime overhead, if you get affected by adding a top-level
> > > cgroup in any measureable way, we need to fix that.  That's not a
> > > valid argument for messing up the interface.
> > 
> > I think cgroup tree depth is a more significant issue; because of
> > hierarchy we often do tree walks (uo-to-root or down-to-task).
> > 
> > So creating elaborate trees is something I try not to do.
> 
> So, as long as the depth stays reasonable (single digit or lower),
> what we try to do is keeping tree traversal operations aggregated or
> located on slow paths. 

While at the same time you allowed that BPF cgroup thing to not be
hierarchical because iterating the tree was too expensive; or did I
misunderstand?

Also, I think Mike showed you the pain and hurt are quite visible for
even a few levels.

Batching is tricky, you need to somehow bound the error function in
order to not become too big a factor on behaviour. Esp. for cpu, cpuacct
obviously doesn't care much as it doesn't enforce anything.

> In general, I think it's important to ensure that this in general is
> the case so that users can use the logical layouts matching the actual
> resource hierarchy rather than having to twist the layout for
> optimization.

One does what one can.. But it is important to understand the
constraints, nothing comes for free.

> > > Even if we allow switching back and forth, we can't make the same
> > > cgroup both resource domain && thread root.  Not in a sane way at
> > > least.
> > 
> > The back and forth thing yes, but even with a single level, the 

Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-12 Thread Mike Galbraith
On Sun, 2017-02-12 at 07:59 +0100, Mike Galbraith wrote:
> On Sun, 2017-02-12 at 14:05 +0900, Tejun Heo wrote:
> 
> > > I think cgroup tree depth is a more significant issue; because of
> > > hierarchy we often do tree walks (uo-to-root or down-to-task).
> > > 
> > > So creating elaborate trees is something I try not to do.
> > 
> > So, as long as the depth stays reasonable (single digit or lower),
> > what we try to do is keeping tree traversal operations aggregated or
> > located on slow paths.  There still are places that this overhead
> > shows up (e.g. the block controllers aren't too optimized) but it
> > isn't particularly difficult to make a handful of layers not matter at
> > all.
> 
> A handful of cpu bean counting layers stings considerably.

BTW, that overhead is also why merging cpu/cpuacct is not really as
wonderful as it may seem on paper.  If you only want to account, you
may not have anything to gain from group scheduling (in fact it may
wreck performance), but you'll pay for it.
 
> homer:/abuild # pipe-test 1  
> 2.010057 usecs/loop -- avg 2.010057 995.0 KHz
> 2.006630 usecs/loop -- avg 2.009714 995.2 KHz
> 2.127118 usecs/loop -- avg 2.021455 989.4 KHz
> 2.256244 usecs/loop -- avg 2.044934 978.0 KHz
> 1.993693 usecs/loop -- avg 2.039810 980.5 KHz
> ^C
> homer:/abuild # cgexec -g cpu:hurt pipe-test 1
> 2.771641 usecs/loop -- avg 2.771641 721.6 KHz
> 2.432333 usecs/loop -- avg 2.737710 730.5 KHz
> 2.750493 usecs/loop -- avg 2.738988 730.2 KHz
> 2.663203 usecs/loop -- avg 2.731410 732.2 KHz
> 2.762564 usecs/loop -- avg 2.734525 731.4 KHz
> ^C
> homer:/abuild # cgexec -g cpu:hurt/pain pipe-test 1
> 2.967201 usecs/loop -- avg 2.967201 674.0 KHz
> 3.049012 usecs/loop -- avg 2.975382 672.2 KHz
> 3.031226 usecs/loop -- avg 2.980966 670.9 KHz
> 2.954259 usecs/loop -- avg 2.978296 671.5 KHz
> 2.933432 usecs/loop -- avg 2.973809 672.5 KHz
> ^C
> ...
> homer:/abuild # cgexec -g cpu:hurt/pain/ouch/moan/groan pipe-test 1
> 4.417044 usecs/loop -- avg 4.417044 452.8 KHz
> 4.494913 usecs/loop -- avg 4.424831 452.0 KHz
> 4.253861 usecs/loop -- avg 4.407734 453.7 KHz
> 4.378059 usecs/loop -- avg 4.404766 454.1 KHz
> 4.179895 usecs/loop -- avg 4.382279 456.4 KHz


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-12 Thread Mike Galbraith
On Sun, 2017-02-12 at 07:59 +0100, Mike Galbraith wrote:
> On Sun, 2017-02-12 at 14:05 +0900, Tejun Heo wrote:
> 
> > > I think cgroup tree depth is a more significant issue; because of
> > > hierarchy we often do tree walks (uo-to-root or down-to-task).
> > > 
> > > So creating elaborate trees is something I try not to do.
> > 
> > So, as long as the depth stays reasonable (single digit or lower),
> > what we try to do is keeping tree traversal operations aggregated or
> > located on slow paths.  There still are places that this overhead
> > shows up (e.g. the block controllers aren't too optimized) but it
> > isn't particularly difficult to make a handful of layers not matter at
> > all.
> 
> A handful of cpu bean counting layers stings considerably.

BTW, that overhead is also why merging cpu/cpuacct is not really as
wonderful as it may seem on paper.  If you only want to account, you
may not have anything to gain from group scheduling (in fact it may
wreck performance), but you'll pay for it.
 
> homer:/abuild # pipe-test 1  
> 2.010057 usecs/loop -- avg 2.010057 995.0 KHz
> 2.006630 usecs/loop -- avg 2.009714 995.2 KHz
> 2.127118 usecs/loop -- avg 2.021455 989.4 KHz
> 2.256244 usecs/loop -- avg 2.044934 978.0 KHz
> 1.993693 usecs/loop -- avg 2.039810 980.5 KHz
> ^C
> homer:/abuild # cgexec -g cpu:hurt pipe-test 1
> 2.771641 usecs/loop -- avg 2.771641 721.6 KHz
> 2.432333 usecs/loop -- avg 2.737710 730.5 KHz
> 2.750493 usecs/loop -- avg 2.738988 730.2 KHz
> 2.663203 usecs/loop -- avg 2.731410 732.2 KHz
> 2.762564 usecs/loop -- avg 2.734525 731.4 KHz
> ^C
> homer:/abuild # cgexec -g cpu:hurt/pain pipe-test 1
> 2.967201 usecs/loop -- avg 2.967201 674.0 KHz
> 3.049012 usecs/loop -- avg 2.975382 672.2 KHz
> 3.031226 usecs/loop -- avg 2.980966 670.9 KHz
> 2.954259 usecs/loop -- avg 2.978296 671.5 KHz
> 2.933432 usecs/loop -- avg 2.973809 672.5 KHz
> ^C
> ...
> homer:/abuild # cgexec -g cpu:hurt/pain/ouch/moan/groan pipe-test 1
> 4.417044 usecs/loop -- avg 4.417044 452.8 KHz
> 4.494913 usecs/loop -- avg 4.424831 452.0 KHz
> 4.253861 usecs/loop -- avg 4.407734 453.7 KHz
> 4.378059 usecs/loop -- avg 4.404766 454.1 KHz
> 4.179895 usecs/loop -- avg 4.382279 456.4 KHz


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-12 Thread Mike Galbraith
On Sun, 2017-02-12 at 13:16 -0800, Paul Turner wrote:
> 
> 
> On Thursday, February 9, 2017, Peter Zijlstra  wrote:
> > On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote:
> > > The only case that this does not support vs ".threads" would be some
> > > hybrid where we co-mingle threads from different processes (with the
> > > processes belonging to the same node in the hierarchy).  I'm not aware
> > > of any usage that looks like this.
> > 
> > If I understand you right; this is a fairly common thing with RT where
> > we would stuff all the !rt threads of the various processes in a 'misc'
> > bucket.
> > 
> > Similarly, it happens that we stuff the various rt threads of processes
> > in a specific (shared) 'rt' bucket.
> > 
> > So I would certainly not like to exclude that setup.
> > 
> 
> Unless you're using rt groups I'm not sure this one really changes.  
> Whether the "misc" threads exist at the parent level or one below
> should not matter.

(with exclusive cpusets, a mask can exist at one and only one location)


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-12 Thread Mike Galbraith
On Sun, 2017-02-12 at 13:16 -0800, Paul Turner wrote:
> 
> 
> On Thursday, February 9, 2017, Peter Zijlstra  wrote:
> > On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote:
> > > The only case that this does not support vs ".threads" would be some
> > > hybrid where we co-mingle threads from different processes (with the
> > > processes belonging to the same node in the hierarchy).  I'm not aware
> > > of any usage that looks like this.
> > 
> > If I understand you right; this is a fairly common thing with RT where
> > we would stuff all the !rt threads of the various processes in a 'misc'
> > bucket.
> > 
> > Similarly, it happens that we stuff the various rt threads of processes
> > in a specific (shared) 'rt' bucket.
> > 
> > So I would certainly not like to exclude that setup.
> > 
> 
> Unless you're using rt groups I'm not sure this one really changes.  
> Whether the "misc" threads exist at the parent level or one below
> should not matter.

(with exclusive cpusets, a mask can exist at one and only one location)


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-11 Thread Mike Galbraith
On Sun, 2017-02-12 at 14:05 +0900, Tejun Heo wrote:

> > I think cgroup tree depth is a more significant issue; because of
> > hierarchy we often do tree walks (uo-to-root or down-to-task).
> > 
> > So creating elaborate trees is something I try not to do.
> 
> So, as long as the depth stays reasonable (single digit or lower),
> what we try to do is keeping tree traversal operations aggregated or
> located on slow paths.  There still are places that this overhead
> shows up (e.g. the block controllers aren't too optimized) but it
> isn't particularly difficult to make a handful of layers not matter at
> all.

A handful of cpu bean counting layers stings considerably.

homer:/abuild # pipe-test 1  
2.010057 usecs/loop -- avg 2.010057 995.0 KHz
2.006630 usecs/loop -- avg 2.009714 995.2 KHz
2.127118 usecs/loop -- avg 2.021455 989.4 KHz
2.256244 usecs/loop -- avg 2.044934 978.0 KHz
1.993693 usecs/loop -- avg 2.039810 980.5 KHz
^C
homer:/abuild # cgexec -g cpu:hurt pipe-test 1
2.771641 usecs/loop -- avg 2.771641 721.6 KHz
2.432333 usecs/loop -- avg 2.737710 730.5 KHz
2.750493 usecs/loop -- avg 2.738988 730.2 KHz
2.663203 usecs/loop -- avg 2.731410 732.2 KHz
2.762564 usecs/loop -- avg 2.734525 731.4 KHz
^C
homer:/abuild # cgexec -g cpu:hurt/pain pipe-test 1
2.967201 usecs/loop -- avg 2.967201 674.0 KHz
3.049012 usecs/loop -- avg 2.975382 672.2 KHz
3.031226 usecs/loop -- avg 2.980966 670.9 KHz
2.954259 usecs/loop -- avg 2.978296 671.5 KHz
2.933432 usecs/loop -- avg 2.973809 672.5 KHz
^C
...
homer:/abuild # cgexec -g cpu:hurt/pain/ouch/moan/groan pipe-test 1
4.417044 usecs/loop -- avg 4.417044 452.8 KHz
4.494913 usecs/loop -- avg 4.424831 452.0 KHz
4.253861 usecs/loop -- avg 4.407734 453.7 KHz
4.378059 usecs/loop -- avg 4.404766 454.1 KHz
4.179895 usecs/loop -- avg 4.382279 456.4 KHz


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-11 Thread Mike Galbraith
On Sun, 2017-02-12 at 14:05 +0900, Tejun Heo wrote:

> > I think cgroup tree depth is a more significant issue; because of
> > hierarchy we often do tree walks (uo-to-root or down-to-task).
> > 
> > So creating elaborate trees is something I try not to do.
> 
> So, as long as the depth stays reasonable (single digit or lower),
> what we try to do is keeping tree traversal operations aggregated or
> located on slow paths.  There still are places that this overhead
> shows up (e.g. the block controllers aren't too optimized) but it
> isn't particularly difficult to make a handful of layers not matter at
> all.

A handful of cpu bean counting layers stings considerably.

homer:/abuild # pipe-test 1  
2.010057 usecs/loop -- avg 2.010057 995.0 KHz
2.006630 usecs/loop -- avg 2.009714 995.2 KHz
2.127118 usecs/loop -- avg 2.021455 989.4 KHz
2.256244 usecs/loop -- avg 2.044934 978.0 KHz
1.993693 usecs/loop -- avg 2.039810 980.5 KHz
^C
homer:/abuild # cgexec -g cpu:hurt pipe-test 1
2.771641 usecs/loop -- avg 2.771641 721.6 KHz
2.432333 usecs/loop -- avg 2.737710 730.5 KHz
2.750493 usecs/loop -- avg 2.738988 730.2 KHz
2.663203 usecs/loop -- avg 2.731410 732.2 KHz
2.762564 usecs/loop -- avg 2.734525 731.4 KHz
^C
homer:/abuild # cgexec -g cpu:hurt/pain pipe-test 1
2.967201 usecs/loop -- avg 2.967201 674.0 KHz
3.049012 usecs/loop -- avg 2.975382 672.2 KHz
3.031226 usecs/loop -- avg 2.980966 670.9 KHz
2.954259 usecs/loop -- avg 2.978296 671.5 KHz
2.933432 usecs/loop -- avg 2.973809 672.5 KHz
^C
...
homer:/abuild # cgexec -g cpu:hurt/pain/ouch/moan/groan pipe-test 1
4.417044 usecs/loop -- avg 4.417044 452.8 KHz
4.494913 usecs/loop -- avg 4.424831 452.0 KHz
4.253861 usecs/loop -- avg 4.407734 453.7 KHz
4.378059 usecs/loop -- avg 4.404766 454.1 KHz
4.179895 usecs/loop -- avg 4.382279 456.4 KHz


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-11 Thread Tejun Heo
Hello,

On Fri, Feb 10, 2017 at 06:51:45PM +0100, Peter Zijlstra wrote:
> Sure, we're past that. This isn't about what memcg can or cannot do.
> Previous discussions established that controllers come in two shapes:
> 
>  - task based controllers; these are build on per task properties and
>groups are aggregates over sets of tasks. Since per definition inter
>task competition is already defined on individual tasks, its fairly
>trivial to extend the same rules to sets of tasks etc..
> 
>Examples: cpu, cpuset, cpuacct, perf, pid, (freezer)
>
>  - system controllers; instead of building from tasks upwards, they
>split what previously would be machine wide / global state. For these
>there is no natural competition rule vs tasks, and hence your
>no-internal-task rule.
> 
>Examples: memcg, io, hugetlb

This is a bit of delta but as I wrote before, at least cpu (and
accordingly cpuacct) won't stay purely task-based as we should account
for resource consumptions which aren't tied to specific tasks to the
matching domain (e.g. CPU consumption during writeback, disk
encryption or CPU cycles spent to receive packets).

> > And here's another point, currently, all controllers are enabled
> > consecutively from root.  If we have leaf thread subtrees, this still
> > works fine.  Resource domain controllers won't be enabled into thread
> > subtrees.  If we allow switching back and forth, what do we do in the
> > middle while we're in the thread part?
> 
> From what I understand you cannot re-enable a controller once its been
> disabled, right? If you disable it, its dead for the entire subtree.

cgroups on creation don't enable controllers by default and users can
enable and disable controllers dynamically as long as the conditions
are met.  So, they can be disable and re-enabled.

> > No matter what we do, it's
> > gonna be more confusing and we lose basic invariants like "parent
> > always has superset of control knobs that its child has".
> 
> No, exactly that. I don't think I ever proposed something different.
>
> The "resource domain" flag I proposed violates the no-internal-processes
> thing, but it doesn't violate that rule afaict.

If we go to thread mode and back to domain mode, the control knobs for
domain controllers don't make sense on the thread part of the tree and
they won't have cgroup_subsys_state to correspond to either.  For
example,

 A - T - B

B's memcg knobs would control memory distribution from A and cgroups
in T can't have memcg knobs.  It'd be weird to indicate that memcg is
enabled in those cgroups too.

We can make it work somehow.  It's just weird-ass interface.

> > As for the runtime overhead, if you get affected by adding a top-level
> > cgroup in any measureable way, we need to fix that.  That's not a
> > valid argument for messing up the interface.
> 
> I think cgroup tree depth is a more significant issue; because of
> hierarchy we often do tree walks (uo-to-root or down-to-task).
> 
> So creating elaborate trees is something I try not to do.

So, as long as the depth stays reasonable (single digit or lower),
what we try to do is keeping tree traversal operations aggregated or
located on slow paths.  There still are places that this overhead
shows up (e.g. the block controllers aren't too optimized) but it
isn't particularly difficult to make a handful of layers not matter at
all.  memcg batches the charging operations and it's impossible to
measure the overhead of several levels of hierarchy.

In general, I think it's important to ensure that this in general is
the case so that users can use the logical layouts matching the actual
resource hierarchy rather than having to twist the layout for
optimization.

> > Even if we allow switching back and forth, we can't make the same
> > cgroup both resource domain && thread root.  Not in a sane way at
> > least.
> 
> The back and forth thing yes, but even with a single level, the one
> resource domain you tag will be both resource domain and thread root.

Ah, you're right.

Thanks.

-- 
tejun


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-11 Thread Tejun Heo
Hello,

On Fri, Feb 10, 2017 at 06:51:45PM +0100, Peter Zijlstra wrote:
> Sure, we're past that. This isn't about what memcg can or cannot do.
> Previous discussions established that controllers come in two shapes:
> 
>  - task based controllers; these are build on per task properties and
>groups are aggregates over sets of tasks. Since per definition inter
>task competition is already defined on individual tasks, its fairly
>trivial to extend the same rules to sets of tasks etc..
> 
>Examples: cpu, cpuset, cpuacct, perf, pid, (freezer)
>
>  - system controllers; instead of building from tasks upwards, they
>split what previously would be machine wide / global state. For these
>there is no natural competition rule vs tasks, and hence your
>no-internal-task rule.
> 
>Examples: memcg, io, hugetlb

This is a bit of delta but as I wrote before, at least cpu (and
accordingly cpuacct) won't stay purely task-based as we should account
for resource consumptions which aren't tied to specific tasks to the
matching domain (e.g. CPU consumption during writeback, disk
encryption or CPU cycles spent to receive packets).

> > And here's another point, currently, all controllers are enabled
> > consecutively from root.  If we have leaf thread subtrees, this still
> > works fine.  Resource domain controllers won't be enabled into thread
> > subtrees.  If we allow switching back and forth, what do we do in the
> > middle while we're in the thread part?
> 
> From what I understand you cannot re-enable a controller once its been
> disabled, right? If you disable it, its dead for the entire subtree.

cgroups on creation don't enable controllers by default and users can
enable and disable controllers dynamically as long as the conditions
are met.  So, they can be disable and re-enabled.

> > No matter what we do, it's
> > gonna be more confusing and we lose basic invariants like "parent
> > always has superset of control knobs that its child has".
> 
> No, exactly that. I don't think I ever proposed something different.
>
> The "resource domain" flag I proposed violates the no-internal-processes
> thing, but it doesn't violate that rule afaict.

If we go to thread mode and back to domain mode, the control knobs for
domain controllers don't make sense on the thread part of the tree and
they won't have cgroup_subsys_state to correspond to either.  For
example,

 A - T - B

B's memcg knobs would control memory distribution from A and cgroups
in T can't have memcg knobs.  It'd be weird to indicate that memcg is
enabled in those cgroups too.

We can make it work somehow.  It's just weird-ass interface.

> > As for the runtime overhead, if you get affected by adding a top-level
> > cgroup in any measureable way, we need to fix that.  That's not a
> > valid argument for messing up the interface.
> 
> I think cgroup tree depth is a more significant issue; because of
> hierarchy we often do tree walks (uo-to-root or down-to-task).
> 
> So creating elaborate trees is something I try not to do.

So, as long as the depth stays reasonable (single digit or lower),
what we try to do is keeping tree traversal operations aggregated or
located on slow paths.  There still are places that this overhead
shows up (e.g. the block controllers aren't too optimized) but it
isn't particularly difficult to make a handful of layers not matter at
all.  memcg batches the charging operations and it's impossible to
measure the overhead of several levels of hierarchy.

In general, I think it's important to ensure that this in general is
the case so that users can use the logical layouts matching the actual
resource hierarchy rather than having to twist the layout for
optimization.

> > Even if we allow switching back and forth, we can't make the same
> > cgroup both resource domain && thread root.  Not in a sane way at
> > least.
> 
> The back and forth thing yes, but even with a single level, the one
> resource domain you tag will be both resource domain and thread root.

Ah, you're right.

Thanks.

-- 
tejun


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-10 Thread Peter Zijlstra
On Fri, Feb 10, 2017 at 10:45:08AM -0500, Tejun Heo wrote:

> > > and making subtrees threaded is a
> > > straight-forward extension of that - threaded controllers just see
> > > further into the hierarchy.  Adding threaded sub-sections in the
> > > middle is more complex and frankly confusing.
> > 
> > I disagree, as I completely fail to see any confusion. The rules are
> > simple and straight forward.
> > 
> > I also don't see why you would want to impose this artificial
> > restriction. It doesn't get you anything. Why are you so keen on designs
> > with these artificial limits on?
> 
> Because I actually understand and use this thing day in and day out?

Just because you don't have the use-cases doesn't mean they're invalid.

Also, the above is effectively: "because I say so", which isn't much of
an argument.

> Let's go back to the no-internal-process constraint.  The main reason
> behind that is avoiding resource competition between child cgroups and
> processes.  The reason why we need this is because for some resources
> the terminal consumer (be that a process or task or anonymous) and the
> resource domain that it belongs to (be that the system itself or a
> cgroup) aren't equivalent.

Sure, we're past that. This isn't about what memcg can or cannot do.
Previous discussions established that controllers come in two shapes:

 - task based controllers; these are build on per task properties and
   groups are aggregates over sets of tasks. Since per definition inter
   task competition is already defined on individual tasks, its fairly
   trivial to extend the same rules to sets of tasks etc..

   Examples: cpu, cpuset, cpuacct, perf, pid, (freezer)

 - system controllers; instead of building from tasks upwards, they
   split what previously would be machine wide / global state. For these
   there is no natural competition rule vs tasks, and hence your
   no-internal-task rule.

   Examples: memcg, io, hugetlb

(I have no idea where: devices, net_cls, net_prio, debug fall in this
classification, nor is that really relevant)

Now, cgroup-v2 is entirely build around the use-case of
containerization, where you want a single hierarchy describing the
containers and their resources. Now, because of that single hierarchy
and single use-case, you let the system controllers dominate and dictate
the rules.

By doing that you've completely killed off a whole bunch of use-cases
that were possible with pure task controllers. And you seen to have a
very hard time accepting that this is a problem.

Furthermore, the argument that people who need that can continue to use
v1 doesn't work. Because v2 and v1 are mutually exclusive and do not
respect the namespace/container invariant. That is, if a controller is
used in v2, a sub-container is forced to also use v2.

Therefore it is important to fix v2 if possible or do v3 if not, such
that all use-cases can be met in a single setup that respects the
container invariant.

> Now, back to not allowing switching back and forth between resource
> domains and thread subtrees.  Let's say we allow that and compose a
> hierarchy as follows.  Let's say A and B are resource domains and T's
> are subtrees of threads.
> 
>   A - T1 - B - T2
> 
> The resource domain controllers would see the following hierarchy.
> 
>   A - B
> 
> A will contain processes from T1 and B T2.  Both A and B would have
> internal consumptions from the processes and the no-internal-process
> constraint and thus resource domain abstraction are broken.

> If we want to support a hierarchy like that, we'll internally have to
> something like
> 
>   A - B
>\
> A'

Because, and it took me a little while to get here, this:

A
  /   \
 T1   t1
/ \
   t2  B
  / \
 t3  T2
 /\
t4 t5


Ends up being this from a resource domain pov. (because the task
controllers are hierarchical their effective contribution collapses onto
the resource domain):

A
  /   \
 B t1, t2
 |
 t3,t4,t5


> Now, this is exactly the same problem as having internal processes

Indeed, bugger.

> And here's another point, currently, all controllers are enabled
> consecutively from root.  If we have leaf thread subtrees, this still
> works fine.  Resource domain controllers won't be enabled into thread
> subtrees.  If we allow switching back and forth, what do we do in the
> middle while we're in the thread part?

>From what I understand you cannot re-enable a controller once its been
disabled, right? If you disable it, its dead for the entire subtree.

I think it would work naturally if you only allow disabling system
controllers at the resource domain levels (thread controllers can be
disabled at any point).

That means that thread nodes will have the exact same system controllers
enabled as their resource domain, which makes perfect sense, 

Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-10 Thread Peter Zijlstra
On Fri, Feb 10, 2017 at 10:45:08AM -0500, Tejun Heo wrote:

> > > and making subtrees threaded is a
> > > straight-forward extension of that - threaded controllers just see
> > > further into the hierarchy.  Adding threaded sub-sections in the
> > > middle is more complex and frankly confusing.
> > 
> > I disagree, as I completely fail to see any confusion. The rules are
> > simple and straight forward.
> > 
> > I also don't see why you would want to impose this artificial
> > restriction. It doesn't get you anything. Why are you so keen on designs
> > with these artificial limits on?
> 
> Because I actually understand and use this thing day in and day out?

Just because you don't have the use-cases doesn't mean they're invalid.

Also, the above is effectively: "because I say so", which isn't much of
an argument.

> Let's go back to the no-internal-process constraint.  The main reason
> behind that is avoiding resource competition between child cgroups and
> processes.  The reason why we need this is because for some resources
> the terminal consumer (be that a process or task or anonymous) and the
> resource domain that it belongs to (be that the system itself or a
> cgroup) aren't equivalent.

Sure, we're past that. This isn't about what memcg can or cannot do.
Previous discussions established that controllers come in two shapes:

 - task based controllers; these are build on per task properties and
   groups are aggregates over sets of tasks. Since per definition inter
   task competition is already defined on individual tasks, its fairly
   trivial to extend the same rules to sets of tasks etc..

   Examples: cpu, cpuset, cpuacct, perf, pid, (freezer)

 - system controllers; instead of building from tasks upwards, they
   split what previously would be machine wide / global state. For these
   there is no natural competition rule vs tasks, and hence your
   no-internal-task rule.

   Examples: memcg, io, hugetlb

(I have no idea where: devices, net_cls, net_prio, debug fall in this
classification, nor is that really relevant)

Now, cgroup-v2 is entirely build around the use-case of
containerization, where you want a single hierarchy describing the
containers and their resources. Now, because of that single hierarchy
and single use-case, you let the system controllers dominate and dictate
the rules.

By doing that you've completely killed off a whole bunch of use-cases
that were possible with pure task controllers. And you seen to have a
very hard time accepting that this is a problem.

Furthermore, the argument that people who need that can continue to use
v1 doesn't work. Because v2 and v1 are mutually exclusive and do not
respect the namespace/container invariant. That is, if a controller is
used in v2, a sub-container is forced to also use v2.

Therefore it is important to fix v2 if possible or do v3 if not, such
that all use-cases can be met in a single setup that respects the
container invariant.

> Now, back to not allowing switching back and forth between resource
> domains and thread subtrees.  Let's say we allow that and compose a
> hierarchy as follows.  Let's say A and B are resource domains and T's
> are subtrees of threads.
> 
>   A - T1 - B - T2
> 
> The resource domain controllers would see the following hierarchy.
> 
>   A - B
> 
> A will contain processes from T1 and B T2.  Both A and B would have
> internal consumptions from the processes and the no-internal-process
> constraint and thus resource domain abstraction are broken.

> If we want to support a hierarchy like that, we'll internally have to
> something like
> 
>   A - B
>\
> A'

Because, and it took me a little while to get here, this:

A
  /   \
 T1   t1
/ \
   t2  B
  / \
 t3  T2
 /\
t4 t5


Ends up being this from a resource domain pov. (because the task
controllers are hierarchical their effective contribution collapses onto
the resource domain):

A
  /   \
 B t1, t2
 |
 t3,t4,t5


> Now, this is exactly the same problem as having internal processes

Indeed, bugger.

> And here's another point, currently, all controllers are enabled
> consecutively from root.  If we have leaf thread subtrees, this still
> works fine.  Resource domain controllers won't be enabled into thread
> subtrees.  If we allow switching back and forth, what do we do in the
> middle while we're in the thread part?

>From what I understand you cannot re-enable a controller once its been
disabled, right? If you disable it, its dead for the entire subtree.

I think it would work naturally if you only allow disabling system
controllers at the resource domain levels (thread controllers can be
disabled at any point).

That means that thread nodes will have the exact same system controllers
enabled as their resource domain, which makes perfect sense, 

Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-10 Thread Tejun Heo
Hello,

On Thu, Feb 09, 2017 at 11:29:09AM +0100, Peter Zijlstra wrote:
> Uhm, no. They would see the exact same hierarchy, seeing how there is
> only one tree. They would have different view of it maybe, but I don't
> see how that matters, nor do you explain.

Sure, the base hierarchy is the same but different controllers would
need to see different subsets (or views) of the hierarchy.  As I wrote
before, cgroup v2 alredy does this to certain extent by controllers
ignoring the hierarchy beyond certain points.  You're proposing to add
a new "view" of the hierarchy.  I'll explain why it matters below.

> > which brings in something completely new to the basic hierarchy.
> 
> I'm failing to see what.
> 
> > Different controllers seeing differing levels of the same hierarchy is
> > part of the basic behaviors
> 
> I have no idea what you mean there.

It's explained in Documentation/cgroup-v2.txt but for example, if the
whole hierarchy is,

A - B -C
  \ D

One controller might only see

A - B
  \ D

while another sees the whole thing.

> > and making subtrees threaded is a
> > straight-forward extension of that - threaded controllers just see
> > further into the hierarchy.  Adding threaded sub-sections in the
> > middle is more complex and frankly confusing.
> 
> I disagree, as I completely fail to see any confusion. The rules are
> simple and straight forward.
> 
> I also don't see why you would want to impose this artificial
> restriction. It doesn't get you anything. Why are you so keen on designs
> with these artificial limits on?

Because I actually understand and use this thing day in and day out?

Let's go back to the no-internal-process constraint.  The main reason
behind that is avoiding resource competition between child cgroups and
processes.  The reason why we need this is because for some resources
the terminal consumer (be that a process or task or anonymous) and the
resource domain that it belongs to (be that the system itself or a
cgroup) aren't equivalent.

If you make a memcg, put some processes in it and then create some
child cgroups, how resource should be distributed between those
processes and child cgroups is not clearly defined and can't be
controlled from userspace.  The resource control knobs in a child
cgroup governs how the resource is distributed from the parent.  For
child processes, we don't have those knobs.

There are multiple ways to deal with the problem.  We can add a
separate set of control knobs to govern control resource consumption
from internal processes.  This effectively adds an implicit leaf node
to each cgroup so that internal processes or tasks always are in its
own leaf resource domain.  This however adds a lot of cruft to the
interface, the implementation gets nasty and the presented resource
hierarchy can be misleading to users.

Another option would be just letting each controller do whatever,
which is pretty much what we did in v1.  This got really bad because
the behaviors were widely inconsistent across controllers and often
implementation dependent without any way for the user to configure or
monitor what's going on.  Who gets how much becomes a matter of
accidents and people optimize for whatever arbitrary behaviors that
the kernel they're using is showing.

No-internal-process rule establishes that resource domains are always
terminal in the resource graph for a given controller, such that every
competition along the resource hiearchy always is clearly defined and
configurable.  Only the terminal resource domains actually host
resource consumptions and they can behave analogous to a system which
doesn't have any cgroups at all.  Estalishing resource domains this
way isn't the only approach to solve the problem; however, it is a
valid, simple and effective one.

Now, back to not allowing switching back and forth between resource
domains and thread subtrees.  Let's say we allow that and compose a
hierarchy as follows.  Let's say A and B are resource domains and T's
are subtrees of threads.

A - T1 - B - T2

The resource domain controllers would see the following hierarchy.

A - B

A will contain processes from T1 and B T2.  Both A and B would have
internal consumptions from the processes and the no-internal-process
constraint and thus resource domain abstraction are broken.  If we
want to support a hierarchy like that, we'll internally have to
something like

A - B
 \
  A'

Where cgroup A' contains processes from T1 and B T2.  Now, this is
exactly the same problem as having internal processes and can be
solved in the same ways.  The only realistic way to handle this in a
generic and consistent manner is creating a leaf cgroup to contain the
processes.  We sure can try to hide this from userspace and convolute
the interface but it can be solved *far* more elegantly by simply
requiring thread subtrees to be leaf subtrees.

And here's another point, currently, all controllers are 

Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-10 Thread Tejun Heo
Hello,

On Thu, Feb 09, 2017 at 11:29:09AM +0100, Peter Zijlstra wrote:
> Uhm, no. They would see the exact same hierarchy, seeing how there is
> only one tree. They would have different view of it maybe, but I don't
> see how that matters, nor do you explain.

Sure, the base hierarchy is the same but different controllers would
need to see different subsets (or views) of the hierarchy.  As I wrote
before, cgroup v2 alredy does this to certain extent by controllers
ignoring the hierarchy beyond certain points.  You're proposing to add
a new "view" of the hierarchy.  I'll explain why it matters below.

> > which brings in something completely new to the basic hierarchy.
> 
> I'm failing to see what.
> 
> > Different controllers seeing differing levels of the same hierarchy is
> > part of the basic behaviors
> 
> I have no idea what you mean there.

It's explained in Documentation/cgroup-v2.txt but for example, if the
whole hierarchy is,

A - B -C
  \ D

One controller might only see

A - B
  \ D

while another sees the whole thing.

> > and making subtrees threaded is a
> > straight-forward extension of that - threaded controllers just see
> > further into the hierarchy.  Adding threaded sub-sections in the
> > middle is more complex and frankly confusing.
> 
> I disagree, as I completely fail to see any confusion. The rules are
> simple and straight forward.
> 
> I also don't see why you would want to impose this artificial
> restriction. It doesn't get you anything. Why are you so keen on designs
> with these artificial limits on?

Because I actually understand and use this thing day in and day out?

Let's go back to the no-internal-process constraint.  The main reason
behind that is avoiding resource competition between child cgroups and
processes.  The reason why we need this is because for some resources
the terminal consumer (be that a process or task or anonymous) and the
resource domain that it belongs to (be that the system itself or a
cgroup) aren't equivalent.

If you make a memcg, put some processes in it and then create some
child cgroups, how resource should be distributed between those
processes and child cgroups is not clearly defined and can't be
controlled from userspace.  The resource control knobs in a child
cgroup governs how the resource is distributed from the parent.  For
child processes, we don't have those knobs.

There are multiple ways to deal with the problem.  We can add a
separate set of control knobs to govern control resource consumption
from internal processes.  This effectively adds an implicit leaf node
to each cgroup so that internal processes or tasks always are in its
own leaf resource domain.  This however adds a lot of cruft to the
interface, the implementation gets nasty and the presented resource
hierarchy can be misleading to users.

Another option would be just letting each controller do whatever,
which is pretty much what we did in v1.  This got really bad because
the behaviors were widely inconsistent across controllers and often
implementation dependent without any way for the user to configure or
monitor what's going on.  Who gets how much becomes a matter of
accidents and people optimize for whatever arbitrary behaviors that
the kernel they're using is showing.

No-internal-process rule establishes that resource domains are always
terminal in the resource graph for a given controller, such that every
competition along the resource hiearchy always is clearly defined and
configurable.  Only the terminal resource domains actually host
resource consumptions and they can behave analogous to a system which
doesn't have any cgroups at all.  Estalishing resource domains this
way isn't the only approach to solve the problem; however, it is a
valid, simple and effective one.

Now, back to not allowing switching back and forth between resource
domains and thread subtrees.  Let's say we allow that and compose a
hierarchy as follows.  Let's say A and B are resource domains and T's
are subtrees of threads.

A - T1 - B - T2

The resource domain controllers would see the following hierarchy.

A - B

A will contain processes from T1 and B T2.  Both A and B would have
internal consumptions from the processes and the no-internal-process
constraint and thus resource domain abstraction are broken.  If we
want to support a hierarchy like that, we'll internally have to
something like

A - B
 \
  A'

Where cgroup A' contains processes from T1 and B T2.  Now, this is
exactly the same problem as having internal processes and can be
solved in the same ways.  The only realistic way to handle this in a
generic and consistent manner is creating a leaf cgroup to contain the
processes.  We sure can try to hide this from userspace and convolute
the interface but it can be solved *far* more elegantly by simply
requiring thread subtrees to be leaf subtrees.

And here's another point, currently, all controllers are 

Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-10 Thread Tejun Heo
On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote:
> What are the motivations that you see for forcing this all onto one
> mount-point via .threads sub-tree tags?

So, you wanted rgroup but with /proc interface?  I'm afraid it's too
late for that.

Thanks.

-- 
tejun


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-10 Thread Tejun Heo
On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote:
> What are the motivations that you see for forcing this all onto one
> mount-point via .threads sub-tree tags?

So, you wanted rgroup but with /proc interface?  I'm afraid it's too
late for that.

Thanks.

-- 
tejun


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-09 Thread Mike Galbraith
On Thu, 2017-02-09 at 15:47 +0100, Peter Zijlstra wrote:
> On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote:
> > The only case that this does not support vs ".threads" would be some
> > hybrid where we co-mingle threads from different processes (with the
> > processes belonging to the same node in the hierarchy).  I'm not aware
> > of any usage that looks like this.
> 
> If I understand you right; this is a fairly common thing with RT where
> we would stuff all the !rt threads of the various processes in a 'misc'
> bucket.
> 
> Similarly, it happens that we stuff the various rt threads of processes
> in a specific (shared) 'rt' bucket.
> 
> So I would certainly not like to exclude that setup.

Absolutely, you just described my daily bread performance setup.

-Mike


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-09 Thread Mike Galbraith
On Thu, 2017-02-09 at 15:47 +0100, Peter Zijlstra wrote:
> On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote:
> > The only case that this does not support vs ".threads" would be some
> > hybrid where we co-mingle threads from different processes (with the
> > processes belonging to the same node in the hierarchy).  I'm not aware
> > of any usage that looks like this.
> 
> If I understand you right; this is a fairly common thing with RT where
> we would stuff all the !rt threads of the various processes in a 'misc'
> bucket.
> 
> Similarly, it happens that we stuff the various rt threads of processes
> in a specific (shared) 'rt' bucket.
> 
> So I would certainly not like to exclude that setup.

Absolutely, you just described my daily bread performance setup.

-Mike


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-09 Thread Peter Zijlstra
On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote:
> The only case that this does not support vs ".threads" would be some
> hybrid where we co-mingle threads from different processes (with the
> processes belonging to the same node in the hierarchy).  I'm not aware
> of any usage that looks like this.

If I understand you right; this is a fairly common thing with RT where
we would stuff all the !rt threads of the various processes in a 'misc'
bucket.

Similarly, it happens that we stuff the various rt threads of processes
in a specific (shared) 'rt' bucket.

So I would certainly not like to exclude that setup.


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-09 Thread Peter Zijlstra
On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote:
> The only case that this does not support vs ".threads" would be some
> hybrid where we co-mingle threads from different processes (with the
> processes belonging to the same node in the hierarchy).  I'm not aware
> of any usage that looks like this.

If I understand you right; this is a fairly common thing with RT where
we would stuff all the !rt threads of the various processes in a 'misc'
bucket.

Similarly, it happens that we stuff the various rt threads of processes
in a specific (shared) 'rt' bucket.

So I would certainly not like to exclude that setup.


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-09 Thread Paul Turner
On Thu, Feb 2, 2017 at 12:06 PM, Tejun Heo  wrote:
> Hello,
>
> This patchset implements cgroup v2 thread mode.  It is largely based
> on the discussions that we had at the plumbers last year.  Here's the
> rough outline.
>
> * Thread mode is explicitly enabled on a cgroup by writing "enable"
>   into "cgroup.threads" file.  The cgroup shouldn't have any child
>   cgroups or enabled controllers.
>
> * Once enabled, arbitrary sub-hierarchy can be created and threads can
>   be put anywhere in the subtree by writing TIDs into "cgroup.threads"
>   file.  Process granularity and no-internal-process constraint don't
>   apply in a threaded subtree.
>
> * To be used in a threaded subtree, controllers should explicitly
>   declare thread mode support and should be able to handle internal
>   competition in some way.
>
> * The root of a threaded subtree serves as the resource domain for the
>   whole subtree.  This is where all the controllers are guaranteed to
>   have a common ground and resource consumptions in the threaded
>   subtree which aren't tied to a specific thread are charged.
>   Non-threaded controllers never see beyond thread root and can assume
>   that all controllers will follow the same rules upto that point.
>
> This allows threaded controllers to implement thread granular resource
> control without getting in the way of system level resource
> partitioning.
>

I think that this is definitely a step in the right direction versus
previous proposals.
However, as proposed it feels like the API is conflating the
process/thread distinction with the core process hierarchy.  While
this does previous use-cases to be re-enabled, it seems to do so at an
unnecessary API cost.

As proposed, the cgroup.threads file means that threads are always
anchored in the tree by their process parent.  They may never move
past it.  I.e.
  If I have cgroups root/A/B
With B allowing sub-thread moves and the parent belonging to A, or B.
it is clear that the child cannot be moved beyond the parent.

Now this, in itself, is a natural restriction.  However, with this in
hand, it means that we are effectively co-mingling two hierarchies
onto the same tree: one that applies to processes, and per-process
sub-trees.

This introduces the following costs/restrictions:

1) We lose the ability to reasonably move a process.  This puts us
back to the existing challenge of the V1 API in which a thread is the
unit we can move atomically.  Hierarchies must be externally managed
and synchronized.

2) This retains all of the problems of the existing V1 API for a
process which wants to use these sub-groups to coordinate its threads.
It must coordinate its operations on these groups with the global
hierarchy (which is not consistently mounted) as well as potential
migration -- (1) above.

With the split as proposed, I fundamentally do not see the advantage
of exposing these as the same hierarchy.  By definition these .thread
files are essentially introducing independent, process level,
sub-hierarchies.

It seems greatly preferable to expose the sub-process level
hierarchies via separate path, e.g.:
  /proc/{pid, self}/thread_cgroups/

Any controllers enabled on the hierarchy that the process belonged to,
which support thread level operations would appear within.  This fully
addresses (1) and (2) while allowing us to keep the unified
process-granular v2-cgroup mounts as is.

The only case that this does not support vs ".threads" would be some
hybrid where we co-mingle threads from different processes (with the
processes belonging to the same node in the hierarchy).  I'm not aware
of any usage that looks like this.

What are the motivations that you see for forcing this all onto one
mount-point via .threads sub-tree tags?


> This patchset contains the following five patches.  For more details
> on the interface and behavior, please refer to the last patch.
>
>  0001-cgroup-reorganize-cgroup.procs-task-write-path.patch
>  0002-cgroup-add-flags-to-css_task_iter_start-and-implemen.patch
>  0003-cgroup-introduce-cgroup-proc_cgrp-and-threaded-css_s.patch
>  0004-cgroup-implement-CSS_TASK_ITER_THREADED.patch
>  0005-cgroup-implement-cgroup-v2-thread-support.patch
>
> This patchset is on top of cgroup/for-4.11 63f1ca59453a ("Merge branch
> 'cgroup/for-4.11-rdmacg' into cgroup/for-4.11") and available in the
> following git branch.
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git 
> review-cgroup2-threads
>
> diffstat follows.  Thanks.
>
>  Documentation/cgroup-v2.txt |   75 -
>  include/linux/cgroup-defs.h |   38 ++
>  include/linux/cgroup.h  |   12
>  kernel/cgroup/cgroup-internal.h |8
>  kernel/cgroup/cgroup-v1.c   |   64 +++-
>  kernel/cgroup/cgroup.c  |  589 
> 
>  kernel/cgroup/cpuset.c  |6
>  kernel/cgroup/freezer.c |6
>  kernel/cgroup/pids.c|1
>  kernel/events/core.c|1
>  

Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-09 Thread Paul Turner
On Thu, Feb 2, 2017 at 12:06 PM, Tejun Heo  wrote:
> Hello,
>
> This patchset implements cgroup v2 thread mode.  It is largely based
> on the discussions that we had at the plumbers last year.  Here's the
> rough outline.
>
> * Thread mode is explicitly enabled on a cgroup by writing "enable"
>   into "cgroup.threads" file.  The cgroup shouldn't have any child
>   cgroups or enabled controllers.
>
> * Once enabled, arbitrary sub-hierarchy can be created and threads can
>   be put anywhere in the subtree by writing TIDs into "cgroup.threads"
>   file.  Process granularity and no-internal-process constraint don't
>   apply in a threaded subtree.
>
> * To be used in a threaded subtree, controllers should explicitly
>   declare thread mode support and should be able to handle internal
>   competition in some way.
>
> * The root of a threaded subtree serves as the resource domain for the
>   whole subtree.  This is where all the controllers are guaranteed to
>   have a common ground and resource consumptions in the threaded
>   subtree which aren't tied to a specific thread are charged.
>   Non-threaded controllers never see beyond thread root and can assume
>   that all controllers will follow the same rules upto that point.
>
> This allows threaded controllers to implement thread granular resource
> control without getting in the way of system level resource
> partitioning.
>

I think that this is definitely a step in the right direction versus
previous proposals.
However, as proposed it feels like the API is conflating the
process/thread distinction with the core process hierarchy.  While
this does previous use-cases to be re-enabled, it seems to do so at an
unnecessary API cost.

As proposed, the cgroup.threads file means that threads are always
anchored in the tree by their process parent.  They may never move
past it.  I.e.
  If I have cgroups root/A/B
With B allowing sub-thread moves and the parent belonging to A, or B.
it is clear that the child cannot be moved beyond the parent.

Now this, in itself, is a natural restriction.  However, with this in
hand, it means that we are effectively co-mingling two hierarchies
onto the same tree: one that applies to processes, and per-process
sub-trees.

This introduces the following costs/restrictions:

1) We lose the ability to reasonably move a process.  This puts us
back to the existing challenge of the V1 API in which a thread is the
unit we can move atomically.  Hierarchies must be externally managed
and synchronized.

2) This retains all of the problems of the existing V1 API for a
process which wants to use these sub-groups to coordinate its threads.
It must coordinate its operations on these groups with the global
hierarchy (which is not consistently mounted) as well as potential
migration -- (1) above.

With the split as proposed, I fundamentally do not see the advantage
of exposing these as the same hierarchy.  By definition these .thread
files are essentially introducing independent, process level,
sub-hierarchies.

It seems greatly preferable to expose the sub-process level
hierarchies via separate path, e.g.:
  /proc/{pid, self}/thread_cgroups/

Any controllers enabled on the hierarchy that the process belonged to,
which support thread level operations would appear within.  This fully
addresses (1) and (2) while allowing us to keep the unified
process-granular v2-cgroup mounts as is.

The only case that this does not support vs ".threads" would be some
hybrid where we co-mingle threads from different processes (with the
processes belonging to the same node in the hierarchy).  I'm not aware
of any usage that looks like this.

What are the motivations that you see for forcing this all onto one
mount-point via .threads sub-tree tags?


> This patchset contains the following five patches.  For more details
> on the interface and behavior, please refer to the last patch.
>
>  0001-cgroup-reorganize-cgroup.procs-task-write-path.patch
>  0002-cgroup-add-flags-to-css_task_iter_start-and-implemen.patch
>  0003-cgroup-introduce-cgroup-proc_cgrp-and-threaded-css_s.patch
>  0004-cgroup-implement-CSS_TASK_ITER_THREADED.patch
>  0005-cgroup-implement-cgroup-v2-thread-support.patch
>
> This patchset is on top of cgroup/for-4.11 63f1ca59453a ("Merge branch
> 'cgroup/for-4.11-rdmacg' into cgroup/for-4.11") and available in the
> following git branch.
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git 
> review-cgroup2-threads
>
> diffstat follows.  Thanks.
>
>  Documentation/cgroup-v2.txt |   75 -
>  include/linux/cgroup-defs.h |   38 ++
>  include/linux/cgroup.h  |   12
>  kernel/cgroup/cgroup-internal.h |8
>  kernel/cgroup/cgroup-v1.c   |   64 +++-
>  kernel/cgroup/cgroup.c  |  589 
> 
>  kernel/cgroup/cpuset.c  |6
>  kernel/cgroup/freezer.c |6
>  kernel/cgroup/pids.c|1
>  kernel/events/core.c|1
>  mm/memcontrol.c 

Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-09 Thread Peter Zijlstra
On Wed, Feb 08, 2017 at 06:08:19PM -0500, Tejun Heo wrote:
> (cc'ing Linus and Andrew for visibility)
> 
> Hello, Peter.
> 
> On Mon, Feb 06, 2017 at 01:49:43PM +0100, Peter Zijlstra wrote:
> > But to me the resource domain is your primary new construct; so it makes
> > more sense to explicitly mark that.
> 
> Whether it's new or not isn't the point.  Resource domains weren't
> added arbitrarily.  We were missing critical resource accounting and
> control capabilities because cgroup v1's abstraction wasn't strong
> enough to cover some of the major resource consumers and how different
> resources can interact with each other.
> 
> Resource domains were added to address that.  Given that cgroup's
> primary goal is providing resource accounting and control, it doesn't
> make sense to make this optional.

I'm not sure what you're saying here. Are you agreeing that 'resource
domains' are the primary new construct or not?

> > My question was, if you have root.threads=1, can you then still create
> > (sub) resource domains? Can I create a child cgroup and clear "threads"
> > again?
> >
> > (I'm assuming "threads" is inherited when creating new groups).
> > 
> > Now, _if_ we can do the above, then "threads" is not sufficient to
> > uniquely identify resource domains, which I think was your point in the
> > other email. Which argues against the interface. Because a group can be
> > a resource domain _and_ have threads sub-trees.
> >
> > OTOH, if you can _not_ do this, then this proposal is
> > insufficient/inadequate.
> 
> No, you can't flip it back and I'm not convinced this matters.  More
> on this below.

Then I shall preliminary NAK your proposal right here, but I shall
continue to read on.

> > So, just to recap, my proposal is as follows:
> > 
> >  1) each cgroup will have a new flag, indicating if this is a resource
> > domain.
> > 
> > a) this flag will be inherited; when creating a new cgroup, the
> >state of the flag will be set to the value of the parent cgroup.
> > 
> > b) the root cgroup is a resource domain per definition, will
> >have it set (cannot be cleared).
> > 
> >  2) all tasks of a process shall be part of the same resource domain
> > 
> >  3) controllers come in two types:
> > 
> > a) task based controllers; these use the direct cgroup the task
> >is assigned to.
> > 
> > b) resource controllers; these use the effective resource group
> >of the task, which is the first parent group with our new
> >flag set.
> > 
> > 
> > With an optional:
> > 
> >  1c) this flag can only be changed on empty groups
> > 
> > to ease implementation.
> > 
> > From these rules it follows that:
> > 
> > - 1a and 1b together ensure no change in behaviour per default
> >   for cgroup-v2.
> > 
> > - 2 and 3a together ensure resource groups automagically work for task
> >   based controllers (under the assumption that the controller is
> >   strictly hierarchical).
> > 
> > For example, cpuacct does the accounting strictly hierarchical, it adds
> > the cpu usage to each parent group. Therefore the total cpu usage
> > accounted to the resource group is the same as if all tasks were part of
> > that group.
> 
> So, what you're proposing isn't that different from what the posted
> patches implement except that what's implemented doesn't allow
> flipping a part of a threaded subtree back to domain mode.
> 
> Your proposal is more complicated while seemingly not adding much to
> what can be achieved.  The orignal proposal is very simple.  It allows
> declaring a subtree to be threaded (or task based) and that's it.  A
> threaded subtree can't have resource domains under it.
> 
> The only addition that your proposal has is the ability to mark
> portions of such subtree to be domains again.  This would require
> making domain controllers and thread controllers to see different
> hierarchies,

Uhm, no. They would see the exact same hierarchy, seeing how there is
only one tree. They would have different view of it maybe, but I don't
see how that matters, nor do you explain.

> which brings in something completely new to the basic hierarchy.

I'm failing to see what.

> Different controllers seeing differing levels of the same hierarchy is
> part of the basic behaviors

I have no idea what you mean there.

> and making subtrees threaded is a
> straight-forward extension of that - threaded controllers just see
> further into the hierarchy.  Adding threaded sub-sections in the
> middle is more complex and frankly confusing.

I disagree, as I completely fail to see any confusion. The rules are
simple and straight forward.

I also don't see why you would want to impose this artificial
restriction. It doesn't get you anything. Why are you so keen on designs
with these artificial limits on?

> Let's say we can make that work but what are the use cases which would
> require such setup where we have to alternate between thread and
> domain modes through out the resource 

Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-09 Thread Peter Zijlstra
On Wed, Feb 08, 2017 at 06:08:19PM -0500, Tejun Heo wrote:
> (cc'ing Linus and Andrew for visibility)
> 
> Hello, Peter.
> 
> On Mon, Feb 06, 2017 at 01:49:43PM +0100, Peter Zijlstra wrote:
> > But to me the resource domain is your primary new construct; so it makes
> > more sense to explicitly mark that.
> 
> Whether it's new or not isn't the point.  Resource domains weren't
> added arbitrarily.  We were missing critical resource accounting and
> control capabilities because cgroup v1's abstraction wasn't strong
> enough to cover some of the major resource consumers and how different
> resources can interact with each other.
> 
> Resource domains were added to address that.  Given that cgroup's
> primary goal is providing resource accounting and control, it doesn't
> make sense to make this optional.

I'm not sure what you're saying here. Are you agreeing that 'resource
domains' are the primary new construct or not?

> > My question was, if you have root.threads=1, can you then still create
> > (sub) resource domains? Can I create a child cgroup and clear "threads"
> > again?
> >
> > (I'm assuming "threads" is inherited when creating new groups).
> > 
> > Now, _if_ we can do the above, then "threads" is not sufficient to
> > uniquely identify resource domains, which I think was your point in the
> > other email. Which argues against the interface. Because a group can be
> > a resource domain _and_ have threads sub-trees.
> >
> > OTOH, if you can _not_ do this, then this proposal is
> > insufficient/inadequate.
> 
> No, you can't flip it back and I'm not convinced this matters.  More
> on this below.

Then I shall preliminary NAK your proposal right here, but I shall
continue to read on.

> > So, just to recap, my proposal is as follows:
> > 
> >  1) each cgroup will have a new flag, indicating if this is a resource
> > domain.
> > 
> > a) this flag will be inherited; when creating a new cgroup, the
> >state of the flag will be set to the value of the parent cgroup.
> > 
> > b) the root cgroup is a resource domain per definition, will
> >have it set (cannot be cleared).
> > 
> >  2) all tasks of a process shall be part of the same resource domain
> > 
> >  3) controllers come in two types:
> > 
> > a) task based controllers; these use the direct cgroup the task
> >is assigned to.
> > 
> > b) resource controllers; these use the effective resource group
> >of the task, which is the first parent group with our new
> >flag set.
> > 
> > 
> > With an optional:
> > 
> >  1c) this flag can only be changed on empty groups
> > 
> > to ease implementation.
> > 
> > From these rules it follows that:
> > 
> > - 1a and 1b together ensure no change in behaviour per default
> >   for cgroup-v2.
> > 
> > - 2 and 3a together ensure resource groups automagically work for task
> >   based controllers (under the assumption that the controller is
> >   strictly hierarchical).
> > 
> > For example, cpuacct does the accounting strictly hierarchical, it adds
> > the cpu usage to each parent group. Therefore the total cpu usage
> > accounted to the resource group is the same as if all tasks were part of
> > that group.
> 
> So, what you're proposing isn't that different from what the posted
> patches implement except that what's implemented doesn't allow
> flipping a part of a threaded subtree back to domain mode.
> 
> Your proposal is more complicated while seemingly not adding much to
> what can be achieved.  The orignal proposal is very simple.  It allows
> declaring a subtree to be threaded (or task based) and that's it.  A
> threaded subtree can't have resource domains under it.
> 
> The only addition that your proposal has is the ability to mark
> portions of such subtree to be domains again.  This would require
> making domain controllers and thread controllers to see different
> hierarchies,

Uhm, no. They would see the exact same hierarchy, seeing how there is
only one tree. They would have different view of it maybe, but I don't
see how that matters, nor do you explain.

> which brings in something completely new to the basic hierarchy.

I'm failing to see what.

> Different controllers seeing differing levels of the same hierarchy is
> part of the basic behaviors

I have no idea what you mean there.

> and making subtrees threaded is a
> straight-forward extension of that - threaded controllers just see
> further into the hierarchy.  Adding threaded sub-sections in the
> middle is more complex and frankly confusing.

I disagree, as I completely fail to see any confusion. The rules are
simple and straight forward.

I also don't see why you would want to impose this artificial
restriction. It doesn't get you anything. Why are you so keen on designs
with these artificial limits on?

> Let's say we can make that work but what are the use cases which would
> require such setup where we have to alternate between thread and
> domain modes through out the resource 

Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-08 Thread Tejun Heo
(cc'ing Linus and Andrew for visibility)

Hello, Peter.

On Mon, Feb 06, 2017 at 01:49:43PM +0100, Peter Zijlstra wrote:
> But to me the resource domain is your primary new construct; so it makes
> more sense to explicitly mark that.

Whether it's new or not isn't the point.  Resource domains weren't
added arbitrarily.  We were missing critical resource accounting and
control capabilities because cgroup v1's abstraction wasn't strong
enough to cover some of the major resource consumers and how different
resources can interact with each other.

Resource domains were added to address that.  Given that cgroup's
primary goal is providing resource accounting and control, it doesn't
make sense to make this optional.

> (FWIW note that your whole initial cgroup-v2 thing is counter intuitive
> to me, as someone who has only ever dealt with thread capable
> controllers.)

That is completely fine but you can't direct the overall design while
claiming to be not using, disinterested and unfamiliar with the
subject at hand.  I do understand that you have certain use cases that
you think should be covered.  Let's focus on that.

> My question was, if you have root.threads=1, can you then still create
> (sub) resource domains? Can I create a child cgroup and clear "threads"
> again?
>
> (I'm assuming "threads" is inherited when creating new groups).
> 
> Now, _if_ we can do the above, then "threads" is not sufficient to
> uniquely identify resource domains, which I think was your point in the
> other email. Which argues against the interface. Because a group can be
> a resource domain _and_ have threads sub-trees.
>
> OTOH, if you can _not_ do this, then this proposal is
> insufficient/inadequate.

No, you can't flip it back and I'm not convinced this matters.  More
on this below.

> > In practice, how would this work?  To enable memcg, the user has to
> > first create the subtree and then explicitly have to make that a
> > domain and then enable memcg?  If so, that would be a completely
> > unnecessary deviation from the current behavior while not achieving
> > any more functionalities, right?  It's just flipping the default value
> > the other way around and the default wouldn't be supported by many of
> > the controllers.  I can't see how that is a better option.
> 
> OK, so I'm entirely unaware of this enabling of controllers. What's that
> about? I thought the whole point of cgroup-v2 was to have all
> controllers enabled over the entire tree, this is not so?

This is one of the most basic aspects of cgroup v2.  In short, while
the controllers share the hierarchy, each doesn't have to be enabled
all the way down to the leaf.  Different controllers can see upto
different subsets of the hierarchy spreading out from the root.

> In any case, yes, more or less like that, except of course, not at all
> :-) If we make this flag inherited (which I think it should be), you
> don't need to do anything different from today, because the root group
> must be a resource domain, any new sub-group will automagically also be.
> 
> Only once you clear the flag somewhere do you get 'new' behaviour. Note
> that the only extra constraint is that all threads of a process must
> stay within the same resource domain, anything else goes.
> 
> Task based controllers operate on the actual cgroup, resource domain
> controllers always map it back to the resource group. Finding a random
> task's resource domain is trivial; simply walk up the hierarchy until
> you find a group with the flag set. 
> 
> So, just to recap, my proposal is as follows:
> 
>  1) each cgroup will have a new flag, indicating if this is a resource
> domain.
> 
> a) this flag will be inherited; when creating a new cgroup, the
>state of the flag will be set to the value of the parent cgroup.
> 
> b) the root cgroup is a resource domain per definition, will
>have it set (cannot be cleared).
> 
>  2) all tasks of a process shall be part of the same resource domain
> 
>  3) controllers come in two types:
> 
> a) task based controllers; these use the direct cgroup the task
>is assigned to.
> 
> b) resource controllers; these use the effective resource group
>of the task, which is the first parent group with our new
>flag set.
> 
> 
> With an optional:
> 
>  1c) this flag can only be changed on empty groups
> 
> to ease implementation.
> 
> From these rules it follows that:
> 
> - 1a and 1b together ensure no change in behaviour per default
>   for cgroup-v2.
> 
> - 2 and 3a together ensure resource groups automagically work for task
>   based controllers (under the assumption that the controller is
>   strictly hierarchical).
> 
> For example, cpuacct does the accounting strictly hierarchical, it adds
> the cpu usage to each parent group. Therefore the total cpu usage
> accounted to the resource group is the same as if all tasks were part of
> that group.

So, what you're proposing isn't that different from what the 

Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-08 Thread Tejun Heo
(cc'ing Linus and Andrew for visibility)

Hello, Peter.

On Mon, Feb 06, 2017 at 01:49:43PM +0100, Peter Zijlstra wrote:
> But to me the resource domain is your primary new construct; so it makes
> more sense to explicitly mark that.

Whether it's new or not isn't the point.  Resource domains weren't
added arbitrarily.  We were missing critical resource accounting and
control capabilities because cgroup v1's abstraction wasn't strong
enough to cover some of the major resource consumers and how different
resources can interact with each other.

Resource domains were added to address that.  Given that cgroup's
primary goal is providing resource accounting and control, it doesn't
make sense to make this optional.

> (FWIW note that your whole initial cgroup-v2 thing is counter intuitive
> to me, as someone who has only ever dealt with thread capable
> controllers.)

That is completely fine but you can't direct the overall design while
claiming to be not using, disinterested and unfamiliar with the
subject at hand.  I do understand that you have certain use cases that
you think should be covered.  Let's focus on that.

> My question was, if you have root.threads=1, can you then still create
> (sub) resource domains? Can I create a child cgroup and clear "threads"
> again?
>
> (I'm assuming "threads" is inherited when creating new groups).
> 
> Now, _if_ we can do the above, then "threads" is not sufficient to
> uniquely identify resource domains, which I think was your point in the
> other email. Which argues against the interface. Because a group can be
> a resource domain _and_ have threads sub-trees.
>
> OTOH, if you can _not_ do this, then this proposal is
> insufficient/inadequate.

No, you can't flip it back and I'm not convinced this matters.  More
on this below.

> > In practice, how would this work?  To enable memcg, the user has to
> > first create the subtree and then explicitly have to make that a
> > domain and then enable memcg?  If so, that would be a completely
> > unnecessary deviation from the current behavior while not achieving
> > any more functionalities, right?  It's just flipping the default value
> > the other way around and the default wouldn't be supported by many of
> > the controllers.  I can't see how that is a better option.
> 
> OK, so I'm entirely unaware of this enabling of controllers. What's that
> about? I thought the whole point of cgroup-v2 was to have all
> controllers enabled over the entire tree, this is not so?

This is one of the most basic aspects of cgroup v2.  In short, while
the controllers share the hierarchy, each doesn't have to be enabled
all the way down to the leaf.  Different controllers can see upto
different subsets of the hierarchy spreading out from the root.

> In any case, yes, more or less like that, except of course, not at all
> :-) If we make this flag inherited (which I think it should be), you
> don't need to do anything different from today, because the root group
> must be a resource domain, any new sub-group will automagically also be.
> 
> Only once you clear the flag somewhere do you get 'new' behaviour. Note
> that the only extra constraint is that all threads of a process must
> stay within the same resource domain, anything else goes.
> 
> Task based controllers operate on the actual cgroup, resource domain
> controllers always map it back to the resource group. Finding a random
> task's resource domain is trivial; simply walk up the hierarchy until
> you find a group with the flag set. 
> 
> So, just to recap, my proposal is as follows:
> 
>  1) each cgroup will have a new flag, indicating if this is a resource
> domain.
> 
> a) this flag will be inherited; when creating a new cgroup, the
>state of the flag will be set to the value of the parent cgroup.
> 
> b) the root cgroup is a resource domain per definition, will
>have it set (cannot be cleared).
> 
>  2) all tasks of a process shall be part of the same resource domain
> 
>  3) controllers come in two types:
> 
> a) task based controllers; these use the direct cgroup the task
>is assigned to.
> 
> b) resource controllers; these use the effective resource group
>of the task, which is the first parent group with our new
>flag set.
> 
> 
> With an optional:
> 
>  1c) this flag can only be changed on empty groups
> 
> to ease implementation.
> 
> From these rules it follows that:
> 
> - 1a and 1b together ensure no change in behaviour per default
>   for cgroup-v2.
> 
> - 2 and 3a together ensure resource groups automagically work for task
>   based controllers (under the assumption that the controller is
>   strictly hierarchical).
> 
> For example, cpuacct does the accounting strictly hierarchical, it adds
> the cpu usage to each parent group. Therefore the total cpu usage
> accounted to the resource group is the same as if all tasks were part of
> that group.

So, what you're proposing isn't that different from what the 

Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-06 Thread Peter Zijlstra
On Fri, Feb 03, 2017 at 03:59:55PM -0500, Tejun Heo wrote:
> Hello, Peter.
> 
> On Fri, Feb 03, 2017 at 09:20:48PM +0100, Peter Zijlstra wrote:
> > So my proposal was to do the inverse of what you propose here. Instead
> > of marking special 'thread' subtrees, explicitly mark resource domains
> > in the tree.
> > 
> > So always allow arbitrary hierarchies and allow threads to be assigned
> > to cgroups, as long as they're all in the same resource domain.
> > 
> > Controllers that do not support things, fallback to mapping everything
> > to the nearest resource domain.
> 
> That sounds counter-intuitive as all controllers can do resource
> domains and only a subset of them can do thread mode.

But to me the resource domain is your primary new construct; so it makes
more sense to explicitly mark that.

(FWIW note that your whole initial cgroup-v2 thing is counter intuitive
to me, as someone who has only ever dealt with thread capable
controllers.)

> Also, thread subtrees are necessarily a sub-hierarchy of a resource domain.

Sure, don't see how that is relevant though. Or rather, I don't see it
being an argument one way or the other.

> Also,
> expanding resource domains from the root after the trees are populated
> would make the behaviors surprising as the resource domains that these
> subtrees belong to would change dynamically.

Uh what? I cannot parse that.

My question was, if you have root.threads=1, can you then still create
(sub) resource domains? Can I create a child cgroup and clear "threads"
again?

(I'm assuming "threads" is inherited when creating new groups).

Now, _if_ we can do the above, then "threads" is not sufficient to
uniquely identify resource domains, which I think was your point in the
other email. Which argues against the interface. Because a group can be
a resource domain _and_ have threads sub-trees.

OTOH, if you can _not_ do this, then this proposal is
insufficient/inadequate.

> In practice, how would this work?  To enable memcg, the user has to
> first create the subtree and then explicitly have to make that a
> domain and then enable memcg?  If so, that would be a completely
> unnecessary deviation from the current behavior while not achieving
> any more functionalities, right?  It's just flipping the default value
> the other way around and the default wouldn't be supported by many of
> the controllers.  I can't see how that is a better option.

OK, so I'm entirely unaware of this enabling of controllers. What's that
about? I thought the whole point of cgroup-v2 was to have all
controllers enabled over the entire tree, this is not so?

In any case, yes, more or less like that, except of course, not at all
:-) If we make this flag inherited (which I think it should be), you
don't need to do anything different from today, because the root group
must be a resource domain, any new sub-group will automagically also be.

Only once you clear the flag somewhere do you get 'new' behaviour. Note
that the only extra constraint is that all threads of a process must
stay within the same resource domain, anything else goes.

Task based controllers operate on the actual cgroup, resource domain
controllers always map it back to the resource group. Finding a random
task's resource domain is trivial; simply walk up the hierarchy until
you find a group with the flag set. 



So, just to recap, my proposal is as follows:

 1) each cgroup will have a new flag, indicating if this is a resource
domain.

a) this flag will be inherited; when creating a new cgroup, the
   state of the flag will be set to the value of the parent cgroup.

b) the root cgroup is a resource domain per definition, will
   have it set (cannot be cleared).

 2) all tasks of a process shall be part of the same resource domain

 3) controllers come in two types:

a) task based controllers; these use the direct cgroup the task
   is assigned to.

b) resource controllers; these use the effective resource group
   of the task, which is the first parent group with our new
   flag set.


With an optional:

 1c) this flag can only be changed on empty groups

to ease implementation.



>From these rules it follows that:

- 1a and 1b together ensure no change in behaviour per default
  for cgroup-v2.

- 2 and 3a together ensure resource groups automagically work for task
  based controllers (under the assumption that the controller is
  strictly hierarchical).


For example, cpuacct does the accounting strictly hierarchical, it adds
the cpu usage to each parent group. Therefore the total cpu usage
accounted to the resource group is the same as if all tasks were part of
that group.



Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-06 Thread Peter Zijlstra
On Fri, Feb 03, 2017 at 03:59:55PM -0500, Tejun Heo wrote:
> Hello, Peter.
> 
> On Fri, Feb 03, 2017 at 09:20:48PM +0100, Peter Zijlstra wrote:
> > So my proposal was to do the inverse of what you propose here. Instead
> > of marking special 'thread' subtrees, explicitly mark resource domains
> > in the tree.
> > 
> > So always allow arbitrary hierarchies and allow threads to be assigned
> > to cgroups, as long as they're all in the same resource domain.
> > 
> > Controllers that do not support things, fallback to mapping everything
> > to the nearest resource domain.
> 
> That sounds counter-intuitive as all controllers can do resource
> domains and only a subset of them can do thread mode.

But to me the resource domain is your primary new construct; so it makes
more sense to explicitly mark that.

(FWIW note that your whole initial cgroup-v2 thing is counter intuitive
to me, as someone who has only ever dealt with thread capable
controllers.)

> Also, thread subtrees are necessarily a sub-hierarchy of a resource domain.

Sure, don't see how that is relevant though. Or rather, I don't see it
being an argument one way or the other.

> Also,
> expanding resource domains from the root after the trees are populated
> would make the behaviors surprising as the resource domains that these
> subtrees belong to would change dynamically.

Uh what? I cannot parse that.

My question was, if you have root.threads=1, can you then still create
(sub) resource domains? Can I create a child cgroup and clear "threads"
again?

(I'm assuming "threads" is inherited when creating new groups).

Now, _if_ we can do the above, then "threads" is not sufficient to
uniquely identify resource domains, which I think was your point in the
other email. Which argues against the interface. Because a group can be
a resource domain _and_ have threads sub-trees.

OTOH, if you can _not_ do this, then this proposal is
insufficient/inadequate.

> In practice, how would this work?  To enable memcg, the user has to
> first create the subtree and then explicitly have to make that a
> domain and then enable memcg?  If so, that would be a completely
> unnecessary deviation from the current behavior while not achieving
> any more functionalities, right?  It's just flipping the default value
> the other way around and the default wouldn't be supported by many of
> the controllers.  I can't see how that is a better option.

OK, so I'm entirely unaware of this enabling of controllers. What's that
about? I thought the whole point of cgroup-v2 was to have all
controllers enabled over the entire tree, this is not so?

In any case, yes, more or less like that, except of course, not at all
:-) If we make this flag inherited (which I think it should be), you
don't need to do anything different from today, because the root group
must be a resource domain, any new sub-group will automagically also be.

Only once you clear the flag somewhere do you get 'new' behaviour. Note
that the only extra constraint is that all threads of a process must
stay within the same resource domain, anything else goes.

Task based controllers operate on the actual cgroup, resource domain
controllers always map it back to the resource group. Finding a random
task's resource domain is trivial; simply walk up the hierarchy until
you find a group with the flag set. 



So, just to recap, my proposal is as follows:

 1) each cgroup will have a new flag, indicating if this is a resource
domain.

a) this flag will be inherited; when creating a new cgroup, the
   state of the flag will be set to the value of the parent cgroup.

b) the root cgroup is a resource domain per definition, will
   have it set (cannot be cleared).

 2) all tasks of a process shall be part of the same resource domain

 3) controllers come in two types:

a) task based controllers; these use the direct cgroup the task
   is assigned to.

b) resource controllers; these use the effective resource group
   of the task, which is the first parent group with our new
   flag set.


With an optional:

 1c) this flag can only be changed on empty groups

to ease implementation.



>From these rules it follows that:

- 1a and 1b together ensure no change in behaviour per default
  for cgroup-v2.

- 2 and 3a together ensure resource groups automagically work for task
  based controllers (under the assumption that the controller is
  strictly hierarchical).


For example, cpuacct does the accounting strictly hierarchical, it adds
the cpu usage to each parent group. Therefore the total cpu usage
accounted to the resource group is the same as if all tasks were part of
that group.



Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-06 Thread Peter Zijlstra
On Thu, Feb 02, 2017 at 04:52:29PM -0500, Tejun Heo wrote:

> > Why do you need to manually turn it on?  That is, couldn't it be
> > automatic based on what controllers are enabled?
> 
> This came up already but it's not like some controllers are inherently
> thread-only.  Consider CPU, all in-context CPU cycle consumptions are
> tied to the thread; however, we also want to be able to account for
> CPU cycles consumed for, for example, memory reclaim or encryption
> during writeback.
> 
> I played with an interface where thread mode is enabled automatically
> upto the common ancestor of the threads but not only was it
> complicated to implement but also the eventual behavior was very
> confusing as the resource domain can change without any active actions
> from the user.  I think keeping things simple is the right choice
> here.

Note that the explicit marking of the resource domains gets you exactly
that. But let me reply in the other subthread.


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-06 Thread Peter Zijlstra
On Thu, Feb 02, 2017 at 04:52:29PM -0500, Tejun Heo wrote:

> > Why do you need to manually turn it on?  That is, couldn't it be
> > automatic based on what controllers are enabled?
> 
> This came up already but it's not like some controllers are inherently
> thread-only.  Consider CPU, all in-context CPU cycle consumptions are
> tied to the thread; however, we also want to be able to account for
> CPU cycles consumed for, for example, memory reclaim or encryption
> during writeback.
> 
> I played with an interface where thread mode is enabled automatically
> upto the common ancestor of the threads but not only was it
> complicated to implement but also the eventual behavior was very
> confusing as the resource domain can change without any active actions
> from the user.  I think keeping things simple is the right choice
> here.

Note that the explicit marking of the resource domains gets you exactly
that. But let me reply in the other subthread.


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-03 Thread Tejun Heo
Hello,

On Fri, Feb 03, 2017 at 01:10:21PM -0800, Andy Lutomirski wrote:
> Is this flexible enough for the real-world usecases?  For my use case

I can't think of a reason why it won't be.  Capability-wise, nothing
is being lost by the interface.

> (if I actually ported over to this), it would mean that I'd have to
> enable thread mode on the root.  What about letting a given process
> (actually mm, perhaps) live in a cgroup but let the threads be in
> different cgroups without any particular constraints.  Then
> process-wide stuff would be accounted to the cgroup that owns the
> process.

I don't know.  So, then, we basiclly have completely separate trees
for resource domains and threads.  That exactly is what mounting cpu
controller separately does.  It doesn't make sense to put them on the
same hierarchy.  Why?

> > If a controller can't possibly define how internal competition should
> > be handled, which is unlikely - the problem is being consistent and
> > sensible, defining something isn't difficult - the controller can
> > simply error out those cases either on configuration or migration.
> > Again, I'm very doubtful we'll need that but if we ever need that
> > denying specific configurations is the best we can do anyway.
> 
> I'm not sure I follow.
> 
> I'm suggesting something quite simple: let controllers that don't need
> the no-internal-process constraints set a flag so that the constraints
> don't apply if all enabled controllers have the flag set.

Firstly, I think it's better to have the rules as simple and
consistent as possible as long as we don't sacrifice critical
capabilities.

Secondly, all the major resource controllers including cpu would
eventually need resource domain, so there is no real practical upside
to doing so.

Thirdly, if we commit to something like "controller X is not subject
to no-internal-process constraint", that commitment would prevent from
ever adding domain level operations to that controller without
breaking userland visible interface.  All controllers do and have to
support process level operations.  Some controllers can do thread
level operations.  Keeping the latter opt-in doesn't block us from
adding thread mode later on.  Doing it the other way around blocks us
from adding domain level operations later on.

Thanks.

-- 
tejun


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-03 Thread Tejun Heo
Hello,

On Fri, Feb 03, 2017 at 01:10:21PM -0800, Andy Lutomirski wrote:
> Is this flexible enough for the real-world usecases?  For my use case

I can't think of a reason why it won't be.  Capability-wise, nothing
is being lost by the interface.

> (if I actually ported over to this), it would mean that I'd have to
> enable thread mode on the root.  What about letting a given process
> (actually mm, perhaps) live in a cgroup but let the threads be in
> different cgroups without any particular constraints.  Then
> process-wide stuff would be accounted to the cgroup that owns the
> process.

I don't know.  So, then, we basiclly have completely separate trees
for resource domains and threads.  That exactly is what mounting cpu
controller separately does.  It doesn't make sense to put them on the
same hierarchy.  Why?

> > If a controller can't possibly define how internal competition should
> > be handled, which is unlikely - the problem is being consistent and
> > sensible, defining something isn't difficult - the controller can
> > simply error out those cases either on configuration or migration.
> > Again, I'm very doubtful we'll need that but if we ever need that
> > denying specific configurations is the best we can do anyway.
> 
> I'm not sure I follow.
> 
> I'm suggesting something quite simple: let controllers that don't need
> the no-internal-process constraints set a flag so that the constraints
> don't apply if all enabled controllers have the flag set.

Firstly, I think it's better to have the rules as simple and
consistent as possible as long as we don't sacrifice critical
capabilities.

Secondly, all the major resource controllers including cpu would
eventually need resource domain, so there is no real practical upside
to doing so.

Thirdly, if we commit to something like "controller X is not subject
to no-internal-process constraint", that commitment would prevent from
ever adding domain level operations to that controller without
breaking userland visible interface.  All controllers do and have to
support process level operations.  Some controllers can do thread
level operations.  Keeping the latter opt-in doesn't block us from
adding thread mode later on.  Doing it the other way around blocks us
from adding domain level operations later on.

Thanks.

-- 
tejun


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-03 Thread Andy Lutomirski
On Thu, Feb 2, 2017 at 1:52 PM, Tejun Heo  wrote:
> Hello,
>
> On Thu, Feb 02, 2017 at 01:32:19PM -0800, Andy Lutomirski wrote:
>> > * Thread mode is explicitly enabled on a cgroup by writing "enable"
>> >   into "cgroup.threads" file.  The cgroup shouldn't have any child
>> >   cgroups or enabled controllers.
>>
>> Why do you need to manually turn it on?  That is, couldn't it be
>> automatic based on what controllers are enabled?
>
> This came up already but it's not like some controllers are inherently
> thread-only.  Consider CPU, all in-context CPU cycle consumptions are
> tied to the thread; however, we also want to be able to account for
> CPU cycles consumed for, for example, memory reclaim or encryption
> during writeback.
>

Is this flexible enough for the real-world usecases?  For my use case
(if I actually ported over to this), it would mean that I'd have to
enable thread mode on the root.  What about letting a given process
(actually mm, perhaps) live in a cgroup but let the threads be in
different cgroups without any particular constraints.  Then
process-wide stuff would be accounted to the cgroup that owns the
process.

>
>> > * Once enabled, arbitrary sub-hierarchy can be created and threads can
>> >   be put anywhere in the subtree by writing TIDs into "cgroup.threads"
>> >   file.  Process granularity and no-internal-process constraint don't
>> >   apply in a threaded subtree.
>>
>> I'm a bit worried that this conflates two different things.  There's
>> thread support, i.e. allowing individual threads to be placed into
>> cgroups.  There's also more flexible sub-hierarchy support, i.e.
>> relaxing no-internal-process constraints.  For the "cpuacct"
>> controller, for example, both of these make sense.  But what if
>> someone writes a controller (directio, for example, just to make
>> something up) for which thread granularity makes sense but relaxing
>> no-internal-process constraints does not?
>
> If a controller can't possibly define how internal competition should
> be handled, which is unlikely - the problem is being consistent and
> sensible, defining something isn't difficult - the controller can
> simply error out those cases either on configuration or migration.
> Again, I'm very doubtful we'll need that but if we ever need that
> denying specific configurations is the best we can do anyway.
>

I'm not sure I follow.

I'm suggesting something quite simple: let controllers that don't need
the no-internal-process constraints set a flag so that the constraints
don't apply if all enabled controllers have the flag set.

--Andy


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-03 Thread Andy Lutomirski
On Thu, Feb 2, 2017 at 1:52 PM, Tejun Heo  wrote:
> Hello,
>
> On Thu, Feb 02, 2017 at 01:32:19PM -0800, Andy Lutomirski wrote:
>> > * Thread mode is explicitly enabled on a cgroup by writing "enable"
>> >   into "cgroup.threads" file.  The cgroup shouldn't have any child
>> >   cgroups or enabled controllers.
>>
>> Why do you need to manually turn it on?  That is, couldn't it be
>> automatic based on what controllers are enabled?
>
> This came up already but it's not like some controllers are inherently
> thread-only.  Consider CPU, all in-context CPU cycle consumptions are
> tied to the thread; however, we also want to be able to account for
> CPU cycles consumed for, for example, memory reclaim or encryption
> during writeback.
>

Is this flexible enough for the real-world usecases?  For my use case
(if I actually ported over to this), it would mean that I'd have to
enable thread mode on the root.  What about letting a given process
(actually mm, perhaps) live in a cgroup but let the threads be in
different cgroups without any particular constraints.  Then
process-wide stuff would be accounted to the cgroup that owns the
process.

>
>> > * Once enabled, arbitrary sub-hierarchy can be created and threads can
>> >   be put anywhere in the subtree by writing TIDs into "cgroup.threads"
>> >   file.  Process granularity and no-internal-process constraint don't
>> >   apply in a threaded subtree.
>>
>> I'm a bit worried that this conflates two different things.  There's
>> thread support, i.e. allowing individual threads to be placed into
>> cgroups.  There's also more flexible sub-hierarchy support, i.e.
>> relaxing no-internal-process constraints.  For the "cpuacct"
>> controller, for example, both of these make sense.  But what if
>> someone writes a controller (directio, for example, just to make
>> something up) for which thread granularity makes sense but relaxing
>> no-internal-process constraints does not?
>
> If a controller can't possibly define how internal competition should
> be handled, which is unlikely - the problem is being consistent and
> sensible, defining something isn't difficult - the controller can
> simply error out those cases either on configuration or migration.
> Again, I'm very doubtful we'll need that but if we ever need that
> denying specific configurations is the best we can do anyway.
>

I'm not sure I follow.

I'm suggesting something quite simple: let controllers that don't need
the no-internal-process constraints set a flag so that the constraints
don't apply if all enabled controllers have the flag set.

--Andy


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-03 Thread Tejun Heo
Hello, Peter.

On Fri, Feb 03, 2017 at 09:20:48PM +0100, Peter Zijlstra wrote:
> So my proposal was to do the inverse of what you propose here. Instead
> of marking special 'thread' subtrees, explicitly mark resource domains
> in the tree.
> 
> So always allow arbitrary hierarchies and allow threads to be assigned
> to cgroups, as long as they're all in the same resource domain.
> 
> Controllers that do not support things, fallback to mapping everything
> to the nearest resource domain.

That sounds counter-intuitive as all controllers can do resource
domains and only a subset of them can do thread mode.  Also, thread
subtrees are necessarily a sub-hierarchy of a resource domain.  Also,
expanding resource domains from the root after the trees are populated
would make the behaviors surprising as the resource domains that these
subtrees belong to would change dynamically.

In practice, how would this work?  To enable memcg, the user has to
first create the subtree and then explicitly have to make that a
domain and then enable memcg?  If so, that would be a completely
unnecessary deviation from the current behavior while not achieving
any more functionalities, right?  It's just flipping the default value
the other way around and the default wouldn't be supported by many of
the controllers.  I can't see how that is a better option.

Thanks.

-- 
tejun


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-03 Thread Tejun Heo
Hello, Peter.

On Fri, Feb 03, 2017 at 09:20:48PM +0100, Peter Zijlstra wrote:
> So my proposal was to do the inverse of what you propose here. Instead
> of marking special 'thread' subtrees, explicitly mark resource domains
> in the tree.
> 
> So always allow arbitrary hierarchies and allow threads to be assigned
> to cgroups, as long as they're all in the same resource domain.
> 
> Controllers that do not support things, fallback to mapping everything
> to the nearest resource domain.

That sounds counter-intuitive as all controllers can do resource
domains and only a subset of them can do thread mode.  Also, thread
subtrees are necessarily a sub-hierarchy of a resource domain.  Also,
expanding resource domains from the root after the trees are populated
would make the behaviors surprising as the resource domains that these
subtrees belong to would change dynamically.

In practice, how would this work?  To enable memcg, the user has to
first create the subtree and then explicitly have to make that a
domain and then enable memcg?  If so, that would be a completely
unnecessary deviation from the current behavior while not achieving
any more functionalities, right?  It's just flipping the default value
the other way around and the default wouldn't be supported by many of
the controllers.  I can't see how that is a better option.

Thanks.

-- 
tejun


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-03 Thread Peter Zijlstra
On Thu, Feb 02, 2017 at 03:06:27PM -0500, Tejun Heo wrote:
> Hello,
> 
> This patchset implements cgroup v2 thread mode.  It is largely based
> on the discussions that we had at the plumbers last year.  Here's the
> rough outline.
> 
> * Thread mode is explicitly enabled on a cgroup by writing "enable"
>   into "cgroup.threads" file.  The cgroup shouldn't have any child
>   cgroups or enabled controllers.
> 
> * Once enabled, arbitrary sub-hierarchy can be created and threads can
>   be put anywhere in the subtree by writing TIDs into "cgroup.threads"
>   file.  Process granularity and no-internal-process constraint don't
>   apply in a threaded subtree.
> 
> * To be used in a threaded subtree, controllers should explicitly
>   declare thread mode support and should be able to handle internal
>   competition in some way.
> 
> * The root of a threaded subtree serves as the resource domain for the
>   whole subtree.  This is where all the controllers are guaranteed to
>   have a common ground and resource consumptions in the threaded
>   subtree which aren't tied to a specific thread are charged.
>   Non-threaded controllers never see beyond thread root and can assume
>   that all controllers will follow the same rules upto that point.
> 
> This allows threaded controllers to implement thread granular resource
> control without getting in the way of system level resource
> partitioning.


I'm still confused. So suppose I mark my root cgroup as such, because I
run RT tasks there. Does this then mean I cannot later start a VM and
have that containered properly? That is, I think threaded controllers
very much get in the way of system level source partitioning this way.


So my proposal was to do the inverse of what you propose here. Instead
of marking special 'thread' subtrees, explicitly mark resource domains
in the tree.

So always allow arbitrary hierarchies and allow threads to be assigned
to cgroups, as long as they're all in the same resource domain.

Controllers that do not support things, fallback to mapping everything
to the nearest resource domain.



Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-03 Thread Peter Zijlstra
On Thu, Feb 02, 2017 at 03:06:27PM -0500, Tejun Heo wrote:
> Hello,
> 
> This patchset implements cgroup v2 thread mode.  It is largely based
> on the discussions that we had at the plumbers last year.  Here's the
> rough outline.
> 
> * Thread mode is explicitly enabled on a cgroup by writing "enable"
>   into "cgroup.threads" file.  The cgroup shouldn't have any child
>   cgroups or enabled controllers.
> 
> * Once enabled, arbitrary sub-hierarchy can be created and threads can
>   be put anywhere in the subtree by writing TIDs into "cgroup.threads"
>   file.  Process granularity and no-internal-process constraint don't
>   apply in a threaded subtree.
> 
> * To be used in a threaded subtree, controllers should explicitly
>   declare thread mode support and should be able to handle internal
>   competition in some way.
> 
> * The root of a threaded subtree serves as the resource domain for the
>   whole subtree.  This is where all the controllers are guaranteed to
>   have a common ground and resource consumptions in the threaded
>   subtree which aren't tied to a specific thread are charged.
>   Non-threaded controllers never see beyond thread root and can assume
>   that all controllers will follow the same rules upto that point.
> 
> This allows threaded controllers to implement thread granular resource
> control without getting in the way of system level resource
> partitioning.


I'm still confused. So suppose I mark my root cgroup as such, because I
run RT tasks there. Does this then mean I cannot later start a VM and
have that containered properly? That is, I think threaded controllers
very much get in the way of system level source partitioning this way.


So my proposal was to do the inverse of what you propose here. Instead
of marking special 'thread' subtrees, explicitly mark resource domains
in the tree.

So always allow arbitrary hierarchies and allow threads to be assigned
to cgroups, as long as they're all in the same resource domain.

Controllers that do not support things, fallback to mapping everything
to the nearest resource domain.



Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-02 Thread Tejun Heo
Hello,

On Thu, Feb 02, 2017 at 01:32:19PM -0800, Andy Lutomirski wrote:
> > * Thread mode is explicitly enabled on a cgroup by writing "enable"
> >   into "cgroup.threads" file.  The cgroup shouldn't have any child
> >   cgroups or enabled controllers.
> 
> Why do you need to manually turn it on?  That is, couldn't it be
> automatic based on what controllers are enabled?

This came up already but it's not like some controllers are inherently
thread-only.  Consider CPU, all in-context CPU cycle consumptions are
tied to the thread; however, we also want to be able to account for
CPU cycles consumed for, for example, memory reclaim or encryption
during writeback.

I played with an interface where thread mode is enabled automatically
upto the common ancestor of the threads but not only was it
complicated to implement but also the eventual behavior was very
confusing as the resource domain can change without any active actions
from the user.  I think keeping things simple is the right choice
here.

> > * Once enabled, arbitrary sub-hierarchy can be created and threads can
> >   be put anywhere in the subtree by writing TIDs into "cgroup.threads"
> >   file.  Process granularity and no-internal-process constraint don't
> >   apply in a threaded subtree.
> 
> I'm a bit worried that this conflates two different things.  There's
> thread support, i.e. allowing individual threads to be placed into
> cgroups.  There's also more flexible sub-hierarchy support, i.e.
> relaxing no-internal-process constraints.  For the "cpuacct"
> controller, for example, both of these make sense.  But what if
> someone writes a controller (directio, for example, just to make
> something up) for which thread granularity makes sense but relaxing
> no-internal-process constraints does not?

If a controller can't possibly define how internal competition should
be handled, which is unlikely - the problem is being consistent and
sensible, defining something isn't difficult - the controller can
simply error out those cases either on configuration or migration.
Again, I'm very doubtful we'll need that but if we ever need that
denying specific configurations is the best we can do anyway.

Thanks.

-- 
tejun


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-02 Thread Tejun Heo
Hello,

On Thu, Feb 02, 2017 at 01:32:19PM -0800, Andy Lutomirski wrote:
> > * Thread mode is explicitly enabled on a cgroup by writing "enable"
> >   into "cgroup.threads" file.  The cgroup shouldn't have any child
> >   cgroups or enabled controllers.
> 
> Why do you need to manually turn it on?  That is, couldn't it be
> automatic based on what controllers are enabled?

This came up already but it's not like some controllers are inherently
thread-only.  Consider CPU, all in-context CPU cycle consumptions are
tied to the thread; however, we also want to be able to account for
CPU cycles consumed for, for example, memory reclaim or encryption
during writeback.

I played with an interface where thread mode is enabled automatically
upto the common ancestor of the threads but not only was it
complicated to implement but also the eventual behavior was very
confusing as the resource domain can change without any active actions
from the user.  I think keeping things simple is the right choice
here.

> > * Once enabled, arbitrary sub-hierarchy can be created and threads can
> >   be put anywhere in the subtree by writing TIDs into "cgroup.threads"
> >   file.  Process granularity and no-internal-process constraint don't
> >   apply in a threaded subtree.
> 
> I'm a bit worried that this conflates two different things.  There's
> thread support, i.e. allowing individual threads to be placed into
> cgroups.  There's also more flexible sub-hierarchy support, i.e.
> relaxing no-internal-process constraints.  For the "cpuacct"
> controller, for example, both of these make sense.  But what if
> someone writes a controller (directio, for example, just to make
> something up) for which thread granularity makes sense but relaxing
> no-internal-process constraints does not?

If a controller can't possibly define how internal competition should
be handled, which is unlikely - the problem is being consistent and
sensible, defining something isn't difficult - the controller can
simply error out those cases either on configuration or migration.
Again, I'm very doubtful we'll need that but if we ever need that
denying specific configurations is the best we can do anyway.

Thanks.

-- 
tejun


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-02 Thread Andy Lutomirski
On Thu, Feb 2, 2017 at 12:06 PM, Tejun Heo  wrote:
> Hello,
>
> This patchset implements cgroup v2 thread mode.  It is largely based
> on the discussions that we had at the plumbers last year.  Here's the
> rough outline.

I like this, but I have some design questions:

>
> * Thread mode is explicitly enabled on a cgroup by writing "enable"
>   into "cgroup.threads" file.  The cgroup shouldn't have any child
>   cgroups or enabled controllers.

Why do you need to manually turn it on?  That is, couldn't it be
automatic based on what controllers are enabled?

>
> * Once enabled, arbitrary sub-hierarchy can be created and threads can
>   be put anywhere in the subtree by writing TIDs into "cgroup.threads"
>   file.  Process granularity and no-internal-process constraint don't
>   apply in a threaded subtree.

I'm a bit worried that this conflates two different things.  There's
thread support, i.e. allowing individual threads to be placed into
cgroups.  There's also more flexible sub-hierarchy support, i.e.
relaxing no-internal-process constraints.  For the "cpuacct"
controller, for example, both of these make sense.  But what if
someone writes a controller (directio, for example, just to make
something up) for which thread granularity makes sense but relaxing
no-internal-process constraints does not?

--Andy


Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

2017-02-02 Thread Andy Lutomirski
On Thu, Feb 2, 2017 at 12:06 PM, Tejun Heo  wrote:
> Hello,
>
> This patchset implements cgroup v2 thread mode.  It is largely based
> on the discussions that we had at the plumbers last year.  Here's the
> rough outline.

I like this, but I have some design questions:

>
> * Thread mode is explicitly enabled on a cgroup by writing "enable"
>   into "cgroup.threads" file.  The cgroup shouldn't have any child
>   cgroups or enabled controllers.

Why do you need to manually turn it on?  That is, couldn't it be
automatic based on what controllers are enabled?

>
> * Once enabled, arbitrary sub-hierarchy can be created and threads can
>   be put anywhere in the subtree by writing TIDs into "cgroup.threads"
>   file.  Process granularity and no-internal-process constraint don't
>   apply in a threaded subtree.

I'm a bit worried that this conflates two different things.  There's
thread support, i.e. allowing individual threads to be placed into
cgroups.  There's also more flexible sub-hierarchy support, i.e.
relaxing no-internal-process constraints.  For the "cpuacct"
controller, for example, both of these make sense.  But what if
someone writes a controller (directio, for example, just to make
something up) for which thread granularity makes sense but relaxing
no-internal-process constraints does not?

--Andy