Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Tue, Mar 21, 2017 at 01:39:58PM +0100, Peter Zijlstra wrote: > > And yes, having to consider views is new and a direct consequence of > this new optional feature. But I don't see how its a problem. > So aside from having (RO) links in thread groups for system controllers, we could also have a ${controller}_parent link back to whatever group is the actual parent for that specific controller's view. So then your B's memcg_parent would point to A, not T. But I feel this is all superfluous window dressing; but if you want to clarify the filesystem interface, this could be something to consider.
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Tue, Mar 21, 2017 at 01:39:58PM +0100, Peter Zijlstra wrote: > > And yes, having to consider views is new and a direct consequence of > this new optional feature. But I don't see how its a problem. > So aside from having (RO) links in thread groups for system controllers, we could also have a ${controller}_parent link back to whatever group is the actual parent for that specific controller's view. So then your B's memcg_parent would point to A, not T. But I feel this is all superfluous window dressing; but if you want to clarify the filesystem interface, this could be something to consider.
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Mon, Mar 13, 2017 at 04:05:44PM -0400, Tejun Heo wrote: > Hey, Peter. Sorry about the long delay. No worries; we're all (too) busy. > > > If we go to thread mode and back to domain mode, the control knobs for > > > domain controllers don't make sense on the thread part of the tree and > > > they won't have cgroup_subsys_state to correspond to either. For > > > example, > > > > > > A - T - B > > > > > > B's memcg knobs would control memory distribution from A and cgroups > > > in T can't have memcg knobs. It'd be weird to indicate that memcg is > > > enabled in those cgroups too. > > > > But memcg _is_ enabled for T. All the tasks are mapped onto A for > > purpose of the system controller (memcg) and are subject to its > > constraints. > > Sure, T is contained in A but think about the interface. For memcg, T > belongs to A. B is the first descendant when viewed from memcg, which > brings about two problems - memcg doesn't have control knobs to assign > throughout T which is just weird and there's no way to configure how T > competes against B. > > > > We can make it work somehow. It's just weird-ass interface. > > > > You could make these control files (read-only?) symlinks back to A's > > actual control files. To more explicitly show this. > > But the knobs are supposed to control how much resource a child gets > from its parent. Flipping that over while walking down the same tree > sounds horribly ugly and confusing to me. Besides, that doesn't solve > the problem with lacking the ability configure T's consumptions > against B. So I'm not confused; and I suspect you're not either. But you're worried about 'simple' people getting confused? The rules really are fairly straight forward; but yes, it will be a little more involved than without this. But note that this is an optional thing, people don't have to make thread groups if they don't want to. And they further don't have to use non-leaf thread groups. And at some point, there's no helping stupid; and trying to do so will only make you insane. So the fundamental thing to realize (and explain) is that there are two different types of controllers; and that they operate on different views of the hierarchy. I think our goal as a kernel API should be presenting the capabilities in a concise and consistent manner; and I feel that the proposed interface is that. So the points you raise above; about system controller knobs in thread groups and competition between thread and system groups as seen for system controllers are confusion due to not considering the views. And yes, having to consider views is new and a direct consequence of this new optional feature. But I don't see how its a problem. > Scheduling hackbench is an extreme case tho and in practice at least > we're not seeing noticeable issues with a few levels of nesting when > the workload actually spends cpu cycles doing things other than > scheduling. Right; most workloads don't schedule _that_ much; and the overhead isn't too painful. > However, we're seeing significant increase in scheduling > latency coming from how cgroups are handled from the rebalance path. > I'm still looking into it and will write about that in a separate > thread. I have some vague memories of this being a pain point. IIRC it comes down to the problem that latency is an absolute measure and the weight is relative thing. I think we mucked about with it some many years ago; but haven't done so recently. > > Also, there is the one giant wart in v2 wrt no-internal-processes; > > namely the root group is allowed to have them. > > > > Now I understand why this is so; so don't feel compelled to explain that > > again, but it does make the model very ugly and has a real problem, see > > below. OTOH, since it is there, I would very much like to make use of > > this 'feature' and allow a thread-group on the root group. > > > > And since you then _can_ have nested thread groups, it again becomes > > very important to be able to find the resource domains, which brings me > > back to my proposal; albeit with an addition constraint. > > I've thought quite a bit about ways to allow thread granularity from > the top while still presenting a consistent picture to resource domain > controllers. That's what's missing from the CPU controller side given > Mike's claim that there's unavoidable overhead in nesting CPU > controller and requiring at least one level of nesting on cgroup v2 > for thread granularity might not be acceptable. > > Going back to why thread support on cgroup v2 was needed in the first > place, it was to allow thread level control while cooperating with > other controllers on v2. IOW, allowing thread level control for CPU > while cooperating with resource domain type controllers. Well, not only CPU, I can see the same being used for perf for example. > Now, going back to allowing thread hierarchies from the root, given > that their resource domain can only be root, which is exactly
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Mon, Mar 13, 2017 at 04:05:44PM -0400, Tejun Heo wrote: > Hey, Peter. Sorry about the long delay. No worries; we're all (too) busy. > > > If we go to thread mode and back to domain mode, the control knobs for > > > domain controllers don't make sense on the thread part of the tree and > > > they won't have cgroup_subsys_state to correspond to either. For > > > example, > > > > > > A - T - B > > > > > > B's memcg knobs would control memory distribution from A and cgroups > > > in T can't have memcg knobs. It'd be weird to indicate that memcg is > > > enabled in those cgroups too. > > > > But memcg _is_ enabled for T. All the tasks are mapped onto A for > > purpose of the system controller (memcg) and are subject to its > > constraints. > > Sure, T is contained in A but think about the interface. For memcg, T > belongs to A. B is the first descendant when viewed from memcg, which > brings about two problems - memcg doesn't have control knobs to assign > throughout T which is just weird and there's no way to configure how T > competes against B. > > > > We can make it work somehow. It's just weird-ass interface. > > > > You could make these control files (read-only?) symlinks back to A's > > actual control files. To more explicitly show this. > > But the knobs are supposed to control how much resource a child gets > from its parent. Flipping that over while walking down the same tree > sounds horribly ugly and confusing to me. Besides, that doesn't solve > the problem with lacking the ability configure T's consumptions > against B. So I'm not confused; and I suspect you're not either. But you're worried about 'simple' people getting confused? The rules really are fairly straight forward; but yes, it will be a little more involved than without this. But note that this is an optional thing, people don't have to make thread groups if they don't want to. And they further don't have to use non-leaf thread groups. And at some point, there's no helping stupid; and trying to do so will only make you insane. So the fundamental thing to realize (and explain) is that there are two different types of controllers; and that they operate on different views of the hierarchy. I think our goal as a kernel API should be presenting the capabilities in a concise and consistent manner; and I feel that the proposed interface is that. So the points you raise above; about system controller knobs in thread groups and competition between thread and system groups as seen for system controllers are confusion due to not considering the views. And yes, having to consider views is new and a direct consequence of this new optional feature. But I don't see how its a problem. > Scheduling hackbench is an extreme case tho and in practice at least > we're not seeing noticeable issues with a few levels of nesting when > the workload actually spends cpu cycles doing things other than > scheduling. Right; most workloads don't schedule _that_ much; and the overhead isn't too painful. > However, we're seeing significant increase in scheduling > latency coming from how cgroups are handled from the rebalance path. > I'm still looking into it and will write about that in a separate > thread. I have some vague memories of this being a pain point. IIRC it comes down to the problem that latency is an absolute measure and the weight is relative thing. I think we mucked about with it some many years ago; but haven't done so recently. > > Also, there is the one giant wart in v2 wrt no-internal-processes; > > namely the root group is allowed to have them. > > > > Now I understand why this is so; so don't feel compelled to explain that > > again, but it does make the model very ugly and has a real problem, see > > below. OTOH, since it is there, I would very much like to make use of > > this 'feature' and allow a thread-group on the root group. > > > > And since you then _can_ have nested thread groups, it again becomes > > very important to be able to find the resource domains, which brings me > > back to my proposal; albeit with an addition constraint. > > I've thought quite a bit about ways to allow thread granularity from > the top while still presenting a consistent picture to resource domain > controllers. That's what's missing from the CPU controller side given > Mike's claim that there's unavoidable overhead in nesting CPU > controller and requiring at least one level of nesting on cgroup v2 > for thread granularity might not be acceptable. > > Going back to why thread support on cgroup v2 was needed in the first > place, it was to allow thread level control while cooperating with > other controllers on v2. IOW, allowing thread level control for CPU > while cooperating with resource domain type controllers. Well, not only CPU, I can see the same being used for perf for example. > Now, going back to allowing thread hierarchies from the root, given > that their resource domain can only be root, which is exactly
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Mon, 2017-03-13 at 15:26 -0400, Tejun Heo wrote: > Hello, Mike. > > Sorry about the long delay. > > On Mon, Feb 13, 2017 at 06:45:07AM +0100, Mike Galbraith wrote: > > > > So, as long as the depth stays reasonable (single digit or lower), > > > > what we try to do is keeping tree traversal operations aggregated or > > > > located on slow paths. There still are places that this overhead > > > > shows up (e.g. the block controllers aren't too optimized) but it > > > > isn't particularly difficult to make a handful of layers not matter at > > > > all. > > > > > > A handful of cpu bean counting layers stings considerably. > > Hmm... yeah, I was trying to think about ways to avoid full scheduling > overhead at each layer (the scheduler does a lot per each layer of > scheduling) but don't think it's possible to circumvent that without > introducing a whole lot of scheduling artifacts. Yup. > In a lot of workloads, the added overhead from several layers of CPU > controllers doesn't seem to get in the way too much (most threads do > something other than scheduling after all). Sure, don't schedule a lot, it doesn't hurt much, but there are plenty of loads that routinely do schedule a LOT, and there it matters a LOT.. which is why network benchmarks tend to be severely allergic to scheduler lard. > The only major issue that > we're seeing in the fleet is the cgroup iteration in idle rebalancing > code pushing up the scheduling latency too much but that's a different > issue. Hm, I would suspect PELT to be the culprit there. It helps smooth out load balancing, but will stack "skinny looking" tasks. -Mike
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Mon, 2017-03-13 at 15:26 -0400, Tejun Heo wrote: > Hello, Mike. > > Sorry about the long delay. > > On Mon, Feb 13, 2017 at 06:45:07AM +0100, Mike Galbraith wrote: > > > > So, as long as the depth stays reasonable (single digit or lower), > > > > what we try to do is keeping tree traversal operations aggregated or > > > > located on slow paths. There still are places that this overhead > > > > shows up (e.g. the block controllers aren't too optimized) but it > > > > isn't particularly difficult to make a handful of layers not matter at > > > > all. > > > > > > A handful of cpu bean counting layers stings considerably. > > Hmm... yeah, I was trying to think about ways to avoid full scheduling > overhead at each layer (the scheduler does a lot per each layer of > scheduling) but don't think it's possible to circumvent that without > introducing a whole lot of scheduling artifacts. Yup. > In a lot of workloads, the added overhead from several layers of CPU > controllers doesn't seem to get in the way too much (most threads do > something other than scheduling after all). Sure, don't schedule a lot, it doesn't hurt much, but there are plenty of loads that routinely do schedule a LOT, and there it matters a LOT.. which is why network benchmarks tend to be severely allergic to scheduler lard. > The only major issue that > we're seeing in the fleet is the cgroup iteration in idle rebalancing > code pushing up the scheduling latency too much but that's a different > issue. Hm, I would suspect PELT to be the culprit there. It helps smooth out load balancing, but will stack "skinny looking" tasks. -Mike
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
Hey, Peter. Sorry about the long delay. On Tue, Feb 14, 2017 at 11:35:41AM +0100, Peter Zijlstra wrote: > > This is a bit of delta but as I wrote before, at least cpu (and > > accordingly cpuacct) won't stay purely task-based as we should account > > for resource consumptions which aren't tied to specific tasks to the > > matching domain (e.g. CPU consumption during writeback, disk > > encryption or CPU cycles spent to receive packets). > > We should probably do that in another thread, but I'd probably insist on > separate controllers that co-operate to get that done. Let's shelve this for now. > > cgroups on creation don't enable controllers by default and users can > > enable and disable controllers dynamically as long as the conditions > > are met. So, they can be disable and re-enabled. > > I was talking in a hierarchical sense, your section 2-4-2. Top-Down > constraint seems to state similar things to what I meant. > > Once you disable a controller it cannot be re-enabled in a subtree. Ah, yeah, you can't jump across parts of the tree. > > If we go to thread mode and back to domain mode, the control knobs for > > domain controllers don't make sense on the thread part of the tree and > > they won't have cgroup_subsys_state to correspond to either. For > > example, > > > > A - T - B > > > > B's memcg knobs would control memory distribution from A and cgroups > > in T can't have memcg knobs. It'd be weird to indicate that memcg is > > enabled in those cgroups too. > > But memcg _is_ enabled for T. All the tasks are mapped onto A for > purpose of the system controller (memcg) and are subject to its > constraints. Sure, T is contained in A but think about the interface. For memcg, T belongs to A. B is the first descendant when viewed from memcg, which brings about two problems - memcg doesn't have control knobs to assign throughout T which is just weird and there's no way to configure how T competes against B. > > We can make it work somehow. It's just weird-ass interface. > > You could make these control files (read-only?) symlinks back to A's > actual control files. To more explicitly show this. But the knobs are supposed to control how much resource a child gets from its parent. Flipping that over while walking down the same tree sounds horribly ugly and confusing to me. Besides, that doesn't solve the problem with lacking the ability configure T's consumptions against B. > > So, as long as the depth stays reasonable (single digit or lower), > > what we try to do is keeping tree traversal operations aggregated or > > located on slow paths. > > While at the same time you allowed that BPF cgroup thing to not be > hierarchical because iterating the tree was too expensive; or did I > misunderstand? That was more because that was supposed to be part of bpf (network or whatever) and just followed the model of table matching "is the target under this hierarchy?". That's where it came from after all. Hierarchical walking can be added but it's more work (defining the iteration direction and rules) and doesn't bring benefits without working delegation. If it were a cgroup controller, it should have been fully hierarchical no matter what but that involves solving bpf delegation first. > Also, I think Mike showed you the pain and hurt are quite visible for > even a few levels. > > Batching is tricky, you need to somehow bound the error function in > order to not become too big a factor on behaviour. Esp. for cpu, cpuacct > obviously doesn't care much as it doesn't enforce anything. Yeah, I thought about this for quite a while but I couldn't think of any easy way of circumventing overhead without introducing a lot of scheduling artifacts (e.g. multiplying down the weights to practically collapse multi levels of the hierarchy), at least for the weight based control which what most people use. It looks like the only way to lower the overhead there is making generic scheduling cheaper but that still means that multi-level will always be noticeably more expensive in terms of raw schceduling performance. Scheduling hackbench is an extreme case tho and in practice at least we're not seeing noticeable issues with a few levels of nesting when the workload actually spends cpu cycles doing things other than scheduling. However, we're seeing significant increase in scheduling latency coming from how cgroups are handled from the rebalance path. I'm still looking into it and will write about that in a separate thread. > > In general, I think it's important to ensure that this in general is > > the case so that users can use the logical layouts matching the actual > > resource hierarchy rather than having to twist the layout for > > optimization. > > One does what one can.. But it is important to understand the > constraints, nothing comes for free. Yeah, for sure. > Also, there is the one giant wart in v2 wrt no-internal-processes; > namely the root group is allowed to have them. > >
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
Hey, Peter. Sorry about the long delay. On Tue, Feb 14, 2017 at 11:35:41AM +0100, Peter Zijlstra wrote: > > This is a bit of delta but as I wrote before, at least cpu (and > > accordingly cpuacct) won't stay purely task-based as we should account > > for resource consumptions which aren't tied to specific tasks to the > > matching domain (e.g. CPU consumption during writeback, disk > > encryption or CPU cycles spent to receive packets). > > We should probably do that in another thread, but I'd probably insist on > separate controllers that co-operate to get that done. Let's shelve this for now. > > cgroups on creation don't enable controllers by default and users can > > enable and disable controllers dynamically as long as the conditions > > are met. So, they can be disable and re-enabled. > > I was talking in a hierarchical sense, your section 2-4-2. Top-Down > constraint seems to state similar things to what I meant. > > Once you disable a controller it cannot be re-enabled in a subtree. Ah, yeah, you can't jump across parts of the tree. > > If we go to thread mode and back to domain mode, the control knobs for > > domain controllers don't make sense on the thread part of the tree and > > they won't have cgroup_subsys_state to correspond to either. For > > example, > > > > A - T - B > > > > B's memcg knobs would control memory distribution from A and cgroups > > in T can't have memcg knobs. It'd be weird to indicate that memcg is > > enabled in those cgroups too. > > But memcg _is_ enabled for T. All the tasks are mapped onto A for > purpose of the system controller (memcg) and are subject to its > constraints. Sure, T is contained in A but think about the interface. For memcg, T belongs to A. B is the first descendant when viewed from memcg, which brings about two problems - memcg doesn't have control knobs to assign throughout T which is just weird and there's no way to configure how T competes against B. > > We can make it work somehow. It's just weird-ass interface. > > You could make these control files (read-only?) symlinks back to A's > actual control files. To more explicitly show this. But the knobs are supposed to control how much resource a child gets from its parent. Flipping that over while walking down the same tree sounds horribly ugly and confusing to me. Besides, that doesn't solve the problem with lacking the ability configure T's consumptions against B. > > So, as long as the depth stays reasonable (single digit or lower), > > what we try to do is keeping tree traversal operations aggregated or > > located on slow paths. > > While at the same time you allowed that BPF cgroup thing to not be > hierarchical because iterating the tree was too expensive; or did I > misunderstand? That was more because that was supposed to be part of bpf (network or whatever) and just followed the model of table matching "is the target under this hierarchy?". That's where it came from after all. Hierarchical walking can be added but it's more work (defining the iteration direction and rules) and doesn't bring benefits without working delegation. If it were a cgroup controller, it should have been fully hierarchical no matter what but that involves solving bpf delegation first. > Also, I think Mike showed you the pain and hurt are quite visible for > even a few levels. > > Batching is tricky, you need to somehow bound the error function in > order to not become too big a factor on behaviour. Esp. for cpu, cpuacct > obviously doesn't care much as it doesn't enforce anything. Yeah, I thought about this for quite a while but I couldn't think of any easy way of circumventing overhead without introducing a lot of scheduling artifacts (e.g. multiplying down the weights to practically collapse multi levels of the hierarchy), at least for the weight based control which what most people use. It looks like the only way to lower the overhead there is making generic scheduling cheaper but that still means that multi-level will always be noticeably more expensive in terms of raw schceduling performance. Scheduling hackbench is an extreme case tho and in practice at least we're not seeing noticeable issues with a few levels of nesting when the workload actually spends cpu cycles doing things other than scheduling. However, we're seeing significant increase in scheduling latency coming from how cgroups are handled from the rebalance path. I'm still looking into it and will write about that in a separate thread. > > In general, I think it's important to ensure that this in general is > > the case so that users can use the logical layouts matching the actual > > resource hierarchy rather than having to twist the layout for > > optimization. > > One does what one can.. But it is important to understand the > constraints, nothing comes for free. Yeah, for sure. > Also, there is the one giant wart in v2 wrt no-internal-processes; > namely the root group is allowed to have them. > >
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
Hello, Mike. Sorry about the long delay. On Mon, Feb 13, 2017 at 06:45:07AM +0100, Mike Galbraith wrote: > > > So, as long as the depth stays reasonable (single digit or lower), > > > what we try to do is keeping tree traversal operations aggregated or > > > located on slow paths. There still are places that this overhead > > > shows up (e.g. the block controllers aren't too optimized) but it > > > isn't particularly difficult to make a handful of layers not matter at > > > all. > > > > A handful of cpu bean counting layers stings considerably. Hmm... yeah, I was trying to think about ways to avoid full scheduling overhead at each layer (the scheduler does a lot per each layer of scheduling) but don't think it's possible to circumvent that without introducing a whole lot of scheduling artifacts. In a lot of workloads, the added overhead from several layers of CPU controllers doesn't seem to get in the way too much (most threads do something other than scheduling after all). The only major issue that we're seeing in the fleet is the cgroup iteration in idle rebalancing code pushing up the scheduling latency too much but that's a different issue. Anyways, I understand that there are cases where people would want to avoid any extra layers. I'll continue on PeterZ's message. > BTW, that overhead is also why merging cpu/cpuacct is not really as > wonderful as it may seem on paper. If you only want to account, you > may not have anything to gain from group scheduling (in fact it may > wreck performance), but you'll pay for it. There's another reason why we would want accounting separate - because weight based controllers, cpu and io currently, can't be enabled without affecting the scheduling behavior. However, they're different from CPU controllers in that all the heavy part of operations can be shifted to the readers (we just need to do per-cpu updates from hot paths), so we might as well publish those stats by default on the v2 hierarchy. We couldn't do the same in v1 because the number of hierarchies were not limited. Thanks. -- tejun
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
Hello, Mike. Sorry about the long delay. On Mon, Feb 13, 2017 at 06:45:07AM +0100, Mike Galbraith wrote: > > > So, as long as the depth stays reasonable (single digit or lower), > > > what we try to do is keeping tree traversal operations aggregated or > > > located on slow paths. There still are places that this overhead > > > shows up (e.g. the block controllers aren't too optimized) but it > > > isn't particularly difficult to make a handful of layers not matter at > > > all. > > > > A handful of cpu bean counting layers stings considerably. Hmm... yeah, I was trying to think about ways to avoid full scheduling overhead at each layer (the scheduler does a lot per each layer of scheduling) but don't think it's possible to circumvent that without introducing a whole lot of scheduling artifacts. In a lot of workloads, the added overhead from several layers of CPU controllers doesn't seem to get in the way too much (most threads do something other than scheduling after all). The only major issue that we're seeing in the fleet is the cgroup iteration in idle rebalancing code pushing up the scheduling latency too much but that's a different issue. Anyways, I understand that there are cases where people would want to avoid any extra layers. I'll continue on PeterZ's message. > BTW, that overhead is also why merging cpu/cpuacct is not really as > wonderful as it may seem on paper. If you only want to account, you > may not have anything to gain from group scheduling (in fact it may > wreck performance), but you'll pay for it. There's another reason why we would want accounting separate - because weight based controllers, cpu and io currently, can't be enabled without affecting the scheduling behavior. However, they're different from CPU controllers in that all the heavy part of operations can be shifted to the readers (we just need to do per-cpu updates from hot paths), so we might as well publish those stats by default on the v2 hierarchy. We couldn't do the same in v1 because the number of hierarchies were not limited. Thanks. -- tejun
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Sun, Feb 12, 2017 at 02:05:44PM +0900, Tejun Heo wrote: > Hello, > > On Fri, Feb 10, 2017 at 06:51:45PM +0100, Peter Zijlstra wrote: > > Sure, we're past that. This isn't about what memcg can or cannot do. > > Previous discussions established that controllers come in two shapes: > > > > - task based controllers; these are build on per task properties and > >groups are aggregates over sets of tasks. Since per definition inter > >task competition is already defined on individual tasks, its fairly > >trivial to extend the same rules to sets of tasks etc.. > > > >Examples: cpu, cpuset, cpuacct, perf, pid, (freezer) > > > > - system controllers; instead of building from tasks upwards, they > >split what previously would be machine wide / global state. For these > >there is no natural competition rule vs tasks, and hence your > >no-internal-task rule. > > > >Examples: memcg, io, hugetlb > > This is a bit of delta but as I wrote before, at least cpu (and > accordingly cpuacct) won't stay purely task-based as we should account > for resource consumptions which aren't tied to specific tasks to the > matching domain (e.g. CPU consumption during writeback, disk > encryption or CPU cycles spent to receive packets). We should probably do that in another thread, but I'd probably insist on separate controllers that co-operate to get that done. > > > And here's another point, currently, all controllers are enabled > > > consecutively from root. If we have leaf thread subtrees, this still > > > works fine. Resource domain controllers won't be enabled into thread > > > subtrees. If we allow switching back and forth, what do we do in the > > > middle while we're in the thread part? > > > > From what I understand you cannot re-enable a controller once its been > > disabled, right? If you disable it, its dead for the entire subtree. > > cgroups on creation don't enable controllers by default and users can > enable and disable controllers dynamically as long as the conditions > are met. So, they can be disable and re-enabled. I was talking in a hierarchical sense, your section 2-4-2. Top-Down constraint seems to state similar things to what I meant. Once you disable a controller it cannot be re-enabled in a subtree. > > > No matter what we do, it's > > > gonna be more confusing and we lose basic invariants like "parent > > > always has superset of control knobs that its child has". > > > > No, exactly that. I don't think I ever proposed something different. > > > > The "resource domain" flag I proposed violates the no-internal-processes > > thing, but it doesn't violate that rule afaict. > > If we go to thread mode and back to domain mode, the control knobs for > domain controllers don't make sense on the thread part of the tree and > they won't have cgroup_subsys_state to correspond to either. For > example, > > A - T - B > > B's memcg knobs would control memory distribution from A and cgroups > in T can't have memcg knobs. It'd be weird to indicate that memcg is > enabled in those cgroups too. But memcg _is_ enabled for T. All the tasks are mapped onto A for purpose of the system controller (memcg) and are subject to its constraints. > We can make it work somehow. It's just weird-ass interface. You could make these control files (read-only?) symlinks back to A's actual control files. To more explicitly show this. > > > As for the runtime overhead, if you get affected by adding a top-level > > > cgroup in any measureable way, we need to fix that. That's not a > > > valid argument for messing up the interface. > > > > I think cgroup tree depth is a more significant issue; because of > > hierarchy we often do tree walks (uo-to-root or down-to-task). > > > > So creating elaborate trees is something I try not to do. > > So, as long as the depth stays reasonable (single digit or lower), > what we try to do is keeping tree traversal operations aggregated or > located on slow paths. While at the same time you allowed that BPF cgroup thing to not be hierarchical because iterating the tree was too expensive; or did I misunderstand? Also, I think Mike showed you the pain and hurt are quite visible for even a few levels. Batching is tricky, you need to somehow bound the error function in order to not become too big a factor on behaviour. Esp. for cpu, cpuacct obviously doesn't care much as it doesn't enforce anything. > In general, I think it's important to ensure that this in general is > the case so that users can use the logical layouts matching the actual > resource hierarchy rather than having to twist the layout for > optimization. One does what one can.. But it is important to understand the constraints, nothing comes for free. > > > Even if we allow switching back and forth, we can't make the same > > > cgroup both resource domain && thread root. Not in a sane way at > > > least. > > > > The back and forth thing yes, but even with a single level, the
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Sun, Feb 12, 2017 at 02:05:44PM +0900, Tejun Heo wrote: > Hello, > > On Fri, Feb 10, 2017 at 06:51:45PM +0100, Peter Zijlstra wrote: > > Sure, we're past that. This isn't about what memcg can or cannot do. > > Previous discussions established that controllers come in two shapes: > > > > - task based controllers; these are build on per task properties and > >groups are aggregates over sets of tasks. Since per definition inter > >task competition is already defined on individual tasks, its fairly > >trivial to extend the same rules to sets of tasks etc.. > > > >Examples: cpu, cpuset, cpuacct, perf, pid, (freezer) > > > > - system controllers; instead of building from tasks upwards, they > >split what previously would be machine wide / global state. For these > >there is no natural competition rule vs tasks, and hence your > >no-internal-task rule. > > > >Examples: memcg, io, hugetlb > > This is a bit of delta but as I wrote before, at least cpu (and > accordingly cpuacct) won't stay purely task-based as we should account > for resource consumptions which aren't tied to specific tasks to the > matching domain (e.g. CPU consumption during writeback, disk > encryption or CPU cycles spent to receive packets). We should probably do that in another thread, but I'd probably insist on separate controllers that co-operate to get that done. > > > And here's another point, currently, all controllers are enabled > > > consecutively from root. If we have leaf thread subtrees, this still > > > works fine. Resource domain controllers won't be enabled into thread > > > subtrees. If we allow switching back and forth, what do we do in the > > > middle while we're in the thread part? > > > > From what I understand you cannot re-enable a controller once its been > > disabled, right? If you disable it, its dead for the entire subtree. > > cgroups on creation don't enable controllers by default and users can > enable and disable controllers dynamically as long as the conditions > are met. So, they can be disable and re-enabled. I was talking in a hierarchical sense, your section 2-4-2. Top-Down constraint seems to state similar things to what I meant. Once you disable a controller it cannot be re-enabled in a subtree. > > > No matter what we do, it's > > > gonna be more confusing and we lose basic invariants like "parent > > > always has superset of control knobs that its child has". > > > > No, exactly that. I don't think I ever proposed something different. > > > > The "resource domain" flag I proposed violates the no-internal-processes > > thing, but it doesn't violate that rule afaict. > > If we go to thread mode and back to domain mode, the control knobs for > domain controllers don't make sense on the thread part of the tree and > they won't have cgroup_subsys_state to correspond to either. For > example, > > A - T - B > > B's memcg knobs would control memory distribution from A and cgroups > in T can't have memcg knobs. It'd be weird to indicate that memcg is > enabled in those cgroups too. But memcg _is_ enabled for T. All the tasks are mapped onto A for purpose of the system controller (memcg) and are subject to its constraints. > We can make it work somehow. It's just weird-ass interface. You could make these control files (read-only?) symlinks back to A's actual control files. To more explicitly show this. > > > As for the runtime overhead, if you get affected by adding a top-level > > > cgroup in any measureable way, we need to fix that. That's not a > > > valid argument for messing up the interface. > > > > I think cgroup tree depth is a more significant issue; because of > > hierarchy we often do tree walks (uo-to-root or down-to-task). > > > > So creating elaborate trees is something I try not to do. > > So, as long as the depth stays reasonable (single digit or lower), > what we try to do is keeping tree traversal operations aggregated or > located on slow paths. While at the same time you allowed that BPF cgroup thing to not be hierarchical because iterating the tree was too expensive; or did I misunderstand? Also, I think Mike showed you the pain and hurt are quite visible for even a few levels. Batching is tricky, you need to somehow bound the error function in order to not become too big a factor on behaviour. Esp. for cpu, cpuacct obviously doesn't care much as it doesn't enforce anything. > In general, I think it's important to ensure that this in general is > the case so that users can use the logical layouts matching the actual > resource hierarchy rather than having to twist the layout for > optimization. One does what one can.. But it is important to understand the constraints, nothing comes for free. > > > Even if we allow switching back and forth, we can't make the same > > > cgroup both resource domain && thread root. Not in a sane way at > > > least. > > > > The back and forth thing yes, but even with a single level, the
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Sun, 2017-02-12 at 07:59 +0100, Mike Galbraith wrote: > On Sun, 2017-02-12 at 14:05 +0900, Tejun Heo wrote: > > > > I think cgroup tree depth is a more significant issue; because of > > > hierarchy we often do tree walks (uo-to-root or down-to-task). > > > > > > So creating elaborate trees is something I try not to do. > > > > So, as long as the depth stays reasonable (single digit or lower), > > what we try to do is keeping tree traversal operations aggregated or > > located on slow paths. There still are places that this overhead > > shows up (e.g. the block controllers aren't too optimized) but it > > isn't particularly difficult to make a handful of layers not matter at > > all. > > A handful of cpu bean counting layers stings considerably. BTW, that overhead is also why merging cpu/cpuacct is not really as wonderful as it may seem on paper. If you only want to account, you may not have anything to gain from group scheduling (in fact it may wreck performance), but you'll pay for it. > homer:/abuild # pipe-test 1 > 2.010057 usecs/loop -- avg 2.010057 995.0 KHz > 2.006630 usecs/loop -- avg 2.009714 995.2 KHz > 2.127118 usecs/loop -- avg 2.021455 989.4 KHz > 2.256244 usecs/loop -- avg 2.044934 978.0 KHz > 1.993693 usecs/loop -- avg 2.039810 980.5 KHz > ^C > homer:/abuild # cgexec -g cpu:hurt pipe-test 1 > 2.771641 usecs/loop -- avg 2.771641 721.6 KHz > 2.432333 usecs/loop -- avg 2.737710 730.5 KHz > 2.750493 usecs/loop -- avg 2.738988 730.2 KHz > 2.663203 usecs/loop -- avg 2.731410 732.2 KHz > 2.762564 usecs/loop -- avg 2.734525 731.4 KHz > ^C > homer:/abuild # cgexec -g cpu:hurt/pain pipe-test 1 > 2.967201 usecs/loop -- avg 2.967201 674.0 KHz > 3.049012 usecs/loop -- avg 2.975382 672.2 KHz > 3.031226 usecs/loop -- avg 2.980966 670.9 KHz > 2.954259 usecs/loop -- avg 2.978296 671.5 KHz > 2.933432 usecs/loop -- avg 2.973809 672.5 KHz > ^C > ... > homer:/abuild # cgexec -g cpu:hurt/pain/ouch/moan/groan pipe-test 1 > 4.417044 usecs/loop -- avg 4.417044 452.8 KHz > 4.494913 usecs/loop -- avg 4.424831 452.0 KHz > 4.253861 usecs/loop -- avg 4.407734 453.7 KHz > 4.378059 usecs/loop -- avg 4.404766 454.1 KHz > 4.179895 usecs/loop -- avg 4.382279 456.4 KHz
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Sun, 2017-02-12 at 07:59 +0100, Mike Galbraith wrote: > On Sun, 2017-02-12 at 14:05 +0900, Tejun Heo wrote: > > > > I think cgroup tree depth is a more significant issue; because of > > > hierarchy we often do tree walks (uo-to-root or down-to-task). > > > > > > So creating elaborate trees is something I try not to do. > > > > So, as long as the depth stays reasonable (single digit or lower), > > what we try to do is keeping tree traversal operations aggregated or > > located on slow paths. There still are places that this overhead > > shows up (e.g. the block controllers aren't too optimized) but it > > isn't particularly difficult to make a handful of layers not matter at > > all. > > A handful of cpu bean counting layers stings considerably. BTW, that overhead is also why merging cpu/cpuacct is not really as wonderful as it may seem on paper. If you only want to account, you may not have anything to gain from group scheduling (in fact it may wreck performance), but you'll pay for it. > homer:/abuild # pipe-test 1 > 2.010057 usecs/loop -- avg 2.010057 995.0 KHz > 2.006630 usecs/loop -- avg 2.009714 995.2 KHz > 2.127118 usecs/loop -- avg 2.021455 989.4 KHz > 2.256244 usecs/loop -- avg 2.044934 978.0 KHz > 1.993693 usecs/loop -- avg 2.039810 980.5 KHz > ^C > homer:/abuild # cgexec -g cpu:hurt pipe-test 1 > 2.771641 usecs/loop -- avg 2.771641 721.6 KHz > 2.432333 usecs/loop -- avg 2.737710 730.5 KHz > 2.750493 usecs/loop -- avg 2.738988 730.2 KHz > 2.663203 usecs/loop -- avg 2.731410 732.2 KHz > 2.762564 usecs/loop -- avg 2.734525 731.4 KHz > ^C > homer:/abuild # cgexec -g cpu:hurt/pain pipe-test 1 > 2.967201 usecs/loop -- avg 2.967201 674.0 KHz > 3.049012 usecs/loop -- avg 2.975382 672.2 KHz > 3.031226 usecs/loop -- avg 2.980966 670.9 KHz > 2.954259 usecs/loop -- avg 2.978296 671.5 KHz > 2.933432 usecs/loop -- avg 2.973809 672.5 KHz > ^C > ... > homer:/abuild # cgexec -g cpu:hurt/pain/ouch/moan/groan pipe-test 1 > 4.417044 usecs/loop -- avg 4.417044 452.8 KHz > 4.494913 usecs/loop -- avg 4.424831 452.0 KHz > 4.253861 usecs/loop -- avg 4.407734 453.7 KHz > 4.378059 usecs/loop -- avg 4.404766 454.1 KHz > 4.179895 usecs/loop -- avg 4.382279 456.4 KHz
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Sun, 2017-02-12 at 13:16 -0800, Paul Turner wrote: > > > On Thursday, February 9, 2017, Peter Zijlstrawrote: > > On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote: > > > The only case that this does not support vs ".threads" would be some > > > hybrid where we co-mingle threads from different processes (with the > > > processes belonging to the same node in the hierarchy). I'm not aware > > > of any usage that looks like this. > > > > If I understand you right; this is a fairly common thing with RT where > > we would stuff all the !rt threads of the various processes in a 'misc' > > bucket. > > > > Similarly, it happens that we stuff the various rt threads of processes > > in a specific (shared) 'rt' bucket. > > > > So I would certainly not like to exclude that setup. > > > > Unless you're using rt groups I'm not sure this one really changes. > Whether the "misc" threads exist at the parent level or one below > should not matter. (with exclusive cpusets, a mask can exist at one and only one location)
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Sun, 2017-02-12 at 13:16 -0800, Paul Turner wrote: > > > On Thursday, February 9, 2017, Peter Zijlstra wrote: > > On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote: > > > The only case that this does not support vs ".threads" would be some > > > hybrid where we co-mingle threads from different processes (with the > > > processes belonging to the same node in the hierarchy). I'm not aware > > > of any usage that looks like this. > > > > If I understand you right; this is a fairly common thing with RT where > > we would stuff all the !rt threads of the various processes in a 'misc' > > bucket. > > > > Similarly, it happens that we stuff the various rt threads of processes > > in a specific (shared) 'rt' bucket. > > > > So I would certainly not like to exclude that setup. > > > > Unless you're using rt groups I'm not sure this one really changes. > Whether the "misc" threads exist at the parent level or one below > should not matter. (with exclusive cpusets, a mask can exist at one and only one location)
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Sun, 2017-02-12 at 14:05 +0900, Tejun Heo wrote: > > I think cgroup tree depth is a more significant issue; because of > > hierarchy we often do tree walks (uo-to-root or down-to-task). > > > > So creating elaborate trees is something I try not to do. > > So, as long as the depth stays reasonable (single digit or lower), > what we try to do is keeping tree traversal operations aggregated or > located on slow paths. There still are places that this overhead > shows up (e.g. the block controllers aren't too optimized) but it > isn't particularly difficult to make a handful of layers not matter at > all. A handful of cpu bean counting layers stings considerably. homer:/abuild # pipe-test 1 2.010057 usecs/loop -- avg 2.010057 995.0 KHz 2.006630 usecs/loop -- avg 2.009714 995.2 KHz 2.127118 usecs/loop -- avg 2.021455 989.4 KHz 2.256244 usecs/loop -- avg 2.044934 978.0 KHz 1.993693 usecs/loop -- avg 2.039810 980.5 KHz ^C homer:/abuild # cgexec -g cpu:hurt pipe-test 1 2.771641 usecs/loop -- avg 2.771641 721.6 KHz 2.432333 usecs/loop -- avg 2.737710 730.5 KHz 2.750493 usecs/loop -- avg 2.738988 730.2 KHz 2.663203 usecs/loop -- avg 2.731410 732.2 KHz 2.762564 usecs/loop -- avg 2.734525 731.4 KHz ^C homer:/abuild # cgexec -g cpu:hurt/pain pipe-test 1 2.967201 usecs/loop -- avg 2.967201 674.0 KHz 3.049012 usecs/loop -- avg 2.975382 672.2 KHz 3.031226 usecs/loop -- avg 2.980966 670.9 KHz 2.954259 usecs/loop -- avg 2.978296 671.5 KHz 2.933432 usecs/loop -- avg 2.973809 672.5 KHz ^C ... homer:/abuild # cgexec -g cpu:hurt/pain/ouch/moan/groan pipe-test 1 4.417044 usecs/loop -- avg 4.417044 452.8 KHz 4.494913 usecs/loop -- avg 4.424831 452.0 KHz 4.253861 usecs/loop -- avg 4.407734 453.7 KHz 4.378059 usecs/loop -- avg 4.404766 454.1 KHz 4.179895 usecs/loop -- avg 4.382279 456.4 KHz
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Sun, 2017-02-12 at 14:05 +0900, Tejun Heo wrote: > > I think cgroup tree depth is a more significant issue; because of > > hierarchy we often do tree walks (uo-to-root or down-to-task). > > > > So creating elaborate trees is something I try not to do. > > So, as long as the depth stays reasonable (single digit or lower), > what we try to do is keeping tree traversal operations aggregated or > located on slow paths. There still are places that this overhead > shows up (e.g. the block controllers aren't too optimized) but it > isn't particularly difficult to make a handful of layers not matter at > all. A handful of cpu bean counting layers stings considerably. homer:/abuild # pipe-test 1 2.010057 usecs/loop -- avg 2.010057 995.0 KHz 2.006630 usecs/loop -- avg 2.009714 995.2 KHz 2.127118 usecs/loop -- avg 2.021455 989.4 KHz 2.256244 usecs/loop -- avg 2.044934 978.0 KHz 1.993693 usecs/loop -- avg 2.039810 980.5 KHz ^C homer:/abuild # cgexec -g cpu:hurt pipe-test 1 2.771641 usecs/loop -- avg 2.771641 721.6 KHz 2.432333 usecs/loop -- avg 2.737710 730.5 KHz 2.750493 usecs/loop -- avg 2.738988 730.2 KHz 2.663203 usecs/loop -- avg 2.731410 732.2 KHz 2.762564 usecs/loop -- avg 2.734525 731.4 KHz ^C homer:/abuild # cgexec -g cpu:hurt/pain pipe-test 1 2.967201 usecs/loop -- avg 2.967201 674.0 KHz 3.049012 usecs/loop -- avg 2.975382 672.2 KHz 3.031226 usecs/loop -- avg 2.980966 670.9 KHz 2.954259 usecs/loop -- avg 2.978296 671.5 KHz 2.933432 usecs/loop -- avg 2.973809 672.5 KHz ^C ... homer:/abuild # cgexec -g cpu:hurt/pain/ouch/moan/groan pipe-test 1 4.417044 usecs/loop -- avg 4.417044 452.8 KHz 4.494913 usecs/loop -- avg 4.424831 452.0 KHz 4.253861 usecs/loop -- avg 4.407734 453.7 KHz 4.378059 usecs/loop -- avg 4.404766 454.1 KHz 4.179895 usecs/loop -- avg 4.382279 456.4 KHz
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
Hello, On Fri, Feb 10, 2017 at 06:51:45PM +0100, Peter Zijlstra wrote: > Sure, we're past that. This isn't about what memcg can or cannot do. > Previous discussions established that controllers come in two shapes: > > - task based controllers; these are build on per task properties and >groups are aggregates over sets of tasks. Since per definition inter >task competition is already defined on individual tasks, its fairly >trivial to extend the same rules to sets of tasks etc.. > >Examples: cpu, cpuset, cpuacct, perf, pid, (freezer) > > - system controllers; instead of building from tasks upwards, they >split what previously would be machine wide / global state. For these >there is no natural competition rule vs tasks, and hence your >no-internal-task rule. > >Examples: memcg, io, hugetlb This is a bit of delta but as I wrote before, at least cpu (and accordingly cpuacct) won't stay purely task-based as we should account for resource consumptions which aren't tied to specific tasks to the matching domain (e.g. CPU consumption during writeback, disk encryption or CPU cycles spent to receive packets). > > And here's another point, currently, all controllers are enabled > > consecutively from root. If we have leaf thread subtrees, this still > > works fine. Resource domain controllers won't be enabled into thread > > subtrees. If we allow switching back and forth, what do we do in the > > middle while we're in the thread part? > > From what I understand you cannot re-enable a controller once its been > disabled, right? If you disable it, its dead for the entire subtree. cgroups on creation don't enable controllers by default and users can enable and disable controllers dynamically as long as the conditions are met. So, they can be disable and re-enabled. > > No matter what we do, it's > > gonna be more confusing and we lose basic invariants like "parent > > always has superset of control knobs that its child has". > > No, exactly that. I don't think I ever proposed something different. > > The "resource domain" flag I proposed violates the no-internal-processes > thing, but it doesn't violate that rule afaict. If we go to thread mode and back to domain mode, the control knobs for domain controllers don't make sense on the thread part of the tree and they won't have cgroup_subsys_state to correspond to either. For example, A - T - B B's memcg knobs would control memory distribution from A and cgroups in T can't have memcg knobs. It'd be weird to indicate that memcg is enabled in those cgroups too. We can make it work somehow. It's just weird-ass interface. > > As for the runtime overhead, if you get affected by adding a top-level > > cgroup in any measureable way, we need to fix that. That's not a > > valid argument for messing up the interface. > > I think cgroup tree depth is a more significant issue; because of > hierarchy we often do tree walks (uo-to-root or down-to-task). > > So creating elaborate trees is something I try not to do. So, as long as the depth stays reasonable (single digit or lower), what we try to do is keeping tree traversal operations aggregated or located on slow paths. There still are places that this overhead shows up (e.g. the block controllers aren't too optimized) but it isn't particularly difficult to make a handful of layers not matter at all. memcg batches the charging operations and it's impossible to measure the overhead of several levels of hierarchy. In general, I think it's important to ensure that this in general is the case so that users can use the logical layouts matching the actual resource hierarchy rather than having to twist the layout for optimization. > > Even if we allow switching back and forth, we can't make the same > > cgroup both resource domain && thread root. Not in a sane way at > > least. > > The back and forth thing yes, but even with a single level, the one > resource domain you tag will be both resource domain and thread root. Ah, you're right. Thanks. -- tejun
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
Hello, On Fri, Feb 10, 2017 at 06:51:45PM +0100, Peter Zijlstra wrote: > Sure, we're past that. This isn't about what memcg can or cannot do. > Previous discussions established that controllers come in two shapes: > > - task based controllers; these are build on per task properties and >groups are aggregates over sets of tasks. Since per definition inter >task competition is already defined on individual tasks, its fairly >trivial to extend the same rules to sets of tasks etc.. > >Examples: cpu, cpuset, cpuacct, perf, pid, (freezer) > > - system controllers; instead of building from tasks upwards, they >split what previously would be machine wide / global state. For these >there is no natural competition rule vs tasks, and hence your >no-internal-task rule. > >Examples: memcg, io, hugetlb This is a bit of delta but as I wrote before, at least cpu (and accordingly cpuacct) won't stay purely task-based as we should account for resource consumptions which aren't tied to specific tasks to the matching domain (e.g. CPU consumption during writeback, disk encryption or CPU cycles spent to receive packets). > > And here's another point, currently, all controllers are enabled > > consecutively from root. If we have leaf thread subtrees, this still > > works fine. Resource domain controllers won't be enabled into thread > > subtrees. If we allow switching back and forth, what do we do in the > > middle while we're in the thread part? > > From what I understand you cannot re-enable a controller once its been > disabled, right? If you disable it, its dead for the entire subtree. cgroups on creation don't enable controllers by default and users can enable and disable controllers dynamically as long as the conditions are met. So, they can be disable and re-enabled. > > No matter what we do, it's > > gonna be more confusing and we lose basic invariants like "parent > > always has superset of control knobs that its child has". > > No, exactly that. I don't think I ever proposed something different. > > The "resource domain" flag I proposed violates the no-internal-processes > thing, but it doesn't violate that rule afaict. If we go to thread mode and back to domain mode, the control knobs for domain controllers don't make sense on the thread part of the tree and they won't have cgroup_subsys_state to correspond to either. For example, A - T - B B's memcg knobs would control memory distribution from A and cgroups in T can't have memcg knobs. It'd be weird to indicate that memcg is enabled in those cgroups too. We can make it work somehow. It's just weird-ass interface. > > As for the runtime overhead, if you get affected by adding a top-level > > cgroup in any measureable way, we need to fix that. That's not a > > valid argument for messing up the interface. > > I think cgroup tree depth is a more significant issue; because of > hierarchy we often do tree walks (uo-to-root or down-to-task). > > So creating elaborate trees is something I try not to do. So, as long as the depth stays reasonable (single digit or lower), what we try to do is keeping tree traversal operations aggregated or located on slow paths. There still are places that this overhead shows up (e.g. the block controllers aren't too optimized) but it isn't particularly difficult to make a handful of layers not matter at all. memcg batches the charging operations and it's impossible to measure the overhead of several levels of hierarchy. In general, I think it's important to ensure that this in general is the case so that users can use the logical layouts matching the actual resource hierarchy rather than having to twist the layout for optimization. > > Even if we allow switching back and forth, we can't make the same > > cgroup both resource domain && thread root. Not in a sane way at > > least. > > The back and forth thing yes, but even with a single level, the one > resource domain you tag will be both resource domain and thread root. Ah, you're right. Thanks. -- tejun
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Fri, Feb 10, 2017 at 10:45:08AM -0500, Tejun Heo wrote: > > > and making subtrees threaded is a > > > straight-forward extension of that - threaded controllers just see > > > further into the hierarchy. Adding threaded sub-sections in the > > > middle is more complex and frankly confusing. > > > > I disagree, as I completely fail to see any confusion. The rules are > > simple and straight forward. > > > > I also don't see why you would want to impose this artificial > > restriction. It doesn't get you anything. Why are you so keen on designs > > with these artificial limits on? > > Because I actually understand and use this thing day in and day out? Just because you don't have the use-cases doesn't mean they're invalid. Also, the above is effectively: "because I say so", which isn't much of an argument. > Let's go back to the no-internal-process constraint. The main reason > behind that is avoiding resource competition between child cgroups and > processes. The reason why we need this is because for some resources > the terminal consumer (be that a process or task or anonymous) and the > resource domain that it belongs to (be that the system itself or a > cgroup) aren't equivalent. Sure, we're past that. This isn't about what memcg can or cannot do. Previous discussions established that controllers come in two shapes: - task based controllers; these are build on per task properties and groups are aggregates over sets of tasks. Since per definition inter task competition is already defined on individual tasks, its fairly trivial to extend the same rules to sets of tasks etc.. Examples: cpu, cpuset, cpuacct, perf, pid, (freezer) - system controllers; instead of building from tasks upwards, they split what previously would be machine wide / global state. For these there is no natural competition rule vs tasks, and hence your no-internal-task rule. Examples: memcg, io, hugetlb (I have no idea where: devices, net_cls, net_prio, debug fall in this classification, nor is that really relevant) Now, cgroup-v2 is entirely build around the use-case of containerization, where you want a single hierarchy describing the containers and their resources. Now, because of that single hierarchy and single use-case, you let the system controllers dominate and dictate the rules. By doing that you've completely killed off a whole bunch of use-cases that were possible with pure task controllers. And you seen to have a very hard time accepting that this is a problem. Furthermore, the argument that people who need that can continue to use v1 doesn't work. Because v2 and v1 are mutually exclusive and do not respect the namespace/container invariant. That is, if a controller is used in v2, a sub-container is forced to also use v2. Therefore it is important to fix v2 if possible or do v3 if not, such that all use-cases can be met in a single setup that respects the container invariant. > Now, back to not allowing switching back and forth between resource > domains and thread subtrees. Let's say we allow that and compose a > hierarchy as follows. Let's say A and B are resource domains and T's > are subtrees of threads. > > A - T1 - B - T2 > > The resource domain controllers would see the following hierarchy. > > A - B > > A will contain processes from T1 and B T2. Both A and B would have > internal consumptions from the processes and the no-internal-process > constraint and thus resource domain abstraction are broken. > If we want to support a hierarchy like that, we'll internally have to > something like > > A - B >\ > A' Because, and it took me a little while to get here, this: A / \ T1 t1 / \ t2 B / \ t3 T2 /\ t4 t5 Ends up being this from a resource domain pov. (because the task controllers are hierarchical their effective contribution collapses onto the resource domain): A / \ B t1, t2 | t3,t4,t5 > Now, this is exactly the same problem as having internal processes Indeed, bugger. > And here's another point, currently, all controllers are enabled > consecutively from root. If we have leaf thread subtrees, this still > works fine. Resource domain controllers won't be enabled into thread > subtrees. If we allow switching back and forth, what do we do in the > middle while we're in the thread part? >From what I understand you cannot re-enable a controller once its been disabled, right? If you disable it, its dead for the entire subtree. I think it would work naturally if you only allow disabling system controllers at the resource domain levels (thread controllers can be disabled at any point). That means that thread nodes will have the exact same system controllers enabled as their resource domain, which makes perfect sense,
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Fri, Feb 10, 2017 at 10:45:08AM -0500, Tejun Heo wrote: > > > and making subtrees threaded is a > > > straight-forward extension of that - threaded controllers just see > > > further into the hierarchy. Adding threaded sub-sections in the > > > middle is more complex and frankly confusing. > > > > I disagree, as I completely fail to see any confusion. The rules are > > simple and straight forward. > > > > I also don't see why you would want to impose this artificial > > restriction. It doesn't get you anything. Why are you so keen on designs > > with these artificial limits on? > > Because I actually understand and use this thing day in and day out? Just because you don't have the use-cases doesn't mean they're invalid. Also, the above is effectively: "because I say so", which isn't much of an argument. > Let's go back to the no-internal-process constraint. The main reason > behind that is avoiding resource competition between child cgroups and > processes. The reason why we need this is because for some resources > the terminal consumer (be that a process or task or anonymous) and the > resource domain that it belongs to (be that the system itself or a > cgroup) aren't equivalent. Sure, we're past that. This isn't about what memcg can or cannot do. Previous discussions established that controllers come in two shapes: - task based controllers; these are build on per task properties and groups are aggregates over sets of tasks. Since per definition inter task competition is already defined on individual tasks, its fairly trivial to extend the same rules to sets of tasks etc.. Examples: cpu, cpuset, cpuacct, perf, pid, (freezer) - system controllers; instead of building from tasks upwards, they split what previously would be machine wide / global state. For these there is no natural competition rule vs tasks, and hence your no-internal-task rule. Examples: memcg, io, hugetlb (I have no idea where: devices, net_cls, net_prio, debug fall in this classification, nor is that really relevant) Now, cgroup-v2 is entirely build around the use-case of containerization, where you want a single hierarchy describing the containers and their resources. Now, because of that single hierarchy and single use-case, you let the system controllers dominate and dictate the rules. By doing that you've completely killed off a whole bunch of use-cases that were possible with pure task controllers. And you seen to have a very hard time accepting that this is a problem. Furthermore, the argument that people who need that can continue to use v1 doesn't work. Because v2 and v1 are mutually exclusive and do not respect the namespace/container invariant. That is, if a controller is used in v2, a sub-container is forced to also use v2. Therefore it is important to fix v2 if possible or do v3 if not, such that all use-cases can be met in a single setup that respects the container invariant. > Now, back to not allowing switching back and forth between resource > domains and thread subtrees. Let's say we allow that and compose a > hierarchy as follows. Let's say A and B are resource domains and T's > are subtrees of threads. > > A - T1 - B - T2 > > The resource domain controllers would see the following hierarchy. > > A - B > > A will contain processes from T1 and B T2. Both A and B would have > internal consumptions from the processes and the no-internal-process > constraint and thus resource domain abstraction are broken. > If we want to support a hierarchy like that, we'll internally have to > something like > > A - B >\ > A' Because, and it took me a little while to get here, this: A / \ T1 t1 / \ t2 B / \ t3 T2 /\ t4 t5 Ends up being this from a resource domain pov. (because the task controllers are hierarchical their effective contribution collapses onto the resource domain): A / \ B t1, t2 | t3,t4,t5 > Now, this is exactly the same problem as having internal processes Indeed, bugger. > And here's another point, currently, all controllers are enabled > consecutively from root. If we have leaf thread subtrees, this still > works fine. Resource domain controllers won't be enabled into thread > subtrees. If we allow switching back and forth, what do we do in the > middle while we're in the thread part? >From what I understand you cannot re-enable a controller once its been disabled, right? If you disable it, its dead for the entire subtree. I think it would work naturally if you only allow disabling system controllers at the resource domain levels (thread controllers can be disabled at any point). That means that thread nodes will have the exact same system controllers enabled as their resource domain, which makes perfect sense,
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
Hello, On Thu, Feb 09, 2017 at 11:29:09AM +0100, Peter Zijlstra wrote: > Uhm, no. They would see the exact same hierarchy, seeing how there is > only one tree. They would have different view of it maybe, but I don't > see how that matters, nor do you explain. Sure, the base hierarchy is the same but different controllers would need to see different subsets (or views) of the hierarchy. As I wrote before, cgroup v2 alredy does this to certain extent by controllers ignoring the hierarchy beyond certain points. You're proposing to add a new "view" of the hierarchy. I'll explain why it matters below. > > which brings in something completely new to the basic hierarchy. > > I'm failing to see what. > > > Different controllers seeing differing levels of the same hierarchy is > > part of the basic behaviors > > I have no idea what you mean there. It's explained in Documentation/cgroup-v2.txt but for example, if the whole hierarchy is, A - B -C \ D One controller might only see A - B \ D while another sees the whole thing. > > and making subtrees threaded is a > > straight-forward extension of that - threaded controllers just see > > further into the hierarchy. Adding threaded sub-sections in the > > middle is more complex and frankly confusing. > > I disagree, as I completely fail to see any confusion. The rules are > simple and straight forward. > > I also don't see why you would want to impose this artificial > restriction. It doesn't get you anything. Why are you so keen on designs > with these artificial limits on? Because I actually understand and use this thing day in and day out? Let's go back to the no-internal-process constraint. The main reason behind that is avoiding resource competition between child cgroups and processes. The reason why we need this is because for some resources the terminal consumer (be that a process or task or anonymous) and the resource domain that it belongs to (be that the system itself or a cgroup) aren't equivalent. If you make a memcg, put some processes in it and then create some child cgroups, how resource should be distributed between those processes and child cgroups is not clearly defined and can't be controlled from userspace. The resource control knobs in a child cgroup governs how the resource is distributed from the parent. For child processes, we don't have those knobs. There are multiple ways to deal with the problem. We can add a separate set of control knobs to govern control resource consumption from internal processes. This effectively adds an implicit leaf node to each cgroup so that internal processes or tasks always are in its own leaf resource domain. This however adds a lot of cruft to the interface, the implementation gets nasty and the presented resource hierarchy can be misleading to users. Another option would be just letting each controller do whatever, which is pretty much what we did in v1. This got really bad because the behaviors were widely inconsistent across controllers and often implementation dependent without any way for the user to configure or monitor what's going on. Who gets how much becomes a matter of accidents and people optimize for whatever arbitrary behaviors that the kernel they're using is showing. No-internal-process rule establishes that resource domains are always terminal in the resource graph for a given controller, such that every competition along the resource hiearchy always is clearly defined and configurable. Only the terminal resource domains actually host resource consumptions and they can behave analogous to a system which doesn't have any cgroups at all. Estalishing resource domains this way isn't the only approach to solve the problem; however, it is a valid, simple and effective one. Now, back to not allowing switching back and forth between resource domains and thread subtrees. Let's say we allow that and compose a hierarchy as follows. Let's say A and B are resource domains and T's are subtrees of threads. A - T1 - B - T2 The resource domain controllers would see the following hierarchy. A - B A will contain processes from T1 and B T2. Both A and B would have internal consumptions from the processes and the no-internal-process constraint and thus resource domain abstraction are broken. If we want to support a hierarchy like that, we'll internally have to something like A - B \ A' Where cgroup A' contains processes from T1 and B T2. Now, this is exactly the same problem as having internal processes and can be solved in the same ways. The only realistic way to handle this in a generic and consistent manner is creating a leaf cgroup to contain the processes. We sure can try to hide this from userspace and convolute the interface but it can be solved *far* more elegantly by simply requiring thread subtrees to be leaf subtrees. And here's another point, currently, all controllers are
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
Hello, On Thu, Feb 09, 2017 at 11:29:09AM +0100, Peter Zijlstra wrote: > Uhm, no. They would see the exact same hierarchy, seeing how there is > only one tree. They would have different view of it maybe, but I don't > see how that matters, nor do you explain. Sure, the base hierarchy is the same but different controllers would need to see different subsets (or views) of the hierarchy. As I wrote before, cgroup v2 alredy does this to certain extent by controllers ignoring the hierarchy beyond certain points. You're proposing to add a new "view" of the hierarchy. I'll explain why it matters below. > > which brings in something completely new to the basic hierarchy. > > I'm failing to see what. > > > Different controllers seeing differing levels of the same hierarchy is > > part of the basic behaviors > > I have no idea what you mean there. It's explained in Documentation/cgroup-v2.txt but for example, if the whole hierarchy is, A - B -C \ D One controller might only see A - B \ D while another sees the whole thing. > > and making subtrees threaded is a > > straight-forward extension of that - threaded controllers just see > > further into the hierarchy. Adding threaded sub-sections in the > > middle is more complex and frankly confusing. > > I disagree, as I completely fail to see any confusion. The rules are > simple and straight forward. > > I also don't see why you would want to impose this artificial > restriction. It doesn't get you anything. Why are you so keen on designs > with these artificial limits on? Because I actually understand and use this thing day in and day out? Let's go back to the no-internal-process constraint. The main reason behind that is avoiding resource competition between child cgroups and processes. The reason why we need this is because for some resources the terminal consumer (be that a process or task or anonymous) and the resource domain that it belongs to (be that the system itself or a cgroup) aren't equivalent. If you make a memcg, put some processes in it and then create some child cgroups, how resource should be distributed between those processes and child cgroups is not clearly defined and can't be controlled from userspace. The resource control knobs in a child cgroup governs how the resource is distributed from the parent. For child processes, we don't have those knobs. There are multiple ways to deal with the problem. We can add a separate set of control knobs to govern control resource consumption from internal processes. This effectively adds an implicit leaf node to each cgroup so that internal processes or tasks always are in its own leaf resource domain. This however adds a lot of cruft to the interface, the implementation gets nasty and the presented resource hierarchy can be misleading to users. Another option would be just letting each controller do whatever, which is pretty much what we did in v1. This got really bad because the behaviors were widely inconsistent across controllers and often implementation dependent without any way for the user to configure or monitor what's going on. Who gets how much becomes a matter of accidents and people optimize for whatever arbitrary behaviors that the kernel they're using is showing. No-internal-process rule establishes that resource domains are always terminal in the resource graph for a given controller, such that every competition along the resource hiearchy always is clearly defined and configurable. Only the terminal resource domains actually host resource consumptions and they can behave analogous to a system which doesn't have any cgroups at all. Estalishing resource domains this way isn't the only approach to solve the problem; however, it is a valid, simple and effective one. Now, back to not allowing switching back and forth between resource domains and thread subtrees. Let's say we allow that and compose a hierarchy as follows. Let's say A and B are resource domains and T's are subtrees of threads. A - T1 - B - T2 The resource domain controllers would see the following hierarchy. A - B A will contain processes from T1 and B T2. Both A and B would have internal consumptions from the processes and the no-internal-process constraint and thus resource domain abstraction are broken. If we want to support a hierarchy like that, we'll internally have to something like A - B \ A' Where cgroup A' contains processes from T1 and B T2. Now, this is exactly the same problem as having internal processes and can be solved in the same ways. The only realistic way to handle this in a generic and consistent manner is creating a leaf cgroup to contain the processes. We sure can try to hide this from userspace and convolute the interface but it can be solved *far* more elegantly by simply requiring thread subtrees to be leaf subtrees. And here's another point, currently, all controllers are
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote: > What are the motivations that you see for forcing this all onto one > mount-point via .threads sub-tree tags? So, you wanted rgroup but with /proc interface? I'm afraid it's too late for that. Thanks. -- tejun
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote: > What are the motivations that you see for forcing this all onto one > mount-point via .threads sub-tree tags? So, you wanted rgroup but with /proc interface? I'm afraid it's too late for that. Thanks. -- tejun
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Thu, 2017-02-09 at 15:47 +0100, Peter Zijlstra wrote: > On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote: > > The only case that this does not support vs ".threads" would be some > > hybrid where we co-mingle threads from different processes (with the > > processes belonging to the same node in the hierarchy). I'm not aware > > of any usage that looks like this. > > If I understand you right; this is a fairly common thing with RT where > we would stuff all the !rt threads of the various processes in a 'misc' > bucket. > > Similarly, it happens that we stuff the various rt threads of processes > in a specific (shared) 'rt' bucket. > > So I would certainly not like to exclude that setup. Absolutely, you just described my daily bread performance setup. -Mike
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Thu, 2017-02-09 at 15:47 +0100, Peter Zijlstra wrote: > On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote: > > The only case that this does not support vs ".threads" would be some > > hybrid where we co-mingle threads from different processes (with the > > processes belonging to the same node in the hierarchy). I'm not aware > > of any usage that looks like this. > > If I understand you right; this is a fairly common thing with RT where > we would stuff all the !rt threads of the various processes in a 'misc' > bucket. > > Similarly, it happens that we stuff the various rt threads of processes > in a specific (shared) 'rt' bucket. > > So I would certainly not like to exclude that setup. Absolutely, you just described my daily bread performance setup. -Mike
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote: > The only case that this does not support vs ".threads" would be some > hybrid where we co-mingle threads from different processes (with the > processes belonging to the same node in the hierarchy). I'm not aware > of any usage that looks like this. If I understand you right; this is a fairly common thing with RT where we would stuff all the !rt threads of the various processes in a 'misc' bucket. Similarly, it happens that we stuff the various rt threads of processes in a specific (shared) 'rt' bucket. So I would certainly not like to exclude that setup.
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote: > The only case that this does not support vs ".threads" would be some > hybrid where we co-mingle threads from different processes (with the > processes belonging to the same node in the hierarchy). I'm not aware > of any usage that looks like this. If I understand you right; this is a fairly common thing with RT where we would stuff all the !rt threads of the various processes in a 'misc' bucket. Similarly, it happens that we stuff the various rt threads of processes in a specific (shared) 'rt' bucket. So I would certainly not like to exclude that setup.
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Thu, Feb 2, 2017 at 12:06 PM, Tejun Heowrote: > Hello, > > This patchset implements cgroup v2 thread mode. It is largely based > on the discussions that we had at the plumbers last year. Here's the > rough outline. > > * Thread mode is explicitly enabled on a cgroup by writing "enable" > into "cgroup.threads" file. The cgroup shouldn't have any child > cgroups or enabled controllers. > > * Once enabled, arbitrary sub-hierarchy can be created and threads can > be put anywhere in the subtree by writing TIDs into "cgroup.threads" > file. Process granularity and no-internal-process constraint don't > apply in a threaded subtree. > > * To be used in a threaded subtree, controllers should explicitly > declare thread mode support and should be able to handle internal > competition in some way. > > * The root of a threaded subtree serves as the resource domain for the > whole subtree. This is where all the controllers are guaranteed to > have a common ground and resource consumptions in the threaded > subtree which aren't tied to a specific thread are charged. > Non-threaded controllers never see beyond thread root and can assume > that all controllers will follow the same rules upto that point. > > This allows threaded controllers to implement thread granular resource > control without getting in the way of system level resource > partitioning. > I think that this is definitely a step in the right direction versus previous proposals. However, as proposed it feels like the API is conflating the process/thread distinction with the core process hierarchy. While this does previous use-cases to be re-enabled, it seems to do so at an unnecessary API cost. As proposed, the cgroup.threads file means that threads are always anchored in the tree by their process parent. They may never move past it. I.e. If I have cgroups root/A/B With B allowing sub-thread moves and the parent belonging to A, or B. it is clear that the child cannot be moved beyond the parent. Now this, in itself, is a natural restriction. However, with this in hand, it means that we are effectively co-mingling two hierarchies onto the same tree: one that applies to processes, and per-process sub-trees. This introduces the following costs/restrictions: 1) We lose the ability to reasonably move a process. This puts us back to the existing challenge of the V1 API in which a thread is the unit we can move atomically. Hierarchies must be externally managed and synchronized. 2) This retains all of the problems of the existing V1 API for a process which wants to use these sub-groups to coordinate its threads. It must coordinate its operations on these groups with the global hierarchy (which is not consistently mounted) as well as potential migration -- (1) above. With the split as proposed, I fundamentally do not see the advantage of exposing these as the same hierarchy. By definition these .thread files are essentially introducing independent, process level, sub-hierarchies. It seems greatly preferable to expose the sub-process level hierarchies via separate path, e.g.: /proc/{pid, self}/thread_cgroups/ Any controllers enabled on the hierarchy that the process belonged to, which support thread level operations would appear within. This fully addresses (1) and (2) while allowing us to keep the unified process-granular v2-cgroup mounts as is. The only case that this does not support vs ".threads" would be some hybrid where we co-mingle threads from different processes (with the processes belonging to the same node in the hierarchy). I'm not aware of any usage that looks like this. What are the motivations that you see for forcing this all onto one mount-point via .threads sub-tree tags? > This patchset contains the following five patches. For more details > on the interface and behavior, please refer to the last patch. > > 0001-cgroup-reorganize-cgroup.procs-task-write-path.patch > 0002-cgroup-add-flags-to-css_task_iter_start-and-implemen.patch > 0003-cgroup-introduce-cgroup-proc_cgrp-and-threaded-css_s.patch > 0004-cgroup-implement-CSS_TASK_ITER_THREADED.patch > 0005-cgroup-implement-cgroup-v2-thread-support.patch > > This patchset is on top of cgroup/for-4.11 63f1ca59453a ("Merge branch > 'cgroup/for-4.11-rdmacg' into cgroup/for-4.11") and available in the > following git branch. > > git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git > review-cgroup2-threads > > diffstat follows. Thanks. > > Documentation/cgroup-v2.txt | 75 - > include/linux/cgroup-defs.h | 38 ++ > include/linux/cgroup.h | 12 > kernel/cgroup/cgroup-internal.h |8 > kernel/cgroup/cgroup-v1.c | 64 +++- > kernel/cgroup/cgroup.c | 589 > > kernel/cgroup/cpuset.c |6 > kernel/cgroup/freezer.c |6 > kernel/cgroup/pids.c|1 > kernel/events/core.c|1 >
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Thu, Feb 2, 2017 at 12:06 PM, Tejun Heo wrote: > Hello, > > This patchset implements cgroup v2 thread mode. It is largely based > on the discussions that we had at the plumbers last year. Here's the > rough outline. > > * Thread mode is explicitly enabled on a cgroup by writing "enable" > into "cgroup.threads" file. The cgroup shouldn't have any child > cgroups or enabled controllers. > > * Once enabled, arbitrary sub-hierarchy can be created and threads can > be put anywhere in the subtree by writing TIDs into "cgroup.threads" > file. Process granularity and no-internal-process constraint don't > apply in a threaded subtree. > > * To be used in a threaded subtree, controllers should explicitly > declare thread mode support and should be able to handle internal > competition in some way. > > * The root of a threaded subtree serves as the resource domain for the > whole subtree. This is where all the controllers are guaranteed to > have a common ground and resource consumptions in the threaded > subtree which aren't tied to a specific thread are charged. > Non-threaded controllers never see beyond thread root and can assume > that all controllers will follow the same rules upto that point. > > This allows threaded controllers to implement thread granular resource > control without getting in the way of system level resource > partitioning. > I think that this is definitely a step in the right direction versus previous proposals. However, as proposed it feels like the API is conflating the process/thread distinction with the core process hierarchy. While this does previous use-cases to be re-enabled, it seems to do so at an unnecessary API cost. As proposed, the cgroup.threads file means that threads are always anchored in the tree by their process parent. They may never move past it. I.e. If I have cgroups root/A/B With B allowing sub-thread moves and the parent belonging to A, or B. it is clear that the child cannot be moved beyond the parent. Now this, in itself, is a natural restriction. However, with this in hand, it means that we are effectively co-mingling two hierarchies onto the same tree: one that applies to processes, and per-process sub-trees. This introduces the following costs/restrictions: 1) We lose the ability to reasonably move a process. This puts us back to the existing challenge of the V1 API in which a thread is the unit we can move atomically. Hierarchies must be externally managed and synchronized. 2) This retains all of the problems of the existing V1 API for a process which wants to use these sub-groups to coordinate its threads. It must coordinate its operations on these groups with the global hierarchy (which is not consistently mounted) as well as potential migration -- (1) above. With the split as proposed, I fundamentally do not see the advantage of exposing these as the same hierarchy. By definition these .thread files are essentially introducing independent, process level, sub-hierarchies. It seems greatly preferable to expose the sub-process level hierarchies via separate path, e.g.: /proc/{pid, self}/thread_cgroups/ Any controllers enabled on the hierarchy that the process belonged to, which support thread level operations would appear within. This fully addresses (1) and (2) while allowing us to keep the unified process-granular v2-cgroup mounts as is. The only case that this does not support vs ".threads" would be some hybrid where we co-mingle threads from different processes (with the processes belonging to the same node in the hierarchy). I'm not aware of any usage that looks like this. What are the motivations that you see for forcing this all onto one mount-point via .threads sub-tree tags? > This patchset contains the following five patches. For more details > on the interface and behavior, please refer to the last patch. > > 0001-cgroup-reorganize-cgroup.procs-task-write-path.patch > 0002-cgroup-add-flags-to-css_task_iter_start-and-implemen.patch > 0003-cgroup-introduce-cgroup-proc_cgrp-and-threaded-css_s.patch > 0004-cgroup-implement-CSS_TASK_ITER_THREADED.patch > 0005-cgroup-implement-cgroup-v2-thread-support.patch > > This patchset is on top of cgroup/for-4.11 63f1ca59453a ("Merge branch > 'cgroup/for-4.11-rdmacg' into cgroup/for-4.11") and available in the > following git branch. > > git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git > review-cgroup2-threads > > diffstat follows. Thanks. > > Documentation/cgroup-v2.txt | 75 - > include/linux/cgroup-defs.h | 38 ++ > include/linux/cgroup.h | 12 > kernel/cgroup/cgroup-internal.h |8 > kernel/cgroup/cgroup-v1.c | 64 +++- > kernel/cgroup/cgroup.c | 589 > > kernel/cgroup/cpuset.c |6 > kernel/cgroup/freezer.c |6 > kernel/cgroup/pids.c|1 > kernel/events/core.c|1 > mm/memcontrol.c
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Wed, Feb 08, 2017 at 06:08:19PM -0500, Tejun Heo wrote: > (cc'ing Linus and Andrew for visibility) > > Hello, Peter. > > On Mon, Feb 06, 2017 at 01:49:43PM +0100, Peter Zijlstra wrote: > > But to me the resource domain is your primary new construct; so it makes > > more sense to explicitly mark that. > > Whether it's new or not isn't the point. Resource domains weren't > added arbitrarily. We were missing critical resource accounting and > control capabilities because cgroup v1's abstraction wasn't strong > enough to cover some of the major resource consumers and how different > resources can interact with each other. > > Resource domains were added to address that. Given that cgroup's > primary goal is providing resource accounting and control, it doesn't > make sense to make this optional. I'm not sure what you're saying here. Are you agreeing that 'resource domains' are the primary new construct or not? > > My question was, if you have root.threads=1, can you then still create > > (sub) resource domains? Can I create a child cgroup and clear "threads" > > again? > > > > (I'm assuming "threads" is inherited when creating new groups). > > > > Now, _if_ we can do the above, then "threads" is not sufficient to > > uniquely identify resource domains, which I think was your point in the > > other email. Which argues against the interface. Because a group can be > > a resource domain _and_ have threads sub-trees. > > > > OTOH, if you can _not_ do this, then this proposal is > > insufficient/inadequate. > > No, you can't flip it back and I'm not convinced this matters. More > on this below. Then I shall preliminary NAK your proposal right here, but I shall continue to read on. > > So, just to recap, my proposal is as follows: > > > > 1) each cgroup will have a new flag, indicating if this is a resource > > domain. > > > > a) this flag will be inherited; when creating a new cgroup, the > >state of the flag will be set to the value of the parent cgroup. > > > > b) the root cgroup is a resource domain per definition, will > >have it set (cannot be cleared). > > > > 2) all tasks of a process shall be part of the same resource domain > > > > 3) controllers come in two types: > > > > a) task based controllers; these use the direct cgroup the task > >is assigned to. > > > > b) resource controllers; these use the effective resource group > >of the task, which is the first parent group with our new > >flag set. > > > > > > With an optional: > > > > 1c) this flag can only be changed on empty groups > > > > to ease implementation. > > > > From these rules it follows that: > > > > - 1a and 1b together ensure no change in behaviour per default > > for cgroup-v2. > > > > - 2 and 3a together ensure resource groups automagically work for task > > based controllers (under the assumption that the controller is > > strictly hierarchical). > > > > For example, cpuacct does the accounting strictly hierarchical, it adds > > the cpu usage to each parent group. Therefore the total cpu usage > > accounted to the resource group is the same as if all tasks were part of > > that group. > > So, what you're proposing isn't that different from what the posted > patches implement except that what's implemented doesn't allow > flipping a part of a threaded subtree back to domain mode. > > Your proposal is more complicated while seemingly not adding much to > what can be achieved. The orignal proposal is very simple. It allows > declaring a subtree to be threaded (or task based) and that's it. A > threaded subtree can't have resource domains under it. > > The only addition that your proposal has is the ability to mark > portions of such subtree to be domains again. This would require > making domain controllers and thread controllers to see different > hierarchies, Uhm, no. They would see the exact same hierarchy, seeing how there is only one tree. They would have different view of it maybe, but I don't see how that matters, nor do you explain. > which brings in something completely new to the basic hierarchy. I'm failing to see what. > Different controllers seeing differing levels of the same hierarchy is > part of the basic behaviors I have no idea what you mean there. > and making subtrees threaded is a > straight-forward extension of that - threaded controllers just see > further into the hierarchy. Adding threaded sub-sections in the > middle is more complex and frankly confusing. I disagree, as I completely fail to see any confusion. The rules are simple and straight forward. I also don't see why you would want to impose this artificial restriction. It doesn't get you anything. Why are you so keen on designs with these artificial limits on? > Let's say we can make that work but what are the use cases which would > require such setup where we have to alternate between thread and > domain modes through out the resource
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Wed, Feb 08, 2017 at 06:08:19PM -0500, Tejun Heo wrote: > (cc'ing Linus and Andrew for visibility) > > Hello, Peter. > > On Mon, Feb 06, 2017 at 01:49:43PM +0100, Peter Zijlstra wrote: > > But to me the resource domain is your primary new construct; so it makes > > more sense to explicitly mark that. > > Whether it's new or not isn't the point. Resource domains weren't > added arbitrarily. We were missing critical resource accounting and > control capabilities because cgroup v1's abstraction wasn't strong > enough to cover some of the major resource consumers and how different > resources can interact with each other. > > Resource domains were added to address that. Given that cgroup's > primary goal is providing resource accounting and control, it doesn't > make sense to make this optional. I'm not sure what you're saying here. Are you agreeing that 'resource domains' are the primary new construct or not? > > My question was, if you have root.threads=1, can you then still create > > (sub) resource domains? Can I create a child cgroup and clear "threads" > > again? > > > > (I'm assuming "threads" is inherited when creating new groups). > > > > Now, _if_ we can do the above, then "threads" is not sufficient to > > uniquely identify resource domains, which I think was your point in the > > other email. Which argues against the interface. Because a group can be > > a resource domain _and_ have threads sub-trees. > > > > OTOH, if you can _not_ do this, then this proposal is > > insufficient/inadequate. > > No, you can't flip it back and I'm not convinced this matters. More > on this below. Then I shall preliminary NAK your proposal right here, but I shall continue to read on. > > So, just to recap, my proposal is as follows: > > > > 1) each cgroup will have a new flag, indicating if this is a resource > > domain. > > > > a) this flag will be inherited; when creating a new cgroup, the > >state of the flag will be set to the value of the parent cgroup. > > > > b) the root cgroup is a resource domain per definition, will > >have it set (cannot be cleared). > > > > 2) all tasks of a process shall be part of the same resource domain > > > > 3) controllers come in two types: > > > > a) task based controllers; these use the direct cgroup the task > >is assigned to. > > > > b) resource controllers; these use the effective resource group > >of the task, which is the first parent group with our new > >flag set. > > > > > > With an optional: > > > > 1c) this flag can only be changed on empty groups > > > > to ease implementation. > > > > From these rules it follows that: > > > > - 1a and 1b together ensure no change in behaviour per default > > for cgroup-v2. > > > > - 2 and 3a together ensure resource groups automagically work for task > > based controllers (under the assumption that the controller is > > strictly hierarchical). > > > > For example, cpuacct does the accounting strictly hierarchical, it adds > > the cpu usage to each parent group. Therefore the total cpu usage > > accounted to the resource group is the same as if all tasks were part of > > that group. > > So, what you're proposing isn't that different from what the posted > patches implement except that what's implemented doesn't allow > flipping a part of a threaded subtree back to domain mode. > > Your proposal is more complicated while seemingly not adding much to > what can be achieved. The orignal proposal is very simple. It allows > declaring a subtree to be threaded (or task based) and that's it. A > threaded subtree can't have resource domains under it. > > The only addition that your proposal has is the ability to mark > portions of such subtree to be domains again. This would require > making domain controllers and thread controllers to see different > hierarchies, Uhm, no. They would see the exact same hierarchy, seeing how there is only one tree. They would have different view of it maybe, but I don't see how that matters, nor do you explain. > which brings in something completely new to the basic hierarchy. I'm failing to see what. > Different controllers seeing differing levels of the same hierarchy is > part of the basic behaviors I have no idea what you mean there. > and making subtrees threaded is a > straight-forward extension of that - threaded controllers just see > further into the hierarchy. Adding threaded sub-sections in the > middle is more complex and frankly confusing. I disagree, as I completely fail to see any confusion. The rules are simple and straight forward. I also don't see why you would want to impose this artificial restriction. It doesn't get you anything. Why are you so keen on designs with these artificial limits on? > Let's say we can make that work but what are the use cases which would > require such setup where we have to alternate between thread and > domain modes through out the resource
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
(cc'ing Linus and Andrew for visibility) Hello, Peter. On Mon, Feb 06, 2017 at 01:49:43PM +0100, Peter Zijlstra wrote: > But to me the resource domain is your primary new construct; so it makes > more sense to explicitly mark that. Whether it's new or not isn't the point. Resource domains weren't added arbitrarily. We were missing critical resource accounting and control capabilities because cgroup v1's abstraction wasn't strong enough to cover some of the major resource consumers and how different resources can interact with each other. Resource domains were added to address that. Given that cgroup's primary goal is providing resource accounting and control, it doesn't make sense to make this optional. > (FWIW note that your whole initial cgroup-v2 thing is counter intuitive > to me, as someone who has only ever dealt with thread capable > controllers.) That is completely fine but you can't direct the overall design while claiming to be not using, disinterested and unfamiliar with the subject at hand. I do understand that you have certain use cases that you think should be covered. Let's focus on that. > My question was, if you have root.threads=1, can you then still create > (sub) resource domains? Can I create a child cgroup and clear "threads" > again? > > (I'm assuming "threads" is inherited when creating new groups). > > Now, _if_ we can do the above, then "threads" is not sufficient to > uniquely identify resource domains, which I think was your point in the > other email. Which argues against the interface. Because a group can be > a resource domain _and_ have threads sub-trees. > > OTOH, if you can _not_ do this, then this proposal is > insufficient/inadequate. No, you can't flip it back and I'm not convinced this matters. More on this below. > > In practice, how would this work? To enable memcg, the user has to > > first create the subtree and then explicitly have to make that a > > domain and then enable memcg? If so, that would be a completely > > unnecessary deviation from the current behavior while not achieving > > any more functionalities, right? It's just flipping the default value > > the other way around and the default wouldn't be supported by many of > > the controllers. I can't see how that is a better option. > > OK, so I'm entirely unaware of this enabling of controllers. What's that > about? I thought the whole point of cgroup-v2 was to have all > controllers enabled over the entire tree, this is not so? This is one of the most basic aspects of cgroup v2. In short, while the controllers share the hierarchy, each doesn't have to be enabled all the way down to the leaf. Different controllers can see upto different subsets of the hierarchy spreading out from the root. > In any case, yes, more or less like that, except of course, not at all > :-) If we make this flag inherited (which I think it should be), you > don't need to do anything different from today, because the root group > must be a resource domain, any new sub-group will automagically also be. > > Only once you clear the flag somewhere do you get 'new' behaviour. Note > that the only extra constraint is that all threads of a process must > stay within the same resource domain, anything else goes. > > Task based controllers operate on the actual cgroup, resource domain > controllers always map it back to the resource group. Finding a random > task's resource domain is trivial; simply walk up the hierarchy until > you find a group with the flag set. > > So, just to recap, my proposal is as follows: > > 1) each cgroup will have a new flag, indicating if this is a resource > domain. > > a) this flag will be inherited; when creating a new cgroup, the >state of the flag will be set to the value of the parent cgroup. > > b) the root cgroup is a resource domain per definition, will >have it set (cannot be cleared). > > 2) all tasks of a process shall be part of the same resource domain > > 3) controllers come in two types: > > a) task based controllers; these use the direct cgroup the task >is assigned to. > > b) resource controllers; these use the effective resource group >of the task, which is the first parent group with our new >flag set. > > > With an optional: > > 1c) this flag can only be changed on empty groups > > to ease implementation. > > From these rules it follows that: > > - 1a and 1b together ensure no change in behaviour per default > for cgroup-v2. > > - 2 and 3a together ensure resource groups automagically work for task > based controllers (under the assumption that the controller is > strictly hierarchical). > > For example, cpuacct does the accounting strictly hierarchical, it adds > the cpu usage to each parent group. Therefore the total cpu usage > accounted to the resource group is the same as if all tasks were part of > that group. So, what you're proposing isn't that different from what the
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
(cc'ing Linus and Andrew for visibility) Hello, Peter. On Mon, Feb 06, 2017 at 01:49:43PM +0100, Peter Zijlstra wrote: > But to me the resource domain is your primary new construct; so it makes > more sense to explicitly mark that. Whether it's new or not isn't the point. Resource domains weren't added arbitrarily. We were missing critical resource accounting and control capabilities because cgroup v1's abstraction wasn't strong enough to cover some of the major resource consumers and how different resources can interact with each other. Resource domains were added to address that. Given that cgroup's primary goal is providing resource accounting and control, it doesn't make sense to make this optional. > (FWIW note that your whole initial cgroup-v2 thing is counter intuitive > to me, as someone who has only ever dealt with thread capable > controllers.) That is completely fine but you can't direct the overall design while claiming to be not using, disinterested and unfamiliar with the subject at hand. I do understand that you have certain use cases that you think should be covered. Let's focus on that. > My question was, if you have root.threads=1, can you then still create > (sub) resource domains? Can I create a child cgroup and clear "threads" > again? > > (I'm assuming "threads" is inherited when creating new groups). > > Now, _if_ we can do the above, then "threads" is not sufficient to > uniquely identify resource domains, which I think was your point in the > other email. Which argues against the interface. Because a group can be > a resource domain _and_ have threads sub-trees. > > OTOH, if you can _not_ do this, then this proposal is > insufficient/inadequate. No, you can't flip it back and I'm not convinced this matters. More on this below. > > In practice, how would this work? To enable memcg, the user has to > > first create the subtree and then explicitly have to make that a > > domain and then enable memcg? If so, that would be a completely > > unnecessary deviation from the current behavior while not achieving > > any more functionalities, right? It's just flipping the default value > > the other way around and the default wouldn't be supported by many of > > the controllers. I can't see how that is a better option. > > OK, so I'm entirely unaware of this enabling of controllers. What's that > about? I thought the whole point of cgroup-v2 was to have all > controllers enabled over the entire tree, this is not so? This is one of the most basic aspects of cgroup v2. In short, while the controllers share the hierarchy, each doesn't have to be enabled all the way down to the leaf. Different controllers can see upto different subsets of the hierarchy spreading out from the root. > In any case, yes, more or less like that, except of course, not at all > :-) If we make this flag inherited (which I think it should be), you > don't need to do anything different from today, because the root group > must be a resource domain, any new sub-group will automagically also be. > > Only once you clear the flag somewhere do you get 'new' behaviour. Note > that the only extra constraint is that all threads of a process must > stay within the same resource domain, anything else goes. > > Task based controllers operate on the actual cgroup, resource domain > controllers always map it back to the resource group. Finding a random > task's resource domain is trivial; simply walk up the hierarchy until > you find a group with the flag set. > > So, just to recap, my proposal is as follows: > > 1) each cgroup will have a new flag, indicating if this is a resource > domain. > > a) this flag will be inherited; when creating a new cgroup, the >state of the flag will be set to the value of the parent cgroup. > > b) the root cgroup is a resource domain per definition, will >have it set (cannot be cleared). > > 2) all tasks of a process shall be part of the same resource domain > > 3) controllers come in two types: > > a) task based controllers; these use the direct cgroup the task >is assigned to. > > b) resource controllers; these use the effective resource group >of the task, which is the first parent group with our new >flag set. > > > With an optional: > > 1c) this flag can only be changed on empty groups > > to ease implementation. > > From these rules it follows that: > > - 1a and 1b together ensure no change in behaviour per default > for cgroup-v2. > > - 2 and 3a together ensure resource groups automagically work for task > based controllers (under the assumption that the controller is > strictly hierarchical). > > For example, cpuacct does the accounting strictly hierarchical, it adds > the cpu usage to each parent group. Therefore the total cpu usage > accounted to the resource group is the same as if all tasks were part of > that group. So, what you're proposing isn't that different from what the
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Fri, Feb 03, 2017 at 03:59:55PM -0500, Tejun Heo wrote: > Hello, Peter. > > On Fri, Feb 03, 2017 at 09:20:48PM +0100, Peter Zijlstra wrote: > > So my proposal was to do the inverse of what you propose here. Instead > > of marking special 'thread' subtrees, explicitly mark resource domains > > in the tree. > > > > So always allow arbitrary hierarchies and allow threads to be assigned > > to cgroups, as long as they're all in the same resource domain. > > > > Controllers that do not support things, fallback to mapping everything > > to the nearest resource domain. > > That sounds counter-intuitive as all controllers can do resource > domains and only a subset of them can do thread mode. But to me the resource domain is your primary new construct; so it makes more sense to explicitly mark that. (FWIW note that your whole initial cgroup-v2 thing is counter intuitive to me, as someone who has only ever dealt with thread capable controllers.) > Also, thread subtrees are necessarily a sub-hierarchy of a resource domain. Sure, don't see how that is relevant though. Or rather, I don't see it being an argument one way or the other. > Also, > expanding resource domains from the root after the trees are populated > would make the behaviors surprising as the resource domains that these > subtrees belong to would change dynamically. Uh what? I cannot parse that. My question was, if you have root.threads=1, can you then still create (sub) resource domains? Can I create a child cgroup and clear "threads" again? (I'm assuming "threads" is inherited when creating new groups). Now, _if_ we can do the above, then "threads" is not sufficient to uniquely identify resource domains, which I think was your point in the other email. Which argues against the interface. Because a group can be a resource domain _and_ have threads sub-trees. OTOH, if you can _not_ do this, then this proposal is insufficient/inadequate. > In practice, how would this work? To enable memcg, the user has to > first create the subtree and then explicitly have to make that a > domain and then enable memcg? If so, that would be a completely > unnecessary deviation from the current behavior while not achieving > any more functionalities, right? It's just flipping the default value > the other way around and the default wouldn't be supported by many of > the controllers. I can't see how that is a better option. OK, so I'm entirely unaware of this enabling of controllers. What's that about? I thought the whole point of cgroup-v2 was to have all controllers enabled over the entire tree, this is not so? In any case, yes, more or less like that, except of course, not at all :-) If we make this flag inherited (which I think it should be), you don't need to do anything different from today, because the root group must be a resource domain, any new sub-group will automagically also be. Only once you clear the flag somewhere do you get 'new' behaviour. Note that the only extra constraint is that all threads of a process must stay within the same resource domain, anything else goes. Task based controllers operate on the actual cgroup, resource domain controllers always map it back to the resource group. Finding a random task's resource domain is trivial; simply walk up the hierarchy until you find a group with the flag set. So, just to recap, my proposal is as follows: 1) each cgroup will have a new flag, indicating if this is a resource domain. a) this flag will be inherited; when creating a new cgroup, the state of the flag will be set to the value of the parent cgroup. b) the root cgroup is a resource domain per definition, will have it set (cannot be cleared). 2) all tasks of a process shall be part of the same resource domain 3) controllers come in two types: a) task based controllers; these use the direct cgroup the task is assigned to. b) resource controllers; these use the effective resource group of the task, which is the first parent group with our new flag set. With an optional: 1c) this flag can only be changed on empty groups to ease implementation. >From these rules it follows that: - 1a and 1b together ensure no change in behaviour per default for cgroup-v2. - 2 and 3a together ensure resource groups automagically work for task based controllers (under the assumption that the controller is strictly hierarchical). For example, cpuacct does the accounting strictly hierarchical, it adds the cpu usage to each parent group. Therefore the total cpu usage accounted to the resource group is the same as if all tasks were part of that group.
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Fri, Feb 03, 2017 at 03:59:55PM -0500, Tejun Heo wrote: > Hello, Peter. > > On Fri, Feb 03, 2017 at 09:20:48PM +0100, Peter Zijlstra wrote: > > So my proposal was to do the inverse of what you propose here. Instead > > of marking special 'thread' subtrees, explicitly mark resource domains > > in the tree. > > > > So always allow arbitrary hierarchies and allow threads to be assigned > > to cgroups, as long as they're all in the same resource domain. > > > > Controllers that do not support things, fallback to mapping everything > > to the nearest resource domain. > > That sounds counter-intuitive as all controllers can do resource > domains and only a subset of them can do thread mode. But to me the resource domain is your primary new construct; so it makes more sense to explicitly mark that. (FWIW note that your whole initial cgroup-v2 thing is counter intuitive to me, as someone who has only ever dealt with thread capable controllers.) > Also, thread subtrees are necessarily a sub-hierarchy of a resource domain. Sure, don't see how that is relevant though. Or rather, I don't see it being an argument one way or the other. > Also, > expanding resource domains from the root after the trees are populated > would make the behaviors surprising as the resource domains that these > subtrees belong to would change dynamically. Uh what? I cannot parse that. My question was, if you have root.threads=1, can you then still create (sub) resource domains? Can I create a child cgroup and clear "threads" again? (I'm assuming "threads" is inherited when creating new groups). Now, _if_ we can do the above, then "threads" is not sufficient to uniquely identify resource domains, which I think was your point in the other email. Which argues against the interface. Because a group can be a resource domain _and_ have threads sub-trees. OTOH, if you can _not_ do this, then this proposal is insufficient/inadequate. > In practice, how would this work? To enable memcg, the user has to > first create the subtree and then explicitly have to make that a > domain and then enable memcg? If so, that would be a completely > unnecessary deviation from the current behavior while not achieving > any more functionalities, right? It's just flipping the default value > the other way around and the default wouldn't be supported by many of > the controllers. I can't see how that is a better option. OK, so I'm entirely unaware of this enabling of controllers. What's that about? I thought the whole point of cgroup-v2 was to have all controllers enabled over the entire tree, this is not so? In any case, yes, more or less like that, except of course, not at all :-) If we make this flag inherited (which I think it should be), you don't need to do anything different from today, because the root group must be a resource domain, any new sub-group will automagically also be. Only once you clear the flag somewhere do you get 'new' behaviour. Note that the only extra constraint is that all threads of a process must stay within the same resource domain, anything else goes. Task based controllers operate on the actual cgroup, resource domain controllers always map it back to the resource group. Finding a random task's resource domain is trivial; simply walk up the hierarchy until you find a group with the flag set. So, just to recap, my proposal is as follows: 1) each cgroup will have a new flag, indicating if this is a resource domain. a) this flag will be inherited; when creating a new cgroup, the state of the flag will be set to the value of the parent cgroup. b) the root cgroup is a resource domain per definition, will have it set (cannot be cleared). 2) all tasks of a process shall be part of the same resource domain 3) controllers come in two types: a) task based controllers; these use the direct cgroup the task is assigned to. b) resource controllers; these use the effective resource group of the task, which is the first parent group with our new flag set. With an optional: 1c) this flag can only be changed on empty groups to ease implementation. >From these rules it follows that: - 1a and 1b together ensure no change in behaviour per default for cgroup-v2. - 2 and 3a together ensure resource groups automagically work for task based controllers (under the assumption that the controller is strictly hierarchical). For example, cpuacct does the accounting strictly hierarchical, it adds the cpu usage to each parent group. Therefore the total cpu usage accounted to the resource group is the same as if all tasks were part of that group.
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Thu, Feb 02, 2017 at 04:52:29PM -0500, Tejun Heo wrote: > > Why do you need to manually turn it on? That is, couldn't it be > > automatic based on what controllers are enabled? > > This came up already but it's not like some controllers are inherently > thread-only. Consider CPU, all in-context CPU cycle consumptions are > tied to the thread; however, we also want to be able to account for > CPU cycles consumed for, for example, memory reclaim or encryption > during writeback. > > I played with an interface where thread mode is enabled automatically > upto the common ancestor of the threads but not only was it > complicated to implement but also the eventual behavior was very > confusing as the resource domain can change without any active actions > from the user. I think keeping things simple is the right choice > here. Note that the explicit marking of the resource domains gets you exactly that. But let me reply in the other subthread.
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Thu, Feb 02, 2017 at 04:52:29PM -0500, Tejun Heo wrote: > > Why do you need to manually turn it on? That is, couldn't it be > > automatic based on what controllers are enabled? > > This came up already but it's not like some controllers are inherently > thread-only. Consider CPU, all in-context CPU cycle consumptions are > tied to the thread; however, we also want to be able to account for > CPU cycles consumed for, for example, memory reclaim or encryption > during writeback. > > I played with an interface where thread mode is enabled automatically > upto the common ancestor of the threads but not only was it > complicated to implement but also the eventual behavior was very > confusing as the resource domain can change without any active actions > from the user. I think keeping things simple is the right choice > here. Note that the explicit marking of the resource domains gets you exactly that. But let me reply in the other subthread.
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
Hello, On Fri, Feb 03, 2017 at 01:10:21PM -0800, Andy Lutomirski wrote: > Is this flexible enough for the real-world usecases? For my use case I can't think of a reason why it won't be. Capability-wise, nothing is being lost by the interface. > (if I actually ported over to this), it would mean that I'd have to > enable thread mode on the root. What about letting a given process > (actually mm, perhaps) live in a cgroup but let the threads be in > different cgroups without any particular constraints. Then > process-wide stuff would be accounted to the cgroup that owns the > process. I don't know. So, then, we basiclly have completely separate trees for resource domains and threads. That exactly is what mounting cpu controller separately does. It doesn't make sense to put them on the same hierarchy. Why? > > If a controller can't possibly define how internal competition should > > be handled, which is unlikely - the problem is being consistent and > > sensible, defining something isn't difficult - the controller can > > simply error out those cases either on configuration or migration. > > Again, I'm very doubtful we'll need that but if we ever need that > > denying specific configurations is the best we can do anyway. > > I'm not sure I follow. > > I'm suggesting something quite simple: let controllers that don't need > the no-internal-process constraints set a flag so that the constraints > don't apply if all enabled controllers have the flag set. Firstly, I think it's better to have the rules as simple and consistent as possible as long as we don't sacrifice critical capabilities. Secondly, all the major resource controllers including cpu would eventually need resource domain, so there is no real practical upside to doing so. Thirdly, if we commit to something like "controller X is not subject to no-internal-process constraint", that commitment would prevent from ever adding domain level operations to that controller without breaking userland visible interface. All controllers do and have to support process level operations. Some controllers can do thread level operations. Keeping the latter opt-in doesn't block us from adding thread mode later on. Doing it the other way around blocks us from adding domain level operations later on. Thanks. -- tejun
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
Hello, On Fri, Feb 03, 2017 at 01:10:21PM -0800, Andy Lutomirski wrote: > Is this flexible enough for the real-world usecases? For my use case I can't think of a reason why it won't be. Capability-wise, nothing is being lost by the interface. > (if I actually ported over to this), it would mean that I'd have to > enable thread mode on the root. What about letting a given process > (actually mm, perhaps) live in a cgroup but let the threads be in > different cgroups without any particular constraints. Then > process-wide stuff would be accounted to the cgroup that owns the > process. I don't know. So, then, we basiclly have completely separate trees for resource domains and threads. That exactly is what mounting cpu controller separately does. It doesn't make sense to put them on the same hierarchy. Why? > > If a controller can't possibly define how internal competition should > > be handled, which is unlikely - the problem is being consistent and > > sensible, defining something isn't difficult - the controller can > > simply error out those cases either on configuration or migration. > > Again, I'm very doubtful we'll need that but if we ever need that > > denying specific configurations is the best we can do anyway. > > I'm not sure I follow. > > I'm suggesting something quite simple: let controllers that don't need > the no-internal-process constraints set a flag so that the constraints > don't apply if all enabled controllers have the flag set. Firstly, I think it's better to have the rules as simple and consistent as possible as long as we don't sacrifice critical capabilities. Secondly, all the major resource controllers including cpu would eventually need resource domain, so there is no real practical upside to doing so. Thirdly, if we commit to something like "controller X is not subject to no-internal-process constraint", that commitment would prevent from ever adding domain level operations to that controller without breaking userland visible interface. All controllers do and have to support process level operations. Some controllers can do thread level operations. Keeping the latter opt-in doesn't block us from adding thread mode later on. Doing it the other way around blocks us from adding domain level operations later on. Thanks. -- tejun
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Thu, Feb 2, 2017 at 1:52 PM, Tejun Heowrote: > Hello, > > On Thu, Feb 02, 2017 at 01:32:19PM -0800, Andy Lutomirski wrote: >> > * Thread mode is explicitly enabled on a cgroup by writing "enable" >> > into "cgroup.threads" file. The cgroup shouldn't have any child >> > cgroups or enabled controllers. >> >> Why do you need to manually turn it on? That is, couldn't it be >> automatic based on what controllers are enabled? > > This came up already but it's not like some controllers are inherently > thread-only. Consider CPU, all in-context CPU cycle consumptions are > tied to the thread; however, we also want to be able to account for > CPU cycles consumed for, for example, memory reclaim or encryption > during writeback. > Is this flexible enough for the real-world usecases? For my use case (if I actually ported over to this), it would mean that I'd have to enable thread mode on the root. What about letting a given process (actually mm, perhaps) live in a cgroup but let the threads be in different cgroups without any particular constraints. Then process-wide stuff would be accounted to the cgroup that owns the process. > >> > * Once enabled, arbitrary sub-hierarchy can be created and threads can >> > be put anywhere in the subtree by writing TIDs into "cgroup.threads" >> > file. Process granularity and no-internal-process constraint don't >> > apply in a threaded subtree. >> >> I'm a bit worried that this conflates two different things. There's >> thread support, i.e. allowing individual threads to be placed into >> cgroups. There's also more flexible sub-hierarchy support, i.e. >> relaxing no-internal-process constraints. For the "cpuacct" >> controller, for example, both of these make sense. But what if >> someone writes a controller (directio, for example, just to make >> something up) for which thread granularity makes sense but relaxing >> no-internal-process constraints does not? > > If a controller can't possibly define how internal competition should > be handled, which is unlikely - the problem is being consistent and > sensible, defining something isn't difficult - the controller can > simply error out those cases either on configuration or migration. > Again, I'm very doubtful we'll need that but if we ever need that > denying specific configurations is the best we can do anyway. > I'm not sure I follow. I'm suggesting something quite simple: let controllers that don't need the no-internal-process constraints set a flag so that the constraints don't apply if all enabled controllers have the flag set. --Andy
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Thu, Feb 2, 2017 at 1:52 PM, Tejun Heo wrote: > Hello, > > On Thu, Feb 02, 2017 at 01:32:19PM -0800, Andy Lutomirski wrote: >> > * Thread mode is explicitly enabled on a cgroup by writing "enable" >> > into "cgroup.threads" file. The cgroup shouldn't have any child >> > cgroups or enabled controllers. >> >> Why do you need to manually turn it on? That is, couldn't it be >> automatic based on what controllers are enabled? > > This came up already but it's not like some controllers are inherently > thread-only. Consider CPU, all in-context CPU cycle consumptions are > tied to the thread; however, we also want to be able to account for > CPU cycles consumed for, for example, memory reclaim or encryption > during writeback. > Is this flexible enough for the real-world usecases? For my use case (if I actually ported over to this), it would mean that I'd have to enable thread mode on the root. What about letting a given process (actually mm, perhaps) live in a cgroup but let the threads be in different cgroups without any particular constraints. Then process-wide stuff would be accounted to the cgroup that owns the process. > >> > * Once enabled, arbitrary sub-hierarchy can be created and threads can >> > be put anywhere in the subtree by writing TIDs into "cgroup.threads" >> > file. Process granularity and no-internal-process constraint don't >> > apply in a threaded subtree. >> >> I'm a bit worried that this conflates two different things. There's >> thread support, i.e. allowing individual threads to be placed into >> cgroups. There's also more flexible sub-hierarchy support, i.e. >> relaxing no-internal-process constraints. For the "cpuacct" >> controller, for example, both of these make sense. But what if >> someone writes a controller (directio, for example, just to make >> something up) for which thread granularity makes sense but relaxing >> no-internal-process constraints does not? > > If a controller can't possibly define how internal competition should > be handled, which is unlikely - the problem is being consistent and > sensible, defining something isn't difficult - the controller can > simply error out those cases either on configuration or migration. > Again, I'm very doubtful we'll need that but if we ever need that > denying specific configurations is the best we can do anyway. > I'm not sure I follow. I'm suggesting something quite simple: let controllers that don't need the no-internal-process constraints set a flag so that the constraints don't apply if all enabled controllers have the flag set. --Andy
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
Hello, Peter. On Fri, Feb 03, 2017 at 09:20:48PM +0100, Peter Zijlstra wrote: > So my proposal was to do the inverse of what you propose here. Instead > of marking special 'thread' subtrees, explicitly mark resource domains > in the tree. > > So always allow arbitrary hierarchies and allow threads to be assigned > to cgroups, as long as they're all in the same resource domain. > > Controllers that do not support things, fallback to mapping everything > to the nearest resource domain. That sounds counter-intuitive as all controllers can do resource domains and only a subset of them can do thread mode. Also, thread subtrees are necessarily a sub-hierarchy of a resource domain. Also, expanding resource domains from the root after the trees are populated would make the behaviors surprising as the resource domains that these subtrees belong to would change dynamically. In practice, how would this work? To enable memcg, the user has to first create the subtree and then explicitly have to make that a domain and then enable memcg? If so, that would be a completely unnecessary deviation from the current behavior while not achieving any more functionalities, right? It's just flipping the default value the other way around and the default wouldn't be supported by many of the controllers. I can't see how that is a better option. Thanks. -- tejun
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
Hello, Peter. On Fri, Feb 03, 2017 at 09:20:48PM +0100, Peter Zijlstra wrote: > So my proposal was to do the inverse of what you propose here. Instead > of marking special 'thread' subtrees, explicitly mark resource domains > in the tree. > > So always allow arbitrary hierarchies and allow threads to be assigned > to cgroups, as long as they're all in the same resource domain. > > Controllers that do not support things, fallback to mapping everything > to the nearest resource domain. That sounds counter-intuitive as all controllers can do resource domains and only a subset of them can do thread mode. Also, thread subtrees are necessarily a sub-hierarchy of a resource domain. Also, expanding resource domains from the root after the trees are populated would make the behaviors surprising as the resource domains that these subtrees belong to would change dynamically. In practice, how would this work? To enable memcg, the user has to first create the subtree and then explicitly have to make that a domain and then enable memcg? If so, that would be a completely unnecessary deviation from the current behavior while not achieving any more functionalities, right? It's just flipping the default value the other way around and the default wouldn't be supported by many of the controllers. I can't see how that is a better option. Thanks. -- tejun
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Thu, Feb 02, 2017 at 03:06:27PM -0500, Tejun Heo wrote: > Hello, > > This patchset implements cgroup v2 thread mode. It is largely based > on the discussions that we had at the plumbers last year. Here's the > rough outline. > > * Thread mode is explicitly enabled on a cgroup by writing "enable" > into "cgroup.threads" file. The cgroup shouldn't have any child > cgroups or enabled controllers. > > * Once enabled, arbitrary sub-hierarchy can be created and threads can > be put anywhere in the subtree by writing TIDs into "cgroup.threads" > file. Process granularity and no-internal-process constraint don't > apply in a threaded subtree. > > * To be used in a threaded subtree, controllers should explicitly > declare thread mode support and should be able to handle internal > competition in some way. > > * The root of a threaded subtree serves as the resource domain for the > whole subtree. This is where all the controllers are guaranteed to > have a common ground and resource consumptions in the threaded > subtree which aren't tied to a specific thread are charged. > Non-threaded controllers never see beyond thread root and can assume > that all controllers will follow the same rules upto that point. > > This allows threaded controllers to implement thread granular resource > control without getting in the way of system level resource > partitioning. I'm still confused. So suppose I mark my root cgroup as such, because I run RT tasks there. Does this then mean I cannot later start a VM and have that containered properly? That is, I think threaded controllers very much get in the way of system level source partitioning this way. So my proposal was to do the inverse of what you propose here. Instead of marking special 'thread' subtrees, explicitly mark resource domains in the tree. So always allow arbitrary hierarchies and allow threads to be assigned to cgroups, as long as they're all in the same resource domain. Controllers that do not support things, fallback to mapping everything to the nearest resource domain.
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Thu, Feb 02, 2017 at 03:06:27PM -0500, Tejun Heo wrote: > Hello, > > This patchset implements cgroup v2 thread mode. It is largely based > on the discussions that we had at the plumbers last year. Here's the > rough outline. > > * Thread mode is explicitly enabled on a cgroup by writing "enable" > into "cgroup.threads" file. The cgroup shouldn't have any child > cgroups or enabled controllers. > > * Once enabled, arbitrary sub-hierarchy can be created and threads can > be put anywhere in the subtree by writing TIDs into "cgroup.threads" > file. Process granularity and no-internal-process constraint don't > apply in a threaded subtree. > > * To be used in a threaded subtree, controllers should explicitly > declare thread mode support and should be able to handle internal > competition in some way. > > * The root of a threaded subtree serves as the resource domain for the > whole subtree. This is where all the controllers are guaranteed to > have a common ground and resource consumptions in the threaded > subtree which aren't tied to a specific thread are charged. > Non-threaded controllers never see beyond thread root and can assume > that all controllers will follow the same rules upto that point. > > This allows threaded controllers to implement thread granular resource > control without getting in the way of system level resource > partitioning. I'm still confused. So suppose I mark my root cgroup as such, because I run RT tasks there. Does this then mean I cannot later start a VM and have that containered properly? That is, I think threaded controllers very much get in the way of system level source partitioning this way. So my proposal was to do the inverse of what you propose here. Instead of marking special 'thread' subtrees, explicitly mark resource domains in the tree. So always allow arbitrary hierarchies and allow threads to be assigned to cgroups, as long as they're all in the same resource domain. Controllers that do not support things, fallback to mapping everything to the nearest resource domain.
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
Hello, On Thu, Feb 02, 2017 at 01:32:19PM -0800, Andy Lutomirski wrote: > > * Thread mode is explicitly enabled on a cgroup by writing "enable" > > into "cgroup.threads" file. The cgroup shouldn't have any child > > cgroups or enabled controllers. > > Why do you need to manually turn it on? That is, couldn't it be > automatic based on what controllers are enabled? This came up already but it's not like some controllers are inherently thread-only. Consider CPU, all in-context CPU cycle consumptions are tied to the thread; however, we also want to be able to account for CPU cycles consumed for, for example, memory reclaim or encryption during writeback. I played with an interface where thread mode is enabled automatically upto the common ancestor of the threads but not only was it complicated to implement but also the eventual behavior was very confusing as the resource domain can change without any active actions from the user. I think keeping things simple is the right choice here. > > * Once enabled, arbitrary sub-hierarchy can be created and threads can > > be put anywhere in the subtree by writing TIDs into "cgroup.threads" > > file. Process granularity and no-internal-process constraint don't > > apply in a threaded subtree. > > I'm a bit worried that this conflates two different things. There's > thread support, i.e. allowing individual threads to be placed into > cgroups. There's also more flexible sub-hierarchy support, i.e. > relaxing no-internal-process constraints. For the "cpuacct" > controller, for example, both of these make sense. But what if > someone writes a controller (directio, for example, just to make > something up) for which thread granularity makes sense but relaxing > no-internal-process constraints does not? If a controller can't possibly define how internal competition should be handled, which is unlikely - the problem is being consistent and sensible, defining something isn't difficult - the controller can simply error out those cases either on configuration or migration. Again, I'm very doubtful we'll need that but if we ever need that denying specific configurations is the best we can do anyway. Thanks. -- tejun
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
Hello, On Thu, Feb 02, 2017 at 01:32:19PM -0800, Andy Lutomirski wrote: > > * Thread mode is explicitly enabled on a cgroup by writing "enable" > > into "cgroup.threads" file. The cgroup shouldn't have any child > > cgroups or enabled controllers. > > Why do you need to manually turn it on? That is, couldn't it be > automatic based on what controllers are enabled? This came up already but it's not like some controllers are inherently thread-only. Consider CPU, all in-context CPU cycle consumptions are tied to the thread; however, we also want to be able to account for CPU cycles consumed for, for example, memory reclaim or encryption during writeback. I played with an interface where thread mode is enabled automatically upto the common ancestor of the threads but not only was it complicated to implement but also the eventual behavior was very confusing as the resource domain can change without any active actions from the user. I think keeping things simple is the right choice here. > > * Once enabled, arbitrary sub-hierarchy can be created and threads can > > be put anywhere in the subtree by writing TIDs into "cgroup.threads" > > file. Process granularity and no-internal-process constraint don't > > apply in a threaded subtree. > > I'm a bit worried that this conflates two different things. There's > thread support, i.e. allowing individual threads to be placed into > cgroups. There's also more flexible sub-hierarchy support, i.e. > relaxing no-internal-process constraints. For the "cpuacct" > controller, for example, both of these make sense. But what if > someone writes a controller (directio, for example, just to make > something up) for which thread granularity makes sense but relaxing > no-internal-process constraints does not? If a controller can't possibly define how internal competition should be handled, which is unlikely - the problem is being consistent and sensible, defining something isn't difficult - the controller can simply error out those cases either on configuration or migration. Again, I'm very doubtful we'll need that but if we ever need that denying specific configurations is the best we can do anyway. Thanks. -- tejun
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Thu, Feb 2, 2017 at 12:06 PM, Tejun Heowrote: > Hello, > > This patchset implements cgroup v2 thread mode. It is largely based > on the discussions that we had at the plumbers last year. Here's the > rough outline. I like this, but I have some design questions: > > * Thread mode is explicitly enabled on a cgroup by writing "enable" > into "cgroup.threads" file. The cgroup shouldn't have any child > cgroups or enabled controllers. Why do you need to manually turn it on? That is, couldn't it be automatic based on what controllers are enabled? > > * Once enabled, arbitrary sub-hierarchy can be created and threads can > be put anywhere in the subtree by writing TIDs into "cgroup.threads" > file. Process granularity and no-internal-process constraint don't > apply in a threaded subtree. I'm a bit worried that this conflates two different things. There's thread support, i.e. allowing individual threads to be placed into cgroups. There's also more flexible sub-hierarchy support, i.e. relaxing no-internal-process constraints. For the "cpuacct" controller, for example, both of these make sense. But what if someone writes a controller (directio, for example, just to make something up) for which thread granularity makes sense but relaxing no-internal-process constraints does not? --Andy
Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
On Thu, Feb 2, 2017 at 12:06 PM, Tejun Heo wrote: > Hello, > > This patchset implements cgroup v2 thread mode. It is largely based > on the discussions that we had at the plumbers last year. Here's the > rough outline. I like this, but I have some design questions: > > * Thread mode is explicitly enabled on a cgroup by writing "enable" > into "cgroup.threads" file. The cgroup shouldn't have any child > cgroups or enabled controllers. Why do you need to manually turn it on? That is, couldn't it be automatic based on what controllers are enabled? > > * Once enabled, arbitrary sub-hierarchy can be created and threads can > be put anywhere in the subtree by writing TIDs into "cgroup.threads" > file. Process granularity and no-internal-process constraint don't > apply in a threaded subtree. I'm a bit worried that this conflates two different things. There's thread support, i.e. allowing individual threads to be placed into cgroups. There's also more flexible sub-hierarchy support, i.e. relaxing no-internal-process constraints. For the "cpuacct" controller, for example, both of these make sense. But what if someone writes a controller (directio, for example, just to make something up) for which thread granularity makes sense but relaxing no-internal-process constraints does not? --Andy