Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase
Hello, On Thu, Dec 03, 2020 at 04:51:42PM -0800, Josh Don wrote: > > So me the color thing reads like an end-run around the cgroup hierarchy. > > Restructuring the cgroup resource hierarchy to incorporate the trust > domains is not necessarily trivial (as is the case for us). I agree > though that that would be the ideal correct solution from the cgroup > hierarchy perspective. Yeah, it sure isn't. We (FB) had to go through a couple iterations and it's looking like we need another adjustment, so I fully agree that these are painful but at the same time I don't think it's difficult to see that how piling up workarounds in the lowest layer is not the right thing to do. The workarounds you guys need would be different from what we or others would need. The kernel can't be maintained in any sustainable manner if we keep piling on disjoint workarounds on it. Please consider long term trajectory when proposing interface changes which often boils down to identifying the core features which must be supported by the interface. Thanks. -- tejun
Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase
On Wed, Dec 2, 2020 at 12:02 AM Peter Zijlstra wrote: > > On Tue, Dec 01, 2020 at 10:18:00PM -0800, Josh Don wrote: > > Hey Peter, > > > > On Wed, Nov 25, 2020 at 5:43 AM Peter Zijlstra wrote: > > > > > > Why can't the above work by setting 'tag' (that's a terrible name, why > > > does that still live) in CDE? Have the most specific tag live. Same with > > > that thread stuff. > > > > The motivation is to allow an unprivileged user the ability to > > configure the trust hierarchy in a way that otherwise wouldn't be > > possible for a given cgroup hierarchy. For example given a cookie'd > > hierarchy such as: > > > > A > >/ | | \ > > B C D E > > > > the user might only want subsets of {B, C, D, E} to share. For > > instance, the user might only want {B,C} and {D, E} to share. One way > > to solve this would be to allow the user to write the group cookie > > directly. However, this interface would need to be restricted to > > privileged users, since otherwise the cookie could be configured to > > share with any arbitrary cgroup. The purpose of the 'color' field is > > to expose a portion of the cookie that can be modified by a > > non-privileged user in order to achieve this sharing goal. > > > > If this doesn't seem like a useful case, I'm happy to drop this patch > > from the series to unblock it. > > Well, the traditional cgroup way of doing that would be to: > > A > / \ >T1 T2 > / \ > B C > > And tag T1 if you want B,C to share. > > So me the color thing reads like an end-run around the cgroup hierarchy. Restructuring the cgroup resource hierarchy to incorporate the trust domains is not necessarily trivial (as is the case for us). I agree though that that would be the ideal correct solution from the cgroup hierarchy perspective.
Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase
Hello, On Wed, Dec 02, 2020 at 09:02:11AM +0100, Peter Zijlstra wrote: > > the user might only want subsets of {B, C, D, E} to share. For > > instance, the user might only want {B,C} and {D, E} to share. One way > > to solve this would be to allow the user to write the group cookie > > directly. However, this interface would need to be restricted to > > privileged users, since otherwise the cookie could be configured to > > share with any arbitrary cgroup. The purpose of the 'color' field is > > to expose a portion of the cookie that can be modified by a > > non-privileged user in order to achieve this sharing goal. > > > > If this doesn't seem like a useful case, I'm happy to drop this patch > > from the series to unblock it. > > Well, the traditional cgroup way of doing that would be to: > > A > / \ >T1 T2 > / \ > B C > > And tag T1 if you want B,C to share. > > So me the color thing reads like an end-run around the cgroup hierarchy. +1 and please cc me on cgroup related changes. Thanks. -- tejun
Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase
On Tue, Dec 01, 2020 at 10:18:00PM -0800, Josh Don wrote: > Hey Peter, > > On Wed, Nov 25, 2020 at 5:43 AM Peter Zijlstra wrote: > > > > Why can't the above work by setting 'tag' (that's a terrible name, why > > does that still live) in CDE? Have the most specific tag live. Same with > > that thread stuff. > > The motivation is to allow an unprivileged user the ability to > configure the trust hierarchy in a way that otherwise wouldn't be > possible for a given cgroup hierarchy. For example given a cookie'd > hierarchy such as: > > A >/ | | \ > B C D E > > the user might only want subsets of {B, C, D, E} to share. For > instance, the user might only want {B,C} and {D, E} to share. One way > to solve this would be to allow the user to write the group cookie > directly. However, this interface would need to be restricted to > privileged users, since otherwise the cookie could be configured to > share with any arbitrary cgroup. The purpose of the 'color' field is > to expose a portion of the cookie that can be modified by a > non-privileged user in order to achieve this sharing goal. > > If this doesn't seem like a useful case, I'm happy to drop this patch > from the series to unblock it. Well, the traditional cgroup way of doing that would be to: A / \ T1 T2 / \ B C And tag T1 if you want B,C to share. So me the color thing reads like an end-run around the cgroup hierarchy.
Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase
Hey Peter, On Wed, Nov 25, 2020 at 5:43 AM Peter Zijlstra wrote: > > Why can't the above work by setting 'tag' (that's a terrible name, why > does that still live) in CDE? Have the most specific tag live. Same with > that thread stuff. The motivation is to allow an unprivileged user the ability to configure the trust hierarchy in a way that otherwise wouldn't be possible for a given cgroup hierarchy. For example given a cookie'd hierarchy such as: A / | | \ B C D E the user might only want subsets of {B, C, D, E} to share. For instance, the user might only want {B,C} and {D, E} to share. One way to solve this would be to allow the user to write the group cookie directly. However, this interface would need to be restricted to privileged users, since otherwise the cookie could be configured to share with any arbitrary cgroup. The purpose of the 'color' field is to expose a portion of the cookie that can be modified by a non-privileged user in order to achieve this sharing goal. If this doesn't seem like a useful case, I'm happy to drop this patch from the series to unblock it. > All this API stuff here is a complete and utter trainwreck. Please just > delete the patches and start over. Hint: if you use stop_machine(), > you're doing it wrong. Yes, agree on stop_machine(); we'll pull that out of the underlying interface patch. Thanks, Josh
Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase
Hi Peter, On Wed, Nov 25, 2020 at 02:42:37PM +0100, Peter Zijlstra wrote: > On Tue, Nov 17, 2020 at 06:19:56PM -0500, Joel Fernandes (Google) wrote: > > From: Josh Don > > > > Google has a usecase where the first level tag to tag a CGroup is not > > sufficient. So, a patch is carried for years where a second tag is added > > which > > is writeable by unprivileged users. > > > > Google uses DAC controls to make the 'tag' possible to set only by root > > while > > the second-level 'color' can be changed by anyone. The actual names that > > Google uses is different, but the concept is the same. > > > > The hierarchy looks like: > > > > Root group > >/ \ > > A B(These are created by the root daemon - borglet). > > / \ \ > > C D E (These are created by AppEngine within the container). > > > > The reason why Google has two parts is that AppEngine wants to allow a > > subset of > > subcgroups within a parent tagged cgroup sharing execution. Think of these > > subcgroups belong to the same customer or project. Because these subcgroups > > are > > created by AppEngine, they are not tracked by borglet (the root daemon), > > therefore borglet won't have a chance to set a color for them. That's where > > 'color' file comes from. Color could be set by AppEngine, and once set, the > > normal tasks within the subcgroup would not be able to overwrite it. This is > > enforced by promoting the permission of the color file in cgroupfs. > > Why can't the above work by setting 'tag' (that's a terrible name, why > does that still live) in CDE? Have the most specific tag live. Same with > that thread stuff. There's 2 parts that Google's usecase has. The first part is set by a privileged process, and the second part (color) is set within the container. Maybe we can just put the "color" feature behind a CONFIG option for Google to enable? > All this API stuff here is a complete and utter trainwreck. Please just > delete the patches and start over. Hint: if you use stop_machine(), > you're doing it wrong. Ok, the idea was to use stop_machine() as in your initial patch. It works quite well in testing. However I agree with its horrible we ought to do better (or at least try). Maybe we can do a synchronize_rcu() after changing cookie, to ensure we are no longer using the old cookie value in the scheduler. > At best you now have the requirements sorted. Yes. thanks, - Joel
Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase
On Wed, Nov 25, 2020 at 02:42:37PM +0100, Peter Zijlstra wrote: > On Tue, Nov 17, 2020 at 06:19:56PM -0500, Joel Fernandes (Google) wrote: > > From: Josh Don > > > > Google has a usecase where the first level tag to tag a CGroup is not > > sufficient. So, a patch is carried for years where a second tag is added > > which > > is writeable by unprivileged users. > > > > Google uses DAC controls to make the 'tag' possible to set only by root > > while > > the second-level 'color' can be changed by anyone. The actual names that > > Google uses is different, but the concept is the same. > > > > The hierarchy looks like: > > > > Root group > >/ \ > > A B(These are created by the root daemon - borglet). > > / \ \ > > C D E (These are created by AppEngine within the container). > > > > The reason why Google has two parts is that AppEngine wants to allow a > > subset of > > subcgroups within a parent tagged cgroup sharing execution. Think of these > > subcgroups belong to the same customer or project. Because these subcgroups > > are > > created by AppEngine, they are not tracked by borglet (the root daemon), > > therefore borglet won't have a chance to set a color for them. That's where > > 'color' file comes from. Color could be set by AppEngine, and once set, the > > normal tasks within the subcgroup would not be able to overwrite it. This is > > enforced by promoting the permission of the color file in cgroupfs. > > Why can't the above work by setting 'tag' (that's a terrible name, why > does that still live) in CDE? Have the most specific tag live. Same with > that thread stuff. > > All this API stuff here is a complete and utter trainwreck. Please just > delete the patches and start over. Hint: if you use stop_machine(), > you're doing it wrong. > > At best you now have the requirements sorted. +1, just remove this patch from the series so as to unblock the series. Balbir Singh.
Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase
On Tue, Nov 17, 2020 at 06:19:56PM -0500, Joel Fernandes (Google) wrote: > From: Josh Don > > Google has a usecase where the first level tag to tag a CGroup is not > sufficient. So, a patch is carried for years where a second tag is added which > is writeable by unprivileged users. > > Google uses DAC controls to make the 'tag' possible to set only by root while > the second-level 'color' can be changed by anyone. The actual names that > Google uses is different, but the concept is the same. > > The hierarchy looks like: > > Root group >/ \ > A B(These are created by the root daemon - borglet). > / \ \ > C D E (These are created by AppEngine within the container). > > The reason why Google has two parts is that AppEngine wants to allow a subset > of > subcgroups within a parent tagged cgroup sharing execution. Think of these > subcgroups belong to the same customer or project. Because these subcgroups > are > created by AppEngine, they are not tracked by borglet (the root daemon), > therefore borglet won't have a chance to set a color for them. That's where > 'color' file comes from. Color could be set by AppEngine, and once set, the > normal tasks within the subcgroup would not be able to overwrite it. This is > enforced by promoting the permission of the color file in cgroupfs. Why can't the above work by setting 'tag' (that's a terrible name, why does that still live) in CDE? Have the most specific tag live. Same with that thread stuff. All this API stuff here is a complete and utter trainwreck. Please just delete the patches and start over. Hint: if you use stop_machine(), you're doing it wrong. At best you now have the requirements sorted.
[PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase
From: Josh Don Google has a usecase where the first level tag to tag a CGroup is not sufficient. So, a patch is carried for years where a second tag is added which is writeable by unprivileged users. Google uses DAC controls to make the 'tag' possible to set only by root while the second-level 'color' can be changed by anyone. The actual names that Google uses is different, but the concept is the same. The hierarchy looks like: Root group / \ A B(These are created by the root daemon - borglet). / \ \ C D E (These are created by AppEngine within the container). The reason why Google has two parts is that AppEngine wants to allow a subset of subcgroups within a parent tagged cgroup sharing execution. Think of these subcgroups belong to the same customer or project. Because these subcgroups are created by AppEngine, they are not tracked by borglet (the root daemon), therefore borglet won't have a chance to set a color for them. That's where 'color' file comes from. Color could be set by AppEngine, and once set, the normal tasks within the subcgroup would not be able to overwrite it. This is enforced by promoting the permission of the color file in cgroupfs. Reviewed-by: Joel Fernandes (Google) Signed-off-by: Josh Don Signed-off-by: Joel Fernandes (Google) --- include/linux/sched.h | 1 + kernel/sched/core.c | 120 +++--- kernel/sched/sched.h | 2 + 3 files changed, 103 insertions(+), 20 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 6fbdb1a204bf..c9efdf8ccdf3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -690,6 +690,7 @@ struct task_struct { unsigned long core_cookie; unsigned long core_task_cookie; unsigned long core_group_cookie; + unsigned long core_color; unsigned intcore_occupation; #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index bd75b3d62a97..8f17ec8e993e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9049,9 +9049,6 @@ void sched_offline_group(struct task_group *tg) spin_unlock_irqrestore(&task_group_lock, flags); } -void cpu_core_get_group_cookie(struct task_group *tg, - unsigned long *group_cookie_ptr); - static void sched_change_group(struct task_struct *tsk, int type) { struct task_group *tg; @@ -9747,6 +9744,7 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css, struct sched_core_cookie { unsigned long task_cookie; unsigned long group_cookie; + unsigned long color; struct rb_node node; refcount_t refcnt; @@ -9782,6 +9780,7 @@ static int sched_core_cookie_cmp(const struct sched_core_cookie *a, COOKIE_CMP_RETURN(task_cookie); COOKIE_CMP_RETURN(group_cookie); + COOKIE_CMP_RETURN(color); /* all cookie fields match */ return 0; @@ -9819,7 +9818,7 @@ static void sched_core_put_cookie(struct sched_core_cookie *cookie) /* * A task's core cookie is a compound structure composed of various cookie - * fields (task_cookie, group_cookie). The overall core_cookie is + * fields (task_cookie, group_cookie, color). The overall core_cookie is * a pointer to a struct containing those values. This function either finds * an existing core_cookie or creates a new one, and then updates the task's * core_cookie to point to it. Additionally, it handles the necessary reference @@ -9837,6 +9836,7 @@ static void __sched_core_update_cookie(struct task_struct *p) struct sched_core_cookie temp = { .task_cookie= p->core_task_cookie, .group_cookie = p->core_group_cookie, + .color = p->core_color }; const bool is_zero_cookie = (sched_core_cookie_cmp(&temp, &zero_cookie) == 0); @@ -9892,6 +9892,7 @@ static void __sched_core_update_cookie(struct task_struct *p) match->task_cookie = temp.task_cookie; match->group_cookie = temp.group_cookie; + match->color = temp.color; refcount_set(&match->refcnt, 1); rb_link_node(&match->node, parent, node); @@ -9949,6 +9950,9 @@ static void sched_core_update_cookie(struct task_struct *p, unsigned long cookie case sched_core_group_cookie_type: p->core_group_cookie = cookie; break; + case sched_core_color_type: + p->core_color = cookie; + break; default: WARN_ON_ONCE(1); } @@ -9967,19 +9971,23 @@ static void sched_core_update_cookie(struct task_struct *p, unsigned long cookie sched_core_enqueue(task_rq(p), p); } -void cpu_core_get_group_cookie(struct task_group *tg, - unsigned long *group_cookie_ptr); +void cpu_core_