Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase

2020-12-04 Thread Tejun Heo
Hello,

On Thu, Dec 03, 2020 at 04:51:42PM -0800, Josh Don wrote:
> > So me the color thing reads like an end-run around the cgroup hierarchy.
> 
> Restructuring the cgroup resource hierarchy to incorporate the trust
> domains is not necessarily trivial (as is the case for us).  I agree
> though that that would be the ideal correct solution from the cgroup
> hierarchy perspective.

Yeah, it sure isn't. We (FB) had to go through a couple iterations and it's
looking like we need another adjustment, so I fully agree that these are
painful but at the same time I don't think it's difficult to see that how
piling up workarounds in the lowest layer is not the right thing to do. The
workarounds you guys need would be different from what we or others would
need. The kernel can't be maintained in any sustainable manner if we keep
piling on disjoint workarounds on it. Please consider long term trajectory
when proposing interface changes which often boils down to identifying the
core features which must be supported by the interface.

Thanks.

-- 
tejun


Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase

2020-12-03 Thread Josh Don
On Wed, Dec 2, 2020 at 12:02 AM Peter Zijlstra  wrote:
>
> On Tue, Dec 01, 2020 at 10:18:00PM -0800, Josh Don wrote:
> > Hey Peter,
> >
> > On Wed, Nov 25, 2020 at 5:43 AM Peter Zijlstra  wrote:
> > >
> > > Why can't the above work by setting 'tag' (that's a terrible name, why
> > > does that still live) in CDE? Have the most specific tag live. Same with
> > > that thread stuff.
> >
> > The motivation is to allow an unprivileged user the ability to
> > configure the trust hierarchy in a way that otherwise wouldn't be
> > possible for a given cgroup hierarchy.  For example given a cookie'd
> > hierarchy such as:
> >
> >   A
> >/  |  |   \
> > B  C  D  E
> >
> > the user might only want subsets of {B, C, D, E} to share.  For
> > instance, the user might only want {B,C} and {D, E} to share.  One way
> > to solve this would be to allow the user to write the group cookie
> > directly.  However, this interface would need to be restricted to
> > privileged users, since otherwise the cookie could be configured to
> > share with any arbitrary cgroup.  The purpose of the 'color' field is
> > to expose a portion of the cookie that can be modified by a
> > non-privileged user in order to achieve this sharing goal.
> >
> > If this doesn't seem like a useful case, I'm happy to drop this patch
> > from the series to unblock it.
>
> Well, the traditional cgroup way of doing that would be to:
>
>  A
> / \
>T1 T2
>   / \
>  B   C
>
> And tag T1 if you want B,C to share.
>
> So me the color thing reads like an end-run around the cgroup hierarchy.

Restructuring the cgroup resource hierarchy to incorporate the trust
domains is not necessarily trivial (as is the case for us).  I agree
though that that would be the ideal correct solution from the cgroup
hierarchy perspective.


Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase

2020-12-02 Thread Tejun Heo
Hello,

On Wed, Dec 02, 2020 at 09:02:11AM +0100, Peter Zijlstra wrote:
> > the user might only want subsets of {B, C, D, E} to share.  For
> > instance, the user might only want {B,C} and {D, E} to share.  One way
> > to solve this would be to allow the user to write the group cookie
> > directly.  However, this interface would need to be restricted to
> > privileged users, since otherwise the cookie could be configured to
> > share with any arbitrary cgroup.  The purpose of the 'color' field is
> > to expose a portion of the cookie that can be modified by a
> > non-privileged user in order to achieve this sharing goal.
> > 
> > If this doesn't seem like a useful case, I'm happy to drop this patch
> > from the series to unblock it.
> 
> Well, the traditional cgroup way of doing that would be to:
> 
>  A
>   / \
>T1 T2
>   / \
>  B   C
> 
> And tag T1 if you want B,C to share.
> 
> So me the color thing reads like an end-run around the cgroup hierarchy.

+1

and please cc me on cgroup related changes.

Thanks.

-- 
tejun


Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase

2020-12-02 Thread Peter Zijlstra
On Tue, Dec 01, 2020 at 10:18:00PM -0800, Josh Don wrote:
> Hey Peter,
> 
> On Wed, Nov 25, 2020 at 5:43 AM Peter Zijlstra  wrote:
> >
> > Why can't the above work by setting 'tag' (that's a terrible name, why
> > does that still live) in CDE? Have the most specific tag live. Same with
> > that thread stuff.
> 
> The motivation is to allow an unprivileged user the ability to
> configure the trust hierarchy in a way that otherwise wouldn't be
> possible for a given cgroup hierarchy.  For example given a cookie'd
> hierarchy such as:
> 
>   A
>/  |  |   \
> B  C  D  E
> 
> the user might only want subsets of {B, C, D, E} to share.  For
> instance, the user might only want {B,C} and {D, E} to share.  One way
> to solve this would be to allow the user to write the group cookie
> directly.  However, this interface would need to be restricted to
> privileged users, since otherwise the cookie could be configured to
> share with any arbitrary cgroup.  The purpose of the 'color' field is
> to expose a portion of the cookie that can be modified by a
> non-privileged user in order to achieve this sharing goal.
> 
> If this doesn't seem like a useful case, I'm happy to drop this patch
> from the series to unblock it.

Well, the traditional cgroup way of doing that would be to:

 A
/ \
   T1 T2
  / \
 B   C

And tag T1 if you want B,C to share.

So me the color thing reads like an end-run around the cgroup hierarchy.


Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase

2020-12-01 Thread Josh Don
Hey Peter,

On Wed, Nov 25, 2020 at 5:43 AM Peter Zijlstra  wrote:
>
> Why can't the above work by setting 'tag' (that's a terrible name, why
> does that still live) in CDE? Have the most specific tag live. Same with
> that thread stuff.

The motivation is to allow an unprivileged user the ability to
configure the trust hierarchy in a way that otherwise wouldn't be
possible for a given cgroup hierarchy.  For example given a cookie'd
hierarchy such as:

  A
   /  |  |   \
B  C  D  E

the user might only want subsets of {B, C, D, E} to share.  For
instance, the user might only want {B,C} and {D, E} to share.  One way
to solve this would be to allow the user to write the group cookie
directly.  However, this interface would need to be restricted to
privileged users, since otherwise the cookie could be configured to
share with any arbitrary cgroup.  The purpose of the 'color' field is
to expose a portion of the cookie that can be modified by a
non-privileged user in order to achieve this sharing goal.

If this doesn't seem like a useful case, I'm happy to drop this patch
from the series to unblock it.

> All this API stuff here is a complete and utter trainwreck. Please just
> delete the patches and start over. Hint: if you use stop_machine(),
> you're doing it wrong.

Yes, agree on stop_machine(); we'll pull that out of the underlying
interface patch.

Thanks,
Josh


Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase

2020-12-01 Thread Joel Fernandes
Hi Peter,

On Wed, Nov 25, 2020 at 02:42:37PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:56PM -0500, Joel Fernandes (Google) wrote:
> > From: Josh Don 
> > 
> > Google has a usecase where the first level tag to tag a CGroup is not
> > sufficient. So, a patch is carried for years where a second tag is added 
> > which
> > is writeable by unprivileged users.
> > 
> > Google uses DAC controls to make the 'tag' possible to set only by root 
> > while
> > the second-level 'color' can be changed by anyone. The actual names that
> > Google uses is different, but the concept is the same.
> > 
> > The hierarchy looks like:
> > 
> > Root group
> >/ \
> >   A   B(These are created by the root daemon - borglet).
> >  / \   \
> > C   D   E  (These are created by AppEngine within the container).
> > 
> > The reason why Google has two parts is that AppEngine wants to allow a 
> > subset of
> > subcgroups within a parent tagged cgroup sharing execution. Think of these
> > subcgroups belong to the same customer or project. Because these subcgroups 
> > are
> > created by AppEngine, they are not tracked by borglet (the root daemon),
> > therefore borglet won't have a chance to set a color for them. That's where
> > 'color' file comes from. Color could be set by AppEngine, and once set, the
> > normal tasks within the subcgroup would not be able to overwrite it. This is
> > enforced by promoting the permission of the color file in cgroupfs.
> 
> Why can't the above work by setting 'tag' (that's a terrible name, why
> does that still live) in CDE? Have the most specific tag live. Same with
> that thread stuff.

There's 2 parts that Google's usecase has. The first part is set by a
privileged process, and the second part (color) is set within the container.
Maybe we can just put the "color" feature behind a CONFIG option for Google
to enable?

> All this API stuff here is a complete and utter trainwreck. Please just
> delete the patches and start over. Hint: if you use stop_machine(),
> you're doing it wrong.

Ok, the idea was to use stop_machine() as in your initial patch. It works
quite well in testing. However I agree with its horrible we ought to do
better (or at least try).

Maybe we can do a synchronize_rcu() after changing cookie, to ensure we are
no longer using the old cookie value in the scheduler.

> At best you now have the requirements sorted.

Yes.

thanks,

 - Joel



Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase

2020-11-30 Thread Balbir Singh
On Wed, Nov 25, 2020 at 02:42:37PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 06:19:56PM -0500, Joel Fernandes (Google) wrote:
> > From: Josh Don 
> > 
> > Google has a usecase where the first level tag to tag a CGroup is not
> > sufficient. So, a patch is carried for years where a second tag is added 
> > which
> > is writeable by unprivileged users.
> > 
> > Google uses DAC controls to make the 'tag' possible to set only by root 
> > while
> > the second-level 'color' can be changed by anyone. The actual names that
> > Google uses is different, but the concept is the same.
> > 
> > The hierarchy looks like:
> > 
> > Root group
> >/ \
> >   A   B(These are created by the root daemon - borglet).
> >  / \   \
> > C   D   E  (These are created by AppEngine within the container).
> > 
> > The reason why Google has two parts is that AppEngine wants to allow a 
> > subset of
> > subcgroups within a parent tagged cgroup sharing execution. Think of these
> > subcgroups belong to the same customer or project. Because these subcgroups 
> > are
> > created by AppEngine, they are not tracked by borglet (the root daemon),
> > therefore borglet won't have a chance to set a color for them. That's where
> > 'color' file comes from. Color could be set by AppEngine, and once set, the
> > normal tasks within the subcgroup would not be able to overwrite it. This is
> > enforced by promoting the permission of the color file in cgroupfs.
> 
> Why can't the above work by setting 'tag' (that's a terrible name, why
> does that still live) in CDE? Have the most specific tag live. Same with
> that thread stuff.
> 
> All this API stuff here is a complete and utter trainwreck. Please just
> delete the patches and start over. Hint: if you use stop_machine(),
> you're doing it wrong.
> 
> At best you now have the requirements sorted.

+1, just remove this patch from the series so as to unblock the series.

Balbir Singh.


Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase

2020-11-25 Thread Peter Zijlstra
On Tue, Nov 17, 2020 at 06:19:56PM -0500, Joel Fernandes (Google) wrote:
> From: Josh Don 
> 
> Google has a usecase where the first level tag to tag a CGroup is not
> sufficient. So, a patch is carried for years where a second tag is added which
> is writeable by unprivileged users.
> 
> Google uses DAC controls to make the 'tag' possible to set only by root while
> the second-level 'color' can be changed by anyone. The actual names that
> Google uses is different, but the concept is the same.
> 
> The hierarchy looks like:
> 
> Root group
>/ \
>   A   B(These are created by the root daemon - borglet).
>  / \   \
> C   D   E  (These are created by AppEngine within the container).
> 
> The reason why Google has two parts is that AppEngine wants to allow a subset 
> of
> subcgroups within a parent tagged cgroup sharing execution. Think of these
> subcgroups belong to the same customer or project. Because these subcgroups 
> are
> created by AppEngine, they are not tracked by borglet (the root daemon),
> therefore borglet won't have a chance to set a color for them. That's where
> 'color' file comes from. Color could be set by AppEngine, and once set, the
> normal tasks within the subcgroup would not be able to overwrite it. This is
> enforced by promoting the permission of the color file in cgroupfs.

Why can't the above work by setting 'tag' (that's a terrible name, why
does that still live) in CDE? Have the most specific tag live. Same with
that thread stuff.

All this API stuff here is a complete and utter trainwreck. Please just
delete the patches and start over. Hint: if you use stop_machine(),
you're doing it wrong.

At best you now have the requirements sorted.


[PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase

2020-11-17 Thread Joel Fernandes (Google)
From: Josh Don 

Google has a usecase where the first level tag to tag a CGroup is not
sufficient. So, a patch is carried for years where a second tag is added which
is writeable by unprivileged users.

Google uses DAC controls to make the 'tag' possible to set only by root while
the second-level 'color' can be changed by anyone. The actual names that
Google uses is different, but the concept is the same.

The hierarchy looks like:

Root group
   / \
  A   B(These are created by the root daemon - borglet).
 / \   \
C   D   E  (These are created by AppEngine within the container).

The reason why Google has two parts is that AppEngine wants to allow a subset of
subcgroups within a parent tagged cgroup sharing execution. Think of these
subcgroups belong to the same customer or project. Because these subcgroups are
created by AppEngine, they are not tracked by borglet (the root daemon),
therefore borglet won't have a chance to set a color for them. That's where
'color' file comes from. Color could be set by AppEngine, and once set, the
normal tasks within the subcgroup would not be able to overwrite it. This is
enforced by promoting the permission of the color file in cgroupfs.

Reviewed-by: Joel Fernandes (Google) 
Signed-off-by: Josh Don 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/sched.h |   1 +
 kernel/sched/core.c   | 120 +++---
 kernel/sched/sched.h  |   2 +
 3 files changed, 103 insertions(+), 20 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6fbdb1a204bf..c9efdf8ccdf3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -690,6 +690,7 @@ struct task_struct {
unsigned long   core_cookie;
unsigned long   core_task_cookie;
unsigned long   core_group_cookie;
+   unsigned long   core_color;
unsigned intcore_occupation;
 #endif
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bd75b3d62a97..8f17ec8e993e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9049,9 +9049,6 @@ void sched_offline_group(struct task_group *tg)
spin_unlock_irqrestore(&task_group_lock, flags);
 }
 
-void cpu_core_get_group_cookie(struct task_group *tg,
-  unsigned long *group_cookie_ptr);
-
 static void sched_change_group(struct task_struct *tsk, int type)
 {
struct task_group *tg;
@@ -9747,6 +9744,7 @@ static u64 cpu_rt_period_read_uint(struct 
cgroup_subsys_state *css,
 struct sched_core_cookie {
unsigned long task_cookie;
unsigned long group_cookie;
+   unsigned long color;
 
struct rb_node node;
refcount_t refcnt;
@@ -9782,6 +9780,7 @@ static int sched_core_cookie_cmp(const struct 
sched_core_cookie *a,
 
COOKIE_CMP_RETURN(task_cookie);
COOKIE_CMP_RETURN(group_cookie);
+   COOKIE_CMP_RETURN(color);
 
/* all cookie fields match */
return 0;
@@ -9819,7 +9818,7 @@ static void sched_core_put_cookie(struct 
sched_core_cookie *cookie)
 
 /*
  * A task's core cookie is a compound structure composed of various cookie
- * fields (task_cookie, group_cookie). The overall core_cookie is
+ * fields (task_cookie, group_cookie, color). The overall core_cookie is
  * a pointer to a struct containing those values. This function either finds
  * an existing core_cookie or creates a new one, and then updates the task's
  * core_cookie to point to it. Additionally, it handles the necessary reference
@@ -9837,6 +9836,7 @@ static void __sched_core_update_cookie(struct task_struct 
*p)
struct sched_core_cookie temp = {
.task_cookie= p->core_task_cookie,
.group_cookie   = p->core_group_cookie,
+   .color  = p->core_color
};
const bool is_zero_cookie =
(sched_core_cookie_cmp(&temp, &zero_cookie) == 0);
@@ -9892,6 +9892,7 @@ static void __sched_core_update_cookie(struct task_struct 
*p)
 
match->task_cookie = temp.task_cookie;
match->group_cookie = temp.group_cookie;
+   match->color = temp.color;
refcount_set(&match->refcnt, 1);
 
rb_link_node(&match->node, parent, node);
@@ -9949,6 +9950,9 @@ static void sched_core_update_cookie(struct task_struct 
*p, unsigned long cookie
case sched_core_group_cookie_type:
p->core_group_cookie = cookie;
break;
+   case sched_core_color_type:
+   p->core_color = cookie;
+   break;
default:
WARN_ON_ONCE(1);
}
@@ -9967,19 +9971,23 @@ static void sched_core_update_cookie(struct task_struct 
*p, unsigned long cookie
sched_core_enqueue(task_rq(p), p);
 }
 
-void cpu_core_get_group_cookie(struct task_group *tg,
-  unsigned long *group_cookie_ptr);
+void cpu_core_