Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

2020-05-28 Thread Phil Auld
On Thu, May 28, 2020 at 02:17:19PM -0400 Phil Auld wrote:
> On Thu, May 28, 2020 at 07:01:28PM +0200 Peter Zijlstra wrote:
> > On Sun, May 24, 2020 at 10:00:46AM -0400, Phil Auld wrote:
> > > On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote:
> > > > On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
> > > > [..]
> > > > > > > It doens't allow tasks for form their own groups (by for example 
> > > > > > > setting
> > > > > > > the key to that of another task).
> > > > > > 
> > > > > > So for this, I was thinking of making the prctl pass in an integer. 
> > > > > > And 0
> > > > > > would mean untagged. Does that sound good to you?
> > > > > 
> > > > > A TID, I think. If you pass your own TID, you tag yourself as
> > > > > not-sharing. If you tag yourself with another tasks's TID, you can do
> > > > > ptrace tests to see if you're allowed to observe their junk.
> > > > 
> > > > But that would require a bunch of tasks agreeing on which TID to tag 
> > > > with.
> > > > For example, if 2 tasks tag with each other's TID, then they would have
> > > > different tags and not share.
> > 
> > Well, don't do that then ;-)
> >
> 
> That was a poorly worded example :)
>

Heh, sorry, I thought that was my statement. I do not mean to belittle Joel's
example...  That's a fine example of a totally different problem than I
was thinking of :)


Cheers,
Phil

> The point I was trying to make was more that one TID of a group (not cgroup!)
> of tasks is just an arbitrary value.
> 
> At a single process (or pair rather) level, sure, you can see it as an
> identifier of whom you want to share with, but even then you have to tag
> both processes with this. And it has less meaning when the whom you want to
> share with is mutltiple tasks.
> 
> > > > What's wrong with passing in an integer instead? In any case, we would 
> > > > do the
> > > > CAP_SYS_ADMIN check to limit who can do it.
> > 
> > So the actual permission model can be different depending on how broken
> > the hardware is.
> > 
> > > > Also, one thing CGroup interface allows is an external process to set 
> > > > the
> > > > cookie, so I am wondering if we should use sched_setattr(2) instead of, 
> > > > or in
> > > > addition to, the prctl(2). That way, we can drop the CGroup interface
> > > > completely. How do you feel about that?
> > > >
> > > 
> > > I think it should be an arbitrary 64bit value, in both interfaces to avoid
> > > any potential reuse security issues.
> > > 
> > > I think the cgroup interface could be extended not to be a boolean but 
> > > take
> > > the value. With 0 being untagged as now.
> > 
> > How do you avoid reuse in such a huge space? That just creates yet
> > another problem for the kernel to keep track of who is who.
> >
> 
> The kernel doesn't care or have to track anything.  The admin does.
> At the kernel level it's just matching cookies. 
> 
> Tasks A,B,C all can share core so you give them each A's TID as a cookie.
> Task A then exits. Now B and C are using essentially a random value.
> Task D comes along and want to share with B and C. You have to tag it
> with A's old TID, which has no meaning at this point.
> 
> And if A's TID ever gets reused. The new A` gets to share too. At some
> level aren't those still 32bits? 
> 
> > With random u64 numbers, it even becomes hard to determine if you're
> > sharing at all or not.
> > 
> > Now, with the current SMT+MDS trainwreck, any sharing is bad because it
> > allows leaking kernel privates. But under a less severe thread scenario,
> > say where only user data would be at risk, the ptrace() tests make
> > sense, but those become really hard with random u64 numbers too.
> > 
> > What would the purpose of random u64 values be for cgroups? That only
> > replicates the problem of determining uniqueness there. Then you can get
> > two cgroups unintentionally sharing because you got lucky.
> >
> 
> Seems that would be more flexible for the admin. 
> 
> What if you had two cgroups you wanted to allow to run together?  Or a
> cgroup and a few processes from a different one (say with different
> quotas or something).
> 
> I don't have such use cases so I don't feel that strongly but it seemed
> more flexible and followed the mechanism-in-kernel/policy-in-userspace
> dictum rather than basing the functionality on the implementation details.
> 
> 
> Cheers,
> Phil
> 
> 
> > Also, fundamentally, we cannot have more threads than TID space, it's a
> > natural identifier.
> > 
> 
> -- 

-- 



Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

2020-05-28 Thread Joel Fernandes
Hi Peter,

On Thu, May 28, 2020 at 07:01:28PM +0200, Peter Zijlstra wrote:
> On Sun, May 24, 2020 at 10:00:46AM -0400, Phil Auld wrote:
> > On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote:
> > > On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
> > > [..]
> > > > > > It doens't allow tasks for form their own groups (by for example 
> > > > > > setting
> > > > > > the key to that of another task).
> > > > > 
> > > > > So for this, I was thinking of making the prctl pass in an integer. 
> > > > > And 0
> > > > > would mean untagged. Does that sound good to you?
> > > > 
> > > > A TID, I think. If you pass your own TID, you tag yourself as
> > > > not-sharing. If you tag yourself with another tasks's TID, you can do
> > > > ptrace tests to see if you're allowed to observe their junk.
> > > 
> > > But that would require a bunch of tasks agreeing on which TID to tag with.
> > > For example, if 2 tasks tag with each other's TID, then they would have
> > > different tags and not share.
> 
> Well, don't do that then ;-)

We could also guard it with a mutex. First task to set the TID wins, the
other thread just reuses the cookie of the TID that won.

But I think we cannot just use the TID value as the cookie, due to TID
wrap-around and reuse. Otherwise we could accidentally group 2 tasks. Instead, I
suggest let us keep TID as the interface per your suggestion and do the
needed ptrace checks, but convert the TID to the task_struct pointer value
and use that as the cookie for the group of tasks sharing a core.

Thoughts?

thanks,

 - Joel

> > > What's wrong with passing in an integer instead? In any case, we would do 
> > > the
> > > CAP_SYS_ADMIN check to limit who can do it.
> 
> So the actual permission model can be different depending on how broken
> the hardware is.
> 
> > > Also, one thing CGroup interface allows is an external process to set the
> > > cookie, so I am wondering if we should use sched_setattr(2) instead of, 
> > > or in
> > > addition to, the prctl(2). That way, we can drop the CGroup interface
> > > completely. How do you feel about that?
> > >
> > 
> > I think it should be an arbitrary 64bit value, in both interfaces to avoid
> > any potential reuse security issues.
> > 
> > I think the cgroup interface could be extended not to be a boolean but take
> > the value. With 0 being untagged as now.
> 
> How do you avoid reuse in such a huge space? That just creates yet
> another problem for the kernel to keep track of who is who.
> 
> With random u64 numbers, it even becomes hard to determine if you're
> sharing at all or not.
> 
> Now, with the current SMT+MDS trainwreck, any sharing is bad because it
> allows leaking kernel privates. But under a less severe thread scenario,
> say where only user data would be at risk, the ptrace() tests make
> sense, but those become really hard with random u64 numbers too.
> 
> What would the purpose of random u64 values be for cgroups? That only
> replicates the problem of determining uniqueness there. Then you can get
> two cgroups unintentionally sharing because you got lucky.
> 
> Also, fundamentally, we cannot have more threads than TID space, it's a
> natural identifier.


Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

2020-05-28 Thread Phil Auld
On Thu, May 28, 2020 at 07:01:28PM +0200 Peter Zijlstra wrote:
> On Sun, May 24, 2020 at 10:00:46AM -0400, Phil Auld wrote:
> > On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote:
> > > On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
> > > [..]
> > > > > > It doens't allow tasks for form their own groups (by for example 
> > > > > > setting
> > > > > > the key to that of another task).
> > > > > 
> > > > > So for this, I was thinking of making the prctl pass in an integer. 
> > > > > And 0
> > > > > would mean untagged. Does that sound good to you?
> > > > 
> > > > A TID, I think. If you pass your own TID, you tag yourself as
> > > > not-sharing. If you tag yourself with another tasks's TID, you can do
> > > > ptrace tests to see if you're allowed to observe their junk.
> > > 
> > > But that would require a bunch of tasks agreeing on which TID to tag with.
> > > For example, if 2 tasks tag with each other's TID, then they would have
> > > different tags and not share.
> 
> Well, don't do that then ;-)
>

That was a poorly worded example :)

The point I was trying to make was more that one TID of a group (not cgroup!)
of tasks is just an arbitrary value.

At a single process (or pair rather) level, sure, you can see it as an
identifier of whom you want to share with, but even then you have to tag
both processes with this. And it has less meaning when the whom you want to
share with is mutltiple tasks.

> > > What's wrong with passing in an integer instead? In any case, we would do 
> > > the
> > > CAP_SYS_ADMIN check to limit who can do it.
> 
> So the actual permission model can be different depending on how broken
> the hardware is.
> 
> > > Also, one thing CGroup interface allows is an external process to set the
> > > cookie, so I am wondering if we should use sched_setattr(2) instead of, 
> > > or in
> > > addition to, the prctl(2). That way, we can drop the CGroup interface
> > > completely. How do you feel about that?
> > >
> > 
> > I think it should be an arbitrary 64bit value, in both interfaces to avoid
> > any potential reuse security issues.
> > 
> > I think the cgroup interface could be extended not to be a boolean but take
> > the value. With 0 being untagged as now.
> 
> How do you avoid reuse in such a huge space? That just creates yet
> another problem for the kernel to keep track of who is who.
>

The kernel doesn't care or have to track anything.  The admin does.
At the kernel level it's just matching cookies. 

Tasks A,B,C all can share core so you give them each A's TID as a cookie.
Task A then exits. Now B and C are using essentially a random value.
Task D comes along and want to share with B and C. You have to tag it
with A's old TID, which has no meaning at this point.

And if A's TID ever gets reused. The new A` gets to share too. At some
level aren't those still 32bits? 

> With random u64 numbers, it even becomes hard to determine if you're
> sharing at all or not.
> 
> Now, with the current SMT+MDS trainwreck, any sharing is bad because it
> allows leaking kernel privates. But under a less severe thread scenario,
> say where only user data would be at risk, the ptrace() tests make
> sense, but those become really hard with random u64 numbers too.
> 
> What would the purpose of random u64 values be for cgroups? That only
> replicates the problem of determining uniqueness there. Then you can get
> two cgroups unintentionally sharing because you got lucky.
>

Seems that would be more flexible for the admin. 

What if you had two cgroups you wanted to allow to run together?  Or a
cgroup and a few processes from a different one (say with different
quotas or something).

I don't have such use cases so I don't feel that strongly but it seemed
more flexible and followed the mechanism-in-kernel/policy-in-userspace
dictum rather than basing the functionality on the implementation details.


Cheers,
Phil


> Also, fundamentally, we cannot have more threads than TID space, it's a
> natural identifier.
> 

-- 



Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

2020-05-28 Thread Peter Zijlstra
On Sun, May 24, 2020 at 10:00:46AM -0400, Phil Auld wrote:
> On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote:
> > On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
> > [..]
> > > > > It doens't allow tasks for form their own groups (by for example 
> > > > > setting
> > > > > the key to that of another task).
> > > > 
> > > > So for this, I was thinking of making the prctl pass in an integer. And > > > > 0
> > > > would mean untagged. Does that sound good to you?
> > > 
> > > A TID, I think. If you pass your own TID, you tag yourself as
> > > not-sharing. If you tag yourself with another tasks's TID, you can do
> > > ptrace tests to see if you're allowed to observe their junk.
> > 
> > But that would require a bunch of tasks agreeing on which TID to tag with.
> > For example, if 2 tasks tag with each other's TID, then they would have
> > different tags and not share.

Well, don't do that then ;-)

> > What's wrong with passing in an integer instead? In any case, we would do 
> > the
> > CAP_SYS_ADMIN check to limit who can do it.

So the actual permission model can be different depending on how broken
the hardware is.

> > Also, one thing CGroup interface allows is an external process to set the
> > cookie, so I am wondering if we should use sched_setattr(2) instead of, or 
> > in
> > addition to, the prctl(2). That way, we can drop the CGroup interface
> > completely. How do you feel about that?
> >
> 
> I think it should be an arbitrary 64bit value, in both interfaces to avoid
> any potential reuse security issues.
> 
> I think the cgroup interface could be extended not to be a boolean but take
> the value. With 0 being untagged as now.

How do you avoid reuse in such a huge space? That just creates yet
another problem for the kernel to keep track of who is who.

With random u64 numbers, it even becomes hard to determine if you're
sharing at all or not.

Now, with the current SMT+MDS trainwreck, any sharing is bad because it
allows leaking kernel privates. But under a less severe thread scenario,
say where only user data would be at risk, the ptrace() tests make
sense, but those become really hard with random u64 numbers too.

What would the purpose of random u64 values be for cgroups? That only
replicates the problem of determining uniqueness there. Then you can get
two cgroups unintentionally sharing because you got lucky.

Also, fundamentally, we cannot have more threads than TID space, it's a
natural identifier.


Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

2020-05-28 Thread Joel Fernandes
On Sun, May 24, 2020 at 10:00:46AM -0400, Phil Auld wrote:
> On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote:
> > On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
> > [..]
> > > > > It doens't allow tasks for form their own groups (by for example 
> > > > > setting
> > > > > the key to that of another task).
> > > > 
> > > > So for this, I was thinking of making the prctl pass in an integer. And > > > > 0
> > > > would mean untagged. Does that sound good to you?
> > > 
> > > A TID, I think. If you pass your own TID, you tag yourself as
> > > not-sharing. If you tag yourself with another tasks's TID, you can do
> > > ptrace tests to see if you're allowed to observe their junk.
> > 
> > But that would require a bunch of tasks agreeing on which TID to tag with.
> > For example, if 2 tasks tag with each other's TID, then they would have
> > different tags and not share.
> > 
> > What's wrong with passing in an integer instead? In any case, we would do 
> > the
> > CAP_SYS_ADMIN check to limit who can do it.
> > 
> > Also, one thing CGroup interface allows is an external process to set the
> > cookie, so I am wondering if we should use sched_setattr(2) instead of, or 
> > in
> > addition to, the prctl(2). That way, we can drop the CGroup interface
> > completely. How do you feel about that?
> >
> 
> I think it should be an arbitrary 64bit value, in both interfaces to avoid
> any potential reuse security issues. 
> 
> I think the cgroup interface could be extended not to be a boolean but take
> the value. With 0 being untagged as now.
> 
> And sched_setattr could be used to set it on a per task basis.

Yeah, something like this will be needed.

> > > > More seriously, the reason I did it this way is the prctl-tagging is a 
> > > > bit
> > > > incompatible with CGroup tagging:
> > > > 
> > > > 1. What happens if 2 tasks are in a tagged CGroup and one of them 
> > > > changes
> > > > their cookie through prctl? Do they still remain in the tagged CGroup 
> > > > but are
> > > > now going to not trust each other? Do they get removed from the CGroup? 
> > > > This
> > > > is why I made the prctl fail with -EBUSY in such cases.

In util-clamp's design (which has task-specific attribute and task-group
attribute), it seems for that the priority is task-specific value first, then
the group one, then the system-wide one.

Perhaps a similar design can be adopted for this interface. So probably we
should let the per-task interface not fail if the task was already in CGroup
and rather prioritize its value first before looking at the group one?

Uclamp's comments:

 * The effective clamp bucket index of a task depends on, by increasing
 * priority:
 * - the task specific clamp value, when explicitly requested from userspace
 * - the task group effective clamp value, for tasks not either in the root
 *   group or in an autogroup
 * - the system default clamp value, defined by the sysadmin

> > > > 
> > > > 2. What happens if 2 tagged tasks with different cookies are added to a
> > > > tagged CGroup? Do we fail the addition of the tasks to the group, or do 
> > > > we
> > > > override their cookie (like I'm doing)?
> > > 
> > > For #2 I think I prefer failure.
> > > 
> > > But having the rationale spelled out in documentation (man-pages for
> > > example) is important.
> > 
> > If we drop the CGroup interface, this would avoid both #1 and #2.
> >
> 
> I believe both are useful.  Personally, I think the per-task setting should
> win over the cgroup tagging. In that case #1 just falls out.

Cool, this is similar to what I mentioned above.

> And #2 pretty
> much as well. Nothing would happen to the tagged task as they were added
> to the cgroup. They'd keep their explicitly assigned tags and everything
> should "just work". There are other reasons to be in a cpu cgroup together
> than just the core scheduling tag.

Well ok, so there's no reason to fail them the addition to CGroup of a
prctl-tagged task then, we can let it succeed but prioritize the
task-specific attribute over the group-specific one.

> There are a few other edge cases, like if you are in a cgroup, but have
> been tagged explicitly with sched_setattr and then get untagged (presumably
> by setting 0) do you get the cgroup tag or just stay untagged? I think based
> on per-task winning you'd stay untagged. I supposed you could move out and
> back in the cgroup to get the tag reapplied (Or maybe the cgroup interface
> could just be reused with the same value to re-tag everyone who's untagged).

If we maintain a task-specific tag and a group-specific tag, then I think
both tags can coexist and the final tag is decided on priority basis
mentioned above.

So before getting into CGroup, I think first we develop the task-specific
tagging mechanism like Peter was suggesting. So let us talk about that. I
will reply to the other thread Vineeth started while CC'ing you. In
particular, I like Peter's idea about user land passing a TID to share a 

Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

2020-05-24 Thread Phil Auld
On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote:
> On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
> [..]
> > > > It doens't allow tasks for form their own groups (by for example setting
> > > > the key to that of another task).
> > > 
> > > So for this, I was thinking of making the prctl pass in an integer. And 0
> > > would mean untagged. Does that sound good to you?
> > 
> > A TID, I think. If you pass your own TID, you tag yourself as
> > not-sharing. If you tag yourself with another tasks's TID, you can do
> > ptrace tests to see if you're allowed to observe their junk.
> 
> But that would require a bunch of tasks agreeing on which TID to tag with.
> For example, if 2 tasks tag with each other's TID, then they would have
> different tags and not share.
> 
> What's wrong with passing in an integer instead? In any case, we would do the
> CAP_SYS_ADMIN check to limit who can do it.
> 
> Also, one thing CGroup interface allows is an external process to set the
> cookie, so I am wondering if we should use sched_setattr(2) instead of, or in
> addition to, the prctl(2). That way, we can drop the CGroup interface
> completely. How do you feel about that?
>

I think it should be an arbitrary 64bit value, in both interfaces to avoid
any potential reuse security issues. 

I think the cgroup interface could be extended not to be a boolean but take
the value. With 0 being untagged as now.

And sched_setattr could be used to set it on a per task basis.


> > > > It is also horribly ill defined what it means to 'enable', with whoem
> > > > is it allows to share a core.
> > > 
> > > I couldn't parse this. Do you mean "enabling coresched does not make 
> > > sense if
> > > we don't specify whom to share the core with?"
> > 
> > As a corrolary yes. I mostly meant that a blanket 'enable' doesn't
> > specify a 'who' you're sharing your bits with.
> 
> Yes, ok. I can reword the commit log a bit to make it more clear that we are
> specifying who we can share a core with.
> 
> > > I was just trying to respect the functionality of the CGroup patch in the
> > > coresched series, after all a gentleman named Peter Zijlstra wrote that
> > > patch ;-) ;-).
> > 
> > Yeah, but I think that same guy said that that was a shit interface and
> > only hacked up because it was easy :-)
> 
> Fair enough :-)
> 
> > > More seriously, the reason I did it this way is the prctl-tagging is a bit
> > > incompatible with CGroup tagging:
> > > 
> > > 1. What happens if 2 tasks are in a tagged CGroup and one of them changes
> > > their cookie through prctl? Do they still remain in the tagged CGroup but 
> > > are
> > > now going to not trust each other? Do they get removed from the CGroup? 
> > > This
> > > is why I made the prctl fail with -EBUSY in such cases.
> > > 
> > > 2. What happens if 2 tagged tasks with different cookies are added to a
> > > tagged CGroup? Do we fail the addition of the tasks to the group, or do we
> > > override their cookie (like I'm doing)?
> > 
> > For #2 I think I prefer failure.
> > 
> > But having the rationale spelled out in documentation (man-pages for
> > example) is important.
> 
> If we drop the CGroup interface, this would avoid both #1 and #2.
>

I believe both are useful.  Personally, I think the per-task setting should
win over the cgroup tagging. In that case #1 just falls out. And #2 pretty
much as well. Nothing would happen to the tagged task as they were added
to the cgroup. They'd keep their explicitly assigned tags and everything
should "just work". There are other reasons to be in a cpu cgroup together
than just the core scheduling tag.

There are a few other edge cases, like if you are in a cgroup, but have
been tagged explicitly with sched_setattr and then get untagged (presumably
by setting 0) do you get the cgroup tag or just stay untagged? I think based
on per-task winning you'd stay untagged. I supposed you could move out and
back in the cgroup to get the tag reapplied (Or maybe the cgroup interface
could just be reused with the same value to re-tag everyone who's untagged).



Cheers,
Phil


> thanks,
> 
>  - Joel
> 

-- 



Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

2020-05-22 Thread Joel Fernandes
On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
[..]
> > > It doens't allow tasks for form their own groups (by for example setting
> > > the key to that of another task).
> > 
> > So for this, I was thinking of making the prctl pass in an integer. And 0
> > would mean untagged. Does that sound good to you?
> 
> A TID, I think. If you pass your own TID, you tag yourself as
> not-sharing. If you tag yourself with another tasks's TID, you can do
> ptrace tests to see if you're allowed to observe their junk.

But that would require a bunch of tasks agreeing on which TID to tag with.
For example, if 2 tasks tag with each other's TID, then they would have
different tags and not share.

What's wrong with passing in an integer instead? In any case, we would do the
CAP_SYS_ADMIN check to limit who can do it.

Also, one thing CGroup interface allows is an external process to set the
cookie, so I am wondering if we should use sched_setattr(2) instead of, or in
addition to, the prctl(2). That way, we can drop the CGroup interface
completely. How do you feel about that?

> > > It is also horribly ill defined what it means to 'enable', with whoem
> > > is it allows to share a core.
> > 
> > I couldn't parse this. Do you mean "enabling coresched does not make sense 
> > if
> > we don't specify whom to share the core with?"
> 
> As a corrolary yes. I mostly meant that a blanket 'enable' doesn't
> specify a 'who' you're sharing your bits with.

Yes, ok. I can reword the commit log a bit to make it more clear that we are
specifying who we can share a core with.

> > I was just trying to respect the functionality of the CGroup patch in the
> > coresched series, after all a gentleman named Peter Zijlstra wrote that
> > patch ;-) ;-).
> 
> Yeah, but I think that same guy said that that was a shit interface and
> only hacked up because it was easy :-)

Fair enough :-)

> > More seriously, the reason I did it this way is the prctl-tagging is a bit
> > incompatible with CGroup tagging:
> > 
> > 1. What happens if 2 tasks are in a tagged CGroup and one of them changes
> > their cookie through prctl? Do they still remain in the tagged CGroup but 
> > are
> > now going to not trust each other? Do they get removed from the CGroup? This
> > is why I made the prctl fail with -EBUSY in such cases.
> > 
> > 2. What happens if 2 tagged tasks with different cookies are added to a
> > tagged CGroup? Do we fail the addition of the tasks to the group, or do we
> > override their cookie (like I'm doing)?
> 
> For #2 I think I prefer failure.
> 
> But having the rationale spelled out in documentation (man-pages for
> example) is important.

If we drop the CGroup interface, this would avoid both #1 and #2.

thanks,

 - Joel



Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

2020-05-22 Thread Linus Torvalds
On Thu, May 21, 2020 at 2:58 PM Jesse Barnes  wrote:
>
> Expanding on this a little, we're working on a couple of projects that
> should provide results like these for upstream.  One is continuously
> rebasing our upstream backlog onto new kernels for testing purposes
> (the idea here is to make it easier for us to update kernels on
> Chromebooks),

Lovely. Not just for any performance work that comes out of this, but
hopefully this means that we'll also have quick problem reports if
something happens that affects chrome.

There's certainly been issues on the server side of google where we
made changes (*cough*cgroup*cough*) which didn't make anybody really
blink until years after the fact.. Which ends up being very
inconvenient when other parts of the community have been using the new
features for years.

> and the second is to drive more stuff into the
> kernelci.org infrastructure.  Given the test environments we have in
> place now, we can probably get results from our continuous rebase
> project first and provide those against -rc releases if that's
> something you'd be interested in.

I think the more automated (or regular, or close-to-upstream)
real-world testing that we get, the better off we are.  We have a
number of regular distributions that track the upstream kernel fairly
closely, so we get a fair amount of coverage for the normal desktop
loads.

And the bots are doing great, but they tend to test very specific
things (in the case of "syzbot" the "specific" thing is obviously
pretty far-ranging, but it's still very small details). And latency
has always been harder to really test (outside of the truly trivial
microbenchmarks), so the fact that it sounds like you're going to test
not only a different environment than the usual distros but have a few
macro-level latency tests just sounds lovely in general.

Let's see how lovely I think it is once you start sending regression reports..

Linus


Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

2020-05-22 Thread Peter Zijlstra
On Thu, May 21, 2020 at 09:47:05AM -0400, Joel Fernandes wrote:
> Hi Peter,
> Thanks for the comments.
> 
> On Thu, May 21, 2020 at 10:51:22AM +0200, Peter Zijlstra wrote:
> > On Wed, May 20, 2020 at 06:26:42PM -0400, Joel Fernandes (Google) wrote:
> > > Add a per-thread core scheduling interface which allows a thread to tag
> > > itself and enable core scheduling. Based on discussion at OSPM with
> > > maintainers, we propose a prctl(2) interface accepting values of 0 or 1.
> > >  1 - enable core scheduling for the task.
> > >  0 - disable core scheduling for the task.
> > 
> > Yeah, so this is a terrible interface :-)
> 
> I tried to keep it simple. You are right, lets make it better.
> 
> > It doens't allow tasks for form their own groups (by for example setting
> > the key to that of another task).
> 
> So for this, I was thinking of making the prctl pass in an integer. And 0
> would mean untagged. Does that sound good to you?

A TID, I think. If you pass your own TID, you tag yourself as
not-sharing. If you tag yourself with another tasks's TID, you can do
ptrace tests to see if you're allowed to observe their junk.

> > It is also horribly ill defined what it means to 'enable', with whoem
> > is it allows to share a core.
> 
> I couldn't parse this. Do you mean "enabling coresched does not make sense if
> we don't specify whom to share the core with?"

As a corrolary yes. I mostly meant that a blanket 'enable' doesn't
specify a 'who' you're sharing your bits with.

> > OK, so cgroup always wins; is why is that a good thing?
> 
> I was just trying to respect the functionality of the CGroup patch in the
> coresched series, after all a gentleman named Peter Zijlstra wrote that
> patch ;-) ;-).

Yeah, but I think that same guy said that that was a shit interface and
only hacked up because it was easy :-)

> More seriously, the reason I did it this way is the prctl-tagging is a bit
> incompatible with CGroup tagging:
> 
> 1. What happens if 2 tasks are in a tagged CGroup and one of them changes
> their cookie through prctl? Do they still remain in the tagged CGroup but are
> now going to not trust each other? Do they get removed from the CGroup? This
> is why I made the prctl fail with -EBUSY in such cases.
> 
> 2. What happens if 2 tagged tasks with different cookies are added to a
> tagged CGroup? Do we fail the addition of the tasks to the group, or do we
> override their cookie (like I'm doing)?

For #2 I think I prefer failure.

But having the rationale spelled out in documentation (man-pages for
example) is important.

> > > ChromeOS will use core-scheduling to securely enable hyperthreading.
> > > This cuts down the keypress latency in Google docs from 150ms to 50ms
> > > while improving the camera streaming frame rate by ~3%.
> > 
> > It doesn't consider permissions.
> > 
> > Basically, with the way you guys use it, it should be a CAP_SYS_ADMIN
> > only to enable core-sched.
> 
> True, we were relying on the seccomp sandboxing in ChromeOS to protect the
> prctl but you're right and I fixed it for next revision.

With the TID idea above you get the ptrace tests.


Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

2020-05-21 Thread Jesse Barnes
On Thu, May 21, 2020 at 1:45 PM Joel Fernandes  wrote:
>
> Hi Linus,
>
> On Thu, May 21, 2020 at 11:31:38AM -0700, Linus Torvalds wrote:
> > On Wed, May 20, 2020 at 3:26 PM Joel Fernandes (Google)
> >  wrote:
> > Generally throughput benchmarks are much easier to do, how do you do
> > this latency benchmark, and is it perhaps something that could be run
> > more widely (ie I'm thinking that if it's generic enough and stable
> > enough to be run by some of the performance regression checking
> > robots, it would be a much more interesting test-case than some of the
> > ones they run right now...)
>
> Glad you like it! The metric is calculated with a timestamp of when the
> driver says the key was pressed, up until when the GPU says we've drawn
> pixels in response.
>
> The test requires a mostly only requires Chrome browser. It opens some
> pre-existing test URLs (a google doc, a window that opens a camera stream and
> another window that decodes video). This metric is already calculated in
> Chrome, we just scrape it from
> chrome://histograms/Event.Latency.EndToEnd.KeyPress.  If you install Chrome,
> you can goto this link and see the histogram.  We open a Google docs window
> and synthetically input keys into it with a camera stream and video decoding
> running in other windows which gives the CPUs a good beating. Then we collect
> roughly the 90th percentile keypress latency from the above histogram and the
> camera and decoded video's FPS, among other things. There is a test in the
> works that my colleagues are writing to run the full Google hangout video
> chatting stack to stress the system more (versus just the camera stream).  I
> guess if the robots can somehow input keys into the Google docs and open the
> right windows, then it is just a matter of scraping the histogram.

Expanding on this a little, we're working on a couple of projects that
should provide results like these for upstream.  One is continuously
rebasing our upstream backlog onto new kernels for testing purposes
(the idea here is to make it easier for us to update kernels on
Chromebooks), and the second is to drive more stuff into the
kernelci.org infrastructure.  Given the test environments we have in
place now, we can probably get results from our continuous rebase
project first and provide those against -rc releases if that's
something you'd be interested in.  Going forward, I hope we can
extract several of our tests and put them into kernelci as well, so we
get more general coverage without the potential impact of our (still
somewhat large) upstream backlog of patches.

To Joel's point, there are a few changes we'll have to make to get
similar results outside of our environment, but I think that's doable
without a ton of work.  And if anyone is curious, I think most of this
stuff is already public in the tast and autotest repos of the
chromiumos tree.  Just let us know if you want to make changes or port
to another environment so we can try to stay in sync wrt new features,
etc.

Thanks,
Jesse


Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

2020-05-21 Thread Joel Fernandes
Hi Linus,

On Thu, May 21, 2020 at 11:31:38AM -0700, Linus Torvalds wrote:
> On Wed, May 20, 2020 at 3:26 PM Joel Fernandes (Google)
>  wrote:
> >
> > ChromeOS will use core-scheduling to securely enable hyperthreading.
> > This cuts down the keypress latency in Google docs from 150ms to 50ms
> > while improving the camera streaming frame rate by ~3%.
> 
> I'm assuming this is "compared to SMT disabled"?

Yes this is compared to SMT disabled, I'll improve the commit message.

> What is the cost compared to "SMT enabled but no core scheduling"?

With SMT enabled and no core scheduling, it is around 40ms in the higher
percentiles. Also one more thing I wanted to mention, this is the 90th
percentile.

> But the real reason I'm piping up is that your  latency benchmark
> sounds very cool.
> 
> Generally throughput benchmarks are much easier to do, how do you do
> this latency benchmark, and is it perhaps something that could be run
> more widely (ie I'm thinking that if it's generic enough and stable
> enough to be run by some of the performance regression checking
> robots, it would be a much more interesting test-case than some of the
> ones they run right now...)

Glad you like it! The metric is calculated with a timestamp of when the
driver says the key was pressed, up until when the GPU says we've drawn
pixels in response.

The test requires a mostly only requires Chrome browser. It opens some
pre-existing test URLs (a google doc, a window that opens a camera stream and
another window that decodes video). This metric is already calculated in
Chrome, we just scrape it from
chrome://histograms/Event.Latency.EndToEnd.KeyPress.  If you install Chrome,
you can goto this link and see the histogram.  We open a Google docs window
and synthetically input keys into it with a camera stream and video decoding
running in other windows which gives the CPUs a good beating. Then we collect
roughly the 90th percentile keypress latency from the above histogram and the
camera and decoded video's FPS, among other things. There is a test in the
works that my colleagues are writing to run the full Google hangout video
chatting stack to stress the system more (versus just the camera stream).  I
guess if the robots can somehow input keys into the Google docs and open the
right windows, then it is just a matter of scraping the histogram.

> I'm looking at that "threaded phoronix gzip performance regression"
> thread due to a totally unrelated scheduling change ("sched/fair:
> Rework load_balance()"), and then I see this thread and my reaction is
> "the keypress latency thing sounds like a much more interesting
> performance test than threaded gzip from clear linux".
> 
> But the threaded gzip test is presumably trivial to script, while your
> latency test is perhaps very specific to one particular platform and
> setuip?

Yes it is specifically a ChromeOS running on a pixel book running a 7th Gen
Intel Core i7 with 4 hardware threads.
https://store.google.com/us/product/google_pixelbook

I could try to make it a synthetic test but it might be difficult for a robot
to run it if it does not have graphics support and a camera connected to it.
It would then need a fake/emulated camera connected to it. These robots run
Linux in a non-GUI environment in qemu instances right?

thanks,

 - Joel



Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

2020-05-21 Thread Vineeth Remanan Pillai
On Thu, May 21, 2020 at 9:47 AM Joel Fernandes  wrote:
>
> > It doens't allow tasks for form their own groups (by for example setting
> > the key to that of another task).
>
> So for this, I was thinking of making the prctl pass in an integer. And 0
> would mean untagged. Does that sound good to you?
>
On a similar note, me and Joel were discussing about prctl and it came up
that, there is no mechanism to set cookie from outside a process using
prctl(2). So, another option we could consider is to use sched_setattr(2)
and expand sched_attr to accomodate a u64 cookie. User could pass in a
cookie to explicitly set it and also use the same cookie for grouping.

Haven't prototyped it yet. Will need to dig deeper and see how it would
really look like.

Thanks,
Vineeth


Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

2020-05-21 Thread Linus Torvalds
On Wed, May 20, 2020 at 3:26 PM Joel Fernandes (Google)
 wrote:
>
> ChromeOS will use core-scheduling to securely enable hyperthreading.
> This cuts down the keypress latency in Google docs from 150ms to 50ms
> while improving the camera streaming frame rate by ~3%.

I'm assuming this is "compared to SMT disabled"?

What is the cost compared to "SMT enabled but no core scheduling"?

But the real reason I'm piping up is that your  latency benchmark
sounds very cool.

Generally throughput benchmarks are much easier to do, how do you do
this latency benchmark, and is it perhaps something that could be run
more widely (ie I'm thinking that if it's generic enough and stable
enough to be run by some of the performance regression checking
robots, it would be a much more interesting test-case than some of the
ones they run right now...)

I'm looking at that "threaded phoronix gzip performance regression"
thread due to a totally unrelated scheduling change ("sched/fair:
Rework load_balance()"), and then I see this thread and my reaction is
"the keypress latency thing sounds like a much more interesting
performance test than threaded gzip from clear linux".

But the threaded gzip test is presumably trivial to script, while your
latency test is perhaps very specific to one particular platform and
setuip?

   Linus


Re: [PATCH RFC] sched: Add a per-thread core scheduling interface(Internet mail)

2020-05-21 Thread Joel Fernandes
On Thu, May 21, 2020 at 04:09:50AM +, benbjiang(蒋彪) wrote:
> 
> 
> > On May 21, 2020, at 6:26 AM, Joel Fernandes (Google) 
> >  wrote:
> > 
> > Add a per-thread core scheduling interface which allows a thread to tag
> > itself and enable core scheduling. Based on discussion at OSPM with
> > maintainers, we propose a prctl(2) interface accepting values of 0 or 1.
> > 1 - enable core scheduling for the task.
> > 0 - disable core scheduling for the task.
> > 
> > Special cases:
> > (1)
> > The core-scheduling patchset contains a CGroup interface as well. In
> > order for us to respect users of that interface, we avoid overriding the
> > tag if a task was CGroup-tagged because the task becomes inconsistent
> > with the CGroup tag. Instead return -EBUSY.
> > 
> > (2)
> > If a task is prctl-tagged, allow the CGroup interface to override
> > the task's tag.
> > 
> > ChromeOS will use core-scheduling to securely enable hyperthreading.
> > This cuts down the keypress latency in Google docs from 150ms to 50ms
> > while improving the camera streaming frame rate by ~3%.
> Hi,
> Are the performance improvements compared to the hyperthreading disabled 
> scenario or not?
> Could you help to explain how the keypress latency improvement comes with 
> core-scheduling?

Hi Jiang,

The keypress end-to-end latency metric we have is calculated from when the
keypress is registered in hardware to when a character is drawn on the
scereen. This involves several parties including the GPU and browser
processes which are all running in the same trust domain and benefit from
parallelism through hyperthreading.

thanks,

 - Joel



Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

2020-05-21 Thread Joel Fernandes
Hi Peter,
Thanks for the comments.

On Thu, May 21, 2020 at 10:51:22AM +0200, Peter Zijlstra wrote:
> On Wed, May 20, 2020 at 06:26:42PM -0400, Joel Fernandes (Google) wrote:
> > Add a per-thread core scheduling interface which allows a thread to tag
> > itself and enable core scheduling. Based on discussion at OSPM with
> > maintainers, we propose a prctl(2) interface accepting values of 0 or 1.
> >  1 - enable core scheduling for the task.
> >  0 - disable core scheduling for the task.
> 
> Yeah, so this is a terrible interface :-)

I tried to keep it simple. You are right, lets make it better.

> It doens't allow tasks for form their own groups (by for example setting
> the key to that of another task).

So for this, I was thinking of making the prctl pass in an integer. And 0
would mean untagged. Does that sound good to you?

> It is also horribly ill defined what it means to 'enable', with whoem
> is it allows to share a core.

I couldn't parse this. Do you mean "enabling coresched does not make sense if
we don't specify whom to share the core with?"

> > Special cases:
> 
> > (1)
> > The core-scheduling patchset contains a CGroup interface as well. In
> > order for us to respect users of that interface, we avoid overriding the
> > tag if a task was CGroup-tagged because the task becomes inconsistent
> > with the CGroup tag. Instead return -EBUSY.
> > 
> > (2)
> > If a task is prctl-tagged, allow the CGroup interface to override
> > the task's tag.
> 
> OK, so cgroup always wins; is why is that a good thing?

I was just trying to respect the functionality of the CGroup patch in the
coresched series, after all a gentleman named Peter Zijlstra wrote that
patch ;-) ;-).

More seriously, the reason I did it this way is the prctl-tagging is a bit
incompatible with CGroup tagging:

1. What happens if 2 tasks are in a tagged CGroup and one of them changes
their cookie through prctl? Do they still remain in the tagged CGroup but are
now going to not trust each other? Do they get removed from the CGroup? This
is why I made the prctl fail with -EBUSY in such cases.

2. What happens if 2 tagged tasks with different cookies are added to a
tagged CGroup? Do we fail the addition of the tasks to the group, or do we
override their cookie (like I'm doing)?

> > ChromeOS will use core-scheduling to securely enable hyperthreading.
> > This cuts down the keypress latency in Google docs from 150ms to 50ms
> > while improving the camera streaming frame rate by ~3%.
> 
> It doesn't consider permissions.
> 
> Basically, with the way you guys use it, it should be a CAP_SYS_ADMIN
> only to enable core-sched.

True, we were relying on the seccomp sandboxing in ChromeOS to protect the
prctl but you're right and I fixed it for next revision.

> That also means we should very much default to disable.

This is how it is already.

thanks,

 - Joel



Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

2020-05-21 Thread Peter Zijlstra
On Wed, May 20, 2020 at 06:26:42PM -0400, Joel Fernandes (Google) wrote:
> Add a per-thread core scheduling interface which allows a thread to tag
> itself and enable core scheduling. Based on discussion at OSPM with
> maintainers, we propose a prctl(2) interface accepting values of 0 or 1.
>  1 - enable core scheduling for the task.
>  0 - disable core scheduling for the task.

Yeah, so this is a terrible interface :-)

It doens't allow tasks for form their own groups (by for example setting
the key to that of another task).

It is also horribly ill defined what it means to 'enable', with whoem
is it allows to share a core.

> Special cases:

> (1)
> The core-scheduling patchset contains a CGroup interface as well. In
> order for us to respect users of that interface, we avoid overriding the
> tag if a task was CGroup-tagged because the task becomes inconsistent
> with the CGroup tag. Instead return -EBUSY.
> 
> (2)
> If a task is prctl-tagged, allow the CGroup interface to override
> the task's tag.

OK, so cgroup always wins; is why is that a good thing?

> ChromeOS will use core-scheduling to securely enable hyperthreading.
> This cuts down the keypress latency in Google docs from 150ms to 50ms
> while improving the camera streaming frame rate by ~3%.

It doesn't consider permissions.

Basically, with the way you guys use it, it should be a CAP_SYS_ADMIN
only to enable core-sched.

That also means we should very much default to disable.


Re: [PATCH RFC] sched: Add a per-thread core scheduling interface(Internet mail)

2020-05-20 Thread 蒋彪



> On May 21, 2020, at 6:26 AM, Joel Fernandes (Google)  
> wrote:
> 
> Add a per-thread core scheduling interface which allows a thread to tag
> itself and enable core scheduling. Based on discussion at OSPM with
> maintainers, we propose a prctl(2) interface accepting values of 0 or 1.
> 1 - enable core scheduling for the task.
> 0 - disable core scheduling for the task.
> 
> Special cases:
> (1)
> The core-scheduling patchset contains a CGroup interface as well. In
> order for us to respect users of that interface, we avoid overriding the
> tag if a task was CGroup-tagged because the task becomes inconsistent
> with the CGroup tag. Instead return -EBUSY.
> 
> (2)
> If a task is prctl-tagged, allow the CGroup interface to override
> the task's tag.
> 
> ChromeOS will use core-scheduling to securely enable hyperthreading.
> This cuts down the keypress latency in Google docs from 150ms to 50ms
> while improving the camera streaming frame rate by ~3%.
Hi,
Are the performance improvements compared to the hyperthreading disabled 
scenario or not?
Could you help to explain how the keypress latency improvement comes with 
core-scheduling?

Thanks a lot.

Regards,
Jiang