Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-03-01 Thread Josh Don
On Fri, Feb 26, 2021 at 12:08 PM Chris Hyser  wrote:
>
> Update:
>
> The clone syscall stress test I have is causing a deadlock with this patchset 
> when
> compiled with CONFIG_PROVE_RAW_LOCK_NESTING=y. I am able to get stacktraces 
> with
> nmi_watchdog and am looking through those. Josh was not able to duplicate the
> deadlock, instead getting an actual warning about kmalloc w/GFP_ATOMIC while
> under a raw spinlock in the function __sched_core_update_cookie(). When I 
> retry
> the test with a patch Josh sent, removing the kmalloc() and thus the trigger 
> of
> the warning, no more deadlock. So some combination of CONFIGs, timing and
> function calls seems to have found something in this LOCK proving code.
>
> -chrish

I have a patch to remove the dynamic allocation and overall simplify
the logic here. Will squash that along with the rest of the
modifications suggested by Peter in the next version.


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-26 Thread Chris Hyser

On 2/24/21 10:47 AM, chris hyser wrote:

On 2/24/21 8:52 AM, chris hyser wrote:

On 2/24/21 8:02 AM, Chris Hyser wrote:


However, it means that overall throughput of your binary is cut in
~half, since none of the threads can share a core. Note that I never
saw an indefinite deadlock, just ~2x runtime for your binary vs th > control. I've verified that both a) manually 
hardcoding all threads to

be able to share regardless of cookie, and b) using a machine with 6
cores instead of 2, both allow your binary to complete in the same
amount of time as without the new API.


This was on a 24 core box. When I run the test, I definitely hangs. I'll answer 
your other email as wwll.



I just want to clarify. The test completes in secs normally. When I run this on the 24 core box from the console, 
other ssh connections immediately freeze. The console is frozen. You can't ping the box and it has stayed that way for 
up to 1/2 hour before I reset it. I'm trying to get some kind of stack trace to see what it is doing. To the extent 
that I've been able to trace it or print it, the "next code" always seems to be __sched_core_update_cookie(p);


I cannot duplicate this on a 4 core box even with 1000's of processes and threads. The 24 core box does not even create 
the full 400 processes/threads in that test before it hangs and that test reliably fails almost instantly. The working 
theory is that the 24 core box is doing way more of the clone syscalls in parallel.


Update:

The clone syscall stress test I have is causing a deadlock with this patchset 
when
compiled with CONFIG_PROVE_RAW_LOCK_NESTING=y. I am able to get stacktraces with
nmi_watchdog and am looking through those. Josh was not able to duplicate the
deadlock, instead getting an actual warning about kmalloc w/GFP_ATOMIC while
under a raw spinlock in the function __sched_core_update_cookie(). When I retry
the test with a patch Josh sent, removing the kmalloc() and thus the trigger of
the warning, no more deadlock. So some combination of CONFIGs, timing and
function calls seems to have found something in this LOCK proving code.

-chrish


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-24 Thread chris hyser

On 2/24/21 8:52 AM, chris hyser wrote:

On 2/24/21 8:02 AM, Chris Hyser wrote:


However, it means that overall throughput of your binary is cut in
~half, since none of the threads can share a core. Note that I never
saw an indefinite deadlock, just ~2x runtime for your binary vs th > control. I've verified that both a) manually 
hardcoding all threads to

be able to share regardless of cookie, and b) using a machine with 6
cores instead of 2, both allow your binary to complete in the same
amount of time as without the new API.


This was on a 24 core box. When I run the test, I definitely hangs. I'll answer 
your other email as wwll.



I just want to clarify. The test completes in secs normally. When I run this on the 24 core box from the console, other 
ssh connections immediately freeze. The console is frozen. You can't ping the box and it has stayed that way for up to 
1/2 hour before I reset it. I'm trying to get some kind of stack trace to see what it is doing. To the extent that I've 
been able to trace it or print it, the "next code" always seems to be __sched_core_update_cookie(p);


I cannot duplicate this on a 4 core box even with 1000's of processes and threads. The 24 core box does not even create 
the full 400 processes/threads in that test before it hangs and that test reliably fails almost instantly. The working 
theory is that the 24 core box is doing way more of the clone syscalls in parallel.


-chrish


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-24 Thread chris hyser

On 2/24/21 8:02 AM, Chris Hyser wrote:


However, it means that overall throughput of your binary is cut in
~half, since none of the threads can share a core. Note that I never
saw an indefinite deadlock, just ~2x runtime for your binary vs th > control. I've verified that both a) manually 
hardcoding all threads to

be able to share regardless of cookie, and b) using a machine with 6
cores instead of 2, both allow your binary to complete in the same
amount of time as without the new API.


This was on a 24 core box. When I run the test, I definitely hangs. I'll answer 
your other email as wwll.



I just want to clarify. The test completes in secs normally. When I run this on the 24 core box from the console, other 
ssh connections immediately freeze. The console is frozen. You can't ping the box and it has stayed that way for up to 
1/2 hour before I reset it. I'm trying to get some kind of stack trace to see what it is doing. To the extent that I've 
been able to trace it or print it, the "next code" always seems to be __sched_core_update_cookie(p);


-chrish






Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-24 Thread Chris Hyser




On 2/24/21 12:15 AM, Josh Don wrote:

On Tue, Feb 23, 2021 at 11:26 AM Chris Hyser  wrote:


On 2/23/21 4:05 AM, Peter Zijlstra wrote:

On Mon, Feb 22, 2021 at 11:00:37PM -0500, Chris Hyser wrote:

On 1/22/21 8:17 PM, Joel Fernandes (Google) wrote:
While trying to test the new prctl() code I'm working on, I ran into a bug I
chased back into this v10 code. Under a fair amount of stress, when the
function __sched_core_update_cookie() is ultimately called from
sched_core_fork(), the system deadlocks or otherwise non-visibly crashes.
I've not had much success figuring out why/what. I'm running with LOCKDEP on
and seeing no complaints. Duplicating it only requires setting a cookie on a
task and forking a bunch of threads ... all of which then want to update
their cookie.


Can you share the code and reproducer?


Attached is a tarball with c code (source) and scripts. Just run ./setup_bug 
which will compile the source and start a
bash with a cs cookie. Then run ./show_bug which dumps the cookie and then 
fires off some processes and threads. Note
the cs_clone command is not doing any core sched prctls for this test (not 
needed and currently coded for a diff prctl
interface). It just creates processes and threads. I see this hang almost 
instantly.

Josh, I did verify that this occurs on Joel's coresched tree both with and w/o 
the kprot patch and that should exactly
correspond to these patches.

-chrish



I think I've gotten to the root of this. In the fork code, our cases
for inheriting task_cookie are inverted for CLONE_THREAD vs
!CLONE_THREAD. As a result, we are creating a new cookie per-thread,
rather than inheriting from the parent. Now this is actually ok; I'm
not observing a scalability problem with creating this many cookies.


This isn't the issue. The test code generates cases for both THREAD_CLONE and not and both paths call the cookie update 
code. The new code I was testing when I discovered this, fixed the problem you noted.




However, it means that overall throughput of your binary is cut in
~half, since none of the threads can share a core. Note that I never
saw an indefinite deadlock, just ~2x runtime for your binary vs th > control. 
I've verified that both a) manually hardcoding all threads to
be able to share regardless of cookie, and b) using a machine with 6
cores instead of 2, both allow your binary to complete in the same
amount of time as without the new API.


This was on a 24 core box. When I run the test, I definitely hangs. I'll answer 
your other email as wwll.

-chrish


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-23 Thread Josh Don
On Tue, Feb 23, 2021 at 11:26 AM Chris Hyser  wrote:
>
> On 2/23/21 4:05 AM, Peter Zijlstra wrote:
> > On Mon, Feb 22, 2021 at 11:00:37PM -0500, Chris Hyser wrote:
> >> On 1/22/21 8:17 PM, Joel Fernandes (Google) wrote:
> >> While trying to test the new prctl() code I'm working on, I ran into a bug 
> >> I
> >> chased back into this v10 code. Under a fair amount of stress, when the
> >> function __sched_core_update_cookie() is ultimately called from
> >> sched_core_fork(), the system deadlocks or otherwise non-visibly crashes.
> >> I've not had much success figuring out why/what. I'm running with LOCKDEP 
> >> on
> >> and seeing no complaints. Duplicating it only requires setting a cookie on 
> >> a
> >> task and forking a bunch of threads ... all of which then want to update
> >> their cookie.
> >
> > Can you share the code and reproducer?
>
> Attached is a tarball with c code (source) and scripts. Just run ./setup_bug 
> which will compile the source and start a
> bash with a cs cookie. Then run ./show_bug which dumps the cookie and then 
> fires off some processes and threads. Note
> the cs_clone command is not doing any core sched prctls for this test (not 
> needed and currently coded for a diff prctl
> interface). It just creates processes and threads. I see this hang almost 
> instantly.
>
> Josh, I did verify that this occurs on Joel's coresched tree both with and 
> w/o the kprot patch and that should exactly
> correspond to these patches.
>
> -chrish
>

I think I've gotten to the root of this. In the fork code, our cases
for inheriting task_cookie are inverted for CLONE_THREAD vs
!CLONE_THREAD. As a result, we are creating a new cookie per-thread,
rather than inheriting from the parent. Now this is actually ok; I'm
not observing a scalability problem with creating this many cookies.
However, it means that overall throughput of your binary is cut in
~half, since none of the threads can share a core. Note that I never
saw an indefinite deadlock, just ~2x runtime for your binary vs the
control. I've verified that both a) manually hardcoding all threads to
be able to share regardless of cookie, and b) using a machine with 6
cores instead of 2, both allow your binary to complete in the same
amount of time as without the new API.


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-23 Thread Chris Hyser

On 2/23/21 4:05 AM, Peter Zijlstra wrote:

On Mon, Feb 22, 2021 at 11:00:37PM -0500, Chris Hyser wrote:

On 1/22/21 8:17 PM, Joel Fernandes (Google) wrote:
While trying to test the new prctl() code I'm working on, I ran into a bug I
chased back into this v10 code. Under a fair amount of stress, when the
function __sched_core_update_cookie() is ultimately called from
sched_core_fork(), the system deadlocks or otherwise non-visibly crashes.
I've not had much success figuring out why/what. I'm running with LOCKDEP on
and seeing no complaints. Duplicating it only requires setting a cookie on a
task and forking a bunch of threads ... all of which then want to update
their cookie.


Can you share the code and reproducer?


Attached is a tarball with c code (source) and scripts. Just run ./setup_bug which will compile the source and start a 
bash with a cs cookie. Then run ./show_bug which dumps the cookie and then fires off some processes and threads. Note 
the cs_clone command is not doing any core sched prctls for this test (not needed and currently coded for a diff prctl 
interface). It just creates processes and threads. I see this hang almost instantly.


Josh, I did verify that this occurs on Joel's coresched tree both with and w/o the kprot patch and that should exactly 
correspond to these patches.


-chrish



bug.tar.xz
Description: application/xz


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-23 Thread Peter Zijlstra
On Mon, Feb 22, 2021 at 11:00:37PM -0500, Chris Hyser wrote:
> On 1/22/21 8:17 PM, Joel Fernandes (Google) wrote:
> While trying to test the new prctl() code I'm working on, I ran into a bug I
> chased back into this v10 code. Under a fair amount of stress, when the
> function __sched_core_update_cookie() is ultimately called from
> sched_core_fork(), the system deadlocks or otherwise non-visibly crashes.
> I've not had much success figuring out why/what. I'm running with LOCKDEP on
> and seeing no complaints. Duplicating it only requires setting a cookie on a
> task and forking a bunch of threads ... all of which then want to update
> their cookie.

Can you share the code and reproducer?


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-22 Thread Chris Hyser

On 1/22/21 8:17 PM, Joel Fernandes (Google) wrote:

+static void __sched_core_update_cookie(struct task_struct *p)
+{
+   struct rb_node *parent, **node;
+   struct sched_core_cookie *node_core_cookie, *match;
+   static const struct sched_core_cookie zero_cookie;
+   struct sched_core_cookie requested_cookie;
+   bool is_zero_cookie;
+   struct sched_core_cookie *const curr_cookie =
+   (struct sched_core_cookie *)p->core_cookie;
+
+   /*
+* Ensures that we do not cause races/corruption by modifying/reading
+* task cookie fields. Also ensures task cannot get migrated.
+*/
+   lockdep_assert_held(rq_lockp(task_rq(p)));
+
+   sched_core_cookie_init_from_task(_cookie, p);
+
+   is_zero_cookie = !sched_core_cookie_cmp(_cookie, 
_cookie);
+
+   /*
+* Already have a cookie matching the requested settings? Nothing to
+* do.
+*/
+   if ((curr_cookie && !sched_core_cookie_cmp(curr_cookie, 
_cookie)) ||
+   (!curr_cookie && is_zero_cookie))
+   return;
+
+   raw_spin_lock(_core_cookies_lock);
+
+   if (is_zero_cookie) {
+   match = NULL;
+   goto finish;
+   }
+
+retry:
+   match = NULL;
+
+   node = _core_cookies.rb_node;
+   parent = *node;
+   while (*node) {
+   int cmp;
+
+   node_core_cookie =
+   container_of(*node, struct sched_core_cookie, node);
+   parent = *node;
+
+   cmp = sched_core_cookie_cmp(_cookie, 
node_core_cookie);
+   if (cmp < 0) {
+   node = >rb_left;
+   } else if (cmp > 0) {
+   node = >rb_right;
+   } else {
+   match = node_core_cookie;
+   break;
+   }
+   }
+
+   if (!match) {
+   /* No existing cookie; create and insert one */
+   match = kmalloc(sizeof(struct sched_core_cookie), GFP_ATOMIC);
+
+   /* Fall back to zero cookie */
+   if (WARN_ON_ONCE(!match))
+   goto finish;
+
+   *match = requested_cookie;
+   refcount_set(>refcnt, 1);
+
+   rb_link_node(>node, parent, node);
+   rb_insert_color(>node, _core_cookies);
+   } else {
+   /*
+* Cookie exists, increment refcnt. If refcnt is currently 0,
+* we're racing with a put() (refcnt decremented but cookie not
+* yet removed from the tree). In this case, we can simply
+* perform the removal ourselves and retry.
+* sched_core_put_cookie() will still function correctly.
+*/
+   if (unlikely(!refcount_inc_not_zero(>refcnt))) {
+   __sched_core_erase_cookie(match);
+   goto retry;
+   }
+   }
+
+finish:
+   p->core_cookie = (unsigned long)match;
+
+   raw_spin_unlock(_core_cookies_lock);
+
+   sched_core_put_cookie(curr_cookie);
+}
+
+/*
+ * sched_core_update_cookie - Common helper to update a task's core cookie. 
This
+ * updates the selected cookie field and then updates the overall cookie.
+ * @p: The task whose cookie should be updated.
+ * @cookie: The new cookie.
+ * @cookie_type: The cookie field to which the cookie corresponds.
+ */
+static void sched_core_update_cookie(struct task_struct *p, unsigned long 
cookie,
+enum sched_core_cookie_type cookie_type)
+{
+   struct rq_flags rf;
+   struct rq *rq;
+
+   if (!p)
+   return;
+
+   rq = task_rq_lock(p, );
+
+   switch (cookie_type) {
+   case sched_core_task_cookie_type:
+   p->core_task_cookie = cookie;
+   break;
+   case sched_core_group_cookie_type:
+   p->core_group_cookie = cookie;
+   break;
+   default:
+   WARN_ON_ONCE(1);
+   }
+
+   /* Set p->core_cookie, which is the overall cookie */
+   __sched_core_update_cookie(p);


While trying to test the new prctl() code I'm working on, I ran into a bug I chased back into this v10 code. Under a 
fair amount of stress, when the function __sched_core_update_cookie() is ultimately called from sched_core_fork(), the 
system deadlocks or otherwise non-visibly crashes. I've not had much success figuring out why/what. I'm running with 
LOCKDEP on and seeing no complaints. Duplicating it only requires setting a cookie on a task and forking a bunch of 
threads ... all of which then want to update their cookie.


-chrish


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-05 Thread Chris Hyser

On 2/5/21 5:43 AM, Peter Zijlstra wrote:

On Thu, Feb 04, 2021 at 03:52:55PM -0500, Chris Hyser wrote:


A second complication was a decision that new processes (not threads) do not
inherit their parents cookie. Thus forking is also not a means to share a
cookie. Basically with a "from-only" interface, the new task would need to
be modified to call prctl() itself. From-only also does not allow for
setting a cookie on an unmodified already running task. This can be fixed by
providing both a "to" and "from" sharing interface that allows helper
programs to construct arbitrary configurations from unmodified programs.


Do we really want to inhibit on fork() or would exec() be a better
place? What about those applications that use fork() based workers?


I like exec-time as a fan of fork-based workers. I suppose the counter argument would be security, but the new process 
is in a state to be trusted to lower it's privileges, change permissions, core sched cookie etc before it makes itself 
differently vulnerable and because it is the same code, it "knows" if it did.



Also, how do I set a unique cookie on myself with this interface?


The v10 patch still uses the overloaded v9 mechanism (which as mentioned
above is if two tasks w/o cookies share a cookie they get a new shared
unique cookie). Yes, that is clearly an inconsistency and kludgy. The
mechanism is documented in the docs, but clearly not obvious from the


I've not seen a document so far (also, I'm not one to actually read
documentation, much preferring comments and Changelogs).


Understood. I think we should split this patch into separate prctl and cgroup patches. The rationale decided here would 
then go into the prctl patch commit message. We can also use this split to address any dependencies we've created on 
cgroups that you mentioned in a different email.



So based on the above, how about we add a "create" to pair with "clear" and
call it "create" vs "set" since we are creating a unique opaque cookie
versus setting a particular value. And as mentioned, because one can't
specify a cookie directly but only thru sharing relationships, we need both
"to" and "from" to make it completely usable.

So we end up with something like this:
 PR_SCHED_CORE_CREATE   -- give yourself a unique cookie
 PR_SCHED_CORE_CLEAR-- clear your core sched cookie
 PR_SCHED_CORE_SHARE_FROM -- get their cookie for you
 PR_SCHED_CORE_SHARE_TO  -- push your cookie to them


I'm still wondering why we need _FROM/_TO. What exactly will we miss
with just _SHARE ?

current arg_task
-EDAFT
  current gets cookie
  arg_task gets cookie
-EDAFTER

(I have a suspicion, but I want to see it spelled out).


The min requirements of the interface I see are:

1. create a my own cookie
2. clear my own cookie
3. create a cookie for a running unmodified task
4. clear a cookie for a running unmodified task
5. copy a cookie from one running unmodified task to another unmodified task

So from your example above:
>  current gets cookie

could also mean clear the cookie of arg_task and satisfy requirement 4 above.

"Share" is a fuzzy term. I should have used COPY as that is more the semantics I was thinking ... specified directional 
transfer. So we could have a single command with two arguments where argument order determines direction. In the v10 
interface proposal, as one argument, current, was always implied, direction had to be specified in the command.


So a single copy command could be something like:

PR_SCHED_CORE_COPY

to replace the two. The very first util you write to do any thing useful w/ all of this is a "copy_cookie  
". :-)



Also, do we wants this interface to be able to work on processes? Things
like fcntl(F_SETOWN_EX) allow you to specify a PID type.


Yes and creating/clearing a cookie on a PGID and SID seem like useful shortcuts 
to look into.


An additional question is should the inheritability of a process' cookie be
configurable? The current code gives the child process their own unique
cookie if the parent had a cookie. That is useful in some cases, but many
other configurations could be made much easier with inheritance.


What was the argument for not following the traditional fork() semantics
and inheriting everything?


The code just showed up with no explanation :-), but I think I know what was intended and it touches on the same 
security policy type problem you mentioned in a comment on the CLEAR code. In a secure context, you can't just allow a 
random user to clear their cookie, i.e. make themselves trusted. At the same time, in a non-secure context, and several 
use cases have been put forward, I can't think of anything more annoying then being able to set a cookie on my task and 
then not having permission to clear it.


The fork scenario has a similar 

Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-05 Thread Josh Don
On Fri, Feb 5, 2021 at 3:53 AM Peter Zijlstra  wrote:
>
> On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
>
> > +/* All active sched_core_cookies */
> > +static struct rb_root sched_core_cookies = RB_ROOT;
> > +static DEFINE_RAW_SPINLOCK(sched_core_cookies_lock);
>
> > +/*
> > + * Returns the following:
> > + * a < b  => -1
> > + * a == b => 0
> > + * a > b  => 1
> > + */
> > +static int sched_core_cookie_cmp(const struct sched_core_cookie *a,
> > +  const struct sched_core_cookie *b)
> > +{
> > +#define COOKIE_CMP_RETURN(field) do {\
> > + if (a->field < b->field)\
> > + return -1;  \
> > + else if (a->field > b->field)   \
> > + return 1;   \
> > +} while (0)  \
> > +
> > + COOKIE_CMP_RETURN(task_cookie);
> > + COOKIE_CMP_RETURN(group_cookie);
> > +
> > + /* all cookie fields match */
> > + return 0;
> > +
> > +#undef COOKIE_CMP_RETURN
> > +}
>
> AFAICT all this madness exists because cgroup + task interaction, yet
> none of that code is actually dependent on cgroups being on.
>
> So this seems to implement semantics that will make two tasks that share
> a cookie, but are then placed in different cgroups not actually share.
>
> Is that desired? Can we justify these semantics and the resulting code
> complexity.

Yes that is the desired result. IMO it is less optimal from an
interface perspective if we were to instead have group or task cookie
override the other. Joel gave some additional justification here:
https://lkml.org/lkml/2020/12/6/389.


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-05 Thread Joel Fernandes
On Thu, Feb 04, 2021 at 03:52:53PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> > +static void sched_core_update_cookie(struct task_struct *p, unsigned long 
> > cookie,
> > +enum sched_core_cookie_type cookie_type)
> > +{
> > +   struct rq_flags rf;
> > +   struct rq *rq;
> > +
> > +   if (!p)
> > +   return;
> > +
> > +   rq = task_rq_lock(p, );
> > +
> > +   switch (cookie_type) {
> > +   case sched_core_task_cookie_type:
> > +   p->core_task_cookie = cookie;
> > +   break;
> > +   case sched_core_group_cookie_type:
> > +   p->core_group_cookie = cookie;
> > +   break;
> > +   default:
> > +   WARN_ON_ONCE(1);
> > +   }
> > +
> > +   /* Set p->core_cookie, which is the overall cookie */
> > +   __sched_core_update_cookie(p);
> > +
> > +   if (sched_core_enqueued(p)) {
> > +   sched_core_dequeue(rq, p);
> > +   if (!p->core_cookie) {
> > +   task_rq_unlock(rq, p, );
> > +   return;
> > +   }
> > +   }
> > +
> > +   if (sched_core_enabled(rq) &&
> > +   p->core_cookie && task_on_rq_queued(p))
> > +   sched_core_enqueue(task_rq(p), p);
> > +
> > +   /*
> > +* If task is currently running or waking, it may not be compatible
> > +* anymore after the cookie change, so enter the scheduler on its CPU
> > +* to schedule it away.
> > +*/
> > +   if (task_running(rq, p) || p->state == TASK_WAKING)
> > +   resched_curr(rq);
> 
> I'm not immediately seeing the need for that WAKING test. Since we're
> holding it's rq->lock, the only place that task can be WAKING is on the
> wake_list. And if it's there, it needs to acquire rq->lock to get
> enqueued, and rq->lock again to get scheduled.
> 
> What am I missing?

Hi Peter,

I did this way following a similar pattern in affine_move_task(). However, I
think you are right. Unlike in the case affine_move_task(), we have
schedule() to do the right thing for us in case of any races with wakeup. So
the TASK_WAKING test is indeed not needed and we can drop tha test. Apologies
for adding the extra test out of paranoia.

thanks,

 - Joel



Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-05 Thread Joel Fernandes
Hi Peter,

On Thu, Feb 04, 2021 at 02:59:58PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 03, 2021 at 05:51:15PM +0100, Peter Zijlstra wrote:
> > 
> > I'm slowly starting to go through this...
> > 
> > On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> > > +static bool sched_core_empty(struct rq *rq)
> > > +{
> > > + return RB_EMPTY_ROOT(>core_tree);
> > > +}
> > > +
> > > +static struct task_struct *sched_core_first(struct rq *rq)
> > > +{
> > > + struct task_struct *task;
> > > +
> > > + task = container_of(rb_first(>core_tree), struct task_struct, 
> > > core_node);
> > > + return task;
> > > +}
> > 
> > AFAICT you can do with:
> > 
> > static struct task_struct *sched_core_any(struct rq *rq)
> > {
> > return rb_entry(rq->core_tree.rb_node, struct task_struct, code_node);
> > }
> > 
> > > +static void sched_core_flush(int cpu)
> > > +{
> > > + struct rq *rq = cpu_rq(cpu);
> > > + struct task_struct *task;
> > > +
> > > + while (!sched_core_empty(rq)) {
> > > + task = sched_core_first(rq);
> > > + rb_erase(>core_node, >core_tree);
> > > + RB_CLEAR_NODE(>core_node);
> > > + }
> > > + rq->core->core_task_seq++;
> > > +}
> > 
> > However,
> > 
> > > + for_each_possible_cpu(cpu) {
> > > + struct rq *rq = cpu_rq(cpu);
> > > +
> > > + WARN_ON_ONCE(enabled == rq->core_enabled);
> > > +
> > > + if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) 
> > > >= 2)) {
> > > + /*
> > > +  * All active and migrating tasks will have already
> > > +  * been removed from core queue when we clear the
> > > +  * cgroup tags. However, dying tasks could still be
> > > +  * left in core queue. Flush them here.
> > > +  */
> > > + if (!enabled)
> > > + sched_core_flush(cpu);
> > > +
> > > + rq->core_enabled = enabled;
> > > + }
> > > + }
> > 
> > I'm not sure I understand. Is the problem that we're still schedulable
> > during do_exit() after cgroup_exit() ?

Yes, exactly. Tim had written this code in the original patches and it
carried (I was not involved at that time). IIRC, the issue is the exit will
race with core scheduling being disabled. Even after core sched is disabled,
it will still exist in the core rb tree and needs to be removed. Otherwise it
causes crashes.

> It could be argued that when we
> > leave the cgroup there, we should definitely leave the tag group too.
> 
> That is, did you forget to implement cpu_cgroup_exit()?

Yes, I think it is better to implement it in cpu_cgroup_exit().

thanks,

 - Joel



Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-05 Thread Peter Zijlstra
On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> +int sched_core_share_tasks(struct task_struct *t1, struct task_struct *t2)
> +{
> + static DEFINE_MUTEX(sched_core_tasks_mutex);
> + unsigned long cookie;
> + int ret = -ENOMEM;
> +
> + mutex_lock(_core_tasks_mutex);
> +
> + if (!t2) {
> + if (t1->core_task_cookie) {
> + sched_core_put_task_cookie(t1->core_task_cookie);
> + sched_core_update_task_cookie(t1, 0);
> + sched_core_put();
> + }

So this seems to be the bit that implements _CLEAR. ISTR there were
security implications / considerations here.

When the machine is vulnerable to L1TF/MDS and the like, clearing the
cookie would gain privilege and should thus be subject to some checks,
but I can'd find anything.

At the very least that deserves a comment I'm thinking.


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-05 Thread Peter Zijlstra
On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:

> +/* All active sched_core_cookies */
> +static struct rb_root sched_core_cookies = RB_ROOT;
> +static DEFINE_RAW_SPINLOCK(sched_core_cookies_lock);

> +/*
> + * Returns the following:
> + * a < b  => -1
> + * a == b => 0
> + * a > b  => 1
> + */
> +static int sched_core_cookie_cmp(const struct sched_core_cookie *a,
> +  const struct sched_core_cookie *b)
> +{
> +#define COOKIE_CMP_RETURN(field) do {\
> + if (a->field < b->field)\
> + return -1;  \
> + else if (a->field > b->field)   \
> + return 1;   \
> +} while (0)  \
> +
> + COOKIE_CMP_RETURN(task_cookie);
> + COOKIE_CMP_RETURN(group_cookie);
> +
> + /* all cookie fields match */
> + return 0;
> +
> +#undef COOKIE_CMP_RETURN
> +}

AFAICT all this madness exists because cgroup + task interaction, yet
none of that code is actually dependent on cgroups being on.

So this seems to implement semantics that will make two tasks that share
a cookie, but are then placed in different cgroups not actually share.

Is that desired? Can we justify these semantics and the resulting code
complexity.


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-05 Thread Peter Zijlstra
On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -736,6 +736,7 @@ void __put_task_struct(struct task_struct *tsk)
>   exit_creds(tsk);
>   delayacct_tsk_free(tsk);
>   put_signal_struct(tsk->signal);
> + sched_tsk_free(tsk);
>  
>   if (!profile_handoff_task(tsk))
>   free_task(tsk);

> +struct sched_core_task_cookie {
> + refcount_t refcnt;
> + struct work_struct work; /* to free in WQ context. */;
> +};

> +static void sched_core_put_cookie_work(struct work_struct *ws);
> +
> +/* Caller has to call sched_core_get() if non-zero value is returned. */
> +static unsigned long sched_core_alloc_task_cookie(void)
> +{
> + struct sched_core_task_cookie *ck =
> + kmalloc(sizeof(struct sched_core_task_cookie), GFP_KERNEL);
> +
> + if (!ck)
> + return 0;
> + refcount_set(>refcnt, 1);
> + INIT_WORK(>work, sched_core_put_cookie_work);
> +
> + return (unsigned long)ck;
> +}

> +static void sched_core_put_task_cookie(unsigned long cookie)
> +{
> + struct sched_core_task_cookie *ptr =
> + (struct sched_core_task_cookie *)cookie;
> +
> + if (refcount_dec_and_test(>refcnt))
> + kfree(ptr);
> +}

> +static void sched_core_put_cookie_work(struct work_struct *ws)
> +{
> + struct sched_core_task_cookie *ck =
> + container_of(ws, struct sched_core_task_cookie, work);
> +
> + sched_core_put_task_cookie((unsigned long)ck);
> + sched_core_put();
> +}

> +void sched_tsk_free(struct task_struct *tsk)
> +{
> + struct sched_core_task_cookie *ck;
> +
> + sched_core_put_cookie((struct sched_core_cookie *)tsk->core_cookie);
> +
> + if (!tsk->core_task_cookie)
> + return;
> +
> + ck = (struct sched_core_task_cookie *)tsk->core_task_cookie;
> + queue_work(system_wq, >work);
> +}

*groan*.. so the purpose of that work is to be able to disable
core-scheduling after the last such task dies.

Maybe add sched_core_put_async() instead?


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-05 Thread Peter Zijlstra
On Thu, Feb 04, 2021 at 03:52:55PM -0500, Chris Hyser wrote:

> A second complication was a decision that new processes (not threads) do not
> inherit their parents cookie. Thus forking is also not a means to share a
> cookie. Basically with a "from-only" interface, the new task would need to
> be modified to call prctl() itself. From-only also does not allow for
> setting a cookie on an unmodified already running task. This can be fixed by
> providing both a "to" and "from" sharing interface that allows helper
> programs to construct arbitrary configurations from unmodified programs.

Do we really want to inhibit on fork() or would exec() be a better
place? What about those applications that use fork() based workers?

> > Also, how do I set a unique cookie on myself with this interface?
> 
> The v10 patch still uses the overloaded v9 mechanism (which as mentioned
> above is if two tasks w/o cookies share a cookie they get a new shared
> unique cookie). Yes, that is clearly an inconsistency and kludgy. The
> mechanism is documented in the docs, but clearly not obvious from the

I've not seen a document so far (also, I'm not one to actually read
documentation, much preferring comments and Changelogs).

> So based on the above, how about we add a "create" to pair with "clear" and
> call it "create" vs "set" since we are creating a unique opaque cookie
> versus setting a particular value. And as mentioned, because one can't
> specify a cookie directly but only thru sharing relationships, we need both
> "to" and "from" to make it completely usable.
> 
> So we end up with something like this:
> PR_SCHED_CORE_CREATE   -- give yourself a unique 
> cookie
> PR_SCHED_CORE_CLEAR-- clear your core sched cookie
> PR_SCHED_CORE_SHARE_FROM -- get their cookie for you
> PR_SCHED_CORE_SHARE_TO  -- push your cookie to them

I'm still wondering why we need _FROM/_TO. What exactly will we miss
with just _SHARE ?

current arg_task
-EDAFT
  current gets cookie
  arg_task gets cookie
-EDAFTER

(I have a suspicion, but I want to see it spelled out).

Also, do we wants this interface to be able to work on processes? Things
like fcntl(F_SETOWN_EX) allow you to specify a PID type.

> An additional question is should the inheritability of a process' cookie be
> configurable? The current code gives the child process their own unique
> cookie if the parent had a cookie. That is useful in some cases, but many
> other configurations could be made much easier with inheritance.

What was the argument for not following the traditional fork() semantics
and inheriting everything?


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-04 Thread Josh Don
On Thu, Feb 4, 2021 at 6:36 AM Peter Zijlstra  wrote:
>
> refcount_dec_and_lock() avoids that complication.

There isn't currently an interface for raw_spin_locks. Certainly could
add a new interface in a separate patch as part of this series, but
doesn't seem that bad to do the locking inline.


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-04 Thread Josh Don
On Thu, Feb 4, 2021 at 6:02 AM Peter Zijlstra  wrote:
>
> On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
>
> > +#ifdef CONFIG_SCHED_DEBUG
> > + /* Read the group cookie. */
> > + {
> > + .name = "core_group_cookie",
> > + .flags = CFTYPE_NOT_ON_ROOT,
> > + .read_u64 = cpu_core_group_cookie_read_u64,
> > + },
> > +#endif
>
> > +#ifdef CONFIG_SCHED_DEBUG
> > + /* Read the group cookie. */
> > + {
> > + .name = "core_group_cookie",
> > + .flags = CFTYPE_NOT_ON_ROOT,
> > + .read_u64 = cpu_core_group_cookie_read_u64,
> > + },
> > +#endif
>
> AFAICT this leaks kernel pointers. IIRC that was a bad thing.

For that matter, we're also exposing the cookie pointer in
/proc/$pid/sched. Currently these are used by the selftests to
validate that two tasks are/aren't sharing.  If this poses a risk, we
can rework to avoid exposing the actual pointers.


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-04 Thread Josh Don
On Thu, Feb 4, 2021 at 5:54 AM Peter Zijlstra  wrote:
>
> On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> > From: Peter Zijlstra 
>
> I'm thinking this is too much credit, I didn't write much of this.
>
> > Marks all tasks in a cgroup as matching for core-scheduling.
> >
> > A task will need to be moved into the core scheduler queue when the cgroup
> > it belongs to is tagged to run with core scheduling.  Similarly the task
> > will need to be moved out of the core scheduler queue when the cgroup
> > is untagged.
> >
> > Also after we forked a task, its core scheduler queue's presence will
> > need to be updated according to its new cgroup's status.
> >
> > Use stop machine mechanism to update all tasks in a cgroup to prevent a
> > new task from sneaking into the cgroup, and missed out from the update
> > while we iterates through all the tasks in the cgroup.  A more complicated
> > scheme could probably avoid the stop machine.  Such scheme will also
> > need to resovle inconsistency between a task's cgroup core scheduling
> > tag and residency in core scheduler queue.
> >
> > We are opting for the simple stop machine mechanism for now that avoids
> > such complications.
> >
> > Core scheduler has extra overhead.  Enable it only for core with
> > more than one SMT hardware threads.
>
> Very little actual words on the desired and implemented semantics of the
> interface, while the patch contains much non-obvious complication.

Ack to both, will fix in the next posting.


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-04 Thread Chris Hyser

On 2/4/21 8:57 AM, Peter Zijlstra wrote:

On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:

+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE59
+# define PR_SCHED_CORE_CLEAR   0  /* clear core_sched cookie of pid */
+# define PR_SCHED_CORE_SHARE_FROM  1  /* get core_sched cookie from pid */
+# define PR_SCHED_CORE_SHARE_TO2  /* push core_sched cookie to 
pid */


Why ?


The simplest interface would be a single 'set' command that specifies and sets a cookie value. Using 0 as a special 
value could then clear it. However, an early requirement that people seemed to agree with, is that cookies should be 
opaque and system guaranteed unique except when explicitly shared. Thus, since two tasks cannot share a cookie by 
explicitly setting the same cookie value, the prctl() must provide for a means of cookie sharing between tasks. The v9 
proposal had incorporated all of this into a single "from-only" command whose actions depended on the state of the two 
tasks. If neither have a cookie and one shares from the other, they both get the same new cookie. If the calling task 
had one and the other didn't, the calling task's cookie was cleared. And of course if the src task has a cookie, the 
caller just gets it. Does a lot, tad bit overloaded, and still insufficient.


A second complication was a decision that new processes (not threads) do not inherit their parents cookie. Thus forking 
is also not a means to share a cookie. Basically with a "from-only" interface, the new task would need to be modified to 
call prctl() itself. From-only also does not allow for setting a cookie on an unmodified already running task. This can 
be fixed by providing both a "to" and "from" sharing interface that allows helper programs to construct arbitrary 
configurations from unmodified programs.

Also, how do I set a unique cookie on myself with this interface?


The v10 patch still uses the overloaded v9 mechanism (which as mentioned above is if two tasks w/o cookies share a 
cookie they get a new shared unique cookie). Yes, that is clearly an inconsistency and kludgy. The mechanism is 
documented in the docs, but clearly not obvious from the interface above. I think we got a bit overzealous in patch 
squashing and much of this verbiage should have been in the combined commit message.


So based on the above, how about we add a "create" to pair with "clear" and call it "create" vs "set" since we are 
creating a unique opaque cookie versus setting a particular value. And as mentioned, because one can't specify a cookie 
directly but only thru sharing relationships, we need both "to" and "from" to make it completely usable.


So we end up with something like this:
PR_SCHED_CORE_CREATE   -- give yourself a unique cookie
PR_SCHED_CORE_CLEAR-- clear your core sched cookie
PR_SCHED_CORE_SHARE_FROM -- get their cookie for you
PR_SCHED_CORE_SHARE_TO  -- push your cookie to them

An additional question is should the inheritability of a process' cookie be configurable? The current code gives the 
child process their own unique cookie if the parent had a cookie. That is useful in some cases, but many other 
configurations could be made much easier with inheritance.


If configurable cookie inheritance would be useful, it might look something 
like this:

PR_SCHED_CORE_CHILD_INHERIT <0/1>  -- 1 - child inherits cookie from parent. 0 - If parent has a cookie, child process 
gets a unique cookie.


-chrish


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-04 Thread Peter Zijlstra
On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> +static void sched_core_update_cookie(struct task_struct *p, unsigned long 
> cookie,
> +  enum sched_core_cookie_type cookie_type)
> +{
> + struct rq_flags rf;
> + struct rq *rq;
> +
> + if (!p)
> + return;
> +
> + rq = task_rq_lock(p, );
> +
> + switch (cookie_type) {
> + case sched_core_task_cookie_type:
> + p->core_task_cookie = cookie;
> + break;
> + case sched_core_group_cookie_type:
> + p->core_group_cookie = cookie;
> + break;
> + default:
> + WARN_ON_ONCE(1);
> + }
> +
> + /* Set p->core_cookie, which is the overall cookie */
> + __sched_core_update_cookie(p);
> +
> + if (sched_core_enqueued(p)) {
> + sched_core_dequeue(rq, p);
> + if (!p->core_cookie) {
> + task_rq_unlock(rq, p, );
> + return;
> + }
> + }
> +
> + if (sched_core_enabled(rq) &&
> + p->core_cookie && task_on_rq_queued(p))
> + sched_core_enqueue(task_rq(p), p);
> +
> + /*
> +  * If task is currently running or waking, it may not be compatible
> +  * anymore after the cookie change, so enter the scheduler on its CPU
> +  * to schedule it away.
> +  */
> + if (task_running(rq, p) || p->state == TASK_WAKING)
> + resched_curr(rq);

I'm not immediately seeing the need for that WAKING test. Since we're
holding it's rq->lock, the only place that task can be WAKING is on the
wake_list. And if it's there, it needs to acquire rq->lock to get
enqueued, and rq->lock again to get scheduled.

What am I missing?

> +
> + task_rq_unlock(rq, p, );
> +}


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-04 Thread Peter Zijlstra
On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> +static inline void __sched_core_erase_cookie(struct sched_core_cookie 
> *cookie)
> +{
> + lockdep_assert_held(_core_cookies_lock);
> +
> + /* Already removed */
> + if (RB_EMPTY_NODE(>node))
> + return;
> +
> + rb_erase(>node, _core_cookies);
> + RB_CLEAR_NODE(>node);
> +}
> +
> +/* Called when a task no longer points to the cookie in question */
> +static void sched_core_put_cookie(struct sched_core_cookie *cookie)
> +{
> + unsigned long flags;
> +
> + if (!cookie)
> + return;
> +
> + if (refcount_dec_and_test(>refcnt)) {
> + raw_spin_lock_irqsave(_core_cookies_lock, flags);
> + __sched_core_erase_cookie(cookie);
> + raw_spin_unlock_irqrestore(_core_cookies_lock, flags);
> + kfree(cookie);
> + }
> +}

> +static void __sched_core_update_cookie(struct task_struct *p)
> +{

> + raw_spin_lock(_core_cookies_lock);

> + /*
> +  * Cookie exists, increment refcnt. If refcnt is currently 0,
> +  * we're racing with a put() (refcnt decremented but cookie not
> +  * yet removed from the tree). In this case, we can simply
> +  * perform the removal ourselves and retry.
> +  * sched_core_put_cookie() will still function correctly.
> +  */
> + if (unlikely(!refcount_inc_not_zero(>refcnt))) {
> + __sched_core_erase_cookie(match);
> + goto retry;
> + }

refcount_dec_and_lock() avoids that complication.


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-04 Thread Peter Zijlstra
On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:

> +#ifdef CONFIG_SCHED_DEBUG
> + /* Read the group cookie. */
> + {
> + .name = "core_group_cookie",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .read_u64 = cpu_core_group_cookie_read_u64,
> + },
> +#endif

> +#ifdef CONFIG_SCHED_DEBUG
> + /* Read the group cookie. */
> + {
> + .name = "core_group_cookie",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .read_u64 = cpu_core_group_cookie_read_u64,
> + },
> +#endif

AFAICT this leaks kernel pointers. IIRC that was a bad thing.


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-04 Thread Peter Zijlstra
On Wed, Feb 03, 2021 at 05:51:15PM +0100, Peter Zijlstra wrote:
> 
> I'm slowly starting to go through this...
> 
> On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> > +static bool sched_core_empty(struct rq *rq)
> > +{
> > +   return RB_EMPTY_ROOT(>core_tree);
> > +}
> > +
> > +static struct task_struct *sched_core_first(struct rq *rq)
> > +{
> > +   struct task_struct *task;
> > +
> > +   task = container_of(rb_first(>core_tree), struct task_struct, 
> > core_node);
> > +   return task;
> > +}
> 
> AFAICT you can do with:
> 
> static struct task_struct *sched_core_any(struct rq *rq)
> {
>   return rb_entry(rq->core_tree.rb_node, struct task_struct, code_node);
> }
> 
> > +static void sched_core_flush(int cpu)
> > +{
> > +   struct rq *rq = cpu_rq(cpu);
> > +   struct task_struct *task;
> > +
> > +   while (!sched_core_empty(rq)) {
> > +   task = sched_core_first(rq);
> > +   rb_erase(>core_node, >core_tree);
> > +   RB_CLEAR_NODE(>core_node);
> > +   }
> > +   rq->core->core_task_seq++;
> > +}
> 
> However,
> 
> > +   for_each_possible_cpu(cpu) {
> > +   struct rq *rq = cpu_rq(cpu);
> > +
> > +   WARN_ON_ONCE(enabled == rq->core_enabled);
> > +
> > +   if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) 
> > >= 2)) {
> > +   /*
> > +* All active and migrating tasks will have already
> > +* been removed from core queue when we clear the
> > +* cgroup tags. However, dying tasks could still be
> > +* left in core queue. Flush them here.
> > +*/
> > +   if (!enabled)
> > +   sched_core_flush(cpu);
> > +
> > +   rq->core_enabled = enabled;
> > +   }
> > +   }
> 
> I'm not sure I understand. Is the problem that we're still schedulable
> during do_exit() after cgroup_exit() ? It could be argued that when we
> leave the cgroup there, we should definitely leave the tag group too.

That is, did you forget to implement cpu_cgroup_exit()?


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-04 Thread Peter Zijlstra
On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> +/* Request the scheduler to share a core */
> +#define PR_SCHED_CORE_SHARE  59
> +# define PR_SCHED_CORE_CLEAR 0  /* clear core_sched cookie of pid */
> +# define PR_SCHED_CORE_SHARE_FROM1  /* get core_sched cookie from pid */
> +# define PR_SCHED_CORE_SHARE_TO  2  /* push core_sched cookie to 
> pid */

Why ?

Also, how do I set a unique cookie on myself with this interface?


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-04 Thread Peter Zijlstra
On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> From: Peter Zijlstra 

I'm thinking this is too much credit, I didn't write much of this.

> Marks all tasks in a cgroup as matching for core-scheduling.
> 
> A task will need to be moved into the core scheduler queue when the cgroup
> it belongs to is tagged to run with core scheduling.  Similarly the task
> will need to be moved out of the core scheduler queue when the cgroup
> is untagged.
> 
> Also after we forked a task, its core scheduler queue's presence will
> need to be updated according to its new cgroup's status.
> 
> Use stop machine mechanism to update all tasks in a cgroup to prevent a
> new task from sneaking into the cgroup, and missed out from the update
> while we iterates through all the tasks in the cgroup.  A more complicated
> scheme could probably avoid the stop machine.  Such scheme will also
> need to resovle inconsistency between a task's cgroup core scheduling
> tag and residency in core scheduler queue.
> 
> We are opting for the simple stop machine mechanism for now that avoids
> such complications.
> 
> Core scheduler has extra overhead.  Enable it only for core with
> more than one SMT hardware threads.

Very little actual words on the desired and implemented semantics of the
interface, while the patch contains much non-obvious complication.


Re: [PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-02-03 Thread Peter Zijlstra


I'm slowly starting to go through this...

On Fri, Jan 22, 2021 at 08:17:01PM -0500, Joel Fernandes (Google) wrote:
> +static bool sched_core_empty(struct rq *rq)
> +{
> + return RB_EMPTY_ROOT(>core_tree);
> +}
> +
> +static struct task_struct *sched_core_first(struct rq *rq)
> +{
> + struct task_struct *task;
> +
> + task = container_of(rb_first(>core_tree), struct task_struct, 
> core_node);
> + return task;
> +}

AFAICT you can do with:

static struct task_struct *sched_core_any(struct rq *rq)
{
return rb_entry(rq->core_tree.rb_node, struct task_struct, code_node);
}

> +static void sched_core_flush(int cpu)
> +{
> + struct rq *rq = cpu_rq(cpu);
> + struct task_struct *task;
> +
> + while (!sched_core_empty(rq)) {
> + task = sched_core_first(rq);
> + rb_erase(>core_node, >core_tree);
> + RB_CLEAR_NODE(>core_node);
> + }
> + rq->core->core_task_seq++;
> +}

However,

> + for_each_possible_cpu(cpu) {
> + struct rq *rq = cpu_rq(cpu);
> +
> + WARN_ON_ONCE(enabled == rq->core_enabled);
> +
> + if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) 
> >= 2)) {
> + /*
> +  * All active and migrating tasks will have already
> +  * been removed from core queue when we clear the
> +  * cgroup tags. However, dying tasks could still be
> +  * left in core queue. Flush them here.
> +  */
> + if (!enabled)
> + sched_core_flush(cpu);
> +
> + rq->core_enabled = enabled;
> + }
> + }

I'm not sure I understand. Is the problem that we're still schedulable
during do_exit() after cgroup_exit() ? It could be argued that when we
leave the cgroup there, we should definitely leave the tag group too.





[PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-01-22 Thread Joel Fernandes (Google)
From: Peter Zijlstra 

Marks all tasks in a cgroup as matching for core-scheduling.

A task will need to be moved into the core scheduler queue when the cgroup
it belongs to is tagged to run with core scheduling.  Similarly the task
will need to be moved out of the core scheduler queue when the cgroup
is untagged.

Also after we forked a task, its core scheduler queue's presence will
need to be updated according to its new cgroup's status.

Use stop machine mechanism to update all tasks in a cgroup to prevent a
new task from sneaking into the cgroup, and missed out from the update
while we iterates through all the tasks in the cgroup.  A more complicated
scheme could probably avoid the stop machine.  Such scheme will also
need to resovle inconsistency between a task's cgroup core scheduling
tag and residency in core scheduler queue.

We are opting for the simple stop machine mechanism for now that avoids
such complications.

Core scheduler has extra overhead.  Enable it only for core with
more than one SMT hardware threads.

Co-developed-by: Josh Don 
Co-developed-by: Chris Hyser 
Co-developed-by: Joel Fernandes (Google) 
Tested-by: Julien Desfossez 
Signed-off-by: Tim Chen 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Julien Desfossez 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Josh Don 
Signed-off-by: Chris Hyser 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/sched.h|  10 +
 include/uapi/linux/prctl.h   |   6 +
 kernel/fork.c|   1 +
 kernel/sched/Makefile|   1 +
 kernel/sched/core.c  | 136 ++-
 kernel/sched/coretag.c   | 669 +++
 kernel/sched/debug.c |   4 +
 kernel/sched/sched.h |  58 ++-
 kernel/sys.c |   7 +
 tools/include/uapi/linux/prctl.h |   6 +
 10 files changed, 878 insertions(+), 20 deletions(-)
 create mode 100644 kernel/sched/coretag.c

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7efce9c9d9cf..7ca6f2f72cda 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -688,6 +688,8 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
struct rb_node  core_node;
unsigned long   core_cookie;
+   unsigned long   core_task_cookie;
+   unsigned long   core_group_cookie;
unsigned intcore_occupation;
 #endif
 
@@ -2076,4 +2078,12 @@ int sched_trace_rq_nr_running(struct rq *rq);
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
+#ifdef CONFIG_SCHED_CORE
+int sched_core_share_pid(unsigned long flags, pid_t pid);
+void sched_tsk_free(struct task_struct *tsk);
+#else
+#define sched_core_share_pid(flags, pid) do { } while (0)
+#define sched_tsk_free(tsk) do { } while (0)
+#endif
+
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c334e6a02e5f..f8e4e9626121 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -248,4 +248,10 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER  57
 #define PR_GET_IO_FLUSHER  58
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE59
+# define PR_SCHED_CORE_CLEAR   0  /* clear core_sched cookie of pid */
+# define PR_SCHED_CORE_SHARE_FROM  1  /* get core_sched cookie from pid */
+# define PR_SCHED_CORE_SHARE_TO2  /* push core_sched cookie to 
pid */
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 7199d359690c..5468c93829c5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -736,6 +736,7 @@ void __put_task_struct(struct task_struct *tsk)
exit_creds(tsk);
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);
+   sched_tsk_free(tsk);
 
if (!profile_handoff_task(tsk))
free_task(tsk);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 5fc9c9b70862..c526c20adf9d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
+obj-$(CONFIG_SCHED_CORE) += coretag.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 20125431af87..a3844e2e7379 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -136,7 +136,33 @@ static inline bool __sched_core_less(struct task_struct 
*a, struct task_struct *
return false;
 }
 
-static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+static bool sched_core_empty(struct rq *rq)
+{
+   return RB_EMPTY_ROOT(>core_tree);
+}
+
+static struct task_struct *sched_core_first(struct rq *rq)
+{
+   struct task_struct *task;
+
+   task = container_of(rb_first(>core_tree), struct task_struct, 
core_node);
+   return task;
+}
+
+static void