from:"Vineeth Remanan Pillai"

Re: [RFC PATCH 14/16] irq: Add support for core-wide protection of IRQ and softirq

2020-07-10 Thread Vineeth Remanan Pillai

Hi Aubrey,

On Fri, Jul 10, 2020 at 8:19 AM Li, Aubrey  wrote:
>
> Hi Joel/Vineeth,
> [...]
> The problem is gone when we reverted this patch. We are running multiple
> uperf threads(equal to cpu number) in a cgroup with coresched enabled.
> This is 100% reproducible on our side.
>
> Just wonder if anything already known before we dig into it.
>
Thanks for reporting this. We haven't seen any lockups like this
in our testing yet.

Could you please add more information on how to reproduce this?
Was it a simple uperf run without any options or was it running any
specific kind of network test?

We shall also try to reproduce this and investigate.

Thanks,
Vineeth

Re: [RFC PATCH 06/16] sched: Add core wide task selection and scheduling.

2020-07-06 Thread Vineeth Remanan Pillai

On Mon, Jul 6, 2020 at 10:09 AM Joel Fernandes  wrote:
>
> > I am not sure if this can happen. If the other sibling sets core_pick, it
> > will be under the core wide lock and it should set the core_sched_seq also
> > before releasing the lock. So when this cpu tries, it would see the 
> > core_pick
> > before resetting it. Is this the same case you were mentioning? Sorry if I
> > misunderstood the case you mentioned..
>
> If you have a case where you have 3 siblings all trying to enter the schedule
> loop. Call them A, B and C.
>
> A picks something for B in core_pick. Now C comes and resets B's core_pick
> before running the mega-loop, hoping to select something for it shortly.
> However, C then does an unconstrained pick and forgets to set B's pick to
> something.
>
> I don't know if this can really happen - but this is why I added the warning
> in the end of the patch. I think we should make the code more robust and
> handle these kind of cases.
>
I don't think this can happen. Each of the sibling takes the core wide
lock before calling into pick_next _task. So this should not happen.

> Again, it is about making the code more robust. Why should not set
> rq->core_pick when we pick something? As we discussed in the private
> discussion - we should make the code robust and consistent. Correctness is
> not enough, the code has to be robust and maintainable.
>
> I think in our private discussion, you agreed with me that there is no harm
> in setting core_pick in this case.
>
I agreed there was no harm, because we wanted to use that in the last
check after 'done' label. But now I see that adding that check after
done label cause the WARN_ON to fire even in valid case. Firing the
WARN_ON in valid case is not good. So, if that WARN_ON check can be
removed, adding this is not necessary IMHO.

> > cpumask_copy(_mask, cpu_smt_mask(cpu));
> > if (unlikely(cpumask_test_cpu(cpu, _mask))) {
> > cpumask_set_cpu(cpu, _mask);
> > need_sync = false;
> > }
>
> Nah, more lines of code for no good no reason, plus another branch right? I'd
> like to leave my one liner alone than adding 4 more lines :-)
>
Remember, this is the fast path. Every schedule() except for our sync
IPI reaches here. And we are sure that smt_cpumask will not have cpu
only on hotplug cases which is very rare. I feel adding more code to
make it clear that this setting is not needed always and also optimizing for
the fast path is what I was looking for.

> > By setting need_sync to false, we will do an unconstrained pick and will
> > not sync with other siblings. I guess we need to reset need_sync after
> > or in the following for_each_cpu loop, because the loop may set it.
>
> I don't know if we want to add more conditions really and make it more
> confusing. If anything, I believe we should simplify the existing code more 
> TBH.
>
I don't think its more confusing. This hotplug is really a rare case
and we should wrap it in a unlikely conditional IMHO. Comments can
make the reasoning more clear. We are saving two things here: one
is the always setting of cpu mask and second is the unnecessary syncing
of the siblings during hotplug.

> > I think we would not need these here. core_pick needs to be set only
> > for siblings if we are picking a task for them. For unconstrained pick,
> > we pick only for ourselves. Also, core_sched_seq need not be synced here.
> > We might already be synced with the existing core->core_pick_seq. Even
> > if it is not synced, I don't think it will cause an issue in subsequent
> > schedule events.
>
> As discussed both privately and above, there is no harm and it is good to
> keep the code consistent. I'd rather have any task picking set core_pick and
> core_sched_seq to prevent confusion.
>
> And if anything is resetting an existing ->core_pick of a sibling in the
> selection loop, it better set it to something sane.
>
As I mentioned, I was okay with this as you are using this down in the
WARN_ON check. But the WARN_ON check triggers even on valid cases which
is bad. I don't think setting this here will make code more robust IMHO.
core_pick is already NULL and I would like it to be that way unless there
is a compelling reason to set it. The reason is, we could find any bad
cases entering the pick condition above if this is NULL(it will crash).

> > >  done:
> > > +   /*
> > > +* If we reset a sibling's core_pick, make sure that we picked a 
> > > task
> > > +* for it, this is because we might have reset it though it was 
> > > set to
> > > +* something by another selector. In this case we cannot leave it 
> > > as
> > > +* NULL and should have found something for it.
> > > +*/
> > > +   for_each_cpu(i, _mask) {
> > > +   WARN_ON_ONCE(!cpu_rq(i)->core_pick);
> > > +   }
> > > +
> > I think this check will not be true always. For unconstrained pick, we
> > do not pick tasks for siblings and hence do not set core_pick for them.
> > So this WARN_ON will

Re: [RFC PATCH 06/16] sched: Add core wide task selection and scheduling.

2020-07-03 Thread Vineeth Remanan Pillai

On Wed, Jul 1, 2020 at 7:28 PM Joel Fernandes  wrote:
>
> From: "Joel Fernandes (Google)" 
> Subject: [PATCH] sched: Fix CPU hotplug causing crashes in task selection 
> logic
>
> The selection logic does not run correctly if the current CPU is not in the
> cpu_smt_mask (which it is not because the CPU is offlined when the stopper
> finishes running and needs to switch to idle).  There are also other issues
> fixed by the patch I think such as: if some other sibling set core_pick to
> something, however the selection logic on current cpu resets it before
> selecting. In this case, we need to run the task selection logic again to
> make sure it picks something if there is something to run. It might end up
> picking the wrong task.
>
I am not sure if this can happen. If the other sibling sets core_pick, it
will be under the core wide lock and it should set the core_sched_seq also
before releasing the lock. So when this cpu tries, it would see the core_pick
before resetting it. Is this the same case you were mentioning? Sorry if I
misunderstood the case you mentioned..

> Yet another issue was, if the stopper thread is an
> unconstrained pick, then rq->core_pick is set. The next time task selection
> logic runs when stopper needs to switch to idle, the current CPU is not in
> the smt_mask. This causes the previous ->core_pick to be picked again which
> happens to be the unconstrained task! so the stopper keeps getting selected
> forever.
>
I did not clearly understand this. During an unconstrained pick, current
cpu's core_pick is not set and tasks are not picked for siblings as well.
If it is observed being set in the v6 code, I think it should be a bug.

> That and there are a few more safe guards and checks around checking/setting
> rq->core_pick. To test it, I ran rcutorture and made it tag all torture
> threads. Then ran it in hotplug mode (hotplugging every 200ms) and it hit the
> issue. Now it runs for an hour or so without issue. (Torture testing debug
> changes: https://bit.ly/38htfqK ).
>
> Various fixes were tried causing varying degrees of crashes.  Finally I found
> that it is easiest to just add current CPU to the smt_mask's copy always.
> This is so that task selection logic always runs on the current CPU which
> called schedule().
>
> [...]
> cpu = cpu_of(rq);
> -   smt_mask = cpu_smt_mask(cpu);
> +   /* Make a copy of cpu_smt_mask as we should not set that. */
> +   cpumask_copy(_mask, cpu_smt_mask(cpu));
> +
> +   /*
> +* Always make sure current CPU is added to smt_mask so that below
> +* selection logic runs on it.
> +*/
> +   cpumask_set_cpu(cpu, _mask);
>
I like this idea. Probably we can optimize it a bit. We get here with cpu
not in smt_mask only during an offline and online(including the boot time
online) phase. So we could probably wrap it in an "if (unlikely())". Also,
during this time, it would be idle thread or some hotplug online thread that
would be runnable and no other tasks should be runnable on this cpu. So, I
think it makes sense to do an unconstrained pick rather than a costly sync
of all siblings. Probably something like:

cpumask_copy(_mask, cpu_smt_mask(cpu));
if (unlikely(cpumask_test_cpu(cpu, _mask))) {
cpumask_set_cpu(cpu, _mask);
need_sync = false;
}

By setting need_sync to false, we will do an unconstrained pick and will
not sync with other siblings. I guess we need to reset need_sync after
or in the following for_each_cpu loop, because the loop may set it.

> /*
>  * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
> @@ -4351,7 +4358,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
> struct rq_flags *rf)

> if (i == cpu && !need_sync && !p->core_cookie) {
> next = p;
> +   rq_i->core_pick = next;
> +   rq_i->core_sched_seq = 
> rq_i->core->core_pick_seq;
>
I think we would not need these here. core_pick needs to be set only
for siblings if we are picking a task for them. For unconstrained pick,
we pick only for ourselves. Also, core_sched_seq need not be synced here.
We might already be synced with the existing core->core_pick_seq. Even
if it is not synced, I don't think it will cause an issue in subsequent
schedule events.


>  done:
> +   /*
> +* If we reset a sibling's core_pick, make sure that we picked a task
> +* for it, this is because we might have reset it though it was set to
> +* something by another selector. In this case we cannot leave it as
> +* NULL and should have found something for it.
> +*/
> +   for_each_cpu(i, _mask) {
> +   WARN_ON_ONCE(!cpu_rq(i)->core_pick);
> +   }
> +
I think this check will not be true always. For unconstrained pick, we
do not pick tasks for siblings and hence do not set core_pick for them.
So this WARN_ON will fire for unconstrained pick. Easily

[RFC PATCH 11/16] sched: migration changes for core scheduling

2020-06-30 Thread Vineeth Remanan Pillai

From: Aubrey Li 

 - Don't migrate if there is a cookie mismatch
 Load balance tries to move task from busiest CPU to the
 destination CPU. When core scheduling is enabled, if the
 task's cookie does not match with the destination CPU's
 core cookie, this task will be skipped by this CPU. This
 mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
 In the fast path of task wakeup, select the first cookie matched
 idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
 In the slow path of task wakeup, find the idlest CPU whose core
 cookie matches with task's cookie

 - Don't migrate task if cookie not match
 For the NUMA load balance, don't migrate task to the CPU whose
 core cookie does not match with task's cookie

Signed-off-by: Aubrey Li 
Signed-off-by: Tim Chen 
Signed-off-by: Vineeth Remanan Pillai 
---
 kernel/sched/fair.c  | 64 
 kernel/sched/sched.h | 29 
 2 files changed, 88 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d16939766361..33dc4bf01817 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
continue;
 
+#ifdef CONFIG_SCHED_CORE
+   /*
+* Skip this cpu if source task's cookie does not match
+* with CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+   continue;
+#endif
+
env->dst_cpu = cpu;
if (task_numa_compare(env, taskimp, groupimp, maymove))
break;
@@ -5963,11 +5972,17 @@ find_idlest_group_cpu(struct sched_group *group, struct 
task_struct *p, int this
 
/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+   struct rq *rq = cpu_rq(i);
+
+#ifdef CONFIG_SCHED_CORE
+   if (!sched_core_cookie_match(rq, p))
+   continue;
+#endif
+
if (sched_idle_cpu(i))
return i;
 
if (available_idle_cpu(i)) {
-   struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
/*
@@ -6224,8 +6239,18 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
return -1;
-   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
-   break;
+
+   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
+#ifdef CONFIG_SCHED_CORE
+   /*
+* If Core Scheduling is enabled, select this cpu
+* only if the process cookie matches core cookie.
+*/
+   if (sched_core_enabled(cpu_rq(cpu)) &&
+   p->core_cookie == cpu_rq(cpu)->core->core_cookie)
+#endif
+   break;
+   }
}
 
time = cpu_clock(this) - time;
@@ -7609,8 +7634,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env 
*env)
 * We do not migrate tasks that are:
 * 1) throttled_lb_pair, or
 * 2) cannot be migrated to this CPU due to cpus_ptr, or
-* 3) running (obviously), or
-* 4) are cache-hot on their current CPU.
+* 3) task's cookie does not match with this CPU's core cookie
+* 4) running (obviously), or
+* 5) are cache-hot on their current CPU.
 */
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
@@ -7645,6 +7671,15 @@ int can_migrate_task(struct task_struct *p, struct 
lb_env *env)
return 0;
}
 
+#ifdef CONFIG_SCHED_CORE
+   /*
+* Don't migrate task if the task's cookie does not match
+* with the destination CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+   return 0;
+#endif
+
/* Record that we found atleast one task that could run on dst_cpu */
env->flags &= ~LBF_ALL_PINNED;
 
@@ -8857,6 +8892,25 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p,
p->cpus_ptr))
continue;
 
+#ifdef CONFIG_SCHED_CORE
+   if (sched_core_enabled(cpu_rq(this_cpu))) {
+   int i = 0;
+   bool cookie_match = false;

[RFC PATCH 16/16] sched: Debug bits...

2020-06-30 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

Not-Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c | 44 ++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2ec56970d6bb..0362102fa3d2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -105,6 +105,10 @@ static inline bool prio_less(struct task_struct *a, struct 
task_struct *b)
 
int pa = __task_prio(a), pb = __task_prio(b);
 
+   trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
+a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
if (-pa < -pb)
return true;
 
@@ -302,12 +306,16 @@ static void __sched_core_enable(void)
 
static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
+
+   printk("core sched enabled\n");
 }
 
 static void __sched_core_disable(void)
 {
stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);
+
+   printk("core sched disabled\n");
 }
 
 void sched_core_get(void)
@@ -4477,6 +4485,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
put_prev_task(rq, prev);
set_next_task(rq, next);
}
+
+   trace_printk("pick pre selected (%u %u %u): %s/%d %lx\n",
+rq->core->core_task_seq,
+rq->core->core_pick_seq,
+rq->core_sched_seq,
+next->comm, next->pid,
+next->core_cookie);
+
return next;
}
 
@@ -4551,6 +4567,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 */
if (i == cpu && !need_sync && !p->core_cookie) {
next = p;
+   trace_printk("unconstrained pick: %s/%d %lx\n",
+next->comm, next->pid, 
next->core_cookie);
+
goto done;
}
 
@@ -4559,6 +4578,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
rq_i->core_pick = p;
 
+   trace_printk("cpu(%d): selected: %s/%d %lx\n",
+i, p->comm, p->pid, p->core_cookie);
+
/*
 * If this new candidate is of higher priority than the
 * previous; and they're incompatible; we need to wipe
@@ -4575,6 +4597,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_cookie = p->core_cookie;
max = p;
 
+   trace_printk("max: %s/%d %lx\n", max->comm, 
max->pid, max->core_cookie);
+
if (old_max) {
for_each_cpu(j, smt_mask) {
if (j == i)
@@ -4602,6 +4626,7 @@ next_class:;
rq->core->core_pick_seq = rq->core->core_task_seq;
next = rq->core_pick;
rq->core_sched_seq = rq->core->core_pick_seq;
+   trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, 
next->core_cookie);
 
/*
 * Reschedule siblings
@@ -4624,11 +4649,20 @@ next_class:;
if (i == cpu)
continue;
 
-   if (rq_i->curr != rq_i->core_pick)
+   if (rq_i->curr != rq_i->core_pick) {
+   trace_printk("IPI(%d)\n", i);
resched_curr(rq_i);
+   }
 
/* Did we break L1TF mitigation requirements? */
-   WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+   if (unlikely(!cookie_match(next, rq_i->core_pick))) {
+   trace_printk("[%d]: cookie mismatch. 
%s/%d/0x%lx/0x%lx\n",
+rq_i->cpu, rq_i->core_pick->comm,
+rq_i->core_pick->pid,
+rq_i->core_pick->core_cookie,
+rq_i->core->core_cookie);
+   WARN_ON_ONCE(1);
+   }
}
 
 done:
@@ -4667,6 +4701,10 @@ static bool try_steal_cookie(int this, int that)
if (p->core_occupation > dst->idle->core_occupation)
goto next;
 
+   trace_printk("core fill: %s/%d (%d->%d) %d %d %lx\n",
+p->comm, p->pid, that, this,
+p->core_occupation, dst->idle->core_occupation, 
cookie);
+
p->on_rq = TASK_ON_RQ_MIGRATING;

[RFC PATCH 10/16] sched: Trivial forced-newidle balancer

2020-06-30 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

When a sibling is forced-idle to match the core-cookie; search for
matching tasks to fill the core.

rcu_read_unlock() can incur an infrequent deadlock in
sched_core_balance(). Fix this by using the RCU-sched flavor instead.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Joel Fernandes (Google) 
Acked-by: Paul E. McKenney 
---
 include/linux/sched.h |   1 +
 kernel/sched/core.c   | 131 +-
 kernel/sched/idle.c   |   1 +
 kernel/sched/sched.h  |   6 ++
 4 files changed, 138 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3c8dcc5ff039..4f9edf013df3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -688,6 +688,7 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
struct rb_node  core_node;
unsigned long   core_cookie;
+   unsigned intcore_occupation;
 #endif
 
 #ifdef CONFIG_CGROUP_SCHED
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4d6d6a678013..fb9edb09ead7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -201,6 +201,21 @@ static struct task_struct *sched_core_find(struct rq *rq, 
unsigned long cookie)
return match;
 }
 
+static struct task_struct *sched_core_next(struct task_struct *p, unsigned 
long cookie)
+{
+   struct rb_node *node = >core_node;
+
+   node = rb_next(node);
+   if (!node)
+   return NULL;
+
+   p = container_of(node, struct task_struct, core_node);
+   if (p->core_cookie != cookie)
+   return NULL;
+
+   return p;
+}
+
 /*
  * The static-key + stop-machine variable are needed such that:
  *
@@ -4233,7 +4248,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
struct task_struct *next, *max = NULL;
const struct sched_class *class;
const struct cpumask *smt_mask;
-   int i, j, cpu;
+   int i, j, cpu, occ = 0;
bool need_sync;
 
if (!sched_core_enabled(rq))
@@ -4332,6 +4347,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
goto done;
}
 
+   if (!is_idle_task(p))
+   occ++;
+
rq_i->core_pick = p;
 
/*
@@ -4357,6 +4375,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
cpu_rq(j)->core_pick = NULL;
}
+   occ = 1;
goto again;
} else {
/*
@@ -4393,6 +4412,8 @@ next_class:;
if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
rq_i->core_forceidle = true;
 
+   rq_i->core_pick->core_occupation = occ;
+
if (i == cpu)
continue;
 
@@ -4408,6 +4429,114 @@ next_class:;
return next;
 }
 
+static bool try_steal_cookie(int this, int that)
+{
+   struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
+   struct task_struct *p;
+   unsigned long cookie;
+   bool success = false;
+
+   local_irq_disable();
+   double_rq_lock(dst, src);
+
+   cookie = dst->core->core_cookie;
+   if (!cookie)
+   goto unlock;
+
+   if (dst->curr != dst->idle)
+   goto unlock;
+
+   p = sched_core_find(src, cookie);
+   if (p == src->idle)
+   goto unlock;
+
+   do {
+   if (p == src->core_pick || p == src->curr)
+   goto next;
+
+   if (!cpumask_test_cpu(this, >cpus_mask))
+   goto next;
+
+   if (p->core_occupation > dst->idle->core_occupation)
+   goto next;
+
+   p->on_rq = TASK_ON_RQ_MIGRATING;
+   deactivate_task(src, p, 0);
+   set_task_cpu(p, this);
+   activate_task(dst, p, 0);
+   p->on_rq = TASK_ON_RQ_QUEUED;
+
+   resched_curr(dst);
+
+   success = true;
+   break;
+
+next:
+   p = sched_core_next(p, cookie);
+   } while (p);
+
+unlock:
+   double_rq_unlock(dst, src);
+   local_irq_enable();
+
+   return success;
+}
+
+static bool steal_cookie_task(int cpu, struct sched_domain *sd)
+{
+   int i;
+
+   for_each_cpu_wrap(i, sched_domain_span(sd), cpu) {
+   if (i == cpu)
+   continue;
+
+   if (need_resched())
+   break;
+
+   if (try_steal_cookie(cpu, i))
+   return true;
+   }
+
+   return false;
+}
+
+static void sched_core_balance(struct rq *rq)
+{
+   struct sched_domain *sd;
+   int cpu =

[RFC PATCH 12/16] sched: cgroup tagging interface for core scheduling

2020-06-30 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

Marks all tasks in a cgroup as matching for core-scheduling.

A task will need to be moved into the core scheduler queue when the cgroup
it belongs to is tagged to run with core scheduling.  Similarly the task
will need to be moved out of the core scheduler queue when the cgroup
is untagged.

Also after we forked a task, its core scheduler queue's presence will
need to be updated according to its new cgroup's status.

Use stop machine mechanism to update all tasks in a cgroup to prevent a
new task from sneaking into the cgroup, and missed out from the update
while we iterates through all the tasks in the cgroup.  A more complicated
scheme could probably avoid the stop machine.  Such scheme will also
need to resovle inconsistency between a task's cgroup core scheduling
tag and residency in core scheduler queue.

We are opting for the simple stop machine mechanism for now that avoids
such complications.

Core scheduler has extra overhead.  Enable it only for core with
more than one SMT hardware threads.

Signed-off-by: Tim Chen 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Julien Desfossez 
Signed-off-by: Vineeth Remanan Pillai 
---
 kernel/sched/core.c  | 183 +--
 kernel/sched/sched.h |   4 +
 2 files changed, 180 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fb9edb09ead7..c84f209b8591 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -135,6 +135,37 @@ static inline bool __sched_core_less(struct task_struct 
*a, struct task_struct *
return false;
 }
 
+static bool sched_core_empty(struct rq *rq)
+{
+   return RB_EMPTY_ROOT(>core_tree);
+}
+
+static bool sched_core_enqueued(struct task_struct *task)
+{
+   return !RB_EMPTY_NODE(>core_node);
+}
+
+static struct task_struct *sched_core_first(struct rq *rq)
+{
+   struct task_struct *task;
+
+   task = container_of(rb_first(>core_tree), struct task_struct, 
core_node);
+   return task;
+}
+
+static void sched_core_flush(int cpu)
+{
+   struct rq *rq = cpu_rq(cpu);
+   struct task_struct *task;
+
+   while (!sched_core_empty(rq)) {
+   task = sched_core_first(rq);
+   rb_erase(>core_node, >core_tree);
+   RB_CLEAR_NODE(>core_node);
+   }
+   rq->core->core_task_seq++;
+}
+
 static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 {
struct rb_node *parent, **node;
@@ -166,10 +197,11 @@ static void sched_core_dequeue(struct rq *rq, struct 
task_struct *p)
 {
rq->core->core_task_seq++;
 
-   if (!p->core_cookie)
+   if (!sched_core_enqueued(p))
return;
 
rb_erase(>core_node, >core_tree);
+   RB_CLEAR_NODE(>core_node);
 }
 
 /*
@@ -235,9 +267,23 @@ static int __sched_core_stopper(void *data)
 
for_each_possible_cpu(cpu) {
struct rq *rq = cpu_rq(cpu);
-   rq->core_enabled = enabled;
-   if (cpu_online(cpu) && rq->core != rq)
-   sched_core_adjust_sibling_vruntime(cpu, enabled);
+
+   WARN_ON_ONCE(enabled == rq->core_enabled);
+
+   if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) 
>= 2)) {
+   /*
+* All active and migrating tasks will have already
+* been removed from core queue when we clear the
+* cgroup tags. However, dying tasks could still be
+* left in core queue. Flush them here.
+*/
+   if (!enabled)
+   sched_core_flush(cpu);
+
+   rq->core_enabled = enabled;
+   if (cpu_online(cpu) && rq->core != rq)
+   sched_core_adjust_sibling_vruntime(cpu, 
enabled);
+   }
}
 
return 0;
@@ -248,7 +294,11 @@ static int sched_core_count;
 
 static void __sched_core_enable(void)
 {
-   // XXX verify there are no cookie tasks (yet)
+   int cpu;
+
+   /* verify there are no cookie tasks (yet) */
+   for_each_online_cpu(cpu)
+   BUG_ON(!sched_core_empty(cpu_rq(cpu)));
 
static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
@@ -256,8 +306,6 @@ static void __sched_core_enable(void)
 
 static void __sched_core_disable(void)
 {
-   // XXX verify there are no cookie tasks (left)
-
stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);
 }
@@ -282,6 +330,7 @@ void sched_core_put(void)
 
 static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
 static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+static bool sched_core_enqueue

[RFC PATCH 15/16] Documentation: Add documentation on core scheduling

2020-06-30 Thread Vineeth Remanan Pillai

From: "Joel Fernandes (Google)" 

Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: Vineeth Remanan Pillai 
---
 .../admin-guide/hw-vuln/core-scheduling.rst   | 241 ++
 Documentation/admin-guide/hw-vuln/index.rst   |   1 +
 2 files changed, 242 insertions(+)
 create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst 
b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
new file mode 100644
index ..275568162a74
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -0,0 +1,241 @@
+Core Scheduling
+
+MDS and L1TF mitigations do not protect from cross-HT attacks (attacker running
+on one HT with victim running on another). For proper mitigation of this,
+core scheduling support is available via the ``CONFIG_SCHED_CORE`` config 
option.
+Using this feature, userspace defines groups of tasks that trust each other.
+The core scheduler uses this information to make sure that tasks that do not
+trust each other will never run simultaneously on a core while ensuring trying
+to maintain and ensure scheduler properties and requirements.
+
+Usage
+-
+The user-interface to this feature is not yet finalized. CUrrent implementation
+uses CPU controller cgroup. Core scheduling adds a ``cpu.tag`` file to the CPU
+controller CGroup. If the content of this file is 1, then all the tasks in this
+CGroup trust each other and are allowed to run concurrently on the siblings of
+a core.
+
+This interface has drawbacks. Trusted tasks has to be grouped into one CPU 
CGroup
+and this is not always possible based on system's existing Cgroup 
configuration,
+where trusted tasks could already be in different CPU Cgroups. Also, this 
feature
+will have a hard dependency on CGroups and systems with CGroups disabled would 
not
+be able to use core scheduling. See `Future work`_ for other API proposals.
+
+Design/Implementation
+-
+Tasks are grouped as mentioned in `Usage`_ and tasks that trust each other
+share the same cookie value(in task_struct).
+
+The basic idea is that, every schedule event tries to select tasks for all the
+siblings of a core such that all the selected tasks are trusted(same cookie).
+
+During a schedule event on any sibling of a core, the highest priority task for
+that core is picked and assigned to the sibling which has it enqueued. For 
rest of
+the siblings in the core, highest priority task with the same cookie is 
selected if
+there is one runnable in the run queue. If a task with same cookie is not 
available,
+idle task is selected. Idle task is globally trusted.
+
+Once a task has been selected for all the siblings in the core, an IPI is sent 
to
+all the siblings who has a new task selected. Siblings on receiving the IPI, 
will
+switch to the new task immediately.
+
+Force-idling of tasks
+-
+The scheduler tries its best to find tasks that trust each other such that all
+tasks selected to be scheduled are of the highest priority in that runqueue.
+However, this is not always possible. Favoring security over fairness, one or
+more siblings could be forced to select a lower priority task if the highest
+priority task is not trusted with respect to the core wide highest priority 
task.
+If a sibling does not have a trusted task to run, it will be forced idle by the
+scheduler(idle thread is scheduled to run).
+
+When the highest priorty task is selected to run, a reschedule-IPI is sent to
+the sibling to force it into idle. This results in 4 cases which need to be
+considered depending on whether a VM or a regular usermode process was running
+on either HT:
+
+::
+  HT1 (attack)HT2 (victim)
+   
+   A  idle -> user space  user space -> idle
+   
+   B  idle -> user space  guest -> idle
+   
+   C  idle -> guest   user space -> idle
+   
+   D  idle -> guest   guest -> idle
+
+Note that for better performance, we do not wait for the destination CPU
+(victim) to enter idle mode.  This is because the sending of the IPI would
+bring the destination CPU immediately into kernel mode from user space, or
+VMEXIT from guests. At best, this would only leak some scheduler metadata which
+may not be worth protecting.
+
+Protection against interrupts using IRQ pausing
+---
+The scheduler on its own cannot protect interrupt data. This is because the
+scheduler is unaware of interrupts at scheduling time. To mitigate this, we
+send an IPI to siblings on IRQ entry. This IPI handler busy-waits until the IRQ
+on the sending HT exits. For good performance, we send an IPI only if it is
+detected that the core is running tasks that have been marked for
+core scheduling. Both interrupts and softirqs are protected.
+
+This protection can be disabled by disabling ``CONFIG_

[RFC PATCH 13/16] sched: Fix pick_next_task() race condition in core scheduling

2020-06-30 Thread Vineeth Remanan Pillai

From: Chen Yu 

As Peter mentioned that Commit 6e2df0581f56 ("sched: Fix pick_next_task()
vs 'change' pattern race") has fixed a race condition due to rq->lock
improperly released after put_prev_task(), backport this fix to core
scheduling's pick_next_task() as well.

Without this fix, Aubrey, Long and I found an NULL exception point
triggered within one hour when running RDT MBA(Intel Resource Directory
Technolodge Memory Bandwidth Allocation) benchmarks on a 36 Core(72 HTs)
platform, which tries to dereference a NULL sched_entity:

[ 3618.429053] BUG: kernel NULL pointer dereference, address: 0160
[ 3618.429039] RIP: 0010:pick_task_fair+0x2e/0xa0
[ 3618.429042] RSP: 0018:c9317da8 EFLAGS: 00010046
[ 3618.429044] RAX:  RBX: 88afdf4ad100 RCX: 0001
[ 3618.429045] RDX:  RSI:  RDI: 88afdf4ad100
[ 3618.429045] RBP: c9317dc0 R08: 0048 R09: 0110
[ 3618.429046] R10: 0001 R11:  R12: 
[ 3618.429047] R13: 0002d080 R14: 88afdf4ad080 R15: 0014
[ 3618.429048]  ? pick_task_fair+0x48/0xa0
[ 3618.429048]  pick_next_task+0x34c/0x7e0
[ 3618.429049]  ? tick_program_event+0x44/0x70
[ 3618.429049]  __schedule+0xee/0x5d0
[ 3618.429050]  schedule_idle+0x2c/0x40
[ 3618.429051]  do_idle+0x175/0x280
[ 3618.429051]  cpu_startup_entry+0x1d/0x30
[ 3618.429052]  start_secondary+0x169/0x1c0
[ 3618.429052]  secondary_startup_64+0xa4/0xb0

While with this patch applied, no NULL pointer exception was found within
14 hours for now. Although there's no direct evidence this fix would solve
the issue, it does fix a potential race condition.

Signed-off-by: Chen Yu 
Signed-off-by: Vineeth Remanan Pillai 
---
 kernel/sched/core.c  | 44 +---
 kernel/sched/fair.c  |  9 ++---
 kernel/sched/sched.h |  7 ---
 3 files changed, 31 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c84f209b8591..ede86fb37b4e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4169,6 +4169,29 @@ static inline void schedule_debug(struct task_struct 
*prev, bool preempt)
schedstat_inc(this_rq()->sched_count);
 }
 
+static inline void
+finish_prev_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+#ifdef CONFIG_SMP
+   const struct sched_class *class;
+
+   /*
+* We must do the balancing pass before put_next_task(), such
+* that when we release the rq->lock the task is in the same
+* state as before we took rq->lock.
+*
+* We can terminate the balance pass as soon as we know there is
+* a runnable task of @class priority or higher.
+*/
+   for_class_range(class, prev->sched_class, _sched_class) {
+   if (class->balance(rq, prev, rf))
+   break;
+   }
+#endif
+
+   put_prev_task(rq, prev);
+}
+
 /*
  * Pick up the highest-prio task:
  */
@@ -4202,22 +4225,7 @@ __pick_next_task(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf)
}
 
 restart:
-#ifdef CONFIG_SMP
-   /*
-* We must do the balancing pass before put_next_task(), such
-* that when we release the rq->lock the task is in the same
-* state as before we took rq->lock.
-*
-* We can terminate the balance pass as soon as we know there is
-* a runnable task of @class priority or higher.
-*/
-   for_class_range(class, prev->sched_class, _sched_class) {
-   if (class->balance(rq, prev, rf))
-   break;
-   }
-#endif
-
-   put_prev_task(rq, prev);
+   finish_prev_task(rq, prev, rf);
 
for_each_class(class) {
p = class->pick_next_task(rq);
@@ -4323,9 +4331,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
return next;
}
 
-   prev->sched_class->put_prev_task(rq, prev);
-   if (!rq->nr_running)
-   newidle_balance(rq, rf);
+   finish_prev_task(rq, prev, rf);
 
cpu = cpu_of(rq);
smt_mask = cpu_smt_mask(cpu);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 33dc4bf01817..435b460d3c3f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -115,6 +115,7 @@ int __weak arch_asym_cpu_priority(int cpu)
  */
 #define fits_capacity(cap, max)((cap) * 1280 < (max) * 1024)
 
+static int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
 #endif
 
 #ifdef CONFIG_CFS_BANDWIDTH
@@ -7116,9 +7117,11 @@ pick_next_task_fair(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf
struct cfs_rq *cfs_rq = >cfs;
struct sched_entity *se;
struct task_struct *p;
+#ifdef CONFIG_SMP
int new_tasks;
 
 again:
+#endif
if (!sched_fair_runnable(rq))

[RFC PATCH 07/16] sched/fair: Fix forced idle sibling starvation corner case

2020-06-30 Thread Vineeth Remanan Pillai

From: vpillai 

If there is only one long running local task and the sibling is
forced idle, it  might not get a chance to run until a schedule
event happens on any cpu in the core.

So we check for this condition during a tick to see if a sibling
is starved and then give it a chance to schedule.

Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Julien Desfossez 
---
 kernel/sched/fair.c | 39 +++
 1 file changed, 39 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ae17507533a0..49fb93296e35 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10613,6 +10613,40 @@ static void rq_offline_fair(struct rq *rq)
 
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_SCHED_CORE
+static inline bool
+__entity_slice_used(struct sched_entity *se)
+{
+   return (se->sum_exec_runtime - se->prev_sum_exec_runtime) >
+   sched_slice(cfs_rq_of(se), se);
+}
+
+/*
+ * If runqueue has only one task which used up its slice and if the sibling
+ * is forced idle, then trigger schedule to give forced idle task a chance.
+ */
+static void resched_forceidle_sibling(struct rq *rq, struct sched_entity *se)
+{
+   int cpu = cpu_of(rq), sibling_cpu;
+
+   if (rq->cfs.nr_running > 1 || !__entity_slice_used(se))
+   return;
+
+   for_each_cpu(sibling_cpu, cpu_smt_mask(cpu)) {
+   struct rq *sibling_rq;
+   if (sibling_cpu == cpu)
+   continue;
+   if (cpu_is_offline(sibling_cpu))
+   continue;
+
+   sibling_rq = cpu_rq(sibling_cpu);
+   if (sibling_rq->core_forceidle) {
+   resched_curr(sibling_rq);
+   }
+   }
+}
+#endif
+
 /*
  * scheduler tick hitting a task of our scheduling class.
  *
@@ -10636,6 +10670,11 @@ static void task_tick_fair(struct rq *rq, struct 
task_struct *curr, int queued)
 
update_misfit_status(curr, rq);
update_overutilized_status(task_rq(curr));
+
+#ifdef CONFIG_SCHED_CORE
+   if (sched_core_enabled(rq))
+   resched_forceidle_sibling(rq, >se);
+#endif
 }
 
 /*
-- 
2.17.1

[RFC PATCH 09/16] sched/fair: core wide cfs task priority comparison

2020-06-30 Thread Vineeth Remanan Pillai

From: Aaron Lu 

This patch provides a vruntime based way to compare two cfs task's
priority, be it on the same cpu or different threads of the same core.

When the two tasks are on the same CPU, we just need to find a common
cfs_rq both sched_entities are on and then do the comparison.

When the two tasks are on differen threads of the same core, each thread
will choose the next task to run the usual way and then the root level
sched entities which the two tasks belong to will be used to decide
which task runs next core wide.

An illustration for the cross CPU case:

   cpu0 cpu1
 /   |  \ /   |  \
se1 se2 se3  se4 se5 se6
/  \/   \
  se21 se22   se61  se62
  (A)/
   se621
(B)

Assume CPU0 and CPU1 are smt siblings and cpu0 has decided task A to
run next and cpu1 has decided task B to run next. To compare priority
of task A and B, we compare priority of se2 and se6. Whose vruntime is
smaller, who wins.

To make this work, the root level sched entities' vruntime of the two
threads must be directly comparable. So one of the hyperthread's root
cfs_rq's min_vruntime is chosen as the core wide one and all root level
sched entities' vruntime is normalized against it.

All sub cfs_rqs and sched entities are not interesting in cross cpu
priority comparison as they will only participate in the usual cpu local
schedule decision so no need to normalize their vruntimes.

Signed-off-by: Aaron Lu 
---
 kernel/sched/core.c  |  23 +++
 kernel/sched/fair.c  | 142 ++-
 kernel/sched/sched.h |   3 +
 3 files changed, 150 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f51e5c4798c8..4d6d6a678013 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -114,19 +114,8 @@ static inline bool prio_less(struct task_struct *a, struct 
task_struct *b)
if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-   if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-   u64 vruntime = b->se.vruntime;
-
-   /*
-* Normalize the vruntime if tasks are in different cpus.
-*/
-   if (task_cpu(a) != task_cpu(b)) {
-   vruntime -= task_cfs_rq(b)->min_vruntime;
-   vruntime += task_cfs_rq(a)->min_vruntime;
-   }
-
-   return !((s64)(a->se.vruntime - vruntime) <= 0);
-   }
+   if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
+   return cfs_prio_less(a, b);
 
return false;
 }
@@ -229,8 +218,12 @@ static int __sched_core_stopper(void *data)
bool enabled = !!(unsigned long)data;
int cpu;
 
-   for_each_possible_cpu(cpu)
-   cpu_rq(cpu)->core_enabled = enabled;
+   for_each_possible_cpu(cpu) {
+   struct rq *rq = cpu_rq(cpu);
+   rq->core_enabled = enabled;
+   if (cpu_online(cpu) && rq->core != rq)
+   sched_core_adjust_sibling_vruntime(cpu, enabled);
+   }
 
return 0;
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 61d19e573443..d16939766361 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -462,11 +462,142 @@ find_matching_se(struct sched_entity **se, struct 
sched_entity **pse)
 
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
+#ifdef CONFIG_SCHED_CORE
+static inline struct cfs_rq *root_cfs_rq(struct cfs_rq *cfs_rq)
+{
+   return _of(cfs_rq)->cfs;
+}
+
+static inline bool is_root_cfs_rq(struct cfs_rq *cfs_rq)
+{
+   return cfs_rq == root_cfs_rq(cfs_rq);
+}
+
+static inline struct cfs_rq *core_cfs_rq(struct cfs_rq *cfs_rq)
+{
+   return _of(cfs_rq)->core->cfs;
+}
+
+static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
+{
+   if (!sched_core_enabled(rq_of(cfs_rq)) || !is_root_cfs_rq(cfs_rq))
+   return cfs_rq->min_vruntime;
+
+   return core_cfs_rq(cfs_rq)->min_vruntime;
+}
+
+#ifndef CONFIG_64BIT
+static inline u64 cfs_rq_min_vruntime_copy(struct cfs_rq *cfs_rq)
+{
+   if (!sched_core_enabled(rq_of(cfs_rq)) || !is_root_cfs_rq(cfs_rq))
+   return cfs_rq->min_vruntime_copy;
+
+   return core_cfs_rq(cfs_rq)->min_vruntime_copy;
+}
+#endif /* CONFIG_64BIT */
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+   struct sched_entity *sea = >se;
+   struct sched_entity *seb = >se;
+   bool samecpu = task_cpu(a) == task_cpu(b);
+   s64 delta;
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+   if (samecpu) {
+   /* vruntime is per cfs_rq */
+   while (!is_same_group(sea, seb)) {
+   int sea_depth = sea->depth;
+   int seb_depth = seb->depth;
+
+   if (sea_depth >= seb_depth)
+   sea = parent_entity(sea);
+

[RFC PATCH 06/16] sched: Add core wide task selection and scheduling.

2020-06-30 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

Instead of only selecting a local task, select a task for all SMT
siblings for every reschedule on the core (irrespective which logical
CPU does the reschedule).

There could be races in core scheduler where a CPU is trying to pick
a task for its sibling in core scheduler, when that CPU has just been
offlined.  We should not schedule any tasks on the CPU in this case.
Return an idle task in pick_next_task for this situation.

NOTE: there is still potential for siblings rivalry.
NOTE: this is far too complicated; but thus far I've failed to
  simplify it further.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Julien Desfossez 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Aaron Lu 
Signed-off-by: Tim Chen 
---
 kernel/sched/core.c  | 263 ++-
 kernel/sched/sched.h |   6 +-
 2 files changed, 267 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b21bcab20da6..f51e5c4798c8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4113,7 +4113,7 @@ static inline void schedule_debug(struct task_struct 
*prev, bool preempt)
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
const struct sched_class *class;
struct task_struct *p;
@@ -4169,6 +4169,262 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
BUG();
 }
 
+#ifdef CONFIG_SCHED_CORE
+
+static inline bool cookie_equals(struct task_struct *a, unsigned long cookie)
+{
+   return is_idle_task(a) || (a->core_cookie == cookie);
+}
+
+static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
+{
+   if (is_idle_task(a) || is_idle_task(b))
+   return true;
+
+   return a->core_cookie == b->core_cookie;
+}
+
+// XXX fairness/fwd progress conditions
+/*
+ * Returns
+ * - NULL if there is no runnable task for this class.
+ * - the highest priority task for this runqueue if it matches
+ *   rq->core->core_cookie or its priority is greater than max.
+ * - Else returns idle_task.
+ */
+static struct task_struct *
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct 
*max)
+{
+   struct task_struct *class_pick, *cookie_pick;
+   unsigned long cookie = rq->core->core_cookie;
+
+   class_pick = class->pick_task(rq);
+   if (!class_pick)
+   return NULL;
+
+   if (!cookie) {
+   /*
+* If class_pick is tagged, return it only if it has
+* higher priority than max.
+*/
+   if (max && class_pick->core_cookie &&
+   prio_less(class_pick, max))
+   return idle_sched_class.pick_task(rq);
+
+   return class_pick;
+   }
+
+   /*
+* If class_pick is idle or matches cookie, return early.
+*/
+   if (cookie_equals(class_pick, cookie))
+   return class_pick;
+
+   cookie_pick = sched_core_find(rq, cookie);
+
+   /*
+* If class > max && class > cookie, it is the highest priority task on
+* the core (so far) and it must be selected, otherwise we must go with
+* the cookie pick in order to satisfy the constraint.
+*/
+   if (prio_less(cookie_pick, class_pick) &&
+   (!max || prio_less(max, class_pick)))
+   return class_pick;
+
+   return cookie_pick;
+}
+
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+   struct task_struct *next, *max = NULL;
+   const struct sched_class *class;
+   const struct cpumask *smt_mask;
+   int i, j, cpu;
+   bool need_sync;
+
+   if (!sched_core_enabled(rq))
+   return __pick_next_task(rq, prev, rf);
+
+   /*
+* If there were no {en,de}queues since we picked (IOW, the task
+* pointers are all still valid), and we haven't scheduled the last
+* pick yet, do so now.
+*/
+   if (rq->core->core_pick_seq == rq->core->core_task_seq &&
+   rq->core->core_pick_seq != rq->core_sched_seq) {
+   WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
+
+   next = rq->core_pick;
+   if (next != prev) {
+   put_prev_task(rq, prev);
+   set_next_task(rq, next);
+   }
+   return next;
+   }
+
+   prev->sched_class->put_prev_task(rq, prev);
+   if (!rq->nr_running)
+   newidle_balance(rq, rf);
+
+   cpu = cpu_of(rq);
+   smt_mask = cpu_smt_mask(cpu);
+
+   /*
+* core->core_

[RFC PATCH 05/16] sched: Basic tracking of matching tasks

2020-06-30 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

Introduce task_struct::core_cookie as an opaque identifier for core
scheduling. When enabled; core scheduling will only allow matching
task to be on the core; where idle matches everything.

When task_struct::core_cookie is set (and core scheduling is enabled)
these tasks are indexed in a second RB-tree, first on cookie value
then on scheduling function, such that matching task selection always
finds the most elegible match.

NOTE: *shudder* at the overhead...

NOTE: *sigh*, a 3rd copy of the scheduling function; the alternative
is per class tracking of cookies and that just duplicates a lot of
stuff for no raisin (the 2nd copy lives in the rt-mutex PI code).

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Julien Desfossez 
---
 include/linux/sched.h |   8 ++-
 kernel/sched/core.c   | 146 ++
 kernel/sched/fair.c   |  46 -
 kernel/sched/sched.h  |  55 
 4 files changed, 208 insertions(+), 47 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4418f5cb8324..3c8dcc5ff039 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -683,10 +683,16 @@ struct task_struct {
const struct sched_class*sched_class;
struct sched_entity se;
struct sched_rt_entity  rt;
+   struct sched_dl_entity  dl;
+
+#ifdef CONFIG_SCHED_CORE
+   struct rb_node  core_node;
+   unsigned long   core_cookie;
+#endif
+
 #ifdef CONFIG_CGROUP_SCHED
struct task_group   *sched_task_group;
 #endif
-   struct sched_dl_entity  dl;
 
 #ifdef CONFIG_UCLAMP_TASK
/* Clamp values requested for a scheduling entity */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4b81301e3f21..b21bcab20da6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -77,6 +77,141 @@ int sysctl_sched_rt_runtime = 95;
 
 DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
 
+/* kernel prio, less is more */
+static inline int __task_prio(struct task_struct *p)
+{
+   if (p->sched_class == _sched_class) /* trumps deadline */
+   return -2;
+
+   if (rt_prio(p->prio)) /* includes deadline */
+   return p->prio; /* [-1, 99] */
+
+   if (p->sched_class == _sched_class)
+   return MAX_RT_PRIO + NICE_WIDTH; /* 140 */
+
+   return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */
+}
+
+/*
+ * l(a,b)
+ * le(a,b) := !l(b,a)
+ * g(a,b)  := l(b,a)
+ * ge(a,b) := !l(a,b)
+ */
+
+/* real prio, less is less */
+static inline bool prio_less(struct task_struct *a, struct task_struct *b)
+{
+
+   int pa = __task_prio(a), pb = __task_prio(b);
+
+   if (-pa < -pb)
+   return true;
+
+   if (-pb < -pa)
+   return false;
+
+   if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
+   return !dl_time_before(a->dl.deadline, b->dl.deadline);
+
+   if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
+   u64 vruntime = b->se.vruntime;
+
+   /*
+* Normalize the vruntime if tasks are in different cpus.
+*/
+   if (task_cpu(a) != task_cpu(b)) {
+   vruntime -= task_cfs_rq(b)->min_vruntime;
+   vruntime += task_cfs_rq(a)->min_vruntime;
+   }
+
+   return !((s64)(a->se.vruntime - vruntime) <= 0);
+   }
+
+   return false;
+}
+
+static inline bool __sched_core_less(struct task_struct *a, struct task_struct 
*b)
+{
+   if (a->core_cookie < b->core_cookie)
+   return true;
+
+   if (a->core_cookie > b->core_cookie)
+   return false;
+
+   /* flip prio, so high prio is leftmost */
+   if (prio_less(b, a))
+   return true;
+
+   return false;
+}
+
+static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+{
+   struct rb_node *parent, **node;
+   struct task_struct *node_task;
+
+   rq->core->core_task_seq++;
+
+   if (!p->core_cookie)
+   return;
+
+   node = >core_tree.rb_node;
+   parent = *node;
+
+   while (*node) {
+   node_task = container_of(*node, struct task_struct, core_node);
+   parent = *node;
+
+   if (__sched_core_less(p, node_task))
+   node = >rb_left;
+   else
+   node = >rb_right;
+   }
+
+   rb_link_node(>core_node, parent, node);
+   rb_insert_color(>core_node, >core_tree);
+}
+
+static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+{
+   rq->core->core_task_seq++;
+
+   if (!p->core_cookie)
+   return;
+
+   rb_erase(>core_node, >core_tree);
+}
+
+/*
+ * Find left-m

[RFC PATCH 08/16] sched/fair: wrapper for cfs_rq->min_vruntime

2020-06-30 Thread Vineeth Remanan Pillai

From: Aaron Lu 

Add a wrapper function cfs_rq_min_vruntime(cfs_rq) to
return cfs_rq->min_vruntime.

It will be used in the following patch, no functionality
change.

Signed-off-by: Aaron Lu 
---
 kernel/sched/fair.c | 27 ---
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 49fb93296e35..61d19e573443 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -462,6 +462,11 @@ find_matching_se(struct sched_entity **se, struct 
sched_entity **pse)
 
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
+static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
+{
+   return cfs_rq->min_vruntime;
+}
+
 static __always_inline
 void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
 
@@ -498,7 +503,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
struct sched_entity *curr = cfs_rq->curr;
struct rb_node *leftmost = rb_first_cached(_rq->tasks_timeline);
 
-   u64 vruntime = cfs_rq->min_vruntime;
+   u64 vruntime = cfs_rq_min_vruntime(cfs_rq);
 
if (curr) {
if (curr->on_rq)
@@ -518,7 +523,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
}
 
/* ensure we never gain time by being placed backwards. */
-   cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
+   cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), 
vruntime);
 #ifndef CONFIG_64BIT
smp_wmb();
cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
@@ -4026,7 +4031,7 @@ static inline void update_misfit_status(struct 
task_struct *p, struct rq *rq) {}
 static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 #ifdef CONFIG_SCHED_DEBUG
-   s64 d = se->vruntime - cfs_rq->min_vruntime;
+   s64 d = se->vruntime - cfs_rq_min_vruntime(cfs_rq);
 
if (d < 0)
d = -d;
@@ -4039,7 +4044,7 @@ static void check_spread(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
-   u64 vruntime = cfs_rq->min_vruntime;
+   u64 vruntime = cfs_rq_min_vruntime(cfs_rq);
 
/*
 * The 'current' period is already promised to the current tasks,
@@ -4133,7 +4138,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int flags)
 * update_curr().
 */
if (renorm && curr)
-   se->vruntime += cfs_rq->min_vruntime;
+   se->vruntime += cfs_rq_min_vruntime(cfs_rq);
 
update_curr(cfs_rq);
 
@@ -4144,7 +4149,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int flags)
 * fairness detriment of existing tasks.
 */
if (renorm && !curr)
-   se->vruntime += cfs_rq->min_vruntime;
+   se->vruntime += cfs_rq_min_vruntime(cfs_rq);
 
/*
 * When enqueuing a sched_entity, we must:
@@ -4263,7 +4268,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int flags)
 * can move min_vruntime forward still more.
 */
if (!(flags & DEQUEUE_SLEEP))
-   se->vruntime -= cfs_rq->min_vruntime;
+   se->vruntime -= cfs_rq_min_vruntime(cfs_rq);
 
/* return excess runtime on last dequeue */
return_cfs_rq_runtime(cfs_rq);
@@ -6700,7 +6705,7 @@ static void migrate_task_rq_fair(struct task_struct *p, 
int new_cpu)
min_vruntime = cfs_rq->min_vruntime;
} while (min_vruntime != min_vruntime_copy);
 #else
-   min_vruntime = cfs_rq->min_vruntime;
+   min_vruntime = cfs_rq_min_vruntime(cfs_rq);
 #endif
 
se->vruntime -= min_vruntime;
@@ -10709,7 +10714,7 @@ static void task_fork_fair(struct task_struct *p)
resched_curr(rq);
}
 
-   se->vruntime -= cfs_rq->min_vruntime;
+   se->vruntime -= cfs_rq_min_vruntime(cfs_rq);
rq_unlock(rq, );
 }
 
@@ -10832,7 +10837,7 @@ static void detach_task_cfs_rq(struct task_struct *p)
 * cause 'unlimited' sleep bonus.
 */
place_entity(cfs_rq, se, 0);
-   se->vruntime -= cfs_rq->min_vruntime;
+   se->vruntime -= cfs_rq_min_vruntime(cfs_rq);
}
 
detach_entity_cfs_rq(se);
@@ -10846,7 +10851,7 @@ static void attach_task_cfs_rq(struct task_struct *p)
attach_entity_cfs_rq(se);
 
if (!vruntime_normalized(p))
-   se->vruntime += cfs_rq->min_vruntime;
+   se->vruntime += cfs_rq_min_vruntime(cfs_rq);
 }
 
 static void switched_from_fair(struct rq *rq, struct task_struct *p)
-- 
2.17.1

[RFC PATCH 14/16] irq: Add support for core-wide protection of IRQ and softirq

2020-06-30 Thread Vineeth Remanan Pillai

From: "Joel Fernandes (Google)" 

With current core scheduling patchset, non-threaded IRQ and softirq
victims can leak data from its hyperthread to a sibling hyperthread
running an attacker.

For MDS, it is possible for the IRQ and softirq handlers to leak data to
either host or guest attackers. For L1TF, it is possible to leak to
guest attackers. There is no possible mitigation involving flushing of
buffers to avoid this since the execution of attacker and victims happen
concurrently on 2 or more HTs.

The solution in this patch is to monitor the outer-most core-wide
irq_enter() and irq_exit() executed by any sibling. In between these
two, we mark the core to be in a special core-wide IRQ state.

In the IRQ entry, if we detect that the sibling is running untrusted
code, we send a reschedule IPI so that the sibling transitions through
the sibling's irq_exit() to do any waiting there, till the IRQ being
protected finishes.

We also monitor the per-CPU outer-most irq_exit(). If during the per-cpu
outer-most irq_exit(), the core is still in the special core-wide IRQ
state, we perform a busy-wait till the core exits this state. This
combination of per-cpu and core-wide IRQ states helps to handle any
combination of irq_entry()s and irq_exit()s happening on all of the
siblings of the core in any order.

Lastly, we also check in the schedule loop if we are about to schedule
an untrusted process while the core is in such a state. This is possible
if a trusted thread enters the scheduler by way of yielding CPU. This
would involve no transitions through the irq_exit() point to do any
waiting, so we have to explicitly do the waiting there.

Every attempt is made to prevent a busy-wait unnecessarily, and in
testing on real-world ChromeOS usecases, it has not shown a performance
drop. In ChromeOS, with this and the rest of the core scheduling
patchset, we see around a 300% improvement in key press latencies into
Google docs when Camera streaming is running simulatenously (90th
percentile latency of ~150ms drops to ~50ms).

This fetaure is controlled by the build time config option
CONFIG_SCHED_CORE_IRQ_PAUSE and is enabled by default. There is also a
kernel boot parameter 'sched_core_irq_pause' to enable/disable the
feature at boot time. Default is enabled at boot time.

Cc: Julien Desfossez 
Cc: Tim Chen 
Cc: Aaron Lu 
Cc: Aubrey Li 
Cc: Tim Chen 
Cc: Paul E. McKenney 
Co-developed-by: Vineeth Pillai 
Signed-off-by: Vineeth Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 .../admin-guide/kernel-parameters.txt |   9 +
 include/linux/sched.h |   5 +
 kernel/Kconfig.preempt|  13 ++
 kernel/sched/core.c   | 161 ++
 kernel/sched/sched.h  |   7 +
 kernel/softirq.c  |  46 +
 6 files changed, 241 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 5e2ce88d6eda..d44d7a997610 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4445,6 +4445,15 @@
 
sbni=   [NET] Granch SBNI12 leased line adapter
 
+   sched_core_irq_pause=
+   [SCHED_CORE, SCHED_CORE_IRQ_PAUSE] Pause SMT siblings
+   of a core if atleast one of the siblings of the core
+   is running nmi/irq/softirq. This is to guarentee that
+   kernel data is not leaked to tasks which are not trusted
+   by the kernel.
+   This feature is valid only when Core scheduling is
+   enabled(SCHED_CORE).
+
sched_debug [KNL] Enables verbose scheduler debug messages.
 
schedstats= [KNL,X86] Enable or disable scheduled statistics.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4f9edf013df3..097746a9f260 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2025,4 +2025,9 @@ int sched_trace_rq_cpu(struct rq *rq);
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
+#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE
+void sched_core_irq_enter(void);
+void sched_core_irq_exit(void);
+#endif
+
 #endif
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 4488fbf4d3a8..59094a66a987 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -86,3 +86,16 @@ config SCHED_CORE
default y
depends on SCHED_SMT
 
+config SCHED_CORE_IRQ_PAUSE
+   bool "Pause siblings on entering irq/softirq during core-scheduling"
+   default y
+   depends on SCHED_CORE
+   help
+ This option enables pausing all SMT siblings of a core when atleast
+ one of the siblings in the core is in nmi/irq/softirq. This is to
+ enforce security such that information from kernel is not leaked to
+ non-trusted tasks running on

[RFC PATCH 02/16] sched: Introduce sched_class::pick_task()

2020-06-30 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

Because sched_class::pick_next_task() also implies
sched_class::set_next_task() (and possibly put_prev_task() and
newidle_balance) it is not state invariant. This makes it unsuitable
for remote task selection.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Julien Desfossez 
---
 kernel/sched/deadline.c  | 16 ++--
 kernel/sched/fair.c  | 36 +---
 kernel/sched/idle.c  |  8 
 kernel/sched/rt.c| 14 --
 kernel/sched/sched.h |  3 +++
 kernel/sched/stop_task.c | 13 +++--
 6 files changed, 81 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 34f95462b838..b56ef74d2d74 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1775,7 +1775,7 @@ static struct sched_dl_entity *pick_next_dl_entity(struct 
rq *rq,
return rb_entry(left, struct sched_dl_entity, rb_node);
 }
 
-static struct task_struct *pick_next_task_dl(struct rq *rq)
+static struct task_struct *pick_task_dl(struct rq *rq)
 {
struct sched_dl_entity *dl_se;
struct dl_rq *dl_rq = >dl;
@@ -1787,7 +1787,18 @@ static struct task_struct *pick_next_task_dl(struct rq 
*rq)
dl_se = pick_next_dl_entity(rq, dl_rq);
BUG_ON(!dl_se);
p = dl_task_of(dl_se);
-   set_next_task_dl(rq, p, true);
+
+   return p;
+}
+
+static struct task_struct *pick_next_task_dl(struct rq *rq)
+{
+   struct task_struct *p;
+
+   p = pick_task_dl(rq);
+   if (p)
+   set_next_task_dl(rq, p, true);
+
return p;
 }
 
@@ -2444,6 +2455,7 @@ const struct sched_class dl_sched_class = {
 
 #ifdef CONFIG_SMP
.balance= balance_dl,
+   .pick_task  = pick_task_dl,
.select_task_rq = select_task_rq_dl,
.migrate_task_rq= migrate_task_rq_dl,
.set_cpus_allowed   = set_cpus_allowed_dl,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 43ff0e5cf387..5e9f11c8256f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4428,7 +4428,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *curr)
 * Avoid running the skip buddy, if running something else can
 * be done without getting too unfair.
 */
-   if (cfs_rq->skip == se) {
+   if (cfs_rq->skip && cfs_rq->skip == se) {
struct sched_entity *second;
 
if (se == curr) {
@@ -4446,13 +4446,13 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *curr)
/*
 * Prefer last buddy, try to return the CPU to a preempted task.
 */
-   if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
+   if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 
1)
se = cfs_rq->last;
 
/*
 * Someone really wants this to run. If it's not unfair, run it.
 */
-   if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
+   if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 
1)
se = cfs_rq->next;
 
clear_buddies(cfs_rq, se);
@@ -6953,6 +6953,35 @@ static void check_preempt_wakeup(struct rq *rq, struct 
task_struct *p, int wake_
set_last_buddy(se);
 }
 
+#ifdef CONFIG_SMP
+static struct task_struct *pick_task_fair(struct rq *rq)
+{
+   struct cfs_rq *cfs_rq = >cfs;
+   struct sched_entity *se;
+
+   if (!cfs_rq->nr_running)
+   return NULL;
+
+   do {
+   struct sched_entity *curr = cfs_rq->curr;
+
+   se = pick_next_entity(cfs_rq, NULL);
+
+   if (curr) {
+   if (se && curr->on_rq)
+   update_curr(cfs_rq);
+
+   if (!se || entity_before(curr, se))
+   se = curr;
+   }
+
+   cfs_rq = group_cfs_rq(se);
+   } while (cfs_rq);
+
+   return task_of(se);
+}
+#endif
+
 struct task_struct *
 pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags 
*rf)
 {
@@ -11134,6 +11163,7 @@ const struct sched_class fair_sched_class = {
 
 #ifdef CONFIG_SMP
.balance= balance_fair,
+   .pick_task  = pick_task_fair,
.select_task_rq = select_task_rq_fair,
.migrate_task_rq= migrate_task_rq_fair,
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 0106d34f1d8c..a8d40ffab097 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -397,6 +397,13 @@ static void set_next_task_idle(struct rq *rq, struct 
task_struct *next, bool fir
schedstat_inc(rq->sched_goidle);
 }
 
+#ifdef CONFIG_SMP
+static struct task_

[RFC PATCH 01/16] sched: Wrap rq::lock access

2020-06-30 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

In preparation of playing games with rq->lock, abstract the thing
using an accessor.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Julien Desfossez 
---
 kernel/sched/core.c |  46 +-
 kernel/sched/cpuacct.c  |  12 ++---
 kernel/sched/deadline.c |  18 +++
 kernel/sched/debug.c|   4 +-
 kernel/sched/fair.c |  38 +++
 kernel/sched/idle.c |   4 +-
 kernel/sched/pelt.h |   2 +-
 kernel/sched/rt.c   |   8 +--
 kernel/sched/sched.h| 105 +---
 kernel/sched/topology.c |   4 +-
 10 files changed, 122 insertions(+), 119 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5eccfb816d23..ef594ace6ffd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -85,12 +85,12 @@ struct rq *__task_rq_lock(struct task_struct *p, struct 
rq_flags *rf)
 
for (;;) {
rq = task_rq(p);
-   raw_spin_lock(>lock);
+   raw_spin_lock(rq_lockp(rq));
if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
rq_pin_lock(rq, rf);
return rq;
}
-   raw_spin_unlock(>lock);
+   raw_spin_unlock(rq_lockp(rq));
 
while (unlikely(task_on_rq_migrating(p)))
cpu_relax();
@@ -109,7 +109,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct 
rq_flags *rf)
for (;;) {
raw_spin_lock_irqsave(>pi_lock, rf->flags);
rq = task_rq(p);
-   raw_spin_lock(>lock);
+   raw_spin_lock(rq_lockp(rq));
/*
 *  move_queued_task()  task_rq_lock()
 *
@@ -131,7 +131,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct 
rq_flags *rf)
rq_pin_lock(rq, rf);
return rq;
}
-   raw_spin_unlock(>lock);
+   raw_spin_unlock(rq_lockp(rq));
raw_spin_unlock_irqrestore(>pi_lock, rf->flags);
 
while (unlikely(task_on_rq_migrating(p)))
@@ -201,7 +201,7 @@ void update_rq_clock(struct rq *rq)
 {
s64 delta;
 
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
 
if (rq->clock_update_flags & RQCF_ACT_SKIP)
return;
@@ -505,7 +505,7 @@ void resched_curr(struct rq *rq)
struct task_struct *curr = rq->curr;
int cpu;
 
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
 
if (test_tsk_need_resched(curr))
return;
@@ -529,10 +529,10 @@ void resched_cpu(int cpu)
struct rq *rq = cpu_rq(cpu);
unsigned long flags;
 
-   raw_spin_lock_irqsave(>lock, flags);
+   raw_spin_lock_irqsave(rq_lockp(rq), flags);
if (cpu_online(cpu) || cpu == smp_processor_id())
resched_curr(rq);
-   raw_spin_unlock_irqrestore(>lock, flags);
+   raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 }
 
 #ifdef CONFIG_SMP
@@ -947,7 +947,7 @@ static inline void uclamp_rq_inc_id(struct rq *rq, struct 
task_struct *p,
struct uclamp_se *uc_se = >uclamp[clamp_id];
struct uclamp_bucket *bucket;
 
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
 
/* Update task effective clamp */
p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
@@ -987,7 +987,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct 
task_struct *p,
unsigned int bkt_clamp;
unsigned int rq_clamp;
 
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
 
bucket = _rq->bucket[uc_se->bucket_id];
SCHED_WARN_ON(!bucket->tasks);
@@ -1472,7 +1472,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, 
int cpu)
 static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
   struct task_struct *p, int new_cpu)
 {
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
 
WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
dequeue_task(rq, p, DEQUEUE_NOCLOCK);
@@ -1586,7 +1586,7 @@ void do_set_cpus_allowed(struct task_struct *p, const 
struct cpumask *new_mask)
 * Because __kthread_bind() calls this on blocked tasks without
 * holding rq->lock.
 */
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
}
if (running)
@@ -1723,7 +1723,7 @@ void set_task_cpu(struct task_struct *p, unsigned int 
new_cpu)
 * task_rq_lock().
 */
WARN_ON_ONCE(debug_locks &&

[RFC PATCH 00/16] Core scheduling v6

2020-06-30 Thread Vineeth Remanan Pillai

Sixth iteration of the Core-Scheduling feature.

Core scheduling is a feature that allows only trusted tasks to run
concurrently on cpus sharing compute resources (eg: hyperthreads on a
core). The goal is to mitigate the core-level side-channel attacks
without requiring to disable SMT (which has a significant impact on
performance in some situations). Core scheduling (as of v6) mitigates
user-space to user-space attacks and user to kernel attack when one of
the siblings enters the kernel via interrupts. It is still possible to
have a task attack the sibling thread when it enters the kernel via
syscalls.

By default, the feature doesn't change any of the current scheduler
behavior. The user decides which tasks can run simultaneously on the
same core (for now by having them in the same tagged cgroup). When a
tag is enabled in a cgroup and a task from that cgroup is running on a
hardware thread, the scheduler ensures that only idle or trusted tasks
run on the other sibling(s). Besides security concerns, this feature
can also be beneficial for RT and performance applications where we
want to control how tasks make use of SMT dynamically.

This iteration is mostly a cleanup of v5 except for a major feature of
pausing sibling when a cpu enters kernel via nmi/irq/softirq. Also
introducing documentation and includes minor crash fixes.

One major cleanup was removing the hotplug support and related code.
The hotplug related crashes were not documented and the fixes piled up
over time leading to complex code. We were not able to reproduce the
crashes in the limited testing done. But if they are reroducable, we
don't want to hide them. We should document them and design better
fixes if any.

In terms of performance, the results in this release are similar to
v5. On a x86 system with N hardware threads:
- if only N/2 hardware threads are busy, the performance is similar
  between baseline, corescheduling and nosmt
- if N hardware threads are busy with N different corescheduling
  groups, the impact of corescheduling is similar to nosmt
- if N hardware threads are busy and multiple active threads share the
  same corescheduling cookie, they gain a performance improvement over
  nosmt.
  The specific performance impact depends on the workload, but for a
  really busy database 12-vcpu VM (1 coresched tag) running on a 36
  hardware threads NUMA node with 96 mostly idle neighbor VMs (each in
  their own coresched tag), the performance drops by 54% with
  corescheduling and drops by 90% with nosmt.

v6 is rebased on 5.7.6(a06eb423367e)
https://github.com/digitalocean/linux-coresched/tree/coresched/v6-v5.7.y

Changes in v6
-
- Documentation
  - Joel
- Pause siblings on entering nmi/irq/softirq
  - Joel, Vineeth
- Fix for RCU crash
  - Joel
- Fix for a crash in pick_next_task
  - Yu Chen, Vineeth
- Minor re-write of core-wide vruntime comparison
  - Aaron Lu
- Cleanup: Address Review comments
- Cleanup: Remove hotplug support (for now)
- Build fixes: 32 bit, SMT=n, AUTOGROUP=n etc
  - Joel, Vineeth

Changes in v5
-
- Fixes for cgroup/process tagging during corner cases like cgroup
  destroy, task moving across cgroups etc
  - Tim Chen
- Coresched aware task migrations
  - Aubrey Li
- Other minor stability fixes.

Changes in v4
-
- Implement a core wide min_vruntime for vruntime comparison of tasks
  across cpus in a core.
  - Aaron Lu
- Fixes a typo bug in setting the forced_idle cpu.
  - Aaron Lu

Changes in v3
-
- Fixes the issue of sibling picking up an incompatible task
  - Aaron Lu
  - Vineeth Pillai
  - Julien Desfossez
- Fixes the issue of starving threads due to forced idle
  - Peter Zijlstra
- Fixes the refcounting issue when deleting a cgroup with tag
  - Julien Desfossez
- Fixes a crash during cpu offline/online with coresched enabled
  - Vineeth Pillai
- Fixes a comparison logic issue in sched_core_find
  - Aaron Lu

Changes in v2
-
- Fixes for couple of NULL pointer dereference crashes
  - Subhra Mazumdar
  - Tim Chen
- Improves priority comparison logic for process in different cpus
  - Peter Zijlstra
  - Aaron Lu
- Fixes a hard lockup in rq locking
  - Vineeth Pillai
  - Julien Desfossez
- Fixes a performance issue seen on IO heavy workloads
  - Vineeth Pillai
  - Julien Desfossez
- Fix for 32bit build
  - Aubrey Li

ISSUES
--
- Aaron(Intel) found an issue with load balancing when the tasks have
  different weights(nice or cgroup shares). Task weight is not considered
  in coresched aware load balancing and causes those higher weights task
  to starve. This issue was in v5 as well and is carried over.
- Joel(ChromeOS) found an issue where RT task may be preempted by a
  lower class task. This was also in v6 and is carried over.
- Coresched RB-tree doesn't get updated when task priority changes
- Potential starvation of untagged tasks(0 cookie) - a side effect of
  0 cookie tasks not in the coresched RB-Tree

TODO

- MAJOR: Core wide vruntime

[RFC PATCH 03/16] sched: Core-wide rq->lock

2020-06-30 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

Introduce the basic infrastructure to have a core wide rq->lock.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Julien Desfossez 
Signed-off-by: Vineeth Remanan Pillai 
---
 kernel/Kconfig.preempt |  6 +++
 kernel/sched/core.c| 91 ++
 kernel/sched/sched.h   | 31 ++
 3 files changed, 128 insertions(+)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index bf82259cff96..4488fbf4d3a8 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -80,3 +80,9 @@ config PREEMPT_COUNT
 config PREEMPTION
bool
select PREEMPT_COUNT
+
+config SCHED_CORE
+   bool "Core Scheduling for SMT"
+   default y
+   depends on SCHED_SMT
+
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ef594ace6ffd..4b81301e3f21 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -73,6 +73,70 @@ __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 95;
 
+#ifdef CONFIG_SCHED_CORE
+
+DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+/*
+ * The static-key + stop-machine variable are needed such that:
+ *
+ * spin_lock(rq_lockp(rq));
+ * ...
+ * spin_unlock(rq_lockp(rq));
+ *
+ * ends up locking and unlocking the _same_ lock, and all CPUs
+ * always agree on what rq has what lock.
+ *
+ * XXX entirely possible to selectively enable cores, don't bother for now.
+ */
+static int __sched_core_stopper(void *data)
+{
+   bool enabled = !!(unsigned long)data;
+   int cpu;
+
+   for_each_possible_cpu(cpu)
+   cpu_rq(cpu)->core_enabled = enabled;
+
+   return 0;
+}
+
+static DEFINE_MUTEX(sched_core_mutex);
+static int sched_core_count;
+
+static void __sched_core_enable(void)
+{
+   // XXX verify there are no cookie tasks (yet)
+
+   static_branch_enable(&__sched_core_enabled);
+   stop_machine(__sched_core_stopper, (void *)true, NULL);
+}
+
+static void __sched_core_disable(void)
+{
+   // XXX verify there are no cookie tasks (left)
+
+   stop_machine(__sched_core_stopper, (void *)false, NULL);
+   static_branch_disable(&__sched_core_enabled);
+}
+
+void sched_core_get(void)
+{
+   mutex_lock(_core_mutex);
+   if (!sched_core_count++)
+   __sched_core_enable();
+   mutex_unlock(_core_mutex);
+}
+
+void sched_core_put(void)
+{
+   mutex_lock(_core_mutex);
+   if (!--sched_core_count)
+   __sched_core_disable();
+   mutex_unlock(_core_mutex);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * __task_rq_lock - lock the rq @p resides on.
  */
@@ -6475,6 +6539,28 @@ static void sched_rq_cpu_starting(unsigned int cpu)
 
 int sched_cpu_starting(unsigned int cpu)
 {
+#ifdef CONFIG_SCHED_CORE
+   const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+   struct rq *rq, *core_rq = NULL;
+   int i;
+
+   for_each_cpu(i, smt_mask) {
+   rq = cpu_rq(i);
+   if (rq->core && rq->core == rq)
+   core_rq = rq;
+   }
+
+   if (!core_rq)
+   core_rq = cpu_rq(cpu);
+
+   for_each_cpu(i, smt_mask) {
+   rq = cpu_rq(i);
+
+   WARN_ON_ONCE(rq->core && rq->core != core_rq);
+   rq->core = core_rq;
+   }
+#endif /* CONFIG_SCHED_CORE */
+
sched_rq_cpu_starting(cpu);
sched_tick_start(cpu);
return 0;
@@ -6696,6 +6782,11 @@ void __init sched_init(void)
 #endif /* CONFIG_SMP */
hrtick_rq_init(rq);
atomic_set(>nr_iowait, 0);
+
+#ifdef CONFIG_SCHED_CORE
+   rq->core = NULL;
+   rq->core_enabled = 0;
+#endif
}
 
set_load_weight(_task, false);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a63c3115d212..66e586adee18 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1028,6 +1028,12 @@ struct rq {
/* Must be inspected within a rcu lock section */
struct cpuidle_state*idle_state;
 #endif
+
+#ifdef CONFIG_SCHED_CORE
+   /* per rq */
+   struct rq   *core;
+   unsigned intcore_enabled;
+#endif
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -1055,11 +1061,36 @@ static inline int cpu_of(struct rq *rq)
 #endif
 }
 
+#ifdef CONFIG_SCHED_CORE
+DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+   return static_branch_unlikely(&__sched_core_enabled) && 
rq->core_enabled;
+}
+
+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+   if (sched_core_enabled(rq))
+   return >core->__lock;
+
+   return >__lock;
+}
+
+#else /* !CONFIG_SCHED_CORE */
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+   return false;
+}
+
 static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 {
return >__lock;
 }
 
+#endif /* CONFIG_SCHED_CORE */
+
 #ifdef CONFIG_SCHED_SMT
 extern void __update_idle_core(struct rq *rq);
 
-- 
2.17.1

[RFC PATCH 04/16] sched/fair: Add a few assertions

2020-06-30 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/fair.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5e9f11c8256f..e44a43b87975 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6203,6 +6203,11 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
}
 
 symmetric:
+   /*
+* per-cpu select_idle_mask usage
+*/
+   lockdep_assert_irqs_disabled();
+
if (available_idle_cpu(target) || sched_idle_cpu(target))
return target;
 
@@ -6644,8 +6649,6 @@ static int find_energy_efficient_cpu(struct task_struct 
*p, int prev_cpu)
  * certain conditions an idle sibling CPU if the domain has SD_WAKE_AFFINE set.
  *
  * Returns the target CPU number.
- *
- * preempt must be disabled.
  */
 static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int 
wake_flags)
@@ -6656,6 +6659,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, 
int sd_flag, int wake_f
int want_affine = 0;
int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 
+   /*
+* required for stable ->cpus_allowed
+*/
+   lockdep_assert_held(>pi_lock);
+
if (sd_flag & SD_BALANCE_WAKE) {
record_wakee(p);
 
-- 
2.17.1

Re: [RFC PATCH 00/13] Core scheduling v5

2020-06-29 Thread Vineeth Remanan Pillai

Hi Aubrey,

On Mon, Jun 29, 2020 at 8:34 AM Li, Aubrey  wrote:
>
> >  - Load balancing/migration changes ignores group weights:
> >- 
> > https://lwn.net/ml/linux-kernel/20200225034438.GA617271@ziqianlu-desktop.localdomain
>
> According to Aaron's response below:
> https://lwn.net/ml/linux-kernel/20200305085231.GA12108@ziqianlu-desktop.localdomain/
>
> The following logic seems to be helpful for Aaron's case.
>
> +   /*
> +* Ignore cookie match if there is a big imbalance between the src rq
> +* and dst rq.
> +*/
> +   if ((src_rq->cfs.h_nr_running - rq->cfs.h_nr_running) > 1)
> +   return true;
>
> I didn't see any other comments on the patch at here:
> https://lwn.net/ml/linux-kernel/67e46f79-51c2-5b69-71c6-133ec10b6...@linux.intel.com/
>
> Do we have another way to address this issue?
>
We do not have a clear fix for this yet, and did not get much time to
work on this.

I feel that the above change would not be fixing the real issue.
The issue is about not considering the weight of the group when we
try to load balance, but the above change is checking only the
nr_running which might not work always. I feel that we should fix
the real issue in v6 and probably hold on to adding the workaround
fix in the interim.  I have added a TODO specifically for this bug
in v6.

What do you think?

Thanks,
Vineeth

Re: [RFC PATCH 12/13] sched: cgroup tagging interface for core scheduling

2020-06-26 Thread Vineeth Remanan Pillai

On Wed, Mar 4, 2020 at 12:00 PM vpillai  wrote:
>
>
> Marks all tasks in a cgroup as matching for core-scheduling.
>
> A task will need to be moved into the core scheduler queue when the cgroup
> it belongs to is tagged to run with core scheduling.  Similarly the task
> will need to be moved out of the core scheduler queue when the cgroup
> is untagged.
>
> Also after we forked a task, its core scheduler queue's presence will
> need to be updated according to its new cgroup's status.
>
This came up during a private discussion with Joel and thanks to
him for bringing this up! Details below..

> @@ -7910,7 +7986,12 @@ static void cpu_cgroup_fork(struct task_struct *task)
> rq = task_rq_lock(task, );
>
> update_rq_clock(rq);
> +   if (sched_core_enqueued(task))
> +   sched_core_dequeue(rq, task);
A newly created task will not be enqueued and hence do we need this
here?

> sched_change_group(task, TASK_SET_GROUP);
> +   if (sched_core_enabled(rq) && task_on_rq_queued(task) &&
> +   task->core_cookie)
> +   sched_core_enqueue(rq, task);
>
Do we need this here? Soon after this, wake_up_new_task() is called
which will ultimately call enqueue_task() and adds the task to the
coresched rbtree. So we will be trying to enqueue twice. Also, this
code will not really enqueue,  because task_on_rq_queued() would
return false at this point(activate_task is not yet called for this
new task).

I am not sure if I missed any other code path reaching here that
does not proceed with wake_up_new_task().Please let me know, if I
missed anything here.

Thanks,
Vineeth

Re: [RFC PATCH 00/13] Core scheduling v5

2020-06-26 Thread Vineeth Remanan Pillai

On Thu, Jun 25, 2020 at 9:47 PM Joel Fernandes  wrote:
>
> On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai
>  wrote:
> [...]
> > TODO lists:
> >
> >  - Interface discussions could not come to a conclusion in v5 and hence 
> > would
> >like to restart the discussion and reach a consensus on it.
> >- 
> > https://lwn.net/ml/linux-kernel/20200520222642.70679-1-j...@joelfernandes.org
>
> Thanks Vineeth, just want to add: I have a revised implementation of
> prctl(2) where you only pass a TID of a task you'd like to share a
> core with (credit to Peter for the idea [1]) so we can make use of
> ptrace_may_access() checks. I am currently finishing writing of
> kselftests for this and post it all once it is ready.
>
Thinking more about it, using TID/PID for prctl(2) and internally
using a task identifier to identify coresched group may have
limitations. A coresched group can exist longer than the lifetime
of a task and then there is a chance for that identifier to be
reused by a newer task which may or maynot be a part of the same
coresched group.

A way to overcome this is to have a coresched group with a seperate
identifier implemented internally and have mapping from task to the
group. And cgroup framework provides exactly that.

I feel we could use prctl for isolating individual tasks/processes
and use grouping frameworks like cgroup for core scheduling groups.
Cpu cgroup might not be a good idea as it has its own purpose. Users
might not always want a group of trusted tasks in the same cpu cgroup.
Or all the processes in an existing cpu cgroup might not be mutually
trusted as well.

What do you think about having a separate cgroup for coresched?
Both coresched cgroup and prctl() could co-exist where prctl could
be used to isolate individual process or task and coresched cgroup
to group trusted processes.

> However a question: If using the prctl(2) on a CGroup tagged task, we
> discussed in previous threads [2] to override the CGroup cookie such
> that the task may not share a core with any of the tasks in its CGroup
> anymore and I think Peter and Phil are Ok with.  My question though is
> - would that not be confusing for anyone looking at the CGroup
> filesystem's "tag" and "tasks" files?
>
Having a dedicated cgroup for coresched could solve this problem
as well. "coresched.tasks" inside the cgroup hierarchy would list all
the taskx in the group and prctl can override this and take it out
of the group.

> To resolve this, I am proposing to add a new CGroup file
> 'tasks.coresched' to the CGroup, and this will only contain tasks that
> were assigned cookies due to their CGroup residency. As soon as one
> prctl(2)'s the task, it will stop showing up in the CGroup's
> "tasks.coresched" file (unless of course it was requesting to
> prctl-share a core with someone in its CGroup itself). Are folks Ok
> with this solution?
>
As I mentioned above, IMHO cpu cgroups should not be used to account
for core scheduling as well. Cpu cgroups serve a different purpose
and overloading it with core scheduling would not be flexible and
scalable. But if there is a consensus to move forward with cpu cgroups,
adding this new file seems to be okay with me.

Thoughts/suggestions/concerns?

Thanks,
Vineeth

Re: [RFC PATCH 00/13] Core scheduling v5

2020-06-25 Thread Vineeth Remanan Pillai

On Wed, Mar 4, 2020 at 12:00 PM vpillai wrote:
>
>
> Fifth iteration of the Core-Scheduling feature.
>
Its probably time for an iteration and We are planning to post v6 based
on this branch:
https://github.com/digitalocean/linux-coresched/tree/coresched/pre-v6-v5.7.y

Just wanted to share the details about v6 here before posting the patch
series. If there is no objection to the following, we shall be posting
the v6 early next week.

The main changes from v6 are the following:
1. Address Peter's comments in v5
- Code cleanup
- Remove fixes related to hotplugging.
- Split the patch out for force idling starvation
3. Fix for RCU deadlock
4. core wide priority comparison minor re-work.
5. IRQ Pause patch
6. Documentation
-
https://github.com/digitalocean/linux-coresched/blob/coresched/pre-v6-v5.7.y/Documentation/admin-guide/hw-vuln/core-scheduling.rst

This version is much leaner compared to v5 due to the removal of hotplug
support. As a result, dynamic coresched enable/disable on cpus due to
smt on/off on the core do not function anymore. I tried to reproduce the
crashes during hotplug, but could not reproduce reliably. The plan is to
try to reproduce the crashes with v6, and document each corner case for crashes
as we fix those. Previously, we randomly fixed the issues without a clear
documentation and the fixes became complex over time.

TODO lists:

- Interface discussions could not come to a conclusion in v5 and hence would
like to restart the discussion and reach a consensus on it.
-
https://lwn.net/ml/linux-kernel/20200520222642.70679-1-j...@joelfernandes.org

- Core wide vruntime calculation needs rework:
-
https://lwn.net/ml/linux-kernel/20200506143506.gh5...@hirez.programming.kicks-ass.net

- Load balancing/migration changes ignores group weights:
-
https://lwn.net/ml/linux-kernel/20200225034438.GA617271@ziqianlu-desktop.localdomain

Please have a look and let me know comments/suggestions or anything missed.

Thanks,
Vineeth

Re: [RFC PATCH 11/13] sched: migration changes for core scheduling

2020-06-13 Thread Vineeth Remanan Pillai

On Fri, Jun 12, 2020 at 10:25 PM Joel Fernandes  wrote:
>
> Ok, so I take it that you will make it so in v6 then, unless of course
> someone else objects.
>
Yes, just wanted to hear from Aubrey, Tim and others as well to see
if we have not missed anything obvious. Will have this in v6 if
there are no objections.

Thanks for bringing this up!

~Vineeth

Re: [RFC PATCH 11/13] sched: migration changes for core scheduling

2020-06-12 Thread Vineeth Remanan Pillai

On Fri, Jun 12, 2020 at 9:21 AM Joel Fernandes  wrote:
>
> > +#ifdef CONFIG_SCHED_CORE
> > + if (available_idle_cpu(cpu) &&
> > + sched_core_cookie_match(cpu_rq(cpu), p))
> > + break;
> > +#else
>
> select_idle_cpu() is called only if no idle core could be found in the LLC by
> select_idle_core().
>
> So, would it be better here to just do the cookie equality check directly
> instead of calling the sched_core_cookie_match() helper?  More so, because
> select_idle_sibling() is a fastpath.
>
Agree, this makes sense to me.

> AFAIR, that's what v4 did:
>
> if (available_idle_cpu(cpu))
> #ifdef CONFIG_SCHED_CORE
> if (sched_core_enabled(cpu_rq(cpu)) &&
> (p->core_cookie == 
> cpu_rq(cpu)->core->core_cookie))
> break;
> #else
> break;
> #endif
>
This patch was initially not in v4 and this is a merging of 4 patches
suggested post-v4. During the initial round, code was like above. But since
there looked like a code duplication in the different migration paths,
it was consolidated into sched_core_cookie_match() and it caused this
extra logic to this specific code path. As you mentioned, I also feel
we do not need to check for core idleness in this path.

Thanks,
Vineeth

Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

2020-05-21 Thread Vineeth Remanan Pillai

On Thu, May 21, 2020 at 9:47 AM Joel Fernandes  wrote:
>
> > It doens't allow tasks for form their own groups (by for example setting
> > the key to that of another task).
>
> So for this, I was thinking of making the prctl pass in an integer. And 0
> would mean untagged. Does that sound good to you?
>
On a similar note, me and Joel were discussing about prctl and it came up
that, there is no mechanism to set cookie from outside a process using
prctl(2). So, another option we could consider is to use sched_setattr(2)
and expand sched_attr to accomodate a u64 cookie. User could pass in a
cookie to explicitly set it and also use the same cookie for grouping.

Haven't prototyped it yet. Will need to dig deeper and see how it would
really look like.

Thanks,
Vineeth

Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison

2020-05-15 Thread Vineeth Remanan Pillai

On Fri, May 15, 2020 at 6:39 AM Peter Zijlstra  wrote:
>
> It's complicated ;-)
>
> So this sync is basically a relative reset of S to 0.
>
> So with 2 queues, when one goes idle, we drop them both to 0 and one
> then increases due to not being idle, and the idle one builds up lag to
> get re-elected. So far so simple, right?
>
> When there's 3, we can have the situation where 2 run and one is idle,
> we sync to 0 and let the idle one build up lag to get re-election. Now
> suppose another one also drops idle. At this point dropping all to 0
> again would destroy the built-up lag from the queue that was already
> idle, not good.
>
Thanks for the clarification :-).

I was suggesting an idea of corewide force_idle. We sync the core_vruntime
on first force_idle of a sibling in the core and start using core_vruntime
for priority comparison from then on. That way, we don't reset the lag on
every force_idle and the lag builds up from the first sibling that was
forced_idle. I think this would work with infeasible weights as well,
but needs to think more to see if it would break. A sample check to enter
this core wide force_idle state is:
(cpumask_weight(cpu_smt_mask(cpu)) == old_active && new_active < old_active)

And we exit the core wide force_idle state when the last sibling goes out
of force_idle and can start using min_vruntime for priority comparison
from then on.

When there is a cookie match on all siblings, we don't do priority comparison
now. But I think we need to do priority comparison for cookie matches
also, so that we update 'max' in the loop. And for this comparison during
a no forced_idle scenario, I hope it should be fine to use the min_vruntime.
Updating 'max' in the loop when cookie matches is not really needed for SMT2,
but would be needed for SMTn.

This is just a wild idea on top of your patches. Might not be accurate
in all cases and need to think more about the corner cases. I thought I
would think it loud here :-)

> So instead of syncing everything, we can:
>
>   less := !((s64)(s_a - s_b) <= 0)
>
>   (v_a - S_a) - (v_b - S_b) == v_a - v_b - S_a + S_b
> == v_a - (v_b - S_a + S_b)
>
> IOW, we can recast the (lag) comparison to a one-sided difference.
> So if then, instead of syncing the whole queue, sync the idle queue
> against the active queue with S_a + S_b at the point where we sync.
>
> (XXX consider the implication of living in a cyclic group: N / 2^n N)
>
> This gives us means of syncing single queues against the active queue,
> and for already idle queues to preseve their build-up lag.
>
> Of course, then we get the situation where there's 2 active and one
> going idle, who do we pick to sync against? Theory would have us sync
> against the combined S, but as we've already demonstated, there is no
> such thing in infeasible weight scenarios.
>
> One thing I've considered; and this is where that core_active rudiment
> came from, is having active queues sync up between themselves after
> every tick. This limits the observed divergence due to the work
> conservance.
>
> On top of that, we can improve upon things by moving away from our
> horrible (10) hack and moving to (9) and employing (13) here.
>
> Anyway, I got partway through that in the past days, but then my head
> hurt. I'll consider it some more :-)
This sounds much better and a more accurate approach then the one I
mentioned above. Please share the code when you have it in some form :-)

>
> > > +   new_active++;
> > I think we need to reset new_active on restarting the selection.
>
> But this loop is after selection has been done; we don't modify
> new_active during selection.
My bad, sorry about this false alarm!

> > > +
> > > +   vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime;
> > > +   vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime;
> > Should we be using core_vruntime conditionally? should it be min_vruntime 
> > for
> > default comparisons and core_vruntime during force_idle?
>
> At the very least it should be min_vruntime when cfs_rq_a == cfs_rq_b,
> ie. when we're on the same CPU.
>
yes, this makes sense.

The issue that I was thinking about is, when there is no force_idle and
all siblings run compatible tasks for a while, min_vruntime progresses,
but core_vruntime lags behind. And when a new task gets enqueued, it gets
the min_vruntime. But now during comparison it might be treated unfairly.

Consider a small example of two rqs rq1 and rq2.
rq1->cfs->min_vruntime = 1000
rq2->cfs->min_vruntime = 2000

During a force_idle, core_vruntime gets synced and

rq1->cfs->core_vruntime = 1000
rq2->cfs->core_vruntime = 2000

Now, suppose the core is out of force_idle and runs two compatible tasks
for a while, where the task on rq1 has more weight. min_vruntime progresses
on both, but slowly on rq1. Say the progress looks like:
rq1->cfs->min_vruntime = 1200, se1->vruntime = 1200
rq2->cfs->min_vruntime = 2500, se2->vruntime = 2500

If a new incompatible

Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison

2020-05-14 Thread Vineeth Remanan Pillai

Hi Peter,

On Thu, May 14, 2020 at 9:02 AM Peter Zijlstra  wrote:
>
> A little something like so, this syncs min_vruntime when we switch to
> single queue mode. This is very much SMT2 only, I got my head in twist
> when thikning about more siblings, I'll have to try again later.
>
Thanks for the quick patch! :-)

For SMT-n, would it work if sync vruntime if atleast one sibling is
forced idle? Since force_idle is for all the rqs, I think it would
be correct to sync the vruntime if atleast one cpu is forced idle.

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> -   if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> -   rq_i->core_forceidle = true;
> +   if (is_idle_task(rq_i->core_pick)) {
> +   if (rq_i->nr_running)
> +   rq_i->core_forceidle = true;
> +   } else {
> +   new_active++;
I think we need to reset new_active on restarting the selection.

> +   }
>
> if (i == cpu)
> continue;
> @@ -4476,6 +4473,16 @@ next_class:;
> WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
> }
>
> +   /* XXX SMT2 only */
> +   if (new_active == 1 && old_active > 1) {
As I mentioned above, would it be correct to check if atleast one sibling is
forced_idle? Something like:
if (cpumask_weight(cpu_smt_mask(cpu)) == old_active && new_active < old_active)

> +   /*
> +* We just dropped into single-rq mode, increment the sequence
> +* count to trigger the vruntime sync.
> +*/
> +   rq->core->core_sync_seq++;
> +   }
> +   rq->core->core_active = new_active;
core_active seems to be unused.

> +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> +{
> +   struct sched_entity *se_a = >se, *se_b = >se;
> +   struct cfs_rq *cfs_rq_a, *cfa_rq_b;
> +   u64 vruntime_a, vruntime_b;
> +
> +   while (!is_same_tg(se_a, se_b)) {
> +   int se_a_depth = se_a->depth;
> +   int se_b_depth = se_b->depth;
> +
> +   if (se_a_depth <= se_b_depth)
> +   se_b = parent_entity(se_b);
> +   if (se_a_depth >= se_b_depth)
> +   se_a = parent_entity(se_a);
> +   }
> +
> +   cfs_rq_a = cfs_rq_of(se_a);
> +   cfs_rq_b = cfs_rq_of(se_b);
> +
> +   vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime;
> +   vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime;
Should we be using core_vruntime conditionally? should it be min_vruntime for
default comparisons and core_vruntime during force_idle?

Thanks,
Vineeth

Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-10-21 Thread Vineeth Remanan Pillai

On Mon, Oct 14, 2019 at 5:57 AM Aaron Lu  wrote:
>
> I now remembered why I used max().
>
> Assume rq1 and rq2's min_vruntime are both at 2000 and the core wide
> min_vruntime is also 2000. Also assume both runqueues are empty at the
> moment. Then task t1 is queued to rq1 and runs for a long time while rq2
> keeps empty. rq1's min_vruntime will be incremented all the time while
> the core wide min_vruntime stays at 2000 if min() is used. Then when
> another task gets queued to rq2, it will get really large unfair boost
> by using a much smaller min_vruntime as its base.
>
> To fix this, either max() is used as is done in my patch, or adjust
> rq2's min_vruntime to be the same as rq1's on each
> update_core_cfs_min_vruntime() when rq2 is found empty and then use
> min() to get the core wide min_vruntime. Looks not worth the trouble to
> use min().

Understood. I think this case is a special case where one runqueue is empty
and hence the min_vruntime of the core should match the progressing vruntime
of the active runqueue. If we use max as the core wide min_vruntime, I think
we may hit starvation elsewhere. On quick example I can think of is during
force idle. When a sibling is forced idle, and if a new task gets enqueued
in the force idled runq, it would inherit the max vruntime and would starve
until the other tasks in the forced idle sibling catches up. While this might
be okay, we are deviating from the concept that all new tasks inherits the
min_vruntime of the cpu(core in our case). I have not tested deeply to see
if there are any assumptions which may fail if we use max.

The modified patch actually takes care of syncing the min_vruntime across
the siblings so that, core wide min_vruntime and per cpu min_vruntime
always stays in sync.

Thanks,
Vineeth

Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-10-13 Thread Vineeth Remanan Pillai

On Fri, Oct 11, 2019 at 11:55 PM Aaron Lu  wrote:

>
> I don't think we need do the normalization afterwrads and it appears
> we are on the same page regarding core wide vruntime.
>
> The intent of my patch is to treat all the root level sched entities of
> the two siblings as if they are in a single cfs_rq of the core. With a
> core wide min_vruntime, the core scheduler can decide which sched entity
> to run next. And the individual sched entity's vruntime shouldn't be
> changed based on the change of core wide min_vruntime, or faireness can
> hurt(if we add or reduce vruntime of a sched entity, its credit will
> change).
>
Ok, I think I get it now. I see that your first patch actually wraps
all the places
where min_vruntime is accessed. So yes, the tree vruntime updation is needed
only one time. From then on, since we use the wrapper cfs_rq_min_vruntime(),
both the runqueues would self adjust from then on based on the code wide
min_vruntime. Also by the virtue that min_vruntime stays min from there on, the
tree updation logic will not be called more than once. So I think the
changes are safe.
I will do some profiling to make sure that it is actually called once only.

> The weird thing about my patch is, the min_vruntime is often increased,
> it doesn't point to the smallest value as in a traditional cfs_rq. This
> probabaly can be changed to follow the tradition, I don't quite remember
> why I did this, will need to check this some time later.

Yeah, I noticed this. In my patch, I had already accounted for this and changed
to min() instead of max() which is more logical that min_vruntime should be the
minimum of both the run queue.

> All those sub cfs_rq's sched entities are not interesting. Because once
> we decided which sched entity in the root level cfs_rq should run next,
> we can then pick the final next task from there(using the usual way). In
> other words, to make scheduler choose the correct candidate for the core,
> we only need worry about sched entities on both CPU's root level cfs_rqs.
>
Understood. The only reason I did the normalize is to get both the runqueues
under one min_vruntime always. And as long as we use the cfs_rq_min_vruntime
from then on, we wouldn't be calling the balancing logic any more.

> Does this make sense?

Sure, thanks for the clarification.

Thanks,
Vineeth

Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-10-11 Thread Vineeth Remanan Pillai

> Thanks for the clarification.
>
> Yes, this is the initialization issue I mentioned before when core
> scheduling is initially enabled. rq1's vruntime is bumped the first time
> update_core_cfs_rq_min_vruntime() is called and if there are already
> some tasks queued, new tasks queued on rq1 will be starved to some extent.
>
> Agree that this needs fix. But we shouldn't need do this afterwards.
>
> So do I understand correctly that patch1 is meant to solve the
> initialization issue?

I think we need this update logic even after initialization. I mean, core
runqueue's min_vruntime can get updated every time when the core
runqueue's min_vruntime changes with respect to the sibling's min_vruntime.
So, whenever this update happens, we would need to propagate the changes
down the tree right? Please let me know if I am visualizing it wrong.

Thanks,
Vineeth

Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-10-11 Thread Vineeth Remanan Pillai

> > The reason we need to do this is because, new tasks that gets created will
> > have a vruntime based on the new min_vruntime and old tasks will have it
> > based on the old min_vruntime
>
> I think this is expected behaviour.
>
I don't think this is the expected behavior. If we hadn't changed the root
cfs->min_vruntime for the core rq, then it would have been the expected
behaviour. But now, we are updating the core rq's root cfs, min_vruntime
without changing the the vruntime down to the tree. To explain, consider
this example based on your patch. Let cpu 1 and 2 be siblings. And let rq(cpu1)
be the core rq. Let rq1->cfs->min_vruntime=1000 and rq2->cfs->min_vruntime=2000.
So in update_core_cfs_rq_min_vruntime, you update rq1->cfs->min_vruntime
to 2000 because that is the max. So new tasks enqueued on rq1 starts with
vruntime of 2000 while the tasks in that runqueue are still based on the old
min_vruntime(1000). So the new tasks gets enqueued some where to the right
of the tree and has to wait until already existing tasks catch up the
vruntime to
2000. This is what I meant by starvation. This happens always when we update
the core rq's cfs->min_vruntime. Hope this clarifies.

> > and it can cause starvation based on how
> > you set the min_vruntime.
>
> Care to elaborate the starvation problem?

Explained above.

> Again, what's the point of normalizing sched entities' vruntime in
> sub-cfs_rqs? Their vruntime comparisons only happen inside their own
> cfs_rq, we don't do cross CPU vruntime comparison for them.

As I mentioned above, this is to avoid the starvation case. Even though we are
not doing cross cfs_rq comparison, the whole tree's vruntime is based on the
root cfs->min_vruntime and we will have an imbalance if we change the root
cfs->min_vruntime without updating down the tree.

Thanks,
Vineeth

Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-10-10 Thread Vineeth Remanan Pillai

> I didn't see why we need do this.
>
> We only need to have the root level sched entities' vruntime become core
> wide since we will compare vruntime for them across hyperthreads. For
> sched entities on sub cfs_rqs, we never(at least, not now) compare their
> vruntime outside their cfs_rqs.
>
The reason we need to do this is because, new tasks that gets created will
have a vruntime based on the new min_vruntime and old tasks will have it
based on the old min_vruntime and it can cause starvation based on how
you set the min_vruntime. With this new patch, we normalize the whole
tree so that new tasks and old tasks compare with the same min_vruntime.

Thanks,
Vineeth

Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-10-02 Thread Vineeth Remanan Pillai

On Mon, Sep 30, 2019 at 7:53 AM Vineeth Remanan Pillai
 wrote:
>
> >
> Sorry, I misunderstood the fix and I did not initially see the core wide
> min_vruntime that you tried to maintain in the rq->core. This approach
> seems reasonable. I think we can fix the potential starvation that you
> mentioned in the comment by adjusting for the difference in all the children
> cfs_rq when we set the minvruntime in rq->core. Since we take the lock for
> both the queues, it should be doable and I am trying to see how we can best
> do that.
>
Attaching here with, the 2 patches I was working on in preparation of v4.

Patch 1 is an improvement of patch 2 of Aaron where I am propagating the
vruntime changes to the whole tree.
Patch 2 is an improvement for patch 3 of Aaron where we do resched_curr
only when the sibling is forced idle.

Micro benchmarks seems good. Will be doing larger set of tests and hopefully
posting v4 by end of week. Please let me know what you think of these patches
(patch 1 is on top of Aaron's patch 2, patch 2 replaces Aaron's patch 3)

Thanks,
Vineeth

[PATCH 1/2] sched/fair: propagate the min_vruntime change to the whole rq tree

When we adjust the min_vruntime of rq->core, we need to propgate
that down the tree so as to not cause starvation of existing tasks
based on previous vruntime.
---
 kernel/sched/fair.c | 24 ++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 59cb01a1563b..e8dd78a8c54d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -476,6 +476,23 @@ static inline u64 cfs_rq_min_vruntime(struct
cfs_rq *cfs_rq)
return cfs_rq->min_vruntime;
 }

+static void coresched_adjust_vruntime(struct cfs_rq *cfs_rq, u64 delta)
+{
+   struct sched_entity *se, *next;
+
+   if (!cfs_rq)
+   return;
+
+   cfs_rq->min_vruntime -= delta;
+   rbtree_postorder_for_each_entry_safe(se, next,
+   _rq->tasks_timeline.rb_root, run_node) {
+   if (se->vruntime > delta)
+   se->vruntime -= delta;
+   if (se->my_q)
+   coresched_adjust_vruntime(se->my_q, delta);
+   }
+}
+
 static void update_core_cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
 {
struct cfs_rq *cfs_rq_core;
@@ -487,8 +504,11 @@ static void
update_core_cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
return;

cfs_rq_core = core_cfs_rq(cfs_rq);
-   cfs_rq_core->min_vruntime = max(cfs_rq_core->min_vruntime,
-   cfs_rq->min_vruntime);
+   if (cfs_rq_core != cfs_rq &&
+   cfs_rq->min_vruntime < cfs_rq_core->min_vruntime) {
+   u64 delta = cfs_rq_core->min_vruntime - cfs_rq->min_vruntime;
+   coresched_adjust_vruntime(cfs_rq_core, delta);
+   }
 }

 bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
--
2.17.1

[PATCH 2/2] sched/fair : Wake up forced idle siblings if needed

If a cpu has only one task and if it has used up its timeslice,
then we should try to wake up the sibling to give the forced idle
thread a chance.
We do that by triggering schedule which will IPI the sibling if
the task in the sibling wins the priority check.
---
 kernel/sched/fair.c | 43 +++
 1 file changed, 43 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8dd78a8c54d..ba4d929abae6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4165,6 +4165,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct
sched_entity *se, int flags)
update_min_vruntime(cfs_rq);
 }

+static inline bool
+__entity_slice_used(struct sched_entity *se)
+{
+   return (se->sum_exec_runtime - se->prev_sum_exec_runtime) >
+   sched_slice(cfs_rq_of(se), se);
+}
+
 /*
  * Preempt the current task with a newly woken task if needed:
  */
@@ -10052,6 +10059,39 @@ static void rq_offline_fair(struct rq *rq)

 #endif /* CONFIG_SMP */

+#ifdef CONFIG_SCHED_CORE
+/*
+ * If runqueue has only one task which used up its slice and
+ * if the sibling is forced idle, then trigger schedule
+ * to give forced idle task a chance.
+ */
+static void resched_forceidle(struct rq *rq, struct sched_entity *se)
+{
+   int cpu = cpu_of(rq), sibling_cpu;
+   if (rq->cfs.nr_running > 1 || !__entity_slice_used(se))
+   return;
+
+   for_each_cpu(sibling_cpu, cpu_smt_mask(cpu)) {
+   struct rq *sibling_rq;
+   if (sibling_cpu == cpu)
+   continue;
+   if (cpu_is_offline(sibling_cpu))
+   continue;
+
+   sibling_rq = cpu_rq(sibling_cpu);
+   if (sibling_rq->core_forceidle) {
+   resched_curr(rq);
+   break;
+

Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-09-30 Thread Vineeth Remanan Pillai

On Wed, Sep 18, 2019 at 6:16 PM Aubrey Li  wrote:
>
> On Thu, Sep 19, 2019 at 4:41 AM Tim Chen  wrote:
> >
> > On 9/17/19 6:33 PM, Aubrey Li wrote:
> > > On Sun, Sep 15, 2019 at 10:14 PM Aaron Lu  
> > > wrote:
> >
> > >>
> > >> And I have pushed Tim's branch to:
> > >> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
> > >>
> > >> Mine:
> > >> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-core_vruntime
> >
> >
> > Aubrey,
> >
> > Thanks for testing with your set up.
> >
> > I think the test that's of interest is to see my load balancing added on top
> > of Aaron's fairness patch, instead of using my previous version of
> > forced idle approach in coresched-v3-v5.1.5-test-tim branch.
> >
>
> I'm trying to figure out a way to solve fairness only(not include task
> placement),
> So @Vineeth - if everyone is okay with Aaron's fairness patch, maybe
> we should have a v4?
>
Yes, I think we can move to v4 with Aaron's fairness fix and potentially
Tim's load balancing fixes. I am working on some improvements to Aaron's
fixes and shall post the changes after some testing. Basically, what I am
trying to do is to propagate the min_vruntime change down to all the cf_rq
and individual se when we update the cfs_rq(rq->core)->min_vrutime. So,
we can make sure that the rq stays in sync and starvation do not happen.

If everything goes well, we shall also post the v4 towards the end of this
week. We would be testing Tim's load balancing patches in an
over-committed VM scenario to observe the effect of the fix.

Thanks,
Vineeth

Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-09-30 Thread Vineeth Remanan Pillai

On Thu, Sep 12, 2019 at 8:35 AM Aaron Lu  wrote:
> >
> > I think comparing parent's runtime also will have issues once
> > the task group has a lot more threads with different running
> > patterns. One example is a task group with lot of active threads
> > and a thread with fairly less activity. So when this less active
> > thread is competing with a thread in another group, there is a
> > chance that it loses continuously for a while until the other
> > group catches up on its vruntime.
>
> I actually think this is expected behaviour.
>
> Without core scheduling, when deciding which task to run, we will first
> decide which "se" to run from the CPU's root level cfs runqueue and then
> go downwards. Let's call the chosen se on the root level cfs runqueue
> the winner se. Then with core scheduling, we will also need compare the
> two winner "se"s of each hyperthread and choose the core wide winner "se".
>
Sorry, I misunderstood the fix and I did not initially see the core wide
min_vruntime that you tried to maintain in the rq->core. This approach
seems reasonable. I think we can fix the potential starvation that you
mentioned in the comment by adjusting for the difference in all the children
cfs_rq when we set the minvruntime in rq->core. Since we take the lock for
both the queues, it should be doable and I am trying to see how we can best
do that.

> >
> > As discussed during LPC, probably start thinking along the lines
> > of global vruntime or core wide vruntime to fix the vruntime
> > comparison issue?
>
> core wide vruntime makes sense when there are multiple tasks of
> different cgroups queued on the same core. e.g. when there are two
> tasks of cgroupA and one task of cgroupB are queued on the same core,
> assume cgroupA's one task is on one hyperthread and its other task is on
> the other hyperthread with cgroupB's task. With my current
> implementation or Tim's, cgroupA will get more time than cgroupB. If we
> maintain core wide vruntime for cgroupA and cgroupB, we should be able
> to maintain fairness between cgroups on this core. Tim propose to solve
> this problem by doing some kind of load balancing if I'm not mistaken, I
> haven't taken a look at this yet.
I think your fix is almost close to maintaining a core wide vruntime as you
have a single minvruntime to compare now across the siblings in the core.
To make the fix complete, we might need to adjust the whole tree's
min_vruntime and I think its doable.

Thanks,
Vineeth

Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-09-11 Thread Vineeth Remanan Pillai

> > So both of you are working on top of my 2 patches that deal with the
> > fairness issue, but I had the feeling Tim's alternative patches[1] are
> > simpler than mine and achieves the same result(after the force idle tag
>
> I think Julien's result show that my patches did not do as well as
> your patches for fairness. Aubrey did some other testing with the same
> conclusion.  So I think keeping the forced idle time balanced is not
> enough for maintaining fairness.
>
There are two main issues - vruntime comparison issue and the
forced idle issue.  coresched_idle thread patch is addressing
the forced idle issue as scheduler is no longer overloading idle
thread for forcing idle. If I understand correctly, Tim's patch
also tries to fix the forced idle issue. On top of fixing forced
idle issue, we also need to fix that vruntime comparison issue
and I think thats where Aaron's patch helps.

I think comparing parent's runtime also will have issues once
the task group has a lot more threads with different running
patterns. One example is a task group with lot of active threads
and a thread with fairly less activity. So when this less active
thread is competing with a thread in another group, there is a
chance that it loses continuously for a while until the other
group catches up on its vruntime.

As discussed during LPC, probably start thinking along the lines
of global vruntime or core wide vruntime to fix the vruntime
comparison issue?

Thanks,
Vineeth

Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-08-12 Thread Vineeth Remanan Pillai

> I have two other small changes that I think are worth sending out.
>
> The first simplify logic in pick_task() and the 2nd avoid task pick all
> over again when max is preempted. I also refined the previous hack patch to
> make schedule always happen only for root cfs rq. Please see below for
> details, thanks.
>
I see a potential issue here. With the simplification in pick_task,
you might introduce a livelock where the match logic spins for ever.
But you avoid that with the patch 2, by removing the loop if a pick
preempts max. The potential problem is that, you miss a case where
the newly picked task might have a match in the sibling on which max
was selected before. By selecting idle, you ignore the potential match.
As of now, the potential match check does not really work because,
sched_core_find will always return the same task and we do not check
the whole core_tree for a next match. This is in my TODO list to have
sched_core_find to return the best next match, if match was preempted.
But its a bit complex and needs more thought.

Third patch looks good to me.

Thanks,
Vineeth

Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-08-06 Thread Vineeth Remanan Pillai

> I think tenant will have per core weight, similar to sched entity's per
> cpu weight. The tenant's per core weight could derive from its
> corresponding taskgroup's per cpu sched entities' weight(sum them up
> perhaps). Tenant with higher weight will have its core wide vruntime
> advance slower than tenant with lower weight. Does this address the
> issue here?
>
I think that makes sense. Should work. We should also consider how to
classify untagged processes so that they are not starved .

>
> Care to elaborate the idea of coresched idle thread concept?
> How it solved the hyperthread going idle problem and what the accounting
> issues and wakeup issues are, etc.
>
So we have one coresched_idle thread per cpu and when a sibling
cannot find a match, instead of forcing idle, we schedule this new
thread. Ideally this thread would be similar to idle, but scheduler doesn't
now confuse idle cpu with a forced idle state. This also invokes schedule()
as vruntime progresses(alternative to your 3rd patch) and vruntime
accounting gets more consistent. There are special cases that need to be
handled so that coresched_idle never gets scheduled in the normal
scheduling path(without coresched) etc. Hope this clarifies.

But as Peter suggested, if we can differentiate idle from forced idle in
the idle thread and account for the vruntime progress, that would be a
better approach.

Thanks,
Vineeth

Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-08-06 Thread Vineeth Remanan Pillai

>
> What accounting in particular is upset? Is it things like
> select_idle_sibling() that thinks the thread is idle and tries to place
> tasks there?
>
The major issue that we saw was, certain work load causes the idle cpu to never
wakeup and schedule again even when there are runnable threads in there. If
I remember correctly, this happened when the sibling had only one cpu intensive
task and did not enter the pick_next_task for a long time. There were other
situations as well which caused this prolonged idle state on the cpu.
One was when
pick_next_task was called on the sibling but it always won there
because vruntime
was not progressing on the idle cpu.

Having a coresched idle makes sure that the idle thread is not overloaded. Also
vruntime moves forward and tsk vruntime comparison across cpus works when
we normalize.

> It should be possible to change idle_cpu() to not report a forced-idle
> CPU as idle.
I agree. If we can identify all the places the idle thread is
considered special and
also account for the vruntime progress for force idle, this should be a better
approach compared to coresched idle thread per cpu.

>
> (also; it should be possible to optimize select_idle_sibling() for the
> core-sched case specifically)
>
We haven't seen this because, most of our micro test cases did not have more
threads than the cpus. Thanks for pointing this out, we shall cook some tests
to observe this behavior.

Thanks,
Vineeth

Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-08-06 Thread Vineeth Remanan Pillai

> >
> > I also think a way to make fairness per cookie per core, is this what you
> > want to propose?
>
> Yes, that's what I meant.

I think that would hurt some kind of workloads badly, especially if
one tenant is
having way more tasks than the other. Tenant with more task on the same core
might have immediate requirements from some threads than the other and we
would fail to take that into account. With some hierarchical management, we can
alleviate this, but as Aaron said, it would be a bit messy.

Peter's rebalance logic actually takes care of most of the runq
imbalance caused
due to cookie tagging. What we have found from our testing is, fairness issue is
caused mostly due to a Hyperthread going idle and not waking up. Aaron's 3rd
patch works around that. As Julien mentioned, we are working on a per thread
coresched idle thread concept. The problem that we found was, idle thread causes
accounting issues and wakeup issues as it was not designed to be used in this
context. So if we can have a low priority thread which looks like any other task
to the scheduler, things becomes easy for the scheduler and we achieve security
as well. Please share your thoughts on this idea.

The results are encouraging, but we do not yet have the coresched idle
to not spin
100%. We will soon post the patch once it is a bit more stable for
running the tests
that we all have done so far.

Thanks,
Vineeth

[RFC PATCH v3 07/16] sched: Allow put_prev_task() to drop rq->lock

2019-05-29 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

Currently the pick_next_task() loop is convoluted and ugly because of
how it can drop the rq->lock and needs to restart the picking.

For the RT/Deadline classes, it is put_prev_task() where we do
balancing, and we could do this before the picking loop. Make this
possible.

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c  |  2 +-
 kernel/sched/deadline.c  | 14 +-
 kernel/sched/fair.c  |  2 +-
 kernel/sched/idle.c  |  2 +-
 kernel/sched/rt.c| 14 +-
 kernel/sched/sched.h |  4 ++--
 kernel/sched/stop_task.c |  2 +-
 7 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 32ea79fb8d29..9dfa0c53deb3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5595,7 +5595,7 @@ static void calc_load_migrate(struct rq *rq)
atomic_long_add(delta, _load_tasks);
 }
 
-static void put_prev_task_fake(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_fake(struct rq *rq, struct task_struct *prev, struct 
rq_flags *rf)
 {
 }
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c02b3229e2c3..45425f971eec 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1772,13 +1772,25 @@ pick_next_task_dl(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf)
return p;
 }
 
-static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
+static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct 
rq_flags *rf)
 {
update_curr_dl(rq);
 
update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1);
if (on_dl_rq(>dl) && p->nr_cpus_allowed > 1)
enqueue_pushable_dl_task(rq, p);
+
+   if (rf && !on_dl_rq(>dl) && need_pull_dl_task(rq, p)) {
+   /*
+* This is OK, because current is on_cpu, which avoids it being
+* picked for load-balance and preemption/IRQs are still
+* disabled avoiding further scheduler activity on it and we've
+* not yet started the picking loop.
+*/
+   rq_unpin_lock(rq, rf);
+   pull_dl_task(rq);
+   rq_repin_lock(rq, rf);
+   }
 }
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 49707b4797de..8e3eb243fd9f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7110,7 +7110,7 @@ done: __maybe_unused;
 /*
  * Account for a descheduled task:
  */
-static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct 
rq_flags *rf)
 {
struct sched_entity *se = >se;
struct cfs_rq *cfs_rq;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index dd64be34881d..1b65a4c3683e 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -373,7 +373,7 @@ static void check_preempt_curr_idle(struct rq *rq, struct 
task_struct *p, int fl
resched_curr(rq);
 }
 
-static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct 
rq_flags *rf)
 {
 }
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index adec98a94f2b..51ee87c5a28a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1593,7 +1593,7 @@ pick_next_task_rt(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf)
return p;
 }
 
-static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
+static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct 
rq_flags *rf)
 {
update_curr_rt(rq);
 
@@ -1605,6 +1605,18 @@ static void put_prev_task_rt(struct rq *rq, struct 
task_struct *p)
 */
if (on_rt_rq(>rt) && p->nr_cpus_allowed > 1)
enqueue_pushable_task(rq, p);
+
+   if (rf && !on_rt_rq(>rt) && need_pull_rt_task(rq, p)) {
+   /*
+* This is OK, because current is on_cpu, which avoids it being
+* picked for load-balance and preemption/IRQs are still
+* disabled avoiding further scheduler activity on it and we've
+* not yet started the picking loop.
+*/
+   rq_unpin_lock(rq, rf);
+   pull_rt_task(rq);
+   rq_repin_lock(rq, rf);
+   }
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bfcbcbb25646..4cbe2bef92e4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1675,7 +1675,7 @@ struct sched_class {
struct task_struct * (*pick_next_task)(struct rq *rq,
   struct task_struct *prev,
   struct rq_flags *rf);
-   void (*put_prev_task)(struct rq *rq, struct task_struct *p);
+   void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct 
rq_flags *rf);
void (*set_next_task)(struct rq *rq,

[RFC PATCH v3 02/16] sched: Fix kerneldoc comment for ia64_set_curr_task

2019-05-29 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4778c48a7fda..416ea613eda8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6287,7 +6287,7 @@ struct task_struct *curr_task(int cpu)
 
 #ifdef CONFIG_IA64
 /**
- * set_curr_task - set the current task for a given CPU.
+ * ia64_set_curr_task - set the current task for a given CPU.
  * @cpu: the processor in question.
  * @p: the task pointer to set.
  *
-- 
2.17.1

[RFC PATCH v3 01/16] stop_machine: Fix stop_cpus_in_progress ordering

2019-05-29 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

Make sure the entire for loop has stop_cpus_in_progress set.

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/stop_machine.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 067cb83f37ea..583119e0c51c 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -375,6 +375,7 @@ static bool queue_stop_cpus_work(const struct cpumask 
*cpumask,
 */
preempt_disable();
stop_cpus_in_progress = true;
+   barrier();
for_each_cpu(cpu, cpumask) {
work = _cpu(cpu_stopper.stop_work, cpu);
work->fn = fn;
@@ -383,6 +384,7 @@ static bool queue_stop_cpus_work(const struct cpumask 
*cpumask,
if (cpu_stop_queue_work(cpu, work))
queued = true;
}
+   barrier();
stop_cpus_in_progress = false;
preempt_enable();
 
-- 
2.17.1

[RFC PATCH v3 04/16] sched/{rt,deadline}: Fix set_next_task vs pick_next_task

2019-05-29 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

Because pick_next_task() implies set_curr_task() and some of the
details haven't matter too much, some of what _should_ be in
set_curr_task() ended up in pick_next_task, correct this.

This prepares the way for a pick_next_task() variant that does not
affect the current state; allowing remote picking.

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/deadline.c | 23 ---
 kernel/sched/rt.c   | 27 ++-
 2 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e683d4c19ab8..0783dfa65150 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1694,12 +1694,21 @@ static void start_hrtick_dl(struct rq *rq, struct 
task_struct *p)
 }
 #endif
 
-static inline void set_next_task(struct rq *rq, struct task_struct *p)
+static void set_next_task_dl(struct rq *rq, struct task_struct *p)
 {
p->se.exec_start = rq_clock_task(rq);
 
/* You can't push away the running task */
dequeue_pushable_dl_task(rq, p);
+
+   if (hrtick_enabled(rq))
+   start_hrtick_dl(rq, p);
+
+   if (rq->curr->sched_class != _sched_class)
+   update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+
+   if (rq->curr != p)
+   deadline_queue_push_tasks(rq);
 }
 
 static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
@@ -1758,15 +1767,7 @@ pick_next_task_dl(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf)
 
p = dl_task_of(dl_se);
 
-   set_next_task(rq, p);
-
-   if (hrtick_enabled(rq))
-   start_hrtick_dl(rq, p);
-
-   deadline_queue_push_tasks(rq);
-
-   if (rq->curr->sched_class != _sched_class)
-   update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+   set_next_task_dl(rq, p);
 
return p;
 }
@@ -1813,7 +1814,7 @@ static void task_fork_dl(struct task_struct *p)
 
 static void set_curr_task_dl(struct rq *rq)
 {
-   set_next_task(rq, rq->curr);
+   set_next_task_dl(rq, rq->curr);
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 3d9db8c75d53..353ad960691b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1498,12 +1498,23 @@ static void check_preempt_curr_rt(struct rq *rq, struct 
task_struct *p, int flag
 #endif
 }
 
-static inline void set_next_task(struct rq *rq, struct task_struct *p)
+static inline void set_next_task_rt(struct rq *rq, struct task_struct *p)
 {
p->se.exec_start = rq_clock_task(rq);
 
/* The running task is never eligible for pushing */
dequeue_pushable_task(rq, p);
+
+   /*
+* If prev task was rt, put_prev_task() has already updated the
+* utilization. We only care of the case where we start to schedule a
+* rt task
+*/
+   if (rq->curr->sched_class != _sched_class)
+   update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+
+   if (rq->curr != p)
+   rt_queue_push_tasks(rq);
 }
 
 static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
@@ -1577,17 +1588,7 @@ pick_next_task_rt(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf)
 
p = _pick_next_task_rt(rq);
 
-   set_next_task(rq, p);
-
-   rt_queue_push_tasks(rq);
-
-   /*
-* If prev task was rt, put_prev_task() has already updated the
-* utilization. We only care of the case where we start to schedule a
-* rt task
-*/
-   if (rq->curr->sched_class != _sched_class)
-   update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+   set_next_task_rt(rq, p);
 
return p;
 }
@@ -2356,7 +2357,7 @@ static void task_tick_rt(struct rq *rq, struct 
task_struct *p, int queued)
 
 static void set_curr_task_rt(struct rq *rq)
 {
-   set_next_task(rq, rq->curr);
+   set_next_task_rt(rq, rq->curr);
 }
 
 static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task)
-- 
2.17.1

[RFC PATCH v3 06/16] sched/fair: Export newidle_balance()

2019-05-29 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

For pick_next_task_fair() it is the newidle balance that requires
dropping the rq->lock; provided we do put_prev_task() early, we can
also detect the condition for doing newidle early.

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/fair.c  | 18 --
 kernel/sched/sched.h |  4 
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 56fc2a1aa261..49707b4797de 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3615,8 +3615,6 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq 
*cfs_rq)
return cfs_rq->avg.load_avg;
 }
 
-static int idle_balance(struct rq *this_rq, struct rq_flags *rf);
-
 static inline unsigned long task_util(struct task_struct *p)
 {
return READ_ONCE(p->se.avg.util_avg);
@@ -7087,11 +7085,10 @@ done: __maybe_unused;
return p;
 
 idle:
-   update_misfit_status(NULL, rq);
-   new_tasks = idle_balance(rq, rf);
+   new_tasks = newidle_balance(rq, rf);
 
/*
-* Because idle_balance() releases (and re-acquires) rq->lock, it is
+* Because newidle_balance() releases (and re-acquires) rq->lock, it is
 * possible for any higher priority task to appear. In that case we
 * must re-start the pick_next_entity() loop.
 */
@@ -9286,10 +9283,10 @@ static int load_balance(int this_cpu, struct rq 
*this_rq,
ld_moved = 0;
 
/*
-* idle_balance() disregards balance intervals, so we could repeatedly
-* reach this code, which would lead to balance_interval skyrocketting
-* in a short amount of time. Skip the balance_interval increase logic
-* to avoid that.
+* newidle_balance() disregards balance intervals, so we could
+* repeatedly reach this code, which would lead to balance_interval
+* skyrocketting in a short amount of time. Skip the balance_interval
+* increase logic to avoid that.
 */
if (env.idle == CPU_NEWLY_IDLE)
goto out;
@@ -9996,7 +9993,7 @@ static inline void nohz_newidle_balance(struct rq 
*this_rq) { }
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
  */
-static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
+int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 {
unsigned long next_balance = jiffies + HZ;
int this_cpu = this_rq->cpu;
@@ -10004,6 +10001,7 @@ static int idle_balance(struct rq *this_rq, struct 
rq_flags *rf)
int pulled_task = 0;
u64 curr_cost = 0;
 
+   update_misfit_status(NULL, this_rq);
/*
 * We must set idle_stamp _before_ calling idle_balance(), such that we
 * measure the duration of idle_balance() as idle time.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fb01c77c16ff..bfcbcbb25646 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1414,10 +1414,14 @@ static inline void unregister_sched_domain_sysctl(void)
 }
 #endif
 
+extern int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
+
 #else
 
 static inline void sched_ttwu_pending(void) { }
 
+static inline int newidle_balance(struct rq *this_rq, struct rq_flags *rf) { 
return 0; }
+
 #endif /* CONFIG_SMP */
 
 #include "stats.h"
-- 
2.17.1

[RFC PATCH v3 11/16] sched: Basic tracking of matching tasks

2019-05-29 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

Introduce task_struct::core_cookie as an opaque identifier for core
scheduling. When enabled; core scheduling will only allow matching
task to be on the core; where idle matches everything.

When task_struct::core_cookie is set (and core scheduling is enabled)
these tasks are indexed in a second RB-tree, first on cookie value
then on scheduling function, such that matching task selection always
finds the most elegible match.

NOTE: *shudder* at the overhead...

NOTE: *sigh*, a 3rd copy of the scheduling function; the alternative
is per class tracking of cookies and that just duplicates a lot of
stuff for no raisin (the 2nd copy lives in the rt-mutex PI code).

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Julien Desfossez 
---

Changes in v3
-
- Refactored priority comparison code
- Fixed a comparison logic issue in sched_core_find
  - Aaron Lu

Changes in v2
-
- Improves the priority comparison logic between processes in
  different cpus.
  - Peter Zijlstra
  - Aaron Lu

---
 include/linux/sched.h |   8 ++-
 kernel/sched/core.c   | 146 ++
 kernel/sched/fair.c   |  46 -
 kernel/sched/sched.h  |  55 
 4 files changed, 208 insertions(+), 47 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1549584a1538..a4b39a28236f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -636,10 +636,16 @@ struct task_struct {
const struct sched_class*sched_class;
struct sched_entity se;
struct sched_rt_entity  rt;
+   struct sched_dl_entity  dl;
+
+#ifdef CONFIG_SCHED_CORE
+   struct rb_node  core_node;
+   unsigned long   core_cookie;
+#endif
+
 #ifdef CONFIG_CGROUP_SCHED
struct task_group   *sched_task_group;
 #endif
-   struct sched_dl_entity  dl;
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
/* List of struct preempt_notifier: */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b1ce33f9b106..112d70f2b1e5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -64,6 +64,141 @@ int sysctl_sched_rt_runtime = 95;
 
 DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
 
+/* kernel prio, less is more */
+static inline int __task_prio(struct task_struct *p)
+{
+   if (p->sched_class == _sched_class) /* trumps deadline */
+   return -2;
+
+   if (rt_prio(p->prio)) /* includes deadline */
+   return p->prio; /* [-1, 99] */
+
+   if (p->sched_class == _sched_class)
+   return MAX_RT_PRIO + NICE_WIDTH; /* 140 */
+
+   return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */
+}
+
+/*
+ * l(a,b)
+ * le(a,b) := !l(b,a)
+ * g(a,b)  := l(b,a)
+ * ge(a,b) := !l(a,b)
+ */
+
+/* real prio, less is less */
+static inline bool prio_less(struct task_struct *a, struct task_struct *b)
+{
+
+   int pa = __task_prio(a), pb = __task_prio(b);
+
+   if (-pa < -pb)
+   return true;
+
+   if (-pb < -pa)
+   return false;
+
+   if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
+   return !dl_time_before(a->dl.deadline, b->dl.deadline);
+
+   if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
+   u64 vruntime = b->se.vruntime;
+
+   /*
+* Normalize the vruntime if tasks are in different cpus.
+*/
+   if (task_cpu(a) != task_cpu(b)) {
+   vruntime -= task_cfs_rq(b)->min_vruntime;
+   vruntime += task_cfs_rq(a)->min_vruntime;
+   }
+
+   return !((s64)(a->se.vruntime - vruntime) <= 0);
+   }
+
+   return false;
+}
+
+static inline bool __sched_core_less(struct task_struct *a, struct task_struct 
*b)
+{
+   if (a->core_cookie < b->core_cookie)
+   return true;
+
+   if (a->core_cookie > b->core_cookie)
+   return false;
+
+   /* flip prio, so high prio is leftmost */
+   if (prio_less(b, a))
+   return true;
+
+   return false;
+}
+
+static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+{
+   struct rb_node *parent, **node;
+   struct task_struct *node_task;
+
+   rq->core->core_task_seq++;
+
+   if (!p->core_cookie)
+   return;
+
+   node = >core_tree.rb_node;
+   parent = *node;
+
+   while (*node) {
+   node_task = container_of(*node, struct task_struct, core_node);
+   parent = *node;
+
+   if (__sched_core_less(p, node_task))
+   node = >rb_left;
+   else
+   node = >rb_right;
+   }
+
+   rb_link_node(>core_node, parent, node);
+   rb_insert_color(>c

[RFC PATCH v3 03/16] sched: Wrap rq::lock access

2019-05-29 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

In preparation of playing games with rq->lock, abstract the thing
using an accessor.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Julien Desfossez 
---

Changes in v2
-
- Fixes a deadlock due in double_rq_lock and double_lock_lock
  - Vineeth Pillai
  - Julien Desfossez
- Fixes 32bit build.
  - Aubrey Li

---
 kernel/sched/core.c |  46 -
 kernel/sched/cpuacct.c  |  12 ++---
 kernel/sched/deadline.c |  18 +++
 kernel/sched/debug.c|   4 +-
 kernel/sched/fair.c |  40 +++
 kernel/sched/idle.c |   4 +-
 kernel/sched/pelt.h |   2 +-
 kernel/sched/rt.c   |   8 +--
 kernel/sched/sched.h| 106 
 kernel/sched/topology.c |   4 +-
 10 files changed, 123 insertions(+), 121 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 416ea613eda8..6f4861ae85dc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -72,12 +72,12 @@ struct rq *__task_rq_lock(struct task_struct *p, struct 
rq_flags *rf)
 
for (;;) {
rq = task_rq(p);
-   raw_spin_lock(>lock);
+   raw_spin_lock(rq_lockp(rq));
if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
rq_pin_lock(rq, rf);
return rq;
}
-   raw_spin_unlock(>lock);
+   raw_spin_unlock(rq_lockp(rq));
 
while (unlikely(task_on_rq_migrating(p)))
cpu_relax();
@@ -96,7 +96,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct 
rq_flags *rf)
for (;;) {
raw_spin_lock_irqsave(>pi_lock, rf->flags);
rq = task_rq(p);
-   raw_spin_lock(>lock);
+   raw_spin_lock(rq_lockp(rq));
/*
 *  move_queued_task()  task_rq_lock()
 *
@@ -118,7 +118,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct 
rq_flags *rf)
rq_pin_lock(rq, rf);
return rq;
}
-   raw_spin_unlock(>lock);
+   raw_spin_unlock(rq_lockp(rq));
raw_spin_unlock_irqrestore(>pi_lock, rf->flags);
 
while (unlikely(task_on_rq_migrating(p)))
@@ -188,7 +188,7 @@ void update_rq_clock(struct rq *rq)
 {
s64 delta;
 
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
 
if (rq->clock_update_flags & RQCF_ACT_SKIP)
return;
@@ -497,7 +497,7 @@ void resched_curr(struct rq *rq)
struct task_struct *curr = rq->curr;
int cpu;
 
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
 
if (test_tsk_need_resched(curr))
return;
@@ -521,10 +521,10 @@ void resched_cpu(int cpu)
struct rq *rq = cpu_rq(cpu);
unsigned long flags;
 
-   raw_spin_lock_irqsave(>lock, flags);
+   raw_spin_lock_irqsave(rq_lockp(rq), flags);
if (cpu_online(cpu) || cpu == smp_processor_id())
resched_curr(rq);
-   raw_spin_unlock_irqrestore(>lock, flags);
+   raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 }
 
 #ifdef CONFIG_SMP
@@ -956,7 +956,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, 
int cpu)
 static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
   struct task_struct *p, int new_cpu)
 {
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
 
WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
dequeue_task(rq, p, DEQUEUE_NOCLOCK);
@@ -1070,7 +1070,7 @@ void do_set_cpus_allowed(struct task_struct *p, const 
struct cpumask *new_mask)
 * Because __kthread_bind() calls this on blocked tasks without
 * holding rq->lock.
 */
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
}
if (running)
@@ -1203,7 +1203,7 @@ void set_task_cpu(struct task_struct *p, unsigned int 
new_cpu)
 * task_rq_lock().
 */
WARN_ON_ONCE(debug_locks && !(lockdep_is_held(>pi_lock) ||
- lockdep_is_held(_rq(p)->lock)));
+ lockdep_is_held(rq_lockp(task_rq(p);
 #endif
/*
 * Clearly, migrating tasks to offline CPUs is a fairly daft thing.
@@ -1732,7 +1732,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, 
int wake_flags,
 {
int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;
 
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
 
 #ifdef CONFIG_SMP
if (p->sched_contributes_to_lo

[RFC PATCH v3 16/16] sched: Debug bits...

2019-05-29 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

Not-Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c | 44 ++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5b8223c9a723..90655c9ad937 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -92,6 +92,10 @@ static inline bool prio_less(struct task_struct *a, struct 
task_struct *b)
 
int pa = __task_prio(a), pb = __task_prio(b);
 
+   trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
+a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
if (-pa < -pb)
return true;
 
@@ -246,6 +250,8 @@ static void __sched_core_enable(void)
 
static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
+
+   printk("core sched enabled\n");
 }
 
 static void __sched_core_disable(void)
@@ -254,6 +260,8 @@ static void __sched_core_disable(void)
 
stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);
+
+   printk("core sched disabled\n");
 }
 
 void sched_core_get(void)
@@ -3707,6 +3715,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
put_prev_task(rq, prev);
set_next_task(rq, next);
}
+
+   trace_printk("pick pre selected (%u %u %u): %s/%d %lx\n",
+rq->core->core_task_seq,
+rq->core->core_pick_seq,
+rq->core_sched_seq,
+next->comm, next->pid,
+next->core_cookie);
+
return next;
}
 
@@ -3786,6 +3802,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 */
if (i == cpu && !need_sync && !p->core_cookie) {
next = p;
+   trace_printk("unconstrained pick: %s/%d %lx\n",
+next->comm, next->pid, 
next->core_cookie);
+
goto done;
}
 
@@ -3794,6 +3813,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
rq_i->core_pick = p;
 
+   trace_printk("cpu(%d): selected: %s/%d %lx\n",
+i, p->comm, p->pid, p->core_cookie);
+
/*
 * If this new candidate is of higher priority than the
 * previous; and they're incompatible; we need to wipe
@@ -3810,6 +3832,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_cookie = p->core_cookie;
max = p;
 
+   trace_printk("max: %s/%d %lx\n", max->comm, 
max->pid, max->core_cookie);
+
if (old_max) {
for_each_cpu(j, smt_mask) {
if (j == i)
@@ -3837,6 +3861,7 @@ next_class:;
rq->core->core_pick_seq = rq->core->core_task_seq;
next = rq->core_pick;
rq->core_sched_seq = rq->core->core_pick_seq;
+   trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, 
next->core_cookie);
 
/*
 * Reschedule siblings
@@ -3862,11 +3887,20 @@ next_class:;
if (i == cpu)
continue;
 
-   if (rq_i->curr != rq_i->core_pick)
+   if (rq_i->curr != rq_i->core_pick) {
+   trace_printk("IPI(%d)\n", i);
resched_curr(rq_i);
+   }
 
/* Did we break L1TF mitigation requirements? */
-   WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+   if (unlikely(!cookie_match(next, rq_i->core_pick))) {
+   trace_printk("[%d]: cookie mismatch. 
%s/%d/0x%lx/0x%lx\n",
+rq_i->cpu, rq_i->core_pick->comm,
+rq_i->core_pick->pid,
+rq_i->core_pick->core_cookie,
+rq_i->core->core_cookie);
+   WARN_ON_ONCE(1);
+   }
}
 
 done:
@@ -3905,6 +3939,10 @@ static bool try_steal_cookie(int this, int that)
if (p->core_occupation > dst->idle->core_occupation)
goto next;
 
+   trace_printk("core fill: %s/%d (%d->%d) %d %d %lx\n",
+p->comm, p->pid, that, this,
+p->core_occupation, dst->idle->core_occupation, 
cookie);
+

[RFC PATCH v3 09/16] sched: Introduce sched_class::pick_task()

2019-05-29 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

Because sched_class::pick_next_task() also implies
sched_class::set_next_task() (and possibly put_prev_task() and
newidle_balance) it is not state invariant. This makes it unsuitable
for remote task selection.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Julien Desfossez 
---

Chnages in v3
-
- Minor refactor to remove redundant NULL checks

Changes in v2
-
- Fixes a NULL pointer dereference crash
  - Subhra Mazumdar
  - Tim Chen

---
 kernel/sched/deadline.c  | 21 -
 kernel/sched/fair.c  | 36 +---
 kernel/sched/idle.c  | 10 +-
 kernel/sched/rt.c| 21 -
 kernel/sched/sched.h |  2 ++
 kernel/sched/stop_task.c | 21 -
 6 files changed, 92 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d3904168857a..64fc444f44f9 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1722,15 +1722,12 @@ static struct sched_dl_entity 
*pick_next_dl_entity(struct rq *rq,
return rb_entry(left, struct sched_dl_entity, rb_node);
 }
 
-static struct task_struct *
-pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static struct task_struct *pick_task_dl(struct rq *rq)
 {
struct sched_dl_entity *dl_se;
struct task_struct *p;
struct dl_rq *dl_rq;
 
-   WARN_ON_ONCE(prev || rf);
-
dl_rq = >dl;
 
if (unlikely(!dl_rq->dl_nr_running))
@@ -1741,7 +1738,19 @@ pick_next_task_dl(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf)
 
p = dl_task_of(dl_se);
 
-   set_next_task_dl(rq, p);
+   return p;
+}
+
+static struct task_struct *
+pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+   struct task_struct *p;
+
+   WARN_ON_ONCE(prev || rf);
+
+   p = pick_task_dl(rq);
+   if (p)
+   set_next_task_dl(rq, p);
 
return p;
 }
@@ -2388,6 +2397,8 @@ const struct sched_class dl_sched_class = {
.set_next_task  = set_next_task_dl,
 
 #ifdef CONFIG_SMP
+   .pick_task  = pick_task_dl,
+
.select_task_rq = select_task_rq_dl,
.migrate_task_rq= migrate_task_rq_dl,
.set_cpus_allowed   = set_cpus_allowed_dl,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e65f2dfda77a..02e5dfb85e7d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4136,7 +4136,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *curr)
 * Avoid running the skip buddy, if running something else can
 * be done without getting too unfair.
 */
-   if (cfs_rq->skip == se) {
+   if (cfs_rq->skip && cfs_rq->skip == se) {
struct sched_entity *second;
 
if (se == curr) {
@@ -4154,13 +4154,13 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *curr)
/*
 * Prefer last buddy, try to return the CPU to a preempted task.
 */
-   if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
+   if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 
1)
se = cfs_rq->last;
 
/*
 * Someone really wants this to run. If it's not unfair, run it.
 */
-   if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
+   if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 
1)
se = cfs_rq->next;
 
clear_buddies(cfs_rq, se);
@@ -6966,6 +6966,34 @@ static void check_preempt_wakeup(struct rq *rq, struct 
task_struct *p, int wake_
set_last_buddy(se);
 }
 
+static struct task_struct *
+pick_task_fair(struct rq *rq)
+{
+   struct cfs_rq *cfs_rq = >cfs;
+   struct sched_entity *se;
+
+   if (!cfs_rq->nr_running)
+   return NULL;
+
+   do {
+   struct sched_entity *curr = cfs_rq->curr;
+
+   se = pick_next_entity(cfs_rq, NULL);
+
+   if (curr) {
+   if (se && curr->on_rq)
+   update_curr(cfs_rq);
+
+   if (!se || entity_before(curr, se))
+   se = curr;
+   }
+
+   cfs_rq = group_cfs_rq(se);
+   } while (cfs_rq);
+
+   return task_of(se);
+}
+
 static struct task_struct *
 pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags 
*rf)
 {
@@ -10677,6 +10705,8 @@ const struct sched_class fair_sched_class = {
.set_next_task  = set_next_task_fair,
 
 #ifdef CONFIG_SMP
+   .pick_task  = pick_task_fair,
+
.select_task_rq = select_task_rq_fair

[RFC PATCH v3 14/16] sched/fair: Add a few assertions

2019-05-29 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/fair.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d8a107aea69b..26d29126d6a5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6192,6 +6192,11 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
struct sched_domain *sd;
int i, recent_used_cpu;
 
+   /*
+* per-cpu select_idle_mask usage
+*/
+   lockdep_assert_irqs_disabled();
+
if (available_idle_cpu(target))
return target;
 
@@ -6619,8 +6624,6 @@ static int find_energy_efficient_cpu(struct task_struct 
*p, int prev_cpu)
  * certain conditions an idle sibling CPU if the domain has SD_WAKE_AFFINE set.
  *
  * Returns the target CPU number.
- *
- * preempt must be disabled.
  */
 static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int 
wake_flags)
@@ -6631,6 +6634,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, 
int sd_flag, int wake_f
int want_affine = 0;
int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 
+   /*
+* required for stable ->cpus_allowed
+*/
+   lockdep_assert_held(>pi_lock);
+
if (sd_flag & SD_BALANCE_WAKE) {
record_wakee(p);
 
-- 
2.17.1

[RFC PATCH v3 15/16] sched: Trivial forced-newidle balancer

2019-05-29 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

When a sibling is forced-idle to match the core-cookie; search for
matching tasks to fill the core.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/sched.h |   1 +
 kernel/sched/core.c   | 131 +-
 kernel/sched/idle.c   |   1 +
 kernel/sched/sched.h  |   6 ++
 4 files changed, 138 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a4b39a28236f..1a309e8546cd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -641,6 +641,7 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
struct rb_node  core_node;
unsigned long   core_cookie;
+   unsigned intcore_occupation;
 #endif
 
 #ifdef CONFIG_CGROUP_SCHED
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e25811b81562..5b8223c9a723 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -199,6 +199,21 @@ static struct task_struct *sched_core_find(struct rq *rq, 
unsigned long cookie)
return match;
 }
 
+static struct task_struct *sched_core_next(struct task_struct *p, unsigned 
long cookie)
+{
+   struct rb_node *node = >core_node;
+
+   node = rb_next(node);
+   if (!node)
+   return NULL;
+
+   p = container_of(node, struct task_struct, core_node);
+   if (p->core_cookie != cookie)
+   return NULL;
+
+   return p;
+}
+
 /*
  * The static-key + stop-machine variable are needed such that:
  *
@@ -3672,7 +3687,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
struct task_struct *next, *max = NULL;
const struct sched_class *class;
const struct cpumask *smt_mask;
-   int i, j, cpu;
+   int i, j, cpu, occ = 0;
bool need_sync = false;
 
if (!sched_core_enabled(rq))
@@ -3774,6 +3789,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
goto done;
}
 
+   if (!is_idle_task(p))
+   occ++;
+
rq_i->core_pick = p;
 
/*
@@ -3799,6 +3817,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
cpu_rq(j)->core_pick = NULL;
}
+   occ = 1;
goto again;
} else {
/*
@@ -3838,6 +3857,8 @@ next_class:;
if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
rq->core_forceidle = true;
 
+   rq_i->core_pick->core_occupation = occ;
+
if (i == cpu)
continue;
 
@@ -3853,6 +3874,114 @@ next_class:;
return next;
 }
 
+static bool try_steal_cookie(int this, int that)
+{
+   struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
+   struct task_struct *p;
+   unsigned long cookie;
+   bool success = false;
+
+   local_irq_disable();
+   double_rq_lock(dst, src);
+
+   cookie = dst->core->core_cookie;
+   if (!cookie)
+   goto unlock;
+
+   if (dst->curr != dst->idle)
+   goto unlock;
+
+   p = sched_core_find(src, cookie);
+   if (p == src->idle)
+   goto unlock;
+
+   do {
+   if (p == src->core_pick || p == src->curr)
+   goto next;
+
+   if (!cpumask_test_cpu(this, >cpus_allowed))
+   goto next;
+
+   if (p->core_occupation > dst->idle->core_occupation)
+   goto next;
+
+   p->on_rq = TASK_ON_RQ_MIGRATING;
+   deactivate_task(src, p, 0);
+   set_task_cpu(p, this);
+   activate_task(dst, p, 0);
+   p->on_rq = TASK_ON_RQ_QUEUED;
+
+   resched_curr(dst);
+
+   success = true;
+   break;
+
+next:
+   p = sched_core_next(p, cookie);
+   } while (p);
+
+unlock:
+   double_rq_unlock(dst, src);
+   local_irq_enable();
+
+   return success;
+}
+
+static bool steal_cookie_task(int cpu, struct sched_domain *sd)
+{
+   int i;
+
+   for_each_cpu_wrap(i, sched_domain_span(sd), cpu) {
+   if (i == cpu)
+   continue;
+
+   if (need_resched())
+   break;
+
+   if (try_steal_cookie(cpu, i))
+   return true;
+   }
+
+   return false;
+}
+
+static void sched_core_balance(struct rq *rq)
+{
+   struct sched_domain *sd;
+   int cpu = cpu_of(rq);
+
+   rcu_read_lock();
+   raw_spin_unlock_irq(rq_lockp(rq));
+   for_each_domain(cpu, sd) {
+   if (!(sd->flags & SD_LOAD_BALANCE))
+

[RFC PATCH v3 10/16] sched: Core-wide rq->lock

2019-05-29 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

Introduce the basic infrastructure to have a core wide rq->lock.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Julien Desfossez 
Signed-off-by: Vineeth Remanan Pillai 
---

Changes in v3
-
- Fixes a crash during cpu offline/offline with coresched enabled
  - Vineeth Pillai

 kernel/Kconfig.preempt |   7 ++-
 kernel/sched/core.c| 113 -
 kernel/sched/sched.h   |  31 +++
 3 files changed, 148 insertions(+), 3 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 0fee5fe6c899..02fe0bf26676 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -57,4 +57,9 @@ config PREEMPT
 endchoice
 
 config PREEMPT_COUNT
-   bool
+   bool
+
+config SCHED_CORE
+   bool
+   default y
+   depends on SCHED_SMT
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b883c70674ba..b1ce33f9b106 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -60,6 +60,70 @@ __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 95;
 
+#ifdef CONFIG_SCHED_CORE
+
+DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+/*
+ * The static-key + stop-machine variable are needed such that:
+ *
+ * spin_lock(rq_lockp(rq));
+ * ...
+ * spin_unlock(rq_lockp(rq));
+ *
+ * ends up locking and unlocking the _same_ lock, and all CPUs
+ * always agree on what rq has what lock.
+ *
+ * XXX entirely possible to selectively enable cores, don't bother for now.
+ */
+static int __sched_core_stopper(void *data)
+{
+   bool enabled = !!(unsigned long)data;
+   int cpu;
+
+   for_each_online_cpu(cpu)
+   cpu_rq(cpu)->core_enabled = enabled;
+
+   return 0;
+}
+
+static DEFINE_MUTEX(sched_core_mutex);
+static int sched_core_count;
+
+static void __sched_core_enable(void)
+{
+   // XXX verify there are no cookie tasks (yet)
+
+   static_branch_enable(&__sched_core_enabled);
+   stop_machine(__sched_core_stopper, (void *)true, NULL);
+}
+
+static void __sched_core_disable(void)
+{
+   // XXX verify there are no cookie tasks (left)
+
+   stop_machine(__sched_core_stopper, (void *)false, NULL);
+   static_branch_disable(&__sched_core_enabled);
+}
+
+void sched_core_get(void)
+{
+   mutex_lock(_core_mutex);
+   if (!sched_core_count++)
+   __sched_core_enable();
+   mutex_unlock(_core_mutex);
+}
+
+void sched_core_put(void)
+{
+   mutex_lock(_core_mutex);
+   if (!--sched_core_count)
+   __sched_core_disable();
+   mutex_unlock(_core_mutex);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * __task_rq_lock - lock the rq @p resides on.
  */
@@ -5790,8 +5854,15 @@ int sched_cpu_activate(unsigned int cpu)
/*
 * When going up, increment the number of cores with SMT present.
 */
-   if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
+   if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
static_branch_inc_cpuslocked(_smt_present);
+#ifdef CONFIG_SCHED_CORE
+   if (static_branch_unlikely(&__sched_core_enabled)) {
+   rq->core_enabled = true;
+   }
+#endif
+   }
+
 #endif
set_cpu_active(cpu, true);
 
@@ -5839,8 +5910,16 @@ int sched_cpu_deactivate(unsigned int cpu)
/*
 * When going down, decrement the number of cores with SMT present.
 */
-   if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
+   if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
+#ifdef CONFIG_SCHED_CORE
+   struct rq *rq = cpu_rq(cpu);
+   if (static_branch_unlikely(&__sched_core_enabled)) {
+   rq->core_enabled = false;
+   }
+#endif
static_branch_dec_cpuslocked(_smt_present);
+
+   }
 #endif
 
if (!sched_smp_initialized)
@@ -5865,6 +5944,28 @@ static void sched_rq_cpu_starting(unsigned int cpu)
 
 int sched_cpu_starting(unsigned int cpu)
 {
+#ifdef CONFIG_SCHED_CORE
+   const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+   struct rq *rq, *core_rq = NULL;
+   int i;
+
+   for_each_cpu(i, smt_mask) {
+   rq = cpu_rq(i);
+   if (rq->core && rq->core == rq)
+   core_rq = rq;
+   }
+
+   if (!core_rq)
+   core_rq = cpu_rq(cpu);
+
+   for_each_cpu(i, smt_mask) {
+   rq = cpu_rq(i);
+
+   WARN_ON_ONCE(rq->core && rq->core != core_rq);
+   rq->core = core_rq;
+   }
+#endif /* CONFIG_SCHED_CORE */
+
sched_rq_cpu_starting(cpu);
sched_tick_start(cpu);
return 0;
@@ -5893,6 +5994,9 @@ int sched_cpu_dying(unsigned int cpu)
update_max_interval();
nohz_balance_exit_idle(rq);
hrtick_clear(rq);
+#ifdef CONFIG_SCHED_CORE
+   rq->core = NULL;
+#endif
return 0;
 }
 #endif
@@ -6091,6

[RFC PATCH v3 13/16] sched: Add core wide task selection and scheduling.

2019-05-29 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

Instead of only selecting a local task, select a task for all SMT
siblings for every reschedule on the core (irrespective which logical
CPU does the reschedule).

NOTE: there is still potential for siblings rivalry.
NOTE: this is far too complicated; but thus far I've failed to
  simplify it further.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Julien Desfossez 
Signed-off-by: Vineeth Remanan Pillai 
---

Changes in v3
-
- Fixes the issue of sibling picking an incompatible task.
  - Aaron Lu
  - Peter Zijlstra
  - Vineeth Pillai
  - Julien Desfossez

---
 kernel/sched/core.c  | 271 ++-
 kernel/sched/sched.h |   6 +-
 2 files changed, 274 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3164c6b33553..e25811b81562 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3556,7 +3556,7 @@ static inline void schedule_debug(struct task_struct 
*prev)
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
const struct sched_class *class;
struct task_struct *p;
@@ -3601,6 +3601,268 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
BUG();
 }
 
+#ifdef CONFIG_SCHED_CORE
+
+static inline bool cookie_equals(struct task_struct *a, unsigned long cookie)
+{
+   return is_idle_task(a) || (a->core_cookie == cookie);
+}
+
+static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
+{
+   if (is_idle_task(a) || is_idle_task(b))
+   return true;
+
+   return a->core_cookie == b->core_cookie;
+}
+
+// XXX fairness/fwd progress conditions
+/*
+ * Returns
+ * - NULL if there is no runnable task for this class.
+ * - the highest priority task for this runqueue if it matches
+ *   rq->core->core_cookie or its priority is greater than max.
+ * - Else returns idle_task.
+ */
+static struct task_struct *
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct 
*max)
+{
+   struct task_struct *class_pick, *cookie_pick;
+   unsigned long cookie = rq->core->core_cookie;
+
+   class_pick = class->pick_task(rq);
+   if (!class_pick)
+   return NULL;
+
+   if (!cookie) {
+   /*
+* If class_pick is tagged, return it only if it has
+* higher priority than max.
+*/
+   if (max && class_pick->core_cookie &&
+   prio_less(class_pick, max))
+   return idle_sched_class.pick_task(rq);
+
+   return class_pick;
+   }
+
+   /*
+* If class_pick is idle or matches cookie, return early.
+*/
+   if (cookie_equals(class_pick, cookie))
+   return class_pick;
+
+   cookie_pick = sched_core_find(rq, cookie);
+
+   /*
+* If class > max && class > cookie, it is the highest priority task on
+* the core (so far) and it must be selected, otherwise we must go with
+* the cookie pick in order to satisfy the constraint.
+*/
+   if (prio_less(cookie_pick, class_pick) &&
+   (!max || prio_less(max, class_pick)))
+   return class_pick;
+
+   return cookie_pick;
+}
+
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+   struct task_struct *next, *max = NULL;
+   const struct sched_class *class;
+   const struct cpumask *smt_mask;
+   int i, j, cpu;
+   bool need_sync = false;
+
+   if (!sched_core_enabled(rq))
+   return __pick_next_task(rq, prev, rf);
+
+   /*
+* If there were no {en,de}queues since we picked (IOW, the task
+* pointers are all still valid), and we haven't scheduled the last
+* pick yet, do so now.
+*/
+   if (rq->core->core_pick_seq == rq->core->core_task_seq &&
+   rq->core->core_pick_seq != rq->core_sched_seq) {
+   WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
+
+   next = rq->core_pick;
+   if (next != prev) {
+   put_prev_task(rq, prev);
+   set_next_task(rq, next);
+   }
+   return next;
+   }
+
+   prev->sched_class->put_prev_task(rq, prev, rf);
+   if (!rq->nr_running)
+   newidle_balance(rq, rf);
+
+   cpu = cpu_of(rq);
+   smt_mask = cpu_smt_mask(cpu);
+
+   /*
+* core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
+*
+* @task_seq guards the task state ({en,de}queues)
+* @pick_seq is the @task_seq we

[RFC PATCH v3 08/16] sched: Rework pick_next_task() slow-path

2019-05-29 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

Avoid the RETRY_TASK case in the pick_next_task() slow path.

By doing the put_prev_task() early, we get the rt/deadline pull done,
and by testing rq->nr_running we know if we need newidle_balance().

This then gives a stable state to pick a task from.

Since the fast-path is fair only; it means the other classes will
always have pick_next_task(.prev=NULL, .rf=NULL) and we can simplify.

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c  | 19 ---
 kernel/sched/deadline.c  | 30 ++
 kernel/sched/fair.c  |  9 ++---
 kernel/sched/idle.c  |  4 +++-
 kernel/sched/rt.c| 29 +
 kernel/sched/sched.h | 13 -
 kernel/sched/stop_task.c |  3 ++-
 7 files changed, 34 insertions(+), 73 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9dfa0c53deb3..b883c70674ba 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3363,7 +3363,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
p = fair_sched_class.pick_next_task(rq, prev, rf);
if (unlikely(p == RETRY_TASK))
-   goto again;
+   goto restart;
 
/* Assumes fair_sched_class->next == idle_sched_class */
if (unlikely(!p))
@@ -3372,14 +3372,19 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
return p;
}
 
-again:
+restart:
+   /*
+* Ensure that we put DL/RT tasks before the pick loop, such that they
+* can PULL higher prio tasks when we lower the RQ 'priority'.
+*/
+   prev->sched_class->put_prev_task(rq, prev, rf);
+   if (!rq->nr_running)
+   newidle_balance(rq, rf);
+
for_each_class(class) {
-   p = class->pick_next_task(rq, prev, rf);
-   if (p) {
-   if (unlikely(p == RETRY_TASK))
-   goto again;
+   p = class->pick_next_task(rq, NULL, NULL);
+   if (p)
return p;
-   }
}
 
/* The idle class should always have a runnable task: */
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 45425f971eec..d3904168857a 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1729,39 +1729,13 @@ pick_next_task_dl(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf)
struct task_struct *p;
struct dl_rq *dl_rq;
 
-   dl_rq = >dl;
-
-   if (need_pull_dl_task(rq, prev)) {
-   /*
-* This is OK, because current is on_cpu, which avoids it being
-* picked for load-balance and preemption/IRQs are still
-* disabled avoiding further scheduler activity on it and we're
-* being very careful to re-start the picking loop.
-*/
-   rq_unpin_lock(rq, rf);
-   pull_dl_task(rq);
-   rq_repin_lock(rq, rf);
-   /*
-* pull_dl_task() can drop (and re-acquire) rq->lock; this
-* means a stop task can slip in, in which case we need to
-* re-start task selection.
-*/
-   if (rq->stop && task_on_rq_queued(rq->stop))
-   return RETRY_TASK;
-   }
+   WARN_ON_ONCE(prev || rf);
 
-   /*
-* When prev is DL, we may throttle it in put_prev_task().
-* So, we update time before we check for dl_nr_running.
-*/
-   if (prev->sched_class == _sched_class)
-   update_curr_dl(rq);
+   dl_rq = >dl;
 
if (unlikely(!dl_rq->dl_nr_running))
return NULL;
 
-   put_prev_task(rq, prev);
-
dl_se = pick_next_dl_entity(rq, dl_rq);
BUG_ON(!dl_se);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8e3eb243fd9f..e65f2dfda77a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6979,7 +6979,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf
goto idle;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-   if (prev->sched_class != _sched_class)
+   if (!prev || prev->sched_class != _sched_class)
goto simple;
 
/*
@@ -7056,8 +7056,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf
goto done;
 simple:
 #endif
-
-   put_prev_task(rq, prev);
+   if (prev)
+   put_prev_task(rq, prev);
 
do {
se = pick_next_entity(cfs_rq, NULL);
@@ -7085,6 +7085,9 @@ done: __maybe_unused;
return p;
 
 idle:
+   if (!rf)
+   return NULL;
+
new_tasks = newidle_balance(rq, rf);
 
/*
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 1b65a4c3683e..7ece8e820b5d 100644
---

[RFC PATCH v3 00/16] Core scheduling v3

2019-05-29 Thread Vineeth Remanan Pillai

Third iteration of the Core-Scheduling feature.

This version fixes mostly correctness related issues in v2 and
addresses performance issues. Also, addressed some crashes related
to cgroups and cpu hotplugging.

We have tested and verified that incompatible processes are not
selected during schedule. In terms of performance, the impact
depends on the workload: 
- on CPU intensive applications that use all the logical CPUs with
  SMT enabled, enabling core scheduling performs better than nosmt.
- on mixed workloads with considerable io compared to cpu usage,
  nosmt seems to perform better than core scheduling.

Changes in v3
-
- Fixes the issue of sibling picking up an incompatible task
  - Aaron Lu
  - Vineeth Pillai
  - Julien Desfossez
- Fixes the issue of starving threads due to forced idle
  - Peter Zijlstra
- Fixes the refcounting issue when deleting a cgroup with tag
  - Julien Desfossez
- Fixes a crash during cpu offline/online with coresched enabled
  - Vineeth Pillai
- Fixes a comparison logic issue in sched_core_find
  - Aaron Lu

Changes in v2
-
- Fixes for couple of NULL pointer dereference crashes
  - Subhra Mazumdar
  - Tim Chen
- Improves priority comparison logic for process in different cpus
  - Peter Zijlstra
  - Aaron Lu
- Fixes a hard lockup in rq locking
  - Vineeth Pillai
  - Julien Desfossez
- Fixes a performance issue seen on IO heavy workloads
  - Vineeth Pillai
  - Julien Desfossez
- Fix for 32bit build
  - Aubrey Li

Issues
--
- Comparing process priority across cpus is not accurate

TODO

- Decide on the API for exposing the feature to userland

---

Peter Zijlstra (16):
  stop_machine: Fix stop_cpus_in_progress ordering
  sched: Fix kerneldoc comment for ia64_set_curr_task
  sched: Wrap rq::lock access
  sched/{rt,deadline}: Fix set_next_task vs pick_next_task
  sched: Add task_struct pointer to sched_class::set_curr_task
  sched/fair: Export newidle_balance()
  sched: Allow put_prev_task() to drop rq->lock
  sched: Rework pick_next_task() slow-path
  sched: Introduce sched_class::pick_task()
  sched: Core-wide rq->lock
  sched: Basic tracking of matching tasks
  sched: A quick and dirty cgroup tagging interface
  sched: Add core wide task selection and scheduling.
  sched/fair: Add a few assertions
  sched: Trivial forced-newidle balancer
  sched: Debug bits...

 include/linux/sched.h|   9 +-
 kernel/Kconfig.preempt   |   7 +-
 kernel/sched/core.c  | 858 +--
 kernel/sched/cpuacct.c   |  12 +-
 kernel/sched/deadline.c  |  99 +++--
 kernel/sched/debug.c |   4 +-
 kernel/sched/fair.c  | 180 
 kernel/sched/idle.c  |  42 +-
 kernel/sched/pelt.h  |   2 +-
 kernel/sched/rt.c|  96 ++---
 kernel/sched/sched.h | 237 ---
 kernel/sched/stop_task.c |  35 +-
 kernel/sched/topology.c  |   4 +-
 kernel/stop_machine.c|   2 +
 14 files changed, 1250 insertions(+), 337 deletions(-)

-- 
2.17.1

[RFC PATCH v3 12/16] sched: A quick and dirty cgroup tagging interface

2019-05-29 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

Marks all tasks in a cgroup as matching for core-scheduling.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Julien Desfossez 
Signed-off-by: Vineeth Remanan Pillai 
---

Changes in v3
-
- Fixes the refcount management when deleting a tagged cgroup.
  - Julien Desfossez

---
 kernel/sched/core.c  | 78 
 kernel/sched/sched.h |  4 +++
 2 files changed, 82 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 112d70f2b1e5..3164c6b33553 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6658,6 +6658,15 @@ static void sched_change_group(struct task_struct *tsk, 
int type)
tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
  struct task_group, css);
tg = autogroup_task_group(tsk, tg);
+
+#ifdef CONFIG_SCHED_CORE
+   if ((unsigned long)tsk->sched_task_group == tsk->core_cookie)
+   tsk->core_cookie = 0UL;
+
+   if (tg->tagged /* && !tsk->core_cookie ? */)
+   tsk->core_cookie = (unsigned long)tg;
+#endif
+
tsk->sched_task_group = tg;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -6737,6 +6746,18 @@ static int cpu_cgroup_css_online(struct 
cgroup_subsys_state *css)
return 0;
 }
 
+static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
+{
+#ifdef CONFIG_SCHED_CORE
+   struct task_group *tg = css_tg(css);
+
+   if (tg->tagged) {
+   sched_core_put();
+   tg->tagged = 0;
+   }
+#endif
+}
+
 static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
 {
struct task_group *tg = css_tg(css);
@@ -7117,6 +7138,46 @@ static u64 cpu_rt_period_read_uint(struct 
cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_SCHED_CORE
+static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct 
cftype *cft)
+{
+   struct task_group *tg = css_tg(css);
+
+   return !!tg->tagged;
+}
+
+static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct 
cftype *cft, u64 val)
+{
+   struct task_group *tg = css_tg(css);
+   struct css_task_iter it;
+   struct task_struct *p;
+
+   if (val > 1)
+   return -ERANGE;
+
+   if (!static_branch_likely(_smt_present))
+   return -EINVAL;
+
+   if (tg->tagged == !!val)
+   return 0;
+
+   tg->tagged = !!val;
+
+   if (!!val)
+   sched_core_get();
+
+   css_task_iter_start(css, 0, );
+   while ((p = css_task_iter_next()))
+   p->core_cookie = !!val ? (unsigned long)tg : 0UL;
+   css_task_iter_end();
+
+   if (!val)
+   sched_core_put();
+
+   return 0;
+}
+#endif
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
{
@@ -7152,6 +7213,14 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_rt_period_read_uint,
.write_u64 = cpu_rt_period_write_uint,
},
+#endif
+#ifdef CONFIG_SCHED_CORE
+   {
+   .name = "tag",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .read_u64 = cpu_core_tag_read_u64,
+   .write_u64 = cpu_core_tag_write_u64,
+   },
 #endif
{ } /* Terminate */
 };
@@ -7319,6 +7388,14 @@ static struct cftype cpu_files[] = {
.seq_show = cpu_max_show,
.write = cpu_max_write,
},
+#endif
+#ifdef CONFIG_SCHED_CORE
+   {
+   .name = "tag",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .read_u64 = cpu_core_tag_read_u64,
+   .write_u64 = cpu_core_tag_write_u64,
+   },
 #endif
{ } /* terminate */
 };
@@ -7326,6 +7403,7 @@ static struct cftype cpu_files[] = {
 struct cgroup_subsys cpu_cgrp_subsys = {
.css_alloc  = cpu_cgroup_css_alloc,
.css_online = cpu_cgroup_css_online,
+   .css_offline= cpu_cgroup_css_offline,
.css_released   = cpu_cgroup_css_released,
.css_free   = cpu_cgroup_css_free,
.css_extra_stat_show = cpu_extra_stat_show,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0cbcfb6c8ee4..bd9b473ebde2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -363,6 +363,10 @@ struct cfs_bandwidth {
 struct task_group {
struct cgroup_subsys_state css;
 
+#ifdef CONFIG_SCHED_CORE
+   int tagged;
+#endif
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
/* schedulable entities of this group on each CPU */
struct sched_entity **se;
-- 
2.17.1

[RFC PATCH v3 05/16] sched: Add task_struct pointer to sched_class::set_curr_task

2019-05-29 Thread Vineeth Remanan Pillai

From: Peter Zijlstra 

In preparation of further separating pick_next_task() and
set_curr_task() we have to pass the actual task into it, while there,
rename the thing to better pair with put_prev_task().

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c  | 12 ++--
 kernel/sched/deadline.c  |  7 +--
 kernel/sched/fair.c  | 17 ++---
 kernel/sched/idle.c  | 27 +++
 kernel/sched/rt.c|  7 +--
 kernel/sched/sched.h |  8 +---
 kernel/sched/stop_task.c | 17 +++--
 7 files changed, 49 insertions(+), 46 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6f4861ae85dc..32ea79fb8d29 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1081,7 +1081,7 @@ void do_set_cpus_allowed(struct task_struct *p, const 
struct cpumask *new_mask)
if (queued)
enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
if (running)
-   set_curr_task(rq, p);
+   set_next_task(rq, p);
 }
 
 /*
@@ -3890,7 +3890,7 @@ void rt_mutex_setprio(struct task_struct *p, struct 
task_struct *pi_task)
if (queued)
enqueue_task(rq, p, queue_flag);
if (running)
-   set_curr_task(rq, p);
+   set_next_task(rq, p);
 
check_class_changed(rq, p, prev_class, oldprio);
 out_unlock:
@@ -3957,7 +3957,7 @@ void set_user_nice(struct task_struct *p, long nice)
resched_curr(rq);
}
if (running)
-   set_curr_task(rq, p);
+   set_next_task(rq, p);
 out_unlock:
task_rq_unlock(rq, p, );
 }
@@ -4382,7 +4382,7 @@ static int __sched_setscheduler(struct task_struct *p,
enqueue_task(rq, p, queue_flags);
}
if (running)
-   set_curr_task(rq, p);
+   set_next_task(rq, p);
 
check_class_changed(rq, p, prev_class, oldprio);
 
@@ -,7 +,7 @@ void sched_setnuma(struct task_struct *p, int nid)
if (queued)
enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
if (running)
-   set_curr_task(rq, p);
+   set_next_task(rq, p);
task_rq_unlock(rq, p, );
 }
 #endif /* CONFIG_NUMA_BALANCING */
@@ -6438,7 +6438,7 @@ void sched_move_task(struct task_struct *tsk)
if (queued)
enqueue_task(rq, tsk, queue_flags);
if (running)
-   set_curr_task(rq, tsk);
+   set_next_task(rq, tsk);
 
task_rq_unlock(rq, tsk, );
 }
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0783dfa65150..c02b3229e2c3 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1812,11 +1812,6 @@ static void task_fork_dl(struct task_struct *p)
 */
 }
 
-static void set_curr_task_dl(struct rq *rq)
-{
-   set_next_task_dl(rq, rq->curr);
-}
-
 #ifdef CONFIG_SMP
 
 /* Only try algorithms three times */
@@ -2404,6 +2399,7 @@ const struct sched_class dl_sched_class = {
 
.pick_next_task = pick_next_task_dl,
.put_prev_task  = put_prev_task_dl,
+   .set_next_task  = set_next_task_dl,
 
 #ifdef CONFIG_SMP
.select_task_rq = select_task_rq_dl,
@@ -2414,7 +2410,6 @@ const struct sched_class dl_sched_class = {
.task_woken = task_woken_dl,
 #endif
 
-   .set_curr_task  = set_curr_task_dl,
.task_tick  = task_tick_dl,
.task_fork  = task_fork_dl,
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c17faf672dcf..56fc2a1aa261 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10388,9 +10388,19 @@ static void switched_to_fair(struct rq *rq, struct 
task_struct *p)
  * This routine is mostly called to set cfs_rq->curr field when a task
  * migrates between groups/classes.
  */
-static void set_curr_task_fair(struct rq *rq)
+static void set_next_task_fair(struct rq *rq, struct task_struct *p)
 {
-   struct sched_entity *se = >curr->se;
+   struct sched_entity *se = >se;
+
+#ifdef CONFIG_SMP
+   if (task_on_rq_queued(p)) {
+   /*
+* Move the next running task to the front of the list, so our
+* cfs_tasks list becomes MRU one.
+*/
+   list_move(>group_node, >cfs_tasks);
+   }
+#endif
 
for_each_sched_entity(se) {
struct cfs_rq *cfs_rq = cfs_rq_of(se);
@@ -10661,7 +10671,9 @@ const struct sched_class fair_sched_class = {
.check_preempt_curr = check_preempt_wakeup,
 
.pick_next_task = pick_next_task_fair,
+
.put_prev_task  = put_prev_task_fair,
+   .set_next_task  = set_next_task_fair,
 
 #ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
@@ -10674,7 +10686,6 @@ const struct sched_class fair_sched_class = {
.set_cpus_allowed   =

Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks

2019-05-22 Thread Vineeth Remanan Pillai

> > I do not have a strong opinion on both. Probably a better approach
> > would be to replace both cpu_prio_less/core_prio_less with prio_less
> > which takes the third arguement 'bool on_same_rq'?
> >
>
> Fwiw, I find the two names easier to read than a boolean flag. Could still
> be wrapped to a single implementation I suppose.
>
> An enum to control cpu or core would be more readable, but probably 
> overkill...
>
I think we can infact remove the boolean altogether and still have a single
function to compare the priority. If tasks are on the same cpu, use the task's
vruntime, else do the normalization.

Thanks,
Vineeth

---
-static inline bool __prio_less(struct task_struct *a, struct task_struct *b, 
bool core_cmp)
+static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 {
-   u64 vruntime;
 
int pa = __task_prio(a), pb = __task_prio(b);
 
@@ -119,25 +105,21 @@ static inline bool __prio_less(struct task_struct *a, 
struct task_struct *b, boo
if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-   vruntime = b->se.vruntime;
-   if (core_cmp) {
-   vruntime -= task_cfs_rq(b)->min_vruntime;
-   vruntime += task_cfs_rq(a)->min_vruntime;
-   }
-   if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
-   return !((s64)(a->se.vruntime - vruntime) <= 0);
+   if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
+   u64 vruntime = b->se.vruntime;
 
-   return false;
-}
+   /*
+* Normalize the vruntime if tasks are in different cpus.
+*/
+   if (task_cpu(a) != task_cpu(b)) {
+   vruntime -= task_cfs_rq(b)->min_vruntime;
+   vruntime += task_cfs_rq(a)->min_vruntime;
+   }
 
-static inline bool cpu_prio_less(struct task_struct *a, struct task_struct *b)
-{
-   return __prio_less(a, b, false);
-}
+   return !((s64)(a->se.vruntime - vruntime) <= 0);
+   }
 
-static inline bool core_prio_less(struct task_struct *a, struct task_struct *b)
-{
-   return __prio_less(a, b, true);
+   return false;
 }
 
 static inline bool __sched_core_less(struct task_struct *a, struct task_struct 
*b)
@@ -149,7 +131,7 @@ static inline bool __sched_core_less(struct task_struct *a, 
struct task_struct *
return false;
 
/* flip prio, so high prio is leftmost */
-   if (cpu_prio_less(b, a))
+   if (prio_less(b, a))
return true;
 
return false;
@@ -3621,7 +3603,7 @@ pick_task(struct rq *rq, const struct sched_class *class, 
struct task_struct *ma
 * higher priority than max.
 */
if (max && class_pick->core_cookie &&
-   core_prio_less(class_pick, max))
+   prio_less(class_pick, max))
return idle_sched_class.pick_task(rq);
 
return class_pick;
@@ -3640,8 +3622,8 @@ pick_task(struct rq *rq, const struct sched_class *class, 
struct task_struct *ma
 * the core (so far) and it must be selected, otherwise we must go with
 * the cookie pick in order to satisfy the constraint.
 */
-   if (cpu_prio_less(cookie_pick, class_pick) &&
-   (!max || core_prio_less(max, class_pick)))
+   if (prio_less(cookie_pick, class_pick) &&
+   (!max || prio_less(max, class_pick)))
return class_pick;
 
return cookie_pick;

Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-05-15 Thread Vineeth Remanan Pillai

> Thanks for pointing this out. I think the ideal fix would be to
> correctly initialize/cleanup the coresched attributes in the cpu
> hotplug code path so that lock could be taken successfully if the
> sibling is offlined/onlined after coresched was enabled. We are
> working on another bug related to hotplugpath and shall introduce
> the fix in v3.
>
A possible fix for handling the runqueues during cpu offline/online
is attached here with.

Thanks,
Vineeth

---
 kernel/sched/core.c | 28 +---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e8e5f26db052..1a809849a1e7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -253,7 +253,7 @@ static int __sched_core_stopper(void *data)
bool enabled = !!(unsigned long)data;
int cpu;
 
-   for_each_possible_cpu(cpu)
+   for_each_online_cpu(cpu)
cpu_rq(cpu)->core_enabled = enabled;
 
return 0;
@@ -3764,6 +3764,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
struct rq *rq_i = cpu_rq(i);
struct task_struct *p;
 
+   if (cpu_is_offline(i))
+   continue;
+
if (rq_i->core_pick)
continue;
 
@@ -3866,6 +3869,9 @@ next_class:;
for_each_cpu(i, smt_mask) {
struct rq *rq_i = cpu_rq(i);
 
+   if (cpu_is_offline(i))
+   continue;
+
WARN_ON_ONCE(!rq_i->core_pick);
 
rq_i->core_pick->core_occupation = occ;
@@ -6410,8 +6416,14 @@ int sched_cpu_activate(unsigned int cpu)
/*
 * When going up, increment the number of cores with SMT present.
 */
-   if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
+   if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
static_branch_inc_cpuslocked(_smt_present);
+#ifdef CONFIG_SCHED_CORE
+   if (static_branch_unlikely(&__sched_core_enabled)) {
+   rq->core_enabled = true;
+   }
+#endif
+   }
 #endif
set_cpu_active(cpu, true);
 
@@ -6459,8 +6471,15 @@ int sched_cpu_deactivate(unsigned int cpu)
/*
 * When going down, decrement the number of cores with SMT present.
 */
-   if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
+   if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
+#ifdef CONFIG_SCHED_CORE
+   struct rq *rq = cpu_rq(cpu);
+   if (static_branch_unlikely(&__sched_core_enabled)) {
+   rq->core_enabled = false;
+   }
+#endif
static_branch_dec_cpuslocked(_smt_present);
+   }
 #endif
 
if (!sched_smp_initialized)
@@ -6537,6 +6556,9 @@ int sched_cpu_dying(unsigned int cpu)
update_max_interval();
nohz_balance_exit_idle(rq);
hrtick_clear(rq);
+#ifdef CONFIG_SCHED_CORE
+   rq->core = NULL;
+#endif
return 0;
 }
 #endif
-- 
2.17.1

Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-05-15 Thread Vineeth Remanan Pillai

> It's clear now, thanks.
> I don't immediately see how my isolation fix would make your fix stop
> working, will need to check. But I'm busy with other stuffs so it will
> take a while.
>
We have identified the issue and have a fix for this. The issue is
same as before, forced idle sibling has a runnable process which
is starved due to an unconstrained pick bug.

One sample scenario is like this:
cpu0 and cpu1 are siblings. cpu0 selects an untagged process 'a'
which forces idle on cpu1 even though it had a runnable tagged
process 'b' which is determined by the code to be of lesser priority.
cpu1 can go to deep idle.

During the next schedule in cpu0, the following could happen:
 - cpu0 selects swapper as there is nothing to run and hence
   prev_cookie is 0, it does an unconstrained pick of swapper.
   So both cpu0 and 1 are idling and cpu1 might be deep idle.
 - cpu0 again goes to schedule and selects 'a' which is runnable
   now. since prev_cookie is 0, 'a' is an unconstrained pick and
   'b' on cpu1 is forgotten again.

This continues with swapper and process 'a' taking turns without
considering sibling until a tagged process becomes runnable in cpu0
and then we don't get into unconstrained pick.

The above is one of the couple of scenarios we have seen and each
have a slightly different path, which ultimately leads to an
unconstrianed pick, starving the sibling's runnable thread.

The fix is to mark if a core has gone forced idle when there was a
runnable process and then do not do uncontrained pick if a forced
idle happened in the last pick.

I am attaching here wth, the patch that fixes the above issue. Patch
is on top of Peter's fix and your correctness fix that we modified for
v2. We have a public reposiory with all the changes including this
fix as well:
https://github.com/digitalocean/linux-coresched/tree/coresched

We are working on a v3 where the last 3 commits will be squashed to
their related patches in v2. We hope to come up with a v3 next week
with all the suggestions and fixes posted in v2.

Thanks,
Vineeth

---
 kernel/sched/core.c  | 26 ++
 kernel/sched/sched.h |  1 +
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 413d46bde17d..3aba0f8fe384 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3653,8 +3653,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
struct task_struct *next, *max = NULL;
const struct sched_class *class;
const struct cpumask *smt_mask;
-   unsigned long prev_cookie;
int i, j, cpu, occ = 0;
+   bool need_sync = false;
 
if (!sched_core_enabled(rq))
return __pick_next_task(rq, prev, rf);
@@ -3702,7 +3702,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 * 'Fix' this by also increasing @task_seq for every pick.
 */
rq->core->core_task_seq++;
-   prev_cookie = rq->core->core_cookie;
+   need_sync = !!rq->core->core_cookie;
 
/* reset state */
rq->core->core_cookie = 0UL;
@@ -3711,6 +3711,11 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
rq_i->core_pick = NULL;
 
+   if (rq_i->core_forceidle) {
+   need_sync = true;
+   rq_i->core_forceidle = false;
+   }
+
if (i != cpu)
update_rq_clock(rq_i);
}
@@ -3743,7 +3748,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 * If there weren't no cookies; we don't need
 * to bother with the other siblings.
 */
-   if (i == cpu && !prev_cookie)
+   if (i == cpu && !need_sync)
goto next_class;
 
continue;
@@ -3753,7 +3758,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 * Optimize the 'normal' case where there aren't any
 * cookies and we don't need to sync up.
 */
-   if (i == cpu && !prev_cookie && !p->core_cookie) {
+   if (i == cpu && !need_sync && !p->core_cookie) {
next = p;
rq->core_pick = NULL;
rq->core->core_cookie = 0UL;
@@ -3816,7 +3821,16 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
}
occ = 1;
goto again;
+   } else {
+   /*
+* Once we select

Re: [RFC PATCH v2 09/17] sched: Introduce sched_class::pick_task()

2019-04-26 Thread Vineeth Remanan Pillai

>
> I didn't get around to reading the original discussion here, but how can
> that possibly be?
>
> I can see !se, in that case curr is still selected.
>
> I can also see !curr, in that case curr is put.
>
> But I cannot see !se && !curr, per the above check we know
> cfs_rq->nr_running != 0, so there must be a cfs task to find. This means
> either curr or se must exist.

This fix was suggested as a quick fix for a crash seen in v1. But
I agree with you that this should be a bug if it happens and should
be investigated. I have tried in v2 and can no longer reproduce the
crash. Will remove the check in v3.

Thanks

Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks

2019-04-24 Thread Vineeth Remanan Pillai

> The sched_core_* functions are used only in the core.c
> they are declared in.  We can convert them to static functions.

Thanks for pointing this out, will accomodate this in v3.

Thanks

Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks

2019-04-24 Thread Vineeth Remanan Pillai

> A minor nitpick.  I find keeping the vruntime base readjustment in
> core_prio_less probably is more straight forward rather than pass a
> core_cmp bool around.

The reason I moved the vruntime base adjustment to __prio_less is
because, the vruntime seemed alien to __prio_less when looked as
a standalone function.

I do not have a strong opinion on both. Probably a better approach
would be to replace both cpu_prio_less/core_prio_less with prio_less
which takes the third arguement 'bool on_same_rq'?

Thanks

Re: [RFC PATCH v2 15/17] sched: Trivial forced-newidle balancer

2019-04-24 Thread Vineeth Remanan Pillai

> try_steal_cookie() is in the loop of for_each_cpu_wrap().
> The root domain could be large and we should avoid
> stealing cookie if source rq has only one task or dst is really busy.
>
> The following patch eliminated a deadlock issue on my side if I didn't
> miss anything in v1. I'll double check with v2, but it at least avoids
> unnecessary irq off/on and double rq lock. Especially, it avoids lock
> contention that the idle cpu which is holding rq lock in the progress
> of load_balance() and tries to lock rq here. I think it might be worth to
> be picked up.
>

The dst->nr_running is actually checked in queue_core_balance with the
lock held. Also, try_steal_cookie checks if dst is running idle, but
under the lock. Checking whether src is empty makes sense, but shouldn't
it be called under the rq lock? Couple of safety and performance checks
are done before calling try_steal_cookie and hence, I hope double lock
would not cause a major performance issue.

If the hard lockup is reproducible with v2, could you please share more
details about the lockup?

Thanks

Re: [RFC PATCH v2 15/17] sched: Trivial forced-newidle balancer

2019-04-24 Thread Vineeth Remanan Pillai

If the hard lockup is reproducible with v2, could you please share more
details about the lockup?

Thanks

Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-24 Thread Vineeth Remanan Pillai

> Is this one missed? Or fixed with a better impl?
>
> The boot up CPUs don't match the possible cpu map, so the not onlined
> CPU rq->core are not initialized, which causes NULL pointer dereference
> panic in online_fair_sched_group():
>
Thanks for pointing this out. I think the ideal fix would be to
correctly initialize/cleanup the coresched attributes in the cpu
hotplug code path so that lock could be taken successfully if the
sibling is offlined/onlined after coresched was enabled. We are
working on another bug related to hotplugpath and shall introduce
the fix in v3.

Thanks

Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-23 Thread Vineeth Remanan Pillai

>> - Processes with different tags can still share the core

> I may have missed something... Could you explain this statement?

> This, to me, is the whole point of the patch series. If it's not
> doing this then ... what?

What I meant was, the patch needs some more work to be accurate.
There are some race conditions where the core violation can still
happen. In our testing, we saw around 1 to 5% of the time being
shared with incompatible processes. One example of this happening
is as follows(let cpu 0 and 1 be siblings):
- cpu 0 selects a process with a cookie
- cpu 1 selects a higher priority process without cookie
- Selection process restarts for cpu 0 and it might select a
  process with cookie but with lesser priority.
- Since it is lesser priority, the logic in pick_next_task
  doesn't compare again for the cookie(trusts pick_task) and
  proceeds.

This is one of the scenarios that we saw from traces, but there
might be other race conditions as well. Fix seems a little
involved and We are working on that.

Thanks

[RFC PATCH v2 01/17] stop_machine: Fix stop_cpus_in_progress ordering

2019-04-23 Thread Vineeth Remanan Pillai

From: Peter Zijlstra (Intel) 

Make sure the entire for loop has stop_cpus_in_progress set.

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/stop_machine.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 067cb83f37ea..583119e0c51c 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -375,6 +375,7 @@ static bool queue_stop_cpus_work(const struct cpumask 
*cpumask,
 */
preempt_disable();
stop_cpus_in_progress = true;
+   barrier();
for_each_cpu(cpu, cpumask) {
work = _cpu(cpu_stopper.stop_work, cpu);
work->fn = fn;
@@ -383,6 +384,7 @@ static bool queue_stop_cpus_work(const struct cpumask 
*cpumask,
if (cpu_stop_queue_work(cpu, work))
queued = true;
}
+   barrier();
stop_cpus_in_progress = false;
preempt_enable();
 
-- 
2.17.1

[RFC PATCH v2 04/17] sched/{rt,deadline}: Fix set_next_task vs pick_next_task

2019-04-23 Thread Vineeth Remanan Pillai

From: Peter Zijlstra (Intel) 

Because pick_next_task() implies set_curr_task() and some of the
details haven't matter too much, some of what _should_ be in
set_curr_task() ended up in pick_next_task, correct this.

This prepares the way for a pick_next_task() variant that does not
affect the current state; allowing remote picking.

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/deadline.c | 23 ---
 kernel/sched/rt.c   | 27 ++-
 2 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 133fbcc58ea1..b8e15c7aa889 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1695,12 +1695,21 @@ static void start_hrtick_dl(struct rq *rq, struct 
task_struct *p)
 }
 #endif
 
-static inline void set_next_task(struct rq *rq, struct task_struct *p)
+static void set_next_task_dl(struct rq *rq, struct task_struct *p)
 {
p->se.exec_start = rq_clock_task(rq);
 
/* You can't push away the running task */
dequeue_pushable_dl_task(rq, p);
+
+   if (hrtick_enabled(rq))
+   start_hrtick_dl(rq, p);
+
+   if (rq->curr->sched_class != _sched_class)
+   update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+
+   if (rq->curr != p)
+   deadline_queue_push_tasks(rq);
 }
 
 static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
@@ -1759,15 +1768,7 @@ pick_next_task_dl(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf)
 
p = dl_task_of(dl_se);
 
-   set_next_task(rq, p);
-
-   if (hrtick_enabled(rq))
-   start_hrtick_dl(rq, p);
-
-   deadline_queue_push_tasks(rq);
-
-   if (rq->curr->sched_class != _sched_class)
-   update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+   set_next_task_dl(rq, p);
 
return p;
 }
@@ -1814,7 +1815,7 @@ static void task_fork_dl(struct task_struct *p)
 
 static void set_curr_task_dl(struct rq *rq)
 {
-   set_next_task(rq, rq->curr);
+   set_next_task_dl(rq, rq->curr);
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 3d9db8c75d53..353ad960691b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1498,12 +1498,23 @@ static void check_preempt_curr_rt(struct rq *rq, struct 
task_struct *p, int flag
 #endif
 }
 
-static inline void set_next_task(struct rq *rq, struct task_struct *p)
+static inline void set_next_task_rt(struct rq *rq, struct task_struct *p)
 {
p->se.exec_start = rq_clock_task(rq);
 
/* The running task is never eligible for pushing */
dequeue_pushable_task(rq, p);
+
+   /*
+* If prev task was rt, put_prev_task() has already updated the
+* utilization. We only care of the case where we start to schedule a
+* rt task
+*/
+   if (rq->curr->sched_class != _sched_class)
+   update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+
+   if (rq->curr != p)
+   rt_queue_push_tasks(rq);
 }
 
 static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
@@ -1577,17 +1588,7 @@ pick_next_task_rt(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf)
 
p = _pick_next_task_rt(rq);
 
-   set_next_task(rq, p);
-
-   rt_queue_push_tasks(rq);
-
-   /*
-* If prev task was rt, put_prev_task() has already updated the
-* utilization. We only care of the case where we start to schedule a
-* rt task
-*/
-   if (rq->curr->sched_class != _sched_class)
-   update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+   set_next_task_rt(rq, p);
 
return p;
 }
@@ -2356,7 +2357,7 @@ static void task_tick_rt(struct rq *rq, struct 
task_struct *p, int queued)
 
 static void set_curr_task_rt(struct rq *rq)
 {
-   set_next_task(rq, rq->curr);
+   set_next_task_rt(rq, rq->curr);
 }
 
 static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task)
-- 
2.17.1

[RFC PATCH v2 08/17] sched: Rework pick_next_task() slow-path

2019-04-23 Thread Vineeth Remanan Pillai

From: Peter Zijlstra (Intel) 

Avoid the RETRY_TASK case in the pick_next_task() slow path.

By doing the put_prev_task() early, we get the rt/deadline pull done,
and by testing rq->nr_running we know if we need newidle_balance().

This then gives a stable state to pick a task from.

Since the fast-path is fair only; it means the other classes will
always have pick_next_task(.prev=NULL, .rf=NULL) and we can simplify.

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c  | 19 ---
 kernel/sched/deadline.c  | 30 ++
 kernel/sched/fair.c  |  9 ++---
 kernel/sched/idle.c  |  4 +++-
 kernel/sched/rt.c| 29 +
 kernel/sched/sched.h | 13 -
 kernel/sched/stop_task.c |  3 ++-
 7 files changed, 34 insertions(+), 73 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9dfa0c53deb3..b883c70674ba 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3363,7 +3363,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
p = fair_sched_class.pick_next_task(rq, prev, rf);
if (unlikely(p == RETRY_TASK))
-   goto again;
+   goto restart;
 
/* Assumes fair_sched_class->next == idle_sched_class */
if (unlikely(!p))
@@ -3372,14 +3372,19 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
return p;
}
 
-again:
+restart:
+   /*
+* Ensure that we put DL/RT tasks before the pick loop, such that they
+* can PULL higher prio tasks when we lower the RQ 'priority'.
+*/
+   prev->sched_class->put_prev_task(rq, prev, rf);
+   if (!rq->nr_running)
+   newidle_balance(rq, rf);
+
for_each_class(class) {
-   p = class->pick_next_task(rq, prev, rf);
-   if (p) {
-   if (unlikely(p == RETRY_TASK))
-   goto again;
+   p = class->pick_next_task(rq, NULL, NULL);
+   if (p)
return p;
-   }
}
 
/* The idle class should always have a runnable task: */
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 56791c0318a2..249310e68592 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1730,39 +1730,13 @@ pick_next_task_dl(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf)
struct task_struct *p;
struct dl_rq *dl_rq;
 
-   dl_rq = >dl;
-
-   if (need_pull_dl_task(rq, prev)) {
-   /*
-* This is OK, because current is on_cpu, which avoids it being
-* picked for load-balance and preemption/IRQs are still
-* disabled avoiding further scheduler activity on it and we're
-* being very careful to re-start the picking loop.
-*/
-   rq_unpin_lock(rq, rf);
-   pull_dl_task(rq);
-   rq_repin_lock(rq, rf);
-   /*
-* pull_dl_task() can drop (and re-acquire) rq->lock; this
-* means a stop task can slip in, in which case we need to
-* re-start task selection.
-*/
-   if (rq->stop && task_on_rq_queued(rq->stop))
-   return RETRY_TASK;
-   }
+   WARN_ON_ONCE(prev || rf);
 
-   /*
-* When prev is DL, we may throttle it in put_prev_task().
-* So, we update time before we check for dl_nr_running.
-*/
-   if (prev->sched_class == _sched_class)
-   update_curr_dl(rq);
+   dl_rq = >dl;
 
if (unlikely(!dl_rq->dl_nr_running))
return NULL;
 
-   put_prev_task(rq, prev);
-
dl_se = pick_next_dl_entity(rq, dl_rq);
BUG_ON(!dl_se);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41ec5e68e1c5..c055bad249a9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6950,7 +6950,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf
goto idle;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-   if (prev->sched_class != _sched_class)
+   if (!prev || prev->sched_class != _sched_class)
goto simple;
 
/*
@@ -7027,8 +7027,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf
goto done;
 simple:
 #endif
-
-   put_prev_task(rq, prev);
+   if (prev)
+   put_prev_task(rq, prev);
 
do {
se = pick_next_entity(cfs_rq, NULL);
@@ -7056,6 +7056,9 @@ done: __maybe_unused;
return p;
 
 idle:
+   if (!rf)
+   return NULL;
+
new_tasks = newidle_balance(rq, rf);
 
/*
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 1b65a4c3683e..7ece8e820b5d 100644
---

[RFC PATCH v2 05/17] sched: Add task_struct pointer to sched_class::set_curr_task

2019-04-23 Thread Vineeth Remanan Pillai

From: Peter Zijlstra (Intel) 

In preparation of further separating pick_next_task() and
set_curr_task() we have to pass the actual task into it, while there,
rename the thing to better pair with put_prev_task().

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c  | 12 ++--
 kernel/sched/deadline.c  |  7 +--
 kernel/sched/fair.c  | 17 ++---
 kernel/sched/idle.c  | 27 +++
 kernel/sched/rt.c|  7 +--
 kernel/sched/sched.h |  8 +---
 kernel/sched/stop_task.c | 17 +++--
 7 files changed, 49 insertions(+), 46 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6f4861ae85dc..32ea79fb8d29 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1081,7 +1081,7 @@ void do_set_cpus_allowed(struct task_struct *p, const 
struct cpumask *new_mask)
if (queued)
enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
if (running)
-   set_curr_task(rq, p);
+   set_next_task(rq, p);
 }
 
 /*
@@ -3890,7 +3890,7 @@ void rt_mutex_setprio(struct task_struct *p, struct 
task_struct *pi_task)
if (queued)
enqueue_task(rq, p, queue_flag);
if (running)
-   set_curr_task(rq, p);
+   set_next_task(rq, p);
 
check_class_changed(rq, p, prev_class, oldprio);
 out_unlock:
@@ -3957,7 +3957,7 @@ void set_user_nice(struct task_struct *p, long nice)
resched_curr(rq);
}
if (running)
-   set_curr_task(rq, p);
+   set_next_task(rq, p);
 out_unlock:
task_rq_unlock(rq, p, );
 }
@@ -4382,7 +4382,7 @@ static int __sched_setscheduler(struct task_struct *p,
enqueue_task(rq, p, queue_flags);
}
if (running)
-   set_curr_task(rq, p);
+   set_next_task(rq, p);
 
check_class_changed(rq, p, prev_class, oldprio);
 
@@ -,7 +,7 @@ void sched_setnuma(struct task_struct *p, int nid)
if (queued)
enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
if (running)
-   set_curr_task(rq, p);
+   set_next_task(rq, p);
task_rq_unlock(rq, p, );
 }
 #endif /* CONFIG_NUMA_BALANCING */
@@ -6438,7 +6438,7 @@ void sched_move_task(struct task_struct *tsk)
if (queued)
enqueue_task(rq, tsk, queue_flags);
if (running)
-   set_curr_task(rq, tsk);
+   set_next_task(rq, tsk);
 
task_rq_unlock(rq, tsk, );
 }
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index b8e15c7aa889..fadfbfe7d573 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1813,11 +1813,6 @@ static void task_fork_dl(struct task_struct *p)
 */
 }
 
-static void set_curr_task_dl(struct rq *rq)
-{
-   set_next_task_dl(rq, rq->curr);
-}
-
 #ifdef CONFIG_SMP
 
 /* Only try algorithms three times */
@@ -2405,6 +2400,7 @@ const struct sched_class dl_sched_class = {
 
.pick_next_task = pick_next_task_dl,
.put_prev_task  = put_prev_task_dl,
+   .set_next_task  = set_next_task_dl,
 
 #ifdef CONFIG_SMP
.select_task_rq = select_task_rq_dl,
@@ -2415,7 +2411,6 @@ const struct sched_class dl_sched_class = {
.task_woken = task_woken_dl,
 #endif
 
-   .set_curr_task  = set_curr_task_dl,
.task_tick  = task_tick_dl,
.task_fork  = task_fork_dl,
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1ccab35ccf21..ebad19a033eb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10359,9 +10359,19 @@ static void switched_to_fair(struct rq *rq, struct 
task_struct *p)
  * This routine is mostly called to set cfs_rq->curr field when a task
  * migrates between groups/classes.
  */
-static void set_curr_task_fair(struct rq *rq)
+static void set_next_task_fair(struct rq *rq, struct task_struct *p)
 {
-   struct sched_entity *se = >curr->se;
+   struct sched_entity *se = >se;
+
+#ifdef CONFIG_SMP
+   if (task_on_rq_queued(p)) {
+   /*
+* Move the next running task to the front of the list, so our
+* cfs_tasks list becomes MRU one.
+*/
+   list_move(>group_node, >cfs_tasks);
+   }
+#endif
 
for_each_sched_entity(se) {
struct cfs_rq *cfs_rq = cfs_rq_of(se);
@@ -10632,7 +10642,9 @@ const struct sched_class fair_sched_class = {
.check_preempt_curr = check_preempt_wakeup,
 
.pick_next_task = pick_next_task_fair,
+
.put_prev_task  = put_prev_task_fair,
+   .set_next_task  = set_next_task_fair,
 
 #ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
@@ -10645,7 +10657,6 @@ const struct sched_class fair_sched_class = {
.set_cpus_allowed   =

[RFC PATCH v2 13/17] sched: Add core wide task selection and scheduling.

2019-04-23 Thread Vineeth Remanan Pillai

From: Peter Zijlstra (Intel) 

Instead of only selecting a local task, select a task for all SMT
siblings for every reschedule on the core (irrespective which logical
CPU does the reschedule).

NOTE: there is still potential for siblings rivalry.
NOTE: this is far too complicated; but thus far I've failed to
  simplify it further.

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c  | 222 ++-
 kernel/sched/sched.h |   5 +-
 2 files changed, 224 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e5bdc1c4d8d7..9e6e90c6f9b9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3574,7 +3574,7 @@ static inline void schedule_debug(struct task_struct 
*prev)
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
const struct sched_class *class;
struct task_struct *p;
@@ -3619,6 +3619,220 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
BUG();
 }
 
+#ifdef CONFIG_SCHED_CORE
+
+static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
+{
+   if (is_idle_task(a) || is_idle_task(b))
+   return true;
+
+   return a->core_cookie == b->core_cookie;
+}
+
+// XXX fairness/fwd progress conditions
+static struct task_struct *
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct 
*max)
+{
+   struct task_struct *class_pick, *cookie_pick;
+   unsigned long cookie = 0UL;
+
+   /*
+* We must not rely on rq->core->core_cookie here, because we fail to 
reset
+* rq->core->core_cookie on new picks, such that we can detect if we 
need
+* to do single vs multi rq task selection.
+*/
+
+   if (max && max->core_cookie) {
+   WARN_ON_ONCE(rq->core->core_cookie != max->core_cookie);
+   cookie = max->core_cookie;
+   }
+
+   class_pick = class->pick_task(rq);
+   if (!cookie)
+   return class_pick;
+
+   cookie_pick = sched_core_find(rq, cookie);
+   if (!class_pick)
+   return cookie_pick;
+
+   /*
+* If class > max && class > cookie, it is the highest priority task on
+* the core (so far) and it must be selected, otherwise we must go with
+* the cookie pick in order to satisfy the constraint.
+*/
+   if (cpu_prio_less(cookie_pick, class_pick) && core_prio_less(max, 
class_pick))
+   return class_pick;
+
+   return cookie_pick;
+}
+
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+   struct task_struct *next, *max = NULL;
+   const struct sched_class *class;
+   const struct cpumask *smt_mask;
+   int i, j, cpu;
+
+   if (!sched_core_enabled(rq))
+   return __pick_next_task(rq, prev, rf);
+
+   /*
+* If there were no {en,de}queues since we picked (IOW, the task
+* pointers are all still valid), and we haven't scheduled the last
+* pick yet, do so now.
+*/
+   if (rq->core->core_pick_seq == rq->core->core_task_seq &&
+   rq->core->core_pick_seq != rq->core_sched_seq) {
+   WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
+
+   next = rq->core_pick;
+   if (next != prev) {
+   put_prev_task(rq, prev);
+   set_next_task(rq, next);
+   }
+   return next;
+   }
+
+   prev->sched_class->put_prev_task(rq, prev, rf);
+   if (!rq->nr_running)
+   newidle_balance(rq, rf);
+
+   cpu = cpu_of(rq);
+   smt_mask = cpu_smt_mask(cpu);
+
+   /*
+* core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
+*
+* @task_seq guards the task state ({en,de}queues)
+* @pick_seq is the @task_seq we did a selection on
+* @sched_seq is the @pick_seq we scheduled
+*
+* However, preemptions can cause multiple picks on the same task set.
+* 'Fix' this by also increasing @task_seq for every pick.
+*/
+   rq->core->core_task_seq++;
+
+   /* reset state */
+   for_each_cpu(i, smt_mask) {
+   struct rq *rq_i = cpu_rq(i);
+
+   rq_i->core_pick = NULL;
+
+   if (i != cpu)
+   update_rq_clock(rq_i);
+   }
+
+   /*
+* Try and select tasks for each sibling in decending sched_class
+* order.
+*/
+   for_each_class(class) {
+again:
+   for_each_cpu_wrap(i, smt_mask, cpu) {
+   struct rq *rq_i = cpu_rq(i);
+   struct task_struct *p;
+
+   if (rq_i->core_pick)
+

[RFC PATCH v2 12/17] sched: A quick and dirty cgroup tagging interface

2019-04-23 Thread Vineeth Remanan Pillai

From: Peter Zijlstra (Intel) 

Marks all tasks in a cgroup as matching for core-scheduling.

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c  | 62 
 kernel/sched/sched.h |  4 +++
 2 files changed, 66 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5066a1493acf..e5bdc1c4d8d7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6658,6 +6658,15 @@ static void sched_change_group(struct task_struct *tsk, 
int type)
tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
  struct task_group, css);
tg = autogroup_task_group(tsk, tg);
+
+#ifdef CONFIG_SCHED_CORE
+   if ((unsigned long)tsk->sched_task_group == tsk->core_cookie)
+   tsk->core_cookie = 0UL;
+
+   if (tg->tagged /* && !tsk->core_cookie ? */)
+   tsk->core_cookie = (unsigned long)tg;
+#endif
+
tsk->sched_task_group = tg;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -7117,6 +7126,43 @@ static u64 cpu_rt_period_read_uint(struct 
cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_SCHED_CORE
+static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct 
cftype *cft)
+{
+   struct task_group *tg = css_tg(css);
+
+   return !!tg->tagged;
+}
+
+static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct 
cftype *cft, u64 val)
+{
+   struct task_group *tg = css_tg(css);
+   struct css_task_iter it;
+   struct task_struct *p;
+
+   if (val > 1)
+   return -ERANGE;
+
+   if (tg->tagged == !!val)
+   return 0;
+
+   tg->tagged = !!val;
+
+   if (!!val)
+   sched_core_get();
+
+   css_task_iter_start(css, 0, );
+   while ((p = css_task_iter_next()))
+   p->core_cookie = !!val ? (unsigned long)tg : 0UL;
+   css_task_iter_end();
+
+   if (!val)
+   sched_core_put();
+
+   return 0;
+}
+#endif
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
{
@@ -7152,6 +7198,14 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_rt_period_read_uint,
.write_u64 = cpu_rt_period_write_uint,
},
+#endif
+#ifdef CONFIG_SCHED_CORE
+   {
+   .name = "tag",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .read_u64 = cpu_core_tag_read_u64,
+   .write_u64 = cpu_core_tag_write_u64,
+   },
 #endif
{ } /* Terminate */
 };
@@ -7319,6 +7373,14 @@ static struct cftype cpu_files[] = {
.seq_show = cpu_max_show,
.write = cpu_max_write,
},
+#endif
+#ifdef CONFIG_SCHED_CORE
+   {
+   .name = "tag",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .read_u64 = cpu_core_tag_read_u64,
+   .write_u64 = cpu_core_tag_write_u64,
+   },
 #endif
{ } /* terminate */
 };
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 42dd620797d7..16fb236eab7b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -363,6 +363,10 @@ struct cfs_bandwidth {
 struct task_group {
struct cgroup_subsys_state css;
 
+#ifdef CONFIG_SCHED_CORE
+   int tagged;
+#endif
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
/* schedulable entities of this group on each CPU */
struct sched_entity **se;
-- 
2.17.1

[RFC PATCH v2 09/17] sched: Introduce sched_class::pick_task()

2019-04-23 Thread Vineeth Remanan Pillai

From: "Peter Zijlstra (Intel)" 

Because sched_class::pick_next_task() also implies
sched_class::set_next_task() (and possibly put_prev_task() and
newidle_balance) it is not state invariant. This makes it unsuitable
for remote task selection.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Julien Desfossez 
---

Changes in v2
-
- Fixes a NULL pointer dereference crash
  - Subhra Mazumdar
  - Tim Chen

---
 kernel/sched/deadline.c  | 21 -
 kernel/sched/fair.c  | 39 ---
 kernel/sched/idle.c  | 10 +-
 kernel/sched/rt.c| 21 -
 kernel/sched/sched.h |  2 ++
 kernel/sched/stop_task.c | 21 -
 6 files changed, 95 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 249310e68592..010234908cc0 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1723,15 +1723,12 @@ static struct sched_dl_entity 
*pick_next_dl_entity(struct rq *rq,
return rb_entry(left, struct sched_dl_entity, rb_node);
 }
 
-static struct task_struct *
-pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static struct task_struct *pick_task_dl(struct rq *rq)
 {
struct sched_dl_entity *dl_se;
struct task_struct *p;
struct dl_rq *dl_rq;
 
-   WARN_ON_ONCE(prev || rf);
-
dl_rq = >dl;
 
if (unlikely(!dl_rq->dl_nr_running))
@@ -1742,7 +1739,19 @@ pick_next_task_dl(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf)
 
p = dl_task_of(dl_se);
 
-   set_next_task_dl(rq, p);
+   return p;
+}
+
+static struct task_struct *
+pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+   struct task_struct *p;
+
+   WARN_ON_ONCE(prev || rf);
+
+   p = pick_task_dl(rq);
+   if (p)
+   set_next_task_dl(rq, p);
 
return p;
 }
@@ -2389,6 +2398,8 @@ const struct sched_class dl_sched_class = {
.set_next_task  = set_next_task_dl,
 
 #ifdef CONFIG_SMP
+   .pick_task  = pick_task_dl,
+
.select_task_rq = select_task_rq_dl,
.migrate_task_rq= migrate_task_rq_dl,
.set_cpus_allowed   = set_cpus_allowed_dl,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c055bad249a9..45d86b862750 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4132,7 +4132,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *curr)
 * Avoid running the skip buddy, if running something else can
 * be done without getting too unfair.
 */
-   if (cfs_rq->skip == se) {
+   if (cfs_rq->skip && cfs_rq->skip == se) {
struct sched_entity *second;
 
if (se == curr) {
@@ -4150,13 +4150,13 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *curr)
/*
 * Prefer last buddy, try to return the CPU to a preempted task.
 */
-   if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
+   if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 
1)
se = cfs_rq->last;
 
/*
 * Someone really wants this to run. If it's not unfair, run it.
 */
-   if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
+   if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 
1)
se = cfs_rq->next;
 
clear_buddies(cfs_rq, se);
@@ -6937,6 +6937,37 @@ static void check_preempt_wakeup(struct rq *rq, struct 
task_struct *p, int wake_
set_last_buddy(se);
 }
 
+static struct task_struct *
+pick_task_fair(struct rq *rq)
+{
+   struct cfs_rq *cfs_rq = >cfs;
+   struct sched_entity *se;
+
+   if (!cfs_rq->nr_running)
+   return NULL;
+
+   do {
+   struct sched_entity *curr = cfs_rq->curr;
+
+   se = pick_next_entity(cfs_rq, NULL);
+
+   if (!(se || curr))
+   return NULL;
+
+   if (curr) {
+   if (se && curr->on_rq)
+   update_curr(cfs_rq);
+
+   if (!se || entity_before(curr, se))
+   se = curr;
+   }
+
+   cfs_rq = group_cfs_rq(se);
+   } while (cfs_rq);
+
+   return task_of(se);
+}
+
 static struct task_struct *
 pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags 
*rf)
 {
@@ -10648,6 +10679,8 @@ const struct sched_class fair_sched_class = {
.set_next_task  = set_next_task_fair,
 
 #ifdef CONFIG_SMP
+   .pick_task  = pick_task_fair,
+
.select_task_rq

[RFC PATCH v2 11/17] sched: Basic tracking of matching tasks

2019-04-23 Thread Vineeth Remanan Pillai

From: Peter Zijlstra (Intel) 

Introduce task_struct::core_cookie as an opaque identifier for core
scheduling. When enabled; core scheduling will only allow matching
task to be on the core; where idle matches everything.

When task_struct::core_cookie is set (and core scheduling is enabled)
these tasks are indexed in a second RB-tree, first on cookie value
then on scheduling function, such that matching task selection always
finds the most elegible match.

NOTE: *shudder* at the overhead...

NOTE: *sigh*, a 3rd copy of the scheduling function; the alternative
is per class tracking of cookies and that just duplicates a lot of
stuff for no raisin (the 2nd copy lives in the rt-mutex PI code).

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Julien Desfossez 
---

Changes in v2
-
- Improves the priority comparison logic between processes in
  different cpus.
  - Peter Zijlstra
  - Aaron Lu

---
 include/linux/sched.h |   8 ++-
 kernel/sched/core.c   | 164 ++
 kernel/sched/sched.h  |   4 ++
 3 files changed, 175 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1549584a1538..a4b39a28236f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -636,10 +636,16 @@ struct task_struct {
const struct sched_class*sched_class;
struct sched_entity se;
struct sched_rt_entity  rt;
+   struct sched_dl_entity  dl;
+
+#ifdef CONFIG_SCHED_CORE
+   struct rb_node  core_node;
+   unsigned long   core_cookie;
+#endif
+
 #ifdef CONFIG_CGROUP_SCHED
struct task_group   *sched_task_group;
 #endif
-   struct sched_dl_entity  dl;
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
/* List of struct preempt_notifier: */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2f559d706b8e..5066a1493acf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -64,6 +64,159 @@ int sysctl_sched_rt_runtime = 95;
 
 DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
 
+/* kernel prio, less is more */
+static inline int __task_prio(struct task_struct *p)
+{
+   if (p->sched_class == _sched_class) /* trumps deadline */
+   return -2;
+
+   if (rt_prio(p->prio)) /* includes deadline */
+   return p->prio; /* [-1, 99] */
+
+   if (p->sched_class == _sched_class)
+   return MAX_RT_PRIO + NICE_WIDTH; /* 140 */
+
+   return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */
+}
+
+// FIXME: This is copied from fair.c. Needs only single copy.
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+   return p->se.cfs_rq;
+}
+#else
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+   return _rq(p)->cfs;
+}
+#endif
+
+/*
+ * l(a,b)
+ * le(a,b) := !l(b,a)
+ * g(a,b)  := l(b,a)
+ * ge(a,b) := !l(a,b)
+ */
+
+/* real prio, less is less */
+static inline bool __prio_less(struct task_struct *a, struct task_struct *b, 
bool core_cmp)
+{
+   u64 vruntime;
+
+   int pa = __task_prio(a), pb = __task_prio(b);
+
+   if (-pa < -pb)
+   return true;
+
+   if (-pb < -pa)
+   return false;
+
+   if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
+   return !dl_time_before(a->dl.deadline, b->dl.deadline);
+
+   vruntime = b->se.vruntime;
+   if (core_cmp) {
+   vruntime -= task_cfs_rq(b)->min_vruntime;
+   vruntime += task_cfs_rq(a)->min_vruntime;
+   }
+   if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
+   return !((s64)(a->se.vruntime - vruntime) <= 0);
+
+   return false;
+}
+
+static inline bool cpu_prio_less(struct task_struct *a, struct task_struct *b)
+{
+   return __prio_less(a, b, false);
+}
+
+static inline bool core_prio_less(struct task_struct *a, struct task_struct *b)
+{
+   return __prio_less(a, b, true);
+}
+
+static inline bool __sched_core_less(struct task_struct *a, struct task_struct 
*b)
+{
+   if (a->core_cookie < b->core_cookie)
+   return true;
+
+   if (a->core_cookie > b->core_cookie)
+   return false;
+
+   /* flip prio, so high prio is leftmost */
+   if (cpu_prio_less(b, a))
+   return true;
+
+   return false;
+}
+
+void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+{
+   struct rb_node *parent, **node;
+   struct task_struct *node_task;
+
+   rq->core->core_task_seq++;
+
+   if (!p->core_cookie)
+   return;
+
+   node = >core_tree.rb_node;
+   parent = *node;
+
+   while (*node) {
+   node_task = container_of(*node, struct task_struct, core_node);
+   parent = *node;
+
+   i

[RFC PATCH v2 10/17] sched: Core-wide rq->lock

2019-04-23 Thread Vineeth Remanan Pillai

From: Peter Zijlstra (Intel) 

Introduce the basic infrastructure to have a core wide rq->lock.

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/Kconfig.preempt |  7 +++-
 kernel/sched/core.c| 91 ++
 kernel/sched/sched.h   | 31 ++
 3 files changed, 128 insertions(+), 1 deletion(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 0fee5fe6c899..02fe0bf26676 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -57,4 +57,9 @@ config PREEMPT
 endchoice
 
 config PREEMPT_COUNT
-   bool
+   bool
+
+config SCHED_CORE
+   bool
+   default y
+   depends on SCHED_SMT
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b883c70674ba..2f559d706b8e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -60,6 +60,70 @@ __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 95;
 
+#ifdef CONFIG_SCHED_CORE
+
+DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+/*
+ * The static-key + stop-machine variable are needed such that:
+ *
+ * spin_lock(rq_lockp(rq));
+ * ...
+ * spin_unlock(rq_lockp(rq));
+ *
+ * ends up locking and unlocking the _same_ lock, and all CPUs
+ * always agree on what rq has what lock.
+ *
+ * XXX entirely possible to selectively enable cores, don't bother for now.
+ */
+static int __sched_core_stopper(void *data)
+{
+   bool enabled = !!(unsigned long)data;
+   int cpu;
+
+   for_each_possible_cpu(cpu)
+   cpu_rq(cpu)->core_enabled = enabled;
+
+   return 0;
+}
+
+static DEFINE_MUTEX(sched_core_mutex);
+static int sched_core_count;
+
+static void __sched_core_enable(void)
+{
+   // XXX verify there are no cookie tasks (yet)
+
+   static_branch_enable(&__sched_core_enabled);
+   stop_machine(__sched_core_stopper, (void *)true, NULL);
+}
+
+static void __sched_core_disable(void)
+{
+   // XXX verify there are no cookie tasks (left)
+
+   stop_machine(__sched_core_stopper, (void *)false, NULL);
+   static_branch_disable(&__sched_core_enabled);
+}
+
+void sched_core_get(void)
+{
+   mutex_lock(_core_mutex);
+   if (!sched_core_count++)
+   __sched_core_enable();
+   mutex_unlock(_core_mutex);
+}
+
+void sched_core_put(void)
+{
+   mutex_lock(_core_mutex);
+   if (!--sched_core_count)
+   __sched_core_disable();
+   mutex_unlock(_core_mutex);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * __task_rq_lock - lock the rq @p resides on.
  */
@@ -5865,6 +5929,28 @@ static void sched_rq_cpu_starting(unsigned int cpu)
 
 int sched_cpu_starting(unsigned int cpu)
 {
+#ifdef CONFIG_SCHED_CORE
+   const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+   struct rq *rq, *core_rq = NULL;
+   int i;
+
+   for_each_cpu(i, smt_mask) {
+   rq = cpu_rq(i);
+   if (rq->core && rq->core == rq)
+   core_rq = rq;
+   }
+
+   if (!core_rq)
+   core_rq = cpu_rq(cpu);
+
+   for_each_cpu(i, smt_mask) {
+   rq = cpu_rq(i);
+
+   WARN_ON_ONCE(rq->core && rq->core != core_rq);
+   rq->core = core_rq;
+   }
+#endif /* CONFIG_SCHED_CORE */
+
sched_rq_cpu_starting(cpu);
sched_tick_start(cpu);
return 0;
@@ -6091,6 +6177,11 @@ void __init sched_init(void)
 #endif /* CONFIG_SMP */
hrtick_rq_init(rq);
atomic_set(>nr_iowait, 0);
+
+#ifdef CONFIG_SCHED_CORE
+   rq->core = NULL;
+   rq->core_enabled = 0;
+#endif
}
 
set_load_weight(_task, false);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a024dd80eeb3..eb38063221d0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -952,6 +952,12 @@ struct rq {
/* Must be inspected within a rcu lock section */
struct cpuidle_state*idle_state;
 #endif
+
+#ifdef CONFIG_SCHED_CORE
+   /* per rq */
+   struct rq   *core;
+   unsigned intcore_enabled;
+#endif
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -979,11 +985,36 @@ static inline int cpu_of(struct rq *rq)
 #endif
 }
 
+#ifdef CONFIG_SCHED_CORE
+DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+   return static_branch_unlikely(&__sched_core_enabled) && 
rq->core_enabled;
+}
+
+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+   if (sched_core_enabled(rq))
+   return >core->__lock;
+
+   return >__lock;
+}
+
+#else /* !CONFIG_SCHED_CORE */
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+   return false;
+}
+
 static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 {
return >__lock;
 }
 
+#endif /* CONFIG_SCHED_CORE */
+
 #ifdef CONFIG_SCHED_SMT
 extern void __update_idle_core(struct rq *rq);
 
-- 
2.17.1

[RFC PATCH v2 16/17] sched: Wake up sibling if it has something to run

2019-04-23 Thread Vineeth Remanan Pillai

During core scheduling, it can happen that the current rq selects a
non-tagged process while the sibling might be idling even though it
had something to run (because the sibling selected idle to match the
tagged process in previous tag matching iteration). We need to wake up
the sibling if such a situation arise.

Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Julien Desfossez 
---
 kernel/sched/core.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e8f5ec641d0a..0e3c51a1b54a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3775,6 +3775,21 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 */
if (i == cpu && !rq->core->core_cookie && 
!p->core_cookie) {
next = p;
+   rq->core_pick = NULL;
+ 
+   /*
+* If the sibling is idling, we might want to 
wake it
+* so that it can check for any runnable tasks 
that did
+* not get a chance to run due to previous task 
matching.
+*/
+   for_each_cpu(j, smt_mask) {
+   struct rq *rq_j = cpu_rq(j);
+   rq_j->core_pick = NULL;
+   if (j != cpu &&
+   is_idle_task(rq_j->curr) && 
rq_j->nr_running) {
+   resched_curr(rq_j);
+   }
+   }
goto done;
}
 
-- 
2.17.1

[RFC PATCH v2 06/17] sched/fair: Export newidle_balance()

2019-04-23 Thread Vineeth Remanan Pillai

From: Peter Zijlstra (Intel) 

For pick_next_task_fair() it is the newidle balance that requires
dropping the rq->lock; provided we do put_prev_task() early, we can
also detect the condition for doing newidle early.

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/fair.c  | 18 --
 kernel/sched/sched.h |  4 
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ebad19a033eb..f7e631e692a3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3611,8 +3611,6 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq 
*cfs_rq)
return cfs_rq->avg.load_avg;
 }
 
-static int idle_balance(struct rq *this_rq, struct rq_flags *rf);
-
 static inline unsigned long task_util(struct task_struct *p)
 {
return READ_ONCE(p->se.avg.util_avg);
@@ -7058,11 +7056,10 @@ done: __maybe_unused;
return p;
 
 idle:
-   update_misfit_status(NULL, rq);
-   new_tasks = idle_balance(rq, rf);
+   new_tasks = newidle_balance(rq, rf);
 
/*
-* Because idle_balance() releases (and re-acquires) rq->lock, it is
+* Because newidle_balance() releases (and re-acquires) rq->lock, it is
 * possible for any higher priority task to appear. In that case we
 * must re-start the pick_next_entity() loop.
 */
@@ -9257,10 +9254,10 @@ static int load_balance(int this_cpu, struct rq 
*this_rq,
ld_moved = 0;
 
/*
-* idle_balance() disregards balance intervals, so we could repeatedly
-* reach this code, which would lead to balance_interval skyrocketting
-* in a short amount of time. Skip the balance_interval increase logic
-* to avoid that.
+* newidle_balance() disregards balance intervals, so we could
+* repeatedly reach this code, which would lead to balance_interval
+* skyrocketting in a short amount of time. Skip the balance_interval
+* increase logic to avoid that.
 */
if (env.idle == CPU_NEWLY_IDLE)
goto out;
@@ -9967,7 +9964,7 @@ static inline void nohz_newidle_balance(struct rq 
*this_rq) { }
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
  */
-static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
+int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 {
unsigned long next_balance = jiffies + HZ;
int this_cpu = this_rq->cpu;
@@ -9975,6 +9972,7 @@ static int idle_balance(struct rq *this_rq, struct 
rq_flags *rf)
int pulled_task = 0;
u64 curr_cost = 0;
 
+   update_misfit_status(NULL, this_rq);
/*
 * We must set idle_stamp _before_ calling idle_balance(), such that we
 * measure the duration of idle_balance() as idle time.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fb01c77c16ff..bfcbcbb25646 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1414,10 +1414,14 @@ static inline void unregister_sched_domain_sysctl(void)
 }
 #endif
 
+extern int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
+
 #else
 
 static inline void sched_ttwu_pending(void) { }
 
+static inline int newidle_balance(struct rq *this_rq, struct rq_flags *rf) { 
return 0; }
+
 #endif /* CONFIG_SMP */
 
 #include "stats.h"
-- 
2.17.1

[RFC PATCH v2 15/17] sched: Trivial forced-newidle balancer

2019-04-23 Thread Vineeth Remanan Pillai

From: Peter Zijlstra (Intel) 

When a sibling is forced-idle to match the core-cookie; search for
matching tasks to fill the core.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/sched.h |   1 +
 kernel/sched/core.c   | 131 +-
 kernel/sched/idle.c   |   1 +
 kernel/sched/sched.h  |   6 ++
 4 files changed, 138 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a4b39a28236f..1a309e8546cd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -641,6 +641,7 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
struct rb_node  core_node;
unsigned long   core_cookie;
+   unsigned intcore_occupation;
 #endif
 
 #ifdef CONFIG_CGROUP_SCHED
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9e6e90c6f9b9..e8f5ec641d0a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -217,6 +217,21 @@ struct task_struct *sched_core_find(struct rq *rq, 
unsigned long cookie)
return match;
 }
 
+struct task_struct *sched_core_next(struct task_struct *p, unsigned long 
cookie)
+{
+   struct rb_node *node = >core_node;
+
+   node = rb_next(node);
+   if (!node)
+   return NULL;
+
+   p = container_of(node, struct task_struct, core_node);
+   if (p->core_cookie != cookie)
+   return NULL;
+
+   return p;
+}
+
 /*
  * The static-key + stop-machine variable are needed such that:
  *
@@ -3672,7 +3687,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
struct task_struct *next, *max = NULL;
const struct sched_class *class;
const struct cpumask *smt_mask;
-   int i, j, cpu;
+   int i, j, cpu, occ = 0;
 
if (!sched_core_enabled(rq))
return __pick_next_task(rq, prev, rf);
@@ -3763,6 +3778,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
goto done;
}
 
+   if (!is_idle_task(p))
+   occ++;
+
rq_i->core_pick = p;
 
/*
@@ -3786,6 +3804,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
cpu_rq(j)->core_pick = NULL;
}
+   occ = 1;
goto again;
}
}
@@ -3808,6 +3827,8 @@ next_class:;
 
WARN_ON_ONCE(!rq_i->core_pick);
 
+   rq_i->core_pick->core_occupation = occ;
+
if (i == cpu)
continue;
 
@@ -3823,6 +3844,114 @@ next_class:;
return next;
 }
 
+static bool try_steal_cookie(int this, int that)
+{
+   struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
+   struct task_struct *p;
+   unsigned long cookie;
+   bool success = false;
+
+   local_irq_disable();
+   double_rq_lock(dst, src);
+
+   cookie = dst->core->core_cookie;
+   if (!cookie)
+   goto unlock;
+
+   if (dst->curr != dst->idle)
+   goto unlock;
+
+   p = sched_core_find(src, cookie);
+   if (p == src->idle)
+   goto unlock;
+
+   do {
+   if (p == src->core_pick || p == src->curr)
+   goto next;
+
+   if (!cpumask_test_cpu(this, >cpus_allowed))
+   goto next;
+
+   if (p->core_occupation > dst->idle->core_occupation)
+   goto next;
+
+   p->on_rq = TASK_ON_RQ_MIGRATING;
+   deactivate_task(src, p, 0);
+   set_task_cpu(p, this);
+   activate_task(dst, p, 0);
+   p->on_rq = TASK_ON_RQ_QUEUED;
+
+   resched_curr(dst);
+
+   success = true;
+   break;
+
+next:
+   p = sched_core_next(p, cookie);
+   } while (p);
+
+unlock:
+   double_rq_unlock(dst, src);
+   local_irq_enable();
+
+   return success;
+}
+
+static bool steal_cookie_task(int cpu, struct sched_domain *sd)
+{
+   int i;
+
+   for_each_cpu_wrap(i, sched_domain_span(sd), cpu) {
+   if (i == cpu)
+   continue;
+
+   if (need_resched())
+   break;
+
+   if (try_steal_cookie(cpu, i))
+   return true;
+   }
+
+   return false;
+}
+
+static void sched_core_balance(struct rq *rq)
+{
+   struct sched_domain *sd;
+   int cpu = cpu_of(rq);
+
+   rcu_read_lock();
+   raw_spin_unlock_irq(rq_lockp(rq));
+   for_each_domain(cpu, sd) {
+   if (!(sd->flags & SD_LOAD_BALANCE))
+   break;
+
+   if (need_resched())
+

[RFC PATCH v2 14/17] sched/fair: Add a few assertions

2019-04-23 Thread Vineeth Remanan Pillai

From: Peter Zijlstra (Intel) 

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/fair.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 45d86b862750..08812fe7e1d3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6209,6 +6209,11 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
struct sched_domain *sd;
int i, recent_used_cpu;
 
+   /*
+* per-cpu select_idle_mask usage
+*/
+   lockdep_assert_irqs_disabled();
+
if (available_idle_cpu(target))
return target;
 
@@ -6636,8 +6641,6 @@ static int find_energy_efficient_cpu(struct task_struct 
*p, int prev_cpu)
  * certain conditions an idle sibling CPU if the domain has SD_WAKE_AFFINE set.
  *
  * Returns the target CPU number.
- *
- * preempt must be disabled.
  */
 static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int 
wake_flags)
@@ -6648,6 +6651,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, 
int sd_flag, int wake_f
int want_affine = 0;
int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 
+   /*
+* required for stable ->cpus_allowed
+*/
+   lockdep_assert_held(>pi_lock);
+
if (sd_flag & SD_BALANCE_WAKE) {
record_wakee(p);
 
-- 
2.17.1

[RFC PATCH v2 17/17] sched: Debug bits...

2019-04-23 Thread Vineeth Remanan Pillai

From: Peter Zijlstra (Intel) 

Not-Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c | 38 +-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0e3c51a1b54a..e8e5f26db052 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -106,6 +106,10 @@ static inline bool __prio_less(struct task_struct *a, 
struct task_struct *b, boo
 
int pa = __task_prio(a), pb = __task_prio(b);
 
+   trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
+a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+b->comm, b->pid, pa, b->se.vruntime, b->dl.deadline);
+
if (-pa < -pb)
return true;
 
@@ -264,6 +268,8 @@ static void __sched_core_enable(void)
 
static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
+
+   printk("core sched enabled\n");
 }
 
 static void __sched_core_disable(void)
@@ -272,6 +278,8 @@ static void __sched_core_disable(void)
 
stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);
+
+   printk("core sched disabled\n");
 }
 
 void sched_core_get(void)
@@ -3706,6 +3714,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
put_prev_task(rq, prev);
set_next_task(rq, next);
}
+
+   trace_printk("pick pre selected (%u %u %u): %s/%d %lx\n",
+rq->core->core_task_seq,
+rq->core->core_pick_seq,
+rq->core_sched_seq,
+next->comm, next->pid,
+next->core_cookie);
+
return next;
}
 
@@ -3777,6 +3793,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
next = p;
rq->core_pick = NULL;
  
+   trace_printk("unconstrained pick: %s/%d %lx\n",
+next->comm, next->pid, 
next->core_cookie);
+
/*
 * If the sibling is idling, we might want to 
wake it
 * so that it can check for any runnable tasks 
that did
@@ -3787,6 +3806,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq_j->core_pick = NULL;
if (j != cpu &&
is_idle_task(rq_j->curr) && 
rq_j->nr_running) {
+   trace_printk("IPI(%d->%d[%d]) 
idle preempt\n",
+cpu, j, 
rq_j->nr_running);
resched_curr(rq_j);
}
}
@@ -3798,6 +3819,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
rq_i->core_pick = p;
 
+   trace_printk("cpu(%d): selected: %s/%d %lx\n",
+i, p->comm, p->pid, p->core_cookie);
+
/*
 * If this new candidate is of higher priority than the
 * previous; and they're incompatible; we need to wipe
@@ -3812,6 +3836,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_cookie = p->core_cookie;
max = p;
 
+   trace_printk("max: %s/%d %lx\n", max->comm, 
max->pid, max->core_cookie);
+
if (old_max && !cookie_match(old_max, p)) {
for_each_cpu(j, smt_mask) {
if (j == i)
@@ -3847,13 +3873,17 @@ next_class:;
if (i == cpu)
continue;
 
-   if (rq_i->curr != rq_i->core_pick)
+   if (rq_i->curr != rq_i->core_pick) {
+   trace_printk("IPI(%d)\n", i);
resched_curr(rq_i);
+   }
}
 
rq->core_sched_seq = rq->core->core_pick_seq;
next = rq->core_pick;
 
+   trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, 
next->core_cookie);
+
 done:
set_next_task(rq, next);
return next;
@@ -3890,6 +3920,10 @@ static bool try_steal_cookie(int this, int that)
if (p->core_occupation > dst->idle->core_occupation)
goto next;
 
+   trace_printk("core fill: %s/%d (%d->%d) %d %d %lx\n",
+p->comm, p->pid, that, this,
+

[RFC PATCH v2 07/17] sched: Allow put_prev_task() to drop rq->lock

2019-04-23 Thread Vineeth Remanan Pillai

From: Peter Zijlstra (Intel) 

Currently the pick_next_task() loop is convoluted and ugly because of
how it can drop the rq->lock and needs to restart the picking.

For the RT/Deadline classes, it is put_prev_task() where we do
balancing, and we could do this before the picking loop. Make this
possible.

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c  |  2 +-
 kernel/sched/deadline.c  | 14 +-
 kernel/sched/fair.c  |  2 +-
 kernel/sched/idle.c  |  2 +-
 kernel/sched/rt.c| 14 +-
 kernel/sched/sched.h |  4 ++--
 kernel/sched/stop_task.c |  2 +-
 7 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 32ea79fb8d29..9dfa0c53deb3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5595,7 +5595,7 @@ static void calc_load_migrate(struct rq *rq)
atomic_long_add(delta, _load_tasks);
 }
 
-static void put_prev_task_fake(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_fake(struct rq *rq, struct task_struct *prev, struct 
rq_flags *rf)
 {
 }
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fadfbfe7d573..56791c0318a2 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1773,13 +1773,25 @@ pick_next_task_dl(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf)
return p;
 }
 
-static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
+static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct 
rq_flags *rf)
 {
update_curr_dl(rq);
 
update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1);
if (on_dl_rq(>dl) && p->nr_cpus_allowed > 1)
enqueue_pushable_dl_task(rq, p);
+
+   if (rf && !on_dl_rq(>dl) && need_pull_dl_task(rq, p)) {
+   /*
+* This is OK, because current is on_cpu, which avoids it being
+* picked for load-balance and preemption/IRQs are still
+* disabled avoiding further scheduler activity on it and we've
+* not yet started the picking loop.
+*/
+   rq_unpin_lock(rq, rf);
+   pull_dl_task(rq);
+   rq_repin_lock(rq, rf);
+   }
 }
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f7e631e692a3..41ec5e68e1c5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7081,7 +7081,7 @@ done: __maybe_unused;
 /*
  * Account for a descheduled task:
  */
-static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct 
rq_flags *rf)
 {
struct sched_entity *se = >se;
struct cfs_rq *cfs_rq;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index dd64be34881d..1b65a4c3683e 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -373,7 +373,7 @@ static void check_preempt_curr_idle(struct rq *rq, struct 
task_struct *p, int fl
resched_curr(rq);
 }
 
-static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct 
rq_flags *rf)
 {
 }
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index adec98a94f2b..51ee87c5a28a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1593,7 +1593,7 @@ pick_next_task_rt(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf)
return p;
 }
 
-static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
+static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct 
rq_flags *rf)
 {
update_curr_rt(rq);
 
@@ -1605,6 +1605,18 @@ static void put_prev_task_rt(struct rq *rq, struct 
task_struct *p)
 */
if (on_rt_rq(>rt) && p->nr_cpus_allowed > 1)
enqueue_pushable_task(rq, p);
+
+   if (rf && !on_rt_rq(>rt) && need_pull_rt_task(rq, p)) {
+   /*
+* This is OK, because current is on_cpu, which avoids it being
+* picked for load-balance and preemption/IRQs are still
+* disabled avoiding further scheduler activity on it and we've
+* not yet started the picking loop.
+*/
+   rq_unpin_lock(rq, rf);
+   pull_rt_task(rq);
+   rq_repin_lock(rq, rf);
+   }
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bfcbcbb25646..4cbe2bef92e4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1675,7 +1675,7 @@ struct sched_class {
struct task_struct * (*pick_next_task)(struct rq *rq,
   struct task_struct *prev,
   struct rq_flags *rf);
-   void (*put_prev_task)(struct rq *rq, struct task_struct *p);
+   void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct 
rq_flags *rf);
void (*set_next_task)(struct

[RFC PATCH v2 00/17] Core scheduling v2

2019-04-23 Thread Vineeth Remanan Pillai

Second iteration of the core-scheduling feature.

This version fixes apparent bugs and performance issues in v1. This
doesn't fully address the issue of core sharing between processes
with different tags. Core sharing still happens 1% to 5% of the time
based on the nature of workload and timing of the runnable processes.

Changes in v2
-
- rebased on mainline commit: 6d906f99817951e2257d577656899da02bb33105
- Fixes for couple of NULL pointer dereference crashes
  - Subhra Mazumdar
  - Tim Chen
- Improves priority comparison logic for process in different cpus
  - Peter Zijlstra
  - Aaron Lu
- Fixes a hard lockup in rq locking
  - Vineeth Pillai
  - Julien Desfossez
- Fixes a performance issue seen on IO heavy workloads
  - Vineeth Pillai
  - Julien Desfossez
- Fix for 32bit build
  - Aubrey Li

Issues
--
- Processes with different tags can still share the core
- A crash when disabling cpus with core-scheduling on
   - https://paste.debian.net/plainh/fa6bcfa8

---

Peter Zijlstra (16):
  stop_machine: Fix stop_cpus_in_progress ordering
  sched: Fix kerneldoc comment for ia64_set_curr_task
  sched: Wrap rq::lock access
  sched/{rt,deadline}: Fix set_next_task vs pick_next_task
  sched: Add task_struct pointer to sched_class::set_curr_task
  sched/fair: Export newidle_balance()
  sched: Allow put_prev_task() to drop rq->lock
  sched: Rework pick_next_task() slow-path
  sched: Introduce sched_class::pick_task()
  sched: Core-wide rq->lock
  sched: Basic tracking of matching tasks
  sched: A quick and dirty cgroup tagging interface
  sched: Add core wide task selection and scheduling.
  sched/fair: Add a few assertions
  sched: Trivial forced-newidle balancer
  sched: Debug bits...

Vineeth Remanan Pillai (1):
  sched: Wake up sibling if it has something to run

 include/linux/sched.h|   9 +-
 kernel/Kconfig.preempt   |   7 +-
 kernel/sched/core.c  | 800 +--
 kernel/sched/cpuacct.c   |  12 +-
 kernel/sched/deadline.c  |  99 +++--
 kernel/sched/debug.c |   4 +-
 kernel/sched/fair.c  | 137 +--
 kernel/sched/idle.c  |  42 +-
 kernel/sched/pelt.h  |   2 +-
 kernel/sched/rt.c|  96 +++--
 kernel/sched/sched.h | 185 ++---
 kernel/sched/stop_task.c |  35 +-
 kernel/sched/topology.c  |   4 +-
 kernel/stop_machine.c|   2 +
 14 files changed, 1145 insertions(+), 289 deletions(-)

-- 
2.17.1

[RFC PATCH v2 03/17] sched: Wrap rq::lock access

2019-04-23 Thread Vineeth Remanan Pillai

From: Peter Zijlstra (Intel) 

In preparation of playing games with rq->lock, abstract the thing
using an accessor.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Julien Desfossez 
---

Changes in v2
-
- Fixes a deadlock due in double_rq_lock and double_lock_lock
  - Vineeth Pillai
  - Julien Desfossez
- Fixes 32bit build.
  - Aubrey Li

---
 kernel/sched/core.c |  46 -
 kernel/sched/cpuacct.c  |  12 ++---
 kernel/sched/deadline.c |  18 +++
 kernel/sched/debug.c|   4 +-
 kernel/sched/fair.c |  40 +++
 kernel/sched/idle.c |   4 +-
 kernel/sched/pelt.h |   2 +-
 kernel/sched/rt.c   |   8 +--
 kernel/sched/sched.h| 106 
 kernel/sched/topology.c |   4 +-
 10 files changed, 123 insertions(+), 121 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 416ea613eda8..6f4861ae85dc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -72,12 +72,12 @@ struct rq *__task_rq_lock(struct task_struct *p, struct 
rq_flags *rf)
 
for (;;) {
rq = task_rq(p);
-   raw_spin_lock(>lock);
+   raw_spin_lock(rq_lockp(rq));
if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
rq_pin_lock(rq, rf);
return rq;
}
-   raw_spin_unlock(>lock);
+   raw_spin_unlock(rq_lockp(rq));
 
while (unlikely(task_on_rq_migrating(p)))
cpu_relax();
@@ -96,7 +96,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct 
rq_flags *rf)
for (;;) {
raw_spin_lock_irqsave(>pi_lock, rf->flags);
rq = task_rq(p);
-   raw_spin_lock(>lock);
+   raw_spin_lock(rq_lockp(rq));
/*
 *  move_queued_task()  task_rq_lock()
 *
@@ -118,7 +118,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct 
rq_flags *rf)
rq_pin_lock(rq, rf);
return rq;
}
-   raw_spin_unlock(>lock);
+   raw_spin_unlock(rq_lockp(rq));
raw_spin_unlock_irqrestore(>pi_lock, rf->flags);
 
while (unlikely(task_on_rq_migrating(p)))
@@ -188,7 +188,7 @@ void update_rq_clock(struct rq *rq)
 {
s64 delta;
 
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
 
if (rq->clock_update_flags & RQCF_ACT_SKIP)
return;
@@ -497,7 +497,7 @@ void resched_curr(struct rq *rq)
struct task_struct *curr = rq->curr;
int cpu;
 
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
 
if (test_tsk_need_resched(curr))
return;
@@ -521,10 +521,10 @@ void resched_cpu(int cpu)
struct rq *rq = cpu_rq(cpu);
unsigned long flags;
 
-   raw_spin_lock_irqsave(>lock, flags);
+   raw_spin_lock_irqsave(rq_lockp(rq), flags);
if (cpu_online(cpu) || cpu == smp_processor_id())
resched_curr(rq);
-   raw_spin_unlock_irqrestore(>lock, flags);
+   raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 }
 
 #ifdef CONFIG_SMP
@@ -956,7 +956,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, 
int cpu)
 static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
   struct task_struct *p, int new_cpu)
 {
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
 
WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
dequeue_task(rq, p, DEQUEUE_NOCLOCK);
@@ -1070,7 +1070,7 @@ void do_set_cpus_allowed(struct task_struct *p, const 
struct cpumask *new_mask)
 * Because __kthread_bind() calls this on blocked tasks without
 * holding rq->lock.
 */
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
}
if (running)
@@ -1203,7 +1203,7 @@ void set_task_cpu(struct task_struct *p, unsigned int 
new_cpu)
 * task_rq_lock().
 */
WARN_ON_ONCE(debug_locks && !(lockdep_is_held(>pi_lock) ||
- lockdep_is_held(_rq(p)->lock)));
+ lockdep_is_held(rq_lockp(task_rq(p);
 #endif
/*
 * Clearly, migrating tasks to offline CPUs is a fairly daft thing.
@@ -1732,7 +1732,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, 
int wake_flags,
 {
int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;
 
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
 
 #ifdef CONFIG_SMP
if (p->sched_contribu

[RFC PATCH v2 02/17] sched: Fix kerneldoc comment for ia64_set_curr_task

2019-04-23 Thread Vineeth Remanan Pillai

From: Peter Zijlstra (Intel) 

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4778c48a7fda..416ea613eda8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6287,7 +6287,7 @@ struct task_struct *curr_task(int cpu)
 
 #ifdef CONFIG_IA64
 /**
- * set_curr_task - set the current task for a given CPU.
+ * ia64_set_curr_task - set the current task for a given CPU.
  * @cpu: the processor in question.
  * @p: the task pointer to set.
  *
-- 
2.17.1

Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.

2019-04-10 Thread Vineeth Remanan Pillai

From: Vineeth Pillai 

> Well, I was promised someome else was going to carry all this, also

We are interested in this feature and have been actively testing, benchmarking
and working on fixes. If there is no v2 effort currently in progress, we are
willing to help consolidate all the changes discussed here and prepare a v2.
If there are any pending changes in pipeline, please post your ideas so that
we could include it in v2.

We hope to post the v2 with all the changes here in a week’s time rebased on
the latest tip.

[PATCH v4 2/2] mm: rid swapoff of quadratic complexity

2019-01-14 Thread Vineeth Remanan Pillai

This patch was initially posted by Kelley(kelley...@gmail.com).
Reposting the patch with all review comments addressed and with minor
modifications and optimizations. Also, folding in the fixes offered by
Hugh Dickins and Huang Ying. Tests were rerun and commit message
updated with new results.

The function try_to_unuse() is of quadratic complexity, with a lot of
wasted effort. It unuses swap entries one by one, potentially iterating
over all the page tables for all the processes in the system for each
one.

This new proposed implementation of try_to_unuse simplifies its
complexity to linear. It iterates over the system's mms once, unusing
all the affected entries as it walks each set of page tables. It also
makes similar changes to shmem_unuse.

Improvement

swapoff was called on a swap partition containing about 6G of data, in a
VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted.

Present implementationabout 1200M calls(8min, avg 80% cpu util).
Prototype.about  9.0K calls(3min, avg 5% cpu util).

Details

In shmem_unuse(), iterate over the shmem_swaplist and, for each
shmem_inode_info that contains a swap entry, pass it to
shmem_unuse_inode(), along with the swap type. In shmem_unuse_inode(),
iterate over its associated xarray, and store the index and value of each
swap entry in an array for passing to shmem_swapin_page() outside
of the RCU critical section.

In try_to_unuse(), instead of iterating over the entries in the type and
unusing them one by one, perhaps walking all the page tables for all the
processes for each one, iterate over the mmlist, making one pass. Pass
each mm to unuse_mm() to begin its page table walk, and during the walk,
unuse all the ptes that have backing store in the swap type received by
try_to_unuse(). After the walk, check the type for orphaned swap entries
with find_next_to_unuse(), and remove them from the swap cache. If
find_next_to_unuse() starts over at the beginning of the type, repeat
the check of the shmem_swaplist and the walk a maximum of three times.

Change unuse_mm() and the intervening walk functions down to
unuse_pte_range() to take the type as a parameter, and to iterate over
their entire range, calling the next function down on every iteration.
In unuse_pte_range(), make a swap entry from each pte in the range using
the passed in type. If it has backing store in the type, call
swapin_readahead() to retrieve the page and pass it to unuse_pte().

Pass the count of pages_to_unuse down the page table walks in
try_to_unuse(), and return from the walk when the desired number of pages
has been swapped back in.

Change in v4:
 - Folded fixes from Hugh and Huang
 - Fixed the case of full frontswap unuse.
 - Handle frontswap case for the shmem.
Change in v3
 - Addressed review comments
 - Refactored out swap-in logic from shmem_getpage_fp
Changes in v2
 - Updated patch to use Xarray instead of radix tree

Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Kelley Nielsen 
Signed-off-by: Huang Ying 
Signed-off-by: Hugh Dickins 
CC: Rik van Riel 
---
 include/linux/frontswap.h |   7 +
 include/linux/shmem_fs.h  |   3 +-
 mm/shmem.c| 267 ---
 mm/swapfile.c | 433 ++
 4 files changed, 319 insertions(+), 391 deletions(-)

diff --git a/include/linux/frontswap.h b/include/linux/frontswap.h
index 011965c08b93..6d775984905b 100644
--- a/include/linux/frontswap.h
+++ b/include/linux/frontswap.h
@@ -7,6 +7,13 @@
 #include 
 #include 
 
+/*
+ * Return code to denote that requested number of
+ * frontswap pages are unused(moved to page cache).
+ * Used in in shmem_unuse and try_to_unuse.
+ */
+#define FRONTSWAP_PAGES_UNUSED 2
+
 struct frontswap_ops {
void (*init)(unsigned); /* this swap type was just swapon'ed */
int (*store)(unsigned, pgoff_t, struct page *); /* store a page */
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index f155dc607112..f3fb1edb3526 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -72,7 +72,8 @@ extern void shmem_unlock_mapping(struct address_space 
*mapping);
 extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
pgoff_t index, gfp_t gfp_mask);
 extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t 
end);
-extern int shmem_unuse(swp_entry_t entry, struct page *page);
+extern int shmem_unuse(unsigned int type, bool frontswap,
+  unsigned long *fs_pages_to_unuse);
 
 extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
 extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
diff --git a/mm/shmem.c b/mm/shmem.c
index 7cd7ee69f670..cdeedba44cbe 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include  /* for arch/microblaze update_mmu_cache() */
 
@@ -1093,159 +1094,184 @@ static void

[PATCH v4 1/2] mm: Refactor swap-in logic out of shmem_getpage_gfp

2019-01-14 Thread Vineeth Remanan Pillai

swap-in logic could be reused independently without rest of the logic
in shmem_getpage_gfp. So lets refactor it out as an independent
function.

Signed-off-by: Vineeth Remanan Pillai 
---
 mm/shmem.c | 449 +
 1 file changed, 244 insertions(+), 205 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 6ece1e2fe76e..7cd7ee69f670 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -123,6 +123,10 @@ static unsigned long shmem_default_max_inodes(void)
 static bool shmem_should_replace_page(struct page *page, gfp_t gfp);
 static int shmem_replace_page(struct page **pagep, gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index);
+static int shmem_swapin_page(struct inode *inode, pgoff_t index,
+struct page **pagep, enum sgp_type sgp,
+gfp_t gfp, struct vm_area_struct *vma,
+vm_fault_t *fault_type);
 static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct page **pagep, enum sgp_type sgp,
gfp_t gfp, struct vm_area_struct *vma,
@@ -1575,6 +1579,116 @@ static int shmem_replace_page(struct page **pagep, 
gfp_t gfp,
return error;
 }
 
+/*
+ * Swap in the page pointed to by *pagep.
+ * Caller has to make sure that *pagep contains a valid swapped page.
+ * Returns 0 and the page in pagep if success. On failure, returns the
+ * the error code and NULL in *pagep.
+ */
+static int shmem_swapin_page(struct inode *inode, pgoff_t index,
+struct page **pagep, enum sgp_type sgp,
+gfp_t gfp, struct vm_area_struct *vma,
+vm_fault_t *fault_type)
+{
+   struct address_space *mapping = inode->i_mapping;
+   struct shmem_inode_info *info = SHMEM_I(inode);
+   struct mm_struct *charge_mm = vma ? vma->vm_mm : current->mm;
+   struct mem_cgroup *memcg;
+   struct page *page;
+   swp_entry_t swap;
+   int error;
+
+   VM_BUG_ON(!*pagep || !xa_is_value(*pagep));
+   swap = radix_to_swp_entry(*pagep);
+   *pagep = NULL;
+
+   /* Look it up and read it in.. */
+   page = lookup_swap_cache(swap, NULL, 0);
+   if (!page) {
+   /* Or update major stats only when swapin succeeds?? */
+   if (fault_type) {
+   *fault_type |= VM_FAULT_MAJOR;
+   count_vm_event(PGMAJFAULT);
+   count_memcg_event_mm(charge_mm, PGMAJFAULT);
+   }
+   /* Here we actually start the io */
+   page = shmem_swapin(swap, gfp, info, index);
+   if (!page) {
+   error = -ENOMEM;
+   goto failed;
+   }
+   }
+
+   /* We have to do this with page locked to prevent races */
+   lock_page(page);
+   if (!PageSwapCache(page) || page_private(page) != swap.val ||
+   !shmem_confirm_swap(mapping, index, swap)) {
+   error = -EEXIST;
+   goto unlock;
+   }
+   if (!PageUptodate(page)) {
+   error = -EIO;
+   goto failed;
+   }
+   wait_on_page_writeback(page);
+
+   if (shmem_should_replace_page(page, gfp)) {
+   error = shmem_replace_page(, gfp, info, index);
+   if (error)
+   goto failed;
+   }
+
+   error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, ,
+   false);
+   if (!error) {
+   error = shmem_add_to_page_cache(page, mapping, index,
+   swp_to_radix_entry(swap), gfp);
+   /*
+* We already confirmed swap under page lock, and make
+* no memory allocation here, so usually no possibility
+* of error; but free_swap_and_cache() only trylocks a
+* page, so it is just possible that the entry has been
+* truncated or holepunched since swap was confirmed.
+* shmem_undo_range() will have done some of the
+* unaccounting, now delete_from_swap_cache() will do
+* the rest.
+*/
+   if (error) {
+   mem_cgroup_cancel_charge(page, memcg, false);
+   delete_from_swap_cache(page);
+   }
+   }
+   if (error)
+   goto failed;
+
+   mem_cgroup_commit_charge(page, memcg, true, false);
+
+   spin_lock_irq(>lock);
+   info->swapped--;
+   shmem_recalc_inode(inode);
+   spin_unlock_irq(>lock);
+
+   if (sgp == SGP_WRITE)
+   mark_page_accessed(page);
+
+   delete_from_swap_cache(page);
+   set_page_dirty(page);
+   swap_free(swap);
+
+   *pagep = page;
+   return 0;
+failed:
+   if (!shmem_confirm_swap(ma

[PATCH v3 2/2] mm: rid swapoff of quadratic complexity

2018-12-03 Thread Vineeth Remanan Pillai

This patch was initially posted by Kelley(kelley...@gmail.com).
Reposting the patch with all review comments addressed and with minor
modifications and optimizations. Tests were rerun and commit message
updated with new results.

The function try_to_unuse() is of quadratic complexity, with a lot of
wasted effort. It unuses swap entries one by one, potentially iterating
over all the page tables for all the processes in the system for each
one.

This new proposed implementation of try_to_unuse simplifies its
complexity to linear. It iterates over the system's mms once, unusing
all the affected entries as it walks each set of page tables. It also
makes similar changes to shmem_unuse.

Improvement

swapoff was called on a swap partition containing about 6G of data, in a
VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted.

Present implementationabout 1200M calls(8min, avg 80% cpu util).
Prototype.about  9.0K calls(3min, avg 5% cpu util).

Details

In shmem_unuse(), iterate over the shmem_swaplist and, for each
shmem_inode_info that contains a swap entry, pass it to
shmem_unuse_inode(), along with the swap type. In shmem_unuse_inode(),
iterate over its associated xarray, and store the index and value of each
swap entry in an array for passing to shmem_swapin_page() outside
of the RCU critical section.

In try_to_unuse(), instead of iterating over the entries in the type and
unusing them one by one, perhaps walking all the page tables for all the
processes for each one, iterate over the mmlist, making one pass. Pass
each mm to unuse_mm() to begin its page table walk, and during the walk,
unuse all the ptes that have backing store in the swap type received by
try_to_unuse(). After the walk, check the type for orphaned swap entries
with find_next_to_unuse(), and remove them from the swap cache. If
find_next_to_unuse() starts over at the beginning of the type, repeat
the check of the shmem_swaplist and the walk a maximum of three times.

Change unuse_mm() and the intervening walk functions down to
unuse_pte_range() to take the type as a parameter, and to iterate over
their entire range, calling the next function down on every iteration.
In unuse_pte_range(), make a swap entry from each pte in the range using
the passed in type. If it has backing store in the type, call
swapin_readahead() to retrieve the page and pass it to unuse_pte().

Pass the count of pages_to_unuse down the page table walks in
try_to_unuse(), and return from the walk when the desired number of pages
has been swapped back in.

Change in v3
 - Addressed review comments
 - Refactored out swap-in logic from shmem_getpage_fp
Changes in v2
 - Updated patch to use Xarray instead of radix tree

Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Kelley Nielsen 
CC: Rik van Riel 
---
 include/linux/shmem_fs.h |   2 +-
 mm/shmem.c   | 218 ++---
 mm/swapfile.c| 413 +++
 3 files changed, 260 insertions(+), 373 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index f155dc607112..1dd02592bb53 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -72,7 +72,7 @@ extern void shmem_unlock_mapping(struct address_space 
*mapping);
 extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
pgoff_t index, gfp_t gfp_mask);
 extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t 
end);
-extern int shmem_unuse(swp_entry_t entry, struct page *page);
+extern int shmem_unuse(unsigned int type);
 
 extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
 extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
diff --git a/mm/shmem.c b/mm/shmem.c
index 035ea2c10f54..404f7b785fce 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1093,159 +1093,143 @@ static void shmem_evict_inode(struct inode *inode)
clear_inode(inode);
 }
 
-static unsigned long find_swap_entry(struct xarray *xa, void *item)
+static int shmem_find_swap_entries(struct address_space *mapping,
+  pgoff_t start, unsigned int nr_entries,
+  struct page **entries, pgoff_t *indices)
 {
-   XA_STATE(xas, xa, 0);
-   unsigned int checked = 0;
-   void *entry;
+   XA_STATE(xas, >i_pages, start);
+   struct page *page;
+   unsigned int ret = 0;
+
+   if (!nr_entries)
+   return 0;
 
rcu_read_lock();
-   xas_for_each(, entry, ULONG_MAX) {
-   if (xas_retry(, entry))
+   xas_for_each(, page, ULONG_MAX) {
+   if (xas_retry(, page))
continue;
-   if (entry == item)
-   break;
-   checked++;
-   if ((checked % XA_CHECK_SCHED) != 0)
+
+   if (!xa_is_value(page))
continue;
-   xas_pa

[PATCH v3 2/2] mm: rid swapoff of quadratic complexity

2018-12-03 Thread Vineeth Remanan Pillai

This patch was initially posted by Kelley(kelley...@gmail.com).
Reposting the patch with all review comments addressed and with minor
modifications and optimizations. Tests were rerun and commit message
updated with new results.

The function try_to_unuse() is of quadratic complexity, with a lot of
wasted effort. It unuses swap entries one by one, potentially iterating
over all the page tables for all the processes in the system for each
one.

This new proposed implementation of try_to_unuse simplifies its
complexity to linear. It iterates over the system's mms once, unusing
all the affected entries as it walks each set of page tables. It also
makes similar changes to shmem_unuse.

Improvement

swapoff was called on a swap partition containing about 6G of data, in a
VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted.

Present implementationabout 1200M calls(8min, avg 80% cpu util).
Prototype.about  9.0K calls(3min, avg 5% cpu util).

Details

In shmem_unuse(), iterate over the shmem_swaplist and, for each
shmem_inode_info that contains a swap entry, pass it to
shmem_unuse_inode(), along with the swap type. In shmem_unuse_inode(),
iterate over its associated xarray, and store the index and value of each
swap entry in an array for passing to shmem_swapin_page() outside
of the RCU critical section.

In try_to_unuse(), instead of iterating over the entries in the type and
unusing them one by one, perhaps walking all the page tables for all the
processes for each one, iterate over the mmlist, making one pass. Pass
each mm to unuse_mm() to begin its page table walk, and during the walk,
unuse all the ptes that have backing store in the swap type received by
try_to_unuse(). After the walk, check the type for orphaned swap entries
with find_next_to_unuse(), and remove them from the swap cache. If
find_next_to_unuse() starts over at the beginning of the type, repeat
the check of the shmem_swaplist and the walk a maximum of three times.

Change unuse_mm() and the intervening walk functions down to
unuse_pte_range() to take the type as a parameter, and to iterate over
their entire range, calling the next function down on every iteration.
In unuse_pte_range(), make a swap entry from each pte in the range using
the passed in type. If it has backing store in the type, call
swapin_readahead() to retrieve the page and pass it to unuse_pte().

Pass the count of pages_to_unuse down the page table walks in
try_to_unuse(), and return from the walk when the desired number of pages
has been swapped back in.

Change in v3
 - Addressed review comments
 - Refactored out swap-in logic from shmem_getpage_fp
Changes in v2
 - Updated patch to use Xarray instead of radix tree

Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Kelley Nielsen 
CC: Rik van Riel 
---
 include/linux/shmem_fs.h |   2 +-
 mm/shmem.c   | 218 ++---
 mm/swapfile.c| 413 +++
 3 files changed, 260 insertions(+), 373 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index f155dc607112..1dd02592bb53 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -72,7 +72,7 @@ extern void shmem_unlock_mapping(struct address_space 
*mapping);
 extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
pgoff_t index, gfp_t gfp_mask);
 extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t 
end);
-extern int shmem_unuse(swp_entry_t entry, struct page *page);
+extern int shmem_unuse(unsigned int type);
 
 extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
 extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
diff --git a/mm/shmem.c b/mm/shmem.c
index 035ea2c10f54..404f7b785fce 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1093,159 +1093,143 @@ static void shmem_evict_inode(struct inode *inode)
clear_inode(inode);
 }
 
-static unsigned long find_swap_entry(struct xarray *xa, void *item)
+static int shmem_find_swap_entries(struct address_space *mapping,
+  pgoff_t start, unsigned int nr_entries,
+  struct page **entries, pgoff_t *indices)
 {
-   XA_STATE(xas, xa, 0);
-   unsigned int checked = 0;
-   void *entry;
+   XA_STATE(xas, >i_pages, start);
+   struct page *page;
+   unsigned int ret = 0;
+
+   if (!nr_entries)
+   return 0;
 
rcu_read_lock();
-   xas_for_each(, entry, ULONG_MAX) {
-   if (xas_retry(, entry))
+   xas_for_each(, page, ULONG_MAX) {
+   if (xas_retry(, page))
continue;
-   if (entry == item)
-   break;
-   checked++;
-   if ((checked % XA_CHECK_SCHED) != 0)
+
+   if (!xa_is_value(page))
continue;
-   xas_pa

[PATCH v3 1/2] mm: Refactor swap-in logic out of shmem_getpage_gfp

2018-12-03 Thread Vineeth Remanan Pillai

swap-in logic could be reused independently without rest of the logic
in shmem_getpage_gfp. So lets refactor it out as an independent
function.

Signed-off-by: Vineeth Remanan Pillai 
---
 mm/shmem.c | 449 +
 1 file changed, 244 insertions(+), 205 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index cddc72ac44d8..035ea2c10f54 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -121,6 +121,10 @@ static unsigned long shmem_default_max_inodes(void)
 static bool shmem_should_replace_page(struct page *page, gfp_t gfp);
 static int shmem_replace_page(struct page **pagep, gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index);
+static int shmem_swapin_page(struct inode *inode, pgoff_t index,
+struct page **pagep, enum sgp_type sgp,
+gfp_t gfp, struct vm_area_struct *vma,
+vm_fault_t *fault_type);
 static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct page **pagep, enum sgp_type sgp,
gfp_t gfp, struct vm_area_struct *vma,
@@ -1575,6 +1579,116 @@ static int shmem_replace_page(struct page **pagep, 
gfp_t gfp,
return error;
 }
 
+/*
+ * Swap in the page pointed to by *pagep.
+ * Caller has to make sure that *pagep contains a valid swapped page.
+ * Returns 0 and the page in pagep if success. On failure, returns the
+ * the error code and NULL in *pagep.
+ */
+static int shmem_swapin_page(struct inode *inode, pgoff_t index,
+struct page **pagep, enum sgp_type sgp,
+gfp_t gfp, struct vm_area_struct *vma,
+vm_fault_t *fault_type)
+{
+   struct address_space *mapping = inode->i_mapping;
+   struct shmem_inode_info *info = SHMEM_I(inode);
+   struct mm_struct *charge_mm = vma ? vma->vm_mm : current->mm;
+   struct mem_cgroup *memcg;
+   struct page *page;
+   swp_entry_t swap;
+   int error;
+
+   VM_BUG_ON(!*pagep || !xa_is_value(*pagep));
+   swap = radix_to_swp_entry(*pagep);
+   *pagep = NULL;
+
+   /* Look it up and read it in.. */
+   page = lookup_swap_cache(swap, NULL, 0);
+   if (!page) {
+   /* Or update major stats only when swapin succeeds?? */
+   if (fault_type) {
+   *fault_type |= VM_FAULT_MAJOR;
+   count_vm_event(PGMAJFAULT);
+   count_memcg_event_mm(charge_mm, PGMAJFAULT);
+   }
+   /* Here we actually start the io */
+   page = shmem_swapin(swap, gfp, info, index);
+   if (!page) {
+   error = -ENOMEM;
+   goto failed;
+   }
+   }
+
+   /* We have to do this with page locked to prevent races */
+   lock_page(page);
+   if (!PageSwapCache(page) || page_private(page) != swap.val ||
+   !shmem_confirm_swap(mapping, index, swap)) {
+   error = -EEXIST;
+   goto unlock;
+   }
+   if (!PageUptodate(page)) {
+   error = -EIO;
+   goto failed;
+   }
+   wait_on_page_writeback(page);
+
+   if (shmem_should_replace_page(page, gfp)) {
+   error = shmem_replace_page(, gfp, info, index);
+   if (error)
+   goto failed;
+   }
+
+   error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, ,
+   false);
+   if (!error) {
+   error = shmem_add_to_page_cache(page, mapping, index,
+   swp_to_radix_entry(swap), gfp);
+   /*
+* We already confirmed swap under page lock, and make
+* no memory allocation here, so usually no possibility
+* of error; but free_swap_and_cache() only trylocks a
+* page, so it is just possible that the entry has been
+* truncated or holepunched since swap was confirmed.
+* shmem_undo_range() will have done some of the
+* unaccounting, now delete_from_swap_cache() will do
+* the rest.
+*/
+   if (error) {
+   mem_cgroup_cancel_charge(page, memcg, false);
+   delete_from_swap_cache(page);
+   }
+   }
+   if (error)
+   goto failed;
+
+   mem_cgroup_commit_charge(page, memcg, true, false);
+
+   spin_lock_irq(>lock);
+   info->swapped--;
+   shmem_recalc_inode(inode);
+   spin_unlock_irq(>lock);
+
+   if (sgp == SGP_WRITE)
+   mark_page_accessed(page);
+
+   delete_from_swap_cache(page);
+   set_page_dirty(page);
+   swap_free(swap);
+
+   *pagep = page;
+   return 0;
+failed:
+   if (!shmem_confirm_swap(ma

[PATCH v3 1/2] mm: Refactor swap-in logic out of shmem_getpage_gfp

2018-12-03 Thread Vineeth Remanan Pillai

swap-in logic could be reused independently without rest of the logic
in shmem_getpage_gfp. So lets refactor it out as an independent
function.

Signed-off-by: Vineeth Remanan Pillai 
---
 mm/shmem.c | 449 +
 1 file changed, 244 insertions(+), 205 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index cddc72ac44d8..035ea2c10f54 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -121,6 +121,10 @@ static unsigned long shmem_default_max_inodes(void)
 static bool shmem_should_replace_page(struct page *page, gfp_t gfp);
 static int shmem_replace_page(struct page **pagep, gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index);
+static int shmem_swapin_page(struct inode *inode, pgoff_t index,
+struct page **pagep, enum sgp_type sgp,
+gfp_t gfp, struct vm_area_struct *vma,
+vm_fault_t *fault_type);
 static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct page **pagep, enum sgp_type sgp,
gfp_t gfp, struct vm_area_struct *vma,
@@ -1575,6 +1579,116 @@ static int shmem_replace_page(struct page **pagep, 
gfp_t gfp,
return error;
 }
 
+/*
+ * Swap in the page pointed to by *pagep.
+ * Caller has to make sure that *pagep contains a valid swapped page.
+ * Returns 0 and the page in pagep if success. On failure, returns the
+ * the error code and NULL in *pagep.
+ */
+static int shmem_swapin_page(struct inode *inode, pgoff_t index,
+struct page **pagep, enum sgp_type sgp,
+gfp_t gfp, struct vm_area_struct *vma,
+vm_fault_t *fault_type)
+{
+   struct address_space *mapping = inode->i_mapping;
+   struct shmem_inode_info *info = SHMEM_I(inode);
+   struct mm_struct *charge_mm = vma ? vma->vm_mm : current->mm;
+   struct mem_cgroup *memcg;
+   struct page *page;
+   swp_entry_t swap;
+   int error;
+
+   VM_BUG_ON(!*pagep || !xa_is_value(*pagep));
+   swap = radix_to_swp_entry(*pagep);
+   *pagep = NULL;
+
+   /* Look it up and read it in.. */
+   page = lookup_swap_cache(swap, NULL, 0);
+   if (!page) {
+   /* Or update major stats only when swapin succeeds?? */
+   if (fault_type) {
+   *fault_type |= VM_FAULT_MAJOR;
+   count_vm_event(PGMAJFAULT);
+   count_memcg_event_mm(charge_mm, PGMAJFAULT);
+   }
+   /* Here we actually start the io */
+   page = shmem_swapin(swap, gfp, info, index);
+   if (!page) {
+   error = -ENOMEM;
+   goto failed;
+   }
+   }
+
+   /* We have to do this with page locked to prevent races */
+   lock_page(page);
+   if (!PageSwapCache(page) || page_private(page) != swap.val ||
+   !shmem_confirm_swap(mapping, index, swap)) {
+   error = -EEXIST;
+   goto unlock;
+   }
+   if (!PageUptodate(page)) {
+   error = -EIO;
+   goto failed;
+   }
+   wait_on_page_writeback(page);
+
+   if (shmem_should_replace_page(page, gfp)) {
+   error = shmem_replace_page(, gfp, info, index);
+   if (error)
+   goto failed;
+   }
+
+   error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, ,
+   false);
+   if (!error) {
+   error = shmem_add_to_page_cache(page, mapping, index,
+   swp_to_radix_entry(swap), gfp);
+   /*
+* We already confirmed swap under page lock, and make
+* no memory allocation here, so usually no possibility
+* of error; but free_swap_and_cache() only trylocks a
+* page, so it is just possible that the entry has been
+* truncated or holepunched since swap was confirmed.
+* shmem_undo_range() will have done some of the
+* unaccounting, now delete_from_swap_cache() will do
+* the rest.
+*/
+   if (error) {
+   mem_cgroup_cancel_charge(page, memcg, false);
+   delete_from_swap_cache(page);
+   }
+   }
+   if (error)
+   goto failed;
+
+   mem_cgroup_commit_charge(page, memcg, true, false);
+
+   spin_lock_irq(>lock);
+   info->swapped--;
+   shmem_recalc_inode(inode);
+   spin_unlock_irq(>lock);
+
+   if (sgp == SGP_WRITE)
+   mark_page_accessed(page);
+
+   delete_from_swap_cache(page);
+   set_page_dirty(page);
+   swap_free(swap);
+
+   *pagep = page;
+   return 0;
+failed:
+   if (!shmem_confirm_swap(ma

Re: [PATCH v2] mm: prototype: rid swapoff of quadratic complexity

2018-12-03 Thread Vineeth Remanan Pillai


Hi Matthew,



This seems terribly complicated.  You run through i_pages, record the
indices of the swap entries, then go back and look them up again by
calling shmem_getpage() which calls the incredibly complex 300 line
shmem_getpage_gfp().

Can we refactor shmem_getpage_gfp() to skip some of the checks which
aren't necessary when called from this path, and turn this into a nice
simple xas_for_each() loop which works one entry at a time?


I shall investigate this and make this simpler as you suggested.


I have looked into this deeper. I think it would be very difficult to 
consolidate the whole logic into a single xas_for_each() loop because, 
we do disk io and might sleep. I have refactored the code such that it 
much more readable now and I am using the same format used by 
find_get_entries.


Will send out the next revision later today.


Thanks,

Vineeth

Re: [PATCH v2] mm: prototype: rid swapoff of quadratic complexity

2018-12-03 Thread Vineeth Remanan Pillai


Hi Matthew,



This seems terribly complicated.  You run through i_pages, record the
indices of the swap entries, then go back and look them up again by
calling shmem_getpage() which calls the incredibly complex 300 line
shmem_getpage_gfp().

Can we refactor shmem_getpage_gfp() to skip some of the checks which
aren't necessary when called from this path, and turn this into a nice
simple xas_for_each() loop which works one entry at a time?


I shall investigate this and make this simpler as you suggested.


I have looked into this deeper. I think it would be very difficult to 
consolidate the whole logic into a single xas_for_each() loop because, 
we do disk io and might sleep. I have refactored the code such that it 
much more readable now and I am using the same format used by 
find_get_entries.


Will send out the next revision later today.


Thanks,

Vineeth

Re: [PATCH v2] mm: prototype: rid swapoff of quadratic complexity

2018-11-26 Thread Vineeth Remanan Pillai


Hi Mathew,


Thanks for your response!

On 11/26/18 12:22 PM, Matthew Wilcox wrote:

On Mon, Nov 26, 2018 at 04:55:21PM +, Vineeth Remanan Pillai wrote:

+   do {
+   XA_STATE(xas, >i_pages, start);
+   int i;
+   int entries = 0;
+   struct page *page;
+   pgoff_t indices[PAGEVEC_SIZE];
+   unsigned long end = start + PAGEVEC_SIZE;
  
+		rcu_read_lock();

+   xas_for_each(, page, end) {

I think this is a mistake.  You should probably specify ULONG_MAX for the
end.  Otherwise if there are no swap entries in the first 60kB of the file,
you'll just exit.  That does mean you'll need to check 'entries' for
hitting PAGEVEC_SIZE.


Thanks for pointing this out. I shall fix this in the next version.


This seems terribly complicated.  You run through i_pages, record the
indices of the swap entries, then go back and look them up again by
calling shmem_getpage() which calls the incredibly complex 300 line
shmem_getpage_gfp().

Can we refactor shmem_getpage_gfp() to skip some of the checks which
aren't necessary when called from this path, and turn this into a nice
simple xas_for_each() loop which works one entry at a time?


I shall investigate this and make this simpler as you suggested.


+   list_for_each_safe(p, next, _swaplist) {
+   info = list_entry(p, struct shmem_inode_info, swaplist);

This could use list_for_each_entry_safe(), right?


Yes, you are right. Will fix.


Thanks,

Vineeth

Re: [PATCH v2] mm: prototype: rid swapoff of quadratic complexity

2018-11-26 Thread Vineeth Remanan Pillai


Hi Mathew,


Thanks for your response!

On 11/26/18 12:22 PM, Matthew Wilcox wrote:

On Mon, Nov 26, 2018 at 04:55:21PM +, Vineeth Remanan Pillai wrote:

+   do {
+   XA_STATE(xas, >i_pages, start);
+   int i;
+   int entries = 0;
+   struct page *page;
+   pgoff_t indices[PAGEVEC_SIZE];
+   unsigned long end = start + PAGEVEC_SIZE;
  
+		rcu_read_lock();

+   xas_for_each(, page, end) {

I think this is a mistake.  You should probably specify ULONG_MAX for the
end.  Otherwise if there are no swap entries in the first 60kB of the file,
you'll just exit.  That does mean you'll need to check 'entries' for
hitting PAGEVEC_SIZE.


Thanks for pointing this out. I shall fix this in the next version.


This seems terribly complicated.  You run through i_pages, record the
indices of the swap entries, then go back and look them up again by
calling shmem_getpage() which calls the incredibly complex 300 line
shmem_getpage_gfp().

Can we refactor shmem_getpage_gfp() to skip some of the checks which
aren't necessary when called from this path, and turn this into a nice
simple xas_for_each() loop which works one entry at a time?


I shall investigate this and make this simpler as you suggested.


+   list_for_each_safe(p, next, _swaplist) {
+   info = list_entry(p, struct shmem_inode_info, swaplist);

This could use list_for_each_entry_safe(), right?


Yes, you are right. Will fix.


Thanks,

Vineeth

[PATCH v2] mm: prototype: rid swapoff of quadratic complexity

2018-11-26 Thread Vineeth Remanan Pillai

This patch was initially posted by Kelley(kelley...@gmail.com).
Reposting the patch with all review comments addressed and with minor
modifications and optimizations. Tests were rerun and commit message
updated with new results.

The function try_to_unuse() is of quadratic complexity, with a lot of
wasted effort. It unuses swap entries one by one, potentially iterating
over all the page tables for all the processes in the system for each
one.

This new proposed implementation of try_to_unuse simplifies its
complexity to linear. It iterates over the system's mms once, unusing
all the affected entries as it walks each set of page tables. It also
makes similar changes to shmem_unuse.

Improvement

swapoff was called on a swap partition containing about 6G of data, in a
VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted.

Present implementationabout 1200M calls(8min, avg 80% cpu util).
Prototype.about  8.0K calls(3min, avg 5% cpu util).

Details

In shmem_unuse(), iterate over the shmem_swaplist and, for each
shmem_inode_info that contains a swap entry, pass it to
shmem_unuse_inode(), along with the swap type. In shmem_unuse_inode(),
iterate over its associated radix tree, and store the index of each
exceptional entry in an array for passing to shmem_getpage_gfp() outside
of the RCU critical section.

In try_to_unuse(), instead of iterating over the entries in the type and
unusing them one by one, perhaps walking all the page tables for all the
processes for each one, iterate over the mmlist, making one pass. Pass
each mm to unuse_mm() to begin its page table walk, and during the walk,
unuse all the ptes that have backing store in the swap type received by
try_to_unuse(). After the walk, check the type for orphaned swap entries
with find_next_to_unuse(), and remove them from the swap cache. If
find_next_to_unuse() starts over at the beginning of the type, repeat
the check of the shmem_swaplist and the walk a maximum of three times.

Change unuse_mm() and the intervening walk functions down to
unuse_pte_range() to take the type as a parameter, and to iterate over
their entire range, calling the next function down on every iteration.
In unuse_pte_range(), make a swap entry from each pte in the range using
the passed in type. If it has backing store in the type, call
swapin_readahead() to retrieve the page and pass it to unuse_pte().

Pass the count of pages_to_unuse down the page table walks in
try_to_unuse(), and return from the walk when the desired number of pages
has been swapped back in.

Changes in v2
 - Updated patch to use Xarray instead of radix tree

Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Kelley Nielsen 
CC: Rik van Riel 
---
 include/linux/shmem_fs.h |   2 +-
 mm/shmem.c   | 208 
 mm/swapfile.c| 413 +++
 3 files changed, 242 insertions(+), 381 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index f155dc607112..1dd02592bb53 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -72,7 +72,7 @@ extern void shmem_unlock_mapping(struct address_space 
*mapping);
 extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
pgoff_t index, gfp_t gfp_mask);
 extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t 
end);
-extern int shmem_unuse(swp_entry_t entry, struct page *page);
+extern int shmem_unuse(unsigned int type);
 
 extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
 extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
diff --git a/mm/shmem.c b/mm/shmem.c
index d44991ea5ed4..21d87cdcba3c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1085,159 +1085,117 @@ static void shmem_evict_inode(struct inode *inode)
clear_inode(inode);
 }
 
-static unsigned long find_swap_entry(struct xarray *xa, void *item)
-{
-   XA_STATE(xas, xa, 0);
-   unsigned int checked = 0;
-   void *entry;
-
-   rcu_read_lock();
-   xas_for_each(, entry, ULONG_MAX) {
-   if (xas_retry(, entry))
-   continue;
-   if (entry == item)
-   break;
-   checked++;
-   if ((checked % XA_CHECK_SCHED) != 0)
-   continue;
-   xas_pause();
-   cond_resched_rcu();
-   }
-   rcu_read_unlock();
-
-   return entry ? xas.xa_index : -1;
-}
-
 /*
  * If swap found in inode, free it and move page from swapcache to filecache.
  */
-static int shmem_unuse_inode(struct shmem_inode_info *info,
-swp_entry_t swap, struct page **pagep)
+static int shmem_unuse_inode(struct inode *inode, unsigned int type)
 {
-   struct address_space *mapping = info->vfs_inode.i_mapping;
-   void *radswap;
-   pgoff_t index;
-   gfp_t gfp;
+   struct address_space *mapp

[PATCH v2] mm: prototype: rid swapoff of quadratic complexity

2018-11-26 Thread Vineeth Remanan Pillai

This patch was initially posted by Kelley(kelley...@gmail.com).
Reposting the patch with all review comments addressed and with minor
modifications and optimizations. Tests were rerun and commit message
updated with new results.

The function try_to_unuse() is of quadratic complexity, with a lot of
wasted effort. It unuses swap entries one by one, potentially iterating
over all the page tables for all the processes in the system for each
one.

This new proposed implementation of try_to_unuse simplifies its
complexity to linear. It iterates over the system's mms once, unusing
all the affected entries as it walks each set of page tables. It also
makes similar changes to shmem_unuse.

Improvement

swapoff was called on a swap partition containing about 6G of data, in a
VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted.

Present implementationabout 1200M calls(8min, avg 80% cpu util).
Prototype.about  8.0K calls(3min, avg 5% cpu util).

Details

In shmem_unuse(), iterate over the shmem_swaplist and, for each
shmem_inode_info that contains a swap entry, pass it to
shmem_unuse_inode(), along with the swap type. In shmem_unuse_inode(),
iterate over its associated radix tree, and store the index of each
exceptional entry in an array for passing to shmem_getpage_gfp() outside
of the RCU critical section.

In try_to_unuse(), instead of iterating over the entries in the type and
unusing them one by one, perhaps walking all the page tables for all the
processes for each one, iterate over the mmlist, making one pass. Pass
each mm to unuse_mm() to begin its page table walk, and during the walk,
unuse all the ptes that have backing store in the swap type received by
try_to_unuse(). After the walk, check the type for orphaned swap entries
with find_next_to_unuse(), and remove them from the swap cache. If
find_next_to_unuse() starts over at the beginning of the type, repeat
the check of the shmem_swaplist and the walk a maximum of three times.

Change unuse_mm() and the intervening walk functions down to
unuse_pte_range() to take the type as a parameter, and to iterate over
their entire range, calling the next function down on every iteration.
In unuse_pte_range(), make a swap entry from each pte in the range using
the passed in type. If it has backing store in the type, call
swapin_readahead() to retrieve the page and pass it to unuse_pte().

Pass the count of pages_to_unuse down the page table walks in
try_to_unuse(), and return from the walk when the desired number of pages
has been swapped back in.

Changes in v2
 - Updated patch to use Xarray instead of radix tree

Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Kelley Nielsen 
CC: Rik van Riel 
---
 include/linux/shmem_fs.h |   2 +-
 mm/shmem.c   | 208 
 mm/swapfile.c| 413 +++
 3 files changed, 242 insertions(+), 381 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index f155dc607112..1dd02592bb53 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -72,7 +72,7 @@ extern void shmem_unlock_mapping(struct address_space 
*mapping);
 extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
pgoff_t index, gfp_t gfp_mask);
 extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t 
end);
-extern int shmem_unuse(swp_entry_t entry, struct page *page);
+extern int shmem_unuse(unsigned int type);
 
 extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
 extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
diff --git a/mm/shmem.c b/mm/shmem.c
index d44991ea5ed4..21d87cdcba3c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1085,159 +1085,117 @@ static void shmem_evict_inode(struct inode *inode)
clear_inode(inode);
 }
 
-static unsigned long find_swap_entry(struct xarray *xa, void *item)
-{
-   XA_STATE(xas, xa, 0);
-   unsigned int checked = 0;
-   void *entry;
-
-   rcu_read_lock();
-   xas_for_each(, entry, ULONG_MAX) {
-   if (xas_retry(, entry))
-   continue;
-   if (entry == item)
-   break;
-   checked++;
-   if ((checked % XA_CHECK_SCHED) != 0)
-   continue;
-   xas_pause();
-   cond_resched_rcu();
-   }
-   rcu_read_unlock();
-
-   return entry ? xas.xa_index : -1;
-}
-
 /*
  * If swap found in inode, free it and move page from swapcache to filecache.
  */
-static int shmem_unuse_inode(struct shmem_inode_info *info,
-swp_entry_t swap, struct page **pagep)
+static int shmem_unuse_inode(struct inode *inode, unsigned int type)
 {
-   struct address_space *mapping = info->vfs_inode.i_mapping;
-   void *radswap;
-   pgoff_t index;
-   gfp_t gfp;
+   struct address_space *mapp

1 2 >

1 - 100 of 130 matches

Mail list logo