from:"Joel Fernandes \(Google\)"

[PATCH resend 7/8] Documentation: Add core scheduling documentation

2021-03-24 Thread Joel Fernandes (Google)

Document the usecases, design and interfaces for core scheduling.

Co-developed-by: Chris Hyser 
Co-developed-by: Vineeth Pillai 
Co-developed-by: Josh Don 
Signed-off-by: Josh Don 
Signed-off-by: Vineeth Pillai 
Signed-off-by: Chris Hyser 
Tested-by: Julien Desfossez 
Reviewed-by: Randy Dunlap 
Signed-off-by: Joel Fernandes (Google) 
---
 .../admin-guide/hw-vuln/core-scheduling.rst   | 460 ++
 Documentation/admin-guide/hw-vuln/index.rst   |   1 +
 2 files changed, 461 insertions(+)
 create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst 
b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
new file mode 100644
index ..0ef00edd50e6
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -0,0 +1,460 @@
+Core Scheduling
+***
+Core scheduling support allows userspace to define groups of tasks that can
+share a core. These groups can be specified either for security usecases (one
+group of tasks don't trust another), or for performance usecases (some
+workloads may benefit from running on the same core as they don't need the same
+hardware resources of the shared core).
+
+Security usecase
+
+A cross-HT attack involves the attacker and victim running on different
+Hyper Threads of the same core. MDS and L1TF are examples of such attacks.
+Without core scheduling, the only full mitigation of cross-HT attacks is to
+disable Hyper Threading (HT). Core scheduling allows HT to be turned on safely
+by ensuring that trusted tasks can share a core. This increase in core sharing
+can improvement performance, however it is not guaranteed that performance will
+always improve, though that is seen to be the case with a number of real world
+workloads. In theory, core scheduling aims to perform at least as good as when
+Hyper Threading is disabled. In practice, this is mostly the case though not
+always: as synchronizing scheduling decisions across 2 or more CPUs in a core
+involves additional overhead - especially when the system is lightly loaded
+(``total_threads <= N/2``, where N is the total number of CPUs).
+
+Usage
+-
+Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
+Using this feature, userspace defines groups of tasks that trust each other.
+The core scheduler uses this information to make sure that tasks that do not
+trust each other will never run simultaneously on a core, while doing its best
+to satisfy the system's scheduling requirements.
+
+There are 2 ways to use core-scheduling:
+
+CGroup
+##
+Core scheduling adds additional files to the CPU controller CGroup:
+
+* ``cpu.core_tag``
+
+Writing ``1`` into this file results in all tasks in the group getting tagged.
+This results in all the CGroup's tasks allowed to run concurrently on a core's
+hyperthreads (also called siblings).
+
+The file being a value of ``0`` means the tag state of the CGroup is inherited
+from its parent hierarchy. If any ancestor of the CGroup is tagged, then the
+group is tagged.
+
+.. note:: Once a CGroup is tagged via cpu.core_tag, it is not possible to set 
this
+  for any descendant of the tagged group.
+
+.. note:: When a CGroup is not tagged, all the tasks within the group can share
+  a core with kernel threads and untagged system threads. For this 
reason,
+  if a group has ``cpu.core_tag`` of 0, it is considered to be trusted.
+
+prctl(2) interface
+##
+
+A ``prtcl(2)`` command ``PR_SCHED_CORE_SHARE`` provides an interface for the
+creation of and admission and removal of tasks from core scheduling groups.
+Permission to change the ``cookie`` and hence the core scheduling group it
+represents is based on ``ptrace access``.
+
+::
+
+#include 
+
+int prctl(int option, unsigned long arg2, unsigned long arg3, unsigned 
long arg4, unsigned long arg5);
+
+int prctl(PR_SCHED_CORE_SHARE, sub_command, pid, pid_type, 0);
+
+option:
+``PR_SCHED_CORE_SHARE``
+
+arg2:
+sub-command:
+
+- ``PR_SCHED_CORE_CLEAR0  -- clear core_sched cookie of pid``
+- ``PR_SCHED_CORE_CREATE   1  -- create a new cookie for pid``
+- ``PR_SCHED_CORE_SHARE_FROM   2  -- copy core_sched cookie from pid``
+- ``PR_SCHED_CORE_SHARE_TO 3  -- copy core_sched cookie to pid``
+
+arg3:
+``pid`` of the task for which the operation applies where ``pid == 0``
+implies current process.
+
+arg4:
+``pid_type`` for PR_SCHED_CORE_CLEAR/CREATE/SHARE_TO is an enum
+{PIDTYPE_PID=0, PIDTYPE_TGID, PIDTYPE_PGID} and determines how the target
+``pid`` should be interpreted. ``PIDTYPE_PID`` indicates that the target
+``pid`` should be treated as an individual task, ``PIDTYPE_TGID`` a process
+or thread group, and ``PIDTYPE_PGID`` or a process group ``PIDTYPE_PGID``.
+
+arg5:
+MUST be equal to 0.
+
+Return Value:
+::
+
+EINVAL - bad parame

[PATCH resend 3/8] sched: prctl() cookie manipulation for core scheduling

2021-03-24 Thread Joel Fernandes (Google)

From: chris hyser 

This patch provides support for setting, clearing and copying core
scheduling 'task cookies' between threads (PID), processes (TGID), and
process groups (PGID).

The value of core scheduling isn't that tasks don't share a core, 'nosmt'
can do that. The value lies in exploiting all the sharing opportunities
that exist to recover possible lost performance and that requires a degree
of flexibility in the API. From a security perspective (and there are
others), the thread, process and process group distinction is an existent
hierarchal categorization of tasks that reflects many of the security
concerns about 'data sharing'. For example, protecting against
cache-snooping by a thread that can just read the memory directly isn't all
that useful. With this in mind, subcommands to CLEAR/CREATE/SHARE (TO/FROM)
provide a mechanism to create, clear and share cookies.
CLEAR/CREATE/SHARE_TO specify a target pid with enum pidtype used to
specify the scope of the targeted tasks. For example, PIDTYPE_TGID will
share the cookie with the process and all of it's threads as typically
desired in a security scenario.

API:

prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_CREATE, tgtpid, pidtype, 0)
prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_CLEAR, tgtpid, pidtype, 0)
prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_FROM, srcpid, 0, 0)
prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_TO, tgtpid, pidtype, 0)

where 'tgtpid/srcpid == 0' implies the current process and pidtype is
kernel enum pid_type {PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID, ...}.
PIDTYPE_SID, sharing a cookie with an entire session, was considered less
useful given the choice to create a new cookie on task exec().

For return values, EINVAL, ENOMEM are what they say. ESRCH means the
tgtpid/srcpid was not found. EPERM indicates lack of PTRACE permission
access to tgtpid/srcpid. EACCES indicates that a task in the target pidtype
group was not updated due to permission.

In terms of interaction with the cgroup interface, task cookies are set
independently of cgroup core scheduling cookies and thus would allow use
for tasks within a container using cgroup cookies.

Current hard-coded policies are:
- a user can clear the cookie of any process they can set a cookie for.
Lack of a cookie *might* be a security issue if cookies are being used
for that.
- on fork of a parent with a cookie, both process and thread child tasks
get a copy.
- on exec a task with a cookie is given a new cookie

Signed-off-by: Chris Hyser 
Signed-off-by: Josh Don 
---
 fs/exec.c|   4 +-
 include/linux/sched.h|  11 ++
 include/linux/sched/task.h   |   4 +-
 include/uapi/linux/prctl.h   |   7 ++
 kernel/sched/core.c  |  11 +-
 kernel/sched/coretag.c   | 196 ++-
 kernel/sched/sched.h |   2 +
 kernel/sys.c |   7 ++
 tools/include/uapi/linux/prctl.h |   7 ++
 9 files changed, 241 insertions(+), 8 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 18594f11c31f..ab0945508b50 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1807,7 +1807,9 @@ static int bprm_execve(struct linux_binprm *bprm,
if (IS_ERR(file))
goto out_unmark;
 
-   sched_exec();
+   retval = sched_exec();
+   if (retval)
+   goto out;
 
bprm->file = file;
/*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 833f8d682212..075b15392a4a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2184,8 +2184,19 @@ const struct cpumask *sched_trace_rd_span(struct 
root_domain *rd);
 
 #ifdef CONFIG_SCHED_CORE
 void sched_tsk_free(struct task_struct *tsk);
+int sched_core_share_pid(unsigned long flags, pid_t pid, enum pid_type type);
+int sched_core_exec(void);
 #else
 #define sched_tsk_free(tsk) do { } while (0)
+static inline int sched_core_share_pid(unsigned long flags, pid_t pid, enum 
pid_type type)
+{
+   return 0;
+}
+
+static inline int sched_core_exec(void)
+{
+   return 0;
+}
 #endif
 
 #endif
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ef02be869cf2..d0f5b233f092 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -94,9 +94,9 @@ extern void free_task(struct task_struct *tsk);
 
 /* sched_exec is called by processes performing an exec */
 #ifdef CONFIG_SMP
-extern void sched_exec(void);
+int sched_exec(void);
 #else
-#define sched_exec()   {}
+static inline int sched_exec(void) { return 0; }
 #endif
 
 static inline struct task_struct *get_task_struct(struct task_struct *t)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 667f1aed091c..e658dca88f4f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -255,4 +255,11 @@ struct prctl_mm_map {
 # define SYSCALL_DISPATCH_FILTER_ALLOW 0
 # define SYSCALL_DISPATCH_FILTER_BLOCK 1
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE60
+#

[PATCH resend 6/8] kselftest: Add tests for core-sched interface

2021-03-24 Thread Joel Fernandes (Google)

Add a kselftest test to ensure that the core-sched interface is working
correctly.

Co-developed-by: Chris Hyser 
Signed-off-by: Chris Hyser 
Co-developed-by: Josh Don 
Signed-off-by: Josh Don 
Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: chris hyser 
---
 tools/testing/selftests/sched/.gitignore  |   1 +
 tools/testing/selftests/sched/Makefile|   4 +-
 .../testing/selftests/sched/test_coresched.c  | 812 ++
 3 files changed, 815 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/sched/test_coresched.c

diff --git a/tools/testing/selftests/sched/.gitignore 
b/tools/testing/selftests/sched/.gitignore
index 6996d4654d92..a4b4a1cdcd93 100644
--- a/tools/testing/selftests/sched/.gitignore
+++ b/tools/testing/selftests/sched/.gitignore
@@ -1 +1,2 @@
 cs_prctl_test
+test_coresched
diff --git a/tools/testing/selftests/sched/Makefile 
b/tools/testing/selftests/sched/Makefile
index 10c72f14fea9..830766e12bed 100644
--- a/tools/testing/selftests/sched/Makefile
+++ b/tools/testing/selftests/sched/Makefile
@@ -8,7 +8,7 @@ CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/  
-Wl,-rpath=./ \
  $(CLANG_FLAGS)
 LDLIBS += -lpthread
 
-TEST_GEN_FILES := cs_prctl_test
-TEST_PROGS := cs_prctl_test
+TEST_GEN_FILES := test_coresched cs_prctl_test
+TEST_PROGS := test_coresched cs_prctl_test
 
 include ../lib.mk
diff --git a/tools/testing/selftests/sched/test_coresched.c 
b/tools/testing/selftests/sched/test_coresched.c
new file mode 100644
index ..9d47845e6f8a
--- /dev/null
+++ b/tools/testing/selftests/sched/test_coresched.c
@@ -0,0 +1,812 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Core-scheduling selftests.
+ *
+ * Copyright (C) 2020, Joel Fernandes.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef PR_SCHED_CORE_SHARE
+#define PR_SCHED_CORE_SHARE 59
+# define PR_SCHED_CORE_CLEAR0
+# define PR_SCHED_CORE_CREATE   1
+# define PR_SCHED_CORE_SHARE_FROM   2
+# define PR_SCHED_CORE_SHARE_TO 3
+#endif
+
+#ifndef DEBUG_PRINT
+#define dprint(...)
+#else
+#define dprint(str, args...) printf("DEBUG: %s: " str "\n", __func__, ##args)
+#endif
+
+void print_banner(char *s)
+{
+   printf("coresched: %s:  ", s);
+}
+
+void print_pass(void)
+{
+   printf("PASS\n");
+}
+
+void assert_cond(int cond, char *str)
+{
+   if (!cond) {
+   printf("Error: %s\n", str);
+   abort();
+   }
+}
+
+char *make_group_root(void)
+{
+   char *mntpath, *mnt;
+   int ret;
+
+   mntpath = malloc(50);
+   if (!mntpath) {
+   perror("Failed to allocate mntpath\n");
+   abort();
+   }
+
+   sprintf(mntpath, "/tmp/coresched-test-XX");
+   mnt = mkdtemp(mntpath);
+   if (!mnt) {
+   perror("Failed to create mount: ");
+   exit(-1);
+   }
+
+   ret = mount("nodev", mnt, "cgroup", 0, "cpu");
+   if (ret == -1) {
+   perror("Failed to mount cgroup: ");
+   exit(-1);
+   }
+
+   return mnt;
+}
+
+void assert_group_tag(char *cgroup_path, char *tag)
+{
+   char tag_path[50] = {}, rdbuf[8] = {};
+   int tfd;
+
+   sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
+   tfd = open(tag_path, O_RDONLY, 0666);
+   if (tfd == -1) {
+   perror("Open of cgroup tag path failed: ");
+   abort();
+   }
+
+   if (read(tfd, rdbuf, 1) != 1) {
+   perror("Failed to enable coresched on cgroup: ");
+   abort();
+   }
+
+   if (strcmp(rdbuf, tag)) {
+   printf("Group tag does not match (exp: %s, act: %s)\n", tag,
+  rdbuf);
+   abort();
+   }
+
+   if (close(tfd) == -1) {
+   perror("Failed to close tag fd: ");
+   abort();
+   }
+}
+
+void tag_group(char *cgroup_path)
+{
+   char tag_path[50];
+   int tfd;
+
+   sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
+   tfd = open(tag_path, O_WRONLY, 0666);
+   if (tfd == -1) {
+   perror("Open of cgroup tag path failed: ");
+   abort();
+   }
+
+   if (write(tfd, "1", 1) != 1) {
+   perror("Failed to enable coresched on cgroup: ");
+   abort();
+   }
+
+   if (close(tfd) == -1) {
+   perror("Failed to close tag fd: ");
+   abort();
+   }
+
+   assert_group_tag(cgroup_path, "1");
+}
+
+void untag_group(char *cgroup_path)
+{
+   char tag_path[50];
+   int tfd;
+
+   sprintf(tag_path,

[PATCH resend 8/8] sched: Debug bits...

2021-03-24 Thread Joel Fernandes (Google)

Tested-by: Julien Desfossez 
Not-Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c | 40 +++-
 kernel/sched/fair.c | 12 
 2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a733891dfe7d..2649efeac19f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -106,6 +106,10 @@ static inline bool prio_less(struct task_struct *a, struct 
task_struct *b, bool
 
int pa = __task_prio(a), pb = __task_prio(b);
 
+   trace_printk("(%s/%d;%d,%llu,%llu) ?< (%s/%d;%d,%llu,%llu)\n",
+a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
if (-pa < -pb)
return true;
 
@@ -292,12 +296,16 @@ static void __sched_core_enable(void)
 
static_branch_enable(&__sched_core_enabled);
__sched_core_flip(true);
+
+   printk("core sched enabled\n");
 }
 
 static void __sched_core_disable(void)
 {
__sched_core_flip(false);
static_branch_disable(&__sched_core_enabled);
+
+   printk("core sched disabled\n");
 }
 
 void sched_core_get(void)
@@ -5361,6 +5369,13 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
set_next_task(rq, next);
}
 
+   trace_printk("pick pre selected (%u %u %u): %s/%d %llu\n",
+rq->core->core_task_seq,
+rq->core->core_pick_seq,
+rq->core_sched_seq,
+next->comm, next->pid,
+next->core_cookie.userspace_id);
+
rq->core_pick = NULL;
return next;
}
@@ -5455,6 +5470,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_forceidle_seq++;
}
 
+   trace_printk("cpu(%d): selected: %s/%d %llu\n",
+i, p->comm, p->pid,
+p->core_cookie.userspace_id);
+
/*
 * If this new candidate is of higher priority than the
 * previous; and they're incompatible; we need to wipe
@@ -5471,6 +5490,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_cookie = p->core_cookie;
max = p;
 
+   trace_printk("max: %s/%d %llu\n",
+max->comm, max->pid,
+max->core_cookie.userspace_id);
+
if (old_max) {
rq->core->core_forceidle = false;
for_each_cpu(j, smt_mask) {
@@ -5492,6 +5515,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
/* Something should have been selected for current CPU */
WARN_ON_ONCE(!next);
+   trace_printk("picked: %s/%d %llu\n", next->comm, next->pid,
+next->core_cookie.userspace_id);
 
/*
 * Reschedule siblings
@@ -5533,13 +5558,21 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
}
 
/* Did we break L1TF mitigation requirements? */
-   WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+   if (unlikely(!cookie_match(next, rq_i->core_pick))) {
+   trace_printk("[%d]: cookie mismatch. 
%s/%d/0x%llu/0x%llu\n",
+rq_i->cpu, rq_i->core_pick->comm,
+rq_i->core_pick->pid,
+rq_i->core_pick->core_cookie.userspace_id,
+rq_i->core->core_cookie.userspace_id);
+   WARN_ON_ONCE(1);
+   }
 
if (rq_i->curr == rq_i->core_pick) {
rq_i->core_pick = NULL;
continue;
}
 
+   trace_printk("IPI(%d)\n", i);
resched_curr(rq_i);
}
 
@@ -5579,6 +5612,11 @@ static bool try_steal_cookie(int this, int that)
if (p->core_occupation > dst->idle->core_occupation)
goto next;
 
+   trace_printk("core fill: %s/%d (%d->%d) %d %d %llu\n",
+p->comm, p->pid, that, this,
+p->core_occupation, dst->idle->core_occupation,
+cookie->userspace_id);
+
p->on_rq = TASK_ON_RQ_MIGRATING;
deactivate_task(src, p, 0);
set_task_cpu(p, this);
diff --git a/kernel/sched/fair.c

[PATCH resend 5/8] sched: cgroup cookie API for core scheduling

2021-03-24 Thread Joel Fernandes (Google)

From: Josh Don 

This adds the API to set/get the cookie for a given cgroup. This
interface lives at cgroup/cpu.core_tag.

The cgroup interface can be used to toggle a unique cookie value for all
descendent tasks, preventing these tasks from sharing with any others.
See Documentation/admin-guide/hw-vuln/core-scheduling.rst for a full
rundown of both this and the per-task API.

Signed-off-by: Josh Don 
---
 kernel/sched/core.c|  61 ++--
 kernel/sched/coretag.c | 156 -
 kernel/sched/sched.h   |  25 +++
 3 files changed, 235 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3093cb3414c3..a733891dfe7d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9328,6 +9328,8 @@ struct task_group *sched_create_group(struct task_group 
*parent)
 
alloc_uclamp_sched_group(tg, parent);
 
+   alloc_sched_core_sched_group(tg);
+
return tg;
 
 err:
@@ -9391,6 +9393,11 @@ static void sched_change_group(struct task_struct *tsk, 
int type)
tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
  struct task_group, css);
tg = autogroup_task_group(tsk, tg);
+
+#ifdef CONFIG_SCHED_CORE
+   sched_core_change_group(tsk, tg);
+#endif
+
tsk->sched_task_group = tg;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -9443,11 +9450,6 @@ void sched_move_task(struct task_struct *tsk)
task_rq_unlock(rq, tsk, );
 }
 
-static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
-{
-   return css ? container_of(css, struct task_group, css) : NULL;
-}
-
 static struct cgroup_subsys_state *
 cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 {
@@ -9483,6 +9485,18 @@ static int cpu_cgroup_css_online(struct 
cgroup_subsys_state *css)
return 0;
 }
 
+static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
+{
+#ifdef CONFIG_SCHED_CORE
+   struct task_group *tg = css_tg(css);
+
+   if (tg->core_tagged) {
+   sched_core_put();
+   tg->core_tagged = 0;
+   }
+#endif
+}
+
 static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
 {
struct task_group *tg = css_tg(css);
@@ -9517,6 +9531,25 @@ static void cpu_cgroup_fork(struct task_struct *task)
task_rq_unlock(rq, task, );
 }
 
+static void cpu_cgroup_exit(struct task_struct *task)
+{
+#ifdef CONFIG_SCHED_CORE
+   /*
+* This is possible if task exit races with core sched being
+* disabled due to the task's cgroup no longer being tagged, since
+* cpu_core_tag_write_u64() will miss dying tasks.
+*/
+   if (unlikely(sched_core_enqueued(task))) {
+   struct rq *rq;
+   struct rq_flags rf;
+
+   rq = task_rq_lock(task, );
+   sched_core_dequeue(rq, task);
+   task_rq_unlock(rq, task, );
+   }
+#endif
+}
+
 static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
 {
struct task_struct *task;
@@ -10084,6 +10117,14 @@ static struct cftype cpu_legacy_files[] = {
.write_u64 = cpu_rt_period_write_uint,
},
 #endif
+#ifdef CONFIG_SCHED_CORE
+   {
+   .name = "core_tag",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .read_u64 = cpu_core_tag_read_u64,
+   .write_u64 = cpu_core_tag_write_u64,
+   },
+#endif
 #ifdef CONFIG_UCLAMP_TASK_GROUP
{
.name = "uclamp.min",
@@ -10257,6 +10298,14 @@ static struct cftype cpu_files[] = {
.write_s64 = cpu_weight_nice_write_s64,
},
 #endif
+#ifdef CONFIG_SCHED_CORE
+   {
+   .name = "core_tag",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .read_u64 = cpu_core_tag_read_u64,
+   .write_u64 = cpu_core_tag_write_u64,
+   },
+#endif
 #ifdef CONFIG_CFS_BANDWIDTH
{
.name = "max",
@@ -10285,10 +10334,12 @@ static struct cftype cpu_files[] = {
 struct cgroup_subsys cpu_cgrp_subsys = {
.css_alloc  = cpu_cgroup_css_alloc,
.css_online = cpu_cgroup_css_online,
+   .css_offline= cpu_cgroup_css_offline,
.css_released   = cpu_cgroup_css_released,
.css_free   = cpu_cgroup_css_free,
.css_extra_stat_show = cpu_extra_stat_show,
.fork   = cpu_cgroup_fork,
+   .exit   = cpu_cgroup_exit,
.can_attach = cpu_cgroup_can_attach,
.attach = cpu_cgroup_attach,
.legacy_cftypes = cpu_legacy_files,
diff --git a/kernel/sched/coretag.c b/kernel/sched/coretag.c
index 550f4975eea2..1498790bc76c 100644
--- a/kernel/sched/coretag.c
+++ b/kernel/sched/coretag.c
@@ -96,9 +96,19 @@ static void __sched_core_set_task_cookie(struct 
sched_core_cookie *cookie,
 static void __sched_core_set_group_cookie(struct sched_core_cookie *cookie,
  unsigned long val)
 {
+   struct

[PATCH resend 1/8] sched: migration changes for core scheduling

2021-03-24 Thread Joel Fernandes (Google)

From: Aubrey Li 

 - Don't migrate if there is a cookie mismatch
 Load balance tries to move task from busiest CPU to the
 destination CPU. When core scheduling is enabled, if the
 task's cookie does not match with the destination CPU's
 core cookie, this task may be skipped by this CPU. This
 mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
 In the fast path of task wakeup, select the first cookie matched
 idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
 In the slow path of task wakeup, find the idlest CPU whose core
 cookie matches with task's cookie

Signed-off-by: Aubrey Li 
Signed-off-by: Tim Chen 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/fair.c  | 29 ++
 kernel/sched/sched.h | 73 
 2 files changed, 96 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a03564398605..12030b73a032 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5877,11 +5877,15 @@ find_idlest_group_cpu(struct sched_group *group, struct 
task_struct *p, int this
 
/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+   struct rq *rq = cpu_rq(i);
+
+   if (!sched_core_cookie_match(rq, p))
+   continue;
+
if (sched_idle_cpu(i))
return i;
 
if (available_idle_cpu(i)) {
-   struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
/*
@@ -5967,9 +5971,10 @@ static inline int find_idlest_cpu(struct sched_domain 
*sd, struct task_struct *p
return new_cpu;
 }
 
-static inline int __select_idle_cpu(int cpu)
+static inline int __select_idle_cpu(int cpu, struct task_struct *p)
 {
-   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+   if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
+   sched_cpu_cookie_match(cpu_rq(cpu), p))
return cpu;
 
return -1;
@@ -6041,7 +6046,7 @@ static int select_idle_core(struct task_struct *p, int 
core, struct cpumask *cpu
int cpu;
 
if (!static_branch_likely(_smt_present))
-   return __select_idle_cpu(core);
+   return __select_idle_cpu(core, p);
 
for_each_cpu(cpu, cpu_smt_mask(core)) {
if (!available_idle_cpu(cpu)) {
@@ -6079,7 +6084,7 @@ static inline bool test_idle_cores(int cpu, bool def)
 
 static inline int select_idle_core(struct task_struct *p, int core, struct 
cpumask *cpus, int *idle_cpu)
 {
-   return __select_idle_cpu(core);
+   return __select_idle_cpu(core, p);
 }
 
 #endif /* CONFIG_SCHED_SMT */
@@ -6132,7 +6137,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
} else {
if (!--nr)
return -1;
-   idle_cpu = __select_idle_cpu(cpu);
+   idle_cpu = __select_idle_cpu(cpu, p);
if ((unsigned int)idle_cpu < nr_cpumask_bits)
break;
}
@@ -7473,6 +7478,14 @@ static int task_hot(struct task_struct *p, struct lb_env 
*env)
 
if (sysctl_sched_migration_cost == -1)
return 1;
+
+   /*
+* Don't migrate task if the task's cookie does not match
+* with the destination CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+   return 1;
+
if (sysctl_sched_migration_cost == 0)
return 0;
 
@@ -8834,6 +8847,10 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p, int this_cpu)
p->cpus_ptr))
continue;
 
+   /* Skip over this group if no cookie matched */
+   if (!sched_group_cookie_match(cpu_rq(this_cpu), p, group))
+   continue;
+
local_group = cpumask_test_cpu(this_cpu,
   sched_group_span(group));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 80abbc0af680..12edfb8f6994 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1128,8 +1128,10 @@ static inline bool is_migration_disabled(struct 
task_struct *p)
 #endif
 }
 
+struct sched_group;
 #ifdef CONFIG_SCHED_CORE
 DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
+static inline struct cpumask *sched_group_span(struct sched_group *sg);
 
 static inline bool sched_core_enabled(struct rq *rq)
 {
@@ -1163,6 +1165,61 @@ static inline raw_spinlock_t *__rq_lockp(struct rq

[PATCH resend 2/8] sched: core scheduling tagging infrastructure

2021-03-24 Thread Joel Fernandes (Google)

From: Josh Don 

A single unsigned long is insufficient as a cookie value for core
scheduling. We will minimally have cookie values for a per-task and a
per-group interface, which must be combined into an overall cookie.

This patch adds the infrastructure necessary for setting task and group
cookie. Namely, it reworks the core_cookie into a struct, and provides
interfaces for setting task and group cookie, as well as other
operations (i.e. compare()). Subsequent patches will use these hooks to
provide an API for setting these cookies.

One important property of this interface is that neither the per-task
nor the per-cgroup setting overrides the other. For example, if two
tasks are in different cgroups, and one or both of the cgroups is tagged
using the per-cgroup interface, then these tasks cannot share, even if
they use the per-task interface to attempt to share with one another.

Core scheduler has extra overhead.  Enable it only for machines with
more than one SMT hardware thread.

Co-developed-by: Chris Hyser 
Signed-off-by: Chris Hyser 
Co-developed-by: Joel Fernandes (Google) 
Signed-off-by: Joel Fernandes (Google) 
Tested-by: Julien Desfossez 
Signed-off-by: Julien Desfossez 
Signed-off-by: Tim Chen 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Josh Don 
---
 include/linux/sched.h  |  24 +++-
 kernel/fork.c  |   1 +
 kernel/sched/Makefile  |   1 +
 kernel/sched/core.c| 100 ++---
 kernel/sched/coretag.c | 245 +
 kernel/sched/debug.c   |   4 +
 kernel/sched/sched.h   |  57 --
 7 files changed, 384 insertions(+), 48 deletions(-)
 create mode 100644 kernel/sched/coretag.c

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5d91ff1d3a30..833f8d682212 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -645,6 +645,22 @@ struct kmap_ctrl {
 #endif
 };
 
+#ifdef CONFIG_SCHED_CORE
+struct sched_core_cookie {
+   unsigned long task_cookie;
+#ifdef CONFIG_CGROUP_SCHED
+   unsigned long group_cookie;
+#endif
+
+   /*
+* A u64 representation of the cookie used only for display to
+* userspace. We avoid exposing the actual cookie contents, which
+* are kernel pointers.
+*/
+   u64 userspace_id;
+};
+#endif
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
/*
@@ -703,7 +719,7 @@ struct task_struct {
 
 #ifdef CONFIG_SCHED_CORE
struct rb_node  core_node;
-   unsigned long   core_cookie;
+   struct sched_core_cookiecore_cookie;
unsigned intcore_occupation;
 #endif
 
@@ -2166,4 +2182,10 @@ int sched_trace_rq_nr_running(struct rq *rq);
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
+#ifdef CONFIG_SCHED_CORE
+void sched_tsk_free(struct task_struct *tsk);
+#else
+#define sched_tsk_free(tsk) do { } while (0)
+#endif
+
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 54cc905e5fe0..cbe461105b10 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -737,6 +737,7 @@ void __put_task_struct(struct task_struct *tsk)
exit_creds(tsk);
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);
+   sched_tsk_free(tsk);
 
if (!profile_handoff_task(tsk))
free_task(tsk);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 5fc9c9b70862..c526c20adf9d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
+obj-$(CONFIG_SCHED_CORE) += coretag.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 53d742ed6432..1b07687c53d4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -123,11 +123,13 @@ static inline bool prio_less(struct task_struct *a, 
struct task_struct *b, bool
 
 static inline bool __sched_core_less(struct task_struct *a, struct task_struct 
*b)
 {
-   if (a->core_cookie < b->core_cookie)
-   return true;
+   int cmp = sched_core_cookie_cmp(>core_cookie, >core_cookie);
 
-   if (a->core_cookie > b->core_cookie)
-   return false;
+   if (cmp < 0)
+   return true; /* a < b */
+
+   if (cmp > 0)
+   return false; /* a > b */
 
/* flip prio, so high prio is leftmost */
if (prio_less(b, a, task_rq(a)->core->core_forceidle))
@@ -146,41 +148,49 @@ static inline bool rb_sched_core_less(struct rb_node *a, 
const struct rb_node *b
 static inline int rb_sched_core_cmp(const void *key, const struct rb_node 
*node)
 {
const struct task_struct *p = __node_2_sc(node);
-   unsigned long cookie = (unsigned long)key;
+   const struct sched_core_cookie *cookie = key;
+   int cmp = sched_core_cookie_cm

[PATCH resend 4/8] kselftest: Add test for core sched prctl interface

2021-03-24 Thread Joel Fernandes (Google)

From: chris hyser 

Provides a selftest and examples of using the interface.

Signed-off-by: Chris Hyser 
Signed-off-by: Josh Don 
---
 tools/testing/selftests/sched/.gitignore  |   1 +
 tools/testing/selftests/sched/Makefile|  14 +
 tools/testing/selftests/sched/config  |   1 +
 tools/testing/selftests/sched/cs_prctl_test.c | 370 ++
 4 files changed, 386 insertions(+)
 create mode 100644 tools/testing/selftests/sched/.gitignore
 create mode 100644 tools/testing/selftests/sched/Makefile
 create mode 100644 tools/testing/selftests/sched/config
 create mode 100644 tools/testing/selftests/sched/cs_prctl_test.c

diff --git a/tools/testing/selftests/sched/.gitignore 
b/tools/testing/selftests/sched/.gitignore
new file mode 100644
index ..6996d4654d92
--- /dev/null
+++ b/tools/testing/selftests/sched/.gitignore
@@ -0,0 +1 @@
+cs_prctl_test
diff --git a/tools/testing/selftests/sched/Makefile 
b/tools/testing/selftests/sched/Makefile
new file mode 100644
index ..10c72f14fea9
--- /dev/null
+++ b/tools/testing/selftests/sched/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0+
+
+ifneq ($(shell $(CC) --version 2>&1 | head -n 1 | grep clang),)
+CLANG_FLAGS += -no-integrated-as
+endif
+
+CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/  -Wl,-rpath=./ \
+ $(CLANG_FLAGS)
+LDLIBS += -lpthread
+
+TEST_GEN_FILES := cs_prctl_test
+TEST_PROGS := cs_prctl_test
+
+include ../lib.mk
diff --git a/tools/testing/selftests/sched/config 
b/tools/testing/selftests/sched/config
new file mode 100644
index ..e8b09aa7c0c4
--- /dev/null
+++ b/tools/testing/selftests/sched/config
@@ -0,0 +1 @@
+CONFIG_SCHED_DEBUG=y
diff --git a/tools/testing/selftests/sched/cs_prctl_test.c 
b/tools/testing/selftests/sched/cs_prctl_test.c
new file mode 100644
index ..03581e180e31
--- /dev/null
+++ b/tools/testing/selftests/sched/cs_prctl_test.c
@@ -0,0 +1,370 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Use the core scheduling prctl() to test core scheduling cookies control.
+ *
+ * Copyright (c) 2021 Oracle and/or its affiliates.
+ * Author: Chris Hyser 
+ *
+ *
+ * This library is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This library is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
+ * for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public License
+ * along with this library; if not, see .
+ */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#if __GLIBC_PREREQ(2, 30) == 0
+#include 
+static pid_t gettid(void)
+{
+   return syscall(SYS_gettid);
+}
+#endif
+
+#ifndef PR_SCHED_CORE_SHARE
+#define PR_SCHED_CORE_SHARE60
+# define PR_SCHED_CORE_CLEAR   0
+# define PR_SCHED_CORE_CREATE  1
+# define PR_SCHED_CORE_SHARE_FROM  2
+# define PR_SCHED_CORE_SHARE_TO3
+#endif
+
+#define MAX_PROCESSES 128
+#define MAX_THREADS   128
+
+static const char USAGE[] = "cs_prctl_test [options]\n"
+"options:\n"
+"  -P  : number of processes to create.\n"
+"  -T  : number of threads per process to create.\n"
+"  -d  : delay time to keep tasks alive.\n"
+"  -k  : keep tasks alive until keypress.\n";
+
+enum pid_type {PIDTYPE_PID = 0, PIDTYPE_TGID, PIDTYPE_PGID};
+
+const int THREAD_CLONE_FLAGS = CLONE_THREAD | CLONE_SIGHAND | CLONE_FS | 
CLONE_VM | CLONE_FILES;
+
+static int _prctl(int option, unsigned long arg2, unsigned long arg3, unsigned 
long arg4,
+ unsigned long arg5)
+{
+   int res;
+
+   res = prctl(option, arg2, arg3, arg4, arg5);
+   printf("%d = prctl(%d, %ld, %ld, %ld, %lx)\n", res, option, (long)arg2, 
(long)arg3,
+  (long)arg4, arg5);
+   return res;
+}
+
+#define STACK_SIZE (1024 * 1024)
+
+#define handle_error(msg) __handle_error(__FILE__, __LINE__, msg)
+static void __handle_error(char *fn, int ln, char *msg)
+{
+   printf("(%s:%d) - ", fn, ln);
+   perror(msg);
+   exit(EXIT_FAILURE);
+}
+
+static void handle_usage(int rc, char *msg)
+{
+   puts(USAGE);
+   puts(msg);
+   putchar('\n');
+   exit(rc);
+}
+
+static unsigned long get_cs_cookie(int pid)
+{
+   char buf[4096];
+   char fn[512];
+   FILE *inf;
+   char *c;
+   int i;
+
+   if (pid == 0)
+   pid = getpid();
+   snprintf(fn, 512, "/proc/%d/sched", pid);
+
+   inf = fopen(fn, "r");
+   if (!inf)
+   return -2UL;
+
+   while (fgets(buf, 4096, inf)) {
+   if (!strncmp(buf, "core_cookie", 11))

[PATCH resend 0/8] Core sched remaining patches rebased

2021-03-24 Thread Joel Fernandes (Google)

 Desfossez
- Fixes the issue of starving threads due to forced idle
  - Peter Zijlstra
- Fixes the refcounting issue when deleting a cgroup with tag
  - Julien Desfossez
- Fixes a crash during cpu offline/online with coresched enabled
  - Vineeth Pillai
- Fixes a comparison logic issue in sched_core_find
  - Aaron Lu

Changes in v2
=
- Fixes for couple of NULL pointer dereference crashes
  - Subhra Mazumdar
  - Tim Chen
- Improves priority comparison logic for process in different cpus
  - Peter Zijlstra
  - Aaron Lu
- Fixes a hard lockup in rq locking
  - Vineeth Pillai
  - Julien Desfossez
- Fixes a performance issue seen on IO heavy workloads
  - Vineeth Pillai
  - Julien Desfossez
- Fix for 32bit build
  - Aubrey Li

Future work
===
- Load balancing/Migration fixes for core scheduling.
  With v6, Load balancing is partially coresched aware, but has some
  issues w.r.t process/taskgroup weights:
  https://lwn.net/ml/linux-kernel/20200225034438.GA617271@z...

Aubrey Li (1):
sched: migration changes for core scheduling

Joel Fernandes (Google) (3):
kselftest: Add tests for core-sched interface
Documentation: Add core scheduling documentation
sched: Debug bits...

Josh Don (2):
sched: core scheduling tagging infrastructure
sched: cgroup cookie API for core scheduling

chris hyser (2):
sched: prctl() cookie manipulation for core scheduling
kselftest: Add test for core sched prctl interface

.../admin-guide/hw-vuln/core-scheduling.rst   | 460 ++
Documentation/admin-guide/hw-vuln/index.rst   |   1 +
fs/exec.c |   4 +-
include/linux/sched.h |  35 +-
include/linux/sched/task.h|   4 +-
include/uapi/linux/prctl.h|   7 +
kernel/fork.c |   1 +
kernel/sched/Makefile |   1 +
kernel/sched/core.c   | 212 -
kernel/sched/coretag.c| 587 +
kernel/sched/debug.c  |   4 +
kernel/sched/fair.c   |  41 +-
kernel/sched/sched.h  | 151 +++-
kernel/sys.c  |   7 +
tools/include/uapi/linux/prctl.h  |   7 +
tools/testing/selftests/sched/.gitignore  |   2 +
tools/testing/selftests/sched/Makefile|  14 +
tools/testing/selftests/sched/config  |   1 +
tools/testing/selftests/sched/cs_prctl_test.c | 370 
.../testing/selftests/sched/test_coresched.c  | 812 ++
20 files changed, 2659 insertions(+), 62 deletions(-)
create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst
create mode 100644 kernel/sched/coretag.c
create mode 100644 tools/testing/selftests/sched/.gitignore
create mode 100644 tools/testing/selftests/sched/Makefile
create mode 100644 tools/testing/selftests/sched/config
create mode 100644 tools/testing/selftests/sched/cs_prctl_test.c
create mode 100644 tools/testing/selftests/sched/test_coresched.c

--
2.31.0.291.g576ba9dcdaf-goog

[PATCH 1/6] sched: migration changes for core scheduling

2021-03-19 Thread Joel Fernandes (Google)

From: Aubrey Li 

 - Don't migrate if there is a cookie mismatch
 Load balance tries to move task from busiest CPU to the
 destination CPU. When core scheduling is enabled, if the
 task's cookie does not match with the destination CPU's
 core cookie, this task will be skipped by this CPU. This
 mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
 In the fast path of task wakeup, select the first cookie matched
 idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
 In the slow path of task wakeup, find the idlest CPU whose core
 cookie matches with task's cookie

 - Don't migrate task if cookie not match
 For the NUMA load balance, don't migrate task to the CPU whose
 core cookie does not match with task's cookie

Tested-by: Julien Desfossez 
Signed-off-by: Aubrey Li 
Signed-off-by: Tim Chen 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/fair.c  | 33 +---
 kernel/sched/sched.h | 72 
 2 files changed, 101 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7f90765f7fd..fddd7c44bbf3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1921,6 +1921,13 @@ static void task_numa_find_cpu(struct task_numa_env *env,
if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
continue;
 
+   /*
+* Skip this cpu if source task's cookie does not match
+* with CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+   continue;
+
env->dst_cpu = cpu;
if (task_numa_compare(env, taskimp, groupimp, maymove))
break;
@@ -5867,11 +5874,15 @@ find_idlest_group_cpu(struct sched_group *group, struct 
task_struct *p, int this
 
/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+   struct rq *rq = cpu_rq(i);
+
+   if (!sched_core_cookie_match(rq, p))
+   continue;
+
if (sched_idle_cpu(i))
return i;
 
if (available_idle_cpu(i)) {
-   struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
/*
@@ -6129,7 +6140,9 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
return -1;
-   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+
+   if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
+   sched_cpu_cookie_match(cpu_rq(cpu), p))
break;
}
 
@@ -7530,8 +7543,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env 
*env)
 * We do not migrate tasks that are:
 * 1) throttled_lb_pair, or
 * 2) cannot be migrated to this CPU due to cpus_ptr, or
-* 3) running (obviously), or
-* 4) are cache-hot on their current CPU.
+* 3) task's cookie does not match with this CPU's core cookie
+* 4) running (obviously), or
+* 5) are cache-hot on their current CPU.
 */
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
@@ -7566,6 +7580,13 @@ int can_migrate_task(struct task_struct *p, struct 
lb_env *env)
return 0;
}
 
+   /*
+* Don't migrate task if the task's cookie does not match
+* with the destination CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+   return 0;
+
/* Record that we found atleast one task that could run on dst_cpu */
env->flags &= ~LBF_ALL_PINNED;
 
@@ -8792,6 +8813,10 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p, int this_cpu)
p->cpus_ptr))
continue;
 
+   /* Skip over this group if no cookie matched */
+   if (!sched_group_cookie_match(cpu_rq(this_cpu), p, group))
+   continue;
+
local_group = cpumask_test_cpu(this_cpu,
   sched_group_span(group));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d563b3f97789..877f77044b39 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1125,6 +1125,7 @@ static inline bool is_migration_disabled(struct 
task_struct *p)
 
 #ifdef CONFIG_SCHED_CORE
 DECLARE_STATIC_KEY_FALSE(__sched_core_enabled)

[PATCH 2/6] sched: tagging interface for core scheduling

2021-03-19 Thread Joel Fernandes (Google)

From: Josh Don 

Adds per-task and per-cgroup interfaces for specifying which tasks can
co-execute on adjacent SMT hyperthreads via core scheduling.

The per-task interface hooks are implemented here, but are not currently
used. The following patch adds a prctl interface which then takes
advantage of these.

The cgroup interface can be used to toggle a unique cookie value for all
descendent tasks, preventing these tasks from sharing with any others.
See Documentation/admin-guide/hw-vuln/core-scheduling.rst for a full
rundown.

One important property of this interface is that neither the per-task
nor the per-cgroup setting overrides the other. For example, if two
tasks are in different cgroups, and one or both of the cgroups is tagged
using the per-cgroup interface, then these tasks cannot share, even if
they use the per-task interface to attempt to share with one another.

The above is implemented by making the overall core scheduling cookie a
compound structure, containing both a task-level cookie and a
group-level cookie. Two tasks will only be allowed to share if all
fields of their respective cookies match.

Core scheduler has extra overhead.  Enable it only for machines with
more than one SMT hardware thread.

Co-developed-by: Chris Hyser 
Signed-off-by: Chris Hyser 
Co-developed-by: Joel Fernandes (Google) 
Signed-off-by: Joel Fernandes (Google) 
Tested-by: Julien Desfossez 
Signed-off-by: Julien Desfossez 
Signed-off-by: Tim Chen 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Josh Don 
---
 include/linux/sched.h  |  20 ++-
 kernel/fork.c  |   1 +
 kernel/sched/Makefile  |   1 +
 kernel/sched/core.c| 172 +-
 kernel/sched/coretag.c | 397 +
 kernel/sched/debug.c   |   4 +
 kernel/sched/sched.h   |  85 +++--
 7 files changed, 619 insertions(+), 61 deletions(-)
 create mode 100644 kernel/sched/coretag.c

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 344432130b8f..9031aa8fee5b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -629,6 +629,20 @@ struct wake_q_node {
struct wake_q_node *next;
 };
 
+#ifdef CONFIG_SCHED_CORE
+struct sched_core_cookie {
+   unsigned long task_cookie;
+   unsigned long group_cookie;
+
+   /* A u64 representation of the cookie used only for display to
+* userspace. We avoid exposing the actual cookie contents, which
+* are kernel pointers.
+*/
+   u64 userspace_id;
+};
+#endif
+
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
/*
@@ -687,7 +701,7 @@ struct task_struct {
 
 #ifdef CONFIG_SCHED_CORE
struct rb_node  core_node;
-   unsigned long   core_cookie;
+   struct sched_core_cookiecore_cookie;
unsigned intcore_occupation;
 #endif
 
@@ -2076,7 +2090,6 @@ int sched_trace_rq_nr_running(struct rq *rq);
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
-#ifdef CONFIG_SCHED_CORE
 enum ht_protect_ctx {
HT_PROTECT_SYSCALL,
HT_PROTECT_IRQ,
@@ -2084,15 +2097,18 @@ enum ht_protect_ctx {
HT_PROTECT_FROM_IDLE
 };
 
+#ifdef CONFIG_SCHED_CORE
 void sched_core_unsafe_enter(enum ht_protect_ctx ctx);
 void sched_core_unsafe_exit(enum ht_protect_ctx ctx);
 bool sched_core_wait_till_safe(unsigned long ti_check);
 bool sched_core_kernel_protected(enum ht_protect_ctx ctx);
+void sched_tsk_free(struct task_struct *tsk);
 #else
 #define sched_core_unsafe_enter(ignore) do { } while (0)
 #define sched_core_unsafe_exit(ignore) do { } while (0)
 #define sched_core_wait_till_safe(ignore) do { } while (0)
 #define sched_core_kernel_protected(ignore) do { } while (0)
+#define sched_tsk_free(tsk) do { } while (0)
 #endif
 
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 7199d359690c..5468c93829c5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -736,6 +736,7 @@ void __put_task_struct(struct task_struct *tsk)
exit_creds(tsk);
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);
+   sched_tsk_free(tsk);
 
if (!profile_handoff_task(tsk))
free_task(tsk);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 5fc9c9b70862..c526c20adf9d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
+obj-$(CONFIG_SCHED_CORE) += coretag.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 073047b13126..2e3024a6f6e1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -164,11 +164,13 @@ static inline bool prio_less(struct task_struct *a, 
struct task_struct *b, bool
 
 static inline bool __sched_core_less(struct task_struct *a, struct task_struct 
*b)
 {
-   if (a->core_cook

[PATCH 3/6] sched: prctl() cookie manipulation for core scheduling.

2021-03-19 Thread Joel Fernandes (Google)

From: chris hyser 

This patch provides support for setting, clearing and copying core
scheduling 'task cookies' between threads (PID), processes (TGID), and
process groups (PGID).

The value of core scheduling isn't that tasks don't share a core, 'nosmt'
can do that. The value lies in exploiting all the sharing opportunities
that exist to recover possible lost performance and that requires a degree
of flexibility in the API. From a security perspective (and there are
others), the thread, process and process group distinction is an existent
hierarchal categorization of tasks that reflects many of the security
concerns about 'data sharing'. For example, protecting against
cache-snooping by a thread that can just read the memory directly isn't all
that useful. With this in mind, subcommands to CLEAR/CREATE/SHARE (TO/FROM)
provide a mechanism to create, clear and share cookies.
CLEAR/CREATE/SHARE_TO specify a target pid with enum pidtype used to
specify the scope of the targeted tasks. For example, PIDTYPE_TGID will
share the cookie with the process and all of it's threads as typically
desired in a security scenario.

API:

prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_CREATE, tgtpid, pidtype, 0)
prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_CLEAR, tgtpid, pidtype, 0)
prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_FROM, srcpid, 0, 0)
prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_TO, tgtpid, pidtype, 0)

where 'tgtpid/srcpid == 0' implies the current process and pidtype is
kernel enum pid_type {PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID, ...}.
PIDTYPE_SID, sharing a cookie with an entire session, was considered less
useful given the choice to create a new cookie on task exec().

For return values, EINVAL, ENOMEM are what they say. ESRCH means the
tgtpid/srcpid was not found. EPERM indicates lack of PTRACE permission access
to tgtpid/srcpid. EACCES indicates that a task in the target pidtype group was
not updated due to permission.

In terms of interaction with the cgroup interface, task cookies are set
independently of cgroup core scheduling cookies and thus would allow use
for tasks within a container using cgroup cookies.

Current hard-coded policies are:
- a user can clear the cookie of any process they can set a cookie for.
Lack of a cookie *might* be a security issue if cookies are being used
for that.
- on fork of a parent with a cookie, both process and thread child tasks
get a copy.
- on exec a task with a cookie is given a new cookie

Signed-off-by: Chris Hyser 
Signed-off-by: Josh Don 
---
 include/linux/sched.h|   7 ++
 include/linux/sched/task.h   |   4 +-
 include/uapi/linux/prctl.h   |   7 ++
 kernel/sched/core.c  |  11 +-
 kernel/sched/coretag.c   | 197 +--
 kernel/sched/sched.h |   2 +
 kernel/sys.c |   7 ++
 tools/include/uapi/linux/prctl.h |   7 ++
 8 files changed, 230 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9031aa8fee5b..6ccbdbf7048b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2102,13 +2102,20 @@ void sched_core_unsafe_enter(enum ht_protect_ctx ctx);
 void sched_core_unsafe_exit(enum ht_protect_ctx ctx);
 bool sched_core_wait_till_safe(unsigned long ti_check);
 bool sched_core_kernel_protected(enum ht_protect_ctx ctx);
+int sched_core_share_pid(unsigned long flags, pid_t pid, enum pid_type type);
 void sched_tsk_free(struct task_struct *tsk);
+int sched_core_exec(void);
 #else
 #define sched_core_unsafe_enter(ignore) do { } while (0)
 #define sched_core_unsafe_exit(ignore) do { } while (0)
 #define sched_core_wait_till_safe(ignore) do { } while (0)
 #define sched_core_kernel_protected(ignore) do { } while (0)
+static inline int sched_core_share_pid(unsigned long flags, pid_t pid, enum 
pid_type type)
+{
+   return 0;
+}
 #define sched_tsk_free(tsk) do { } while (0)
+static inline int sched_core_exec(void) { return 0; }
 #endif
 
 #endif
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 85fb2f34c59b..033033ed641e 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -94,9 +94,9 @@ extern void free_task(struct task_struct *tsk);
 
 /* sched_exec is called by processes performing an exec */
 #ifdef CONFIG_SMP
-extern void sched_exec(void);
+int sched_exec(void);
 #else
-#define sched_exec()   {}
+static inline int sched_exec(void) { return 0; }
 #endif
 
 static inline struct task_struct *get_task_struct(struct task_struct *t)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c334e6a02e5f..40c7241f5fcb 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -248,4 +248,11 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER  57
 #define PR_GET_IO_FLUSHER  58
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE59
+# define PR_SCHED_CORE_CLEAR   0 /* clear core_sched

[PATCH 4/6] kselftest: Add tests for core-sched interface

2021-03-19 Thread Joel Fernandes (Google)

Add a kselftest test to ensure that the core-sched interface is working
correctly.

Co-developed-by: Chris Hyser 
Signed-off-by: Chris Hyser 
Co-developed-by: Josh Don 
Signed-off-by: Josh Don 
Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 tools/testing/selftests/sched/.gitignore  |   1 +
 tools/testing/selftests/sched/Makefile|  14 +
 tools/testing/selftests/sched/config  |   1 +
 tools/testing/selftests/sched/cs_prctl_test.c | 372 
 .../testing/selftests/sched/test_coresched.c  | 812 ++
 5 files changed, 1200 insertions(+)
 create mode 100644 tools/testing/selftests/sched/.gitignore
 create mode 100644 tools/testing/selftests/sched/Makefile
 create mode 100644 tools/testing/selftests/sched/config
 create mode 100644 tools/testing/selftests/sched/cs_prctl_test.c
 create mode 100644 tools/testing/selftests/sched/test_coresched.c

diff --git a/tools/testing/selftests/sched/.gitignore 
b/tools/testing/selftests/sched/.gitignore
new file mode 100644
index ..4660929b0b9a
--- /dev/null
+++ b/tools/testing/selftests/sched/.gitignore
@@ -0,0 +1 @@
+test_coresched
diff --git a/tools/testing/selftests/sched/Makefile 
b/tools/testing/selftests/sched/Makefile
new file mode 100644
index ..830766e12bed
--- /dev/null
+++ b/tools/testing/selftests/sched/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0+
+
+ifneq ($(shell $(CC) --version 2>&1 | head -n 1 | grep clang),)
+CLANG_FLAGS += -no-integrated-as
+endif
+
+CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/  -Wl,-rpath=./ \
+ $(CLANG_FLAGS)
+LDLIBS += -lpthread
+
+TEST_GEN_FILES := test_coresched cs_prctl_test
+TEST_PROGS := test_coresched cs_prctl_test
+
+include ../lib.mk
diff --git a/tools/testing/selftests/sched/config 
b/tools/testing/selftests/sched/config
new file mode 100644
index ..e8b09aa7c0c4
--- /dev/null
+++ b/tools/testing/selftests/sched/config
@@ -0,0 +1 @@
+CONFIG_SCHED_DEBUG=y
diff --git a/tools/testing/selftests/sched/cs_prctl_test.c 
b/tools/testing/selftests/sched/cs_prctl_test.c
new file mode 100644
index ..9e51874533c8
--- /dev/null
+++ b/tools/testing/selftests/sched/cs_prctl_test.c
@@ -0,0 +1,372 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Use the core scheduling prctl() to test core scheduling cookies control.
+ *
+ * Copyright (c) 2021 Oracle and/or its affiliates.
+ * Author: Chris Hyser 
+ *
+ *
+ * This library is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This library is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
+ * for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public License
+ * along with this library; if not, see <http://www.gnu.org/licenses>.
+ */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#if __GLIBC_PREREQ(2, 30) == 0
+#include 
+static pid_t gettid(void)
+{
+   return syscall(SYS_gettid);
+}
+#endif
+
+#ifndef PR_SCHED_CORE_SHARE
+#define PR_SCHED_CORE_SHARE59
+# define PR_SCHED_CORE_CLEAR   0
+# define PR_SCHED_CORE_CREATE  1
+# define PR_SCHED_CORE_SHARE_FROM  2
+# define PR_SCHED_CORE_SHARE_TO3
+#endif
+
+#define MAX_PROCESSES 128
+#define MAX_THREADS   128
+
+static const char USAGE[] = "cs_prctl_test [options]\n"
+"options:\n"
+"  -P  : number of processes to create.\n"
+"  -T  : number of threads per process to create.\n"
+"  -d  : delay time to keep tasks alive.\n"
+"  -k  : keep tasks alive until keypress.\n";
+
+enum pid_type {PIDTYPE_PID = 0, PIDTYPE_TGID, PIDTYPE_PGID};
+
+const int THREAD_CLONE_FLAGS = CLONE_THREAD | CLONE_SIGHAND | CLONE_FS | 
CLONE_VM | CLONE_FILES;
+
+static int _prctl(int option, unsigned long arg2, unsigned long arg3, unsigned 
long arg4,
+ unsigned long arg5)
+{
+   int res;
+
+   res = prctl(option, arg2, arg3, arg4, arg5);
+   printf("%d = prctl(%d, %ld, %ld, %ld, %lx)\n", res, option, (long)arg2, 
(long)arg3,
+  (long)arg4, arg5);
+   return res;
+}
+
+#define STACK_SIZE (1024 * 1024)
+
+#define handle_error(msg) __handle_error(__FILE__, __LINE__, msg)
+static void __handle_error(char *fn, int ln, char *msg)
+{
+   printf("(%s:%d) - ", fn, ln);
+   perror(msg);
+   exit(EXIT_FAILURE);
+}
+
+static void handle_usage(int rc, char *msg)
+{
+   puts(USAGE);
+   puts(msg);
+   putchar('\n');
+   exit(rc);
+}
+
+static unsigned long get_cs_cookie

[PATCH 5/6] Documentation: Add core scheduling documentation

2021-03-19 Thread Joel Fernandes (Google)

Document the usecases, design and interfaces for core scheduling.

Co-developed-by: Chris Hyser 
Co-developed-by: Vineeth Pillai 
Co-developed-by: Josh Don 
Signed-off-by: Josh Don 
Signed-off-by: Vineeth Pillai 
Signed-off-by: Chris Hyser 
Tested-by: Julien Desfossez 
Reviewed-by: Randy Dunlap 
Signed-off-by: Joel Fernandes (Google) 
---
 .../admin-guide/hw-vuln/core-scheduling.rst   | 460 ++
 Documentation/admin-guide/hw-vuln/index.rst   |   1 +
 2 files changed, 461 insertions(+)
 create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst 
b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
new file mode 100644
index ..50042e79709d
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -0,0 +1,460 @@
+Core Scheduling
+***
+Core scheduling support allows userspace to define groups of tasks that can
+share a core. These groups can be specified either for security usecases (one
+group of tasks don't trust another), or for performance usecases (some
+workloads may benefit from running on the same core as they don't need the same
+hardware resources of the shared core).
+
+Security usecase
+
+A cross-HT attack involves the attacker and victim running on different
+Hyper Threads of the same core. MDS and L1TF are examples of such attacks.
+Without core scheduling, the only full mitigation of cross-HT attacks is to
+disable Hyper Threading (HT). Core scheduling allows HT to be turned on safely
+by ensuring that trusted tasks can share a core. This increase in core sharing
+can improvement performance, however it is not guaranteed that performance will
+always improve, though that is seen to be the case with a number of real world
+workloads. In theory, core scheduling aims to perform at least as good as when
+Hyper Threading is disabled. In practice, this is mostly the case though not
+always: as synchronizing scheduling decisions across 2 or more CPUs in a core
+involves additional overhead - especially when the system is lightly loaded
+(``total_threads <= N/2``, where N is the total number of CPUs).
+
+Usage
+-
+Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
+Using this feature, userspace defines groups of tasks that trust each other.
+The core scheduler uses this information to make sure that tasks that do not
+trust each other will never run simultaneously on a core, while doing its best
+to satisfy the system's scheduling requirements.
+
+There are 2 ways to use core-scheduling:
+
+CGroup
+##
+Core scheduling adds additional files to the CPU controller CGroup:
+
+* ``cpu.core_tag``
+
+Writing ``1`` into this file results in all tasks in the group getting tagged.
+This results in all the CGroup's tasks allowed to run concurrently on a core's
+hyperthreads (also called siblings).
+
+The file being a value of ``0`` means the tag state of the CGroup is inherited
+from its parent hierarchy. If any ancestor of the CGroup is tagged, then the
+group is tagged.
+
+.. note:: Once a CGroup is tagged via cpu.core_tag, it is not possible to set 
this
+  for any descendant of the tagged group.
+
+.. note:: When a CGroup is not tagged, all the tasks within the group can share
+  a core with kernel threads and untagged system threads. For this 
reason,
+  if a group has ``cpu.core_tag`` of 0, it is considered to be trusted.
+
+prctl(2) interface
+##
+
+A ``prtcl(2)`` command ``PR_SCHED_CORE_SHARE`` provides an interface for the
+creation of and admission and removal of tasks from core scheduling groups.
+Permission to change the ``cookie`` and hence the core scheduling group it
+represents is based on ``ptrace access``.
+
+::
+
+#include 
+
+int prctl(int option, unsigned long arg2, unsigned long arg3, unsigned 
long arg4, unsigned long arg5);
+
+int prctl(PR_SCHED_CORE_SHARE, sub_command, pid, pid_type, 0);
+
+option:
+``PR_SCHED_CORE_SHARE``
+
+arg2:
+sub-command:
+
+- ``PR_SCHED_CORE_CLEAR0  -- clear core_sched cookie of pid``
+- ``PR_SCHED_CORE_CREATE   1  -- create a new cookie for pid``
+- ``PR_SCHED_CORE_SHARE_FROM   2  -- copy core_sched cookie from pid``
+- ``PR_SCHED_CORE_SHARE_TO 3  -- copy core_sched cookie to pid``
+
+arg3:
+``pid`` of the task for which the operation applies where ``pid == 0``
+implies current process.
+
+arg4:
+``pid_type`` for PR_SCHED_CORE_CLEAR/CREATE/SHARE_TO is an enum
+{PIDTYPE_PID=0, PIDTYPE_TGID, PIDTYPE_PGID} and determines how the target
+``pid`` should be interpreted. ``PIDTYPE_PID`` indicates that the target
+``pid`` should be treated as an individual task, ``PIDTYPE_TGID`` a process
+or thread group, and ``PIDTYPE_PGID`` or a process group ``PIDTYPE_PGID``.
+
+arg5:
+MUST be equal to 0.
+
+Return Value:
+::
+
+EINVAL - bad parame

[PATCH 6/6] sched: Debug bits...

2021-03-19 Thread Joel Fernandes (Google)

Tested-by: Julien Desfossez 
Not-Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c | 40 +++-
 kernel/sched/fair.c |  9 +
 2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a62e8ad5ce58..58cca96ba93d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -147,6 +147,10 @@ static inline bool prio_less(struct task_struct *a, struct 
task_struct *b, bool
 
int pa = __task_prio(a), pb = __task_prio(b);
 
+   trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
+a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
if (-pa < -pb)
return true;
 
@@ -312,12 +316,16 @@ static void __sched_core_enable(void)
 
static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
+
+   printk("core sched enabled\n");
 }
 
 static void __sched_core_disable(void)
 {
stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);
+
+   printk("core sched disabled\n");
 }
 
 void sched_core_get(void)
@@ -5503,6 +5511,13 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
set_next_task(rq, next);
}
 
+   trace_printk("pick pre selected (%u %u %u): %s/%d %llu\n",
+rq->core->core_task_seq,
+rq->core->core_pick_seq,
+rq->core_sched_seq,
+next->comm, next->pid,
+next->core_cookie.userspace_id);
+
rq->core_pick = NULL;
return next;
}
@@ -5597,6 +5612,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_forceidle_seq++;
}
 
+   trace_printk("cpu(%d): selected: %s/%d %llu\n",
+i, p->comm, p->pid,
+p->core_cookie.userspace_id);
+
/*
 * If this new candidate is of higher priority than the
 * previous; and they're incompatible; we need to wipe
@@ -5613,6 +5632,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_cookie = p->core_cookie;
max = p;
 
+   trace_printk("max: %s/%d %llu\n",
+max->comm, max->pid,
+max->core_cookie.userspace_id);
+
if (old_max) {
rq->core->core_forceidle = false;
for_each_cpu(j, smt_mask) {
@@ -5634,6 +5657,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
/* Something should have been selected for current CPU */
WARN_ON_ONCE(!next);
+   trace_printk("picked: %s/%d %llu\n", next->comm, next->pid,
+next->core_cookie.userspace_id);
 
/*
 * Reschedule siblings
@@ -5675,13 +5700,21 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
}
 
/* Did we break L1TF mitigation requirements? */
-   WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+   if (unlikely(!cookie_match(next, rq_i->core_pick))) {
+   trace_printk("[%d]: cookie mismatch. 
%s/%d/0x%llu/0x%llu\n",
+rq_i->cpu, rq_i->core_pick->comm,
+rq_i->core_pick->pid,
+rq_i->core_pick->core_cookie.userspace_id,
+rq_i->core->core_cookie.userspace_id);
+   WARN_ON_ONCE(1);
+   }
 
if (rq_i->curr == rq_i->core_pick) {
rq_i->core_pick = NULL;
continue;
}
 
+   trace_printk("IPI(%d)\n", i);
resched_curr(rq_i);
}
 
@@ -5721,6 +5754,11 @@ static bool try_steal_cookie(int this, int that)
if (p->core_occupation > dst->idle->core_occupation)
goto next;
 
+   trace_printk("core fill: %s/%d (%d->%d) %d %d %llu\n",
+p->comm, p->pid, that, this,
+p->core_occupation, dst->idle->core_occupation,
+cookie->userspace_id);
+
p->on_rq = TASK_ON_RQ_MIGRATING;
deactivate_task(src, p, 0);

[PATCH 0/6] Core scheduling remaining patches

2021-03-19 Thread Joel Fernandes (Google)

 threads due to forced idle
  - Peter Zijlstra
- Fixes the refcounting issue when deleting a cgroup with tag
  - Julien Desfossez
- Fixes a crash during cpu offline/online with coresched enabled
  - Vineeth Pillai
- Fixes a comparison logic issue in sched_core_find
  - Aaron Lu

Changes in v2
=
- Fixes for couple of NULL pointer dereference crashes
  - Subhra Mazumdar
  - Tim Chen
- Improves priority comparison logic for process in different cpus
  - Peter Zijlstra
  - Aaron Lu
- Fixes a hard lockup in rq locking
  - Vineeth Pillai
  - Julien Desfossez
- Fixes a performance issue seen on IO heavy workloads
  - Vineeth Pillai
  - Julien Desfossez
- Fix for 32bit build
  - Aubrey Li

Future work
===
- Load balancing/Migration fixes for core scheduling.
  With v6, Load balancing is partially coresched aware, but has some
  issues w.r.t process/taskgroup weights:
  https://lwn.net/ml/linux-kernel/20200225034438.GA617271@z...

Aubrey Li (1):
sched: migration changes for core scheduling

Joel Fernandes (Google) (3):
kselftest: Add tests for core-sched interface
Documentation: Add core scheduling documentation
sched: Debug bits...

Josh Don (1):
sched: tagging interface for core scheduling

chris hyser (1):
sched: prctl() cookie manipulation for core scheduling.

.../admin-guide/hw-vuln/core-scheduling.rst   | 460 ++
Documentation/admin-guide/hw-vuln/index.rst   |   1 +
include/linux/sched.h |  27 +-
include/linux/sched/task.h|   4 +-
include/uapi/linux/prctl.h|   7 +
kernel/fork.c |   1 +
kernel/sched/Makefile |   1 +
kernel/sched/core.c   | 223 +++--
kernel/sched/coretag.c| 578 +
kernel/sched/debug.c  |   4 +
kernel/sched/fair.c   |  42 +-
kernel/sched/sched.h  | 155 +++-
kernel/sys.c  |   7 +
tools/include/uapi/linux/prctl.h  |   7 +
tools/testing/selftests/sched/.gitignore  |   1 +
tools/testing/selftests/sched/Makefile|  14 +
tools/testing/selftests/sched/config  |   1 +
tools/testing/selftests/sched/cs_prctl_test.c | 372 
.../testing/selftests/sched/test_coresched.c  | 812 ++
19 files changed, 2649 insertions(+), 68 deletions(-)
create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst
create mode 100644 kernel/sched/coretag.c
create mode 100644 tools/testing/selftests/sched/.gitignore
create mode 100644 tools/testing/selftests/sched/Makefile
create mode 100644 tools/testing/selftests/sched/config
create mode 100644 tools/testing/selftests/sched/cs_prctl_test.c
create mode 100644 tools/testing/selftests/sched/test_coresched.c

--
2.31.0.rc2.261.g7f71774620-goog

[tip: core/rcu] rcu/tree: Make rcu_do_batch count how many callbacks were executed

2021-02-15 Thread tip-bot2 for Joel Fernandes (Google)

The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 6bc335828056f3b301a3deadda782de4e8f0db08
Gitweb:
https://git.kernel.org/tip/6bc335828056f3b301a3deadda782de4e8f0db08
Author:Joel Fernandes (Google) 
AuthorDate:Tue, 03 Nov 2020 09:25:57 -05:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 04 Jan 2021 13:22:12 -08:00

rcu/tree: Make rcu_do_batch count how many callbacks were executed

The rcu_do_batch() function extracts the ready-to-invoke callbacks
from the rcu_segcblist located in the ->cblist field of the current
CPU's rcu_data structure.  These callbacks are first moved to a local
(unsegmented) rcu_cblist.  The rcu_do_batch() function then uses this
rcu_cblist's ->len field to count how many CBs it has invoked, but it
does so by counting that field down from zero.  Finally, this function
negates the value in this ->len field (resulting in a positive number)
and subtracts the result from the ->len field of the current CPU's
->cblist field.

Except that it is sometimes necessary for rcu_do_batch() to stop invoking
callbacks mid-stream, despite there being more ready to invoke, for
example, if a high-priority task wakes up.  In this case the remaining
not-yet-invoked callbacks are requeued back onto the CPU's ->cblist,
but remain in the ready-to-invoke segment of that list.  As above, the
negative of the local rcu_cblist's ->len field is still subtracted from
the ->len field of the current CPU's ->cblist field.

The design of counting down from 0 is confusing and error-prone, plus
use of a positive count will make it easier to provide a uniform and
consistent API to deal with the per-segment counts that are added
later in this series.  For example, rcu_segcblist_extract_done_cbs()
can unconditionally populate the resulting unsegmented list's ->len
field during extraction.

This commit therefore explicitly counts how many callbacks were executed
in rcu_do_batch() itself, counting up from zero, and then uses that
to update the per-CPU segcb list's ->len field, without relying on the
downcounting of rcl->len from zero.

Reviewed-by: Frederic Weisbecker 
Reviewed-by: Neeraj Upadhyay 
Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcu_segcblist.c |  2 +-
 kernel/rcu/rcu_segcblist.h |  1 +
 kernel/rcu/tree.c  | 11 +--
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 2d2a6b6..bb246d8 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -95,7 +95,7 @@ static void rcu_segcblist_set_len(struct rcu_segcblist 
*rsclp, long v)
  * This increase is fully ordered with respect to the callers accesses
  * both before and after.
  */
-static void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v)
+void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v)
 {
 #ifdef CONFIG_RCU_NOCB_CPU
smp_mb__before_atomic(); /* Up to the caller! */
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index 492262b..1d2d614 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -76,6 +76,7 @@ static inline bool rcu_segcblist_restempty(struct 
rcu_segcblist *rsclp, int seg)
 }
 
 void rcu_segcblist_inc_len(struct rcu_segcblist *rsclp);
+void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v);
 void rcu_segcblist_init(struct rcu_segcblist *rsclp);
 void rcu_segcblist_disable(struct rcu_segcblist *rsclp);
 void rcu_segcblist_offload(struct rcu_segcblist *rsclp);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 40e5e3d..cc6f379 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2434,7 +2434,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
const bool offloaded = rcu_segcblist_is_offloaded(>cblist);
struct rcu_head *rhp;
struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
-   long bl, count;
+   long bl, count = 0;
long pending, tlimit = 0;
 
/* If no callbacks are ready, just return. */
@@ -2479,6 +2479,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
for (; rhp; rhp = rcu_cblist_dequeue()) {
rcu_callback_t f;
 
+   count++;
debug_rcu_head_unqueue(rhp);
 
rcu_lock_acquire(_callback_map);
@@ -2492,15 +2493,14 @@ static void rcu_do_batch(struct rcu_data *rdp)
 
/*
 * Stop only if limit reached and CPU has something to do.
-* Note: The rcl structure counts down from zero.
 */
-   if (-rcl.len >= bl && !offloaded &&
+   if (count >= bl && !offloaded &&
(need_resched() ||
 (!is_idle_task(current) && !rcu_is_callbacks_kthread(
break;
if (unlikely(tlimit)) {

[tip: core/rcu] rcu/segcblist: Add additional comments to explain smp_mb()

2021-02-12 Thread tip-bot2 for Joel Fernandes (Google)

The following commit has been merged into the core/rcu branch of tip:

Commit-ID: c2e13112e830c06825339cbadf0b3bc2bdb9a716
Gitweb:
https://git.kernel.org/tip/c2e13112e830c06825339cbadf0b3bc2bdb9a716
Author:Joel Fernandes (Google) 
AuthorDate:Tue, 03 Nov 2020 09:26:03 -05:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:23:23 -08:00

rcu/segcblist: Add additional comments to explain smp_mb()

One counter-intuitive property of RCU is the fact that full memory
barriers are needed both before and after updates to the full
(non-segmented) length.  This patch therefore helps to assist the
reader's intuition by adding appropriate comments.

[ paulmck:  Wordsmithing. ]
Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcu_segcblist.c | 68 ++---
 1 file changed, 64 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index bb246d8..3cff800 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -94,17 +94,77 @@ static void rcu_segcblist_set_len(struct rcu_segcblist 
*rsclp, long v)
  * field to disagree with the actual number of callbacks on the structure.
  * This increase is fully ordered with respect to the callers accesses
  * both before and after.
+ *
+ * So why on earth is a memory barrier required both before and after
+ * the update to the ->len field???
+ *
+ * The reason is that rcu_barrier() locklessly samples each CPU's ->len
+ * field, and if a given CPU's field is zero, avoids IPIing that CPU.
+ * This can of course race with both queuing and invoking of callbacks.
+ * Failing to correctly handle either of these races could result in
+ * rcu_barrier() failing to IPI a CPU that actually had callbacks queued
+ * which rcu_barrier() was obligated to wait on.  And if rcu_barrier()
+ * failed to wait on such a callback, unloading certain kernel modules
+ * would result in calls to functions whose code was no longer present in
+ * the kernel, for but one example.
+ *
+ * Therefore, ->len transitions from 1->0 and 0->1 have to be carefully
+ * ordered with respect with both list modifications and the rcu_barrier().
+ *
+ * The queuing case is CASE 1 and the invoking case is CASE 2.
+ *
+ * CASE 1: Suppose that CPU 0 has no callbacks queued, but invokes
+ * call_rcu() just as CPU 1 invokes rcu_barrier().  CPU 0's ->len field
+ * will transition from 0->1, which is one of the transitions that must
+ * be handled carefully.  Without the full memory barriers after the ->len
+ * update and at the beginning of rcu_barrier(), the following could happen:
+ *
+ * CPU 0   CPU 1
+ *
+ * call_rcu().
+ * rcu_barrier() sees ->len as 0.
+ * set ->len = 1.
+ * rcu_barrier() does nothing.
+ * module is unloaded.
+ * callback invokes unloaded function!
+ *
+ * With the full barriers, any case where rcu_barrier() sees ->len as 0 will
+ * have unambiguously preceded the return from the racing call_rcu(), which
+ * means that this call_rcu() invocation is OK to not wait on.  After all,
+ * you are supposed to make sure that any problematic call_rcu() invocations
+ * happen before the rcu_barrier().
+ *
+ *
+ * CASE 2: Suppose that CPU 0 is invoking its last callback just as
+ * CPU 1 invokes rcu_barrier().  CPU 0's ->len field will transition from
+ * 1->0, which is one of the transitions that must be handled carefully.
+ * Without the full memory barriers before the ->len update and at the
+ * end of rcu_barrier(), the following could happen:
+ *
+ * CPU 0   CPU 1
+ *
+ * start invoking last callback
+ * set ->len = 0 (reordered)
+ * rcu_barrier() sees ->len as 0
+ * rcu_barrier() does nothing.
+ * module is unloaded
+ * callback executing after unloaded!
+ *
+ * With the full barriers, any case where rcu_barrier() sees ->len as 0
+ * will be fully ordered after the completion of the callback function,
+ * so that the module unloading operation is completely safe.
+ *
  */
 void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v)
 {
 #ifdef CONFIG_RCU_NOCB_CPU
-   smp_mb__before_atomic(); /* Up to the caller! */
+   smp_mb__before_atomic(); // Read header comment above.
atomic_long_add(v, >len);
-   smp_mb__after_atomic(); /* Up to the caller! */
+   smp_mb__after_atomic();  // Read header comment above.
 #else
-   smp_mb(); /* Up to the caller! */
+   smp_mb(); // Read header comment above.
WRITE_ONCE(rsclp->len, rsclp->len + v);
-   smp_mb(); /* Up to the caller! */
+   smp_mb(); // Read header comment above.
 #endif
 }

[tip: core/rcu] rcu/segcblist: Add counters to segcblist datastructure

2021-02-12 Thread tip-bot2 for Joel Fernandes (Google)

The following commit has been merged into the core/rcu branch of tip:

Commit-ID: ae5c2341ed3987bd434ed495bd4f3d8b2bc3e623
Gitweb:
https://git.kernel.org/tip/ae5c2341ed3987bd434ed495bd4f3d8b2bc3e623
Author:Joel Fernandes (Google) 
AuthorDate:Wed, 23 Sep 2020 11:22:09 -04:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:19 -08:00

rcu/segcblist: Add counters to segcblist datastructure

Add counting of segment lengths of segmented callback list.

This will be useful for a number of things such as knowing how big the
ready-to-execute segment have gotten. The immediate benefit is ability
to trace how the callbacks in the segmented callback list change.

Also this patch remove hacks related to using donecbs's ->len field as a
temporary variable to save the segmented callback list's length. This cannot be
done anymore and is not needed.

Also fix SRCU:
The negative counting of the unsegmented list cannot be used to adjust
the segmented one. To fix this, sample the unsegmented length in
advance, and use it after CB execution to adjust the segmented list's
length.

Reviewed-by: Frederic Weisbecker 
Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: Paul E. McKenney 
---
 include/linux/rcu_segcblist.h |   1 +-
 kernel/rcu/rcu_segcblist.c| 120 +
 kernel/rcu/rcu_segcblist.h|   2 +-
 kernel/rcu/srcutree.c |   5 +-
 4 files changed, 82 insertions(+), 46 deletions(-)

diff --git a/include/linux/rcu_segcblist.h b/include/linux/rcu_segcblist.h
index b36afe7..6c01f09 100644
--- a/include/linux/rcu_segcblist.h
+++ b/include/linux/rcu_segcblist.h
@@ -72,6 +72,7 @@ struct rcu_segcblist {
 #else
long len;
 #endif
+   long seglen[RCU_CBLIST_NSEGS];
u8 enabled;
u8 offloaded;
 };
diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 3cff800..804 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -7,10 +7,10 @@
  * Authors: Paul E. McKenney 
  */
 
-#include 
-#include 
+#include 
 #include 
-#include 
+#include 
+#include 
 
 #include "rcu_segcblist.h"
 
@@ -88,6 +88,46 @@ static void rcu_segcblist_set_len(struct rcu_segcblist 
*rsclp, long v)
 #endif
 }
 
+/* Get the length of a segment of the rcu_segcblist structure. */
+static long rcu_segcblist_get_seglen(struct rcu_segcblist *rsclp, int seg)
+{
+   return READ_ONCE(rsclp->seglen[seg]);
+}
+
+/* Set the length of a segment of the rcu_segcblist structure. */
+static void rcu_segcblist_set_seglen(struct rcu_segcblist *rsclp, int seg, 
long v)
+{
+   WRITE_ONCE(rsclp->seglen[seg], v);
+}
+
+/* Increase the numeric length of a segment by a specified amount. */
+static void rcu_segcblist_add_seglen(struct rcu_segcblist *rsclp, int seg, 
long v)
+{
+   WRITE_ONCE(rsclp->seglen[seg], rsclp->seglen[seg] + v);
+}
+
+/* Move from's segment length to to's segment. */
+static void rcu_segcblist_move_seglen(struct rcu_segcblist *rsclp, int from, 
int to)
+{
+   long len;
+
+   if (from == to)
+   return;
+
+   len = rcu_segcblist_get_seglen(rsclp, from);
+   if (!len)
+   return;
+
+   rcu_segcblist_add_seglen(rsclp, to, len);
+   rcu_segcblist_set_seglen(rsclp, from, 0);
+}
+
+/* Increment segment's length. */
+static void rcu_segcblist_inc_seglen(struct rcu_segcblist *rsclp, int seg)
+{
+   rcu_segcblist_add_seglen(rsclp, seg, 1);
+}
+
 /*
  * Increase the numeric length of an rcu_segcblist structure by the
  * specified amount, which can be negative.  This can cause the ->len
@@ -180,26 +220,6 @@ void rcu_segcblist_inc_len(struct rcu_segcblist *rsclp)
 }
 
 /*
- * Exchange the numeric length of the specified rcu_segcblist structure
- * with the specified value.  This can cause the ->len field to disagree
- * with the actual number of callbacks on the structure.  This exchange is
- * fully ordered with respect to the callers accesses both before and after.
- */
-static long rcu_segcblist_xchg_len(struct rcu_segcblist *rsclp, long v)
-{
-#ifdef CONFIG_RCU_NOCB_CPU
-   return atomic_long_xchg(>len, v);
-#else
-   long ret = rsclp->len;
-
-   smp_mb(); /* Up to the caller! */
-   WRITE_ONCE(rsclp->len, v);
-   smp_mb(); /* Up to the caller! */
-   return ret;
-#endif
-}
-
-/*
  * Initialize an rcu_segcblist structure.
  */
 void rcu_segcblist_init(struct rcu_segcblist *rsclp)
@@ -209,8 +229,10 @@ void rcu_segcblist_init(struct rcu_segcblist *rsclp)
BUILD_BUG_ON(RCU_NEXT_TAIL + 1 != ARRAY_SIZE(rsclp->gp_seq));
BUILD_BUG_ON(ARRAY_SIZE(rsclp->tails) != ARRAY_SIZE(rsclp->gp_seq));
rsclp->head = NULL;
-   for (i = 0; i < RCU_CBLIST_NSEGS; i++)
+   for (i = 0; i < RCU_CBLIST_NSEGS; i++) {
rsclp->tails[i] = >head;
+   rcu_segcblist_set_seglen(rsclp, i, 0);
+   }
rcu_segcblist_set_len(rs

[tip: core/rcu] rcu/segcblist: Add debug checks for segment lengths

2021-02-12 Thread tip-bot2 for Joel Fernandes (Google)

The following commit has been merged into the core/rcu branch of tip:

Commit-ID: b4e6039e8af8c20dfbbdfcaebfcbd7c9d9ffe713
Gitweb:
https://git.kernel.org/tip/b4e6039e8af8c20dfbbdfcaebfcbd7c9d9ffe713
Author:Joel Fernandes (Google) 
AuthorDate:Wed, 18 Nov 2020 11:15:41 -05:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:19 -08:00

rcu/segcblist: Add debug checks for segment lengths

This commit adds debug checks near the end of rcu_do_batch() that emit
warnings if an empty rcu_segcblist structure has non-zero segment counts,
or, conversely, if a non-empty structure has all-zero segment counts.

Signed-off-by: Joel Fernandes (Google) 
[ paulmck: Fix queue/segment-length checks. ]
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcu_segcblist.c | 12 
 kernel/rcu/rcu_segcblist.h |  3 +++
 kernel/rcu/tree.c  |  8 ++--
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 1e80a0a..89e0dff 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -94,6 +94,18 @@ static long rcu_segcblist_get_seglen(struct rcu_segcblist 
*rsclp, int seg)
return READ_ONCE(rsclp->seglen[seg]);
 }
 
+/* Return number of callbacks in segmented callback list by summing seglen. */
+long rcu_segcblist_n_segment_cbs(struct rcu_segcblist *rsclp)
+{
+   long len = 0;
+   int i;
+
+   for (i = RCU_DONE_TAIL; i < RCU_CBLIST_NSEGS; i++)
+   len += rcu_segcblist_get_seglen(rsclp, i);
+
+   return len;
+}
+
 /* Set the length of a segment of the rcu_segcblist structure. */
 static void rcu_segcblist_set_seglen(struct rcu_segcblist *rsclp, int seg, 
long v)
 {
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index cd35c9f..18e101d 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -15,6 +15,9 @@ static inline long rcu_cblist_n_cbs(struct rcu_cblist *rclp)
return READ_ONCE(rclp->len);
 }
 
+/* Return number of callbacks in segmented callback list by summing seglen. */
+long rcu_segcblist_n_segment_cbs(struct rcu_segcblist *rsclp);
+
 void rcu_cblist_init(struct rcu_cblist *rclp);
 void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp);
 void rcu_cblist_flush_enqueue(struct rcu_cblist *drclp,
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 6bf269c..8086c04 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2434,6 +2434,7 @@ int rcutree_dead_cpu(unsigned int cpu)
 static void rcu_do_batch(struct rcu_data *rdp)
 {
int div;
+   bool __maybe_unused empty;
unsigned long flags;
const bool offloaded = rcu_segcblist_is_offloaded(>cblist);
struct rcu_head *rhp;
@@ -2548,9 +2549,12 @@ static void rcu_do_batch(struct rcu_data *rdp)
 * The following usually indicates a double call_rcu().  To track
 * this down, try building with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y.
 */
-   WARN_ON_ONCE(count == 0 && !rcu_segcblist_empty(>cblist));
+   empty = rcu_segcblist_empty(>cblist);
+   WARN_ON_ONCE(count == 0 && !empty);
WARN_ON_ONCE(!IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
-count != 0 && rcu_segcblist_empty(>cblist));
+count != 0 && empty);
+   WARN_ON_ONCE(count == 0 && rcu_segcblist_n_segment_cbs(>cblist) != 
0);
+   WARN_ON_ONCE(!empty && rcu_segcblist_n_segment_cbs(>cblist) == 0);
 
rcu_nocb_unlock_irqrestore(rdp, flags);

[tip: core/rcu] rcu/tree: segcblist: Remove redundant smp_mb()s

2021-02-12 Thread tip-bot2 for Joel Fernandes (Google)

The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 68804cf1c905ce227e4e1d0bc252c216811c59fd
Gitweb:
https://git.kernel.org/tip/68804cf1c905ce227e4e1d0bc252c216811c59fd
Author:Joel Fernandes (Google) 
AuthorDate:Wed, 14 Oct 2020 18:21:53 -04:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:19 -08:00

rcu/tree: segcblist: Remove redundant smp_mb()s

The full memory barriers in rcu_segcblist_enqueue() and in rcu_do_batch()
are not needed because rcu_segcblist_add_len(), and thus also
rcu_segcblist_inc_len(), already includes a memory barrier *before*
and *after* the length of the list is updated.

This commit therefore removes these redundant smp_mb() invocations.

Reviewed-by: Frederic Weisbecker 
Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcu_segcblist.c | 1 -
 kernel/rcu/tree.c  | 1 -
 2 files changed, 2 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 804..1e80a0a 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -327,7 +327,6 @@ void rcu_segcblist_enqueue(struct rcu_segcblist *rsclp,
   struct rcu_head *rhp)
 {
rcu_segcblist_inc_len(rsclp);
-   smp_mb(); /* Ensure counts are updated before callback is enqueued. */
rcu_segcblist_inc_seglen(rsclp, RCU_NEXT_TAIL);
rhp->next = NULL;
WRITE_ONCE(*rsclp->tails[RCU_NEXT_TAIL], rhp);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index cc6f379..b0fb654 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2523,7 +2523,6 @@ static void rcu_do_batch(struct rcu_data *rdp)
 
/* Update counts and requeue any remaining callbacks. */
rcu_segcblist_insert_done_cbs(>cblist, );
-   smp_mb(); /* List handling before counting for rcu_barrier(). */
rcu_segcblist_add_len(>cblist, -count);
 
/* Reinstate batch limit if we have worked down the excess. */

[tip: core/rcu] rcu/trace: Add tracing for how segcb list changes

2021-02-12 Thread tip-bot2 for Joel Fernandes (Google)

The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 3afe7fa535491ecd0382c3968dc2349602bff8a2
Gitweb:
https://git.kernel.org/tip/3afe7fa535491ecd0382c3968dc2349602bff8a2
Author:Joel Fernandes (Google) 
AuthorDate:Sat, 14 Nov 2020 14:31:32 -05:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:19 -08:00

rcu/trace: Add tracing for how segcb list changes

This commit adds tracing to track how the segcb list changes before/after
acceleration, during queuing and during dequeuing.

This tracing helped discover an optimization that avoided needless GP
requests when no callbacks were accelerated. The tracing overhead is
minimal as each segment's length is now stored in the respective segment.

Reviewed-by: Frederic Weisbecker 
Reviewed-by: Neeraj Upadhyay 
Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: Paul E. McKenney 
---
 include/trace/events/rcu.h | 26 ++
 kernel/rcu/tree.c  |  9 +
 2 files changed, 35 insertions(+)

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index 155b5cb..5fc2940 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -505,6 +505,32 @@ TRACE_EVENT_RCU(rcu_callback,
  __entry->qlen)
 );
 
+TRACE_EVENT_RCU(rcu_segcb_stats,
+
+   TP_PROTO(struct rcu_segcblist *rs, const char *ctx),
+
+   TP_ARGS(rs, ctx),
+
+   TP_STRUCT__entry(
+   __field(const char *, ctx)
+   __array(unsigned long, gp_seq, RCU_CBLIST_NSEGS)
+   __array(long, seglen, RCU_CBLIST_NSEGS)
+   ),
+
+   TP_fast_assign(
+   __entry->ctx = ctx;
+   memcpy(__entry->seglen, rs->seglen, RCU_CBLIST_NSEGS * 
sizeof(long));
+   memcpy(__entry->gp_seq, rs->gp_seq, RCU_CBLIST_NSEGS * 
sizeof(unsigned long));
+
+   ),
+
+   TP_printk("%s seglen: (DONE=%ld, WAIT=%ld, NEXT_READY=%ld, 
NEXT=%ld) "
+ "gp_seq: (DONE=%lu, WAIT=%lu, NEXT_READY=%lu, 
NEXT=%lu)", __entry->ctx,
+ __entry->seglen[0], __entry->seglen[1], 
__entry->seglen[2], __entry->seglen[3],
+ __entry->gp_seq[0], __entry->gp_seq[1], 
__entry->gp_seq[2], __entry->gp_seq[3])
+
+);
+
 /*
  * Tracepoint for the registration of a single RCU callback of the special
  * kvfree() form.  The first argument is the RCU type, the second argument
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index b0fb654..6bf269c 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1495,6 +1495,8 @@ static bool rcu_accelerate_cbs(struct rcu_node *rnp, 
struct rcu_data *rdp)
if (!rcu_segcblist_pend_cbs(>cblist))
return false;
 
+   trace_rcu_segcb_stats(>cblist, TPS("SegCbPreAcc"));
+
/*
 * Callbacks are often registered with incomplete grace-period
 * information.  Something about the fact that getting exact
@@ -1515,6 +1517,8 @@ static bool rcu_accelerate_cbs(struct rcu_node *rnp, 
struct rcu_data *rdp)
else
trace_rcu_grace_period(rcu_state.name, gp_seq_req, 
TPS("AccReadyCB"));
 
+   trace_rcu_segcb_stats(>cblist, TPS("SegCbPostAcc"));
+
return ret;
 }
 
@@ -2471,11 +2475,14 @@ static void rcu_do_batch(struct rcu_data *rdp)
rcu_segcblist_extract_done_cbs(>cblist, );
if (offloaded)
rdp->qlen_last_fqs_check = rcu_segcblist_n_cbs(>cblist);
+
+   trace_rcu_segcb_stats(>cblist, TPS("SegCbDequeued"));
rcu_nocb_unlock_irqrestore(rdp, flags);
 
/* Invoke callbacks. */
tick_dep_set_task(current, TICK_DEP_BIT_RCU);
rhp = rcu_cblist_dequeue();
+
for (; rhp; rhp = rcu_cblist_dequeue()) {
rcu_callback_t f;
 
@@ -2987,6 +2994,8 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func)
trace_rcu_callback(rcu_state.name, head,
   rcu_segcblist_n_cbs(>cblist));
 
+   trace_rcu_segcb_stats(>cblist, TPS("SegCBQueued"));
+
/* Go handle any RCU core processing required. */
if (unlikely(rcu_segcblist_is_offloaded(>cblist))) {
__call_rcu_nocb_wake(rdp, was_alldone, flags); /* unlocks */

[PATCH v10 2/5] sched: CGroup tagging interface for core scheduling

2021-01-22 Thread Joel Fernandes (Google)

From: Peter Zijlstra 

Marks all tasks in a cgroup as matching for core-scheduling.

A task will need to be moved into the core scheduler queue when the cgroup
it belongs to is tagged to run with core scheduling.  Similarly the task
will need to be moved out of the core scheduler queue when the cgroup
is untagged.

Also after we forked a task, its core scheduler queue's presence will
need to be updated according to its new cgroup's status.

Use stop machine mechanism to update all tasks in a cgroup to prevent a
new task from sneaking into the cgroup, and missed out from the update
while we iterates through all the tasks in the cgroup.  A more complicated
scheme could probably avoid the stop machine.  Such scheme will also
need to resovle inconsistency between a task's cgroup core scheduling
tag and residency in core scheduler queue.

We are opting for the simple stop machine mechanism for now that avoids
such complications.

Core scheduler has extra overhead.  Enable it only for core with
more than one SMT hardware threads.

Co-developed-by: Josh Don 
Co-developed-by: Chris Hyser 
Co-developed-by: Joel Fernandes (Google) 
Tested-by: Julien Desfossez 
Signed-off-by: Tim Chen 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Julien Desfossez 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Josh Don 
Signed-off-by: Chris Hyser 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/sched.h|  10 +
 include/uapi/linux/prctl.h   |   6 +
 kernel/fork.c|   1 +
 kernel/sched/Makefile|   1 +
 kernel/sched/core.c  | 136 ++-
 kernel/sched/coretag.c   | 669 +++
 kernel/sched/debug.c |   4 +
 kernel/sched/sched.h |  58 ++-
 kernel/sys.c |   7 +
 tools/include/uapi/linux/prctl.h |   6 +
 10 files changed, 878 insertions(+), 20 deletions(-)
 create mode 100644 kernel/sched/coretag.c

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7efce9c9d9cf..7ca6f2f72cda 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -688,6 +688,8 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
struct rb_node  core_node;
unsigned long   core_cookie;
+   unsigned long   core_task_cookie;
+   unsigned long   core_group_cookie;
unsigned intcore_occupation;
 #endif
 
@@ -2076,4 +2078,12 @@ int sched_trace_rq_nr_running(struct rq *rq);
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
+#ifdef CONFIG_SCHED_CORE
+int sched_core_share_pid(unsigned long flags, pid_t pid);
+void sched_tsk_free(struct task_struct *tsk);
+#else
+#define sched_core_share_pid(flags, pid) do { } while (0)
+#define sched_tsk_free(tsk) do { } while (0)
+#endif
+
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c334e6a02e5f..f8e4e9626121 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -248,4 +248,10 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER  57
 #define PR_GET_IO_FLUSHER  58
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE59
+# define PR_SCHED_CORE_CLEAR   0  /* clear core_sched cookie of pid */
+# define PR_SCHED_CORE_SHARE_FROM  1  /* get core_sched cookie from pid */
+# define PR_SCHED_CORE_SHARE_TO2  /* push core_sched cookie to 
pid */
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 7199d359690c..5468c93829c5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -736,6 +736,7 @@ void __put_task_struct(struct task_struct *tsk)
exit_creds(tsk);
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);
+   sched_tsk_free(tsk);
 
if (!profile_handoff_task(tsk))
free_task(tsk);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 5fc9c9b70862..c526c20adf9d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
+obj-$(CONFIG_SCHED_CORE) += coretag.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 20125431af87..a3844e2e7379 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -136,7 +136,33 @@ static inline bool __sched_core_less(struct task_struct 
*a, struct task_struct *
return false;
 }
 
-static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+static bool sched_core_empty(struct rq *rq)
+{
+   return RB_EMPTY_ROOT(>core_tree);
+}
+
+static struct task_struct *sched_core_first(struct rq *rq)
+{
+   struct task_struct *task;
+
+   task = container_of(rb_first(>core_tree), struct task_struct, 
core_node);
+   return task;
+}
+
+st

[PATCH v10 3/5] kselftest: Add tests for core-sched interface

2021-01-22 Thread Joel Fernandes (Google)

Add a kselftest test to ensure that the core-sched interface is working
correctly.

Co-developed-by: Chris Hyser 
Signed-off-by: Chris Hyser 
Tested-by: Julien Desfossez 
Reviewed-by: Josh Don 
Signed-off-by: Josh Don 
Signed-off-by: Chris Hyser 
Signed-off-by: Joel Fernandes (Google) 
---
 tools/testing/selftests/sched/.gitignore  |   1 +
 tools/testing/selftests/sched/Makefile|  14 +
 tools/testing/selftests/sched/config  |   1 +
 .../testing/selftests/sched/test_coresched.c  | 716 ++
 4 files changed, 732 insertions(+)
 create mode 100644 tools/testing/selftests/sched/.gitignore
 create mode 100644 tools/testing/selftests/sched/Makefile
 create mode 100644 tools/testing/selftests/sched/config
 create mode 100644 tools/testing/selftests/sched/test_coresched.c

diff --git a/tools/testing/selftests/sched/.gitignore 
b/tools/testing/selftests/sched/.gitignore
new file mode 100644
index ..4660929b0b9a
--- /dev/null
+++ b/tools/testing/selftests/sched/.gitignore
@@ -0,0 +1 @@
+test_coresched
diff --git a/tools/testing/selftests/sched/Makefile 
b/tools/testing/selftests/sched/Makefile
new file mode 100644
index ..e43b74fc5d7e
--- /dev/null
+++ b/tools/testing/selftests/sched/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0+
+
+ifneq ($(shell $(CC) --version 2>&1 | head -n 1 | grep clang),)
+CLANG_FLAGS += -no-integrated-as
+endif
+
+CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/  -Wl,-rpath=./ \
+ $(CLANG_FLAGS)
+LDLIBS += -lpthread
+
+TEST_GEN_FILES := test_coresched
+TEST_PROGS := test_coresched
+
+include ../lib.mk
diff --git a/tools/testing/selftests/sched/config 
b/tools/testing/selftests/sched/config
new file mode 100644
index ..e8b09aa7c0c4
--- /dev/null
+++ b/tools/testing/selftests/sched/config
@@ -0,0 +1 @@
+CONFIG_SCHED_DEBUG=y
diff --git a/tools/testing/selftests/sched/test_coresched.c 
b/tools/testing/selftests/sched/test_coresched.c
new file mode 100644
index ..4d18a0a727c8
--- /dev/null
+++ b/tools/testing/selftests/sched/test_coresched.c
@@ -0,0 +1,716 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Core-scheduling selftests.
+ *
+ * Copyright (C) 2020, Joel Fernandes.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef PR_SCHED_CORE_SHARE
+#define PR_SCHED_CORE_SHARE 59
+# define PR_SCHED_CORE_CLEAR0
+# define PR_SCHED_CORE_SHARE_FROM   1
+# define PR_SCHED_CORE_SHARE_TO 2
+#endif
+
+#ifndef DEBUG_PRINT
+#define dprint(...)
+#else
+#define dprint(str, args...) printf("DEBUG: %s: " str "\n", __func__, ##args)
+#endif
+
+void print_banner(char *s)
+{
+   printf("coresched: %s:  ", s);
+}
+
+void print_pass(void)
+{
+   printf("PASS\n");
+}
+
+void assert_cond(int cond, char *str)
+{
+   if (!cond) {
+   printf("Error: %s\n", str);
+   abort();
+   }
+}
+
+char *make_group_root(void)
+{
+   char *mntpath, *mnt;
+   int ret;
+
+   mntpath = malloc(50);
+   if (!mntpath) {
+   perror("Failed to allocate mntpath\n");
+   abort();
+   }
+
+   sprintf(mntpath, "/tmp/coresched-test-XX");
+   mnt = mkdtemp(mntpath);
+   if (!mnt) {
+   perror("Failed to create mount: ");
+   exit(-1);
+   }
+
+   ret = mount("nodev", mnt, "cgroup", 0, "cpu");
+   if (ret == -1) {
+   perror("Failed to mount cgroup: ");
+   exit(-1);
+   }
+
+   return mnt;
+}
+
+char *read_group_cookie(char *cgroup_path)
+{
+   char path[50] = {}, *val;
+   int fd;
+
+   sprintf(path, "%s/cpu.core_group_cookie", cgroup_path);
+   fd = open(path, O_RDONLY, 0666);
+   if (fd == -1) {
+   perror("Open of cgroup tag path failed: ");
+   abort();
+   }
+
+   val = calloc(1, 50);
+   if (read(fd, val, 50) == -1) {
+   perror("Failed to read group cookie: ");
+   abort();
+   }
+
+   val[strcspn(val, "\r\n")] = 0;
+
+   close(fd);
+   return val;
+}
+
+void assert_group_tag(char *cgroup_path, char *tag)
+{
+   char tag_path[50] = {}, rdbuf[8] = {};
+   int tfd;
+
+   sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
+   tfd = open(tag_path, O_RDONLY, 0666);
+   if (tfd == -1) {
+   perror("Open of cgroup tag path failed: ");
+   abort();
+   }
+
+   if (read(tfd, rdbuf, 1) != 1) {
+   perror("Failed to enable coresched on cgroup: ");
+   abort();
+   }
+
+   if (strcmp(rdbuf, tag)) {
+   printf("Group tag does not match

[PATCH v10 1/5] sched: migration changes for core scheduling

2021-01-22 Thread Joel Fernandes (Google)

From: Aubrey Li 

 - Don't migrate if there is a cookie mismatch
 Load balance tries to move task from busiest CPU to the
 destination CPU. When core scheduling is enabled, if the
 task's cookie does not match with the destination CPU's
 core cookie, this task will be skipped by this CPU. This
 mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
 In the fast path of task wakeup, select the first cookie matched
 idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
 In the slow path of task wakeup, find the idlest CPU whose core
 cookie matches with task's cookie

 - Don't migrate task if cookie not match
 For the NUMA load balance, don't migrate task to the CPU whose
 core cookie does not match with task's cookie

Tested-by: Julien Desfossez 
Signed-off-by: Aubrey Li 
Signed-off-by: Tim Chen 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/fair.c  | 33 +---
 kernel/sched/sched.h | 72 
 2 files changed, 101 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7f90765f7fd..fddd7c44bbf3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1921,6 +1921,13 @@ static void task_numa_find_cpu(struct task_numa_env *env,
if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
continue;
 
+   /*
+* Skip this cpu if source task's cookie does not match
+* with CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+   continue;
+
env->dst_cpu = cpu;
if (task_numa_compare(env, taskimp, groupimp, maymove))
break;
@@ -5867,11 +5874,15 @@ find_idlest_group_cpu(struct sched_group *group, struct 
task_struct *p, int this
 
/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+   struct rq *rq = cpu_rq(i);
+
+   if (!sched_core_cookie_match(rq, p))
+   continue;
+
if (sched_idle_cpu(i))
return i;
 
if (available_idle_cpu(i)) {
-   struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
/*
@@ -6129,7 +6140,9 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
return -1;
-   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+
+   if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
+   sched_cpu_cookie_match(cpu_rq(cpu), p))
break;
}
 
@@ -7530,8 +7543,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env 
*env)
 * We do not migrate tasks that are:
 * 1) throttled_lb_pair, or
 * 2) cannot be migrated to this CPU due to cpus_ptr, or
-* 3) running (obviously), or
-* 4) are cache-hot on their current CPU.
+* 3) task's cookie does not match with this CPU's core cookie
+* 4) running (obviously), or
+* 5) are cache-hot on their current CPU.
 */
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
@@ -7566,6 +7580,13 @@ int can_migrate_task(struct task_struct *p, struct 
lb_env *env)
return 0;
}
 
+   /*
+* Don't migrate task if the task's cookie does not match
+* with the destination CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+   return 0;
+
/* Record that we found atleast one task that could run on dst_cpu */
env->flags &= ~LBF_ALL_PINNED;
 
@@ -8792,6 +8813,10 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p, int this_cpu)
p->cpus_ptr))
continue;
 
+   /* Skip over this group if no cookie matched */
+   if (!sched_group_cookie_match(cpu_rq(this_cpu), p, group))
+   continue;
+
local_group = cpumask_test_cpu(this_cpu,
   sched_group_span(group));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3efcbc779a75..d6efb1ffc08c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1122,6 +1122,7 @@ static inline bool is_migration_disabled(struct 
task_struct *p)
 
 #ifdef CONFIG_SCHED_CORE
 DECLARE_STATIC_KEY_FALSE(__sched_core_enabled)

[PATCH v10 5/5] sched: Debug bits...

2021-01-22 Thread Joel Fernandes (Google)

Tested-by: Julien Desfossez 
Not-Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c | 35 ++-
 kernel/sched/fair.c |  9 +
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a3844e2e7379..56ba2ca4f922 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -106,6 +106,10 @@ static inline bool prio_less(struct task_struct *a, struct 
task_struct *b, bool
 
int pa = __task_prio(a), pb = __task_prio(b);
 
+   trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
+a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
if (-pa < -pb)
return true;
 
@@ -296,12 +300,16 @@ static void __sched_core_enable(void)
 
static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
+
+   printk("core sched enabled\n");
 }
 
 static void __sched_core_disable(void)
 {
stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);
+
+   printk("core sched disabled\n");
 }
 
 void sched_core_get(void)
@@ -5237,6 +5245,13 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
set_next_task(rq, next);
}
 
+   trace_printk("pick pre selected (%u %u %u): %s/%d %lx\n",
+rq->core->core_task_seq,
+rq->core->core_pick_seq,
+rq->core_sched_seq,
+next->comm, next->pid,
+next->core_cookie);
+
rq->core_pick = NULL;
return next;
}
@@ -5331,6 +5346,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_forceidle_seq++;
}
 
+   trace_printk("cpu(%d): selected: %s/%d %lx\n",
+i, p->comm, p->pid, p->core_cookie);
+
/*
 * If this new candidate is of higher priority than the
 * previous; and they're incompatible; we need to wipe
@@ -5347,6 +5365,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_cookie = p->core_cookie;
max = p;
 
+   trace_printk("max: %s/%d %lx\n", max->comm, 
max->pid, max->core_cookie);
+
if (old_max) {
rq->core->core_forceidle = false;
for_each_cpu(j, smt_mask) {
@@ -5368,6 +5388,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
/* Something should have been selected for current CPU */
WARN_ON_ONCE(!next);
+   trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, 
next->core_cookie);
 
/*
 * Reschedule siblings
@@ -5409,13 +5430,21 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
}
 
/* Did we break L1TF mitigation requirements? */
-   WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+   if (unlikely(!cookie_match(next, rq_i->core_pick))) {
+   trace_printk("[%d]: cookie mismatch. 
%s/%d/0x%lx/0x%lx\n",
+rq_i->cpu, rq_i->core_pick->comm,
+rq_i->core_pick->pid,
+rq_i->core_pick->core_cookie,
+rq_i->core->core_cookie);
+   WARN_ON_ONCE(1);
+   }
 
if (rq_i->curr == rq_i->core_pick) {
rq_i->core_pick = NULL;
continue;
}
 
+   trace_printk("IPI(%d)\n", i);
resched_curr(rq_i);
}
 
@@ -5455,6 +5484,10 @@ static bool try_steal_cookie(int this, int that)
if (p->core_occupation > dst->idle->core_occupation)
goto next;
 
+   trace_printk("core fill: %s/%d (%d->%d) %d %d %lx\n",
+p->comm, p->pid, that, this,
+p->core_occupation, dst->idle->core_occupation, 
cookie);
+
p->on_rq = TASK_ON_RQ_MIGRATING;
deactivate_task(src, p, 0);
set_task_cpu(p, this);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fddd7c44bbf3..ebeeebc4223a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10769,6 +10769,15 @@ static void se_fi_update(struct sched_entity *se, 
unsigned int fi_seq, bool forc

[PATCH v10 4/5] Documentation: Add core scheduling documentation

2021-01-22 Thread Joel Fernandes (Google)

Document the usecases, design and interfaces for core scheduling.

Co-developed-by: Chris Hyser 
Co-developed-by: Vineeth Pillai 
Co-developed-by: Josh Don 
Signed-off-by: Josh Don 
Signed-off-by: Vineeth Pillai 
Signed-off-by: Chris Hyser 
Tested-by: Julien Desfossez 
Reviewed-by: Randy Dunlap 
Signed-off-by: Joel Fernandes (Google) 
---
 .../admin-guide/hw-vuln/core-scheduling.rst   | 263 ++
 Documentation/admin-guide/hw-vuln/index.rst   |   1 +
 2 files changed, 264 insertions(+)
 create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst 
b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
new file mode 100644
index ..a795747c706a
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -0,0 +1,263 @@
+Core Scheduling
+***
+Core scheduling support allows userspace to define groups of tasks that can
+share a core. These groups can be specified either for security usecases (one
+group of tasks don't trust another), or for performance usecases (some
+workloads may benefit from running on the same core as they don't need the same
+hardware resources of the shared core).
+
+Security usecase
+
+A cross-HT attack involves the attacker and victim running on different
+Hyper Threads of the same core. MDS and L1TF are examples of such attacks.
+Without core scheduling, the only full mitigation of cross-HT attacks is to
+disable Hyper Threading (HT). Core scheduling allows HT to be turned on safely
+by ensuring that trusted tasks can share a core. This increase in core sharing
+can improvement performance, however it is not guaranteed that performance will
+always improve, though that is seen to be the case with a number of real world
+workloads. In theory, core scheduling aims to perform at least as good as when
+Hyper Threading is disabled. In practice, this is mostly the case though not
+always: as synchronizing scheduling decisions across 2 or more CPUs in a core
+involves additional overhead - especially when the system is lightly loaded
+(``total_threads <= N/2``, where N is the total number of CPUs).
+
+Usage
+-
+Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
+Using this feature, userspace defines groups of tasks that can be co-scheduled
+on the same core.
+The core scheduler uses this information to make sure that tasks that are not
+in the same group never run simultaneously on a core, while doing its best to
+satisfy the system's scheduling requirements.
+
+There are 2 ways to use core-scheduling:
+
+CGroup
+##
+Core scheduling adds additional files to the CPU controller CGroup:
+
+* ``cpu.core_tag``
+Writing ``1`` into this file results in all tasks in the group getting tagged.
+This results in all the CGroup's tasks allowed to run concurrently on a core's
+hyperthreads (also called siblings).
+
+The file being a value of ``0`` means the tag state of the CGroup is inherited
+from its parent hierarchy. If any ancestor of the CGroup is tagged, then the
+group is tagged.
+
+.. note:: Once a CGroup is tagged via cpu.core_tag, it is not possible to set 
this
+  for any descendant of the tagged group.
+
+.. note:: When a CGroup is not tagged, all the tasks within the group can share
+  a core with kernel threads and untagged system threads. For this 
reason,
+  if a group has ``cpu.core_tag`` of 0, it is considered to be trusted.
+
+prctl interface
+###
+A ``prtcl(2)`` command ``PR_SCHED_CORE_SHARE`` provides an interface for the
+creation of and admission and removal of tasks from core scheduling groups.
+
+::
+
+#include 
+
+int prctl(int option, unsigned long arg2, unsigned long arg3,
+unsigned long arg4, unsigned long arg5);
+
+option:
+``PR_SCHED_CORE_SHARE``
+
+arg2:
+- ``PR_SCHED_CORE_CLEAR0  -- clear core_sched cookie of pid``
+- ``PR_SCHED_CORE_SHARE_FROM   1  -- get core_sched cookie from pid``
+- ``PR_SCHED_CORE_SHARE_TO 2  -- push core_sched cookie to pid``
+
+arg3:
+``tid`` of the task for which the operation applies
+
+arg4 and arg5:
+MUST be equal to 0.
+
+Creation
+
+Creation is accomplished by sharing a ''cookie'' from a process not currently 
in
+a core scheduling group.
+
+::
+
+if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_FROM, src_tid, 0, 0) < 
0)
+handle_error("src_tid sched_core failed");
+
+Removal
+~~~
+Removing a task from a core scheduling group is done by:
+
+::
+
+if (prctl(PR_SCHED_CORE_SHARE, PR_SCHED_CORE_SHARE_CLEAR, clr_tid, 0, 0) < 
0)
+ handle_error("clr_tid sched_core failed");
+
+Cookie Transferal
+~
+Transferring a cookie between the current and other tasks is possible using
+PR_SCHED_CORE_SHARE_FROM and PR_SCHED_CORE_SHARE_TO to inherit a cookie from a
+specified task or

[PATCH v10 0/5] Core scheduling remaining patches

2021-01-22 Thread Joel Fernandes (Google)

 weights:
  https://lwn.net/ml/linux-kernel/20200225034438.GA617271@z...

Aubrey Li (1):
sched: migration changes for core scheduling

Joel Fernandes (Google) (3):
kselftest: Add tests for core-sched interface
Documentation: Add core scheduling documentation
sched: Debug bits...

Peter Zijlstra (1):
sched: CGroup tagging interface for core scheduling

.../admin-guide/hw-vuln/core-scheduling.rst   | 263 +++
Documentation/admin-guide/hw-vuln/index.rst   |   1 +
include/linux/sched.h |  10 +
include/uapi/linux/prctl.h|   6 +
kernel/fork.c |   1 +
kernel/sched/Makefile |   1 +
kernel/sched/core.c   | 171 -
kernel/sched/coretag.c| 669 
kernel/sched/debug.c  |   4 +
kernel/sched/fair.c   |  42 +-
kernel/sched/sched.h  | 130 +++-
kernel/sys.c  |   7 +
tools/include/uapi/linux/prctl.h  |   6 +
tools/testing/selftests/sched/.gitignore  |   1 +
tools/testing/selftests/sched/Makefile|  14 +
tools/testing/selftests/sched/config  |   1 +
.../testing/selftests/sched/test_coresched.c  | 716 ++
17 files changed, 2018 insertions(+), 25 deletions(-)
create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst
create mode 100644 kernel/sched/coretag.c
create mode 100644 tools/testing/selftests/sched/.gitignore
create mode 100644 tools/testing/selftests/sched/Makefile
create mode 100644 tools/testing/selftests/sched/config
create mode 100644 tools/testing/selftests/sched/test_coresched.c

--
2.30.0.280.ga3ce27912f-goog

[PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-22 Thread Joel Fernandes (Google)

On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
that there are a lot of calls to update_blocked_averages(). This causes
the schedule loop to slow down to taking upto 500 micro seconds at
times (due to newidle load balance). I have also seen this manifest in
the periodic balancer.

Closer look shows that the problem is caused by the following
ingredients:
1. If the system has a lot of inactive CGroups (thanks Dietmar for
suggesting to inspect /proc/sched_debug for this), this can make
__update_blocked_fair() take a long time.

2. The device has a lot of CPUs in a cluster which causes schedutil in a
shared frequency domain configuration to be slower than usual. (the load
average updates also try to update the frequency in schedutil).

3. The CPU is running at a low frequency causing the scheduler/schedutil
code paths to take longer than when running at a high CPU frequency.

The fix is simply rate limit the calls to update_blocked_averages to 20
times per second. It appears that updating the blocked average less
often is sufficient. Currently I see about 200 calls per second
sometimes, which seems overkill.

schbench shows a clear improvement with the change:

Without patch:
~/schbench -t 2 -m 2 -r 5
Latency percentiles (usec) runtime 5 (s) (212 total samples)
50.0th: 210 (106 samples)
75.0th: 619 (53 samples)
90.0th: 665 (32 samples)
95.0th: 703 (11 samples)
*99.0th: 12656 (8 samples)
99.5th: 12784 (1 samples)
99.9th: 13424 (1 samples)
min=15, max=13424

With patch:
~/schbench -t 2 -m 2 -r 5
Latency percentiles (usec) runtime 5 (s) (214 total samples)
50.0th: 188 (108 samples)
75.0th: 238 (53 samples)
90.0th: 623 (32 samples)
95.0th: 657 (12 samples)
*99.0th: 717 (7 samples)
99.5th: 725 (2 samples)
99.9th: 725 (0 samples)

Cc: Paul McKenney 
Cc: Frederic Weisbecker 
Suggested-by: Dietmar Eggeman 
Co-developed-by: Qais Yousef 
Signed-off-by: Qais Yousef 
Signed-off-by: Joel Fernandes (Google) 

---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..fe2dc0024db5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
return false;
 
-   if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
+   if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + 
(HZ/20)))
return true;
 
update_blocked_averages(cpu);
-- 
2.30.0.280.ga3ce27912f-goog

[tip: core/rcu] rcu/tree: Add a warning if CPU being onlined did not report QS already

2020-12-13 Thread tip-bot2 for Joel Fernandes (Google)

The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 9f866dac94292f93d3b6bf8dbe860a44b954e555
Gitweb:
https://git.kernel.org/tip/9f866dac94292f93d3b6bf8dbe860a44b954e555
Author:Joel Fernandes (Google) 
AuthorDate:Tue, 29 Sep 2020 15:29:27 -04:00
Committer: Paul E. McKenney 
CommitterDate: Thu, 19 Nov 2020 19:37:16 -08:00

rcu/tree: Add a warning if CPU being onlined did not report QS already

Currently, rcu_cpu_starting() checks to see if the RCU core expects a
quiescent state from the incoming CPU.  However, the current interaction
between RCU quiescent-state reporting and CPU-hotplug operations should
mean that the incoming CPU never needs to report a quiescent state.
First, the outgoing CPU reports a quiescent state if needed.  Second,
the race where the CPU is leaving just as RCU is initializing a new
grace period is handled by an explicit check for this condition.  Third,
the CPU's leaf rcu_node structure's ->lock serializes these checks.

This means that if rcu_cpu_starting() ever feels the need to report
a quiescent state, then there is a bug somewhere in the CPU hotplug
code or the RCU grace-period handling code.  This commit therefore
adds a WARN_ON_ONCE() to bring that bug to everyone's attention.

Cc: Neeraj Upadhyay 
Suggested-by: Paul E. McKenney 
Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 39e14cf..e4d6d0b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4075,7 +4075,9 @@ void rcu_cpu_starting(unsigned int cpu)
rcu_gpnum_ovf(rnp, rdp); /* Offline-induced counter wrap? */
rdp->rcu_onl_gp_seq = READ_ONCE(rcu_state.gp_seq);
rdp->rcu_onl_gp_flags = READ_ONCE(rcu_state.gp_flags);
-   if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */
+
+   /* An incoming CPU should never be blocking a grace period. */
+   if (WARN_ON_ONCE(rnp->qsmask & mask)) { /* RCU waiting on incoming CPU? 
*/
rcu_disable_urgency_upon_qs(rdp);
/* Report QS -after- changing ->qsmaskinitnext! */
rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);

[tip: core/rcu] rcu/tree: nocb: Avoid raising softirq for offloaded ready-to-execute CBs

2020-12-13 Thread tip-bot2 for Joel Fernandes (Google)

The following commit has been merged into the core/rcu branch of tip:

Commit-ID: bd56e0a4a291bc9db2cbaddef20ec61a1aad4208
Gitweb:
https://git.kernel.org/tip/bd56e0a4a291bc9db2cbaddef20ec61a1aad4208
Author:Joel Fernandes (Google) 
AuthorDate:Wed, 07 Oct 2020 13:50:36 -07:00
Committer: Paul E. McKenney 
CommitterDate: Thu, 19 Nov 2020 19:37:17 -08:00

rcu/tree: nocb: Avoid raising softirq for offloaded ready-to-execute CBs

Testing showed that rcu_pending() can return 1 when offloaded callbacks
are ready to execute.  This invokes RCU core processing, for example,
by raising RCU_SOFTIRQ, eventually resulting in a call to rcu_core().
However, rcu_core() explicitly avoids in any way manipulating offloaded
callbacks, which are instead handled by the rcuog and rcuoc kthreads,
which work independently of rcu_core().

One exception to this independence is that rcu_core() invokes
do_nocb_deferred_wakeup(), however, rcu_pending() also checks
rcu_nocb_need_deferred_wakeup() in order to correctly handle this case,
invoking rcu_core() when needed.

This commit therefore avoids needlessly invoking RCU core processing
by checking rcu_segcblist_ready_cbs() only on non-offloaded CPUs.
This reduces overhead, for example, by reducing softirq activity.

This change passed 30 minute tests of TREE01 through TREE09 each.

On TREE08, there is at most 150us from the time that rcu_pending() chose
not to invoke RCU core processing to the time when the ready callbacks
were invoked by the rcuoc kthread.  This provides further evidence that
there is no need to invoke rcu_core() for offloaded callbacks that are
ready to invoke.

Cc: Neeraj Upadhyay 
Reviewed-by: Frederic Weisbecker 
Reviewed-by: Neeraj Upadhyay 
Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d6a015e..50d90ee 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3718,7 +3718,8 @@ static int rcu_pending(int user)
return 1;
 
/* Does this CPU have callbacks ready to invoke? */
-   if (rcu_segcblist_ready_cbs(>cblist))
+   if (!rcu_segcblist_is_offloaded(>cblist) &&
+   rcu_segcblist_ready_cbs(>cblist))
return 1;
 
/* Has RCU gone idle with this CPU needing another grace period? */

[tip: core/rcu] docs: Update RCU's hotplug requirements with a bit about design

2020-12-13 Thread tip-bot2 for Joel Fernandes (Google)

The following commit has been merged into the core/rcu branch of tip:

Commit-ID: a043260740d5d6ec5be59c3fb595c719890a0b0b
Gitweb:
https://git.kernel.org/tip/a043260740d5d6ec5be59c3fb595c719890a0b0b
Author:Joel Fernandes (Google) 
AuthorDate:Tue, 29 Sep 2020 15:29:28 -04:00
Committer: Paul E. McKenney 
CommitterDate: Fri, 06 Nov 2020 17:02:43 -08:00

docs: Update RCU's hotplug requirements with a bit about design

The rcu_barrier() section of the "Hotplug CPU" section discusses
deadlocks, however the description of deadlocks other than those involving
rcu_barrier() is rather incomplete.

This commit therefore continues the section by describing how RCU's
design handles CPU hotplug in a deadlock-free way.

Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: Paul E. McKenney 
---
 Documentation/RCU/Design/Requirements/Requirements.rst | 49 +++--
 1 file changed, 39 insertions(+), 10 deletions(-)

diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst 
b/Documentation/RCU/Design/Requirements/Requirements.rst
index 1ae79a1..8807985 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.rst
+++ b/Documentation/RCU/Design/Requirements/Requirements.rst
@@ -1929,16 +1929,45 @@ The Linux-kernel CPU-hotplug implementation has 
notifiers that are used
 to allow the various kernel subsystems (including RCU) to respond
 appropriately to a given CPU-hotplug operation. Most RCU operations may
 be invoked from CPU-hotplug notifiers, including even synchronous
-grace-period operations such as ``synchronize_rcu()`` and
-``synchronize_rcu_expedited()``.
-
-However, all-callback-wait operations such as ``rcu_barrier()`` are also
-not supported, due to the fact that there are phases of CPU-hotplug
-operations where the outgoing CPU's callbacks will not be invoked until
-after the CPU-hotplug operation ends, which could also result in
-deadlock. Furthermore, ``rcu_barrier()`` blocks CPU-hotplug operations
-during its execution, which results in another type of deadlock when
-invoked from a CPU-hotplug notifier.
+grace-period operations such as (``synchronize_rcu()`` and
+``synchronize_rcu_expedited()``).  However, these synchronous operations
+do block and therefore cannot be invoked from notifiers that execute via
+``stop_machine()``, specifically those between the ``CPUHP_AP_OFFLINE``
+and ``CPUHP_AP_ONLINE`` states.
+
+In addition, all-callback-wait operations such as ``rcu_barrier()`` may
+not be invoked from any CPU-hotplug notifier.  This restriction is due
+to the fact that there are phases of CPU-hotplug operations where the
+outgoing CPU's callbacks will not be invoked until after the CPU-hotplug
+operation ends, which could also result in deadlock. Furthermore,
+``rcu_barrier()`` blocks CPU-hotplug operations during its execution,
+which results in another type of deadlock when invoked from a CPU-hotplug
+notifier.
+
+Finally, RCU must avoid deadlocks due to interaction between hotplug,
+timers and grace period processing. It does so by maintaining its own set
+of books that duplicate the centrally maintained ``cpu_online_mask``,
+and also by reporting quiescent states explicitly when a CPU goes
+offline.  This explicit reporting of quiescent states avoids any need
+for the force-quiescent-state loop (FQS) to report quiescent states for
+offline CPUs.  However, as a debugging measure, the FQS loop does splat
+if offline CPUs block an RCU grace period for too long.
+
+An offline CPU's quiescent state will be reported either:
+1.  As the CPU goes offline using RCU's hotplug notifier 
(``rcu_report_dead()``).
+2.  When grace period initialization (``rcu_gp_init()``) detects a
+race either with CPU offlining or with a task unblocking on a leaf
+``rcu_node`` structure whose CPUs are all offline.
+
+The CPU-online path (``rcu_cpu_starting()``) should never need to report
+a quiescent state for an offline CPU.  However, as a debugging measure,
+it does emit a warning if a quiescent state was not already reported
+for that CPU.
+
+During the checking/modification of RCU's hotplug bookkeeping, the
+corresponding CPU's leaf node lock is held. This avoids race conditions
+between RCU's hotplug notifier hooks, the grace period initialization
+code, and the FQS loop, all of which refer to or modify this bookkeeping.
 
 Scheduler and RCU
 ~

[PATCH v2] rcu/segcblist: Add debug checks for segment lengths

2020-11-18 Thread Joel Fernandes (Google)

After rcu_do_batch(), add a check for whether the seglen counts went to
zero if the list was indeed empty.

Signed-off-by: Joel Fernandes (Google) 
---
v1->v2: Added more debug checks.

 kernel/rcu/rcu_segcblist.c | 12 
 kernel/rcu/rcu_segcblist.h |  3 +++
 kernel/rcu/tree.c  |  2 ++
 3 files changed, 17 insertions(+)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 5059b6102afe..6e98bb3804f0 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -94,6 +94,18 @@ static long rcu_segcblist_get_seglen(struct rcu_segcblist 
*rsclp, int seg)
return READ_ONCE(rsclp->seglen[seg]);
 }
 
+/* Return number of callbacks in segmented callback list by totalling seglen. 
*/
+long rcu_segcblist_n_segment_cbs(struct rcu_segcblist *rsclp)
+{
+   long len = 0;
+   int i;
+
+   for (i = RCU_DONE_TAIL; i < RCU_CBLIST_NSEGS; i++)
+   len += rcu_segcblist_get_seglen(rsclp, i);
+
+   return len;
+}
+
 /* Set the length of a segment of the rcu_segcblist structure. */
 static void rcu_segcblist_set_seglen(struct rcu_segcblist *rsclp, int seg, 
long v)
 {
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index cd35c9faaf51..46a42d77f7e1 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -15,6 +15,9 @@ static inline long rcu_cblist_n_cbs(struct rcu_cblist *rclp)
return READ_ONCE(rclp->len);
 }
 
+/* Return number of callbacks in segmented callback list by totalling seglen. 
*/
+long rcu_segcblist_n_segment_cbs(struct rcu_segcblist *rsclp);
+
 void rcu_cblist_init(struct rcu_cblist *rclp);
 void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp);
 void rcu_cblist_flush_enqueue(struct rcu_cblist *drclp,
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f5b61e10f1de..91e35b521e51 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2553,6 +2553,8 @@ static void rcu_do_batch(struct rcu_data *rdp)
WARN_ON_ONCE(count == 0 && !rcu_segcblist_empty(>cblist));
WARN_ON_ONCE(!IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
 count != 0 && rcu_segcblist_empty(>cblist));
+   WARN_ON_ONCE(count == 0 && rcu_segcblist_n_segment_cbs(>cblist) != 
0);
+   WARN_ON_ONCE(count != 0 && rcu_segcblist_n_segment_cbs(>cblist) == 
0);
 
rcu_nocb_unlock_irqrestore(rdp, flags);
 
-- 
2.29.2.299.gdc1121823c-goog

[PATCH] rcu/segcblist: Add debug check for whether seglen is 0 for empty list

2020-11-18 Thread Joel Fernandes (Google)

After rcu_do_batch(), add a check for whether the seglen counts went to
zero if the list was indeed empty.

Signed-off-by: Joel Fernandes (Google) 

---
 kernel/rcu/rcu_segcblist.c | 12 
 kernel/rcu/rcu_segcblist.h |  3 +++
 kernel/rcu/tree.c  |  1 +
 3 files changed, 16 insertions(+)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 5059b6102afe..6e98bb3804f0 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -94,6 +94,18 @@ static long rcu_segcblist_get_seglen(struct rcu_segcblist 
*rsclp, int seg)
return READ_ONCE(rsclp->seglen[seg]);
 }
 
+/* Return number of callbacks in segmented callback list by totalling seglen. 
*/
+long rcu_segcblist_n_segment_cbs(struct rcu_segcblist *rsclp)
+{
+   long len = 0;
+   int i;
+
+   for (i = RCU_DONE_TAIL; i < RCU_CBLIST_NSEGS; i++)
+   len += rcu_segcblist_get_seglen(rsclp, i);
+
+   return len;
+}
+
 /* Set the length of a segment of the rcu_segcblist structure. */
 static void rcu_segcblist_set_seglen(struct rcu_segcblist *rsclp, int seg, 
long v)
 {
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index cd35c9faaf51..46a42d77f7e1 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -15,6 +15,9 @@ static inline long rcu_cblist_n_cbs(struct rcu_cblist *rclp)
return READ_ONCE(rclp->len);
 }
 
+/* Return number of callbacks in segmented callback list by totalling seglen. 
*/
+long rcu_segcblist_n_segment_cbs(struct rcu_segcblist *rsclp);
+
 void rcu_cblist_init(struct rcu_cblist *rclp);
 void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp);
 void rcu_cblist_flush_enqueue(struct rcu_cblist *drclp,
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f5b61e10f1de..928bd10c9c3b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2553,6 +2553,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
WARN_ON_ONCE(count == 0 && !rcu_segcblist_empty(>cblist));
WARN_ON_ONCE(!IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
 count != 0 && rcu_segcblist_empty(>cblist));
+   WARN_ON_ONCE(count == 0 && !rcu_segcblist_n_segment_cbs(>cblist));
 
rcu_nocb_unlock_irqrestore(rdp, flags);
 
-- 
2.29.2.299.gdc1121823c-goog

[PATCH -tip 09/32] sched/fair: Snapshot the min_vruntime of CPUs on force idle

2020-11-17 Thread Joel Fernandes (Google)

During force-idle, we end up doing cross-cpu comparison of vruntimes
during pick_next_task. If we simply compare (vruntime-min_vruntime)
across CPUs, and if the CPUs only have 1 task each, we will always
end up comparing 0 with 0 and pick just one of the tasks all the time.
This starves the task that was not picked. To fix this, take a snapshot
of the min_vruntime when entering force idle and use it for comparison.
This min_vruntime snapshot will only be used for cross-CPU vruntime
comparison, and nothing else.

This resolves several performance issues that were seen in ChromeOS
audio usecase.

NOTE: Note, this patch will be improved in a later patch. It is just
  kept here as the basis for the later patch and to make rebasing
  easier. Further, it may make reverting the improvement easier in
  case the improvement causes any regression.

Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c  | 33 -
 kernel/sched/fair.c  | 40 
 kernel/sched/sched.h |  5 +
 3 files changed, 65 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 52d0e83072a4..4ee4902c2cf5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -115,19 +115,8 @@ static inline bool prio_less(struct task_struct *a, struct 
task_struct *b)
if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-   if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-   u64 vruntime = b->se.vruntime;
-
-   /*
-* Normalize the vruntime if tasks are in different cpus.
-*/
-   if (task_cpu(a) != task_cpu(b)) {
-   vruntime -= task_cfs_rq(b)->min_vruntime;
-   vruntime += task_cfs_rq(a)->min_vruntime;
-   }
-
-   return !((s64)(a->se.vruntime - vruntime) <= 0);
-   }
+   if (pa == MAX_RT_PRIO + MAX_NICE)   /* fair */
+   return cfs_prio_less(a, b);
 
return false;
 }
@@ -5144,6 +5133,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
struct task_struct *next, *max = NULL;
const struct sched_class *class;
const struct cpumask *smt_mask;
+   bool fi_before = false;
bool need_sync;
int i, j, cpu;
 
@@ -5208,6 +5198,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_cookie = 0UL;
if (rq->core->core_forceidle) {
need_sync = true;
+   fi_before = true;
rq->core->core_forceidle = false;
}
for_each_cpu(i, smt_mask) {
@@ -5219,6 +5210,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
update_rq_clock(rq_i);
}
 
+   /* Reset the snapshot if core is no longer in force-idle. */
+   if (!fi_before) {
+   for_each_cpu(i, smt_mask) {
+   struct rq *rq_i = cpu_rq(i);
+   rq_i->cfs.min_vruntime_fi = rq_i->cfs.min_vruntime;
+   }
+   }
+
/*
 * Try and select tasks for each sibling in decending sched_class
 * order.
@@ -5355,6 +5354,14 @@ next_class:;
resched_curr(rq_i);
}
 
+   /* Snapshot if core is in force-idle. */
+   if (!fi_before && rq->core->core_forceidle) {
+   for_each_cpu(i, smt_mask) {
+   struct rq *rq_i = cpu_rq(i);
+   rq_i->cfs.min_vruntime_fi = rq_i->cfs.min_vruntime;
+   }
+   }
+
 done:
set_next_task(rq, next);
return next;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 42965c4fd71f..de82f88ba98c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10726,6 +10726,46 @@ static inline void task_tick_core(struct rq *rq, 
struct task_struct *curr)
__entity_slice_used(>se, MIN_NR_TASKS_DURING_FORCEIDLE))
resched_curr(rq);
 }
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+   bool samecpu = task_cpu(a) == task_cpu(b);
+   struct sched_entity *sea = >se;
+   struct sched_entity *seb = >se;
+   struct cfs_rq *cfs_rqa;
+   struct cfs_rq *cfs_rqb;
+   s64 delta;
+
+   if (samecpu) {
+   /* vruntime is per cfs_rq */
+   while (!is_same_group(sea, seb)) {
+   int sea_depth = sea->depth;
+   int seb_depth = seb->depth;
+   if (sea_depth >= seb_depth)
+   sea = parent_entity(sea);
+   if (sea_depth <= seb_depth)
+   seb = parent_entity

[PATCH -tip 08/32] sched/fair: Fix forced idle sibling starvation corner case

2020-11-17 Thread Joel Fernandes (Google)

From: Vineeth Pillai 

If there is only one long running local task and the sibling is
forced idle, it  might not get a chance to run until a schedule
event happens on any cpu in the core.

So we check for this condition during a tick to see if a sibling
is starved and then give it a chance to schedule.

Tested-by: Julien Desfossez 
Reviewed-by: Joel Fernandes (Google) 
Signed-off-by: Vineeth Pillai 
Signed-off-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c  | 15 ---
 kernel/sched/fair.c  | 40 
 kernel/sched/sched.h |  2 +-
 3 files changed, 49 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1bd0b0bbb040..52d0e83072a4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5206,16 +5206,15 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
/* reset state */
rq->core->core_cookie = 0UL;
+   if (rq->core->core_forceidle) {
+   need_sync = true;
+   rq->core->core_forceidle = false;
+   }
for_each_cpu(i, smt_mask) {
struct rq *rq_i = cpu_rq(i);
 
rq_i->core_pick = NULL;
 
-   if (rq_i->core_forceidle) {
-   need_sync = true;
-   rq_i->core_forceidle = false;
-   }
-
if (i != cpu)
update_rq_clock(rq_i);
}
@@ -5335,8 +5334,10 @@ next_class:;
if (!rq_i->core_pick)
continue;
 
-   if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running)
-   rq_i->core_forceidle = true;
+   if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running &&
+   !rq_i->core->core_forceidle) {
+   rq_i->core->core_forceidle = true;
+   }
 
if (i == cpu) {
rq_i->core_pick = NULL;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f53681cd263e..42965c4fd71f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10692,6 +10692,44 @@ static void rq_offline_fair(struct rq *rq)
 
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_SCHED_CORE
+static inline bool
+__entity_slice_used(struct sched_entity *se, int min_nr_tasks)
+{
+   u64 slice = sched_slice(cfs_rq_of(se), se);
+   u64 rtime = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+
+   return (rtime * min_nr_tasks > slice);
+}
+
+#define MIN_NR_TASKS_DURING_FORCEIDLE  2
+static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
+{
+   if (!sched_core_enabled(rq))
+   return;
+
+   /*
+* If runqueue has only one task which used up its slice and
+* if the sibling is forced idle, then trigger schedule to
+* give forced idle task a chance.
+*
+* sched_slice() considers only this active rq and it gets the
+* whole slice. But during force idle, we have siblings acting
+* like a single runqueue and hence we need to consider runnable
+* tasks on this cpu and the forced idle cpu. Ideally, we should
+* go through the forced idle rq, but that would be a perf hit.
+* We can assume that the forced idle cpu has atleast
+* MIN_NR_TASKS_DURING_FORCEIDLE - 1 tasks and use that to check
+* if we need to give up the cpu.
+*/
+   if (rq->core->core_forceidle && rq->cfs.nr_running == 1 &&
+   __entity_slice_used(>se, MIN_NR_TASKS_DURING_FORCEIDLE))
+   resched_curr(rq);
+}
+#else
+static inline void task_tick_core(struct rq *rq, struct task_struct *curr) {}
+#endif
+
 /*
  * scheduler tick hitting a task of our scheduling class.
  *
@@ -10715,6 +10753,8 @@ static void task_tick_fair(struct rq *rq, struct 
task_struct *curr, int queued)
 
update_misfit_status(curr, rq);
update_overutilized_status(task_rq(curr));
+
+   task_tick_core(rq, curr);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 63b28e1843ee..be656ca8693d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1069,12 +1069,12 @@ struct rq {
unsigned intcore_enabled;
unsigned intcore_sched_seq;
struct rb_root  core_tree;
-   unsigned char   core_forceidle;
 
/* shared state */
unsigned intcore_task_seq;
unsigned intcore_pick_seq;
unsigned long   core_cookie;
+   unsigned char   core_forceidle;
 #endif
 };
 
-- 
2.29.2.299.gdc1121823c-goog

[PATCH -tip 21/32] sched: CGroup tagging interface for core scheduling

2020-11-17 Thread Joel Fernandes (Google)

From: Peter Zijlstra 

Marks all tasks in a cgroup as matching for core-scheduling.

A task will need to be moved into the core scheduler queue when the cgroup
it belongs to is tagged to run with core scheduling.  Similarly the task
will need to be moved out of the core scheduler queue when the cgroup
is untagged.

Also after we forked a task, its core scheduler queue's presence will
need to be updated according to its new cgroup's status.

Use stop machine mechanism to update all tasks in a cgroup to prevent a
new task from sneaking into the cgroup, and missed out from the update
while we iterates through all the tasks in the cgroup.  A more complicated
scheme could probably avoid the stop machine.  Such scheme will also
need to resovle inconsistency between a task's cgroup core scheduling
tag and residency in core scheduler queue.

We are opting for the simple stop machine mechanism for now that avoids
such complications.

Core scheduler has extra overhead.  Enable it only for core with
more than one SMT hardware threads.

Tested-by: Julien Desfossez 
Signed-off-by: Tim Chen 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Julien Desfossez 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c  | 183 +--
 kernel/sched/sched.h |   4 +
 2 files changed, 181 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f807a84cc30..b99a7493d590 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -157,6 +157,37 @@ static inline bool __sched_core_less(struct task_struct 
*a, struct task_struct *
return false;
 }
 
+static bool sched_core_empty(struct rq *rq)
+{
+   return RB_EMPTY_ROOT(>core_tree);
+}
+
+static bool sched_core_enqueued(struct task_struct *task)
+{
+   return !RB_EMPTY_NODE(>core_node);
+}
+
+static struct task_struct *sched_core_first(struct rq *rq)
+{
+   struct task_struct *task;
+
+   task = container_of(rb_first(>core_tree), struct task_struct, 
core_node);
+   return task;
+}
+
+static void sched_core_flush(int cpu)
+{
+   struct rq *rq = cpu_rq(cpu);
+   struct task_struct *task;
+
+   while (!sched_core_empty(rq)) {
+   task = sched_core_first(rq);
+   rb_erase(>core_node, >core_tree);
+   RB_CLEAR_NODE(>core_node);
+   }
+   rq->core->core_task_seq++;
+}
+
 static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 {
struct rb_node *parent, **node;
@@ -188,10 +219,11 @@ static void sched_core_dequeue(struct rq *rq, struct 
task_struct *p)
 {
rq->core->core_task_seq++;
 
-   if (!p->core_cookie)
+   if (!sched_core_enqueued(p))
return;
 
rb_erase(>core_node, >core_tree);
+   RB_CLEAR_NODE(>core_node);
 }
 
 /*
@@ -255,8 +287,24 @@ static int __sched_core_stopper(void *data)
bool enabled = !!(unsigned long)data;
int cpu;
 
-   for_each_possible_cpu(cpu)
-   cpu_rq(cpu)->core_enabled = enabled;
+   for_each_possible_cpu(cpu) {
+   struct rq *rq = cpu_rq(cpu);
+
+   WARN_ON_ONCE(enabled == rq->core_enabled);
+
+   if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) 
>= 2)) {
+   /*
+* All active and migrating tasks will have already
+* been removed from core queue when we clear the
+* cgroup tags. However, dying tasks could still be
+* left in core queue. Flush them here.
+*/
+   if (!enabled)
+   sched_core_flush(cpu);
+
+   rq->core_enabled = enabled;
+   }
+   }
 
return 0;
 }
@@ -266,7 +314,11 @@ static int sched_core_count;
 
 static void __sched_core_enable(void)
 {
-   // XXX verify there are no cookie tasks (yet)
+   int cpu;
+
+   /* verify there are no cookie tasks (yet) */
+   for_each_online_cpu(cpu)
+   BUG_ON(!sched_core_empty(cpu_rq(cpu)));
 
static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
@@ -274,8 +326,6 @@ static void __sched_core_enable(void)
 
 static void __sched_core_disable(void)
 {
-   // XXX verify there are no cookie tasks (left)
-
stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);
 }
@@ -300,6 +350,7 @@ void sched_core_put(void)
 
 static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
 static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+static bool sched_core_enqueued(struct task_struct *task) { return false; }
 
 #endif /* CONFIG_SCHED_CORE */
 
@@ -3978,6 +4029,9 @@ int sched_fork

[PATCH -tip 15/32] sched: Improve snapshotting of min_vruntime for CGroups

2020-11-17 Thread Joel Fernandes (Google)

A previous patch improved cross-cpu vruntime comparison opertations in
pick_next_task(). Improve it further for tasks in CGroups.

In particular, for cross-CPU comparisons, we were previously going to
the root-level se(s) for both the task being compared. That was strange.
This patch instead finds the se(s) for both tasks that have the same
parent (which may be different from root).

A note about the min_vruntime snapshot and force idling:
Abbreviations: fi: force-idled now? ; fib: force-idled before?
During selection:
When we're not fi, we need to update snapshot.
when we're fi and we were not fi, we must update snapshot.
When we're fi and we were already fi, we must not update snapshot.

Which gives:
fib fi  update?
0   0   1
0   1   1
1   0   1
1   1   0
So the min_vruntime snapshot needs to be updated when: !(fib && fi).

Also, the cfs_prio_less() function needs to be aware of whether the core
is in force idle or not, since it will be use this information to know
whether to advance a cfs_rq's min_vruntime_fi in the hierarchy. So pass
this information along via pick_task() -> prio_less().

Reviewed-by: Vineeth Pillai 
Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c  | 61 +
 kernel/sched/fair.c  | 80 
 kernel/sched/sched.h |  7 +++-
 3 files changed, 97 insertions(+), 51 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3b373b592680..20125431af87 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -101,7 +101,7 @@ static inline int __task_prio(struct task_struct *p)
  */
 
 /* real prio, less is less */
-static inline bool prio_less(struct task_struct *a, struct task_struct *b)
+static inline bool prio_less(struct task_struct *a, struct task_struct *b, 
bool in_fi)
 {
 
int pa = __task_prio(a), pb = __task_prio(b);
@@ -116,7 +116,7 @@ static inline bool prio_less(struct task_struct *a, struct 
task_struct *b)
return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
if (pa == MAX_RT_PRIO + MAX_NICE)   /* fair */
-   return cfs_prio_less(a, b);
+   return cfs_prio_less(a, b, in_fi);
 
return false;
 }
@@ -130,7 +130,7 @@ static inline bool __sched_core_less(struct task_struct *a, 
struct task_struct *
return false;
 
/* flip prio, so high prio is leftmost */
-   if (prio_less(b, a))
+   if (prio_less(b, a, task_rq(a)->core->core_forceidle))
return true;
 
return false;
@@ -5101,7 +5101,7 @@ static inline bool cookie_match(struct task_struct *a, 
struct task_struct *b)
  * - Else returns idle_task.
  */
 static struct task_struct *
-pick_task(struct rq *rq, const struct sched_class *class, struct task_struct 
*max)
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct 
*max, bool in_fi)
 {
struct task_struct *class_pick, *cookie_pick;
unsigned long cookie = rq->core->core_cookie;
@@ -5116,7 +5116,7 @@ pick_task(struct rq *rq, const struct sched_class *class, 
struct task_struct *ma
 * higher priority than max.
 */
if (max && class_pick->core_cookie &&
-   prio_less(class_pick, max))
+   prio_less(class_pick, max, in_fi))
return idle_sched_class.pick_task(rq);
 
return class_pick;
@@ -5135,13 +5135,15 @@ pick_task(struct rq *rq, const struct sched_class 
*class, struct task_struct *ma
 * the core (so far) and it must be selected, otherwise we must go with
 * the cookie pick in order to satisfy the constraint.
 */
-   if (prio_less(cookie_pick, class_pick) &&
-   (!max || prio_less(max, class_pick)))
+   if (prio_less(cookie_pick, class_pick, in_fi) &&
+   (!max || prio_less(max, class_pick, in_fi)))
return class_pick;
 
return cookie_pick;
 }
 
+extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool 
in_fi);
+
 static struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
@@ -5230,9 +5232,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
if (!next->core_cookie) {
rq->core_pick = NULL;
+   /*
+* For robustness, update the min_vruntime_fi for
+* unconstrained picks as well.
+*/
+   WARN_ON_ONCE(fi_before);
+   task_vruntime_update(rq, next, false);
goto done;
}
-   need_sync = true;
}
 
for_each_cpu(i, s

[PATCH -tip 19/32] entry/idle: Enter and exit kernel protection during idle entry and exit

2020-11-17 Thread Joel Fernandes (Google)

Add a generic_idle_{enter,exit} helper function to enter and exit kernel
protection when entering and exiting idle, respectively.

While at it, remove a stale RCU comment.

Reviewed-by: Alexandre Chartre 
Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/entry-common.h | 18 ++
 kernel/sched/idle.c  | 11 ++-
 2 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 022e1f114157..8f34ae625f83 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -454,4 +454,22 @@ static inline bool entry_kernel_protected(void)
return IS_ENABLED(CONFIG_SCHED_CORE) && sched_core_kernel_protected()
&& _TIF_UNSAFE_RET != 0;
 }
+
+/**
+ * generic_idle_enter - General tasks to perform during idle entry.
+ */
+static inline void generic_idle_enter(void)
+{
+   /* Entering idle ends the protected kernel region. */
+   sched_core_unsafe_exit();
+}
+
+/**
+ * generic_idle_exit  - General tasks to perform during idle exit.
+ */
+static inline void generic_idle_exit(void)
+{
+   /* Exiting idle (re)starts the protected kernel region. */
+   sched_core_unsafe_enter();
+}
 #endif
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 8bdb214eb78f..ee4f91396c31 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -8,6 +8,7 @@
  */
 #include "sched.h"
 
+#include 
 #include 
 
 /* Linker adds these: start and end of __cpuidle functions */
@@ -54,6 +55,7 @@ __setup("hlt", cpu_idle_nopoll_setup);
 
 static noinline int __cpuidle cpu_idle_poll(void)
 {
+   generic_idle_enter();
trace_cpu_idle(0, smp_processor_id());
stop_critical_timings();
rcu_idle_enter();
@@ -66,6 +68,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
rcu_idle_exit();
start_critical_timings();
trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
+   generic_idle_exit();
 
return 1;
 }
@@ -156,11 +159,7 @@ static void cpuidle_idle_call(void)
return;
}
 
-   /*
-* The RCU framework needs to be told that we are entering an idle
-* section, so no more rcu read side critical sections and one more
-* step to the grace period
-*/
+   generic_idle_enter();
 
if (cpuidle_not_available(drv, dev)) {
tick_nohz_idle_stop_tick();
@@ -225,6 +224,8 @@ static void cpuidle_idle_call(void)
 */
if (WARN_ON_ONCE(irqs_disabled()))
local_irq_enable();
+
+   generic_idle_exit();
 }
 
 /*
-- 
2.29.2.299.gdc1121823c-goog

[PATCH -tip 32/32] sched: Debug bits...

2020-11-17 Thread Joel Fernandes (Google)

Tested-by: Julien Desfossez 
Not-Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c | 35 ++-
 kernel/sched/fair.c |  9 +
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 01938a2154fd..bbeeb18d460e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -127,6 +127,10 @@ static inline bool prio_less(struct task_struct *a, struct 
task_struct *b, bool
 
int pa = __task_prio(a), pb = __task_prio(b);
 
+   trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
+a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
if (-pa < -pb)
return true;
 
@@ -317,12 +321,16 @@ static void __sched_core_enable(void)
 
static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
+
+   printk("core sched enabled\n");
 }
 
 static void __sched_core_disable(void)
 {
stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);
+
+   printk("core sched disabled\n");
 }
 
 DEFINE_STATIC_KEY_TRUE(sched_coresched_supported);
@@ -5486,6 +5494,13 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
set_next_task(rq, next);
}
 
+   trace_printk("pick pre selected (%u %u %u): %s/%d %lx\n",
+rq->core->core_task_seq,
+rq->core->core_pick_seq,
+rq->core_sched_seq,
+next->comm, next->pid,
+next->core_cookie);
+
rq->core_pick = NULL;
return next;
}
@@ -5580,6 +5595,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_forceidle_seq++;
}
 
+   trace_printk("cpu(%d): selected: %s/%d %lx\n",
+i, p->comm, p->pid, p->core_cookie);
+
/*
 * If this new candidate is of higher priority than the
 * previous; and they're incompatible; we need to wipe
@@ -5596,6 +5614,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_cookie = p->core_cookie;
max = p;
 
+   trace_printk("max: %s/%d %lx\n", max->comm, 
max->pid, max->core_cookie);
+
if (old_max) {
rq->core->core_forceidle = false;
for_each_cpu(j, smt_mask) {
@@ -5617,6 +5637,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
/* Something should have been selected for current CPU */
WARN_ON_ONCE(!next);
+   trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, 
next->core_cookie);
 
/*
 * Reschedule siblings
@@ -5658,13 +5679,21 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
}
 
/* Did we break L1TF mitigation requirements? */
-   WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+   if (unlikely(!cookie_match(next, rq_i->core_pick))) {
+   trace_printk("[%d]: cookie mismatch. 
%s/%d/0x%lx/0x%lx\n",
+rq_i->cpu, rq_i->core_pick->comm,
+rq_i->core_pick->pid,
+rq_i->core_pick->core_cookie,
+rq_i->core->core_cookie);
+   WARN_ON_ONCE(1);
+   }
 
if (rq_i->curr == rq_i->core_pick) {
rq_i->core_pick = NULL;
continue;
}
 
+   trace_printk("IPI(%d)\n", i);
resched_curr(rq_i);
}
 
@@ -5704,6 +5733,10 @@ static bool try_steal_cookie(int this, int that)
if (p->core_occupation > dst->idle->core_occupation)
goto next;
 
+   trace_printk("core fill: %s/%d (%d->%d) %d %d %lx\n",
+p->comm, p->pid, that, this,
+p->core_occupation, dst->idle->core_occupation, 
cookie);
+
p->on_rq = TASK_ON_RQ_MIGRATING;
deactivate_task(src, p, 0);
set_task_cpu(p, this);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a89c7c917cc6..81c8a50ab4c4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10798,6 +10798,15 @@ static void se_fi_update(struct sched_entity *se, 
unsigned int fi_seq,

[PATCH -tip 31/32] sched: Add a coresched command line option

2020-11-17 Thread Joel Fernandes (Google)

Some hardware such as certain AMD variants don't have cross-HT MDS/L1TF
issues. Detect this and don't enable core scheduling as it can
needlessly slow those device down.

However, some users may want core scheduling even if the hardware is
secure. To support them, add a coresched= option which defaults to
'secure' and can be overridden to 'on' if the user wants to enable
coresched even if the HW is not vulnerable. 'off' would disable
core scheduling in any case.

Also add a sched_debug entry to indicate if core scheduling is turned on
or not.

Reviewed-by: Alexander Graf 
Signed-off-by: Joel Fernandes (Google) 
---
 .../admin-guide/kernel-parameters.txt | 14 ++
 arch/x86/kernel/cpu/bugs.c| 19 
 include/linux/cpu.h   |  1 +
 include/linux/sched/smt.h |  4 ++
 kernel/cpu.c  | 43 +++
 kernel/sched/core.c   |  6 +++
 kernel/sched/debug.c  |  4 ++
 7 files changed, 91 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index b185c6ed4aba..9cd2cf7c18d4 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -698,6 +698,20 @@
/proc//coredump_filter.
See also Documentation/filesystems/proc.rst.
 
+   coresched=  [SCHED_CORE] This feature allows the Linux scheduler
+   to force hyperthread siblings of a CPU to only execute 
tasks
+   concurrently on all hyperthreads that are running 
within the
+   same core scheduling group.
+   Possible values are:
+   'on' - Enable scheduler capability to core schedule.
+   By default, no tasks will be core scheduled, but the 
coresched
+   interface can be used to form groups of tasks that are 
forced
+   to share a core.
+   'off' - Disable scheduler capability to core schedule.
+   'secure' - Like 'on' but only enable on systems 
affected by
+   MDS or L1TF vulnerabilities. 'off' otherwise.
+   Default: 'secure'.
+
coresight_cpu_debug.enable
[ARM,ARM64]
Format: 
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index dece79e4d1e9..f3163f4a805c 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -43,6 +43,7 @@ static void __init mds_select_mitigation(void);
 static void __init mds_print_mitigation(void);
 static void __init taa_select_mitigation(void);
 static void __init srbds_select_mitigation(void);
+static void __init coresched_select(void);
 
 /* The base value of the SPEC_CTRL MSR that always has to be preserved. */
 u64 x86_spec_ctrl_base;
@@ -103,6 +104,9 @@ void __init check_bugs(void)
if (boot_cpu_has(X86_FEATURE_STIBP))
x86_spec_ctrl_mask |= SPEC_CTRL_STIBP;
 
+   /* Update whether core-scheduling is needed. */
+   coresched_select();
+
/* Select the proper CPU mitigations before patching alternatives: */
spectre_v1_select_mitigation();
spectre_v2_select_mitigation();
@@ -1808,4 +1812,19 @@ ssize_t cpu_show_srbds(struct device *dev, struct 
device_attribute *attr, char *
 {
return cpu_show_common(dev, attr, buf, X86_BUG_SRBDS);
 }
+
+/*
+ * When coresched=secure command line option is passed (default), disable core
+ * scheduling if CPU does not have MDS/L1TF vulnerability.
+ */
+static void __init coresched_select(void)
+{
+#ifdef CONFIG_SCHED_CORE
+   if (coresched_cmd_secure() &&
+   !boot_cpu_has_bug(X86_BUG_MDS) &&
+   !boot_cpu_has_bug(X86_BUG_L1TF))
+   static_branch_disable(_coresched_supported);
+#endif
+}
+
 #endif
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index d6428aaf67e7..d1f1e64316d6 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -228,4 +228,5 @@ static inline int cpuhp_smt_disable(enum cpuhp_smt_control 
ctrlval) { return 0;
 extern bool cpu_mitigations_off(void);
 extern bool cpu_mitigations_auto_nosmt(void);
 
+extern bool coresched_cmd_secure(void);
 #endif /* _LINUX_CPU_H_ */
diff --git a/include/linux/sched/smt.h b/include/linux/sched/smt.h
index 59d3736c454c..561064eb3268 100644
--- a/include/linux/sched/smt.h
+++ b/include/linux/sched/smt.h
@@ -17,4 +17,8 @@ static inline bool sched_smt_active(void) { return false; }
 
 void arch_smt_update(void);
 
+#ifdef CONFIG_SCHED_CORE
+extern struct static_key_true sched_coresched_supported;
+#endif
+
 #endif
diff --git a/kernel/cpu.c b/kernel/cpu.c
index fa535eaa4826..f22330c3ab4c 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2559,3 +2559,46 @@ bool cpu_mitiga

[PATCH -tip 30/32] Documentation: Add core scheduling documentation

2020-11-17 Thread Joel Fernandes (Google)

Document the usecases, design and interfaces for core scheduling.

Co-developed-by: Vineeth Pillai 
Signed-off-by: Vineeth Pillai 
Tested-by: Julien Desfossez 
Reviewed-by: Randy Dunlap 
Signed-off-by: Joel Fernandes (Google) 
---
 .../admin-guide/hw-vuln/core-scheduling.rst   | 330 ++
 Documentation/admin-guide/hw-vuln/index.rst   |   1 +
 2 files changed, 331 insertions(+)
 create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst 
b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
new file mode 100644
index ..01be28d0687a
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -0,0 +1,330 @@
+Core Scheduling
+***
+Core scheduling support allows userspace to define groups of tasks that can
+share a core. These groups can be specified either for security usecases (one
+group of tasks don't trust another), or for performance usecases (some
+workloads may benefit from running on the same core as they don't need the same
+hardware resources of the shared core).
+
+Security usecase
+
+A cross-HT attack involves the attacker and victim running on different
+Hyper Threads of the same core. MDS and L1TF are examples of such attacks.
+Without core scheduling, the only full mitigation of cross-HT attacks is to
+disable Hyper Threading (HT). Core scheduling allows HT to be turned on safely
+by ensuring that trusted tasks can share a core. This increase in core sharing
+can improvement performance, however it is not guaranteed that performance will
+always improve, though that is seen to be the case with a number of real world
+workloads. In theory, core scheduling aims to perform at least as good as when
+Hyper Threading is disabled. In practice, this is mostly the case though not
+always: as synchronizing scheduling decisions across 2 or more CPUs in a core
+involves additional overhead - especially when the system is lightly loaded
+(``total_threads <= N/2``, where N is the total number of CPUs).
+
+Usage
+-
+Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
+Using this feature, userspace defines groups of tasks that trust each other.
+The core scheduler uses this information to make sure that tasks that do not
+trust each other will never run simultaneously on a core, while doing its best
+to satisfy the system's scheduling requirements.
+
+There are 2 ways to use core-scheduling:
+
+CGroup
+##
+Core scheduling adds additional files to the CPU controller CGroup:
+
+* ``cpu.core_tag``
+Writing ``1`` into this file results in all tasks in the group getting tagged.
+This results in all the CGroup's tasks allowed to run concurrently on a core's
+hyperthreads (also called siblings).
+
+The file being a value of ``0`` means the tag state of the CGroup is inherited
+from its parent hierarchy. If any ancestor of the CGroup is tagged, then the
+group is tagged.
+
+.. note:: Once a CGroup is tagged via cpu.core_tag, it is not possible to set 
this
+  for any descendant of the tagged group. For finer grained control, 
the
+  ``cpu.core_tag_color`` file described next may be used.
+
+.. note:: When a CGroup is not tagged, all the tasks within the group can share
+  a core with kernel threads and untagged system threads. For this 
reason,
+  if a group has ``cpu.core_tag`` of 0, it is considered to be trusted.
+
+* ``cpu.core_tag_color``
+For finer grained control over core sharing, a color can also be set in
+addition to the tag. This allows to further control core sharing between child
+CGroups within an already tagged CGroup. The color and the tag are both used to
+generate a `cookie` which is used by the scheduler to identify the group.
+
+Up to 256 different colors can be set (0-255) by writing into this file.
+
+A sample real-world usage of this file follows:
+
+Google uses DAC controls to make ``cpu.core_tag`` writable only by root and the
+``cpu.core_tag_color`` can be changed by anyone.
+
+The hierarchy looks like this:
+::
+  Root group
+ / \
+A   B(These are created by the root daemon - borglet).
+   / \   \
+  C   D   E  (These are created by AppEngine within the container).
+
+A and B are containers for 2 different jobs or apps that are created by a root
+daemon called borglet. borglet then tags each of these group with the 
``cpu.core_tag``
+file. The job itself can create additional child CGroups which are colored by
+the container's AppEngine with the ``cpu.core_tag_color`` file.
+
+The reason why Google uses this 2-level tagging system is that AppEngine wants 
to
+allow a subset of child CGroups within a tagged parent CGroup to be 
co-scheduled on a
+core while not being co-scheduled with other child CGroups. Think of these
+child CGroups as belonging to the same customer or project.  Because these
+child CGroups are created by AppEngine, they are not trac

[PATCH -tip 13/32] sched: Trivial forced-newidle balancer

2020-11-17 Thread Joel Fernandes (Google)

From: Peter Zijlstra 

When a sibling is forced-idle to match the core-cookie; search for
matching tasks to fill the core.

rcu_read_unlock() can incur an infrequent deadlock in
sched_core_balance(). Fix this by using the RCU-sched flavor instead.

Acked-by: Paul E. McKenney 
Tested-by: Julien Desfossez 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/sched.h |   1 +
 kernel/sched/core.c   | 130 +-
 kernel/sched/idle.c   |   1 +
 kernel/sched/sched.h  |   6 ++
 4 files changed, 137 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 344499ab29f2..7efce9c9d9cf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -688,6 +688,7 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
struct rb_node  core_node;
unsigned long   core_cookie;
+   unsigned intcore_occupation;
 #endif
 
 #ifdef CONFIG_CGROUP_SCHED
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 12e8e6627ab3..3b373b592680 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -202,6 +202,21 @@ static struct task_struct *sched_core_find(struct rq *rq, 
unsigned long cookie)
return match;
 }
 
+static struct task_struct *sched_core_next(struct task_struct *p, unsigned 
long cookie)
+{
+   struct rb_node *node = >core_node;
+
+   node = rb_next(node);
+   if (!node)
+   return NULL;
+
+   p = container_of(node, struct task_struct, core_node);
+   if (p->core_cookie != cookie)
+   return NULL;
+
+   return p;
+}
+
 /*
  * The static-key + stop-machine variable are needed such that:
  *
@@ -5134,8 +5149,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
const struct sched_class *class;
const struct cpumask *smt_mask;
bool fi_before = false;
+   int i, j, cpu, occ = 0;
bool need_sync;
-   int i, j, cpu;
 
if (!sched_core_enabled(rq))
return __pick_next_task(rq, prev, rf);
@@ -5260,6 +5275,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
if (!p)
continue;
 
+   if (!is_task_rq_idle(p))
+   occ++;
+
rq_i->core_pick = p;
 
/*
@@ -5285,6 +5303,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
cpu_rq(j)->core_pick = NULL;
}
+   occ = 1;
goto again;
}
}
@@ -5324,6 +5343,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq_i->core->core_forceidle = true;
}
 
+   rq_i->core_pick->core_occupation = occ;
+
if (i == cpu) {
rq_i->core_pick = NULL;
continue;
@@ -5353,6 +5374,113 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
return next;
 }
 
+static bool try_steal_cookie(int this, int that)
+{
+   struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
+   struct task_struct *p;
+   unsigned long cookie;
+   bool success = false;
+
+   local_irq_disable();
+   double_rq_lock(dst, src);
+
+   cookie = dst->core->core_cookie;
+   if (!cookie)
+   goto unlock;
+
+   if (dst->curr != dst->idle)
+   goto unlock;
+
+   p = sched_core_find(src, cookie);
+   if (p == src->idle)
+   goto unlock;
+
+   do {
+   if (p == src->core_pick || p == src->curr)
+   goto next;
+
+   if (!cpumask_test_cpu(this, >cpus_mask))
+   goto next;
+
+   if (p->core_occupation > dst->idle->core_occupation)
+   goto next;
+
+   p->on_rq = TASK_ON_RQ_MIGRATING;
+   deactivate_task(src, p, 0);
+   set_task_cpu(p, this);
+   activate_task(dst, p, 0);
+   p->on_rq = TASK_ON_RQ_QUEUED;
+
+   resched_curr(dst);
+
+   success = true;
+   break;
+
+next:
+   p = sched_core_next(p, cookie);
+   } while (p);
+
+unlock:
+   double_rq_unlock(dst, src);
+   local_irq_enable();
+
+   return success;
+}
+
+static bool steal_cookie_task(int cpu, struct sched_domain *sd)
+{
+   int i;
+
+   for_each_cpu_wrap(i, sched_domain_span(sd), cpu) {
+   if (i == cpu)
+   continue;
+
+   if (need_resched())
+

[PATCH -tip 29/32] sched: Move core-scheduler interfacing code to a new file

2020-11-17 Thread Joel Fernandes (Google)

core.c is already huge. The core-tagging interface code is largely
independent of it. Move it to its own file to make both files easier to
maintain.

Also make the following changes:
- Fix SWA bugs found by Chris Hyser.
- Fix refcount underrun caused by not zero'ing new task's cookie.

Tested-by: Julien Desfossez 
Reviewed-by: Chris Hyser 
Signed-off-by: Chris Hyser 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/Makefile  |   1 +
 kernel/sched/core.c| 809 +---
 kernel/sched/coretag.c | 819 +
 kernel/sched/sched.h   |  51 ++-
 4 files changed, 872 insertions(+), 808 deletions(-)
 create mode 100644 kernel/sched/coretag.c

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 5fc9c9b70862..c526c20adf9d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
+obj-$(CONFIG_SCHED_CORE) += coretag.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f1d9762b571a..5ef04bdc849f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -162,11 +162,6 @@ static bool sched_core_empty(struct rq *rq)
return RB_EMPTY_ROOT(>core_tree);
 }
 
-static bool sched_core_enqueued(struct task_struct *task)
-{
-   return !RB_EMPTY_NODE(>core_node);
-}
-
 static struct task_struct *sched_core_first(struct rq *rq)
 {
struct task_struct *task;
@@ -188,7 +183,7 @@ static void sched_core_flush(int cpu)
rq->core->core_task_seq++;
 }
 
-static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 {
struct rb_node *parent, **node;
struct task_struct *node_task;
@@ -215,7 +210,7 @@ static void sched_core_enqueue(struct rq *rq, struct 
task_struct *p)
rb_insert_color(>core_node, >core_tree);
 }
 
-static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+void sched_core_dequeue(struct rq *rq, struct task_struct *p)
 {
rq->core->core_task_seq++;
 
@@ -310,7 +305,6 @@ static int __sched_core_stopper(void *data)
 }
 
 static DEFINE_MUTEX(sched_core_mutex);
-static DEFINE_MUTEX(sched_core_tasks_mutex);
 static int sched_core_count;
 
 static void __sched_core_enable(void)
@@ -346,16 +340,6 @@ void sched_core_put(void)
__sched_core_disable();
mutex_unlock(_core_mutex);
 }
-
-static int sched_core_share_tasks(struct task_struct *t1, struct task_struct 
*t2);
-
-#else /* !CONFIG_SCHED_CORE */
-
-static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
-static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
-static bool sched_core_enqueued(struct task_struct *task) { return false; }
-static int sched_core_share_tasks(struct task_struct *t1, struct task_struct 
*t2) { }
-
 #endif /* CONFIG_SCHED_CORE */
 
 /*
@@ -3834,6 +3818,9 @@ static void __sched_fork(unsigned long clone_flags, 
struct task_struct *p)
p->capture_control = NULL;
 #endif
init_numa_balancing(clone_flags, p);
+#ifdef CONFIG_SCHED_CORE
+   p->core_task_cookie = 0;
+#endif
 #ifdef CONFIG_SMP
p->wake_entry.u_flags = CSD_TYPE_TTWU;
p->migration_pending = NULL;
@@ -9118,11 +9105,6 @@ void sched_move_task(struct task_struct *tsk)
task_rq_unlock(rq, tsk, );
 }
 
-static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
-{
-   return css ? container_of(css, struct task_group, css) : NULL;
-}
-
 static struct cgroup_subsys_state *
 cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 {
@@ -9735,787 +9717,6 @@ static u64 cpu_rt_period_read_uint(struct 
cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
-#ifdef CONFIG_SCHED_CORE
-/*
- * Wrapper representing a complete cookie. The address of the cookie is used as
- * a unique identifier. Each cookie has a unique permutation of the internal
- * cookie fields.
- */
-struct sched_core_cookie {
-   unsigned long task_cookie;
-   unsigned long group_cookie;
-   unsigned long color;
-
-   struct rb_node node;
-   refcount_t refcnt;
-};
-
-/*
- * A simple wrapper around refcount. An allocated sched_core_task_cookie's
- * address is used to compute the cookie of the task.
- */
-struct sched_core_task_cookie {
-   refcount_t refcnt;
-};
-
-/* All active sched_core_cookies */
-static struct rb_root sched_core_cookies = RB_ROOT;
-static DEFINE_RAW_SPINLOCK(sched_core_cookies_lock);
-
-/*
- * Returns the following:
- * a < b  => -1
- * a == b => 0
- * a > b  => 1
- */
-static int sched_core_cookie_cmp(const struct sched_core_cookie *a,
-const struct sched_core_cookie *b)
-{
-#define COOKIE_CMP_RETURN(fie

[PATCH -tip 03/32] sched/fair: Fix pick_task_fair crashes due to empty rbtree

2020-11-17 Thread Joel Fernandes (Google)

From: Peter Zijlstra 

pick_next_entity() is passed curr == NULL during core-scheduling. Due to
this, if the rbtree is empty, the 'left' variable is set to NULL within
the function. This can cause crashes within the function.

This is not an issue if put_prev_task() is invoked on the currently
running task before calling pick_next_entity(). However, in core
scheduling, it is possible that a sibling CPU picks for another RQ in
the core, via pick_task_fair(). This remote sibling would not get any
opportunities to do a put_prev_task().

Fix it by refactoring pick_task_fair() such that pick_next_entity() is
called with the cfs_rq->curr. This will prevent pick_next_entity() from
crashing if its rbtree is empty.

Also this fixes another possible bug where update_curr() would not be
called on the cfs_rq hierarchy if the rbtree is empty. This could effect
cross-cpu comparison of vruntime.

Suggested-by: Vineeth Remanan Pillai 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/fair.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 12cf068eeec8..51483a00a755 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7029,15 +7029,10 @@ static struct task_struct *pick_task_fair(struct rq *rq)
do {
struct sched_entity *curr = cfs_rq->curr;
 
-   se = pick_next_entity(cfs_rq, NULL);
-
-   if (curr) {
-   if (se && curr->on_rq)
-   update_curr(cfs_rq);
+   if (curr && curr->on_rq)
+   update_curr(cfs_rq);
 
-   if (!se || entity_before(curr, se))
-   se = curr;
-   }
+   se = pick_next_entity(cfs_rq, curr);
 
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
-- 
2.29.2.299.gdc1121823c-goog

[PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase

2020-11-17 Thread Joel Fernandes (Google)

From: Josh Don 

Google has a usecase where the first level tag to tag a CGroup is not
sufficient. So, a patch is carried for years where a second tag is added which
is writeable by unprivileged users.

Google uses DAC controls to make the 'tag' possible to set only by root while
the second-level 'color' can be changed by anyone. The actual names that
Google uses is different, but the concept is the same.

The hierarchy looks like:

Root group
   / \
  A   B(These are created by the root daemon - borglet).
 / \   \
C   D   E  (These are created by AppEngine within the container).

The reason why Google has two parts is that AppEngine wants to allow a subset of
subcgroups within a parent tagged cgroup sharing execution. Think of these
subcgroups belong to the same customer or project. Because these subcgroups are
created by AppEngine, they are not tracked by borglet (the root daemon),
therefore borglet won't have a chance to set a color for them. That's where
'color' file comes from. Color could be set by AppEngine, and once set, the
normal tasks within the subcgroup would not be able to overwrite it. This is
enforced by promoting the permission of the color file in cgroupfs.

Reviewed-by: Joel Fernandes (Google) 
Signed-off-by: Josh Don 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/sched.h |   1 +
 kernel/sched/core.c   | 120 +++---
 kernel/sched/sched.h  |   2 +
 3 files changed, 103 insertions(+), 20 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6fbdb1a204bf..c9efdf8ccdf3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -690,6 +690,7 @@ struct task_struct {
unsigned long   core_cookie;
unsigned long   core_task_cookie;
unsigned long   core_group_cookie;
+   unsigned long   core_color;
unsigned intcore_occupation;
 #endif
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bd75b3d62a97..8f17ec8e993e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9049,9 +9049,6 @@ void sched_offline_group(struct task_group *tg)
spin_unlock_irqrestore(_group_lock, flags);
 }
 
-void cpu_core_get_group_cookie(struct task_group *tg,
-  unsigned long *group_cookie_ptr);
-
 static void sched_change_group(struct task_struct *tsk, int type)
 {
struct task_group *tg;
@@ -9747,6 +9744,7 @@ static u64 cpu_rt_period_read_uint(struct 
cgroup_subsys_state *css,
 struct sched_core_cookie {
unsigned long task_cookie;
unsigned long group_cookie;
+   unsigned long color;
 
struct rb_node node;
refcount_t refcnt;
@@ -9782,6 +9780,7 @@ static int sched_core_cookie_cmp(const struct 
sched_core_cookie *a,
 
COOKIE_CMP_RETURN(task_cookie);
COOKIE_CMP_RETURN(group_cookie);
+   COOKIE_CMP_RETURN(color);
 
/* all cookie fields match */
return 0;
@@ -9819,7 +9818,7 @@ static void sched_core_put_cookie(struct 
sched_core_cookie *cookie)
 
 /*
  * A task's core cookie is a compound structure composed of various cookie
- * fields (task_cookie, group_cookie). The overall core_cookie is
+ * fields (task_cookie, group_cookie, color). The overall core_cookie is
  * a pointer to a struct containing those values. This function either finds
  * an existing core_cookie or creates a new one, and then updates the task's
  * core_cookie to point to it. Additionally, it handles the necessary reference
@@ -9837,6 +9836,7 @@ static void __sched_core_update_cookie(struct task_struct 
*p)
struct sched_core_cookie temp = {
.task_cookie= p->core_task_cookie,
.group_cookie   = p->core_group_cookie,
+   .color  = p->core_color
};
const bool is_zero_cookie =
(sched_core_cookie_cmp(, _cookie) == 0);
@@ -9892,6 +9892,7 @@ static void __sched_core_update_cookie(struct task_struct 
*p)
 
match->task_cookie = temp.task_cookie;
match->group_cookie = temp.group_cookie;
+   match->color = temp.color;
refcount_set(>refcnt, 1);
 
rb_link_node(>node, parent, node);
@@ -9949,6 +9950,9 @@ static void sched_core_update_cookie(struct task_struct 
*p, unsigned long cookie
case sched_core_group_cookie_type:
p->core_group_cookie = cookie;
break;
+   case sched_core_color_type:
+   p->core_color = cookie;
+   break;
default:
WARN_ON_ONCE(1);
}
@@ -9967,19 +9971,23 @@ static void sched_core_update_cookie(struct task_struct 
*p, unsigned long cookie
sched_core_enqueue(task_rq(p), p);
 }
 
-void cpu_core_get_group_cookie(struct task_group *tg,
-  unsigned long *group_cookie_ptr);
+void cpu_c

[PATCH -tip 28/32] kselftest: Add tests for core-sched interface

2020-11-17 Thread Joel Fernandes (Google)

Add a kselftest test to ensure that the core-sched interface is working
correctly.

Tested-by: Julien Desfossez 
Reviewed-by: Josh Don 
Signed-off-by: Joel Fernandes (Google) 
---
 tools/testing/selftests/sched/.gitignore  |   1 +
 tools/testing/selftests/sched/Makefile|  14 +
 tools/testing/selftests/sched/config  |   1 +
 .../testing/selftests/sched/test_coresched.c  | 818 ++
 4 files changed, 834 insertions(+)
 create mode 100644 tools/testing/selftests/sched/.gitignore
 create mode 100644 tools/testing/selftests/sched/Makefile
 create mode 100644 tools/testing/selftests/sched/config
 create mode 100644 tools/testing/selftests/sched/test_coresched.c

diff --git a/tools/testing/selftests/sched/.gitignore 
b/tools/testing/selftests/sched/.gitignore
new file mode 100644
index ..4660929b0b9a
--- /dev/null
+++ b/tools/testing/selftests/sched/.gitignore
@@ -0,0 +1 @@
+test_coresched
diff --git a/tools/testing/selftests/sched/Makefile 
b/tools/testing/selftests/sched/Makefile
new file mode 100644
index ..e43b74fc5d7e
--- /dev/null
+++ b/tools/testing/selftests/sched/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0+
+
+ifneq ($(shell $(CC) --version 2>&1 | head -n 1 | grep clang),)
+CLANG_FLAGS += -no-integrated-as
+endif
+
+CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/  -Wl,-rpath=./ \
+ $(CLANG_FLAGS)
+LDLIBS += -lpthread
+
+TEST_GEN_FILES := test_coresched
+TEST_PROGS := test_coresched
+
+include ../lib.mk
diff --git a/tools/testing/selftests/sched/config 
b/tools/testing/selftests/sched/config
new file mode 100644
index ..e8b09aa7c0c4
--- /dev/null
+++ b/tools/testing/selftests/sched/config
@@ -0,0 +1 @@
+CONFIG_SCHED_DEBUG=y
diff --git a/tools/testing/selftests/sched/test_coresched.c 
b/tools/testing/selftests/sched/test_coresched.c
new file mode 100644
index ..70ed2758fe23
--- /dev/null
+++ b/tools/testing/selftests/sched/test_coresched.c
@@ -0,0 +1,818 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Core-scheduling selftests.
+ *
+ * Copyright (C) 2020, Joel Fernandes.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef PR_SCHED_CORE_SHARE
+#define PR_SCHED_CORE_SHARE 59
+#endif
+
+#ifndef DEBUG_PRINT
+#define dprint(...)
+#else
+#define dprint(str, args...) printf("DEBUG: %s: " str "\n", __func__, ##args)
+#endif
+
+void print_banner(char *s)
+{
+printf("coresched: %s:  ", s);
+}
+
+void print_pass(void)
+{
+printf("PASS\n");
+}
+
+void assert_cond(int cond, char *str)
+{
+if (!cond) {
+   printf("Error: %s\n", str);
+   abort();
+}
+}
+
+char *make_group_root(void)
+{
+   char *mntpath, *mnt;
+   int ret;
+
+   mntpath = malloc(50);
+   if (!mntpath) {
+   perror("Failed to allocate mntpath\n");
+   abort();
+   }
+
+   sprintf(mntpath, "/tmp/coresched-test-XX");
+   mnt = mkdtemp(mntpath);
+   if (!mnt) {
+   perror("Failed to create mount: ");
+   exit(-1);
+   }
+
+   ret = mount("nodev", mnt, "cgroup", 0, "cpu");
+   if (ret == -1) {
+   perror("Failed to mount cgroup: ");
+   exit(-1);
+   }
+
+   return mnt;
+}
+
+char *read_group_cookie(char *cgroup_path)
+{
+char path[50] = {}, *val;
+int fd;
+
+sprintf(path, "%s/cpu.core_group_cookie", cgroup_path);
+fd = open(path, O_RDONLY, 0666);
+if (fd == -1) {
+   perror("Open of cgroup tag path failed: ");
+   abort();
+}
+
+val = calloc(1, 50);
+if (read(fd, val, 50) == -1) {
+   perror("Failed to read group cookie: ");
+   abort();
+}
+
+val[strcspn(val, "\r\n")] = 0;
+
+close(fd);
+return val;
+}
+
+void assert_group_tag(char *cgroup_path, char *tag)
+{
+char tag_path[50] = {}, rdbuf[8] = {};
+int tfd;
+
+sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
+tfd = open(tag_path, O_RDONLY, 0666);
+if (tfd == -1) {
+   perror("Open of cgroup tag path failed: ");
+   abort();
+}
+
+if (read(tfd, rdbuf, 1) != 1) {
+   perror("Failed to enable coresched on cgroup: ");
+   abort();
+}
+
+if (strcmp(rdbuf, tag)) {
+   printf("Group tag does not match (exp: %s, act: %s)\n", tag, rdbuf);
+   abort();
+}
+
+if (close(tfd) == -1) {
+   perror("Failed to close tag fd: ");
+   abort();
+}
+}
+
+void assert_group_color(char *cgroup_path, const char *color)
+{
+char tag_path[50] = {}, rdbuf[8] = {};
+int tfd;
+
+sprintf(tag_path, "%s/cpu.core_tag_color", cgroup_path);
+tfd = open

[PATCH -tip 27/32] sched/debug: Add CGroup node for printing group cookie if SCHED_DEBUG

2020-11-17 Thread Joel Fernandes (Google)

This will be used by kselftest to verify the CGroup cookie value that is
set by the CGroup interface.

Reviewed-by: Josh Don 
Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8f17ec8e993e..f1d9762b571a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10277,6 +10277,21 @@ static u64 cpu_core_tag_color_read_u64(struct 
cgroup_subsys_state *css, struct c
return tg->core_tag_color;
 }
 
+#ifdef CONFIG_SCHED_DEBUG
+static u64 cpu_core_group_cookie_read_u64(struct cgroup_subsys_state *css, 
struct cftype *cft)
+{
+   unsigned long group_cookie, color;
+
+   cpu_core_get_group_cookie_and_color(css_tg(css), _cookie, );
+
+   /*
+* Combine group_cookie and color into a single 64 bit value, for
+* display purposes only.
+*/
+   return (group_cookie << 32) | (color & 0x);
+}
+#endif
+
 struct write_core_tag {
struct cgroup_subsys_state *css;
unsigned long cookie;
@@ -10550,6 +10565,14 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_core_tag_color_read_u64,
.write_u64 = cpu_core_tag_color_write_u64,
},
+#ifdef CONFIG_SCHED_DEBUG
+   /* Read the effective cookie (color+tag) of the group. */
+   {
+   .name = "core_group_cookie",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .read_u64 = cpu_core_group_cookie_read_u64,
+   },
+#endif
 #endif
 #ifdef CONFIG_UCLAMP_TASK_GROUP
{
@@ -10737,6 +10760,14 @@ static struct cftype cpu_files[] = {
.read_u64 = cpu_core_tag_color_read_u64,
.write_u64 = cpu_core_tag_color_write_u64,
},
+#ifdef CONFIG_SCHED_DEBUG
+   /* Read the effective cookie (color+tag) of the group. */
+   {
+   .name = "core_group_cookie",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .read_u64 = cpu_core_group_cookie_read_u64,
+   },
+#endif
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
{
-- 
2.29.2.299.gdc1121823c-goog

[PATCH -tip 25/32] sched: Refactor core cookie into struct

2020-11-17 Thread Joel Fernandes (Google)

From: Josh Don 

The overall core cookie is currently a single unsigned long value. This
poses issues as we seek to add additional sub-fields to the cookie. This
patch refactors the core_cookie to be a pointer to a struct containing
an arbitrary set of cookie fields.

We maintain a sorted RB tree of existing core cookies so that multiple
tasks may share the same core_cookie.

This will be especially useful in the next patch, where the concept of
cookie color is introduced.

Reviewed-by: Joel Fernandes (Google) 
Signed-off-by: Josh Don 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c  | 481 +--
 kernel/sched/sched.h |  11 +-
 2 files changed, 429 insertions(+), 63 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cc36c384364e..bd75b3d62a97 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3958,6 +3958,7 @@ static inline void init_schedstats(void) {}
 int sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
unsigned long flags;
+   int __maybe_unused ret;
 
__sched_fork(clone_flags, p);
/*
@@ -4037,20 +4038,9 @@ int sched_fork(unsigned long clone_flags, struct 
task_struct *p)
 #ifdef CONFIG_SCHED_CORE
RB_CLEAR_NODE(>core_node);
 
-   /*
-* If parent is tagged via per-task cookie, tag the child (either with
-* the parent's cookie, or a new one). The final cookie is calculated
-* by concatenating the per-task cookie with that of the CGroup's.
-*/
-   if (current->core_task_cookie) {
-
-   /* If it is not CLONE_THREAD fork, assign a unique per-task 
tag. */
-   if (!(clone_flags & CLONE_THREAD)) {
-   return sched_core_share_tasks(p, p);
-   }
-   /* Otherwise share the parent's per-task tag. */
-   return sched_core_share_tasks(p, current);
-   }
+   ret = sched_core_fork(p, clone_flags);
+   if (ret)
+   return ret;
 #endif
return 0;
 }
@@ -9059,6 +9049,9 @@ void sched_offline_group(struct task_group *tg)
spin_unlock_irqrestore(_group_lock, flags);
 }
 
+void cpu_core_get_group_cookie(struct task_group *tg,
+  unsigned long *group_cookie_ptr);
+
 static void sched_change_group(struct task_struct *tsk, int type)
 {
struct task_group *tg;
@@ -9073,11 +9066,7 @@ static void sched_change_group(struct task_struct *tsk, 
int type)
tg = autogroup_task_group(tsk, tg);
 
 #ifdef CONFIG_SCHED_CORE
-   if ((unsigned long)tsk->sched_task_group == tsk->core_cookie)
-   tsk->core_cookie = 0UL;
-
-   if (tg->tagged /* && !tsk->core_cookie ? */)
-   tsk->core_cookie = (unsigned long)tg;
+   sched_core_change_group(tsk, tg);
 #endif
 
tsk->sched_task_group = tg;
@@ -9177,9 +9166,9 @@ static void cpu_cgroup_css_offline(struct 
cgroup_subsys_state *css)
 #ifdef CONFIG_SCHED_CORE
struct task_group *tg = css_tg(css);
 
-   if (tg->tagged) {
+   if (tg->core_tagged) {
sched_core_put();
-   tg->tagged = 0;
+   tg->core_tagged = 0;
}
 #endif
 }
@@ -9751,38 +9740,225 @@ static u64 cpu_rt_period_read_uint(struct 
cgroup_subsys_state *css,
 
 #ifdef CONFIG_SCHED_CORE
 /*
- * A simple wrapper around refcount. An allocated sched_core_cookie's
- * address is used to compute the cookie of the task.
+ * Wrapper representing a complete cookie. The address of the cookie is used as
+ * a unique identifier. Each cookie has a unique permutation of the internal
+ * cookie fields.
  */
 struct sched_core_cookie {
+   unsigned long task_cookie;
+   unsigned long group_cookie;
+
+   struct rb_node node;
refcount_t refcnt;
 };
 
 /*
- * sched_core_tag_requeue - Common helper for all interfaces to set a cookie.
- * @p: The task to assign a cookie to.
- * @cookie: The cookie to assign.
- * @group: is it a group interface or a per-task interface.
+ * A simple wrapper around refcount. An allocated sched_core_task_cookie's
+ * address is used to compute the cookie of the task.
+ */
+struct sched_core_task_cookie {
+   refcount_t refcnt;
+};
+
+/* All active sched_core_cookies */
+static struct rb_root sched_core_cookies = RB_ROOT;
+static DEFINE_RAW_SPINLOCK(sched_core_cookies_lock);
+
+/*
+ * Returns the following:
+ * a < b  => -1
+ * a == b => 0
+ * a > b  => 1
+ */
+static int sched_core_cookie_cmp(const struct sched_core_cookie *a,
+const struct sched_core_cookie *b)
+{
+#define COOKIE_CMP_RETURN(field) do {  \
+   if (a->field < b->field)\
+   return -1;  \
+   else if (a->field > b->field)   \
+   return 1;   \
+} while (0)

[PATCH -tip 23/32] sched: Add a per-thread core scheduling interface

2020-11-17 Thread Joel Fernandes (Google)

Add a per-thread core scheduling interface which allows a thread to share a
core with another thread, or have a core exclusively for itself.

ChromeOS uses core-scheduling to securely enable hyperthreading.  This cuts
down the keypress latency in Google docs from 150ms to 50ms while improving
the camera streaming frame rate by ~3%.

Tested-by: Julien Desfossez 
Reviewed-by: Aubrey Li 
Co-developed-by: Chris Hyser 
Signed-off-by: Chris Hyser 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/sched.h|  1 +
 include/uapi/linux/prctl.h   |  3 ++
 kernel/sched/core.c  | 51 +---
 kernel/sys.c |  3 ++
 tools/include/uapi/linux/prctl.h |  3 ++
 5 files changed, 57 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c6a3b0fa952b..79d76c78cc8e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2083,6 +2083,7 @@ void sched_core_unsafe_enter(void);
 void sched_core_unsafe_exit(void);
 bool sched_core_wait_till_safe(unsigned long ti_check);
 bool sched_core_kernel_protected(void);
+int sched_core_share_pid(pid_t pid);
 #else
 #define sched_core_unsafe_enter(ignore) do { } while (0)
 #define sched_core_unsafe_exit(ignore) do { } while (0)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c334e6a02e5f..217b0482aea1 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -248,4 +248,7 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER  57
 #define PR_GET_IO_FLUSHER  58
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE59
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7ccca355623a..a95898c75bdf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -310,6 +310,7 @@ static int __sched_core_stopper(void *data)
 }
 
 static DEFINE_MUTEX(sched_core_mutex);
+static DEFINE_MUTEX(sched_core_tasks_mutex);
 static int sched_core_count;
 
 static void __sched_core_enable(void)
@@ -4037,8 +4038,9 @@ int sched_fork(unsigned long clone_flags, struct 
task_struct *p)
RB_CLEAR_NODE(>core_node);
 
/*
-* Tag child via per-task cookie only if parent is tagged via per-task
-* cookie. This is independent of, but can be additive to the CGroup 
tagging.
+* If parent is tagged via per-task cookie, tag the child (either with
+* the parent's cookie, or a new one). The final cookie is calculated
+* by concatenating the per-task cookie with that of the CGroup's.
 */
if (current->core_task_cookie) {
 
@@ -9855,7 +9857,7 @@ static int sched_core_share_tasks(struct task_struct *t1, 
struct task_struct *t2
unsigned long cookie;
int ret = -ENOMEM;
 
-   mutex_lock(_core_mutex);
+   mutex_lock(_core_tasks_mutex);
 
/*
 * NOTE: sched_core_get() is done by sched_core_alloc_task_cookie() or
@@ -9954,10 +9956,51 @@ static int sched_core_share_tasks(struct task_struct 
*t1, struct task_struct *t2
 
ret = 0;
 out_unlock:
-   mutex_unlock(_core_mutex);
+   mutex_unlock(_core_tasks_mutex);
return ret;
 }
 
+/* Called from prctl interface: PR_SCHED_CORE_SHARE */
+int sched_core_share_pid(pid_t pid)
+{
+   struct task_struct *task;
+   int err;
+
+   if (pid == 0) { /* Recent current task's cookie. */
+   /* Resetting a cookie requires privileges. */
+   if (current->core_task_cookie)
+   if (!capable(CAP_SYS_ADMIN))
+   return -EPERM;
+   task = NULL;
+   } else {
+   rcu_read_lock();
+   task = pid ? find_task_by_vpid(pid) : current;
+   if (!task) {
+   rcu_read_unlock();
+   return -ESRCH;
+   }
+
+   get_task_struct(task);
+
+   /*
+* Check if this process has the right to modify the specified
+* process. Use the regular "ptrace_may_access()" checks.
+*/
+   if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
+   rcu_read_unlock();
+   err = -EPERM;
+   goto out_put;
+   }
+   rcu_read_unlock();
+   }
+
+   err = sched_core_share_tasks(current, task);
+out_put:
+   if (task)
+   put_task_struct(task);
+   return err;
+}
+
 /* CGroup interface */
 static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct 
cftype *cft)
 {
diff --git a/kernel/sys.c b/kernel/sys.c
index a730c03ee607..61a3c98e36de 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2530,6 +2530,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, 
unsigned long, arg3,
 
error = (current->flags & PR_IO

[PATCH -tip 24/32] sched: Release references to the per-task cookie on exit

2020-11-17 Thread Joel Fernandes (Google)

During exit, we have to free the references to a cookie that might be shared by
many tasks. This commit therefore ensures when the task_struct is released, any
references to cookies that it holds are also released.

Reviewed-by: Chris Hyser 
Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/sched.h | 3 +++
 kernel/fork.c | 1 +
 kernel/sched/core.c   | 8 
 3 files changed, 12 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 79d76c78cc8e..6fbdb1a204bf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2084,11 +2084,14 @@ void sched_core_unsafe_exit(void);
 bool sched_core_wait_till_safe(unsigned long ti_check);
 bool sched_core_kernel_protected(void);
 int sched_core_share_pid(pid_t pid);
+void sched_tsk_free(struct task_struct *tsk);
 #else
 #define sched_core_unsafe_enter(ignore) do { } while (0)
 #define sched_core_unsafe_exit(ignore) do { } while (0)
 #define sched_core_wait_till_safe(ignore) do { } while (0)
 #define sched_core_kernel_protected(ignore) do { } while (0)
+#define sched_core_share_pid(pid) do { } while (0)
+#define sched_tsk_free(tsk) do { } while (0)
 #endif
 
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 7199d359690c..5468c93829c5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -736,6 +736,7 @@ void __put_task_struct(struct task_struct *tsk)
exit_creds(tsk);
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);
+   sched_tsk_free(tsk);
 
if (!profile_handoff_task(tsk))
free_task(tsk);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a95898c75bdf..cc36c384364e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10066,6 +10066,14 @@ static int cpu_core_tag_write_u64(struct 
cgroup_subsys_state *css, struct cftype
 
return 0;
 }
+
+void sched_tsk_free(struct task_struct *tsk)
+{
+   if (!tsk->core_task_cookie)
+   return;
+   sched_core_put_task_cookie(tsk->core_task_cookie);
+   sched_core_put();
+}
 #endif
 
 static struct cftype cpu_legacy_files[] = {
-- 
2.29.2.299.gdc1121823c-goog

[PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork

2020-11-17 Thread Joel Fernandes (Google)

In order to prevent interference and clearly support both per-task and CGroup
APIs, split the cookie into 2 and allow it to be set from either per-task, or
CGroup API. The final cookie is the combined value of both and is computed when
the stop-machine executes during a change of cookie.

Also, for the per-task cookie, it will get weird if we use pointers of any
emphemeral objects. For this reason, introduce a refcounted object who's sole
purpose is to assign unique cookie value by way of the object's pointer.

While at it, refactor the CGroup code a bit. Future patches will introduce more
APIs and support.

Reviewed-by: Josh Don 
Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/sched.h |   2 +
 kernel/sched/core.c   | 241 --
 kernel/sched/debug.c  |   4 +
 3 files changed, 236 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a60868165590..c6a3b0fa952b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -688,6 +688,8 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
struct rb_node  core_node;
unsigned long   core_cookie;
+   unsigned long   core_task_cookie;
+   unsigned long   core_group_cookie;
unsigned intcore_occupation;
 #endif
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b99a7493d590..7ccca355623a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -346,11 +346,14 @@ void sched_core_put(void)
mutex_unlock(_core_mutex);
 }
 
+static int sched_core_share_tasks(struct task_struct *t1, struct task_struct 
*t2);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
 static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
 static bool sched_core_enqueued(struct task_struct *task) { return false; }
+static int sched_core_share_tasks(struct task_struct *t1, struct task_struct 
*t2) { }
 
 #endif /* CONFIG_SCHED_CORE */
 
@@ -4032,6 +4035,20 @@ int sched_fork(unsigned long clone_flags, struct 
task_struct *p)
 #endif
 #ifdef CONFIG_SCHED_CORE
RB_CLEAR_NODE(>core_node);
+
+   /*
+* Tag child via per-task cookie only if parent is tagged via per-task
+* cookie. This is independent of, but can be additive to the CGroup 
tagging.
+*/
+   if (current->core_task_cookie) {
+
+   /* If it is not CLONE_THREAD fork, assign a unique per-task 
tag. */
+   if (!(clone_flags & CLONE_THREAD)) {
+   return sched_core_share_tasks(p, p);
+   }
+   /* Otherwise share the parent's per-task tag. */
+   return sched_core_share_tasks(p, current);
+   }
 #endif
return 0;
 }
@@ -9731,6 +9748,217 @@ static u64 cpu_rt_period_read_uint(struct 
cgroup_subsys_state *css,
 #endif /* CONFIG_RT_GROUP_SCHED */
 
 #ifdef CONFIG_SCHED_CORE
+/*
+ * A simple wrapper around refcount. An allocated sched_core_cookie's
+ * address is used to compute the cookie of the task.
+ */
+struct sched_core_cookie {
+   refcount_t refcnt;
+};
+
+/*
+ * sched_core_tag_requeue - Common helper for all interfaces to set a cookie.
+ * @p: The task to assign a cookie to.
+ * @cookie: The cookie to assign.
+ * @group: is it a group interface or a per-task interface.
+ *
+ * This function is typically called from a stop-machine handler.
+ */
+void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie, bool 
group)
+{
+   if (!p)
+   return;
+
+   if (group)
+   p->core_group_cookie = cookie;
+   else
+   p->core_task_cookie = cookie;
+
+   /* Use up half of the cookie's bits for task cookie and remaining for 
group cookie. */
+   p->core_cookie = (p->core_task_cookie <<
+   (sizeof(unsigned long) * 4)) + 
p->core_group_cookie;
+
+   if (sched_core_enqueued(p)) {
+   sched_core_dequeue(task_rq(p), p);
+   if (!p->core_task_cookie)
+   return;
+   }
+
+   if (sched_core_enabled(task_rq(p)) &&
+   p->core_cookie && task_on_rq_queued(p))
+   sched_core_enqueue(task_rq(p), p);
+}
+
+/* Per-task interface */
+static unsigned long sched_core_alloc_task_cookie(void)
+{
+   struct sched_core_cookie *ptr =
+   kmalloc(sizeof(struct sched_core_cookie), GFP_KERNEL);
+
+   if (!ptr)
+   return 0;
+   refcount_set(>refcnt, 1);
+
+   /*
+* NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
+* is done after the stopper runs.
+*/
+   sched_core_get();
+   return (unsigned long)ptr;
+}
+
+static bool sched_core_get_task_cookie(unsigned long co

[PATCH -tip 20/32] entry/kvm: Protect the kernel when entering from guest

2020-11-17 Thread Joel Fernandes (Google)

From: Vineeth Pillai 

Similar to how user to kernel mode transitions are protected in earlier
patches, protect the entry into kernel from guest mode as well.

Tested-by: Julien Desfossez 
Reviewed-by: Joel Fernandes (Google) 
Reviewed-by: Alexandre Chartre 
Signed-off-by: Vineeth Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 arch/x86/kvm/x86.c|  2 ++
 include/linux/entry-kvm.h | 12 
 kernel/entry/kvm.c| 33 +
 3 files changed, 47 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 447edc0d1d5a..a50be74f70f1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8910,6 +8910,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 */
smp_mb__after_srcu_read_unlock();
 
+   kvm_exit_to_guest_mode();
/*
 * This handles the case where a posted interrupt was
 * notified with kvm_vcpu_kick.
@@ -9003,6 +9004,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
}
}
 
+   kvm_enter_from_guest_mode();
local_irq_enable();
preempt_enable();
 
diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
index 9b93f8584ff7..67da6dcf442b 100644
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -77,4 +77,16 @@ static inline bool xfer_to_guest_mode_work_pending(void)
 }
 #endif /* CONFIG_KVM_XFER_TO_GUEST_WORK */
 
+/**
+ * kvm_enter_from_guest_mode - Hook called just after entering kernel from 
guest.
+ * Caller should ensure interrupts are off.
+ */
+void kvm_enter_from_guest_mode(void);
+
+/**
+ * kvm_exit_to_guest_mode - Hook called just before entering guest from kernel.
+ * Caller should ensure interrupts are off.
+ */
+void kvm_exit_to_guest_mode(void);
+
 #endif
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index 49972ee99aff..3b603e8bd5da 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -50,3 +50,36 @@ int xfer_to_guest_mode_handle_work(struct kvm_vcpu *vcpu)
return xfer_to_guest_mode_work(vcpu, ti_work);
 }
 EXPORT_SYMBOL_GPL(xfer_to_guest_mode_handle_work);
+
+/**
+ * kvm_enter_from_guest_mode - Hook called just after entering kernel from 
guest.
+ * Caller should ensure interrupts are off.
+ */
+void kvm_enter_from_guest_mode(void)
+{
+   if (!entry_kernel_protected())
+   return;
+   sched_core_unsafe_enter();
+}
+EXPORT_SYMBOL_GPL(kvm_enter_from_guest_mode);
+
+/**
+ * kvm_exit_to_guest_mode - Hook called just before entering guest from kernel.
+ * Caller should ensure interrupts are off.
+ */
+void kvm_exit_to_guest_mode(void)
+{
+   if (!entry_kernel_protected())
+   return;
+   sched_core_unsafe_exit();
+
+   /*
+* Wait here instead of in xfer_to_guest_mode_handle_work(). The reason
+* is because in vcpu_run(), xfer_to_guest_mode_handle_work() is called
+* when a vCPU was either runnable or blocked. However, we only care
+* about the runnable case (VM entry/exit) which is handled by
+* vcpu_enter_guest().
+*/
+   sched_core_wait_till_safe(XFER_TO_GUEST_MODE_WORK);
+}
+EXPORT_SYMBOL_GPL(kvm_exit_to_guest_mode);
-- 
2.29.2.299.gdc1121823c-goog

[PATCH -tip 18/32] kernel/entry: Add support for core-wide protection of kernel-mode

2020-11-17 Thread Joel Fernandes (Google)

Core-scheduling prevents hyperthreads in usermode from attacking each
other, but it does not do anything about one of the hyperthreads
entering the kernel for any reason. This leaves the door open for MDS
and L1TF attacks with concurrent execution sequences between
hyperthreads.

This patch therefore adds support for protecting all syscall and IRQ
kernel mode entries. Care is taken to track the outermost usermode exit
and entry using per-cpu counters. In cases where one of the hyperthreads
enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
when not needed - example: idle and non-cookie HTs do not need to be
forced into kernel mode.

More information about attacks:
For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
data to either host or guest attackers. For L1TF, it is possible to leak
to guest attackers. There is no possible mitigation involving flushing
of buffers to avoid this since the execution of attacker and victims
happen concurrently on 2 or more HTs.

Reviewed-by: Alexandre Chartre 
Tested-by: Julien Desfossez 
Cc: Julien Desfossez 
Cc: Tim Chen 
Cc: Aaron Lu 
Cc: Aubrey Li 
Cc: Tim Chen 
Cc: Paul E. McKenney 
Co-developed-by: Vineeth Pillai 
Signed-off-by: Vineeth Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 .../admin-guide/kernel-parameters.txt |  11 +
 include/linux/entry-common.h  |  12 +-
 include/linux/sched.h |  12 +
 kernel/entry/common.c |  28 +-
 kernel/sched/core.c   | 241 ++
 kernel/sched/sched.h  |   3 +
 6 files changed, 304 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index bd1a5b87a5e2..b185c6ed4aba 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4678,6 +4678,17 @@
 
sbni=   [NET] Granch SBNI12 leased line adapter
 
+   sched_core_protect_kernel=
+   [SCHED_CORE] Pause SMT siblings of a core running in
+   user mode, if at least one of the siblings of the core
+   is running in kernel mode. This is to guarantee that
+   kernel data is not leaked to tasks which are not trusted
+   by the kernel. A value of 0 disables protection, 1
+   enables protection. The default is 1. Note that 
protection
+   depends on the arch defining the _TIF_UNSAFE_RET flag.
+   Further, for protecting VMEXIT, arch needs to call
+   KVM entry/exit hooks.
+
sched_debug [KNL] Enables verbose scheduler debug messages.
 
schedstats= [KNL,X86] Enable or disable scheduled statistics.
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 1a128baf3628..022e1f114157 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -33,6 +33,10 @@
 # define _TIF_PATCH_PENDING(0)
 #endif
 
+#ifndef _TIF_UNSAFE_RET
+# define _TIF_UNSAFE_RET   (0)
+#endif
+
 #ifndef _TIF_UPROBE
 # define _TIF_UPROBE   (0)
 #endif
@@ -74,7 +78,7 @@
 #define EXIT_TO_USER_MODE_WORK \
(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |   \
 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |  \
-ARCH_EXIT_TO_USER_MODE_WORK)
+_TIF_UNSAFE_RET | ARCH_EXIT_TO_USER_MODE_WORK)
 
 /**
  * arch_check_user_regs - Architecture specific sanity check for user mode regs
@@ -444,4 +448,10 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs 
*regs);
  */
 void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t 
irq_state);
 
+/* entry_kernel_protected - Is kernel protection on entry/exit into kernel 
supported? */
+static inline bool entry_kernel_protected(void)
+{
+   return IS_ENABLED(CONFIG_SCHED_CORE) && sched_core_kernel_protected()
+   && _TIF_UNSAFE_RET != 0;
+}
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7efce9c9d9cf..a60868165590 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2076,4 +2076,16 @@ int sched_trace_rq_nr_running(struct rq *rq);
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
+#ifdef CONFIG_SCHED_CORE
+void sched_core_unsafe_enter(void);
+void sched_core_unsafe_exit(void);
+bool sched_core_wait_till_safe(unsigned long ti_check);
+bool sched_core_kernel_protected(void);
+#else
+#define sched_core_unsafe_enter(ignore) do { } while (0)
+#define sched_core_unsafe_exit(ignore) do { } while (0)
+#define sched_core_wait_till_safe(ignore) do { } while (0)
+#define sched_core_kernel_protected(ignore) do { } while (0)
+#endif
+
 #endif
diff --git a/kernel/entry/common.c b/kerne

[PATCH -tip 17/32] arch/x86: Add a new TIF flag for untrusted tasks

2020-11-17 Thread Joel Fernandes (Google)

Add a new TIF flag to indicate whether the kernel needs to be careful
and take additional steps to mitigate micro-architectural issues during
entry into user or guest mode.

This new flag will be used by the series to determine if waiting is
needed or not, during exit to user or guest mode.

Tested-by: Julien Desfossez 
Reviewed-by: Aubrey Li 
Signed-off-by: Joel Fernandes (Google) 
---
 arch/x86/include/asm/thread_info.h | 2 ++
 kernel/sched/sched.h   | 6 ++
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index 93277a8d2ef0..ae4f6196e38c 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -99,6 +99,7 @@ struct thread_info {
 #define TIF_SPEC_FORCE_UPDATE  23  /* Force speculation MSR update in 
context switch */
 #define TIF_FORCED_TF  24  /* true if TF in eflags artificially */
 #define TIF_BLOCKSTEP  25  /* set when we want DEBUGCTLMSR_BTF */
+#define TIF_UNSAFE_RET 26  /* On return to process/guest, perform 
safety checks. */
 #define TIF_LAZY_MMU_UPDATES   27  /* task is updating the mmu lazily */
 #define TIF_SYSCALL_TRACEPOINT 28  /* syscall tracepoint instrumentation */
 #define TIF_ADDR32 29  /* 32-bit address space on 64 bits */
@@ -127,6 +128,7 @@ struct thread_info {
 #define _TIF_SPEC_FORCE_UPDATE (1 << TIF_SPEC_FORCE_UPDATE)
 #define _TIF_FORCED_TF (1 << TIF_FORCED_TF)
 #define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP)
+#define _TIF_UNSAFE_RET(1 << TIF_UNSAFE_RET)
 #define _TIF_LAZY_MMU_UPDATES  (1 << TIF_LAZY_MMU_UPDATES)
 #define _TIF_SYSCALL_TRACEPOINT(1 << TIF_SYSCALL_TRACEPOINT)
 #define _TIF_ADDR32(1 << TIF_ADDR32)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5c258ab64052..615092cb693c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2851,3 +2851,9 @@ static inline bool is_per_cpu_kthread(struct task_struct 
*p)
 
 void swake_up_all_locked(struct swait_queue_head *q);
 void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
+
+#ifdef CONFIG_SCHED_CORE
+#ifndef TIF_UNSAFE_RET
+#define TIF_UNSAFE_RET (0)
+#endif
+#endif
-- 
2.29.2.299.gdc1121823c-goog

[PATCH -tip 16/32] irq_work: Cleanup

2020-11-17 Thread Joel Fernandes (Google)

From: Peter Zijlstra 

Get rid of the __call_single_node union and clean up the API a little
to avoid external code relying on the structure layout as much.

(Needed for irq_work_is_busy() API in core-scheduling series).

Tested-by: Julien Desfossez 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Joel Fernandes (Google) 
---
 drivers/gpu/drm/i915/i915_request.c |  4 ++--
 include/linux/irq_work.h| 33 ++---
 include/linux/irqflags.h|  4 ++--
 kernel/bpf/stackmap.c   |  2 +-
 kernel/irq_work.c   | 18 
 kernel/printk/printk.c  |  6 ++
 kernel/rcu/tree.c   |  3 +--
 kernel/time/tick-sched.c|  6 ++
 kernel/trace/bpf_trace.c|  2 +-
 9 files changed, 41 insertions(+), 37 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_request.c 
b/drivers/gpu/drm/i915/i915_request.c
index 0e813819b041..5385b081a376 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -197,7 +197,7 @@ __notify_execute_cb(struct i915_request *rq, bool 
(*fn)(struct irq_work *wrk))
 
llist_for_each_entry_safe(cb, cn,
  llist_del_all(>execute_cb),
- work.llnode)
+ work.node.llist)
fn(>work);
 }
 
@@ -460,7 +460,7 @@ __await_execution(struct i915_request *rq,
 * callback first, then checking the ACTIVE bit, we serialise with
 * the completed/retired request.
 */
-   if (llist_add(>work.llnode, >execute_cb)) {
+   if (llist_add(>work.node.llist, >execute_cb)) {
if (i915_request_is_active(signal) ||
__request_in_flight(signal))
__notify_execute_cb_imm(signal);
diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index 30823780c192..ec2a47a81e42 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -14,28 +14,37 @@
  */
 
 struct irq_work {
-   union {
-   struct __call_single_node node;
-   struct {
-   struct llist_node llnode;
-   atomic_t flags;
-   };
-   };
+   struct __call_single_node node;
void (*func)(struct irq_work *);
 };
 
+#define __IRQ_WORK_INIT(_func, _flags) (struct irq_work){  \
+   .node = { .u_flags = (_flags), },   \
+   .func = (_func),\
+}
+
+#define IRQ_WORK_INIT(_func) __IRQ_WORK_INIT(_func, 0)
+#define IRQ_WORK_INIT_LAZY(_func) __IRQ_WORK_INIT(_func, IRQ_WORK_LAZY)
+#define IRQ_WORK_INIT_HARD(_func) __IRQ_WORK_INIT(_func, IRQ_WORK_HARD_IRQ)
+
+#define DEFINE_IRQ_WORK(name, _f)  \
+   struct irq_work name = IRQ_WORK_INIT(_f)
+
 static inline
 void init_irq_work(struct irq_work *work, void (*func)(struct irq_work *))
 {
-   atomic_set(>flags, 0);
-   work->func = func;
+   *work = IRQ_WORK_INIT(func);
 }
 
-#define DEFINE_IRQ_WORK(name, _f) struct irq_work name = { \
-   .flags = ATOMIC_INIT(0),\
-   .func  = (_f)   \
+static inline bool irq_work_is_pending(struct irq_work *work)
+{
+   return atomic_read(>node.a_flags) & IRQ_WORK_PENDING;
 }
 
+static inline bool irq_work_is_busy(struct irq_work *work)
+{
+   return atomic_read(>node.a_flags) & IRQ_WORK_BUSY;
+}
 
 bool irq_work_queue(struct irq_work *work);
 bool irq_work_queue_on(struct irq_work *work, int cpu);
diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 3ed4e8771b64..fef2d43a7a1d 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -109,12 +109,12 @@ do {  \
 
 # define lockdep_irq_work_enter(__work)
\
  do {  \
- if (!(atomic_read(&__work->flags) & IRQ_WORK_HARD_IRQ))\
+ if (!(atomic_read(&__work->node.a_flags) & 
IRQ_WORK_HARD_IRQ))\
current->irq_config = 1;\
  } while (0)
 # define lockdep_irq_work_exit(__work) \
  do {  \
- if (!(atomic_read(&__work->flags) & IRQ_WORK_HARD_IRQ))\
+ if (!(atomic_read(&__work->node.a_flags) & 
IRQ_WORK_HARD_IRQ))\
current->irq_config = 0;\
  } while (0)
 
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 06065fa27124..599041cd0c8a 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -298,7 +298,7 @@ static void stac

[PATCH -tip 14/32] sched: migration changes for core scheduling

2020-11-17 Thread Joel Fernandes (Google)

From: Aubrey Li 

 - Don't migrate if there is a cookie mismatch
 Load balance tries to move task from busiest CPU to the
 destination CPU. When core scheduling is enabled, if the
 task's cookie does not match with the destination CPU's
 core cookie, this task will be skipped by this CPU. This
 mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
 In the fast path of task wakeup, select the first cookie matched
 idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
 In the slow path of task wakeup, find the idlest CPU whose core
 cookie matches with task's cookie

 - Don't migrate task if cookie not match
 For the NUMA load balance, don't migrate task to the CPU whose
 core cookie does not match with task's cookie

Tested-by: Julien Desfossez 
Signed-off-by: Aubrey Li 
Signed-off-by: Tim Chen 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/fair.c  | 64 
 kernel/sched/sched.h | 29 
 2 files changed, 88 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index de82f88ba98c..ceb3906c9a8a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1921,6 +1921,15 @@ static void task_numa_find_cpu(struct task_numa_env *env,
if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
continue;
 
+#ifdef CONFIG_SCHED_CORE
+   /*
+* Skip this cpu if source task's cookie does not match
+* with CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+   continue;
+#endif
+
env->dst_cpu = cpu;
if (task_numa_compare(env, taskimp, groupimp, maymove))
break;
@@ -5867,11 +5876,17 @@ find_idlest_group_cpu(struct sched_group *group, struct 
task_struct *p, int this
 
/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+   struct rq *rq = cpu_rq(i);
+
+#ifdef CONFIG_SCHED_CORE
+   if (!sched_core_cookie_match(rq, p))
+   continue;
+#endif
+
if (sched_idle_cpu(i))
return i;
 
if (available_idle_cpu(i)) {
-   struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
/*
@@ -6129,8 +6144,18 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
return -1;
-   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
-   break;
+
+   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
+#ifdef CONFIG_SCHED_CORE
+   /*
+* If Core Scheduling is enabled, select this cpu
+* only if the process cookie matches core cookie.
+*/
+   if (sched_core_enabled(cpu_rq(cpu)) &&
+   p->core_cookie == cpu_rq(cpu)->core->core_cookie)
+#endif
+   break;
+   }
}
 
time = cpu_clock(this) - time;
@@ -7530,8 +7555,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env 
*env)
 * We do not migrate tasks that are:
 * 1) throttled_lb_pair, or
 * 2) cannot be migrated to this CPU due to cpus_ptr, or
-* 3) running (obviously), or
-* 4) are cache-hot on their current CPU.
+* 3) task's cookie does not match with this CPU's core cookie
+* 4) running (obviously), or
+* 5) are cache-hot on their current CPU.
 */
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
@@ -7566,6 +7592,15 @@ int can_migrate_task(struct task_struct *p, struct 
lb_env *env)
return 0;
}
 
+#ifdef CONFIG_SCHED_CORE
+   /*
+* Don't migrate task if the task's cookie does not match
+* with the destination CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+   return 0;
+#endif
+
/* Record that we found atleast one task that could run on dst_cpu */
env->flags &= ~LBF_ALL_PINNED;
 
@@ -8792,6 +8827,25 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p, int this_cpu)
p->cpus_ptr))
continue;
 
+#ifdef CONFIG_SCHED_CORE
+   if (sched_core_enabled(cpu_rq(this_cpu))) {
+

[PATCH -tip 12/32] sched: Simplify the core pick loop for optimized case

2020-11-17 Thread Joel Fernandes (Google)

The core pick loop grew a lot of warts over time to support
optimizations. Turns out that that directly doing a class pick before
entering the core-wide pick is better for readability. Make the changes.

Since this is a relatively new patch, make it a separate patch so that
it is easier to revert in case anyone reports an issue with it. Testing
shows it to be working for me.

Reviewed-by: Vineeth Pillai 
Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c | 73 -
 1 file changed, 26 insertions(+), 47 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6aa76de55ef2..12e8e6627ab3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5180,6 +5180,15 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
put_prev_task_balance(rq, prev, rf);
 
smt_mask = cpu_smt_mask(cpu);
+   need_sync = !!rq->core->core_cookie;
+
+   /* reset state */
+   rq->core->core_cookie = 0UL;
+   if (rq->core->core_forceidle) {
+   need_sync = true;
+   fi_before = true;
+   rq->core->core_forceidle = false;
+   }
 
/*
 * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
@@ -5192,16 +5201,25 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 * 'Fix' this by also increasing @task_seq for every pick.
 */
rq->core->core_task_seq++;
-   need_sync = !!rq->core->core_cookie;
 
-   /* reset state */
-reset:
-   rq->core->core_cookie = 0UL;
-   if (rq->core->core_forceidle) {
+   /*
+* Optimize for common case where this CPU has no cookies
+* and there are no cookied tasks running on siblings.
+*/
+   if (!need_sync) {
+   for_each_class(class) {
+   next = class->pick_task(rq);
+   if (next)
+   break;
+   }
+
+   if (!next->core_cookie) {
+   rq->core_pick = NULL;
+   goto done;
+   }
need_sync = true;
-   fi_before = true;
-   rq->core->core_forceidle = false;
}
+
for_each_cpu(i, smt_mask) {
struct rq *rq_i = cpu_rq(i);
 
@@ -5239,38 +5257,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 * core.
 */
p = pick_task(rq_i, class, max);
-   if (!p) {
-   /*
-* If there weren't no cookies; we don't need to
-* bother with the other siblings.
-*/
-   if (i == cpu && !need_sync)
-   goto next_class;
-
+   if (!p)
continue;
-   }
-
-   /*
-* Optimize the 'normal' case where there aren't any
-* cookies and we don't need to sync up.
-*/
-   if (i == cpu && !need_sync) {
-   if (p->core_cookie) {
-   /*
-* This optimization is only valid as
-* long as there are no cookies
-* involved. We may have skipped
-* non-empty higher priority classes on
-* siblings, which are empty on this
-* CPU, so start over.
-*/
-   need_sync = true;
-   goto reset;
-   }
-
-   next = p;
-   goto done;
-   }
 
rq_i->core_pick = p;
 
@@ -5298,18 +5286,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
cpu_rq(j)->core_pick = NULL;
}
goto again;
-   } else {
-   /*
-* Once we select a task for a cpu, we
-* should not be doing an unconstrained
-* pick because it might starve a task
-* on a forced idle cpu.
-

[PATCH -tip 10/32] sched: Fix priority inversion of cookied task with sibling

2020-11-17 Thread Joel Fernandes (Google)

From: Peter Zijlstra 

The rationale is as follows. In the core-wide pick logic, even if
need_sync == false, we need to go look at other CPUs (non-local CPUs) to
see if they could be running RT.

Say the RQs in a particular core look like this:
Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task.

rq0rq1
CFS1 (tagged)  RT1 (not tag)
CFS2 (tagged)

Say schedule() runs on rq0. Now, it will enter the above loop and
pick_task(RT) will return NULL for 'p'. It will enter the above if() block
and see that need_sync == false and will skip RT entirely.

The end result of the selection will be (say prio(CFS1) > prio(CFS2)):
rq0 rq1
CFS1IDLE

When it should have selected:
rq0 r1
IDLERT

Joel saw this issue on real-world usecases in ChromeOS where an RT task
gets constantly force-idled and breaks RT. Lets cure it.

NOTE: This problem will be fixed differently in a later patch. It just
  kept here for reference purposes about this issue, and to make
  applying later patches easier.

Reported-by: Joel Fernandes (Google) 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4ee4902c2cf5..53af817740c0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5195,6 +5195,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
need_sync = !!rq->core->core_cookie;
 
/* reset state */
+reset:
rq->core->core_cookie = 0UL;
if (rq->core->core_forceidle) {
need_sync = true;
@@ -5242,14 +5243,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
/*
 * If there weren't no cookies; we don't need to
 * bother with the other siblings.
-* If the rest of the core is not running a 
tagged
-* task, i.e.  need_sync == 0, and the current 
CPU
-* which called into the schedule() loop does 
not
-* have any tasks for this class, skip 
selecting for
-* other siblings since there's no point. We 
don't skip
-* for RT/DL because that could make CFS 
force-idle RT.
 */
-   if (i == cpu && !need_sync && class == 
_sched_class)
+   if (i == cpu && !need_sync)
goto next_class;
 
continue;
@@ -5259,7 +5254,20 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 * Optimize the 'normal' case where there aren't any
 * cookies and we don't need to sync up.
 */
-   if (i == cpu && !need_sync && !p->core_cookie) {
+   if (i == cpu && !need_sync) {
+   if (p->core_cookie) {
+   /*
+* This optimization is only valid as
+* long as there are no cookies
+* involved. We may have skipped
+* non-empty higher priority classes on
+* siblings, which are empty on this
+* CPU, so start over.
+*/
+   need_sync = true;
+   goto reset;
+   }
+
next = p;
goto done;
}
@@ -5299,7 +5307,6 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 */
need_sync = true;
}
-
}
}
 next_class:;
-- 
2.29.2.299.gdc1121823c-goog

[PATCH -tip 11/32] sched: Enqueue task into core queue only after vruntime is updated

2020-11-17 Thread Joel Fernandes (Google)

A waking task may have its vruntime adjusted. However, the code right
now puts it into the core queue without the adjustment. This means the
core queue may have a task with incorrect vruntime, potentially a very
long one. This may cause a task to get artificially boosted during
picking.

Fix it by enqueuing into the core queue only after the class-specific
enqueue callback has been called. This ensures that for CFS tasks, the
updated vruntime value is used when enqueuing the task into the core
rbtree.

Reviewed-by: Vineeth Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 53af817740c0..6aa76de55ef2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1753,9 +1753,6 @@ static inline void init_uclamp(void) { }
 
 static inline void enqueue_task(struct rq *rq, struct task_struct *p, int 
flags)
 {
-   if (sched_core_enabled(rq))
-   sched_core_enqueue(rq, p);
-
if (!(flags & ENQUEUE_NOCLOCK))
update_rq_clock(rq);
 
@@ -1766,6 +1763,9 @@ static inline void enqueue_task(struct rq *rq, struct 
task_struct *p, int flags)
 
uclamp_rq_inc(rq, p);
p->sched_class->enqueue_task(rq, p, flags);
+
+   if (sched_core_enabled(rq))
+   sched_core_enqueue(rq, p);
 }
 
 static inline void dequeue_task(struct rq *rq, struct task_struct *p, int 
flags)
-- 
2.29.2.299.gdc1121823c-goog

[PATCH -tip 07/32] sched: Add core wide task selection and scheduling.

2020-11-17 Thread Joel Fernandes (Google)

From: Peter Zijlstra 

Instead of only selecting a local task, select a task for all SMT
siblings for every reschedule on the core (irrespective which logical
CPU does the reschedule).

Tested-by: Julien Desfossez 
Reviewed-by: Joel Fernandes (Google) 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Julien Desfossez 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Aaron Lu 
Signed-off-by: Tim Chen 
Signed-off-by: Chen Yu 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c  | 301 ++-
 kernel/sched/sched.h |   6 +-
 2 files changed, 305 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9d521033777f..1bd0b0bbb040 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5029,7 +5029,7 @@ static void put_prev_task_balance(struct rq *rq, struct 
task_struct *prev,
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
const struct sched_class *class;
struct task_struct *p;
@@ -5070,6 +5070,294 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 }
 
 #ifdef CONFIG_SCHED_CORE
+static inline bool is_task_rq_idle(struct task_struct *t)
+{
+   return (task_rq(t)->idle == t);
+}
+
+static inline bool cookie_equals(struct task_struct *a, unsigned long cookie)
+{
+   return is_task_rq_idle(a) || (a->core_cookie == cookie);
+}
+
+static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
+{
+   if (is_task_rq_idle(a) || is_task_rq_idle(b))
+   return true;
+
+   return a->core_cookie == b->core_cookie;
+}
+
+// XXX fairness/fwd progress conditions
+/*
+ * Returns
+ * - NULL if there is no runnable task for this class.
+ * - the highest priority task for this runqueue if it matches
+ *   rq->core->core_cookie or its priority is greater than max.
+ * - Else returns idle_task.
+ */
+static struct task_struct *
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct 
*max)
+{
+   struct task_struct *class_pick, *cookie_pick;
+   unsigned long cookie = rq->core->core_cookie;
+
+   class_pick = class->pick_task(rq);
+   if (!class_pick)
+   return NULL;
+
+   if (!cookie) {
+   /*
+* If class_pick is tagged, return it only if it has
+* higher priority than max.
+*/
+   if (max && class_pick->core_cookie &&
+   prio_less(class_pick, max))
+   return idle_sched_class.pick_task(rq);
+
+   return class_pick;
+   }
+
+   /*
+* If class_pick is idle or matches cookie, return early.
+*/
+   if (cookie_equals(class_pick, cookie))
+   return class_pick;
+
+   cookie_pick = sched_core_find(rq, cookie);
+
+   /*
+* If class > max && class > cookie, it is the highest priority task on
+* the core (so far) and it must be selected, otherwise we must go with
+* the cookie pick in order to satisfy the constraint.
+*/
+   if (prio_less(cookie_pick, class_pick) &&
+   (!max || prio_less(max, class_pick)))
+   return class_pick;
+
+   return cookie_pick;
+}
+
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+   struct task_struct *next, *max = NULL;
+   const struct sched_class *class;
+   const struct cpumask *smt_mask;
+   bool need_sync;
+   int i, j, cpu;
+
+   if (!sched_core_enabled(rq))
+   return __pick_next_task(rq, prev, rf);
+
+   cpu = cpu_of(rq);
+
+   /* Stopper task is switching into idle, no need core-wide selection. */
+   if (cpu_is_offline(cpu)) {
+   /*
+* Reset core_pick so that we don't enter the fastpath when
+* coming online. core_pick would already be migrated to
+* another cpu during offline.
+*/
+   rq->core_pick = NULL;
+   return __pick_next_task(rq, prev, rf);
+   }
+
+   /*
+* If there were no {en,de}queues since we picked (IOW, the task
+* pointers are all still valid), and we haven't scheduled the last
+* pick yet, do so now.
+*
+* rq->core_pick can be NULL if no selection was made for a CPU because
+* it was either offline or went offline during a sibling's core-wide
+* selection. In this case, do a core-wide selection.
+*/
+   if (rq->core->core_pick_seq == rq->core->core_task_seq &&
+   rq->core->core_pick_seq != rq->core_sched_seq &&
+   rq->

[PATCH -tip 02/32] sched: Introduce sched_class::pick_task()

2020-11-17 Thread Joel Fernandes (Google)

From: Peter Zijlstra 

Because sched_class::pick_next_task() also implies
sched_class::set_next_task() (and possibly put_prev_task() and
newidle_balance) it is not state invariant. This makes it unsuitable
for remote task selection.

Tested-by: Julien Desfossez 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/deadline.c  | 16 ++--
 kernel/sched/fair.c  | 32 +++-
 kernel/sched/idle.c  |  8 
 kernel/sched/rt.c| 15 +--
 kernel/sched/sched.h |  3 +++
 kernel/sched/stop_task.c | 14 --
 6 files changed, 81 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0f2ea0a3664c..abfc8b505d0d 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1867,7 +1867,7 @@ static struct sched_dl_entity *pick_next_dl_entity(struct 
rq *rq,
return rb_entry(left, struct sched_dl_entity, rb_node);
 }
 
-static struct task_struct *pick_next_task_dl(struct rq *rq)
+static struct task_struct *pick_task_dl(struct rq *rq)
 {
struct sched_dl_entity *dl_se;
struct dl_rq *dl_rq = >dl;
@@ -1879,7 +1879,18 @@ static struct task_struct *pick_next_task_dl(struct rq 
*rq)
dl_se = pick_next_dl_entity(rq, dl_rq);
BUG_ON(!dl_se);
p = dl_task_of(dl_se);
-   set_next_task_dl(rq, p, true);
+
+   return p;
+}
+
+static struct task_struct *pick_next_task_dl(struct rq *rq)
+{
+   struct task_struct *p;
+
+   p = pick_task_dl(rq);
+   if (p)
+   set_next_task_dl(rq, p, true);
+
return p;
 }
 
@@ -2551,6 +2562,7 @@ DEFINE_SCHED_CLASS(dl) = {
 
 #ifdef CONFIG_SMP
.balance= balance_dl,
+   .pick_task  = pick_task_dl,
.select_task_rq = select_task_rq_dl,
.migrate_task_rq= migrate_task_rq_dl,
.set_cpus_allowed   = set_cpus_allowed_dl,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 52ddfec7cea6..12cf068eeec8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4459,7 +4459,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *curr)
 * Avoid running the skip buddy, if running something else can
 * be done without getting too unfair.
 */
-   if (cfs_rq->skip == se) {
+   if (cfs_rq->skip && cfs_rq->skip == se) {
struct sched_entity *second;
 
if (se == curr) {
@@ -7017,6 +7017,35 @@ static void check_preempt_wakeup(struct rq *rq, struct 
task_struct *p, int wake_
set_last_buddy(se);
 }
 
+#ifdef CONFIG_SMP
+static struct task_struct *pick_task_fair(struct rq *rq)
+{
+   struct cfs_rq *cfs_rq = >cfs;
+   struct sched_entity *se;
+
+   if (!cfs_rq->nr_running)
+   return NULL;
+
+   do {
+   struct sched_entity *curr = cfs_rq->curr;
+
+   se = pick_next_entity(cfs_rq, NULL);
+
+   if (curr) {
+   if (se && curr->on_rq)
+   update_curr(cfs_rq);
+
+   if (!se || entity_before(curr, se))
+   se = curr;
+   }
+
+   cfs_rq = group_cfs_rq(se);
+   } while (cfs_rq);
+
+   return task_of(se);
+}
+#endif
+
 struct task_struct *
 pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags 
*rf)
 {
@@ -11219,6 +11248,7 @@ DEFINE_SCHED_CLASS(fair) = {
 
 #ifdef CONFIG_SMP
.balance= balance_fair,
+   .pick_task  = pick_task_fair,
.select_task_rq = select_task_rq_fair,
.migrate_task_rq= migrate_task_rq_fair,
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 50e128b899c4..33864193a2f9 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -406,6 +406,13 @@ static void set_next_task_idle(struct rq *rq, struct 
task_struct *next, bool fir
schedstat_inc(rq->sched_goidle);
 }
 
+#ifdef CONFIG_SMP
+static struct task_struct *pick_task_idle(struct rq *rq)
+{
+   return rq->idle;
+}
+#endif
+
 struct task_struct *pick_next_task_idle(struct rq *rq)
 {
struct task_struct *next = rq->idle;
@@ -473,6 +480,7 @@ DEFINE_SCHED_CLASS(idle) = {
 
 #ifdef CONFIG_SMP
.balance= balance_idle,
+   .pick_task  = pick_task_idle,
.select_task_rq = select_task_rq_idle,
.set_cpus_allowed   = set_cpus_allowed_common,
 #endif
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a6f9d132c24f..a0e245b0c4bd 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1626,7 +1626,7 @@ static struct task_struct *_pick_next_task_rt(struct rq 
*rq)
return rt_task_of(rt_se);
 }
 
-static struct task_struct *pick_next_

[PATCH -tip 06/32] sched: Basic tracking of matching tasks

2020-11-17 Thread Joel Fernandes (Google)

From: Peter Zijlstra 

Introduce task_struct::core_cookie as an opaque identifier for core
scheduling. When enabled; core scheduling will only allow matching
task to be on the core; where idle matches everything.

When task_struct::core_cookie is set (and core scheduling is enabled)
these tasks are indexed in a second RB-tree, first on cookie value
then on scheduling function, such that matching task selection always
finds the most elegible match.

NOTE: *shudder* at the overhead...

NOTE: *sigh*, a 3rd copy of the scheduling function; the alternative
is per class tracking of cookies and that just duplicates a lot of
stuff for no raisin (the 2nd copy lives in the rt-mutex PI code).

Reviewed-by: Joel Fernandes (Google) 
Tested-by: Julien Desfossez 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/sched.h |   8 ++-
 kernel/sched/core.c   | 146 ++
 kernel/sched/fair.c   |  46 -
 kernel/sched/sched.h  |  55 
 4 files changed, 208 insertions(+), 47 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7abbdd7f3884..344499ab29f2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -683,10 +683,16 @@ struct task_struct {
const struct sched_class*sched_class;
struct sched_entity se;
struct sched_rt_entity  rt;
+   struct sched_dl_entity  dl;
+
+#ifdef CONFIG_SCHED_CORE
+   struct rb_node  core_node;
+   unsigned long   core_cookie;
+#endif
+
 #ifdef CONFIG_CGROUP_SCHED
struct task_group   *sched_task_group;
 #endif
-   struct sched_dl_entity  dl;
 
 #ifdef CONFIG_UCLAMP_TASK
/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6d88bc9a6818..9d521033777f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -78,6 +78,141 @@ __read_mostly int scheduler_running;
 
 DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
 
+/* kernel prio, less is more */
+static inline int __task_prio(struct task_struct *p)
+{
+   if (p->sched_class == _sched_class) /* trumps deadline */
+   return -2;
+
+   if (rt_prio(p->prio)) /* includes deadline */
+   return p->prio; /* [-1, 99] */
+
+   if (p->sched_class == _sched_class)
+   return MAX_RT_PRIO + NICE_WIDTH; /* 140 */
+
+   return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */
+}
+
+/*
+ * l(a,b)
+ * le(a,b) := !l(b,a)
+ * g(a,b)  := l(b,a)
+ * ge(a,b) := !l(a,b)
+ */
+
+/* real prio, less is less */
+static inline bool prio_less(struct task_struct *a, struct task_struct *b)
+{
+
+   int pa = __task_prio(a), pb = __task_prio(b);
+
+   if (-pa < -pb)
+   return true;
+
+   if (-pb < -pa)
+   return false;
+
+   if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
+   return !dl_time_before(a->dl.deadline, b->dl.deadline);
+
+   if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
+   u64 vruntime = b->se.vruntime;
+
+   /*
+* Normalize the vruntime if tasks are in different cpus.
+*/
+   if (task_cpu(a) != task_cpu(b)) {
+   vruntime -= task_cfs_rq(b)->min_vruntime;
+   vruntime += task_cfs_rq(a)->min_vruntime;
+   }
+
+   return !((s64)(a->se.vruntime - vruntime) <= 0);
+   }
+
+   return false;
+}
+
+static inline bool __sched_core_less(struct task_struct *a, struct task_struct 
*b)
+{
+   if (a->core_cookie < b->core_cookie)
+   return true;
+
+   if (a->core_cookie > b->core_cookie)
+   return false;
+
+   /* flip prio, so high prio is leftmost */
+   if (prio_less(b, a))
+   return true;
+
+   return false;
+}
+
+static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+{
+   struct rb_node *parent, **node;
+   struct task_struct *node_task;
+
+   rq->core->core_task_seq++;
+
+   if (!p->core_cookie)
+   return;
+
+   node = >core_tree.rb_node;
+   parent = *node;
+
+   while (*node) {
+   node_task = container_of(*node, struct task_struct, core_node);
+   parent = *node;
+
+   if (__sched_core_less(p, node_task))
+   node = >rb_left;
+   else
+   node = >rb_right;
+   }
+
+   rb_link_node(>core_node, parent, node);
+   rb_insert_color(>core_node, >core_tree);
+}
+
+static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+{
+   rq->core->core_task_seq++;
+
+   if (!p->core_cookie)
+   return;
+
+   rb_era

[PATCH -tip 04/32] sched: Core-wide rq->lock

2020-11-17 Thread Joel Fernandes (Google)

From: Peter Zijlstra 

Introduce the basic infrastructure to have a core wide rq->lock.

Tested-by: Julien Desfossez 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Julien Desfossez 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/Kconfig.preempt |   5 ++
 kernel/sched/core.c| 108 +
 kernel/sched/sched.h   |  31 
 3 files changed, 144 insertions(+)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index bf82259cff96..6d8be4630bd6 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -80,3 +80,8 @@ config PREEMPT_COUNT
 config PREEMPTION
bool
select PREEMPT_COUNT
+
+config SCHED_CORE
+   bool "Core Scheduling for SMT"
+   default y
+   depends on SCHED_SMT
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index db5cc05a68bc..6d88bc9a6818 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,70 @@ unsigned int sysctl_sched_rt_period = 100;
 
 __read_mostly int scheduler_running;
 
+#ifdef CONFIG_SCHED_CORE
+
+DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+/*
+ * The static-key + stop-machine variable are needed such that:
+ *
+ * spin_lock(rq_lockp(rq));
+ * ...
+ * spin_unlock(rq_lockp(rq));
+ *
+ * ends up locking and unlocking the _same_ lock, and all CPUs
+ * always agree on what rq has what lock.
+ *
+ * XXX entirely possible to selectively enable cores, don't bother for now.
+ */
+static int __sched_core_stopper(void *data)
+{
+   bool enabled = !!(unsigned long)data;
+   int cpu;
+
+   for_each_possible_cpu(cpu)
+   cpu_rq(cpu)->core_enabled = enabled;
+
+   return 0;
+}
+
+static DEFINE_MUTEX(sched_core_mutex);
+static int sched_core_count;
+
+static void __sched_core_enable(void)
+{
+   // XXX verify there are no cookie tasks (yet)
+
+   static_branch_enable(&__sched_core_enabled);
+   stop_machine(__sched_core_stopper, (void *)true, NULL);
+}
+
+static void __sched_core_disable(void)
+{
+   // XXX verify there are no cookie tasks (left)
+
+   stop_machine(__sched_core_stopper, (void *)false, NULL);
+   static_branch_disable(&__sched_core_enabled);
+}
+
+void sched_core_get(void)
+{
+   mutex_lock(_core_mutex);
+   if (!sched_core_count++)
+   __sched_core_enable();
+   mutex_unlock(_core_mutex);
+}
+
+void sched_core_put(void)
+{
+   mutex_lock(_core_mutex);
+   if (!--sched_core_count)
+   __sched_core_disable();
+   mutex_unlock(_core_mutex);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * part of the period that we allow rt tasks to run in us.
  * default: 0.95s
@@ -4859,6 +4923,42 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
BUG();
 }
 
+#ifdef CONFIG_SCHED_CORE
+
+static inline void sched_core_cpu_starting(unsigned int cpu)
+{
+   const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+   struct rq *rq, *core_rq = NULL;
+   int i;
+
+   core_rq = cpu_rq(cpu)->core;
+
+   if (!core_rq) {
+   for_each_cpu(i, smt_mask) {
+   rq = cpu_rq(i);
+   if (rq->core && rq->core == rq)
+   core_rq = rq;
+   }
+
+   if (!core_rq)
+   core_rq = cpu_rq(cpu);
+
+   for_each_cpu(i, smt_mask) {
+   rq = cpu_rq(i);
+
+   WARN_ON_ONCE(rq->core && rq->core != core_rq);
+   rq->core = core_rq;
+   }
+   }
+
+   printk("core: %d -> %d\n", cpu, cpu_of(core_rq));
+}
+#else /* !CONFIG_SCHED_CORE */
+
+static inline void sched_core_cpu_starting(unsigned int cpu) {}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -7484,6 +7584,9 @@ static void sched_rq_cpu_starting(unsigned int cpu)
 
 int sched_cpu_starting(unsigned int cpu)
 {
+
+   sched_core_cpu_starting(cpu);
+
sched_rq_cpu_starting(cpu);
sched_tick_start(cpu);
return 0;
@@ -7747,6 +7850,11 @@ void __init sched_init(void)
 #endif /* CONFIG_SMP */
hrtick_rq_init(rq);
atomic_set(>nr_iowait, 0);
+
+#ifdef CONFIG_SCHED_CORE
+   rq->core = NULL;
+   rq->core_enabled = 0;
+#endif
}
 
set_load_weight(_task, false);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5a0dd2b312aa..0dfccf988998 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1061,6 +1061,12 @@ struct rq {
 #endif
unsigned intpush_busy;
struct cpu_stop_workpush_work;
+
+#ifdef CONFIG_SCHED_CORE
+   /* per rq */
+   struct rq   *core;
+   unsigned intcore_enabled;
+#endif
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -1099,11 +110

[PATCH -tip 05/32] sched/fair: Add a few assertions

2020-11-17 Thread Joel Fernandes (Google)

From: Peter Zijlstra 

Tested-by: Julien Desfossez 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/fair.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 51483a00a755..ca35bfc0a368 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6245,6 +6245,11 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
task_util = uclamp_task_util(p);
}
 
+   /*
+* per-cpu select_idle_mask usage
+*/
+   lockdep_assert_irqs_disabled();
+
if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
asym_fits_capacity(task_util, target))
return target;
@@ -6710,8 +6715,6 @@ static int find_energy_efficient_cpu(struct task_struct 
*p, int prev_cpu)
  * certain conditions an idle sibling CPU if the domain has SD_WAKE_AFFINE set.
  *
  * Returns the target CPU number.
- *
- * preempt must be disabled.
  */
 static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
@@ -6724,6 +6727,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, 
int wake_flags)
/* SD_flags and WF_flags share the first nibble */
int sd_flag = wake_flags & 0xF;
 
+   /*
+* required for stable ->cpus_allowed
+*/
+   lockdep_assert_held(>pi_lock);
if (wake_flags & WF_TTWU) {
record_wakee(p);
 
-- 
2.29.2.299.gdc1121823c-goog

[PATCH -tip 01/32] sched: Wrap rq::lock access

2020-11-17 Thread Joel Fernandes (Google)

From: Peter Zijlstra 

In preparation of playing games with rq->lock, abstract the thing
using an accessor.

Tested-by: Julien Desfossez 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c |  68 -
 kernel/sched/cpuacct.c  |  12 ++---
 kernel/sched/deadline.c |  22 
 kernel/sched/debug.c|   4 +-
 kernel/sched/fair.c |  38 +++---
 kernel/sched/idle.c |   4 +-
 kernel/sched/pelt.h |   2 +-
 kernel/sched/rt.c   |  16 +++---
 kernel/sched/sched.h| 108 +---
 kernel/sched/topology.c |   4 +-
 10 files changed, 141 insertions(+), 137 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a6aaf9fb3400..db5cc05a68bc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -186,12 +186,12 @@ struct rq *__task_rq_lock(struct task_struct *p, struct 
rq_flags *rf)
 
for (;;) {
rq = task_rq(p);
-   raw_spin_lock(>lock);
+   raw_spin_lock(rq_lockp(rq));
if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
rq_pin_lock(rq, rf);
return rq;
}
-   raw_spin_unlock(>lock);
+   raw_spin_unlock(rq_lockp(rq));
 
while (unlikely(task_on_rq_migrating(p)))
cpu_relax();
@@ -210,7 +210,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct 
rq_flags *rf)
for (;;) {
raw_spin_lock_irqsave(>pi_lock, rf->flags);
rq = task_rq(p);
-   raw_spin_lock(>lock);
+   raw_spin_lock(rq_lockp(rq));
/*
 *  move_queued_task()  task_rq_lock()
 *
@@ -232,7 +232,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct 
rq_flags *rf)
rq_pin_lock(rq, rf);
return rq;
}
-   raw_spin_unlock(>lock);
+   raw_spin_unlock(rq_lockp(rq));
raw_spin_unlock_irqrestore(>pi_lock, rf->flags);
 
while (unlikely(task_on_rq_migrating(p)))
@@ -302,7 +302,7 @@ void update_rq_clock(struct rq *rq)
 {
s64 delta;
 
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
 
if (rq->clock_update_flags & RQCF_ACT_SKIP)
return;
@@ -611,7 +611,7 @@ void resched_curr(struct rq *rq)
struct task_struct *curr = rq->curr;
int cpu;
 
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
 
if (test_tsk_need_resched(curr))
return;
@@ -635,10 +635,10 @@ void resched_cpu(int cpu)
struct rq *rq = cpu_rq(cpu);
unsigned long flags;
 
-   raw_spin_lock_irqsave(>lock, flags);
+   raw_spin_lock_irqsave(rq_lockp(rq), flags);
if (cpu_online(cpu) || cpu == smp_processor_id())
resched_curr(rq);
-   raw_spin_unlock_irqrestore(>lock, flags);
+   raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 }
 
 #ifdef CONFIG_SMP
@@ -1137,7 +1137,7 @@ static inline void uclamp_rq_inc_id(struct rq *rq, struct 
task_struct *p,
struct uclamp_se *uc_se = >uclamp[clamp_id];
struct uclamp_bucket *bucket;
 
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
 
/* Update task effective clamp */
p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
@@ -1177,7 +1177,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct 
task_struct *p,
unsigned int bkt_clamp;
unsigned int rq_clamp;
 
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
 
/*
 * If sched_uclamp_used was enabled after task @p was enqueued,
@@ -1807,7 +1807,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, 
int cpu)
 static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
   struct task_struct *p, int new_cpu)
 {
-   lockdep_assert_held(>lock);
+   lockdep_assert_held(rq_lockp(rq));
 
deactivate_task(rq, p, DEQUEUE_NOCLOCK);
set_task_cpu(p, new_cpu);
@@ -1973,7 +1973,7 @@ int push_cpu_stop(void *arg)
struct task_struct *p = arg;
 
raw_spin_lock_irq(>pi_lock);
-   raw_spin_lock(>lock);
+   raw_spin_lock(rq_lockp(rq));
 
if (task_rq(p) != rq)
goto out_unlock;
@@ -2003,7 +2003,7 @@ int push_cpu_stop(void *arg)
 
 out_unlock:
rq->push_busy = false;
-   raw_spin_unlock(>lock);
+   raw_spin_unlock(rq_lockp(rq));
raw_spin_unlock_irq(>pi_lock);
 
put_task_struct(p);
@@ -2056,7 +2056,7 @@ __do_set_cpus_allowed(struct task_struct *p,

[PATCH -tip 00/32] Core scheduling (v9)

2020-11-17 Thread Joel Fernandes (Google)

during cpu offline/online with coresched enabled
  - Vineeth Pillai
- Fixes a comparison logic issue in sched_core_find
  - Aaron Lu

Changes in v2
=
- Fixes for couple of NULL pointer dereference crashes
  - Subhra Mazumdar
  - Tim Chen
- Improves priority comparison logic for process in different cpus
  - Peter Zijlstra
  - Aaron Lu
- Fixes a hard lockup in rq locking
  - Vineeth Pillai
  - Julien Desfossez
- Fixes a performance issue seen on IO heavy workloads
  - Vineeth Pillai
  - Julien Desfossez
- Fix for 32bit build
  - Aubrey Li

Future work
===
- Load balancing/Migration fixes for core scheduling.
  With v6, Load balancing is partially coresched aware, but has some
  issues w.r.t process/taskgroup weights:
  https://lwn.net/ml/linux-kernel/20200225034438.GA617271@z...

Aubrey Li (1):
sched: migration changes for core scheduling

Joel Fernandes (Google) (16):
sched/fair: Snapshot the min_vruntime of CPUs on force idle
sched: Enqueue task into core queue only after vruntime is updated
sched: Simplify the core pick loop for optimized case
sched: Improve snapshotting of min_vruntime for CGroups
arch/x86: Add a new TIF flag for untrusted tasks
kernel/entry: Add support for core-wide protection of kernel-mode
entry/idle: Enter and exit kernel protection during idle entry and
exit
sched: Split the cookie and setup per-task cookie on fork
sched: Add a per-thread core scheduling interface
sched: Release references to the per-task cookie on exit
sched/debug: Add CGroup node for printing group cookie if SCHED_DEBUG
kselftest: Add tests for core-sched interface
sched: Move core-scheduler interfacing code to a new file
Documentation: Add core scheduling documentation
sched: Add a coresched command line option
sched: Debug bits...

Josh Don (2):
sched: Refactor core cookie into struct
sched: Add a second-level tag for nested CGroup usecase

Peter Zijlstra (11):
sched: Wrap rq::lock access
sched: Introduce sched_class::pick_task()
sched/fair: Fix pick_task_fair crashes due to empty rbtree
sched: Core-wide rq->lock
sched/fair: Add a few assertions
sched: Basic tracking of matching tasks
sched: Add core wide task selection and scheduling.
sched: Fix priority inversion of cookied task with sibling
sched: Trivial forced-newidle balancer
irq_work: Cleanup
sched: CGroup tagging interface for core scheduling

Vineeth Pillai (2):
sched/fair: Fix forced idle sibling starvation corner case
entry/kvm: Protect the kernel when entering from guest

.../admin-guide/hw-vuln/core-scheduling.rst   |  330 +
Documentation/admin-guide/hw-vuln/index.rst   |1 +
.../admin-guide/kernel-parameters.txt |   25 +
arch/x86/include/asm/thread_info.h|2 +
arch/x86/kernel/cpu/bugs.c|   19 +
arch/x86/kvm/x86.c|2 +
drivers/gpu/drm/i915/i915_request.c   |4 +-
include/linux/cpu.h   |1 +
include/linux/entry-common.h  |   30 +-
include/linux/entry-kvm.h |   12 +
include/linux/irq_work.h  |   33 +-
include/linux/irqflags.h  |4 +-
include/linux/sched.h |   28 +-
include/linux/sched/smt.h |4 +
include/uapi/linux/prctl.h|3 +
kernel/Kconfig.preempt|5 +
kernel/bpf/stackmap.c |2 +-
kernel/cpu.c  |   43 +
kernel/entry/common.c |   28 +-
kernel/entry/kvm.c|   33 +
kernel/fork.c |1 +
kernel/irq_work.c |   18 +-
kernel/printk/printk.c|6 +-
kernel/rcu/tree.c |3 +-
kernel/sched/Makefile |1 +
kernel/sched/core.c   | 1278 +++--
kernel/sched/coretag.c|  819 +++
kernel/sched/cpuacct.c|   12 +-
kernel/sched/deadline.c   |   38 +-
kernel/sched/debug.c  |   12 +-
kernel/sched/fair.c   |  313 +++-
kernel/sched/idle.c   |   24 +-
kernel/sched/pelt.h   |2 +-
kernel/sched/rt.c |   31 +-
kernel/sched/sched.h  |  315 +++-
kernel/sched/stop_task.c  |   14 +-
kernel/sched/topology.c   |4 +-
kernel/sys.c  |3 +
kernel/time/tick-sched.c  |6 +-
kernel/trace/bpf_trace.c  |2 +-
tools/include/uapi/linux/prctl.h  |3 +
tools/testing/selftests/sched/.gitignore  |1 +
tools/testing/selftests/sched/Makefile|   14 +
tools/testing/selftests/sched/config  |1 +
.../testing/se

[PATCH rcu-dev] rcu/trace: Add tracing for how segcb list changes

2020-11-14 Thread Joel Fernandes (Google)

Track how the segcb list changes before/after acceleration, during
queuing and during dequeuing.

This has proved useful to discover an optimization to avoid unwanted GP
requests when there are no callbacks accelerated. The overhead is minimal as
each segment's length is now stored in the respective segment.

Reviewed-by: Frederic Weisbecker 
Reviewed-by: Neeraj Upadhyay 
Signed-off-by: Joel Fernandes (Google) 

---
 include/trace/events/rcu.h | 26 ++
 kernel/rcu/tree.c  |  9 +
 2 files changed, 35 insertions(+)

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index 155b5cb43cfd..5fc29400e1a2 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -505,6 +505,32 @@ TRACE_EVENT_RCU(rcu_callback,
  __entry->qlen)
 );
 
+TRACE_EVENT_RCU(rcu_segcb_stats,
+
+   TP_PROTO(struct rcu_segcblist *rs, const char *ctx),
+
+   TP_ARGS(rs, ctx),
+
+   TP_STRUCT__entry(
+   __field(const char *, ctx)
+   __array(unsigned long, gp_seq, RCU_CBLIST_NSEGS)
+   __array(long, seglen, RCU_CBLIST_NSEGS)
+   ),
+
+   TP_fast_assign(
+   __entry->ctx = ctx;
+   memcpy(__entry->seglen, rs->seglen, RCU_CBLIST_NSEGS * 
sizeof(long));
+   memcpy(__entry->gp_seq, rs->gp_seq, RCU_CBLIST_NSEGS * 
sizeof(unsigned long));
+
+   ),
+
+   TP_printk("%s seglen: (DONE=%ld, WAIT=%ld, NEXT_READY=%ld, 
NEXT=%ld) "
+ "gp_seq: (DONE=%lu, WAIT=%lu, NEXT_READY=%lu, 
NEXT=%lu)", __entry->ctx,
+ __entry->seglen[0], __entry->seglen[1], 
__entry->seglen[2], __entry->seglen[3],
+ __entry->gp_seq[0], __entry->gp_seq[1], 
__entry->gp_seq[2], __entry->gp_seq[3])
+
+);
+
 /*
  * Tracepoint for the registration of a single RCU callback of the special
  * kvfree() form.  The first argument is the RCU type, the second argument
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 413831b48648..b96d26d0d44a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1497,6 +1497,8 @@ static bool rcu_accelerate_cbs(struct rcu_node *rnp, 
struct rcu_data *rdp)
if (!rcu_segcblist_pend_cbs(>cblist))
return false;
 
+   trace_rcu_segcb_stats(>cblist, TPS("SegCbPreAcc"));
+
/*
 * Callbacks are often registered with incomplete grace-period
 * information.  Something about the fact that getting exact
@@ -1517,6 +1519,8 @@ static bool rcu_accelerate_cbs(struct rcu_node *rnp, 
struct rcu_data *rdp)
else
trace_rcu_grace_period(rcu_state.name, gp_seq_req, 
TPS("AccReadyCB"));
 
+   trace_rcu_segcb_stats(>cblist, TPS("SegCbPostAcc"));
+
return ret;
 }
 
@@ -2473,11 +2477,14 @@ static void rcu_do_batch(struct rcu_data *rdp)
rcu_segcblist_extract_done_cbs(>cblist, );
if (offloaded)
rdp->qlen_last_fqs_check = rcu_segcblist_n_cbs(>cblist);
+
+   trace_rcu_segcb_stats(>cblist, TPS("SegCbDequeued"));
rcu_nocb_unlock_irqrestore(rdp, flags);
 
/* Invoke callbacks. */
tick_dep_set_task(current, TICK_DEP_BIT_RCU);
rhp = rcu_cblist_dequeue();
+
for (; rhp; rhp = rcu_cblist_dequeue()) {
rcu_callback_t f;
 
@@ -2989,6 +2996,8 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func)
trace_rcu_callback(rcu_state.name, head,
   rcu_segcblist_n_cbs(>cblist));
 
+   trace_rcu_segcb_stats(>cblist, TPS("SegCBQueued"));
+
/* Go handle any RCU core processing required. */
if (unlikely(rcu_segcblist_is_offloaded(>cblist))) {
__call_rcu_nocb_wake(rdp, was_alldone, flags); /* unlocks */
-- 
2.29.2.299.gdc1121823c-goog

[RFC 1/2] x86/bugs: Disable coresched on hardware that does not need it

2020-11-11 Thread Joel Fernandes (Google)

Some hardware such as certain AMD variants don't have cross-HT MDS/L1TF
issues. Detect this and don't enable core scheduling as it can
needlessly slow the device done.

Signed-off-by: Joel Fernandes (Google) 
---
 arch/x86/kernel/cpu/bugs.c | 8 
 kernel/sched/core.c| 7 +++
 kernel/sched/sched.h   | 5 +
 3 files changed, 20 insertions(+)

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index dece79e4d1e9..0e6e61e49b23 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -152,6 +152,14 @@ void __init check_bugs(void)
 #endif
 }
 
+/*
+ * Do not need core scheduling if CPU does not have MDS/L1TF vulnerability.
+ */
+int arch_allow_core_sched(void)
+{
+   return boot_cpu_has_bug(X86_BUG_MDS) || boot_cpu_has_bug(X86_BUG_L1TF);
+}
+
 void
 x86_virt_spec_ctrl(u64 guest_spec_ctrl, u64 guest_virt_spec_ctrl, bool 
setguest)
 {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 64c559192634..c6158b4959fe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -319,6 +319,13 @@ static void __sched_core_enable(void)
for_each_online_cpu(cpu)
BUG_ON(!sched_core_empty(cpu_rq(cpu)));
 
+   /*
+* Some architectures may not want coresched. (ex, AMD does not have
+* MDS/L1TF issues so it wants SMT completely on).
+*/
+   if (!arch_allow_core_sched())
+   return;
+
static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3cf08c77b678..a1b39764a6ed 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1203,6 +1203,11 @@ int cpu_core_tag_color_write_u64(struct 
cgroup_subsys_state *css,
 
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);
 
+int __weak arch_allow_core_sched(void)
+{
+   return true;
+}
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline bool sched_core_enqueued(struct task_struct *task) { return 
false; }
-- 
2.29.2.222.g5d2a92d10f8-goog

[RFC 2/2] sched/debug: Add debug information about whether coresched is enabled

2020-11-11 Thread Joel Fernandes (Google)

It is useful to see whether coresched is enabled or not, especially in
devices that don't need it. Add information about the same to
/proc/sched_debug.

Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/debug.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 88bf45267672..935b68be18cd 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -782,6 +782,10 @@ static void sched_debug_header(struct seq_file *m)
"sysctl_sched_tunable_scaling",
sysctl_sched_tunable_scaling,
sched_tunable_scaling_names[sysctl_sched_tunable_scaling]);
+#ifdef CONFIG_SCHED_CORE
+   SEQ_printf(m, "  .%-40s: %d\n", "core_sched_enabled",
+  !!static_branch_likely(&__sched_core_enabled));
+#endif
SEQ_printf(m, "\n");
 }
 
-- 
2.29.2.222.g5d2a92d10f8-goog

[RFC 0/2] Do not slow down some AMD devices with coresched

2020-11-11 Thread Joel Fernandes (Google)

Hi,
Here are 2 patches I am thinking of sending with next coresched series. Let me
know what you think. These are for not slowing down certain AMD devices if they
don't have vulnerabilities that require coresched.

1/2 - keep coresched disabled if device doesn't have vulnerability.
2/2 - add debug information.

In the future, if needed we could add more options to make it possible to
force-enable coresched. But right now I don't see a need for that, till a
usecase arises.

Joel Fernandes (Google) (2):
x86/bugs: Disable coresched on hardware that does not need it
sched/debug: Add debug information about whether coresched is enabled

arch/x86/kernel/cpu/bugs.c | 8 
kernel/sched/core.c| 7 +++
kernel/sched/debug.c   | 4 
kernel/sched/sched.h   | 5 +
4 files changed, 24 insertions(+)

--
2.29.2.222.g5d2a92d10f8-goog

[PATCH v9 2/7] rcu/segcblist: Add counters to segcblist datastructure

2020-11-03 Thread Joel Fernandes (Google)

Add counting of segment lengths of segmented callback list.

This will be useful for a number of things such as knowing how big the
ready-to-execute segment have gotten. The immediate benefit is ability
to trace how the callbacks in the segmented callback list change.

Also this patch remove hacks related to using donecbs's ->len field as a
temporary variable to save the segmented callback list's length. This cannot be
done anymore and is not needed.

Reviewed-by: Frederic Weisbecker 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/rcu_segcblist.h |   1 +
 kernel/rcu/rcu_segcblist.c| 120 ++
 kernel/rcu/rcu_segcblist.h|   2 -
 3 files changed, 79 insertions(+), 44 deletions(-)

diff --git a/include/linux/rcu_segcblist.h b/include/linux/rcu_segcblist.h
index b36afe7b22c9..6c01f09a6456 100644
--- a/include/linux/rcu_segcblist.h
+++ b/include/linux/rcu_segcblist.h
@@ -72,6 +72,7 @@ struct rcu_segcblist {
 #else
long len;
 #endif
+   long seglen[RCU_CBLIST_NSEGS];
u8 enabled;
u8 offloaded;
 };
diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index bb246d8c6ef1..357c19bbcb00 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -7,10 +7,11 @@
  * Authors: Paul E. McKenney 
  */
 
-#include 
-#include 
+#include 
 #include 
+#include 
 #include 
+#include 
 
 #include "rcu_segcblist.h"
 
@@ -88,6 +89,46 @@ static void rcu_segcblist_set_len(struct rcu_segcblist 
*rsclp, long v)
 #endif
 }
 
+/* Get the length of a segment of the rcu_segcblist structure. */
+static long rcu_segcblist_get_seglen(struct rcu_segcblist *rsclp, int seg)
+{
+   return READ_ONCE(rsclp->seglen[seg]);
+}
+
+/* Set the length of a segment of the rcu_segcblist structure. */
+static void rcu_segcblist_set_seglen(struct rcu_segcblist *rsclp, int seg, 
long v)
+{
+   WRITE_ONCE(rsclp->seglen[seg], v);
+}
+
+/* Return number of callbacks in a segment of the segmented callback list. */
+static void rcu_segcblist_add_seglen(struct rcu_segcblist *rsclp, int seg, 
long v)
+{
+   WRITE_ONCE(rsclp->seglen[seg], rsclp->seglen[seg] + v);
+}
+
+/* Move from's segment length to to's segment. */
+static void rcu_segcblist_move_seglen(struct rcu_segcblist *rsclp, int from, 
int to)
+{
+   long len;
+
+   if (from == to)
+   return;
+
+   len = rcu_segcblist_get_seglen(rsclp, from);
+   if (!len)
+   return;
+
+   rcu_segcblist_add_seglen(rsclp, to, len);
+   rcu_segcblist_set_seglen(rsclp, from, 0);
+}
+
+/* Increment segment's length. */
+static void rcu_segcblist_inc_seglen(struct rcu_segcblist *rsclp, int seg)
+{
+   rcu_segcblist_add_seglen(rsclp, seg, 1);
+}
+
 /*
  * Increase the numeric length of an rcu_segcblist structure by the
  * specified amount, which can be negative.  This can cause the ->len
@@ -119,26 +160,6 @@ void rcu_segcblist_inc_len(struct rcu_segcblist *rsclp)
rcu_segcblist_add_len(rsclp, 1);
 }
 
-/*
- * Exchange the numeric length of the specified rcu_segcblist structure
- * with the specified value.  This can cause the ->len field to disagree
- * with the actual number of callbacks on the structure.  This exchange is
- * fully ordered with respect to the callers accesses both before and after.
- */
-static long rcu_segcblist_xchg_len(struct rcu_segcblist *rsclp, long v)
-{
-#ifdef CONFIG_RCU_NOCB_CPU
-   return atomic_long_xchg(>len, v);
-#else
-   long ret = rsclp->len;
-
-   smp_mb(); /* Up to the caller! */
-   WRITE_ONCE(rsclp->len, v);
-   smp_mb(); /* Up to the caller! */
-   return ret;
-#endif
-}
-
 /*
  * Initialize an rcu_segcblist structure.
  */
@@ -149,8 +170,10 @@ void rcu_segcblist_init(struct rcu_segcblist *rsclp)
BUILD_BUG_ON(RCU_NEXT_TAIL + 1 != ARRAY_SIZE(rsclp->gp_seq));
BUILD_BUG_ON(ARRAY_SIZE(rsclp->tails) != ARRAY_SIZE(rsclp->gp_seq));
rsclp->head = NULL;
-   for (i = 0; i < RCU_CBLIST_NSEGS; i++)
+   for (i = 0; i < RCU_CBLIST_NSEGS; i++) {
rsclp->tails[i] = >head;
+   rcu_segcblist_set_seglen(rsclp, i, 0);
+   }
rcu_segcblist_set_len(rsclp, 0);
rsclp->enabled = 1;
 }
@@ -246,6 +269,7 @@ void rcu_segcblist_enqueue(struct rcu_segcblist *rsclp,
 {
rcu_segcblist_inc_len(rsclp);
smp_mb(); /* Ensure counts are updated before callback is enqueued. */
+   rcu_segcblist_inc_seglen(rsclp, RCU_NEXT_TAIL);
rhp->next = NULL;
WRITE_ONCE(*rsclp->tails[RCU_NEXT_TAIL], rhp);
WRITE_ONCE(rsclp->tails[RCU_NEXT_TAIL], >next);
@@ -274,27 +298,13 @@ bool rcu_segcblist_entrain(struct rcu_segcblist *rsclp,
for (i = RCU_NEXT_TAIL; i > RCU_DONE_TAIL; i--)
if (rsclp->tails[i] != rsclp->tails[i - 1])
break;
+   rcu_segcblist_inc_seglen(rs

[PATCH v9 6/7] rcu/tree: segcblist: Remove redundant smp_mb()s

2020-11-03 Thread Joel Fernandes (Google)

This memory barrier is not needed as rcu_segcblist_add_len() already
includes a memory barrier *before* and *after* the length of the list is
updated.

Same reasoning for rcu_segcblist_enqueue().

Reviewed-by: Frederic Weisbecker 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/rcu/rcu_segcblist.c | 1 -
 kernel/rcu/tree.c  | 1 -
 2 files changed, 2 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index e9e72d72f7a6..d96272e8d604 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -268,7 +268,6 @@ void rcu_segcblist_enqueue(struct rcu_segcblist *rsclp,
   struct rcu_head *rhp)
 {
rcu_segcblist_inc_len(rsclp);
-   smp_mb(); /* Ensure counts are updated before callback is enqueued. */
rcu_segcblist_inc_seglen(rsclp, RCU_NEXT_TAIL);
rhp->next = NULL;
WRITE_ONCE(*rsclp->tails[RCU_NEXT_TAIL], rhp);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f6c6653b3ec2..fb2a5ac4a59c 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2525,7 +2525,6 @@ static void rcu_do_batch(struct rcu_data *rdp)
 
/* Update counts and requeue any remaining callbacks. */
rcu_segcblist_insert_done_cbs(>cblist, );
-   smp_mb(); /* List handling before counting for rcu_barrier(). */
rcu_segcblist_add_len(>cblist, -count);
 
/* Reinstate batch limit if we have worked down the excess. */
-- 
2.29.1.341.ge80a0c044ae-goog

[PATCH v9 0/7] Add support for length of each segment in the segcblist

2020-11-03 Thread Joel Fernandes (Google)

This is required for several usecases identified. One of them being tracing how
the segmented callback list changes. Tracing this has identified issues in RCU
code in the past.

>From Paul:
Another use case is of course more accurately determining whether a given CPU's
large pile of callbacks can be best served by making grace periods go faster,
invoking callbacks more vigorously, or both.  It should also be possible to
simplify some of the callback handling a bit, given that some of the unnatural
acts are due to there having been no per-batch counts.

Revision history:
v9: Fix SRCU issues, other minor style changes (Frederic). Added Frederic's
Reviewed-by to all but the last patch..

v8: Small style changes, making the seglen as non-atomic since it is always
under lock (Frederic).

v7: Cleaned up memory barriers (thanks fweisbec@ for reviewing), made minor
corrections per Neeraj (thanks).

v6: Fixed TREE04, and restored older logic to ensure rcu_barrier works.

v5: Various changes, bug fixes. Discovery of rcu_barrier issue.

v4: Restructured rcu_do_batch() and segcblist merging to avoid issues.
Fixed minor nit from Davidlohr.
v1->v3: minor nits.
(https://lore.kernel.org/lkml/20200719034210.2382053-1-joel@xxxxx/)

Joel Fernandes (Google) (7):
rcu/tree: Make rcu_do_batch count how many callbacks were executed
rcu/segcblist: Add counters to segcblist datastructure
srcu: Fix invoke_rcu_callbacks() segcb length adjustment
rcu/trace: Add tracing for how segcb list changes
rcu/segcblist: Remove useless rcupdate.h include
rcu/tree: segcblist: Remove redundant smp_mb()s
rcu/segcblist: Add additional comments to explain smp_mb()

include/linux/rcu_segcblist.h |   1 +
include/trace/events/rcu.h|  25 +
kernel/rcu/rcu_segcblist.c| 198 +-
kernel/rcu/rcu_segcblist.h|   8 +-
kernel/rcu/srcutree.c |   5 +-
kernel/rcu/tree.c |  21 ++--
6 files changed, 199 insertions(+), 59 deletions(-)

--
2.29.1.341.ge80a0c044ae-goog

[PATCH v9 5/7] rcu/segcblist: Remove useless rcupdate.h include

2020-11-03 Thread Joel Fernandes (Google)

Signed-off-by: Joel Fernandes (Google) 
---
 kernel/rcu/rcu_segcblist.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 2a03949d0b82..e9e72d72f7a6 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -10,7 +10,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #include "rcu_segcblist.h"
-- 
2.29.1.341.ge80a0c044ae-goog

[PATCH v9 3/7] srcu: Fix invoke_rcu_callbacks() segcb length adjustment

2020-11-03 Thread Joel Fernandes (Google)

With earlier patches, the negative counting of the unsegmented list
cannot be used to adjust the segmented one. To fix this, sample the
unsegmented length in advance, and use it after CB execution to adjust
the segmented list's length.

Reviewed-by: Frederic Weisbecker 
Suggested-by: Frederic Weisbecker 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/rcu/srcutree.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 0f23d20d485a..79b7081143a7 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -1160,6 +1160,7 @@ static void srcu_advance_state(struct srcu_struct *ssp)
  */
 static void srcu_invoke_callbacks(struct work_struct *work)
 {
+   long len;
bool more;
struct rcu_cblist ready_cbs;
struct rcu_head *rhp;
@@ -1182,6 +1183,7 @@ static void srcu_invoke_callbacks(struct work_struct 
*work)
/* We are on the job!  Extract and invoke ready callbacks. */
sdp->srcu_cblist_invoking = true;
rcu_segcblist_extract_done_cbs(>srcu_cblist, _cbs);
+   len = ready_cbs.len;
spin_unlock_irq_rcu_node(sdp);
rhp = rcu_cblist_dequeue(_cbs);
for (; rhp != NULL; rhp = rcu_cblist_dequeue(_cbs)) {
@@ -1190,13 +1192,14 @@ static void srcu_invoke_callbacks(struct work_struct 
*work)
rhp->func(rhp);
local_bh_enable();
}
+   WARN_ON_ONCE(ready_cbs.len);
 
/*
 * Update counts, accelerate new callbacks, and if needed,
 * schedule another round of callback invocation.
 */
spin_lock_irq_rcu_node(sdp);
-   rcu_segcblist_insert_count(>srcu_cblist, _cbs);
+   rcu_segcblist_add_len(>srcu_cblist, -len);
(void)rcu_segcblist_accelerate(>srcu_cblist,
   rcu_seq_snap(>srcu_gp_seq));
sdp->srcu_cblist_invoking = false;
-- 
2.29.1.341.ge80a0c044ae-goog

[PATCH v9 1/7] rcu/tree: Make rcu_do_batch count how many callbacks were executed

2020-11-03 Thread Joel Fernandes (Google)

Currently, rcu_do_batch() depends on the unsegmented callback list's len field
to know how many CBs are executed. This fields counts down from 0 as CBs are
dequeued.  It is possible that all CBs could not be run because of reaching
limits in which case the remaining unexecuted callbacks are requeued in the
CPU's segcblist.

The number of callbacks that were not requeued are then the negative count (how
many CBs were run) stored in the rcl->len which has been counting down on every
dequeue. This negative count is then added to the per-cpu segmented callback
list's to correct its count.

Such a design works against future efforts to track the length of each segment
of the segmented callback list. The reason is because
rcu_segcblist_extract_done_cbs() will be populating the unsegmented callback
list's length field (rcl->len) during extraction.

Also, the design of counting down from 0 is confusing and error-prone IMHO.

This commit therefore explicitly counts how many callbacks were executed in
rcu_do_batch() itself, and uses that to update the per-CPU segcb list's ->len
field, without relying on the negativity of rcl->len.

Reviewed-by: Frederic Weisbecker 
Reviewed-by: Neeraj Upadhyay 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/rcu/rcu_segcblist.c |  2 +-
 kernel/rcu/rcu_segcblist.h |  1 +
 kernel/rcu/tree.c  | 11 +--
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 2d2a6b6b9dfb..bb246d8c6ef1 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -95,7 +95,7 @@ static void rcu_segcblist_set_len(struct rcu_segcblist 
*rsclp, long v)
  * This increase is fully ordered with respect to the callers accesses
  * both before and after.
  */
-static void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v)
+void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v)
 {
 #ifdef CONFIG_RCU_NOCB_CPU
smp_mb__before_atomic(); /* Up to the caller! */
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index 492262bcb591..1d2d61406463 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -76,6 +76,7 @@ static inline bool rcu_segcblist_restempty(struct 
rcu_segcblist *rsclp, int seg)
 }
 
 void rcu_segcblist_inc_len(struct rcu_segcblist *rsclp);
+void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v);
 void rcu_segcblist_init(struct rcu_segcblist *rsclp);
 void rcu_segcblist_disable(struct rcu_segcblist *rsclp);
 void rcu_segcblist_offload(struct rcu_segcblist *rsclp);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 286dc0a1b184..24c00020ab83 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2429,7 +2429,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
const bool offloaded = rcu_segcblist_is_offloaded(>cblist);
struct rcu_head *rhp;
struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
-   long bl, count;
+   long bl, count = 0;
long pending, tlimit = 0;
 
/* If no callbacks are ready, just return. */
@@ -2474,6 +2474,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
for (; rhp; rhp = rcu_cblist_dequeue()) {
rcu_callback_t f;
 
+   count++;
debug_rcu_head_unqueue(rhp);
 
rcu_lock_acquire(_callback_map);
@@ -2487,15 +2488,14 @@ static void rcu_do_batch(struct rcu_data *rdp)
 
/*
 * Stop only if limit reached and CPU has something to do.
-* Note: The rcl structure counts down from zero.
 */
-   if (-rcl.len >= bl && !offloaded &&
+   if (count >= bl && !offloaded &&
(need_resched() ||
 (!is_idle_task(current) && !rcu_is_callbacks_kthread(
break;
if (unlikely(tlimit)) {
/* only call local_clock() every 32 callbacks */
-   if (likely((-rcl.len & 31) || local_clock() < tlimit))
+   if (likely((count & 31) || local_clock() < tlimit))
continue;
/* Exceeded the time limit, so leave. */
break;
@@ -2512,7 +2512,6 @@ static void rcu_do_batch(struct rcu_data *rdp)
 
local_irq_save(flags);
rcu_nocb_lock(rdp);
-   count = -rcl.len;
rdp->n_cbs_invoked += count;
trace_rcu_batch_end(rcu_state.name, count, !!rcl.head, need_resched(),
is_idle_task(current), rcu_is_callbacks_kthread());
@@ -2520,7 +2519,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
/* Update counts and requeue any remaining callbacks. */
rcu_segcblist_insert_done_cbs(>cblist, );
smp_mb(); /* List handling before counting for rcu_barrier(). */
-   rcu

[PATCH v9 4/7] rcu/trace: Add tracing for how segcb list changes

2020-11-03 Thread Joel Fernandes (Google)

Track how the segcb list changes before/after acceleration, during
queuing and during dequeuing.

This has proved useful to discover an optimization to avoid unwanted GP
requests when there are no callbacks accelerated. The overhead is minimal as
each segment's length is now stored in the respective segment.

Reviewed-by: Frederic Weisbecker 
Reviewed-by: Neeraj Upadhyay 
Signed-off-by: Joel Fernandes (Google) 
---
 include/trace/events/rcu.h | 25 +
 kernel/rcu/rcu_segcblist.c | 34 ++
 kernel/rcu/rcu_segcblist.h |  5 +
 kernel/rcu/tree.c  |  9 +
 4 files changed, 73 insertions(+)

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index 155b5cb43cfd..5f8f2ee1a936 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -505,6 +505,31 @@ TRACE_EVENT_RCU(rcu_callback,
  __entry->qlen)
 );
 
+TRACE_EVENT_RCU(rcu_segcb_stats,
+
+   TP_PROTO(const char *ctx, int *cb_count, unsigned long *gp_seq),
+
+   TP_ARGS(ctx, cb_count, gp_seq),
+
+   TP_STRUCT__entry(
+   __field(const char *, ctx)
+   __array(int, cb_count, RCU_CBLIST_NSEGS)
+   __array(unsigned long, gp_seq, RCU_CBLIST_NSEGS)
+   ),
+
+   TP_fast_assign(
+   __entry->ctx = ctx;
+   memcpy(__entry->cb_count, cb_count, RCU_CBLIST_NSEGS * 
sizeof(int));
+   memcpy(__entry->gp_seq, gp_seq, RCU_CBLIST_NSEGS * 
sizeof(unsigned long));
+   ),
+
+   TP_printk("%s cb_count: (DONE=%d, WAIT=%d, NEXT_READY=%d, 
NEXT=%d) "
+ "gp_seq: (DONE=%lu, WAIT=%lu, NEXT_READY=%lu, 
NEXT=%lu)", __entry->ctx,
+ __entry->cb_count[0], __entry->cb_count[1], 
__entry->cb_count[2], __entry->cb_count[3],
+ __entry->gp_seq[0], __entry->gp_seq[1], 
__entry->gp_seq[2], __entry->gp_seq[3])
+
+);
+
 /*
  * Tracepoint for the registration of a single RCU callback of the special
  * kvfree() form.  The first argument is the RCU type, the second argument
diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 357c19bbcb00..2a03949d0b82 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -14,6 +14,7 @@
 #include 
 
 #include "rcu_segcblist.h"
+#include "rcu.h"
 
 /* Initialize simple callback list. */
 void rcu_cblist_init(struct rcu_cblist *rclp)
@@ -328,6 +329,39 @@ void rcu_segcblist_extract_done_cbs(struct rcu_segcblist 
*rsclp,
rcu_segcblist_set_seglen(rsclp, RCU_DONE_TAIL, 0);
 }
 
+/*
+ * Return how many CBs each segment along with their gp_seq values.
+ *
+ * This function is O(N) where N is the number of segments. Only used from
+ * tracing code which is usually disabled in production.
+ */
+#ifdef CONFIG_RCU_TRACE
+static void rcu_segcblist_countseq(struct rcu_segcblist *rsclp,
+int cbcount[RCU_CBLIST_NSEGS],
+unsigned long gpseq[RCU_CBLIST_NSEGS])
+{
+   int i;
+
+   for (i = 0; i < RCU_CBLIST_NSEGS; i++) {
+   cbcount[i] = rcu_segcblist_get_seglen(rsclp, i);
+   gpseq[i] = rsclp->gp_seq[i];
+   }
+}
+
+void __trace_rcu_segcb_stats(struct rcu_segcblist *rsclp, const char *context)
+{
+   int cbs[RCU_CBLIST_NSEGS];
+   unsigned long gps[RCU_CBLIST_NSEGS];
+
+   if (!trace_rcu_segcb_stats_enabled())
+   return;
+
+   rcu_segcblist_countseq(rsclp, cbs, gps);
+
+   trace_rcu_segcb_stats(context, cbs, gps);
+}
+#endif
+
 /*
  * Extract only those callbacks still pending (not yet ready to be
  * invoked) from the specified rcu_segcblist structure and place them in
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index cd35c9faaf51..7750734fa116 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -103,3 +103,8 @@ void rcu_segcblist_advance(struct rcu_segcblist *rsclp, 
unsigned long seq);
 bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, unsigned long seq);
 void rcu_segcblist_merge(struct rcu_segcblist *dst_rsclp,
 struct rcu_segcblist *src_rsclp);
+#ifdef CONFIG_RCU_TRACE
+void __trace_rcu_segcb_stats(struct rcu_segcblist *rsclp, const char *context);
+#else
+#define __trace_rcu_segcb_stats(...)
+#endif
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 24c00020ab83..f6c6653b3ec2 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1497,6 +1497,8 @@ static bool rcu_accelerate_cbs(struct rcu_node *rnp, 
struct rcu_data *rdp)
if (!rcu_segcblist_pend_cbs(>cblist))
return false;
 
+   __trace_rcu_segcb_stats(>cblist, TPS("SegCbPreAcc"));
+
/*
 * Callbacks are often registered with incomplete

[PATCH v9 7/7] rcu/segcblist: Add additional comments to explain smp_mb()

2020-11-03 Thread Joel Fernandes (Google)

Memory barriers are needed when updating the full length of the
segcblist, however it is not fully clearly why one is needed before and
after. This patch therefore adds additional comments to the function
header to explain it.

Signed-off-by: Joel Fernandes (Google) 
---
 kernel/rcu/rcu_segcblist.c | 40 ++
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index d96272e8d604..9b43d686b1f3 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -135,17 +135,49 @@ static void rcu_segcblist_inc_seglen(struct rcu_segcblist 
*rsclp, int seg)
  * field to disagree with the actual number of callbacks on the structure.
  * This increase is fully ordered with respect to the callers accesses
  * both before and after.
+ *
+ * About memory barriers:
+ * There is a situation where rcu_barrier() locklessly samples the full
+ * length of the segmented cblist before deciding what to do. That can
+ * race with another path that calls this function such as enqueue or dequeue.
+ * rcu_barrier() should not wrongly assume there are no callbacks, so any
+ * transitions from 1->0 and 0->1 have to be carefully ordered with respect to
+ * list modifications and with whatever follows the rcu_barrier().
+ *
+ * There are at least 2 cases:
+ * CASE 1: Memory barrier is needed before adding to length, for the case where
+ * v is negative (happens during dequeue). When length transitions from 1 -> 0,
+ * the write to 0 has to be ordered to appear to be *after* the memory accesses
+ * of the CBs that were dequeued and the segcblist modifications:
+ * To illustrate the problematic scenario to avoid:
+ * P0 (what P1 sees)   P1
+ * set len = 0
+ *  rcu_barrier sees len as 0
+ * dequeue from list
+ *  rcu_barrier does nothing.
+ *
+ * CASE 2: Memory barrier is needed after adding to length for the case
+ * where length transitions from 0 -> 1. This is because rcu_barrier()
+ * should never miss an update to the length. So the update to length
+ * has to be seen *before* any modifications to the segmented list. Otherwise a
+ * race can happen.
+ * To illustrate the problematic scenario to avoid:
+ * P0 (what P1 sees)   P1
+ * queue to list
+ *  rcu_barrier sees len as 0
+ * set len = 1.
+ *  rcu_barrier does nothing.
  */
 void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v)
 {
 #ifdef CONFIG_RCU_NOCB_CPU
-   smp_mb__before_atomic(); /* Up to the caller! */
+   smp_mb__before_atomic(); /* Read function's comments */
atomic_long_add(v, >len);
-   smp_mb__after_atomic(); /* Up to the caller! */
+   smp_mb__after_atomic();  /* Read function's comments */
 #else
-   smp_mb(); /* Up to the caller! */
+   smp_mb(); /* Read function's comments */
WRITE_ONCE(rsclp->len, rsclp->len + v);
-   smp_mb(); /* Up to the caller! */
+   smp_mb(); /* Read function's comments */
 #endif
 }
 
-- 
2.29.1.341.ge80a0c044ae-goog

[PATCH v8 3/6] rcu/trace: Add tracing for how segcb list changes

2020-10-21 Thread Joel Fernandes (Google)

Track how the segcb list changes before/after acceleration, during
queuing and during dequeuing.

This has proved useful to discover an optimization to avoid unwanted GP
requests when there are no callbacks accelerated. The overhead is minimal as
each segment's length is now stored in the respective segment.

Signed-off-by: Joel Fernandes (Google) 
---
 include/trace/events/rcu.h | 25 +
 kernel/rcu/rcu_segcblist.c | 31 +++
 kernel/rcu/rcu_segcblist.h |  5 +
 kernel/rcu/tree.c  |  9 +
 4 files changed, 70 insertions(+)

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index 155b5cb43cfd..9f2237d9b0c8 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -505,6 +505,31 @@ TRACE_EVENT_RCU(rcu_callback,
  __entry->qlen)
 );
 
+TRACE_EVENT_RCU(rcu_segcb,
+
+   TP_PROTO(const char *ctx, int *cb_count, unsigned long *gp_seq),
+
+   TP_ARGS(ctx, cb_count, gp_seq),
+
+   TP_STRUCT__entry(
+   __field(const char *, ctx)
+   __array(int, cb_count, RCU_CBLIST_NSEGS)
+   __array(unsigned long, gp_seq, RCU_CBLIST_NSEGS)
+   ),
+
+   TP_fast_assign(
+   __entry->ctx = ctx;
+   memcpy(__entry->cb_count, cb_count, RCU_CBLIST_NSEGS * 
sizeof(int));
+   memcpy(__entry->gp_seq, gp_seq, RCU_CBLIST_NSEGS * 
sizeof(unsigned long));
+   ),
+
+   TP_printk("%s cb_count: (DONE=%d, WAIT=%d, NEXT_READY=%d, 
NEXT=%d) "
+ "gp_seq: (DONE=%lu, WAIT=%lu, NEXT_READY=%lu, 
NEXT=%lu)", __entry->ctx,
+ __entry->cb_count[0], __entry->cb_count[1], 
__entry->cb_count[2], __entry->cb_count[3],
+ __entry->gp_seq[0], __entry->gp_seq[1], 
__entry->gp_seq[2], __entry->gp_seq[3])
+
+);
+
 /*
  * Tracepoint for the registration of a single RCU callback of the special
  * kvfree() form.  The first argument is the RCU type, the second argument
diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 357c19bbcb00..b0aaa51e0ee6 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -14,6 +14,7 @@
 #include 
 
 #include "rcu_segcblist.h"
+#include "rcu.h"
 
 /* Initialize simple callback list. */
 void rcu_cblist_init(struct rcu_cblist *rclp)
@@ -328,6 +329,36 @@ void rcu_segcblist_extract_done_cbs(struct rcu_segcblist 
*rsclp,
rcu_segcblist_set_seglen(rsclp, RCU_DONE_TAIL, 0);
 }
 
+/*
+ * Return how many CBs each segment along with their gp_seq values.
+ *
+ * This function is O(N) where N is the number of segments. Only used from
+ * tracing code which is usually disabled in production.
+ */
+#ifdef CONFIG_RCU_TRACE
+static void rcu_segcblist_countseq(struct rcu_segcblist *rsclp,
+int cbcount[RCU_CBLIST_NSEGS],
+unsigned long gpseq[RCU_CBLIST_NSEGS])
+{
+   int i;
+
+   for (i = 0; i < RCU_CBLIST_NSEGS; i++) {
+   cbcount[i] = rcu_segcblist_get_seglen(rsclp, i);
+   gpseq[i] = rsclp->gp_seq[i];
+   }
+}
+
+void trace_rcu_segcb_list(struct rcu_segcblist *rsclp, const char *context)
+{
+   int cbs[RCU_CBLIST_NSEGS];
+   unsigned long gps[RCU_CBLIST_NSEGS];
+
+   rcu_segcblist_countseq(rsclp, cbs, gps);
+
+   trace_rcu_segcb(context, cbs, gps);
+}
+#endif
+
 /*
  * Extract only those callbacks still pending (not yet ready to be
  * invoked) from the specified rcu_segcblist structure and place them in
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index cd35c9faaf51..c2e274ae0912 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -103,3 +103,8 @@ void rcu_segcblist_advance(struct rcu_segcblist *rsclp, 
unsigned long seq);
 bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, unsigned long seq);
 void rcu_segcblist_merge(struct rcu_segcblist *dst_rsclp,
 struct rcu_segcblist *src_rsclp);
+#ifdef CONFIG_RCU_TRACE
+void trace_rcu_segcb_list(struct rcu_segcblist *rsclp, const char *context);
+#else
+#define trace_rcu_segcb_list(...)
+#endif
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 24c00020ab83..346a05506935 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1497,6 +1497,8 @@ static bool rcu_accelerate_cbs(struct rcu_node *rnp, 
struct rcu_data *rdp)
if (!rcu_segcblist_pend_cbs(>cblist))
return false;
 
+   trace_rcu_segcb_list(>cblist, TPS("SegCbPreAcc"));
+
/*
 * Callbacks are often registered with incomplete grace-period
 * information.  Something about the fact that getting exact
@@ -1517,6 +1519,8 @@ static bool rcu_accelerate_cbs(struct rcu_node *rnp, 
struc

[PATCH v8 1/6] rcu/tree: Make rcu_do_batch count how many callbacks were executed

2020-10-21 Thread Joel Fernandes (Google)

Currently, rcu_do_batch() depends on the unsegmented callback list's len field
to know how many CBs are executed. This fields counts down from 0 as CBs are
dequeued.  It is possible that all CBs could not be run because of reaching
limits in which case the remaining unexecuted callbacks are requeued in the
CPU's segcblist.

The number of callbacks that were not requeued are then the negative count (how
many CBs were run) stored in the rcl->len which has been counting down on every
dequeue. This negative count is then added to the per-cpu segmented callback
list's to correct its count.

Such a design works against future efforts to track the length of each segment
of the segmented callback list. The reason is because
rcu_segcblist_extract_done_cbs() will be populating the unsegmented callback
list's length field (rcl->len) during extraction.

Also, the design of counting down from 0 is confusing and error-prone IMHO.

This commit therefore explicitly counts how many callbacks were executed in
rcu_do_batch() itself, and uses that to update the per-CPU segcb list's ->len
field, without relying on the negativity of rcl->len.

Signed-off-by: Joel Fernandes (Google) 
Reviewed-by: Frederic Weisbecker 
---
 kernel/rcu/rcu_segcblist.c |  2 +-
 kernel/rcu/rcu_segcblist.h |  1 +
 kernel/rcu/tree.c  | 11 +--
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 2d2a6b6b9dfb..bb246d8c6ef1 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -95,7 +95,7 @@ static void rcu_segcblist_set_len(struct rcu_segcblist 
*rsclp, long v)
  * This increase is fully ordered with respect to the callers accesses
  * both before and after.
  */
-static void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v)
+void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v)
 {
 #ifdef CONFIG_RCU_NOCB_CPU
smp_mb__before_atomic(); /* Up to the caller! */
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index 492262bcb591..1d2d61406463 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -76,6 +76,7 @@ static inline bool rcu_segcblist_restempty(struct 
rcu_segcblist *rsclp, int seg)
 }
 
 void rcu_segcblist_inc_len(struct rcu_segcblist *rsclp);
+void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v);
 void rcu_segcblist_init(struct rcu_segcblist *rsclp);
 void rcu_segcblist_disable(struct rcu_segcblist *rsclp);
 void rcu_segcblist_offload(struct rcu_segcblist *rsclp);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 286dc0a1b184..24c00020ab83 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2429,7 +2429,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
const bool offloaded = rcu_segcblist_is_offloaded(>cblist);
struct rcu_head *rhp;
struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
-   long bl, count;
+   long bl, count = 0;
long pending, tlimit = 0;
 
/* If no callbacks are ready, just return. */
@@ -2474,6 +2474,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
for (; rhp; rhp = rcu_cblist_dequeue()) {
rcu_callback_t f;
 
+   count++;
debug_rcu_head_unqueue(rhp);
 
rcu_lock_acquire(_callback_map);
@@ -2487,15 +2488,14 @@ static void rcu_do_batch(struct rcu_data *rdp)
 
/*
 * Stop only if limit reached and CPU has something to do.
-* Note: The rcl structure counts down from zero.
 */
-   if (-rcl.len >= bl && !offloaded &&
+   if (count >= bl && !offloaded &&
(need_resched() ||
 (!is_idle_task(current) && !rcu_is_callbacks_kthread(
break;
if (unlikely(tlimit)) {
/* only call local_clock() every 32 callbacks */
-   if (likely((-rcl.len & 31) || local_clock() < tlimit))
+   if (likely((count & 31) || local_clock() < tlimit))
continue;
/* Exceeded the time limit, so leave. */
break;
@@ -2512,7 +2512,6 @@ static void rcu_do_batch(struct rcu_data *rdp)
 
local_irq_save(flags);
rcu_nocb_lock(rdp);
-   count = -rcl.len;
rdp->n_cbs_invoked += count;
trace_rcu_batch_end(rcu_state.name, count, !!rcl.head, need_resched(),
is_idle_task(current), rcu_is_callbacks_kthread());
@@ -2520,7 +2519,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
/* Update counts and requeue any remaining callbacks. */
rcu_segcblist_insert_done_cbs(>cblist, );
smp_mb(); /* List handling before counting for rcu_barrier(). */
-   rcu_segcblist_insert_count(>cblist, );
+   r

[PATCH v8 2/6] rcu/segcblist: Add counters to segcblist datastructure

2020-10-21 Thread Joel Fernandes (Google)

Add counting of segment lengths of segmented callback list.

This will be useful for a number of things such as knowing how big the
ready-to-execute segment have gotten. The immediate benefit is ability
to trace how the callbacks in the segmented callback list change.

Also this patch remove hacks related to using donecbs's ->len field as a
temporary variable to save the segmented callback list's length. This cannot be
done anymore and is not needed.

Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/rcu_segcblist.h |   1 +
 kernel/rcu/rcu_segcblist.c| 120 ++
 kernel/rcu/rcu_segcblist.h|   2 -
 3 files changed, 79 insertions(+), 44 deletions(-)

diff --git a/include/linux/rcu_segcblist.h b/include/linux/rcu_segcblist.h
index b36afe7b22c9..6c01f09a6456 100644
--- a/include/linux/rcu_segcblist.h
+++ b/include/linux/rcu_segcblist.h
@@ -72,6 +72,7 @@ struct rcu_segcblist {
 #else
long len;
 #endif
+   long seglen[RCU_CBLIST_NSEGS];
u8 enabled;
u8 offloaded;
 };
diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index bb246d8c6ef1..357c19bbcb00 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -7,10 +7,11 @@
  * Authors: Paul E. McKenney 
  */
 
-#include 
-#include 
+#include 
 #include 
+#include 
 #include 
+#include 
 
 #include "rcu_segcblist.h"
 
@@ -88,6 +89,46 @@ static void rcu_segcblist_set_len(struct rcu_segcblist 
*rsclp, long v)
 #endif
 }
 
+/* Get the length of a segment of the rcu_segcblist structure. */
+static long rcu_segcblist_get_seglen(struct rcu_segcblist *rsclp, int seg)
+{
+   return READ_ONCE(rsclp->seglen[seg]);
+}
+
+/* Set the length of a segment of the rcu_segcblist structure. */
+static void rcu_segcblist_set_seglen(struct rcu_segcblist *rsclp, int seg, 
long v)
+{
+   WRITE_ONCE(rsclp->seglen[seg], v);
+}
+
+/* Return number of callbacks in a segment of the segmented callback list. */
+static void rcu_segcblist_add_seglen(struct rcu_segcblist *rsclp, int seg, 
long v)
+{
+   WRITE_ONCE(rsclp->seglen[seg], rsclp->seglen[seg] + v);
+}
+
+/* Move from's segment length to to's segment. */
+static void rcu_segcblist_move_seglen(struct rcu_segcblist *rsclp, int from, 
int to)
+{
+   long len;
+
+   if (from == to)
+   return;
+
+   len = rcu_segcblist_get_seglen(rsclp, from);
+   if (!len)
+   return;
+
+   rcu_segcblist_add_seglen(rsclp, to, len);
+   rcu_segcblist_set_seglen(rsclp, from, 0);
+}
+
+/* Increment segment's length. */
+static void rcu_segcblist_inc_seglen(struct rcu_segcblist *rsclp, int seg)
+{
+   rcu_segcblist_add_seglen(rsclp, seg, 1);
+}
+
 /*
  * Increase the numeric length of an rcu_segcblist structure by the
  * specified amount, which can be negative.  This can cause the ->len
@@ -119,26 +160,6 @@ void rcu_segcblist_inc_len(struct rcu_segcblist *rsclp)
rcu_segcblist_add_len(rsclp, 1);
 }
 
-/*
- * Exchange the numeric length of the specified rcu_segcblist structure
- * with the specified value.  This can cause the ->len field to disagree
- * with the actual number of callbacks on the structure.  This exchange is
- * fully ordered with respect to the callers accesses both before and after.
- */
-static long rcu_segcblist_xchg_len(struct rcu_segcblist *rsclp, long v)
-{
-#ifdef CONFIG_RCU_NOCB_CPU
-   return atomic_long_xchg(>len, v);
-#else
-   long ret = rsclp->len;
-
-   smp_mb(); /* Up to the caller! */
-   WRITE_ONCE(rsclp->len, v);
-   smp_mb(); /* Up to the caller! */
-   return ret;
-#endif
-}
-
 /*
  * Initialize an rcu_segcblist structure.
  */
@@ -149,8 +170,10 @@ void rcu_segcblist_init(struct rcu_segcblist *rsclp)
BUILD_BUG_ON(RCU_NEXT_TAIL + 1 != ARRAY_SIZE(rsclp->gp_seq));
BUILD_BUG_ON(ARRAY_SIZE(rsclp->tails) != ARRAY_SIZE(rsclp->gp_seq));
rsclp->head = NULL;
-   for (i = 0; i < RCU_CBLIST_NSEGS; i++)
+   for (i = 0; i < RCU_CBLIST_NSEGS; i++) {
rsclp->tails[i] = >head;
+   rcu_segcblist_set_seglen(rsclp, i, 0);
+   }
rcu_segcblist_set_len(rsclp, 0);
rsclp->enabled = 1;
 }
@@ -246,6 +269,7 @@ void rcu_segcblist_enqueue(struct rcu_segcblist *rsclp,
 {
rcu_segcblist_inc_len(rsclp);
smp_mb(); /* Ensure counts are updated before callback is enqueued. */
+   rcu_segcblist_inc_seglen(rsclp, RCU_NEXT_TAIL);
rhp->next = NULL;
WRITE_ONCE(*rsclp->tails[RCU_NEXT_TAIL], rhp);
WRITE_ONCE(rsclp->tails[RCU_NEXT_TAIL], >next);
@@ -274,27 +298,13 @@ bool rcu_segcblist_entrain(struct rcu_segcblist *rsclp,
for (i = RCU_NEXT_TAIL; i > RCU_DONE_TAIL; i--)
if (rsclp->tails[i] != rsclp->tails[i - 1])
break;
+   rcu_segcblist_inc_seglen(rsclp, i);
WRITE_ONCE(*rsclp->tails[i], rhp)

[PATCH v8 0/6] Add support for length of each segment in the segcblist

2020-10-21 Thread Joel Fernandes (Google)

From: Joel Fernandes 

This is required for several usecases identified. One of them being tracing how
the segmented callback list changes. Tracing this has identified issues in RCU
code in the past.

>From Paul:
Another use case is of course more accurately determining whether a given CPU's
large pile of callbacks can be best served by making grace periods go faster,
invoking callbacks more vigorously, or both.  It should also be possible to
simplify some of the callback handling a bit, given that some of the unnatural
acts are due to there having been no per-batch counts.

Revision history:
v8: Small style changes, making the seglen as non-atomic since it is always
under lock (Frederic).

v7: Cleaned up memory barriers (thanks fweisbec@ for reviewing), made minor
corrections per Neeraj (thanks).

v6: Fixed TREE04, and restored older logic to ensure rcu_barrier works.

v5: Various changes, bug fixes. Discovery of rcu_barrier issue.

v4: Restructured rcu_do_batch() and segcblist merging to avoid issues.
Fixed minor nit from Davidlohr.
v1->v3: minor nits.
(https://lore.kernel.org/lkml/20200719034210.2382053-1-joel@xxxxx/)

Joel Fernandes (Google) (6):
rcu/tree: Make rcu_do_batch count how many callbacks were executed
rcu/segcblist: Add counters to segcblist datastructure
rcu/trace: Add tracing for how segcb list changes
rcu/segcblist: Remove useless rcupdate.h include
rcu/tree: segcblist: Remove redundant smp_mb()s
rcu/segcblist: Add additional comments to explain smp_mb()

include/linux/rcu_segcblist.h |   1 +
include/trace/events/rcu.h|  25 +
kernel/rcu/rcu_segcblist.c| 195 +-
kernel/rcu/rcu_segcblist.h|   8 +-
kernel/rcu/tree.c |  21 ++--
5 files changed, 192 insertions(+), 58 deletions(-)

--
2.29.0.rc1.297.gfa9743e501-goog

[PATCH v8 4/6] rcu/segcblist: Remove useless rcupdate.h include

2020-10-21 Thread Joel Fernandes (Google)

Signed-off-by: Joel Fernandes (Google) 
---
 kernel/rcu/rcu_segcblist.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index b0aaa51e0ee6..19ff82b805fb 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -10,7 +10,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #include "rcu_segcblist.h"
-- 
2.29.0.rc1.297.gfa9743e501-goog

[PATCH v8 6/6] rcu/segcblist: Add additional comments to explain smp_mb()

2020-10-21 Thread Joel Fernandes (Google)

Memory barriers are needed when updating the full length of the
segcblist, however it is not fully clearly why one is needed before and
after. This patch therefore adds additional comments to the function
header to explain it.

Signed-off-by: Joel Fernandes (Google) 
---
 kernel/rcu/rcu_segcblist.c | 40 ++
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index f0fcdf9d0f7f..1652b874855e 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -135,17 +135,49 @@ static void rcu_segcblist_inc_seglen(struct rcu_segcblist 
*rsclp, int seg)
  * field to disagree with the actual number of callbacks on the structure.
  * This increase is fully ordered with respect to the callers accesses
  * both before and after.
+ *
+ * About memory barriers:
+ * There is a situation where rcu_barrier() locklessly samples the full
+ * length of the segmented cblist before deciding what to do. That can
+ * race with another path that calls this function such as enqueue or dequeue.
+ * rcu_barrier() should not wrongly assume there are no callbacks, so any
+ * transitions from 1->0 and 0->1 have to be carefully ordered with respect to
+ * list modifications and with whatever follows the rcu_barrier().
+ *
+ * There are at least 2 cases:
+ * CASE 1: Memory barrier is needed before adding to length, for the case where
+ * v is negative which does not happen in current code, but used
+ * to happen. Keep the memory barrier for robustness reasons. When/If the
+ * length transitions from 1 -> 0, the write to 0 has to be ordered *after*
+ * the memory accesses of the CBs that were dequeued and the segcblist
+ * modifications:
+ * P0 (what P1 sees)   P1
+ * set len = 0
+ *  rcu_barrier sees len as 0
+ * dequeue from list
+ *  rcu_barrier does nothing.
+ *
+ * CASE 2: Memory barrier is needed after adding to length for the case
+ * where length transitions from 0 -> 1. This is because rcu_barrier()
+ * should never miss an update to the length. So the update to length
+ * has to be seen *before* any modifications to the segmented list. Otherwise a
+ * race can happen.
+ * P0 (what P1 sees)   P1
+ * queue to list
+ *  rcu_barrier sees len as 0
+ * set len = 1.
+ *  rcu_barrier does nothing.
  */
 void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v)
 {
 #ifdef CONFIG_RCU_NOCB_CPU
-   smp_mb__before_atomic(); /* Up to the caller! */
+   smp_mb__before_atomic(); /* Read function's comments */
atomic_long_add(v, >len);
-   smp_mb__after_atomic(); /* Up to the caller! */
+   smp_mb__after_atomic();  /* Read function's comments */
 #else
-   smp_mb(); /* Up to the caller! */
+   smp_mb(); /* Read function's comments */
WRITE_ONCE(rsclp->len, rsclp->len + v);
-   smp_mb(); /* Up to the caller! */
+   smp_mb(); /* Read function's comments */
 #endif
 }
 
-- 
2.29.0.rc1.297.gfa9743e501-goog

[PATCH v8 5/6] rcu/tree: segcblist: Remove redundant smp_mb()s

2020-10-21 Thread Joel Fernandes (Google)

This memory barrier is not needed as rcu_segcblist_add_len() already
includes a memory barrier *before* the length of the list is updated.

Same reasoning for rcu_segcblist_enqueue().

Signed-off-by: Joel Fernandes (Google) 
---
 kernel/rcu/rcu_segcblist.c | 1 -
 kernel/rcu/tree.c  | 1 -
 2 files changed, 2 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 19ff82b805fb..f0fcdf9d0f7f 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -268,7 +268,6 @@ void rcu_segcblist_enqueue(struct rcu_segcblist *rsclp,
   struct rcu_head *rhp)
 {
rcu_segcblist_inc_len(rsclp);
-   smp_mb(); /* Ensure counts are updated before callback is enqueued. */
rcu_segcblist_inc_seglen(rsclp, RCU_NEXT_TAIL);
rhp->next = NULL;
WRITE_ONCE(*rsclp->tails[RCU_NEXT_TAIL], rhp);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 346a05506935..6c6d3c7036e6 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2525,7 +2525,6 @@ static void rcu_do_batch(struct rcu_data *rdp)
 
/* Update counts and requeue any remaining callbacks. */
rcu_segcblist_insert_done_cbs(>cblist, );
-   smp_mb(); /* List handling before counting for rcu_barrier(). */
rcu_segcblist_add_len(>cblist, -count);
 
/* Reinstate batch limit if we have worked down the excess. */
-- 
2.29.0.rc1.297.gfa9743e501-goog

[PATCH v8 -tip 17/26] sched: Split the cookie and setup per-task cookie on fork

2020-10-19 Thread Joel Fernandes (Google)

In order to prevent interference and clearly support both per-task and CGroup
APIs, split the cookie into 2 and allow it to be set from either per-task, or
CGroup API. The final cookie is the combined value of both and is computed when
the stop-machine executes during a change of cookie.

Also, for the per-task cookie, it will get weird if we use pointers of any
emphemeral objects. For this reason, introduce a refcounted object who's sole
purpose is to assign unique cookie value by way of the object's pointer.

While at it, refactor the CGroup code a bit. Future patches will introduce more
APIs and support.

Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/sched.h |   2 +
 kernel/sched/core.c   | 241 --
 kernel/sched/debug.c  |   4 +
 3 files changed, 236 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fe6f225bfbf9..c6034c00846a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -688,6 +688,8 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
struct rb_node  core_node;
unsigned long   core_cookie;
+   unsigned long   core_task_cookie;
+   unsigned long   core_group_cookie;
unsigned intcore_occupation;
 #endif
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bab4ea2f5cd8..30a9e4cb5ce1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -346,11 +346,14 @@ void sched_core_put(void)
mutex_unlock(_core_mutex);
 }
 
+static int sched_core_share_tasks(struct task_struct *t1, struct task_struct 
*t2);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
 static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
 static bool sched_core_enqueued(struct task_struct *task) { return false; }
+static int sched_core_share_tasks(struct task_struct *t1, struct task_struct 
*t2) { }
 
 #endif /* CONFIG_SCHED_CORE */
 
@@ -3583,6 +3586,20 @@ int sched_fork(unsigned long clone_flags, struct 
task_struct *p)
 #endif
 #ifdef CONFIG_SCHED_CORE
RB_CLEAR_NODE(>core_node);
+
+   /*
+* Tag child via per-task cookie only if parent is tagged via per-task
+* cookie. This is independent of, but can be additive to the CGroup 
tagging.
+*/
+   if (current->core_task_cookie) {
+
+   /* If it is not CLONE_THREAD fork, assign a unique per-task 
tag. */
+   if (!(clone_flags & CLONE_THREAD)) {
+   return sched_core_share_tasks(p, p);
+   }
+   /* Otherwise share the parent's per-task tag. */
+   return sched_core_share_tasks(p, current);
+   }
 #endif
return 0;
 }
@@ -9177,6 +9194,217 @@ static u64 cpu_rt_period_read_uint(struct 
cgroup_subsys_state *css,
 #endif /* CONFIG_RT_GROUP_SCHED */
 
 #ifdef CONFIG_SCHED_CORE
+/*
+ * A simple wrapper around refcount. An allocated sched_core_cookie's
+ * address is used to compute the cookie of the task.
+ */
+struct sched_core_cookie {
+   refcount_t refcnt;
+};
+
+/*
+ * sched_core_tag_requeue - Common helper for all interfaces to set a cookie.
+ * @p: The task to assign a cookie to.
+ * @cookie: The cookie to assign.
+ * @group: is it a group interface or a per-task interface.
+ *
+ * This function is typically called from a stop-machine handler.
+ */
+void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie, bool 
group)
+{
+   if (!p)
+   return;
+
+   if (group)
+   p->core_group_cookie = cookie;
+   else
+   p->core_task_cookie = cookie;
+
+   /* Use up half of the cookie's bits for task cookie and remaining for 
group cookie. */
+   p->core_cookie = (p->core_task_cookie <<
+   (sizeof(unsigned long) * 4)) + 
p->core_group_cookie;
+
+   if (sched_core_enqueued(p)) {
+   sched_core_dequeue(task_rq(p), p);
+   if (!p->core_task_cookie)
+   return;
+   }
+
+   if (sched_core_enabled(task_rq(p)) &&
+   p->core_cookie && task_on_rq_queued(p))
+   sched_core_enqueue(task_rq(p), p);
+}
+
+/* Per-task interface */
+static unsigned long sched_core_alloc_task_cookie(void)
+{
+   struct sched_core_cookie *ptr =
+   kmalloc(sizeof(struct sched_core_cookie), GFP_KERNEL);
+
+   if (!ptr)
+   return 0;
+   refcount_set(>refcnt, 1);
+
+   /*
+* NOTE: sched_core_put() is not done by put_task_cookie(). Instead, it
+* is done after the stopper runs.
+*/
+   sched_core_get();
+   return (unsigned long)ptr;
+}
+
+static bool sched_core_get_task_cookie(unsigned long cookie)
+{
+   struct sch

[PATCH v8 -tip 16/26] sched: cgroup tagging interface for core scheduling

2020-10-19 Thread Joel Fernandes (Google)

From: Peter Zijlstra 

Marks all tasks in a cgroup as matching for core-scheduling.

A task will need to be moved into the core scheduler queue when the cgroup
it belongs to is tagged to run with core scheduling.  Similarly the task
will need to be moved out of the core scheduler queue when the cgroup
is untagged.

Also after we forked a task, its core scheduler queue's presence will
need to be updated according to its new cgroup's status.

Use stop machine mechanism to update all tasks in a cgroup to prevent a
new task from sneaking into the cgroup, and missed out from the update
while we iterates through all the tasks in the cgroup.  A more complicated
scheme could probably avoid the stop machine.  Such scheme will also
need to resovle inconsistency between a task's cgroup core scheduling
tag and residency in core scheduler queue.

We are opting for the simple stop machine mechanism for now that avoids
such complications.

Core scheduler has extra overhead.  Enable it only for core with
more than one SMT hardware threads.

Tested-by: Julien Desfossez 
Signed-off-by: Tim Chen 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Julien Desfossez 
Signed-off-by: Vineeth Remanan Pillai 
---
 kernel/sched/core.c  | 183 +--
 kernel/sched/sched.h |   4 +
 2 files changed, 181 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5a7aeaa914e3..bab4ea2f5cd8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -157,6 +157,37 @@ static inline bool __sched_core_less(struct task_struct 
*a, struct task_struct *
return false;
 }
 
+static bool sched_core_empty(struct rq *rq)
+{
+   return RB_EMPTY_ROOT(>core_tree);
+}
+
+static bool sched_core_enqueued(struct task_struct *task)
+{
+   return !RB_EMPTY_NODE(>core_node);
+}
+
+static struct task_struct *sched_core_first(struct rq *rq)
+{
+   struct task_struct *task;
+
+   task = container_of(rb_first(>core_tree), struct task_struct, 
core_node);
+   return task;
+}
+
+static void sched_core_flush(int cpu)
+{
+   struct rq *rq = cpu_rq(cpu);
+   struct task_struct *task;
+
+   while (!sched_core_empty(rq)) {
+   task = sched_core_first(rq);
+   rb_erase(>core_node, >core_tree);
+   RB_CLEAR_NODE(>core_node);
+   }
+   rq->core->core_task_seq++;
+}
+
 static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 {
struct rb_node *parent, **node;
@@ -188,10 +219,11 @@ static void sched_core_dequeue(struct rq *rq, struct 
task_struct *p)
 {
rq->core->core_task_seq++;
 
-   if (!p->core_cookie)
+   if (!sched_core_enqueued(p))
return;
 
rb_erase(>core_node, >core_tree);
+   RB_CLEAR_NODE(>core_node);
 }
 
 /*
@@ -255,8 +287,24 @@ static int __sched_core_stopper(void *data)
bool enabled = !!(unsigned long)data;
int cpu;
 
-   for_each_possible_cpu(cpu)
-   cpu_rq(cpu)->core_enabled = enabled;
+   for_each_possible_cpu(cpu) {
+   struct rq *rq = cpu_rq(cpu);
+
+   WARN_ON_ONCE(enabled == rq->core_enabled);
+
+   if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) 
>= 2)) {
+   /*
+* All active and migrating tasks will have already
+* been removed from core queue when we clear the
+* cgroup tags. However, dying tasks could still be
+* left in core queue. Flush them here.
+*/
+   if (!enabled)
+   sched_core_flush(cpu);
+
+   rq->core_enabled = enabled;
+   }
+   }
 
return 0;
 }
@@ -266,7 +314,11 @@ static int sched_core_count;
 
 static void __sched_core_enable(void)
 {
-   // XXX verify there are no cookie tasks (yet)
+   int cpu;
+
+   /* verify there are no cookie tasks (yet) */
+   for_each_online_cpu(cpu)
+   BUG_ON(!sched_core_empty(cpu_rq(cpu)));
 
static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
@@ -274,8 +326,6 @@ static void __sched_core_enable(void)
 
 static void __sched_core_disable(void)
 {
-   // XXX verify there are no cookie tasks (left)
-
stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);
 }
@@ -300,6 +350,7 @@ void sched_core_put(void)
 
 static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
 static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+static bool sched_core_enqueued(struct task_struct *task) { return false; }
 
 #endif /* CONFIG_SCHED_CORE */
 
@@ -3529,6 +3580,9 @@ int sched_fork(unsigned long clone_flags, struct 
task_struct *p)
 #ifdef CONFIG_SMP
plist_node_init(>pushable_tasks,

[PATCH v8 -tip 15/26] entry/kvm: Protect the kernel when entering from guest

2020-10-19 Thread Joel Fernandes (Google)

From: Vineeth Pillai 

Similar to how user to kernel mode transitions are protected in earlier
patches, protect the entry into kernel from guest mode as well.

Tested-by: Julien Desfossez 
Signed-off-by: Vineeth Pillai 
---
 arch/x86/kvm/x86.c|  3 +++
 include/linux/entry-kvm.h | 12 
 kernel/entry/kvm.c| 13 +
 3 files changed, 28 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ce856e0ece84..05a281f3ef28 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8540,6 +8540,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 */
smp_mb__after_srcu_read_unlock();
 
+   kvm_exit_to_guest_mode(vcpu);
+
/*
 * This handles the case where a posted interrupt was
 * notified with kvm_vcpu_kick.
@@ -8633,6 +8635,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
}
}
 
+   kvm_enter_from_guest_mode(vcpu);
local_irq_enable();
preempt_enable();
 
diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
index 0cef17afb41a..32aabb7f3e6d 100644
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -77,4 +77,16 @@ static inline bool xfer_to_guest_mode_work_pending(void)
 }
 #endif /* CONFIG_KVM_XFER_TO_GUEST_WORK */
 
+/**
+ * kvm_enter_from_guest_mode - Hook called just after entering kernel from 
guest.
+ * @vcpu:   Pointer to the current VCPU data
+ */
+void kvm_enter_from_guest_mode(struct kvm_vcpu *vcpu);
+
+/**
+ * kvm_exit_to_guest_mode - Hook called just before entering guest from kernel.
+ * @vcpu:   Pointer to the current VCPU data
+ */
+void kvm_exit_to_guest_mode(struct kvm_vcpu *vcpu);
+
 #endif
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index eb1a8a4c867c..b0b7facf4374 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -49,3 +49,16 @@ int xfer_to_guest_mode_handle_work(struct kvm_vcpu *vcpu)
return xfer_to_guest_mode_work(vcpu, ti_work);
 }
 EXPORT_SYMBOL_GPL(xfer_to_guest_mode_handle_work);
+
+void kvm_enter_from_guest_mode(struct kvm_vcpu *vcpu)
+{
+   sched_core_unsafe_enter();
+}
+EXPORT_SYMBOL_GPL(kvm_enter_from_guest_mode);
+
+void kvm_exit_to_guest_mode(struct kvm_vcpu *vcpu)
+{
+   sched_core_unsafe_exit();
+   sched_core_wait_till_safe(XFER_TO_GUEST_MODE_WORK);
+}
+EXPORT_SYMBOL_GPL(kvm_exit_to_guest_mode);
-- 
2.29.0.rc1.297.gfa9743e501-goog

[PATCH v8 -tip 20/26] sched: Release references to the per-task cookie on exit

2020-10-19 Thread Joel Fernandes (Google)

During exit, we have to free the references to a cookie that might be shared by
many tasks. This commit therefore ensures when the task_struct is released, any
references to cookies that it holds are also released.

Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/sched.h | 2 ++
 kernel/fork.c | 1 +
 kernel/sched/core.c   | 8 
 3 files changed, 11 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4cb76575afa8..eabd96beff92 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2079,12 +2079,14 @@ void sched_core_unsafe_exit(void);
 bool sched_core_wait_till_safe(unsigned long ti_check);
 bool sched_core_kernel_protected(void);
 int sched_core_share_pid(pid_t pid);
+void sched_tsk_free(struct task_struct *tsk);
 #else
 #define sched_core_unsafe_enter(ignore) do { } while (0)
 #define sched_core_unsafe_exit(ignore) do { } while (0)
 #define sched_core_wait_till_safe(ignore) do { } while (0)
 #define sched_core_kernel_protected(ignore) do { } while (0)
 #define sched_core_share_pid(pid_t pid) do { } while (0)
+#define sched_tsk_free(tsk) do { } while (0)
 #endif
 
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index b9c289d0f4ef..a39248a02fdd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -736,6 +736,7 @@ void __put_task_struct(struct task_struct *tsk)
exit_creds(tsk);
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);
+   sched_tsk_free(tsk);
 
if (!profile_handoff_task(tsk))
free_task(tsk);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 42aa811eab14..61e1dcf11000 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9631,6 +9631,14 @@ static int cpu_core_tag_color_write_u64(struct 
cgroup_subsys_state *css,
 
return 0;
 }
+
+void sched_tsk_free(struct task_struct *tsk)
+{
+   if (!tsk->core_task_cookie)
+   return;
+   sched_core_put_task_cookie(tsk->core_task_cookie);
+   sched_core_put();
+}
 #endif
 
 static struct cftype cpu_legacy_files[] = {
-- 
2.29.0.rc1.297.gfa9743e501-goog

[PATCH v8 -tip 26/26] sched: Debug bits...

2020-10-19 Thread Joel Fernandes (Google)

From: Peter Zijlstra 

Tested-by: Julien Desfossez 
Not-Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/core.c | 37 -
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 211e0784675f..61758b5478d8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -127,6 +127,10 @@ static inline bool prio_less(struct task_struct *a, struct 
task_struct *b)
 
int pa = __task_prio(a), pb = __task_prio(b);
 
+   trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
+a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
if (-pa < -pb)
return true;
 
@@ -317,12 +321,16 @@ static void __sched_core_enable(void)
 
static_branch_enable(&__sched_core_enabled);
stop_machine(__sched_core_stopper, (void *)true, NULL);
+
+   printk("core sched enabled\n");
 }
 
 static void __sched_core_disable(void)
 {
stop_machine(__sched_core_stopper, (void *)false, NULL);
static_branch_disable(&__sched_core_enabled);
+
+   printk("core sched disabled\n");
 }
 
 void sched_core_get(void)
@@ -4978,6 +4986,13 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
set_next_task(rq, next);
}
 
+   trace_printk("pick pre selected (%u %u %u): %s/%d %lx\n",
+rq->core->core_task_seq,
+rq->core->core_pick_seq,
+rq->core_sched_seq,
+next->comm, next->pid,
+next->core_cookie);
+
rq->core_pick = NULL;
return next;
}
@@ -5066,6 +5081,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 */
if (i == cpu && !need_sync && !p->core_cookie) {
next = p;
+   trace_printk("unconstrained pick: %s/%d %lx\n",
+next->comm, next->pid, 
next->core_cookie);
goto done;
}
 
@@ -5074,6 +5091,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
rq_i->core_pick = p;
 
+   trace_printk("cpu(%d): selected: %s/%d %lx\n",
+i, p->comm, p->pid, p->core_cookie);
+
/*
 * If this new candidate is of higher priority than the
 * previous; and they're incompatible; we need to wipe
@@ -5090,6 +5110,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
rq->core->core_cookie = p->core_cookie;
max = p;
 
+   trace_printk("max: %s/%d %lx\n", max->comm, 
max->pid, max->core_cookie);
+
if (old_max) {
for_each_cpu(j, smt_mask) {
if (j == i)
@@ -5120,6 +5142,7 @@ next_class:;
 
/* Something should have been selected for current CPU */
WARN_ON_ONCE(!next);
+   trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, 
next->core_cookie);
 
/*
 * Reschedule siblings
@@ -5155,13 +5178,21 @@ next_class:;
}
 
/* Did we break L1TF mitigation requirements? */
-   WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+   if (unlikely(!cookie_match(next, rq_i->core_pick))) {
+   trace_printk("[%d]: cookie mismatch. 
%s/%d/0x%lx/0x%lx\n",
+rq_i->cpu, rq_i->core_pick->comm,
+rq_i->core_pick->pid,
+rq_i->core_pick->core_cookie,
+rq_i->core->core_cookie);
+   WARN_ON_ONCE(1);
+   }
 
if (rq_i->curr == rq_i->core_pick) {
rq_i->core_pick = NULL;
continue;
}
 
+   trace_printk("IPI(%d)\n", i);
resched_curr(rq_i);
}
 
@@ -5209,6 +5240,10 @@ static bool try_steal_cookie(int this, int that)
if (p->core_occupation > dst->idle->core_occupation)
goto next;
 
+   trace_printk("core fill: %s/%d (%d->%d) %d %d %lx\n",
+p->comm, p->pid, that, this,
+p->core_occupation, dst->idle->core_occupation, 
cookie);
+
p->on_rq = TASK_ON_RQ_MIGRATING;
deactivate_task(src, p, 0);
set_task_cpu(p, this);

[PATCH v8 -tip 23/26] kselftest: Add tests for core-sched interface

2020-10-19 Thread Joel Fernandes (Google)

Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 tools/testing/selftests/sched/.gitignore  |   1 +
 tools/testing/selftests/sched/Makefile|  14 +
 tools/testing/selftests/sched/config  |   1 +
 .../testing/selftests/sched/test_coresched.c  | 840 ++
 4 files changed, 856 insertions(+)
 create mode 100644 tools/testing/selftests/sched/.gitignore
 create mode 100644 tools/testing/selftests/sched/Makefile
 create mode 100644 tools/testing/selftests/sched/config
 create mode 100644 tools/testing/selftests/sched/test_coresched.c

diff --git a/tools/testing/selftests/sched/.gitignore 
b/tools/testing/selftests/sched/.gitignore
new file mode 100644
index ..4660929b0b9a
--- /dev/null
+++ b/tools/testing/selftests/sched/.gitignore
@@ -0,0 +1 @@
+test_coresched
diff --git a/tools/testing/selftests/sched/Makefile 
b/tools/testing/selftests/sched/Makefile
new file mode 100644
index ..e43b74fc5d7e
--- /dev/null
+++ b/tools/testing/selftests/sched/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0+
+
+ifneq ($(shell $(CC) --version 2>&1 | head -n 1 | grep clang),)
+CLANG_FLAGS += -no-integrated-as
+endif
+
+CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/  -Wl,-rpath=./ \
+ $(CLANG_FLAGS)
+LDLIBS += -lpthread
+
+TEST_GEN_FILES := test_coresched
+TEST_PROGS := test_coresched
+
+include ../lib.mk
diff --git a/tools/testing/selftests/sched/config 
b/tools/testing/selftests/sched/config
new file mode 100644
index ..e8b09aa7c0c4
--- /dev/null
+++ b/tools/testing/selftests/sched/config
@@ -0,0 +1 @@
+CONFIG_SCHED_DEBUG=y
diff --git a/tools/testing/selftests/sched/test_coresched.c 
b/tools/testing/selftests/sched/test_coresched.c
new file mode 100644
index ..2fdefb843115
--- /dev/null
+++ b/tools/testing/selftests/sched/test_coresched.c
@@ -0,0 +1,840 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Core-scheduling selftests.
+ *
+ * Copyright (C) 2020, Joel Fernandes.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef PR_SCHED_CORE_SHARE
+#define PR_SCHED_CORE_SHARE 59
+#endif
+
+#ifndef DEBUG_PRINT
+#define dprint(...)
+#else
+#define dprint(str, args...) printf("DEBUG: %s: " str "\n", __func__, ##args)
+#endif
+
+void print_banner(char *s)
+{
+printf("coresched: %s:  ", s);
+}
+
+void print_pass(void)
+{
+printf("PASS\n");
+}
+
+void assert_cond(int cond, char *str)
+{
+if (!cond) {
+   printf("Error: %s\n", str);
+   abort();
+}
+}
+
+char *make_group_root(void)
+{
+   char *mntpath, *mnt;
+   int ret;
+
+   mntpath = malloc(50);
+   if (!mntpath) {
+   perror("Failed to allocate mntpath\n");
+   abort();
+   }
+
+   sprintf(mntpath, "/tmp/coresched-test-XX");
+   mnt = mkdtemp(mntpath);
+   if (!mnt) {
+   perror("Failed to create mount: ");
+   exit(-1);
+   }
+
+   ret = mount("nodev", mnt, "cgroup", 0, "cpu");
+   if (ret == -1) {
+   perror("Failed to mount cgroup: ");
+   exit(-1);
+   }
+
+   return mnt;
+}
+
+char *read_group_cookie(char *cgroup_path)
+{
+char path[50] = {}, *val;
+int fd;
+
+sprintf(path, "%s/cpu.core_group_cookie", cgroup_path);
+fd = open(path, O_RDONLY, 0666);
+if (fd == -1) {
+   perror("Open of cgroup tag path failed: ");
+   abort();
+}
+
+val = calloc(1, 50);
+if (read(fd, val, 50) == -1) {
+   perror("Failed to read group cookie: ");
+   abort();
+}
+
+val[strcspn(val, "\r\n")] = 0;
+
+close(fd);
+return val;
+}
+
+void assert_group_tag(char *cgroup_path, char *tag)
+{
+char tag_path[50] = {}, rdbuf[8] = {};
+int tfd;
+
+sprintf(tag_path, "%s/cpu.core_tag", cgroup_path);
+tfd = open(tag_path, O_RDONLY, 0666);
+if (tfd == -1) {
+   perror("Open of cgroup tag path failed: ");
+   abort();
+}
+
+if (read(tfd, rdbuf, 1) != 1) {
+   perror("Failed to enable coresched on cgroup: ");
+   abort();
+}
+
+if (strcmp(rdbuf, tag)) {
+   printf("Group tag does not match (exp: %s, act: %s)\n", tag, rdbuf);
+   abort();
+}
+
+if (close(tfd) == -1) {
+   perror("Failed to close tag fd: ");
+   abort();
+}
+}
+
+void assert_group_color(char *cgroup_path, const char *color)
+{
+char tag_path[50] = {}, rdbuf[8] = {};
+int tfd;
+
+sprintf(tag_path, "%s/cpu.core_tag_color", cgroup_path);
+tfd = open(tag_path, O_RDONLY, 0666);
+if (tfd == -1) {
+   perror("Open of cgroup tag path failed: ");
+   abort

[PATCH v8 -tip 22/26] sched/debug: Add CGroup node for printing group cookie if SCHED_DEBUG

2020-10-19 Thread Joel Fernandes (Google)

This will be used by kselftest to verify the CGroup cookie value that is
set by the CGroup interface.

Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1321c26a8385..b3afbba5abe1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9520,6 +9520,13 @@ static u64 cpu_core_tag_color_read_u64(struct 
cgroup_subsys_state *css, struct c
return tg->core_tag_color;
 }
 
+#ifdef CONFIG_SCHED_DEBUG
+static u64 cpu_core_group_cookie_read_u64(struct cgroup_subsys_state *css, 
struct cftype *cft)
+{
+   return cpu_core_get_group_cookie(css_tg(css));
+}
+#endif
+
 struct write_core_tag {
struct cgroup_subsys_state *css;
unsigned long cookie;
@@ -9695,6 +9702,14 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_core_tag_color_read_u64,
.write_u64 = cpu_core_tag_color_write_u64,
},
+#ifdef CONFIG_SCHED_DEBUG
+   /* Read the effective cookie (color+tag) of the group. */
+   {
+   .name = "core_group_cookie",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .read_u64 = cpu_core_group_cookie_read_u64,
+   },
+#endif
 #endif
 #ifdef CONFIG_UCLAMP_TASK_GROUP
{
@@ -9882,6 +9897,14 @@ static struct cftype cpu_files[] = {
.read_u64 = cpu_core_tag_color_read_u64,
.write_u64 = cpu_core_tag_color_write_u64,
},
+#ifdef CONFIG_SCHED_DEBUG
+   /* Read the effective cookie (color+tag) of the group. */
+   {
+   .name = "core_group_cookie",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .read_u64 = cpu_core_group_cookie_read_u64,
+   },
+#endif
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
{
-- 
2.29.0.rc1.297.gfa9743e501-goog

[PATCH v8 -tip 18/26] sched: Add a per-thread core scheduling interface

2020-10-19 Thread Joel Fernandes (Google)

Add a per-thread core scheduling interface which allows a thread to share a
core with another thread, or have a core exclusively for itself.

ChromeOS uses core-scheduling to securely enable hyperthreading.  This cuts
down the keypress latency in Google docs from 150ms to 50ms while improving
the camera streaming frame rate by ~3%.

Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/sched.h|  2 ++
 include/uapi/linux/prctl.h   |  3 ++
 kernel/sched/core.c  | 51 +---
 kernel/sys.c |  3 ++
 tools/include/uapi/linux/prctl.h |  3 ++
 5 files changed, 58 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c6034c00846a..4cb76575afa8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2078,11 +2078,13 @@ void sched_core_unsafe_enter(void);
 void sched_core_unsafe_exit(void);
 bool sched_core_wait_till_safe(unsigned long ti_check);
 bool sched_core_kernel_protected(void);
+int sched_core_share_pid(pid_t pid);
 #else
 #define sched_core_unsafe_enter(ignore) do { } while (0)
 #define sched_core_unsafe_exit(ignore) do { } while (0)
 #define sched_core_wait_till_safe(ignore) do { } while (0)
 #define sched_core_kernel_protected(ignore) do { } while (0)
+#define sched_core_share_pid(pid_t pid) do { } while (0)
 #endif
 
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c334e6a02e5f..217b0482aea1 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -248,4 +248,7 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER  57
 #define PR_GET_IO_FLUSHER  58
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE_SHARE59
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 30a9e4cb5ce1..a0678614a056 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -310,6 +310,7 @@ static int __sched_core_stopper(void *data)
 }
 
 static DEFINE_MUTEX(sched_core_mutex);
+static DEFINE_MUTEX(sched_core_tasks_mutex);
 static int sched_core_count;
 
 static void __sched_core_enable(void)
@@ -3588,8 +3589,9 @@ int sched_fork(unsigned long clone_flags, struct 
task_struct *p)
RB_CLEAR_NODE(>core_node);
 
/*
-* Tag child via per-task cookie only if parent is tagged via per-task
-* cookie. This is independent of, but can be additive to the CGroup 
tagging.
+* If parent is tagged via per-task cookie, tag the child (either with
+* the parent's cookie, or a new one). The final cookie is calculated
+* by concatenating the per-task cookie with that of the CGroup's.
 */
if (current->core_task_cookie) {
 
@@ -9301,7 +9303,7 @@ static int sched_core_share_tasks(struct task_struct *t1, 
struct task_struct *t2
unsigned long cookie;
int ret = -ENOMEM;
 
-   mutex_lock(_core_mutex);
+   mutex_lock(_core_tasks_mutex);
 
/*
 * NOTE: sched_core_get() is done by sched_core_alloc_task_cookie() or
@@ -9400,10 +9402,51 @@ static int sched_core_share_tasks(struct task_struct 
*t1, struct task_struct *t2
 
ret = 0;
 out_unlock:
-   mutex_unlock(_core_mutex);
+   mutex_unlock(_core_tasks_mutex);
return ret;
 }
 
+/* Called from prctl interface: PR_SCHED_CORE_SHARE */
+int sched_core_share_pid(pid_t pid)
+{
+   struct task_struct *task;
+   int err;
+
+   if (pid == 0) { /* Recent current task's cookie. */
+   /* Resetting a cookie requires privileges. */
+   if (current->core_task_cookie)
+   if (!capable(CAP_SYS_ADMIN))
+   return -EPERM;
+   task = NULL;
+   } else {
+   rcu_read_lock();
+   task = pid ? find_task_by_vpid(pid) : current;
+   if (!task) {
+   rcu_read_unlock();
+   return -ESRCH;
+   }
+
+   get_task_struct(task);
+
+   /*
+* Check if this process has the right to modify the specified
+* process. Use the regular "ptrace_may_access()" checks.
+*/
+   if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
+   rcu_read_unlock();
+   err = -EPERM;
+   goto out_put;
+   }
+   rcu_read_unlock();
+   }
+
+   err = sched_core_share_tasks(current, task);
+out_put:
+   if (task)
+   put_task_struct(task);
+   return err;
+}
+
 /* CGroup interface */
 static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct 
cftype *cft)
 {
diff --git a/kernel/sys.c b/kernel/sys.c
index 6401880dff74..17911b8680b1 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2530,6 +2530,9 @@ SYSCALL_DEFINE5(prctl, int, op

[PATCH v8 -tip 09/26] sched: Trivial forced-newidle balancer

2020-10-19 Thread Joel Fernandes (Google)

From: Peter Zijlstra 

When a sibling is forced-idle to match the core-cookie; search for
matching tasks to fill the core.

rcu_read_unlock() can incur an infrequent deadlock in
sched_core_balance(). Fix this by using the RCU-sched flavor instead.

Tested-by: Julien Desfossez 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Joel Fernandes (Google) 
Acked-by: Paul E. McKenney 
---
 include/linux/sched.h |   1 +
 kernel/sched/core.c   | 130 +-
 kernel/sched/idle.c   |   1 +
 kernel/sched/sched.h  |   6 ++
 4 files changed, 137 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c3563d7cab7f..d38e904dd603 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -688,6 +688,7 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
struct rb_node  core_node;
unsigned long   core_cookie;
+   unsigned intcore_occupation;
 #endif
 
 #ifdef CONFIG_CGROUP_SCHED
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a5404ec9e89a..02db5b024768 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -202,6 +202,21 @@ static struct task_struct *sched_core_find(struct rq *rq, 
unsigned long cookie)
return match;
 }
 
+static struct task_struct *sched_core_next(struct task_struct *p, unsigned 
long cookie)
+{
+   struct rb_node *node = >core_node;
+
+   node = rb_next(node);
+   if (!node)
+   return NULL;
+
+   p = container_of(node, struct task_struct, core_node);
+   if (p->core_cookie != cookie)
+   return NULL;
+
+   return p;
+}
+
 /*
  * The static-key + stop-machine variable are needed such that:
  *
@@ -4638,8 +4653,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
const struct sched_class *class;
const struct cpumask *smt_mask;
bool fi_before = false;
+   int i, j, cpu, occ = 0;
bool need_sync;
-   int i, j, cpu;
 
if (!sched_core_enabled(rq))
return __pick_next_task(rq, prev, rf);
@@ -4768,6 +4783,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
goto done;
}
 
+   if (!is_task_rq_idle(p))
+   occ++;
+
rq_i->core_pick = p;
 
/*
@@ -4793,6 +4811,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
 
cpu_rq(j)->core_pick = NULL;
}
+   occ = 1;
goto again;
} else {
/*
@@ -4842,6 +4861,8 @@ next_class:;
rq_i->core->core_forceidle = true;
}
 
+   rq_i->core_pick->core_occupation = occ;
+
if (i == cpu) {
rq_i->core_pick = NULL;
continue;
@@ -4871,6 +4892,113 @@ next_class:;
return next;
 }
 
+static bool try_steal_cookie(int this, int that)
+{
+   struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
+   struct task_struct *p;
+   unsigned long cookie;
+   bool success = false;
+
+   local_irq_disable();
+   double_rq_lock(dst, src);
+
+   cookie = dst->core->core_cookie;
+   if (!cookie)
+   goto unlock;
+
+   if (dst->curr != dst->idle)
+   goto unlock;
+
+   p = sched_core_find(src, cookie);
+   if (p == src->idle)
+   goto unlock;
+
+   do {
+   if (p == src->core_pick || p == src->curr)
+   goto next;
+
+   if (!cpumask_test_cpu(this, >cpus_mask))
+   goto next;
+
+   if (p->core_occupation > dst->idle->core_occupation)
+   goto next;
+
+   p->on_rq = TASK_ON_RQ_MIGRATING;
+   deactivate_task(src, p, 0);
+   set_task_cpu(p, this);
+   activate_task(dst, p, 0);
+   p->on_rq = TASK_ON_RQ_QUEUED;
+
+   resched_curr(dst);
+
+   success = true;
+   break;
+
+next:
+   p = sched_core_next(p, cookie);
+   } while (p);
+
+unlock:
+   double_rq_unlock(dst, src);
+   local_irq_enable();
+
+   return success;
+}
+
+static bool steal_cookie_task(int cpu, struct sched_domain *sd)
+{
+   int i;
+
+   for_each_cpu_wrap(i, sched_domain_span(sd), cpu) {
+   if (i == cpu)
+   continue;
+
+   if (need_resched())
+   break;
+
+   if (try_steal_cookie(cpu, i))
+   return true;
+

[PATCH v8 -tip 25/26] Documentation: Add core scheduling documentation

2020-10-19 Thread Joel Fernandes (Google)

Document the usecases, design and interfaces for core scheduling.

Co-developed-by: Vineeth Pillai 
Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 .../admin-guide/hw-vuln/core-scheduling.rst   | 312 ++
 Documentation/admin-guide/hw-vuln/index.rst   |   1 +
 2 files changed, 313 insertions(+)
 create mode 100644 Documentation/admin-guide/hw-vuln/core-scheduling.rst

diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst 
b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
new file mode 100644
index ..eacafbb8fa3f
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -0,0 +1,312 @@
+Core Scheduling
+***
+Core scheduling support allows userspace to define groups of tasks that can
+share a core. These groups can be specified either for security usecases (one
+group of tasks don't trust another), or for performance usecases (some
+workloads may benefit from running on the same core as they don't need the same
+hardware resources of the shared core).
+
+Security usecase
+
+A cross-HT attack involves the attacker and victim running on different
+Hyper Threads of the same core. MDS and L1TF are examples of such attacks.
+Without core scheduling, the only full mitigation of cross-HT attacks is to
+disable Hyper Threading (HT). Core scheduling allows HT to be turned on safely
+by ensuring that trusted tasks can share a core. This increase in core sharing
+can improvement performance, however it is not guaranteed that performance will
+always improve, though that is seen to be the case with a number of real world
+workloads. In theory, core scheduling aims to perform at least as good as when
+Hyper Threading is disabled. In practise, this is mostly the case though not
+always: as synchronizing scheduling decisions across 2 or more CPUs in a core
+involves additional overhead - especially when the system is lightly loaded
+(``total_threads <= N/2``).
+
+Usage
+-
+Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
+Using this feature, userspace defines groups of tasks that trust each other.
+The core scheduler uses this information to make sure that tasks that do not
+trust each other will never run simultaneously on a core, while doing its best
+to satisfy the system's scheduling requirements.
+
+There are 2 ways to use core-scheduling:
+
+CGroup
+##
+Core scheduling adds additional files to the CPU controller CGroup:
+
+* ``cpu.tag``
+Writing ``1`` into this file results in all tasks in the group get tagged. This
+results in all the CGroup's tasks allowed to run concurrently on a core's
+hyperthreads (also called siblings).
+
+The file being a value of ``0`` means the tag state of the CGroup is inheritted
+from its parent hierarchy. If any ancestor of the CGroup is tagged, then the
+group is tagged.
+
+.. note:: Once a CGroup is tagged via cpu.tag, it is not possible to set this
+  for any descendant of the tagged group. For finer grained control, 
the
+  ``cpu.tag_color`` file described next may be used.
+
+.. note:: When a CGroup is not tagged, all the tasks within the group can share
+  a core with kernel threads and untagged system threads. For this 
reason,
+  if a group has ``cpu.tag`` of 0, it is considered to be trusted.
+
+* ``cpu.tag_color``
+For finer grained control over core sharing, a color can also be set in
+addition to the tag. This allows to further control core sharing between child
+CGroups within an already tagged CGroup. The color and the tag are both used to
+generate a `cookie` which is used by the scheduler to identify the group.
+
+Upto 256 different colors can be set (0-255) by writing into this file.
+
+A sample real-world usage of this file follows:
+
+Google uses DAC controls to make ``cpu.tag`` writeable only by root and the
+``cpu.tag_color`` can be changed by anyone.
+
+The hierarchy looks like this:
+::
+  Root group
+ / \
+A   B(These are created by the root daemon - borglet).
+   / \   \
+  C   D   E  (These are created by AppEngine within the container).
+
+A and B are containers for 2 different jobs or apps that are created by a root
+daemon called borglet. borglet then tags each of these group with the 
``cpu.tag``
+file. The job itself can create additional child CGroups which are colored by
+the container's AppEngine with the ``cpu.tag_color`` file.
+
+The reason why Google uses this 2-level tagging system is that AppEngine wants 
to
+allow a subset of child CGroups within a tagged parent CGroup to be 
co-scheduled on a
+core while not being co-scheduled with other child CGroups. Think of these
+child CGroups as belonging to the same customer or project.  Because these
+child CGroups are created by AppEngine, they are not tracked by borglet (the
+root daemon), therefore borglet won't have a chance to set a color for them.
+That's where cpu.tag_color file comes in. A co

[PATCH v8 -tip 19/26] sched: Add a second-level tag for nested CGroup usecase

2020-10-19 Thread Joel Fernandes (Google)

Google has a usecase where the first level tag to tag a CGroup is not
sufficient. So, a patch is carried for years where a second tag is added which
is writeable by unprivileged users.

Google uses DAC controls to make the 'tag' possible to set only by root while
the second-level 'color' can be changed by anyone. The actual names that
Google uses is different, but the concept is the same.

The hierarchy looks like:

Root group
   / \
  A   B(These are created by the root daemon - borglet).
 / \   \
C   D   E  (These are created by AppEngine within the container).

The reason why Google has two parts is that AppEngine wants to allow a subset of
subcgroups within a parent tagged cgroup sharing execution. Think of these
subcgroups belong to the same customer or project. Because these subcgroups are
created by AppEngine, they are not tracked by borglet (the root daemon),
therefore borglet won't have a chance to set a color for them. That's where
'color' file comes from. Color could be set by AppEngine, and once set, the
normal tasks within the subcgroup would not be able to overwrite it. This is
enforced by promoting the permission of the color file in cgroupfs.

The 'color' is a 8-bit value allowing for upto 256 unique colors. IMHO, having
more than these many CGroups sounds like a scalability issue so this suffices.
We steal the lower 8-bits of the cookie to set the color.

Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c  | 181 +--
 kernel/sched/sched.h |   3 +-
 2 files changed, 158 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a0678614a056..42aa811eab14 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8522,7 +8522,7 @@ static void sched_change_group(struct task_struct *tsk, 
int type)
if ((unsigned long)tsk->sched_task_group == tsk->core_cookie)
tsk->core_cookie = 0UL;
 
-   if (tg->tagged /* && !tsk->core_cookie ? */)
+   if (tg->core_tagged /* && !tsk->core_cookie ? */)
tsk->core_cookie = (unsigned long)tg;
 #endif
 
@@ -8623,9 +8623,9 @@ static void cpu_cgroup_css_offline(struct 
cgroup_subsys_state *css)
 #ifdef CONFIG_SCHED_CORE
struct task_group *tg = css_tg(css);
 
-   if (tg->tagged) {
+   if (tg->core_tagged) {
sched_core_put();
-   tg->tagged = 0;
+   tg->core_tagged = 0;
}
 #endif
 }
@@ -9228,7 +9228,7 @@ void sched_core_tag_requeue(struct task_struct *p, 
unsigned long cookie, bool gr
 
if (sched_core_enqueued(p)) {
sched_core_dequeue(task_rq(p), p);
-   if (!p->core_task_cookie)
+   if (!p->core_cookie)
return;
}
 
@@ -9448,41 +9448,100 @@ int sched_core_share_pid(pid_t pid)
 }
 
 /* CGroup interface */
+
+/*
+ * Helper to get the cookie in a hierarchy.
+ * The cookie is a combination of a tag and color. Any ancestor
+ * can have a tag/color. tag is the first-level cookie setting
+ * with color being the second. Atmost one color and one tag is
+ * allowed.
+ */
+static unsigned long cpu_core_get_group_cookie(struct task_group *tg)
+{
+   unsigned long color = 0;
+
+   if (!tg)
+   return 0;
+
+   for (; tg; tg = tg->parent) {
+   if (tg->core_tag_color) {
+   WARN_ON_ONCE(color);
+   color = tg->core_tag_color;
+   }
+
+   if (tg->core_tagged) {
+   unsigned long cookie = ((unsigned long)tg << 8) | color;
+   cookie &= (1UL << (sizeof(unsigned long) * 4)) - 1;
+   return cookie;
+   }
+   }
+
+   return 0;
+}
+
+/* Determine if any group in @tg's children are tagged or colored. */
+static bool cpu_core_check_descendants(struct task_group *tg, bool check_tag,
+   bool check_color)
+{
+   struct task_group *child;
+
+   rcu_read_lock();
+   list_for_each_entry_rcu(child, >children, siblings) {
+   if ((child->core_tagged && check_tag) ||
+   (child->core_tag_color && check_color)) {
+   rcu_read_unlock();
+   return true;
+   }
+
+   rcu_read_unlock();
+   return cpu_core_check_descendants(child, check_tag, 
check_color);
+   }
+
+   rcu_read_unlock();
+   return false;
+}
+
 static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct 
cftype *cft)
 {
struct task_group *tg = css_tg(css);
 
-   return !!tg->tagged;
+   return !!tg->core_tagged;
+}
+
+static u64 cpu_core_tag_color_read_u64(struct cgroup_subsys_state *css, struct 
cftype *cft)
+{
+   struct task_group *tg = cs

[PATCH v8 -tip 11/26] irq_work: Cleanup

2020-10-19 Thread Joel Fernandes (Google)

From: Peter Zijlstra 

Get rid of the __call_single_node union and clean up the API a little
to avoid external code relying on the structure layout as much.

(Needed for irq_work_is_busy() API in core-scheduling series).

Tested-by: Julien Desfossez 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Joel Fernandes (Google) 
---
 drivers/gpu/drm/i915/i915_request.c |  4 ++--
 include/linux/irq_work.h| 33 ++---
 include/linux/irqflags.h|  4 ++--
 kernel/bpf/stackmap.c   |  2 +-
 kernel/irq_work.c   | 18 
 kernel/printk/printk.c  |  6 ++
 kernel/rcu/tree.c   |  3 +--
 kernel/time/tick-sched.c|  6 ++
 kernel/trace/bpf_trace.c|  2 +-
 9 files changed, 41 insertions(+), 37 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_request.c 
b/drivers/gpu/drm/i915/i915_request.c
index 0e813819b041..5385b081a376 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -197,7 +197,7 @@ __notify_execute_cb(struct i915_request *rq, bool 
(*fn)(struct irq_work *wrk))
 
llist_for_each_entry_safe(cb, cn,
  llist_del_all(>execute_cb),
- work.llnode)
+ work.node.llist)
fn(>work);
 }
 
@@ -460,7 +460,7 @@ __await_execution(struct i915_request *rq,
 * callback first, then checking the ACTIVE bit, we serialise with
 * the completed/retired request.
 */
-   if (llist_add(>work.llnode, >execute_cb)) {
+   if (llist_add(>work.node.llist, >execute_cb)) {
if (i915_request_is_active(signal) ||
__request_in_flight(signal))
__notify_execute_cb_imm(signal);
diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index 30823780c192..ec2a47a81e42 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -14,28 +14,37 @@
  */
 
 struct irq_work {
-   union {
-   struct __call_single_node node;
-   struct {
-   struct llist_node llnode;
-   atomic_t flags;
-   };
-   };
+   struct __call_single_node node;
void (*func)(struct irq_work *);
 };
 
+#define __IRQ_WORK_INIT(_func, _flags) (struct irq_work){  \
+   .node = { .u_flags = (_flags), },   \
+   .func = (_func),\
+}
+
+#define IRQ_WORK_INIT(_func) __IRQ_WORK_INIT(_func, 0)
+#define IRQ_WORK_INIT_LAZY(_func) __IRQ_WORK_INIT(_func, IRQ_WORK_LAZY)
+#define IRQ_WORK_INIT_HARD(_func) __IRQ_WORK_INIT(_func, IRQ_WORK_HARD_IRQ)
+
+#define DEFINE_IRQ_WORK(name, _f)  \
+   struct irq_work name = IRQ_WORK_INIT(_f)
+
 static inline
 void init_irq_work(struct irq_work *work, void (*func)(struct irq_work *))
 {
-   atomic_set(>flags, 0);
-   work->func = func;
+   *work = IRQ_WORK_INIT(func);
 }
 
-#define DEFINE_IRQ_WORK(name, _f) struct irq_work name = { \
-   .flags = ATOMIC_INIT(0),\
-   .func  = (_f)   \
+static inline bool irq_work_is_pending(struct irq_work *work)
+{
+   return atomic_read(>node.a_flags) & IRQ_WORK_PENDING;
 }
 
+static inline bool irq_work_is_busy(struct irq_work *work)
+{
+   return atomic_read(>node.a_flags) & IRQ_WORK_BUSY;
+}
 
 bool irq_work_queue(struct irq_work *work);
 bool irq_work_queue_on(struct irq_work *work, int cpu);
diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 3ed4e8771b64..fef2d43a7a1d 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -109,12 +109,12 @@ do {  \
 
 # define lockdep_irq_work_enter(__work)
\
  do {  \
- if (!(atomic_read(&__work->flags) & IRQ_WORK_HARD_IRQ))\
+ if (!(atomic_read(&__work->node.a_flags) & 
IRQ_WORK_HARD_IRQ))\
current->irq_config = 1;\
  } while (0)
 # define lockdep_irq_work_exit(__work) \
  do {  \
- if (!(atomic_read(&__work->flags) & IRQ_WORK_HARD_IRQ))\
+ if (!(atomic_read(&__work->node.a_flags) & 
IRQ_WORK_HARD_IRQ))\
current->irq_config = 0;\
  } while (0)
 
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 06065fa27124..599041cd0c8a 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -298,7 +298,7 @@ static void stac

[PATCH v8 -tip 21/26] sched: Handle task addition to CGroup

2020-10-19 Thread Joel Fernandes (Google)

Due to earlier patches, the old way of computing a task's cookie when it
is added to a CGroup,is outdated. Update it by fetching the group's
cookie using the new helpers.

Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/core.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 61e1dcf11000..1321c26a8385 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8505,6 +8505,9 @@ void sched_offline_group(struct task_group *tg)
spin_unlock_irqrestore(_group_lock, flags);
 }
 
+#define SCHED_CORE_GROUP_COOKIE_MASK ((1UL << (sizeof(unsigned long) * 4)) - 1)
+static unsigned long cpu_core_get_group_cookie(struct task_group *tg);
+
 static void sched_change_group(struct task_struct *tsk, int type)
 {
struct task_group *tg;
@@ -8519,11 +8522,13 @@ static void sched_change_group(struct task_struct *tsk, 
int type)
tg = autogroup_task_group(tsk, tg);
 
 #ifdef CONFIG_SCHED_CORE
-   if ((unsigned long)tsk->sched_task_group == tsk->core_cookie)
-   tsk->core_cookie = 0UL;
+   if (tsk->core_group_cookie) {
+   tsk->core_group_cookie = 0UL;
+   tsk->core_cookie &= ~SCHED_CORE_GROUP_COOKIE_MASK;
+   }
 
-   if (tg->core_tagged /* && !tsk->core_cookie ? */)
-   tsk->core_cookie = (unsigned long)tg;
+   tsk->core_group_cookie = cpu_core_get_group_cookie(tg);
+   tsk->core_cookie |= tsk->core_group_cookie;
 #endif
 
tsk->sched_task_group = tg;
@@ -9471,7 +9476,7 @@ static unsigned long cpu_core_get_group_cookie(struct 
task_group *tg)
 
if (tg->core_tagged) {
unsigned long cookie = ((unsigned long)tg << 8) | color;
-   cookie &= (1UL << (sizeof(unsigned long) * 4)) - 1;
+   cookie &= SCHED_CORE_GROUP_COOKIE_MASK;
return cookie;
}
}
-- 
2.29.0.rc1.297.gfa9743e501-goog

[PATCH v8 -tip 24/26] sched: Move core-scheduler interfacing code to a new file

2020-10-19 Thread Joel Fernandes (Google)

core.c is already huge. The core-tagging interface code is largely
independent of it. Move it to its own file to make both files easier to
maintain.

Tested-by: Julien Desfossez 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/Makefile  |   1 +
 kernel/sched/core.c| 481 +
 kernel/sched/coretag.c | 468 +++
 kernel/sched/sched.h   |  56 -
 4 files changed, 523 insertions(+), 483 deletions(-)
 create mode 100644 kernel/sched/coretag.c

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 5fc9c9b70862..c526c20adf9d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
+obj-$(CONFIG_SCHED_CORE) += coretag.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b3afbba5abe1..211e0784675f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -162,11 +162,6 @@ static bool sched_core_empty(struct rq *rq)
return RB_EMPTY_ROOT(>core_tree);
 }
 
-static bool sched_core_enqueued(struct task_struct *task)
-{
-   return !RB_EMPTY_NODE(>core_node);
-}
-
 static struct task_struct *sched_core_first(struct rq *rq)
 {
struct task_struct *task;
@@ -188,7 +183,7 @@ static void sched_core_flush(int cpu)
rq->core->core_task_seq++;
 }
 
-static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 {
struct rb_node *parent, **node;
struct task_struct *node_task;
@@ -215,7 +210,7 @@ static void sched_core_enqueue(struct rq *rq, struct 
task_struct *p)
rb_insert_color(>core_node, >core_tree);
 }
 
-static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+void sched_core_dequeue(struct rq *rq, struct task_struct *p)
 {
rq->core->core_task_seq++;
 
@@ -310,7 +305,6 @@ static int __sched_core_stopper(void *data)
 }
 
 static DEFINE_MUTEX(sched_core_mutex);
-static DEFINE_MUTEX(sched_core_tasks_mutex);
 static int sched_core_count;
 
 static void __sched_core_enable(void)
@@ -346,16 +340,6 @@ void sched_core_put(void)
__sched_core_disable();
mutex_unlock(_core_mutex);
 }
-
-static int sched_core_share_tasks(struct task_struct *t1, struct task_struct 
*t2);
-
-#else /* !CONFIG_SCHED_CORE */
-
-static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
-static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
-static bool sched_core_enqueued(struct task_struct *task) { return false; }
-static int sched_core_share_tasks(struct task_struct *t1, struct task_struct 
*t2) { }
-
 #endif /* CONFIG_SCHED_CORE */
 
 /*
@@ -8505,9 +8489,6 @@ void sched_offline_group(struct task_group *tg)
spin_unlock_irqrestore(_group_lock, flags);
 }
 
-#define SCHED_CORE_GROUP_COOKIE_MASK ((1UL << (sizeof(unsigned long) * 4)) - 1)
-static unsigned long cpu_core_get_group_cookie(struct task_group *tg);
-
 static void sched_change_group(struct task_struct *tsk, int type)
 {
struct task_group *tg;
@@ -8583,11 +8564,6 @@ void sched_move_task(struct task_struct *tsk)
task_rq_unlock(rq, tsk, );
 }
 
-static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
-{
-   return css ? container_of(css, struct task_group, css) : NULL;
-}
-
 static struct cgroup_subsys_state *
 cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 {
@@ -9200,459 +9176,6 @@ static u64 cpu_rt_period_read_uint(struct 
cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
-#ifdef CONFIG_SCHED_CORE
-/*
- * A simple wrapper around refcount. An allocated sched_core_cookie's
- * address is used to compute the cookie of the task.
- */
-struct sched_core_cookie {
-   refcount_t refcnt;
-};
-
-/*
- * sched_core_tag_requeue - Common helper for all interfaces to set a cookie.
- * @p: The task to assign a cookie to.
- * @cookie: The cookie to assign.
- * @group: is it a group interface or a per-task interface.
- *
- * This function is typically called from a stop-machine handler.
- */
-void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie, bool 
group)
-{
-   if (!p)
-   return;
-
-   if (group)
-   p->core_group_cookie = cookie;
-   else
-   p->core_task_cookie = cookie;
-
-   /* Use up half of the cookie's bits for task cookie and remaining for 
group cookie. */
-   p->core_cookie = (p->core_task_cookie <<
-   (sizeof(unsigned long) * 4)) + 
p->core_group_cookie;
-
-   if (sched_core_enqueued(p)) {
-   sched_core_dequeue(task_rq(p), p);
-   if (!p->core_cookie)
-   return;
-   }
-
-   if (sched_core_enabled

1 2 3 4 5 6 >

1 - 100 of 555 matches

Mail list logo