Re: [RFC PATCH 3/9] sched: add sched feature to disable idle core search

2019-09-05 Thread Subhra Mazumdar



On 9/5/19 3:17 AM, Patrick Bellasi wrote:

On Fri, Aug 30, 2019 at 18:49:38 +0100, subhra mazumdar wrote...


Add a new sched feature SIS_CORE to have an option to disable idle core
search (select_idle_core).

Signed-off-by: subhra mazumdar 
---
  kernel/sched/features.h | 1 +
  1 file changed, 1 insertion(+)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 858589b..de4d506 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
   */
  SCHED_FEAT(SIS_AVG_CPU, false)
  SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_CORE, true)

Why do we need a sched_feature? If you think there are systems in
which the usage of latency-nice does not make sense for in "Select Idle
Sibling", then we should probably better add a new Kconfig option.

This is not for latency-nice but to be able to disable a different aspect
of the scheduler, i.e searching for idle cores. This can be made part of
latency-nice (i.e not do idle core search if latency-nice is below a
certain value) but even then having a feature to disable it doesn't hurt.


If that's the case, you can probably use the init/Kconfig's
"Scheduler features" section, recently added by:

   commit 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting")


  /*
   * Issue a WARN when we do multiple update_rq_clock() calls

Best,
Patrick



Re: [RFC PATCH 7/9] sched: search SMT before LLC domain

2019-09-05 Thread Subhra Mazumdar



On 9/5/19 2:31 AM, Peter Zijlstra wrote:

On Fri, Aug 30, 2019 at 10:49:42AM -0700, subhra mazumdar wrote:

Search SMT siblings before all CPUs in LLC domain for idle CPU. This helps
in L1 cache locality.
---
  kernel/sched/fair.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8856503..94dd4a32 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6274,11 +6274,11 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
return i;
}
  
-	i = select_idle_cpu(p, sd, target);

+   i = select_idle_smt(p, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
  
-	i = select_idle_smt(p, target);

+   i = select_idle_cpu(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
  

But it is absolutely conceptually wrong. An idle core is a much better
target than an idle sibling.

This is select_idle_smt not select_idle_core.


[RFC PATCH 3/9] sched: add sched feature to disable idle core search

2019-08-30 Thread subhra mazumdar
Add a new sched feature SIS_CORE to have an option to disable idle core
search (select_idle_core).

Signed-off-by: subhra mazumdar 
---
 kernel/sched/features.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 858589b..de4d506 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_CORE, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
-- 
2.9.3



[RFC PATCH 6/9] x86/smpboot: Optimize cpumask_weight_sibling macro for x86

2019-08-30 Thread subhra mazumdar
Use per-CPU variable for cpumask_weight_sibling macro in case of x86 for
fast lookup in select_idle_cpu. This avoids reading multiple cache lines
in case of systems with large numbers of CPUs where bitmask can span
multiple cache lines. Even if bitmask spans only one cache line this avoids
looping through it to find the number of bits and gets it in O(1).

Signed-off-by: subhra mazumdar 
---
 arch/x86/include/asm/smp.h  |  1 +
 arch/x86/include/asm/topology.h |  1 +
 arch/x86/kernel/smpboot.c   | 17 -
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index da545df..1e90cbd 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -22,6 +22,7 @@ extern int smp_num_siblings;
 extern unsigned int num_processors;
 
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
+DECLARE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling);
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
 /* cpus sharing the last level cache: */
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 453cf38..dd19c71 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -111,6 +111,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 #ifdef CONFIG_SMP
 #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
 #define topology_sibling_cpumask(cpu)  (per_cpu(cpu_sibling_map, cpu))
+#define topology_sibling_weight(cpu)   (per_cpu(cpumask_weight_sibling, cpu))
 
 extern unsigned int __max_logical_packages;
 #define topology_max_packages()(__max_logical_packages)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 362dd89..57ad88d 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -85,6 +85,9 @@
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
 EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
 
+/* representing number of HT siblings of each CPU */
+DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling);
+
 /* representing HT and core siblings of each logical CPU */
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
 EXPORT_PER_CPU_SYMBOL(cpu_core_map);
@@ -520,6 +523,8 @@ void set_cpu_sibling_map(int cpu)
 
if (!has_mp) {
cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu));
+   per_cpu(cpumask_weight_sibling, cpu) =
+   cpumask_weight(topology_sibling_cpumask(cpu));
cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
cpumask_set_cpu(cpu, topology_core_cpumask(cpu));
c->booted_cores = 1;
@@ -529,8 +534,13 @@ void set_cpu_sibling_map(int cpu)
for_each_cpu(i, cpu_sibling_setup_mask) {
o = _data(i);
 
-   if ((i == cpu) || (has_smt && match_smt(c, o)))
+   if ((i == cpu) || (has_smt && match_smt(c, o))) {
link_mask(topology_sibling_cpumask, cpu, i);
+   per_cpu(cpumask_weight_sibling, cpu) =
+   cpumask_weight(topology_sibling_cpumask(cpu));
+   per_cpu(cpumask_weight_sibling, i) =
+   cpumask_weight(topology_sibling_cpumask(i));
+   }
 
if ((i == cpu) || (has_mp && match_llc(c, o)))
link_mask(cpu_llc_shared_mask, cpu, i);
@@ -1173,6 +1183,8 @@ static __init void disable_smp(void)
else
physid_set_mask_of_physid(0, _cpu_present_map);
cpumask_set_cpu(0, topology_sibling_cpumask(0));
+   per_cpu(cpumask_weight_sibling, 0) =
+   cpumask_weight(topology_sibling_cpumask(0));
cpumask_set_cpu(0, topology_core_cpumask(0));
 }
 
@@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu)
 
for_each_cpu(sibling, topology_core_cpumask(cpu)) {
cpumask_clear_cpu(cpu, topology_core_cpumask(sibling));
+   per_cpu(cpumask_weight_sibling, sibling) =
+   cpumask_weight(topology_sibling_cpumask(sibling));
/*/
 * last thread sibling in this cpu core going down
 */
@@ -1495,6 +1509,7 @@ static void remove_siblinginfo(int cpu)
cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
cpumask_clear(cpu_llc_shared_mask(cpu));
cpumask_clear(topology_sibling_cpumask(cpu));
+   per_cpu(cpumask_weight_sibling, cpu) = 0;
cpumask_clear(topology_core_cpumask(cpu));
c->cpu_core_id = 0;
c->booted_cores = 0;
-- 
2.9.3



[RFC PATCH 5/9] sched: Define macro for number of CPUs in core

2019-08-30 Thread subhra mazumdar
Introduce macro topology_sibling_weight for number of sibling CPUs in a
core and use in select_idle_cpu

Signed-off-by: subhra mazumdar 
---
 include/linux/topology.h | 4 
 kernel/sched/fair.c  | 2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index cb0775e..a85aea1 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -190,6 +190,10 @@ static inline int cpu_to_mem(int cpu)
 #ifndef topology_sibling_cpumask
 #define topology_sibling_cpumask(cpu)  cpumask_of(cpu)
 #endif
+#ifndef topology_sibling_weight
+#define topology_sibling_weight(cpu)   \
+   cpumask_weight(topology_sibling_cpumask(cpu))
+#endif
 #ifndef topology_core_cpumask
 #define topology_core_cpumask(cpu) cpumask_of(cpu)
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 23ec9c6..8856503 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6205,7 +6205,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
return -1;
 
if (sched_feat(SIS_PROP)) {
-   floor = cpumask_weight(topology_sibling_cpumask(target));
+   floor = topology_sibling_weight(target);
if (floor < 2)
floor = 2;
nr = (p->latency_nice * sd->span_weight) / LATENCY_NICE_MAX;
-- 
2.9.3



[RFC PATCH 0/9] Task latency-nice

2019-08-30 Thread subhra mazumdar
Introduce new per task property latency-nice for controlling scalability
in scheduler idle CPU search path. Valid latency-nice values are from 1 to
100 indicating 1% to 100% search of the LLC domain in select_idle_cpu. New
CPU cgroup file cpu.latency-nice is added as an interface to set and get.
All tasks in the same cgroup share the same latency-nice value. Using a
lower latency-nice value can help latency intolerant tasks e.g very short
running OLTP threads where full LLC search cost can be significant compared
to run time of the threads. The default latency-nice value is 5.

In addition to latency-nice, it also adds a new sched feature SIS_CORE to
be able to disable idle core search altogether which is costly and hurts
more than it helps in short running workloads.

Finally it also introduces a new per-cpu variable next_cpu to track
the limit of search so that every time search starts from where it ended.
This rotating search window over cpus in LLC domain ensures that idle
cpus are eventually found in case of high load.

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline   latency-nice=5,SIS_CORE latency-nice=5,NO_SIS_CORE 
8   64.66  64.38 (-0.43%)  64.79 (0.2%)
16  123.34 122.88 (-0.37%) 125.87 (2.05%)
32  215.18 215.55 (0.17%)  247.77 (15.15%)
48  278.56 321.6 (15.45%)  321.2 (15.3%)
64  259.99 319.45 (22.87%) 333.95 (28.44%)
128 431.1  437.69 (1.53%)  431.09 (0%)

subhra mazumdar (9):
  sched,cgroup: Add interface for latency-nice
  sched: add search limit as per latency-nice
  sched: add sched feature to disable idle core search
  sched: SIS_CORE to disable idle core search
  sched: Define macro for number of CPUs in core
  x86/smpboot: Optimize cpumask_weight_sibling macro for x86
  sched: search SMT before LLC domain
  sched: introduce per-cpu var next_cpu to track search limit
  sched: rotate the cpu search window for better spread

 arch/x86/include/asm/smp.h  |  1 +
 arch/x86/include/asm/topology.h |  1 +
 arch/x86/kernel/smpboot.c   | 17 -
 include/linux/sched.h   |  1 +
 include/linux/topology.h|  4 
 kernel/sched/core.c | 42 +
 kernel/sched/fair.c | 34 +
 kernel/sched/features.h |  1 +
 kernel/sched/sched.h|  9 +
 9 files changed, 97 insertions(+), 13 deletions(-)

-- 
2.9.3



[RFC PATCH 7/9] sched: search SMT before LLC domain

2019-08-30 Thread subhra mazumdar
Search SMT siblings before all CPUs in LLC domain for idle CPU. This helps
in L1 cache locality.
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8856503..94dd4a32 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6274,11 +6274,11 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
return i;
}
 
-   i = select_idle_cpu(p, sd, target);
+   i = select_idle_smt(p, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
 
-   i = select_idle_smt(p, target);
+   i = select_idle_cpu(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
 
-- 
2.9.3



[RFC PATCH 8/9] sched: introduce per-cpu var next_cpu to track search limit

2019-08-30 Thread subhra mazumdar
Introduce a per-cpu variable to track the limit upto which idle cpu search
was done in select_idle_cpu(). This will help to start the search next time
from there. This is necessary for rotating the search window over entire
LLC domain.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/core.c  | 2 ++
 kernel/sched/sched.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 47969bc..5862d54 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -24,6 +24,7 @@
 #include 
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL)
 /*
@@ -5966,6 +5967,7 @@ void __init sched_init(void)
for_each_possible_cpu(i) {
struct rq *rq;
 
+   per_cpu(next_cpu, i) = -1;
rq = cpu_rq(i);
raw_spin_lock_init(>lock);
rq->nr_running = 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 365c928..cca2b09 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1002,6 +1002,7 @@ static inline void update_idle_core(struct rq *rq) { }
 #endif
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #define cpu_rq(cpu)(_cpu(runqueues, (cpu)))
 #define this_rq()  this_cpu_ptr()
-- 
2.9.3



[RFC PATCH 4/9] sched: SIS_CORE to disable idle core search

2019-08-30 Thread subhra mazumdar
Use SIS_CORE to disable idle core search. For some workloads
select_idle_core becomes a scalability bottleneck, removing it improves
throughput. Also there are workloads where disabling it can hurt latency,
so need to have an option.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c31082d..23ec9c6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6268,9 +6268,11 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
if (!sd)
return target;
 
-   i = select_idle_core(p, sd, target);
-   if ((unsigned)i < nr_cpumask_bits)
-   return i;
+   if (sched_feat(SIS_CORE)) {
+   i = select_idle_core(p, sd, target);
+   if ((unsigned)i < nr_cpumask_bits)
+   return i;
+   }
 
i = select_idle_cpu(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
-- 
2.9.3



[RFC PATCH 2/9] sched: add search limit as per latency-nice

2019-08-30 Thread subhra mazumdar
Put upper and lower limit on CPU search in select_idle_cpu. The lower limit
is set to amount of CPUs in a core  while upper limit is derived from the
latency-nice of the thread. This ensures for any architecture we will
usually search beyond a core. Changing the latency-nice value by user will
change the search cost making it appropriate for given workload.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b08d00c..c31082d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
-   int cpu, nr = INT_MAX;
+   int cpu, floor, nr = INT_MAX;
 
this_sd = rcu_dereference(*this_cpu_ptr(_llc));
if (!this_sd)
@@ -6205,11 +6205,12 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
return -1;
 
if (sched_feat(SIS_PROP)) {
-   u64 span_avg = sd->span_weight * avg_idle;
-   if (span_avg > 4*avg_cost)
-   nr = div_u64(span_avg, avg_cost);
-   else
-   nr = 4;
+   floor = cpumask_weight(topology_sibling_cpumask(target));
+   if (floor < 2)
+   floor = 2;
+   nr = (p->latency_nice * sd->span_weight) / LATENCY_NICE_MAX;
+   if (nr < floor)
+   nr = floor;
}
 
time = local_clock();
-- 
2.9.3



[RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-08-30 Thread subhra mazumdar
Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
"latency-nice" which is shared by all the threads in that Cgroup.

Signed-off-by: subhra mazumdar 
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   | 40 
 kernel/sched/fair.c   |  1 +
 kernel/sched/sched.h  |  8 
 4 files changed, 50 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1183741..b4a79c3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -631,6 +631,7 @@ struct task_struct {
int static_prio;
int normal_prio;
unsigned intrt_priority;
+   u64 latency_nice;
 
const struct sched_class*sched_class;
struct sched_entity se;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427..47969bc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5976,6 +5976,7 @@ void __init sched_init(void)
init_dl_rq(>dl);
 #ifdef CONFIG_FAIR_GROUP_SCHED
root_task_group.shares = ROOT_TASK_GROUP_LOAD;
+   root_task_group.latency_nice = LATENCY_NICE_DEFAULT;
INIT_LIST_HEAD(>leaf_cfs_rq_list);
rq->tmp_alone_branch = >leaf_cfs_rq_list;
/*
@@ -6345,6 +6346,7 @@ static void sched_change_group(struct task_struct *tsk, 
int type)
 */
tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
  struct task_group, css);
+   tsk->latency_nice = tg->latency_nice;
tg = autogroup_task_group(tsk, tg);
tsk->sched_task_group = tg;
 
@@ -6812,6 +6814,34 @@ static u64 cpu_rt_period_read_uint(struct 
cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+static u64 cpu_latency_nice_read_u64(struct cgroup_subsys_state *css,
+struct cftype *cft)
+{
+   struct task_group *tg = css_tg(css);
+
+   return tg->latency_nice;
+}
+
+static int cpu_latency_nice_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft, u64 latency_nice)
+{
+   struct task_group *tg = css_tg(css);
+   struct css_task_iter it;
+   struct task_struct *p;
+
+   if (latency_nice < LATENCY_NICE_MIN || latency_nice > LATENCY_NICE_MAX)
+   return -ERANGE;
+
+   tg->latency_nice = latency_nice;
+
+   css_task_iter_start(css, 0, );
+   while ((p = css_task_iter_next()))
+   p->latency_nice = latency_nice;
+   css_task_iter_end();
+
+   return 0;
+}
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
{
@@ -6848,6 +6878,11 @@ static struct cftype cpu_legacy_files[] = {
.write_u64 = cpu_rt_period_write_uint,
},
 #endif
+   {
+   .name = "latency-nice",
+   .read_u64 = cpu_latency_nice_read_u64,
+   .write_u64 = cpu_latency_nice_write_u64,
+   },
{ } /* Terminate */
 };
 
@@ -7015,6 +7050,11 @@ static struct cftype cpu_files[] = {
.write = cpu_max_write,
},
 #endif
+   {
+   .name = "latency-nice",
+   .read_u64 = cpu_latency_nice_read_u64,
+   .write_u64 = cpu_latency_nice_write_u64,
+   },
{ } /* terminate */
 };
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..b08d00c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10479,6 +10479,7 @@ int alloc_fair_sched_group(struct task_group *tg, 
struct task_group *parent)
goto err;
 
tg->shares = NICE_0_LOAD;
+   tg->latency_nice = LATENCY_NICE_DEFAULT;
 
init_cfs_bandwidth(tg_cfs_bandwidth(tg));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..365c928 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -143,6 +143,13 @@ static inline void cpu_load_update_active(struct rq 
*this_rq) { }
 #define NICE_0_LOAD(1L << NICE_0_LOAD_SHIFT)
 
 /*
+ * Latency-nice default value
+ */
+#defineLATENCY_NICE_DEFAULT5
+#defineLATENCY_NICE_MIN1
+#defineLATENCY_NICE_MAX100
+
+/*
  * Single value that decides SCHED_DEADLINE internal math precision.
  * 10 -> just above 1us
  * 9  -> just above 0.5us
@@ -362,6 +369,7 @@ struct cfs_bandwidth {
 /* Task group related information */
 struct task_group {
struct cgroup_subsys_state css;
+   u64 latency_nice;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
/* schedulable entities of this group on each CPU */
-- 
2.9.3



[RFC PATCH 9/9] sched: rotate the cpu search window for better spread

2019-08-30 Thread subhra mazumdar
Rotate the cpu search window for better spread of threads. This will ensure
an idle cpu will quickly be found if one exists.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 94dd4a32..7419b47 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
-   int cpu, floor, nr = INT_MAX;
+   int cpu, floor, target_tmp, nr = INT_MAX;
 
this_sd = rcu_dereference(*this_cpu_ptr(_llc));
if (!this_sd)
@@ -6213,9 +6213,15 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
nr = floor;
}
 
+   if (per_cpu(next_cpu, target) != -1)
+   target_tmp = per_cpu(next_cpu, target);
+   else
+   target_tmp = target;
+
time = local_clock();
 
-   for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+   for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
+   per_cpu(next_cpu, target) = cpu;
if (!--nr)
return -1;
if (!cpumask_test_cpu(cpu, >cpus_allowed))
-- 
2.9.3



Panic on v5.3-rc4

2019-08-15 Thread Subhra Mazumdar

I am getting the following panic during boot of tag v5.3-rc4 of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git. I don't
see the panic on tag v5.2 on same rig. Is it a bug or something
legitimately changed?

Thanks,
Subhra

[  147.184948] dracut Warning: No root device 
"block:/dev/mapper/vg_paex623-lv_root" found



[  147.282665] dracut Warning: LVM vg_paex623/lv_swap not found


[  147.354854] dracut Warning: LVM vg_paex623/lv_root not found

[  147.432099] dracut Warning: Boot has failed. To debug this issue add 
"rdshell" to the kernel command line.

[  147.549737] dracut Warning: Signal caught!


[  147.600145] dracut Warning: LVM vg_paex623/lv_swap not found
[  147.670692] dracut Warning: LVM vg_paex623/lv_root not found

[  147.738593] dracut Warning: Boot has failed. To debug this issue add 
"rdshell" to the kernel command line.
[  147.856206] Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x0100
[  147.947859] CPU: 3 PID: 1 Comm: init Not tainted 
5.3.0-rc4latency_nice_BL #3
[  148.032225] Hardware name: Oracle Corporation ORACLE SERVER 
X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016

[  148.149879] Call Trace:
[  148.179117] dump_stack+0x5c/0x7b
[  148.218760] panic+0xfe/0x2e2
[  148.254242] do_exit+0xbd8/0xbe0
[  148.292842] do_group_exit+0x3a/0xa0
[  148.335597] __x64_sys_exit_group+0x14/0x20
[  148.385644] do_syscall_64+0x5b/0x1d0
[  148.429448] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  148.489900] RIP: 0033:0x3a5ccad018
[  148.530582] Code: Bad RIP value.
[  148.569178] RSP: 002b:7ffc147a5b48 EFLAGS: 0246 ORIG_RAX: 
00e7
[  148.659791] RAX: ffda RBX: 0004 RCX: 
003a5ccad018
[  148.745197] RDX: 0001 RSI: 003c RDI: 
0001
[  148.830605] RBP:  R08: 00e7 R09: 
ffa8
[  148.916011] R10: 0001 R11: 0246 R12: 
00401fb0
[  149.001414] R13: 7ffc147a5e20 R14:  R15: 


[  149.086929] Kernel Offset: disabled
[  149.132815] ---[ end Kernel panic - not syncing: Attempted to kill 
init! exitcode=0x0100 ]---




Re: [RFC PATCH 3/3] sched: introduce tunables to control soft affinity

2019-07-19 Thread Subhra Mazumdar



On 7/18/19 3:38 PM, Srikar Dronamraju wrote:

* subhra mazumdar  [2019-06-26 15:47:18]:


For different workloads the optimal "softness" of soft affinity can be
different. Introduce tunables sched_allowed and sched_preferred that can
be tuned via /proc. This allows to chose at what utilization difference
the scheduler will chose cpus_allowed over cpus_preferred in the first
level of search. Depending on the extent of data sharing, cache coherency
overhead of the system etc. the optimal point may vary.

Signed-off-by: subhra mazumdar 
---

Correct me but this patchset only seems to be concentrated on the wakeup
path, I don't see any changes in the regular load balancer or the
numa-balancer. If system is loaded or tasks are CPU intensive, then wouldn't
these tasks be moved to cpus_allowed instead of cpus_preferred and hence
breaking this soft affinity.


The new idle is purposefully unchanged, if threads get stolen to the allowed
set from the preferred set that's intended, together with the enqueue side
it will achieve softness of affinity.


Re: [RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs

2019-07-18 Thread Subhra Mazumdar



On 7/18/19 5:07 PM, Peter Zijlstra wrote:

On Wed, Jul 17, 2019 at 08:31:25AM +0530, Subhra Mazumdar wrote:

On 7/2/19 10:58 PM, Peter Zijlstra wrote:

On Wed, Jun 26, 2019 at 03:47:17PM -0700, subhra mazumdar wrote:

The soft affinity CPUs present in the cpumask cpus_preferred is used by the
scheduler in two levels of search. First is in determining wake affine
which choses the LLC domain and secondly while searching for idle CPUs in
LLC domain. In the first level it uses cpus_preferred to prune out the
search space. In the second level it first searches the cpus_preferred and
then cpus_allowed. Using affinity_unequal flag it breaks early to avoid
any overhead in the scheduler fast path when soft affinity is not used.
This only changes the wake up path of the scheduler, the idle balancing
is unchanged; together they achieve the "softness" of scheduling.

I really dislike this implementation.

I thought the idea was to remain work conserving (in so far as that
we're that anyway), so changing select_idle_sibling() doesn't make sense
to me. If there is idle, we use it.

Same for newidle; which you already retained.

The scheduler is already not work conserving in many ways. Soft affinity is
only for those who want to use it and has no side effects when not used.
Also the way scheduler is implemented in the first level of search it may
not be possible to do it in a work conserving way, I am open to ideas.

I really don't understand the premise of this soft affinity stuff then.

I understood it was to allow spreading if under-utilized, but group when
over-utilized, but you're arguing for the exact opposite, which doesn't
make sense.

You are right on the premise. The whole knob thing came into existence
because I couldn't make the first level of search work conserving. I am
concerned that trying to make that work conserving can introduce
significant latency in the code path when SA is used. I have made the
second level of search work conserving when we search the LLC domain.

Having said that, SA need not necessarily be binary i.e only spill over to
the allowed set if the preferred set is 100% utilized (work conserving).
The spill over can happen before that and SA can have a degree of softness.

The above two points made me go down the knob path for the first level of
search.


Re: [RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs

2019-07-16 Thread Subhra Mazumdar



On 7/2/19 10:58 PM, Peter Zijlstra wrote:

On Wed, Jun 26, 2019 at 03:47:17PM -0700, subhra mazumdar wrote:

The soft affinity CPUs present in the cpumask cpus_preferred is used by the
scheduler in two levels of search. First is in determining wake affine
which choses the LLC domain and secondly while searching for idle CPUs in
LLC domain. In the first level it uses cpus_preferred to prune out the
search space. In the second level it first searches the cpus_preferred and
then cpus_allowed. Using affinity_unequal flag it breaks early to avoid
any overhead in the scheduler fast path when soft affinity is not used.
This only changes the wake up path of the scheduler, the idle balancing
is unchanged; together they achieve the "softness" of scheduling.

I really dislike this implementation.

I thought the idea was to remain work conserving (in so far as that
we're that anyway), so changing select_idle_sibling() doesn't make sense
to me. If there is idle, we use it.

Same for newidle; which you already retained.

The scheduler is already not work conserving in many ways. Soft affinity is
only for those who want to use it and has no side effects when not used.
Also the way scheduler is implemented in the first level of search it may
not be possible to do it in a work conserving way, I am open to ideas.


This then leaves regular balancing, and for that we can fudge with
can_migrate_task() and nr_balance_failed or something.

Possibly but I don't know if similar performance behavior can be achieved
by the periodic load balancer. Do you want a performance comparison of the
two approaches?


And I also really don't want a second utilization tipping point; we
already have the overloaded thing.

The numbers in the cover letter show that a static tipping point will not
work for all workloads. What soft affinity is doing is essentially trading
off cache coherence for more CPU. The optimum tradeoff point will vary
from workload to workload and the system metrics of coherence overhead etc.
If we just use the domain overload that becomes a static definition of
tipping point, we need something tunable that captures this tradeoff. The
ratio of CPU util seemed to work well and capture that.


I also still dislike how you never looked into the numa balancer, which
already has peferred_nid stuff.

Not sure if you mean using the existing NUMA balancer or enhancing it. If
the former, I have numbers in the cover letter that show NUMA balancer is
not making any difference. I allocated memory of each DB instance to one
NUMA node using numactl, but NUMA balancer still migrated pages, so numactl
only seems to control the initial allocation. Secondly even though NUMA
balancer migrated pages it had no performance benefit as compared to
disabling it.


Re: [PATCH v3 5/7] sched: SIS_CORE to disable idle core search

2019-07-13 Thread Subhra Mazumdar



On 7/4/19 6:04 PM, Parth Shah wrote:

Same experiment with hackbench and with perf analysis shows increase in L1
cache miss rate with these patches
(Lower is better)
   Baseline(%)   Patch(%)
  --- - ---
   Total Cache miss rate 17.01   19(-11%)
   L1 icache miss rate5.45   6.7(-22%)



So is is possible for idle_cpu search to try checking target_cpu first and
then goto sliding window if not found.
Below diff works as expected in IBM POWER9 system and resolves the problem
of far wakeup upto large extent.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff2e9b5c3ac5..fae035ce1162 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6161,6 +6161,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 u64 time, cost;
 s64 delta;
 int cpu, limit, floor, target_tmp, nr = INT_MAX;
+   struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
  
 this_sd = rcu_dereference(*this_cpu_ptr(_llc));

 if (!this_sd)
@@ -6198,16 +6199,22 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
  
 time = local_clock();
  
-   for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {

+   cpumask_and(cpus, sched_domain_span(sd), >cpus_allowed);
+   for_each_cpu_wrap(cpu, cpu_smt_mask(target), target) {
+   __cpumask_clear_cpu(cpu, cpus);
+   if (available_idle_cpu(cpu))
+   goto idle_cpu_exit;
+   }
+
+   for_each_cpu_wrap(cpu, cpus, target_tmp) {
 per_cpu(next_cpu, target) = cpu;
 if (!--nr)
 return -1;
-   if (!cpumask_test_cpu(cpu, >cpus_allowed))
-   continue;
 if (available_idle_cpu(cpu))
 break;
 }
  
+idle_cpu_exit:

 time = local_clock() - time;
 cost = this_sd->avg_scan_cost;
 delta = (s64)(time - cost) / 8;



Best,
Parth

How about calling select_idle_smt before select_idle_cpu from
select_idle_sibling? That should have the same effect.


Re: [RFC 0/2] Optimize the idle CPU search

2019-07-09 Thread Subhra Mazumdar



On 7/9/19 11:08 AM, Parth Shah wrote:


On 7/9/19 5:38 AM, Subhra Mazumdar wrote:

On 7/8/19 10:24 AM, Parth Shah wrote:

When searching for an idle_sibling, scheduler first iterates to search for
an idle core and then for an idle CPU. By maintaining the idle CPU mask
while iterating through idle cores, we can mark non-idle CPUs for which
idle CPU search would not have to iterate through again. This is especially
true in a moderately load system

Optimize idle CPUs search by marking already found non idle CPUs during
idle core search. This reduces iteration count when searching for idle
CPUs, resulting in lower iteration count.


I believe this can co-exist with latency-nice? We can derive the 'nr' in
select_idle_cpu from latency-nice and use the new mask to iterate.


I agree, can be done with latency-nice.

Maybe something like below?
smt = nr_cpus / nr_cores
nr = smt + (p->latency_nice * (total_cpus-smt) / max_latency_nice)

This limits lower bounds to 1 core and goes through all the cores if
latency_nice is maximum for a task.

Yes I had similar in mind.


Re: [RFC 0/2] Optimize the idle CPU search

2019-07-09 Thread Subhra Mazumdar



On 7/8/19 1:38 PM, Peter Zijlstra wrote:

On Mon, Jul 08, 2019 at 10:24:30AM +0530, Parth Shah wrote:

When searching for an idle_sibling, scheduler first iterates to search for
an idle core and then for an idle CPU. By maintaining the idle CPU mask
while iterating through idle cores, we can mark non-idle CPUs for which
idle CPU search would not have to iterate through again. This is especially
true in a moderately load system

Optimize idle CPUs search by marking already found non idle CPUs during
idle core search. This reduces iteration count when searching for idle
CPUs, resulting in lower iteration count.

Have you seen these patches:

   https://lkml.kernel.org/r/20180530142236.667774...@infradead.org

I've meant to get back to that, but never quite had the time :/

The most relevant bit of this was folding select_idle_core and
select_idle_cpu. But it may be good to keep them separate as workloads
which just want any idle cpu can find one and break early by disabling
the idle core search. And this can still work with latency-nice which can
moderate both idle core and idle cpu search.



Re: [RFC 0/2] Optimize the idle CPU search

2019-07-08 Thread Subhra Mazumdar



On 7/8/19 10:24 AM, Parth Shah wrote:

When searching for an idle_sibling, scheduler first iterates to search for
an idle core and then for an idle CPU. By maintaining the idle CPU mask
while iterating through idle cores, we can mark non-idle CPUs for which
idle CPU search would not have to iterate through again. This is especially
true in a moderately load system

Optimize idle CPUs search by marking already found non idle CPUs during
idle core search. This reduces iteration count when searching for idle
CPUs, resulting in lower iteration count.


I believe this can co-exist with latency-nice? We can derive the 'nr' in
select_idle_cpu from latency-nice and use the new mask to iterate.


Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path

2019-07-02 Thread Subhra Mazumdar



On 7/2/19 1:54 AM, Patrick Bellasi wrote:

Wondering if searching and preempting needs will ever be conflicting?
I guess the winning point is that we don't commit behaviors to
userspace, but just abstract concepts which are turned into biases.

I don't see conflicts right now: if you are latency tolerant that
means you can spend more time to try finding a better CPU (e.g. we can
use the energy model to compare multiple CPUs) _and/or_ give the
current task a better chance to complete by delaying its preemption.

OK



Otherwise sounds like a good direction to me. For the searching aspect, can
we map latency nice values to the % of cores we search in select_idle_cpu?
Thus the search cost can be controlled by latency nice value.

I guess that's worth a try, only caveat I see is that it's turning the
bias into something very platform specific. Meaning, the same
latency-nice value on different machines can have very different
results.

Would not be better to try finding a more platform independent mapping?

Maybe something time bounded, e.g. the higher the latency-nice the more
time we can spend looking for CPUs?

The issue I see is suppose we have a range of latency-nice values, then it
should cover the entire range of search (one core to all cores). As Peter
said some workloads will want to search the LLC fully. If we have absolute
time, the map of latency-nice values range to them will be arbitrary. If
you have something in mind let me know, may be I am thinking differently.



But the issue is if more latency tolerant workloads set to less
search, we still need some mechanism to achieve good spread of
threads.

I don't get this example: why more latency tolerant workloads should
require less search?

I guess I got the definition of "latency tolerant" backwards.



Can we keep the sliding window mechanism in that case?

Which one? Sorry did not went through the patches, can you briefly
resume the idea?

If a workload has set it to low latency tolerant, then the search will be
less. That can lead to localization of threads on a few CPUs as we are not
searching the entire LLC even if there are idle CPUs available. For this
I had introduced a per-CPU variable (for the target CPU) to track the
boundary of search so that every time it will start from the boundary, thus
sliding the window. So even if we are searching very little the search
window keeps shifting and gives us a good spread. This is orthogonal to the
latency-nice thing.



Also will latency nice do anything for select_idle_core and
select_idle_smt?

I guess principle the same bias can be used at different levels, maybe
with different mappings.

Doing it for select_idle_core will have the issue that the dynamic flag
(whether an idle core is present or not) can only be updated by threads
which are doing the full search.

Thanks,
Subhra


In the mobile world use-case we will likely use it only to switch from
select_idle_sibling to the energy aware slow path. And perhaps to see
if we can bias the wakeup preemption granularity.

Best,
Patrick



Re: [PATCH V3 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found

2019-07-02 Thread Subhra Mazumdar



On 7/2/19 1:35 AM, Peter Zijlstra wrote:

On Mon, Jul 01, 2019 at 03:08:41PM -0700, Subhra Mazumdar wrote:

On 7/1/19 1:03 AM, Viresh Kumar wrote:

On 28-06-19, 18:16, Subhra Mazumdar wrote:

On 6/25/19 10:06 PM, Viresh Kumar wrote:

@@ -5376,6 +5376,15 @@ static struct {
#endif /* CONFIG_NO_HZ_COMMON */
+/* CPU only has SCHED_IDLE tasks enqueued */
+static int sched_idle_cpu(int cpu)
+{
+   struct rq *rq = cpu_rq(cpu);
+
+   return unlikely(rq->nr_running == rq->cfs.idle_h_nr_running &&
+   rq->nr_running);
+}
+

Shouldn't this check if rq->curr is also sched idle?

Why wouldn't the current set of checks be enough to guarantee that ?

I thought nr_running does not include the on-cpu thread.

It very much does.


And why not drop the rq->nr_running non zero check?

Because CPU isn't sched-idle if nr_running and idle_h_nr_running are both 0,
i.e. it is an IDLE cpu in that case. And so I thought it is important to have
this check as well.


idle_cpu() not only checks nr_running is 0 but also rq->curr == rq->idle

idle_cpu() will try very hard to declare a CPU !idle. But I don't see
how that it relevant. sched_idle_cpu() will only return true if there
are only SCHED_IDLE tasks on the CPU. Viresh's test is simple and
straight forward.


OK makes sense.

Thanks,
Subhra



Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path

2019-07-01 Thread Subhra Mazumdar



On 7/1/19 6:55 AM, Patrick Bellasi wrote:

On 01-Jul 11:02, Peter Zijlstra wrote:

On Wed, Jun 26, 2019 at 06:29:12PM -0700, subhra mazumdar wrote:

Hi,

Resending this patchset, will be good to get some feedback. Any suggestions
that will make it more acceptable are welcome. We have been shipping this
with Unbreakable Enterprise Kernel in Oracle Linux.

Current select_idle_sibling first tries to find a fully idle core using
select_idle_core which can potentially search all cores and if it fails it
finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially
search all cpus in the llc domain. This doesn't scale for large llc domains
and will only get worse with more cores in future.

This patch solves the scalability problem by:
  - Setting an upper and lower limit of idle cpu search in select_idle_cpu
to keep search time low and constant
  - Adding a new sched feature SIS_CORE to disable select_idle_core

Additionally it also introduces a new per-cpu variable next_cpu to track
the limit of search so that every time search starts from where it ended.
This rotating search window over cpus in LLC domain ensures that idle
cpus are eventually found in case of high load.

Right, so we had a wee conversation about this patch series at OSPM, and
I don't see any of that reflected here :-(

Specifically, given that some people _really_ want the whole L3 mask
scanned to reduce tail latency over raw throughput, while you guys
prefer the other way around, it was proposed to extend the task model.

Specifically something like a latency-nice was mentioned (IIRC) where a

Right, AFAIR PaulT suggested to add support for the concept of a task
being "latency tolerant": meaning we can spend more time to search for
a CPU and/or avoid preempting the current task.


Wondering if searching and preempting needs will ever be conflicting?
Otherwise sounds like a good direction to me. For the searching aspect, can
we map latency nice values to the % of cores we search in select_idle_cpu?
Thus the search cost can be controlled by latency nice value. But the issue
is if more latency tolerant workloads set to less search, we still need
some mechanism to achieve good spread of threads. Can we keep the sliding
window mechanism in that case? Also will latency nice do anything for
select_idle_core and select_idle_smt?


Re: [PATCH V3 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found

2019-07-01 Thread Subhra Mazumdar



On 7/1/19 1:03 AM, Viresh Kumar wrote:

On 28-06-19, 18:16, Subhra Mazumdar wrote:

On 6/25/19 10:06 PM, Viresh Kumar wrote:

We try to find an idle CPU to run the next task, but in case we don't
find an idle CPU it is better to pick a CPU which will run the task the
soonest, for performance reason.

A CPU which isn't idle but has only SCHED_IDLE activity queued on it
should be a good target based on this criteria as any normal fair task
will most likely preempt the currently running SCHED_IDLE task
immediately. In fact, choosing a SCHED_IDLE CPU over a fully idle one
shall give better results as it should be able to run the task sooner
than an idle CPU (which requires to be woken up from an idle state).

This patch updates both fast and slow paths with this optimization.

Signed-off-by: Viresh Kumar 
---
   kernel/sched/fair.c | 43 +--
   1 file changed, 33 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1277adc3e7ed..2e0527fd468c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5376,6 +5376,15 @@ static struct {
   #endif /* CONFIG_NO_HZ_COMMON */
+/* CPU only has SCHED_IDLE tasks enqueued */
+static int sched_idle_cpu(int cpu)
+{
+   struct rq *rq = cpu_rq(cpu);
+
+   return unlikely(rq->nr_running == rq->cfs.idle_h_nr_running &&
+   rq->nr_running);
+}
+

Shouldn't this check if rq->curr is also sched idle?

Why wouldn't the current set of checks be enough to guarantee that ?

I thought nr_running does not include the on-cpu thread.



And why not drop the rq->nr_running non zero check?

Because CPU isn't sched-idle if nr_running and idle_h_nr_running are both 0,
i.e. it is an IDLE cpu in that case. And so I thought it is important to have
this check as well.


idle_cpu() not only checks nr_running is 0 but also rq->curr == rq->idle


Re: [PATCH v3 5/7] sched: SIS_CORE to disable idle core search

2019-07-01 Thread Subhra Mazumdar




Also, systems like POWER9 has sd_llc as a pair of core only. So it
won't benefit from the limits and hence also hiding your code in select_idle_cpu
behind static keys will be much preferred.

If it doesn't hurt then I don't see the point.


So these is the result from POWER9 system with your patches:
System configuration: 2 Socket, 44 cores, 176 CPUs

Experiment setup:
===
=> Setup 1:
- 44 tasks doing just while(1), this is to make select_idle_core return -1 most 
times
- perf bench sched messaging -g 1 -l 100
+---++--++
| Baseline  | stddev |Patch | stddev |
+---++--++
|   135 |   3.21 | 158(-17.03%) |   4.69 |
+---++--++

=> Setup 2:
- schbench -m44 -t 1
+===+==+=+=+==+
| %ile  | Baseline | stddev  |  patch  |  stddev  |
+===+==+=+=+==+
|50 |   10 |3.49 |  10 | 2.29 |
+---+--+-+-+--+
|95 |  467 |4.47 | 469 | 0.81 |
+---+--+-+-+--+
|99 |  571 |   21.32 | 584 |18.69 |
+---+--+-+-+--+
|  99.5 |  629 |   30.05 | 641 |20.95 |
+---+--+-+-+--+
|  99.9 |  780 |   40.38 | 773 | 44.2 |
+---+--+-+-+--+

I guess it doesn't make much difference in schbench results but hackbench (perf 
bench)
seems to have an observable regression.


Best,
Parth


If POWER9 sd_llc has only 2 cores, the behavior shouldn't change much with
the select_idle_cpu changes as the limits are 1 and 2 core. Previously the
lower bound was 4 cpus and upper bound calculated by the prop. Now it is
1 core (4 cpus on SMT4) and upper bound 2 cores. Could it be the extra
computation of cpumask_weight causing the regression rather than the
sliding window itself (one way to check this would be hardcode 4 in place
of topology_sibling_weight)? Or is it the L1 cache coherency? I am a bit
suprised because SPARC SMT8 which has more cores in sd_llc and L1 cache per
core showed improvement with Hackbench.

Thanks,
Subhra


Re: [PATCH V3 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found

2019-06-28 Thread Subhra Mazumdar



On 6/25/19 10:06 PM, Viresh Kumar wrote:

We try to find an idle CPU to run the next task, but in case we don't
find an idle CPU it is better to pick a CPU which will run the task the
soonest, for performance reason.

A CPU which isn't idle but has only SCHED_IDLE activity queued on it
should be a good target based on this criteria as any normal fair task
will most likely preempt the currently running SCHED_IDLE task
immediately. In fact, choosing a SCHED_IDLE CPU over a fully idle one
shall give better results as it should be able to run the task sooner
than an idle CPU (which requires to be woken up from an idle state).

This patch updates both fast and slow paths with this optimization.

Signed-off-by: Viresh Kumar 
---
  kernel/sched/fair.c | 43 +--
  1 file changed, 33 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1277adc3e7ed..2e0527fd468c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5376,6 +5376,15 @@ static struct {
  
  #endif /* CONFIG_NO_HZ_COMMON */
  
+/* CPU only has SCHED_IDLE tasks enqueued */

+static int sched_idle_cpu(int cpu)
+{
+   struct rq *rq = cpu_rq(cpu);
+
+   return unlikely(rq->nr_running == rq->cfs.idle_h_nr_running &&
+   rq->nr_running);
+}
+

Shouldn't this check if rq->curr is also sched idle? And why not drop the
rq->nr_running non zero check?


Re: [PATCH v3 3/7] sched: rotate the cpu search window for better spread

2019-06-28 Thread Subhra Mazumdar



On 6/28/19 4:54 AM, Srikar Dronamraju wrote:

* subhra mazumdar  [2019-06-26 18:29:15]:


Rotate the cpu search window for better spread of threads. This will ensure
an idle cpu will quickly be found if one exists.

While rotating the cpu search window is good, not sure if this can find a
idle cpu quickly. The probability of finding an idle cpu still should remain
the same. No?


Signed-off-by: subhra mazumdar 
---
  kernel/sched/fair.c | 10 --
  1 file changed, 8 insertions(+), 2 deletions(-)

@@ -6219,9 +6219,15 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
}
}
  
+	if (per_cpu(next_cpu, target) != -1)

+   target_tmp = per_cpu(next_cpu, target);
+   else
+   target_tmp = target;
+
time = local_clock();
  
-	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {

+   for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
+   per_cpu(next_cpu, target) = cpu;

Shouldn't this assignment be outside the for loop.
With the current code,
1. We keep reassigning multiple times.
2. The last assignment happes for idle_cpu and sometimes the
assignment is for non-idle cpu.

We want the last assignment irrespective of it was an idle cpu or not since
in both cases we want to track the boundary of search.

Thanks,
Subhra


Re: [PATCH v3 5/7] sched: SIS_CORE to disable idle core search

2019-06-28 Thread Subhra Mazumdar



On 6/28/19 12:01 PM, Parth Shah wrote:


On 6/27/19 6:59 AM, subhra mazumdar wrote:

Use SIS_CORE to disable idle core search. For some workloads
select_idle_core becomes a scalability bottleneck, removing it improves
throughput. Also there are workloads where disabling it can hurt latency,
so need to have an option.

Signed-off-by: subhra mazumdar 
---
  kernel/sched/fair.c | 8 +---
  1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c1ca88e..6a74808 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6280,9 +6280,11 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
if (!sd)
return target;

-   i = select_idle_core(p, sd, target);
-   if ((unsigned)i < nr_cpumask_bits)
-   return i;
+   if (sched_feat(SIS_CORE)) {
+   i = select_idle_core(p, sd, target);
+   if ((unsigned)i < nr_cpumask_bits)
+   return i;
+   }

This can have significant performance loss if disabled. The select_idle_core 
spreads
workloads quickly across the cores, hence disabling this leaves much of the 
work to
be offloaded to load balancer to move task across the cores. Latency sensitive
and long running multi-threaded workload should see the regression under this 
conditions.

Yes in case of SPARC SMT8 I did notice that (see cover letter). That's why
it is a feature that is ON by default, but can be turned OFF for specific
workloads on x86 SMT2 that can benefit from it.

Also, systems like POWER9 has sd_llc as a pair of core only. So it
won't benefit from the limits and hence also hiding your code in select_idle_cpu
behind static keys will be much preferred.

If it doesn't hurt then I don't see the point.

Thanks,
Subhra



Re: [PATCH v3 1/7] sched: limit cpu search in select_idle_cpu

2019-06-28 Thread Subhra Mazumdar



On 6/28/19 11:47 AM, Parth Shah wrote:


On 6/27/19 6:59 AM, subhra mazumdar wrote:

Put upper and lower limit on cpu search of select_idle_cpu. The lower limit
is amount of cpus in a core while upper limit is twice that. This ensures
for any architecture we will usually search beyond a core. The upper limit
also helps in keeping the search cost low and constant.

Signed-off-by: subhra mazumdar 
---
  kernel/sched/fair.c | 15 +++
  1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..b58f08f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
-   int cpu, nr = INT_MAX;
+   int cpu, limit, floor, nr = INT_MAX;

this_sd = rcu_dereference(*this_cpu_ptr(_llc));
if (!this_sd)
@@ -6206,10 +6206,17 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t

if (sched_feat(SIS_PROP)) {
u64 span_avg = sd->span_weight * avg_idle;
-   if (span_avg > 4*avg_cost)
+   floor = cpumask_weight(topology_sibling_cpumask(target));
+   if (floor < 2)
+   floor = 2;
+   limit = floor << 1;

Is upper limit an experimental value only or it has any arch specific 
significance?
Because, AFAIU, systems like POWER9 might have benefit for searching for 4-cores
due to its different cache model. So it can be tuned for arch specific builds 
then.

The lower bound and upper bound were 1 core and 2 core respectively. That
is done as to search beyond one core and at the same time not to search
too much. It is heuristic that seemed to work well on all archs coupled
with the moving window mechanism. Does 4 vs 2 make any difference on your
POWER9? AFAIR it didn't on SPARC SMT8.


Also variable names can be changed for better readability.
floor -> weight_clamp_min
limit -> weight_clamp_max
or something similar

OK.

Thanks,
Subhra




+   if (span_avg > floor*avg_cost) {
nr = div_u64(span_avg, avg_cost);
-   else
-   nr = 4;
+   if (nr > limit)
+   nr = limit;
+   } else {
+   nr = floor;
+   }
}

time = local_clock();



Best,
Parth



Re: [PATCH v3 3/7] sched: rotate the cpu search window for better spread

2019-06-28 Thread Subhra Mazumdar



On 6/28/19 11:36 AM, Parth Shah wrote:

Hi Subhra,

I ran your patch series on IBM POWER systems and this is what I have observed.

On 6/27/19 6:59 AM, subhra mazumdar wrote:

Rotate the cpu search window for better spread of threads. This will ensure
an idle cpu will quickly be found if one exists.

Signed-off-by: subhra mazumdar 
---
  kernel/sched/fair.c | 10 --
  1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b58f08f..c1ca88e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
-   int cpu, limit, floor, nr = INT_MAX;
+   int cpu, limit, floor, target_tmp, nr = INT_MAX;

this_sd = rcu_dereference(*this_cpu_ptr(_llc));
if (!this_sd)
@@ -6219,9 +6219,15 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
}
}

+   if (per_cpu(next_cpu, target) != -1)
+   target_tmp = per_cpu(next_cpu, target);
+   else
+   target_tmp = target;
+
time = local_clock();

-   for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+   for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
+   per_cpu(next_cpu, target) = cpu;

This leads to a problem of cache hotness.
AFAIU, in most cases, `target = prev_cpu` of the task being woken up and
selecting an idle CPU nearest to the prev_cpu is favorable.
But since this doesn't keep track of last idle cpu per task, it fails to find 
the
nearest possible idle CPU in cases when the task is being woken up after other
scheduled task.


I had tested hackbench on SPARC SMT8 (see numbers in cover letter) and
showed improvement with this. Firstly it's a tradeoff between cache effects
vs time spent in searching idle CPU, and both x86 and SPARC numbers showed
tradeoff is worth it. Secondly there is a lot of cache affinity logic
in the beginning of select_idle_sibling. If select_idle_cpu is still called
that means we are past that and want any idle cpu. I don't think waking up
close to the prev cpu is the intention for starting search from there,
rather it is to spread threads across all cpus so that no cpu gets
victimized as there is no atomicity. Prev cpu just acts a good seed to do
the spreading.

Thanks,
Subhra


Re: [PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings

2019-06-27 Thread Subhra Mazumdar



On 6/26/19 11:54 PM, Thomas Gleixner wrote:

On Thu, 27 Jun 2019, Thomas Gleixner wrote:


On Wed, 26 Jun 2019, subhra mazumdar wrote:


Introduce a per-cpu variable to keep the number of HT siblings of a cpu.
This will be used for quick lookup in select_idle_cpu to determine the
limits of search.

Why? The number of siblings is constant at least today unless you play
silly cpu hotplug games. A bit more justification for adding yet another
random storage would be appreciated.


This patch does it only for x86.

# grep 'This patch' Documentation/process/submitting-patches.rst

IOW, we all know already that this is a patch and from the subject prefix
and the diffstat it's pretty obvious that this is x86 only.

So instead of documenting the obvious, please add proper context to justify
the change.

Aside of that the right ordering is to introduce the default fallback in a
separate patch, which explains the reasoning and then in the next one add
the x86 optimized version.

OK. I will also add the extra optimization for other architectures.

Thanks,
Subhra


Thanks,

tglx


Re: [PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings

2019-06-27 Thread Subhra Mazumdar



On 6/26/19 11:51 PM, Thomas Gleixner wrote:

On Wed, 26 Jun 2019, subhra mazumdar wrote:


Introduce a per-cpu variable to keep the number of HT siblings of a cpu.
This will be used for quick lookup in select_idle_cpu to determine the
limits of search.

Why? The number of siblings is constant at least today unless you play
silly cpu hotplug games. A bit more justification for adding yet another
random storage would be appreciated.

Using cpumask_weight every time in select_idle_cpu to compute the no. of
SMT siblings can be costly as cpumask_weight may not be O(1) for systems
with large no. of CPUs (e.g 8 socket, each socket having lots of cores).
Over 512 CPUs the bitmask will span multiple cache lines and touching
multiple cache lines in the fast path of scheduler can cost more than we
save from this optimization. Even in single cache line it loops in longs.
We want to touch O(1) cache lines and do O(1) operations, hence
pre-compute it in per-CPU variable.



This patch does it only for x86.

# grep 'This patch' Documentation/process/submitting-patches.rst

IOW, we all know already that this is a patch and from the subject prefix
and the diffstat it's pretty obvious that this is x86 only.

So instead of documenting the obvious, please add proper context to justify
the change.

Ok. The extra per-CPU optimization was done only for x86 as we cared about
it the most and make it future proof. I will add for other architectures.
  

+/* representing number of HT siblings of each CPU */
+DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling);
+EXPORT_PER_CPU_SYMBOL(cpumask_weight_sibling);

Why does this need an export? No module has any reason to access this.

I will remove it



  /* representing HT and core siblings of each logical CPU */
  DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
  EXPORT_PER_CPU_SYMBOL(cpu_core_map);
@@ -520,6 +524,8 @@ void set_cpu_sibling_map(int cpu)
  
  	if (!has_mp) {

cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu));
+   per_cpu(cpumask_weight_sibling, cpu) =
+   cpumask_weight(topology_sibling_cpumask(cpu));
cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
cpumask_set_cpu(cpu, topology_core_cpumask(cpu));
c->booted_cores = 1;
@@ -529,8 +535,12 @@ void set_cpu_sibling_map(int cpu)
for_each_cpu(i, cpu_sibling_setup_mask) {
o = _data(i);
  
-		if ((i == cpu) || (has_smt && match_smt(c, o)))

+   if ((i == cpu) || (has_smt && match_smt(c, o))) {
link_mask(topology_sibling_cpumask, cpu, i);
+   threads = cpumask_weight(topology_sibling_cpumask(cpu));
+   per_cpu(cpumask_weight_sibling, cpu) = threads;
+   per_cpu(cpumask_weight_sibling, i) = threads;

This only works for SMT=2, but fails to update the rest for SMT=4.


I guess I assumed that x86 will always be SMT2, will fix this.

Thanks,
Subhra




@@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu)
  
  	for_each_cpu(sibling, topology_core_cpumask(cpu)) {

cpumask_clear_cpu(cpu, topology_core_cpumask(sibling));
+   per_cpu(cpumask_weight_sibling, sibling) =
+   cpumask_weight(topology_sibling_cpumask(sibling));

While remove does the right thing.

Thanks,

tglx


[PATCH v3 7/7] sched: use per-cpu variable cpumask_weight_sibling

2019-06-26 Thread subhra mazumdar
Use per-cpu var cpumask_weight_sibling for quick lookup in select_idle_cpu.
This is the fast path of scheduler and every cycle is worth saving. Usage
of cpumask_weight can result in iterations.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a74808..878f11c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6206,7 +6206,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
if (sched_feat(SIS_PROP)) {
u64 span_avg = sd->span_weight * avg_idle;
-   floor = cpumask_weight(topology_sibling_cpumask(target));
+   floor = topology_sibling_weight(target);
if (floor < 2)
floor = 2;
limit = floor << 1;
-- 
2.9.3



[RESEND PATCH v3 0/7] Improve scheduler scalability for fast path

2019-06-26 Thread subhra mazumdar
 by having SIS_CORE
false.

Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44
tasks (lower is better):
percentile  baseline  %stdev   patch   %stdev
50  942.82 93.33 (0.71%)   1.24  
75  124   2.13 122.67 (1.08%)  1.7   
90  152   1.74 149.33 (1.75%)  2.35  
95  171   2.11 167 (2.34%) 2.74
99  512.67104.96   206 (59.82%)8.86
99.52296  82.553121.67 (-35.96%)   97.37 
99.912517.33  2.38 12592 (-0.6%)   1.67

Changes from v2->v3:
-Use shift operator instead of multiplication to compute limit
-Use per-CPU variable to precompute the number of sibling SMTs for x86

subhra mazumdar (7):
  sched: limit cpu search in select_idle_cpu
  sched: introduce per-cpu var next_cpu to track search limit
  sched: rotate the cpu search window for better spread
  sched: add sched feature to disable idle core search
  sched: SIS_CORE to disable idle core search
  x86/smpboot: introduce per-cpu variable for HT siblings
  sched: use per-cpu variable cpumask_weight_sibling

 arch/x86/include/asm/smp.h  |  1 +
 arch/x86/include/asm/topology.h |  1 +
 arch/x86/kernel/smpboot.c   | 17 -
 include/linux/topology.h|  4 
 kernel/sched/core.c |  2 ++
 kernel/sched/fair.c | 31 +++
 kernel/sched/features.h |  1 +
 kernel/sched/sched.h|  1 +
 8 files changed, 49 insertions(+), 9 deletions(-)

-- 
2.9.3



[PATCH v3 1/7] sched: limit cpu search in select_idle_cpu

2019-06-26 Thread subhra mazumdar
Put upper and lower limit on cpu search of select_idle_cpu. The lower limit
is amount of cpus in a core while upper limit is twice that. This ensures
for any architecture we will usually search beyond a core. The upper limit
also helps in keeping the search cost low and constant.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..b58f08f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
-   int cpu, nr = INT_MAX;
+   int cpu, limit, floor, nr = INT_MAX;
 
this_sd = rcu_dereference(*this_cpu_ptr(_llc));
if (!this_sd)
@@ -6206,10 +6206,17 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
 
if (sched_feat(SIS_PROP)) {
u64 span_avg = sd->span_weight * avg_idle;
-   if (span_avg > 4*avg_cost)
+   floor = cpumask_weight(topology_sibling_cpumask(target));
+   if (floor < 2)
+   floor = 2;
+   limit = floor << 1;
+   if (span_avg > floor*avg_cost) {
nr = div_u64(span_avg, avg_cost);
-   else
-   nr = 4;
+   if (nr > limit)
+   nr = limit;
+   } else {
+   nr = floor;
+   }
}
 
time = local_clock();
-- 
2.9.3



[PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings

2019-06-26 Thread subhra mazumdar
Introduce a per-cpu variable to keep the number of HT siblings of a cpu.
This will be used for quick lookup in select_idle_cpu to determine the
limits of search. This patch does it only for x86.

Signed-off-by: subhra mazumdar 
---
 arch/x86/include/asm/smp.h  |  1 +
 arch/x86/include/asm/topology.h |  1 +
 arch/x86/kernel/smpboot.c   | 17 -
 include/linux/topology.h|  4 
 4 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index da545df..1e90cbd 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -22,6 +22,7 @@ extern int smp_num_siblings;
 extern unsigned int num_processors;
 
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
+DECLARE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling);
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
 /* cpus sharing the last level cache: */
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 453cf38..dd19c71 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -111,6 +111,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 #ifdef CONFIG_SMP
 #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
 #define topology_sibling_cpumask(cpu)  (per_cpu(cpu_sibling_map, cpu))
+#define topology_sibling_weight(cpu)   (per_cpu(cpumask_weight_sibling, cpu))
 
 extern unsigned int __max_logical_packages;
 #define topology_max_packages()(__max_logical_packages)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 362dd89..20bf676 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -85,6 +85,10 @@
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
 EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
 
+/* representing number of HT siblings of each CPU */
+DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling);
+EXPORT_PER_CPU_SYMBOL(cpumask_weight_sibling);
+
 /* representing HT and core siblings of each logical CPU */
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
 EXPORT_PER_CPU_SYMBOL(cpu_core_map);
@@ -520,6 +524,8 @@ void set_cpu_sibling_map(int cpu)
 
if (!has_mp) {
cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu));
+   per_cpu(cpumask_weight_sibling, cpu) =
+   cpumask_weight(topology_sibling_cpumask(cpu));
cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
cpumask_set_cpu(cpu, topology_core_cpumask(cpu));
c->booted_cores = 1;
@@ -529,8 +535,12 @@ void set_cpu_sibling_map(int cpu)
for_each_cpu(i, cpu_sibling_setup_mask) {
o = _data(i);
 
-   if ((i == cpu) || (has_smt && match_smt(c, o)))
+   if ((i == cpu) || (has_smt && match_smt(c, o))) {
link_mask(topology_sibling_cpumask, cpu, i);
+   threads = cpumask_weight(topology_sibling_cpumask(cpu));
+   per_cpu(cpumask_weight_sibling, cpu) = threads;
+   per_cpu(cpumask_weight_sibling, i) = threads;
+   }
 
if ((i == cpu) || (has_mp && match_llc(c, o)))
link_mask(cpu_llc_shared_mask, cpu, i);
@@ -1173,6 +1183,8 @@ static __init void disable_smp(void)
else
physid_set_mask_of_physid(0, _cpu_present_map);
cpumask_set_cpu(0, topology_sibling_cpumask(0));
+   per_cpu(cpumask_weight_sibling, 0) =
+   cpumask_weight(topology_sibling_cpumask(0));
cpumask_set_cpu(0, topology_core_cpumask(0));
 }
 
@@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu)
 
for_each_cpu(sibling, topology_core_cpumask(cpu)) {
cpumask_clear_cpu(cpu, topology_core_cpumask(sibling));
+   per_cpu(cpumask_weight_sibling, sibling) =
+   cpumask_weight(topology_sibling_cpumask(sibling));
/*/
 * last thread sibling in this cpu core going down
 */
@@ -1495,6 +1509,7 @@ static void remove_siblinginfo(int cpu)
cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
cpumask_clear(cpu_llc_shared_mask(cpu));
cpumask_clear(topology_sibling_cpumask(cpu));
+   per_cpu(cpumask_weight_sibling, cpu) = 0;
cpumask_clear(topology_core_cpumask(cpu));
c->cpu_core_id = 0;
c->booted_cores = 0;
diff --git a/include/linux/topology.h b/include/linux/topology.h
index cb0775e..a85aea1 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -190,6 +190,10 @@ static inline int cpu_to_mem(int cpu)
 #ifndef topology_sibling_cpumask
 #define topology_sibling_cpumask(cpu)  cpumask_of(cpu)
 #endif
+#ifndef topology_sibling_weight
+#d

[PATCH v3 3/7] sched: rotate the cpu search window for better spread

2019-06-26 Thread subhra mazumdar
Rotate the cpu search window for better spread of threads. This will ensure
an idle cpu will quickly be found if one exists.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b58f08f..c1ca88e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
-   int cpu, limit, floor, nr = INT_MAX;
+   int cpu, limit, floor, target_tmp, nr = INT_MAX;
 
this_sd = rcu_dereference(*this_cpu_ptr(_llc));
if (!this_sd)
@@ -6219,9 +6219,15 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
}
}
 
+   if (per_cpu(next_cpu, target) != -1)
+   target_tmp = per_cpu(next_cpu, target);
+   else
+   target_tmp = target;
+
time = local_clock();
 
-   for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+   for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
+   per_cpu(next_cpu, target) = cpu;
if (!--nr)
return -1;
if (!cpumask_test_cpu(cpu, >cpus_allowed))
-- 
2.9.3



[PATCH v3 4/7] sched: add sched feature to disable idle core search

2019-06-26 Thread subhra mazumdar
Add a new sched feature SIS_CORE to have an option to disable idle core
search (select_idle_core).

Signed-off-by: subhra mazumdar 
---
 kernel/sched/features.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 858589b..de4d506 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_CORE, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
-- 
2.9.3



[PATCH v3 5/7] sched: SIS_CORE to disable idle core search

2019-06-26 Thread subhra mazumdar
Use SIS_CORE to disable idle core search. For some workloads
select_idle_core becomes a scalability bottleneck, removing it improves
throughput. Also there are workloads where disabling it can hurt latency,
so need to have an option.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c1ca88e..6a74808 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6280,9 +6280,11 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
if (!sd)
return target;
 
-   i = select_idle_core(p, sd, target);
-   if ((unsigned)i < nr_cpumask_bits)
-   return i;
+   if (sched_feat(SIS_CORE)) {
+   i = select_idle_core(p, sd, target);
+   if ((unsigned)i < nr_cpumask_bits)
+   return i;
+   }
 
i = select_idle_cpu(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
-- 
2.9.3



[PATCH v3 2/7] sched: introduce per-cpu var next_cpu to track search limit

2019-06-26 Thread subhra mazumdar
Introduce a per-cpu variable to track the limit upto which idle cpu search
was done in select_idle_cpu(). This will help to start the search next time
from there. This is necessary for rotating the search window over entire
LLC domain.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/core.c  | 2 ++
 kernel/sched/sched.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427..80657fc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -24,6 +24,7 @@
 #include 
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL)
 /*
@@ -5966,6 +5967,7 @@ void __init sched_init(void)
for_each_possible_cpu(i) {
struct rq *rq;
 
+   per_cpu(next_cpu, i) = -1;
rq = cpu_rq(i);
raw_spin_lock_init(>lock);
rq->nr_running = 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..4cecfa2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -994,6 +994,7 @@ static inline void update_idle_core(struct rq *rq) { }
 #endif
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #define cpu_rq(cpu)(_cpu(runqueues, (cpu)))
 #define this_rq()  this_cpu_ptr()
-- 
2.9.3



[RFC PATCH 1/3] sched: Introduce new interface for scheduler soft affinity

2019-06-26 Thread subhra mazumdar
New system call sched_setaffinity2 is introduced for scheduler soft
affinity. It takes an extra parameter to specify hard or soft affinity,
where hard implies same as existing sched_setaffinity. New cpumask
cpus_preferred is introduced for this purpose which is always a subset of
cpus_allowed. A boolean affinity_unequal is used to store if they are
unequal for fast lookup. Setting hard affinity resets soft affinity set to
be equal to it. Soft affinity is only allowed for CFS class threads.

Signed-off-by: subhra mazumdar 
---
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/sched.h  |   5 +-
 include/linux/syscalls.h   |   3 +
 include/uapi/asm-generic/unistd.h  |   4 +-
 include/uapi/linux/sched.h |   3 +
 init/init_task.c   |   2 +
 kernel/compat.c|   2 +-
 kernel/rcu/tree_plugin.h   |   3 +-
 kernel/sched/core.c| 167 -
 9 files changed, 162 insertions(+), 28 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e..1dccdd2 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
 431common  fsconfig__x64_sys_fsconfig
 432common  fsmount __x64_sys_fsmount
 433common  fspick  __x64_sys_fspick
+434common  sched_setaffinity2  __x64_sys_sched_setaffinity2
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1183741..b863fa8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -652,6 +652,8 @@ struct task_struct {
unsigned intpolicy;
int nr_cpus_allowed;
cpumask_t   cpus_allowed;
+   cpumask_t   cpus_preferred;
+   boolaffinity_unequal;
 
 #ifdef CONFIG_PREEMPT_RCU
int rcu_read_lock_nesting;
@@ -1784,7 +1786,8 @@ static inline void set_task_cpu(struct task_struct *p, 
unsigned int cpu)
 # define vcpu_is_preempted(cpu)false
 #endif
 
-extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask);
+extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask,
+ int flags);
 extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
 
 #ifndef TASK_SIZE_OF
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe..147a4e5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -669,6 +669,9 @@ asmlinkage long sys_sched_rr_get_interval(pid_t pid,
struct __kernel_timespec __user *interval);
 asmlinkage long sys_sched_rr_get_interval_time32(pid_t pid,
 struct old_timespec32 __user 
*interval);
+asmlinkage long sys_sched_setaffinity2(pid_t pid, unsigned int len,
+  unsigned long __user *user_mask_ptr,
+  int flags);
 
 /* kernel/signal.c */
 asmlinkage long sys_restart_syscall(void);
diff --git a/include/uapi/asm-generic/unistd.h 
b/include/uapi/asm-generic/unistd.h
index a87904d..d77b366 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
 __SYSCALL(__NR_fsmount, sys_fsmount)
 #define __NR_fspick 433
 __SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_sched_setaffinity2 434
+__SYSCALL(__NR_sched_setaffinity2, sys_sched_setaffinity2)
 
 #undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 435
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index ed4ee17..f910cd5 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -52,6 +52,9 @@
 #define SCHED_FLAG_RECLAIM 0x02
 #define SCHED_FLAG_DL_OVERRUN  0x04
 
+#define SCHED_HARD_AFFINITY0
+#define SCHED_SOFT_AFFINITY1
+
 #define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK   | \
 SCHED_FLAG_RECLAIM | \
 SCHED_FLAG_DL_OVERRUN)
diff --git a/init/init_task.c b/init/init_task.c
index c70ef65..aa226a3 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -73,6 +73,8 @@ struct task_struct init_task
.normal_prio= MAX_PRIO - 20,
.policy = SCHED_NORMAL,
.cpus_allowed   = CPU_MASK_ALL,
+   .cpus_preferred = CPU_MASK_ALL,
+   .affinity_unequal = false,
.nr_cpus_allowed= NR_CPUS,
.mm = NULL,
.active_mm  = _mm,
diff --git a/kernel/compat.c b/kernel/compat.c
index b5f7063..96621d7 100644
--- a/kernel/compat.c
+++ b/kernel/compat.c

[RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs

2019-06-26 Thread subhra mazumdar
The soft affinity CPUs present in the cpumask cpus_preferred is used by the
scheduler in two levels of search. First is in determining wake affine
which choses the LLC domain and secondly while searching for idle CPUs in
LLC domain. In the first level it uses cpus_preferred to prune out the
search space. In the second level it first searches the cpus_preferred and
then cpus_allowed. Using affinity_unequal flag it breaks early to avoid
any overhead in the scheduler fast path when soft affinity is not used.
This only changes the wake up path of the scheduler, the idle balancing
is unchanged; together they achieve the "softness" of scheduling.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 137 ++--
 1 file changed, 100 insertions(+), 37 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..53aa7f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5807,7 +5807,7 @@ static unsigned long capacity_spare_without(int cpu, 
struct task_struct *p)
  */
 static struct sched_group *
 find_idlest_group(struct sched_domain *sd, struct task_struct *p,
- int this_cpu, int sd_flag)
+ int this_cpu, int sd_flag, struct cpumask *cpus)
 {
struct sched_group *idlest = NULL, *group = sd->groups;
struct sched_group *most_spare_sg = NULL;
@@ -5831,7 +5831,7 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p,
 
/* Skip over this group if it has no CPUs allowed */
if (!cpumask_intersects(sched_group_span(group),
-   >cpus_allowed))
+   cpus))
continue;
 
local_group = cpumask_test_cpu(this_cpu,
@@ -5949,7 +5949,8 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p,
  * find_idlest_group_cpu - find the idlest CPU among the CPUs in the group.
  */
 static int
-find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int 
this_cpu)
+find_idlest_group_cpu(struct sched_group *group, struct task_struct *p,
+ int this_cpu, struct cpumask *cpus)
 {
unsigned long load, min_load = ULONG_MAX;
unsigned int min_exit_latency = UINT_MAX;
@@ -5963,7 +5964,7 @@ find_idlest_group_cpu(struct sched_group *group, struct 
task_struct *p, int this
return cpumask_first(sched_group_span(group));
 
/* Traverse only the allowed CPUs */
-   for_each_cpu_and(i, sched_group_span(group), >cpus_allowed) {
+   for_each_cpu_and(i, sched_group_span(group), cpus) {
if (available_idle_cpu(i)) {
struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
@@ -5999,7 +6000,8 @@ find_idlest_group_cpu(struct sched_group *group, struct 
task_struct *p, int this
 }
 
 static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct 
*p,
- int cpu, int prev_cpu, int sd_flag)
+ int cpu, int prev_cpu, int sd_flag,
+ struct cpumask *cpus)
 {
int new_cpu = cpu;
 
@@ -6023,13 +6025,14 @@ static inline int find_idlest_cpu(struct sched_domain 
*sd, struct task_struct *p
continue;
}
 
-   group = find_idlest_group(sd, p, cpu, sd_flag);
+   group = find_idlest_group(sd, p, cpu, sd_flag, cpus);
+
if (!group) {
sd = sd->child;
continue;
}
 
-   new_cpu = find_idlest_group_cpu(group, p, cpu);
+   new_cpu = find_idlest_group_cpu(group, p, cpu, cpus);
if (new_cpu == cpu) {
/* Now try balancing at a lower domain level of 'cpu': 
*/
sd = sd->child;
@@ -6104,6 +6107,27 @@ void __update_idle_core(struct rq *rq)
rcu_read_unlock();
 }
 
+static inline int
+scan_cpu_mask_for_idle_cores(struct cpumask *cpus, int target)
+{
+   int core, cpu;
+
+   for_each_cpu_wrap(core, cpus, target) {
+   bool idle = true;
+
+   for_each_cpu(cpu, cpu_smt_mask(core)) {
+   cpumask_clear_cpu(cpu, cpus);
+   if (!idle_cpu(cpu))
+   idle = false;
+   }
+
+   if (idle)
+   return core;
+   }
+
+   return -1;
+}
+
 /*
  * Scan the entire LLC domain for idle cores; this dynamically switches off if
  * there are no idle cores left in the system; tracked through
@@ -6112,7 +6136,7 @@ void __update_idle_core(struct rq *rq)
 static int select_idle_core(struct task_struct *p, struct sched_domain *sd, 
int target)
 {
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
-   int co

[RFC PATCH 0/3] Scheduler Soft Affinity

2019-06-26 Thread subhra mazumdar
ch   %gain with soft affinity
2*4 1.43
2*8 1.36
2*161.01
2*321.45
1*4 -2.55
1*8 -5.06
1*16-8
1*32-7.32
   
DB  %gain with soft affinity
2*160.46
2*243.68
2*32-3.34
1*160.08
1*241.6
1*32-1.29

Finally I measured the overhead of soft affinity when it is NOT used by
comparing it with baseline kernel in case of no affinity and hard affinity
with Hackbench. The following is the improvement of soft affinity kernel
w.r.t baseline, but really numbers are in noise margin. This shows soft
affinity has no overhead when not used.

Hackbench   %diff of no affinity%diff of hard affinity
2*4 0.110.31
2*8 0.130.55
2*160.610.90
2*320.861.01
1*4 0.480.43
1*8 0.450.33
1*160.610.64
1*320.110.63

A final set of experiments were done (numbers not shown) having the memory
of each DB instance spread evenly across both NUMA nodes. This showed
similar improvements with soft affinity for 2 instance case, thus proving
the improvement is due to saving LLC coherence overhead.

subhra mazumdar (3):
  sched: Introduce new interface for scheduler soft affinity
  sched: change scheduler to give preference to soft affinity CPUs
  sched: introduce tunables to control soft affinity

 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/sched.h  |   5 +-
 include/linux/sched/sysctl.h   |   2 +
 include/linux/syscalls.h   |   3 +
 include/uapi/asm-generic/unistd.h  |   4 +-
 include/uapi/linux/sched.h |   3 +
 init/init_task.c   |   2 +
 kernel/compat.c|   2 +-
 kernel/rcu/tree_plugin.h   |   3 +-
 kernel/sched/core.c| 167 -
 kernel/sched/fair.c| 154 ++
 kernel/sched/sched.h   |   2 +
 kernel/sysctl.c|  14 +++
 13 files changed, 297 insertions(+), 65 deletions(-)

-- 
2.9.3



[RFC PATCH 3/3] sched: introduce tunables to control soft affinity

2019-06-26 Thread subhra mazumdar
For different workloads the optimal "softness" of soft affinity can be
different. Introduce tunables sched_allowed and sched_preferred that can
be tuned via /proc. This allows to chose at what utilization difference
the scheduler will chose cpus_allowed over cpus_preferred in the first
level of search. Depending on the extent of data sharing, cache coherency
overhead of the system etc. the optimal point may vary.

Signed-off-by: subhra mazumdar 
---
 include/linux/sched/sysctl.h |  2 ++
 kernel/sched/fair.c  | 19 ++-
 kernel/sched/sched.h |  2 ++
 kernel/sysctl.c  | 14 ++
 4 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 99ce6d7..0e75602 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -41,6 +41,8 @@ extern unsigned int sysctl_numa_balancing_scan_size;
 #ifdef CONFIG_SCHED_DEBUG
 extern __read_mostly unsigned int sysctl_sched_migration_cost;
 extern __read_mostly unsigned int sysctl_sched_nr_migrate;
+extern __read_mostly unsigned int sysctl_sched_preferred;
+extern __read_mostly unsigned int sysctl_sched_allowed;
 
 int sched_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 53aa7f2..d222d78 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -85,6 +85,8 @@ unsigned int sysctl_sched_wakeup_granularity  
= 100UL;
 static unsigned int normalized_sysctl_sched_wakeup_granularity = 100UL;
 
 const_debug unsigned int sysctl_sched_migration_cost   = 50UL;
+const_debug unsigned int sysctl_sched_preferred= 1UL;
+const_debug unsigned int sysctl_sched_allowed  = 100UL;
 
 #ifdef CONFIG_SMP
 /*
@@ -6739,7 +6741,22 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, 
int sd_flag, int wake_f
int new_cpu = prev_cpu;
int want_affine = 0;
int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
-   struct cpumask *cpus = >cpus_preferred;
+   int cpux, cpuy;
+   struct cpumask *cpus;
+
+   if (!p->affinity_unequal) {
+   cpus = >cpus_allowed;
+   } else {
+   cpux = cpumask_any(>cpus_preferred);
+   cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+   cpumask_andnot(cpus, >cpus_allowed, >cpus_preferred);
+   cpuy = cpumask_any(cpus);
+   if (sysctl_sched_preferred * cpu_rq(cpux)->cfs.avg.util_avg >
+   sysctl_sched_allowed * cpu_rq(cpuy)->cfs.avg.util_avg)
+   cpus = >cpus_allowed;
+   else
+   cpus = >cpus_preferred;
+   }
 
if (sd_flag & SD_BALANCE_WAKE) {
record_wakee(p);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..f856bdb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1863,6 +1863,8 @@ extern void check_preempt_curr(struct rq *rq, struct 
task_struct *p, int flags);
 
 extern const_debug unsigned int sysctl_sched_nr_migrate;
 extern const_debug unsigned int sysctl_sched_migration_cost;
+extern const_debug unsigned int sysctl_sched_preferred;
+extern const_debug unsigned int sysctl_sched_allowed;
 
 #ifdef CONFIG_SCHED_HRTICK
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 7d1008b..bdffb48 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -383,6 +383,20 @@ static struct ctl_table kern_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec,
},
+   {
+   .procname   = "sched_preferred",
+   .data   = _sched_preferred,
+   .maxlen = sizeof(unsigned int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+   {
+   .procname   = "sched_allowed",
+   .data   = _sched_allowed,
+   .maxlen = sizeof(unsigned int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
 #ifdef CONFIG_SCHEDSTATS
{
.procname   = "sched_schedstats",
-- 
2.9.3



Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-06-13 Thread Subhra Mazumdar



On 6/12/19 9:33 AM, Julien Desfossez wrote:

After reading more traces and trying to understand why only untagged
tasks are starving when there are cpu-intensive tasks running on the
same set of CPUs, we noticed a difference in behavior in ‘pick_task’. In
the case where ‘core_cookie’ is 0, we are supposed to only prefer the
tagged task if it’s priority is higher, but when the priorities are
equal we prefer it as well which causes the starving. ‘pick_task’ is
biased toward selecting its first parameter in case of equality which in
this case was the ‘class_pick’ instead of ‘max’. Reversing the order of
the parameter solves this issue and matches the expected behavior.

So we can get rid of this vruntime_boost concept.

We have tested the fix below and it seems to work well with
tagged/untagged tasks.


My 2 DB instance runs with this patch are better with CORESCHED_STALL_FIX
than NO_CORESCHED_STALL_FIX in terms of performance, std deviation and
idleness. May be enable it by default?

NO_CORESCHED_STALL_FIX:

users %stdev   %gain %idle
16    25   -42.4 73
24    32   -26.3 67
32    0.2  -48.9 62


CORESCHED_STALL_FIX:

users %stdev   %gain %idle
16    6.5  -23 70
24    0.6  -17 60
32    1.5  -30.2   52


[PATCH v3 4/7] sched: add sched feature to disable idle core search

2019-06-08 Thread subhra mazumdar
Add a new sched feature SIS_CORE to have an option to disable idle core
search (select_idle_core).

Signed-off-by: subhra mazumdar 
---
 kernel/sched/features.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 858589b..de4d506 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_CORE, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
-- 
2.9.3



[PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings

2019-06-08 Thread subhra mazumdar
Introduce a per-cpu variable to keep the number of HT siblings of a cpu.
This will be used for quick lookup in select_idle_cpu to determine the
limits of search. This patch does it only for x86.

Signed-off-by: subhra mazumdar 
---
 arch/x86/include/asm/smp.h  |  1 +
 arch/x86/include/asm/topology.h |  1 +
 arch/x86/kernel/smpboot.c   | 17 -
 include/linux/topology.h|  4 
 4 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index da545df..1e90cbd 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -22,6 +22,7 @@ extern int smp_num_siblings;
 extern unsigned int num_processors;
 
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
+DECLARE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling);
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
 /* cpus sharing the last level cache: */
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 453cf38..dd19c71 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -111,6 +111,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 #ifdef CONFIG_SMP
 #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
 #define topology_sibling_cpumask(cpu)  (per_cpu(cpu_sibling_map, cpu))
+#define topology_sibling_weight(cpu)   (per_cpu(cpumask_weight_sibling, cpu))
 
 extern unsigned int __max_logical_packages;
 #define topology_max_packages()(__max_logical_packages)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 362dd89..20bf676 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -85,6 +85,10 @@
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
 EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
 
+/* representing number of HT siblings of each CPU */
+DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling);
+EXPORT_PER_CPU_SYMBOL(cpumask_weight_sibling);
+
 /* representing HT and core siblings of each logical CPU */
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
 EXPORT_PER_CPU_SYMBOL(cpu_core_map);
@@ -520,6 +524,8 @@ void set_cpu_sibling_map(int cpu)
 
if (!has_mp) {
cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu));
+   per_cpu(cpumask_weight_sibling, cpu) =
+   cpumask_weight(topology_sibling_cpumask(cpu));
cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
cpumask_set_cpu(cpu, topology_core_cpumask(cpu));
c->booted_cores = 1;
@@ -529,8 +535,12 @@ void set_cpu_sibling_map(int cpu)
for_each_cpu(i, cpu_sibling_setup_mask) {
o = _data(i);
 
-   if ((i == cpu) || (has_smt && match_smt(c, o)))
+   if ((i == cpu) || (has_smt && match_smt(c, o))) {
link_mask(topology_sibling_cpumask, cpu, i);
+   threads = cpumask_weight(topology_sibling_cpumask(cpu));
+   per_cpu(cpumask_weight_sibling, cpu) = threads;
+   per_cpu(cpumask_weight_sibling, i) = threads;
+   }
 
if ((i == cpu) || (has_mp && match_llc(c, o)))
link_mask(cpu_llc_shared_mask, cpu, i);
@@ -1173,6 +1183,8 @@ static __init void disable_smp(void)
else
physid_set_mask_of_physid(0, _cpu_present_map);
cpumask_set_cpu(0, topology_sibling_cpumask(0));
+   per_cpu(cpumask_weight_sibling, 0) =
+   cpumask_weight(topology_sibling_cpumask(0));
cpumask_set_cpu(0, topology_core_cpumask(0));
 }
 
@@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu)
 
for_each_cpu(sibling, topology_core_cpumask(cpu)) {
cpumask_clear_cpu(cpu, topology_core_cpumask(sibling));
+   per_cpu(cpumask_weight_sibling, sibling) =
+   cpumask_weight(topology_sibling_cpumask(sibling));
/*/
 * last thread sibling in this cpu core going down
 */
@@ -1495,6 +1509,7 @@ static void remove_siblinginfo(int cpu)
cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
cpumask_clear(cpu_llc_shared_mask(cpu));
cpumask_clear(topology_sibling_cpumask(cpu));
+   per_cpu(cpumask_weight_sibling, cpu) = 0;
cpumask_clear(topology_core_cpumask(cpu));
c->cpu_core_id = 0;
c->booted_cores = 0;
diff --git a/include/linux/topology.h b/include/linux/topology.h
index cb0775e..a85aea1 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -190,6 +190,10 @@ static inline int cpu_to_mem(int cpu)
 #ifndef topology_sibling_cpumask
 #define topology_sibling_cpumask(cpu)  cpumask_of(cpu)
 #endif
+#ifndef topology_sibling_weight
+#d

[PATCH v3 5/7] sched: SIS_CORE to disable idle core search

2019-06-08 Thread subhra mazumdar
Use SIS_CORE to disable idle core search. For some workloads
select_idle_core becomes a scalability bottleneck, removing it improves
throughput. Also there are workloads where disabling it can hurt latency,
so need to have an option.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c1ca88e..6a74808 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6280,9 +6280,11 @@ static int select_idle_sibling(struct task_struct *p, 
int prev, int target)
if (!sd)
return target;
 
-   i = select_idle_core(p, sd, target);
-   if ((unsigned)i < nr_cpumask_bits)
-   return i;
+   if (sched_feat(SIS_CORE)) {
+   i = select_idle_core(p, sd, target);
+   if ((unsigned)i < nr_cpumask_bits)
+   return i;
+   }
 
i = select_idle_cpu(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
-- 
2.9.3



[PATCH v3 1/7] sched: limit cpu search in select_idle_cpu

2019-06-08 Thread subhra mazumdar
Put upper and lower limit on cpu search of select_idle_cpu. The lower limit
is amount of cpus in a core while upper limit is twice that. This ensures
for any architecture we will usually search beyond a core. The upper limit
also helps in keeping the search cost low and constant.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..b58f08f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
-   int cpu, nr = INT_MAX;
+   int cpu, limit, floor, nr = INT_MAX;
 
this_sd = rcu_dereference(*this_cpu_ptr(_llc));
if (!this_sd)
@@ -6206,10 +6206,17 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
 
if (sched_feat(SIS_PROP)) {
u64 span_avg = sd->span_weight * avg_idle;
-   if (span_avg > 4*avg_cost)
+   floor = cpumask_weight(topology_sibling_cpumask(target));
+   if (floor < 2)
+   floor = 2;
+   limit = floor << 1;
+   if (span_avg > floor*avg_cost) {
nr = div_u64(span_avg, avg_cost);
-   else
-   nr = 4;
+   if (nr > limit)
+   nr = limit;
+   } else {
+   nr = floor;
+   }
}
 
time = local_clock();
-- 
2.9.3



[PATCH v3 7/7] sched: use per-cpu variable cpumask_weight_sibling

2019-06-08 Thread subhra mazumdar
Use per-cpu var cpumask_weight_sibling for quick lookup in select_idle_cpu.
This is the fast path of scheduler and every cycle is worth saving. Usage
of cpumask_weight can result in iterations.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a74808..878f11c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6206,7 +6206,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
 
if (sched_feat(SIS_PROP)) {
u64 span_avg = sd->span_weight * avg_idle;
-   floor = cpumask_weight(topology_sibling_cpumask(target));
+   floor = topology_sibling_weight(target);
if (floor < 2)
floor = 2;
limit = floor << 1;
-- 
2.9.3



[PATCH v3 2/7] sched: introduce per-cpu var next_cpu to track search limit

2019-06-08 Thread subhra mazumdar
Introduce a per-cpu variable to track the limit upto which idle cpu search
was done in select_idle_cpu(). This will help to start the search next time
from there. This is necessary for rotating the search window over entire
LLC domain.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/core.c  | 2 ++
 kernel/sched/sched.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427..80657fc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -24,6 +24,7 @@
 #include 
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL)
 /*
@@ -5966,6 +5967,7 @@ void __init sched_init(void)
for_each_possible_cpu(i) {
struct rq *rq;
 
+   per_cpu(next_cpu, i) = -1;
rq = cpu_rq(i);
raw_spin_lock_init(>lock);
rq->nr_running = 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..4cecfa2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -994,6 +994,7 @@ static inline void update_idle_core(struct rq *rq) { }
 #endif
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #define cpu_rq(cpu)(_cpu(runqueues, (cpu)))
 #define this_rq()  this_cpu_ptr()
-- 
2.9.3



[PATCH v3 3/7] sched: rotate the cpu search window for better spread

2019-06-08 Thread subhra mazumdar
Rotate the cpu search window for better spread of threads. This will ensure
an idle cpu will quickly be found if one exists.

Signed-off-by: subhra mazumdar 
---
 kernel/sched/fair.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b58f08f..c1ca88e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
-   int cpu, limit, floor, nr = INT_MAX;
+   int cpu, limit, floor, target_tmp, nr = INT_MAX;
 
this_sd = rcu_dereference(*this_cpu_ptr(_llc));
if (!this_sd)
@@ -6219,9 +6219,15 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
}
}
 
+   if (per_cpu(next_cpu, target) != -1)
+   target_tmp = per_cpu(next_cpu, target);
+   else
+   target_tmp = target;
+
time = local_clock();
 
-   for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+   for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
+   per_cpu(next_cpu, target) = cpu;
if (!--nr)
return -1;
if (!cpumask_test_cpu(cpu, >cpus_allowed))
-- 
2.9.3



[PATCH v3 0/7] Improve scheduler scalability for fast path

2019-06-08 Thread subhra mazumdar
2.82 93.33 (0.71%)   1.24  
75  124   2.13 122.67 (1.08%)  1.7   
90  152   1.74 149.33 (1.75%)  2.35  
95  171   2.11 167 (2.34%) 2.74
99  512.67104.96   206 (59.82%)8.86
99.52296  82.553121.67 (-35.96%)   97.37 
99.912517.33  2.38 12592 (-0.6%)   1.67

Changes from v2->v3:
-Use shift operator instead of multiplication to compute limit
-Use per-CPU variable to precompute the number of sibling SMTs for x86

subhra mazumdar (7):
  sched: limit cpu search in select_idle_cpu
  sched: introduce per-cpu var next_cpu to track search limit
  sched: rotate the cpu search window for better spread
  sched: add sched feature to disable idle core search
  sched: SIS_CORE to disable idle core search
  x86/smpboot: introduce per-cpu variable for HT siblings
  sched: use per-cpu variable cpumask_weight_sibling

 arch/x86/include/asm/smp.h  |  1 +
 arch/x86/include/asm/topology.h |  1 +
 arch/x86/kernel/smpboot.c   | 17 -
 include/linux/topology.h|  4 
 kernel/sched/core.c |  2 ++
 kernel/sched/fair.c | 31 +++
 kernel/sched/features.h |  1 +
 kernel/sched/sched.h|  1 +
 8 files changed, 49 insertions(+), 9 deletions(-)

-- 
2.9.3



Re: [RFC V2 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found

2019-05-14 Thread Subhra Mazumdar



On 5/14/19 10:27 AM, Subhra Mazumdar wrote:


On 5/14/19 9:03 AM, Steven Sistare wrote:

On 5/13/2019 7:35 AM, Peter Zijlstra wrote:

On Mon, May 13, 2019 at 03:04:18PM +0530, Viresh Kumar wrote:

On 10-05-19, 09:21, Peter Zijlstra wrote:

I don't hate his per se; but the whole select_idle_sibling() thing is
something that needs looking at.

There was the task stealing thing from Steve that looked 
interesting and

that would render your apporach unfeasible.

I am surely missing something as I don't see how that patchset will
make this patchset perform badly, than what it already does.

Nah; I just misremembered. I know Oracle has a patch set poking at
select_idle_siblings() _somewhere_ (as do I), and I just found the 
wrong

one.

Basically everybody is complaining select_idle_sibling() is too
expensive for checking the entire LLC domain, except for FB (and thus
likely some other workloads too) that depend on it to kill their tail
latency.

But I suppose we could still do this, even if we scan only a subset of
the LLC, just keep track of the last !idle CPU running only SCHED_IDLE
tasks and pick that if you do not (in your limited scan) find a better
candidate.
Subhra posted a patch that incrementally searches for an idle CPU in 
the LLC,
remembering the last CPU examined, and searching a fixed number of 
CPUs from there.
That technique is compatible with the one that Viresh suggests; the 
incremental

search would stop if a SCHED_IDLE cpu was found.

This was the last version of patchset I sent:
https://lkml.org/lkml/2018/6/28/810

Also select_idle_core is a net -ve for certain workloads like OLTP. So I
had put a SCHED_FEAT to be able to disable it.

Forgot to add, the cpumask_weight computation may not be O(1) with large
number of CPUs, so needs to be precomputed in a per-cpu variable to further
optimize. That part is missing from the above patchset.


Thanks,
Subhra


I also fiddled with select_idle_sibling, maintaining a per-LLC bitmap 
of idle CPUs,
updated with atomic operations.  Performance was basically unchanged 
for the
workloads I tested, and I inserted timers around the idle search 
showing it was
a very small fraction of time both before and after my changes.  That 
led me to

ignore the push side and optimize the pull side with task stealing.

I would be very interested in hearing from folks that have workloads 
that demonstrate

that select_idle_sibling is too expensive.

- Steve


Re: [RFC V2 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found

2019-05-14 Thread Subhra Mazumdar



On 5/14/19 9:03 AM, Steven Sistare wrote:

On 5/13/2019 7:35 AM, Peter Zijlstra wrote:

On Mon, May 13, 2019 at 03:04:18PM +0530, Viresh Kumar wrote:

On 10-05-19, 09:21, Peter Zijlstra wrote:

I don't hate his per se; but the whole select_idle_sibling() thing is
something that needs looking at.

There was the task stealing thing from Steve that looked interesting and
that would render your apporach unfeasible.

I am surely missing something as I don't see how that patchset will
make this patchset perform badly, than what it already does.

Nah; I just misremembered. I know Oracle has a patch set poking at
select_idle_siblings() _somewhere_ (as do I), and I just found the wrong
one.

Basically everybody is complaining select_idle_sibling() is too
expensive for checking the entire LLC domain, except for FB (and thus
likely some other workloads too) that depend on it to kill their tail
latency.

But I suppose we could still do this, even if we scan only a subset of
the LLC, just keep track of the last !idle CPU running only SCHED_IDLE
tasks and pick that if you do not (in your limited scan) find a better
candidate.

Subhra posted a patch that incrementally searches for an idle CPU in the LLC,
remembering the last CPU examined, and searching a fixed number of CPUs from 
there.
That technique is compatible with the one that Viresh suggests; the incremental
search would stop if a SCHED_IDLE cpu was found.

This was the last version of patchset I sent:
https://lkml.org/lkml/2018/6/28/810

Also select_idle_core is a net -ve for certain workloads like OLTP. So I
had put a SCHED_FEAT to be able to disable it.

Thanks,
Subhra


I also fiddled with select_idle_sibling, maintaining a per-LLC bitmap of idle 
CPUs,
updated with atomic operations.  Performance was basically unchanged for the
workloads I tested, and I inserted timers around the idle search showing it was
a very small fraction of time both before and after my changes.  That led me to
ignore the push side and optimize the pull side with task stealing.

I would be very interested in hearing from folks that have workloads that 
demonstrate
that select_idle_sibling is too expensive.

- Steve


Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks

2019-05-09 Thread Subhra Mazumdar




select_task_rq_* seems to be unchanged. So the search logic to find a cpu
to enqueue when a task becomes runnable is same as before and doesn't do
any kind of cookie matching.

Okay, that's true in task wakeup path, and also load_balance seems to pull task
without checking cookie too. But my system is not over loaded when I tested this
patch, so there is none or only one task in rq and on the rq's rb
tree, so this patch
does not make a difference.

I had same hypothesis for my tests.


The question is, should we do cookie checking for task selecting CPU and load
balance CPU pulling task?

The basic issue is keeping the CPUs busy. In case of overloaded system,
the trivial new idle balancer should be able to find a matching task
in case of forced idle. More problematic is the lower load scenario when
there aren't any matching task to be found but there are runnable tasks of
other groups. Also wake up code path tries to balance threads across cores
(select_idle_core) first which is opposite of what core scheduling wants.
I will re-run my tests with select_idle_core disabled, but the issue is
on x86 Intel systems (my test rig) the CPU ids are interleaved across cores
so even select_idle_cpu will balance across cores first. May be others have
some better ideas?


Thanks,
-Aubrey


Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks

2019-05-08 Thread Subhra Mazumdar



On 5/8/19 6:38 PM, Aubrey Li wrote:

On Thu, May 9, 2019 at 8:29 AM Subhra Mazumdar
 wrote:


On 5/8/19 5:01 PM, Aubrey Li wrote:

On Thu, May 9, 2019 at 2:41 AM Subhra Mazumdar
 wrote:

On 5/8/19 11:19 AM, Subhra Mazumdar wrote:

On 5/8/19 8:49 AM, Aubrey Li wrote:

Pawan ran an experiment setting up 2 VMs, with one VM doing a
parallel kernel build and one VM doing sysbench,
limiting both VMs to run on 16 cpu threads (8 physical cores), with
8 vcpu for each VM.
Making the fix did improve kernel build time by 7%.

I'm gonna agree with the patch below, but just wonder if the testing
result is consistent,
as I didn't see any improvement in my testing environment.

IIUC, from the code behavior, especially for 2 VMs case(only 2
different cookies), the
per-rq rb tree unlikely has nodes with different cookies, that is, all
the nodes on this
tree should have the same cookie, so:
- if the parameter cookie is equal to the rb tree cookie, we meet a
match and go the
third branch
- else, no matter we go left or right, we can't find a match, and
we'll return idle thread
finally.

Please correct me if I was wrong.

Thanks,
-Aubrey

This is searching in the per core rb tree (rq->core_tree) which can have
2 different cookies. But having said that, even I didn't see any
improvement with the patch for my DB test case. But logically it is
correct.


Ah, my bad. It is per rq. But still can have 2 different cookies. Not sure
why you think it is unlikely?

Yeah, I meant 2 different cookies on the system, but unlikely 2
different cookies
on one same rq.

If I read the source correctly, for the sched_core_balance path, when try to
steal cookie from another CPU, sched_core_find() uses dst's cookie to search
if there is a cookie match in src's rq, and sched_core_find() returns idle or
matched task, and later put this matched task onto dst's rq (activate_task() in
sched_core_find()). At this moment, the nodes on the rq's rb tree should have
same cookies.

Thanks,
-Aubrey

Yes, but sched_core_find is also called from pick_task to find a local
matching task.

Can a local searching introduce a different cookies? Where is it from?

No. I meant the local search uses the same binary search of sched_core_find
so it has to be correct.



The enqueue side logic of the scheduler is unchanged with
core scheduling,

But only the task with cookies is placed onto this rb tree?


so it is possible tasks with different cookies are
enqueued on the same rq. So while searching for a matching task locally
doing it correctly should matter.

May I know how exactly?

select_task_rq_* seems to be unchanged. So the search logic to find a cpu
to enqueue when a task becomes runnable is same as before and doesn't do
any kind of cookie matching.


Thanks,
-Aubrey


Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks

2019-05-08 Thread Subhra Mazumdar



On 5/8/19 5:01 PM, Aubrey Li wrote:

On Thu, May 9, 2019 at 2:41 AM Subhra Mazumdar
 wrote:


On 5/8/19 11:19 AM, Subhra Mazumdar wrote:

On 5/8/19 8:49 AM, Aubrey Li wrote:

Pawan ran an experiment setting up 2 VMs, with one VM doing a
parallel kernel build and one VM doing sysbench,
limiting both VMs to run on 16 cpu threads (8 physical cores), with
8 vcpu for each VM.
Making the fix did improve kernel build time by 7%.

I'm gonna agree with the patch below, but just wonder if the testing
result is consistent,
as I didn't see any improvement in my testing environment.

IIUC, from the code behavior, especially for 2 VMs case(only 2
different cookies), the
per-rq rb tree unlikely has nodes with different cookies, that is, all
the nodes on this
tree should have the same cookie, so:
- if the parameter cookie is equal to the rb tree cookie, we meet a
match and go the
third branch
- else, no matter we go left or right, we can't find a match, and
we'll return idle thread
finally.

Please correct me if I was wrong.

Thanks,
-Aubrey

This is searching in the per core rb tree (rq->core_tree) which can have
2 different cookies. But having said that, even I didn't see any
improvement with the patch for my DB test case. But logically it is
correct.


Ah, my bad. It is per rq. But still can have 2 different cookies. Not sure
why you think it is unlikely?

Yeah, I meant 2 different cookies on the system, but unlikely 2
different cookies
on one same rq.

If I read the source correctly, for the sched_core_balance path, when try to
steal cookie from another CPU, sched_core_find() uses dst's cookie to search
if there is a cookie match in src's rq, and sched_core_find() returns idle or
matched task, and later put this matched task onto dst's rq (activate_task() in
sched_core_find()). At this moment, the nodes on the rq's rb tree should have
same cookies.

Thanks,
-Aubrey

Yes, but sched_core_find is also called from pick_task to find a local
matching task. The enqueue side logic of the scheduler is unchanged with
core scheduling, so it is possible tasks with different cookies are
enqueued on the same rq. So while searching for a matching task locally
doing it correctly should matter.


Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks

2019-05-08 Thread Subhra Mazumdar



On 5/8/19 11:19 AM, Subhra Mazumdar wrote:


On 5/8/19 8:49 AM, Aubrey Li wrote:
Pawan ran an experiment setting up 2 VMs, with one VM doing a 
parallel kernel build and one VM doing sysbench,
limiting both VMs to run on 16 cpu threads (8 physical cores), with 
8 vcpu for each VM.

Making the fix did improve kernel build time by 7%.

I'm gonna agree with the patch below, but just wonder if the testing
result is consistent,
as I didn't see any improvement in my testing environment.

IIUC, from the code behavior, especially for 2 VMs case(only 2
different cookies), the
per-rq rb tree unlikely has nodes with different cookies, that is, all
the nodes on this
tree should have the same cookie, so:
- if the parameter cookie is equal to the rb tree cookie, we meet a
match and go the
third branch
- else, no matter we go left or right, we can't find a match, and
we'll return idle thread
finally.

Please correct me if I was wrong.

Thanks,
-Aubrey

This is searching in the per core rb tree (rq->core_tree) which can have
2 different cookies. But having said that, even I didn't see any
improvement with the patch for my DB test case. But logically it is
correct.


Ah, my bad. It is per rq. But still can have 2 different cookies. Not sure
why you think it is unlikely?


Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks

2019-05-08 Thread Subhra Mazumdar



On 5/8/19 8:49 AM, Aubrey Li wrote:

Pawan ran an experiment setting up 2 VMs, with one VM doing a parallel kernel 
build and one VM doing sysbench,
limiting both VMs to run on 16 cpu threads (8 physical cores), with 8 vcpu for 
each VM.
Making the fix did improve kernel build time by 7%.

I'm gonna agree with the patch below, but just wonder if the testing
result is consistent,
as I didn't see any improvement in my testing environment.

IIUC, from the code behavior, especially for 2 VMs case(only 2
different cookies), the
per-rq rb tree unlikely has nodes with different cookies, that is, all
the nodes on this
tree should have the same cookie, so:
- if the parameter cookie is equal to the rb tree cookie, we meet a
match and go the
third branch
- else, no matter we go left or right, we can't find a match, and
we'll return idle thread
finally.

Please correct me if I was wrong.

Thanks,
-Aubrey

This is searching in the per core rb tree (rq->core_tree) which can have
2 different cookies. But having said that, even I didn't see any
improvement with the patch for my DB test case. But logically it is
correct.



Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-26 Thread Subhra Mazumdar



On 4/26/19 3:43 AM, Mel Gorman wrote:

On Fri, Apr 26, 2019 at 10:42:22AM +0200, Ingo Molnar wrote:

It should, but it's not perfect. For example, wake_affine_idle does not
take sibling activity into account even though select_idle_sibling *may*
take it into account. Even select_idle_sibling in its fast path may use
an SMT sibling instead of searching.

There are also potential side-effects with cpuidle. Some workloads
migration around the socket as they are communicating because of how the
search for an idle CPU works. With SMT on, there is potentially a longer
opportunity for a core to reach a deep c-state and incur a bigger wakeup
latency. This is a very weak theory but I've seen cases where latency
sensitive workloads with only two communicating tasks are affected by
CPUs reaching low c-states due to migrations.


Clearly it doesn't.


It's more that it's best effort to wakeup quickly instead of being perfect
by using an expensive search every time.

Yeah, but your numbers suggest that for *most* not heavily interacting
under-utilized CPU bound workloads we hurt in the 5-10% range compared to
no-SMT - more in some cases.


Indeed, it was higher than expected and we can't even use the excuse that
more resources are available to a single logical CPU as the scheduler is
meant to keep them apart.


So we avoid a maybe 0.1% scheduler placement overhead but inflict 5-10%
harm on the workload, and also blow up stddev by randomly co-scheduling
two tasks on the same physical core? Not a good trade-off.

I really think we should implement a relatively strict physical core
placement policy in the under-utilized case, and resist any attempts to
weaken this for special workloads that ping-pong quickly and benefit from
sharing the same physical core.


It's worth a shot at least. Changes should mostly be in the wake_affine
path for most loads of interest.

Doesn't select_idle_sibling already try to do that by calling
select_idle_core? For our OLTP workload we infact found the cost of
select_idle_core was actually hurting more than it helped to find a fully
idle core, so a net negative.



Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.

2019-04-19 Thread Subhra Mazumdar



On 4/19/19 1:40 AM, Ingo Molnar wrote:

* Subhra Mazumdar  wrote:


I see similar improvement with this patch as removing the condition I
earlier mentioned. So that's not needed. I also included the patch for the
priority fix. For 2 DB instances, HT disabling stands at -22% for 32 users
(from earlier emails).


1 DB instance

users  baseline   %idle    core_sched %idle
16 1  84   -4.9% 84
24 1  76   -6.7% 75
32 1  69   -2.4% 69

2 DB instance

users  baseline   %idle    core_sched %idle
16 1  66   -19.5% 69
24 1  54   -9.8% 57
32 1  42   -27.2%    48

So HT disabling slows down the 2DB instance by -22%, while core-sched
slows it down by -27.2%?

Would it be possible to see all the results in two larger tables (1 DB
instance and 2 DB instance) so that we can compare the performance of the
3 kernel variants with each other:

  - "vanilla +HT": Hyperthreading enabled,  vanilla scheduler
  - "vanilla -HT": Hyperthreading disabled, vanilla scheduler
  - "core_sched":  Hyperthreading enabled,  core-scheduling enabled

?

Thanks,

Ingo

Following are the numbers. Disabling HT gives improvement in some cases.

1 DB instance

users  vanilla+HT   core_sched vanilla-HT
16 1    -4.9% -11.7%
24 1    -6.7% +13.7%
32 1    -2.4% +8%

2 DB instance

users  vanilla+HT   core_sched vanilla-HT
16 1    -19.5% +5.6%
24 1    -9.8% +3.5%
32 1    -27.2%    -22.8%


Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.

2019-04-10 Thread Subhra Mazumdar



On 4/9/19 11:38 AM, Julien Desfossez wrote:

We found the source of the major performance regression we discussed
previously. It turns out there was a pattern where a task (a kworker in this
case) could be woken up, but the core could still end up idle before that
task had a chance to run.

Example sequence, cpu0 and cpu1 and siblings on the same core, task1 and
task2 are in the same cgroup with the tag enabled (each following line
happens in the increasing order of time):
- task1 running on cpu0, task2 running on cpu1
- sched_waking(kworker/0, target_cpu=cpu0)
- task1 scheduled out of cpu0
- kworker/0 cannot run on cpu0 because of task2 is still running on cpu1
   cpu0 is idle
- task2 scheduled out of cpu1
- cpu1 doesn’t select kworker/0 for cpu0, because the optimization path ends
   the task selection if core_cookie is NULL for currently selected process
   and the cpu1’s runqueue.
- cpu1 is idle
--> both siblings are idle but kworker/0 is still in the run queue of cpu0.
 Cpu0 may stay idle for longer if it goes deep idle.

With the fix below, we ensure to send an IPI to the sibling if it is idle
and has tasks waiting in its runqueue.
This fixes the performance issue we were seeing.

Now here is what we can measure with a disk write-intensive benchmark:
- no performance impact with enabling core scheduling without any tagged
   task,
- 5% overhead if one tagged task is competing with an untagged task,
- 10% overhead if 2 tasks tagged with a different tag are competing
   against each other.

We are starting more scaling tests, but this is very encouraging !


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e1fa10561279..02c862a5e973 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3779,7 +3779,22 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
  
  trace_printk("unconstrained pick: %s/%d %lx\n",

next->comm, next->pid, 
next->core_cookie);
+   rq->core_pick = NULL;
  
+/*

+* If the sibling is idling, we might want to 
wake it
+* so that it can check for any runnable but 
blocked tasks
+* due to previous task matching.
+*/
+   for_each_cpu(j, smt_mask) {
+   struct rq *rq_j = cpu_rq(j);
+   rq_j->core_pick = NULL;
+   if (j != cpu && is_idle_task(rq_j->curr) 
&& rq_j->nr_running) {
+   resched_curr(rq_j);
+   trace_printk("IPI(%d->%d[%d]) idle 
preempt\n",
+cpu, j, 
rq_j->nr_running);
+   }
+   }
goto done;
}
  

I see similar improvement with this patch as removing the condition I
earlier mentioned. So that's not needed. I also included the patch for the
priority fix. For 2 DB instances, HT disabling stands at -22% for 32 users
(from earlier emails).


1 DB instance

users  baseline   %idle    core_sched %idle
16 1  84   -4.9% 84
24 1  76   -6.7% 75
32 1  69   -2.4% 69

2 DB instance

users  baseline   %idle    core_sched %idle
16 1  66   -19.5% 69
24 1  54   -9.8% 57
32 1  42   -27.2%    48


Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access

2019-04-04 Thread Subhra Mazumdar




We tried to comment those lines and it doesn’t seem to get rid of the
performance regression we are seeing.
Can you elaborate a bit more about the test you are performing, what kind of
resources it uses ?

I am running 1 and 2 Oracle DB instances each running TPC-C workload. The
clients driving the instances also run in same node. Each server client
pair is put in each cpu group and tagged.

Can you also try to replicate our test and see if you see the same problem ?

cgcreate -g cpu,cpuset:set1
cat /sys/devices/system/cpu/cpu{0,2,4,6}/topology/thread_siblings_list
0,36
2,38
4,40
6,42

echo "0,2,4,6,36,38,40,42" | sudo tee /sys/fs/cgroup/cpuset/set1/cpuset.cpus
echo 0 | sudo tee /sys/fs/cgroup/cpuset/set1/cpuset.mems

echo 1 | sudo tee /sys/fs/cgroup/cpu,cpuacct/set1/cpu.tag

sysbench --test=fileio prepare
cgexec -g cpu,cpuset:set1 sysbench --threads=4 --test=fileio \
--file-test-mode=seqwr run

The reason we create a cpuset is to narrow down the investigation to just 4
cores on a highly powerful machine. It might not be needed if testing on a
smaller machine.

With this sysbench test I am not seeing any improvement with removing the
condition. Also with hackbench I found it makes no difference but that has
much lower regression to begin with (18%)


Julien


Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access

2019-04-01 Thread Subhra Mazumdar



On 3/29/19 3:23 PM, Subhra Mazumdar wrote:


On 3/29/19 6:35 AM, Julien Desfossez wrote:
On Fri, Mar 22, 2019 at 8:09 PM Subhra Mazumdar 


wrote:

Is the core wide lock primarily responsible for the regression? I ran
upto patch
12 which also has the core wide lock for tagged cgroups and also calls
newidle_balance() from pick_next_task(). I don't see any 
regression.  Of

course
the core sched version of pick_next_task() may be doing more but
comparing with
the __pick_next_task() it doesn't look too horrible.
On further testing and investigation, we also agree that spinlock 
contention
is not the major cause for the regression, but we feel that it should 
be one

of the major contributing factors to this performance loss.



I finally did some code bisection and found the following lines are
basically responsible for the regression. Commenting them out I don't see
the regressions. Can you confirm? I am yet to figure if this is needed 
for

the correctness of core scheduling and if so can we do this better?

>8-

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe3918c..3b3388a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3741,8 +3741,8 @@ pick_next_task(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf)
 * If there weren't no cookies; we 
don't need

 * to bother with the other siblings.
*/
-   if (i == cpu && !rq->core->core_cookie)
-   goto next_class;
+   //if (i == cpu && !rq->core->core_cookie)
+   //goto next_class;

continue;
    }

AFAICT this condition is not needed for correctness as cookie matching will
sill be enforced. Peter any thoughts? I get the following numbers with 1 DB
and 2 DB instance.

1 DB instance
users  baseline   %idle  core_sched %idle
16 1  84 -5.5% 84
24 1  76 -5% 76
32 1  69 -0.45% 69

2 DB instance
users  baseline   %idle  core_sched %idle
16 1  66 -23.8% 69
24 1  54 -3.1% 57
32 1  42 -21.1%  48


Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access

2019-03-29 Thread Subhra Mazumdar



On 3/29/19 6:35 AM, Julien Desfossez wrote:

On Fri, Mar 22, 2019 at 8:09 PM Subhra Mazumdar 
wrote:

Is the core wide lock primarily responsible for the regression? I ran
upto patch
12 which also has the core wide lock for tagged cgroups and also calls
newidle_balance() from pick_next_task(). I don't see any regression.  Of
course
the core sched version of pick_next_task() may be doing more but
comparing with
the __pick_next_task() it doesn't look too horrible.

On further testing and investigation, we also agree that spinlock contention
is not the major cause for the regression, but we feel that it should be one
of the major contributing factors to this performance loss.



I finally did some code bisection and found the following lines are
basically responsible for the regression. Commenting them out I don't see
the regressions. Can you confirm? I am yet to figure if this is needed for
the correctness of core scheduling and if so can we do this better?

>8-

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe3918c..3b3388a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3741,8 +3741,8 @@ pick_next_task(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf)
 * If there weren't no cookies; we 
don't need

 * to bother with the other siblings.
*/
-   if (i == cpu && !rq->core->core_cookie)
-   goto next_class;
+   //if (i == cpu && !rq->core->core_cookie)
+   //goto next_class;

continue;
    }


Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access

2019-03-26 Thread Subhra Mazumdar



On 3/22/19 5:06 PM, Subhra Mazumdar wrote:


On 3/21/19 2:20 PM, Julien Desfossez wrote:
On Tue, Mar 19, 2019 at 10:31 PM Subhra Mazumdar 


wrote:

On 3/18/19 8:41 AM, Julien Desfossez wrote:

On further investigation, we could see that the contention is mostly 
in the

way rq locks are taken. With this patchset, we lock the whole core if
cpu.tag is set for at least one cgroup. Due to this, __schedule() is 
more or

less serialized for the core and that attributes to the performance loss
that we are seeing. We also saw that newidle_balance() takes 
considerably
long time in load_balance() due to the rq spinlock contention. Do you 
think
it would help if the core-wide locking was only performed when 
absolutely

needed ?

Is the core wide lock primarily responsible for the regression? I ran 
upto patch

12 which also has the core wide lock for tagged cgroups and also calls
newidle_balance() from pick_next_task(). I don't see any regression.  
Of course
the core sched version of pick_next_task() may be doing more but 
comparing with

the __pick_next_task() it doesn't look too horrible.
I gathered some data with only 1 DB instance running (which also has 52% 
slow
down). Following are the numbers of pick_next_task() calls and their avg 
cost
for patch 12 and patch 15. The total number of calls seems to be similar 
but the
avg cost (in us) has more than doubled. For both the patches I had put 
the DB

instance into a cpu tagged cgroup.

 patch12 patch15
count pick_next_task 62317898 58925395
avg cost pick_next_task  0.6566323209   1.4223810108


Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access

2019-03-22 Thread Subhra Mazumdar



On 3/21/19 2:20 PM, Julien Desfossez wrote:

On Tue, Mar 19, 2019 at 10:31 PM Subhra Mazumdar 
wrote:

On 3/18/19 8:41 AM, Julien Desfossez wrote:


On further investigation, we could see that the contention is mostly in the
way rq locks are taken. With this patchset, we lock the whole core if
cpu.tag is set for at least one cgroup. Due to this, __schedule() is more or
less serialized for the core and that attributes to the performance loss
that we are seeing. We also saw that newidle_balance() takes considerably
long time in load_balance() due to the rq spinlock contention. Do you think
it would help if the core-wide locking was only performed when absolutely
needed ?

Is the core wide lock primarily responsible for the regression? I ran 
upto patch

12 which also has the core wide lock for tagged cgroups and also calls
newidle_balance() from pick_next_task(). I don't see any regression.  Of 
course
the core sched version of pick_next_task() may be doing more but 
comparing with

the __pick_next_task() it doesn't look too horrible.


Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access

2019-03-19 Thread Subhra Mazumdar



On 3/18/19 8:41 AM, Julien Desfossez wrote:

The case where we try to acquire the lock on 2 runqueues belonging to 2
different cores requires the rq_lockp wrapper as well otherwise we
frequently deadlock in there.

This fixes the crash reported in
1552577311-8218-1-git-send-email-jdesfos...@digitalocean.com

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 76fee56..71bb71f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2078,7 +2078,7 @@ static inline void double_rq_lock(struct rq *rq1, struct 
rq *rq2)
raw_spin_lock(rq_lockp(rq1));
__acquire(rq2->lock);/* Fake it out ;) */
} else {
-   if (rq1 < rq2) {
+   if (rq_lockp(rq1) < rq_lockp(rq2)) {
raw_spin_lock(rq_lockp(rq1));
raw_spin_lock_nested(rq_lockp(rq2), 
SINGLE_DEPTH_NESTING);
} else {
With this fix and my previous NULL pointer fix my stress tests are 
surviving. I

re-ran my 2 DB instance setup on 44 core 2 socket system by putting each DB
instance in separate core scheduling group. The numbers look much worse 
now.


users  baseline  %stdev  %idle  core_sched  %stdev %idle
16 1 0.3 66 -73.4%  136.8 82
24 1 1.6 54 -95.8%  133.2 81
32 1 1.5 42 -97.5%  124.3 89

I also notice that if I enable a bunch of debug configs related to 
mutexes, spin
locks, lockdep etc. (which I did earlier to debug the dead lock), it 
opens up a

can of worms with multiple crashes.


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-11 Thread Subhra Mazumdar



On 3/11/19 5:20 PM, Greg Kerr wrote:

On Mon, Mar 11, 2019 at 4:36 PM Subhra Mazumdar
 wrote:


On 3/11/19 11:34 AM, Subhra Mazumdar wrote:

On 3/10/19 9:23 PM, Aubrey Li wrote:

On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
 wrote:

expected. Most of the performance recovery happens in patch 15 which,
unfortunately, is also the one that introduces the hard lockup.


After applied Subhra's patch, the following is triggered by enabling
core sched when a cgroup is
under heavy load.


It seems you are facing some other deadlock where printk is involved.
Can you
drop the last patch (patch 16 sched: Debug bits...) and try?

Thanks,
Subhra


Never Mind, I am seeing the same lockdep deadlock output even w/o patch
16. Btw
the NULL fix had something missing, following works.

Is this panic below, which occurs when I tag the first process,
related or known? If not, I will debug it tomorrow.

[   46.831828] BUG: unable to handle kernel NULL pointer dereference
at 
[   46.831829] core sched enabled
[   46.834261] #PF error: [WRITE]
[   46.834899] PGD 0 P4D 0
[   46.835438] Oops: 0002 [#1] SMP PTI
[   46.836158] CPU: 0 PID: 11 Comm: migration/0 Not tainted
5.0.0everyday-glory-03949-g2d8fdbb66245-dirty #7
[   46.838206] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1 04/01/2014
[   46.839844] RIP: 0010:_raw_spin_lock+0x7/0x20
[   46.840448] Code: 00 00 00 65 81 05 25 ca 5c 51 00 02 00 00 31 c0
ba ff 00 00 00 f0 0f b1 17 74 05 e9 93 80 46 ff f3 c3 90 31 c0 ba 01
00 00 00  0f b1 17 74 07 89 c6 e9 1c 6e 46 ff f3 c3 66 2e 0f 1f 84
00 00
[   46.843000] RSP: 0018:b9d300cabe38 EFLAGS: 00010046
[   46.843744] RAX:  RBX:  RCX: 0004
[   46.844709] RDX: 0001 RSI: aea435ae RDI: 
[   46.845689] RBP: b9d300cabed8 R08:  R09: 00020800
[   46.846651] R10: af603ea0 R11: 0001 R12: af6576c0
[   46.847619] R13: 9a57366c8000 R14: 9a5737401300 R15: ade868f0
[   46.848584] FS:  () GS:9a5737a0()
knlGS:
[   46.849680] CS:  0010 DS:  ES:  CR0: 80050033
[   46.850455] CR2:  CR3: 0001d36fa000 CR4: 06f0
[   46.851415] DR0:  DR1:  DR2: 
[   46.852371] DR3:  DR6: fffe0ff0 DR7: 0400
[   46.853326] Call Trace:
[   46.853678]  __schedule+0x139/0x11f0
[   46.854167]  ? cpumask_next+0x16/0x20
[   46.854668]  ? cpu_stop_queue_work+0xc0/0xc0
[   46.855252]  ? sort_range+0x20/0x20
[   46.855742]  schedule+0x4e/0x60
[   46.856171]  smpboot_thread_fn+0x12a/0x160
[   46.856725]  kthread+0x112/0x120
[   46.857164]  ? kthread_stop+0xf0/0xf0
[   46.857661]  ret_from_fork+0x35/0x40
[   46.858146] Modules linked in:
[   46.858562] CR2: 
[   46.859022] ---[ end trace e9fff08f17bfd2be ]---

- Greg


This seems to be different


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-11 Thread Subhra Mazumdar



On 3/11/19 11:34 AM, Subhra Mazumdar wrote:


On 3/10/19 9:23 PM, Aubrey Li wrote:

On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
 wrote:

expected. Most of the performance recovery happens in patch 15 which,
unfortunately, is also the one that introduces the hard lockup.


After applied Subhra's patch, the following is triggered by enabling
core sched when a cgroup is
under heavy load.

It seems you are facing some other deadlock where printk is involved. 
Can you

drop the last patch (patch 16 sched: Debug bits...) and try?

Thanks,
Subhra

Never Mind, I am seeing the same lockdep deadlock output even w/o patch 
16. Btw

the NULL fix had something missing, following works.

->8

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d0dac4..27cbc64 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4131,7 +4131,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *curr)

 * Avoid running the skip buddy, if running something else can
 * be done without getting too unfair.
*/
-   if (cfs_rq->skip == se) {
+   if (cfs_rq->skip && cfs_rq->skip == se) {
    struct sched_entity *second;

    if (se == curr) {
@@ -4149,13 +4149,15 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *curr)

/*
 * Prefer last buddy, try to return the CPU to a preempted task.
*/
-   if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
+   if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, 
left)

+   < 1)
    se = cfs_rq->last;

/*
 * Someone really wants this to run. If it's not unfair, run it.
*/
-   if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
+   if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, 
left)

+   < 1)
    se = cfs_rq->next;

    clear_buddies(cfs_rq, se);
@@ -6958,6 +6960,9 @@ pick_task_fair(struct rq *rq)

    se = pick_next_entity(cfs_rq, NULL);

+   if (!(se || curr))
+   return NULL;
+
    if (curr) {
    if (se && curr->on_rq)
update_curr(cfs_rq);



Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-11 Thread Subhra Mazumdar



On 3/10/19 9:23 PM, Aubrey Li wrote:

On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
 wrote:

expected. Most of the performance recovery happens in patch 15 which,
unfortunately, is also the one that introduces the hard lockup.


After applied Subhra's patch, the following is triggered by enabling
core sched when a cgroup is
under heavy load.

It seems you are facing some other deadlock where printk is involved. 
Can you

drop the last patch (patch 16 sched: Debug bits...) and try?

Thanks,
Subhra



Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-08 Thread Subhra Mazumdar



On 2/22/19 4:45 AM, Mel Gorman wrote:

On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:

On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  wrote:

However; whichever way around you turn this cookie; it is expensive and nasty.

Do you (or anybody else) have numbers for real loads?

Because performance is all that matters. If performance is bad, then
it's pointless, since just turning off SMT is the answer.


I tried to do a comparison between tip/master, ht disabled and this series
putting test workloads into a tagged cgroup but unfortunately it failed

[  156.978682] BUG: unable to handle kernel NULL pointer dereference at 
0058
[  156.986597] #PF error: [normal kernel read fault]
[  156.991343] PGD 0 P4D 0
[  156.993905] Oops:  [#1] SMP PTI
[  156.997438] CPU: 15 PID: 0 Comm: swapper/15 Not tainted 
5.0.0-rc7-schedcore-v1r1 #1
[  157.005161] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 
05/09/2016
[  157.012896] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
[  157.018613] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 
e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00
  53 48 89 fb <48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 
01
[  157.037544] RSP: 0018:c9000c5bbde8 EFLAGS: 00010086
[  157.042819] RAX: 88810f5f6a00 RBX: 0001547f175c RCX: 0001
[  157.050015] RDX: 88bf3bdb0a40 RSI:  RDI: 0001547f175c
[  157.057215] RBP: 88bf7fae32c0 R08: 0001e358 R09: 88810fb9f000
[  157.064410] R10: c9000c5bbe08 R11: 88810fb9f5c4 R12: 
[  157.071611] R13: 88bf4e3ea0c0 R14:  R15: 88bf4e3ea7a8
[  157.078814] FS:  () GS:88bf7f5c() 
knlGS:
[  157.086977] CS:  0010 DS:  ES:  CR0: 80050033
[  157.092779] CR2: 0058 CR3: 0220e005 CR4: 003606e0
[  157.099979] DR0:  DR1:  DR2: 
[  157.109529] DR3:  DR6: fffe0ff0 DR7: 0400
[  157.119058] Call Trace:
[  157.123865]  pick_next_entity+0x61/0x110
[  157.130137]  pick_task_fair+0x4b/0x90
[  157.136124]  __schedule+0x365/0x12c0
[  157.141985]  schedule_idle+0x1e/0x40
[  157.147822]  do_idle+0x166/0x280
[  157.153275]  cpu_startup_entry+0x19/0x20
[  157.159420]  start_secondary+0x17a/0x1d0
[  157.165568]  secondary_startup_64+0xa4/0xb0
[  157.171985] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs msr 
intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm 
ipmi_ssif irqbypass crc32_pclmul ghash_clmulni_intel ixgbe aesni_intel 
xfrm_algo iTCO_wdt joydev iTCO_vendor_support libphy igb aes_x86_64 crypto_simd 
ptp cryptd mei_me mdio pps_core ioatdma glue_helper pcspkr ipmi_si lpc_ich 
i2c_i801 mei dca ipmi_devintf ipmi_msghandler acpi_pad pcc_cpufreq button btrfs 
libcrc32c xor zstd_decompress zstd_compress raid6_pq hid_generic usbhid ast 
i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops 
xhci_pci crc32c_intel ehci_pci ttm xhci_hcd ehci_hcd drm ahci usbcore mpt3sas 
libahci raid_class scsi_transport_sas wmi sg nbd dm_multipath dm_mod 
scsi_dh_rdac scsi_dh_emc scsi_dh_alua
[  157.258990] CR2: 0058
[  157.264961] ---[ end trace a301ac5e3ee86fde ]---
[  157.283719] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
[  157.291967] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 e7 03 
be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00 53 48 89 fb <48> 2b 5e 58 48 
85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01
[  157.316121] RSP: 0018:c9000c5bbde8 EFLAGS: 00010086
[  157.324060] RAX: 88810f5f6a00 RBX: 0001547f175c RCX: 0001
[  157.333932] RDX: 88bf3bdb0a40 RSI:  RDI: 0001547f175c
[  157.343795] RBP: 88bf7fae32c0 R08: 0001e358 R09: 88810fb9f000
[  157.353634] R10: c9000c5bbe08 R11: 88810fb9f5c4 R12: 
[  157.363506] R13: 88bf4e3ea0c0 R14:  R15: 88bf4e3ea7a8
[  157.373395] FS:  () GS:88bf7f5c() 
knlGS:
[  157.384238] CS:  0010 DS:  ES:  CR0: 80050033
[  157.392709] CR2: 0058 CR3: 0220e005 CR4: 003606e0
[  157.402601] DR0:  DR1:  DR2: 
[  157.412488] DR3:  DR6: fffe0ff0 DR7: 0400
[  157.422334] Kernel panic - not syncing: Attempted to kill the idle task!
[  158.529804] Shutting down cpus with NMI
[  158.573249] Kernel Offset: disabled
[  158.586198] ---[ end Kernel panic - not syncing: Attempted to kill the idle 
task! ]---

RIP translates to kernel/sched/fair.c:6819

static int
wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
{
 s64 gran, vdiff = curr->vruntime - se->vruntime; /* LINE 6819 */

 if (vdiff <= 0)
 

Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-28 Thread Subhra Mazumdar



On 2/18/19 8:56 AM, Peter Zijlstra wrote:

A much 'demanded' feature: core-scheduling :-(

I still hate it with a passion, and that is part of why it took a little
longer than 'promised'.

While this one doesn't have all the 'features' of the previous (never
published) version and isn't L1TF 'complete', I tend to like the structure
better (relatively speaking: I hate it slightly less).

This one is sched class agnostic and therefore, in principle, doesn't horribly
wreck RT (in fact, RT could 'ab'use this by setting 'task->core_cookie = task'
to force-idle siblings).

Now, as hinted by that, there are semi sane reasons for actually having this.
Various hardware features like Intel RDT - Memory Bandwidth Allocation, work
per core (due to SMT fundamentally sharing caches) and therefore grouping
related tasks on a core makes it more reliable.

However; whichever way around you turn this cookie; it is expensive and nasty.


I am seeing the following hard lockup frequently now. Following is full
kernel output:

[ 5846.412296] drop_caches (8657): drop_caches: 3
[ 5846.624823] drop_caches (8658): drop_caches: 3
[ 5850.604641] hugetlbfs: oracle (8671): Using mlock ulimits for SHM_HUGETL
B is deprecated
[ 5962.930812] NMI watchdog: Watchdog detected hard LOCKUP on cpu 32
[ 5962.930814] Modules linked in: drbd lru_cache autofs4 cpufreq_powersave
ipv6 crc_ccitt mxm_wmi iTCO_wdt iTCO_vendor_support btrfs raid6_pq
zstd_compress zstd_decompress xor pcspkr i2c_i801 lpc_ich mfd_core ioatdma
ixgbe dca mdio sg ipmi_ssif i2c_core ipmi_si ipmi_msghandler wmi
pcc_cpufreq acpi_pad ext4 fscrypto jbd2 mbcache sd_mod ahci libahci nvme
nvme_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[ 5962.930828] CPU: 32 PID: 10333 Comm: oracle_10333_tp Not tainted
5.0.0-rc7core_sched #1
[ 5962.930828] Hardware name: Oracle Corporation ORACLE SERVER
X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016
[ 5962.930829] RIP: 0010:try_to_wake_up+0x98/0x470
[ 5962.930830] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 44 00 00 8b 43
3c 8b 73 60 85 f6 0f 85 a6 01 00 00 8b 43 38 85 c0 74 09 f3 90 8b 43 38
<85> c0 75 f7 48 8b 43 10 a8 02 b8 00 00 00 00 0f 85 d5 01 00 00 0f
[ 5962.930831] RSP: 0018:c9000f4dbcb8 EFLAGS: 0002
[ 5962.930832] RAX: 0001 RBX: 88dfb4af1680 RCX:
0041
[ 5962.930832] RDX: 0001 RSI:  RDI:
88dfb4af214c
[ 5962.930833] RBP:  R08: 0001 R09:
c9000f4dbd80
[ 5962.930833] R10: 8880 R11: ea00f0003d80 R12:
88dfb4af214c
[ 5962.930834] R13: 0001 R14: 0046 R15:
0001
[ 5962.930834] FS:  7ff4fabd9ae0() GS:88dfbe28()
knlGS:
[ 5962.930834] CS:  0010 DS:  ES:  CR0: 80050033
[ 5962.930835] CR2: 000f4cc84000 CR3: 003b93d36002 CR4:
003606e0
[ 5962.930835] DR0:  DR1:  DR2:

[ 5962.930836] DR3:  DR6: fffe0ff0 DR7:
0400
[ 5962.930836] Call Trace:
[ 5962.930837]  ? __switch_to_asm+0x34/0x70
[ 5962.930837]  ? __switch_to_asm+0x40/0x70
[ 5962.930838]  ? __switch_to_asm+0x34/0x70
[ 5962.930838]  autoremove_wake_function+0x11/0x50
[ 5962.930838]  __wake_up_common+0x8f/0x160
[ 5962.930839]  ? __switch_to_asm+0x40/0x70
[ 5962.930839]  __wake_up_common_lock+0x7c/0xc0
[ 5962.930840]  pipe_write+0x24e/0x3f0
[ 5962.930840]  __vfs_write+0x127/0x1b0
[ 5962.930840]  vfs_write+0xb3/0x1b0
[ 5962.930841]  ksys_write+0x52/0xc0
[ 5962.930841]  do_syscall_64+0x5b/0x170
[ 5962.930842]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 5962.930842] RIP: 0033:0x3b5900e7b0
[ 5962.930843] Code: 97 20 00 31 d2 48 29 c2 64 89 11 48 83 c8 ff eb ea 90
90 90 90 90 90 90 90 90 83 3d f1 db 20 00 00 75 10 b8 01 00 00 00 0f 05
<48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 5e fa ff ff 48 89 04 24
[ 5962.930843] RSP: 002b:7ffedbcd93a8 EFLAGS: 0246 ORIG_RAX:
0001
[ 5962.930844] RAX: ffda RBX: 7ff4faa86e24 RCX:
003b5900e7b0
[ 5962.930845] RDX: 028f RSI: 7ff4faa9688e RDI:
000a
[ 5962.930845] RBP: 7ffedbcd93c0 R08: 7ffedbcd9458 R09:
0020
[ 5962.930846] R10:  R11: 0246 R12:
7ffedbcd9458
[ 5962.930847] R13: 7ff4faa9688e R14: 7ff4faa89cc8 R15:
7ff4faa86bd0
[ 5962.930847] Kernel panic - not syncing: Hard LOCKUP
[ 5962.930848] CPU: 32 PID: 10333 Comm: oracle_10333_tp Not tainted
5.0.0-rc7core_sched #1
[ 5962.930848] Hardware name: Oracle Corporation ORACLE
SERVER X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016
[ 5962.930849] Call Trace:
[ 5962.930849]  
[ 5962.930849]  dump_stack+0x5c/0x7b
[ 5962.930850]  panic+0xfe/0x2b2
[ 5962.930850]  nmi_panic+0x35/0x40
[ 5962.930851]  watchdog_overflow_callback+0xef/0x100
[ 5962.930851]  __perf_event_overflow+0x5a/0xe0
[ 5962.930852]  handle_pmi_common+0x1d1/0x280
[ 5962.930852]  ? __set_pte_vaddr+0x32/0x50
[ 5962.930852]  ? 

Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-21 Thread Subhra Mazumdar



On 2/21/19 6:03 AM, Peter Zijlstra wrote:

On Wed, Feb 20, 2019 at 06:53:08PM -0800, Subhra Mazumdar wrote:

On 2/18/19 9:49 AM, Linus Torvalds wrote:

On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  wrote:

However; whichever way around you turn this cookie; it is expensive and nasty.

Do you (or anybody else) have numbers for real loads?

Because performance is all that matters. If performance is bad, then
it's pointless, since just turning off SMT is the answer.

Linus

I tested 2 Oracle DB instances running OLTP on a 2 socket 44 cores system.
This is on baremetal, no virtualization.

I'm thinking oracle schedules quite a bit, right? Then you get massive
overhead (as shown).


Out of curiosity I ran the patchset from Amazon with the same setup to see
if performance wise it was any better. But it looks equally bad. At 32
users it performed even worse and the idle time increased much more. Only
good thing about it was it was being fair to both the instances as seen in
the low %stdev

Users  Baseline %stdev  %idle  cosched %stdev %idle
16 1    2.9 66 0.93(-7%)   1.1 69
24 1    11.3    53 0.87(-13%)  11.2 61
32 1    7   41 0.66(-34%)  5.3 54


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-21 Thread Subhra Mazumdar



On 2/21/19 6:03 AM, Peter Zijlstra wrote:

On Wed, Feb 20, 2019 at 06:53:08PM -0800, Subhra Mazumdar wrote:

On 2/18/19 9:49 AM, Linus Torvalds wrote:

On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  wrote:

However; whichever way around you turn this cookie; it is expensive and nasty.

Do you (or anybody else) have numbers for real loads?

Because performance is all that matters. If performance is bad, then
it's pointless, since just turning off SMT is the answer.

Linus

I tested 2 Oracle DB instances running OLTP on a 2 socket 44 cores system.
This is on baremetal, no virtualization.

I'm thinking oracle schedules quite a bit, right? Then you get massive
overhead (as shown).

Yes. In terms of idleness we have:

Users baseline core_sched
16    67% 70%
24    53% 59%
32    41% 49%

So there is more idleness with core sched which is understandable as there
can be forced idleness. The other part contributing to regression is most
likely overhead.


The thing with virt workloads is that if they don't VMEXIT lots, they
also don't schedule lots (the vCPU stays running, nested scheduler
etc..).

I plan to run some VM workloads.


Also; like I wrote, it is quite possible there is some sibling rivalry
here, which can cause excessive rescheduling. Someone would have to
trace a workload and check.

My older patches had a condition that would not preempt a task for a
little while, such that it might make _some_ progress, these patches
don't have that (yet).



Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-20 Thread Subhra Mazumdar



On 2/18/19 9:49 AM, Linus Torvalds wrote:

On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  wrote:

However; whichever way around you turn this cookie; it is expensive and nasty.

Do you (or anybody else) have numbers for real loads?

Because performance is all that matters. If performance is bad, then
it's pointless, since just turning off SMT is the answer.

   Linus

I tested 2 Oracle DB instances running OLTP on a 2 socket 44 cores system.
This is on baremetal, no virtualization.  In all cases I put each DB
instance in separate cpu cgroup. Following are the avg throughput numbers
of the 2 instances. %stdev is the standard deviation between the 2
instances.

Baseline = build w/o CONFIG_SCHED_CORE
core_sched = build w/ CONFIG_SCHED_CORE
HT_disable = offlined sibling HT with baseline

Users  Baseline  %stdev  core_sched %stdev HT_disable   %stdev
16 997768    3.28    808193(-19%)   34 1053888(+5.6%)   2.9
24 1157314   9.4 974555(-15.8%) 40.5 1197904(+3.5%)   4.6
32 1693644   6.4 1237195(-27%)  42.8 1308180(-22.8%)  5.3

The regressions are substantial. Also noticed one of the DB instances was
having much less throughput than the other with core scheduling which
brought down the avg and also reflected in the very high %stdev. Disabling
HT has effect at 32 users but still better than core scheduling both in
terms of avg and %stdev. There are some issue with the DB setup for which
I couldn't go beyond 32 users.


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-20 Thread Subhra Mazumdar



On 2/20/19 1:42 AM, Peter Zijlstra wrote:

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

On Tue, Feb 19, 2019 at 02:07:01PM -0800, Greg Kerr wrote:

Thanks for posting this patchset Peter. Based on the patch titled, "sched: A
quick and dirty cgroup tagging interface," I believe cgroups are used to
define co-scheduling groups in this implementation.

Chrome OS engineers (kerr...@google.com, mpden...@google.com, and
pal...@google.com) are considering an interface that is usable by unprivileged
userspace apps. cgroups are a global resource that require privileged access.
Have you considered an interface that is akin to namespaces? Consider the
following strawperson API proposal (I understand prctl() is generally
used for process
specific actions, so we aren't married to using prctl()):

I don't think we're anywhere near the point where I care about
interfaces with this stuff.

Interfaces are a trivial but tedious matter once the rest works to
satisfaction.

As it happens; there is actually a bug in that very cgroup patch that
can cause undesired scheduling. Try spotting and fixing that.

Another question is if we want to be L1TF complete (and how strict) or
not, and if so, build the missing pieces (for instance we currently
don't kick siblings on IRQ/trap/exception entry -- and yes that's nasty
and horrible code and missing for that reason).

I remember asking Paul about this and he mentioned he has a Address Space
Isolation proposal to cover this. So it seems this is out of scope of
core scheduling?


So first; does this provide what we need? If that's sorted we can
bike-shed on uapi/abi.


Re: Gang scheduling

2019-02-12 Thread Subhra Mazumdar

Hi Tim,

On 10/12/18 11:01 AM, Tim Chen wrote:

On 10/10/2018 05:09 PM, Subhra Mazumdar wrote:

Hi,

I was following the Coscheduling patch discussion on lkml and Peter mentioned 
he had a patch series. I found the following on github.

https://github.com/pdxChen/gang/commits/sched_1.23-loadbal

I would like to test this with KVMs. Are the commits from 38d5acb to f019876 
sufficient? Also is there any documentaion on how to use it (any knobs I need 
to turn on for gang scheduling to happen?) or is it enabled by default for KVMs?

Thanks,
Subhra


I would suggest you try
https://github.com/pdxChen/gang/tree/sched_1.23-base
without the load balancing part of gang scheduling.
It is enabled by default for KVMs.

I applied the following 3 patches on 4.19 and tried to install a KVM (with
virt-install). But the kernel hangs with following error:

kernel:watchdog: BUG: soft lockup - CPU#21 stuck for 23s! [kworker/21:1:573]

kvm,sched: Track VCPU threads
x86/kvm,sched: Add fast path for reschedule interrupt
sched: Optimize scheduler_ipi()

The track VCPU patch seems to be the culprit.

Thanks,
Subhra


Due to the constant change in gang scheduling status of the QEMU thread
depending on whether vcpu is loaded or unloaded,
the load balancing part of the code doesn't work very well.

The current version of the code need to be optimized further.  Right now
the QEMU thread constantly does vcpu load and unload during VM enter and exit.
We gang schedule only after vcpu load and register the thread to be gang
scheduled.  When we do vcpu unload, the thread is removed from the set
to be gang scheduled.  Each time there's a synchronization with the
sibling thread that's expensive.

However, for QEMU, there's a one to one correspondence between the QEMU
thread and vcpu.  So we don't have to change the gang scheduling status
for such thread to avoid the church and sync with the sibling. That should
be helpful for VM with lots of I/O causing constant VM exits.  We're
still working on this optimization.  And the load balancing should be
better after this change.

Tim


Re: [RFC PATCH v2 1/1] pipe: busy wait for pipe

2018-11-05 Thread Subhra Mazumdar



On 11/5/18 2:08 AM, Mel Gorman wrote:

Adding Al Viro as per get_maintainers.pl.

On Tue, Sep 25, 2018 at 04:32:40PM -0700, subhra mazumdar wrote:

Introduce pipe_ll_usec field for pipes that indicates the amount of micro
seconds a thread should spin if pipe is empty or full before sleeping. This
is similar to network sockets.

Can you point to what pattern from network sockets you are duplicated
exactly? One would assume it's busy_loop_current_time and busy_loop_timeout
but it should be in the changelog because there are differences in polling
depending on where you are in the network subsystem (which I'm not very
familiar with).

I was referring to the sk_busy_loop_timeout() that uses sk_ll_usec. By
similar I meant having a similar mechanism for pipes to busy wait



Workloads like hackbench in pipe mode
benefits significantly from this by avoiding the sleep and wakeup overhead.
Other similar usecases can benefit. A tunable pipe_busy_poll is introduced
to enable or disable busy waiting via /proc. The value of it specifies the
amount of spin in microseconds. Default value is 0 indicating no spin.


Your lead mail indicates the spin was set to a "suitable spin time". How
should an administrator select this spin time? What works for hackbench
might not be suitable for another workload. What if the spin time
selected happens to be just slightly longer than the time it takes the
reader to respond? In such a case, the result would be "all spin, no
gain". While networking potentially suffers the same problem, it appears
to be opt-in per socket so it's up to the application not to shoot
itself in the foot.

Even for network, sk_ll_usec is assigned the value of the tunable
sysctl_net_busy_read in sock_init_data() for all sockets initialized by
default. There is way for per socket setting using sock_setsockopt(), for
pipes that can be added later if needed in case of different apps running
in one system. But there are cases where only one app runs (e.g big DBs)
and one tunable will suffice. It can be set to a value that is tested to be
beneficial under the operating conditions.



Signed-off-by: subhra mazumdar 
---
  fs/pipe.c | 12 
  include/linux/pipe_fs_i.h |  2 ++
  kernel/sysctl.c   |  7 +++
  3 files changed, 21 insertions(+)

diff --git a/fs/pipe.c b/fs/pipe.c
index bdc5d3c..35d805b 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -26,6 +26,7 @@
  
  #include 

  #include 
+#include 
  
  #include "internal.h"
  
@@ -40,6 +41,7 @@ unsigned int pipe_max_size = 1048576;

   */
  unsigned long pipe_user_pages_hard;
  unsigned long pipe_user_pages_soft = PIPE_DEF_BUFFERS * INR_OPEN_CUR;
+unsigned int pipe_busy_poll;
  
  /*

   * We use a start+len construction, which provides full use of the
@@ -106,6 +108,7 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
  void pipe_wait(struct pipe_inode_info *pipe)
  {
DEFINE_WAIT(wait);
+   u64 start;
  
  	/*

 * Pipes are system-local resources, so sleeping on them
@@ -113,6 +116,10 @@ void pipe_wait(struct pipe_inode_info *pipe)
 */
prepare_to_wait(>wait, , TASK_INTERRUPTIBLE);
pipe_unlock(pipe);
+   start = local_clock();
+   while (current->state != TASK_RUNNING &&
+  ((local_clock() - start) >> 10) < pipe->pipe_ll_usec)
+   cpu_relax();
schedule();
finish_wait(>wait, );
pipe_lock(pipe);

Networking breaks this out better in terms of options instead of
hard-coding. This does not handle need_resched or signal delivery
properly where as networking does for example.

I don't disable preemption, so don't think checking need_resched is needed.
Can you point to what you mean by handling signal delivery in case of
networking? Not sure what I am missing. My initial version broke it out
like networking but after Peter's suggestion I clubbed it. I don't feel
strongly either way.



@@ -825,6 +832,7 @@ static int do_pipe2(int __user *fildes, int flags)
struct file *files[2];
int fd[2];
int error;
+   struct pipe_inode_info *pipe;
  
  	error = __do_pipe_flags(fd, files, flags);

if (!error) {
@@ -838,6 +846,10 @@ static int do_pipe2(int __user *fildes, int flags)
fd_install(fd[0], files[0]);
fd_install(fd[1], files[1]);
}
+   pipe = files[0]->private_data;
+   pipe->pipe_ll_usec = pipe_busy_poll;
+   pipe = files[1]->private_data;
+   pipe->pipe_ll_usec = pipe_busy_poll;
}
return error;
  }

You add a pipe field but the value in always based on the sysctl
so the information is redundantu (barring a race condition on one
pipe write per sysctl update which is an irrelevant corner case). In
comparison, the network subsystem appears to be explicitly opt-in via
setsockopt(SO_BUSY_POLL) from a glance an

Re: [RFC PATCH v2 1/1] pipe: busy wait for pipe

2018-11-05 Thread Subhra Mazumdar



On 11/5/18 2:08 AM, Mel Gorman wrote:

Adding Al Viro as per get_maintainers.pl.

On Tue, Sep 25, 2018 at 04:32:40PM -0700, subhra mazumdar wrote:

Introduce pipe_ll_usec field for pipes that indicates the amount of micro
seconds a thread should spin if pipe is empty or full before sleeping. This
is similar to network sockets.

Can you point to what pattern from network sockets you are duplicated
exactly? One would assume it's busy_loop_current_time and busy_loop_timeout
but it should be in the changelog because there are differences in polling
depending on where you are in the network subsystem (which I'm not very
familiar with).

I was referring to the sk_busy_loop_timeout() that uses sk_ll_usec. By
similar I meant having a similar mechanism for pipes to busy wait



Workloads like hackbench in pipe mode
benefits significantly from this by avoiding the sleep and wakeup overhead.
Other similar usecases can benefit. A tunable pipe_busy_poll is introduced
to enable or disable busy waiting via /proc. The value of it specifies the
amount of spin in microseconds. Default value is 0 indicating no spin.


Your lead mail indicates the spin was set to a "suitable spin time". How
should an administrator select this spin time? What works for hackbench
might not be suitable for another workload. What if the spin time
selected happens to be just slightly longer than the time it takes the
reader to respond? In such a case, the result would be "all spin, no
gain". While networking potentially suffers the same problem, it appears
to be opt-in per socket so it's up to the application not to shoot
itself in the foot.

Even for network, sk_ll_usec is assigned the value of the tunable
sysctl_net_busy_read in sock_init_data() for all sockets initialized by
default. There is way for per socket setting using sock_setsockopt(), for
pipes that can be added later if needed in case of different apps running
in one system. But there are cases where only one app runs (e.g big DBs)
and one tunable will suffice. It can be set to a value that is tested to be
beneficial under the operating conditions.



Signed-off-by: subhra mazumdar 
---
  fs/pipe.c | 12 
  include/linux/pipe_fs_i.h |  2 ++
  kernel/sysctl.c   |  7 +++
  3 files changed, 21 insertions(+)

diff --git a/fs/pipe.c b/fs/pipe.c
index bdc5d3c..35d805b 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -26,6 +26,7 @@
  
  #include 

  #include 
+#include 
  
  #include "internal.h"
  
@@ -40,6 +41,7 @@ unsigned int pipe_max_size = 1048576;

   */
  unsigned long pipe_user_pages_hard;
  unsigned long pipe_user_pages_soft = PIPE_DEF_BUFFERS * INR_OPEN_CUR;
+unsigned int pipe_busy_poll;
  
  /*

   * We use a start+len construction, which provides full use of the
@@ -106,6 +108,7 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
  void pipe_wait(struct pipe_inode_info *pipe)
  {
DEFINE_WAIT(wait);
+   u64 start;
  
  	/*

 * Pipes are system-local resources, so sleeping on them
@@ -113,6 +116,10 @@ void pipe_wait(struct pipe_inode_info *pipe)
 */
prepare_to_wait(>wait, , TASK_INTERRUPTIBLE);
pipe_unlock(pipe);
+   start = local_clock();
+   while (current->state != TASK_RUNNING &&
+  ((local_clock() - start) >> 10) < pipe->pipe_ll_usec)
+   cpu_relax();
schedule();
finish_wait(>wait, );
pipe_lock(pipe);

Networking breaks this out better in terms of options instead of
hard-coding. This does not handle need_resched or signal delivery
properly where as networking does for example.

I don't disable preemption, so don't think checking need_resched is needed.
Can you point to what you mean by handling signal delivery in case of
networking? Not sure what I am missing. My initial version broke it out
like networking but after Peter's suggestion I clubbed it. I don't feel
strongly either way.



@@ -825,6 +832,7 @@ static int do_pipe2(int __user *fildes, int flags)
struct file *files[2];
int fd[2];
int error;
+   struct pipe_inode_info *pipe;
  
  	error = __do_pipe_flags(fd, files, flags);

if (!error) {
@@ -838,6 +846,10 @@ static int do_pipe2(int __user *fildes, int flags)
fd_install(fd[0], files[0]);
fd_install(fd[1], files[1]);
}
+   pipe = files[0]->private_data;
+   pipe->pipe_ll_usec = pipe_busy_poll;
+   pipe = files[1]->private_data;
+   pipe->pipe_ll_usec = pipe_busy_poll;
}
return error;
  }

You add a pipe field but the value in always based on the sysctl
so the information is redundantu (barring a race condition on one
pipe write per sysctl update which is an irrelevant corner case). In
comparison, the network subsystem appears to be explicitly opt-in via
setsockopt(SO_BUSY_POLL) from a glance an

Re: [PATCH 00/10] steal tasks to improve CPU utilization

2018-11-02 Thread Subhra Mazumdar



On 10/22/18 7:59 AM, Steve Sistare wrote:

When a CPU has no more CFS tasks to run, and idle_balance() fails to
find a task, then attempt to steal a task from an overloaded CPU in the
same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
identify candidates.  To minimize search time, steal the first migratable
task that is found when the bitmap is traversed.  For fairness, search
for migratable tasks on an overloaded CPU in order of next to run.

This simple stealing yields a higher CPU utilization than idle_balance()
alone, because the search is cheap, so it may be called every time the CPU
is about to go idle.  idle_balance() does more work because it searches
widely for the busiest queue, so to limit its CPU consumption, it declines
to search if the system is too busy.  Simple stealing does not offload the
globally busiest queue, but it is much better than running nothing at all.

The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
reduce cache contention vs the usual bitmap when many threads concurrently
set, clear, and visit elements.


Is the bitmask saving much? I tried a simple stealing that just starts
searching the domain from the current cpu and steals a thread from the
first cpu that has more than one runnable thread. It seems to perform
similar to your patch.

hackbench on X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
    baseline    %stdev  patch %stdev
1(40 tasks) 0.5524  2.36    0.5522 (0.045%) 3.82
2(80 tasks) 0.6482  11.4    0.7241 (-11.7%) 20.34
4(160 tasks)    0.9756  0.95    0.8276 (15.1%) 5.8
8(320 tasks)    1.7699  1.62    1.6655 (5.9%) 1.57
16(640 tasks)   3.1018  0.77    2.9858 (3.74%) 1.4
32(1280 tasks)  5.565   0.62    5.3388 (4.1%) 0.72

X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
Oracle database OLTP, logging _enabled_

Users %speedup
20 1.2
40 -0.41
60 0.83
80 2.37
100 1.54
120 3.0
140 2.24
160 1.82
180 1.94
200 2.23
220 1.49

Below is the patch (not in best shape)

--->8

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f808ddf..1690451 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3540,6 +3540,7 @@ static inline unsigned long cfs_rq_load_avg(struct 
cfs_rq *cfs_rq)

 }

 static int idle_balance(struct rq *this_rq, struct rq_flags *rf);
+static int my_idle_balance(struct rq *this_rq, struct rq_flags *rf);

 static inline unsigned long task_util(struct task_struct *p)
 {
@@ -6619,6 +6620,8 @@ done: __maybe_unused;

 idle:
    new_tasks = idle_balance(rq, rf);
+   if (new_tasks == 0)
+   new_tasks = my_idle_balance(rq, rf);

    /*
 * Because idle_balance() releases (and re-acquires) rq->lock, 
it is

@@ -8434,6 +8437,75 @@ static int should_we_balance(struct lb_env *env)
    return balance_cpu == env->dst_cpu;
 }

+int get_best_cpu(int this_cpu, struct sched_domain *sd)
+{
+   struct rq *this_rq, *rq;
+   int i;
+   int best_cpu = -1;
+
+   this_rq = cpu_rq(this_cpu);
+   for_each_cpu_wrap(i, sched_domain_span(sd), this_cpu) {
+   if (this_rq->nr_running > 0)
+   return (-1);
+   if (i == this_cpu)
+   continue;
+   rq = cpu_rq(i);
+   if (rq->nr_running <= 1)
+   continue;
+   best_cpu = i;
+   break;
+   }
+   return (best_cpu);
+}
+static int my_load_balance(int this_cpu, struct rq *this_rq,
+  struct sched_domain *sd, enum cpu_idle_type idle)
+{
+   int ld_moved = 0;
+   struct rq *busiest;
+   unsigned long flags;
+   struct task_struct *p = NULL;
+   struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
+   int best_cpu;
+
+   struct lb_env env = {
+   .sd = sd,
+   .dst_cpu    = this_cpu,
+   .dst_rq = this_rq,
+   .dst_grpmask    = sched_group_span(sd->groups),
+   .idle   = idle,
+   .cpus   = cpus,
+   .tasks  = LIST_HEAD_INIT(env.tasks),
+   };
+
+   if (idle == CPU_NEWLY_IDLE)
+   env.dst_grpmask = NULL;
+
+   cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
+
+   best_cpu = get_best_cpu(this_cpu, sd);
+
+   if (best_cpu >= 0)
+   busiest = cpu_rq(best_cpu);
+   else
+   goto out;
+
+   env.src_cpu = busiest->cpu;
+   env.src_rq = busiest;
+   raw_spin_lock_irqsave(>lock, flags);
+
+   p = detach_one_task();
+   raw_spin_unlock(>lock);
+   if (p) {
+   attach_one_task(this_rq, p);
+   ld_moved++;
+   }
+   local_irq_restore(flags);
+
+out:
+   return ld_moved;
+}
+
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is 

Re: [PATCH 00/10] steal tasks to improve CPU utilization

2018-11-02 Thread Subhra Mazumdar



On 10/22/18 7:59 AM, Steve Sistare wrote:

When a CPU has no more CFS tasks to run, and idle_balance() fails to
find a task, then attempt to steal a task from an overloaded CPU in the
same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
identify candidates.  To minimize search time, steal the first migratable
task that is found when the bitmap is traversed.  For fairness, search
for migratable tasks on an overloaded CPU in order of next to run.

This simple stealing yields a higher CPU utilization than idle_balance()
alone, because the search is cheap, so it may be called every time the CPU
is about to go idle.  idle_balance() does more work because it searches
widely for the busiest queue, so to limit its CPU consumption, it declines
to search if the system is too busy.  Simple stealing does not offload the
globally busiest queue, but it is much better than running nothing at all.

The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
reduce cache contention vs the usual bitmap when many threads concurrently
set, clear, and visit elements.


Is the bitmask saving much? I tried a simple stealing that just starts
searching the domain from the current cpu and steals a thread from the
first cpu that has more than one runnable thread. It seems to perform
similar to your patch.

hackbench on X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
    baseline    %stdev  patch %stdev
1(40 tasks) 0.5524  2.36    0.5522 (0.045%) 3.82
2(80 tasks) 0.6482  11.4    0.7241 (-11.7%) 20.34
4(160 tasks)    0.9756  0.95    0.8276 (15.1%) 5.8
8(320 tasks)    1.7699  1.62    1.6655 (5.9%) 1.57
16(640 tasks)   3.1018  0.77    2.9858 (3.74%) 1.4
32(1280 tasks)  5.565   0.62    5.3388 (4.1%) 0.72

X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
Oracle database OLTP, logging _enabled_

Users %speedup
20 1.2
40 -0.41
60 0.83
80 2.37
100 1.54
120 3.0
140 2.24
160 1.82
180 1.94
200 2.23
220 1.49

Below is the patch (not in best shape)

--->8

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f808ddf..1690451 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3540,6 +3540,7 @@ static inline unsigned long cfs_rq_load_avg(struct 
cfs_rq *cfs_rq)

 }

 static int idle_balance(struct rq *this_rq, struct rq_flags *rf);
+static int my_idle_balance(struct rq *this_rq, struct rq_flags *rf);

 static inline unsigned long task_util(struct task_struct *p)
 {
@@ -6619,6 +6620,8 @@ done: __maybe_unused;

 idle:
    new_tasks = idle_balance(rq, rf);
+   if (new_tasks == 0)
+   new_tasks = my_idle_balance(rq, rf);

    /*
 * Because idle_balance() releases (and re-acquires) rq->lock, 
it is

@@ -8434,6 +8437,75 @@ static int should_we_balance(struct lb_env *env)
    return balance_cpu == env->dst_cpu;
 }

+int get_best_cpu(int this_cpu, struct sched_domain *sd)
+{
+   struct rq *this_rq, *rq;
+   int i;
+   int best_cpu = -1;
+
+   this_rq = cpu_rq(this_cpu);
+   for_each_cpu_wrap(i, sched_domain_span(sd), this_cpu) {
+   if (this_rq->nr_running > 0)
+   return (-1);
+   if (i == this_cpu)
+   continue;
+   rq = cpu_rq(i);
+   if (rq->nr_running <= 1)
+   continue;
+   best_cpu = i;
+   break;
+   }
+   return (best_cpu);
+}
+static int my_load_balance(int this_cpu, struct rq *this_rq,
+  struct sched_domain *sd, enum cpu_idle_type idle)
+{
+   int ld_moved = 0;
+   struct rq *busiest;
+   unsigned long flags;
+   struct task_struct *p = NULL;
+   struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
+   int best_cpu;
+
+   struct lb_env env = {
+   .sd = sd,
+   .dst_cpu    = this_cpu,
+   .dst_rq = this_rq,
+   .dst_grpmask    = sched_group_span(sd->groups),
+   .idle   = idle,
+   .cpus   = cpus,
+   .tasks  = LIST_HEAD_INIT(env.tasks),
+   };
+
+   if (idle == CPU_NEWLY_IDLE)
+   env.dst_grpmask = NULL;
+
+   cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
+
+   best_cpu = get_best_cpu(this_cpu, sd);
+
+   if (best_cpu >= 0)
+   busiest = cpu_rq(best_cpu);
+   else
+   goto out;
+
+   env.src_cpu = busiest->cpu;
+   env.src_rq = busiest;
+   raw_spin_lock_irqsave(>lock, flags);
+
+   p = detach_one_task();
+   raw_spin_unlock(>lock);
+   if (p) {
+   attach_one_task(this_rq, p);
+   ld_moved++;
+   }
+   local_irq_restore(flags);
+
+out:
+   return ld_moved;
+}
+
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is 

Re: [RFC 00/60] Coscheduling for Linux

2018-10-29 Thread Subhra Mazumdar



On 10/26/18 4:44 PM, Jan H. Schönherr wrote:

On 19/10/2018 02.26, Subhra Mazumdar wrote:

Hi Jan,

Hi. Sorry for the delay.


On 9/7/18 2:39 PM, Jan H. Schönherr wrote:

The collective context switch from one coscheduled set of tasks to another
-- while fast -- is not atomic. If a use-case needs the absolute guarantee
that all tasks of the previous set have stopped executing before any task
of the next set starts executing, an additional hand-shake/barrier needs to
be added.


Do you know how much is the delay? i.e what is overlap time when a thread
of new group starts executing on one HT while there is still thread of
another group running on the other HT?

The delay is roughly equivalent to the IPI latency, if we're just talking
about coscheduling at SMT level: one sibling decides to schedule another
group, sends an IPI to the other sibling(s), and may already start
executing a task of that other group, before the IPI is received on the
other end.

Can you point to where the leader is sending the IPI to other siblings?

I did some experiment and delay seems to be sub microsec. I ran 2 threads
that are just looping in one cosched group and affinitized to the 2 HTs of
a core. And another thread in a different cosched group starts running
affinitized to the first HT of the same core. I time stamped just before
context_switch() in __schedule() for the threads switching from one to
another and one to idle. Following is what I get on cpu 1 and 45 that are
siblings, cpu 1 is where the other thread preempts:

[  403.216625] cpu:45 sub1->idle:403216624579
[  403.238623] cpu:1 sub1->sub2:403238621585
[  403.238624] cpu:45 sub1->idle:403238621787
[  403.260619] cpu:1 sub1->sub2:403260619182
[  403.260620] cpu:45 sub1->idle:403260619413
[  403.282617] cpu:1 sub1->sub2:403282617157
[  403.282618] cpu:45 sub1->idle:403282617317
..

Not sure why the first switch on cpu to idle happened. But then onwards
the difference in timestamps is less than a microsec. This is just a crude
way to get a sense of the delay, may not be exact.

Thanks,
Subhra


Now, there are some things that may delay processing an IPI, but in those
cases the target CPU isn't executing user code.

I've yet to produce some current numbers for SMT-only coscheduling. An
older ballpark number I have is about 2 microseconds for the collective
context switch of one hierarchy level, but take that with a grain of salt.

Regards
Jan



Re: [RFC 00/60] Coscheduling for Linux

2018-10-29 Thread Subhra Mazumdar



On 10/26/18 4:44 PM, Jan H. Schönherr wrote:

On 19/10/2018 02.26, Subhra Mazumdar wrote:

Hi Jan,

Hi. Sorry for the delay.


On 9/7/18 2:39 PM, Jan H. Schönherr wrote:

The collective context switch from one coscheduled set of tasks to another
-- while fast -- is not atomic. If a use-case needs the absolute guarantee
that all tasks of the previous set have stopped executing before any task
of the next set starts executing, an additional hand-shake/barrier needs to
be added.


Do you know how much is the delay? i.e what is overlap time when a thread
of new group starts executing on one HT while there is still thread of
another group running on the other HT?

The delay is roughly equivalent to the IPI latency, if we're just talking
about coscheduling at SMT level: one sibling decides to schedule another
group, sends an IPI to the other sibling(s), and may already start
executing a task of that other group, before the IPI is received on the
other end.

Can you point to where the leader is sending the IPI to other siblings?

I did some experiment and delay seems to be sub microsec. I ran 2 threads
that are just looping in one cosched group and affinitized to the 2 HTs of
a core. And another thread in a different cosched group starts running
affinitized to the first HT of the same core. I time stamped just before
context_switch() in __schedule() for the threads switching from one to
another and one to idle. Following is what I get on cpu 1 and 45 that are
siblings, cpu 1 is where the other thread preempts:

[  403.216625] cpu:45 sub1->idle:403216624579
[  403.238623] cpu:1 sub1->sub2:403238621585
[  403.238624] cpu:45 sub1->idle:403238621787
[  403.260619] cpu:1 sub1->sub2:403260619182
[  403.260620] cpu:45 sub1->idle:403260619413
[  403.282617] cpu:1 sub1->sub2:403282617157
[  403.282618] cpu:45 sub1->idle:403282617317
..

Not sure why the first switch on cpu to idle happened. But then onwards
the difference in timestamps is less than a microsec. This is just a crude
way to get a sense of the delay, may not be exact.

Thanks,
Subhra


Now, there are some things that may delay processing an IPI, but in those
cases the target CPU isn't executing user code.

I've yet to produce some current numbers for SMT-only coscheduling. An
older ballpark number I have is about 2 microseconds for the collective
context switch of one hierarchy level, but take that with a grain of salt.

Regards
Jan



Re: [RFC 00/60] Coscheduling for Linux

2018-10-26 Thread Subhra Mazumdar





D) What can I *not* do with this?
-

Besides the missing load-balancing within coscheduled task-groups, this
implementation has the following properties, which might be considered
short-comings.

This particular implementation focuses on SCHED_OTHER tasks managed by CFS
and allows coscheduling them. Interrupts as well as tasks in higher
scheduling classes are currently out-of-scope: they are assumed to be
negligible interruptions as far as coscheduling is concerned and they do
*not* cause a preemption of a whole group. This implementation could be
extended to cover higher scheduling classes. Interrupts, however, are an
orthogonal issue.

The collective context switch from one coscheduled set of tasks to another
-- while fast -- is not atomic. If a use-case needs the absolute guarantee
that all tasks of the previous set have stopped executing before any task
of the next set starts executing, an additional hand-shake/barrier needs to
be added.


The leader doesn't kick the other cpus _immediately_ to switch to a
different cosched group. So threads from previous cosched group will keep
running in other HTs till their sched_slice is over (in worst case). This
can still keep the window of L1TF vulnerability open?


Re: [RFC 00/60] Coscheduling for Linux

2018-10-26 Thread Subhra Mazumdar





D) What can I *not* do with this?
-

Besides the missing load-balancing within coscheduled task-groups, this
implementation has the following properties, which might be considered
short-comings.

This particular implementation focuses on SCHED_OTHER tasks managed by CFS
and allows coscheduling them. Interrupts as well as tasks in higher
scheduling classes are currently out-of-scope: they are assumed to be
negligible interruptions as far as coscheduling is concerned and they do
*not* cause a preemption of a whole group. This implementation could be
extended to cover higher scheduling classes. Interrupts, however, are an
orthogonal issue.

The collective context switch from one coscheduled set of tasks to another
-- while fast -- is not atomic. If a use-case needs the absolute guarantee
that all tasks of the previous set have stopped executing before any task
of the next set starts executing, an additional hand-shake/barrier needs to
be added.


The leader doesn't kick the other cpus _immediately_ to switch to a
different cosched group. So threads from previous cosched group will keep
running in other HTs till their sched_slice is over (in worst case). This
can still keep the window of L1TF vulnerability open?


Re: [RFC 00/60] Coscheduling for Linux

2018-10-18 Thread Subhra Mazumdar

Hi Jan,

On 9/7/18 2:39 PM, Jan H. Schönherr wrote:

The collective context switch from one coscheduled set of tasks to another
-- while fast -- is not atomic. If a use-case needs the absolute guarantee
that all tasks of the previous set have stopped executing before any task
of the next set starts executing, an additional hand-shake/barrier needs to
be added.


Do you know how much is the delay? i.e what is overlap time when a thread
of new group starts executing on one HT while there is still thread of
another group running on the other HT?

Thanks,
Subhra


Re: [RFC 00/60] Coscheduling for Linux

2018-10-18 Thread Subhra Mazumdar

Hi Jan,

On 9/7/18 2:39 PM, Jan H. Schönherr wrote:

The collective context switch from one coscheduled set of tasks to another
-- while fast -- is not atomic. If a use-case needs the absolute guarantee
that all tasks of the previous set have stopped executing before any task
of the next set starts executing, an additional hand-shake/barrier needs to
be added.


Do you know how much is the delay? i.e what is overlap time when a thread
of new group starts executing on one HT while there is still thread of
another group running on the other HT?

Thanks,
Subhra


Re: Gang scheduling

2018-10-15 Thread Subhra Mazumdar




On 10/12/2018 11:01 AM, Tim Chen wrote:

On 10/10/2018 05:09 PM, Subhra Mazumdar wrote:

Hi,

I was following the Coscheduling patch discussion on lkml and Peter mentioned 
he had a patch series. I found the following on github.

https://github.com/pdxChen/gang/commits/sched_1.23-loadbal

I would like to test this with KVMs. Are the commits from 38d5acb to f019876 
sufficient? Also is there any documentaion on how to use it (any knobs I need 
to turn on for gang scheduling to happen?) or is it enabled by default for KVMs?

Thanks,
Subhra


I would suggest you try
https://github.com/pdxChen/gang/tree/sched_1.23-base
without the load balancing part of gang scheduling.
It is enabled by default for KVMs.

Due to the constant change in gang scheduling status of the QEMU thread
depending on whether vcpu is loaded or unloaded,
the load balancing part of the code doesn't work very well.

Thanks. Does this mean each vcpu thread need to be affinitized to a cpu?


The current version of the code need to be optimized further.  Right now
the QEMU thread constantly does vcpu load and unload during VM enter and exit.
We gang schedule only after vcpu load and register the thread to be gang
scheduled.  When we do vcpu unload, the thread is removed from the set
to be gang scheduled.  Each time there's a synchronization with the
sibling thread that's expensive.

However, for QEMU, there's a one to one correspondence between the QEMU
thread and vcpu.  So we don't have to change the gang scheduling status
for such thread to avoid the church and sync with the sibling. That should
be helpful for VM with lots of I/O causing constant VM exits.  We're
still working on this optimization.  And the load balancing should be
better after this change.

Tim



Also FYI I get the following error while building sched_1.23-base:

ERROR: "sched_ttwu_pending" [arch/x86/kvm/kvm-intel.ko] undefined!
scripts/Makefile.modpost:92: recipe for target '__modpost' failed

Adding the following fixed it:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 46807dc..302b77d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -21,6 +21,7 @@
 #include 

 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+EXPORT_SYMBOL_GPL(sched_ttwu_pending);

 #if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL)
 /*


Re: Gang scheduling

2018-10-15 Thread Subhra Mazumdar




On 10/12/2018 11:01 AM, Tim Chen wrote:

On 10/10/2018 05:09 PM, Subhra Mazumdar wrote:

Hi,

I was following the Coscheduling patch discussion on lkml and Peter mentioned 
he had a patch series. I found the following on github.

https://github.com/pdxChen/gang/commits/sched_1.23-loadbal

I would like to test this with KVMs. Are the commits from 38d5acb to f019876 
sufficient? Also is there any documentaion on how to use it (any knobs I need 
to turn on for gang scheduling to happen?) or is it enabled by default for KVMs?

Thanks,
Subhra


I would suggest you try
https://github.com/pdxChen/gang/tree/sched_1.23-base
without the load balancing part of gang scheduling.
It is enabled by default for KVMs.

Due to the constant change in gang scheduling status of the QEMU thread
depending on whether vcpu is loaded or unloaded,
the load balancing part of the code doesn't work very well.

Thanks. Does this mean each vcpu thread need to be affinitized to a cpu?


The current version of the code need to be optimized further.  Right now
the QEMU thread constantly does vcpu load and unload during VM enter and exit.
We gang schedule only after vcpu load and register the thread to be gang
scheduled.  When we do vcpu unload, the thread is removed from the set
to be gang scheduled.  Each time there's a synchronization with the
sibling thread that's expensive.

However, for QEMU, there's a one to one correspondence between the QEMU
thread and vcpu.  So we don't have to change the gang scheduling status
for such thread to avoid the church and sync with the sibling. That should
be helpful for VM with lots of I/O causing constant VM exits.  We're
still working on this optimization.  And the load balancing should be
better after this change.

Tim



Also FYI I get the following error while building sched_1.23-base:

ERROR: "sched_ttwu_pending" [arch/x86/kvm/kvm-intel.ko] undefined!
scripts/Makefile.modpost:92: recipe for target '__modpost' failed

Adding the following fixed it:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 46807dc..302b77d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -21,6 +21,7 @@
 #include 

 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+EXPORT_SYMBOL_GPL(sched_ttwu_pending);

 #if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL)
 /*


Gang scheduling

2018-10-10 Thread Subhra Mazumdar

Hi,

I was following the Coscheduling patch discussion on lkml and Peter 
mentioned he had a patch series. I found the following on github.


https://github.com/pdxChen/gang/commits/sched_1.23-loadbal

I would like to test this with KVMs. Are the commits from 38d5acb to 
f019876 sufficient? Also is there any documentaion on how to use it (any 
knobs I need to turn on for gang scheduling to happen?) or is it enabled 
by default for KVMs?


Thanks,
Subhra



Gang scheduling

2018-10-10 Thread Subhra Mazumdar

Hi,

I was following the Coscheduling patch discussion on lkml and Peter 
mentioned he had a patch series. I found the following on github.


https://github.com/pdxChen/gang/commits/sched_1.23-loadbal

I would like to test this with KVMs. Are the commits from 38d5acb to 
f019876 sufficient? Also is there any documentaion on how to use it (any 
knobs I need to turn on for gang scheduling to happen?) or is it enabled 
by default for KVMs?


Thanks,
Subhra



Re: [RFC 00/60] Coscheduling for Linux

2018-09-27 Thread Subhra Mazumdar




On 09/26/2018 02:58 AM, Jan H. Schönherr wrote:

On 09/17/2018 02:25 PM, Peter Zijlstra wrote:

On Fri, Sep 14, 2018 at 06:25:44PM +0200, Jan H. Schönherr wrote:


Assuming, there is a cgroup-less solution that can prevent simultaneous
execution of tasks on a core, when they're not supposed to. How would you
tell the scheduler, which tasks these are?

Specifically for L1TF I hooked into/extended KVM's preempt_notifier
registration interface, which tells us which tasks are VCPUs and to
which VM they belong.

But if we want to actually expose this to userspace, we can either do a
prctl() or extend struct sched_attr.

Both, Peter and Subhra, seem to prefer an interface different than cgroups
to specify what to coschedule.

Can you provide some extra motivation for me, why you feel that way?
(ignoring the current scalability issues with the cpu group controller)


After all, cgroups where designed to create arbitrary groups of tasks and
to attach functionality to those groups.

If we were to introduce a different interface to control that, we'd need to
introduce a whole new group concept, so that you make tasks part of some
group while at the same time preventing unauthorized tasks from joining a
group.


I currently don't see any wins, just a loss in flexibility.

Regards
Jan

I think cgroups will the get the job done for any use case. But we have,
e.g. affinity control via both sched_setaffinity and cgroup cpusets. It
will be good to have an alternative way to specify co-scheduling too for
those who don't want to use cgroup for some reason. It can be added later
on though, only how one will override the other will need to be sorted out.


Re: [RFC 00/60] Coscheduling for Linux

2018-09-27 Thread Subhra Mazumdar




On 09/26/2018 02:58 AM, Jan H. Schönherr wrote:

On 09/17/2018 02:25 PM, Peter Zijlstra wrote:

On Fri, Sep 14, 2018 at 06:25:44PM +0200, Jan H. Schönherr wrote:


Assuming, there is a cgroup-less solution that can prevent simultaneous
execution of tasks on a core, when they're not supposed to. How would you
tell the scheduler, which tasks these are?

Specifically for L1TF I hooked into/extended KVM's preempt_notifier
registration interface, which tells us which tasks are VCPUs and to
which VM they belong.

But if we want to actually expose this to userspace, we can either do a
prctl() or extend struct sched_attr.

Both, Peter and Subhra, seem to prefer an interface different than cgroups
to specify what to coschedule.

Can you provide some extra motivation for me, why you feel that way?
(ignoring the current scalability issues with the cpu group controller)


After all, cgroups where designed to create arbitrary groups of tasks and
to attach functionality to those groups.

If we were to introduce a different interface to control that, we'd need to
introduce a whole new group concept, so that you make tasks part of some
group while at the same time preventing unauthorized tasks from joining a
group.


I currently don't see any wins, just a loss in flexibility.

Regards
Jan

I think cgroups will the get the job done for any use case. But we have,
e.g. affinity control via both sched_setaffinity and cgroup cpusets. It
will be good to have an alternative way to specify co-scheduling too for
those who don't want to use cgroup for some reason. It can be added later
on though, only how one will override the other will need to be sorted out.


Re: [RFC 00/60] Coscheduling for Linux

2018-09-27 Thread Subhra Mazumdar




On 09/24/2018 08:43 AM, Jan H. Schönherr wrote:

On 09/19/2018 11:53 PM, Subhra Mazumdar wrote:

Can we have a more generic interface, like specifying a set of task ids
to be co-scheduled with a particular level rather than tying this with
cgroups? KVMs may not always run with cgroups and there might be other
use cases where we might want co-scheduling that doesn't relate to
cgroups.

Currently: no.

At this point the implementation is tightly coupled to the cpu cgroup
controller. This *might* change, if the task group optimizations mentioned
in other parts of this e-mail thread are done, as I think, that it would
decouple the various mechanisms.

That said, what if you were able to disable the "group-based fairness"
aspect of the cpu cgroup controller? Then you would be able to control
just the coscheduling aspects on their own. Would that satisfy the use
case you have in mind?

Regards
Jan

Yes that will suffice the use case. We wish to experiment at some point
with co-scheduling of certain workers threads in DB parallel query and see
if there is any benefit

Thanks,
Subhra


Re: [RFC 00/60] Coscheduling for Linux

2018-09-27 Thread Subhra Mazumdar




On 09/24/2018 08:43 AM, Jan H. Schönherr wrote:

On 09/19/2018 11:53 PM, Subhra Mazumdar wrote:

Can we have a more generic interface, like specifying a set of task ids
to be co-scheduled with a particular level rather than tying this with
cgroups? KVMs may not always run with cgroups and there might be other
use cases where we might want co-scheduling that doesn't relate to
cgroups.

Currently: no.

At this point the implementation is tightly coupled to the cpu cgroup
controller. This *might* change, if the task group optimizations mentioned
in other parts of this e-mail thread are done, as I think, that it would
decouple the various mechanisms.

That said, what if you were able to disable the "group-based fairness"
aspect of the cpu cgroup controller? Then you would be able to control
just the coscheduling aspects on their own. Would that satisfy the use
case you have in mind?

Regards
Jan

Yes that will suffice the use case. We wish to experiment at some point
with co-scheduling of certain workers threads in DB parallel query and see
if there is any benefit

Thanks,
Subhra


[RFC PATCH v2 1/1] pipe: busy wait for pipe

2018-09-25 Thread subhra mazumdar
Introduce pipe_ll_usec field for pipes that indicates the amount of micro
seconds a thread should spin if pipe is empty or full before sleeping. This
is similar to network sockets. Workloads like hackbench in pipe mode
benefits significantly from this by avoiding the sleep and wakeup overhead.
Other similar usecases can benefit. A tunable pipe_busy_poll is introduced
to enable or disable busy waiting via /proc. The value of it specifies the
amount of spin in microseconds. Default value is 0 indicating no spin.

Signed-off-by: subhra mazumdar 
---
 fs/pipe.c | 12 
 include/linux/pipe_fs_i.h |  2 ++
 kernel/sysctl.c   |  7 +++
 3 files changed, 21 insertions(+)

diff --git a/fs/pipe.c b/fs/pipe.c
index bdc5d3c..35d805b 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -26,6 +26,7 @@
 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -40,6 +41,7 @@ unsigned int pipe_max_size = 1048576;
  */
 unsigned long pipe_user_pages_hard;
 unsigned long pipe_user_pages_soft = PIPE_DEF_BUFFERS * INR_OPEN_CUR;
+unsigned int pipe_busy_poll;
 
 /*
  * We use a start+len construction, which provides full use of the 
@@ -106,6 +108,7 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
 void pipe_wait(struct pipe_inode_info *pipe)
 {
DEFINE_WAIT(wait);
+   u64 start;
 
/*
 * Pipes are system-local resources, so sleeping on them
@@ -113,6 +116,10 @@ void pipe_wait(struct pipe_inode_info *pipe)
 */
prepare_to_wait(>wait, , TASK_INTERRUPTIBLE);
pipe_unlock(pipe);
+   start = local_clock();
+   while (current->state != TASK_RUNNING &&
+  ((local_clock() - start) >> 10) < pipe->pipe_ll_usec)
+   cpu_relax();
schedule();
finish_wait(>wait, );
pipe_lock(pipe);
@@ -825,6 +832,7 @@ static int do_pipe2(int __user *fildes, int flags)
struct file *files[2];
int fd[2];
int error;
+   struct pipe_inode_info *pipe;
 
error = __do_pipe_flags(fd, files, flags);
if (!error) {
@@ -838,6 +846,10 @@ static int do_pipe2(int __user *fildes, int flags)
fd_install(fd[0], files[0]);
fd_install(fd[1], files[1]);
}
+   pipe = files[0]->private_data;
+   pipe->pipe_ll_usec = pipe_busy_poll;
+   pipe = files[1]->private_data;
+   pipe->pipe_ll_usec = pipe_busy_poll;
}
return error;
 }
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 5a3bb3b..73267d2 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -55,6 +55,7 @@ struct pipe_inode_info {
unsigned int waiting_writers;
unsigned int r_counter;
unsigned int w_counter;
+   unsigned int pipe_ll_usec;
struct page *tmp_page;
struct fasync_struct *fasync_readers;
struct fasync_struct *fasync_writers;
@@ -170,6 +171,7 @@ void pipe_double_lock(struct pipe_inode_info *, struct 
pipe_inode_info *);
 extern unsigned int pipe_max_size;
 extern unsigned long pipe_user_pages_hard;
 extern unsigned long pipe_user_pages_soft;
+extern unsigned int pipe_busy_poll;
 
 /* Drop the inode semaphore and wait for a pipe event, atomically */
 void pipe_wait(struct pipe_inode_info *pipe);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index cc02050..0e9ce0c 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1863,6 +1863,13 @@ static struct ctl_table fs_table[] = {
.proc_handler   = proc_doulongvec_minmax,
},
{
+   .procname   = "pipe-busy-poll",
+   .data   = _busy_poll,
+   .maxlen = sizeof(unsigned int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   },
+   {
.procname   = "mount-max",
.data   = _mount_max,
.maxlen = sizeof(unsigned int),
-- 
2.9.3



[RFC PATCH v2 1/1] pipe: busy wait for pipe

2018-09-25 Thread subhra mazumdar
Introduce pipe_ll_usec field for pipes that indicates the amount of micro
seconds a thread should spin if pipe is empty or full before sleeping. This
is similar to network sockets. Workloads like hackbench in pipe mode
benefits significantly from this by avoiding the sleep and wakeup overhead.
Other similar usecases can benefit. A tunable pipe_busy_poll is introduced
to enable or disable busy waiting via /proc. The value of it specifies the
amount of spin in microseconds. Default value is 0 indicating no spin.

Signed-off-by: subhra mazumdar 
---
 fs/pipe.c | 12 
 include/linux/pipe_fs_i.h |  2 ++
 kernel/sysctl.c   |  7 +++
 3 files changed, 21 insertions(+)

diff --git a/fs/pipe.c b/fs/pipe.c
index bdc5d3c..35d805b 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -26,6 +26,7 @@
 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -40,6 +41,7 @@ unsigned int pipe_max_size = 1048576;
  */
 unsigned long pipe_user_pages_hard;
 unsigned long pipe_user_pages_soft = PIPE_DEF_BUFFERS * INR_OPEN_CUR;
+unsigned int pipe_busy_poll;
 
 /*
  * We use a start+len construction, which provides full use of the 
@@ -106,6 +108,7 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
 void pipe_wait(struct pipe_inode_info *pipe)
 {
DEFINE_WAIT(wait);
+   u64 start;
 
/*
 * Pipes are system-local resources, so sleeping on them
@@ -113,6 +116,10 @@ void pipe_wait(struct pipe_inode_info *pipe)
 */
prepare_to_wait(>wait, , TASK_INTERRUPTIBLE);
pipe_unlock(pipe);
+   start = local_clock();
+   while (current->state != TASK_RUNNING &&
+  ((local_clock() - start) >> 10) < pipe->pipe_ll_usec)
+   cpu_relax();
schedule();
finish_wait(>wait, );
pipe_lock(pipe);
@@ -825,6 +832,7 @@ static int do_pipe2(int __user *fildes, int flags)
struct file *files[2];
int fd[2];
int error;
+   struct pipe_inode_info *pipe;
 
error = __do_pipe_flags(fd, files, flags);
if (!error) {
@@ -838,6 +846,10 @@ static int do_pipe2(int __user *fildes, int flags)
fd_install(fd[0], files[0]);
fd_install(fd[1], files[1]);
}
+   pipe = files[0]->private_data;
+   pipe->pipe_ll_usec = pipe_busy_poll;
+   pipe = files[1]->private_data;
+   pipe->pipe_ll_usec = pipe_busy_poll;
}
return error;
 }
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 5a3bb3b..73267d2 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -55,6 +55,7 @@ struct pipe_inode_info {
unsigned int waiting_writers;
unsigned int r_counter;
unsigned int w_counter;
+   unsigned int pipe_ll_usec;
struct page *tmp_page;
struct fasync_struct *fasync_readers;
struct fasync_struct *fasync_writers;
@@ -170,6 +171,7 @@ void pipe_double_lock(struct pipe_inode_info *, struct 
pipe_inode_info *);
 extern unsigned int pipe_max_size;
 extern unsigned long pipe_user_pages_hard;
 extern unsigned long pipe_user_pages_soft;
+extern unsigned int pipe_busy_poll;
 
 /* Drop the inode semaphore and wait for a pipe event, atomically */
 void pipe_wait(struct pipe_inode_info *pipe);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index cc02050..0e9ce0c 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1863,6 +1863,13 @@ static struct ctl_table fs_table[] = {
.proc_handler   = proc_doulongvec_minmax,
},
{
+   .procname   = "pipe-busy-poll",
+   .data   = _busy_poll,
+   .maxlen = sizeof(unsigned int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   },
+   {
.procname   = "mount-max",
.data   = _mount_max,
.maxlen = sizeof(unsigned int),
-- 
2.9.3



[RFC PATCH v2 0/1] Pipe busy wait

2018-09-25 Thread subhra mazumdar
This patch introduces busy waiting for pipes similar to network sockets.
When pipe is full or empty a thread busy waits for some microseconds before
sleeping. This avoids the sleep and wakeup overhead and improves
performance in case wakeup happens very fast. It uses a new field in
pipe_inode_info to decide how much to spin. As different workloads on
different systems can have different optimum spin time, it is configurable
via a tunable that can be set via /proc. The default value is 0 which
indicates no spin.

Following are the hackbench process using pipe and Unixbench performance 
numbers with baseline and suitable spin time.

Hackbench on 2 socket, 36 core and 72 threads Intel x86 machine
(lower is better). It also shows the usr+sys time as shown by time:
groups  baseline  usr+sys   patch(spin=10us)  usr+sys
1   0.674217.0212   0.6842 (-1.48%)   20.2765
2   0.779435.1954   0.7116 (8.7%) 38.1318
4   0.974455.6513   0.8247 (15.36%)   50.301
8   2.0382129.519   1.572 (22.87%)103.722
16  5.5712427.096   2.7989 (49.76%)   194.162
24  7.9314566.327   3.962 (50.05%)278.368

Unixbench on 2 socket, 36 core and 72 threads Intel x86 machine (higher is
better). It shows pipe-based context switching test improving by 107.7% for
1 copy and by 51.3% for 72 parallel copies. This is expected as context
switch overhead is avoided altogether by busy wait.

72 CPUs in system; running 1 parallel copy of tests
System Benchmarks Index Values  baselinepatch(spin=10us)
Dhrystone 2 using register variables2837.5  2842.0
Double-Precision Whetstone  753.3   753.5
Execl Throughput1056.8  1079.2
File Copy 1024 bufsize 2000 maxblocks   1794.0  1805.4
File Copy 256 bufsize 500 maxblocks .4  1117.6
File Copy 4096 bufsize 8000 maxblocks   4136.7  4091.7
Pipe Throughput 752.9   753.3
Pipe-based Context Switching372.2   772.9
Process Creation840.1   847.6
Shell Scripts (1 concurrent)1771.0  1747.5
Shell Scripts (8 concurrent)7316.6  7308.3
System Call Overhead578.0   578.0

System Benchmarks Index Score   1337.8  1424.0

72 CPUs in system; running 72 parallel copies of tests
System Benchmarks Index Values  baselinepatch(spin=10us)
Dhrystone 2 using register variables112849.7112429.8
Double-Precision Whetstone  44971.1 44929.7
Execl Throughput11257.6 11258.7
File Copy 1024 bufsize 2000 maxblocks   1514.6  1471.3
File Copy 256 bufsize 500 maxblocks 919.8   917.4
File Copy 4096 bufsize 8000 maxblocks   3543.8  3355.5
Pipe Throughput 34530.6 34492.9
Pipe-based Context Switching11298.2 17089.9
Process Creation9422.5  9408.1
Shell Scripts (1 concurrent)38764.9 38610.4
Shell Scripts (8 concurrent)32570.5 32458.8
System Call Overhead2607.4  2672.7

System Benchmarks Index Score   11077.2 11393.5

Changes from v1->v2:
-Removed preemption disable in the busy wait loop
-Changed busy spin condition to TASK_RUNNING state of the current thread
-Added usr+sys time for hackbench runs as shown by time command

subhra mazumdar (1):
  pipe: busy wait for pipe

 fs/pipe.c | 12 
 include/linux/pipe_fs_i.h |  2 ++
 kernel/sysctl.c   |  7 +++
 3 files changed, 21 insertions(+)

-- 
2.9.3



  1   2   >