Re: [RFC PATCH 3/9] sched: add sched feature to disable idle core search
On 9/5/19 3:17 AM, Patrick Bellasi wrote: On Fri, Aug 30, 2019 at 18:49:38 +0100, subhra mazumdar wrote... Add a new sched feature SIS_CORE to have an option to disable idle core search (select_idle_core). Signed-off-by: subhra mazumdar --- kernel/sched/features.h | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 858589b..de4d506 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true) */ SCHED_FEAT(SIS_AVG_CPU, false) SCHED_FEAT(SIS_PROP, true) +SCHED_FEAT(SIS_CORE, true) Why do we need a sched_feature? If you think there are systems in which the usage of latency-nice does not make sense for in "Select Idle Sibling", then we should probably better add a new Kconfig option. This is not for latency-nice but to be able to disable a different aspect of the scheduler, i.e searching for idle cores. This can be made part of latency-nice (i.e not do idle core search if latency-nice is below a certain value) but even then having a feature to disable it doesn't hurt. If that's the case, you can probably use the init/Kconfig's "Scheduler features" section, recently added by: commit 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting") /* * Issue a WARN when we do multiple update_rq_clock() calls Best, Patrick
Re: [RFC PATCH 7/9] sched: search SMT before LLC domain
On 9/5/19 2:31 AM, Peter Zijlstra wrote: On Fri, Aug 30, 2019 at 10:49:42AM -0700, subhra mazumdar wrote: Search SMT siblings before all CPUs in LLC domain for idle CPU. This helps in L1 cache locality. --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8856503..94dd4a32 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6274,11 +6274,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) return i; } - i = select_idle_cpu(p, sd, target); + i = select_idle_smt(p, target); if ((unsigned)i < nr_cpumask_bits) return i; - i = select_idle_smt(p, target); + i = select_idle_cpu(p, sd, target); if ((unsigned)i < nr_cpumask_bits) return i; But it is absolutely conceptually wrong. An idle core is a much better target than an idle sibling. This is select_idle_smt not select_idle_core.
[RFC PATCH 3/9] sched: add sched feature to disable idle core search
Add a new sched feature SIS_CORE to have an option to disable idle core search (select_idle_core). Signed-off-by: subhra mazumdar --- kernel/sched/features.h | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 858589b..de4d506 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true) */ SCHED_FEAT(SIS_AVG_CPU, false) SCHED_FEAT(SIS_PROP, true) +SCHED_FEAT(SIS_CORE, true) /* * Issue a WARN when we do multiple update_rq_clock() calls -- 2.9.3
[RFC PATCH 6/9] x86/smpboot: Optimize cpumask_weight_sibling macro for x86
Use per-CPU variable for cpumask_weight_sibling macro in case of x86 for fast lookup in select_idle_cpu. This avoids reading multiple cache lines in case of systems with large numbers of CPUs where bitmask can span multiple cache lines. Even if bitmask spans only one cache line this avoids looping through it to find the number of bits and gets it in O(1). Signed-off-by: subhra mazumdar --- arch/x86/include/asm/smp.h | 1 + arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/smpboot.c | 17 - 3 files changed, 18 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index da545df..1e90cbd 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -22,6 +22,7 @@ extern int smp_num_siblings; extern unsigned int num_processors; DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map); +DECLARE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling); DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); /* cpus sharing the last level cache: */ DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map); diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 453cf38..dd19c71 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -111,6 +111,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu); #ifdef CONFIG_SMP #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu)) #define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) +#define topology_sibling_weight(cpu) (per_cpu(cpumask_weight_sibling, cpu)) extern unsigned int __max_logical_packages; #define topology_max_packages()(__max_logical_packages) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 362dd89..57ad88d 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -85,6 +85,9 @@ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map); EXPORT_PER_CPU_SYMBOL(cpu_sibling_map); +/* representing number of HT siblings of each CPU */ +DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling); + /* representing HT and core siblings of each logical CPU */ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); EXPORT_PER_CPU_SYMBOL(cpu_core_map); @@ -520,6 +523,8 @@ void set_cpu_sibling_map(int cpu) if (!has_mp) { cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = + cpumask_weight(topology_sibling_cpumask(cpu)); cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu)); cpumask_set_cpu(cpu, topology_core_cpumask(cpu)); c->booted_cores = 1; @@ -529,8 +534,13 @@ void set_cpu_sibling_map(int cpu) for_each_cpu(i, cpu_sibling_setup_mask) { o = _data(i); - if ((i == cpu) || (has_smt && match_smt(c, o))) + if ((i == cpu) || (has_smt && match_smt(c, o))) { link_mask(topology_sibling_cpumask, cpu, i); + per_cpu(cpumask_weight_sibling, cpu) = + cpumask_weight(topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, i) = + cpumask_weight(topology_sibling_cpumask(i)); + } if ((i == cpu) || (has_mp && match_llc(c, o))) link_mask(cpu_llc_shared_mask, cpu, i); @@ -1173,6 +1183,8 @@ static __init void disable_smp(void) else physid_set_mask_of_physid(0, _cpu_present_map); cpumask_set_cpu(0, topology_sibling_cpumask(0)); + per_cpu(cpumask_weight_sibling, 0) = + cpumask_weight(topology_sibling_cpumask(0)); cpumask_set_cpu(0, topology_core_cpumask(0)); } @@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu) for_each_cpu(sibling, topology_core_cpumask(cpu)) { cpumask_clear_cpu(cpu, topology_core_cpumask(sibling)); + per_cpu(cpumask_weight_sibling, sibling) = + cpumask_weight(topology_sibling_cpumask(sibling)); /*/ * last thread sibling in this cpu core going down */ @@ -1495,6 +1509,7 @@ static void remove_siblinginfo(int cpu) cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling)); cpumask_clear(cpu_llc_shared_mask(cpu)); cpumask_clear(topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = 0; cpumask_clear(topology_core_cpumask(cpu)); c->cpu_core_id = 0; c->booted_cores = 0; -- 2.9.3
[RFC PATCH 5/9] sched: Define macro for number of CPUs in core
Introduce macro topology_sibling_weight for number of sibling CPUs in a core and use in select_idle_cpu Signed-off-by: subhra mazumdar --- include/linux/topology.h | 4 kernel/sched/fair.c | 2 +- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/include/linux/topology.h b/include/linux/topology.h index cb0775e..a85aea1 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -190,6 +190,10 @@ static inline int cpu_to_mem(int cpu) #ifndef topology_sibling_cpumask #define topology_sibling_cpumask(cpu) cpumask_of(cpu) #endif +#ifndef topology_sibling_weight +#define topology_sibling_weight(cpu) \ + cpumask_weight(topology_sibling_cpumask(cpu)) +#endif #ifndef topology_core_cpumask #define topology_core_cpumask(cpu) cpumask_of(cpu) #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 23ec9c6..8856503 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6205,7 +6205,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t return -1; if (sched_feat(SIS_PROP)) { - floor = cpumask_weight(topology_sibling_cpumask(target)); + floor = topology_sibling_weight(target); if (floor < 2) floor = 2; nr = (p->latency_nice * sd->span_weight) / LATENCY_NICE_MAX; -- 2.9.3
[RFC PATCH 0/9] Task latency-nice
Introduce new per task property latency-nice for controlling scalability in scheduler idle CPU search path. Valid latency-nice values are from 1 to 100 indicating 1% to 100% search of the LLC domain in select_idle_cpu. New CPU cgroup file cpu.latency-nice is added as an interface to set and get. All tasks in the same cgroup share the same latency-nice value. Using a lower latency-nice value can help latency intolerant tasks e.g very short running OLTP threads where full LLC search cost can be significant compared to run time of the threads. The default latency-nice value is 5. In addition to latency-nice, it also adds a new sched feature SIS_CORE to be able to disable idle core search altogether which is costly and hurts more than it helps in short running workloads. Finally it also introduces a new per-cpu variable next_cpu to track the limit of search so that every time search starts from where it ended. This rotating search window over cpus in LLC domain ensures that idle cpus are eventually found in case of high load. Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with message size = 8k (higher is better): threads baseline latency-nice=5,SIS_CORE latency-nice=5,NO_SIS_CORE 8 64.66 64.38 (-0.43%) 64.79 (0.2%) 16 123.34 122.88 (-0.37%) 125.87 (2.05%) 32 215.18 215.55 (0.17%) 247.77 (15.15%) 48 278.56 321.6 (15.45%) 321.2 (15.3%) 64 259.99 319.45 (22.87%) 333.95 (28.44%) 128 431.1 437.69 (1.53%) 431.09 (0%) subhra mazumdar (9): sched,cgroup: Add interface for latency-nice sched: add search limit as per latency-nice sched: add sched feature to disable idle core search sched: SIS_CORE to disable idle core search sched: Define macro for number of CPUs in core x86/smpboot: Optimize cpumask_weight_sibling macro for x86 sched: search SMT before LLC domain sched: introduce per-cpu var next_cpu to track search limit sched: rotate the cpu search window for better spread arch/x86/include/asm/smp.h | 1 + arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/smpboot.c | 17 - include/linux/sched.h | 1 + include/linux/topology.h| 4 kernel/sched/core.c | 42 + kernel/sched/fair.c | 34 + kernel/sched/features.h | 1 + kernel/sched/sched.h| 9 + 9 files changed, 97 insertions(+), 13 deletions(-) -- 2.9.3
[RFC PATCH 7/9] sched: search SMT before LLC domain
Search SMT siblings before all CPUs in LLC domain for idle CPU. This helps in L1 cache locality. --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8856503..94dd4a32 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6274,11 +6274,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) return i; } - i = select_idle_cpu(p, sd, target); + i = select_idle_smt(p, target); if ((unsigned)i < nr_cpumask_bits) return i; - i = select_idle_smt(p, target); + i = select_idle_cpu(p, sd, target); if ((unsigned)i < nr_cpumask_bits) return i; -- 2.9.3
[RFC PATCH 8/9] sched: introduce per-cpu var next_cpu to track search limit
Introduce a per-cpu variable to track the limit upto which idle cpu search was done in select_idle_cpu(). This will help to start the search next time from there. This is necessary for rotating the search window over entire LLC domain. Signed-off-by: subhra mazumdar --- kernel/sched/core.c | 2 ++ kernel/sched/sched.h | 1 + 2 files changed, 3 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 47969bc..5862d54 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -24,6 +24,7 @@ #include DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu); #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL) /* @@ -5966,6 +5967,7 @@ void __init sched_init(void) for_each_possible_cpu(i) { struct rq *rq; + per_cpu(next_cpu, i) = -1; rq = cpu_rq(i); raw_spin_lock_init(>lock); rq->nr_running = 0; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 365c928..cca2b09 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1002,6 +1002,7 @@ static inline void update_idle_core(struct rq *rq) { } #endif DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu); #define cpu_rq(cpu)(_cpu(runqueues, (cpu))) #define this_rq() this_cpu_ptr() -- 2.9.3
[RFC PATCH 4/9] sched: SIS_CORE to disable idle core search
Use SIS_CORE to disable idle core search. For some workloads select_idle_core becomes a scalability bottleneck, removing it improves throughput. Also there are workloads where disabling it can hurt latency, so need to have an option. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c31082d..23ec9c6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6268,9 +6268,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if (!sd) return target; - i = select_idle_core(p, sd, target); - if ((unsigned)i < nr_cpumask_bits) - return i; + if (sched_feat(SIS_CORE)) { + i = select_idle_core(p, sd, target); + if ((unsigned)i < nr_cpumask_bits) + return i; + } i = select_idle_cpu(p, sd, target); if ((unsigned)i < nr_cpumask_bits) -- 2.9.3
[RFC PATCH 2/9] sched: add search limit as per latency-nice
Put upper and lower limit on CPU search in select_idle_cpu. The lower limit is set to amount of CPUs in a core while upper limit is derived from the latency-nice of the thread. This ensures for any architecture we will usually search beyond a core. Changing the latency-nice value by user will change the search cost making it appropriate for given workload. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b08d00c..c31082d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, nr = INT_MAX; + int cpu, floor, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(_llc)); if (!this_sd) @@ -6205,11 +6205,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t return -1; if (sched_feat(SIS_PROP)) { - u64 span_avg = sd->span_weight * avg_idle; - if (span_avg > 4*avg_cost) - nr = div_u64(span_avg, avg_cost); - else - nr = 4; + floor = cpumask_weight(topology_sibling_cpumask(target)); + if (floor < 2) + floor = 2; + nr = (p->latency_nice * sd->span_weight) / LATENCY_NICE_MAX; + if (nr < floor) + nr = floor; } time = local_clock(); -- 2.9.3
[RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice
Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file "latency-nice" which is shared by all the threads in that Cgroup. Signed-off-by: subhra mazumdar --- include/linux/sched.h | 1 + kernel/sched/core.c | 40 kernel/sched/fair.c | 1 + kernel/sched/sched.h | 8 4 files changed, 50 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 1183741..b4a79c3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -631,6 +631,7 @@ struct task_struct { int static_prio; int normal_prio; unsigned intrt_priority; + u64 latency_nice; const struct sched_class*sched_class; struct sched_entity se; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 874c427..47969bc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5976,6 +5976,7 @@ void __init sched_init(void) init_dl_rq(>dl); #ifdef CONFIG_FAIR_GROUP_SCHED root_task_group.shares = ROOT_TASK_GROUP_LOAD; + root_task_group.latency_nice = LATENCY_NICE_DEFAULT; INIT_LIST_HEAD(>leaf_cfs_rq_list); rq->tmp_alone_branch = >leaf_cfs_rq_list; /* @@ -6345,6 +6346,7 @@ static void sched_change_group(struct task_struct *tsk, int type) */ tg = container_of(task_css_check(tsk, cpu_cgrp_id, true), struct task_group, css); + tsk->latency_nice = tg->latency_nice; tg = autogroup_task_group(tsk, tg); tsk->sched_task_group = tg; @@ -6812,6 +6814,34 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css, } #endif /* CONFIG_RT_GROUP_SCHED */ +static u64 cpu_latency_nice_read_u64(struct cgroup_subsys_state *css, +struct cftype *cft) +{ + struct task_group *tg = css_tg(css); + + return tg->latency_nice; +} + +static int cpu_latency_nice_write_u64(struct cgroup_subsys_state *css, + struct cftype *cft, u64 latency_nice) +{ + struct task_group *tg = css_tg(css); + struct css_task_iter it; + struct task_struct *p; + + if (latency_nice < LATENCY_NICE_MIN || latency_nice > LATENCY_NICE_MAX) + return -ERANGE; + + tg->latency_nice = latency_nice; + + css_task_iter_start(css, 0, ); + while ((p = css_task_iter_next())) + p->latency_nice = latency_nice; + css_task_iter_end(); + + return 0; +} + static struct cftype cpu_legacy_files[] = { #ifdef CONFIG_FAIR_GROUP_SCHED { @@ -6848,6 +6878,11 @@ static struct cftype cpu_legacy_files[] = { .write_u64 = cpu_rt_period_write_uint, }, #endif + { + .name = "latency-nice", + .read_u64 = cpu_latency_nice_read_u64, + .write_u64 = cpu_latency_nice_write_u64, + }, { } /* Terminate */ }; @@ -7015,6 +7050,11 @@ static struct cftype cpu_files[] = { .write = cpu_max_write, }, #endif + { + .name = "latency-nice", + .read_u64 = cpu_latency_nice_read_u64, + .write_u64 = cpu_latency_nice_write_u64, + }, { } /* terminate */ }; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f35930f..b08d00c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10479,6 +10479,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent) goto err; tg->shares = NICE_0_LOAD; + tg->latency_nice = LATENCY_NICE_DEFAULT; init_cfs_bandwidth(tg_cfs_bandwidth(tg)); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b52ed1a..365c928 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -143,6 +143,13 @@ static inline void cpu_load_update_active(struct rq *this_rq) { } #define NICE_0_LOAD(1L << NICE_0_LOAD_SHIFT) /* + * Latency-nice default value + */ +#defineLATENCY_NICE_DEFAULT5 +#defineLATENCY_NICE_MIN1 +#defineLATENCY_NICE_MAX100 + +/* * Single value that decides SCHED_DEADLINE internal math precision. * 10 -> just above 1us * 9 -> just above 0.5us @@ -362,6 +369,7 @@ struct cfs_bandwidth { /* Task group related information */ struct task_group { struct cgroup_subsys_state css; + u64 latency_nice; #ifdef CONFIG_FAIR_GROUP_SCHED /* schedulable entities of this group on each CPU */ -- 2.9.3
[RFC PATCH 9/9] sched: rotate the cpu search window for better spread
Rotate the cpu search window for better spread of threads. This will ensure an idle cpu will quickly be found if one exists. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 94dd4a32..7419b47 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, floor, nr = INT_MAX; + int cpu, floor, target_tmp, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(_llc)); if (!this_sd) @@ -6213,9 +6213,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t nr = floor; } + if (per_cpu(next_cpu, target) != -1) + target_tmp = per_cpu(next_cpu, target); + else + target_tmp = target; + time = local_clock(); - for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { + for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) { + per_cpu(next_cpu, target) = cpu; if (!--nr) return -1; if (!cpumask_test_cpu(cpu, >cpus_allowed)) -- 2.9.3
Panic on v5.3-rc4
I am getting the following panic during boot of tag v5.3-rc4 of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git. I don't see the panic on tag v5.2 on same rig. Is it a bug or something legitimately changed? Thanks, Subhra [ 147.184948] dracut Warning: No root device "block:/dev/mapper/vg_paex623-lv_root" found [ 147.282665] dracut Warning: LVM vg_paex623/lv_swap not found [ 147.354854] dracut Warning: LVM vg_paex623/lv_root not found [ 147.432099] dracut Warning: Boot has failed. To debug this issue add "rdshell" to the kernel command line. [ 147.549737] dracut Warning: Signal caught! [ 147.600145] dracut Warning: LVM vg_paex623/lv_swap not found [ 147.670692] dracut Warning: LVM vg_paex623/lv_root not found [ 147.738593] dracut Warning: Boot has failed. To debug this issue add "rdshell" to the kernel command line. [ 147.856206] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0100 [ 147.947859] CPU: 3 PID: 1 Comm: init Not tainted 5.3.0-rc4latency_nice_BL #3 [ 148.032225] Hardware name: Oracle Corporation ORACLE SERVER X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016 [ 148.149879] Call Trace: [ 148.179117] dump_stack+0x5c/0x7b [ 148.218760] panic+0xfe/0x2e2 [ 148.254242] do_exit+0xbd8/0xbe0 [ 148.292842] do_group_exit+0x3a/0xa0 [ 148.335597] __x64_sys_exit_group+0x14/0x20 [ 148.385644] do_syscall_64+0x5b/0x1d0 [ 148.429448] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 148.489900] RIP: 0033:0x3a5ccad018 [ 148.530582] Code: Bad RIP value. [ 148.569178] RSP: 002b:7ffc147a5b48 EFLAGS: 0246 ORIG_RAX: 00e7 [ 148.659791] RAX: ffda RBX: 0004 RCX: 003a5ccad018 [ 148.745197] RDX: 0001 RSI: 003c RDI: 0001 [ 148.830605] RBP: R08: 00e7 R09: ffa8 [ 148.916011] R10: 0001 R11: 0246 R12: 00401fb0 [ 149.001414] R13: 7ffc147a5e20 R14: R15: [ 149.086929] Kernel Offset: disabled [ 149.132815] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0100 ]---
Re: [RFC PATCH 3/3] sched: introduce tunables to control soft affinity
On 7/18/19 3:38 PM, Srikar Dronamraju wrote: * subhra mazumdar [2019-06-26 15:47:18]: For different workloads the optimal "softness" of soft affinity can be different. Introduce tunables sched_allowed and sched_preferred that can be tuned via /proc. This allows to chose at what utilization difference the scheduler will chose cpus_allowed over cpus_preferred in the first level of search. Depending on the extent of data sharing, cache coherency overhead of the system etc. the optimal point may vary. Signed-off-by: subhra mazumdar --- Correct me but this patchset only seems to be concentrated on the wakeup path, I don't see any changes in the regular load balancer or the numa-balancer. If system is loaded or tasks are CPU intensive, then wouldn't these tasks be moved to cpus_allowed instead of cpus_preferred and hence breaking this soft affinity. The new idle is purposefully unchanged, if threads get stolen to the allowed set from the preferred set that's intended, together with the enqueue side it will achieve softness of affinity.
Re: [RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs
On 7/18/19 5:07 PM, Peter Zijlstra wrote: On Wed, Jul 17, 2019 at 08:31:25AM +0530, Subhra Mazumdar wrote: On 7/2/19 10:58 PM, Peter Zijlstra wrote: On Wed, Jun 26, 2019 at 03:47:17PM -0700, subhra mazumdar wrote: The soft affinity CPUs present in the cpumask cpus_preferred is used by the scheduler in two levels of search. First is in determining wake affine which choses the LLC domain and secondly while searching for idle CPUs in LLC domain. In the first level it uses cpus_preferred to prune out the search space. In the second level it first searches the cpus_preferred and then cpus_allowed. Using affinity_unequal flag it breaks early to avoid any overhead in the scheduler fast path when soft affinity is not used. This only changes the wake up path of the scheduler, the idle balancing is unchanged; together they achieve the "softness" of scheduling. I really dislike this implementation. I thought the idea was to remain work conserving (in so far as that we're that anyway), so changing select_idle_sibling() doesn't make sense to me. If there is idle, we use it. Same for newidle; which you already retained. The scheduler is already not work conserving in many ways. Soft affinity is only for those who want to use it and has no side effects when not used. Also the way scheduler is implemented in the first level of search it may not be possible to do it in a work conserving way, I am open to ideas. I really don't understand the premise of this soft affinity stuff then. I understood it was to allow spreading if under-utilized, but group when over-utilized, but you're arguing for the exact opposite, which doesn't make sense. You are right on the premise. The whole knob thing came into existence because I couldn't make the first level of search work conserving. I am concerned that trying to make that work conserving can introduce significant latency in the code path when SA is used. I have made the second level of search work conserving when we search the LLC domain. Having said that, SA need not necessarily be binary i.e only spill over to the allowed set if the preferred set is 100% utilized (work conserving). The spill over can happen before that and SA can have a degree of softness. The above two points made me go down the knob path for the first level of search.
Re: [RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs
On 7/2/19 10:58 PM, Peter Zijlstra wrote: On Wed, Jun 26, 2019 at 03:47:17PM -0700, subhra mazumdar wrote: The soft affinity CPUs present in the cpumask cpus_preferred is used by the scheduler in two levels of search. First is in determining wake affine which choses the LLC domain and secondly while searching for idle CPUs in LLC domain. In the first level it uses cpus_preferred to prune out the search space. In the second level it first searches the cpus_preferred and then cpus_allowed. Using affinity_unequal flag it breaks early to avoid any overhead in the scheduler fast path when soft affinity is not used. This only changes the wake up path of the scheduler, the idle balancing is unchanged; together they achieve the "softness" of scheduling. I really dislike this implementation. I thought the idea was to remain work conserving (in so far as that we're that anyway), so changing select_idle_sibling() doesn't make sense to me. If there is idle, we use it. Same for newidle; which you already retained. The scheduler is already not work conserving in many ways. Soft affinity is only for those who want to use it and has no side effects when not used. Also the way scheduler is implemented in the first level of search it may not be possible to do it in a work conserving way, I am open to ideas. This then leaves regular balancing, and for that we can fudge with can_migrate_task() and nr_balance_failed or something. Possibly but I don't know if similar performance behavior can be achieved by the periodic load balancer. Do you want a performance comparison of the two approaches? And I also really don't want a second utilization tipping point; we already have the overloaded thing. The numbers in the cover letter show that a static tipping point will not work for all workloads. What soft affinity is doing is essentially trading off cache coherence for more CPU. The optimum tradeoff point will vary from workload to workload and the system metrics of coherence overhead etc. If we just use the domain overload that becomes a static definition of tipping point, we need something tunable that captures this tradeoff. The ratio of CPU util seemed to work well and capture that. I also still dislike how you never looked into the numa balancer, which already has peferred_nid stuff. Not sure if you mean using the existing NUMA balancer or enhancing it. If the former, I have numbers in the cover letter that show NUMA balancer is not making any difference. I allocated memory of each DB instance to one NUMA node using numactl, but NUMA balancer still migrated pages, so numactl only seems to control the initial allocation. Secondly even though NUMA balancer migrated pages it had no performance benefit as compared to disabling it.
Re: [PATCH v3 5/7] sched: SIS_CORE to disable idle core search
On 7/4/19 6:04 PM, Parth Shah wrote: Same experiment with hackbench and with perf analysis shows increase in L1 cache miss rate with these patches (Lower is better) Baseline(%) Patch(%) --- - --- Total Cache miss rate 17.01 19(-11%) L1 icache miss rate5.45 6.7(-22%) So is is possible for idle_cpu search to try checking target_cpu first and then goto sliding window if not found. Below diff works as expected in IBM POWER9 system and resolves the problem of far wakeup upto large extent. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ff2e9b5c3ac5..fae035ce1162 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6161,6 +6161,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 time, cost; s64 delta; int cpu, limit, floor, target_tmp, nr = INT_MAX; + struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); this_sd = rcu_dereference(*this_cpu_ptr(_llc)); if (!this_sd) @@ -6198,16 +6199,22 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t time = local_clock(); - for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) { + cpumask_and(cpus, sched_domain_span(sd), >cpus_allowed); + for_each_cpu_wrap(cpu, cpu_smt_mask(target), target) { + __cpumask_clear_cpu(cpu, cpus); + if (available_idle_cpu(cpu)) + goto idle_cpu_exit; + } + + for_each_cpu_wrap(cpu, cpus, target_tmp) { per_cpu(next_cpu, target) = cpu; if (!--nr) return -1; - if (!cpumask_test_cpu(cpu, >cpus_allowed)) - continue; if (available_idle_cpu(cpu)) break; } +idle_cpu_exit: time = local_clock() - time; cost = this_sd->avg_scan_cost; delta = (s64)(time - cost) / 8; Best, Parth How about calling select_idle_smt before select_idle_cpu from select_idle_sibling? That should have the same effect.
Re: [RFC 0/2] Optimize the idle CPU search
On 7/9/19 11:08 AM, Parth Shah wrote: On 7/9/19 5:38 AM, Subhra Mazumdar wrote: On 7/8/19 10:24 AM, Parth Shah wrote: When searching for an idle_sibling, scheduler first iterates to search for an idle core and then for an idle CPU. By maintaining the idle CPU mask while iterating through idle cores, we can mark non-idle CPUs for which idle CPU search would not have to iterate through again. This is especially true in a moderately load system Optimize idle CPUs search by marking already found non idle CPUs during idle core search. This reduces iteration count when searching for idle CPUs, resulting in lower iteration count. I believe this can co-exist with latency-nice? We can derive the 'nr' in select_idle_cpu from latency-nice and use the new mask to iterate. I agree, can be done with latency-nice. Maybe something like below? smt = nr_cpus / nr_cores nr = smt + (p->latency_nice * (total_cpus-smt) / max_latency_nice) This limits lower bounds to 1 core and goes through all the cores if latency_nice is maximum for a task. Yes I had similar in mind.
Re: [RFC 0/2] Optimize the idle CPU search
On 7/8/19 1:38 PM, Peter Zijlstra wrote: On Mon, Jul 08, 2019 at 10:24:30AM +0530, Parth Shah wrote: When searching for an idle_sibling, scheduler first iterates to search for an idle core and then for an idle CPU. By maintaining the idle CPU mask while iterating through idle cores, we can mark non-idle CPUs for which idle CPU search would not have to iterate through again. This is especially true in a moderately load system Optimize idle CPUs search by marking already found non idle CPUs during idle core search. This reduces iteration count when searching for idle CPUs, resulting in lower iteration count. Have you seen these patches: https://lkml.kernel.org/r/20180530142236.667774...@infradead.org I've meant to get back to that, but never quite had the time :/ The most relevant bit of this was folding select_idle_core and select_idle_cpu. But it may be good to keep them separate as workloads which just want any idle cpu can find one and break early by disabling the idle core search. And this can still work with latency-nice which can moderate both idle core and idle cpu search.
Re: [RFC 0/2] Optimize the idle CPU search
On 7/8/19 10:24 AM, Parth Shah wrote: When searching for an idle_sibling, scheduler first iterates to search for an idle core and then for an idle CPU. By maintaining the idle CPU mask while iterating through idle cores, we can mark non-idle CPUs for which idle CPU search would not have to iterate through again. This is especially true in a moderately load system Optimize idle CPUs search by marking already found non idle CPUs during idle core search. This reduces iteration count when searching for idle CPUs, resulting in lower iteration count. I believe this can co-exist with latency-nice? We can derive the 'nr' in select_idle_cpu from latency-nice and use the new mask to iterate.
Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path
On 7/2/19 1:54 AM, Patrick Bellasi wrote: Wondering if searching and preempting needs will ever be conflicting? I guess the winning point is that we don't commit behaviors to userspace, but just abstract concepts which are turned into biases. I don't see conflicts right now: if you are latency tolerant that means you can spend more time to try finding a better CPU (e.g. we can use the energy model to compare multiple CPUs) _and/or_ give the current task a better chance to complete by delaying its preemption. OK Otherwise sounds like a good direction to me. For the searching aspect, can we map latency nice values to the % of cores we search in select_idle_cpu? Thus the search cost can be controlled by latency nice value. I guess that's worth a try, only caveat I see is that it's turning the bias into something very platform specific. Meaning, the same latency-nice value on different machines can have very different results. Would not be better to try finding a more platform independent mapping? Maybe something time bounded, e.g. the higher the latency-nice the more time we can spend looking for CPUs? The issue I see is suppose we have a range of latency-nice values, then it should cover the entire range of search (one core to all cores). As Peter said some workloads will want to search the LLC fully. If we have absolute time, the map of latency-nice values range to them will be arbitrary. If you have something in mind let me know, may be I am thinking differently. But the issue is if more latency tolerant workloads set to less search, we still need some mechanism to achieve good spread of threads. I don't get this example: why more latency tolerant workloads should require less search? I guess I got the definition of "latency tolerant" backwards. Can we keep the sliding window mechanism in that case? Which one? Sorry did not went through the patches, can you briefly resume the idea? If a workload has set it to low latency tolerant, then the search will be less. That can lead to localization of threads on a few CPUs as we are not searching the entire LLC even if there are idle CPUs available. For this I had introduced a per-CPU variable (for the target CPU) to track the boundary of search so that every time it will start from the boundary, thus sliding the window. So even if we are searching very little the search window keeps shifting and gives us a good spread. This is orthogonal to the latency-nice thing. Also will latency nice do anything for select_idle_core and select_idle_smt? I guess principle the same bias can be used at different levels, maybe with different mappings. Doing it for select_idle_core will have the issue that the dynamic flag (whether an idle core is present or not) can only be updated by threads which are doing the full search. Thanks, Subhra In the mobile world use-case we will likely use it only to switch from select_idle_sibling to the energy aware slow path. And perhaps to see if we can bias the wakeup preemption granularity. Best, Patrick
Re: [PATCH V3 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found
On 7/2/19 1:35 AM, Peter Zijlstra wrote: On Mon, Jul 01, 2019 at 03:08:41PM -0700, Subhra Mazumdar wrote: On 7/1/19 1:03 AM, Viresh Kumar wrote: On 28-06-19, 18:16, Subhra Mazumdar wrote: On 6/25/19 10:06 PM, Viresh Kumar wrote: @@ -5376,6 +5376,15 @@ static struct { #endif /* CONFIG_NO_HZ_COMMON */ +/* CPU only has SCHED_IDLE tasks enqueued */ +static int sched_idle_cpu(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + + return unlikely(rq->nr_running == rq->cfs.idle_h_nr_running && + rq->nr_running); +} + Shouldn't this check if rq->curr is also sched idle? Why wouldn't the current set of checks be enough to guarantee that ? I thought nr_running does not include the on-cpu thread. It very much does. And why not drop the rq->nr_running non zero check? Because CPU isn't sched-idle if nr_running and idle_h_nr_running are both 0, i.e. it is an IDLE cpu in that case. And so I thought it is important to have this check as well. idle_cpu() not only checks nr_running is 0 but also rq->curr == rq->idle idle_cpu() will try very hard to declare a CPU !idle. But I don't see how that it relevant. sched_idle_cpu() will only return true if there are only SCHED_IDLE tasks on the CPU. Viresh's test is simple and straight forward. OK makes sense. Thanks, Subhra
Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path
On 7/1/19 6:55 AM, Patrick Bellasi wrote: On 01-Jul 11:02, Peter Zijlstra wrote: On Wed, Jun 26, 2019 at 06:29:12PM -0700, subhra mazumdar wrote: Hi, Resending this patchset, will be good to get some feedback. Any suggestions that will make it more acceptable are welcome. We have been shipping this with Unbreakable Enterprise Kernel in Oracle Linux. Current select_idle_sibling first tries to find a fully idle core using select_idle_core which can potentially search all cores and if it fails it finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially search all cpus in the llc domain. This doesn't scale for large llc domains and will only get worse with more cores in future. This patch solves the scalability problem by: - Setting an upper and lower limit of idle cpu search in select_idle_cpu to keep search time low and constant - Adding a new sched feature SIS_CORE to disable select_idle_core Additionally it also introduces a new per-cpu variable next_cpu to track the limit of search so that every time search starts from where it ended. This rotating search window over cpus in LLC domain ensures that idle cpus are eventually found in case of high load. Right, so we had a wee conversation about this patch series at OSPM, and I don't see any of that reflected here :-( Specifically, given that some people _really_ want the whole L3 mask scanned to reduce tail latency over raw throughput, while you guys prefer the other way around, it was proposed to extend the task model. Specifically something like a latency-nice was mentioned (IIRC) where a Right, AFAIR PaulT suggested to add support for the concept of a task being "latency tolerant": meaning we can spend more time to search for a CPU and/or avoid preempting the current task. Wondering if searching and preempting needs will ever be conflicting? Otherwise sounds like a good direction to me. For the searching aspect, can we map latency nice values to the % of cores we search in select_idle_cpu? Thus the search cost can be controlled by latency nice value. But the issue is if more latency tolerant workloads set to less search, we still need some mechanism to achieve good spread of threads. Can we keep the sliding window mechanism in that case? Also will latency nice do anything for select_idle_core and select_idle_smt?
Re: [PATCH V3 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found
On 7/1/19 1:03 AM, Viresh Kumar wrote: On 28-06-19, 18:16, Subhra Mazumdar wrote: On 6/25/19 10:06 PM, Viresh Kumar wrote: We try to find an idle CPU to run the next task, but in case we don't find an idle CPU it is better to pick a CPU which will run the task the soonest, for performance reason. A CPU which isn't idle but has only SCHED_IDLE activity queued on it should be a good target based on this criteria as any normal fair task will most likely preempt the currently running SCHED_IDLE task immediately. In fact, choosing a SCHED_IDLE CPU over a fully idle one shall give better results as it should be able to run the task sooner than an idle CPU (which requires to be woken up from an idle state). This patch updates both fast and slow paths with this optimization. Signed-off-by: Viresh Kumar --- kernel/sched/fair.c | 43 +-- 1 file changed, 33 insertions(+), 10 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1277adc3e7ed..2e0527fd468c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5376,6 +5376,15 @@ static struct { #endif /* CONFIG_NO_HZ_COMMON */ +/* CPU only has SCHED_IDLE tasks enqueued */ +static int sched_idle_cpu(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + + return unlikely(rq->nr_running == rq->cfs.idle_h_nr_running && + rq->nr_running); +} + Shouldn't this check if rq->curr is also sched idle? Why wouldn't the current set of checks be enough to guarantee that ? I thought nr_running does not include the on-cpu thread. And why not drop the rq->nr_running non zero check? Because CPU isn't sched-idle if nr_running and idle_h_nr_running are both 0, i.e. it is an IDLE cpu in that case. And so I thought it is important to have this check as well. idle_cpu() not only checks nr_running is 0 but also rq->curr == rq->idle
Re: [PATCH v3 5/7] sched: SIS_CORE to disable idle core search
Also, systems like POWER9 has sd_llc as a pair of core only. So it won't benefit from the limits and hence also hiding your code in select_idle_cpu behind static keys will be much preferred. If it doesn't hurt then I don't see the point. So these is the result from POWER9 system with your patches: System configuration: 2 Socket, 44 cores, 176 CPUs Experiment setup: === => Setup 1: - 44 tasks doing just while(1), this is to make select_idle_core return -1 most times - perf bench sched messaging -g 1 -l 100 +---++--++ | Baseline | stddev |Patch | stddev | +---++--++ | 135 | 3.21 | 158(-17.03%) | 4.69 | +---++--++ => Setup 2: - schbench -m44 -t 1 +===+==+=+=+==+ | %ile | Baseline | stddev | patch | stddev | +===+==+=+=+==+ |50 | 10 |3.49 | 10 | 2.29 | +---+--+-+-+--+ |95 | 467 |4.47 | 469 | 0.81 | +---+--+-+-+--+ |99 | 571 | 21.32 | 584 |18.69 | +---+--+-+-+--+ | 99.5 | 629 | 30.05 | 641 |20.95 | +---+--+-+-+--+ | 99.9 | 780 | 40.38 | 773 | 44.2 | +---+--+-+-+--+ I guess it doesn't make much difference in schbench results but hackbench (perf bench) seems to have an observable regression. Best, Parth If POWER9 sd_llc has only 2 cores, the behavior shouldn't change much with the select_idle_cpu changes as the limits are 1 and 2 core. Previously the lower bound was 4 cpus and upper bound calculated by the prop. Now it is 1 core (4 cpus on SMT4) and upper bound 2 cores. Could it be the extra computation of cpumask_weight causing the regression rather than the sliding window itself (one way to check this would be hardcode 4 in place of topology_sibling_weight)? Or is it the L1 cache coherency? I am a bit suprised because SPARC SMT8 which has more cores in sd_llc and L1 cache per core showed improvement with Hackbench. Thanks, Subhra
Re: [PATCH V3 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found
On 6/25/19 10:06 PM, Viresh Kumar wrote: We try to find an idle CPU to run the next task, but in case we don't find an idle CPU it is better to pick a CPU which will run the task the soonest, for performance reason. A CPU which isn't idle but has only SCHED_IDLE activity queued on it should be a good target based on this criteria as any normal fair task will most likely preempt the currently running SCHED_IDLE task immediately. In fact, choosing a SCHED_IDLE CPU over a fully idle one shall give better results as it should be able to run the task sooner than an idle CPU (which requires to be woken up from an idle state). This patch updates both fast and slow paths with this optimization. Signed-off-by: Viresh Kumar --- kernel/sched/fair.c | 43 +-- 1 file changed, 33 insertions(+), 10 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1277adc3e7ed..2e0527fd468c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5376,6 +5376,15 @@ static struct { #endif /* CONFIG_NO_HZ_COMMON */ +/* CPU only has SCHED_IDLE tasks enqueued */ +static int sched_idle_cpu(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + + return unlikely(rq->nr_running == rq->cfs.idle_h_nr_running && + rq->nr_running); +} + Shouldn't this check if rq->curr is also sched idle? And why not drop the rq->nr_running non zero check?
Re: [PATCH v3 3/7] sched: rotate the cpu search window for better spread
On 6/28/19 4:54 AM, Srikar Dronamraju wrote: * subhra mazumdar [2019-06-26 18:29:15]: Rotate the cpu search window for better spread of threads. This will ensure an idle cpu will quickly be found if one exists. While rotating the cpu search window is good, not sure if this can find a idle cpu quickly. The probability of finding an idle cpu still should remain the same. No? Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) @@ -6219,9 +6219,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t } } + if (per_cpu(next_cpu, target) != -1) + target_tmp = per_cpu(next_cpu, target); + else + target_tmp = target; + time = local_clock(); - for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { + for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) { + per_cpu(next_cpu, target) = cpu; Shouldn't this assignment be outside the for loop. With the current code, 1. We keep reassigning multiple times. 2. The last assignment happes for idle_cpu and sometimes the assignment is for non-idle cpu. We want the last assignment irrespective of it was an idle cpu or not since in both cases we want to track the boundary of search. Thanks, Subhra
Re: [PATCH v3 5/7] sched: SIS_CORE to disable idle core search
On 6/28/19 12:01 PM, Parth Shah wrote: On 6/27/19 6:59 AM, subhra mazumdar wrote: Use SIS_CORE to disable idle core search. For some workloads select_idle_core becomes a scalability bottleneck, removing it improves throughput. Also there are workloads where disabling it can hurt latency, so need to have an option. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c1ca88e..6a74808 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6280,9 +6280,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if (!sd) return target; - i = select_idle_core(p, sd, target); - if ((unsigned)i < nr_cpumask_bits) - return i; + if (sched_feat(SIS_CORE)) { + i = select_idle_core(p, sd, target); + if ((unsigned)i < nr_cpumask_bits) + return i; + } This can have significant performance loss if disabled. The select_idle_core spreads workloads quickly across the cores, hence disabling this leaves much of the work to be offloaded to load balancer to move task across the cores. Latency sensitive and long running multi-threaded workload should see the regression under this conditions. Yes in case of SPARC SMT8 I did notice that (see cover letter). That's why it is a feature that is ON by default, but can be turned OFF for specific workloads on x86 SMT2 that can benefit from it. Also, systems like POWER9 has sd_llc as a pair of core only. So it won't benefit from the limits and hence also hiding your code in select_idle_cpu behind static keys will be much preferred. If it doesn't hurt then I don't see the point. Thanks, Subhra
Re: [PATCH v3 1/7] sched: limit cpu search in select_idle_cpu
On 6/28/19 11:47 AM, Parth Shah wrote: On 6/27/19 6:59 AM, subhra mazumdar wrote: Put upper and lower limit on cpu search of select_idle_cpu. The lower limit is amount of cpus in a core while upper limit is twice that. This ensures for any architecture we will usually search beyond a core. The upper limit also helps in keeping the search cost low and constant. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f35930f..b58f08f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, nr = INT_MAX; + int cpu, limit, floor, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(_llc)); if (!this_sd) @@ -6206,10 +6206,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (sched_feat(SIS_PROP)) { u64 span_avg = sd->span_weight * avg_idle; - if (span_avg > 4*avg_cost) + floor = cpumask_weight(topology_sibling_cpumask(target)); + if (floor < 2) + floor = 2; + limit = floor << 1; Is upper limit an experimental value only or it has any arch specific significance? Because, AFAIU, systems like POWER9 might have benefit for searching for 4-cores due to its different cache model. So it can be tuned for arch specific builds then. The lower bound and upper bound were 1 core and 2 core respectively. That is done as to search beyond one core and at the same time not to search too much. It is heuristic that seemed to work well on all archs coupled with the moving window mechanism. Does 4 vs 2 make any difference on your POWER9? AFAIR it didn't on SPARC SMT8. Also variable names can be changed for better readability. floor -> weight_clamp_min limit -> weight_clamp_max or something similar OK. Thanks, Subhra + if (span_avg > floor*avg_cost) { nr = div_u64(span_avg, avg_cost); - else - nr = 4; + if (nr > limit) + nr = limit; + } else { + nr = floor; + } } time = local_clock(); Best, Parth
Re: [PATCH v3 3/7] sched: rotate the cpu search window for better spread
On 6/28/19 11:36 AM, Parth Shah wrote: Hi Subhra, I ran your patch series on IBM POWER systems and this is what I have observed. On 6/27/19 6:59 AM, subhra mazumdar wrote: Rotate the cpu search window for better spread of threads. This will ensure an idle cpu will quickly be found if one exists. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b58f08f..c1ca88e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, limit, floor, nr = INT_MAX; + int cpu, limit, floor, target_tmp, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(_llc)); if (!this_sd) @@ -6219,9 +6219,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t } } + if (per_cpu(next_cpu, target) != -1) + target_tmp = per_cpu(next_cpu, target); + else + target_tmp = target; + time = local_clock(); - for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { + for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) { + per_cpu(next_cpu, target) = cpu; This leads to a problem of cache hotness. AFAIU, in most cases, `target = prev_cpu` of the task being woken up and selecting an idle CPU nearest to the prev_cpu is favorable. But since this doesn't keep track of last idle cpu per task, it fails to find the nearest possible idle CPU in cases when the task is being woken up after other scheduled task. I had tested hackbench on SPARC SMT8 (see numbers in cover letter) and showed improvement with this. Firstly it's a tradeoff between cache effects vs time spent in searching idle CPU, and both x86 and SPARC numbers showed tradeoff is worth it. Secondly there is a lot of cache affinity logic in the beginning of select_idle_sibling. If select_idle_cpu is still called that means we are past that and want any idle cpu. I don't think waking up close to the prev cpu is the intention for starting search from there, rather it is to spread threads across all cpus so that no cpu gets victimized as there is no atomicity. Prev cpu just acts a good seed to do the spreading. Thanks, Subhra
Re: [PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings
On 6/26/19 11:54 PM, Thomas Gleixner wrote: On Thu, 27 Jun 2019, Thomas Gleixner wrote: On Wed, 26 Jun 2019, subhra mazumdar wrote: Introduce a per-cpu variable to keep the number of HT siblings of a cpu. This will be used for quick lookup in select_idle_cpu to determine the limits of search. Why? The number of siblings is constant at least today unless you play silly cpu hotplug games. A bit more justification for adding yet another random storage would be appreciated. This patch does it only for x86. # grep 'This patch' Documentation/process/submitting-patches.rst IOW, we all know already that this is a patch and from the subject prefix and the diffstat it's pretty obvious that this is x86 only. So instead of documenting the obvious, please add proper context to justify the change. Aside of that the right ordering is to introduce the default fallback in a separate patch, which explains the reasoning and then in the next one add the x86 optimized version. OK. I will also add the extra optimization for other architectures. Thanks, Subhra Thanks, tglx
Re: [PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings
On 6/26/19 11:51 PM, Thomas Gleixner wrote: On Wed, 26 Jun 2019, subhra mazumdar wrote: Introduce a per-cpu variable to keep the number of HT siblings of a cpu. This will be used for quick lookup in select_idle_cpu to determine the limits of search. Why? The number of siblings is constant at least today unless you play silly cpu hotplug games. A bit more justification for adding yet another random storage would be appreciated. Using cpumask_weight every time in select_idle_cpu to compute the no. of SMT siblings can be costly as cpumask_weight may not be O(1) for systems with large no. of CPUs (e.g 8 socket, each socket having lots of cores). Over 512 CPUs the bitmask will span multiple cache lines and touching multiple cache lines in the fast path of scheduler can cost more than we save from this optimization. Even in single cache line it loops in longs. We want to touch O(1) cache lines and do O(1) operations, hence pre-compute it in per-CPU variable. This patch does it only for x86. # grep 'This patch' Documentation/process/submitting-patches.rst IOW, we all know already that this is a patch and from the subject prefix and the diffstat it's pretty obvious that this is x86 only. So instead of documenting the obvious, please add proper context to justify the change. Ok. The extra per-CPU optimization was done only for x86 as we cared about it the most and make it future proof. I will add for other architectures. +/* representing number of HT siblings of each CPU */ +DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling); +EXPORT_PER_CPU_SYMBOL(cpumask_weight_sibling); Why does this need an export? No module has any reason to access this. I will remove it /* representing HT and core siblings of each logical CPU */ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); EXPORT_PER_CPU_SYMBOL(cpu_core_map); @@ -520,6 +524,8 @@ void set_cpu_sibling_map(int cpu) if (!has_mp) { cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = + cpumask_weight(topology_sibling_cpumask(cpu)); cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu)); cpumask_set_cpu(cpu, topology_core_cpumask(cpu)); c->booted_cores = 1; @@ -529,8 +535,12 @@ void set_cpu_sibling_map(int cpu) for_each_cpu(i, cpu_sibling_setup_mask) { o = _data(i); - if ((i == cpu) || (has_smt && match_smt(c, o))) + if ((i == cpu) || (has_smt && match_smt(c, o))) { link_mask(topology_sibling_cpumask, cpu, i); + threads = cpumask_weight(topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = threads; + per_cpu(cpumask_weight_sibling, i) = threads; This only works for SMT=2, but fails to update the rest for SMT=4. I guess I assumed that x86 will always be SMT2, will fix this. Thanks, Subhra @@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu) for_each_cpu(sibling, topology_core_cpumask(cpu)) { cpumask_clear_cpu(cpu, topology_core_cpumask(sibling)); + per_cpu(cpumask_weight_sibling, sibling) = + cpumask_weight(topology_sibling_cpumask(sibling)); While remove does the right thing. Thanks, tglx
[PATCH v3 7/7] sched: use per-cpu variable cpumask_weight_sibling
Use per-cpu var cpumask_weight_sibling for quick lookup in select_idle_cpu. This is the fast path of scheduler and every cycle is worth saving. Usage of cpumask_weight can result in iterations. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6a74808..878f11c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6206,7 +6206,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (sched_feat(SIS_PROP)) { u64 span_avg = sd->span_weight * avg_idle; - floor = cpumask_weight(topology_sibling_cpumask(target)); + floor = topology_sibling_weight(target); if (floor < 2) floor = 2; limit = floor << 1; -- 2.9.3
[RESEND PATCH v3 0/7] Improve scheduler scalability for fast path
by having SIS_CORE false. Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44 tasks (lower is better): percentile baseline %stdev patch %stdev 50 942.82 93.33 (0.71%) 1.24 75 124 2.13 122.67 (1.08%) 1.7 90 152 1.74 149.33 (1.75%) 2.35 95 171 2.11 167 (2.34%) 2.74 99 512.67104.96 206 (59.82%)8.86 99.52296 82.553121.67 (-35.96%) 97.37 99.912517.33 2.38 12592 (-0.6%) 1.67 Changes from v2->v3: -Use shift operator instead of multiplication to compute limit -Use per-CPU variable to precompute the number of sibling SMTs for x86 subhra mazumdar (7): sched: limit cpu search in select_idle_cpu sched: introduce per-cpu var next_cpu to track search limit sched: rotate the cpu search window for better spread sched: add sched feature to disable idle core search sched: SIS_CORE to disable idle core search x86/smpboot: introduce per-cpu variable for HT siblings sched: use per-cpu variable cpumask_weight_sibling arch/x86/include/asm/smp.h | 1 + arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/smpboot.c | 17 - include/linux/topology.h| 4 kernel/sched/core.c | 2 ++ kernel/sched/fair.c | 31 +++ kernel/sched/features.h | 1 + kernel/sched/sched.h| 1 + 8 files changed, 49 insertions(+), 9 deletions(-) -- 2.9.3
[PATCH v3 1/7] sched: limit cpu search in select_idle_cpu
Put upper and lower limit on cpu search of select_idle_cpu. The lower limit is amount of cpus in a core while upper limit is twice that. This ensures for any architecture we will usually search beyond a core. The upper limit also helps in keeping the search cost low and constant. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f35930f..b58f08f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, nr = INT_MAX; + int cpu, limit, floor, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(_llc)); if (!this_sd) @@ -6206,10 +6206,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (sched_feat(SIS_PROP)) { u64 span_avg = sd->span_weight * avg_idle; - if (span_avg > 4*avg_cost) + floor = cpumask_weight(topology_sibling_cpumask(target)); + if (floor < 2) + floor = 2; + limit = floor << 1; + if (span_avg > floor*avg_cost) { nr = div_u64(span_avg, avg_cost); - else - nr = 4; + if (nr > limit) + nr = limit; + } else { + nr = floor; + } } time = local_clock(); -- 2.9.3
[PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings
Introduce a per-cpu variable to keep the number of HT siblings of a cpu. This will be used for quick lookup in select_idle_cpu to determine the limits of search. This patch does it only for x86. Signed-off-by: subhra mazumdar --- arch/x86/include/asm/smp.h | 1 + arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/smpboot.c | 17 - include/linux/topology.h| 4 4 files changed, 22 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index da545df..1e90cbd 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -22,6 +22,7 @@ extern int smp_num_siblings; extern unsigned int num_processors; DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map); +DECLARE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling); DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); /* cpus sharing the last level cache: */ DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map); diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 453cf38..dd19c71 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -111,6 +111,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu); #ifdef CONFIG_SMP #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu)) #define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) +#define topology_sibling_weight(cpu) (per_cpu(cpumask_weight_sibling, cpu)) extern unsigned int __max_logical_packages; #define topology_max_packages()(__max_logical_packages) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 362dd89..20bf676 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -85,6 +85,10 @@ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map); EXPORT_PER_CPU_SYMBOL(cpu_sibling_map); +/* representing number of HT siblings of each CPU */ +DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling); +EXPORT_PER_CPU_SYMBOL(cpumask_weight_sibling); + /* representing HT and core siblings of each logical CPU */ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); EXPORT_PER_CPU_SYMBOL(cpu_core_map); @@ -520,6 +524,8 @@ void set_cpu_sibling_map(int cpu) if (!has_mp) { cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = + cpumask_weight(topology_sibling_cpumask(cpu)); cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu)); cpumask_set_cpu(cpu, topology_core_cpumask(cpu)); c->booted_cores = 1; @@ -529,8 +535,12 @@ void set_cpu_sibling_map(int cpu) for_each_cpu(i, cpu_sibling_setup_mask) { o = _data(i); - if ((i == cpu) || (has_smt && match_smt(c, o))) + if ((i == cpu) || (has_smt && match_smt(c, o))) { link_mask(topology_sibling_cpumask, cpu, i); + threads = cpumask_weight(topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = threads; + per_cpu(cpumask_weight_sibling, i) = threads; + } if ((i == cpu) || (has_mp && match_llc(c, o))) link_mask(cpu_llc_shared_mask, cpu, i); @@ -1173,6 +1183,8 @@ static __init void disable_smp(void) else physid_set_mask_of_physid(0, _cpu_present_map); cpumask_set_cpu(0, topology_sibling_cpumask(0)); + per_cpu(cpumask_weight_sibling, 0) = + cpumask_weight(topology_sibling_cpumask(0)); cpumask_set_cpu(0, topology_core_cpumask(0)); } @@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu) for_each_cpu(sibling, topology_core_cpumask(cpu)) { cpumask_clear_cpu(cpu, topology_core_cpumask(sibling)); + per_cpu(cpumask_weight_sibling, sibling) = + cpumask_weight(topology_sibling_cpumask(sibling)); /*/ * last thread sibling in this cpu core going down */ @@ -1495,6 +1509,7 @@ static void remove_siblinginfo(int cpu) cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling)); cpumask_clear(cpu_llc_shared_mask(cpu)); cpumask_clear(topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = 0; cpumask_clear(topology_core_cpumask(cpu)); c->cpu_core_id = 0; c->booted_cores = 0; diff --git a/include/linux/topology.h b/include/linux/topology.h index cb0775e..a85aea1 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -190,6 +190,10 @@ static inline int cpu_to_mem(int cpu) #ifndef topology_sibling_cpumask #define topology_sibling_cpumask(cpu) cpumask_of(cpu) #endif +#ifndef topology_sibling_weight +#d
[PATCH v3 3/7] sched: rotate the cpu search window for better spread
Rotate the cpu search window for better spread of threads. This will ensure an idle cpu will quickly be found if one exists. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b58f08f..c1ca88e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, limit, floor, nr = INT_MAX; + int cpu, limit, floor, target_tmp, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(_llc)); if (!this_sd) @@ -6219,9 +6219,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t } } + if (per_cpu(next_cpu, target) != -1) + target_tmp = per_cpu(next_cpu, target); + else + target_tmp = target; + time = local_clock(); - for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { + for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) { + per_cpu(next_cpu, target) = cpu; if (!--nr) return -1; if (!cpumask_test_cpu(cpu, >cpus_allowed)) -- 2.9.3
[PATCH v3 4/7] sched: add sched feature to disable idle core search
Add a new sched feature SIS_CORE to have an option to disable idle core search (select_idle_core). Signed-off-by: subhra mazumdar --- kernel/sched/features.h | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 858589b..de4d506 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true) */ SCHED_FEAT(SIS_AVG_CPU, false) SCHED_FEAT(SIS_PROP, true) +SCHED_FEAT(SIS_CORE, true) /* * Issue a WARN when we do multiple update_rq_clock() calls -- 2.9.3
[PATCH v3 5/7] sched: SIS_CORE to disable idle core search
Use SIS_CORE to disable idle core search. For some workloads select_idle_core becomes a scalability bottleneck, removing it improves throughput. Also there are workloads where disabling it can hurt latency, so need to have an option. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c1ca88e..6a74808 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6280,9 +6280,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if (!sd) return target; - i = select_idle_core(p, sd, target); - if ((unsigned)i < nr_cpumask_bits) - return i; + if (sched_feat(SIS_CORE)) { + i = select_idle_core(p, sd, target); + if ((unsigned)i < nr_cpumask_bits) + return i; + } i = select_idle_cpu(p, sd, target); if ((unsigned)i < nr_cpumask_bits) -- 2.9.3
[PATCH v3 2/7] sched: introduce per-cpu var next_cpu to track search limit
Introduce a per-cpu variable to track the limit upto which idle cpu search was done in select_idle_cpu(). This will help to start the search next time from there. This is necessary for rotating the search window over entire LLC domain. Signed-off-by: subhra mazumdar --- kernel/sched/core.c | 2 ++ kernel/sched/sched.h | 1 + 2 files changed, 3 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 874c427..80657fc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -24,6 +24,7 @@ #include DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu); #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL) /* @@ -5966,6 +5967,7 @@ void __init sched_init(void) for_each_possible_cpu(i) { struct rq *rq; + per_cpu(next_cpu, i) = -1; rq = cpu_rq(i); raw_spin_lock_init(>lock); rq->nr_running = 0; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b52ed1a..4cecfa2 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -994,6 +994,7 @@ static inline void update_idle_core(struct rq *rq) { } #endif DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu); #define cpu_rq(cpu)(_cpu(runqueues, (cpu))) #define this_rq() this_cpu_ptr() -- 2.9.3
[RFC PATCH 1/3] sched: Introduce new interface for scheduler soft affinity
New system call sched_setaffinity2 is introduced for scheduler soft affinity. It takes an extra parameter to specify hard or soft affinity, where hard implies same as existing sched_setaffinity. New cpumask cpus_preferred is introduced for this purpose which is always a subset of cpus_allowed. A boolean affinity_unequal is used to store if they are unequal for fast lookup. Setting hard affinity resets soft affinity set to be equal to it. Soft affinity is only allowed for CFS class threads. Signed-off-by: subhra mazumdar --- arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/sched.h | 5 +- include/linux/syscalls.h | 3 + include/uapi/asm-generic/unistd.h | 4 +- include/uapi/linux/sched.h | 3 + init/init_task.c | 2 + kernel/compat.c| 2 +- kernel/rcu/tree_plugin.h | 3 +- kernel/sched/core.c| 167 - 9 files changed, 162 insertions(+), 28 deletions(-) diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index b4e6f9e..1dccdd2 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -355,6 +355,7 @@ 431common fsconfig__x64_sys_fsconfig 432common fsmount __x64_sys_fsmount 433common fspick __x64_sys_fspick +434common sched_setaffinity2 __x64_sys_sched_setaffinity2 # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/sched.h b/include/linux/sched.h index 1183741..b863fa8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -652,6 +652,8 @@ struct task_struct { unsigned intpolicy; int nr_cpus_allowed; cpumask_t cpus_allowed; + cpumask_t cpus_preferred; + boolaffinity_unequal; #ifdef CONFIG_PREEMPT_RCU int rcu_read_lock_nesting; @@ -1784,7 +1786,8 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) # define vcpu_is_preempted(cpu)false #endif -extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask); +extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask, + int flags); extern long sched_getaffinity(pid_t pid, struct cpumask *mask); #ifndef TASK_SIZE_OF diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index e2870fe..147a4e5 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -669,6 +669,9 @@ asmlinkage long sys_sched_rr_get_interval(pid_t pid, struct __kernel_timespec __user *interval); asmlinkage long sys_sched_rr_get_interval_time32(pid_t pid, struct old_timespec32 __user *interval); +asmlinkage long sys_sched_setaffinity2(pid_t pid, unsigned int len, + unsigned long __user *user_mask_ptr, + int flags); /* kernel/signal.c */ asmlinkage long sys_restart_syscall(void); diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index a87904d..d77b366 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig) __SYSCALL(__NR_fsmount, sys_fsmount) #define __NR_fspick 433 __SYSCALL(__NR_fspick, sys_fspick) +#define __NR_sched_setaffinity2 434 +__SYSCALL(__NR_sched_setaffinity2, sys_sched_setaffinity2) #undef __NR_syscalls -#define __NR_syscalls 434 +#define __NR_syscalls 435 /* * 32 bit systems traditionally used different diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h index ed4ee17..f910cd5 100644 --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -52,6 +52,9 @@ #define SCHED_FLAG_RECLAIM 0x02 #define SCHED_FLAG_DL_OVERRUN 0x04 +#define SCHED_HARD_AFFINITY0 +#define SCHED_SOFT_AFFINITY1 + #define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \ SCHED_FLAG_RECLAIM | \ SCHED_FLAG_DL_OVERRUN) diff --git a/init/init_task.c b/init/init_task.c index c70ef65..aa226a3 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -73,6 +73,8 @@ struct task_struct init_task .normal_prio= MAX_PRIO - 20, .policy = SCHED_NORMAL, .cpus_allowed = CPU_MASK_ALL, + .cpus_preferred = CPU_MASK_ALL, + .affinity_unequal = false, .nr_cpus_allowed= NR_CPUS, .mm = NULL, .active_mm = _mm, diff --git a/kernel/compat.c b/kernel/compat.c index b5f7063..96621d7 100644 --- a/kernel/compat.c +++ b/kernel/compat.c
[RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs
The soft affinity CPUs present in the cpumask cpus_preferred is used by the scheduler in two levels of search. First is in determining wake affine which choses the LLC domain and secondly while searching for idle CPUs in LLC domain. In the first level it uses cpus_preferred to prune out the search space. In the second level it first searches the cpus_preferred and then cpus_allowed. Using affinity_unequal flag it breaks early to avoid any overhead in the scheduler fast path when soft affinity is not used. This only changes the wake up path of the scheduler, the idle balancing is unchanged; together they achieve the "softness" of scheduling. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 137 ++-- 1 file changed, 100 insertions(+), 37 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f35930f..53aa7f2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5807,7 +5807,7 @@ static unsigned long capacity_spare_without(int cpu, struct task_struct *p) */ static struct sched_group * find_idlest_group(struct sched_domain *sd, struct task_struct *p, - int this_cpu, int sd_flag) + int this_cpu, int sd_flag, struct cpumask *cpus) { struct sched_group *idlest = NULL, *group = sd->groups; struct sched_group *most_spare_sg = NULL; @@ -5831,7 +5831,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, /* Skip over this group if it has no CPUs allowed */ if (!cpumask_intersects(sched_group_span(group), - >cpus_allowed)) + cpus)) continue; local_group = cpumask_test_cpu(this_cpu, @@ -5949,7 +5949,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, * find_idlest_group_cpu - find the idlest CPU among the CPUs in the group. */ static int -find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) +find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, + int this_cpu, struct cpumask *cpus) { unsigned long load, min_load = ULONG_MAX; unsigned int min_exit_latency = UINT_MAX; @@ -5963,7 +5964,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this return cpumask_first(sched_group_span(group)); /* Traverse only the allowed CPUs */ - for_each_cpu_and(i, sched_group_span(group), >cpus_allowed) { + for_each_cpu_and(i, sched_group_span(group), cpus) { if (available_idle_cpu(i)) { struct rq *rq = cpu_rq(i); struct cpuidle_state *idle = idle_get_state(rq); @@ -5999,7 +6000,8 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this } static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p, - int cpu, int prev_cpu, int sd_flag) + int cpu, int prev_cpu, int sd_flag, + struct cpumask *cpus) { int new_cpu = cpu; @@ -6023,13 +6025,14 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p continue; } - group = find_idlest_group(sd, p, cpu, sd_flag); + group = find_idlest_group(sd, p, cpu, sd_flag, cpus); + if (!group) { sd = sd->child; continue; } - new_cpu = find_idlest_group_cpu(group, p, cpu); + new_cpu = find_idlest_group_cpu(group, p, cpu, cpus); if (new_cpu == cpu) { /* Now try balancing at a lower domain level of 'cpu': */ sd = sd->child; @@ -6104,6 +6107,27 @@ void __update_idle_core(struct rq *rq) rcu_read_unlock(); } +static inline int +scan_cpu_mask_for_idle_cores(struct cpumask *cpus, int target) +{ + int core, cpu; + + for_each_cpu_wrap(core, cpus, target) { + bool idle = true; + + for_each_cpu(cpu, cpu_smt_mask(core)) { + cpumask_clear_cpu(cpu, cpus); + if (!idle_cpu(cpu)) + idle = false; + } + + if (idle) + return core; + } + + return -1; +} + /* * Scan the entire LLC domain for idle cores; this dynamically switches off if * there are no idle cores left in the system; tracked through @@ -6112,7 +6136,7 @@ void __update_idle_core(struct rq *rq) static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target) { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); - int co
[RFC PATCH 0/3] Scheduler Soft Affinity
ch %gain with soft affinity 2*4 1.43 2*8 1.36 2*161.01 2*321.45 1*4 -2.55 1*8 -5.06 1*16-8 1*32-7.32 DB %gain with soft affinity 2*160.46 2*243.68 2*32-3.34 1*160.08 1*241.6 1*32-1.29 Finally I measured the overhead of soft affinity when it is NOT used by comparing it with baseline kernel in case of no affinity and hard affinity with Hackbench. The following is the improvement of soft affinity kernel w.r.t baseline, but really numbers are in noise margin. This shows soft affinity has no overhead when not used. Hackbench %diff of no affinity%diff of hard affinity 2*4 0.110.31 2*8 0.130.55 2*160.610.90 2*320.861.01 1*4 0.480.43 1*8 0.450.33 1*160.610.64 1*320.110.63 A final set of experiments were done (numbers not shown) having the memory of each DB instance spread evenly across both NUMA nodes. This showed similar improvements with soft affinity for 2 instance case, thus proving the improvement is due to saving LLC coherence overhead. subhra mazumdar (3): sched: Introduce new interface for scheduler soft affinity sched: change scheduler to give preference to soft affinity CPUs sched: introduce tunables to control soft affinity arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/sched.h | 5 +- include/linux/sched/sysctl.h | 2 + include/linux/syscalls.h | 3 + include/uapi/asm-generic/unistd.h | 4 +- include/uapi/linux/sched.h | 3 + init/init_task.c | 2 + kernel/compat.c| 2 +- kernel/rcu/tree_plugin.h | 3 +- kernel/sched/core.c| 167 - kernel/sched/fair.c| 154 ++ kernel/sched/sched.h | 2 + kernel/sysctl.c| 14 +++ 13 files changed, 297 insertions(+), 65 deletions(-) -- 2.9.3
[RFC PATCH 3/3] sched: introduce tunables to control soft affinity
For different workloads the optimal "softness" of soft affinity can be different. Introduce tunables sched_allowed and sched_preferred that can be tuned via /proc. This allows to chose at what utilization difference the scheduler will chose cpus_allowed over cpus_preferred in the first level of search. Depending on the extent of data sharing, cache coherency overhead of the system etc. the optimal point may vary. Signed-off-by: subhra mazumdar --- include/linux/sched/sysctl.h | 2 ++ kernel/sched/fair.c | 19 ++- kernel/sched/sched.h | 2 ++ kernel/sysctl.c | 14 ++ 4 files changed, 36 insertions(+), 1 deletion(-) diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 99ce6d7..0e75602 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -41,6 +41,8 @@ extern unsigned int sysctl_numa_balancing_scan_size; #ifdef CONFIG_SCHED_DEBUG extern __read_mostly unsigned int sysctl_sched_migration_cost; extern __read_mostly unsigned int sysctl_sched_nr_migrate; +extern __read_mostly unsigned int sysctl_sched_preferred; +extern __read_mostly unsigned int sysctl_sched_allowed; int sched_proc_update_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 53aa7f2..d222d78 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -85,6 +85,8 @@ unsigned int sysctl_sched_wakeup_granularity = 100UL; static unsigned int normalized_sysctl_sched_wakeup_granularity = 100UL; const_debug unsigned int sysctl_sched_migration_cost = 50UL; +const_debug unsigned int sysctl_sched_preferred= 1UL; +const_debug unsigned int sysctl_sched_allowed = 100UL; #ifdef CONFIG_SMP /* @@ -6739,7 +6741,22 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f int new_cpu = prev_cpu; int want_affine = 0; int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING); - struct cpumask *cpus = >cpus_preferred; + int cpux, cpuy; + struct cpumask *cpus; + + if (!p->affinity_unequal) { + cpus = >cpus_allowed; + } else { + cpux = cpumask_any(>cpus_preferred); + cpus = this_cpu_cpumask_var_ptr(select_idle_mask); + cpumask_andnot(cpus, >cpus_allowed, >cpus_preferred); + cpuy = cpumask_any(cpus); + if (sysctl_sched_preferred * cpu_rq(cpux)->cfs.avg.util_avg > + sysctl_sched_allowed * cpu_rq(cpuy)->cfs.avg.util_avg) + cpus = >cpus_allowed; + else + cpus = >cpus_preferred; + } if (sd_flag & SD_BALANCE_WAKE) { record_wakee(p); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b52ed1a..f856bdb 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1863,6 +1863,8 @@ extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags); extern const_debug unsigned int sysctl_sched_nr_migrate; extern const_debug unsigned int sysctl_sched_migration_cost; +extern const_debug unsigned int sysctl_sched_preferred; +extern const_debug unsigned int sysctl_sched_allowed; #ifdef CONFIG_SCHED_HRTICK diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 7d1008b..bdffb48 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -383,6 +383,20 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, + { + .procname = "sched_preferred", + .data = _sched_preferred, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { + .procname = "sched_allowed", + .data = _sched_allowed, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, #ifdef CONFIG_SCHEDSTATS { .procname = "sched_schedstats", -- 2.9.3
Re: [RFC PATCH v3 00/16] Core scheduling v3
On 6/12/19 9:33 AM, Julien Desfossez wrote: After reading more traces and trying to understand why only untagged tasks are starving when there are cpu-intensive tasks running on the same set of CPUs, we noticed a difference in behavior in ‘pick_task’. In the case where ‘core_cookie’ is 0, we are supposed to only prefer the tagged task if it’s priority is higher, but when the priorities are equal we prefer it as well which causes the starving. ‘pick_task’ is biased toward selecting its first parameter in case of equality which in this case was the ‘class_pick’ instead of ‘max’. Reversing the order of the parameter solves this issue and matches the expected behavior. So we can get rid of this vruntime_boost concept. We have tested the fix below and it seems to work well with tagged/untagged tasks. My 2 DB instance runs with this patch are better with CORESCHED_STALL_FIX than NO_CORESCHED_STALL_FIX in terms of performance, std deviation and idleness. May be enable it by default? NO_CORESCHED_STALL_FIX: users %stdev %gain %idle 16 25 -42.4 73 24 32 -26.3 67 32 0.2 -48.9 62 CORESCHED_STALL_FIX: users %stdev %gain %idle 16 6.5 -23 70 24 0.6 -17 60 32 1.5 -30.2 52
[PATCH v3 4/7] sched: add sched feature to disable idle core search
Add a new sched feature SIS_CORE to have an option to disable idle core search (select_idle_core). Signed-off-by: subhra mazumdar --- kernel/sched/features.h | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 858589b..de4d506 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true) */ SCHED_FEAT(SIS_AVG_CPU, false) SCHED_FEAT(SIS_PROP, true) +SCHED_FEAT(SIS_CORE, true) /* * Issue a WARN when we do multiple update_rq_clock() calls -- 2.9.3
[PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings
Introduce a per-cpu variable to keep the number of HT siblings of a cpu. This will be used for quick lookup in select_idle_cpu to determine the limits of search. This patch does it only for x86. Signed-off-by: subhra mazumdar --- arch/x86/include/asm/smp.h | 1 + arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/smpboot.c | 17 - include/linux/topology.h| 4 4 files changed, 22 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index da545df..1e90cbd 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -22,6 +22,7 @@ extern int smp_num_siblings; extern unsigned int num_processors; DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map); +DECLARE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling); DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); /* cpus sharing the last level cache: */ DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map); diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 453cf38..dd19c71 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -111,6 +111,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu); #ifdef CONFIG_SMP #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu)) #define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) +#define topology_sibling_weight(cpu) (per_cpu(cpumask_weight_sibling, cpu)) extern unsigned int __max_logical_packages; #define topology_max_packages()(__max_logical_packages) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 362dd89..20bf676 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -85,6 +85,10 @@ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map); EXPORT_PER_CPU_SYMBOL(cpu_sibling_map); +/* representing number of HT siblings of each CPU */ +DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling); +EXPORT_PER_CPU_SYMBOL(cpumask_weight_sibling); + /* representing HT and core siblings of each logical CPU */ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map); EXPORT_PER_CPU_SYMBOL(cpu_core_map); @@ -520,6 +524,8 @@ void set_cpu_sibling_map(int cpu) if (!has_mp) { cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = + cpumask_weight(topology_sibling_cpumask(cpu)); cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu)); cpumask_set_cpu(cpu, topology_core_cpumask(cpu)); c->booted_cores = 1; @@ -529,8 +535,12 @@ void set_cpu_sibling_map(int cpu) for_each_cpu(i, cpu_sibling_setup_mask) { o = _data(i); - if ((i == cpu) || (has_smt && match_smt(c, o))) + if ((i == cpu) || (has_smt && match_smt(c, o))) { link_mask(topology_sibling_cpumask, cpu, i); + threads = cpumask_weight(topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = threads; + per_cpu(cpumask_weight_sibling, i) = threads; + } if ((i == cpu) || (has_mp && match_llc(c, o))) link_mask(cpu_llc_shared_mask, cpu, i); @@ -1173,6 +1183,8 @@ static __init void disable_smp(void) else physid_set_mask_of_physid(0, _cpu_present_map); cpumask_set_cpu(0, topology_sibling_cpumask(0)); + per_cpu(cpumask_weight_sibling, 0) = + cpumask_weight(topology_sibling_cpumask(0)); cpumask_set_cpu(0, topology_core_cpumask(0)); } @@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu) for_each_cpu(sibling, topology_core_cpumask(cpu)) { cpumask_clear_cpu(cpu, topology_core_cpumask(sibling)); + per_cpu(cpumask_weight_sibling, sibling) = + cpumask_weight(topology_sibling_cpumask(sibling)); /*/ * last thread sibling in this cpu core going down */ @@ -1495,6 +1509,7 @@ static void remove_siblinginfo(int cpu) cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling)); cpumask_clear(cpu_llc_shared_mask(cpu)); cpumask_clear(topology_sibling_cpumask(cpu)); + per_cpu(cpumask_weight_sibling, cpu) = 0; cpumask_clear(topology_core_cpumask(cpu)); c->cpu_core_id = 0; c->booted_cores = 0; diff --git a/include/linux/topology.h b/include/linux/topology.h index cb0775e..a85aea1 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -190,6 +190,10 @@ static inline int cpu_to_mem(int cpu) #ifndef topology_sibling_cpumask #define topology_sibling_cpumask(cpu) cpumask_of(cpu) #endif +#ifndef topology_sibling_weight +#d
[PATCH v3 5/7] sched: SIS_CORE to disable idle core search
Use SIS_CORE to disable idle core search. For some workloads select_idle_core becomes a scalability bottleneck, removing it improves throughput. Also there are workloads where disabling it can hurt latency, so need to have an option. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c1ca88e..6a74808 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6280,9 +6280,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if (!sd) return target; - i = select_idle_core(p, sd, target); - if ((unsigned)i < nr_cpumask_bits) - return i; + if (sched_feat(SIS_CORE)) { + i = select_idle_core(p, sd, target); + if ((unsigned)i < nr_cpumask_bits) + return i; + } i = select_idle_cpu(p, sd, target); if ((unsigned)i < nr_cpumask_bits) -- 2.9.3
[PATCH v3 1/7] sched: limit cpu search in select_idle_cpu
Put upper and lower limit on cpu search of select_idle_cpu. The lower limit is amount of cpus in a core while upper limit is twice that. This ensures for any architecture we will usually search beyond a core. The upper limit also helps in keeping the search cost low and constant. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f35930f..b58f08f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, nr = INT_MAX; + int cpu, limit, floor, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(_llc)); if (!this_sd) @@ -6206,10 +6206,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (sched_feat(SIS_PROP)) { u64 span_avg = sd->span_weight * avg_idle; - if (span_avg > 4*avg_cost) + floor = cpumask_weight(topology_sibling_cpumask(target)); + if (floor < 2) + floor = 2; + limit = floor << 1; + if (span_avg > floor*avg_cost) { nr = div_u64(span_avg, avg_cost); - else - nr = 4; + if (nr > limit) + nr = limit; + } else { + nr = floor; + } } time = local_clock(); -- 2.9.3
[PATCH v3 7/7] sched: use per-cpu variable cpumask_weight_sibling
Use per-cpu var cpumask_weight_sibling for quick lookup in select_idle_cpu. This is the fast path of scheduler and every cycle is worth saving. Usage of cpumask_weight can result in iterations. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6a74808..878f11c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6206,7 +6206,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (sched_feat(SIS_PROP)) { u64 span_avg = sd->span_weight * avg_idle; - floor = cpumask_weight(topology_sibling_cpumask(target)); + floor = topology_sibling_weight(target); if (floor < 2) floor = 2; limit = floor << 1; -- 2.9.3
[PATCH v3 2/7] sched: introduce per-cpu var next_cpu to track search limit
Introduce a per-cpu variable to track the limit upto which idle cpu search was done in select_idle_cpu(). This will help to start the search next time from there. This is necessary for rotating the search window over entire LLC domain. Signed-off-by: subhra mazumdar --- kernel/sched/core.c | 2 ++ kernel/sched/sched.h | 1 + 2 files changed, 3 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 874c427..80657fc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -24,6 +24,7 @@ #include DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu); #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL) /* @@ -5966,6 +5967,7 @@ void __init sched_init(void) for_each_possible_cpu(i) { struct rq *rq; + per_cpu(next_cpu, i) = -1; rq = cpu_rq(i); raw_spin_lock_init(>lock); rq->nr_running = 0; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b52ed1a..4cecfa2 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -994,6 +994,7 @@ static inline void update_idle_core(struct rq *rq) { } #endif DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu); #define cpu_rq(cpu)(_cpu(runqueues, (cpu))) #define this_rq() this_cpu_ptr() -- 2.9.3
[PATCH v3 3/7] sched: rotate the cpu search window for better spread
Rotate the cpu search window for better spread of threads. This will ensure an idle cpu will quickly be found if one exists. Signed-off-by: subhra mazumdar --- kernel/sched/fair.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b58f08f..c1ca88e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 avg_cost, avg_idle; u64 time, cost; s64 delta; - int cpu, limit, floor, nr = INT_MAX; + int cpu, limit, floor, target_tmp, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(_llc)); if (!this_sd) @@ -6219,9 +6219,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t } } + if (per_cpu(next_cpu, target) != -1) + target_tmp = per_cpu(next_cpu, target); + else + target_tmp = target; + time = local_clock(); - for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { + for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) { + per_cpu(next_cpu, target) = cpu; if (!--nr) return -1; if (!cpumask_test_cpu(cpu, >cpus_allowed)) -- 2.9.3
[PATCH v3 0/7] Improve scheduler scalability for fast path
2.82 93.33 (0.71%) 1.24 75 124 2.13 122.67 (1.08%) 1.7 90 152 1.74 149.33 (1.75%) 2.35 95 171 2.11 167 (2.34%) 2.74 99 512.67104.96 206 (59.82%)8.86 99.52296 82.553121.67 (-35.96%) 97.37 99.912517.33 2.38 12592 (-0.6%) 1.67 Changes from v2->v3: -Use shift operator instead of multiplication to compute limit -Use per-CPU variable to precompute the number of sibling SMTs for x86 subhra mazumdar (7): sched: limit cpu search in select_idle_cpu sched: introduce per-cpu var next_cpu to track search limit sched: rotate the cpu search window for better spread sched: add sched feature to disable idle core search sched: SIS_CORE to disable idle core search x86/smpboot: introduce per-cpu variable for HT siblings sched: use per-cpu variable cpumask_weight_sibling arch/x86/include/asm/smp.h | 1 + arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/smpboot.c | 17 - include/linux/topology.h| 4 kernel/sched/core.c | 2 ++ kernel/sched/fair.c | 31 +++ kernel/sched/features.h | 1 + kernel/sched/sched.h| 1 + 8 files changed, 49 insertions(+), 9 deletions(-) -- 2.9.3
Re: [RFC V2 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found
On 5/14/19 10:27 AM, Subhra Mazumdar wrote: On 5/14/19 9:03 AM, Steven Sistare wrote: On 5/13/2019 7:35 AM, Peter Zijlstra wrote: On Mon, May 13, 2019 at 03:04:18PM +0530, Viresh Kumar wrote: On 10-05-19, 09:21, Peter Zijlstra wrote: I don't hate his per se; but the whole select_idle_sibling() thing is something that needs looking at. There was the task stealing thing from Steve that looked interesting and that would render your apporach unfeasible. I am surely missing something as I don't see how that patchset will make this patchset perform badly, than what it already does. Nah; I just misremembered. I know Oracle has a patch set poking at select_idle_siblings() _somewhere_ (as do I), and I just found the wrong one. Basically everybody is complaining select_idle_sibling() is too expensive for checking the entire LLC domain, except for FB (and thus likely some other workloads too) that depend on it to kill their tail latency. But I suppose we could still do this, even if we scan only a subset of the LLC, just keep track of the last !idle CPU running only SCHED_IDLE tasks and pick that if you do not (in your limited scan) find a better candidate. Subhra posted a patch that incrementally searches for an idle CPU in the LLC, remembering the last CPU examined, and searching a fixed number of CPUs from there. That technique is compatible with the one that Viresh suggests; the incremental search would stop if a SCHED_IDLE cpu was found. This was the last version of patchset I sent: https://lkml.org/lkml/2018/6/28/810 Also select_idle_core is a net -ve for certain workloads like OLTP. So I had put a SCHED_FEAT to be able to disable it. Forgot to add, the cpumask_weight computation may not be O(1) with large number of CPUs, so needs to be precomputed in a per-cpu variable to further optimize. That part is missing from the above patchset. Thanks, Subhra I also fiddled with select_idle_sibling, maintaining a per-LLC bitmap of idle CPUs, updated with atomic operations. Performance was basically unchanged for the workloads I tested, and I inserted timers around the idle search showing it was a very small fraction of time both before and after my changes. That led me to ignore the push side and optimize the pull side with task stealing. I would be very interested in hearing from folks that have workloads that demonstrate that select_idle_sibling is too expensive. - Steve
Re: [RFC V2 2/2] sched/fair: Fallback to sched-idle CPU if idle CPU isn't found
On 5/14/19 9:03 AM, Steven Sistare wrote: On 5/13/2019 7:35 AM, Peter Zijlstra wrote: On Mon, May 13, 2019 at 03:04:18PM +0530, Viresh Kumar wrote: On 10-05-19, 09:21, Peter Zijlstra wrote: I don't hate his per se; but the whole select_idle_sibling() thing is something that needs looking at. There was the task stealing thing from Steve that looked interesting and that would render your apporach unfeasible. I am surely missing something as I don't see how that patchset will make this patchset perform badly, than what it already does. Nah; I just misremembered. I know Oracle has a patch set poking at select_idle_siblings() _somewhere_ (as do I), and I just found the wrong one. Basically everybody is complaining select_idle_sibling() is too expensive for checking the entire LLC domain, except for FB (and thus likely some other workloads too) that depend on it to kill their tail latency. But I suppose we could still do this, even if we scan only a subset of the LLC, just keep track of the last !idle CPU running only SCHED_IDLE tasks and pick that if you do not (in your limited scan) find a better candidate. Subhra posted a patch that incrementally searches for an idle CPU in the LLC, remembering the last CPU examined, and searching a fixed number of CPUs from there. That technique is compatible with the one that Viresh suggests; the incremental search would stop if a SCHED_IDLE cpu was found. This was the last version of patchset I sent: https://lkml.org/lkml/2018/6/28/810 Also select_idle_core is a net -ve for certain workloads like OLTP. So I had put a SCHED_FEAT to be able to disable it. Thanks, Subhra I also fiddled with select_idle_sibling, maintaining a per-LLC bitmap of idle CPUs, updated with atomic operations. Performance was basically unchanged for the workloads I tested, and I inserted timers around the idle search showing it was a very small fraction of time both before and after my changes. That led me to ignore the push side and optimize the pull side with task stealing. I would be very interested in hearing from folks that have workloads that demonstrate that select_idle_sibling is too expensive. - Steve
Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks
select_task_rq_* seems to be unchanged. So the search logic to find a cpu to enqueue when a task becomes runnable is same as before and doesn't do any kind of cookie matching. Okay, that's true in task wakeup path, and also load_balance seems to pull task without checking cookie too. But my system is not over loaded when I tested this patch, so there is none or only one task in rq and on the rq's rb tree, so this patch does not make a difference. I had same hypothesis for my tests. The question is, should we do cookie checking for task selecting CPU and load balance CPU pulling task? The basic issue is keeping the CPUs busy. In case of overloaded system, the trivial new idle balancer should be able to find a matching task in case of forced idle. More problematic is the lower load scenario when there aren't any matching task to be found but there are runnable tasks of other groups. Also wake up code path tries to balance threads across cores (select_idle_core) first which is opposite of what core scheduling wants. I will re-run my tests with select_idle_core disabled, but the issue is on x86 Intel systems (my test rig) the CPU ids are interleaved across cores so even select_idle_cpu will balance across cores first. May be others have some better ideas? Thanks, -Aubrey
Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks
On 5/8/19 6:38 PM, Aubrey Li wrote: On Thu, May 9, 2019 at 8:29 AM Subhra Mazumdar wrote: On 5/8/19 5:01 PM, Aubrey Li wrote: On Thu, May 9, 2019 at 2:41 AM Subhra Mazumdar wrote: On 5/8/19 11:19 AM, Subhra Mazumdar wrote: On 5/8/19 8:49 AM, Aubrey Li wrote: Pawan ran an experiment setting up 2 VMs, with one VM doing a parallel kernel build and one VM doing sysbench, limiting both VMs to run on 16 cpu threads (8 physical cores), with 8 vcpu for each VM. Making the fix did improve kernel build time by 7%. I'm gonna agree with the patch below, but just wonder if the testing result is consistent, as I didn't see any improvement in my testing environment. IIUC, from the code behavior, especially for 2 VMs case(only 2 different cookies), the per-rq rb tree unlikely has nodes with different cookies, that is, all the nodes on this tree should have the same cookie, so: - if the parameter cookie is equal to the rb tree cookie, we meet a match and go the third branch - else, no matter we go left or right, we can't find a match, and we'll return idle thread finally. Please correct me if I was wrong. Thanks, -Aubrey This is searching in the per core rb tree (rq->core_tree) which can have 2 different cookies. But having said that, even I didn't see any improvement with the patch for my DB test case. But logically it is correct. Ah, my bad. It is per rq. But still can have 2 different cookies. Not sure why you think it is unlikely? Yeah, I meant 2 different cookies on the system, but unlikely 2 different cookies on one same rq. If I read the source correctly, for the sched_core_balance path, when try to steal cookie from another CPU, sched_core_find() uses dst's cookie to search if there is a cookie match in src's rq, and sched_core_find() returns idle or matched task, and later put this matched task onto dst's rq (activate_task() in sched_core_find()). At this moment, the nodes on the rq's rb tree should have same cookies. Thanks, -Aubrey Yes, but sched_core_find is also called from pick_task to find a local matching task. Can a local searching introduce a different cookies? Where is it from? No. I meant the local search uses the same binary search of sched_core_find so it has to be correct. The enqueue side logic of the scheduler is unchanged with core scheduling, But only the task with cookies is placed onto this rb tree? so it is possible tasks with different cookies are enqueued on the same rq. So while searching for a matching task locally doing it correctly should matter. May I know how exactly? select_task_rq_* seems to be unchanged. So the search logic to find a cpu to enqueue when a task becomes runnable is same as before and doesn't do any kind of cookie matching. Thanks, -Aubrey
Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks
On 5/8/19 5:01 PM, Aubrey Li wrote: On Thu, May 9, 2019 at 2:41 AM Subhra Mazumdar wrote: On 5/8/19 11:19 AM, Subhra Mazumdar wrote: On 5/8/19 8:49 AM, Aubrey Li wrote: Pawan ran an experiment setting up 2 VMs, with one VM doing a parallel kernel build and one VM doing sysbench, limiting both VMs to run on 16 cpu threads (8 physical cores), with 8 vcpu for each VM. Making the fix did improve kernel build time by 7%. I'm gonna agree with the patch below, but just wonder if the testing result is consistent, as I didn't see any improvement in my testing environment. IIUC, from the code behavior, especially for 2 VMs case(only 2 different cookies), the per-rq rb tree unlikely has nodes with different cookies, that is, all the nodes on this tree should have the same cookie, so: - if the parameter cookie is equal to the rb tree cookie, we meet a match and go the third branch - else, no matter we go left or right, we can't find a match, and we'll return idle thread finally. Please correct me if I was wrong. Thanks, -Aubrey This is searching in the per core rb tree (rq->core_tree) which can have 2 different cookies. But having said that, even I didn't see any improvement with the patch for my DB test case. But logically it is correct. Ah, my bad. It is per rq. But still can have 2 different cookies. Not sure why you think it is unlikely? Yeah, I meant 2 different cookies on the system, but unlikely 2 different cookies on one same rq. If I read the source correctly, for the sched_core_balance path, when try to steal cookie from another CPU, sched_core_find() uses dst's cookie to search if there is a cookie match in src's rq, and sched_core_find() returns idle or matched task, and later put this matched task onto dst's rq (activate_task() in sched_core_find()). At this moment, the nodes on the rq's rb tree should have same cookies. Thanks, -Aubrey Yes, but sched_core_find is also called from pick_task to find a local matching task. The enqueue side logic of the scheduler is unchanged with core scheduling, so it is possible tasks with different cookies are enqueued on the same rq. So while searching for a matching task locally doing it correctly should matter.
Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks
On 5/8/19 11:19 AM, Subhra Mazumdar wrote: On 5/8/19 8:49 AM, Aubrey Li wrote: Pawan ran an experiment setting up 2 VMs, with one VM doing a parallel kernel build and one VM doing sysbench, limiting both VMs to run on 16 cpu threads (8 physical cores), with 8 vcpu for each VM. Making the fix did improve kernel build time by 7%. I'm gonna agree with the patch below, but just wonder if the testing result is consistent, as I didn't see any improvement in my testing environment. IIUC, from the code behavior, especially for 2 VMs case(only 2 different cookies), the per-rq rb tree unlikely has nodes with different cookies, that is, all the nodes on this tree should have the same cookie, so: - if the parameter cookie is equal to the rb tree cookie, we meet a match and go the third branch - else, no matter we go left or right, we can't find a match, and we'll return idle thread finally. Please correct me if I was wrong. Thanks, -Aubrey This is searching in the per core rb tree (rq->core_tree) which can have 2 different cookies. But having said that, even I didn't see any improvement with the patch for my DB test case. But logically it is correct. Ah, my bad. It is per rq. But still can have 2 different cookies. Not sure why you think it is unlikely?
Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks
On 5/8/19 8:49 AM, Aubrey Li wrote: Pawan ran an experiment setting up 2 VMs, with one VM doing a parallel kernel build and one VM doing sysbench, limiting both VMs to run on 16 cpu threads (8 physical cores), with 8 vcpu for each VM. Making the fix did improve kernel build time by 7%. I'm gonna agree with the patch below, but just wonder if the testing result is consistent, as I didn't see any improvement in my testing environment. IIUC, from the code behavior, especially for 2 VMs case(only 2 different cookies), the per-rq rb tree unlikely has nodes with different cookies, that is, all the nodes on this tree should have the same cookie, so: - if the parameter cookie is equal to the rb tree cookie, we meet a match and go the third branch - else, no matter we go left or right, we can't find a match, and we'll return idle thread finally. Please correct me if I was wrong. Thanks, -Aubrey This is searching in the per core rb tree (rq->core_tree) which can have 2 different cookies. But having said that, even I didn't see any improvement with the patch for my DB test case. But logically it is correct.
Re: [RFC PATCH v2 00/17] Core scheduling v2
On 4/26/19 3:43 AM, Mel Gorman wrote: On Fri, Apr 26, 2019 at 10:42:22AM +0200, Ingo Molnar wrote: It should, but it's not perfect. For example, wake_affine_idle does not take sibling activity into account even though select_idle_sibling *may* take it into account. Even select_idle_sibling in its fast path may use an SMT sibling instead of searching. There are also potential side-effects with cpuidle. Some workloads migration around the socket as they are communicating because of how the search for an idle CPU works. With SMT on, there is potentially a longer opportunity for a core to reach a deep c-state and incur a bigger wakeup latency. This is a very weak theory but I've seen cases where latency sensitive workloads with only two communicating tasks are affected by CPUs reaching low c-states due to migrations. Clearly it doesn't. It's more that it's best effort to wakeup quickly instead of being perfect by using an expensive search every time. Yeah, but your numbers suggest that for *most* not heavily interacting under-utilized CPU bound workloads we hurt in the 5-10% range compared to no-SMT - more in some cases. Indeed, it was higher than expected and we can't even use the excuse that more resources are available to a single logical CPU as the scheduler is meant to keep them apart. So we avoid a maybe 0.1% scheduler placement overhead but inflict 5-10% harm on the workload, and also blow up stddev by randomly co-scheduling two tasks on the same physical core? Not a good trade-off. I really think we should implement a relatively strict physical core placement policy in the under-utilized case, and resist any attempts to weaken this for special workloads that ping-pong quickly and benefit from sharing the same physical core. It's worth a shot at least. Changes should mostly be in the wake_affine path for most loads of interest. Doesn't select_idle_sibling already try to do that by calling select_idle_core? For our OLTP workload we infact found the cost of select_idle_core was actually hurting more than it helped to find a fully idle core, so a net negative.
Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
On 4/19/19 1:40 AM, Ingo Molnar wrote: * Subhra Mazumdar wrote: I see similar improvement with this patch as removing the condition I earlier mentioned. So that's not needed. I also included the patch for the priority fix. For 2 DB instances, HT disabling stands at -22% for 32 users (from earlier emails). 1 DB instance users baseline %idle core_sched %idle 16 1 84 -4.9% 84 24 1 76 -6.7% 75 32 1 69 -2.4% 69 2 DB instance users baseline %idle core_sched %idle 16 1 66 -19.5% 69 24 1 54 -9.8% 57 32 1 42 -27.2% 48 So HT disabling slows down the 2DB instance by -22%, while core-sched slows it down by -27.2%? Would it be possible to see all the results in two larger tables (1 DB instance and 2 DB instance) so that we can compare the performance of the 3 kernel variants with each other: - "vanilla +HT": Hyperthreading enabled, vanilla scheduler - "vanilla -HT": Hyperthreading disabled, vanilla scheduler - "core_sched": Hyperthreading enabled, core-scheduling enabled ? Thanks, Ingo Following are the numbers. Disabling HT gives improvement in some cases. 1 DB instance users vanilla+HT core_sched vanilla-HT 16 1 -4.9% -11.7% 24 1 -6.7% +13.7% 32 1 -2.4% +8% 2 DB instance users vanilla+HT core_sched vanilla-HT 16 1 -19.5% +5.6% 24 1 -9.8% +3.5% 32 1 -27.2% -22.8%
Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
On 4/9/19 11:38 AM, Julien Desfossez wrote: We found the source of the major performance regression we discussed previously. It turns out there was a pattern where a task (a kworker in this case) could be woken up, but the core could still end up idle before that task had a chance to run. Example sequence, cpu0 and cpu1 and siblings on the same core, task1 and task2 are in the same cgroup with the tag enabled (each following line happens in the increasing order of time): - task1 running on cpu0, task2 running on cpu1 - sched_waking(kworker/0, target_cpu=cpu0) - task1 scheduled out of cpu0 - kworker/0 cannot run on cpu0 because of task2 is still running on cpu1 cpu0 is idle - task2 scheduled out of cpu1 - cpu1 doesn’t select kworker/0 for cpu0, because the optimization path ends the task selection if core_cookie is NULL for currently selected process and the cpu1’s runqueue. - cpu1 is idle --> both siblings are idle but kworker/0 is still in the run queue of cpu0. Cpu0 may stay idle for longer if it goes deep idle. With the fix below, we ensure to send an IPI to the sibling if it is idle and has tasks waiting in its runqueue. This fixes the performance issue we were seeing. Now here is what we can measure with a disk write-intensive benchmark: - no performance impact with enabling core scheduling without any tagged task, - 5% overhead if one tagged task is competing with an untagged task, - 10% overhead if 2 tasks tagged with a different tag are competing against each other. We are starting more scaling tests, but this is very encouraging ! diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e1fa10561279..02c862a5e973 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3779,7 +3779,22 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) trace_printk("unconstrained pick: %s/%d %lx\n", next->comm, next->pid, next->core_cookie); + rq->core_pick = NULL; +/* +* If the sibling is idling, we might want to wake it +* so that it can check for any runnable but blocked tasks +* due to previous task matching. +*/ + for_each_cpu(j, smt_mask) { + struct rq *rq_j = cpu_rq(j); + rq_j->core_pick = NULL; + if (j != cpu && is_idle_task(rq_j->curr) && rq_j->nr_running) { + resched_curr(rq_j); + trace_printk("IPI(%d->%d[%d]) idle preempt\n", +cpu, j, rq_j->nr_running); + } + } goto done; } I see similar improvement with this patch as removing the condition I earlier mentioned. So that's not needed. I also included the patch for the priority fix. For 2 DB instances, HT disabling stands at -22% for 32 users (from earlier emails). 1 DB instance users baseline %idle core_sched %idle 16 1 84 -4.9% 84 24 1 76 -6.7% 75 32 1 69 -2.4% 69 2 DB instance users baseline %idle core_sched %idle 16 1 66 -19.5% 69 24 1 54 -9.8% 57 32 1 42 -27.2% 48
Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
We tried to comment those lines and it doesn’t seem to get rid of the performance regression we are seeing. Can you elaborate a bit more about the test you are performing, what kind of resources it uses ? I am running 1 and 2 Oracle DB instances each running TPC-C workload. The clients driving the instances also run in same node. Each server client pair is put in each cpu group and tagged. Can you also try to replicate our test and see if you see the same problem ? cgcreate -g cpu,cpuset:set1 cat /sys/devices/system/cpu/cpu{0,2,4,6}/topology/thread_siblings_list 0,36 2,38 4,40 6,42 echo "0,2,4,6,36,38,40,42" | sudo tee /sys/fs/cgroup/cpuset/set1/cpuset.cpus echo 0 | sudo tee /sys/fs/cgroup/cpuset/set1/cpuset.mems echo 1 | sudo tee /sys/fs/cgroup/cpu,cpuacct/set1/cpu.tag sysbench --test=fileio prepare cgexec -g cpu,cpuset:set1 sysbench --threads=4 --test=fileio \ --file-test-mode=seqwr run The reason we create a cpuset is to narrow down the investigation to just 4 cores on a highly powerful machine. It might not be needed if testing on a smaller machine. With this sysbench test I am not seeing any improvement with removing the condition. Also with hackbench I found it makes no difference but that has much lower regression to begin with (18%) Julien
Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
On 3/29/19 3:23 PM, Subhra Mazumdar wrote: On 3/29/19 6:35 AM, Julien Desfossez wrote: On Fri, Mar 22, 2019 at 8:09 PM Subhra Mazumdar wrote: Is the core wide lock primarily responsible for the regression? I ran upto patch 12 which also has the core wide lock for tagged cgroups and also calls newidle_balance() from pick_next_task(). I don't see any regression. Of course the core sched version of pick_next_task() may be doing more but comparing with the __pick_next_task() it doesn't look too horrible. On further testing and investigation, we also agree that spinlock contention is not the major cause for the regression, but we feel that it should be one of the major contributing factors to this performance loss. I finally did some code bisection and found the following lines are basically responsible for the regression. Commenting them out I don't see the regressions. Can you confirm? I am yet to figure if this is needed for the correctness of core scheduling and if so can we do this better? >8- diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fe3918c..3b3388a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3741,8 +3741,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) * If there weren't no cookies; we don't need * to bother with the other siblings. */ - if (i == cpu && !rq->core->core_cookie) - goto next_class; + //if (i == cpu && !rq->core->core_cookie) + //goto next_class; continue; } AFAICT this condition is not needed for correctness as cookie matching will sill be enforced. Peter any thoughts? I get the following numbers with 1 DB and 2 DB instance. 1 DB instance users baseline %idle core_sched %idle 16 1 84 -5.5% 84 24 1 76 -5% 76 32 1 69 -0.45% 69 2 DB instance users baseline %idle core_sched %idle 16 1 66 -23.8% 69 24 1 54 -3.1% 57 32 1 42 -21.1% 48
Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
On 3/29/19 6:35 AM, Julien Desfossez wrote: On Fri, Mar 22, 2019 at 8:09 PM Subhra Mazumdar wrote: Is the core wide lock primarily responsible for the regression? I ran upto patch 12 which also has the core wide lock for tagged cgroups and also calls newidle_balance() from pick_next_task(). I don't see any regression. Of course the core sched version of pick_next_task() may be doing more but comparing with the __pick_next_task() it doesn't look too horrible. On further testing and investigation, we also agree that spinlock contention is not the major cause for the regression, but we feel that it should be one of the major contributing factors to this performance loss. I finally did some code bisection and found the following lines are basically responsible for the regression. Commenting them out I don't see the regressions. Can you confirm? I am yet to figure if this is needed for the correctness of core scheduling and if so can we do this better? >8- diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fe3918c..3b3388a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3741,8 +3741,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) * If there weren't no cookies; we don't need * to bother with the other siblings. */ - if (i == cpu && !rq->core->core_cookie) - goto next_class; + //if (i == cpu && !rq->core->core_cookie) + //goto next_class; continue; }
Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
On 3/22/19 5:06 PM, Subhra Mazumdar wrote: On 3/21/19 2:20 PM, Julien Desfossez wrote: On Tue, Mar 19, 2019 at 10:31 PM Subhra Mazumdar wrote: On 3/18/19 8:41 AM, Julien Desfossez wrote: On further investigation, we could see that the contention is mostly in the way rq locks are taken. With this patchset, we lock the whole core if cpu.tag is set for at least one cgroup. Due to this, __schedule() is more or less serialized for the core and that attributes to the performance loss that we are seeing. We also saw that newidle_balance() takes considerably long time in load_balance() due to the rq spinlock contention. Do you think it would help if the core-wide locking was only performed when absolutely needed ? Is the core wide lock primarily responsible for the regression? I ran upto patch 12 which also has the core wide lock for tagged cgroups and also calls newidle_balance() from pick_next_task(). I don't see any regression. Of course the core sched version of pick_next_task() may be doing more but comparing with the __pick_next_task() it doesn't look too horrible. I gathered some data with only 1 DB instance running (which also has 52% slow down). Following are the numbers of pick_next_task() calls and their avg cost for patch 12 and patch 15. The total number of calls seems to be similar but the avg cost (in us) has more than doubled. For both the patches I had put the DB instance into a cpu tagged cgroup. patch12 patch15 count pick_next_task 62317898 58925395 avg cost pick_next_task 0.6566323209 1.4223810108
Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
On 3/21/19 2:20 PM, Julien Desfossez wrote: On Tue, Mar 19, 2019 at 10:31 PM Subhra Mazumdar wrote: On 3/18/19 8:41 AM, Julien Desfossez wrote: On further investigation, we could see that the contention is mostly in the way rq locks are taken. With this patchset, we lock the whole core if cpu.tag is set for at least one cgroup. Due to this, __schedule() is more or less serialized for the core and that attributes to the performance loss that we are seeing. We also saw that newidle_balance() takes considerably long time in load_balance() due to the rq spinlock contention. Do you think it would help if the core-wide locking was only performed when absolutely needed ? Is the core wide lock primarily responsible for the regression? I ran upto patch 12 which also has the core wide lock for tagged cgroups and also calls newidle_balance() from pick_next_task(). I don't see any regression. Of course the core sched version of pick_next_task() may be doing more but comparing with the __pick_next_task() it doesn't look too horrible.
Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
On 3/18/19 8:41 AM, Julien Desfossez wrote: The case where we try to acquire the lock on 2 runqueues belonging to 2 different cores requires the rq_lockp wrapper as well otherwise we frequently deadlock in there. This fixes the crash reported in 1552577311-8218-1-git-send-email-jdesfos...@digitalocean.com diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 76fee56..71bb71f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2078,7 +2078,7 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2) raw_spin_lock(rq_lockp(rq1)); __acquire(rq2->lock);/* Fake it out ;) */ } else { - if (rq1 < rq2) { + if (rq_lockp(rq1) < rq_lockp(rq2)) { raw_spin_lock(rq_lockp(rq1)); raw_spin_lock_nested(rq_lockp(rq2), SINGLE_DEPTH_NESTING); } else { With this fix and my previous NULL pointer fix my stress tests are surviving. I re-ran my 2 DB instance setup on 44 core 2 socket system by putting each DB instance in separate core scheduling group. The numbers look much worse now. users baseline %stdev %idle core_sched %stdev %idle 16 1 0.3 66 -73.4% 136.8 82 24 1 1.6 54 -95.8% 133.2 81 32 1 1.5 42 -97.5% 124.3 89 I also notice that if I enable a bunch of debug configs related to mutexes, spin locks, lockdep etc. (which I did earlier to debug the dead lock), it opens up a can of worms with multiple crashes.
Re: [RFC][PATCH 00/16] sched: Core scheduling
On 3/11/19 5:20 PM, Greg Kerr wrote: On Mon, Mar 11, 2019 at 4:36 PM Subhra Mazumdar wrote: On 3/11/19 11:34 AM, Subhra Mazumdar wrote: On 3/10/19 9:23 PM, Aubrey Li wrote: On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar wrote: expected. Most of the performance recovery happens in patch 15 which, unfortunately, is also the one that introduces the hard lockup. After applied Subhra's patch, the following is triggered by enabling core sched when a cgroup is under heavy load. It seems you are facing some other deadlock where printk is involved. Can you drop the last patch (patch 16 sched: Debug bits...) and try? Thanks, Subhra Never Mind, I am seeing the same lockdep deadlock output even w/o patch 16. Btw the NULL fix had something missing, following works. Is this panic below, which occurs when I tag the first process, related or known? If not, I will debug it tomorrow. [ 46.831828] BUG: unable to handle kernel NULL pointer dereference at [ 46.831829] core sched enabled [ 46.834261] #PF error: [WRITE] [ 46.834899] PGD 0 P4D 0 [ 46.835438] Oops: 0002 [#1] SMP PTI [ 46.836158] CPU: 0 PID: 11 Comm: migration/0 Not tainted 5.0.0everyday-glory-03949-g2d8fdbb66245-dirty #7 [ 46.838206] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [ 46.839844] RIP: 0010:_raw_spin_lock+0x7/0x20 [ 46.840448] Code: 00 00 00 65 81 05 25 ca 5c 51 00 02 00 00 31 c0 ba ff 00 00 00 f0 0f b1 17 74 05 e9 93 80 46 ff f3 c3 90 31 c0 ba 01 00 00 00 0f b1 17 74 07 89 c6 e9 1c 6e 46 ff f3 c3 66 2e 0f 1f 84 00 00 [ 46.843000] RSP: 0018:b9d300cabe38 EFLAGS: 00010046 [ 46.843744] RAX: RBX: RCX: 0004 [ 46.844709] RDX: 0001 RSI: aea435ae RDI: [ 46.845689] RBP: b9d300cabed8 R08: R09: 00020800 [ 46.846651] R10: af603ea0 R11: 0001 R12: af6576c0 [ 46.847619] R13: 9a57366c8000 R14: 9a5737401300 R15: ade868f0 [ 46.848584] FS: () GS:9a5737a0() knlGS: [ 46.849680] CS: 0010 DS: ES: CR0: 80050033 [ 46.850455] CR2: CR3: 0001d36fa000 CR4: 06f0 [ 46.851415] DR0: DR1: DR2: [ 46.852371] DR3: DR6: fffe0ff0 DR7: 0400 [ 46.853326] Call Trace: [ 46.853678] __schedule+0x139/0x11f0 [ 46.854167] ? cpumask_next+0x16/0x20 [ 46.854668] ? cpu_stop_queue_work+0xc0/0xc0 [ 46.855252] ? sort_range+0x20/0x20 [ 46.855742] schedule+0x4e/0x60 [ 46.856171] smpboot_thread_fn+0x12a/0x160 [ 46.856725] kthread+0x112/0x120 [ 46.857164] ? kthread_stop+0xf0/0xf0 [ 46.857661] ret_from_fork+0x35/0x40 [ 46.858146] Modules linked in: [ 46.858562] CR2: [ 46.859022] ---[ end trace e9fff08f17bfd2be ]--- - Greg This seems to be different
Re: [RFC][PATCH 00/16] sched: Core scheduling
On 3/11/19 11:34 AM, Subhra Mazumdar wrote: On 3/10/19 9:23 PM, Aubrey Li wrote: On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar wrote: expected. Most of the performance recovery happens in patch 15 which, unfortunately, is also the one that introduces the hard lockup. After applied Subhra's patch, the following is triggered by enabling core sched when a cgroup is under heavy load. It seems you are facing some other deadlock where printk is involved. Can you drop the last patch (patch 16 sched: Debug bits...) and try? Thanks, Subhra Never Mind, I am seeing the same lockdep deadlock output even w/o patch 16. Btw the NULL fix had something missing, following works. ->8 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1d0dac4..27cbc64 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4131,7 +4131,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) * Avoid running the skip buddy, if running something else can * be done without getting too unfair. */ - if (cfs_rq->skip == se) { + if (cfs_rq->skip && cfs_rq->skip == se) { struct sched_entity *second; if (se == curr) { @@ -4149,13 +4149,15 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) /* * Prefer last buddy, try to return the CPU to a preempted task. */ - if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) + if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) + < 1) se = cfs_rq->last; /* * Someone really wants this to run. If it's not unfair, run it. */ - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) + if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) + < 1) se = cfs_rq->next; clear_buddies(cfs_rq, se); @@ -6958,6 +6960,9 @@ pick_task_fair(struct rq *rq) se = pick_next_entity(cfs_rq, NULL); + if (!(se || curr)) + return NULL; + if (curr) { if (se && curr->on_rq) update_curr(cfs_rq);
Re: [RFC][PATCH 00/16] sched: Core scheduling
On 3/10/19 9:23 PM, Aubrey Li wrote: On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar wrote: expected. Most of the performance recovery happens in patch 15 which, unfortunately, is also the one that introduces the hard lockup. After applied Subhra's patch, the following is triggered by enabling core sched when a cgroup is under heavy load. It seems you are facing some other deadlock where printk is involved. Can you drop the last patch (patch 16 sched: Debug bits...) and try? Thanks, Subhra
Re: [RFC][PATCH 00/16] sched: Core scheduling
On 2/22/19 4:45 AM, Mel Gorman wrote: On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote: On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra wrote: However; whichever way around you turn this cookie; it is expensive and nasty. Do you (or anybody else) have numbers for real loads? Because performance is all that matters. If performance is bad, then it's pointless, since just turning off SMT is the answer. I tried to do a comparison between tip/master, ht disabled and this series putting test workloads into a tagged cgroup but unfortunately it failed [ 156.978682] BUG: unable to handle kernel NULL pointer dereference at 0058 [ 156.986597] #PF error: [normal kernel read fault] [ 156.991343] PGD 0 P4D 0 [ 156.993905] Oops: [#1] SMP PTI [ 156.997438] CPU: 15 PID: 0 Comm: swapper/15 Not tainted 5.0.0-rc7-schedcore-v1r1 #1 [ 157.005161] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016 [ 157.012896] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50 [ 157.018613] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00 53 48 89 fb <48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01 [ 157.037544] RSP: 0018:c9000c5bbde8 EFLAGS: 00010086 [ 157.042819] RAX: 88810f5f6a00 RBX: 0001547f175c RCX: 0001 [ 157.050015] RDX: 88bf3bdb0a40 RSI: RDI: 0001547f175c [ 157.057215] RBP: 88bf7fae32c0 R08: 0001e358 R09: 88810fb9f000 [ 157.064410] R10: c9000c5bbe08 R11: 88810fb9f5c4 R12: [ 157.071611] R13: 88bf4e3ea0c0 R14: R15: 88bf4e3ea7a8 [ 157.078814] FS: () GS:88bf7f5c() knlGS: [ 157.086977] CS: 0010 DS: ES: CR0: 80050033 [ 157.092779] CR2: 0058 CR3: 0220e005 CR4: 003606e0 [ 157.099979] DR0: DR1: DR2: [ 157.109529] DR3: DR6: fffe0ff0 DR7: 0400 [ 157.119058] Call Trace: [ 157.123865] pick_next_entity+0x61/0x110 [ 157.130137] pick_task_fair+0x4b/0x90 [ 157.136124] __schedule+0x365/0x12c0 [ 157.141985] schedule_idle+0x1e/0x40 [ 157.147822] do_idle+0x166/0x280 [ 157.153275] cpu_startup_entry+0x19/0x20 [ 157.159420] start_secondary+0x17a/0x1d0 [ 157.165568] secondary_startup_64+0xa4/0xb0 [ 157.171985] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs msr intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass crc32_pclmul ghash_clmulni_intel ixgbe aesni_intel xfrm_algo iTCO_wdt joydev iTCO_vendor_support libphy igb aes_x86_64 crypto_simd ptp cryptd mei_me mdio pps_core ioatdma glue_helper pcspkr ipmi_si lpc_ich i2c_i801 mei dca ipmi_devintf ipmi_msghandler acpi_pad pcc_cpufreq button btrfs libcrc32c xor zstd_decompress zstd_compress raid6_pq hid_generic usbhid ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci crc32c_intel ehci_pci ttm xhci_hcd ehci_hcd drm ahci usbcore mpt3sas libahci raid_class scsi_transport_sas wmi sg nbd dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua [ 157.258990] CR2: 0058 [ 157.264961] ---[ end trace a301ac5e3ee86fde ]--- [ 157.283719] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50 [ 157.291967] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00 53 48 89 fb <48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01 [ 157.316121] RSP: 0018:c9000c5bbde8 EFLAGS: 00010086 [ 157.324060] RAX: 88810f5f6a00 RBX: 0001547f175c RCX: 0001 [ 157.333932] RDX: 88bf3bdb0a40 RSI: RDI: 0001547f175c [ 157.343795] RBP: 88bf7fae32c0 R08: 0001e358 R09: 88810fb9f000 [ 157.353634] R10: c9000c5bbe08 R11: 88810fb9f5c4 R12: [ 157.363506] R13: 88bf4e3ea0c0 R14: R15: 88bf4e3ea7a8 [ 157.373395] FS: () GS:88bf7f5c() knlGS: [ 157.384238] CS: 0010 DS: ES: CR0: 80050033 [ 157.392709] CR2: 0058 CR3: 0220e005 CR4: 003606e0 [ 157.402601] DR0: DR1: DR2: [ 157.412488] DR3: DR6: fffe0ff0 DR7: 0400 [ 157.422334] Kernel panic - not syncing: Attempted to kill the idle task! [ 158.529804] Shutting down cpus with NMI [ 158.573249] Kernel Offset: disabled [ 158.586198] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]--- RIP translates to kernel/sched/fair.c:6819 static int wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se) { s64 gran, vdiff = curr->vruntime - se->vruntime; /* LINE 6819 */ if (vdiff <= 0)
Re: [RFC][PATCH 00/16] sched: Core scheduling
On 2/18/19 8:56 AM, Peter Zijlstra wrote: A much 'demanded' feature: core-scheduling :-( I still hate it with a passion, and that is part of why it took a little longer than 'promised'. While this one doesn't have all the 'features' of the previous (never published) version and isn't L1TF 'complete', I tend to like the structure better (relatively speaking: I hate it slightly less). This one is sched class agnostic and therefore, in principle, doesn't horribly wreck RT (in fact, RT could 'ab'use this by setting 'task->core_cookie = task' to force-idle siblings). Now, as hinted by that, there are semi sane reasons for actually having this. Various hardware features like Intel RDT - Memory Bandwidth Allocation, work per core (due to SMT fundamentally sharing caches) and therefore grouping related tasks on a core makes it more reliable. However; whichever way around you turn this cookie; it is expensive and nasty. I am seeing the following hard lockup frequently now. Following is full kernel output: [ 5846.412296] drop_caches (8657): drop_caches: 3 [ 5846.624823] drop_caches (8658): drop_caches: 3 [ 5850.604641] hugetlbfs: oracle (8671): Using mlock ulimits for SHM_HUGETL B is deprecated [ 5962.930812] NMI watchdog: Watchdog detected hard LOCKUP on cpu 32 [ 5962.930814] Modules linked in: drbd lru_cache autofs4 cpufreq_powersave ipv6 crc_ccitt mxm_wmi iTCO_wdt iTCO_vendor_support btrfs raid6_pq zstd_compress zstd_decompress xor pcspkr i2c_i801 lpc_ich mfd_core ioatdma ixgbe dca mdio sg ipmi_ssif i2c_core ipmi_si ipmi_msghandler wmi pcc_cpufreq acpi_pad ext4 fscrypto jbd2 mbcache sd_mod ahci libahci nvme nvme_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [ 5962.930828] CPU: 32 PID: 10333 Comm: oracle_10333_tp Not tainted 5.0.0-rc7core_sched #1 [ 5962.930828] Hardware name: Oracle Corporation ORACLE SERVER X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016 [ 5962.930829] RIP: 0010:try_to_wake_up+0x98/0x470 [ 5962.930830] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 44 00 00 8b 43 3c 8b 73 60 85 f6 0f 85 a6 01 00 00 8b 43 38 85 c0 74 09 f3 90 8b 43 38 <85> c0 75 f7 48 8b 43 10 a8 02 b8 00 00 00 00 0f 85 d5 01 00 00 0f [ 5962.930831] RSP: 0018:c9000f4dbcb8 EFLAGS: 0002 [ 5962.930832] RAX: 0001 RBX: 88dfb4af1680 RCX: 0041 [ 5962.930832] RDX: 0001 RSI: RDI: 88dfb4af214c [ 5962.930833] RBP: R08: 0001 R09: c9000f4dbd80 [ 5962.930833] R10: 8880 R11: ea00f0003d80 R12: 88dfb4af214c [ 5962.930834] R13: 0001 R14: 0046 R15: 0001 [ 5962.930834] FS: 7ff4fabd9ae0() GS:88dfbe28() knlGS: [ 5962.930834] CS: 0010 DS: ES: CR0: 80050033 [ 5962.930835] CR2: 000f4cc84000 CR3: 003b93d36002 CR4: 003606e0 [ 5962.930835] DR0: DR1: DR2: [ 5962.930836] DR3: DR6: fffe0ff0 DR7: 0400 [ 5962.930836] Call Trace: [ 5962.930837] ? __switch_to_asm+0x34/0x70 [ 5962.930837] ? __switch_to_asm+0x40/0x70 [ 5962.930838] ? __switch_to_asm+0x34/0x70 [ 5962.930838] autoremove_wake_function+0x11/0x50 [ 5962.930838] __wake_up_common+0x8f/0x160 [ 5962.930839] ? __switch_to_asm+0x40/0x70 [ 5962.930839] __wake_up_common_lock+0x7c/0xc0 [ 5962.930840] pipe_write+0x24e/0x3f0 [ 5962.930840] __vfs_write+0x127/0x1b0 [ 5962.930840] vfs_write+0xb3/0x1b0 [ 5962.930841] ksys_write+0x52/0xc0 [ 5962.930841] do_syscall_64+0x5b/0x170 [ 5962.930842] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 5962.930842] RIP: 0033:0x3b5900e7b0 [ 5962.930843] Code: 97 20 00 31 d2 48 29 c2 64 89 11 48 83 c8 ff eb ea 90 90 90 90 90 90 90 90 90 83 3d f1 db 20 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 5e fa ff ff 48 89 04 24 [ 5962.930843] RSP: 002b:7ffedbcd93a8 EFLAGS: 0246 ORIG_RAX: 0001 [ 5962.930844] RAX: ffda RBX: 7ff4faa86e24 RCX: 003b5900e7b0 [ 5962.930845] RDX: 028f RSI: 7ff4faa9688e RDI: 000a [ 5962.930845] RBP: 7ffedbcd93c0 R08: 7ffedbcd9458 R09: 0020 [ 5962.930846] R10: R11: 0246 R12: 7ffedbcd9458 [ 5962.930847] R13: 7ff4faa9688e R14: 7ff4faa89cc8 R15: 7ff4faa86bd0 [ 5962.930847] Kernel panic - not syncing: Hard LOCKUP [ 5962.930848] CPU: 32 PID: 10333 Comm: oracle_10333_tp Not tainted 5.0.0-rc7core_sched #1 [ 5962.930848] Hardware name: Oracle Corporation ORACLE SERVER X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016 [ 5962.930849] Call Trace: [ 5962.930849] [ 5962.930849] dump_stack+0x5c/0x7b [ 5962.930850] panic+0xfe/0x2b2 [ 5962.930850] nmi_panic+0x35/0x40 [ 5962.930851] watchdog_overflow_callback+0xef/0x100 [ 5962.930851] __perf_event_overflow+0x5a/0xe0 [ 5962.930852] handle_pmi_common+0x1d1/0x280 [ 5962.930852] ? __set_pte_vaddr+0x32/0x50 [ 5962.930852] ?
Re: [RFC][PATCH 00/16] sched: Core scheduling
On 2/21/19 6:03 AM, Peter Zijlstra wrote: On Wed, Feb 20, 2019 at 06:53:08PM -0800, Subhra Mazumdar wrote: On 2/18/19 9:49 AM, Linus Torvalds wrote: On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra wrote: However; whichever way around you turn this cookie; it is expensive and nasty. Do you (or anybody else) have numbers for real loads? Because performance is all that matters. If performance is bad, then it's pointless, since just turning off SMT is the answer. Linus I tested 2 Oracle DB instances running OLTP on a 2 socket 44 cores system. This is on baremetal, no virtualization. I'm thinking oracle schedules quite a bit, right? Then you get massive overhead (as shown). Out of curiosity I ran the patchset from Amazon with the same setup to see if performance wise it was any better. But it looks equally bad. At 32 users it performed even worse and the idle time increased much more. Only good thing about it was it was being fair to both the instances as seen in the low %stdev Users Baseline %stdev %idle cosched %stdev %idle 16 1 2.9 66 0.93(-7%) 1.1 69 24 1 11.3 53 0.87(-13%) 11.2 61 32 1 7 41 0.66(-34%) 5.3 54
Re: [RFC][PATCH 00/16] sched: Core scheduling
On 2/21/19 6:03 AM, Peter Zijlstra wrote: On Wed, Feb 20, 2019 at 06:53:08PM -0800, Subhra Mazumdar wrote: On 2/18/19 9:49 AM, Linus Torvalds wrote: On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra wrote: However; whichever way around you turn this cookie; it is expensive and nasty. Do you (or anybody else) have numbers for real loads? Because performance is all that matters. If performance is bad, then it's pointless, since just turning off SMT is the answer. Linus I tested 2 Oracle DB instances running OLTP on a 2 socket 44 cores system. This is on baremetal, no virtualization. I'm thinking oracle schedules quite a bit, right? Then you get massive overhead (as shown). Yes. In terms of idleness we have: Users baseline core_sched 16 67% 70% 24 53% 59% 32 41% 49% So there is more idleness with core sched which is understandable as there can be forced idleness. The other part contributing to regression is most likely overhead. The thing with virt workloads is that if they don't VMEXIT lots, they also don't schedule lots (the vCPU stays running, nested scheduler etc..). I plan to run some VM workloads. Also; like I wrote, it is quite possible there is some sibling rivalry here, which can cause excessive rescheduling. Someone would have to trace a workload and check. My older patches had a condition that would not preempt a task for a little while, such that it might make _some_ progress, these patches don't have that (yet).
Re: [RFC][PATCH 00/16] sched: Core scheduling
On 2/18/19 9:49 AM, Linus Torvalds wrote: On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra wrote: However; whichever way around you turn this cookie; it is expensive and nasty. Do you (or anybody else) have numbers for real loads? Because performance is all that matters. If performance is bad, then it's pointless, since just turning off SMT is the answer. Linus I tested 2 Oracle DB instances running OLTP on a 2 socket 44 cores system. This is on baremetal, no virtualization. In all cases I put each DB instance in separate cpu cgroup. Following are the avg throughput numbers of the 2 instances. %stdev is the standard deviation between the 2 instances. Baseline = build w/o CONFIG_SCHED_CORE core_sched = build w/ CONFIG_SCHED_CORE HT_disable = offlined sibling HT with baseline Users Baseline %stdev core_sched %stdev HT_disable %stdev 16 997768 3.28 808193(-19%) 34 1053888(+5.6%) 2.9 24 1157314 9.4 974555(-15.8%) 40.5 1197904(+3.5%) 4.6 32 1693644 6.4 1237195(-27%) 42.8 1308180(-22.8%) 5.3 The regressions are substantial. Also noticed one of the DB instances was having much less throughput than the other with core scheduling which brought down the avg and also reflected in the very high %stdev. Disabling HT has effect at 32 users but still better than core scheduling both in terms of avg and %stdev. There are some issue with the DB setup for which I couldn't go beyond 32 users.
Re: [RFC][PATCH 00/16] sched: Core scheduling
On 2/20/19 1:42 AM, Peter Zijlstra wrote: A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing in e-mail? On Tue, Feb 19, 2019 at 02:07:01PM -0800, Greg Kerr wrote: Thanks for posting this patchset Peter. Based on the patch titled, "sched: A quick and dirty cgroup tagging interface," I believe cgroups are used to define co-scheduling groups in this implementation. Chrome OS engineers (kerr...@google.com, mpden...@google.com, and pal...@google.com) are considering an interface that is usable by unprivileged userspace apps. cgroups are a global resource that require privileged access. Have you considered an interface that is akin to namespaces? Consider the following strawperson API proposal (I understand prctl() is generally used for process specific actions, so we aren't married to using prctl()): I don't think we're anywhere near the point where I care about interfaces with this stuff. Interfaces are a trivial but tedious matter once the rest works to satisfaction. As it happens; there is actually a bug in that very cgroup patch that can cause undesired scheduling. Try spotting and fixing that. Another question is if we want to be L1TF complete (and how strict) or not, and if so, build the missing pieces (for instance we currently don't kick siblings on IRQ/trap/exception entry -- and yes that's nasty and horrible code and missing for that reason). I remember asking Paul about this and he mentioned he has a Address Space Isolation proposal to cover this. So it seems this is out of scope of core scheduling? So first; does this provide what we need? If that's sorted we can bike-shed on uapi/abi.
Re: Gang scheduling
Hi Tim, On 10/12/18 11:01 AM, Tim Chen wrote: On 10/10/2018 05:09 PM, Subhra Mazumdar wrote: Hi, I was following the Coscheduling patch discussion on lkml and Peter mentioned he had a patch series. I found the following on github. https://github.com/pdxChen/gang/commits/sched_1.23-loadbal I would like to test this with KVMs. Are the commits from 38d5acb to f019876 sufficient? Also is there any documentaion on how to use it (any knobs I need to turn on for gang scheduling to happen?) or is it enabled by default for KVMs? Thanks, Subhra I would suggest you try https://github.com/pdxChen/gang/tree/sched_1.23-base without the load balancing part of gang scheduling. It is enabled by default for KVMs. I applied the following 3 patches on 4.19 and tried to install a KVM (with virt-install). But the kernel hangs with following error: kernel:watchdog: BUG: soft lockup - CPU#21 stuck for 23s! [kworker/21:1:573] kvm,sched: Track VCPU threads x86/kvm,sched: Add fast path for reschedule interrupt sched: Optimize scheduler_ipi() The track VCPU patch seems to be the culprit. Thanks, Subhra Due to the constant change in gang scheduling status of the QEMU thread depending on whether vcpu is loaded or unloaded, the load balancing part of the code doesn't work very well. The current version of the code need to be optimized further. Right now the QEMU thread constantly does vcpu load and unload during VM enter and exit. We gang schedule only after vcpu load and register the thread to be gang scheduled. When we do vcpu unload, the thread is removed from the set to be gang scheduled. Each time there's a synchronization with the sibling thread that's expensive. However, for QEMU, there's a one to one correspondence between the QEMU thread and vcpu. So we don't have to change the gang scheduling status for such thread to avoid the church and sync with the sibling. That should be helpful for VM with lots of I/O causing constant VM exits. We're still working on this optimization. And the load balancing should be better after this change. Tim
Re: [RFC PATCH v2 1/1] pipe: busy wait for pipe
On 11/5/18 2:08 AM, Mel Gorman wrote: Adding Al Viro as per get_maintainers.pl. On Tue, Sep 25, 2018 at 04:32:40PM -0700, subhra mazumdar wrote: Introduce pipe_ll_usec field for pipes that indicates the amount of micro seconds a thread should spin if pipe is empty or full before sleeping. This is similar to network sockets. Can you point to what pattern from network sockets you are duplicated exactly? One would assume it's busy_loop_current_time and busy_loop_timeout but it should be in the changelog because there are differences in polling depending on where you are in the network subsystem (which I'm not very familiar with). I was referring to the sk_busy_loop_timeout() that uses sk_ll_usec. By similar I meant having a similar mechanism for pipes to busy wait Workloads like hackbench in pipe mode benefits significantly from this by avoiding the sleep and wakeup overhead. Other similar usecases can benefit. A tunable pipe_busy_poll is introduced to enable or disable busy waiting via /proc. The value of it specifies the amount of spin in microseconds. Default value is 0 indicating no spin. Your lead mail indicates the spin was set to a "suitable spin time". How should an administrator select this spin time? What works for hackbench might not be suitable for another workload. What if the spin time selected happens to be just slightly longer than the time it takes the reader to respond? In such a case, the result would be "all spin, no gain". While networking potentially suffers the same problem, it appears to be opt-in per socket so it's up to the application not to shoot itself in the foot. Even for network, sk_ll_usec is assigned the value of the tunable sysctl_net_busy_read in sock_init_data() for all sockets initialized by default. There is way for per socket setting using sock_setsockopt(), for pipes that can be added later if needed in case of different apps running in one system. But there are cases where only one app runs (e.g big DBs) and one tunable will suffice. It can be set to a value that is tested to be beneficial under the operating conditions. Signed-off-by: subhra mazumdar --- fs/pipe.c | 12 include/linux/pipe_fs_i.h | 2 ++ kernel/sysctl.c | 7 +++ 3 files changed, 21 insertions(+) diff --git a/fs/pipe.c b/fs/pipe.c index bdc5d3c..35d805b 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -26,6 +26,7 @@ #include #include +#include #include "internal.h" @@ -40,6 +41,7 @@ unsigned int pipe_max_size = 1048576; */ unsigned long pipe_user_pages_hard; unsigned long pipe_user_pages_soft = PIPE_DEF_BUFFERS * INR_OPEN_CUR; +unsigned int pipe_busy_poll; /* * We use a start+len construction, which provides full use of the @@ -106,6 +108,7 @@ void pipe_double_lock(struct pipe_inode_info *pipe1, void pipe_wait(struct pipe_inode_info *pipe) { DEFINE_WAIT(wait); + u64 start; /* * Pipes are system-local resources, so sleeping on them @@ -113,6 +116,10 @@ void pipe_wait(struct pipe_inode_info *pipe) */ prepare_to_wait(>wait, , TASK_INTERRUPTIBLE); pipe_unlock(pipe); + start = local_clock(); + while (current->state != TASK_RUNNING && + ((local_clock() - start) >> 10) < pipe->pipe_ll_usec) + cpu_relax(); schedule(); finish_wait(>wait, ); pipe_lock(pipe); Networking breaks this out better in terms of options instead of hard-coding. This does not handle need_resched or signal delivery properly where as networking does for example. I don't disable preemption, so don't think checking need_resched is needed. Can you point to what you mean by handling signal delivery in case of networking? Not sure what I am missing. My initial version broke it out like networking but after Peter's suggestion I clubbed it. I don't feel strongly either way. @@ -825,6 +832,7 @@ static int do_pipe2(int __user *fildes, int flags) struct file *files[2]; int fd[2]; int error; + struct pipe_inode_info *pipe; error = __do_pipe_flags(fd, files, flags); if (!error) { @@ -838,6 +846,10 @@ static int do_pipe2(int __user *fildes, int flags) fd_install(fd[0], files[0]); fd_install(fd[1], files[1]); } + pipe = files[0]->private_data; + pipe->pipe_ll_usec = pipe_busy_poll; + pipe = files[1]->private_data; + pipe->pipe_ll_usec = pipe_busy_poll; } return error; } You add a pipe field but the value in always based on the sysctl so the information is redundantu (barring a race condition on one pipe write per sysctl update which is an irrelevant corner case). In comparison, the network subsystem appears to be explicitly opt-in via setsockopt(SO_BUSY_POLL) from a glance an
Re: [RFC PATCH v2 1/1] pipe: busy wait for pipe
On 11/5/18 2:08 AM, Mel Gorman wrote: Adding Al Viro as per get_maintainers.pl. On Tue, Sep 25, 2018 at 04:32:40PM -0700, subhra mazumdar wrote: Introduce pipe_ll_usec field for pipes that indicates the amount of micro seconds a thread should spin if pipe is empty or full before sleeping. This is similar to network sockets. Can you point to what pattern from network sockets you are duplicated exactly? One would assume it's busy_loop_current_time and busy_loop_timeout but it should be in the changelog because there are differences in polling depending on where you are in the network subsystem (which I'm not very familiar with). I was referring to the sk_busy_loop_timeout() that uses sk_ll_usec. By similar I meant having a similar mechanism for pipes to busy wait Workloads like hackbench in pipe mode benefits significantly from this by avoiding the sleep and wakeup overhead. Other similar usecases can benefit. A tunable pipe_busy_poll is introduced to enable or disable busy waiting via /proc. The value of it specifies the amount of spin in microseconds. Default value is 0 indicating no spin. Your lead mail indicates the spin was set to a "suitable spin time". How should an administrator select this spin time? What works for hackbench might not be suitable for another workload. What if the spin time selected happens to be just slightly longer than the time it takes the reader to respond? In such a case, the result would be "all spin, no gain". While networking potentially suffers the same problem, it appears to be opt-in per socket so it's up to the application not to shoot itself in the foot. Even for network, sk_ll_usec is assigned the value of the tunable sysctl_net_busy_read in sock_init_data() for all sockets initialized by default. There is way for per socket setting using sock_setsockopt(), for pipes that can be added later if needed in case of different apps running in one system. But there are cases where only one app runs (e.g big DBs) and one tunable will suffice. It can be set to a value that is tested to be beneficial under the operating conditions. Signed-off-by: subhra mazumdar --- fs/pipe.c | 12 include/linux/pipe_fs_i.h | 2 ++ kernel/sysctl.c | 7 +++ 3 files changed, 21 insertions(+) diff --git a/fs/pipe.c b/fs/pipe.c index bdc5d3c..35d805b 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -26,6 +26,7 @@ #include #include +#include #include "internal.h" @@ -40,6 +41,7 @@ unsigned int pipe_max_size = 1048576; */ unsigned long pipe_user_pages_hard; unsigned long pipe_user_pages_soft = PIPE_DEF_BUFFERS * INR_OPEN_CUR; +unsigned int pipe_busy_poll; /* * We use a start+len construction, which provides full use of the @@ -106,6 +108,7 @@ void pipe_double_lock(struct pipe_inode_info *pipe1, void pipe_wait(struct pipe_inode_info *pipe) { DEFINE_WAIT(wait); + u64 start; /* * Pipes are system-local resources, so sleeping on them @@ -113,6 +116,10 @@ void pipe_wait(struct pipe_inode_info *pipe) */ prepare_to_wait(>wait, , TASK_INTERRUPTIBLE); pipe_unlock(pipe); + start = local_clock(); + while (current->state != TASK_RUNNING && + ((local_clock() - start) >> 10) < pipe->pipe_ll_usec) + cpu_relax(); schedule(); finish_wait(>wait, ); pipe_lock(pipe); Networking breaks this out better in terms of options instead of hard-coding. This does not handle need_resched or signal delivery properly where as networking does for example. I don't disable preemption, so don't think checking need_resched is needed. Can you point to what you mean by handling signal delivery in case of networking? Not sure what I am missing. My initial version broke it out like networking but after Peter's suggestion I clubbed it. I don't feel strongly either way. @@ -825,6 +832,7 @@ static int do_pipe2(int __user *fildes, int flags) struct file *files[2]; int fd[2]; int error; + struct pipe_inode_info *pipe; error = __do_pipe_flags(fd, files, flags); if (!error) { @@ -838,6 +846,10 @@ static int do_pipe2(int __user *fildes, int flags) fd_install(fd[0], files[0]); fd_install(fd[1], files[1]); } + pipe = files[0]->private_data; + pipe->pipe_ll_usec = pipe_busy_poll; + pipe = files[1]->private_data; + pipe->pipe_ll_usec = pipe_busy_poll; } return error; } You add a pipe field but the value in always based on the sysctl so the information is redundantu (barring a race condition on one pipe write per sysctl update which is an irrelevant corner case). In comparison, the network subsystem appears to be explicitly opt-in via setsockopt(SO_BUSY_POLL) from a glance an
Re: [PATCH 00/10] steal tasks to improve CPU utilization
On 10/22/18 7:59 AM, Steve Sistare wrote: When a CPU has no more CFS tasks to run, and idle_balance() fails to find a task, then attempt to steal a task from an overloaded CPU in the same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently identify candidates. To minimize search time, steal the first migratable task that is found when the bitmap is traversed. For fairness, search for migratable tasks on an overloaded CPU in order of next to run. This simple stealing yields a higher CPU utilization than idle_balance() alone, because the search is cheap, so it may be called every time the CPU is about to go idle. idle_balance() does more work because it searches widely for the busiest queue, so to limit its CPU consumption, it declines to search if the system is too busy. Simple stealing does not offload the globally busiest queue, but it is much better than running nothing at all. The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to reduce cache contention vs the usual bitmap when many threads concurrently set, clear, and visit elements. Is the bitmask saving much? I tried a simple stealing that just starts searching the domain from the current cpu and steals a thread from the first cpu that has more than one runnable thread. It seems to perform similar to your patch. hackbench on X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs baseline %stdev patch %stdev 1(40 tasks) 0.5524 2.36 0.5522 (0.045%) 3.82 2(80 tasks) 0.6482 11.4 0.7241 (-11.7%) 20.34 4(160 tasks) 0.9756 0.95 0.8276 (15.1%) 5.8 8(320 tasks) 1.7699 1.62 1.6655 (5.9%) 1.57 16(640 tasks) 3.1018 0.77 2.9858 (3.74%) 1.4 32(1280 tasks) 5.565 0.62 5.3388 (4.1%) 0.72 X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs Oracle database OLTP, logging _enabled_ Users %speedup 20 1.2 40 -0.41 60 0.83 80 2.37 100 1.54 120 3.0 140 2.24 160 1.82 180 1.94 200 2.23 220 1.49 Below is the patch (not in best shape) --->8 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f808ddf..1690451 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3540,6 +3540,7 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq) } static int idle_balance(struct rq *this_rq, struct rq_flags *rf); +static int my_idle_balance(struct rq *this_rq, struct rq_flags *rf); static inline unsigned long task_util(struct task_struct *p) { @@ -6619,6 +6620,8 @@ done: __maybe_unused; idle: new_tasks = idle_balance(rq, rf); + if (new_tasks == 0) + new_tasks = my_idle_balance(rq, rf); /* * Because idle_balance() releases (and re-acquires) rq->lock, it is @@ -8434,6 +8437,75 @@ static int should_we_balance(struct lb_env *env) return balance_cpu == env->dst_cpu; } +int get_best_cpu(int this_cpu, struct sched_domain *sd) +{ + struct rq *this_rq, *rq; + int i; + int best_cpu = -1; + + this_rq = cpu_rq(this_cpu); + for_each_cpu_wrap(i, sched_domain_span(sd), this_cpu) { + if (this_rq->nr_running > 0) + return (-1); + if (i == this_cpu) + continue; + rq = cpu_rq(i); + if (rq->nr_running <= 1) + continue; + best_cpu = i; + break; + } + return (best_cpu); +} +static int my_load_balance(int this_cpu, struct rq *this_rq, + struct sched_domain *sd, enum cpu_idle_type idle) +{ + int ld_moved = 0; + struct rq *busiest; + unsigned long flags; + struct task_struct *p = NULL; + struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask); + int best_cpu; + + struct lb_env env = { + .sd = sd, + .dst_cpu = this_cpu, + .dst_rq = this_rq, + .dst_grpmask = sched_group_span(sd->groups), + .idle = idle, + .cpus = cpus, + .tasks = LIST_HEAD_INIT(env.tasks), + }; + + if (idle == CPU_NEWLY_IDLE) + env.dst_grpmask = NULL; + + cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask); + + best_cpu = get_best_cpu(this_cpu, sd); + + if (best_cpu >= 0) + busiest = cpu_rq(best_cpu); + else + goto out; + + env.src_cpu = busiest->cpu; + env.src_rq = busiest; + raw_spin_lock_irqsave(>lock, flags); + + p = detach_one_task(); + raw_spin_unlock(>lock); + if (p) { + attach_one_task(this_rq, p); + ld_moved++; + } + local_irq_restore(flags); + +out: + return ld_moved; +} + /* * Check this_cpu to ensure it is balanced within domain. Attempt to move * tasks if there is
Re: [PATCH 00/10] steal tasks to improve CPU utilization
On 10/22/18 7:59 AM, Steve Sistare wrote: When a CPU has no more CFS tasks to run, and idle_balance() fails to find a task, then attempt to steal a task from an overloaded CPU in the same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently identify candidates. To minimize search time, steal the first migratable task that is found when the bitmap is traversed. For fairness, search for migratable tasks on an overloaded CPU in order of next to run. This simple stealing yields a higher CPU utilization than idle_balance() alone, because the search is cheap, so it may be called every time the CPU is about to go idle. idle_balance() does more work because it searches widely for the busiest queue, so to limit its CPU consumption, it declines to search if the system is too busy. Simple stealing does not offload the globally busiest queue, but it is much better than running nothing at all. The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to reduce cache contention vs the usual bitmap when many threads concurrently set, clear, and visit elements. Is the bitmask saving much? I tried a simple stealing that just starts searching the domain from the current cpu and steals a thread from the first cpu that has more than one runnable thread. It seems to perform similar to your patch. hackbench on X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs baseline %stdev patch %stdev 1(40 tasks) 0.5524 2.36 0.5522 (0.045%) 3.82 2(80 tasks) 0.6482 11.4 0.7241 (-11.7%) 20.34 4(160 tasks) 0.9756 0.95 0.8276 (15.1%) 5.8 8(320 tasks) 1.7699 1.62 1.6655 (5.9%) 1.57 16(640 tasks) 3.1018 0.77 2.9858 (3.74%) 1.4 32(1280 tasks) 5.565 0.62 5.3388 (4.1%) 0.72 X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs Oracle database OLTP, logging _enabled_ Users %speedup 20 1.2 40 -0.41 60 0.83 80 2.37 100 1.54 120 3.0 140 2.24 160 1.82 180 1.94 200 2.23 220 1.49 Below is the patch (not in best shape) --->8 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f808ddf..1690451 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3540,6 +3540,7 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq) } static int idle_balance(struct rq *this_rq, struct rq_flags *rf); +static int my_idle_balance(struct rq *this_rq, struct rq_flags *rf); static inline unsigned long task_util(struct task_struct *p) { @@ -6619,6 +6620,8 @@ done: __maybe_unused; idle: new_tasks = idle_balance(rq, rf); + if (new_tasks == 0) + new_tasks = my_idle_balance(rq, rf); /* * Because idle_balance() releases (and re-acquires) rq->lock, it is @@ -8434,6 +8437,75 @@ static int should_we_balance(struct lb_env *env) return balance_cpu == env->dst_cpu; } +int get_best_cpu(int this_cpu, struct sched_domain *sd) +{ + struct rq *this_rq, *rq; + int i; + int best_cpu = -1; + + this_rq = cpu_rq(this_cpu); + for_each_cpu_wrap(i, sched_domain_span(sd), this_cpu) { + if (this_rq->nr_running > 0) + return (-1); + if (i == this_cpu) + continue; + rq = cpu_rq(i); + if (rq->nr_running <= 1) + continue; + best_cpu = i; + break; + } + return (best_cpu); +} +static int my_load_balance(int this_cpu, struct rq *this_rq, + struct sched_domain *sd, enum cpu_idle_type idle) +{ + int ld_moved = 0; + struct rq *busiest; + unsigned long flags; + struct task_struct *p = NULL; + struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask); + int best_cpu; + + struct lb_env env = { + .sd = sd, + .dst_cpu = this_cpu, + .dst_rq = this_rq, + .dst_grpmask = sched_group_span(sd->groups), + .idle = idle, + .cpus = cpus, + .tasks = LIST_HEAD_INIT(env.tasks), + }; + + if (idle == CPU_NEWLY_IDLE) + env.dst_grpmask = NULL; + + cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask); + + best_cpu = get_best_cpu(this_cpu, sd); + + if (best_cpu >= 0) + busiest = cpu_rq(best_cpu); + else + goto out; + + env.src_cpu = busiest->cpu; + env.src_rq = busiest; + raw_spin_lock_irqsave(>lock, flags); + + p = detach_one_task(); + raw_spin_unlock(>lock); + if (p) { + attach_one_task(this_rq, p); + ld_moved++; + } + local_irq_restore(flags); + +out: + return ld_moved; +} + /* * Check this_cpu to ensure it is balanced within domain. Attempt to move * tasks if there is
Re: [RFC 00/60] Coscheduling for Linux
On 10/26/18 4:44 PM, Jan H. Schönherr wrote: On 19/10/2018 02.26, Subhra Mazumdar wrote: Hi Jan, Hi. Sorry for the delay. On 9/7/18 2:39 PM, Jan H. Schönherr wrote: The collective context switch from one coscheduled set of tasks to another -- while fast -- is not atomic. If a use-case needs the absolute guarantee that all tasks of the previous set have stopped executing before any task of the next set starts executing, an additional hand-shake/barrier needs to be added. Do you know how much is the delay? i.e what is overlap time when a thread of new group starts executing on one HT while there is still thread of another group running on the other HT? The delay is roughly equivalent to the IPI latency, if we're just talking about coscheduling at SMT level: one sibling decides to schedule another group, sends an IPI to the other sibling(s), and may already start executing a task of that other group, before the IPI is received on the other end. Can you point to where the leader is sending the IPI to other siblings? I did some experiment and delay seems to be sub microsec. I ran 2 threads that are just looping in one cosched group and affinitized to the 2 HTs of a core. And another thread in a different cosched group starts running affinitized to the first HT of the same core. I time stamped just before context_switch() in __schedule() for the threads switching from one to another and one to idle. Following is what I get on cpu 1 and 45 that are siblings, cpu 1 is where the other thread preempts: [ 403.216625] cpu:45 sub1->idle:403216624579 [ 403.238623] cpu:1 sub1->sub2:403238621585 [ 403.238624] cpu:45 sub1->idle:403238621787 [ 403.260619] cpu:1 sub1->sub2:403260619182 [ 403.260620] cpu:45 sub1->idle:403260619413 [ 403.282617] cpu:1 sub1->sub2:403282617157 [ 403.282618] cpu:45 sub1->idle:403282617317 .. Not sure why the first switch on cpu to idle happened. But then onwards the difference in timestamps is less than a microsec. This is just a crude way to get a sense of the delay, may not be exact. Thanks, Subhra Now, there are some things that may delay processing an IPI, but in those cases the target CPU isn't executing user code. I've yet to produce some current numbers for SMT-only coscheduling. An older ballpark number I have is about 2 microseconds for the collective context switch of one hierarchy level, but take that with a grain of salt. Regards Jan
Re: [RFC 00/60] Coscheduling for Linux
On 10/26/18 4:44 PM, Jan H. Schönherr wrote: On 19/10/2018 02.26, Subhra Mazumdar wrote: Hi Jan, Hi. Sorry for the delay. On 9/7/18 2:39 PM, Jan H. Schönherr wrote: The collective context switch from one coscheduled set of tasks to another -- while fast -- is not atomic. If a use-case needs the absolute guarantee that all tasks of the previous set have stopped executing before any task of the next set starts executing, an additional hand-shake/barrier needs to be added. Do you know how much is the delay? i.e what is overlap time when a thread of new group starts executing on one HT while there is still thread of another group running on the other HT? The delay is roughly equivalent to the IPI latency, if we're just talking about coscheduling at SMT level: one sibling decides to schedule another group, sends an IPI to the other sibling(s), and may already start executing a task of that other group, before the IPI is received on the other end. Can you point to where the leader is sending the IPI to other siblings? I did some experiment and delay seems to be sub microsec. I ran 2 threads that are just looping in one cosched group and affinitized to the 2 HTs of a core. And another thread in a different cosched group starts running affinitized to the first HT of the same core. I time stamped just before context_switch() in __schedule() for the threads switching from one to another and one to idle. Following is what I get on cpu 1 and 45 that are siblings, cpu 1 is where the other thread preempts: [ 403.216625] cpu:45 sub1->idle:403216624579 [ 403.238623] cpu:1 sub1->sub2:403238621585 [ 403.238624] cpu:45 sub1->idle:403238621787 [ 403.260619] cpu:1 sub1->sub2:403260619182 [ 403.260620] cpu:45 sub1->idle:403260619413 [ 403.282617] cpu:1 sub1->sub2:403282617157 [ 403.282618] cpu:45 sub1->idle:403282617317 .. Not sure why the first switch on cpu to idle happened. But then onwards the difference in timestamps is less than a microsec. This is just a crude way to get a sense of the delay, may not be exact. Thanks, Subhra Now, there are some things that may delay processing an IPI, but in those cases the target CPU isn't executing user code. I've yet to produce some current numbers for SMT-only coscheduling. An older ballpark number I have is about 2 microseconds for the collective context switch of one hierarchy level, but take that with a grain of salt. Regards Jan
Re: [RFC 00/60] Coscheduling for Linux
D) What can I *not* do with this? - Besides the missing load-balancing within coscheduled task-groups, this implementation has the following properties, which might be considered short-comings. This particular implementation focuses on SCHED_OTHER tasks managed by CFS and allows coscheduling them. Interrupts as well as tasks in higher scheduling classes are currently out-of-scope: they are assumed to be negligible interruptions as far as coscheduling is concerned and they do *not* cause a preemption of a whole group. This implementation could be extended to cover higher scheduling classes. Interrupts, however, are an orthogonal issue. The collective context switch from one coscheduled set of tasks to another -- while fast -- is not atomic. If a use-case needs the absolute guarantee that all tasks of the previous set have stopped executing before any task of the next set starts executing, an additional hand-shake/barrier needs to be added. The leader doesn't kick the other cpus _immediately_ to switch to a different cosched group. So threads from previous cosched group will keep running in other HTs till their sched_slice is over (in worst case). This can still keep the window of L1TF vulnerability open?
Re: [RFC 00/60] Coscheduling for Linux
D) What can I *not* do with this? - Besides the missing load-balancing within coscheduled task-groups, this implementation has the following properties, which might be considered short-comings. This particular implementation focuses on SCHED_OTHER tasks managed by CFS and allows coscheduling them. Interrupts as well as tasks in higher scheduling classes are currently out-of-scope: they are assumed to be negligible interruptions as far as coscheduling is concerned and they do *not* cause a preemption of a whole group. This implementation could be extended to cover higher scheduling classes. Interrupts, however, are an orthogonal issue. The collective context switch from one coscheduled set of tasks to another -- while fast -- is not atomic. If a use-case needs the absolute guarantee that all tasks of the previous set have stopped executing before any task of the next set starts executing, an additional hand-shake/barrier needs to be added. The leader doesn't kick the other cpus _immediately_ to switch to a different cosched group. So threads from previous cosched group will keep running in other HTs till their sched_slice is over (in worst case). This can still keep the window of L1TF vulnerability open?
Re: [RFC 00/60] Coscheduling for Linux
Hi Jan, On 9/7/18 2:39 PM, Jan H. Schönherr wrote: The collective context switch from one coscheduled set of tasks to another -- while fast -- is not atomic. If a use-case needs the absolute guarantee that all tasks of the previous set have stopped executing before any task of the next set starts executing, an additional hand-shake/barrier needs to be added. Do you know how much is the delay? i.e what is overlap time when a thread of new group starts executing on one HT while there is still thread of another group running on the other HT? Thanks, Subhra
Re: [RFC 00/60] Coscheduling for Linux
Hi Jan, On 9/7/18 2:39 PM, Jan H. Schönherr wrote: The collective context switch from one coscheduled set of tasks to another -- while fast -- is not atomic. If a use-case needs the absolute guarantee that all tasks of the previous set have stopped executing before any task of the next set starts executing, an additional hand-shake/barrier needs to be added. Do you know how much is the delay? i.e what is overlap time when a thread of new group starts executing on one HT while there is still thread of another group running on the other HT? Thanks, Subhra
Re: Gang scheduling
On 10/12/2018 11:01 AM, Tim Chen wrote: On 10/10/2018 05:09 PM, Subhra Mazumdar wrote: Hi, I was following the Coscheduling patch discussion on lkml and Peter mentioned he had a patch series. I found the following on github. https://github.com/pdxChen/gang/commits/sched_1.23-loadbal I would like to test this with KVMs. Are the commits from 38d5acb to f019876 sufficient? Also is there any documentaion on how to use it (any knobs I need to turn on for gang scheduling to happen?) or is it enabled by default for KVMs? Thanks, Subhra I would suggest you try https://github.com/pdxChen/gang/tree/sched_1.23-base without the load balancing part of gang scheduling. It is enabled by default for KVMs. Due to the constant change in gang scheduling status of the QEMU thread depending on whether vcpu is loaded or unloaded, the load balancing part of the code doesn't work very well. Thanks. Does this mean each vcpu thread need to be affinitized to a cpu? The current version of the code need to be optimized further. Right now the QEMU thread constantly does vcpu load and unload during VM enter and exit. We gang schedule only after vcpu load and register the thread to be gang scheduled. When we do vcpu unload, the thread is removed from the set to be gang scheduled. Each time there's a synchronization with the sibling thread that's expensive. However, for QEMU, there's a one to one correspondence between the QEMU thread and vcpu. So we don't have to change the gang scheduling status for such thread to avoid the church and sync with the sibling. That should be helpful for VM with lots of I/O causing constant VM exits. We're still working on this optimization. And the load balancing should be better after this change. Tim Also FYI I get the following error while building sched_1.23-base: ERROR: "sched_ttwu_pending" [arch/x86/kvm/kvm-intel.ko] undefined! scripts/Makefile.modpost:92: recipe for target '__modpost' failed Adding the following fixed it: diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 46807dc..302b77d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -21,6 +21,7 @@ #include DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +EXPORT_SYMBOL_GPL(sched_ttwu_pending); #if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL) /*
Re: Gang scheduling
On 10/12/2018 11:01 AM, Tim Chen wrote: On 10/10/2018 05:09 PM, Subhra Mazumdar wrote: Hi, I was following the Coscheduling patch discussion on lkml and Peter mentioned he had a patch series. I found the following on github. https://github.com/pdxChen/gang/commits/sched_1.23-loadbal I would like to test this with KVMs. Are the commits from 38d5acb to f019876 sufficient? Also is there any documentaion on how to use it (any knobs I need to turn on for gang scheduling to happen?) or is it enabled by default for KVMs? Thanks, Subhra I would suggest you try https://github.com/pdxChen/gang/tree/sched_1.23-base without the load balancing part of gang scheduling. It is enabled by default for KVMs. Due to the constant change in gang scheduling status of the QEMU thread depending on whether vcpu is loaded or unloaded, the load balancing part of the code doesn't work very well. Thanks. Does this mean each vcpu thread need to be affinitized to a cpu? The current version of the code need to be optimized further. Right now the QEMU thread constantly does vcpu load and unload during VM enter and exit. We gang schedule only after vcpu load and register the thread to be gang scheduled. When we do vcpu unload, the thread is removed from the set to be gang scheduled. Each time there's a synchronization with the sibling thread that's expensive. However, for QEMU, there's a one to one correspondence between the QEMU thread and vcpu. So we don't have to change the gang scheduling status for such thread to avoid the church and sync with the sibling. That should be helpful for VM with lots of I/O causing constant VM exits. We're still working on this optimization. And the load balancing should be better after this change. Tim Also FYI I get the following error while building sched_1.23-base: ERROR: "sched_ttwu_pending" [arch/x86/kvm/kvm-intel.ko] undefined! scripts/Makefile.modpost:92: recipe for target '__modpost' failed Adding the following fixed it: diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 46807dc..302b77d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -21,6 +21,7 @@ #include DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +EXPORT_SYMBOL_GPL(sched_ttwu_pending); #if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL) /*
Gang scheduling
Hi, I was following the Coscheduling patch discussion on lkml and Peter mentioned he had a patch series. I found the following on github. https://github.com/pdxChen/gang/commits/sched_1.23-loadbal I would like to test this with KVMs. Are the commits from 38d5acb to f019876 sufficient? Also is there any documentaion on how to use it (any knobs I need to turn on for gang scheduling to happen?) or is it enabled by default for KVMs? Thanks, Subhra
Gang scheduling
Hi, I was following the Coscheduling patch discussion on lkml and Peter mentioned he had a patch series. I found the following on github. https://github.com/pdxChen/gang/commits/sched_1.23-loadbal I would like to test this with KVMs. Are the commits from 38d5acb to f019876 sufficient? Also is there any documentaion on how to use it (any knobs I need to turn on for gang scheduling to happen?) or is it enabled by default for KVMs? Thanks, Subhra
Re: [RFC 00/60] Coscheduling for Linux
On 09/26/2018 02:58 AM, Jan H. Schönherr wrote: On 09/17/2018 02:25 PM, Peter Zijlstra wrote: On Fri, Sep 14, 2018 at 06:25:44PM +0200, Jan H. Schönherr wrote: Assuming, there is a cgroup-less solution that can prevent simultaneous execution of tasks on a core, when they're not supposed to. How would you tell the scheduler, which tasks these are? Specifically for L1TF I hooked into/extended KVM's preempt_notifier registration interface, which tells us which tasks are VCPUs and to which VM they belong. But if we want to actually expose this to userspace, we can either do a prctl() or extend struct sched_attr. Both, Peter and Subhra, seem to prefer an interface different than cgroups to specify what to coschedule. Can you provide some extra motivation for me, why you feel that way? (ignoring the current scalability issues with the cpu group controller) After all, cgroups where designed to create arbitrary groups of tasks and to attach functionality to those groups. If we were to introduce a different interface to control that, we'd need to introduce a whole new group concept, so that you make tasks part of some group while at the same time preventing unauthorized tasks from joining a group. I currently don't see any wins, just a loss in flexibility. Regards Jan I think cgroups will the get the job done for any use case. But we have, e.g. affinity control via both sched_setaffinity and cgroup cpusets. It will be good to have an alternative way to specify co-scheduling too for those who don't want to use cgroup for some reason. It can be added later on though, only how one will override the other will need to be sorted out.
Re: [RFC 00/60] Coscheduling for Linux
On 09/26/2018 02:58 AM, Jan H. Schönherr wrote: On 09/17/2018 02:25 PM, Peter Zijlstra wrote: On Fri, Sep 14, 2018 at 06:25:44PM +0200, Jan H. Schönherr wrote: Assuming, there is a cgroup-less solution that can prevent simultaneous execution of tasks on a core, when they're not supposed to. How would you tell the scheduler, which tasks these are? Specifically for L1TF I hooked into/extended KVM's preempt_notifier registration interface, which tells us which tasks are VCPUs and to which VM they belong. But if we want to actually expose this to userspace, we can either do a prctl() or extend struct sched_attr. Both, Peter and Subhra, seem to prefer an interface different than cgroups to specify what to coschedule. Can you provide some extra motivation for me, why you feel that way? (ignoring the current scalability issues with the cpu group controller) After all, cgroups where designed to create arbitrary groups of tasks and to attach functionality to those groups. If we were to introduce a different interface to control that, we'd need to introduce a whole new group concept, so that you make tasks part of some group while at the same time preventing unauthorized tasks from joining a group. I currently don't see any wins, just a loss in flexibility. Regards Jan I think cgroups will the get the job done for any use case. But we have, e.g. affinity control via both sched_setaffinity and cgroup cpusets. It will be good to have an alternative way to specify co-scheduling too for those who don't want to use cgroup for some reason. It can be added later on though, only how one will override the other will need to be sorted out.
Re: [RFC 00/60] Coscheduling for Linux
On 09/24/2018 08:43 AM, Jan H. Schönherr wrote: On 09/19/2018 11:53 PM, Subhra Mazumdar wrote: Can we have a more generic interface, like specifying a set of task ids to be co-scheduled with a particular level rather than tying this with cgroups? KVMs may not always run with cgroups and there might be other use cases where we might want co-scheduling that doesn't relate to cgroups. Currently: no. At this point the implementation is tightly coupled to the cpu cgroup controller. This *might* change, if the task group optimizations mentioned in other parts of this e-mail thread are done, as I think, that it would decouple the various mechanisms. That said, what if you were able to disable the "group-based fairness" aspect of the cpu cgroup controller? Then you would be able to control just the coscheduling aspects on their own. Would that satisfy the use case you have in mind? Regards Jan Yes that will suffice the use case. We wish to experiment at some point with co-scheduling of certain workers threads in DB parallel query and see if there is any benefit Thanks, Subhra
Re: [RFC 00/60] Coscheduling for Linux
On 09/24/2018 08:43 AM, Jan H. Schönherr wrote: On 09/19/2018 11:53 PM, Subhra Mazumdar wrote: Can we have a more generic interface, like specifying a set of task ids to be co-scheduled with a particular level rather than tying this with cgroups? KVMs may not always run with cgroups and there might be other use cases where we might want co-scheduling that doesn't relate to cgroups. Currently: no. At this point the implementation is tightly coupled to the cpu cgroup controller. This *might* change, if the task group optimizations mentioned in other parts of this e-mail thread are done, as I think, that it would decouple the various mechanisms. That said, what if you were able to disable the "group-based fairness" aspect of the cpu cgroup controller? Then you would be able to control just the coscheduling aspects on their own. Would that satisfy the use case you have in mind? Regards Jan Yes that will suffice the use case. We wish to experiment at some point with co-scheduling of certain workers threads in DB parallel query and see if there is any benefit Thanks, Subhra
[RFC PATCH v2 1/1] pipe: busy wait for pipe
Introduce pipe_ll_usec field for pipes that indicates the amount of micro seconds a thread should spin if pipe is empty or full before sleeping. This is similar to network sockets. Workloads like hackbench in pipe mode benefits significantly from this by avoiding the sleep and wakeup overhead. Other similar usecases can benefit. A tunable pipe_busy_poll is introduced to enable or disable busy waiting via /proc. The value of it specifies the amount of spin in microseconds. Default value is 0 indicating no spin. Signed-off-by: subhra mazumdar --- fs/pipe.c | 12 include/linux/pipe_fs_i.h | 2 ++ kernel/sysctl.c | 7 +++ 3 files changed, 21 insertions(+) diff --git a/fs/pipe.c b/fs/pipe.c index bdc5d3c..35d805b 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -26,6 +26,7 @@ #include #include +#include #include "internal.h" @@ -40,6 +41,7 @@ unsigned int pipe_max_size = 1048576; */ unsigned long pipe_user_pages_hard; unsigned long pipe_user_pages_soft = PIPE_DEF_BUFFERS * INR_OPEN_CUR; +unsigned int pipe_busy_poll; /* * We use a start+len construction, which provides full use of the @@ -106,6 +108,7 @@ void pipe_double_lock(struct pipe_inode_info *pipe1, void pipe_wait(struct pipe_inode_info *pipe) { DEFINE_WAIT(wait); + u64 start; /* * Pipes are system-local resources, so sleeping on them @@ -113,6 +116,10 @@ void pipe_wait(struct pipe_inode_info *pipe) */ prepare_to_wait(>wait, , TASK_INTERRUPTIBLE); pipe_unlock(pipe); + start = local_clock(); + while (current->state != TASK_RUNNING && + ((local_clock() - start) >> 10) < pipe->pipe_ll_usec) + cpu_relax(); schedule(); finish_wait(>wait, ); pipe_lock(pipe); @@ -825,6 +832,7 @@ static int do_pipe2(int __user *fildes, int flags) struct file *files[2]; int fd[2]; int error; + struct pipe_inode_info *pipe; error = __do_pipe_flags(fd, files, flags); if (!error) { @@ -838,6 +846,10 @@ static int do_pipe2(int __user *fildes, int flags) fd_install(fd[0], files[0]); fd_install(fd[1], files[1]); } + pipe = files[0]->private_data; + pipe->pipe_ll_usec = pipe_busy_poll; + pipe = files[1]->private_data; + pipe->pipe_ll_usec = pipe_busy_poll; } return error; } diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index 5a3bb3b..73267d2 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -55,6 +55,7 @@ struct pipe_inode_info { unsigned int waiting_writers; unsigned int r_counter; unsigned int w_counter; + unsigned int pipe_ll_usec; struct page *tmp_page; struct fasync_struct *fasync_readers; struct fasync_struct *fasync_writers; @@ -170,6 +171,7 @@ void pipe_double_lock(struct pipe_inode_info *, struct pipe_inode_info *); extern unsigned int pipe_max_size; extern unsigned long pipe_user_pages_hard; extern unsigned long pipe_user_pages_soft; +extern unsigned int pipe_busy_poll; /* Drop the inode semaphore and wait for a pipe event, atomically */ void pipe_wait(struct pipe_inode_info *pipe); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index cc02050..0e9ce0c 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1863,6 +1863,13 @@ static struct ctl_table fs_table[] = { .proc_handler = proc_doulongvec_minmax, }, { + .procname = "pipe-busy-poll", + .data = _busy_poll, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + }, + { .procname = "mount-max", .data = _mount_max, .maxlen = sizeof(unsigned int), -- 2.9.3
[RFC PATCH v2 1/1] pipe: busy wait for pipe
Introduce pipe_ll_usec field for pipes that indicates the amount of micro seconds a thread should spin if pipe is empty or full before sleeping. This is similar to network sockets. Workloads like hackbench in pipe mode benefits significantly from this by avoiding the sleep and wakeup overhead. Other similar usecases can benefit. A tunable pipe_busy_poll is introduced to enable or disable busy waiting via /proc. The value of it specifies the amount of spin in microseconds. Default value is 0 indicating no spin. Signed-off-by: subhra mazumdar --- fs/pipe.c | 12 include/linux/pipe_fs_i.h | 2 ++ kernel/sysctl.c | 7 +++ 3 files changed, 21 insertions(+) diff --git a/fs/pipe.c b/fs/pipe.c index bdc5d3c..35d805b 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -26,6 +26,7 @@ #include #include +#include #include "internal.h" @@ -40,6 +41,7 @@ unsigned int pipe_max_size = 1048576; */ unsigned long pipe_user_pages_hard; unsigned long pipe_user_pages_soft = PIPE_DEF_BUFFERS * INR_OPEN_CUR; +unsigned int pipe_busy_poll; /* * We use a start+len construction, which provides full use of the @@ -106,6 +108,7 @@ void pipe_double_lock(struct pipe_inode_info *pipe1, void pipe_wait(struct pipe_inode_info *pipe) { DEFINE_WAIT(wait); + u64 start; /* * Pipes are system-local resources, so sleeping on them @@ -113,6 +116,10 @@ void pipe_wait(struct pipe_inode_info *pipe) */ prepare_to_wait(>wait, , TASK_INTERRUPTIBLE); pipe_unlock(pipe); + start = local_clock(); + while (current->state != TASK_RUNNING && + ((local_clock() - start) >> 10) < pipe->pipe_ll_usec) + cpu_relax(); schedule(); finish_wait(>wait, ); pipe_lock(pipe); @@ -825,6 +832,7 @@ static int do_pipe2(int __user *fildes, int flags) struct file *files[2]; int fd[2]; int error; + struct pipe_inode_info *pipe; error = __do_pipe_flags(fd, files, flags); if (!error) { @@ -838,6 +846,10 @@ static int do_pipe2(int __user *fildes, int flags) fd_install(fd[0], files[0]); fd_install(fd[1], files[1]); } + pipe = files[0]->private_data; + pipe->pipe_ll_usec = pipe_busy_poll; + pipe = files[1]->private_data; + pipe->pipe_ll_usec = pipe_busy_poll; } return error; } diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index 5a3bb3b..73267d2 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -55,6 +55,7 @@ struct pipe_inode_info { unsigned int waiting_writers; unsigned int r_counter; unsigned int w_counter; + unsigned int pipe_ll_usec; struct page *tmp_page; struct fasync_struct *fasync_readers; struct fasync_struct *fasync_writers; @@ -170,6 +171,7 @@ void pipe_double_lock(struct pipe_inode_info *, struct pipe_inode_info *); extern unsigned int pipe_max_size; extern unsigned long pipe_user_pages_hard; extern unsigned long pipe_user_pages_soft; +extern unsigned int pipe_busy_poll; /* Drop the inode semaphore and wait for a pipe event, atomically */ void pipe_wait(struct pipe_inode_info *pipe); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index cc02050..0e9ce0c 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1863,6 +1863,13 @@ static struct ctl_table fs_table[] = { .proc_handler = proc_doulongvec_minmax, }, { + .procname = "pipe-busy-poll", + .data = _busy_poll, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + }, + { .procname = "mount-max", .data = _mount_max, .maxlen = sizeof(unsigned int), -- 2.9.3
[RFC PATCH v2 0/1] Pipe busy wait
This patch introduces busy waiting for pipes similar to network sockets. When pipe is full or empty a thread busy waits for some microseconds before sleeping. This avoids the sleep and wakeup overhead and improves performance in case wakeup happens very fast. It uses a new field in pipe_inode_info to decide how much to spin. As different workloads on different systems can have different optimum spin time, it is configurable via a tunable that can be set via /proc. The default value is 0 which indicates no spin. Following are the hackbench process using pipe and Unixbench performance numbers with baseline and suitable spin time. Hackbench on 2 socket, 36 core and 72 threads Intel x86 machine (lower is better). It also shows the usr+sys time as shown by time: groups baseline usr+sys patch(spin=10us) usr+sys 1 0.674217.0212 0.6842 (-1.48%) 20.2765 2 0.779435.1954 0.7116 (8.7%) 38.1318 4 0.974455.6513 0.8247 (15.36%) 50.301 8 2.0382129.519 1.572 (22.87%)103.722 16 5.5712427.096 2.7989 (49.76%) 194.162 24 7.9314566.327 3.962 (50.05%)278.368 Unixbench on 2 socket, 36 core and 72 threads Intel x86 machine (higher is better). It shows pipe-based context switching test improving by 107.7% for 1 copy and by 51.3% for 72 parallel copies. This is expected as context switch overhead is avoided altogether by busy wait. 72 CPUs in system; running 1 parallel copy of tests System Benchmarks Index Values baselinepatch(spin=10us) Dhrystone 2 using register variables2837.5 2842.0 Double-Precision Whetstone 753.3 753.5 Execl Throughput1056.8 1079.2 File Copy 1024 bufsize 2000 maxblocks 1794.0 1805.4 File Copy 256 bufsize 500 maxblocks .4 1117.6 File Copy 4096 bufsize 8000 maxblocks 4136.7 4091.7 Pipe Throughput 752.9 753.3 Pipe-based Context Switching372.2 772.9 Process Creation840.1 847.6 Shell Scripts (1 concurrent)1771.0 1747.5 Shell Scripts (8 concurrent)7316.6 7308.3 System Call Overhead578.0 578.0 System Benchmarks Index Score 1337.8 1424.0 72 CPUs in system; running 72 parallel copies of tests System Benchmarks Index Values baselinepatch(spin=10us) Dhrystone 2 using register variables112849.7112429.8 Double-Precision Whetstone 44971.1 44929.7 Execl Throughput11257.6 11258.7 File Copy 1024 bufsize 2000 maxblocks 1514.6 1471.3 File Copy 256 bufsize 500 maxblocks 919.8 917.4 File Copy 4096 bufsize 8000 maxblocks 3543.8 3355.5 Pipe Throughput 34530.6 34492.9 Pipe-based Context Switching11298.2 17089.9 Process Creation9422.5 9408.1 Shell Scripts (1 concurrent)38764.9 38610.4 Shell Scripts (8 concurrent)32570.5 32458.8 System Call Overhead2607.4 2672.7 System Benchmarks Index Score 11077.2 11393.5 Changes from v1->v2: -Removed preemption disable in the busy wait loop -Changed busy spin condition to TASK_RUNNING state of the current thread -Added usr+sys time for hackbench runs as shown by time command subhra mazumdar (1): pipe: busy wait for pipe fs/pipe.c | 12 include/linux/pipe_fs_i.h | 2 ++ kernel/sysctl.c | 7 +++ 3 files changed, 21 insertions(+) -- 2.9.3