Re: [RFC 2/2] Introduce sysctl(s) for the migration costs
On Thu, Feb 08, 2018 at 02:19:55PM -0800, Rohit Jain wrote: > This patch introduces the sysctl for sched_domain based migration costs. > These in turn can be used for performance tuning of workloads. Smells like a bad attempt to (again) revive commit: 0437e109e184 ("sched: zap the migration init / cache-hot balancing code") Yes, the migration cost would ideally be per domain, in practise it all sucks because more tunables is more confusion. And as that commit states, runtime measurements suck too, they cause run-to-run variation which causes repeatability issues and degrade boot times. Static numbers suck worse, because they'll be wrong for everyone.
Re: [RFC 2/2] Introduce sysctl(s) for the migration costs
On Fri, 2018-02-09 at 12:33 -0500, Steven Sistare wrote: > On 2/9/2018 12:08 PM, Mike Galbraith wrote: > > > Shrug. It's bogus no mater what we do. Once Upon A Time, a cost > > number was generated via measurement, but the end result was just as > > bogus as a number pulled out of the ether. How much bandwidth you have > > when blasting data to/from wherever says nothing about misses you avoid > > vs those you generate. > > Yes, yes and yes. I cannot make the original tunable less bogus. Using a > smaller > cost for closer caches still makes logical sense and is supported by the data. You forgot to write "microscopic" before "data" :) I'm mostly agnostic about this, but don't like the yet more knobs that 99.99% won't touch. -Mike
Re: [RFC 2/2] Introduce sysctl(s) for the migration costs
On 2/9/2018 12:08 PM, Mike Galbraith wrote: > On Fri, 2018-02-09 at 11:10 -0500, Steven Sistare wrote: >> On 2/8/2018 10:54 PM, Mike Galbraith wrote: >>> On Thu, 2018-02-08 at 14:19 -0800, Rohit Jain wrote: This patch introduces the sysctl for sched_domain based migration costs. These in turn can be used for performance tuning of workloads. >>> >>> With this patch, we trade 1 completely bogus constant (cost is really >>> highly variable) for 3, twiddling of which has zero effect unless you >>> trigger a domain rebuild afterward, which is neither mentioned in the >>> changelog, nor documented. >>> >>> bogo-numbers++ is kinda hard to love. >> >> Yup, the domain rebuild is missing. >> >> I am no fan of tunables, the fewer the better, but one of the several flaws >> of the single figure for migration cost is that it ignores the very large >> difference in cost when migrating between near vs far levels of the cache >> hierarchy. >> Migration between CPUs of the same core should be free, as they share L1 >> cache. >> Rohit defined a tunable for it, but IMO it could be hard coded to 0. > > That cost is never really 0 in the context of load balancing, as the > load balancing machinery is non-free. When the idle_balance() throttle > was added, that was done to mitigate the (at that time) quite high cost > to high frequency cross core scheduling ala localhost communication. I was imprecise. The cache-loss component of cost as represented by sched_migration_cost should be 0 in this case. The cost of the machinery is non-zero and remains in the code, and can still prevent migration. >> Migration >> between CPUs in different sockets is the most expensive and is represented by >> the existing sysctl_sched_migration_cost tunable. Migration between CPUs in >> the same core cluster, or in the same socket, is somewhere in between, as >> they share L2 or L3 cache. We could avoid a separate tunable by setting it >> to >> sysctl_sched_migration_cost / 10. > > Shrug. It's bogus no mater what we do. Once Upon A Time, a cost > number was generated via measurement, but the end result was just as > bogus as a number pulled out of the ether. How much bandwidth you have > when blasting data to/from wherever says nothing about misses you avoid > vs those you generate. Yes, yes and yes. I cannot make the original tunable less bogus. Using a smaller cost for closer caches still makes logical sense and is supported by the data. - Steve
Re: [RFC 2/2] Introduce sysctl(s) for the migration costs
On Fri, 2018-02-09 at 11:10 -0500, Steven Sistare wrote: > On 2/8/2018 10:54 PM, Mike Galbraith wrote: > > On Thu, 2018-02-08 at 14:19 -0800, Rohit Jain wrote: > >> This patch introduces the sysctl for sched_domain based migration costs. > >> These in turn can be used for performance tuning of workloads. > > > > With this patch, we trade 1 completely bogus constant (cost is really > > highly variable) for 3, twiddling of which has zero effect unless you > > trigger a domain rebuild afterward, which is neither mentioned in the > > changelog, nor documented. > > > > bogo-numbers++ is kinda hard to love. > > Yup, the domain rebuild is missing. > > I am no fan of tunables, the fewer the better, but one of the several flaws > of the single figure for migration cost is that it ignores the very large > difference in cost when migrating between near vs far levels of the cache > hierarchy. > Migration between CPUs of the same core should be free, as they share L1 > cache. > Rohit defined a tunable for it, but IMO it could be hard coded to 0. That cost is never really 0 in the context of load balancing, as the load balancing machinery is non-free. When the idle_balance() throttle was added, that was done to mitigate the (at that time) quite high cost to high frequency cross core scheduling ala localhost communication. > Migration > between CPUs in different sockets is the most expensive and is represented by > the existing sysctl_sched_migration_cost tunable. Migration between CPUs in > the same core cluster, or in the same socket, is somewhere in between, as > they share L2 or L3 cache. We could avoid a separate tunable by setting it to > sysctl_sched_migration_cost / 10. Shrug. It's bogus no mater what we do. Once Upon A Time, a cost number was generated via measurement, but the end result was just as bogus as a number pulled out of the ether. How much bandwidth you have when blasting data to/from wherever says nothing about misses you avoid vs those you generate. -Mike
Re: [RFC 2/2] Introduce sysctl(s) for the migration costs
On 2/8/2018 10:54 PM, Mike Galbraith wrote: > On Thu, 2018-02-08 at 14:19 -0800, Rohit Jain wrote: >> This patch introduces the sysctl for sched_domain based migration costs. >> These in turn can be used for performance tuning of workloads. > > With this patch, we trade 1 completely bogus constant (cost is really > highly variable) for 3, twiddling of which has zero effect unless you > trigger a domain rebuild afterward, which is neither mentioned in the > changelog, nor documented. > > bogo-numbers++ is kinda hard to love. Yup, the domain rebuild is missing. I am no fan of tunables, the fewer the better, but one of the several flaws of the single figure for migration cost is that it ignores the very large difference in cost when migrating between near vs far levels of the cache hierarchy. Migration between CPUs of the same core should be free, as they share L1 cache. Rohit defined a tunable for it, but IMO it could be hard coded to 0. Migration between CPUs in different sockets is the most expensive and is represented by the existing sysctl_sched_migration_cost tunable. Migration between CPUs in the same core cluster, or in the same socket, is somewhere in between, as they share L2 or L3 cache. We could avoid a separate tunable by setting it to sysctl_sched_migration_cost / 10. - Steve
Re: [RFC 2/2] Introduce sysctl(s) for the migration costs
On Thu, 2018-02-08 at 14:19 -0800, Rohit Jain wrote: > This patch introduces the sysctl for sched_domain based migration costs. > These in turn can be used for performance tuning of workloads. With this patch, we trade 1 completely bogus constant (cost is really highly variable) for 3, twiddling of which has zero effect unless you trigger a domain rebuild afterward, which is neither mentioned in the changelog, nor documented. bogo-numbers++ is kinda hard to love. -Mike
[RFC 2/2] Introduce sysctl(s) for the migration costs
This patch introduces the sysctl for sched_domain based migration costs. These in turn can be used for performance tuning of workloads. Signed-off-by: Rohit Jain --- include/linux/sched/sysctl.h | 2 ++ kernel/sched/fair.c | 4 +++- kernel/sched/topology.c | 8 kernel/sysctl.c | 14 ++ 4 files changed, 23 insertions(+), 5 deletions(-) diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 1c1a151..d597f6c 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -39,6 +39,8 @@ extern unsigned int sysctl_numa_balancing_scan_size; #ifdef CONFIG_SCHED_DEBUG extern __read_mostly unsigned int sysctl_sched_migration_cost; +extern __read_mostly unsigned int sysctl_sched_core_migration_cost; +extern __read_mostly unsigned int sysctl_sched_thread_migration_cost; extern __read_mostly unsigned int sysctl_sched_nr_migrate; extern __read_mostly unsigned int sysctl_sched_time_avg; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 61d3508..f395adc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -99,7 +99,9 @@ unsigned int sysctl_sched_child_runs_first __read_mostly; unsigned int sysctl_sched_wakeup_granularity = 100UL; unsigned int normalized_sysctl_sched_wakeup_granularity= 100UL; -const_debug unsigned int sysctl_sched_migration_cost = 50UL; +const_debug unsigned int sysctl_sched_migration_cost = 50UL; +const_debug unsigned int sysctl_sched_core_migration_cost = 50UL; +const_debug unsigned int sysctl_sched_thread_migration_cost= 0UL; #ifdef CONFIG_SMP /* diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index bcd8c64..fc147db 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1148,14 +1148,14 @@ sd_init(struct sched_domain_topology_level *tl, sd->flags |= SD_PREFER_SIBLING; sd->imbalance_pct = 110; sd->smt_gain = 1178; /* ~15% */ - sd->sched_migration_cost = 0; + sd->sched_migration_cost = sysctl_sched_thread_migration_cost; } else if (sd->flags & SD_SHARE_PKG_RESOURCES) { sd->flags |= SD_PREFER_SIBLING; sd->imbalance_pct = 117; sd->cache_nice_tries = 1; sd->busy_idx = 2; - sd->sched_migration_cost = 50UL; + sd->sched_migration_cost = sysctl_sched_core_migration_cost; #ifdef CONFIG_NUMA } else if (sd->flags & SD_NUMA) { @@ -1164,7 +1164,7 @@ sd_init(struct sched_domain_topology_level *tl, sd->idle_idx = 2; sd->flags |= SD_SERIALIZE; - sd->sched_migration_cost = 500UL; + sd->sched_migration_cost = sysctl_sched_migration_cost; if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) { sd->flags &= ~(SD_BALANCE_EXEC | SD_BALANCE_FORK | @@ -1177,7 +1177,7 @@ sd_init(struct sched_domain_topology_level *tl, sd->cache_nice_tries = 1; sd->busy_idx = 2; sd->idle_idx = 1; - sd->sched_migration_cost = 500UL; + sd->sched_migration_cost = sysctl_sched_migration_cost; } /* diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 557d467..0920795 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -356,6 +356,20 @@ static struct ctl_table kern_table[] = { .proc_handler = proc_dointvec, }, { + .procname = "sched_core_migration_cost_ns", + .data = &sysctl_sched_core_migration_cost, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { + .procname = "sched_thread_migration_cost_ns", + .data = &sysctl_sched_thread_migration_cost, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { .procname = "sched_nr_migrate", .data = &sysctl_sched_nr_migrate, .maxlen = sizeof(unsigned int), -- 2.7.4