Re: [RFC 2/2] Introduce sysctl(s) for the migration costs

2018-02-12 Thread Peter Zijlstra
On Thu, Feb 08, 2018 at 02:19:55PM -0800, Rohit Jain wrote:
> This patch introduces the sysctl for sched_domain based migration costs.
> These in turn can be used for performance tuning of workloads.

Smells like a bad attempt to (again) revive commit:

  0437e109e184 ("sched: zap the migration init / cache-hot balancing code")

Yes, the migration cost would ideally be per domain, in practise it all
sucks because more tunables is more confusion. And as that commit
states, runtime measurements suck too, they cause run-to-run variation
which causes repeatability issues and degrade boot times.

Static numbers suck worse, because they'll be wrong for everyone.


Re: [RFC 2/2] Introduce sysctl(s) for the migration costs

2018-02-12 Thread Peter Zijlstra
On Thu, Feb 08, 2018 at 02:19:55PM -0800, Rohit Jain wrote:
> This patch introduces the sysctl for sched_domain based migration costs.
> These in turn can be used for performance tuning of workloads.

Smells like a bad attempt to (again) revive commit:

  0437e109e184 ("sched: zap the migration init / cache-hot balancing code")

Yes, the migration cost would ideally be per domain, in practise it all
sucks because more tunables is more confusion. And as that commit
states, runtime measurements suck too, they cause run-to-run variation
which causes repeatability issues and degrade boot times.

Static numbers suck worse, because they'll be wrong for everyone.


Re: [RFC 2/2] Introduce sysctl(s) for the migration costs

2018-02-09 Thread Mike Galbraith
On Fri, 2018-02-09 at 12:33 -0500, Steven Sistare wrote:
> On 2/9/2018 12:08 PM, Mike Galbraith wrote:
> 
> > Shrug.  It's bogus no mater what we do.  Once Upon A Time, a cost
> > number was generated via measurement, but the end result was just as
> > bogus as a number pulled out of the ether.  How much bandwidth you have
> > when blasting data to/from wherever says nothing about misses you avoid
> > vs those you generate.
> 
> Yes, yes and yes. I cannot make the original tunable less bogus.  Using a 
> smaller
> cost for closer caches still makes logical sense and is supported by the data.

You forgot to write "microscopic" before "data" :)  I'm mostly agnostic
about this, but don't like the yet more knobs that 99.99% won't touch.

-Mike


Re: [RFC 2/2] Introduce sysctl(s) for the migration costs

2018-02-09 Thread Mike Galbraith
On Fri, 2018-02-09 at 12:33 -0500, Steven Sistare wrote:
> On 2/9/2018 12:08 PM, Mike Galbraith wrote:
> 
> > Shrug.  It's bogus no mater what we do.  Once Upon A Time, a cost
> > number was generated via measurement, but the end result was just as
> > bogus as a number pulled out of the ether.  How much bandwidth you have
> > when blasting data to/from wherever says nothing about misses you avoid
> > vs those you generate.
> 
> Yes, yes and yes. I cannot make the original tunable less bogus.  Using a 
> smaller
> cost for closer caches still makes logical sense and is supported by the data.

You forgot to write "microscopic" before "data" :)  I'm mostly agnostic
about this, but don't like the yet more knobs that 99.99% won't touch.

-Mike


Re: [RFC 2/2] Introduce sysctl(s) for the migration costs

2018-02-09 Thread Steven Sistare
On 2/9/2018 12:08 PM, Mike Galbraith wrote:
> On Fri, 2018-02-09 at 11:10 -0500, Steven Sistare wrote:
>> On 2/8/2018 10:54 PM, Mike Galbraith wrote:
>>> On Thu, 2018-02-08 at 14:19 -0800, Rohit Jain wrote:
 This patch introduces the sysctl for sched_domain based migration costs.
 These in turn can be used for performance tuning of workloads.
>>>
>>> With this patch, we trade 1 completely bogus constant (cost is really
>>> highly variable) for 3, twiddling of which has zero effect unless you
>>> trigger a domain rebuild afterward, which is neither mentioned in the
>>> changelog, nor documented.
>>>
>>> bogo-numbers++ is kinda hard to love.
>>
>> Yup, the domain rebuild is missing.
>>
>> I am no fan of tunables, the fewer the better, but one of the several flaws
>> of the single figure for migration cost is that it ignores the very large
>> difference in cost when migrating between near vs far levels of the cache 
>> hierarchy.
>> Migration between CPUs of the same core should be free, as they share L1 
>> cache.
>> Rohit defined a tunable for it, but IMO it could be hard coded to 0.
> 
> That cost is never really 0 in the context of load balancing, as the
> load balancing machinery is non-free.  When the idle_balance() throttle
> was added, that was done to mitigate the (at that time) quite high cost
> to high frequency cross core scheduling ala localhost communication.

I was imprecise.  The cache-loss component of cost as represented by 
sched_migration_cost should be 0 in this case.  The cost of the machinery
is non-zero and remains in the code, and can still prevent migration.

>>  Migration 
>> between CPUs in different sockets is the most expensive and is represented by
>> the existing sysctl_sched_migration_cost tunable.  Migration between CPUs in
>> the same core cluster, or in the same socket, is somewhere in between, as
>> they share L2 or L3 cache.  We could avoid a separate tunable by setting it 
>> to
>> sysctl_sched_migration_cost / 10.
> 
> Shrug.  It's bogus no mater what we do.  Once Upon A Time, a cost
> number was generated via measurement, but the end result was just as
> bogus as a number pulled out of the ether.  How much bandwidth you have
> when blasting data to/from wherever says nothing about misses you avoid
> vs those you generate.

Yes, yes and yes. I cannot make the original tunable less bogus.  Using a 
smaller
cost for closer caches still makes logical sense and is supported by the data.

- Steve



Re: [RFC 2/2] Introduce sysctl(s) for the migration costs

2018-02-09 Thread Steven Sistare
On 2/9/2018 12:08 PM, Mike Galbraith wrote:
> On Fri, 2018-02-09 at 11:10 -0500, Steven Sistare wrote:
>> On 2/8/2018 10:54 PM, Mike Galbraith wrote:
>>> On Thu, 2018-02-08 at 14:19 -0800, Rohit Jain wrote:
 This patch introduces the sysctl for sched_domain based migration costs.
 These in turn can be used for performance tuning of workloads.
>>>
>>> With this patch, we trade 1 completely bogus constant (cost is really
>>> highly variable) for 3, twiddling of which has zero effect unless you
>>> trigger a domain rebuild afterward, which is neither mentioned in the
>>> changelog, nor documented.
>>>
>>> bogo-numbers++ is kinda hard to love.
>>
>> Yup, the domain rebuild is missing.
>>
>> I am no fan of tunables, the fewer the better, but one of the several flaws
>> of the single figure for migration cost is that it ignores the very large
>> difference in cost when migrating between near vs far levels of the cache 
>> hierarchy.
>> Migration between CPUs of the same core should be free, as they share L1 
>> cache.
>> Rohit defined a tunable for it, but IMO it could be hard coded to 0.
> 
> That cost is never really 0 in the context of load balancing, as the
> load balancing machinery is non-free.  When the idle_balance() throttle
> was added, that was done to mitigate the (at that time) quite high cost
> to high frequency cross core scheduling ala localhost communication.

I was imprecise.  The cache-loss component of cost as represented by 
sched_migration_cost should be 0 in this case.  The cost of the machinery
is non-zero and remains in the code, and can still prevent migration.

>>  Migration 
>> between CPUs in different sockets is the most expensive and is represented by
>> the existing sysctl_sched_migration_cost tunable.  Migration between CPUs in
>> the same core cluster, or in the same socket, is somewhere in between, as
>> they share L2 or L3 cache.  We could avoid a separate tunable by setting it 
>> to
>> sysctl_sched_migration_cost / 10.
> 
> Shrug.  It's bogus no mater what we do.  Once Upon A Time, a cost
> number was generated via measurement, but the end result was just as
> bogus as a number pulled out of the ether.  How much bandwidth you have
> when blasting data to/from wherever says nothing about misses you avoid
> vs those you generate.

Yes, yes and yes. I cannot make the original tunable less bogus.  Using a 
smaller
cost for closer caches still makes logical sense and is supported by the data.

- Steve



Re: [RFC 2/2] Introduce sysctl(s) for the migration costs

2018-02-09 Thread Mike Galbraith
On Fri, 2018-02-09 at 11:10 -0500, Steven Sistare wrote:
> On 2/8/2018 10:54 PM, Mike Galbraith wrote:
> > On Thu, 2018-02-08 at 14:19 -0800, Rohit Jain wrote:
> >> This patch introduces the sysctl for sched_domain based migration costs.
> >> These in turn can be used for performance tuning of workloads.
> > 
> > With this patch, we trade 1 completely bogus constant (cost is really
> > highly variable) for 3, twiddling of which has zero effect unless you
> > trigger a domain rebuild afterward, which is neither mentioned in the
> > changelog, nor documented.
> > 
> > bogo-numbers++ is kinda hard to love.
> 
> Yup, the domain rebuild is missing.
> 
> I am no fan of tunables, the fewer the better, but one of the several flaws
> of the single figure for migration cost is that it ignores the very large
> difference in cost when migrating between near vs far levels of the cache 
> hierarchy.
> Migration between CPUs of the same core should be free, as they share L1 
> cache.
> Rohit defined a tunable for it, but IMO it could be hard coded to 0.

That cost is never really 0 in the context of load balancing, as the
load balancing machinery is non-free.  When the idle_balance() throttle
was added, that was done to mitigate the (at that time) quite high cost
to high frequency cross core scheduling ala localhost communication.

>  Migration 
> between CPUs in different sockets is the most expensive and is represented by
> the existing sysctl_sched_migration_cost tunable.  Migration between CPUs in
> the same core cluster, or in the same socket, is somewhere in between, as
> they share L2 or L3 cache.  We could avoid a separate tunable by setting it to
> sysctl_sched_migration_cost / 10.

Shrug.  It's bogus no mater what we do.  Once Upon A Time, a cost
number was generated via measurement, but the end result was just as
bogus as a number pulled out of the ether.  How much bandwidth you have
when blasting data to/from wherever says nothing about misses you avoid
vs those you generate.

-Mike


Re: [RFC 2/2] Introduce sysctl(s) for the migration costs

2018-02-09 Thread Mike Galbraith
On Fri, 2018-02-09 at 11:10 -0500, Steven Sistare wrote:
> On 2/8/2018 10:54 PM, Mike Galbraith wrote:
> > On Thu, 2018-02-08 at 14:19 -0800, Rohit Jain wrote:
> >> This patch introduces the sysctl for sched_domain based migration costs.
> >> These in turn can be used for performance tuning of workloads.
> > 
> > With this patch, we trade 1 completely bogus constant (cost is really
> > highly variable) for 3, twiddling of which has zero effect unless you
> > trigger a domain rebuild afterward, which is neither mentioned in the
> > changelog, nor documented.
> > 
> > bogo-numbers++ is kinda hard to love.
> 
> Yup, the domain rebuild is missing.
> 
> I am no fan of tunables, the fewer the better, but one of the several flaws
> of the single figure for migration cost is that it ignores the very large
> difference in cost when migrating between near vs far levels of the cache 
> hierarchy.
> Migration between CPUs of the same core should be free, as they share L1 
> cache.
> Rohit defined a tunable for it, but IMO it could be hard coded to 0.

That cost is never really 0 in the context of load balancing, as the
load balancing machinery is non-free.  When the idle_balance() throttle
was added, that was done to mitigate the (at that time) quite high cost
to high frequency cross core scheduling ala localhost communication.

>  Migration 
> between CPUs in different sockets is the most expensive and is represented by
> the existing sysctl_sched_migration_cost tunable.  Migration between CPUs in
> the same core cluster, or in the same socket, is somewhere in between, as
> they share L2 or L3 cache.  We could avoid a separate tunable by setting it to
> sysctl_sched_migration_cost / 10.

Shrug.  It's bogus no mater what we do.  Once Upon A Time, a cost
number was generated via measurement, but the end result was just as
bogus as a number pulled out of the ether.  How much bandwidth you have
when blasting data to/from wherever says nothing about misses you avoid
vs those you generate.

-Mike


Re: [RFC 2/2] Introduce sysctl(s) for the migration costs

2018-02-09 Thread Steven Sistare
On 2/8/2018 10:54 PM, Mike Galbraith wrote:
> On Thu, 2018-02-08 at 14:19 -0800, Rohit Jain wrote:
>> This patch introduces the sysctl for sched_domain based migration costs.
>> These in turn can be used for performance tuning of workloads.
> 
> With this patch, we trade 1 completely bogus constant (cost is really
> highly variable) for 3, twiddling of which has zero effect unless you
> trigger a domain rebuild afterward, which is neither mentioned in the
> changelog, nor documented.
> 
> bogo-numbers++ is kinda hard to love.

Yup, the domain rebuild is missing.

I am no fan of tunables, the fewer the better, but one of the several flaws
of the single figure for migration cost is that it ignores the very large
difference in cost when migrating between near vs far levels of the cache 
hierarchy.
Migration between CPUs of the same core should be free, as they share L1 cache.
Rohit defined a tunable for it, but IMO it could be hard coded to 0. Migration 
between CPUs in different sockets is the most expensive and is represented by
the existing sysctl_sched_migration_cost tunable.  Migration between CPUs in
the same core cluster, or in the same socket, is somewhere in between, as
they share L2 or L3 cache.  We could avoid a separate tunable by setting it to
sysctl_sched_migration_cost / 10.

- Steve


Re: [RFC 2/2] Introduce sysctl(s) for the migration costs

2018-02-09 Thread Steven Sistare
On 2/8/2018 10:54 PM, Mike Galbraith wrote:
> On Thu, 2018-02-08 at 14:19 -0800, Rohit Jain wrote:
>> This patch introduces the sysctl for sched_domain based migration costs.
>> These in turn can be used for performance tuning of workloads.
> 
> With this patch, we trade 1 completely bogus constant (cost is really
> highly variable) for 3, twiddling of which has zero effect unless you
> trigger a domain rebuild afterward, which is neither mentioned in the
> changelog, nor documented.
> 
> bogo-numbers++ is kinda hard to love.

Yup, the domain rebuild is missing.

I am no fan of tunables, the fewer the better, but one of the several flaws
of the single figure for migration cost is that it ignores the very large
difference in cost when migrating between near vs far levels of the cache 
hierarchy.
Migration between CPUs of the same core should be free, as they share L1 cache.
Rohit defined a tunable for it, but IMO it could be hard coded to 0. Migration 
between CPUs in different sockets is the most expensive and is represented by
the existing sysctl_sched_migration_cost tunable.  Migration between CPUs in
the same core cluster, or in the same socket, is somewhere in between, as
they share L2 or L3 cache.  We could avoid a separate tunable by setting it to
sysctl_sched_migration_cost / 10.

- Steve


Re: [RFC 2/2] Introduce sysctl(s) for the migration costs

2018-02-08 Thread Mike Galbraith
On Thu, 2018-02-08 at 14:19 -0800, Rohit Jain wrote:
> This patch introduces the sysctl for sched_domain based migration costs.
> These in turn can be used for performance tuning of workloads.

With this patch, we trade 1 completely bogus constant (cost is really
highly variable) for 3, twiddling of which has zero effect unless you
trigger a domain rebuild afterward, which is neither mentioned in the
changelog, nor documented.

bogo-numbers++ is kinda hard to love.

-Mike


Re: [RFC 2/2] Introduce sysctl(s) for the migration costs

2018-02-08 Thread Mike Galbraith
On Thu, 2018-02-08 at 14:19 -0800, Rohit Jain wrote:
> This patch introduces the sysctl for sched_domain based migration costs.
> These in turn can be used for performance tuning of workloads.

With this patch, we trade 1 completely bogus constant (cost is really
highly variable) for 3, twiddling of which has zero effect unless you
trigger a domain rebuild afterward, which is neither mentioned in the
changelog, nor documented.

bogo-numbers++ is kinda hard to love.

-Mike