Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Mike Galbraith
On Fri, 2013-02-22 at 15:30 +0100, Mike Galbraith wrote: 
> On Fri, 2013-02-22 at 14:06 +0100, Ingo Molnar wrote: 

> > I think it might be better to measure the scheduling rate all 
> > the time, and save the _shortest_ cross-cpu-wakeup and 
> > same-cpu-wakeup latencies (since bootup) as a reference number. 
> > 
> > We might be able to pull this off pretty cheaply as the 
> > scheduler clock is running all the time and we have all the 
> > timestamps needed.
> 
> Yeah, that might work.  We have some quick kthreads, so saving ctx
> distance may get close enough to scheduler cost to be good enough.

Or better, shortest idle to idle, that would include the current (bad)
nohz cost, and automatically shrink away when that cost shrinks.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Mike Galbraith
On Fri, 2013-02-22 at 14:06 +0100, Ingo Molnar wrote: 
> * Mike Galbraith  wrote:
> 
> > > > No, that's too high, you loose too much of the pretty 
> > > > face. [...]
> > > 
> > > Then a logical proportion of it - such as half of it?
> > 
> > Hm.  Better would maybe be a quick boot time benchmark, and 
> > use some multiple of your cross core pipe ping-pong time?  
> > That we know is a complete waste of cycles, because almost all 
> > cycles are scheduler cycles with no other work to be done, 
> > making firing up another scheduler rather pointless.  If we're 
> > approaching that rate, we're approaching bad idea.
> 
> Well, one problem with such dynamic boot time measurements is 
> that it introduces a certain amount of uncertainty that persists 
> for the whole lifetime of the booted up box - and it also sucks 
> in any sort of non-deterministic execution environment, such as 
> virtualized systems.

Ok, bad idea.

> I think it might be better to measure the scheduling rate all 
> the time, and save the _shortest_ cross-cpu-wakeup and 
> same-cpu-wakeup latencies (since bootup) as a reference number. 
> 
> We might be able to pull this off pretty cheaply as the 
> scheduler clock is running all the time and we have all the 
> timestamps needed.

Yeah, that might work.  We have some quick kthreads, so saving ctx
distance may get close enough to scheduler cost to be good enough.

> Pretty quickly after bootup this 'shortest latency' would settle 
> down to a very system specific (and pretty accurate) value.
> 
> [ One downside would be an increased sensitivity to the accuracy
>   and monotonicity of the scheduler clock - but that's something 
>   we want to improve on anyway - and 'worst case' we get too 
>   short latencies and we are where we are today. So it can only 
>   improve the situation IMO. ]
> 
> Would you be interested in trying to hack on an auto-tuning 
> feature like this?

Yeah, should be easy, but rainy day has to happen so I have time to
measure twiddle measure measure  tweak.. ;-)

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Ingo Molnar

* Mike Galbraith  wrote:

> > > No, that's too high, you loose too much of the pretty 
> > > face. [...]
> > 
> > Then a logical proportion of it - such as half of it?
> 
> Hm.  Better would maybe be a quick boot time benchmark, and 
> use some multiple of your cross core pipe ping-pong time?  
> That we know is a complete waste of cycles, because almost all 
> cycles are scheduler cycles with no other work to be done, 
> making firing up another scheduler rather pointless.  If we're 
> approaching that rate, we're approaching bad idea.

Well, one problem with such dynamic boot time measurements is 
that it introduces a certain amount of uncertainty that persists 
for the whole lifetime of the booted up box - and it also sucks 
in any sort of non-deterministic execution environment, such as 
virtualized systems.

I think it might be better to measure the scheduling rate all 
the time, and save the _shortest_ cross-cpu-wakeup and 
same-cpu-wakeup latencies (since bootup) as a reference number. 

We might be able to pull this off pretty cheaply as the 
scheduler clock is running all the time and we have all the 
timestamps needed.

Pretty quickly after bootup this 'shortest latency' would settle 
down to a very system specific (and pretty accurate) value.

[ One downside would be an increased sensitivity to the accuracy
  and monotonicity of the scheduler clock - but that's something 
  we want to improve on anyway - and 'worst case' we get too 
  short latencies and we are where we are today. So it can only 
  improve the situation IMO. ]

Would you be interested in trying to hack on an auto-tuning 
feature like this?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Mike Galbraith
On Fri, 2013-02-22 at 13:11 +0100, Ingo Molnar wrote: 
> * Mike Galbraith  wrote:
> 
> > On Fri, 2013-02-22 at 10:54 +0100, Ingo Molnar wrote: 
> > > * Mike Galbraith  wrote:
> > > 
> > > > On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: 
> > > > > On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
> > > > > > But that's really some benefit hardly to be estimate, especially 
> > > > > > when
> > > > > > the workload is heavy, the cost of wake_affine() is very high to
> > > > > > calculated se one by one, is that worth for some benefit we could 
> > > > > > not
> > > > > > promise?
> > > > > 
> > > > > Look at something like pipe-test.. wake_affine() used to 
> > > > > ensure both client/server ran on the same cpu, but then I 
> > > > > think we added select_idle_sibling() and wrecked it again :/
> > > > 
> > > > Yeah, that's the absolute worst case for 
> > > > select_idle_sibling(), 100% synchronous, absolutely nothing to 
> > > > be gained by cross cpu scheduling. Fortunately, most tasks do 
> > > > more than that, but nonetheless, select_idle_sibling() 
> > > > definitely is a two faced little b*tch.  I'd like to see the 
> > > > evil b*tch die, but something needs to replace it's pretty 
> > > > face.  One thing that you can do is simply don't call it when 
> > > > the context switch rate is incredible.. its job is to recover 
> > > > overlap, if you're scheduling near your max, there's no win 
> > > > worth the cost.
> > > 
> > > Couldn't we make the cutoff dependent on sched_migration_cost? 
> > > If the wakeup comes in faster than that then don't spread.
> > 
> > No, that's too high, you loose too much of the pretty face. 
> > [...]
> 
> Then a logical proportion of it - such as half of it?

Hm.  Better would maybe be a quick boot time benchmark, and use some
multiple of your cross core pipe ping-pong time?  That we know is a
complete waste of cycles, because almost all cycles are scheduler cycles
with no other work to be done, making firing up another scheduler rather
pointless.  If we're approaching that rate, we're approaching bad idea.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Ingo Molnar

* Mike Galbraith  wrote:

> On Fri, 2013-02-22 at 10:54 +0100, Ingo Molnar wrote: 
> > * Mike Galbraith  wrote:
> > 
> > > On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: 
> > > > On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
> > > > > But that's really some benefit hardly to be estimate, especially when
> > > > > the workload is heavy, the cost of wake_affine() is very high to
> > > > > calculated se one by one, is that worth for some benefit we could not
> > > > > promise?
> > > > 
> > > > Look at something like pipe-test.. wake_affine() used to 
> > > > ensure both client/server ran on the same cpu, but then I 
> > > > think we added select_idle_sibling() and wrecked it again :/
> > > 
> > > Yeah, that's the absolute worst case for 
> > > select_idle_sibling(), 100% synchronous, absolutely nothing to 
> > > be gained by cross cpu scheduling. Fortunately, most tasks do 
> > > more than that, but nonetheless, select_idle_sibling() 
> > > definitely is a two faced little b*tch.  I'd like to see the 
> > > evil b*tch die, but something needs to replace it's pretty 
> > > face.  One thing that you can do is simply don't call it when 
> > > the context switch rate is incredible.. its job is to recover 
> > > overlap, if you're scheduling near your max, there's no win 
> > > worth the cost.
> > 
> > Couldn't we make the cutoff dependent on sched_migration_cost? 
> > If the wakeup comes in faster than that then don't spread.
> 
> No, that's too high, you loose too much of the pretty face. 
> [...]

Then a logical proportion of it - such as half of it?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Michael Wang
On 02/22/2013 05:57 PM, Peter Zijlstra wrote:
> On Fri, 2013-02-22 at 17:11 +0800, Michael Wang wrote:
> 
>> Ok, it do looks like wake_affine() lost it's value...
> 
> I'm not sure we can say that on this one benchmark, there's a
> preemption advantage to running on a single cpu for pipe-test as well.
> We'd need to create a better benchmark to test this, one that has some
> actual data payload and control over the initial spread of the tasks or
> so.
> 
>>> Now as far as I can see there's two options, either we find there's
>>> absolutely no benefit in wake_affine() as it stands today and we simply
>>> disable/remove it, or we go fix it. What we don't do is completely
>>> wreck it at atrocious cost.
>>
>> I get your point, we should replace wake_affine() with some feature
>> which could really achieve the goal to make client and server on same cpu.
>>
>> But is the logical that the waker/wakee are server/client(or reversed)
>> still works now? that sounds a little arbitrary to me...
> 
> Ah, its never really been about server/client per-se. Its just a
> specific example -- one that breaks down with the 1:n pgbench
> situation.
> 
> Wakeups in general can be considered to be a relation, suppose a
> hardware interrupt that received some data from a device and issues a
> wakeup to a task to consume this data. What CPU would be better suited
> to process this data then the one where its already cache hot.

I see, honestly, I realized that I have underestimated the benefit we
gain from it when saw your testing results...

We do need some better approach to replace wake_affine(), hmm...I need a
draft board now...

Regards,
Michael Wang

> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Mike Galbraith
On Fri, 2013-02-22 at 10:54 +0100, Ingo Molnar wrote: 
> * Mike Galbraith  wrote:
> 
> > On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: 
> > > On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
> > > > But that's really some benefit hardly to be estimate, especially when
> > > > the workload is heavy, the cost of wake_affine() is very high to
> > > > calculated se one by one, is that worth for some benefit we could not
> > > > promise?
> > > 
> > > Look at something like pipe-test.. wake_affine() used to 
> > > ensure both client/server ran on the same cpu, but then I 
> > > think we added select_idle_sibling() and wrecked it again :/
> > 
> > Yeah, that's the absolute worst case for 
> > select_idle_sibling(), 100% synchronous, absolutely nothing to 
> > be gained by cross cpu scheduling. Fortunately, most tasks do 
> > more than that, but nonetheless, select_idle_sibling() 
> > definitely is a two faced little b*tch.  I'd like to see the 
> > evil b*tch die, but something needs to replace it's pretty 
> > face.  One thing that you can do is simply don't call it when 
> > the context switch rate is incredible.. its job is to recover 
> > overlap, if you're scheduling near your max, there's no win 
> > worth the cost.
> 
> Couldn't we make the cutoff dependent on sched_migration_cost? 
> If the wakeup comes in faster than that then don't spread.

No, that's too high, you loose too much of the pretty face.  It's a real
problem.  On AMD, the breakeven is much higher than Intel it seems as
well.  My E5620 can turn in a win on both tbench and even netperf
TCP_RR!! iff nohz is throttled.  For the Opterons I've played with, it's
a loser at even tbench context switch rate, needs to be cut off earlier.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Michael Wang
On 02/22/2013 05:39 PM, Peter Zijlstra wrote:
> On Fri, 2013-02-22 at 17:10 +0800, Michael Wang wrote:
>> On 02/22/2013 04:21 PM, Peter Zijlstra wrote:
>>> On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote:
 According to my understanding, in the old world, wake_affine() will
 only
 be used if curr_cpu and prev_cpu share cache, which means they are in
 one package, whatever search in llc sd of curr_cpu or prev_cpu, we
 won't
 have the chance to spread the task out of that package.
>>>
>>> Nah, look at where SD_WAKE_AFFINE is set. Only 'remote/big' NUMA domains
>>> don't have it set, but 'small' NUMA systems will have it set over the
>>> entire domain tree.
>>
>> Oh, I missed that point...
>>
>> But I don't get the reason to make NUMA level affine, cpus in different
>> nodes share cache? doesn't make sense...
> 
> Contrary, it makes more sense, the more expensive it is to run 'remote'
> the better it is to pull 'related' tasks together.

It increase the range to bound task, from one node to several, but also
increase the range of target cpus, from one node's to several's, I still
can't estimate the benefit, but I think I get the purpose, trying to
make related tasks as close as possible, is that right?

Let me think about this point, I do believe there will be better way to
take care of this purpose.

Regards,
Michael Wang


> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Peter Zijlstra
On Fri, 2013-02-22 at 17:11 +0800, Michael Wang wrote:

> Ok, it do looks like wake_affine() lost it's value...

I'm not sure we can say that on this one benchmark, there's a
preemption advantage to running on a single cpu for pipe-test as well.
We'd need to create a better benchmark to test this, one that has some
actual data payload and control over the initial spread of the tasks or
so.

> > Now as far as I can see there's two options, either we find there's
> > absolutely no benefit in wake_affine() as it stands today and we simply
> > disable/remove it, or we go fix it. What we don't do is completely
> > wreck it at atrocious cost.
> 
> I get your point, we should replace wake_affine() with some feature
> which could really achieve the goal to make client and server on same cpu.
> 
> But is the logical that the waker/wakee are server/client(or reversed)
> still works now? that sounds a little arbitrary to me...

Ah, its never really been about server/client per-se. Its just a
specific example -- one that breaks down with the 1:n pgbench
situation.

Wakeups in general can be considered to be a relation, suppose a
hardware interrupt that received some data from a device and issues a
wakeup to a task to consume this data. What CPU would be better suited
to process this data then the one where its already cache hot.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Ingo Molnar

* Mike Galbraith  wrote:

> On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: 
> > On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
> > > But that's really some benefit hardly to be estimate, especially when
> > > the workload is heavy, the cost of wake_affine() is very high to
> > > calculated se one by one, is that worth for some benefit we could not
> > > promise?
> > 
> > Look at something like pipe-test.. wake_affine() used to 
> > ensure both client/server ran on the same cpu, but then I 
> > think we added select_idle_sibling() and wrecked it again :/
> 
> Yeah, that's the absolute worst case for 
> select_idle_sibling(), 100% synchronous, absolutely nothing to 
> be gained by cross cpu scheduling. Fortunately, most tasks do 
> more than that, but nonetheless, select_idle_sibling() 
> definitely is a two faced little b*tch.  I'd like to see the 
> evil b*tch die, but something needs to replace it's pretty 
> face.  One thing that you can do is simply don't call it when 
> the context switch rate is incredible.. its job is to recover 
> overlap, if you're scheduling near your max, there's no win 
> worth the cost.

Couldn't we make the cutoff dependent on sched_migration_cost? 
If the wakeup comes in faster than that then don't spread.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Mike Galbraith
On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: 
> On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
> > But that's really some benefit hardly to be estimate, especially when
> > the workload is heavy, the cost of wake_affine() is very high to
> > calculated se one by one, is that worth for some benefit we could not
> > promise?
> 
> Look at something like pipe-test.. wake_affine() used to ensure both
> client/server ran on the same cpu, but then I think we added
> select_idle_sibling() and wrecked it again :/

Yeah, that's the absolute worst case for select_idle_sibling(), 100%
synchronous, absolutely nothing to be gained by cross cpu scheduling.
Fortunately, most tasks do more than that, but nonetheless,
select_idle_sibling() definitely is a two faced little b*tch.  I'd like
to see the evil b*tch die, but something needs to replace it's pretty
face.  One thing that you can do is simply don't call it when the
context switch rate is incredible.. its job is to recover overlap, if
you're scheduling near your max, there's no win worth the cost.

> $ taskset 1 perf bench sched pipe
> # Running sched/pipe benchmark...
> # Extecuted 100 pipe operations between two tasks
> 
>  Total time: 3.761 [sec]
> 
>3.761725 usecs/op
>  265835 ops/sec
> 
> $ perf bench sched pipe
> # Running sched/pipe benchmark...
> # Extecuted 100 pipe operations between two tasks
> 
>  Total time: 29.809 [sec]
> 
>   29.809720 usecs/op
>   33546 ops/sec

Gak!  Hm, are you running a kernel without the thinko fix?  It's not
good for this extreme testcase, but it doesn't suck _that_ bad ;-)

nohz isn't exactly your friend with ultra switchers either.

Q6600:
marge:~ # taskset -c 3 perf bench sched pipe
# Running sched/pipe benchmark...
# Executed 100 pipe operations between two tasks

 Total time: 3.395 [sec]

   3.395357 usecs/op
 294519 ops/sec
marge:~ # perf bench sched pipe
# Running sched/pipe benchmark...
# Executed 100 pipe operations between two tasks

 Total time: 4.212 [sec]

   4.212413 usecs/op
 237393 ops/sec

E5620:
rtbox:~ # taskset -c 0 perf bench sched pipe
# Running sched/pipe benchmark...
# Executed 100 pipe operations between two tasks

 Total time: 2.558 [sec]

   2.558237 usecs/op
 390894 ops/sec
rtbox:~ # perf bench sched pipe
# Running sched/pipe benchmark...
# Executed 100 pipe operations between two tasks

 Total time: 4.588 [sec]

   4.588702 usecs/op
 217926 ops/sec

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Peter Zijlstra
On Fri, 2013-02-22 at 17:10 +0800, Michael Wang wrote:
> On 02/22/2013 04:21 PM, Peter Zijlstra wrote:
> > On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote:
> >> According to my understanding, in the old world, wake_affine() will
> >> only
> >> be used if curr_cpu and prev_cpu share cache, which means they are in
> >> one package, whatever search in llc sd of curr_cpu or prev_cpu, we
> >> won't
> >> have the chance to spread the task out of that package.
> > 
> > Nah, look at where SD_WAKE_AFFINE is set. Only 'remote/big' NUMA domains
> > don't have it set, but 'small' NUMA systems will have it set over the
> > entire domain tree.
> 
> Oh, I missed that point...
> 
> But I don't get the reason to make NUMA level affine, cpus in different
> nodes share cache? doesn't make sense...

Contrary, it makes more sense, the more expensive it is to run 'remote'
the better it is to pull 'related' tasks together.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Michael Wang
On 02/22/2013 04:36 PM, Peter Zijlstra wrote:
> On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
>> But that's really some benefit hardly to be estimate, especially when
>> the workload is heavy, the cost of wake_affine() is very high to
>> calculated se one by one, is that worth for some benefit we could not
>> promise?
> 
> Look at something like pipe-test.. wake_affine() used to ensure both
> client/server ran on the same cpu, but then I think we added
> select_idle_sibling() and wrecked it again :/
> 
> $ taskset 1 perf bench sched pipe
> # Running sched/pipe benchmark...
> # Extecuted 100 pipe operations between two tasks
> 
>  Total time: 3.761 [sec]
> 
>3.761725 usecs/op
>  265835 ops/sec
> 
> $ perf bench sched pipe
> # Running sched/pipe benchmark...
> # Extecuted 100 pipe operations between two tasks
> 
>  Total time: 29.809 [sec]
> 
>   29.809720 usecs/op
>   33546 ops/sec
> 

Ok, it do looks like wake_affine() lost it's value...

> 
> Now as far as I can see there's two options, either we find there's
> absolutely no benefit in wake_affine() as it stands today and we simply
> disable/remove it, or we go fix it. What we don't do is completely
> wreck it at atrocious cost.

I get your point, we should replace wake_affine() with some feature
which could really achieve the goal to make client and server on same cpu.

But is the logical that the waker/wakee are server/client(or reversed)
still works now? that sounds a little arbitrary to me...

Regards,
Michael Wang

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Michael Wang
On 02/22/2013 04:21 PM, Peter Zijlstra wrote:
> On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote:
>> According to my understanding, in the old world, wake_affine() will
>> only
>> be used if curr_cpu and prev_cpu share cache, which means they are in
>> one package, whatever search in llc sd of curr_cpu or prev_cpu, we
>> won't
>> have the chance to spread the task out of that package.
> 
> Nah, look at where SD_WAKE_AFFINE is set. Only 'remote/big' NUMA domains
> don't have it set, but 'small' NUMA systems will have it set over the
> entire domain tree.

Oh, I missed that point...

But I don't get the reason to make NUMA level affine, cpus in different
nodes share cache? doesn't make sense...

Regards,
Michael Wang

> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Peter Zijlstra
On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
> But that's really some benefit hardly to be estimate, especially when
> the workload is heavy, the cost of wake_affine() is very high to
> calculated se one by one, is that worth for some benefit we could not
> promise?

Look at something like pipe-test.. wake_affine() used to ensure both
client/server ran on the same cpu, but then I think we added
select_idle_sibling() and wrecked it again :/

$ taskset 1 perf bench sched pipe
# Running sched/pipe benchmark...
# Extecuted 100 pipe operations between two tasks

 Total time: 3.761 [sec]

   3.761725 usecs/op
 265835 ops/sec

$ perf bench sched pipe
# Running sched/pipe benchmark...
# Extecuted 100 pipe operations between two tasks

 Total time: 29.809 [sec]

  29.809720 usecs/op
  33546 ops/sec


Now as far as I can see there's two options, either we find there's
absolutely no benefit in wake_affine() as it stands today and we simply
disable/remove it, or we go fix it. What we don't do is completely
wreck it at atrocious cost.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Michael Wang
On 02/22/2013 04:17 PM, Mike Galbraith wrote:
> On Fri, 2013-02-22 at 14:42 +0800, Michael Wang wrote:
> 
>> So this is trying to take care the condition when curr_cpu(local) and
>> prev_cpu(remote) are on different nodes, which in the old world,
>> wake_affine() won't be invoked, correct?
> 
> It'll be called any time this_cpu and prev_cpu aren't one and the same.
> It'd be pretty silly to asking whether to pull_here or leave_there when
> here and there are identical.

Agree :)

> 
>> Hmm...I think this maybe a good additional checking before enter balance
>> path, but I could not estimate the cost to record the relationship at
>> this moment of time...
> 
> It'd be pretty cheap, but I'd hate adding any cycles to the fast path
> unless those cycles have one hell of a good payoff, so the caching would
> have to show most excellent cold hard numbers (talk crazy ideas walk;).

It sounds like a good idea, I'm not sure whether it's cheap and how many
benefit we could gain, but it worth some research.

I will thinking more about it after finished the sbm work.

Regards,
Michael Wang

> 
> -Mike
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Mike Galbraith
On Fri, 2013-02-22 at 14:42 +0800, Michael Wang wrote:

> So this is trying to take care the condition when curr_cpu(local) and
> prev_cpu(remote) are on different nodes, which in the old world,
> wake_affine() won't be invoked, correct?

It'll be called any time this_cpu and prev_cpu aren't one and the same.
It'd be pretty silly to asking whether to pull_here or leave_there when
here and there are identical.

> Hmm...I think this maybe a good additional checking before enter balance
> path, but I could not estimate the cost to record the relationship at
> this moment of time...

It'd be pretty cheap, but I'd hate adding any cycles to the fast path
unless those cycles have one hell of a good payoff, so the caching would
have to show most excellent cold hard numbers (talk crazy ideas walk;).

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Peter Zijlstra
On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote:
> According to my understanding, in the old world, wake_affine() will
> only
> be used if curr_cpu and prev_cpu share cache, which means they are in
> one package, whatever search in llc sd of curr_cpu or prev_cpu, we
> won't
> have the chance to spread the task out of that package.

Nah, look at where SD_WAKE_AFFINE is set. Only 'remote/big' NUMA domains
don't have it set, but 'small' NUMA systems will have it set over the
entire domain tree.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Peter Zijlstra
On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote:
 According to my understanding, in the old world, wake_affine() will
 only
 be used if curr_cpu and prev_cpu share cache, which means they are in
 one package, whatever search in llc sd of curr_cpu or prev_cpu, we
 won't
 have the chance to spread the task out of that package.

Nah, look at where SD_WAKE_AFFINE is set. Only 'remote/big' NUMA domains
don't have it set, but 'small' NUMA systems will have it set over the
entire domain tree.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Mike Galbraith
On Fri, 2013-02-22 at 14:42 +0800, Michael Wang wrote:

 So this is trying to take care the condition when curr_cpu(local) and
 prev_cpu(remote) are on different nodes, which in the old world,
 wake_affine() won't be invoked, correct?

It'll be called any time this_cpu and prev_cpu aren't one and the same.
It'd be pretty silly to asking whether to pull_here or leave_there when
here and there are identical.

 Hmm...I think this maybe a good additional checking before enter balance
 path, but I could not estimate the cost to record the relationship at
 this moment of time...

It'd be pretty cheap, but I'd hate adding any cycles to the fast path
unless those cycles have one hell of a good payoff, so the caching would
have to show most excellent cold hard numbers (talk crazy ideas walk;).

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Michael Wang
On 02/22/2013 04:17 PM, Mike Galbraith wrote:
 On Fri, 2013-02-22 at 14:42 +0800, Michael Wang wrote:
 
 So this is trying to take care the condition when curr_cpu(local) and
 prev_cpu(remote) are on different nodes, which in the old world,
 wake_affine() won't be invoked, correct?
 
 It'll be called any time this_cpu and prev_cpu aren't one and the same.
 It'd be pretty silly to asking whether to pull_here or leave_there when
 here and there are identical.

Agree :)

 
 Hmm...I think this maybe a good additional checking before enter balance
 path, but I could not estimate the cost to record the relationship at
 this moment of time...
 
 It'd be pretty cheap, but I'd hate adding any cycles to the fast path
 unless those cycles have one hell of a good payoff, so the caching would
 have to show most excellent cold hard numbers (talk crazy ideas walk;).

It sounds like a good idea, I'm not sure whether it's cheap and how many
benefit we could gain, but it worth some research.

I will thinking more about it after finished the sbm work.

Regards,
Michael Wang

 
 -Mike
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Peter Zijlstra
On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
 But that's really some benefit hardly to be estimate, especially when
 the workload is heavy, the cost of wake_affine() is very high to
 calculated se one by one, is that worth for some benefit we could not
 promise?

Look at something like pipe-test.. wake_affine() used to ensure both
client/server ran on the same cpu, but then I think we added
select_idle_sibling() and wrecked it again :/

$ taskset 1 perf bench sched pipe
# Running sched/pipe benchmark...
# Extecuted 100 pipe operations between two tasks

 Total time: 3.761 [sec]

   3.761725 usecs/op
 265835 ops/sec

$ perf bench sched pipe
# Running sched/pipe benchmark...
# Extecuted 100 pipe operations between two tasks

 Total time: 29.809 [sec]

  29.809720 usecs/op
  33546 ops/sec


Now as far as I can see there's two options, either we find there's
absolutely no benefit in wake_affine() as it stands today and we simply
disable/remove it, or we go fix it. What we don't do is completely
wreck it at atrocious cost.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Michael Wang
On 02/22/2013 04:21 PM, Peter Zijlstra wrote:
 On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote:
 According to my understanding, in the old world, wake_affine() will
 only
 be used if curr_cpu and prev_cpu share cache, which means they are in
 one package, whatever search in llc sd of curr_cpu or prev_cpu, we
 won't
 have the chance to spread the task out of that package.
 
 Nah, look at where SD_WAKE_AFFINE is set. Only 'remote/big' NUMA domains
 don't have it set, but 'small' NUMA systems will have it set over the
 entire domain tree.

Oh, I missed that point...

But I don't get the reason to make NUMA level affine, cpus in different
nodes share cache? doesn't make sense...

Regards,
Michael Wang

 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Michael Wang
On 02/22/2013 04:36 PM, Peter Zijlstra wrote:
 On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
 But that's really some benefit hardly to be estimate, especially when
 the workload is heavy, the cost of wake_affine() is very high to
 calculated se one by one, is that worth for some benefit we could not
 promise?
 
 Look at something like pipe-test.. wake_affine() used to ensure both
 client/server ran on the same cpu, but then I think we added
 select_idle_sibling() and wrecked it again :/
 
 $ taskset 1 perf bench sched pipe
 # Running sched/pipe benchmark...
 # Extecuted 100 pipe operations between two tasks
 
  Total time: 3.761 [sec]
 
3.761725 usecs/op
  265835 ops/sec
 
 $ perf bench sched pipe
 # Running sched/pipe benchmark...
 # Extecuted 100 pipe operations between two tasks
 
  Total time: 29.809 [sec]
 
   29.809720 usecs/op
   33546 ops/sec
 

Ok, it do looks like wake_affine() lost it's value...

 
 Now as far as I can see there's two options, either we find there's
 absolutely no benefit in wake_affine() as it stands today and we simply
 disable/remove it, or we go fix it. What we don't do is completely
 wreck it at atrocious cost.

I get your point, we should replace wake_affine() with some feature
which could really achieve the goal to make client and server on same cpu.

But is the logical that the waker/wakee are server/client(or reversed)
still works now? that sounds a little arbitrary to me...

Regards,
Michael Wang

 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Peter Zijlstra
On Fri, 2013-02-22 at 17:10 +0800, Michael Wang wrote:
 On 02/22/2013 04:21 PM, Peter Zijlstra wrote:
  On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote:
  According to my understanding, in the old world, wake_affine() will
  only
  be used if curr_cpu and prev_cpu share cache, which means they are in
  one package, whatever search in llc sd of curr_cpu or prev_cpu, we
  won't
  have the chance to spread the task out of that package.
  
  Nah, look at where SD_WAKE_AFFINE is set. Only 'remote/big' NUMA domains
  don't have it set, but 'small' NUMA systems will have it set over the
  entire domain tree.
 
 Oh, I missed that point...
 
 But I don't get the reason to make NUMA level affine, cpus in different
 nodes share cache? doesn't make sense...

Contrary, it makes more sense, the more expensive it is to run 'remote'
the better it is to pull 'related' tasks together.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Mike Galbraith
On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: 
 On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
  But that's really some benefit hardly to be estimate, especially when
  the workload is heavy, the cost of wake_affine() is very high to
  calculated se one by one, is that worth for some benefit we could not
  promise?
 
 Look at something like pipe-test.. wake_affine() used to ensure both
 client/server ran on the same cpu, but then I think we added
 select_idle_sibling() and wrecked it again :/

Yeah, that's the absolute worst case for select_idle_sibling(), 100%
synchronous, absolutely nothing to be gained by cross cpu scheduling.
Fortunately, most tasks do more than that, but nonetheless,
select_idle_sibling() definitely is a two faced little b*tch.  I'd like
to see the evil b*tch die, but something needs to replace it's pretty
face.  One thing that you can do is simply don't call it when the
context switch rate is incredible.. its job is to recover overlap, if
you're scheduling near your max, there's no win worth the cost.

 $ taskset 1 perf bench sched pipe
 # Running sched/pipe benchmark...
 # Extecuted 100 pipe operations between two tasks
 
  Total time: 3.761 [sec]
 
3.761725 usecs/op
  265835 ops/sec
 
 $ perf bench sched pipe
 # Running sched/pipe benchmark...
 # Extecuted 100 pipe operations between two tasks
 
  Total time: 29.809 [sec]
 
   29.809720 usecs/op
   33546 ops/sec

Gak!  Hm, are you running a kernel without the thinko fix?  It's not
good for this extreme testcase, but it doesn't suck _that_ bad ;-)

nohz isn't exactly your friend with ultra switchers either.

Q6600:
marge:~ # taskset -c 3 perf bench sched pipe
# Running sched/pipe benchmark...
# Executed 100 pipe operations between two tasks

 Total time: 3.395 [sec]

   3.395357 usecs/op
 294519 ops/sec
marge:~ # perf bench sched pipe
# Running sched/pipe benchmark...
# Executed 100 pipe operations between two tasks

 Total time: 4.212 [sec]

   4.212413 usecs/op
 237393 ops/sec

E5620:
rtbox:~ # taskset -c 0 perf bench sched pipe
# Running sched/pipe benchmark...
# Executed 100 pipe operations between two tasks

 Total time: 2.558 [sec]

   2.558237 usecs/op
 390894 ops/sec
rtbox:~ # perf bench sched pipe
# Running sched/pipe benchmark...
# Executed 100 pipe operations between two tasks

 Total time: 4.588 [sec]

   4.588702 usecs/op
 217926 ops/sec

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Ingo Molnar

* Mike Galbraith efa...@gmx.de wrote:

 On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: 
  On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
   But that's really some benefit hardly to be estimate, especially when
   the workload is heavy, the cost of wake_affine() is very high to
   calculated se one by one, is that worth for some benefit we could not
   promise?
  
  Look at something like pipe-test.. wake_affine() used to 
  ensure both client/server ran on the same cpu, but then I 
  think we added select_idle_sibling() and wrecked it again :/
 
 Yeah, that's the absolute worst case for 
 select_idle_sibling(), 100% synchronous, absolutely nothing to 
 be gained by cross cpu scheduling. Fortunately, most tasks do 
 more than that, but nonetheless, select_idle_sibling() 
 definitely is a two faced little b*tch.  I'd like to see the 
 evil b*tch die, but something needs to replace it's pretty 
 face.  One thing that you can do is simply don't call it when 
 the context switch rate is incredible.. its job is to recover 
 overlap, if you're scheduling near your max, there's no win 
 worth the cost.

Couldn't we make the cutoff dependent on sched_migration_cost? 
If the wakeup comes in faster than that then don't spread.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Peter Zijlstra
On Fri, 2013-02-22 at 17:11 +0800, Michael Wang wrote:

 Ok, it do looks like wake_affine() lost it's value...

I'm not sure we can say that on this one benchmark, there's a
preemption advantage to running on a single cpu for pipe-test as well.
We'd need to create a better benchmark to test this, one that has some
actual data payload and control over the initial spread of the tasks or
so.

  Now as far as I can see there's two options, either we find there's
  absolutely no benefit in wake_affine() as it stands today and we simply
  disable/remove it, or we go fix it. What we don't do is completely
  wreck it at atrocious cost.
 
 I get your point, we should replace wake_affine() with some feature
 which could really achieve the goal to make client and server on same cpu.
 
 But is the logical that the waker/wakee are server/client(or reversed)
 still works now? that sounds a little arbitrary to me...

Ah, its never really been about server/client per-se. Its just a
specific example -- one that breaks down with the 1:n pgbench
situation.

Wakeups in general can be considered to be a relation, suppose a
hardware interrupt that received some data from a device and issues a
wakeup to a task to consume this data. What CPU would be better suited
to process this data then the one where its already cache hot.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Michael Wang
On 02/22/2013 05:39 PM, Peter Zijlstra wrote:
 On Fri, 2013-02-22 at 17:10 +0800, Michael Wang wrote:
 On 02/22/2013 04:21 PM, Peter Zijlstra wrote:
 On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote:
 According to my understanding, in the old world, wake_affine() will
 only
 be used if curr_cpu and prev_cpu share cache, which means they are in
 one package, whatever search in llc sd of curr_cpu or prev_cpu, we
 won't
 have the chance to spread the task out of that package.

 Nah, look at where SD_WAKE_AFFINE is set. Only 'remote/big' NUMA domains
 don't have it set, but 'small' NUMA systems will have it set over the
 entire domain tree.

 Oh, I missed that point...

 But I don't get the reason to make NUMA level affine, cpus in different
 nodes share cache? doesn't make sense...
 
 Contrary, it makes more sense, the more expensive it is to run 'remote'
 the better it is to pull 'related' tasks together.

It increase the range to bound task, from one node to several, but also
increase the range of target cpus, from one node's to several's, I still
can't estimate the benefit, but I think I get the purpose, trying to
make related tasks as close as possible, is that right?

Let me think about this point, I do believe there will be better way to
take care of this purpose.

Regards,
Michael Wang


 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Mike Galbraith
On Fri, 2013-02-22 at 10:54 +0100, Ingo Molnar wrote: 
 * Mike Galbraith efa...@gmx.de wrote:
 
  On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: 
   On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
But that's really some benefit hardly to be estimate, especially when
the workload is heavy, the cost of wake_affine() is very high to
calculated se one by one, is that worth for some benefit we could not
promise?
   
   Look at something like pipe-test.. wake_affine() used to 
   ensure both client/server ran on the same cpu, but then I 
   think we added select_idle_sibling() and wrecked it again :/
  
  Yeah, that's the absolute worst case for 
  select_idle_sibling(), 100% synchronous, absolutely nothing to 
  be gained by cross cpu scheduling. Fortunately, most tasks do 
  more than that, but nonetheless, select_idle_sibling() 
  definitely is a two faced little b*tch.  I'd like to see the 
  evil b*tch die, but something needs to replace it's pretty 
  face.  One thing that you can do is simply don't call it when 
  the context switch rate is incredible.. its job is to recover 
  overlap, if you're scheduling near your max, there's no win 
  worth the cost.
 
 Couldn't we make the cutoff dependent on sched_migration_cost? 
 If the wakeup comes in faster than that then don't spread.

No, that's too high, you loose too much of the pretty face.  It's a real
problem.  On AMD, the breakeven is much higher than Intel it seems as
well.  My E5620 can turn in a win on both tbench and even netperf
TCP_RR!! iff nohz is throttled.  For the Opterons I've played with, it's
a loser at even tbench context switch rate, needs to be cut off earlier.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Michael Wang
On 02/22/2013 05:57 PM, Peter Zijlstra wrote:
 On Fri, 2013-02-22 at 17:11 +0800, Michael Wang wrote:
 
 Ok, it do looks like wake_affine() lost it's value...
 
 I'm not sure we can say that on this one benchmark, there's a
 preemption advantage to running on a single cpu for pipe-test as well.
 We'd need to create a better benchmark to test this, one that has some
 actual data payload and control over the initial spread of the tasks or
 so.
 
 Now as far as I can see there's two options, either we find there's
 absolutely no benefit in wake_affine() as it stands today and we simply
 disable/remove it, or we go fix it. What we don't do is completely
 wreck it at atrocious cost.

 I get your point, we should replace wake_affine() with some feature
 which could really achieve the goal to make client and server on same cpu.

 But is the logical that the waker/wakee are server/client(or reversed)
 still works now? that sounds a little arbitrary to me...
 
 Ah, its never really been about server/client per-se. Its just a
 specific example -- one that breaks down with the 1:n pgbench
 situation.
 
 Wakeups in general can be considered to be a relation, suppose a
 hardware interrupt that received some data from a device and issues a
 wakeup to a task to consume this data. What CPU would be better suited
 to process this data then the one where its already cache hot.

I see, honestly, I realized that I have underestimated the benefit we
gain from it when saw your testing results...

We do need some better approach to replace wake_affine(), hmm...I need a
draft board now...

Regards,
Michael Wang

 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Ingo Molnar

* Mike Galbraith efa...@gmx.de wrote:

 On Fri, 2013-02-22 at 10:54 +0100, Ingo Molnar wrote: 
  * Mike Galbraith efa...@gmx.de wrote:
  
   On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: 
On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
 But that's really some benefit hardly to be estimate, especially when
 the workload is heavy, the cost of wake_affine() is very high to
 calculated se one by one, is that worth for some benefit we could not
 promise?

Look at something like pipe-test.. wake_affine() used to 
ensure both client/server ran on the same cpu, but then I 
think we added select_idle_sibling() and wrecked it again :/
   
   Yeah, that's the absolute worst case for 
   select_idle_sibling(), 100% synchronous, absolutely nothing to 
   be gained by cross cpu scheduling. Fortunately, most tasks do 
   more than that, but nonetheless, select_idle_sibling() 
   definitely is a two faced little b*tch.  I'd like to see the 
   evil b*tch die, but something needs to replace it's pretty 
   face.  One thing that you can do is simply don't call it when 
   the context switch rate is incredible.. its job is to recover 
   overlap, if you're scheduling near your max, there's no win 
   worth the cost.
  
  Couldn't we make the cutoff dependent on sched_migration_cost? 
  If the wakeup comes in faster than that then don't spread.
 
 No, that's too high, you loose too much of the pretty face. 
 [...]

Then a logical proportion of it - such as half of it?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Mike Galbraith
On Fri, 2013-02-22 at 13:11 +0100, Ingo Molnar wrote: 
 * Mike Galbraith efa...@gmx.de wrote:
 
  On Fri, 2013-02-22 at 10:54 +0100, Ingo Molnar wrote: 
   * Mike Galbraith efa...@gmx.de wrote:
   
On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: 
 On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
  But that's really some benefit hardly to be estimate, especially 
  when
  the workload is heavy, the cost of wake_affine() is very high to
  calculated se one by one, is that worth for some benefit we could 
  not
  promise?
 
 Look at something like pipe-test.. wake_affine() used to 
 ensure both client/server ran on the same cpu, but then I 
 think we added select_idle_sibling() and wrecked it again :/

Yeah, that's the absolute worst case for 
select_idle_sibling(), 100% synchronous, absolutely nothing to 
be gained by cross cpu scheduling. Fortunately, most tasks do 
more than that, but nonetheless, select_idle_sibling() 
definitely is a two faced little b*tch.  I'd like to see the 
evil b*tch die, but something needs to replace it's pretty 
face.  One thing that you can do is simply don't call it when 
the context switch rate is incredible.. its job is to recover 
overlap, if you're scheduling near your max, there's no win 
worth the cost.
   
   Couldn't we make the cutoff dependent on sched_migration_cost? 
   If the wakeup comes in faster than that then don't spread.
  
  No, that's too high, you loose too much of the pretty face. 
  [...]
 
 Then a logical proportion of it - such as half of it?

Hm.  Better would maybe be a quick boot time benchmark, and use some
multiple of your cross core pipe ping-pong time?  That we know is a
complete waste of cycles, because almost all cycles are scheduler cycles
with no other work to be done, making firing up another scheduler rather
pointless.  If we're approaching that rate, we're approaching bad idea.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Ingo Molnar

* Mike Galbraith efa...@gmx.de wrote:

   No, that's too high, you loose too much of the pretty 
   face. [...]
  
  Then a logical proportion of it - such as half of it?
 
 Hm.  Better would maybe be a quick boot time benchmark, and 
 use some multiple of your cross core pipe ping-pong time?  
 That we know is a complete waste of cycles, because almost all 
 cycles are scheduler cycles with no other work to be done, 
 making firing up another scheduler rather pointless.  If we're 
 approaching that rate, we're approaching bad idea.

Well, one problem with such dynamic boot time measurements is 
that it introduces a certain amount of uncertainty that persists 
for the whole lifetime of the booted up box - and it also sucks 
in any sort of non-deterministic execution environment, such as 
virtualized systems.

I think it might be better to measure the scheduling rate all 
the time, and save the _shortest_ cross-cpu-wakeup and 
same-cpu-wakeup latencies (since bootup) as a reference number. 

We might be able to pull this off pretty cheaply as the 
scheduler clock is running all the time and we have all the 
timestamps needed.

Pretty quickly after bootup this 'shortest latency' would settle 
down to a very system specific (and pretty accurate) value.

[ One downside would be an increased sensitivity to the accuracy
  and monotonicity of the scheduler clock - but that's something 
  we want to improve on anyway - and 'worst case' we get too 
  short latencies and we are where we are today. So it can only 
  improve the situation IMO. ]

Would you be interested in trying to hack on an auto-tuning 
feature like this?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Mike Galbraith
On Fri, 2013-02-22 at 14:06 +0100, Ingo Molnar wrote: 
 * Mike Galbraith efa...@gmx.de wrote:
 
No, that's too high, you loose too much of the pretty 
face. [...]
   
   Then a logical proportion of it - such as half of it?
  
  Hm.  Better would maybe be a quick boot time benchmark, and 
  use some multiple of your cross core pipe ping-pong time?  
  That we know is a complete waste of cycles, because almost all 
  cycles are scheduler cycles with no other work to be done, 
  making firing up another scheduler rather pointless.  If we're 
  approaching that rate, we're approaching bad idea.
 
 Well, one problem with such dynamic boot time measurements is 
 that it introduces a certain amount of uncertainty that persists 
 for the whole lifetime of the booted up box - and it also sucks 
 in any sort of non-deterministic execution environment, such as 
 virtualized systems.

Ok, bad idea.

 I think it might be better to measure the scheduling rate all 
 the time, and save the _shortest_ cross-cpu-wakeup and 
 same-cpu-wakeup latencies (since bootup) as a reference number. 
 
 We might be able to pull this off pretty cheaply as the 
 scheduler clock is running all the time and we have all the 
 timestamps needed.

Yeah, that might work.  We have some quick kthreads, so saving ctx
distance may get close enough to scheduler cost to be good enough.

 Pretty quickly after bootup this 'shortest latency' would settle 
 down to a very system specific (and pretty accurate) value.
 
 [ One downside would be an increased sensitivity to the accuracy
   and monotonicity of the scheduler clock - but that's something 
   we want to improve on anyway - and 'worst case' we get too 
   short latencies and we are where we are today. So it can only 
   improve the situation IMO. ]
 
 Would you be interested in trying to hack on an auto-tuning 
 feature like this?

Yeah, should be easy, but rainy day has to happen so I have time to
measure twiddle measure measure curse tweak.. ;-)

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-22 Thread Mike Galbraith
On Fri, 2013-02-22 at 15:30 +0100, Mike Galbraith wrote: 
 On Fri, 2013-02-22 at 14:06 +0100, Ingo Molnar wrote: 

  I think it might be better to measure the scheduling rate all 
  the time, and save the _shortest_ cross-cpu-wakeup and 
  same-cpu-wakeup latencies (since bootup) as a reference number. 
  
  We might be able to pull this off pretty cheaply as the 
  scheduler clock is running all the time and we have all the 
  timestamps needed.
 
 Yeah, that might work.  We have some quick kthreads, so saving ctx
 distance may get close enough to scheduler cost to be good enough.

Or better, shortest idle to idle, that would include the current (bad)
nohz cost, and automatically shrink away when that cost shrinks.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Michael Wang
On 02/22/2013 01:02 PM, Mike Galbraith wrote:
> On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote: 
>> On 02/21/2013 05:43 PM, Mike Galbraith wrote:
>>> On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote:
>>>
 But is this patch set really cause regression on your Q6600? It may
 sacrificed some thing, but I still think it will benefit far more,
 especially on huge systems.
>>>
>>> We spread on FORK/EXEC, and will no longer will pull communicating tasks
>>> back to a shared cache with the new logic preferring to leave wakee
>>> remote, so while no, I haven't tested (will try to find round tuit) it
>>> seems  it _must_ hurt.  Dragging data from one llc to the other on Q6600
>>> hurts a LOT.  Every time a client and server are cross llc, it's a huge
>>> hit.  The previous logic pulled communicating tasks together right when
>>> it matters the most, intermittent load... or interactive use.
>>
>> I agree that this is a problem need to be solved, but don't agree that
>> wake_affine() is the solution.
> 
> It's not perfect, but it's better than no countering force at all.  It's
> a relic of the dark ages, when affine meant L2, ie this cpu.  Now days,
> affine has a whole new meaning, L3, so it could be done differently, but
> _some_ kind of opposing force is required.
> 
>> According to my understanding, in the old world, wake_affine() will only
>> be used if curr_cpu and prev_cpu share cache, which means they are in
>> one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't
>> have the chance to spread the task out of that package.
> 
> ? affine_sd is the first domain spanning both cpus, that may be NODE.
> True we won't ever spread in the wakeup path unless SD_WAKE_BALANCE is
> set that is.  Would be nice to be able to do that without shredding
> performance.

That's right, we need two conditions in each select instance:
1. prev_cpu and curr_cpu are not affine
2. SD_WAKE_BALANCE

> 
> Off the top of my pointy head, I can think of a way to _maybe_ improve
> the "affine" wakeup criteria:  Add a small (package size? and very fast)
> FIFO queue to task struct, record waker/wakee relationship.  If
> relationship exists in that queue (rbtree), try to wake local, if not,
> wake remote.  The thought is to identify situations ala 1:N pgbench
> where you really need to keep the load spread.  That need arises when
> the sum wakees + waker won't fit in one cache.  True buddies would
> always hit (hm, hit rate), always try to become affine where they
> thrive.  1:N stuff starts missing when client count exceeds package
> size, starts expanding it's horizons. 'Course you would still need to
> NAK if imbalanced too badly, and let NUMA stuff NAK touching lard-balls
> and whatnot.  With a little more smarts, we could have happy 1:N, and
> buddies don't have to chat through 2m thick walls to make 1:N scale as
> well as it can before it dies of stupidity.

So this is trying to take care the condition when curr_cpu(local) and
prev_cpu(remote) are on different nodes, which in the old world,
wake_affine() won't be invoked, correct?

Hmm...I think this maybe a good additional checking before enter balance
path, but I could not estimate the cost to record the relationship at
this moment of time...

Whatever, after applied the affine logical into new world, it will gain
the ability to spread tasks cross nodes just like the old world, your
idea may be an optimization, but the logical is out of the changing in
this patch set, which means if it benefits, the beneficiary will be not
only new but also old.

Regards,
Michael Wang

> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Mike Galbraith
On Fri, 2013-02-22 at 14:06 +0800, Michael Wang wrote: 
> On 02/22/2013 01:08 PM, Mike Galbraith wrote:
> > On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
> > 
> >> According to the testing result, I could not agree this purpose of
> >> wake_affine() benefit us, but I'm sure that wake_affine() is a terrible
> >> performance killer when system is busy.
> > 
> > (hm, result is singular.. pgbench in 1:N mode only?)
> 
> I'm not sure about how pgbench implemented, all I know is it will create
> several instance and access the database, I suppose no different from
> several threads access database (1 server and N clients?).

It's user switchable.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Mike Galbraith
On Fri, 2013-02-22 at 13:26 +0800, Michael Wang wrote:

> Just confirm that I'm not on the wrong way, did the 1:N mode here means
> 1 task forked N threads, and child always talk with father?

Yes, one server, many clients.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Michael Wang
On 02/22/2013 01:08 PM, Mike Galbraith wrote:
> On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
> 
>> According to the testing result, I could not agree this purpose of
>> wake_affine() benefit us, but I'm sure that wake_affine() is a terrible
>> performance killer when system is busy.
> 
> (hm, result is singular.. pgbench in 1:N mode only?)

I'm not sure about how pgbench implemented, all I know is it will create
several instance and access the database, I suppose no different from
several threads access database (1 server and N clients?).

There are improvement since when system busy, wake_affine() will be skipped.

And in old world, when system is busy, wake_affine() will only be
skipped if prev_cpu and curr_cpu belong to different nodes.

Regards,
Michael Wang

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Michael Wang
On 02/22/2013 01:02 PM, Mike Galbraith wrote:
> On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote: 
>> On 02/21/2013 05:43 PM, Mike Galbraith wrote:
>>> On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote:
>>>
 But is this patch set really cause regression on your Q6600? It may
 sacrificed some thing, but I still think it will benefit far more,
 especially on huge systems.
>>>
>>> We spread on FORK/EXEC, and will no longer will pull communicating tasks
>>> back to a shared cache with the new logic preferring to leave wakee
>>> remote, so while no, I haven't tested (will try to find round tuit) it
>>> seems  it _must_ hurt.  Dragging data from one llc to the other on Q6600
>>> hurts a LOT.  Every time a client and server are cross llc, it's a huge
>>> hit.  The previous logic pulled communicating tasks together right when
>>> it matters the most, intermittent load... or interactive use.
>>
>> I agree that this is a problem need to be solved, but don't agree that
>> wake_affine() is the solution.
> 
> It's not perfect, but it's better than no countering force at all.  It's
> a relic of the dark ages, when affine meant L2, ie this cpu.  Now days,
> affine has a whole new meaning, L3, so it could be done differently, but
> _some_ kind of opposing force is required.
> 
>> According to my understanding, in the old world, wake_affine() will only
>> be used if curr_cpu and prev_cpu share cache, which means they are in
>> one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't
>> have the chance to spread the task out of that package.
> 
> ? affine_sd is the first domain spanning both cpus, that may be NODE.
> True we won't ever spread in the wakeup path unless SD_WAKE_BALANCE is
> set that is.  Would be nice to be able to do that without shredding
> performance.
> 
> Off the top of my pointy head, I can think of a way to _maybe_ improve
> the "affine" wakeup criteria:  Add a small (package size? and very fast)
> FIFO queue to task struct, record waker/wakee relationship.  If
> relationship exists in that queue (rbtree), try to wake local, if not,
> wake remote.  The thought is to identify situations ala 1:N pgbench
> where you really need to keep the load spread.  That need arises when
> the sum wakees + waker won't fit in one cache.  True buddies would
> always hit (hm, hit rate), always try to become affine where they
> thrive.  1:N stuff starts missing when client count exceeds package
> size, starts expanding it's horizons. 'Course you would still need to
> NAK if imbalanced too badly, and let NUMA stuff NAK touching lard-balls
> and whatnot.  With a little more smarts, we could have happy 1:N, and
> buddies don't have to chat through 2m thick walls to make 1:N scale as
> well as it can before it dies of stupidity.

Just confirm that I'm not on the wrong way, did the 1:N mode here means
1 task forked N threads, and child always talk with father?

Regards,
Michael Wang

> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Mike Galbraith
On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:

> According to the testing result, I could not agree this purpose of
> wake_affine() benefit us, but I'm sure that wake_affine() is a terrible
> performance killer when system is busy.

(hm, result is singular.. pgbench in 1:N mode only?)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Mike Galbraith
On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote: 
> On 02/21/2013 05:43 PM, Mike Galbraith wrote:
> > On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote:
> > 
> >> But is this patch set really cause regression on your Q6600? It may
> >> sacrificed some thing, but I still think it will benefit far more,
> >> especially on huge systems.
> > 
> > We spread on FORK/EXEC, and will no longer will pull communicating tasks
> > back to a shared cache with the new logic preferring to leave wakee
> > remote, so while no, I haven't tested (will try to find round tuit) it
> > seems  it _must_ hurt.  Dragging data from one llc to the other on Q6600
> > hurts a LOT.  Every time a client and server are cross llc, it's a huge
> > hit.  The previous logic pulled communicating tasks together right when
> > it matters the most, intermittent load... or interactive use.
> 
> I agree that this is a problem need to be solved, but don't agree that
> wake_affine() is the solution.

It's not perfect, but it's better than no countering force at all.  It's
a relic of the dark ages, when affine meant L2, ie this cpu.  Now days,
affine has a whole new meaning, L3, so it could be done differently, but
_some_ kind of opposing force is required.

> According to my understanding, in the old world, wake_affine() will only
> be used if curr_cpu and prev_cpu share cache, which means they are in
> one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't
> have the chance to spread the task out of that package.

? affine_sd is the first domain spanning both cpus, that may be NODE.
True we won't ever spread in the wakeup path unless SD_WAKE_BALANCE is
set that is.  Would be nice to be able to do that without shredding
performance.

Off the top of my pointy head, I can think of a way to _maybe_ improve
the "affine" wakeup criteria:  Add a small (package size? and very fast)
FIFO queue to task struct, record waker/wakee relationship.  If
relationship exists in that queue (rbtree), try to wake local, if not,
wake remote.  The thought is to identify situations ala 1:N pgbench
where you really need to keep the load spread.  That need arises when
the sum wakees + waker won't fit in one cache.  True buddies would
always hit (hm, hit rate), always try to become affine where they
thrive.  1:N stuff starts missing when client count exceeds package
size, starts expanding it's horizons. 'Course you would still need to
NAK if imbalanced too badly, and let NUMA stuff NAK touching lard-balls
and whatnot.  With a little more smarts, we could have happy 1:N, and
buddies don't have to chat through 2m thick walls to make 1:N scale as
well as it can before it dies of stupidity.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Michael Wang
On 02/21/2013 06:20 PM, Peter Zijlstra wrote:
> On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote:
>> The old logical when locate affine_sd is:
>>
>> if prev_cpu != curr_cpu
>> if wake_affine()
>> prev_cpu = curr_cpu
>> new_cpu = select_idle_sibling(prev_cpu)
>> return new_cpu
>>
>> The new logical is same to the old one if prev_cpu == curr_cpu, so
>> let's
>> simplify the old logical like:
>>
>> if wake_affine()
>> new_cpu = select_idle_sibling(curr_cpu)
>> else
>> new_cpu = select_idle_sibling(prev_cpu)
>>
>> return new_cpu
>>
>> Actually that doesn't make sense.
> 
> It does :-)
> 
>> I think wake_affine() is trying to check whether move a task from
>> prev_cpu to curr_cpu will break the balance in affine_sd or not, but
>> why
>> won't break balance means curr_cpu is better than prev_cpu for
>> searching
>> the idle cpu?
> 
> It doesn't, the whole affine wakeup stuff is meant to pull waking tasks
> towards the cpu that does the wakeup, we limit this by putting bounds on
> the imbalance this is may create.
> 
> The reason we want to run tasks on the cpu that does the wakeup is
> because that cpu 'obviously' is running something related and it seems
> like a good idea to run related tasks close together.
> 
> So look at affine wakeups as a force that groups related tasks.

That's right, and it's one point I've missed when judging the
wake_affine()...

But that's really some benefit hardly to be estimate, especially when
the workload is heavy, the cost of wake_affine() is very high to
calculated se one by one, is that worth for some benefit we could not
promise?

According to the testing result, I could not agree this purpose of
wake_affine() benefit us, but I'm sure that wake_affine() is a terrible
performance killer when system is busy.

> 
>> So the new logical in this patch set is:
>>
>> new_cpu = select_idle_sibling(prev_cpu)
>> if idle_cpu(new_cpu)
>> return new_cpu
>>
>> new_cpu = select_idle_sibling(curr_cpu)
>> if idle_cpu(new_cpu) {
>> if wake_affine()
>> return new_cpu
>> }
>>
>> return prev_cpu
>>
>> And now, unless we are really going to move load from prev_cpu to
>> curr_cpu, we won't use wake_affine() any more.
> 
> That's completely breaks stuff, not cool.

Could you please give more details on what's the point you think is bad?

Regards,
Michael Wang

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Michael Wang
On 02/21/2013 05:43 PM, Mike Galbraith wrote:
> On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote:
> 
>> But is this patch set really cause regression on your Q6600? It may
>> sacrificed some thing, but I still think it will benefit far more,
>> especially on huge systems.
> 
> We spread on FORK/EXEC, and will no longer will pull communicating tasks
> back to a shared cache with the new logic preferring to leave wakee
> remote, so while no, I haven't tested (will try to find round tuit) it
> seems  it _must_ hurt.  Dragging data from one llc to the other on Q6600
> hurts a LOT.  Every time a client and server are cross llc, it's a huge
> hit.  The previous logic pulled communicating tasks together right when
> it matters the most, intermittent load... or interactive use.

I agree that this is a problem need to be solved, but don't agree that
wake_affine() is the solution.

According to my understanding, in the old world, wake_affine() will only
be used if curr_cpu and prev_cpu share cache, which means they are in
one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't
have the chance to spread the task out of that package.

I'm going to recover the logical that only do select_idle_sibling() when
prev_cpu and curr_cpu are affine, so now the new logical will only
prefer leaving task in old package if both prev_cpu and curr_cpu are in
that package, I think this could solve the problem, isn't it?

Regards,
Michael Wang



> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Peter Zijlstra
On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote:
> The old logical when locate affine_sd is:
> 
> if prev_cpu != curr_cpu
> if wake_affine()
> prev_cpu = curr_cpu
> new_cpu = select_idle_sibling(prev_cpu)
> return new_cpu
> 
> The new logical is same to the old one if prev_cpu == curr_cpu, so
> let's
> simplify the old logical like:
> 
> if wake_affine()
> new_cpu = select_idle_sibling(curr_cpu)
> else
> new_cpu = select_idle_sibling(prev_cpu)
> 
> return new_cpu
> 
> Actually that doesn't make sense.

It does :-)

> I think wake_affine() is trying to check whether move a task from
> prev_cpu to curr_cpu will break the balance in affine_sd or not, but
> why
> won't break balance means curr_cpu is better than prev_cpu for
> searching
> the idle cpu?

It doesn't, the whole affine wakeup stuff is meant to pull waking tasks
towards the cpu that does the wakeup, we limit this by putting bounds on
the imbalance this is may create.

The reason we want to run tasks on the cpu that does the wakeup is
because that cpu 'obviously' is running something related and it seems
like a good idea to run related tasks close together.

So look at affine wakeups as a force that groups related tasks.

> So the new logical in this patch set is:
> 
> new_cpu = select_idle_sibling(prev_cpu)
> if idle_cpu(new_cpu)
> return new_cpu
> 
> new_cpu = select_idle_sibling(curr_cpu)
> if idle_cpu(new_cpu) {
> if wake_affine()
> return new_cpu
> }
> 
> return prev_cpu
> 
> And now, unless we are really going to move load from prev_cpu to
> curr_cpu, we won't use wake_affine() any more.

That's completely breaks stuff, not cool.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Mike Galbraith
On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote:

> But is this patch set really cause regression on your Q6600? It may
> sacrificed some thing, but I still think it will benefit far more,
> especially on huge systems.

We spread on FORK/EXEC, and will no longer will pull communicating tasks
back to a shared cache with the new logic preferring to leave wakee
remote, so while no, I haven't tested (will try to find round tuit) it
seems  it _must_ hurt.  Dragging data from one llc to the other on Q6600
hurts a LOT.  Every time a client and server are cross llc, it's a huge
hit.  The previous logic pulled communicating tasks together right when
it matters the most, intermittent load... or interactive use.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Michael Wang
On 02/21/2013 04:10 PM, Mike Galbraith wrote:
> On Thu, 2013-02-21 at 15:00 +0800, Michael Wang wrote: 
>> On 02/21/2013 02:11 PM, Mike Galbraith wrote:
>>> On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: 
 On 02/20/2013 06:49 PM, Ingo Molnar wrote:
 [snip]
>> [snip]

if wake_affine()
new_cpu = select_idle_sibling(curr_cpu)
else
new_cpu = select_idle_sibling(prev_cpu)

return new_cpu

 Actually that doesn't make sense.

 I think wake_affine() is trying to check whether move a task from
 prev_cpu to curr_cpu will break the balance in affine_sd or not, but why
 won't break balance means curr_cpu is better than prev_cpu for searching
 the idle cpu?
>>>
>>> You could argue that it's impossible to break balance by moving any task
>>> to any idle cpu, but that would mean bouncing tasks cross node on every
>>> wakeup is fine, which it isn't.
>>
>> I don't get it... could you please give me more detail on how
>> wake_affine() related with bouncing?
> 
> If we didn't ever ask if it's ok, we'd always pull, and stack load up on
> one package if there's the tiniest of holes to stuff a task into,
> periodic balance forcibly rips buddies back apart, repeat.  At least
> with wake_affine() in the loop, there's somewhat of a damper. 

Oh, I think I got the reason why old logical check the affine_sd
firstly, so when prev_cpu and curr_cpu belong to different package,
there will be a chance to enter balance path, that seems like a solution
for this problem, amazing ;-)

You are right, with out this logical, chances to balance load between
packages will missed, I will apply it in next version.

Regards,
Michael Wang

> 
 So the new logical in this patch set is:

new_cpu = select_idle_sibling(prev_cpu)
if idle_cpu(new_cpu)
return new_cpu
>>>
>>> So you tilted the scales in favor of leaving tasks in their current
>>> package, which should benefit large footprint tasks, but should also
>>> penalize light communicating tasks.
>>
>> Yes, I'd prefer to wakeup the task on a cpu which:
>> 1. idle
>> 2. close to prev_cpu
>>
>> So if both curr_cpu and prev_cpu have idle cpu in their topology, which
>> one is better? that depends on how task benefit from cache and the
>> balance situation, whatever, I don't think the benefit worth the high
>> cost of wake_affine() in most cases...
> 
> We've always used wake_affine() before, yet been able to schedule at
> high frequency, so I don't see that it can be _that_ expensive.  I
> haven't actually measured lately (lng time) though.
> 
> WRT cost/benefit of migration, yeah, it depends entirely on the tasks,
> some will gain, some will lose.  On a modern single processor box, it
> just doesn't matter, there's only one llc (two s_i_s() calls = oopsie),
> but on my beloved old Q6600 or a big box, it'll matter a lot to
> something.  NUMA balance will deal with big boxen, my trusty old Q6600
> will likely get all upset with some localhost stuff.
> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Michael Wang
On 02/21/2013 04:10 PM, Mike Galbraith wrote:
> On Thu, 2013-02-21 at 15:00 +0800, Michael Wang wrote: 
>> On 02/21/2013 02:11 PM, Mike Galbraith wrote:
>>> On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: 
 On 02/20/2013 06:49 PM, Ingo Molnar wrote:
 [snip]
>> [snip]

if wake_affine()
new_cpu = select_idle_sibling(curr_cpu)
else
new_cpu = select_idle_sibling(prev_cpu)

return new_cpu

 Actually that doesn't make sense.

 I think wake_affine() is trying to check whether move a task from
 prev_cpu to curr_cpu will break the balance in affine_sd or not, but why
 won't break balance means curr_cpu is better than prev_cpu for searching
 the idle cpu?
>>>
>>> You could argue that it's impossible to break balance by moving any task
>>> to any idle cpu, but that would mean bouncing tasks cross node on every
>>> wakeup is fine, which it isn't.
>>
>> I don't get it... could you please give me more detail on how
>> wake_affine() related with bouncing?
> 
> If we didn't ever ask if it's ok, we'd always pull, and stack load up on
> one package if there's the tiniest of holes to stuff a task into,
> periodic balance forcibly rips buddies back apart, repeat.  At least
> with wake_affine() in the loop, there's somewhat of a damper. 

I think I got your point, a question about the possibility to locate an
idle cpu which belong to different package from prev_cpu, and how
wake_affine() help on it.

Old logical require prev_cpu and curr_cpu to be affine, which means they
share caches on some level, usually means they are in the same package,
and the select_idle_sibling() start search from the top cache share
level, usually means search all the smp in one package, so, usually,
both of the cases will only locate idle cpu in their package.

I could not figure out what's the case that the wake_affine() could
benefit a lot, but it do harm a lot according to the testing results.

> 
 So the new logical in this patch set is:

new_cpu = select_idle_sibling(prev_cpu)
if idle_cpu(new_cpu)
return new_cpu
>>>
>>> So you tilted the scales in favor of leaving tasks in their current
>>> package, which should benefit large footprint tasks, but should also
>>> penalize light communicating tasks.
>>
>> Yes, I'd prefer to wakeup the task on a cpu which:
>> 1. idle
>> 2. close to prev_cpu
>>
>> So if both curr_cpu and prev_cpu have idle cpu in their topology, which
>> one is better? that depends on how task benefit from cache and the
>> balance situation, whatever, I don't think the benefit worth the high
>> cost of wake_affine() in most cases...
> 
> We've always used wake_affine() before, yet been able to schedule at
> high frequency, so I don't see that it can be _that_ expensive.  I
> haven't actually measured lately (lng time) though.
> 
> WRT cost/benefit of migration, yeah, it depends entirely on the tasks,
> some will gain, some will lose.  On a modern single processor box, it
> just doesn't matter, there's only one llc (two s_i_s() calls = oopsie),
> but on my beloved old Q6600 or a big box, it'll matter a lot to
> something.  NUMA balance will deal with big boxen, my trusty old Q6600
> will likely get all upset with some localhost stuff.

But is this patch set really cause regression on your Q6600? It may
sacrificed some thing, but I still think it will benefit far more,
especially on huge systems.

Regards,
Michael Wang


> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Mike Galbraith
On Thu, 2013-02-21 at 15:00 +0800, Michael Wang wrote: 
> On 02/21/2013 02:11 PM, Mike Galbraith wrote:
> > On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: 
> >> On 02/20/2013 06:49 PM, Ingo Molnar wrote:
> >> [snip]
> [snip]
> >>
> >>if wake_affine()
> >>new_cpu = select_idle_sibling(curr_cpu)
> >>else
> >>new_cpu = select_idle_sibling(prev_cpu)
> >>
> >>return new_cpu
> >>
> >> Actually that doesn't make sense.
> >>
> >> I think wake_affine() is trying to check whether move a task from
> >> prev_cpu to curr_cpu will break the balance in affine_sd or not, but why
> >> won't break balance means curr_cpu is better than prev_cpu for searching
> >> the idle cpu?
> > 
> > You could argue that it's impossible to break balance by moving any task
> > to any idle cpu, but that would mean bouncing tasks cross node on every
> > wakeup is fine, which it isn't.
> 
> I don't get it... could you please give me more detail on how
> wake_affine() related with bouncing?

If we didn't ever ask if it's ok, we'd always pull, and stack load up on
one package if there's the tiniest of holes to stuff a task into,
periodic balance forcibly rips buddies back apart, repeat.  At least
with wake_affine() in the loop, there's somewhat of a damper. 

> >> So the new logical in this patch set is:
> >>
> >>new_cpu = select_idle_sibling(prev_cpu)
> >>if idle_cpu(new_cpu)
> >>return new_cpu
> > 
> > So you tilted the scales in favor of leaving tasks in their current
> > package, which should benefit large footprint tasks, but should also
> > penalize light communicating tasks.
> 
> Yes, I'd prefer to wakeup the task on a cpu which:
> 1. idle
> 2. close to prev_cpu
> 
> So if both curr_cpu and prev_cpu have idle cpu in their topology, which
> one is better? that depends on how task benefit from cache and the
> balance situation, whatever, I don't think the benefit worth the high
> cost of wake_affine() in most cases...

We've always used wake_affine() before, yet been able to schedule at
high frequency, so I don't see that it can be _that_ expensive.  I
haven't actually measured lately (lng time) though.

WRT cost/benefit of migration, yeah, it depends entirely on the tasks,
some will gain, some will lose.  On a modern single processor box, it
just doesn't matter, there's only one llc (two s_i_s() calls = oopsie),
but on my beloved old Q6600 or a big box, it'll matter a lot to
something.  NUMA balance will deal with big boxen, my trusty old Q6600
will likely get all upset with some localhost stuff.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Mike Galbraith
On Thu, 2013-02-21 at 15:00 +0800, Michael Wang wrote: 
 On 02/21/2013 02:11 PM, Mike Galbraith wrote:
  On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: 
  On 02/20/2013 06:49 PM, Ingo Molnar wrote:
  [snip]
 [snip]
 
 if wake_affine()
 new_cpu = select_idle_sibling(curr_cpu)
 else
 new_cpu = select_idle_sibling(prev_cpu)
 
 return new_cpu
 
  Actually that doesn't make sense.
 
  I think wake_affine() is trying to check whether move a task from
  prev_cpu to curr_cpu will break the balance in affine_sd or not, but why
  won't break balance means curr_cpu is better than prev_cpu for searching
  the idle cpu?
  
  You could argue that it's impossible to break balance by moving any task
  to any idle cpu, but that would mean bouncing tasks cross node on every
  wakeup is fine, which it isn't.
 
 I don't get it... could you please give me more detail on how
 wake_affine() related with bouncing?

If we didn't ever ask if it's ok, we'd always pull, and stack load up on
one package if there's the tiniest of holes to stuff a task into,
periodic balance forcibly rips buddies back apart, repeat.  At least
with wake_affine() in the loop, there's somewhat of a damper. 

  So the new logical in this patch set is:
 
 new_cpu = select_idle_sibling(prev_cpu)
 if idle_cpu(new_cpu)
 return new_cpu
  
  So you tilted the scales in favor of leaving tasks in their current
  package, which should benefit large footprint tasks, but should also
  penalize light communicating tasks.
 
 Yes, I'd prefer to wakeup the task on a cpu which:
 1. idle
 2. close to prev_cpu
 
 So if both curr_cpu and prev_cpu have idle cpu in their topology, which
 one is better? that depends on how task benefit from cache and the
 balance situation, whatever, I don't think the benefit worth the high
 cost of wake_affine() in most cases...

We've always used wake_affine() before, yet been able to schedule at
high frequency, so I don't see that it can be _that_ expensive.  I
haven't actually measured lately (lng time) though.

WRT cost/benefit of migration, yeah, it depends entirely on the tasks,
some will gain, some will lose.  On a modern single processor box, it
just doesn't matter, there's only one llc (two s_i_s() calls = oopsie),
but on my beloved old Q6600 or a big box, it'll matter a lot to
something.  NUMA balance will deal with big boxen, my trusty old Q6600
will likely get all upset with some localhost stuff.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Michael Wang
On 02/21/2013 04:10 PM, Mike Galbraith wrote:
 On Thu, 2013-02-21 at 15:00 +0800, Michael Wang wrote: 
 On 02/21/2013 02:11 PM, Mike Galbraith wrote:
 On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: 
 On 02/20/2013 06:49 PM, Ingo Molnar wrote:
 [snip]
 [snip]

if wake_affine()
new_cpu = select_idle_sibling(curr_cpu)
else
new_cpu = select_idle_sibling(prev_cpu)

return new_cpu

 Actually that doesn't make sense.

 I think wake_affine() is trying to check whether move a task from
 prev_cpu to curr_cpu will break the balance in affine_sd or not, but why
 won't break balance means curr_cpu is better than prev_cpu for searching
 the idle cpu?

 You could argue that it's impossible to break balance by moving any task
 to any idle cpu, but that would mean bouncing tasks cross node on every
 wakeup is fine, which it isn't.

 I don't get it... could you please give me more detail on how
 wake_affine() related with bouncing?
 
 If we didn't ever ask if it's ok, we'd always pull, and stack load up on
 one package if there's the tiniest of holes to stuff a task into,
 periodic balance forcibly rips buddies back apart, repeat.  At least
 with wake_affine() in the loop, there's somewhat of a damper. 

I think I got your point, a question about the possibility to locate an
idle cpu which belong to different package from prev_cpu, and how
wake_affine() help on it.

Old logical require prev_cpu and curr_cpu to be affine, which means they
share caches on some level, usually means they are in the same package,
and the select_idle_sibling() start search from the top cache share
level, usually means search all the smp in one package, so, usually,
both of the cases will only locate idle cpu in their package.

I could not figure out what's the case that the wake_affine() could
benefit a lot, but it do harm a lot according to the testing results.

 
 So the new logical in this patch set is:

new_cpu = select_idle_sibling(prev_cpu)
if idle_cpu(new_cpu)
return new_cpu

 So you tilted the scales in favor of leaving tasks in their current
 package, which should benefit large footprint tasks, but should also
 penalize light communicating tasks.

 Yes, I'd prefer to wakeup the task on a cpu which:
 1. idle
 2. close to prev_cpu

 So if both curr_cpu and prev_cpu have idle cpu in their topology, which
 one is better? that depends on how task benefit from cache and the
 balance situation, whatever, I don't think the benefit worth the high
 cost of wake_affine() in most cases...
 
 We've always used wake_affine() before, yet been able to schedule at
 high frequency, so I don't see that it can be _that_ expensive.  I
 haven't actually measured lately (lng time) though.
 
 WRT cost/benefit of migration, yeah, it depends entirely on the tasks,
 some will gain, some will lose.  On a modern single processor box, it
 just doesn't matter, there's only one llc (two s_i_s() calls = oopsie),
 but on my beloved old Q6600 or a big box, it'll matter a lot to
 something.  NUMA balance will deal with big boxen, my trusty old Q6600
 will likely get all upset with some localhost stuff.

But is this patch set really cause regression on your Q6600? It may
sacrificed some thing, but I still think it will benefit far more,
especially on huge systems.

Regards,
Michael Wang


 
 -Mike
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Michael Wang
On 02/21/2013 04:10 PM, Mike Galbraith wrote:
 On Thu, 2013-02-21 at 15:00 +0800, Michael Wang wrote: 
 On 02/21/2013 02:11 PM, Mike Galbraith wrote:
 On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: 
 On 02/20/2013 06:49 PM, Ingo Molnar wrote:
 [snip]
 [snip]

if wake_affine()
new_cpu = select_idle_sibling(curr_cpu)
else
new_cpu = select_idle_sibling(prev_cpu)

return new_cpu

 Actually that doesn't make sense.

 I think wake_affine() is trying to check whether move a task from
 prev_cpu to curr_cpu will break the balance in affine_sd or not, but why
 won't break balance means curr_cpu is better than prev_cpu for searching
 the idle cpu?

 You could argue that it's impossible to break balance by moving any task
 to any idle cpu, but that would mean bouncing tasks cross node on every
 wakeup is fine, which it isn't.

 I don't get it... could you please give me more detail on how
 wake_affine() related with bouncing?
 
 If we didn't ever ask if it's ok, we'd always pull, and stack load up on
 one package if there's the tiniest of holes to stuff a task into,
 periodic balance forcibly rips buddies back apart, repeat.  At least
 with wake_affine() in the loop, there's somewhat of a damper. 

Oh, I think I got the reason why old logical check the affine_sd
firstly, so when prev_cpu and curr_cpu belong to different package,
there will be a chance to enter balance path, that seems like a solution
for this problem, amazing ;-)

You are right, with out this logical, chances to balance load between
packages will missed, I will apply it in next version.

Regards,
Michael Wang

 
 So the new logical in this patch set is:

new_cpu = select_idle_sibling(prev_cpu)
if idle_cpu(new_cpu)
return new_cpu

 So you tilted the scales in favor of leaving tasks in their current
 package, which should benefit large footprint tasks, but should also
 penalize light communicating tasks.

 Yes, I'd prefer to wakeup the task on a cpu which:
 1. idle
 2. close to prev_cpu

 So if both curr_cpu and prev_cpu have idle cpu in their topology, which
 one is better? that depends on how task benefit from cache and the
 balance situation, whatever, I don't think the benefit worth the high
 cost of wake_affine() in most cases...
 
 We've always used wake_affine() before, yet been able to schedule at
 high frequency, so I don't see that it can be _that_ expensive.  I
 haven't actually measured lately (lng time) though.
 
 WRT cost/benefit of migration, yeah, it depends entirely on the tasks,
 some will gain, some will lose.  On a modern single processor box, it
 just doesn't matter, there's only one llc (two s_i_s() calls = oopsie),
 but on my beloved old Q6600 or a big box, it'll matter a lot to
 something.  NUMA balance will deal with big boxen, my trusty old Q6600
 will likely get all upset with some localhost stuff.
 
 -Mike
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Mike Galbraith
On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote:

 But is this patch set really cause regression on your Q6600? It may
 sacrificed some thing, but I still think it will benefit far more,
 especially on huge systems.

We spread on FORK/EXEC, and will no longer will pull communicating tasks
back to a shared cache with the new logic preferring to leave wakee
remote, so while no, I haven't tested (will try to find round tuit) it
seems  it _must_ hurt.  Dragging data from one llc to the other on Q6600
hurts a LOT.  Every time a client and server are cross llc, it's a huge
hit.  The previous logic pulled communicating tasks together right when
it matters the most, intermittent load... or interactive use.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Peter Zijlstra
On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote:
 The old logical when locate affine_sd is:
 
 if prev_cpu != curr_cpu
 if wake_affine()
 prev_cpu = curr_cpu
 new_cpu = select_idle_sibling(prev_cpu)
 return new_cpu
 
 The new logical is same to the old one if prev_cpu == curr_cpu, so
 let's
 simplify the old logical like:
 
 if wake_affine()
 new_cpu = select_idle_sibling(curr_cpu)
 else
 new_cpu = select_idle_sibling(prev_cpu)
 
 return new_cpu
 
 Actually that doesn't make sense.

It does :-)

 I think wake_affine() is trying to check whether move a task from
 prev_cpu to curr_cpu will break the balance in affine_sd or not, but
 why
 won't break balance means curr_cpu is better than prev_cpu for
 searching
 the idle cpu?

It doesn't, the whole affine wakeup stuff is meant to pull waking tasks
towards the cpu that does the wakeup, we limit this by putting bounds on
the imbalance this is may create.

The reason we want to run tasks on the cpu that does the wakeup is
because that cpu 'obviously' is running something related and it seems
like a good idea to run related tasks close together.

So look at affine wakeups as a force that groups related tasks.

 So the new logical in this patch set is:
 
 new_cpu = select_idle_sibling(prev_cpu)
 if idle_cpu(new_cpu)
 return new_cpu
 
 new_cpu = select_idle_sibling(curr_cpu)
 if idle_cpu(new_cpu) {
 if wake_affine()
 return new_cpu
 }
 
 return prev_cpu
 
 And now, unless we are really going to move load from prev_cpu to
 curr_cpu, we won't use wake_affine() any more.

That's completely breaks stuff, not cool.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Michael Wang
On 02/21/2013 05:43 PM, Mike Galbraith wrote:
 On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote:
 
 But is this patch set really cause regression on your Q6600? It may
 sacrificed some thing, but I still think it will benefit far more,
 especially on huge systems.
 
 We spread on FORK/EXEC, and will no longer will pull communicating tasks
 back to a shared cache with the new logic preferring to leave wakee
 remote, so while no, I haven't tested (will try to find round tuit) it
 seems  it _must_ hurt.  Dragging data from one llc to the other on Q6600
 hurts a LOT.  Every time a client and server are cross llc, it's a huge
 hit.  The previous logic pulled communicating tasks together right when
 it matters the most, intermittent load... or interactive use.

I agree that this is a problem need to be solved, but don't agree that
wake_affine() is the solution.

According to my understanding, in the old world, wake_affine() will only
be used if curr_cpu and prev_cpu share cache, which means they are in
one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't
have the chance to spread the task out of that package.

I'm going to recover the logical that only do select_idle_sibling() when
prev_cpu and curr_cpu are affine, so now the new logical will only
prefer leaving task in old package if both prev_cpu and curr_cpu are in
that package, I think this could solve the problem, isn't it?

Regards,
Michael Wang



 
 -Mike
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Michael Wang
On 02/21/2013 06:20 PM, Peter Zijlstra wrote:
 On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote:
 The old logical when locate affine_sd is:

 if prev_cpu != curr_cpu
 if wake_affine()
 prev_cpu = curr_cpu
 new_cpu = select_idle_sibling(prev_cpu)
 return new_cpu

 The new logical is same to the old one if prev_cpu == curr_cpu, so
 let's
 simplify the old logical like:

 if wake_affine()
 new_cpu = select_idle_sibling(curr_cpu)
 else
 new_cpu = select_idle_sibling(prev_cpu)

 return new_cpu

 Actually that doesn't make sense.
 
 It does :-)
 
 I think wake_affine() is trying to check whether move a task from
 prev_cpu to curr_cpu will break the balance in affine_sd or not, but
 why
 won't break balance means curr_cpu is better than prev_cpu for
 searching
 the idle cpu?
 
 It doesn't, the whole affine wakeup stuff is meant to pull waking tasks
 towards the cpu that does the wakeup, we limit this by putting bounds on
 the imbalance this is may create.
 
 The reason we want to run tasks on the cpu that does the wakeup is
 because that cpu 'obviously' is running something related and it seems
 like a good idea to run related tasks close together.
 
 So look at affine wakeups as a force that groups related tasks.

That's right, and it's one point I've missed when judging the
wake_affine()...

But that's really some benefit hardly to be estimate, especially when
the workload is heavy, the cost of wake_affine() is very high to
calculated se one by one, is that worth for some benefit we could not
promise?

According to the testing result, I could not agree this purpose of
wake_affine() benefit us, but I'm sure that wake_affine() is a terrible
performance killer when system is busy.

 
 So the new logical in this patch set is:

 new_cpu = select_idle_sibling(prev_cpu)
 if idle_cpu(new_cpu)
 return new_cpu

 new_cpu = select_idle_sibling(curr_cpu)
 if idle_cpu(new_cpu) {
 if wake_affine()
 return new_cpu
 }

 return prev_cpu

 And now, unless we are really going to move load from prev_cpu to
 curr_cpu, we won't use wake_affine() any more.
 
 That's completely breaks stuff, not cool.

Could you please give more details on what's the point you think is bad?

Regards,
Michael Wang

 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Mike Galbraith
On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote: 
 On 02/21/2013 05:43 PM, Mike Galbraith wrote:
  On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote:
  
  But is this patch set really cause regression on your Q6600? It may
  sacrificed some thing, but I still think it will benefit far more,
  especially on huge systems.
  
  We spread on FORK/EXEC, and will no longer will pull communicating tasks
  back to a shared cache with the new logic preferring to leave wakee
  remote, so while no, I haven't tested (will try to find round tuit) it
  seems  it _must_ hurt.  Dragging data from one llc to the other on Q6600
  hurts a LOT.  Every time a client and server are cross llc, it's a huge
  hit.  The previous logic pulled communicating tasks together right when
  it matters the most, intermittent load... or interactive use.
 
 I agree that this is a problem need to be solved, but don't agree that
 wake_affine() is the solution.

It's not perfect, but it's better than no countering force at all.  It's
a relic of the dark ages, when affine meant L2, ie this cpu.  Now days,
affine has a whole new meaning, L3, so it could be done differently, but
_some_ kind of opposing force is required.

 According to my understanding, in the old world, wake_affine() will only
 be used if curr_cpu and prev_cpu share cache, which means they are in
 one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't
 have the chance to spread the task out of that package.

? affine_sd is the first domain spanning both cpus, that may be NODE.
True we won't ever spread in the wakeup path unless SD_WAKE_BALANCE is
set that is.  Would be nice to be able to do that without shredding
performance.

Off the top of my pointy head, I can think of a way to _maybe_ improve
the affine wakeup criteria:  Add a small (package size? and very fast)
FIFO queue to task struct, record waker/wakee relationship.  If
relationship exists in that queue (rbtree), try to wake local, if not,
wake remote.  The thought is to identify situations ala 1:N pgbench
where you really need to keep the load spread.  That need arises when
the sum wakees + waker won't fit in one cache.  True buddies would
always hit (hm, hit rate), always try to become affine where they
thrive.  1:N stuff starts missing when client count exceeds package
size, starts expanding it's horizons. 'Course you would still need to
NAK if imbalanced too badly, and let NUMA stuff NAK touching lard-balls
and whatnot.  With a little more smarts, we could have happy 1:N, and
buddies don't have to chat through 2m thick walls to make 1:N scale as
well as it can before it dies of stupidity.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Mike Galbraith
On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:

 According to the testing result, I could not agree this purpose of
 wake_affine() benefit us, but I'm sure that wake_affine() is a terrible
 performance killer when system is busy.

(hm, result is singular.. pgbench in 1:N mode only?)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Michael Wang
On 02/22/2013 01:02 PM, Mike Galbraith wrote:
 On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote: 
 On 02/21/2013 05:43 PM, Mike Galbraith wrote:
 On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote:

 But is this patch set really cause regression on your Q6600? It may
 sacrificed some thing, but I still think it will benefit far more,
 especially on huge systems.

 We spread on FORK/EXEC, and will no longer will pull communicating tasks
 back to a shared cache with the new logic preferring to leave wakee
 remote, so while no, I haven't tested (will try to find round tuit) it
 seems  it _must_ hurt.  Dragging data from one llc to the other on Q6600
 hurts a LOT.  Every time a client and server are cross llc, it's a huge
 hit.  The previous logic pulled communicating tasks together right when
 it matters the most, intermittent load... or interactive use.

 I agree that this is a problem need to be solved, but don't agree that
 wake_affine() is the solution.
 
 It's not perfect, but it's better than no countering force at all.  It's
 a relic of the dark ages, when affine meant L2, ie this cpu.  Now days,
 affine has a whole new meaning, L3, so it could be done differently, but
 _some_ kind of opposing force is required.
 
 According to my understanding, in the old world, wake_affine() will only
 be used if curr_cpu and prev_cpu share cache, which means they are in
 one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't
 have the chance to spread the task out of that package.
 
 ? affine_sd is the first domain spanning both cpus, that may be NODE.
 True we won't ever spread in the wakeup path unless SD_WAKE_BALANCE is
 set that is.  Would be nice to be able to do that without shredding
 performance.
 
 Off the top of my pointy head, I can think of a way to _maybe_ improve
 the affine wakeup criteria:  Add a small (package size? and very fast)
 FIFO queue to task struct, record waker/wakee relationship.  If
 relationship exists in that queue (rbtree), try to wake local, if not,
 wake remote.  The thought is to identify situations ala 1:N pgbench
 where you really need to keep the load spread.  That need arises when
 the sum wakees + waker won't fit in one cache.  True buddies would
 always hit (hm, hit rate), always try to become affine where they
 thrive.  1:N stuff starts missing when client count exceeds package
 size, starts expanding it's horizons. 'Course you would still need to
 NAK if imbalanced too badly, and let NUMA stuff NAK touching lard-balls
 and whatnot.  With a little more smarts, we could have happy 1:N, and
 buddies don't have to chat through 2m thick walls to make 1:N scale as
 well as it can before it dies of stupidity.

Just confirm that I'm not on the wrong way, did the 1:N mode here means
1 task forked N threads, and child always talk with father?

Regards,
Michael Wang

 
 -Mike
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Michael Wang
On 02/22/2013 01:08 PM, Mike Galbraith wrote:
 On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
 
 According to the testing result, I could not agree this purpose of
 wake_affine() benefit us, but I'm sure that wake_affine() is a terrible
 performance killer when system is busy.
 
 (hm, result is singular.. pgbench in 1:N mode only?)

I'm not sure about how pgbench implemented, all I know is it will create
several instance and access the database, I suppose no different from
several threads access database (1 server and N clients?).

There are improvement since when system busy, wake_affine() will be skipped.

And in old world, when system is busy, wake_affine() will only be
skipped if prev_cpu and curr_cpu belong to different nodes.

Regards,
Michael Wang

 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Mike Galbraith
On Fri, 2013-02-22 at 13:26 +0800, Michael Wang wrote:

 Just confirm that I'm not on the wrong way, did the 1:N mode here means
 1 task forked N threads, and child always talk with father?

Yes, one server, many clients.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Mike Galbraith
On Fri, 2013-02-22 at 14:06 +0800, Michael Wang wrote: 
 On 02/22/2013 01:08 PM, Mike Galbraith wrote:
  On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote:
  
  According to the testing result, I could not agree this purpose of
  wake_affine() benefit us, but I'm sure that wake_affine() is a terrible
  performance killer when system is busy.
  
  (hm, result is singular.. pgbench in 1:N mode only?)
 
 I'm not sure about how pgbench implemented, all I know is it will create
 several instance and access the database, I suppose no different from
 several threads access database (1 server and N clients?).

It's user switchable.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-21 Thread Michael Wang
On 02/22/2013 01:02 PM, Mike Galbraith wrote:
 On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote: 
 On 02/21/2013 05:43 PM, Mike Galbraith wrote:
 On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote:

 But is this patch set really cause regression on your Q6600? It may
 sacrificed some thing, but I still think it will benefit far more,
 especially on huge systems.

 We spread on FORK/EXEC, and will no longer will pull communicating tasks
 back to a shared cache with the new logic preferring to leave wakee
 remote, so while no, I haven't tested (will try to find round tuit) it
 seems  it _must_ hurt.  Dragging data from one llc to the other on Q6600
 hurts a LOT.  Every time a client and server are cross llc, it's a huge
 hit.  The previous logic pulled communicating tasks together right when
 it matters the most, intermittent load... or interactive use.

 I agree that this is a problem need to be solved, but don't agree that
 wake_affine() is the solution.
 
 It's not perfect, but it's better than no countering force at all.  It's
 a relic of the dark ages, when affine meant L2, ie this cpu.  Now days,
 affine has a whole new meaning, L3, so it could be done differently, but
 _some_ kind of opposing force is required.
 
 According to my understanding, in the old world, wake_affine() will only
 be used if curr_cpu and prev_cpu share cache, which means they are in
 one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't
 have the chance to spread the task out of that package.
 
 ? affine_sd is the first domain spanning both cpus, that may be NODE.
 True we won't ever spread in the wakeup path unless SD_WAKE_BALANCE is
 set that is.  Would be nice to be able to do that without shredding
 performance.

That's right, we need two conditions in each select instance:
1. prev_cpu and curr_cpu are not affine
2. SD_WAKE_BALANCE

 
 Off the top of my pointy head, I can think of a way to _maybe_ improve
 the affine wakeup criteria:  Add a small (package size? and very fast)
 FIFO queue to task struct, record waker/wakee relationship.  If
 relationship exists in that queue (rbtree), try to wake local, if not,
 wake remote.  The thought is to identify situations ala 1:N pgbench
 where you really need to keep the load spread.  That need arises when
 the sum wakees + waker won't fit in one cache.  True buddies would
 always hit (hm, hit rate), always try to become affine where they
 thrive.  1:N stuff starts missing when client count exceeds package
 size, starts expanding it's horizons. 'Course you would still need to
 NAK if imbalanced too badly, and let NUMA stuff NAK touching lard-balls
 and whatnot.  With a little more smarts, we could have happy 1:N, and
 buddies don't have to chat through 2m thick walls to make 1:N scale as
 well as it can before it dies of stupidity.

So this is trying to take care the condition when curr_cpu(local) and
prev_cpu(remote) are on different nodes, which in the old world,
wake_affine() won't be invoked, correct?

Hmm...I think this maybe a good additional checking before enter balance
path, but I could not estimate the cost to record the relationship at
this moment of time...

Whatever, after applied the affine logical into new world, it will gain
the ability to spread tasks cross nodes just like the old world, your
idea may be an optimization, but the logical is out of the changing in
this patch set, which means if it benefits, the beneficiary will be not
only new but also old.

Regards,
Michael Wang

 
 -Mike
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-20 Thread Michael Wang
On 02/21/2013 02:11 PM, Mike Galbraith wrote:
> On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: 
>> On 02/20/2013 06:49 PM, Ingo Molnar wrote:
>> [snip]
[snip]
>>
>>  if wake_affine()
>>  new_cpu = select_idle_sibling(curr_cpu)
>>  else
>>  new_cpu = select_idle_sibling(prev_cpu)
>>
>>  return new_cpu
>>
>> Actually that doesn't make sense.
>>
>> I think wake_affine() is trying to check whether move a task from
>> prev_cpu to curr_cpu will break the balance in affine_sd or not, but why
>> won't break balance means curr_cpu is better than prev_cpu for searching
>> the idle cpu?
> 
> You could argue that it's impossible to break balance by moving any task
> to any idle cpu, but that would mean bouncing tasks cross node on every
> wakeup is fine, which it isn't.

I don't get it... could you please give me more detail on how
wake_affine() related with bouncing?

> 
>> So the new logical in this patch set is:
>>
>>  new_cpu = select_idle_sibling(prev_cpu)
>>  if idle_cpu(new_cpu)
>>  return new_cpu
> 
> So you tilted the scales in favor of leaving tasks in their current
> package, which should benefit large footprint tasks, but should also
> penalize light communicating tasks.

Yes, I'd prefer to wakeup the task on a cpu which:
1. idle
2. close to prev_cpu

So if both curr_cpu and prev_cpu have idle cpu in their topology, which
one is better? that depends on how task benefit from cache and the
balance situation, whatever, I don't think the benefit worth the high
cost of wake_affine() in most cases...

Regards,
Michael Wang

> 
> I suspect that much of the pgbench improvement comes from the preemption
> mitigation from keeping 1:N load maximally spread, which is the perfect
> thing to do with such loads.  In all the testing I ever did with it in
> 1:N mode, preemption dominated performance numbers.  Keep server away
> from clients, it has fewer fair competition worries, can consume more
> CPU preemption free, pushing the load collapse point strongly upward.
> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-20 Thread Mike Galbraith
On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: 
> On 02/20/2013 06:49 PM, Ingo Molnar wrote:
> [snip]
> > 
> > The changes look clean and reasoable, any ideas exactly *why* it 
> > speeds up?
> > 
> > I.e. are there one or two key changes in the before/after logic 
> > and scheduling patterns that you can identify as causing the 
> > speedup?
> 
> Hi, Ingo
> 
> Thanks for your reply, please let me point out the key changes here
> (forgive me for haven't wrote a good description in cover).
> 
> The performance improvement from this patch set is:
> 1. delay the invoke on wake_affine().
> 2. save the circle to gain proper sd.
> 
> The second point is obviously, and will benefit a lot when the sd
> topology is deep (NUMA is suppose to make it deeper on large system).
> 
> So in my testing on a 12 cpu box, actually most of the benefit comes
> from the first point, and please let me introduce it in detail.
> 
> The old logical when locate affine_sd is:
> 
>   if prev_cpu != curr_cpu
>   if wake_affine()
>   prev_cpu = curr_cpu
>   new_cpu = select_idle_sibling(prev_cpu)
>   return new_cpu
> 
> The new logical is same to the old one if prev_cpu == curr_cpu, so let's
> simplify the old logical like:
> 
>   if wake_affine()
>   new_cpu = select_idle_sibling(curr_cpu)
>   else
>   new_cpu = select_idle_sibling(prev_cpu)
> 
>   return new_cpu
> 
> Actually that doesn't make sense.
> 
> I think wake_affine() is trying to check whether move a task from
> prev_cpu to curr_cpu will break the balance in affine_sd or not, but why
> won't break balance means curr_cpu is better than prev_cpu for searching
> the idle cpu?

You could argue that it's impossible to break balance by moving any task
to any idle cpu, but that would mean bouncing tasks cross node on every
wakeup is fine, which it isn't.

> So the new logical in this patch set is:
> 
>   new_cpu = select_idle_sibling(prev_cpu)
>   if idle_cpu(new_cpu)
>   return new_cpu

So you tilted the scales in favor of leaving tasks in their current
package, which should benefit large footprint tasks, but should also
penalize light communicating tasks.

I suspect that much of the pgbench improvement comes from the preemption
mitigation from keeping 1:N load maximally spread, which is the perfect
thing to do with such loads.  In all the testing I ever did with it in
1:N mode, preemption dominated performance numbers.  Keep server away
from clients, it has fewer fair competition worries, can consume more
CPU preemption free, pushing the load collapse point strongly upward.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-20 Thread Michael Wang
On 02/20/2013 10:05 PM, Mike Galbraith wrote:
> On Wed, 2013-02-20 at 14:32 +0100, Peter Zijlstra wrote: 
>> On Wed, 2013-02-20 at 11:49 +0100, Ingo Molnar wrote:
>>
>>> The changes look clean and reasoable, 
>>
>> I don't necessarily agree, note that O(n^2) storage requirement that
>> Michael failed to highlight ;-)
> 
> (yeah, I mentioned that needs to shrink.. a lot)

Exactly, and I'm going to apply the suggestion now :)

> 
>>> any ideas exactly *why* it speeds up?
>>
>> That is indeed the most interesting part.. There's two parts to
>> select_task_rq_fair(), the 'regular' affine wakeup path, and the
>> fork/exec find_idlest_goo() path. At the very least we need to quantify
>> which of these two parts contributes most to the speedup.
>>
>> In the power balancing discussion we already noted that the
>> find_idlest_goo() is in need of attention.
> 
> Yup, even little stuff like break off the search when load is zero..

Agree, searching in a bunch of idle cpus and their subsets doesn't make
sense...

Regards,
Michael Wang

> unless someone is planning on implementing anti-idle 'course ;-)
> 
> -Mike
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-20 Thread Michael Wang
On 02/20/2013 09:32 PM, Peter Zijlstra wrote:
> On Wed, 2013-02-20 at 11:49 +0100, Ingo Molnar wrote:
> 
>> The changes look clean and reasoable, 
> 
> I don't necessarily agree, note that O(n^2) storage requirement that
> Michael failed to highlight ;-)

Forgive me for not explain this point in cover, but it's really not a
big deal in my opinion...

And I'm going to apply Mike's suggestion, do allocation when cpu active,
that will save some space :)

Regards,
Michael Wang

> 
>> any ideas exactly *why* it speeds up?
> 
> That is indeed the most interesting part.. There's two parts to
> select_task_rq_fair(), the 'regular' affine wakeup path, and the
> fork/exec find_idlest_goo() path. At the very least we need to quantify
> which of these two parts contributes most to the speedup.
> 
> In the power balancing discussion we already noted that the
> find_idlest_goo() is in need of attention.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-20 Thread Michael Wang
On 02/20/2013 06:49 PM, Ingo Molnar wrote:
[snip]
> 
> The changes look clean and reasoable, any ideas exactly *why* it 
> speeds up?
> 
> I.e. are there one or two key changes in the before/after logic 
> and scheduling patterns that you can identify as causing the 
> speedup?

Hi, Ingo

Thanks for your reply, please let me point out the key changes here
(forgive me for haven't wrote a good description in cover).

The performance improvement from this patch set is:
1. delay the invoke on wake_affine().
2. save the circle to gain proper sd.

The second point is obviously, and will benefit a lot when the sd
topology is deep (NUMA is suppose to make it deeper on large system).

So in my testing on a 12 cpu box, actually most of the benefit comes
from the first point, and please let me introduce it in detail.

The old logical when locate affine_sd is:

if prev_cpu != curr_cpu
if wake_affine()
prev_cpu = curr_cpu
new_cpu = select_idle_sibling(prev_cpu)
return new_cpu

The new logical is same to the old one if prev_cpu == curr_cpu, so let's
simplify the old logical like:

if wake_affine()
new_cpu = select_idle_sibling(curr_cpu)
else
new_cpu = select_idle_sibling(prev_cpu)

return new_cpu

Actually that doesn't make sense.

I think wake_affine() is trying to check whether move a task from
prev_cpu to curr_cpu will break the balance in affine_sd or not, but why
won't break balance means curr_cpu is better than prev_cpu for searching
the idle cpu?

So the new logical in this patch set is:

new_cpu = select_idle_sibling(prev_cpu)
if idle_cpu(new_cpu)
return new_cpu

new_cpu = select_idle_sibling(curr_cpu)
if idle_cpu(new_cpu) {
if wake_affine()
return new_cpu
}

return prev_cpu

And now, unless we are really going to move load from prev_cpu to
curr_cpu, we won't use wake_affine() any more.

So we avoid wake_affine() when system load is low or high, for middle
load, the worst cases is when failed to locate idle cpu in prev_cpu
topology but succeed to locate one in curr_cpu's, but that's rarely
happen and the benchmark results proved that point.

Some comparison below:

1. system load is low
old logical cost:
wake_affine()
select_idle_sibling()
new logical cost:
select_idle_sibling()

2. system load is high
old logical cost:
wake_affine()
select_idle_sibling()
new logical cost:
select_idle_sibling()
select_idle_sibling()

3. system load is middle
don't know

1 save the cost of wake_affine(), 3 could be proved by benchmark that no
regression at least.

For 2, it's the comparison between wake_affine() and
select_idle_sibling(), since the system load is high, wake_affine() cost
far more than select_idle_sibling(), and we saved many according to the
benchmark results.

> 
> Such changes also typically have a chance to cause regressions 
> in other workloads - when that happens we need this kind of 
> information to be able to enact plan-B.

The benefit comes from avoiding unnecessary works, and the patch set is
suppose to only reduce the cost of key function with least logical
changing, I could not promise it benefit all the workloads, but till
now, I've not found regression.

Regards,
Michael Wang

> 
> Thanks,
> 
>   Ingo
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-20 Thread Mike Galbraith
On Wed, 2013-02-20 at 14:32 +0100, Peter Zijlstra wrote: 
> On Wed, 2013-02-20 at 11:49 +0100, Ingo Molnar wrote:
> 
> > The changes look clean and reasoable, 
> 
> I don't necessarily agree, note that O(n^2) storage requirement that
> Michael failed to highlight ;-)

(yeah, I mentioned that needs to shrink.. a lot)

> > any ideas exactly *why* it speeds up?
> 
> That is indeed the most interesting part.. There's two parts to
> select_task_rq_fair(), the 'regular' affine wakeup path, and the
> fork/exec find_idlest_goo() path. At the very least we need to quantify
> which of these two parts contributes most to the speedup.
> 
> In the power balancing discussion we already noted that the
> find_idlest_goo() is in need of attention.

Yup, even little stuff like break off the search when load is zero..
unless someone is planning on implementing anti-idle 'course ;-)

-Mike


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-20 Thread Peter Zijlstra
On Wed, 2013-02-20 at 11:49 +0100, Ingo Molnar wrote:

> The changes look clean and reasoable, 

I don't necessarily agree, note that O(n^2) storage requirement that
Michael failed to highlight ;-)

> any ideas exactly *why* it speeds up?

That is indeed the most interesting part.. There's two parts to
select_task_rq_fair(), the 'regular' affine wakeup path, and the
fork/exec find_idlest_goo() path. At the very least we need to quantify
which of these two parts contributes most to the speedup.

In the power balancing discussion we already noted that the
find_idlest_goo() is in need of attention.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-20 Thread Ingo Molnar

* Michael Wang  wrote:

> v3 change log:
>   Fix small logical issues (Thanks to Mike Galbraith).
>   Change the way of handling WAKE.
> 
> This patch set is trying to simplify the select_task_rq_fair() 
> with schedule balance map.
> 
> After get rid of the complex code and reorganize the logical, 
> pgbench show the improvement, more the clients, bigger the 
> improvement.
> 
>   Prev:   Post:
> 
>   | db_size | clients |   |  tps  |   |  tps  |
> +-+-+   +---+   +---+
> | 22 MB   |   1 |   | 10788 |   | 10881 |
> | 22 MB   |   2 |   | 21617 |   | 21837 |
> | 22 MB   |   4 |   | 41597 |   | 42645 |
> | 22 MB   |   8 |   | 54622 |   | 57808 |
> | 22 MB   |  12 |   | 50753 |   | 54527 |
> | 22 MB   |  16 |   | 50433 |   | 56368 | +11.77%
> | 22 MB   |  24 |   | 46725 |   | 54319 | +16.25%
> | 22 MB   |  32 |   | 43498 |   | 54650 | +25.64%
> | 7484 MB |   1 |   |  7894 |   |  8301 |
> | 7484 MB |   2 |   | 19477 |   | 19622 |
> | 7484 MB |   4 |   | 36458 |   | 38242 |
> | 7484 MB |   8 |   | 48423 |   | 50796 |
> | 7484 MB |  12 |   | 46042 |   | 49938 |
> | 7484 MB |  16 |   | 46274 |   | 50507 | +9.15%
> | 7484 MB |  24 |   | 42583 |   | 49175 | +15.48%
> | 7484 MB |  32 |   | 36413 |   | 49148 | +34.97%
> | 15 GB   |   1 |   |  7742 |   |  7876 |
> | 15 GB   |   2 |   | 19339 |   | 19531 |
> | 15 GB   |   4 |   | 36072 |   | 37389 |
> | 15 GB   |   8 |   | 48549 |   | 50570 |
> | 15 GB   |  12 |   | 45716 |   | 49542 |
> | 15 GB   |  16 |   | 46127 |   | 49647 | +7.63%
> | 15 GB   |  24 |   | 42539 |   | 48639 | +14.34%
> | 15 GB   |  32 |   | 36038 |   | 48560 | +34.75%
> 
> Please check the patch for more details about schedule balance map.

The changes look clean and reasoable, any ideas exactly *why* it 
speeds up?

I.e. are there one or two key changes in the before/after logic 
and scheduling patterns that you can identify as causing the 
speedup?

Such changes also typically have a chance to cause regressions 
in other workloads - when that happens we need this kind of 
information to be able to enact plan-B.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-20 Thread Ingo Molnar

* Michael Wang wang...@linux.vnet.ibm.com wrote:

 v3 change log:
   Fix small logical issues (Thanks to Mike Galbraith).
   Change the way of handling WAKE.
 
 This patch set is trying to simplify the select_task_rq_fair() 
 with schedule balance map.
 
 After get rid of the complex code and reorganize the logical, 
 pgbench show the improvement, more the clients, bigger the 
 improvement.
 
   Prev:   Post:
 
   | db_size | clients |   |  tps  |   |  tps  |
 +-+-+   +---+   +---+
 | 22 MB   |   1 |   | 10788 |   | 10881 |
 | 22 MB   |   2 |   | 21617 |   | 21837 |
 | 22 MB   |   4 |   | 41597 |   | 42645 |
 | 22 MB   |   8 |   | 54622 |   | 57808 |
 | 22 MB   |  12 |   | 50753 |   | 54527 |
 | 22 MB   |  16 |   | 50433 |   | 56368 | +11.77%
 | 22 MB   |  24 |   | 46725 |   | 54319 | +16.25%
 | 22 MB   |  32 |   | 43498 |   | 54650 | +25.64%
 | 7484 MB |   1 |   |  7894 |   |  8301 |
 | 7484 MB |   2 |   | 19477 |   | 19622 |
 | 7484 MB |   4 |   | 36458 |   | 38242 |
 | 7484 MB |   8 |   | 48423 |   | 50796 |
 | 7484 MB |  12 |   | 46042 |   | 49938 |
 | 7484 MB |  16 |   | 46274 |   | 50507 | +9.15%
 | 7484 MB |  24 |   | 42583 |   | 49175 | +15.48%
 | 7484 MB |  32 |   | 36413 |   | 49148 | +34.97%
 | 15 GB   |   1 |   |  7742 |   |  7876 |
 | 15 GB   |   2 |   | 19339 |   | 19531 |
 | 15 GB   |   4 |   | 36072 |   | 37389 |
 | 15 GB   |   8 |   | 48549 |   | 50570 |
 | 15 GB   |  12 |   | 45716 |   | 49542 |
 | 15 GB   |  16 |   | 46127 |   | 49647 | +7.63%
 | 15 GB   |  24 |   | 42539 |   | 48639 | +14.34%
 | 15 GB   |  32 |   | 36038 |   | 48560 | +34.75%
 
 Please check the patch for more details about schedule balance map.

The changes look clean and reasoable, any ideas exactly *why* it 
speeds up?

I.e. are there one or two key changes in the before/after logic 
and scheduling patterns that you can identify as causing the 
speedup?

Such changes also typically have a chance to cause regressions 
in other workloads - when that happens we need this kind of 
information to be able to enact plan-B.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-20 Thread Peter Zijlstra
On Wed, 2013-02-20 at 11:49 +0100, Ingo Molnar wrote:

 The changes look clean and reasoable, 

I don't necessarily agree, note that O(n^2) storage requirement that
Michael failed to highlight ;-)

 any ideas exactly *why* it speeds up?

That is indeed the most interesting part.. There's two parts to
select_task_rq_fair(), the 'regular' affine wakeup path, and the
fork/exec find_idlest_goo() path. At the very least we need to quantify
which of these two parts contributes most to the speedup.

In the power balancing discussion we already noted that the
find_idlest_goo() is in need of attention.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-20 Thread Mike Galbraith
On Wed, 2013-02-20 at 14:32 +0100, Peter Zijlstra wrote: 
 On Wed, 2013-02-20 at 11:49 +0100, Ingo Molnar wrote:
 
  The changes look clean and reasoable, 
 
 I don't necessarily agree, note that O(n^2) storage requirement that
 Michael failed to highlight ;-)

(yeah, I mentioned that needs to shrink.. a lot)

  any ideas exactly *why* it speeds up?
 
 That is indeed the most interesting part.. There's two parts to
 select_task_rq_fair(), the 'regular' affine wakeup path, and the
 fork/exec find_idlest_goo() path. At the very least we need to quantify
 which of these two parts contributes most to the speedup.
 
 In the power balancing discussion we already noted that the
 find_idlest_goo() is in need of attention.

Yup, even little stuff like break off the search when load is zero..
unless someone is planning on implementing anti-idle 'course ;-)

-Mike


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-20 Thread Michael Wang
On 02/20/2013 06:49 PM, Ingo Molnar wrote:
[snip]
 
 The changes look clean and reasoable, any ideas exactly *why* it 
 speeds up?
 
 I.e. are there one or two key changes in the before/after logic 
 and scheduling patterns that you can identify as causing the 
 speedup?

Hi, Ingo

Thanks for your reply, please let me point out the key changes here
(forgive me for haven't wrote a good description in cover).

The performance improvement from this patch set is:
1. delay the invoke on wake_affine().
2. save the circle to gain proper sd.

The second point is obviously, and will benefit a lot when the sd
topology is deep (NUMA is suppose to make it deeper on large system).

So in my testing on a 12 cpu box, actually most of the benefit comes
from the first point, and please let me introduce it in detail.

The old logical when locate affine_sd is:

if prev_cpu != curr_cpu
if wake_affine()
prev_cpu = curr_cpu
new_cpu = select_idle_sibling(prev_cpu)
return new_cpu

The new logical is same to the old one if prev_cpu == curr_cpu, so let's
simplify the old logical like:

if wake_affine()
new_cpu = select_idle_sibling(curr_cpu)
else
new_cpu = select_idle_sibling(prev_cpu)

return new_cpu

Actually that doesn't make sense.

I think wake_affine() is trying to check whether move a task from
prev_cpu to curr_cpu will break the balance in affine_sd or not, but why
won't break balance means curr_cpu is better than prev_cpu for searching
the idle cpu?

So the new logical in this patch set is:

new_cpu = select_idle_sibling(prev_cpu)
if idle_cpu(new_cpu)
return new_cpu

new_cpu = select_idle_sibling(curr_cpu)
if idle_cpu(new_cpu) {
if wake_affine()
return new_cpu
}

return prev_cpu

And now, unless we are really going to move load from prev_cpu to
curr_cpu, we won't use wake_affine() any more.

So we avoid wake_affine() when system load is low or high, for middle
load, the worst cases is when failed to locate idle cpu in prev_cpu
topology but succeed to locate one in curr_cpu's, but that's rarely
happen and the benchmark results proved that point.

Some comparison below:

1. system load is low
old logical cost:
wake_affine()
select_idle_sibling()
new logical cost:
select_idle_sibling()

2. system load is high
old logical cost:
wake_affine()
select_idle_sibling()
new logical cost:
select_idle_sibling()
select_idle_sibling()

3. system load is middle
don't know

1 save the cost of wake_affine(), 3 could be proved by benchmark that no
regression at least.

For 2, it's the comparison between wake_affine() and
select_idle_sibling(), since the system load is high, wake_affine() cost
far more than select_idle_sibling(), and we saved many according to the
benchmark results.

 
 Such changes also typically have a chance to cause regressions 
 in other workloads - when that happens we need this kind of 
 information to be able to enact plan-B.

The benefit comes from avoiding unnecessary works, and the patch set is
suppose to only reduce the cost of key function with least logical
changing, I could not promise it benefit all the workloads, but till
now, I've not found regression.

Regards,
Michael Wang

 
 Thanks,
 
   Ingo
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-20 Thread Michael Wang
On 02/20/2013 09:32 PM, Peter Zijlstra wrote:
 On Wed, 2013-02-20 at 11:49 +0100, Ingo Molnar wrote:
 
 The changes look clean and reasoable, 
 
 I don't necessarily agree, note that O(n^2) storage requirement that
 Michael failed to highlight ;-)

Forgive me for not explain this point in cover, but it's really not a
big deal in my opinion...

And I'm going to apply Mike's suggestion, do allocation when cpu active,
that will save some space :)

Regards,
Michael Wang

 
 any ideas exactly *why* it speeds up?
 
 That is indeed the most interesting part.. There's two parts to
 select_task_rq_fair(), the 'regular' affine wakeup path, and the
 fork/exec find_idlest_goo() path. At the very least we need to quantify
 which of these two parts contributes most to the speedup.
 
 In the power balancing discussion we already noted that the
 find_idlest_goo() is in need of attention.
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-20 Thread Michael Wang
On 02/20/2013 10:05 PM, Mike Galbraith wrote:
 On Wed, 2013-02-20 at 14:32 +0100, Peter Zijlstra wrote: 
 On Wed, 2013-02-20 at 11:49 +0100, Ingo Molnar wrote:

 The changes look clean and reasoable, 

 I don't necessarily agree, note that O(n^2) storage requirement that
 Michael failed to highlight ;-)
 
 (yeah, I mentioned that needs to shrink.. a lot)

Exactly, and I'm going to apply the suggestion now :)

 
 any ideas exactly *why* it speeds up?

 That is indeed the most interesting part.. There's two parts to
 select_task_rq_fair(), the 'regular' affine wakeup path, and the
 fork/exec find_idlest_goo() path. At the very least we need to quantify
 which of these two parts contributes most to the speedup.

 In the power balancing discussion we already noted that the
 find_idlest_goo() is in need of attention.
 
 Yup, even little stuff like break off the search when load is zero..

Agree, searching in a bunch of idle cpus and their subsets doesn't make
sense...

Regards,
Michael Wang

 unless someone is planning on implementing anti-idle 'course ;-)
 
 -Mike
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-20 Thread Mike Galbraith
On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: 
 On 02/20/2013 06:49 PM, Ingo Molnar wrote:
 [snip]
  
  The changes look clean and reasoable, any ideas exactly *why* it 
  speeds up?
  
  I.e. are there one or two key changes in the before/after logic 
  and scheduling patterns that you can identify as causing the 
  speedup?
 
 Hi, Ingo
 
 Thanks for your reply, please let me point out the key changes here
 (forgive me for haven't wrote a good description in cover).
 
 The performance improvement from this patch set is:
 1. delay the invoke on wake_affine().
 2. save the circle to gain proper sd.
 
 The second point is obviously, and will benefit a lot when the sd
 topology is deep (NUMA is suppose to make it deeper on large system).
 
 So in my testing on a 12 cpu box, actually most of the benefit comes
 from the first point, and please let me introduce it in detail.
 
 The old logical when locate affine_sd is:
 
   if prev_cpu != curr_cpu
   if wake_affine()
   prev_cpu = curr_cpu
   new_cpu = select_idle_sibling(prev_cpu)
   return new_cpu
 
 The new logical is same to the old one if prev_cpu == curr_cpu, so let's
 simplify the old logical like:
 
   if wake_affine()
   new_cpu = select_idle_sibling(curr_cpu)
   else
   new_cpu = select_idle_sibling(prev_cpu)
 
   return new_cpu
 
 Actually that doesn't make sense.
 
 I think wake_affine() is trying to check whether move a task from
 prev_cpu to curr_cpu will break the balance in affine_sd or not, but why
 won't break balance means curr_cpu is better than prev_cpu for searching
 the idle cpu?

You could argue that it's impossible to break balance by moving any task
to any idle cpu, but that would mean bouncing tasks cross node on every
wakeup is fine, which it isn't.

 So the new logical in this patch set is:
 
   new_cpu = select_idle_sibling(prev_cpu)
   if idle_cpu(new_cpu)
   return new_cpu

So you tilted the scales in favor of leaving tasks in their current
package, which should benefit large footprint tasks, but should also
penalize light communicating tasks.

I suspect that much of the pgbench improvement comes from the preemption
mitigation from keeping 1:N load maximally spread, which is the perfect
thing to do with such loads.  In all the testing I ever did with it in
1:N mode, preemption dominated performance numbers.  Keep server away
from clients, it has fewer fair competition worries, can consume more
CPU preemption free, pushing the load collapse point strongly upward.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-20 Thread Michael Wang
On 02/21/2013 02:11 PM, Mike Galbraith wrote:
 On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: 
 On 02/20/2013 06:49 PM, Ingo Molnar wrote:
 [snip]
[snip]

  if wake_affine()
  new_cpu = select_idle_sibling(curr_cpu)
  else
  new_cpu = select_idle_sibling(prev_cpu)

  return new_cpu

 Actually that doesn't make sense.

 I think wake_affine() is trying to check whether move a task from
 prev_cpu to curr_cpu will break the balance in affine_sd or not, but why
 won't break balance means curr_cpu is better than prev_cpu for searching
 the idle cpu?
 
 You could argue that it's impossible to break balance by moving any task
 to any idle cpu, but that would mean bouncing tasks cross node on every
 wakeup is fine, which it isn't.

I don't get it... could you please give me more detail on how
wake_affine() related with bouncing?

 
 So the new logical in this patch set is:

  new_cpu = select_idle_sibling(prev_cpu)
  if idle_cpu(new_cpu)
  return new_cpu
 
 So you tilted the scales in favor of leaving tasks in their current
 package, which should benefit large footprint tasks, but should also
 penalize light communicating tasks.

Yes, I'd prefer to wakeup the task on a cpu which:
1. idle
2. close to prev_cpu

So if both curr_cpu and prev_cpu have idle cpu in their topology, which
one is better? that depends on how task benefit from cache and the
balance situation, whatever, I don't think the benefit worth the high
cost of wake_affine() in most cases...

Regards,
Michael Wang

 
 I suspect that much of the pgbench improvement comes from the preemption
 mitigation from keeping 1:N load maximally spread, which is the perfect
 thing to do with such loads.  In all the testing I ever did with it in
 1:N mode, preemption dominated performance numbers.  Keep server away
 from clients, it has fewer fair competition worries, can consume more
 CPU preemption free, pushing the load collapse point strongly upward.
 
 -Mike
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-17 Thread Michael Wang
On 01/29/2013 05:08 PM, Michael Wang wrote:
> v3 change log:
>   Fix small logical issues (Thanks to Mike Galbraith).
>   Change the way of handling WAKE.
> 
> This patch set is trying to simplify the select_task_rq_fair() with
> schedule balance map.
> 
> After get rid of the complex code and reorganize the logical, pgbench show
> the improvement, more the clients, bigger the improvement.
> 
>   Prev:   Post:
> 
>   | db_size | clients |   |  tps  |   |  tps  |
> +-+-+   +---+   +---+
> | 22 MB   |   1 |   | 10788 |   | 10881 |
> | 22 MB   |   2 |   | 21617 |   | 21837 |
> | 22 MB   |   4 |   | 41597 |   | 42645 |
> | 22 MB   |   8 |   | 54622 |   | 57808 |
> | 22 MB   |  12 |   | 50753 |   | 54527 |
> | 22 MB   |  16 |   | 50433 |   | 56368 | +11.77%
> | 22 MB   |  24 |   | 46725 |   | 54319 | +16.25%
> | 22 MB   |  32 |   | 43498 |   | 54650 | +25.64%
> | 7484 MB |   1 |   |  7894 |   |  8301 |
> | 7484 MB |   2 |   | 19477 |   | 19622 |
> | 7484 MB |   4 |   | 36458 |   | 38242 |
> | 7484 MB |   8 |   | 48423 |   | 50796 |
> | 7484 MB |  12 |   | 46042 |   | 49938 |
> | 7484 MB |  16 |   | 46274 |   | 50507 | +9.15%
> | 7484 MB |  24 |   | 42583 |   | 49175 | +15.48%
> | 7484 MB |  32 |   | 36413 |   | 49148 | +34.97%
> | 15 GB   |   1 |   |  7742 |   |  7876 |
> | 15 GB   |   2 |   | 19339 |   | 19531 |
> | 15 GB   |   4 |   | 36072 |   | 37389 |
> | 15 GB   |   8 |   | 48549 |   | 50570 |
> | 15 GB   |  12 |   | 45716 |   | 49542 |
> | 15 GB   |  16 |   | 46127 |   | 49647 | +7.63%
> | 15 GB   |  24 |   | 42539 |   | 48639 | +14.34%
> | 15 GB   |  32 |   | 36038 |   | 48560 | +34.75%
> 
> Please check the patch for more details about schedule balance map.
> 
> Support the NUMA domain but not well tested.
> Support the rebuild of domain but not tested.

Hi, Ingo, Peter

I've finished the test I could figure out (NUMA, domain rebuild...) , no
issue appear on my box.

I think this patch set will benefit the system, especially when there
are huge amount of cpus.

How do you think about this idea? do you have any comments on the patch set?

Regards,
Michael Wang

> 
> Comments are very welcomed.
> 
> Behind the v3:
>   Some changes has been applied to the way of handling WAKE.
> 
>   And that's all around one question, whether we should do load balance
>   for WAKE or not?
> 
>   In the old world, the only chance to do load balance for WAKE is when
>   prev cpu and curr cpu are not cache affine, but that doesn't make sense.
> 
>   I suppose the real meaning behind that logical is, do balance only if
>   cache benefit nothing after changing cpu.
> 
>   However, select_idle_sibling() is not only designed for the purpose to
>   take care of cache, it also benefit latency, and cost less than the
>   balance path.
> 
>   Besides, it's impossible to estimate the benefit of doing load balance
>   at that point of time.
> 
>   And that's come out the v3, no load balance for WAKE.
> 
> Test with:
>   12 cpu X86 server and linux-next 3.8.0-rc3.
> 
> Michael Wang (3):
>   [RFC PATCH v3 1/3] sched: schedule balance map foundation
>   [RFC PATCH v3 2/3] sched: build schedule balance map
>   [RFC PATCH v3 3/3] sched: simplify select_task_rq_fair() with schedule 
> balance map
> 
> Signed-off-by: Michael Wang 
> ---
>  b/kernel/sched/core.c  |   44 +++
>  b/kernel/sched/fair.c  |  135 
> ++---
>  b/kernel/sched/sched.h |   14 +
>  kernel/sched/core.c|   67 
>  4 files changed, 199 insertions(+), 61 deletions(-)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

2013-02-17 Thread Michael Wang
On 01/29/2013 05:08 PM, Michael Wang wrote:
 v3 change log:
   Fix small logical issues (Thanks to Mike Galbraith).
   Change the way of handling WAKE.
 
 This patch set is trying to simplify the select_task_rq_fair() with
 schedule balance map.
 
 After get rid of the complex code and reorganize the logical, pgbench show
 the improvement, more the clients, bigger the improvement.
 
   Prev:   Post:
 
   | db_size | clients |   |  tps  |   |  tps  |
 +-+-+   +---+   +---+
 | 22 MB   |   1 |   | 10788 |   | 10881 |
 | 22 MB   |   2 |   | 21617 |   | 21837 |
 | 22 MB   |   4 |   | 41597 |   | 42645 |
 | 22 MB   |   8 |   | 54622 |   | 57808 |
 | 22 MB   |  12 |   | 50753 |   | 54527 |
 | 22 MB   |  16 |   | 50433 |   | 56368 | +11.77%
 | 22 MB   |  24 |   | 46725 |   | 54319 | +16.25%
 | 22 MB   |  32 |   | 43498 |   | 54650 | +25.64%
 | 7484 MB |   1 |   |  7894 |   |  8301 |
 | 7484 MB |   2 |   | 19477 |   | 19622 |
 | 7484 MB |   4 |   | 36458 |   | 38242 |
 | 7484 MB |   8 |   | 48423 |   | 50796 |
 | 7484 MB |  12 |   | 46042 |   | 49938 |
 | 7484 MB |  16 |   | 46274 |   | 50507 | +9.15%
 | 7484 MB |  24 |   | 42583 |   | 49175 | +15.48%
 | 7484 MB |  32 |   | 36413 |   | 49148 | +34.97%
 | 15 GB   |   1 |   |  7742 |   |  7876 |
 | 15 GB   |   2 |   | 19339 |   | 19531 |
 | 15 GB   |   4 |   | 36072 |   | 37389 |
 | 15 GB   |   8 |   | 48549 |   | 50570 |
 | 15 GB   |  12 |   | 45716 |   | 49542 |
 | 15 GB   |  16 |   | 46127 |   | 49647 | +7.63%
 | 15 GB   |  24 |   | 42539 |   | 48639 | +14.34%
 | 15 GB   |  32 |   | 36038 |   | 48560 | +34.75%
 
 Please check the patch for more details about schedule balance map.
 
 Support the NUMA domain but not well tested.
 Support the rebuild of domain but not tested.

Hi, Ingo, Peter

I've finished the test I could figure out (NUMA, domain rebuild...) , no
issue appear on my box.

I think this patch set will benefit the system, especially when there
are huge amount of cpus.

How do you think about this idea? do you have any comments on the patch set?

Regards,
Michael Wang

 
 Comments are very welcomed.
 
 Behind the v3:
   Some changes has been applied to the way of handling WAKE.
 
   And that's all around one question, whether we should do load balance
   for WAKE or not?
 
   In the old world, the only chance to do load balance for WAKE is when
   prev cpu and curr cpu are not cache affine, but that doesn't make sense.
 
   I suppose the real meaning behind that logical is, do balance only if
   cache benefit nothing after changing cpu.
 
   However, select_idle_sibling() is not only designed for the purpose to
   take care of cache, it also benefit latency, and cost less than the
   balance path.
 
   Besides, it's impossible to estimate the benefit of doing load balance
   at that point of time.
 
   And that's come out the v3, no load balance for WAKE.
 
 Test with:
   12 cpu X86 server and linux-next 3.8.0-rc3.
 
 Michael Wang (3):
   [RFC PATCH v3 1/3] sched: schedule balance map foundation
   [RFC PATCH v3 2/3] sched: build schedule balance map
   [RFC PATCH v3 3/3] sched: simplify select_task_rq_fair() with schedule 
 balance map
 
 Signed-off-by: Michael Wang wang...@linux.vnet.ibm.com
 ---
  b/kernel/sched/core.c  |   44 +++
  b/kernel/sched/fair.c  |  135 
 ++---
  b/kernel/sched/sched.h |   14 +
  kernel/sched/core.c|   67 
  4 files changed, 199 insertions(+), 61 deletions(-)
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/