subject:"Re\: \[PATCH\] sched\: wakeup buddy"

Re: [PATCH] sched: wakeup buddy

2013-03-17 Thread Michael Wang

[snip]
>> It could bring the same benefit but at lower overhead, what's the point
>> of computing the same value over and over again? Also, the rate limit
>> thing naturally works for the soft/hard-irq case.
> 
> Just try to confirm my understanding, so we are going to do something
> like:
> 
>   if (now - wakee->last > time_limit) && wakeup_affine()
>   wakee->last = now
>   select_idle_sibling(curr_cpu)
>   else
>   select_idle_sibling(prev_cpu)
> 
> And time_limit is some static value respect to the rate of load balance,
> is that correct?
> 
> Currently I haven't found regression by reduce the rate, but if we found
> such benchmark, we may still need a way (knob or CONFIG) to disable this
> limitation.

I've done some fast tests on this proposal, on my 12 cpu box, the
pgbench 32 clients test, for a 1000ms time_limit, the benefit is just
like the 8 ref wakeup buddy, when adopt 10ms time_limit, the benefit
dropped half, when time_limit is 1ms, the benefit is less than 10%.

tps

original43404

wakeup-buddy63024   +45.20%

1s-limit62359   +43.67%
100ms-limit 57547   +32.58%
10ms-limit  52258   +20.40%
1ms-limit   46535   +7.21%

Other test items of pgbench are corresponding, and other benchmarks
still inert to the changes.

I'm planning to make a new patch for this approach later, in which
time_limit is a knob with the default value 1ms (usually the initial
value of balance_interval and the value of min_interval), that will
based on the latest tip tree.

Regards,
Michael Wang

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-17 Thread Michael Wang

[snip]
 It could bring the same benefit but at lower overhead, what's the point
 of computing the same value over and over again? Also, the rate limit
 thing naturally works for the soft/hard-irq case.
 
 Just try to confirm my understanding, so we are going to do something
 like:
 
   if (now - wakee-last  time_limit)  wakeup_affine()
   wakee-last = now
   select_idle_sibling(curr_cpu)
   else
   select_idle_sibling(prev_cpu)
 
 And time_limit is some static value respect to the rate of load balance,
 is that correct?
 
 Currently I haven't found regression by reduce the rate, but if we found
 such benchmark, we may still need a way (knob or CONFIG) to disable this
 limitation.

I've done some fast tests on this proposal, on my 12 cpu box, the
pgbench 32 clients test, for a 1000ms time_limit, the benefit is just
like the 8 ref wakeup buddy, when adopt 10ms time_limit, the benefit
dropped half, when time_limit is 1ms, the benefit is less than 10%.

tps

original43404

wakeup-buddy63024   +45.20%

1s-limit62359   +43.67%
100ms-limit 57547   +32.58%
10ms-limit  52258   +20.40%
1ms-limit   46535   +7.21%

Other test items of pgbench are corresponding, and other benchmarks
still inert to the changes.

I'm planning to make a new patch for this approach later, in which
time_limit is a knob with the default value 1ms (usually the initial
value of balance_interval and the value of min_interval), that will
based on the latest tip tree.

Regards,
Michael Wang

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-15 Thread Michael Wang

On 03/14/2013 06:58 PM, Peter Zijlstra wrote:
> On Wed, 2013-03-13 at 11:07 +0800, Michael Wang wrote:
> 
>> However, we already figure out the logical that wakeup related task
>> could benefit from closely running, this could promise us somewhat
>> reliable benefit.
> 
> I'm not convinced that the 2 task wakeup scenario is the only sane
> scenario. Imagine a ring of 3 tasks that wakes each other, if the
> machine is loaded enough, those 3 tasks might fit a single cpu just
> fine -- esp. if one of those is always idle.
> 
> But your strict 1:1 wakeup relation thing will completely fail to
> detect this.

Hmm...yeah, I see your point here, some optimize timing will always
be missed whatever how we twiddle the knob besides 0.

You are right, that's a problem, although currently there are no
workload to prove it, but we have the theory...

> 
>> IMHO, that sounds a little easier for users than to make the decision on
>> tell me how long to pull tasks together, they may be confused...
> 
> Users shouldn't ever need/want/etc.. rely on this. They should just run
> their programs and be (reasonably) happy.
> 
>> In summary, I think we have two point here need to be considered:
>>
>> 1. what about the missed optimize timing, that may benefit
>>some workload (although we haven't found such workload yet).
> 
> Missed optimize; as in not calling wake_affine() due to the throttle?
> If we're calling it at such a high frequency it is very likely the next
> call isn't very far away.
> 
>> 2. how many benefit could wake_affine() stuff bring to us,
>>if limit rate benefit us, why don't we remove it?
> 
> It could bring the same benefit but at lower overhead, what's the point
> of computing the same value over and over again? Also, the rate limit
> thing naturally works for the soft/hard-irq case.

Just try to confirm my understanding, so we are going to do something
like:

if (now - wakee->last > time_limit) && wakeup_affine()
wakee->last = now
select_idle_sibling(curr_cpu)
else
select_idle_sibling(prev_cpu)

And time_limit is some static value respect to the rate of load balance,
is that correct?

Currently I haven't found regression by reduce the rate, but if we found
such benchmark, we may still need a way (knob or CONFIG) to disable this
limitation.

> 
> Now, there's another detail I thought up, one could only limit the
> wake_affine() calls once it starts returning false.

Hmm..if wake_affine() keep succeed, then there will be no difference?

I do believe pgbench match the case, since wake_affine() stuff make it
suffered...and the more it suffered, means the more often wake_affine()
succeed and pull none related tasks together.

I really can't see how could it do help... did I miss something?

Regards,
Michael Wang

> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-15 Thread Michael Wang

On 03/14/2013 06:58 PM, Peter Zijlstra wrote:
 On Wed, 2013-03-13 at 11:07 +0800, Michael Wang wrote:
 
 However, we already figure out the logical that wakeup related task
 could benefit from closely running, this could promise us somewhat
 reliable benefit.
 
 I'm not convinced that the 2 task wakeup scenario is the only sane
 scenario. Imagine a ring of 3 tasks that wakes each other, if the
 machine is loaded enough, those 3 tasks might fit a single cpu just
 fine -- esp. if one of those is always idle.
 
 But your strict 1:1 wakeup relation thing will completely fail to
 detect this.

Hmm...yeah, I see your point here, some optimize timing will always
be missed whatever how we twiddle the knob besides 0.

You are right, that's a problem, although currently there are no
workload to prove it, but we have the theory...

 
 IMHO, that sounds a little easier for users than to make the decision on
 tell me how long to pull tasks together, they may be confused...
 
 Users shouldn't ever need/want/etc.. rely on this. They should just run
 their programs and be (reasonably) happy.
 
 In summary, I think we have two point here need to be considered:

 1. what about the missed optimize timing, that may benefit
some workload (although we haven't found such workload yet).
 
 Missed optimize; as in not calling wake_affine() due to the throttle?
 If we're calling it at such a high frequency it is very likely the next
 call isn't very far away.
 
 2. how many benefit could wake_affine() stuff bring to us,
if limit rate benefit us, why don't we remove it?
 
 It could bring the same benefit but at lower overhead, what's the point
 of computing the same value over and over again? Also, the rate limit
 thing naturally works for the soft/hard-irq case.

Just try to confirm my understanding, so we are going to do something
like:

if (now - wakee-last  time_limit)  wakeup_affine()
wakee-last = now
select_idle_sibling(curr_cpu)
else
select_idle_sibling(prev_cpu)

And time_limit is some static value respect to the rate of load balance,
is that correct?

Currently I haven't found regression by reduce the rate, but if we found
such benchmark, we may still need a way (knob or CONFIG) to disable this
limitation.

 
 Now, there's another detail I thought up, one could only limit the
 wake_affine() calls once it starts returning false.

Hmm..if wake_affine() keep succeed, then there will be no difference?

I do believe pgbench match the case, since wake_affine() stuff make it
suffered...and the more it suffered, means the more often wake_affine()
succeed and pull none related tasks together.

I really can't see how could it do help... did I miss something?

Regards,
Michael Wang

 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-14 Thread Peter Zijlstra

On Wed, 2013-03-13 at 11:07 +0800, Michael Wang wrote:

> However, we already figure out the logical that wakeup related task
> could benefit from closely running, this could promise us somewhat
> reliable benefit.

I'm not convinced that the 2 task wakeup scenario is the only sane
scenario. Imagine a ring of 3 tasks that wakes each other, if the
machine is loaded enough, those 3 tasks might fit a single cpu just
fine -- esp. if one of those is always idle.

But your strict 1:1 wakeup relation thing will completely fail to
detect this.

> IMHO, that sounds a little easier for users than to make the decision on
> tell me how long to pull tasks together, they may be confused...

Users shouldn't ever need/want/etc.. rely on this. They should just run
their programs and be (reasonably) happy.

> In summary, I think we have two point here need to be considered:
> 
> 1. what about the missed optimize timing, that may benefit
>some workload (although we haven't found such workload yet).

Missed optimize; as in not calling wake_affine() due to the throttle?
If we're calling it at such a high frequency it is very likely the next
call isn't very far away.

> 2. how many benefit could wake_affine() stuff bring to us,
>if limit rate benefit us, why don't we remove it?

It could bring the same benefit but at lower overhead, what's the point
of computing the same value over and over again? Also, the rate limit
thing naturally works for the soft/hard-irq case.

Now, there's another detail I thought up, one could only limit the
wake_affine() calls once it starts returning false.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-14 Thread Peter Zijlstra

On Wed, 2013-03-13 at 11:07 +0800, Michael Wang wrote:

 However, we already figure out the logical that wakeup related task
 could benefit from closely running, this could promise us somewhat
 reliable benefit.

I'm not convinced that the 2 task wakeup scenario is the only sane
scenario. Imagine a ring of 3 tasks that wakes each other, if the
machine is loaded enough, those 3 tasks might fit a single cpu just
fine -- esp. if one of those is always idle.

But your strict 1:1 wakeup relation thing will completely fail to
detect this.

 IMHO, that sounds a little easier for users than to make the decision on
 tell me how long to pull tasks together, they may be confused...

Users shouldn't ever need/want/etc.. rely on this. They should just run
their programs and be (reasonably) happy.

 In summary, I think we have two point here need to be considered:
 
 1. what about the missed optimize timing, that may benefit
some workload (although we haven't found such workload yet).

Missed optimize; as in not calling wake_affine() due to the throttle?
If we're calling it at such a high frequency it is very likely the next
call isn't very far away.

 2. how many benefit could wake_affine() stuff bring to us,
if limit rate benefit us, why don't we remove it?

It could bring the same benefit but at lower overhead, what's the point
of computing the same value over and over again? Also, the rate limit
thing naturally works for the soft/hard-irq case.

Now, there's another detail I thought up, one could only limit the
wake_affine() calls once it starts returning false.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-12 Thread Michael Wang

On 03/12/2013 06:08 PM, Peter Zijlstra wrote:
> Does something simple like a per-task throttle of wake_affine() gain
> similar benefits? Say something like only do wake_affine() once every
> 10 ms or so (counting on the wakee, not waker).
> 
> The rationale being that wake_affine() relies on load-balance
> statistics that aren't updated that much faster, so calling it more
> often than that seems pointless.

That could works, if we do not have any logical to rely on and just want
to limit the rate of pull, this could be a good solution.

However, we already figure out the logical that wakeup related task
could benefit from closely running, this could promise us somewhat
reliable benefit.

It's just like, tell me how many times that two task continuously wakeup
each other means they related, I will try to pull them together since
they have a big chance to benefit from cache.

IMHO, that sounds a little easier for users than to make the decision on
tell me how long to pull tasks together, they may be confused...

In summary, I think we have two point here need to be considered:

1. what about the missed optimize timing, that may benefit
   some workload (although we haven't found such workload yet).

2. how many benefit could wake_affine() stuff bring to us,
   if limit rate benefit us, why don't we remove it?

Point 1 could be solved by disable wakeup buddy with 0 limitation, and
when users complain about their database performance, we just say "Calm
down and take a try on this knob ;-)".

Point 2 is about the wake_affine() stuff itself.

Later we could try to make the stuff better, but I have the feeling that
some key info is not there (may be we need Ingo's work atom idea here,
just wondering...), whatever, I think we still need a knob finally,
since it doesn't sounds like a general optimization which benefit all
the cases.

And I don't agree to remove the stuff since we have many theories that
this could benefit us, but before it really show the benefit in all the
cases, provide a way to keep it quiet sounds necessary...

Regards,
Michael Wang

> 
> Something like that should take a lot less lines to implement.
> 
> Just wondering..
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-12 Thread Peter Zijlstra

Does something simple like a per-task throttle of wake_affine() gain
similar benefits? Say something like only do wake_affine() once every
10 ms or so (counting on the wakee, not waker).

The rationale being that wake_affine() relies on load-balance
statistics that aren't updated that much faster, so calling it more
often than that seems pointless.

Something like that should take a lot less lines to implement.

Just wondering..

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-12 Thread Michael Wang

On 03/12/2013 04:48 PM, Ingo Molnar wrote:
[snip]
>>>
>>> Instrumentation/stats/profiling will also double check the correctness of 
>>> this data: if developers/users start relying on the work metric as a 
>>> substitute benchmark number, then app writers will have an additional 
>>> incentive to make them correct.
>>
>> I see, I could not figure out how to wisely using the info currently, 
>> but I have the feeling that it will make scheduler very different ;-)
>>
>> May be we could implement the API and get those info ready firstly 
>> (along with the new sched-pipe which provide work tick info), then think 
>> about the way to use them in scheduler, is there any patches on the way?
> 
> Absolutely.
> 
> Beyond the new prctl no new API is needed: a perf soft event could be 
> added, and/or a tracepoint. Then perf stat and perf record could be used 
> with it. 'perf bench' could be extended to generate the work tick in its 
> 'perf bench sched ...' workloads - and for 'perf bench mem numa' as well.

Nice :)

> 
> vsyscall-accelerating it could be a separate, more complex step: it needs 
> a per thread writable vsyscall data area to make the overhead to 
> applications near zero. Performance critical apps won't call an extra 
> syscall.

If it's really bring benefit, I think they will consider about it,
whatever, that's the developer/users decision, what we need to do is
just make the stuff attractively.

Regards,
Michael Wang

> 
> Thanks,
> 
>   Ingo
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-12 Thread Ingo Molnar


* Michael Wang  wrote:

> On 03/11/2013 05:40 PM, Ingo Molnar wrote:
> > 
> > * Michael Wang  wrote:
> > 
> >> Hi, Ingo
> >>
> >> On 03/11/2013 04:21 PM, Ingo Molnar wrote:
> >> [snip]
> >>>
> >>> I have actually written the prctl() approach before, for instrumentation 
> >>> purposes, and it does wonders to system analysis.
> >>
> >> The idea sounds great, we could get many new info to implement more
> >> smart scheduler, that's amazing :)
> >>
> >>>
> >>> Any objections?
> >>
> >> Just one concern, may be I have misunderstand you, but will it cause 
> >> trouble if the prctl() was indiscriminately used by some applications, 
> >> will we get fake data?
> > 
> > It's their problem: overusing it will increase their CPU overhead. The two 
> > boundary worst-cases are that they either call it too frequently or too 
> > rarely:
> > 
> >  - too frequently: it approximates the current cpu-runtime work metric
> > 
> >  - too infrequently: we just ignore it and fall back to a runtime metric
> >if it does not change.
> > 
> > It's not like it can be used to get preferential treatment - we don't ever 
> > balance other tasks against these tasks based on work throughput, we try 
> > to maximize this workload's work throughput.
> > 
> > What could happen is if an app is 'optimized' for a buggy scheduler by 
> > changing the work metric frequency. We offer no guarantee - apps will be 
> > best off (and users will be least annoyed) if apps honestly report their 
> > work metric.
> > 
> > Instrumentation/stats/profiling will also double check the correctness of 
> > this data: if developers/users start relying on the work metric as a 
> > substitute benchmark number, then app writers will have an additional 
> > incentive to make them correct.
> 
> I see, I could not figure out how to wisely using the info currently, 
> but I have the feeling that it will make scheduler very different ;-)
> 
> May be we could implement the API and get those info ready firstly 
> (along with the new sched-pipe which provide work tick info), then think 
> about the way to use them in scheduler, is there any patches on the way?

Absolutely.

Beyond the new prctl no new API is needed: a perf soft event could be 
added, and/or a tracepoint. Then perf stat and perf record could be used 
with it. 'perf bench' could be extended to generate the work tick in its 
'perf bench sched ...' workloads - and for 'perf bench mem numa' as well.

vsyscall-accelerating it could be a separate, more complex step: it needs 
a per thread writable vsyscall data area to make the overhead to 
applications near zero. Performance critical apps won't call an extra 
syscall.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-12 Thread Michael Wang

On 03/11/2013 05:40 PM, Ingo Molnar wrote:
> 
> * Michael Wang  wrote:
> 
>> Hi, Ingo
>>
>> On 03/11/2013 04:21 PM, Ingo Molnar wrote:
>> [snip]
>>>
>>> I have actually written the prctl() approach before, for instrumentation 
>>> purposes, and it does wonders to system analysis.
>>
>> The idea sounds great, we could get many new info to implement more
>> smart scheduler, that's amazing :)
>>
>>>
>>> Any objections?
>>
>> Just one concern, may be I have misunderstand you, but will it cause 
>> trouble if the prctl() was indiscriminately used by some applications, 
>> will we get fake data?
> 
> It's their problem: overusing it will increase their CPU overhead. The two 
> boundary worst-cases are that they either call it too frequently or too 
> rarely:
> 
>  - too frequently: it approximates the current cpu-runtime work metric
> 
>  - too infrequently: we just ignore it and fall back to a runtime metric
>if it does not change.
> 
> It's not like it can be used to get preferential treatment - we don't ever 
> balance other tasks against these tasks based on work throughput, we try 
> to maximize this workload's work throughput.
> 
> What could happen is if an app is 'optimized' for a buggy scheduler by 
> changing the work metric frequency. We offer no guarantee - apps will be 
> best off (and users will be least annoyed) if apps honestly report their 
> work metric.
> 
> Instrumentation/stats/profiling will also double check the correctness of 
> this data: if developers/users start relying on the work metric as a 
> substitute benchmark number, then app writers will have an additional 
> incentive to make them correct.

I see, I could not figure out how to wisely using the info currently,
but I have the feeling that it will make scheduler very different ;-)

May be we could implement the API and get those info ready firstly
(along with the new sched-pipe which provide work tick info), then think
about the way to use them in scheduler, is there any patches on the way?

Regards,
Michael Wang

> 
> Thanks,
> 
>   Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-12 Thread Michael Wang

On 03/11/2013 05:40 PM, Ingo Molnar wrote:
 
 * Michael Wang wang...@linux.vnet.ibm.com wrote:
 
 Hi, Ingo

 On 03/11/2013 04:21 PM, Ingo Molnar wrote:
 [snip]

 I have actually written the prctl() approach before, for instrumentation 
 purposes, and it does wonders to system analysis.

 The idea sounds great, we could get many new info to implement more
 smart scheduler, that's amazing :)


 Any objections?

 Just one concern, may be I have misunderstand you, but will it cause 
 trouble if the prctl() was indiscriminately used by some applications, 
 will we get fake data?
 
 It's their problem: overusing it will increase their CPU overhead. The two 
 boundary worst-cases are that they either call it too frequently or too 
 rarely:
 
  - too frequently: it approximates the current cpu-runtime work metric
 
  - too infrequently: we just ignore it and fall back to a runtime metric
if it does not change.
 
 It's not like it can be used to get preferential treatment - we don't ever 
 balance other tasks against these tasks based on work throughput, we try 
 to maximize this workload's work throughput.
 
 What could happen is if an app is 'optimized' for a buggy scheduler by 
 changing the work metric frequency. We offer no guarantee - apps will be 
 best off (and users will be least annoyed) if apps honestly report their 
 work metric.
 
 Instrumentation/stats/profiling will also double check the correctness of 
 this data: if developers/users start relying on the work metric as a 
 substitute benchmark number, then app writers will have an additional 
 incentive to make them correct.

I see, I could not figure out how to wisely using the info currently,
but I have the feeling that it will make scheduler very different ;-)

May be we could implement the API and get those info ready firstly
(along with the new sched-pipe which provide work tick info), then think
about the way to use them in scheduler, is there any patches on the way?

Regards,
Michael Wang

 
 Thanks,
 
   Ingo
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-12 Thread Ingo Molnar


* Michael Wang wang...@linux.vnet.ibm.com wrote:

 On 03/11/2013 05:40 PM, Ingo Molnar wrote:
  
  * Michael Wang wang...@linux.vnet.ibm.com wrote:
  
  Hi, Ingo
 
  On 03/11/2013 04:21 PM, Ingo Molnar wrote:
  [snip]
 
  I have actually written the prctl() approach before, for instrumentation 
  purposes, and it does wonders to system analysis.
 
  The idea sounds great, we could get many new info to implement more
  smart scheduler, that's amazing :)
 
 
  Any objections?
 
  Just one concern, may be I have misunderstand you, but will it cause 
  trouble if the prctl() was indiscriminately used by some applications, 
  will we get fake data?
  
  It's their problem: overusing it will increase their CPU overhead. The two 
  boundary worst-cases are that they either call it too frequently or too 
  rarely:
  
   - too frequently: it approximates the current cpu-runtime work metric
  
   - too infrequently: we just ignore it and fall back to a runtime metric
 if it does not change.
  
  It's not like it can be used to get preferential treatment - we don't ever 
  balance other tasks against these tasks based on work throughput, we try 
  to maximize this workload's work throughput.
  
  What could happen is if an app is 'optimized' for a buggy scheduler by 
  changing the work metric frequency. We offer no guarantee - apps will be 
  best off (and users will be least annoyed) if apps honestly report their 
  work metric.
  
  Instrumentation/stats/profiling will also double check the correctness of 
  this data: if developers/users start relying on the work metric as a 
  substitute benchmark number, then app writers will have an additional 
  incentive to make them correct.
 
 I see, I could not figure out how to wisely using the info currently, 
 but I have the feeling that it will make scheduler very different ;-)
 
 May be we could implement the API and get those info ready firstly 
 (along with the new sched-pipe which provide work tick info), then think 
 about the way to use them in scheduler, is there any patches on the way?

Absolutely.

Beyond the new prctl no new API is needed: a perf soft event could be 
added, and/or a tracepoint. Then perf stat and perf record could be used 
with it. 'perf bench' could be extended to generate the work tick in its 
'perf bench sched ...' workloads - and for 'perf bench mem numa' as well.

vsyscall-accelerating it could be a separate, more complex step: it needs 
a per thread writable vsyscall data area to make the overhead to 
applications near zero. Performance critical apps won't call an extra 
syscall.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-12 Thread Michael Wang

On 03/12/2013 04:48 PM, Ingo Molnar wrote:
[snip]

 Instrumentation/stats/profiling will also double check the correctness of 
 this data: if developers/users start relying on the work metric as a 
 substitute benchmark number, then app writers will have an additional 
 incentive to make them correct.

 I see, I could not figure out how to wisely using the info currently, 
 but I have the feeling that it will make scheduler very different ;-)

 May be we could implement the API and get those info ready firstly 
 (along with the new sched-pipe which provide work tick info), then think 
 about the way to use them in scheduler, is there any patches on the way?
 
 Absolutely.
 
 Beyond the new prctl no new API is needed: a perf soft event could be 
 added, and/or a tracepoint. Then perf stat and perf record could be used 
 with it. 'perf bench' could be extended to generate the work tick in its 
 'perf bench sched ...' workloads - and for 'perf bench mem numa' as well.

Nice :)

 
 vsyscall-accelerating it could be a separate, more complex step: it needs 
 a per thread writable vsyscall data area to make the overhead to 
 applications near zero. Performance critical apps won't call an extra 
 syscall.

If it's really bring benefit, I think they will consider about it,
whatever, that's the developer/users decision, what we need to do is
just make the stuff attractively.

Regards,
Michael Wang

 
 Thanks,
 
   Ingo
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-12 Thread Peter Zijlstra

Does something simple like a per-task throttle of wake_affine() gain
similar benefits? Say something like only do wake_affine() once every
10 ms or so (counting on the wakee, not waker).

The rationale being that wake_affine() relies on load-balance
statistics that aren't updated that much faster, so calling it more
often than that seems pointless.

Something like that should take a lot less lines to implement.

Just wondering..

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-12 Thread Michael Wang

On 03/12/2013 06:08 PM, Peter Zijlstra wrote:
 Does something simple like a per-task throttle of wake_affine() gain
 similar benefits? Say something like only do wake_affine() once every
 10 ms or so (counting on the wakee, not waker).
 
 The rationale being that wake_affine() relies on load-balance
 statistics that aren't updated that much faster, so calling it more
 often than that seems pointless.

That could works, if we do not have any logical to rely on and just want
to limit the rate of pull, this could be a good solution.

However, we already figure out the logical that wakeup related task
could benefit from closely running, this could promise us somewhat
reliable benefit.

It's just like, tell me how many times that two task continuously wakeup
each other means they related, I will try to pull them together since
they have a big chance to benefit from cache.

IMHO, that sounds a little easier for users than to make the decision on
tell me how long to pull tasks together, they may be confused...

In summary, I think we have two point here need to be considered:

1. what about the missed optimize timing, that may benefit
   some workload (although we haven't found such workload yet).

2. how many benefit could wake_affine() stuff bring to us,
   if limit rate benefit us, why don't we remove it?

Point 1 could be solved by disable wakeup buddy with 0 limitation, and
when users complain about their database performance, we just say Calm
down and take a try on this knob ;-).

Point 2 is about the wake_affine() stuff itself.

Later we could try to make the stuff better, but I have the feeling that
some key info is not there (may be we need Ingo's work atom idea here,
just wondering...), whatever, I think we still need a knob finally,
since it doesn't sounds like a general optimization which benefit all
the cases.

And I don't agree to remove the stuff since we have many theories that
this could benefit us, but before it really show the benefit in all the
cases, provide a way to keep it quiet sounds necessary...

Regards,
Michael Wang

 
 Something like that should take a lot less lines to implement.
 
 Just wondering..
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-11 Thread Michael Wang

On 03/11/2013 06:36 PM, Peter Zijlstra wrote:
> On Fri, 2013-03-08 at 10:50 +0800, Michael Wang wrote:
> 
>>> OK, so there's two issues I have with all this are:
>>>
>>>  - it completely wrecks task placement for things like interrupts (sadly I 
>>> don't
>>>  have a good idea about a benchmark where this matters).
>>
>> I don't get this point...could you please give more details?
> 
> Take for instance a network workload, the hardirq notifies us there's
> data buffered, then softirq does a ton of network layer demuxing and
> eventually wakes one (or more) tasks to process data. You want these
> tasks to move to the waking cpu, it has lots of this data in cache.
> 
> Now, neither hard- nor softirq runs in task context (except for -rt) so
> it completely fails on what you propose.

I got it, exactly, the new feature with current ref limit will miss this
optimize timing, if the data not related to current...

However, we are still gambling here, and some workload suffered.

Actually this feature's purpose is to provide a way for user who want to
balance the risk and benefit on self demand, it providing flexible, not
restriction...

So what about make the default sysctl_sched_wakeup_buddy_ref to be 0 and
make every thing just like the old world.

And for those who want to make the decision more carefully, they could
turn the knob to make it bigger, and gain the benefit.

> 
> We could simply add something like in_softirq() || in_irq() etc.. to
> re-enable wake_affine() for those cases unconditionally, but not sure
> that's the right thing either.
> 
>>>  - yet another random number.. :/
>>
>> Correct...well, but that also means flexibility, I suppose different
>> system and workload will need some tuning on this knob to gain more
>> benefit, by default, they will gain some benefit, small or big.
> 
> Nah, it just means another knob nobody knows how to twiddle.

Right, and we already have many such kind of knob...since some formular
is impossible to be written down.

Mysterious knob, I twiddle it and wait for something to drop from sky,
candy, cake or rock, whatever, there are adventures there ;-)

And I believe if once the cake drop from sky, there will be more
adventurers, and it won't be a mysterious knob any more.

> 
>>> Also, I'm starting to dislike the buddy name; its somewhat over-used.
>>
>> I have to agree :), any suggestions?
> 
> Nah.. I suppose it depends a bit on the shape the final solution takes,

What about default sysctl_sched_wakeup_buddy_ref = 0 and keep the old
logical?

Regards,
Michael Wang


> but I'll think about it.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-11 Thread Peter Zijlstra

On Fri, 2013-03-08 at 10:50 +0800, Michael Wang wrote:

> > OK, so there's two issues I have with all this are:
> > 
> >  - it completely wrecks task placement for things like interrupts (sadly I 
> > don't
> >  have a good idea about a benchmark where this matters).
> 
> I don't get this point...could you please give more details?

Take for instance a network workload, the hardirq notifies us there's
data buffered, then softirq does a ton of network layer demuxing and
eventually wakes one (or more) tasks to process data. You want these
tasks to move to the waking cpu, it has lots of this data in cache.

Now, neither hard- nor softirq runs in task context (except for -rt) so
it completely fails on what you propose.

We could simply add something like in_softirq() || in_irq() etc.. to
re-enable wake_affine() for those cases unconditionally, but not sure
that's the right thing either.

> >  - yet another random number.. :/
> 
> Correct...well, but that also means flexibility, I suppose different
> system and workload will need some tuning on this knob to gain more
> benefit, by default, they will gain some benefit, small or big.

Nah, it just means another knob nobody knows how to twiddle.

> > Also, I'm starting to dislike the buddy name; its somewhat over-used.
> 
> I have to agree :), any suggestions?

Nah.. I suppose it depends a bit on the shape the final solution takes,
but I'll think about it.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-11 Thread Ingo Molnar

* Michael Wang  wrote:

> Hi, Ingo
> 
> On 03/11/2013 04:21 PM, Ingo Molnar wrote:
> [snip]
> > 
> > I have actually written the prctl() approach before, for instrumentation 
> > purposes, and it does wonders to system analysis.
> 
> The idea sounds great, we could get many new info to implement more
> smart scheduler, that's amazing :)
> 
> > 
> > Any objections?
> 
> Just one concern, may be I have misunderstand you, but will it cause 
> trouble if the prctl() was indiscriminately used by some applications, 
> will we get fake data?

It's their problem: overusing it will increase their CPU overhead. The two 
boundary worst-cases are that they either call it too frequently or too 
rarely:

 - too frequently: it approximates the current cpu-runtime work metric

 - too infrequently: we just ignore it and fall back to a runtime metric
   if it does not change.

It's not like it can be used to get preferential treatment - we don't ever 
balance other tasks against these tasks based on work throughput, we try 
to maximize this workload's work throughput.

What could happen is if an app is 'optimized' for a buggy scheduler by 
changing the work metric frequency. We offer no guarantee - apps will be 
best off (and users will be least annoyed) if apps honestly report their 
work metric.

Instrumentation/stats/profiling will also double check the correctness of 
this data: if developers/users start relying on the work metric as a 
substitute benchmark number, then app writers will have an additional 
incentive to make them correct.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-11 Thread Michael Wang

Hi, Ingo

On 03/11/2013 04:21 PM, Ingo Molnar wrote:
[snip]
> 
> I have actually written the prctl() approach before, for instrumentation 
> purposes, and it does wonders to system analysis.

The idea sounds great, we could get many new info to implement more
smart scheduler, that's amazing :)

> 
> Any objections?

Just one concern, may be I have misunderstand you, but will it cause
trouble if the prctl() was indiscriminately used by some applications,
will we get fake data?

Regards,
Michael Wang

> 
> Thanks,
> 
>   Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-11 Thread Ingo Molnar

* Peter Zijlstra  wrote:

> On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
> 
> > wake_affine() stuff is trying to bind related tasks closely, but it 
> > doesn't work well according to the test on 'perf bench sched pipe' 
> > (thanks to Peter).
> 
> so sched-pipe is a poor benchmark for this..
> 
> Ideally we'd write a new benchmark that has some actual data footprint 
> and we'd measure the cost of tasks being apart on the various cache 
> metrics and see what affine wakeup does for it.

Ideally we'd offer applications a new, lightweight vsyscall:

   void sys_sched_work_tick(void)

Or, to speed up adoption, a new, vsyscall-accelerated prctrl():

   prctl(PR_WORK_TICK);

which applications could call in each basic work unit they are performing.

Sysbench would call it for every transaction completed, sched-pipe would 
call it for every pipe message sent, hackbench would call it for messages, 
etc. etc.

This is a minimal application level change, but gives *huge* information 
to the scheduler: we could balance tasks to maximize their observed work 
rate.

The scheduler could also do other things, like observe the wakeup/sleep 
patterns within a 'work atom', observe execution overlap between work 
atoms and place tasks accordingly, etc. etc.

Today we approximate work atoms by saying that scheduling atoms == work 
atoms. But that approximation breaks down in a number of important cases.

If we had such a design we'd be able to fix pretty much everything, 
without the catch-22 problems we are facing normally.

An added bonus would be increased instrumentation: we could trace, time, 
profile work atom rates and could collect work atom profiles. We see work 
atom execution histograms, etc. etc. - stuff that is simply not possible 
today without extensive application-dependent instrumentation.

We could also use utrace scripts to define work atoms without modifying 
the application: for many applications we know which particular function 
call means that a basic work unit was completed.

I have actually written the prctl() approach before, for instrumentation 
purposes, and it does wonders to system analysis.

Any objections?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-11 Thread Ingo Molnar


* Peter Zijlstra a.p.zijls...@chello.nl wrote:

 On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
 
  wake_affine() stuff is trying to bind related tasks closely, but it 
  doesn't work well according to the test on 'perf bench sched pipe' 
  (thanks to Peter).
 
 so sched-pipe is a poor benchmark for this..
 
 Ideally we'd write a new benchmark that has some actual data footprint 
 and we'd measure the cost of tasks being apart on the various cache 
 metrics and see what affine wakeup does for it.

Ideally we'd offer applications a new, lightweight vsyscall:

   void sys_sched_work_tick(void)

Or, to speed up adoption, a new, vsyscall-accelerated prctrl():

   prctl(PR_WORK_TICK);

which applications could call in each basic work unit they are performing.

Sysbench would call it for every transaction completed, sched-pipe would 
call it for every pipe message sent, hackbench would call it for messages, 
etc. etc.

This is a minimal application level change, but gives *huge* information 
to the scheduler: we could balance tasks to maximize their observed work 
rate.

The scheduler could also do other things, like observe the wakeup/sleep 
patterns within a 'work atom', observe execution overlap between work 
atoms and place tasks accordingly, etc. etc.

Today we approximate work atoms by saying that scheduling atoms == work 
atoms. But that approximation breaks down in a number of important cases.

If we had such a design we'd be able to fix pretty much everything, 
without the catch-22 problems we are facing normally.

An added bonus would be increased instrumentation: we could trace, time, 
profile work atom rates and could collect work atom profiles. We see work 
atom execution histograms, etc. etc. - stuff that is simply not possible 
today without extensive application-dependent instrumentation.

We could also use utrace scripts to define work atoms without modifying 
the application: for many applications we know which particular function 
call means that a basic work unit was completed.

I have actually written the prctl() approach before, for instrumentation 
purposes, and it does wonders to system analysis.

Any objections?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-11 Thread Michael Wang

Hi, Ingo

On 03/11/2013 04:21 PM, Ingo Molnar wrote:
[snip]
 
 I have actually written the prctl() approach before, for instrumentation 
 purposes, and it does wonders to system analysis.

The idea sounds great, we could get many new info to implement more
smart scheduler, that's amazing :)

 
 Any objections?

Just one concern, may be I have misunderstand you, but will it cause
trouble if the prctl() was indiscriminately used by some applications,
will we get fake data?

Regards,
Michael Wang

 
 Thanks,
 
   Ingo
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-11 Thread Ingo Molnar


* Michael Wang wang...@linux.vnet.ibm.com wrote:

 Hi, Ingo
 
 On 03/11/2013 04:21 PM, Ingo Molnar wrote:
 [snip]
  
  I have actually written the prctl() approach before, for instrumentation 
  purposes, and it does wonders to system analysis.
 
 The idea sounds great, we could get many new info to implement more
 smart scheduler, that's amazing :)
 
  
  Any objections?
 
 Just one concern, may be I have misunderstand you, but will it cause 
 trouble if the prctl() was indiscriminately used by some applications, 
 will we get fake data?

It's their problem: overusing it will increase their CPU overhead. The two 
boundary worst-cases are that they either call it too frequently or too 
rarely:

 - too frequently: it approximates the current cpu-runtime work metric

 - too infrequently: we just ignore it and fall back to a runtime metric
   if it does not change.

It's not like it can be used to get preferential treatment - we don't ever 
balance other tasks against these tasks based on work throughput, we try 
to maximize this workload's work throughput.

What could happen is if an app is 'optimized' for a buggy scheduler by 
changing the work metric frequency. We offer no guarantee - apps will be 
best off (and users will be least annoyed) if apps honestly report their 
work metric.

Instrumentation/stats/profiling will also double check the correctness of 
this data: if developers/users start relying on the work metric as a 
substitute benchmark number, then app writers will have an additional 
incentive to make them correct.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-11 Thread Peter Zijlstra

On Fri, 2013-03-08 at 10:50 +0800, Michael Wang wrote:

  OK, so there's two issues I have with all this are:
  
   - it completely wrecks task placement for things like interrupts (sadly I 
  don't
   have a good idea about a benchmark where this matters).
 
 I don't get this point...could you please give more details?

Take for instance a network workload, the hardirq notifies us there's
data buffered, then softirq does a ton of network layer demuxing and
eventually wakes one (or more) tasks to process data. You want these
tasks to move to the waking cpu, it has lots of this data in cache.

Now, neither hard- nor softirq runs in task context (except for -rt) so
it completely fails on what you propose.

We could simply add something like in_softirq() || in_irq() etc.. to
re-enable wake_affine() for those cases unconditionally, but not sure
that's the right thing either.

   - yet another random number.. :/
 
 Correct...well, but that also means flexibility, I suppose different
 system and workload will need some tuning on this knob to gain more
 benefit, by default, they will gain some benefit, small or big.

Nah, it just means another knob nobody knows how to twiddle.

  Also, I'm starting to dislike the buddy name; its somewhat over-used.
 
 I have to agree :), any suggestions?

Nah.. I suppose it depends a bit on the shape the final solution takes,
but I'll think about it.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-11 Thread Michael Wang

On 03/11/2013 06:36 PM, Peter Zijlstra wrote:
 On Fri, 2013-03-08 at 10:50 +0800, Michael Wang wrote:
 
 OK, so there's two issues I have with all this are:

  - it completely wrecks task placement for things like interrupts (sadly I 
 don't
  have a good idea about a benchmark where this matters).

 I don't get this point...could you please give more details?
 
 Take for instance a network workload, the hardirq notifies us there's
 data buffered, then softirq does a ton of network layer demuxing and
 eventually wakes one (or more) tasks to process data. You want these
 tasks to move to the waking cpu, it has lots of this data in cache.
 
 Now, neither hard- nor softirq runs in task context (except for -rt) so
 it completely fails on what you propose.

I got it, exactly, the new feature with current ref limit will miss this
optimize timing, if the data not related to current...

However, we are still gambling here, and some workload suffered.

Actually this feature's purpose is to provide a way for user who want to
balance the risk and benefit on self demand, it providing flexible, not
restriction...

So what about make the default sysctl_sched_wakeup_buddy_ref to be 0 and
make every thing just like the old world.

And for those who want to make the decision more carefully, they could
turn the knob to make it bigger, and gain the benefit.

 
 We could simply add something like in_softirq() || in_irq() etc.. to
 re-enable wake_affine() for those cases unconditionally, but not sure
 that's the right thing either.
 
  - yet another random number.. :/

 Correct...well, but that also means flexibility, I suppose different
 system and workload will need some tuning on this knob to gain more
 benefit, by default, they will gain some benefit, small or big.
 
 Nah, it just means another knob nobody knows how to twiddle.

Right, and we already have many such kind of knob...since some formular
is impossible to be written down.

Mysterious knob, I twiddle it and wait for something to drop from sky,
candy, cake or rock, whatever, there are adventures there ;-)

And I believe if once the cake drop from sky, there will be more
adventurers, and it won't be a mysterious knob any more.

 
 Also, I'm starting to dislike the buddy name; its somewhat over-used.

 I have to agree :), any suggestions?
 
 Nah.. I suppose it depends a bit on the shape the final solution takes,

What about default sysctl_sched_wakeup_buddy_ref = 0 and keep the old
logical?

Regards,
Michael Wang


 but I'll think about it.
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-10 Thread Michael Wang

On 03/08/2013 04:26 PM, Mike Galbraith wrote:
> On Fri, 2013-03-08 at 15:30 +0800, Michael Wang wrote: 
>> On 03/08/2013 02:44 PM, Mike Galbraith wrote:
> 
>>> In general, I think things would work better if we'd just rate limit how
>>> frequently we can wakeup migrate each individual task.  
>>
>> Isn't the wakeup buddy already limit the rate? and by turning the knob,
>> we could change the rate on our demand.
> 
> I was referring to the existing kernel, not as altered. 
> 
>> We want
>>> jabbering tasks to share L3, but we don't really want to trash L2 at an
>>> awesome rate.
>>
>> I don't get it..., it's a task which has 'sleep' for some time, unless
>> there is no task running on prev_cpu when it's sleeping, otherwise
>> whatever the new cpu is, we will trash L2, isn't it?
> 
> I'm thinking if you wake it to it's old home after a microscopic sleep,
> it has a good chance of evicting the current resident, rescuing its L2.
> If tasks which do microscopic sleep can't move around at a high rate,
> they'll poke holes in fewer preempt victims.  If they're _really_ fast
> switchers, always wake affine.  They can't hurt much, they don't do much
> other than schedule off.

I get your point, it's about the task which sleep frequently for a very
short time, you are right, keep them around prev_cpu could gain some
benefit.

There are still many factors need to be considered, for example, if the
cpu around prev_cpu are busier than those around curr_cpu, then pull
from prev to curr still likely to be a good choice...for the consider of
future.

Also for sure, that depends on what's the workload in the system, and
how they cooperate with each other.

IMHO, I think wakeup buddy is a compromising solution for this case,
before we could figure out the correct formular (and a efficient way to
collect all the info we rely on), such a flexible optimization is not bad.

Regards,
Michael Wang

> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-10 Thread Michael Wang

On 03/08/2013 04:26 PM, Mike Galbraith wrote:
 On Fri, 2013-03-08 at 15:30 +0800, Michael Wang wrote: 
 On 03/08/2013 02:44 PM, Mike Galbraith wrote:
 
 In general, I think things would work better if we'd just rate limit how
 frequently we can wakeup migrate each individual task.  

 Isn't the wakeup buddy already limit the rate? and by turning the knob,
 we could change the rate on our demand.
 
 I was referring to the existing kernel, not as altered. 
 
 We want
 jabbering tasks to share L3, but we don't really want to trash L2 at an
 awesome rate.

 I don't get it..., it's a task which has 'sleep' for some time, unless
 there is no task running on prev_cpu when it's sleeping, otherwise
 whatever the new cpu is, we will trash L2, isn't it?
 
 I'm thinking if you wake it to it's old home after a microscopic sleep,
 it has a good chance of evicting the current resident, rescuing its L2.
 If tasks which do microscopic sleep can't move around at a high rate,
 they'll poke holes in fewer preempt victims.  If they're _really_ fast
 switchers, always wake affine.  They can't hurt much, they don't do much
 other than schedule off.

I get your point, it's about the task which sleep frequently for a very
short time, you are right, keep them around prev_cpu could gain some
benefit.

There are still many factors need to be considered, for example, if the
cpu around prev_cpu are busier than those around curr_cpu, then pull
from prev to curr still likely to be a good choice...for the consider of
future.

Also for sure, that depends on what's the workload in the system, and
how they cooperate with each other.

IMHO, I think wakeup buddy is a compromising solution for this case,
before we could figure out the correct formular (and a efficient way to
collect all the info we rely on), such a flexible optimization is not bad.

Regards,
Michael Wang

 
 -Mike
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-08 Thread Mike Galbraith

On Fri, 2013-03-08 at 15:30 +0800, Michael Wang wrote: 
> On 03/08/2013 02:44 PM, Mike Galbraith wrote:

> > In general, I think things would work better if we'd just rate limit how
> > frequently we can wakeup migrate each individual task.  
> 
> Isn't the wakeup buddy already limit the rate? and by turning the knob,
> we could change the rate on our demand.

I was referring to the existing kernel, not as altered. 

> We want
> > jabbering tasks to share L3, but we don't really want to trash L2 at an
> > awesome rate.
> 
> I don't get it..., it's a task which has 'sleep' for some time, unless
> there is no task running on prev_cpu when it's sleeping, otherwise
> whatever the new cpu is, we will trash L2, isn't it?

I'm thinking if you wake it to it's old home after a microscopic sleep,
it has a good chance of evicting the current resident, rescuing its L2.
If tasks which do microscopic sleep can't move around at a high rate,
they'll poke holes in fewer preempt victims.  If they're _really_ fast
switchers, always wake affine.  They can't hurt much, they don't do much
other than schedule off.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-08 Thread Mike Galbraith

On Fri, 2013-03-08 at 15:30 +0800, Michael Wang wrote: 
 On 03/08/2013 02:44 PM, Mike Galbraith wrote:

  In general, I think things would work better if we'd just rate limit how
  frequently we can wakeup migrate each individual task.  
 
 Isn't the wakeup buddy already limit the rate? and by turning the knob,
 we could change the rate on our demand.

I was referring to the existing kernel, not as altered. 

 We want
  jabbering tasks to share L3, but we don't really want to trash L2 at an
  awesome rate.
 
 I don't get it..., it's a task which has 'sleep' for some time, unless
 there is no task running on prev_cpu when it's sleeping, otherwise
 whatever the new cpu is, we will trash L2, isn't it?

I'm thinking if you wake it to it's old home after a microscopic sleep,
it has a good chance of evicting the current resident, rescuing its L2.
If tasks which do microscopic sleep can't move around at a high rate,
they'll poke holes in fewer preempt victims.  If they're _really_ fast
switchers, always wake affine.  They can't hurt much, they don't do much
other than schedule off.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Michael Wang

On 03/08/2013 02:44 PM, Mike Galbraith wrote:
> On Fri, 2013-03-08 at 10:37 +0800, Michael Wang wrote: 
>> On 03/07/2013 05:43 PM, Mike Galbraith wrote:
>>> On Thu, 2013-03-07 at 09:36 +0100, Peter Zijlstra wrote: 
 On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:

> wake_affine() stuff is trying to bind related tasks closely, but it 
> doesn't
> work well according to the test on 'perf bench sched pipe' (thanks to 
> Peter).

 so sched-pipe is a poor benchmark for this.. 

 Ideally we'd write a new benchmark that has some actual data footprint
 and we'd measure the cost of tasks being apart on the various cache
 metrics and see what affine wakeup does for it.

 Before doing something like what you're proposing, I'd have a hard look
 at WF_SYNC, it is possible we should disable/fix select_idle_sibling
 for sync wakeups.
>>>
>>> If nobody beats me to it, I'm going to try tracking shortest round trip
>>> to idle, and use a multiple of that to shut select_idle_sibling() down.
>>> If avg_idle approaches round trip time, there's no win to be had, we're
>>> just wasting cycles.
>>
>> That's great if we have it, I'm a little doubt whether it is possible to
>> find a better way to replace the select_idle_sibling() (look at the way
>> it locates idle cpu...) in some cases, but I'm looking forward it ;-)
> 
> I'm not going to replace it, only stop it from wasting cycles when
> there's very likely nothing to gain.  Save task wakeup time, if delta
> rq->clock - p->last_wakeup < N*shortest_idle or some such very cheap
> metric.  Wake ultra switchers L2 affine if allowed, only go hunting for
> an idle L3 if the thing is on another package.  
> 
> In general, I think things would work better if we'd just rate limit how
> frequently we can wakeup migrate each individual task.  

Isn't the wakeup buddy already limit the rate? and by turning the knob,
we could change the rate on our demand.

We want
> jabbering tasks to share L3, but we don't really want to trash L2 at an
> awesome rate.

I don't get it..., it's a task which has 'sleep' for some time, unless
there is no task running on prev_cpu when it's sleeping, otherwise
whatever the new cpu is, we will trash L2, isn't it?

Regards,
Michael Wang

> 
> -Mike
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Mike Galbraith

On Fri, 2013-03-08 at 10:37 +0800, Michael Wang wrote: 
> On 03/07/2013 05:43 PM, Mike Galbraith wrote:
> > On Thu, 2013-03-07 at 09:36 +0100, Peter Zijlstra wrote: 
> >> On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
> >>
> >>> wake_affine() stuff is trying to bind related tasks closely, but it 
> >>> doesn't
> >>> work well according to the test on 'perf bench sched pipe' (thanks to 
> >>> Peter).
> >>
> >> so sched-pipe is a poor benchmark for this.. 
> >>
> >> Ideally we'd write a new benchmark that has some actual data footprint
> >> and we'd measure the cost of tasks being apart on the various cache
> >> metrics and see what affine wakeup does for it.
> >>
> >> Before doing something like what you're proposing, I'd have a hard look
> >> at WF_SYNC, it is possible we should disable/fix select_idle_sibling
> >> for sync wakeups.
> > 
> > If nobody beats me to it, I'm going to try tracking shortest round trip
> > to idle, and use a multiple of that to shut select_idle_sibling() down.
> > If avg_idle approaches round trip time, there's no win to be had, we're
> > just wasting cycles.
> 
> That's great if we have it, I'm a little doubt whether it is possible to
> find a better way to replace the select_idle_sibling() (look at the way
> it locates idle cpu...) in some cases, but I'm looking forward it ;-)

I'm not going to replace it, only stop it from wasting cycles when
there's very likely nothing to gain.  Save task wakeup time, if delta
rq->clock - p->last_wakeup < N*shortest_idle or some such very cheap
metric.  Wake ultra switchers L2 affine if allowed, only go hunting for
an idle L3 if the thing is on another package.  

In general, I think things would work better if we'd just rate limit how
frequently we can wakeup migrate each individual task.  We want
jabbering tasks to share L3, but we don't really want to trash L2 at an
awesome rate.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Michael Wang

On 03/08/2013 01:27 AM, Peter Zijlstra wrote:
> On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
>> @@ -3351,7 +3420,13 @@ select_task_rq_fair(struct task_struct *p, int
>> sd_flag, int wake_flags)
>> }
>>  
>> if (affine_sd) {
>> -   if (cpu != prev_cpu && wake_affine(affine_sd, p,
>> sync))
>> +   /*
>> +* If current and p are wakeup related, and balance is
>> +* guaranteed, we will try to make them running
>> closely
>> +* to gain cache benefit.
>> +*/
>> +   if (cpu != prev_cpu && wakeup_related(p) &&
>> +   wake_affine(affine_sd, p,
>> sync))
>> prev_cpu = cpu;
> 
> 
> OK, so there's two issues I have with all this are:
> 
>  - it completely wrecks task placement for things like interrupts (sadly
> I don't
>  have a good idea about a benchmark where this matters).

I don't get this point...could you please give more details?

>  - yet another random number.. :/

Correct...well, but that also means flexibility, I suppose different
system and workload will need some tuning on this knob to gain more
benefit, by default, they will gain some benefit, small or big.

> 
> Also, I'm starting to dislike the buddy name; its somewhat over-used.
> 

I have to agree :), any suggestions?

Regards,
Michael Wang

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Michael Wang

On 03/07/2013 05:43 PM, Mike Galbraith wrote:
> On Thu, 2013-03-07 at 09:36 +0100, Peter Zijlstra wrote: 
>> On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
>>
>>> wake_affine() stuff is trying to bind related tasks closely, but it doesn't
>>> work well according to the test on 'perf bench sched pipe' (thanks to 
>>> Peter).
>>
>> so sched-pipe is a poor benchmark for this.. 
>>
>> Ideally we'd write a new benchmark that has some actual data footprint
>> and we'd measure the cost of tasks being apart on the various cache
>> metrics and see what affine wakeup does for it.
>>
>> Before doing something like what you're proposing, I'd have a hard look
>> at WF_SYNC, it is possible we should disable/fix select_idle_sibling
>> for sync wakeups.
> 
> If nobody beats me to it, I'm going to try tracking shortest round trip
> to idle, and use a multiple of that to shut select_idle_sibling() down.
> If avg_idle approaches round trip time, there's no win to be had, we're
> just wasting cycles.

That's great if we have it, I'm a little doubt whether it is possible to
find a better way to replace the select_idle_sibling() (look at the way
it locates idle cpu...) in some cases, but I'm looking forward it ;-)

Regards,
Michael Wang

> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Michael Wang

On 03/08/2013 01:21 AM, Peter Zijlstra wrote:
> On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
>> +static inline int wakeup_related(struct task_struct *p)
>> +{
>> +   if (wakeup_buddy(p, current)) {
>> +   /*
>> +* Now check whether current still focus on his buddy.
>> +*/
>> +   if (wakeup_buddy(current, p))
>> +   return 1;
>> +   }
>> +
>> +   return 0;
>> +}
> 
> Not commenting on the thing in general, but:
> 
> static inline bool wakeup_related(struct task_struct *p)
> {
>   return wakeup_buddy(p, current) && wakeup_buddy(current, p);
> }
> 
> is far shorter and easier to read :-)

Right, I will correct it :)

Regards,
Michael Wang

> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Michael Wang

On 03/08/2013 12:52 AM, Peter Zijlstra wrote:
> On Thu, 2013-03-07 at 17:46 +0800, Michael Wang wrote:
> 
>> On 03/07/2013 04:36 PM, Peter Zijlstra wrote:
>>> On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
>>>
 wake_affine() stuff is trying to bind related tasks closely, but it doesn't
 work well according to the test on 'perf bench sched pipe' (thanks to 
 Peter).
>>>
>>> so sched-pipe is a poor benchmark for this.. 
>>>
>>> Ideally we'd write a new benchmark that has some actual data footprint
>>> and we'd measure the cost of tasks being apart on the various cache
>>> metrics and see what affine wakeup does for it.
>>
>> I think sched-pipe is still somewhat capable, 
> 
> Yeah, its not entirely crap for this, but its not ideal either. The
> very big difference you see between it running on a single cpu and on
> say two threads of a single core is mostly due to preemption
> 'artifacts' though. Not because of cache.
> 
> So we have 2 tasks -- lets call then A and B -- involved in a single
> word ping-pong. So we're both doing write(); read(); loops. Now what
> happens on a single cpu is that A's write()->wakeup() of B makes B
> preempt A before A hits read() and blocks. This in turn ensures that
> B's write()->wakeup() of A finds an already running A and doesn't
> actually need to do the full (and expensive) wakeup thing (and vice
> versa).

Exactly, I used to think that make them running on same cpu only gain
benefit when one is going to sleep, but you are right, in that case,
they get latency and performance at same time, amazing ;-)

One concern in my mind is that this cooperation is somewhat fragile, if
another task C join the fight on that cpu, it will be broken, but if we
have a way to detect that in select_task_rq_fair(), we will be able to
win the gamble.

> 
> So by constantly preempting one another they avoid the expensive bit of
> going to sleep and waking up again.

Amazing point :)

> 
> wake_affine() OTOH still has a (supposed) benefit if it gets the tasks
> running 'closer' (in a cache hierarchy sense) since then the data
> sharing is less expensive.
> 
>> the problem is that the
>> select_idle_sibling() doesn't take care the wakeup related case, it
>> doesn't contain the logical to locate an idle cpu closely.
> 
> I'm not entirely sure if I understand what you mean, do you mean to say
> its idea of 'closely' is not quite correct? If so, I tend to agree, see
> further down.
> 
>> So even we detect the relationship successfully, select_idle_sibling()
>> can only help to make sure the target cpu won't be outside of the
>> current package, it's a package level bind, not mc or smp level.
> 
> That is the entire point of select_idle_sibling(), selecting a cpu
> 'near' the target cpu that is currently idle.
> 
> Not too long ago we had a bit of a discussion on the unholy mess that
> is select_idle_sibling() and if it actually does the right thing.
> Arguably it doesn't for machines that have an effective L2 cache. The 
> issue is that the arch<->sched interface only knows about
> last-level-cache (L3 on anything modern) and SMT.

Yeah, that's the point I concerned, we make sure that the cpu returned
by select_idle_sibling() at least share the L3 cache with target, but
there is a chance to miss the better candidate which share L2 with the
target.

> 
> Expanding the topology description in a way that makes sense (and
> doesn't make it a bigger mess) is somewhere on the todo-list.
> 
>>> Before doing something like what you're proposing, I'd have a hard look
>>> at WF_SYNC, it is possible we should disable/fix select_idle_sibling
>>> for sync wakeups.
>>
>> The patch is supposed to stop using wake_affine() blindly, not improve
>> the wake_affine() stuff itself, the whole stuff still works, but since
>> we rely on select_idle_sibling() to make the choice, the benefit is not
>> so significant, especially on my one node box...
> 
> OK, I'll have to go read the actual patch for that, I'll get back to
> you on that :-)
> 
>>> The idea behind sync wakeups is that we try and detect the case where
>>> we wakeup up one task only to go to sleep ourselves and try and avoid
>>> the regular ping-pong this would otherwise create on account of the
>>> waking task still being alive and so the current cpu isn't actually
>>> idle yet but we know its going to be idle soon.
>>
>> Are you suggesting that we should separate the process of wakeup related
>> case, not just pass current cpu to select_idle_sibling()?
> 
> Depends a bit on what you're trying to fix, so far I'm just trying to
> write down what I remember about stuff and reacting to half-read
> changelogs ;-)

I see, and I get some good point here for me to think deeper, that's
great ;-)

Regards,
Michael Wang

> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Peter Zijlstra

On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
> @@ -3351,7 +3420,13 @@ select_task_rq_fair(struct task_struct *p, int
> sd_flag, int wake_flags)
> }
>  
> if (affine_sd) {
> -   if (cpu != prev_cpu && wake_affine(affine_sd, p,
> sync))
> +   /*
> +* If current and p are wakeup related, and balance is
> +* guaranteed, we will try to make them running
> closely
> +* to gain cache benefit.
> +*/
> +   if (cpu != prev_cpu && wakeup_related(p) &&
> +   wake_affine(affine_sd, p,
> sync))
> prev_cpu = cpu;


OK, so there's two issues I have with all this are:

 - it completely wrecks task placement for things like interrupts (sadly
I don't
 have a good idea about a benchmark where this matters).
 - yet another random number.. :/

Also, I'm starting to dislike the buddy name; its somewhat over-used.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Peter Zijlstra

On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
> +static inline int wakeup_related(struct task_struct *p)
> +{
> +   if (wakeup_buddy(p, current)) {
> +   /*
> +* Now check whether current still focus on his buddy.
> +*/
> +   if (wakeup_buddy(current, p))
> +   return 1;
> +   }
> +
> +   return 0;
> +}

Not commenting on the thing in general, but:

static inline bool wakeup_related(struct task_struct *p)
{
return wakeup_buddy(p, current) && wakeup_buddy(current, p);
}

is far shorter and easier to read :-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Peter Zijlstra

On Thu, 2013-03-07 at 17:46 +0800, Michael Wang wrote:

> On 03/07/2013 04:36 PM, Peter Zijlstra wrote:
> > On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
> > 
> >> wake_affine() stuff is trying to bind related tasks closely, but it doesn't
> >> work well according to the test on 'perf bench sched pipe' (thanks to 
> >> Peter).
> > 
> > so sched-pipe is a poor benchmark for this.. 
> > 
> > Ideally we'd write a new benchmark that has some actual data footprint
> > and we'd measure the cost of tasks being apart on the various cache
> > metrics and see what affine wakeup does for it.
> 
> I think sched-pipe is still somewhat capable, 

Yeah, its not entirely crap for this, but its not ideal either. The
very big difference you see between it running on a single cpu and on
say two threads of a single core is mostly due to preemption
'artifacts' though. Not because of cache.

So we have 2 tasks -- lets call then A and B -- involved in a single
word ping-pong. So we're both doing write(); read(); loops. Now what
happens on a single cpu is that A's write()->wakeup() of B makes B
preempt A before A hits read() and blocks. This in turn ensures that
B's write()->wakeup() of A finds an already running A and doesn't
actually need to do the full (and expensive) wakeup thing (and vice
versa).

So by constantly preempting one another they avoid the expensive bit of
going to sleep and waking up again.

wake_affine() OTOH still has a (supposed) benefit if it gets the tasks
running 'closer' (in a cache hierarchy sense) since then the data
sharing is less expensive.

> the problem is that the
> select_idle_sibling() doesn't take care the wakeup related case, it
> doesn't contain the logical to locate an idle cpu closely.

I'm not entirely sure if I understand what you mean, do you mean to say
its idea of 'closely' is not quite correct? If so, I tend to agree, see
further down.

> So even we detect the relationship successfully, select_idle_sibling()
> can only help to make sure the target cpu won't be outside of the
> current package, it's a package level bind, not mc or smp level.

That is the entire point of select_idle_sibling(), selecting a cpu
'near' the target cpu that is currently idle.

Not too long ago we had a bit of a discussion on the unholy mess that
is select_idle_sibling() and if it actually does the right thing.
Arguably it doesn't for machines that have an effective L2 cache. The 
issue is that the arch<->sched interface only knows about
last-level-cache (L3 on anything modern) and SMT.

Expanding the topology description in a way that makes sense (and
doesn't make it a bigger mess) is somewhere on the todo-list.

> > Before doing something like what you're proposing, I'd have a hard look
> > at WF_SYNC, it is possible we should disable/fix select_idle_sibling
> > for sync wakeups.
> 
> The patch is supposed to stop using wake_affine() blindly, not improve
> the wake_affine() stuff itself, the whole stuff still works, but since
> we rely on select_idle_sibling() to make the choice, the benefit is not
> so significant, especially on my one node box...

OK, I'll have to go read the actual patch for that, I'll get back to
you on that :-)

> > The idea behind sync wakeups is that we try and detect the case where
> > we wakeup up one task only to go to sleep ourselves and try and avoid
> > the regular ping-pong this would otherwise create on account of the
> > waking task still being alive and so the current cpu isn't actually
> > idle yet but we know its going to be idle soon.
> 
> Are you suggesting that we should separate the process of wakeup related
> case, not just pass current cpu to select_idle_sibling()?

Depends a bit on what you're trying to fix, so far I'm just trying to
write down what I remember about stuff and reacting to half-read
changelogs ;-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Michael Wang

Hi, Peter

Thanks for your reply.

On 03/07/2013 04:36 PM, Peter Zijlstra wrote:
> On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
> 
>> wake_affine() stuff is trying to bind related tasks closely, but it doesn't
>> work well according to the test on 'perf bench sched pipe' (thanks to Peter).
> 
> so sched-pipe is a poor benchmark for this.. 
> 
> Ideally we'd write a new benchmark that has some actual data footprint
> and we'd measure the cost of tasks being apart on the various cache
> metrics and see what affine wakeup does for it.

I think sched-pipe is still somewhat capable, the problem is that the
select_idle_sibling() doesn't take care the wakeup related case, it
doesn't contain the logical to locate an idle cpu closely.

So even we detect the relationship successfully, select_idle_sibling()
can only help to make sure the target cpu won't be outside of the
current package, it's a package level bind, not mc or smp level.

> 
> Before doing something like what you're proposing, I'd have a hard look
> at WF_SYNC, it is possible we should disable/fix select_idle_sibling
> for sync wakeups.

The patch is supposed to stop using wake_affine() blindly, not improve
the wake_affine() stuff itself, the whole stuff still works, but since
we rely on select_idle_sibling() to make the choice, the benefit is not
so significant, especially on my one node box...

> 
> The idea behind sync wakeups is that we try and detect the case where
> we wakeup up one task only to go to sleep ourselves and try and avoid
> the regular ping-pong this would otherwise create on account of the
> waking task still being alive and so the current cpu isn't actually
> idle yet but we know its going to be idle soon.

Are you suggesting that we should separate the process of wakeup related
case, not just pass current cpu to select_idle_sibling()?

Regards,
Michael Wang

> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Mike Galbraith

On Thu, 2013-03-07 at 09:36 +0100, Peter Zijlstra wrote: 
> On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
> 
> > wake_affine() stuff is trying to bind related tasks closely, but it doesn't
> > work well according to the test on 'perf bench sched pipe' (thanks to 
> > Peter).
> 
> so sched-pipe is a poor benchmark for this.. 
> 
> Ideally we'd write a new benchmark that has some actual data footprint
> and we'd measure the cost of tasks being apart on the various cache
> metrics and see what affine wakeup does for it.
> 
> Before doing something like what you're proposing, I'd have a hard look
> at WF_SYNC, it is possible we should disable/fix select_idle_sibling
> for sync wakeups.

If nobody beats me to it, I'm going to try tracking shortest round trip
to idle, and use a multiple of that to shut select_idle_sibling() down.
If avg_idle approaches round trip time, there's no win to be had, we're
just wasting cycles.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Peter Zijlstra

On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:

> wake_affine() stuff is trying to bind related tasks closely, but it doesn't
> work well according to the test on 'perf bench sched pipe' (thanks to Peter).

so sched-pipe is a poor benchmark for this.. 

Ideally we'd write a new benchmark that has some actual data footprint
and we'd measure the cost of tasks being apart on the various cache
metrics and see what affine wakeup does for it.

Before doing something like what you're proposing, I'd have a hard look
at WF_SYNC, it is possible we should disable/fix select_idle_sibling
for sync wakeups.

The idea behind sync wakeups is that we try and detect the case where
we wakeup up one task only to go to sleep ourselves and try and avoid
the regular ping-pong this would otherwise create on account of the
waking task still being alive and so the current cpu isn't actually
idle yet but we know its going to be idle soon.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Michael Wang

On 03/08/2013 12:52 AM, Peter Zijlstra wrote:
 On Thu, 2013-03-07 at 17:46 +0800, Michael Wang wrote:
 
 On 03/07/2013 04:36 PM, Peter Zijlstra wrote:
 On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:

 wake_affine() stuff is trying to bind related tasks closely, but it doesn't
 work well according to the test on 'perf bench sched pipe' (thanks to 
 Peter).

 so sched-pipe is a poor benchmark for this.. 

 Ideally we'd write a new benchmark that has some actual data footprint
 and we'd measure the cost of tasks being apart on the various cache
 metrics and see what affine wakeup does for it.

 I think sched-pipe is still somewhat capable, 
 
 Yeah, its not entirely crap for this, but its not ideal either. The
 very big difference you see between it running on a single cpu and on
 say two threads of a single core is mostly due to preemption
 'artifacts' though. Not because of cache.
 
 So we have 2 tasks -- lets call then A and B -- involved in a single
 word ping-pong. So we're both doing write(); read(); loops. Now what
 happens on a single cpu is that A's write()-wakeup() of B makes B
 preempt A before A hits read() and blocks. This in turn ensures that
 B's write()-wakeup() of A finds an already running A and doesn't
 actually need to do the full (and expensive) wakeup thing (and vice
 versa).

Exactly, I used to think that make them running on same cpu only gain
benefit when one is going to sleep, but you are right, in that case,
they get latency and performance at same time, amazing ;-)

One concern in my mind is that this cooperation is somewhat fragile, if
another task C join the fight on that cpu, it will be broken, but if we
have a way to detect that in select_task_rq_fair(), we will be able to
win the gamble.

 
 So by constantly preempting one another they avoid the expensive bit of
 going to sleep and waking up again.

Amazing point :)

 
 wake_affine() OTOH still has a (supposed) benefit if it gets the tasks
 running 'closer' (in a cache hierarchy sense) since then the data
 sharing is less expensive.
 
 the problem is that the
 select_idle_sibling() doesn't take care the wakeup related case, it
 doesn't contain the logical to locate an idle cpu closely.
 
 I'm not entirely sure if I understand what you mean, do you mean to say
 its idea of 'closely' is not quite correct? If so, I tend to agree, see
 further down.
 
 So even we detect the relationship successfully, select_idle_sibling()
 can only help to make sure the target cpu won't be outside of the
 current package, it's a package level bind, not mc or smp level.
 
 That is the entire point of select_idle_sibling(), selecting a cpu
 'near' the target cpu that is currently idle.
 
 Not too long ago we had a bit of a discussion on the unholy mess that
 is select_idle_sibling() and if it actually does the right thing.
 Arguably it doesn't for machines that have an effective L2 cache. The 
 issue is that the arch-sched interface only knows about
 last-level-cache (L3 on anything modern) and SMT.

Yeah, that's the point I concerned, we make sure that the cpu returned
by select_idle_sibling() at least share the L3 cache with target, but
there is a chance to miss the better candidate which share L2 with the
target.

 
 Expanding the topology description in a way that makes sense (and
 doesn't make it a bigger mess) is somewhere on the todo-list.
 
 Before doing something like what you're proposing, I'd have a hard look
 at WF_SYNC, it is possible we should disable/fix select_idle_sibling
 for sync wakeups.

 The patch is supposed to stop using wake_affine() blindly, not improve
 the wake_affine() stuff itself, the whole stuff still works, but since
 we rely on select_idle_sibling() to make the choice, the benefit is not
 so significant, especially on my one node box...
 
 OK, I'll have to go read the actual patch for that, I'll get back to
 you on that :-)
 
 The idea behind sync wakeups is that we try and detect the case where
 we wakeup up one task only to go to sleep ourselves and try and avoid
 the regular ping-pong this would otherwise create on account of the
 waking task still being alive and so the current cpu isn't actually
 idle yet but we know its going to be idle soon.

 Are you suggesting that we should separate the process of wakeup related
 case, not just pass current cpu to select_idle_sibling()?
 
 Depends a bit on what you're trying to fix, so far I'm just trying to
 write down what I remember about stuff and reacting to half-read
 changelogs ;-)

I see, and I get some good point here for me to think deeper, that's
great ;-)

Regards,
Michael Wang

 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Michael Wang

On 03/08/2013 01:21 AM, Peter Zijlstra wrote:
 On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
 +static inline int wakeup_related(struct task_struct *p)
 +{
 +   if (wakeup_buddy(p, current)) {
 +   /*
 +* Now check whether current still focus on his buddy.
 +*/
 +   if (wakeup_buddy(current, p))
 +   return 1;
 +   }
 +
 +   return 0;
 +}
 
 Not commenting on the thing in general, but:
 
 static inline bool wakeup_related(struct task_struct *p)
 {
   return wakeup_buddy(p, current)  wakeup_buddy(current, p);
 }
 
 is far shorter and easier to read :-)

Right, I will correct it :)

Regards,
Michael Wang

 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Michael Wang

On 03/07/2013 05:43 PM, Mike Galbraith wrote:
 On Thu, 2013-03-07 at 09:36 +0100, Peter Zijlstra wrote: 
 On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:

 wake_affine() stuff is trying to bind related tasks closely, but it doesn't
 work well according to the test on 'perf bench sched pipe' (thanks to 
 Peter).

 so sched-pipe is a poor benchmark for this.. 

 Ideally we'd write a new benchmark that has some actual data footprint
 and we'd measure the cost of tasks being apart on the various cache
 metrics and see what affine wakeup does for it.

 Before doing something like what you're proposing, I'd have a hard look
 at WF_SYNC, it is possible we should disable/fix select_idle_sibling
 for sync wakeups.
 
 If nobody beats me to it, I'm going to try tracking shortest round trip
 to idle, and use a multiple of that to shut select_idle_sibling() down.
 If avg_idle approaches round trip time, there's no win to be had, we're
 just wasting cycles.

That's great if we have it, I'm a little doubt whether it is possible to
find a better way to replace the select_idle_sibling() (look at the way
it locates idle cpu...) in some cases, but I'm looking forward it ;-)

Regards,
Michael Wang

 
 -Mike
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Michael Wang

On 03/08/2013 01:27 AM, Peter Zijlstra wrote:
 On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
 @@ -3351,7 +3420,13 @@ select_task_rq_fair(struct task_struct *p, int
 sd_flag, int wake_flags)
 }
  
 if (affine_sd) {
 -   if (cpu != prev_cpu  wake_affine(affine_sd, p,
 sync))
 +   /*
 +* If current and p are wakeup related, and balance is
 +* guaranteed, we will try to make them running
 closely
 +* to gain cache benefit.
 +*/
 +   if (cpu != prev_cpu  wakeup_related(p) 
 +   wake_affine(affine_sd, p,
 sync))
 prev_cpu = cpu;
 
 
 OK, so there's two issues I have with all this are:
 
  - it completely wrecks task placement for things like interrupts (sadly
 I don't
  have a good idea about a benchmark where this matters).

I don't get this point...could you please give more details?

  - yet another random number.. :/

Correct...well, but that also means flexibility, I suppose different
system and workload will need some tuning on this knob to gain more
benefit, by default, they will gain some benefit, small or big.

 
 Also, I'm starting to dislike the buddy name; its somewhat over-used.
 

I have to agree :), any suggestions?

Regards,
Michael Wang

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Mike Galbraith

On Fri, 2013-03-08 at 10:37 +0800, Michael Wang wrote: 
 On 03/07/2013 05:43 PM, Mike Galbraith wrote:
  On Thu, 2013-03-07 at 09:36 +0100, Peter Zijlstra wrote: 
  On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
 
  wake_affine() stuff is trying to bind related tasks closely, but it 
  doesn't
  work well according to the test on 'perf bench sched pipe' (thanks to 
  Peter).
 
  so sched-pipe is a poor benchmark for this.. 
 
  Ideally we'd write a new benchmark that has some actual data footprint
  and we'd measure the cost of tasks being apart on the various cache
  metrics and see what affine wakeup does for it.
 
  Before doing something like what you're proposing, I'd have a hard look
  at WF_SYNC, it is possible we should disable/fix select_idle_sibling
  for sync wakeups.
  
  If nobody beats me to it, I'm going to try tracking shortest round trip
  to idle, and use a multiple of that to shut select_idle_sibling() down.
  If avg_idle approaches round trip time, there's no win to be had, we're
  just wasting cycles.
 
 That's great if we have it, I'm a little doubt whether it is possible to
 find a better way to replace the select_idle_sibling() (look at the way
 it locates idle cpu...) in some cases, but I'm looking forward it ;-)

I'm not going to replace it, only stop it from wasting cycles when
there's very likely nothing to gain.  Save task wakeup time, if delta
rq-clock - p-last_wakeup  N*shortest_idle or some such very cheap
metric.  Wake ultra switchers L2 affine if allowed, only go hunting for
an idle L3 if the thing is on another package.  

In general, I think things would work better if we'd just rate limit how
frequently we can wakeup migrate each individual task.  We want
jabbering tasks to share L3, but we don't really want to trash L2 at an
awesome rate.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Michael Wang

On 03/08/2013 02:44 PM, Mike Galbraith wrote:
 On Fri, 2013-03-08 at 10:37 +0800, Michael Wang wrote: 
 On 03/07/2013 05:43 PM, Mike Galbraith wrote:
 On Thu, 2013-03-07 at 09:36 +0100, Peter Zijlstra wrote: 
 On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:

 wake_affine() stuff is trying to bind related tasks closely, but it 
 doesn't
 work well according to the test on 'perf bench sched pipe' (thanks to 
 Peter).

 so sched-pipe is a poor benchmark for this.. 

 Ideally we'd write a new benchmark that has some actual data footprint
 and we'd measure the cost of tasks being apart on the various cache
 metrics and see what affine wakeup does for it.

 Before doing something like what you're proposing, I'd have a hard look
 at WF_SYNC, it is possible we should disable/fix select_idle_sibling
 for sync wakeups.

 If nobody beats me to it, I'm going to try tracking shortest round trip
 to idle, and use a multiple of that to shut select_idle_sibling() down.
 If avg_idle approaches round trip time, there's no win to be had, we're
 just wasting cycles.

 That's great if we have it, I'm a little doubt whether it is possible to
 find a better way to replace the select_idle_sibling() (look at the way
 it locates idle cpu...) in some cases, but I'm looking forward it ;-)
 
 I'm not going to replace it, only stop it from wasting cycles when
 there's very likely nothing to gain.  Save task wakeup time, if delta
 rq-clock - p-last_wakeup  N*shortest_idle or some such very cheap
 metric.  Wake ultra switchers L2 affine if allowed, only go hunting for
 an idle L3 if the thing is on another package.  
 
 In general, I think things would work better if we'd just rate limit how
 frequently we can wakeup migrate each individual task.  

Isn't the wakeup buddy already limit the rate? and by turning the knob,
we could change the rate on our demand.

We want
 jabbering tasks to share L3, but we don't really want to trash L2 at an
 awesome rate.

I don't get it..., it's a task which has 'sleep' for some time, unless
there is no task running on prev_cpu when it's sleeping, otherwise
whatever the new cpu is, we will trash L2, isn't it?

Regards,
Michael Wang

 
 -Mike
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Peter Zijlstra

On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:

 wake_affine() stuff is trying to bind related tasks closely, but it doesn't
 work well according to the test on 'perf bench sched pipe' (thanks to Peter).

so sched-pipe is a poor benchmark for this.. 

Ideally we'd write a new benchmark that has some actual data footprint
and we'd measure the cost of tasks being apart on the various cache
metrics and see what affine wakeup does for it.

Before doing something like what you're proposing, I'd have a hard look
at WF_SYNC, it is possible we should disable/fix select_idle_sibling
for sync wakeups.

The idea behind sync wakeups is that we try and detect the case where
we wakeup up one task only to go to sleep ourselves and try and avoid
the regular ping-pong this would otherwise create on account of the
waking task still being alive and so the current cpu isn't actually
idle yet but we know its going to be idle soon.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Mike Galbraith

On Thu, 2013-03-07 at 09:36 +0100, Peter Zijlstra wrote: 
 On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
 
  wake_affine() stuff is trying to bind related tasks closely, but it doesn't
  work well according to the test on 'perf bench sched pipe' (thanks to 
  Peter).
 
 so sched-pipe is a poor benchmark for this.. 
 
 Ideally we'd write a new benchmark that has some actual data footprint
 and we'd measure the cost of tasks being apart on the various cache
 metrics and see what affine wakeup does for it.
 
 Before doing something like what you're proposing, I'd have a hard look
 at WF_SYNC, it is possible we should disable/fix select_idle_sibling
 for sync wakeups.

If nobody beats me to it, I'm going to try tracking shortest round trip
to idle, and use a multiple of that to shut select_idle_sibling() down.
If avg_idle approaches round trip time, there's no win to be had, we're
just wasting cycles.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Michael Wang


Hi, Peter

Thanks for your reply.

On 03/07/2013 04:36 PM, Peter Zijlstra wrote:
 On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
 
 wake_affine() stuff is trying to bind related tasks closely, but it doesn't
 work well according to the test on 'perf bench sched pipe' (thanks to Peter).
 
 so sched-pipe is a poor benchmark for this.. 
 
 Ideally we'd write a new benchmark that has some actual data footprint
 and we'd measure the cost of tasks being apart on the various cache
 metrics and see what affine wakeup does for it.

I think sched-pipe is still somewhat capable, the problem is that the
select_idle_sibling() doesn't take care the wakeup related case, it
doesn't contain the logical to locate an idle cpu closely.

So even we detect the relationship successfully, select_idle_sibling()
can only help to make sure the target cpu won't be outside of the
current package, it's a package level bind, not mc or smp level.

 
 Before doing something like what you're proposing, I'd have a hard look
 at WF_SYNC, it is possible we should disable/fix select_idle_sibling
 for sync wakeups.

The patch is supposed to stop using wake_affine() blindly, not improve
the wake_affine() stuff itself, the whole stuff still works, but since
we rely on select_idle_sibling() to make the choice, the benefit is not
so significant, especially on my one node box...

 
 The idea behind sync wakeups is that we try and detect the case where
 we wakeup up one task only to go to sleep ourselves and try and avoid
 the regular ping-pong this would otherwise create on account of the
 waking task still being alive and so the current cpu isn't actually
 idle yet but we know its going to be idle soon.

Are you suggesting that we should separate the process of wakeup related
case, not just pass current cpu to select_idle_sibling()?

Regards,
Michael Wang

 
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Peter Zijlstra

On Thu, 2013-03-07 at 17:46 +0800, Michael Wang wrote:

 On 03/07/2013 04:36 PM, Peter Zijlstra wrote:
  On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
  
  wake_affine() stuff is trying to bind related tasks closely, but it doesn't
  work well according to the test on 'perf bench sched pipe' (thanks to 
  Peter).
  
  so sched-pipe is a poor benchmark for this.. 
  
  Ideally we'd write a new benchmark that has some actual data footprint
  and we'd measure the cost of tasks being apart on the various cache
  metrics and see what affine wakeup does for it.
 
 I think sched-pipe is still somewhat capable, 

Yeah, its not entirely crap for this, but its not ideal either. The
very big difference you see between it running on a single cpu and on
say two threads of a single core is mostly due to preemption
'artifacts' though. Not because of cache.

So we have 2 tasks -- lets call then A and B -- involved in a single
word ping-pong. So we're both doing write(); read(); loops. Now what
happens on a single cpu is that A's write()-wakeup() of B makes B
preempt A before A hits read() and blocks. This in turn ensures that
B's write()-wakeup() of A finds an already running A and doesn't
actually need to do the full (and expensive) wakeup thing (and vice
versa).

So by constantly preempting one another they avoid the expensive bit of
going to sleep and waking up again.

wake_affine() OTOH still has a (supposed) benefit if it gets the tasks
running 'closer' (in a cache hierarchy sense) since then the data
sharing is less expensive.

 the problem is that the
 select_idle_sibling() doesn't take care the wakeup related case, it
 doesn't contain the logical to locate an idle cpu closely.

I'm not entirely sure if I understand what you mean, do you mean to say
its idea of 'closely' is not quite correct? If so, I tend to agree, see
further down.

 So even we detect the relationship successfully, select_idle_sibling()
 can only help to make sure the target cpu won't be outside of the
 current package, it's a package level bind, not mc or smp level.

That is the entire point of select_idle_sibling(), selecting a cpu
'near' the target cpu that is currently idle.

Not too long ago we had a bit of a discussion on the unholy mess that
is select_idle_sibling() and if it actually does the right thing.
Arguably it doesn't for machines that have an effective L2 cache. The 
issue is that the arch-sched interface only knows about
last-level-cache (L3 on anything modern) and SMT.

Expanding the topology description in a way that makes sense (and
doesn't make it a bigger mess) is somewhere on the todo-list.

  Before doing something like what you're proposing, I'd have a hard look
  at WF_SYNC, it is possible we should disable/fix select_idle_sibling
  for sync wakeups.
 
 The patch is supposed to stop using wake_affine() blindly, not improve
 the wake_affine() stuff itself, the whole stuff still works, but since
 we rely on select_idle_sibling() to make the choice, the benefit is not
 so significant, especially on my one node box...

OK, I'll have to go read the actual patch for that, I'll get back to
you on that :-)

  The idea behind sync wakeups is that we try and detect the case where
  we wakeup up one task only to go to sleep ourselves and try and avoid
  the regular ping-pong this would otherwise create on account of the
  waking task still being alive and so the current cpu isn't actually
  idle yet but we know its going to be idle soon.
 
 Are you suggesting that we should separate the process of wakeup related
 case, not just pass current cpu to select_idle_sibling()?

Depends a bit on what you're trying to fix, so far I'm just trying to
write down what I remember about stuff and reacting to half-read
changelogs ;-)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Peter Zijlstra

On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
 +static inline int wakeup_related(struct task_struct *p)
 +{
 +   if (wakeup_buddy(p, current)) {
 +   /*
 +* Now check whether current still focus on his buddy.
 +*/
 +   if (wakeup_buddy(current, p))
 +   return 1;
 +   }
 +
 +   return 0;
 +}

Not commenting on the thing in general, but:

static inline bool wakeup_related(struct task_struct *p)
{
return wakeup_buddy(p, current)  wakeup_buddy(current, p);
}

is far shorter and easier to read :-)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: wakeup buddy

2013-03-07 Thread Peter Zijlstra

On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
 @@ -3351,7 +3420,13 @@ select_task_rq_fair(struct task_struct *p, int
 sd_flag, int wake_flags)
 }
  
 if (affine_sd) {
 -   if (cpu != prev_cpu  wake_affine(affine_sd, p,
 sync))
 +   /*
 +* If current and p are wakeup related, and balance is
 +* guaranteed, we will try to make them running
 closely
 +* to gain cache benefit.
 +*/
 +   if (cpu != prev_cpu  wakeup_related(p) 
 +   wake_affine(affine_sd, p,
 sync))
 prev_cpu = cpu;


OK, so there's two issues I have with all this are:

 - it completely wrecks task placement for things like interrupts (sadly
I don't
 have a good idea about a benchmark where this matters).
 - yet another random number.. :/

Also, I'm starting to dislike the buddy name; its somewhat over-used.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

54 matches

Mail list logo