Re: [PATCH] sched: wakeup buddy
[snip] >> It could bring the same benefit but at lower overhead, what's the point >> of computing the same value over and over again? Also, the rate limit >> thing naturally works for the soft/hard-irq case. > > Just try to confirm my understanding, so we are going to do something > like: > > if (now - wakee->last > time_limit) && wakeup_affine() > wakee->last = now > select_idle_sibling(curr_cpu) > else > select_idle_sibling(prev_cpu) > > And time_limit is some static value respect to the rate of load balance, > is that correct? > > Currently I haven't found regression by reduce the rate, but if we found > such benchmark, we may still need a way (knob or CONFIG) to disable this > limitation. I've done some fast tests on this proposal, on my 12 cpu box, the pgbench 32 clients test, for a 1000ms time_limit, the benefit is just like the 8 ref wakeup buddy, when adopt 10ms time_limit, the benefit dropped half, when time_limit is 1ms, the benefit is less than 10%. tps original43404 wakeup-buddy63024 +45.20% 1s-limit62359 +43.67% 100ms-limit 57547 +32.58% 10ms-limit 52258 +20.40% 1ms-limit 46535 +7.21% Other test items of pgbench are corresponding, and other benchmarks still inert to the changes. I'm planning to make a new patch for this approach later, in which time_limit is a knob with the default value 1ms (usually the initial value of balance_interval and the value of min_interval), that will based on the latest tip tree. Regards, Michael Wang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
[snip] It could bring the same benefit but at lower overhead, what's the point of computing the same value over and over again? Also, the rate limit thing naturally works for the soft/hard-irq case. Just try to confirm my understanding, so we are going to do something like: if (now - wakee-last time_limit) wakeup_affine() wakee-last = now select_idle_sibling(curr_cpu) else select_idle_sibling(prev_cpu) And time_limit is some static value respect to the rate of load balance, is that correct? Currently I haven't found regression by reduce the rate, but if we found such benchmark, we may still need a way (knob or CONFIG) to disable this limitation. I've done some fast tests on this proposal, on my 12 cpu box, the pgbench 32 clients test, for a 1000ms time_limit, the benefit is just like the 8 ref wakeup buddy, when adopt 10ms time_limit, the benefit dropped half, when time_limit is 1ms, the benefit is less than 10%. tps original43404 wakeup-buddy63024 +45.20% 1s-limit62359 +43.67% 100ms-limit 57547 +32.58% 10ms-limit 52258 +20.40% 1ms-limit 46535 +7.21% Other test items of pgbench are corresponding, and other benchmarks still inert to the changes. I'm planning to make a new patch for this approach later, in which time_limit is a knob with the default value 1ms (usually the initial value of balance_interval and the value of min_interval), that will based on the latest tip tree. Regards, Michael Wang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On 03/14/2013 06:58 PM, Peter Zijlstra wrote: > On Wed, 2013-03-13 at 11:07 +0800, Michael Wang wrote: > >> However, we already figure out the logical that wakeup related task >> could benefit from closely running, this could promise us somewhat >> reliable benefit. > > I'm not convinced that the 2 task wakeup scenario is the only sane > scenario. Imagine a ring of 3 tasks that wakes each other, if the > machine is loaded enough, those 3 tasks might fit a single cpu just > fine -- esp. if one of those is always idle. > > But your strict 1:1 wakeup relation thing will completely fail to > detect this. Hmm...yeah, I see your point here, some optimize timing will always be missed whatever how we twiddle the knob besides 0. You are right, that's a problem, although currently there are no workload to prove it, but we have the theory... > >> IMHO, that sounds a little easier for users than to make the decision on >> tell me how long to pull tasks together, they may be confused... > > Users shouldn't ever need/want/etc.. rely on this. They should just run > their programs and be (reasonably) happy. > >> In summary, I think we have two point here need to be considered: >> >> 1. what about the missed optimize timing, that may benefit >>some workload (although we haven't found such workload yet). > > Missed optimize; as in not calling wake_affine() due to the throttle? > If we're calling it at such a high frequency it is very likely the next > call isn't very far away. > >> 2. how many benefit could wake_affine() stuff bring to us, >>if limit rate benefit us, why don't we remove it? > > It could bring the same benefit but at lower overhead, what's the point > of computing the same value over and over again? Also, the rate limit > thing naturally works for the soft/hard-irq case. Just try to confirm my understanding, so we are going to do something like: if (now - wakee->last > time_limit) && wakeup_affine() wakee->last = now select_idle_sibling(curr_cpu) else select_idle_sibling(prev_cpu) And time_limit is some static value respect to the rate of load balance, is that correct? Currently I haven't found regression by reduce the rate, but if we found such benchmark, we may still need a way (knob or CONFIG) to disable this limitation. > > Now, there's another detail I thought up, one could only limit the > wake_affine() calls once it starts returning false. Hmm..if wake_affine() keep succeed, then there will be no difference? I do believe pgbench match the case, since wake_affine() stuff make it suffered...and the more it suffered, means the more often wake_affine() succeed and pull none related tasks together. I really can't see how could it do help... did I miss something? Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On 03/14/2013 06:58 PM, Peter Zijlstra wrote: On Wed, 2013-03-13 at 11:07 +0800, Michael Wang wrote: However, we already figure out the logical that wakeup related task could benefit from closely running, this could promise us somewhat reliable benefit. I'm not convinced that the 2 task wakeup scenario is the only sane scenario. Imagine a ring of 3 tasks that wakes each other, if the machine is loaded enough, those 3 tasks might fit a single cpu just fine -- esp. if one of those is always idle. But your strict 1:1 wakeup relation thing will completely fail to detect this. Hmm...yeah, I see your point here, some optimize timing will always be missed whatever how we twiddle the knob besides 0. You are right, that's a problem, although currently there are no workload to prove it, but we have the theory... IMHO, that sounds a little easier for users than to make the decision on tell me how long to pull tasks together, they may be confused... Users shouldn't ever need/want/etc.. rely on this. They should just run their programs and be (reasonably) happy. In summary, I think we have two point here need to be considered: 1. what about the missed optimize timing, that may benefit some workload (although we haven't found such workload yet). Missed optimize; as in not calling wake_affine() due to the throttle? If we're calling it at such a high frequency it is very likely the next call isn't very far away. 2. how many benefit could wake_affine() stuff bring to us, if limit rate benefit us, why don't we remove it? It could bring the same benefit but at lower overhead, what's the point of computing the same value over and over again? Also, the rate limit thing naturally works for the soft/hard-irq case. Just try to confirm my understanding, so we are going to do something like: if (now - wakee-last time_limit) wakeup_affine() wakee-last = now select_idle_sibling(curr_cpu) else select_idle_sibling(prev_cpu) And time_limit is some static value respect to the rate of load balance, is that correct? Currently I haven't found regression by reduce the rate, but if we found such benchmark, we may still need a way (knob or CONFIG) to disable this limitation. Now, there's another detail I thought up, one could only limit the wake_affine() calls once it starts returning false. Hmm..if wake_affine() keep succeed, then there will be no difference? I do believe pgbench match the case, since wake_affine() stuff make it suffered...and the more it suffered, means the more often wake_affine() succeed and pull none related tasks together. I really can't see how could it do help... did I miss something? Regards, Michael Wang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On Wed, 2013-03-13 at 11:07 +0800, Michael Wang wrote: > However, we already figure out the logical that wakeup related task > could benefit from closely running, this could promise us somewhat > reliable benefit. I'm not convinced that the 2 task wakeup scenario is the only sane scenario. Imagine a ring of 3 tasks that wakes each other, if the machine is loaded enough, those 3 tasks might fit a single cpu just fine -- esp. if one of those is always idle. But your strict 1:1 wakeup relation thing will completely fail to detect this. > IMHO, that sounds a little easier for users than to make the decision on > tell me how long to pull tasks together, they may be confused... Users shouldn't ever need/want/etc.. rely on this. They should just run their programs and be (reasonably) happy. > In summary, I think we have two point here need to be considered: > > 1. what about the missed optimize timing, that may benefit >some workload (although we haven't found such workload yet). Missed optimize; as in not calling wake_affine() due to the throttle? If we're calling it at such a high frequency it is very likely the next call isn't very far away. > 2. how many benefit could wake_affine() stuff bring to us, >if limit rate benefit us, why don't we remove it? It could bring the same benefit but at lower overhead, what's the point of computing the same value over and over again? Also, the rate limit thing naturally works for the soft/hard-irq case. Now, there's another detail I thought up, one could only limit the wake_affine() calls once it starts returning false. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On Wed, 2013-03-13 at 11:07 +0800, Michael Wang wrote: However, we already figure out the logical that wakeup related task could benefit from closely running, this could promise us somewhat reliable benefit. I'm not convinced that the 2 task wakeup scenario is the only sane scenario. Imagine a ring of 3 tasks that wakes each other, if the machine is loaded enough, those 3 tasks might fit a single cpu just fine -- esp. if one of those is always idle. But your strict 1:1 wakeup relation thing will completely fail to detect this. IMHO, that sounds a little easier for users than to make the decision on tell me how long to pull tasks together, they may be confused... Users shouldn't ever need/want/etc.. rely on this. They should just run their programs and be (reasonably) happy. In summary, I think we have two point here need to be considered: 1. what about the missed optimize timing, that may benefit some workload (although we haven't found such workload yet). Missed optimize; as in not calling wake_affine() due to the throttle? If we're calling it at such a high frequency it is very likely the next call isn't very far away. 2. how many benefit could wake_affine() stuff bring to us, if limit rate benefit us, why don't we remove it? It could bring the same benefit but at lower overhead, what's the point of computing the same value over and over again? Also, the rate limit thing naturally works for the soft/hard-irq case. Now, there's another detail I thought up, one could only limit the wake_affine() calls once it starts returning false. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On 03/12/2013 06:08 PM, Peter Zijlstra wrote: > Does something simple like a per-task throttle of wake_affine() gain > similar benefits? Say something like only do wake_affine() once every > 10 ms or so (counting on the wakee, not waker). > > The rationale being that wake_affine() relies on load-balance > statistics that aren't updated that much faster, so calling it more > often than that seems pointless. That could works, if we do not have any logical to rely on and just want to limit the rate of pull, this could be a good solution. However, we already figure out the logical that wakeup related task could benefit from closely running, this could promise us somewhat reliable benefit. It's just like, tell me how many times that two task continuously wakeup each other means they related, I will try to pull them together since they have a big chance to benefit from cache. IMHO, that sounds a little easier for users than to make the decision on tell me how long to pull tasks together, they may be confused... In summary, I think we have two point here need to be considered: 1. what about the missed optimize timing, that may benefit some workload (although we haven't found such workload yet). 2. how many benefit could wake_affine() stuff bring to us, if limit rate benefit us, why don't we remove it? Point 1 could be solved by disable wakeup buddy with 0 limitation, and when users complain about their database performance, we just say "Calm down and take a try on this knob ;-)". Point 2 is about the wake_affine() stuff itself. Later we could try to make the stuff better, but I have the feeling that some key info is not there (may be we need Ingo's work atom idea here, just wondering...), whatever, I think we still need a knob finally, since it doesn't sounds like a general optimization which benefit all the cases. And I don't agree to remove the stuff since we have many theories that this could benefit us, but before it really show the benefit in all the cases, provide a way to keep it quiet sounds necessary... Regards, Michael Wang > > Something like that should take a lot less lines to implement. > > Just wondering.. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
Does something simple like a per-task throttle of wake_affine() gain similar benefits? Say something like only do wake_affine() once every 10 ms or so (counting on the wakee, not waker). The rationale being that wake_affine() relies on load-balance statistics that aren't updated that much faster, so calling it more often than that seems pointless. Something like that should take a lot less lines to implement. Just wondering.. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On 03/12/2013 04:48 PM, Ingo Molnar wrote: [snip] >>> >>> Instrumentation/stats/profiling will also double check the correctness of >>> this data: if developers/users start relying on the work metric as a >>> substitute benchmark number, then app writers will have an additional >>> incentive to make them correct. >> >> I see, I could not figure out how to wisely using the info currently, >> but I have the feeling that it will make scheduler very different ;-) >> >> May be we could implement the API and get those info ready firstly >> (along with the new sched-pipe which provide work tick info), then think >> about the way to use them in scheduler, is there any patches on the way? > > Absolutely. > > Beyond the new prctl no new API is needed: a perf soft event could be > added, and/or a tracepoint. Then perf stat and perf record could be used > with it. 'perf bench' could be extended to generate the work tick in its > 'perf bench sched ...' workloads - and for 'perf bench mem numa' as well. Nice :) > > vsyscall-accelerating it could be a separate, more complex step: it needs > a per thread writable vsyscall data area to make the overhead to > applications near zero. Performance critical apps won't call an extra > syscall. If it's really bring benefit, I think they will consider about it, whatever, that's the developer/users decision, what we need to do is just make the stuff attractively. Regards, Michael Wang > > Thanks, > > Ingo > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
* Michael Wang wrote: > On 03/11/2013 05:40 PM, Ingo Molnar wrote: > > > > * Michael Wang wrote: > > > >> Hi, Ingo > >> > >> On 03/11/2013 04:21 PM, Ingo Molnar wrote: > >> [snip] > >>> > >>> I have actually written the prctl() approach before, for instrumentation > >>> purposes, and it does wonders to system analysis. > >> > >> The idea sounds great, we could get many new info to implement more > >> smart scheduler, that's amazing :) > >> > >>> > >>> Any objections? > >> > >> Just one concern, may be I have misunderstand you, but will it cause > >> trouble if the prctl() was indiscriminately used by some applications, > >> will we get fake data? > > > > It's their problem: overusing it will increase their CPU overhead. The two > > boundary worst-cases are that they either call it too frequently or too > > rarely: > > > > - too frequently: it approximates the current cpu-runtime work metric > > > > - too infrequently: we just ignore it and fall back to a runtime metric > >if it does not change. > > > > It's not like it can be used to get preferential treatment - we don't ever > > balance other tasks against these tasks based on work throughput, we try > > to maximize this workload's work throughput. > > > > What could happen is if an app is 'optimized' for a buggy scheduler by > > changing the work metric frequency. We offer no guarantee - apps will be > > best off (and users will be least annoyed) if apps honestly report their > > work metric. > > > > Instrumentation/stats/profiling will also double check the correctness of > > this data: if developers/users start relying on the work metric as a > > substitute benchmark number, then app writers will have an additional > > incentive to make them correct. > > I see, I could not figure out how to wisely using the info currently, > but I have the feeling that it will make scheduler very different ;-) > > May be we could implement the API and get those info ready firstly > (along with the new sched-pipe which provide work tick info), then think > about the way to use them in scheduler, is there any patches on the way? Absolutely. Beyond the new prctl no new API is needed: a perf soft event could be added, and/or a tracepoint. Then perf stat and perf record could be used with it. 'perf bench' could be extended to generate the work tick in its 'perf bench sched ...' workloads - and for 'perf bench mem numa' as well. vsyscall-accelerating it could be a separate, more complex step: it needs a per thread writable vsyscall data area to make the overhead to applications near zero. Performance critical apps won't call an extra syscall. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On 03/11/2013 05:40 PM, Ingo Molnar wrote: > > * Michael Wang wrote: > >> Hi, Ingo >> >> On 03/11/2013 04:21 PM, Ingo Molnar wrote: >> [snip] >>> >>> I have actually written the prctl() approach before, for instrumentation >>> purposes, and it does wonders to system analysis. >> >> The idea sounds great, we could get many new info to implement more >> smart scheduler, that's amazing :) >> >>> >>> Any objections? >> >> Just one concern, may be I have misunderstand you, but will it cause >> trouble if the prctl() was indiscriminately used by some applications, >> will we get fake data? > > It's their problem: overusing it will increase their CPU overhead. The two > boundary worst-cases are that they either call it too frequently or too > rarely: > > - too frequently: it approximates the current cpu-runtime work metric > > - too infrequently: we just ignore it and fall back to a runtime metric >if it does not change. > > It's not like it can be used to get preferential treatment - we don't ever > balance other tasks against these tasks based on work throughput, we try > to maximize this workload's work throughput. > > What could happen is if an app is 'optimized' for a buggy scheduler by > changing the work metric frequency. We offer no guarantee - apps will be > best off (and users will be least annoyed) if apps honestly report their > work metric. > > Instrumentation/stats/profiling will also double check the correctness of > this data: if developers/users start relying on the work metric as a > substitute benchmark number, then app writers will have an additional > incentive to make them correct. I see, I could not figure out how to wisely using the info currently, but I have the feeling that it will make scheduler very different ;-) May be we could implement the API and get those info ready firstly (along with the new sched-pipe which provide work tick info), then think about the way to use them in scheduler, is there any patches on the way? Regards, Michael Wang > > Thanks, > > Ingo > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On 03/11/2013 05:40 PM, Ingo Molnar wrote: * Michael Wang wang...@linux.vnet.ibm.com wrote: Hi, Ingo On 03/11/2013 04:21 PM, Ingo Molnar wrote: [snip] I have actually written the prctl() approach before, for instrumentation purposes, and it does wonders to system analysis. The idea sounds great, we could get many new info to implement more smart scheduler, that's amazing :) Any objections? Just one concern, may be I have misunderstand you, but will it cause trouble if the prctl() was indiscriminately used by some applications, will we get fake data? It's their problem: overusing it will increase their CPU overhead. The two boundary worst-cases are that they either call it too frequently or too rarely: - too frequently: it approximates the current cpu-runtime work metric - too infrequently: we just ignore it and fall back to a runtime metric if it does not change. It's not like it can be used to get preferential treatment - we don't ever balance other tasks against these tasks based on work throughput, we try to maximize this workload's work throughput. What could happen is if an app is 'optimized' for a buggy scheduler by changing the work metric frequency. We offer no guarantee - apps will be best off (and users will be least annoyed) if apps honestly report their work metric. Instrumentation/stats/profiling will also double check the correctness of this data: if developers/users start relying on the work metric as a substitute benchmark number, then app writers will have an additional incentive to make them correct. I see, I could not figure out how to wisely using the info currently, but I have the feeling that it will make scheduler very different ;-) May be we could implement the API and get those info ready firstly (along with the new sched-pipe which provide work tick info), then think about the way to use them in scheduler, is there any patches on the way? Regards, Michael Wang Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
* Michael Wang wang...@linux.vnet.ibm.com wrote: On 03/11/2013 05:40 PM, Ingo Molnar wrote: * Michael Wang wang...@linux.vnet.ibm.com wrote: Hi, Ingo On 03/11/2013 04:21 PM, Ingo Molnar wrote: [snip] I have actually written the prctl() approach before, for instrumentation purposes, and it does wonders to system analysis. The idea sounds great, we could get many new info to implement more smart scheduler, that's amazing :) Any objections? Just one concern, may be I have misunderstand you, but will it cause trouble if the prctl() was indiscriminately used by some applications, will we get fake data? It's their problem: overusing it will increase their CPU overhead. The two boundary worst-cases are that they either call it too frequently or too rarely: - too frequently: it approximates the current cpu-runtime work metric - too infrequently: we just ignore it and fall back to a runtime metric if it does not change. It's not like it can be used to get preferential treatment - we don't ever balance other tasks against these tasks based on work throughput, we try to maximize this workload's work throughput. What could happen is if an app is 'optimized' for a buggy scheduler by changing the work metric frequency. We offer no guarantee - apps will be best off (and users will be least annoyed) if apps honestly report their work metric. Instrumentation/stats/profiling will also double check the correctness of this data: if developers/users start relying on the work metric as a substitute benchmark number, then app writers will have an additional incentive to make them correct. I see, I could not figure out how to wisely using the info currently, but I have the feeling that it will make scheduler very different ;-) May be we could implement the API and get those info ready firstly (along with the new sched-pipe which provide work tick info), then think about the way to use them in scheduler, is there any patches on the way? Absolutely. Beyond the new prctl no new API is needed: a perf soft event could be added, and/or a tracepoint. Then perf stat and perf record could be used with it. 'perf bench' could be extended to generate the work tick in its 'perf bench sched ...' workloads - and for 'perf bench mem numa' as well. vsyscall-accelerating it could be a separate, more complex step: it needs a per thread writable vsyscall data area to make the overhead to applications near zero. Performance critical apps won't call an extra syscall. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On 03/12/2013 04:48 PM, Ingo Molnar wrote: [snip] Instrumentation/stats/profiling will also double check the correctness of this data: if developers/users start relying on the work metric as a substitute benchmark number, then app writers will have an additional incentive to make them correct. I see, I could not figure out how to wisely using the info currently, but I have the feeling that it will make scheduler very different ;-) May be we could implement the API and get those info ready firstly (along with the new sched-pipe which provide work tick info), then think about the way to use them in scheduler, is there any patches on the way? Absolutely. Beyond the new prctl no new API is needed: a perf soft event could be added, and/or a tracepoint. Then perf stat and perf record could be used with it. 'perf bench' could be extended to generate the work tick in its 'perf bench sched ...' workloads - and for 'perf bench mem numa' as well. Nice :) vsyscall-accelerating it could be a separate, more complex step: it needs a per thread writable vsyscall data area to make the overhead to applications near zero. Performance critical apps won't call an extra syscall. If it's really bring benefit, I think they will consider about it, whatever, that's the developer/users decision, what we need to do is just make the stuff attractively. Regards, Michael Wang Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
Does something simple like a per-task throttle of wake_affine() gain similar benefits? Say something like only do wake_affine() once every 10 ms or so (counting on the wakee, not waker). The rationale being that wake_affine() relies on load-balance statistics that aren't updated that much faster, so calling it more often than that seems pointless. Something like that should take a lot less lines to implement. Just wondering.. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On 03/12/2013 06:08 PM, Peter Zijlstra wrote: Does something simple like a per-task throttle of wake_affine() gain similar benefits? Say something like only do wake_affine() once every 10 ms or so (counting on the wakee, not waker). The rationale being that wake_affine() relies on load-balance statistics that aren't updated that much faster, so calling it more often than that seems pointless. That could works, if we do not have any logical to rely on and just want to limit the rate of pull, this could be a good solution. However, we already figure out the logical that wakeup related task could benefit from closely running, this could promise us somewhat reliable benefit. It's just like, tell me how many times that two task continuously wakeup each other means they related, I will try to pull them together since they have a big chance to benefit from cache. IMHO, that sounds a little easier for users than to make the decision on tell me how long to pull tasks together, they may be confused... In summary, I think we have two point here need to be considered: 1. what about the missed optimize timing, that may benefit some workload (although we haven't found such workload yet). 2. how many benefit could wake_affine() stuff bring to us, if limit rate benefit us, why don't we remove it? Point 1 could be solved by disable wakeup buddy with 0 limitation, and when users complain about their database performance, we just say Calm down and take a try on this knob ;-). Point 2 is about the wake_affine() stuff itself. Later we could try to make the stuff better, but I have the feeling that some key info is not there (may be we need Ingo's work atom idea here, just wondering...), whatever, I think we still need a knob finally, since it doesn't sounds like a general optimization which benefit all the cases. And I don't agree to remove the stuff since we have many theories that this could benefit us, but before it really show the benefit in all the cases, provide a way to keep it quiet sounds necessary... Regards, Michael Wang Something like that should take a lot less lines to implement. Just wondering.. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On 03/11/2013 06:36 PM, Peter Zijlstra wrote: > On Fri, 2013-03-08 at 10:50 +0800, Michael Wang wrote: > >>> OK, so there's two issues I have with all this are: >>> >>> - it completely wrecks task placement for things like interrupts (sadly I >>> don't >>> have a good idea about a benchmark where this matters). >> >> I don't get this point...could you please give more details? > > Take for instance a network workload, the hardirq notifies us there's > data buffered, then softirq does a ton of network layer demuxing and > eventually wakes one (or more) tasks to process data. You want these > tasks to move to the waking cpu, it has lots of this data in cache. > > Now, neither hard- nor softirq runs in task context (except for -rt) so > it completely fails on what you propose. I got it, exactly, the new feature with current ref limit will miss this optimize timing, if the data not related to current... However, we are still gambling here, and some workload suffered. Actually this feature's purpose is to provide a way for user who want to balance the risk and benefit on self demand, it providing flexible, not restriction... So what about make the default sysctl_sched_wakeup_buddy_ref to be 0 and make every thing just like the old world. And for those who want to make the decision more carefully, they could turn the knob to make it bigger, and gain the benefit. > > We could simply add something like in_softirq() || in_irq() etc.. to > re-enable wake_affine() for those cases unconditionally, but not sure > that's the right thing either. > >>> - yet another random number.. :/ >> >> Correct...well, but that also means flexibility, I suppose different >> system and workload will need some tuning on this knob to gain more >> benefit, by default, they will gain some benefit, small or big. > > Nah, it just means another knob nobody knows how to twiddle. Right, and we already have many such kind of knob...since some formular is impossible to be written down. Mysterious knob, I twiddle it and wait for something to drop from sky, candy, cake or rock, whatever, there are adventures there ;-) And I believe if once the cake drop from sky, there will be more adventurers, and it won't be a mysterious knob any more. > >>> Also, I'm starting to dislike the buddy name; its somewhat over-used. >> >> I have to agree :), any suggestions? > > Nah.. I suppose it depends a bit on the shape the final solution takes, What about default sysctl_sched_wakeup_buddy_ref = 0 and keep the old logical? Regards, Michael Wang > but I'll think about it. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On Fri, 2013-03-08 at 10:50 +0800, Michael Wang wrote: > > OK, so there's two issues I have with all this are: > > > > - it completely wrecks task placement for things like interrupts (sadly I > > don't > > have a good idea about a benchmark where this matters). > > I don't get this point...could you please give more details? Take for instance a network workload, the hardirq notifies us there's data buffered, then softirq does a ton of network layer demuxing and eventually wakes one (or more) tasks to process data. You want these tasks to move to the waking cpu, it has lots of this data in cache. Now, neither hard- nor softirq runs in task context (except for -rt) so it completely fails on what you propose. We could simply add something like in_softirq() || in_irq() etc.. to re-enable wake_affine() for those cases unconditionally, but not sure that's the right thing either. > > - yet another random number.. :/ > > Correct...well, but that also means flexibility, I suppose different > system and workload will need some tuning on this knob to gain more > benefit, by default, they will gain some benefit, small or big. Nah, it just means another knob nobody knows how to twiddle. > > Also, I'm starting to dislike the buddy name; its somewhat over-used. > > I have to agree :), any suggestions? Nah.. I suppose it depends a bit on the shape the final solution takes, but I'll think about it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
* Michael Wang wrote: > Hi, Ingo > > On 03/11/2013 04:21 PM, Ingo Molnar wrote: > [snip] > > > > I have actually written the prctl() approach before, for instrumentation > > purposes, and it does wonders to system analysis. > > The idea sounds great, we could get many new info to implement more > smart scheduler, that's amazing :) > > > > > Any objections? > > Just one concern, may be I have misunderstand you, but will it cause > trouble if the prctl() was indiscriminately used by some applications, > will we get fake data? It's their problem: overusing it will increase their CPU overhead. The two boundary worst-cases are that they either call it too frequently or too rarely: - too frequently: it approximates the current cpu-runtime work metric - too infrequently: we just ignore it and fall back to a runtime metric if it does not change. It's not like it can be used to get preferential treatment - we don't ever balance other tasks against these tasks based on work throughput, we try to maximize this workload's work throughput. What could happen is if an app is 'optimized' for a buggy scheduler by changing the work metric frequency. We offer no guarantee - apps will be best off (and users will be least annoyed) if apps honestly report their work metric. Instrumentation/stats/profiling will also double check the correctness of this data: if developers/users start relying on the work metric as a substitute benchmark number, then app writers will have an additional incentive to make them correct. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
Hi, Ingo On 03/11/2013 04:21 PM, Ingo Molnar wrote: [snip] > > I have actually written the prctl() approach before, for instrumentation > purposes, and it does wonders to system analysis. The idea sounds great, we could get many new info to implement more smart scheduler, that's amazing :) > > Any objections? Just one concern, may be I have misunderstand you, but will it cause trouble if the prctl() was indiscriminately used by some applications, will we get fake data? Regards, Michael Wang > > Thanks, > > Ingo > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
* Peter Zijlstra wrote: > On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: > > > wake_affine() stuff is trying to bind related tasks closely, but it > > doesn't work well according to the test on 'perf bench sched pipe' > > (thanks to Peter). > > so sched-pipe is a poor benchmark for this.. > > Ideally we'd write a new benchmark that has some actual data footprint > and we'd measure the cost of tasks being apart on the various cache > metrics and see what affine wakeup does for it. Ideally we'd offer applications a new, lightweight vsyscall: void sys_sched_work_tick(void) Or, to speed up adoption, a new, vsyscall-accelerated prctrl(): prctl(PR_WORK_TICK); which applications could call in each basic work unit they are performing. Sysbench would call it for every transaction completed, sched-pipe would call it for every pipe message sent, hackbench would call it for messages, etc. etc. This is a minimal application level change, but gives *huge* information to the scheduler: we could balance tasks to maximize their observed work rate. The scheduler could also do other things, like observe the wakeup/sleep patterns within a 'work atom', observe execution overlap between work atoms and place tasks accordingly, etc. etc. Today we approximate work atoms by saying that scheduling atoms == work atoms. But that approximation breaks down in a number of important cases. If we had such a design we'd be able to fix pretty much everything, without the catch-22 problems we are facing normally. An added bonus would be increased instrumentation: we could trace, time, profile work atom rates and could collect work atom profiles. We see work atom execution histograms, etc. etc. - stuff that is simply not possible today without extensive application-dependent instrumentation. We could also use utrace scripts to define work atoms without modifying the application: for many applications we know which particular function call means that a basic work unit was completed. I have actually written the prctl() approach before, for instrumentation purposes, and it does wonders to system analysis. Any objections? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
* Peter Zijlstra a.p.zijls...@chello.nl wrote: On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: wake_affine() stuff is trying to bind related tasks closely, but it doesn't work well according to the test on 'perf bench sched pipe' (thanks to Peter). so sched-pipe is a poor benchmark for this.. Ideally we'd write a new benchmark that has some actual data footprint and we'd measure the cost of tasks being apart on the various cache metrics and see what affine wakeup does for it. Ideally we'd offer applications a new, lightweight vsyscall: void sys_sched_work_tick(void) Or, to speed up adoption, a new, vsyscall-accelerated prctrl(): prctl(PR_WORK_TICK); which applications could call in each basic work unit they are performing. Sysbench would call it for every transaction completed, sched-pipe would call it for every pipe message sent, hackbench would call it for messages, etc. etc. This is a minimal application level change, but gives *huge* information to the scheduler: we could balance tasks to maximize their observed work rate. The scheduler could also do other things, like observe the wakeup/sleep patterns within a 'work atom', observe execution overlap between work atoms and place tasks accordingly, etc. etc. Today we approximate work atoms by saying that scheduling atoms == work atoms. But that approximation breaks down in a number of important cases. If we had such a design we'd be able to fix pretty much everything, without the catch-22 problems we are facing normally. An added bonus would be increased instrumentation: we could trace, time, profile work atom rates and could collect work atom profiles. We see work atom execution histograms, etc. etc. - stuff that is simply not possible today without extensive application-dependent instrumentation. We could also use utrace scripts to define work atoms without modifying the application: for many applications we know which particular function call means that a basic work unit was completed. I have actually written the prctl() approach before, for instrumentation purposes, and it does wonders to system analysis. Any objections? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
Hi, Ingo On 03/11/2013 04:21 PM, Ingo Molnar wrote: [snip] I have actually written the prctl() approach before, for instrumentation purposes, and it does wonders to system analysis. The idea sounds great, we could get many new info to implement more smart scheduler, that's amazing :) Any objections? Just one concern, may be I have misunderstand you, but will it cause trouble if the prctl() was indiscriminately used by some applications, will we get fake data? Regards, Michael Wang Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
* Michael Wang wang...@linux.vnet.ibm.com wrote: Hi, Ingo On 03/11/2013 04:21 PM, Ingo Molnar wrote: [snip] I have actually written the prctl() approach before, for instrumentation purposes, and it does wonders to system analysis. The idea sounds great, we could get many new info to implement more smart scheduler, that's amazing :) Any objections? Just one concern, may be I have misunderstand you, but will it cause trouble if the prctl() was indiscriminately used by some applications, will we get fake data? It's their problem: overusing it will increase their CPU overhead. The two boundary worst-cases are that they either call it too frequently or too rarely: - too frequently: it approximates the current cpu-runtime work metric - too infrequently: we just ignore it and fall back to a runtime metric if it does not change. It's not like it can be used to get preferential treatment - we don't ever balance other tasks against these tasks based on work throughput, we try to maximize this workload's work throughput. What could happen is if an app is 'optimized' for a buggy scheduler by changing the work metric frequency. We offer no guarantee - apps will be best off (and users will be least annoyed) if apps honestly report their work metric. Instrumentation/stats/profiling will also double check the correctness of this data: if developers/users start relying on the work metric as a substitute benchmark number, then app writers will have an additional incentive to make them correct. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On Fri, 2013-03-08 at 10:50 +0800, Michael Wang wrote: OK, so there's two issues I have with all this are: - it completely wrecks task placement for things like interrupts (sadly I don't have a good idea about a benchmark where this matters). I don't get this point...could you please give more details? Take for instance a network workload, the hardirq notifies us there's data buffered, then softirq does a ton of network layer demuxing and eventually wakes one (or more) tasks to process data. You want these tasks to move to the waking cpu, it has lots of this data in cache. Now, neither hard- nor softirq runs in task context (except for -rt) so it completely fails on what you propose. We could simply add something like in_softirq() || in_irq() etc.. to re-enable wake_affine() for those cases unconditionally, but not sure that's the right thing either. - yet another random number.. :/ Correct...well, but that also means flexibility, I suppose different system and workload will need some tuning on this knob to gain more benefit, by default, they will gain some benefit, small or big. Nah, it just means another knob nobody knows how to twiddle. Also, I'm starting to dislike the buddy name; its somewhat over-used. I have to agree :), any suggestions? Nah.. I suppose it depends a bit on the shape the final solution takes, but I'll think about it. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On 03/11/2013 06:36 PM, Peter Zijlstra wrote: On Fri, 2013-03-08 at 10:50 +0800, Michael Wang wrote: OK, so there's two issues I have with all this are: - it completely wrecks task placement for things like interrupts (sadly I don't have a good idea about a benchmark where this matters). I don't get this point...could you please give more details? Take for instance a network workload, the hardirq notifies us there's data buffered, then softirq does a ton of network layer demuxing and eventually wakes one (or more) tasks to process data. You want these tasks to move to the waking cpu, it has lots of this data in cache. Now, neither hard- nor softirq runs in task context (except for -rt) so it completely fails on what you propose. I got it, exactly, the new feature with current ref limit will miss this optimize timing, if the data not related to current... However, we are still gambling here, and some workload suffered. Actually this feature's purpose is to provide a way for user who want to balance the risk and benefit on self demand, it providing flexible, not restriction... So what about make the default sysctl_sched_wakeup_buddy_ref to be 0 and make every thing just like the old world. And for those who want to make the decision more carefully, they could turn the knob to make it bigger, and gain the benefit. We could simply add something like in_softirq() || in_irq() etc.. to re-enable wake_affine() for those cases unconditionally, but not sure that's the right thing either. - yet another random number.. :/ Correct...well, but that also means flexibility, I suppose different system and workload will need some tuning on this knob to gain more benefit, by default, they will gain some benefit, small or big. Nah, it just means another knob nobody knows how to twiddle. Right, and we already have many such kind of knob...since some formular is impossible to be written down. Mysterious knob, I twiddle it and wait for something to drop from sky, candy, cake or rock, whatever, there are adventures there ;-) And I believe if once the cake drop from sky, there will be more adventurers, and it won't be a mysterious knob any more. Also, I'm starting to dislike the buddy name; its somewhat over-used. I have to agree :), any suggestions? Nah.. I suppose it depends a bit on the shape the final solution takes, What about default sysctl_sched_wakeup_buddy_ref = 0 and keep the old logical? Regards, Michael Wang but I'll think about it. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On 03/08/2013 04:26 PM, Mike Galbraith wrote: > On Fri, 2013-03-08 at 15:30 +0800, Michael Wang wrote: >> On 03/08/2013 02:44 PM, Mike Galbraith wrote: > >>> In general, I think things would work better if we'd just rate limit how >>> frequently we can wakeup migrate each individual task. >> >> Isn't the wakeup buddy already limit the rate? and by turning the knob, >> we could change the rate on our demand. > > I was referring to the existing kernel, not as altered. > >> We want >>> jabbering tasks to share L3, but we don't really want to trash L2 at an >>> awesome rate. >> >> I don't get it..., it's a task which has 'sleep' for some time, unless >> there is no task running on prev_cpu when it's sleeping, otherwise >> whatever the new cpu is, we will trash L2, isn't it? > > I'm thinking if you wake it to it's old home after a microscopic sleep, > it has a good chance of evicting the current resident, rescuing its L2. > If tasks which do microscopic sleep can't move around at a high rate, > they'll poke holes in fewer preempt victims. If they're _really_ fast > switchers, always wake affine. They can't hurt much, they don't do much > other than schedule off. I get your point, it's about the task which sleep frequently for a very short time, you are right, keep them around prev_cpu could gain some benefit. There are still many factors need to be considered, for example, if the cpu around prev_cpu are busier than those around curr_cpu, then pull from prev to curr still likely to be a good choice...for the consider of future. Also for sure, that depends on what's the workload in the system, and how they cooperate with each other. IMHO, I think wakeup buddy is a compromising solution for this case, before we could figure out the correct formular (and a efficient way to collect all the info we rely on), such a flexible optimization is not bad. Regards, Michael Wang > > -Mike > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On 03/08/2013 04:26 PM, Mike Galbraith wrote: On Fri, 2013-03-08 at 15:30 +0800, Michael Wang wrote: On 03/08/2013 02:44 PM, Mike Galbraith wrote: In general, I think things would work better if we'd just rate limit how frequently we can wakeup migrate each individual task. Isn't the wakeup buddy already limit the rate? and by turning the knob, we could change the rate on our demand. I was referring to the existing kernel, not as altered. We want jabbering tasks to share L3, but we don't really want to trash L2 at an awesome rate. I don't get it..., it's a task which has 'sleep' for some time, unless there is no task running on prev_cpu when it's sleeping, otherwise whatever the new cpu is, we will trash L2, isn't it? I'm thinking if you wake it to it's old home after a microscopic sleep, it has a good chance of evicting the current resident, rescuing its L2. If tasks which do microscopic sleep can't move around at a high rate, they'll poke holes in fewer preempt victims. If they're _really_ fast switchers, always wake affine. They can't hurt much, they don't do much other than schedule off. I get your point, it's about the task which sleep frequently for a very short time, you are right, keep them around prev_cpu could gain some benefit. There are still many factors need to be considered, for example, if the cpu around prev_cpu are busier than those around curr_cpu, then pull from prev to curr still likely to be a good choice...for the consider of future. Also for sure, that depends on what's the workload in the system, and how they cooperate with each other. IMHO, I think wakeup buddy is a compromising solution for this case, before we could figure out the correct formular (and a efficient way to collect all the info we rely on), such a flexible optimization is not bad. Regards, Michael Wang -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On Fri, 2013-03-08 at 15:30 +0800, Michael Wang wrote: > On 03/08/2013 02:44 PM, Mike Galbraith wrote: > > In general, I think things would work better if we'd just rate limit how > > frequently we can wakeup migrate each individual task. > > Isn't the wakeup buddy already limit the rate? and by turning the knob, > we could change the rate on our demand. I was referring to the existing kernel, not as altered. > We want > > jabbering tasks to share L3, but we don't really want to trash L2 at an > > awesome rate. > > I don't get it..., it's a task which has 'sleep' for some time, unless > there is no task running on prev_cpu when it's sleeping, otherwise > whatever the new cpu is, we will trash L2, isn't it? I'm thinking if you wake it to it's old home after a microscopic sleep, it has a good chance of evicting the current resident, rescuing its L2. If tasks which do microscopic sleep can't move around at a high rate, they'll poke holes in fewer preempt victims. If they're _really_ fast switchers, always wake affine. They can't hurt much, they don't do much other than schedule off. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On Fri, 2013-03-08 at 15:30 +0800, Michael Wang wrote: On 03/08/2013 02:44 PM, Mike Galbraith wrote: In general, I think things would work better if we'd just rate limit how frequently we can wakeup migrate each individual task. Isn't the wakeup buddy already limit the rate? and by turning the knob, we could change the rate on our demand. I was referring to the existing kernel, not as altered. We want jabbering tasks to share L3, but we don't really want to trash L2 at an awesome rate. I don't get it..., it's a task which has 'sleep' for some time, unless there is no task running on prev_cpu when it's sleeping, otherwise whatever the new cpu is, we will trash L2, isn't it? I'm thinking if you wake it to it's old home after a microscopic sleep, it has a good chance of evicting the current resident, rescuing its L2. If tasks which do microscopic sleep can't move around at a high rate, they'll poke holes in fewer preempt victims. If they're _really_ fast switchers, always wake affine. They can't hurt much, they don't do much other than schedule off. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On 03/08/2013 02:44 PM, Mike Galbraith wrote: > On Fri, 2013-03-08 at 10:37 +0800, Michael Wang wrote: >> On 03/07/2013 05:43 PM, Mike Galbraith wrote: >>> On Thu, 2013-03-07 at 09:36 +0100, Peter Zijlstra wrote: On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: > wake_affine() stuff is trying to bind related tasks closely, but it > doesn't > work well according to the test on 'perf bench sched pipe' (thanks to > Peter). so sched-pipe is a poor benchmark for this.. Ideally we'd write a new benchmark that has some actual data footprint and we'd measure the cost of tasks being apart on the various cache metrics and see what affine wakeup does for it. Before doing something like what you're proposing, I'd have a hard look at WF_SYNC, it is possible we should disable/fix select_idle_sibling for sync wakeups. >>> >>> If nobody beats me to it, I'm going to try tracking shortest round trip >>> to idle, and use a multiple of that to shut select_idle_sibling() down. >>> If avg_idle approaches round trip time, there's no win to be had, we're >>> just wasting cycles. >> >> That's great if we have it, I'm a little doubt whether it is possible to >> find a better way to replace the select_idle_sibling() (look at the way >> it locates idle cpu...) in some cases, but I'm looking forward it ;-) > > I'm not going to replace it, only stop it from wasting cycles when > there's very likely nothing to gain. Save task wakeup time, if delta > rq->clock - p->last_wakeup < N*shortest_idle or some such very cheap > metric. Wake ultra switchers L2 affine if allowed, only go hunting for > an idle L3 if the thing is on another package. > > In general, I think things would work better if we'd just rate limit how > frequently we can wakeup migrate each individual task. Isn't the wakeup buddy already limit the rate? and by turning the knob, we could change the rate on our demand. We want > jabbering tasks to share L3, but we don't really want to trash L2 at an > awesome rate. I don't get it..., it's a task which has 'sleep' for some time, unless there is no task running on prev_cpu when it's sleeping, otherwise whatever the new cpu is, we will trash L2, isn't it? Regards, Michael Wang > > -Mike > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On Fri, 2013-03-08 at 10:37 +0800, Michael Wang wrote: > On 03/07/2013 05:43 PM, Mike Galbraith wrote: > > On Thu, 2013-03-07 at 09:36 +0100, Peter Zijlstra wrote: > >> On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: > >> > >>> wake_affine() stuff is trying to bind related tasks closely, but it > >>> doesn't > >>> work well according to the test on 'perf bench sched pipe' (thanks to > >>> Peter). > >> > >> so sched-pipe is a poor benchmark for this.. > >> > >> Ideally we'd write a new benchmark that has some actual data footprint > >> and we'd measure the cost of tasks being apart on the various cache > >> metrics and see what affine wakeup does for it. > >> > >> Before doing something like what you're proposing, I'd have a hard look > >> at WF_SYNC, it is possible we should disable/fix select_idle_sibling > >> for sync wakeups. > > > > If nobody beats me to it, I'm going to try tracking shortest round trip > > to idle, and use a multiple of that to shut select_idle_sibling() down. > > If avg_idle approaches round trip time, there's no win to be had, we're > > just wasting cycles. > > That's great if we have it, I'm a little doubt whether it is possible to > find a better way to replace the select_idle_sibling() (look at the way > it locates idle cpu...) in some cases, but I'm looking forward it ;-) I'm not going to replace it, only stop it from wasting cycles when there's very likely nothing to gain. Save task wakeup time, if delta rq->clock - p->last_wakeup < N*shortest_idle or some such very cheap metric. Wake ultra switchers L2 affine if allowed, only go hunting for an idle L3 if the thing is on another package. In general, I think things would work better if we'd just rate limit how frequently we can wakeup migrate each individual task. We want jabbering tasks to share L3, but we don't really want to trash L2 at an awesome rate. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On 03/08/2013 01:27 AM, Peter Zijlstra wrote: > On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: >> @@ -3351,7 +3420,13 @@ select_task_rq_fair(struct task_struct *p, int >> sd_flag, int wake_flags) >> } >> >> if (affine_sd) { >> - if (cpu != prev_cpu && wake_affine(affine_sd, p, >> sync)) >> + /* >> +* If current and p are wakeup related, and balance is >> +* guaranteed, we will try to make them running >> closely >> +* to gain cache benefit. >> +*/ >> + if (cpu != prev_cpu && wakeup_related(p) && >> + wake_affine(affine_sd, p, >> sync)) >> prev_cpu = cpu; > > > OK, so there's two issues I have with all this are: > > - it completely wrecks task placement for things like interrupts (sadly > I don't > have a good idea about a benchmark where this matters). I don't get this point...could you please give more details? > - yet another random number.. :/ Correct...well, but that also means flexibility, I suppose different system and workload will need some tuning on this knob to gain more benefit, by default, they will gain some benefit, small or big. > > Also, I'm starting to dislike the buddy name; its somewhat over-used. > I have to agree :), any suggestions? Regards, Michael Wang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On 03/07/2013 05:43 PM, Mike Galbraith wrote: > On Thu, 2013-03-07 at 09:36 +0100, Peter Zijlstra wrote: >> On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: >> >>> wake_affine() stuff is trying to bind related tasks closely, but it doesn't >>> work well according to the test on 'perf bench sched pipe' (thanks to >>> Peter). >> >> so sched-pipe is a poor benchmark for this.. >> >> Ideally we'd write a new benchmark that has some actual data footprint >> and we'd measure the cost of tasks being apart on the various cache >> metrics and see what affine wakeup does for it. >> >> Before doing something like what you're proposing, I'd have a hard look >> at WF_SYNC, it is possible we should disable/fix select_idle_sibling >> for sync wakeups. > > If nobody beats me to it, I'm going to try tracking shortest round trip > to idle, and use a multiple of that to shut select_idle_sibling() down. > If avg_idle approaches round trip time, there's no win to be had, we're > just wasting cycles. That's great if we have it, I'm a little doubt whether it is possible to find a better way to replace the select_idle_sibling() (look at the way it locates idle cpu...) in some cases, but I'm looking forward it ;-) Regards, Michael Wang > > -Mike > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On 03/08/2013 01:21 AM, Peter Zijlstra wrote: > On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: >> +static inline int wakeup_related(struct task_struct *p) >> +{ >> + if (wakeup_buddy(p, current)) { >> + /* >> +* Now check whether current still focus on his buddy. >> +*/ >> + if (wakeup_buddy(current, p)) >> + return 1; >> + } >> + >> + return 0; >> +} > > Not commenting on the thing in general, but: > > static inline bool wakeup_related(struct task_struct *p) > { > return wakeup_buddy(p, current) && wakeup_buddy(current, p); > } > > is far shorter and easier to read :-) Right, I will correct it :) Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On 03/08/2013 12:52 AM, Peter Zijlstra wrote: > On Thu, 2013-03-07 at 17:46 +0800, Michael Wang wrote: > >> On 03/07/2013 04:36 PM, Peter Zijlstra wrote: >>> On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: >>> wake_affine() stuff is trying to bind related tasks closely, but it doesn't work well according to the test on 'perf bench sched pipe' (thanks to Peter). >>> >>> so sched-pipe is a poor benchmark for this.. >>> >>> Ideally we'd write a new benchmark that has some actual data footprint >>> and we'd measure the cost of tasks being apart on the various cache >>> metrics and see what affine wakeup does for it. >> >> I think sched-pipe is still somewhat capable, > > Yeah, its not entirely crap for this, but its not ideal either. The > very big difference you see between it running on a single cpu and on > say two threads of a single core is mostly due to preemption > 'artifacts' though. Not because of cache. > > So we have 2 tasks -- lets call then A and B -- involved in a single > word ping-pong. So we're both doing write(); read(); loops. Now what > happens on a single cpu is that A's write()->wakeup() of B makes B > preempt A before A hits read() and blocks. This in turn ensures that > B's write()->wakeup() of A finds an already running A and doesn't > actually need to do the full (and expensive) wakeup thing (and vice > versa). Exactly, I used to think that make them running on same cpu only gain benefit when one is going to sleep, but you are right, in that case, they get latency and performance at same time, amazing ;-) One concern in my mind is that this cooperation is somewhat fragile, if another task C join the fight on that cpu, it will be broken, but if we have a way to detect that in select_task_rq_fair(), we will be able to win the gamble. > > So by constantly preempting one another they avoid the expensive bit of > going to sleep and waking up again. Amazing point :) > > wake_affine() OTOH still has a (supposed) benefit if it gets the tasks > running 'closer' (in a cache hierarchy sense) since then the data > sharing is less expensive. > >> the problem is that the >> select_idle_sibling() doesn't take care the wakeup related case, it >> doesn't contain the logical to locate an idle cpu closely. > > I'm not entirely sure if I understand what you mean, do you mean to say > its idea of 'closely' is not quite correct? If so, I tend to agree, see > further down. > >> So even we detect the relationship successfully, select_idle_sibling() >> can only help to make sure the target cpu won't be outside of the >> current package, it's a package level bind, not mc or smp level. > > That is the entire point of select_idle_sibling(), selecting a cpu > 'near' the target cpu that is currently idle. > > Not too long ago we had a bit of a discussion on the unholy mess that > is select_idle_sibling() and if it actually does the right thing. > Arguably it doesn't for machines that have an effective L2 cache. The > issue is that the arch<->sched interface only knows about > last-level-cache (L3 on anything modern) and SMT. Yeah, that's the point I concerned, we make sure that the cpu returned by select_idle_sibling() at least share the L3 cache with target, but there is a chance to miss the better candidate which share L2 with the target. > > Expanding the topology description in a way that makes sense (and > doesn't make it a bigger mess) is somewhere on the todo-list. > >>> Before doing something like what you're proposing, I'd have a hard look >>> at WF_SYNC, it is possible we should disable/fix select_idle_sibling >>> for sync wakeups. >> >> The patch is supposed to stop using wake_affine() blindly, not improve >> the wake_affine() stuff itself, the whole stuff still works, but since >> we rely on select_idle_sibling() to make the choice, the benefit is not >> so significant, especially on my one node box... > > OK, I'll have to go read the actual patch for that, I'll get back to > you on that :-) > >>> The idea behind sync wakeups is that we try and detect the case where >>> we wakeup up one task only to go to sleep ourselves and try and avoid >>> the regular ping-pong this would otherwise create on account of the >>> waking task still being alive and so the current cpu isn't actually >>> idle yet but we know its going to be idle soon. >> >> Are you suggesting that we should separate the process of wakeup related >> case, not just pass current cpu to select_idle_sibling()? > > Depends a bit on what you're trying to fix, so far I'm just trying to > write down what I remember about stuff and reacting to half-read > changelogs ;-) I see, and I get some good point here for me to think deeper, that's great ;-) Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at
Re: [PATCH] sched: wakeup buddy
On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: > @@ -3351,7 +3420,13 @@ select_task_rq_fair(struct task_struct *p, int > sd_flag, int wake_flags) > } > > if (affine_sd) { > - if (cpu != prev_cpu && wake_affine(affine_sd, p, > sync)) > + /* > +* If current and p are wakeup related, and balance is > +* guaranteed, we will try to make them running > closely > +* to gain cache benefit. > +*/ > + if (cpu != prev_cpu && wakeup_related(p) && > + wake_affine(affine_sd, p, > sync)) > prev_cpu = cpu; OK, so there's two issues I have with all this are: - it completely wrecks task placement for things like interrupts (sadly I don't have a good idea about a benchmark where this matters). - yet another random number.. :/ Also, I'm starting to dislike the buddy name; its somewhat over-used. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: > +static inline int wakeup_related(struct task_struct *p) > +{ > + if (wakeup_buddy(p, current)) { > + /* > +* Now check whether current still focus on his buddy. > +*/ > + if (wakeup_buddy(current, p)) > + return 1; > + } > + > + return 0; > +} Not commenting on the thing in general, but: static inline bool wakeup_related(struct task_struct *p) { return wakeup_buddy(p, current) && wakeup_buddy(current, p); } is far shorter and easier to read :-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On Thu, 2013-03-07 at 17:46 +0800, Michael Wang wrote: > On 03/07/2013 04:36 PM, Peter Zijlstra wrote: > > On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: > > > >> wake_affine() stuff is trying to bind related tasks closely, but it doesn't > >> work well according to the test on 'perf bench sched pipe' (thanks to > >> Peter). > > > > so sched-pipe is a poor benchmark for this.. > > > > Ideally we'd write a new benchmark that has some actual data footprint > > and we'd measure the cost of tasks being apart on the various cache > > metrics and see what affine wakeup does for it. > > I think sched-pipe is still somewhat capable, Yeah, its not entirely crap for this, but its not ideal either. The very big difference you see between it running on a single cpu and on say two threads of a single core is mostly due to preemption 'artifacts' though. Not because of cache. So we have 2 tasks -- lets call then A and B -- involved in a single word ping-pong. So we're both doing write(); read(); loops. Now what happens on a single cpu is that A's write()->wakeup() of B makes B preempt A before A hits read() and blocks. This in turn ensures that B's write()->wakeup() of A finds an already running A and doesn't actually need to do the full (and expensive) wakeup thing (and vice versa). So by constantly preempting one another they avoid the expensive bit of going to sleep and waking up again. wake_affine() OTOH still has a (supposed) benefit if it gets the tasks running 'closer' (in a cache hierarchy sense) since then the data sharing is less expensive. > the problem is that the > select_idle_sibling() doesn't take care the wakeup related case, it > doesn't contain the logical to locate an idle cpu closely. I'm not entirely sure if I understand what you mean, do you mean to say its idea of 'closely' is not quite correct? If so, I tend to agree, see further down. > So even we detect the relationship successfully, select_idle_sibling() > can only help to make sure the target cpu won't be outside of the > current package, it's a package level bind, not mc or smp level. That is the entire point of select_idle_sibling(), selecting a cpu 'near' the target cpu that is currently idle. Not too long ago we had a bit of a discussion on the unholy mess that is select_idle_sibling() and if it actually does the right thing. Arguably it doesn't for machines that have an effective L2 cache. The issue is that the arch<->sched interface only knows about last-level-cache (L3 on anything modern) and SMT. Expanding the topology description in a way that makes sense (and doesn't make it a bigger mess) is somewhere on the todo-list. > > Before doing something like what you're proposing, I'd have a hard look > > at WF_SYNC, it is possible we should disable/fix select_idle_sibling > > for sync wakeups. > > The patch is supposed to stop using wake_affine() blindly, not improve > the wake_affine() stuff itself, the whole stuff still works, but since > we rely on select_idle_sibling() to make the choice, the benefit is not > so significant, especially on my one node box... OK, I'll have to go read the actual patch for that, I'll get back to you on that :-) > > The idea behind sync wakeups is that we try and detect the case where > > we wakeup up one task only to go to sleep ourselves and try and avoid > > the regular ping-pong this would otherwise create on account of the > > waking task still being alive and so the current cpu isn't actually > > idle yet but we know its going to be idle soon. > > Are you suggesting that we should separate the process of wakeup related > case, not just pass current cpu to select_idle_sibling()? Depends a bit on what you're trying to fix, so far I'm just trying to write down what I remember about stuff and reacting to half-read changelogs ;-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
Hi, Peter Thanks for your reply. On 03/07/2013 04:36 PM, Peter Zijlstra wrote: > On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: > >> wake_affine() stuff is trying to bind related tasks closely, but it doesn't >> work well according to the test on 'perf bench sched pipe' (thanks to Peter). > > so sched-pipe is a poor benchmark for this.. > > Ideally we'd write a new benchmark that has some actual data footprint > and we'd measure the cost of tasks being apart on the various cache > metrics and see what affine wakeup does for it. I think sched-pipe is still somewhat capable, the problem is that the select_idle_sibling() doesn't take care the wakeup related case, it doesn't contain the logical to locate an idle cpu closely. So even we detect the relationship successfully, select_idle_sibling() can only help to make sure the target cpu won't be outside of the current package, it's a package level bind, not mc or smp level. > > Before doing something like what you're proposing, I'd have a hard look > at WF_SYNC, it is possible we should disable/fix select_idle_sibling > for sync wakeups. The patch is supposed to stop using wake_affine() blindly, not improve the wake_affine() stuff itself, the whole stuff still works, but since we rely on select_idle_sibling() to make the choice, the benefit is not so significant, especially on my one node box... > > The idea behind sync wakeups is that we try and detect the case where > we wakeup up one task only to go to sleep ourselves and try and avoid > the regular ping-pong this would otherwise create on account of the > waking task still being alive and so the current cpu isn't actually > idle yet but we know its going to be idle soon. Are you suggesting that we should separate the process of wakeup related case, not just pass current cpu to select_idle_sibling()? Regards, Michael Wang > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On Thu, 2013-03-07 at 09:36 +0100, Peter Zijlstra wrote: > On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: > > > wake_affine() stuff is trying to bind related tasks closely, but it doesn't > > work well according to the test on 'perf bench sched pipe' (thanks to > > Peter). > > so sched-pipe is a poor benchmark for this.. > > Ideally we'd write a new benchmark that has some actual data footprint > and we'd measure the cost of tasks being apart on the various cache > metrics and see what affine wakeup does for it. > > Before doing something like what you're proposing, I'd have a hard look > at WF_SYNC, it is possible we should disable/fix select_idle_sibling > for sync wakeups. If nobody beats me to it, I'm going to try tracking shortest round trip to idle, and use a multiple of that to shut select_idle_sibling() down. If avg_idle approaches round trip time, there's no win to be had, we're just wasting cycles. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: > wake_affine() stuff is trying to bind related tasks closely, but it doesn't > work well according to the test on 'perf bench sched pipe' (thanks to Peter). so sched-pipe is a poor benchmark for this.. Ideally we'd write a new benchmark that has some actual data footprint and we'd measure the cost of tasks being apart on the various cache metrics and see what affine wakeup does for it. Before doing something like what you're proposing, I'd have a hard look at WF_SYNC, it is possible we should disable/fix select_idle_sibling for sync wakeups. The idea behind sync wakeups is that we try and detect the case where we wakeup up one task only to go to sleep ourselves and try and avoid the regular ping-pong this would otherwise create on account of the waking task still being alive and so the current cpu isn't actually idle yet but we know its going to be idle soon. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On 03/08/2013 12:52 AM, Peter Zijlstra wrote: On Thu, 2013-03-07 at 17:46 +0800, Michael Wang wrote: On 03/07/2013 04:36 PM, Peter Zijlstra wrote: On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: wake_affine() stuff is trying to bind related tasks closely, but it doesn't work well according to the test on 'perf bench sched pipe' (thanks to Peter). so sched-pipe is a poor benchmark for this.. Ideally we'd write a new benchmark that has some actual data footprint and we'd measure the cost of tasks being apart on the various cache metrics and see what affine wakeup does for it. I think sched-pipe is still somewhat capable, Yeah, its not entirely crap for this, but its not ideal either. The very big difference you see between it running on a single cpu and on say two threads of a single core is mostly due to preemption 'artifacts' though. Not because of cache. So we have 2 tasks -- lets call then A and B -- involved in a single word ping-pong. So we're both doing write(); read(); loops. Now what happens on a single cpu is that A's write()-wakeup() of B makes B preempt A before A hits read() and blocks. This in turn ensures that B's write()-wakeup() of A finds an already running A and doesn't actually need to do the full (and expensive) wakeup thing (and vice versa). Exactly, I used to think that make them running on same cpu only gain benefit when one is going to sleep, but you are right, in that case, they get latency and performance at same time, amazing ;-) One concern in my mind is that this cooperation is somewhat fragile, if another task C join the fight on that cpu, it will be broken, but if we have a way to detect that in select_task_rq_fair(), we will be able to win the gamble. So by constantly preempting one another they avoid the expensive bit of going to sleep and waking up again. Amazing point :) wake_affine() OTOH still has a (supposed) benefit if it gets the tasks running 'closer' (in a cache hierarchy sense) since then the data sharing is less expensive. the problem is that the select_idle_sibling() doesn't take care the wakeup related case, it doesn't contain the logical to locate an idle cpu closely. I'm not entirely sure if I understand what you mean, do you mean to say its idea of 'closely' is not quite correct? If so, I tend to agree, see further down. So even we detect the relationship successfully, select_idle_sibling() can only help to make sure the target cpu won't be outside of the current package, it's a package level bind, not mc or smp level. That is the entire point of select_idle_sibling(), selecting a cpu 'near' the target cpu that is currently idle. Not too long ago we had a bit of a discussion on the unholy mess that is select_idle_sibling() and if it actually does the right thing. Arguably it doesn't for machines that have an effective L2 cache. The issue is that the arch-sched interface only knows about last-level-cache (L3 on anything modern) and SMT. Yeah, that's the point I concerned, we make sure that the cpu returned by select_idle_sibling() at least share the L3 cache with target, but there is a chance to miss the better candidate which share L2 with the target. Expanding the topology description in a way that makes sense (and doesn't make it a bigger mess) is somewhere on the todo-list. Before doing something like what you're proposing, I'd have a hard look at WF_SYNC, it is possible we should disable/fix select_idle_sibling for sync wakeups. The patch is supposed to stop using wake_affine() blindly, not improve the wake_affine() stuff itself, the whole stuff still works, but since we rely on select_idle_sibling() to make the choice, the benefit is not so significant, especially on my one node box... OK, I'll have to go read the actual patch for that, I'll get back to you on that :-) The idea behind sync wakeups is that we try and detect the case where we wakeup up one task only to go to sleep ourselves and try and avoid the regular ping-pong this would otherwise create on account of the waking task still being alive and so the current cpu isn't actually idle yet but we know its going to be idle soon. Are you suggesting that we should separate the process of wakeup related case, not just pass current cpu to select_idle_sibling()? Depends a bit on what you're trying to fix, so far I'm just trying to write down what I remember about stuff and reacting to half-read changelogs ;-) I see, and I get some good point here for me to think deeper, that's great ;-) Regards, Michael Wang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On 03/08/2013 01:21 AM, Peter Zijlstra wrote: On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: +static inline int wakeup_related(struct task_struct *p) +{ + if (wakeup_buddy(p, current)) { + /* +* Now check whether current still focus on his buddy. +*/ + if (wakeup_buddy(current, p)) + return 1; + } + + return 0; +} Not commenting on the thing in general, but: static inline bool wakeup_related(struct task_struct *p) { return wakeup_buddy(p, current) wakeup_buddy(current, p); } is far shorter and easier to read :-) Right, I will correct it :) Regards, Michael Wang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On 03/07/2013 05:43 PM, Mike Galbraith wrote: On Thu, 2013-03-07 at 09:36 +0100, Peter Zijlstra wrote: On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: wake_affine() stuff is trying to bind related tasks closely, but it doesn't work well according to the test on 'perf bench sched pipe' (thanks to Peter). so sched-pipe is a poor benchmark for this.. Ideally we'd write a new benchmark that has some actual data footprint and we'd measure the cost of tasks being apart on the various cache metrics and see what affine wakeup does for it. Before doing something like what you're proposing, I'd have a hard look at WF_SYNC, it is possible we should disable/fix select_idle_sibling for sync wakeups. If nobody beats me to it, I'm going to try tracking shortest round trip to idle, and use a multiple of that to shut select_idle_sibling() down. If avg_idle approaches round trip time, there's no win to be had, we're just wasting cycles. That's great if we have it, I'm a little doubt whether it is possible to find a better way to replace the select_idle_sibling() (look at the way it locates idle cpu...) in some cases, but I'm looking forward it ;-) Regards, Michael Wang -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On 03/08/2013 01:27 AM, Peter Zijlstra wrote: On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: @@ -3351,7 +3420,13 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) } if (affine_sd) { - if (cpu != prev_cpu wake_affine(affine_sd, p, sync)) + /* +* If current and p are wakeup related, and balance is +* guaranteed, we will try to make them running closely +* to gain cache benefit. +*/ + if (cpu != prev_cpu wakeup_related(p) + wake_affine(affine_sd, p, sync)) prev_cpu = cpu; OK, so there's two issues I have with all this are: - it completely wrecks task placement for things like interrupts (sadly I don't have a good idea about a benchmark where this matters). I don't get this point...could you please give more details? - yet another random number.. :/ Correct...well, but that also means flexibility, I suppose different system and workload will need some tuning on this knob to gain more benefit, by default, they will gain some benefit, small or big. Also, I'm starting to dislike the buddy name; its somewhat over-used. I have to agree :), any suggestions? Regards, Michael Wang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On Fri, 2013-03-08 at 10:37 +0800, Michael Wang wrote: On 03/07/2013 05:43 PM, Mike Galbraith wrote: On Thu, 2013-03-07 at 09:36 +0100, Peter Zijlstra wrote: On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: wake_affine() stuff is trying to bind related tasks closely, but it doesn't work well according to the test on 'perf bench sched pipe' (thanks to Peter). so sched-pipe is a poor benchmark for this.. Ideally we'd write a new benchmark that has some actual data footprint and we'd measure the cost of tasks being apart on the various cache metrics and see what affine wakeup does for it. Before doing something like what you're proposing, I'd have a hard look at WF_SYNC, it is possible we should disable/fix select_idle_sibling for sync wakeups. If nobody beats me to it, I'm going to try tracking shortest round trip to idle, and use a multiple of that to shut select_idle_sibling() down. If avg_idle approaches round trip time, there's no win to be had, we're just wasting cycles. That's great if we have it, I'm a little doubt whether it is possible to find a better way to replace the select_idle_sibling() (look at the way it locates idle cpu...) in some cases, but I'm looking forward it ;-) I'm not going to replace it, only stop it from wasting cycles when there's very likely nothing to gain. Save task wakeup time, if delta rq-clock - p-last_wakeup N*shortest_idle or some such very cheap metric. Wake ultra switchers L2 affine if allowed, only go hunting for an idle L3 if the thing is on another package. In general, I think things would work better if we'd just rate limit how frequently we can wakeup migrate each individual task. We want jabbering tasks to share L3, but we don't really want to trash L2 at an awesome rate. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On 03/08/2013 02:44 PM, Mike Galbraith wrote: On Fri, 2013-03-08 at 10:37 +0800, Michael Wang wrote: On 03/07/2013 05:43 PM, Mike Galbraith wrote: On Thu, 2013-03-07 at 09:36 +0100, Peter Zijlstra wrote: On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: wake_affine() stuff is trying to bind related tasks closely, but it doesn't work well according to the test on 'perf bench sched pipe' (thanks to Peter). so sched-pipe is a poor benchmark for this.. Ideally we'd write a new benchmark that has some actual data footprint and we'd measure the cost of tasks being apart on the various cache metrics and see what affine wakeup does for it. Before doing something like what you're proposing, I'd have a hard look at WF_SYNC, it is possible we should disable/fix select_idle_sibling for sync wakeups. If nobody beats me to it, I'm going to try tracking shortest round trip to idle, and use a multiple of that to shut select_idle_sibling() down. If avg_idle approaches round trip time, there's no win to be had, we're just wasting cycles. That's great if we have it, I'm a little doubt whether it is possible to find a better way to replace the select_idle_sibling() (look at the way it locates idle cpu...) in some cases, but I'm looking forward it ;-) I'm not going to replace it, only stop it from wasting cycles when there's very likely nothing to gain. Save task wakeup time, if delta rq-clock - p-last_wakeup N*shortest_idle or some such very cheap metric. Wake ultra switchers L2 affine if allowed, only go hunting for an idle L3 if the thing is on another package. In general, I think things would work better if we'd just rate limit how frequently we can wakeup migrate each individual task. Isn't the wakeup buddy already limit the rate? and by turning the knob, we could change the rate on our demand. We want jabbering tasks to share L3, but we don't really want to trash L2 at an awesome rate. I don't get it..., it's a task which has 'sleep' for some time, unless there is no task running on prev_cpu when it's sleeping, otherwise whatever the new cpu is, we will trash L2, isn't it? Regards, Michael Wang -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: wake_affine() stuff is trying to bind related tasks closely, but it doesn't work well according to the test on 'perf bench sched pipe' (thanks to Peter). so sched-pipe is a poor benchmark for this.. Ideally we'd write a new benchmark that has some actual data footprint and we'd measure the cost of tasks being apart on the various cache metrics and see what affine wakeup does for it. Before doing something like what you're proposing, I'd have a hard look at WF_SYNC, it is possible we should disable/fix select_idle_sibling for sync wakeups. The idea behind sync wakeups is that we try and detect the case where we wakeup up one task only to go to sleep ourselves and try and avoid the regular ping-pong this would otherwise create on account of the waking task still being alive and so the current cpu isn't actually idle yet but we know its going to be idle soon. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On Thu, 2013-03-07 at 09:36 +0100, Peter Zijlstra wrote: On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: wake_affine() stuff is trying to bind related tasks closely, but it doesn't work well according to the test on 'perf bench sched pipe' (thanks to Peter). so sched-pipe is a poor benchmark for this.. Ideally we'd write a new benchmark that has some actual data footprint and we'd measure the cost of tasks being apart on the various cache metrics and see what affine wakeup does for it. Before doing something like what you're proposing, I'd have a hard look at WF_SYNC, it is possible we should disable/fix select_idle_sibling for sync wakeups. If nobody beats me to it, I'm going to try tracking shortest round trip to idle, and use a multiple of that to shut select_idle_sibling() down. If avg_idle approaches round trip time, there's no win to be had, we're just wasting cycles. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
Hi, Peter Thanks for your reply. On 03/07/2013 04:36 PM, Peter Zijlstra wrote: On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: wake_affine() stuff is trying to bind related tasks closely, but it doesn't work well according to the test on 'perf bench sched pipe' (thanks to Peter). so sched-pipe is a poor benchmark for this.. Ideally we'd write a new benchmark that has some actual data footprint and we'd measure the cost of tasks being apart on the various cache metrics and see what affine wakeup does for it. I think sched-pipe is still somewhat capable, the problem is that the select_idle_sibling() doesn't take care the wakeup related case, it doesn't contain the logical to locate an idle cpu closely. So even we detect the relationship successfully, select_idle_sibling() can only help to make sure the target cpu won't be outside of the current package, it's a package level bind, not mc or smp level. Before doing something like what you're proposing, I'd have a hard look at WF_SYNC, it is possible we should disable/fix select_idle_sibling for sync wakeups. The patch is supposed to stop using wake_affine() blindly, not improve the wake_affine() stuff itself, the whole stuff still works, but since we rely on select_idle_sibling() to make the choice, the benefit is not so significant, especially on my one node box... The idea behind sync wakeups is that we try and detect the case where we wakeup up one task only to go to sleep ourselves and try and avoid the regular ping-pong this would otherwise create on account of the waking task still being alive and so the current cpu isn't actually idle yet but we know its going to be idle soon. Are you suggesting that we should separate the process of wakeup related case, not just pass current cpu to select_idle_sibling()? Regards, Michael Wang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On Thu, 2013-03-07 at 17:46 +0800, Michael Wang wrote: On 03/07/2013 04:36 PM, Peter Zijlstra wrote: On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: wake_affine() stuff is trying to bind related tasks closely, but it doesn't work well according to the test on 'perf bench sched pipe' (thanks to Peter). so sched-pipe is a poor benchmark for this.. Ideally we'd write a new benchmark that has some actual data footprint and we'd measure the cost of tasks being apart on the various cache metrics and see what affine wakeup does for it. I think sched-pipe is still somewhat capable, Yeah, its not entirely crap for this, but its not ideal either. The very big difference you see between it running on a single cpu and on say two threads of a single core is mostly due to preemption 'artifacts' though. Not because of cache. So we have 2 tasks -- lets call then A and B -- involved in a single word ping-pong. So we're both doing write(); read(); loops. Now what happens on a single cpu is that A's write()-wakeup() of B makes B preempt A before A hits read() and blocks. This in turn ensures that B's write()-wakeup() of A finds an already running A and doesn't actually need to do the full (and expensive) wakeup thing (and vice versa). So by constantly preempting one another they avoid the expensive bit of going to sleep and waking up again. wake_affine() OTOH still has a (supposed) benefit if it gets the tasks running 'closer' (in a cache hierarchy sense) since then the data sharing is less expensive. the problem is that the select_idle_sibling() doesn't take care the wakeup related case, it doesn't contain the logical to locate an idle cpu closely. I'm not entirely sure if I understand what you mean, do you mean to say its idea of 'closely' is not quite correct? If so, I tend to agree, see further down. So even we detect the relationship successfully, select_idle_sibling() can only help to make sure the target cpu won't be outside of the current package, it's a package level bind, not mc or smp level. That is the entire point of select_idle_sibling(), selecting a cpu 'near' the target cpu that is currently idle. Not too long ago we had a bit of a discussion on the unholy mess that is select_idle_sibling() and if it actually does the right thing. Arguably it doesn't for machines that have an effective L2 cache. The issue is that the arch-sched interface only knows about last-level-cache (L3 on anything modern) and SMT. Expanding the topology description in a way that makes sense (and doesn't make it a bigger mess) is somewhere on the todo-list. Before doing something like what you're proposing, I'd have a hard look at WF_SYNC, it is possible we should disable/fix select_idle_sibling for sync wakeups. The patch is supposed to stop using wake_affine() blindly, not improve the wake_affine() stuff itself, the whole stuff still works, but since we rely on select_idle_sibling() to make the choice, the benefit is not so significant, especially on my one node box... OK, I'll have to go read the actual patch for that, I'll get back to you on that :-) The idea behind sync wakeups is that we try and detect the case where we wakeup up one task only to go to sleep ourselves and try and avoid the regular ping-pong this would otherwise create on account of the waking task still being alive and so the current cpu isn't actually idle yet but we know its going to be idle soon. Are you suggesting that we should separate the process of wakeup related case, not just pass current cpu to select_idle_sibling()? Depends a bit on what you're trying to fix, so far I'm just trying to write down what I remember about stuff and reacting to half-read changelogs ;-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: +static inline int wakeup_related(struct task_struct *p) +{ + if (wakeup_buddy(p, current)) { + /* +* Now check whether current still focus on his buddy. +*/ + if (wakeup_buddy(current, p)) + return 1; + } + + return 0; +} Not commenting on the thing in general, but: static inline bool wakeup_related(struct task_struct *p) { return wakeup_buddy(p, current) wakeup_buddy(current, p); } is far shorter and easier to read :-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote: @@ -3351,7 +3420,13 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) } if (affine_sd) { - if (cpu != prev_cpu wake_affine(affine_sd, p, sync)) + /* +* If current and p are wakeup related, and balance is +* guaranteed, we will try to make them running closely +* to gain cache benefit. +*/ + if (cpu != prev_cpu wakeup_related(p) + wake_affine(affine_sd, p, sync)) prev_cpu = cpu; OK, so there's two issues I have with all this are: - it completely wrecks task placement for things like interrupts (sadly I don't have a good idea about a benchmark where this matters). - yet another random number.. :/ Also, I'm starting to dislike the buddy name; its somewhat over-used. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/