Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 15:30 +0100, Mike Galbraith wrote: > On Fri, 2013-02-22 at 14:06 +0100, Ingo Molnar wrote: > > I think it might be better to measure the scheduling rate all > > the time, and save the _shortest_ cross-cpu-wakeup and > > same-cpu-wakeup latencies (since bootup) as a reference number. > > > > We might be able to pull this off pretty cheaply as the > > scheduler clock is running all the time and we have all the > > timestamps needed. > > Yeah, that might work. We have some quick kthreads, so saving ctx > distance may get close enough to scheduler cost to be good enough. Or better, shortest idle to idle, that would include the current (bad) nohz cost, and automatically shrink away when that cost shrinks. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 14:06 +0100, Ingo Molnar wrote: > * Mike Galbraith wrote: > > > > > No, that's too high, you loose too much of the pretty > > > > face. [...] > > > > > > Then a logical proportion of it - such as half of it? > > > > Hm. Better would maybe be a quick boot time benchmark, and > > use some multiple of your cross core pipe ping-pong time? > > That we know is a complete waste of cycles, because almost all > > cycles are scheduler cycles with no other work to be done, > > making firing up another scheduler rather pointless. If we're > > approaching that rate, we're approaching bad idea. > > Well, one problem with such dynamic boot time measurements is > that it introduces a certain amount of uncertainty that persists > for the whole lifetime of the booted up box - and it also sucks > in any sort of non-deterministic execution environment, such as > virtualized systems. Ok, bad idea. > I think it might be better to measure the scheduling rate all > the time, and save the _shortest_ cross-cpu-wakeup and > same-cpu-wakeup latencies (since bootup) as a reference number. > > We might be able to pull this off pretty cheaply as the > scheduler clock is running all the time and we have all the > timestamps needed. Yeah, that might work. We have some quick kthreads, so saving ctx distance may get close enough to scheduler cost to be good enough. > Pretty quickly after bootup this 'shortest latency' would settle > down to a very system specific (and pretty accurate) value. > > [ One downside would be an increased sensitivity to the accuracy > and monotonicity of the scheduler clock - but that's something > we want to improve on anyway - and 'worst case' we get too > short latencies and we are where we are today. So it can only > improve the situation IMO. ] > > Would you be interested in trying to hack on an auto-tuning > feature like this? Yeah, should be easy, but rainy day has to happen so I have time to measure twiddle measure measure tweak.. ;-) -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
* Mike Galbraith wrote: > > > No, that's too high, you loose too much of the pretty > > > face. [...] > > > > Then a logical proportion of it - such as half of it? > > Hm. Better would maybe be a quick boot time benchmark, and > use some multiple of your cross core pipe ping-pong time? > That we know is a complete waste of cycles, because almost all > cycles are scheduler cycles with no other work to be done, > making firing up another scheduler rather pointless. If we're > approaching that rate, we're approaching bad idea. Well, one problem with such dynamic boot time measurements is that it introduces a certain amount of uncertainty that persists for the whole lifetime of the booted up box - and it also sucks in any sort of non-deterministic execution environment, such as virtualized systems. I think it might be better to measure the scheduling rate all the time, and save the _shortest_ cross-cpu-wakeup and same-cpu-wakeup latencies (since bootup) as a reference number. We might be able to pull this off pretty cheaply as the scheduler clock is running all the time and we have all the timestamps needed. Pretty quickly after bootup this 'shortest latency' would settle down to a very system specific (and pretty accurate) value. [ One downside would be an increased sensitivity to the accuracy and monotonicity of the scheduler clock - but that's something we want to improve on anyway - and 'worst case' we get too short latencies and we are where we are today. So it can only improve the situation IMO. ] Would you be interested in trying to hack on an auto-tuning feature like this? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 13:11 +0100, Ingo Molnar wrote: > * Mike Galbraith wrote: > > > On Fri, 2013-02-22 at 10:54 +0100, Ingo Molnar wrote: > > > * Mike Galbraith wrote: > > > > > > > On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: > > > > > On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote: > > > > > > But that's really some benefit hardly to be estimate, especially > > > > > > when > > > > > > the workload is heavy, the cost of wake_affine() is very high to > > > > > > calculated se one by one, is that worth for some benefit we could > > > > > > not > > > > > > promise? > > > > > > > > > > Look at something like pipe-test.. wake_affine() used to > > > > > ensure both client/server ran on the same cpu, but then I > > > > > think we added select_idle_sibling() and wrecked it again :/ > > > > > > > > Yeah, that's the absolute worst case for > > > > select_idle_sibling(), 100% synchronous, absolutely nothing to > > > > be gained by cross cpu scheduling. Fortunately, most tasks do > > > > more than that, but nonetheless, select_idle_sibling() > > > > definitely is a two faced little b*tch. I'd like to see the > > > > evil b*tch die, but something needs to replace it's pretty > > > > face. One thing that you can do is simply don't call it when > > > > the context switch rate is incredible.. its job is to recover > > > > overlap, if you're scheduling near your max, there's no win > > > > worth the cost. > > > > > > Couldn't we make the cutoff dependent on sched_migration_cost? > > > If the wakeup comes in faster than that then don't spread. > > > > No, that's too high, you loose too much of the pretty face. > > [...] > > Then a logical proportion of it - such as half of it? Hm. Better would maybe be a quick boot time benchmark, and use some multiple of your cross core pipe ping-pong time? That we know is a complete waste of cycles, because almost all cycles are scheduler cycles with no other work to be done, making firing up another scheduler rather pointless. If we're approaching that rate, we're approaching bad idea. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
* Mike Galbraith wrote: > On Fri, 2013-02-22 at 10:54 +0100, Ingo Molnar wrote: > > * Mike Galbraith wrote: > > > > > On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: > > > > On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote: > > > > > But that's really some benefit hardly to be estimate, especially when > > > > > the workload is heavy, the cost of wake_affine() is very high to > > > > > calculated se one by one, is that worth for some benefit we could not > > > > > promise? > > > > > > > > Look at something like pipe-test.. wake_affine() used to > > > > ensure both client/server ran on the same cpu, but then I > > > > think we added select_idle_sibling() and wrecked it again :/ > > > > > > Yeah, that's the absolute worst case for > > > select_idle_sibling(), 100% synchronous, absolutely nothing to > > > be gained by cross cpu scheduling. Fortunately, most tasks do > > > more than that, but nonetheless, select_idle_sibling() > > > definitely is a two faced little b*tch. I'd like to see the > > > evil b*tch die, but something needs to replace it's pretty > > > face. One thing that you can do is simply don't call it when > > > the context switch rate is incredible.. its job is to recover > > > overlap, if you're scheduling near your max, there's no win > > > worth the cost. > > > > Couldn't we make the cutoff dependent on sched_migration_cost? > > If the wakeup comes in faster than that then don't spread. > > No, that's too high, you loose too much of the pretty face. > [...] Then a logical proportion of it - such as half of it? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/22/2013 05:57 PM, Peter Zijlstra wrote: > On Fri, 2013-02-22 at 17:11 +0800, Michael Wang wrote: > >> Ok, it do looks like wake_affine() lost it's value... > > I'm not sure we can say that on this one benchmark, there's a > preemption advantage to running on a single cpu for pipe-test as well. > We'd need to create a better benchmark to test this, one that has some > actual data payload and control over the initial spread of the tasks or > so. > >>> Now as far as I can see there's two options, either we find there's >>> absolutely no benefit in wake_affine() as it stands today and we simply >>> disable/remove it, or we go fix it. What we don't do is completely >>> wreck it at atrocious cost. >> >> I get your point, we should replace wake_affine() with some feature >> which could really achieve the goal to make client and server on same cpu. >> >> But is the logical that the waker/wakee are server/client(or reversed) >> still works now? that sounds a little arbitrary to me... > > Ah, its never really been about server/client per-se. Its just a > specific example -- one that breaks down with the 1:n pgbench > situation. > > Wakeups in general can be considered to be a relation, suppose a > hardware interrupt that received some data from a device and issues a > wakeup to a task to consume this data. What CPU would be better suited > to process this data then the one where its already cache hot. I see, honestly, I realized that I have underestimated the benefit we gain from it when saw your testing results... We do need some better approach to replace wake_affine(), hmm...I need a draft board now... Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 10:54 +0100, Ingo Molnar wrote: > * Mike Galbraith wrote: > > > On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: > > > On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote: > > > > But that's really some benefit hardly to be estimate, especially when > > > > the workload is heavy, the cost of wake_affine() is very high to > > > > calculated se one by one, is that worth for some benefit we could not > > > > promise? > > > > > > Look at something like pipe-test.. wake_affine() used to > > > ensure both client/server ran on the same cpu, but then I > > > think we added select_idle_sibling() and wrecked it again :/ > > > > Yeah, that's the absolute worst case for > > select_idle_sibling(), 100% synchronous, absolutely nothing to > > be gained by cross cpu scheduling. Fortunately, most tasks do > > more than that, but nonetheless, select_idle_sibling() > > definitely is a two faced little b*tch. I'd like to see the > > evil b*tch die, but something needs to replace it's pretty > > face. One thing that you can do is simply don't call it when > > the context switch rate is incredible.. its job is to recover > > overlap, if you're scheduling near your max, there's no win > > worth the cost. > > Couldn't we make the cutoff dependent on sched_migration_cost? > If the wakeup comes in faster than that then don't spread. No, that's too high, you loose too much of the pretty face. It's a real problem. On AMD, the breakeven is much higher than Intel it seems as well. My E5620 can turn in a win on both tbench and even netperf TCP_RR!! iff nohz is throttled. For the Opterons I've played with, it's a loser at even tbench context switch rate, needs to be cut off earlier. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/22/2013 05:39 PM, Peter Zijlstra wrote: > On Fri, 2013-02-22 at 17:10 +0800, Michael Wang wrote: >> On 02/22/2013 04:21 PM, Peter Zijlstra wrote: >>> On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote: According to my understanding, in the old world, wake_affine() will only be used if curr_cpu and prev_cpu share cache, which means they are in one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't have the chance to spread the task out of that package. >>> >>> Nah, look at where SD_WAKE_AFFINE is set. Only 'remote/big' NUMA domains >>> don't have it set, but 'small' NUMA systems will have it set over the >>> entire domain tree. >> >> Oh, I missed that point... >> >> But I don't get the reason to make NUMA level affine, cpus in different >> nodes share cache? doesn't make sense... > > Contrary, it makes more sense, the more expensive it is to run 'remote' > the better it is to pull 'related' tasks together. It increase the range to bound task, from one node to several, but also increase the range of target cpus, from one node's to several's, I still can't estimate the benefit, but I think I get the purpose, trying to make related tasks as close as possible, is that right? Let me think about this point, I do believe there will be better way to take care of this purpose. Regards, Michael Wang > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 17:11 +0800, Michael Wang wrote: > Ok, it do looks like wake_affine() lost it's value... I'm not sure we can say that on this one benchmark, there's a preemption advantage to running on a single cpu for pipe-test as well. We'd need to create a better benchmark to test this, one that has some actual data payload and control over the initial spread of the tasks or so. > > Now as far as I can see there's two options, either we find there's > > absolutely no benefit in wake_affine() as it stands today and we simply > > disable/remove it, or we go fix it. What we don't do is completely > > wreck it at atrocious cost. > > I get your point, we should replace wake_affine() with some feature > which could really achieve the goal to make client and server on same cpu. > > But is the logical that the waker/wakee are server/client(or reversed) > still works now? that sounds a little arbitrary to me... Ah, its never really been about server/client per-se. Its just a specific example -- one that breaks down with the 1:n pgbench situation. Wakeups in general can be considered to be a relation, suppose a hardware interrupt that received some data from a device and issues a wakeup to a task to consume this data. What CPU would be better suited to process this data then the one where its already cache hot. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
* Mike Galbraith wrote: > On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: > > On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote: > > > But that's really some benefit hardly to be estimate, especially when > > > the workload is heavy, the cost of wake_affine() is very high to > > > calculated se one by one, is that worth for some benefit we could not > > > promise? > > > > Look at something like pipe-test.. wake_affine() used to > > ensure both client/server ran on the same cpu, but then I > > think we added select_idle_sibling() and wrecked it again :/ > > Yeah, that's the absolute worst case for > select_idle_sibling(), 100% synchronous, absolutely nothing to > be gained by cross cpu scheduling. Fortunately, most tasks do > more than that, but nonetheless, select_idle_sibling() > definitely is a two faced little b*tch. I'd like to see the > evil b*tch die, but something needs to replace it's pretty > face. One thing that you can do is simply don't call it when > the context switch rate is incredible.. its job is to recover > overlap, if you're scheduling near your max, there's no win > worth the cost. Couldn't we make the cutoff dependent on sched_migration_cost? If the wakeup comes in faster than that then don't spread. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: > On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote: > > But that's really some benefit hardly to be estimate, especially when > > the workload is heavy, the cost of wake_affine() is very high to > > calculated se one by one, is that worth for some benefit we could not > > promise? > > Look at something like pipe-test.. wake_affine() used to ensure both > client/server ran on the same cpu, but then I think we added > select_idle_sibling() and wrecked it again :/ Yeah, that's the absolute worst case for select_idle_sibling(), 100% synchronous, absolutely nothing to be gained by cross cpu scheduling. Fortunately, most tasks do more than that, but nonetheless, select_idle_sibling() definitely is a two faced little b*tch. I'd like to see the evil b*tch die, but something needs to replace it's pretty face. One thing that you can do is simply don't call it when the context switch rate is incredible.. its job is to recover overlap, if you're scheduling near your max, there's no win worth the cost. > $ taskset 1 perf bench sched pipe > # Running sched/pipe benchmark... > # Extecuted 100 pipe operations between two tasks > > Total time: 3.761 [sec] > >3.761725 usecs/op > 265835 ops/sec > > $ perf bench sched pipe > # Running sched/pipe benchmark... > # Extecuted 100 pipe operations between two tasks > > Total time: 29.809 [sec] > > 29.809720 usecs/op > 33546 ops/sec Gak! Hm, are you running a kernel without the thinko fix? It's not good for this extreme testcase, but it doesn't suck _that_ bad ;-) nohz isn't exactly your friend with ultra switchers either. Q6600: marge:~ # taskset -c 3 perf bench sched pipe # Running sched/pipe benchmark... # Executed 100 pipe operations between two tasks Total time: 3.395 [sec] 3.395357 usecs/op 294519 ops/sec marge:~ # perf bench sched pipe # Running sched/pipe benchmark... # Executed 100 pipe operations between two tasks Total time: 4.212 [sec] 4.212413 usecs/op 237393 ops/sec E5620: rtbox:~ # taskset -c 0 perf bench sched pipe # Running sched/pipe benchmark... # Executed 100 pipe operations between two tasks Total time: 2.558 [sec] 2.558237 usecs/op 390894 ops/sec rtbox:~ # perf bench sched pipe # Running sched/pipe benchmark... # Executed 100 pipe operations between two tasks Total time: 4.588 [sec] 4.588702 usecs/op 217926 ops/sec -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 17:10 +0800, Michael Wang wrote: > On 02/22/2013 04:21 PM, Peter Zijlstra wrote: > > On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote: > >> According to my understanding, in the old world, wake_affine() will > >> only > >> be used if curr_cpu and prev_cpu share cache, which means they are in > >> one package, whatever search in llc sd of curr_cpu or prev_cpu, we > >> won't > >> have the chance to spread the task out of that package. > > > > Nah, look at where SD_WAKE_AFFINE is set. Only 'remote/big' NUMA domains > > don't have it set, but 'small' NUMA systems will have it set over the > > entire domain tree. > > Oh, I missed that point... > > But I don't get the reason to make NUMA level affine, cpus in different > nodes share cache? doesn't make sense... Contrary, it makes more sense, the more expensive it is to run 'remote' the better it is to pull 'related' tasks together. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/22/2013 04:36 PM, Peter Zijlstra wrote: > On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote: >> But that's really some benefit hardly to be estimate, especially when >> the workload is heavy, the cost of wake_affine() is very high to >> calculated se one by one, is that worth for some benefit we could not >> promise? > > Look at something like pipe-test.. wake_affine() used to ensure both > client/server ran on the same cpu, but then I think we added > select_idle_sibling() and wrecked it again :/ > > $ taskset 1 perf bench sched pipe > # Running sched/pipe benchmark... > # Extecuted 100 pipe operations between two tasks > > Total time: 3.761 [sec] > >3.761725 usecs/op > 265835 ops/sec > > $ perf bench sched pipe > # Running sched/pipe benchmark... > # Extecuted 100 pipe operations between two tasks > > Total time: 29.809 [sec] > > 29.809720 usecs/op > 33546 ops/sec > Ok, it do looks like wake_affine() lost it's value... > > Now as far as I can see there's two options, either we find there's > absolutely no benefit in wake_affine() as it stands today and we simply > disable/remove it, or we go fix it. What we don't do is completely > wreck it at atrocious cost. I get your point, we should replace wake_affine() with some feature which could really achieve the goal to make client and server on same cpu. But is the logical that the waker/wakee are server/client(or reversed) still works now? that sounds a little arbitrary to me... Regards, Michael Wang > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/22/2013 04:21 PM, Peter Zijlstra wrote: > On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote: >> According to my understanding, in the old world, wake_affine() will >> only >> be used if curr_cpu and prev_cpu share cache, which means they are in >> one package, whatever search in llc sd of curr_cpu or prev_cpu, we >> won't >> have the chance to spread the task out of that package. > > Nah, look at where SD_WAKE_AFFINE is set. Only 'remote/big' NUMA domains > don't have it set, but 'small' NUMA systems will have it set over the > entire domain tree. Oh, I missed that point... But I don't get the reason to make NUMA level affine, cpus in different nodes share cache? doesn't make sense... Regards, Michael Wang > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote: > But that's really some benefit hardly to be estimate, especially when > the workload is heavy, the cost of wake_affine() is very high to > calculated se one by one, is that worth for some benefit we could not > promise? Look at something like pipe-test.. wake_affine() used to ensure both client/server ran on the same cpu, but then I think we added select_idle_sibling() and wrecked it again :/ $ taskset 1 perf bench sched pipe # Running sched/pipe benchmark... # Extecuted 100 pipe operations between two tasks Total time: 3.761 [sec] 3.761725 usecs/op 265835 ops/sec $ perf bench sched pipe # Running sched/pipe benchmark... # Extecuted 100 pipe operations between two tasks Total time: 29.809 [sec] 29.809720 usecs/op 33546 ops/sec Now as far as I can see there's two options, either we find there's absolutely no benefit in wake_affine() as it stands today and we simply disable/remove it, or we go fix it. What we don't do is completely wreck it at atrocious cost. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/22/2013 04:17 PM, Mike Galbraith wrote: > On Fri, 2013-02-22 at 14:42 +0800, Michael Wang wrote: > >> So this is trying to take care the condition when curr_cpu(local) and >> prev_cpu(remote) are on different nodes, which in the old world, >> wake_affine() won't be invoked, correct? > > It'll be called any time this_cpu and prev_cpu aren't one and the same. > It'd be pretty silly to asking whether to pull_here or leave_there when > here and there are identical. Agree :) > >> Hmm...I think this maybe a good additional checking before enter balance >> path, but I could not estimate the cost to record the relationship at >> this moment of time... > > It'd be pretty cheap, but I'd hate adding any cycles to the fast path > unless those cycles have one hell of a good payoff, so the caching would > have to show most excellent cold hard numbers (talk crazy ideas walk;). It sounds like a good idea, I'm not sure whether it's cheap and how many benefit we could gain, but it worth some research. I will thinking more about it after finished the sbm work. Regards, Michael Wang > > -Mike > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 14:42 +0800, Michael Wang wrote: > So this is trying to take care the condition when curr_cpu(local) and > prev_cpu(remote) are on different nodes, which in the old world, > wake_affine() won't be invoked, correct? It'll be called any time this_cpu and prev_cpu aren't one and the same. It'd be pretty silly to asking whether to pull_here or leave_there when here and there are identical. > Hmm...I think this maybe a good additional checking before enter balance > path, but I could not estimate the cost to record the relationship at > this moment of time... It'd be pretty cheap, but I'd hate adding any cycles to the fast path unless those cycles have one hell of a good payoff, so the caching would have to show most excellent cold hard numbers (talk crazy ideas walk;). -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote: > According to my understanding, in the old world, wake_affine() will > only > be used if curr_cpu and prev_cpu share cache, which means they are in > one package, whatever search in llc sd of curr_cpu or prev_cpu, we > won't > have the chance to spread the task out of that package. Nah, look at where SD_WAKE_AFFINE is set. Only 'remote/big' NUMA domains don't have it set, but 'small' NUMA systems will have it set over the entire domain tree. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote: According to my understanding, in the old world, wake_affine() will only be used if curr_cpu and prev_cpu share cache, which means they are in one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't have the chance to spread the task out of that package. Nah, look at where SD_WAKE_AFFINE is set. Only 'remote/big' NUMA domains don't have it set, but 'small' NUMA systems will have it set over the entire domain tree. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 14:42 +0800, Michael Wang wrote: So this is trying to take care the condition when curr_cpu(local) and prev_cpu(remote) are on different nodes, which in the old world, wake_affine() won't be invoked, correct? It'll be called any time this_cpu and prev_cpu aren't one and the same. It'd be pretty silly to asking whether to pull_here or leave_there when here and there are identical. Hmm...I think this maybe a good additional checking before enter balance path, but I could not estimate the cost to record the relationship at this moment of time... It'd be pretty cheap, but I'd hate adding any cycles to the fast path unless those cycles have one hell of a good payoff, so the caching would have to show most excellent cold hard numbers (talk crazy ideas walk;). -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/22/2013 04:17 PM, Mike Galbraith wrote: On Fri, 2013-02-22 at 14:42 +0800, Michael Wang wrote: So this is trying to take care the condition when curr_cpu(local) and prev_cpu(remote) are on different nodes, which in the old world, wake_affine() won't be invoked, correct? It'll be called any time this_cpu and prev_cpu aren't one and the same. It'd be pretty silly to asking whether to pull_here or leave_there when here and there are identical. Agree :) Hmm...I think this maybe a good additional checking before enter balance path, but I could not estimate the cost to record the relationship at this moment of time... It'd be pretty cheap, but I'd hate adding any cycles to the fast path unless those cycles have one hell of a good payoff, so the caching would have to show most excellent cold hard numbers (talk crazy ideas walk;). It sounds like a good idea, I'm not sure whether it's cheap and how many benefit we could gain, but it worth some research. I will thinking more about it after finished the sbm work. Regards, Michael Wang -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote: But that's really some benefit hardly to be estimate, especially when the workload is heavy, the cost of wake_affine() is very high to calculated se one by one, is that worth for some benefit we could not promise? Look at something like pipe-test.. wake_affine() used to ensure both client/server ran on the same cpu, but then I think we added select_idle_sibling() and wrecked it again :/ $ taskset 1 perf bench sched pipe # Running sched/pipe benchmark... # Extecuted 100 pipe operations between two tasks Total time: 3.761 [sec] 3.761725 usecs/op 265835 ops/sec $ perf bench sched pipe # Running sched/pipe benchmark... # Extecuted 100 pipe operations between two tasks Total time: 29.809 [sec] 29.809720 usecs/op 33546 ops/sec Now as far as I can see there's two options, either we find there's absolutely no benefit in wake_affine() as it stands today and we simply disable/remove it, or we go fix it. What we don't do is completely wreck it at atrocious cost. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/22/2013 04:21 PM, Peter Zijlstra wrote: On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote: According to my understanding, in the old world, wake_affine() will only be used if curr_cpu and prev_cpu share cache, which means they are in one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't have the chance to spread the task out of that package. Nah, look at where SD_WAKE_AFFINE is set. Only 'remote/big' NUMA domains don't have it set, but 'small' NUMA systems will have it set over the entire domain tree. Oh, I missed that point... But I don't get the reason to make NUMA level affine, cpus in different nodes share cache? doesn't make sense... Regards, Michael Wang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/22/2013 04:36 PM, Peter Zijlstra wrote: On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote: But that's really some benefit hardly to be estimate, especially when the workload is heavy, the cost of wake_affine() is very high to calculated se one by one, is that worth for some benefit we could not promise? Look at something like pipe-test.. wake_affine() used to ensure both client/server ran on the same cpu, but then I think we added select_idle_sibling() and wrecked it again :/ $ taskset 1 perf bench sched pipe # Running sched/pipe benchmark... # Extecuted 100 pipe operations between two tasks Total time: 3.761 [sec] 3.761725 usecs/op 265835 ops/sec $ perf bench sched pipe # Running sched/pipe benchmark... # Extecuted 100 pipe operations between two tasks Total time: 29.809 [sec] 29.809720 usecs/op 33546 ops/sec Ok, it do looks like wake_affine() lost it's value... Now as far as I can see there's two options, either we find there's absolutely no benefit in wake_affine() as it stands today and we simply disable/remove it, or we go fix it. What we don't do is completely wreck it at atrocious cost. I get your point, we should replace wake_affine() with some feature which could really achieve the goal to make client and server on same cpu. But is the logical that the waker/wakee are server/client(or reversed) still works now? that sounds a little arbitrary to me... Regards, Michael Wang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 17:10 +0800, Michael Wang wrote: On 02/22/2013 04:21 PM, Peter Zijlstra wrote: On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote: According to my understanding, in the old world, wake_affine() will only be used if curr_cpu and prev_cpu share cache, which means they are in one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't have the chance to spread the task out of that package. Nah, look at where SD_WAKE_AFFINE is set. Only 'remote/big' NUMA domains don't have it set, but 'small' NUMA systems will have it set over the entire domain tree. Oh, I missed that point... But I don't get the reason to make NUMA level affine, cpus in different nodes share cache? doesn't make sense... Contrary, it makes more sense, the more expensive it is to run 'remote' the better it is to pull 'related' tasks together. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote: But that's really some benefit hardly to be estimate, especially when the workload is heavy, the cost of wake_affine() is very high to calculated se one by one, is that worth for some benefit we could not promise? Look at something like pipe-test.. wake_affine() used to ensure both client/server ran on the same cpu, but then I think we added select_idle_sibling() and wrecked it again :/ Yeah, that's the absolute worst case for select_idle_sibling(), 100% synchronous, absolutely nothing to be gained by cross cpu scheduling. Fortunately, most tasks do more than that, but nonetheless, select_idle_sibling() definitely is a two faced little b*tch. I'd like to see the evil b*tch die, but something needs to replace it's pretty face. One thing that you can do is simply don't call it when the context switch rate is incredible.. its job is to recover overlap, if you're scheduling near your max, there's no win worth the cost. $ taskset 1 perf bench sched pipe # Running sched/pipe benchmark... # Extecuted 100 pipe operations between two tasks Total time: 3.761 [sec] 3.761725 usecs/op 265835 ops/sec $ perf bench sched pipe # Running sched/pipe benchmark... # Extecuted 100 pipe operations between two tasks Total time: 29.809 [sec] 29.809720 usecs/op 33546 ops/sec Gak! Hm, are you running a kernel without the thinko fix? It's not good for this extreme testcase, but it doesn't suck _that_ bad ;-) nohz isn't exactly your friend with ultra switchers either. Q6600: marge:~ # taskset -c 3 perf bench sched pipe # Running sched/pipe benchmark... # Executed 100 pipe operations between two tasks Total time: 3.395 [sec] 3.395357 usecs/op 294519 ops/sec marge:~ # perf bench sched pipe # Running sched/pipe benchmark... # Executed 100 pipe operations between two tasks Total time: 4.212 [sec] 4.212413 usecs/op 237393 ops/sec E5620: rtbox:~ # taskset -c 0 perf bench sched pipe # Running sched/pipe benchmark... # Executed 100 pipe operations between two tasks Total time: 2.558 [sec] 2.558237 usecs/op 390894 ops/sec rtbox:~ # perf bench sched pipe # Running sched/pipe benchmark... # Executed 100 pipe operations between two tasks Total time: 4.588 [sec] 4.588702 usecs/op 217926 ops/sec -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
* Mike Galbraith efa...@gmx.de wrote: On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote: But that's really some benefit hardly to be estimate, especially when the workload is heavy, the cost of wake_affine() is very high to calculated se one by one, is that worth for some benefit we could not promise? Look at something like pipe-test.. wake_affine() used to ensure both client/server ran on the same cpu, but then I think we added select_idle_sibling() and wrecked it again :/ Yeah, that's the absolute worst case for select_idle_sibling(), 100% synchronous, absolutely nothing to be gained by cross cpu scheduling. Fortunately, most tasks do more than that, but nonetheless, select_idle_sibling() definitely is a two faced little b*tch. I'd like to see the evil b*tch die, but something needs to replace it's pretty face. One thing that you can do is simply don't call it when the context switch rate is incredible.. its job is to recover overlap, if you're scheduling near your max, there's no win worth the cost. Couldn't we make the cutoff dependent on sched_migration_cost? If the wakeup comes in faster than that then don't spread. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 17:11 +0800, Michael Wang wrote: Ok, it do looks like wake_affine() lost it's value... I'm not sure we can say that on this one benchmark, there's a preemption advantage to running on a single cpu for pipe-test as well. We'd need to create a better benchmark to test this, one that has some actual data payload and control over the initial spread of the tasks or so. Now as far as I can see there's two options, either we find there's absolutely no benefit in wake_affine() as it stands today and we simply disable/remove it, or we go fix it. What we don't do is completely wreck it at atrocious cost. I get your point, we should replace wake_affine() with some feature which could really achieve the goal to make client and server on same cpu. But is the logical that the waker/wakee are server/client(or reversed) still works now? that sounds a little arbitrary to me... Ah, its never really been about server/client per-se. Its just a specific example -- one that breaks down with the 1:n pgbench situation. Wakeups in general can be considered to be a relation, suppose a hardware interrupt that received some data from a device and issues a wakeup to a task to consume this data. What CPU would be better suited to process this data then the one where its already cache hot. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/22/2013 05:39 PM, Peter Zijlstra wrote: On Fri, 2013-02-22 at 17:10 +0800, Michael Wang wrote: On 02/22/2013 04:21 PM, Peter Zijlstra wrote: On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote: According to my understanding, in the old world, wake_affine() will only be used if curr_cpu and prev_cpu share cache, which means they are in one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't have the chance to spread the task out of that package. Nah, look at where SD_WAKE_AFFINE is set. Only 'remote/big' NUMA domains don't have it set, but 'small' NUMA systems will have it set over the entire domain tree. Oh, I missed that point... But I don't get the reason to make NUMA level affine, cpus in different nodes share cache? doesn't make sense... Contrary, it makes more sense, the more expensive it is to run 'remote' the better it is to pull 'related' tasks together. It increase the range to bound task, from one node to several, but also increase the range of target cpus, from one node's to several's, I still can't estimate the benefit, but I think I get the purpose, trying to make related tasks as close as possible, is that right? Let me think about this point, I do believe there will be better way to take care of this purpose. Regards, Michael Wang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 10:54 +0100, Ingo Molnar wrote: * Mike Galbraith efa...@gmx.de wrote: On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote: But that's really some benefit hardly to be estimate, especially when the workload is heavy, the cost of wake_affine() is very high to calculated se one by one, is that worth for some benefit we could not promise? Look at something like pipe-test.. wake_affine() used to ensure both client/server ran on the same cpu, but then I think we added select_idle_sibling() and wrecked it again :/ Yeah, that's the absolute worst case for select_idle_sibling(), 100% synchronous, absolutely nothing to be gained by cross cpu scheduling. Fortunately, most tasks do more than that, but nonetheless, select_idle_sibling() definitely is a two faced little b*tch. I'd like to see the evil b*tch die, but something needs to replace it's pretty face. One thing that you can do is simply don't call it when the context switch rate is incredible.. its job is to recover overlap, if you're scheduling near your max, there's no win worth the cost. Couldn't we make the cutoff dependent on sched_migration_cost? If the wakeup comes in faster than that then don't spread. No, that's too high, you loose too much of the pretty face. It's a real problem. On AMD, the breakeven is much higher than Intel it seems as well. My E5620 can turn in a win on both tbench and even netperf TCP_RR!! iff nohz is throttled. For the Opterons I've played with, it's a loser at even tbench context switch rate, needs to be cut off earlier. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/22/2013 05:57 PM, Peter Zijlstra wrote: On Fri, 2013-02-22 at 17:11 +0800, Michael Wang wrote: Ok, it do looks like wake_affine() lost it's value... I'm not sure we can say that on this one benchmark, there's a preemption advantage to running on a single cpu for pipe-test as well. We'd need to create a better benchmark to test this, one that has some actual data payload and control over the initial spread of the tasks or so. Now as far as I can see there's two options, either we find there's absolutely no benefit in wake_affine() as it stands today and we simply disable/remove it, or we go fix it. What we don't do is completely wreck it at atrocious cost. I get your point, we should replace wake_affine() with some feature which could really achieve the goal to make client and server on same cpu. But is the logical that the waker/wakee are server/client(or reversed) still works now? that sounds a little arbitrary to me... Ah, its never really been about server/client per-se. Its just a specific example -- one that breaks down with the 1:n pgbench situation. Wakeups in general can be considered to be a relation, suppose a hardware interrupt that received some data from a device and issues a wakeup to a task to consume this data. What CPU would be better suited to process this data then the one where its already cache hot. I see, honestly, I realized that I have underestimated the benefit we gain from it when saw your testing results... We do need some better approach to replace wake_affine(), hmm...I need a draft board now... Regards, Michael Wang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
* Mike Galbraith efa...@gmx.de wrote: On Fri, 2013-02-22 at 10:54 +0100, Ingo Molnar wrote: * Mike Galbraith efa...@gmx.de wrote: On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote: But that's really some benefit hardly to be estimate, especially when the workload is heavy, the cost of wake_affine() is very high to calculated se one by one, is that worth for some benefit we could not promise? Look at something like pipe-test.. wake_affine() used to ensure both client/server ran on the same cpu, but then I think we added select_idle_sibling() and wrecked it again :/ Yeah, that's the absolute worst case for select_idle_sibling(), 100% synchronous, absolutely nothing to be gained by cross cpu scheduling. Fortunately, most tasks do more than that, but nonetheless, select_idle_sibling() definitely is a two faced little b*tch. I'd like to see the evil b*tch die, but something needs to replace it's pretty face. One thing that you can do is simply don't call it when the context switch rate is incredible.. its job is to recover overlap, if you're scheduling near your max, there's no win worth the cost. Couldn't we make the cutoff dependent on sched_migration_cost? If the wakeup comes in faster than that then don't spread. No, that's too high, you loose too much of the pretty face. [...] Then a logical proportion of it - such as half of it? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 13:11 +0100, Ingo Molnar wrote: * Mike Galbraith efa...@gmx.de wrote: On Fri, 2013-02-22 at 10:54 +0100, Ingo Molnar wrote: * Mike Galbraith efa...@gmx.de wrote: On Fri, 2013-02-22 at 09:36 +0100, Peter Zijlstra wrote: On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote: But that's really some benefit hardly to be estimate, especially when the workload is heavy, the cost of wake_affine() is very high to calculated se one by one, is that worth for some benefit we could not promise? Look at something like pipe-test.. wake_affine() used to ensure both client/server ran on the same cpu, but then I think we added select_idle_sibling() and wrecked it again :/ Yeah, that's the absolute worst case for select_idle_sibling(), 100% synchronous, absolutely nothing to be gained by cross cpu scheduling. Fortunately, most tasks do more than that, but nonetheless, select_idle_sibling() definitely is a two faced little b*tch. I'd like to see the evil b*tch die, but something needs to replace it's pretty face. One thing that you can do is simply don't call it when the context switch rate is incredible.. its job is to recover overlap, if you're scheduling near your max, there's no win worth the cost. Couldn't we make the cutoff dependent on sched_migration_cost? If the wakeup comes in faster than that then don't spread. No, that's too high, you loose too much of the pretty face. [...] Then a logical proportion of it - such as half of it? Hm. Better would maybe be a quick boot time benchmark, and use some multiple of your cross core pipe ping-pong time? That we know is a complete waste of cycles, because almost all cycles are scheduler cycles with no other work to be done, making firing up another scheduler rather pointless. If we're approaching that rate, we're approaching bad idea. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
* Mike Galbraith efa...@gmx.de wrote: No, that's too high, you loose too much of the pretty face. [...] Then a logical proportion of it - such as half of it? Hm. Better would maybe be a quick boot time benchmark, and use some multiple of your cross core pipe ping-pong time? That we know is a complete waste of cycles, because almost all cycles are scheduler cycles with no other work to be done, making firing up another scheduler rather pointless. If we're approaching that rate, we're approaching bad idea. Well, one problem with such dynamic boot time measurements is that it introduces a certain amount of uncertainty that persists for the whole lifetime of the booted up box - and it also sucks in any sort of non-deterministic execution environment, such as virtualized systems. I think it might be better to measure the scheduling rate all the time, and save the _shortest_ cross-cpu-wakeup and same-cpu-wakeup latencies (since bootup) as a reference number. We might be able to pull this off pretty cheaply as the scheduler clock is running all the time and we have all the timestamps needed. Pretty quickly after bootup this 'shortest latency' would settle down to a very system specific (and pretty accurate) value. [ One downside would be an increased sensitivity to the accuracy and monotonicity of the scheduler clock - but that's something we want to improve on anyway - and 'worst case' we get too short latencies and we are where we are today. So it can only improve the situation IMO. ] Would you be interested in trying to hack on an auto-tuning feature like this? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 14:06 +0100, Ingo Molnar wrote: * Mike Galbraith efa...@gmx.de wrote: No, that's too high, you loose too much of the pretty face. [...] Then a logical proportion of it - such as half of it? Hm. Better would maybe be a quick boot time benchmark, and use some multiple of your cross core pipe ping-pong time? That we know is a complete waste of cycles, because almost all cycles are scheduler cycles with no other work to be done, making firing up another scheduler rather pointless. If we're approaching that rate, we're approaching bad idea. Well, one problem with such dynamic boot time measurements is that it introduces a certain amount of uncertainty that persists for the whole lifetime of the booted up box - and it also sucks in any sort of non-deterministic execution environment, such as virtualized systems. Ok, bad idea. I think it might be better to measure the scheduling rate all the time, and save the _shortest_ cross-cpu-wakeup and same-cpu-wakeup latencies (since bootup) as a reference number. We might be able to pull this off pretty cheaply as the scheduler clock is running all the time and we have all the timestamps needed. Yeah, that might work. We have some quick kthreads, so saving ctx distance may get close enough to scheduler cost to be good enough. Pretty quickly after bootup this 'shortest latency' would settle down to a very system specific (and pretty accurate) value. [ One downside would be an increased sensitivity to the accuracy and monotonicity of the scheduler clock - but that's something we want to improve on anyway - and 'worst case' we get too short latencies and we are where we are today. So it can only improve the situation IMO. ] Would you be interested in trying to hack on an auto-tuning feature like this? Yeah, should be easy, but rainy day has to happen so I have time to measure twiddle measure measure curse tweak.. ;-) -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 15:30 +0100, Mike Galbraith wrote: On Fri, 2013-02-22 at 14:06 +0100, Ingo Molnar wrote: I think it might be better to measure the scheduling rate all the time, and save the _shortest_ cross-cpu-wakeup and same-cpu-wakeup latencies (since bootup) as a reference number. We might be able to pull this off pretty cheaply as the scheduler clock is running all the time and we have all the timestamps needed. Yeah, that might work. We have some quick kthreads, so saving ctx distance may get close enough to scheduler cost to be good enough. Or better, shortest idle to idle, that would include the current (bad) nohz cost, and automatically shrink away when that cost shrinks. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/22/2013 01:02 PM, Mike Galbraith wrote: > On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote: >> On 02/21/2013 05:43 PM, Mike Galbraith wrote: >>> On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote: >>> But is this patch set really cause regression on your Q6600? It may sacrificed some thing, but I still think it will benefit far more, especially on huge systems. >>> >>> We spread on FORK/EXEC, and will no longer will pull communicating tasks >>> back to a shared cache with the new logic preferring to leave wakee >>> remote, so while no, I haven't tested (will try to find round tuit) it >>> seems it _must_ hurt. Dragging data from one llc to the other on Q6600 >>> hurts a LOT. Every time a client and server are cross llc, it's a huge >>> hit. The previous logic pulled communicating tasks together right when >>> it matters the most, intermittent load... or interactive use. >> >> I agree that this is a problem need to be solved, but don't agree that >> wake_affine() is the solution. > > It's not perfect, but it's better than no countering force at all. It's > a relic of the dark ages, when affine meant L2, ie this cpu. Now days, > affine has a whole new meaning, L3, so it could be done differently, but > _some_ kind of opposing force is required. > >> According to my understanding, in the old world, wake_affine() will only >> be used if curr_cpu and prev_cpu share cache, which means they are in >> one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't >> have the chance to spread the task out of that package. > > ? affine_sd is the first domain spanning both cpus, that may be NODE. > True we won't ever spread in the wakeup path unless SD_WAKE_BALANCE is > set that is. Would be nice to be able to do that without shredding > performance. That's right, we need two conditions in each select instance: 1. prev_cpu and curr_cpu are not affine 2. SD_WAKE_BALANCE > > Off the top of my pointy head, I can think of a way to _maybe_ improve > the "affine" wakeup criteria: Add a small (package size? and very fast) > FIFO queue to task struct, record waker/wakee relationship. If > relationship exists in that queue (rbtree), try to wake local, if not, > wake remote. The thought is to identify situations ala 1:N pgbench > where you really need to keep the load spread. That need arises when > the sum wakees + waker won't fit in one cache. True buddies would > always hit (hm, hit rate), always try to become affine where they > thrive. 1:N stuff starts missing when client count exceeds package > size, starts expanding it's horizons. 'Course you would still need to > NAK if imbalanced too badly, and let NUMA stuff NAK touching lard-balls > and whatnot. With a little more smarts, we could have happy 1:N, and > buddies don't have to chat through 2m thick walls to make 1:N scale as > well as it can before it dies of stupidity. So this is trying to take care the condition when curr_cpu(local) and prev_cpu(remote) are on different nodes, which in the old world, wake_affine() won't be invoked, correct? Hmm...I think this maybe a good additional checking before enter balance path, but I could not estimate the cost to record the relationship at this moment of time... Whatever, after applied the affine logical into new world, it will gain the ability to spread tasks cross nodes just like the old world, your idea may be an optimization, but the logical is out of the changing in this patch set, which means if it benefits, the beneficiary will be not only new but also old. Regards, Michael Wang > > -Mike > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 14:06 +0800, Michael Wang wrote: > On 02/22/2013 01:08 PM, Mike Galbraith wrote: > > On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote: > > > >> According to the testing result, I could not agree this purpose of > >> wake_affine() benefit us, but I'm sure that wake_affine() is a terrible > >> performance killer when system is busy. > > > > (hm, result is singular.. pgbench in 1:N mode only?) > > I'm not sure about how pgbench implemented, all I know is it will create > several instance and access the database, I suppose no different from > several threads access database (1 server and N clients?). It's user switchable. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 13:26 +0800, Michael Wang wrote: > Just confirm that I'm not on the wrong way, did the 1:N mode here means > 1 task forked N threads, and child always talk with father? Yes, one server, many clients. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/22/2013 01:08 PM, Mike Galbraith wrote: > On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote: > >> According to the testing result, I could not agree this purpose of >> wake_affine() benefit us, but I'm sure that wake_affine() is a terrible >> performance killer when system is busy. > > (hm, result is singular.. pgbench in 1:N mode only?) I'm not sure about how pgbench implemented, all I know is it will create several instance and access the database, I suppose no different from several threads access database (1 server and N clients?). There are improvement since when system busy, wake_affine() will be skipped. And in old world, when system is busy, wake_affine() will only be skipped if prev_cpu and curr_cpu belong to different nodes. Regards, Michael Wang > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/22/2013 01:02 PM, Mike Galbraith wrote: > On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote: >> On 02/21/2013 05:43 PM, Mike Galbraith wrote: >>> On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote: >>> But is this patch set really cause regression on your Q6600? It may sacrificed some thing, but I still think it will benefit far more, especially on huge systems. >>> >>> We spread on FORK/EXEC, and will no longer will pull communicating tasks >>> back to a shared cache with the new logic preferring to leave wakee >>> remote, so while no, I haven't tested (will try to find round tuit) it >>> seems it _must_ hurt. Dragging data from one llc to the other on Q6600 >>> hurts a LOT. Every time a client and server are cross llc, it's a huge >>> hit. The previous logic pulled communicating tasks together right when >>> it matters the most, intermittent load... or interactive use. >> >> I agree that this is a problem need to be solved, but don't agree that >> wake_affine() is the solution. > > It's not perfect, but it's better than no countering force at all. It's > a relic of the dark ages, when affine meant L2, ie this cpu. Now days, > affine has a whole new meaning, L3, so it could be done differently, but > _some_ kind of opposing force is required. > >> According to my understanding, in the old world, wake_affine() will only >> be used if curr_cpu and prev_cpu share cache, which means they are in >> one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't >> have the chance to spread the task out of that package. > > ? affine_sd is the first domain spanning both cpus, that may be NODE. > True we won't ever spread in the wakeup path unless SD_WAKE_BALANCE is > set that is. Would be nice to be able to do that without shredding > performance. > > Off the top of my pointy head, I can think of a way to _maybe_ improve > the "affine" wakeup criteria: Add a small (package size? and very fast) > FIFO queue to task struct, record waker/wakee relationship. If > relationship exists in that queue (rbtree), try to wake local, if not, > wake remote. The thought is to identify situations ala 1:N pgbench > where you really need to keep the load spread. That need arises when > the sum wakees + waker won't fit in one cache. True buddies would > always hit (hm, hit rate), always try to become affine where they > thrive. 1:N stuff starts missing when client count exceeds package > size, starts expanding it's horizons. 'Course you would still need to > NAK if imbalanced too badly, and let NUMA stuff NAK touching lard-balls > and whatnot. With a little more smarts, we could have happy 1:N, and > buddies don't have to chat through 2m thick walls to make 1:N scale as > well as it can before it dies of stupidity. Just confirm that I'm not on the wrong way, did the 1:N mode here means 1 task forked N threads, and child always talk with father? Regards, Michael Wang > > -Mike > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote: > According to the testing result, I could not agree this purpose of > wake_affine() benefit us, but I'm sure that wake_affine() is a terrible > performance killer when system is busy. (hm, result is singular.. pgbench in 1:N mode only?) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote: > On 02/21/2013 05:43 PM, Mike Galbraith wrote: > > On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote: > > > >> But is this patch set really cause regression on your Q6600? It may > >> sacrificed some thing, but I still think it will benefit far more, > >> especially on huge systems. > > > > We spread on FORK/EXEC, and will no longer will pull communicating tasks > > back to a shared cache with the new logic preferring to leave wakee > > remote, so while no, I haven't tested (will try to find round tuit) it > > seems it _must_ hurt. Dragging data from one llc to the other on Q6600 > > hurts a LOT. Every time a client and server are cross llc, it's a huge > > hit. The previous logic pulled communicating tasks together right when > > it matters the most, intermittent load... or interactive use. > > I agree that this is a problem need to be solved, but don't agree that > wake_affine() is the solution. It's not perfect, but it's better than no countering force at all. It's a relic of the dark ages, when affine meant L2, ie this cpu. Now days, affine has a whole new meaning, L3, so it could be done differently, but _some_ kind of opposing force is required. > According to my understanding, in the old world, wake_affine() will only > be used if curr_cpu and prev_cpu share cache, which means they are in > one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't > have the chance to spread the task out of that package. ? affine_sd is the first domain spanning both cpus, that may be NODE. True we won't ever spread in the wakeup path unless SD_WAKE_BALANCE is set that is. Would be nice to be able to do that without shredding performance. Off the top of my pointy head, I can think of a way to _maybe_ improve the "affine" wakeup criteria: Add a small (package size? and very fast) FIFO queue to task struct, record waker/wakee relationship. If relationship exists in that queue (rbtree), try to wake local, if not, wake remote. The thought is to identify situations ala 1:N pgbench where you really need to keep the load spread. That need arises when the sum wakees + waker won't fit in one cache. True buddies would always hit (hm, hit rate), always try to become affine where they thrive. 1:N stuff starts missing when client count exceeds package size, starts expanding it's horizons. 'Course you would still need to NAK if imbalanced too badly, and let NUMA stuff NAK touching lard-balls and whatnot. With a little more smarts, we could have happy 1:N, and buddies don't have to chat through 2m thick walls to make 1:N scale as well as it can before it dies of stupidity. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/21/2013 06:20 PM, Peter Zijlstra wrote: > On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: >> The old logical when locate affine_sd is: >> >> if prev_cpu != curr_cpu >> if wake_affine() >> prev_cpu = curr_cpu >> new_cpu = select_idle_sibling(prev_cpu) >> return new_cpu >> >> The new logical is same to the old one if prev_cpu == curr_cpu, so >> let's >> simplify the old logical like: >> >> if wake_affine() >> new_cpu = select_idle_sibling(curr_cpu) >> else >> new_cpu = select_idle_sibling(prev_cpu) >> >> return new_cpu >> >> Actually that doesn't make sense. > > It does :-) > >> I think wake_affine() is trying to check whether move a task from >> prev_cpu to curr_cpu will break the balance in affine_sd or not, but >> why >> won't break balance means curr_cpu is better than prev_cpu for >> searching >> the idle cpu? > > It doesn't, the whole affine wakeup stuff is meant to pull waking tasks > towards the cpu that does the wakeup, we limit this by putting bounds on > the imbalance this is may create. > > The reason we want to run tasks on the cpu that does the wakeup is > because that cpu 'obviously' is running something related and it seems > like a good idea to run related tasks close together. > > So look at affine wakeups as a force that groups related tasks. That's right, and it's one point I've missed when judging the wake_affine()... But that's really some benefit hardly to be estimate, especially when the workload is heavy, the cost of wake_affine() is very high to calculated se one by one, is that worth for some benefit we could not promise? According to the testing result, I could not agree this purpose of wake_affine() benefit us, but I'm sure that wake_affine() is a terrible performance killer when system is busy. > >> So the new logical in this patch set is: >> >> new_cpu = select_idle_sibling(prev_cpu) >> if idle_cpu(new_cpu) >> return new_cpu >> >> new_cpu = select_idle_sibling(curr_cpu) >> if idle_cpu(new_cpu) { >> if wake_affine() >> return new_cpu >> } >> >> return prev_cpu >> >> And now, unless we are really going to move load from prev_cpu to >> curr_cpu, we won't use wake_affine() any more. > > That's completely breaks stuff, not cool. Could you please give more details on what's the point you think is bad? Regards, Michael Wang > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/21/2013 05:43 PM, Mike Galbraith wrote: > On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote: > >> But is this patch set really cause regression on your Q6600? It may >> sacrificed some thing, but I still think it will benefit far more, >> especially on huge systems. > > We spread on FORK/EXEC, and will no longer will pull communicating tasks > back to a shared cache with the new logic preferring to leave wakee > remote, so while no, I haven't tested (will try to find round tuit) it > seems it _must_ hurt. Dragging data from one llc to the other on Q6600 > hurts a LOT. Every time a client and server are cross llc, it's a huge > hit. The previous logic pulled communicating tasks together right when > it matters the most, intermittent load... or interactive use. I agree that this is a problem need to be solved, but don't agree that wake_affine() is the solution. According to my understanding, in the old world, wake_affine() will only be used if curr_cpu and prev_cpu share cache, which means they are in one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't have the chance to spread the task out of that package. I'm going to recover the logical that only do select_idle_sibling() when prev_cpu and curr_cpu are affine, so now the new logical will only prefer leaving task in old package if both prev_cpu and curr_cpu are in that package, I think this could solve the problem, isn't it? Regards, Michael Wang > > -Mike > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: > The old logical when locate affine_sd is: > > if prev_cpu != curr_cpu > if wake_affine() > prev_cpu = curr_cpu > new_cpu = select_idle_sibling(prev_cpu) > return new_cpu > > The new logical is same to the old one if prev_cpu == curr_cpu, so > let's > simplify the old logical like: > > if wake_affine() > new_cpu = select_idle_sibling(curr_cpu) > else > new_cpu = select_idle_sibling(prev_cpu) > > return new_cpu > > Actually that doesn't make sense. It does :-) > I think wake_affine() is trying to check whether move a task from > prev_cpu to curr_cpu will break the balance in affine_sd or not, but > why > won't break balance means curr_cpu is better than prev_cpu for > searching > the idle cpu? It doesn't, the whole affine wakeup stuff is meant to pull waking tasks towards the cpu that does the wakeup, we limit this by putting bounds on the imbalance this is may create. The reason we want to run tasks on the cpu that does the wakeup is because that cpu 'obviously' is running something related and it seems like a good idea to run related tasks close together. So look at affine wakeups as a force that groups related tasks. > So the new logical in this patch set is: > > new_cpu = select_idle_sibling(prev_cpu) > if idle_cpu(new_cpu) > return new_cpu > > new_cpu = select_idle_sibling(curr_cpu) > if idle_cpu(new_cpu) { > if wake_affine() > return new_cpu > } > > return prev_cpu > > And now, unless we are really going to move load from prev_cpu to > curr_cpu, we won't use wake_affine() any more. That's completely breaks stuff, not cool. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote: > But is this patch set really cause regression on your Q6600? It may > sacrificed some thing, but I still think it will benefit far more, > especially on huge systems. We spread on FORK/EXEC, and will no longer will pull communicating tasks back to a shared cache with the new logic preferring to leave wakee remote, so while no, I haven't tested (will try to find round tuit) it seems it _must_ hurt. Dragging data from one llc to the other on Q6600 hurts a LOT. Every time a client and server are cross llc, it's a huge hit. The previous logic pulled communicating tasks together right when it matters the most, intermittent load... or interactive use. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/21/2013 04:10 PM, Mike Galbraith wrote: > On Thu, 2013-02-21 at 15:00 +0800, Michael Wang wrote: >> On 02/21/2013 02:11 PM, Mike Galbraith wrote: >>> On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: On 02/20/2013 06:49 PM, Ingo Molnar wrote: [snip] >> [snip] if wake_affine() new_cpu = select_idle_sibling(curr_cpu) else new_cpu = select_idle_sibling(prev_cpu) return new_cpu Actually that doesn't make sense. I think wake_affine() is trying to check whether move a task from prev_cpu to curr_cpu will break the balance in affine_sd or not, but why won't break balance means curr_cpu is better than prev_cpu for searching the idle cpu? >>> >>> You could argue that it's impossible to break balance by moving any task >>> to any idle cpu, but that would mean bouncing tasks cross node on every >>> wakeup is fine, which it isn't. >> >> I don't get it... could you please give me more detail on how >> wake_affine() related with bouncing? > > If we didn't ever ask if it's ok, we'd always pull, and stack load up on > one package if there's the tiniest of holes to stuff a task into, > periodic balance forcibly rips buddies back apart, repeat. At least > with wake_affine() in the loop, there's somewhat of a damper. Oh, I think I got the reason why old logical check the affine_sd firstly, so when prev_cpu and curr_cpu belong to different package, there will be a chance to enter balance path, that seems like a solution for this problem, amazing ;-) You are right, with out this logical, chances to balance load between packages will missed, I will apply it in next version. Regards, Michael Wang > So the new logical in this patch set is: new_cpu = select_idle_sibling(prev_cpu) if idle_cpu(new_cpu) return new_cpu >>> >>> So you tilted the scales in favor of leaving tasks in their current >>> package, which should benefit large footprint tasks, but should also >>> penalize light communicating tasks. >> >> Yes, I'd prefer to wakeup the task on a cpu which: >> 1. idle >> 2. close to prev_cpu >> >> So if both curr_cpu and prev_cpu have idle cpu in their topology, which >> one is better? that depends on how task benefit from cache and the >> balance situation, whatever, I don't think the benefit worth the high >> cost of wake_affine() in most cases... > > We've always used wake_affine() before, yet been able to schedule at > high frequency, so I don't see that it can be _that_ expensive. I > haven't actually measured lately (lng time) though. > > WRT cost/benefit of migration, yeah, it depends entirely on the tasks, > some will gain, some will lose. On a modern single processor box, it > just doesn't matter, there's only one llc (two s_i_s() calls = oopsie), > but on my beloved old Q6600 or a big box, it'll matter a lot to > something. NUMA balance will deal with big boxen, my trusty old Q6600 > will likely get all upset with some localhost stuff. > > -Mike > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/21/2013 04:10 PM, Mike Galbraith wrote: > On Thu, 2013-02-21 at 15:00 +0800, Michael Wang wrote: >> On 02/21/2013 02:11 PM, Mike Galbraith wrote: >>> On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: On 02/20/2013 06:49 PM, Ingo Molnar wrote: [snip] >> [snip] if wake_affine() new_cpu = select_idle_sibling(curr_cpu) else new_cpu = select_idle_sibling(prev_cpu) return new_cpu Actually that doesn't make sense. I think wake_affine() is trying to check whether move a task from prev_cpu to curr_cpu will break the balance in affine_sd or not, but why won't break balance means curr_cpu is better than prev_cpu for searching the idle cpu? >>> >>> You could argue that it's impossible to break balance by moving any task >>> to any idle cpu, but that would mean bouncing tasks cross node on every >>> wakeup is fine, which it isn't. >> >> I don't get it... could you please give me more detail on how >> wake_affine() related with bouncing? > > If we didn't ever ask if it's ok, we'd always pull, and stack load up on > one package if there's the tiniest of holes to stuff a task into, > periodic balance forcibly rips buddies back apart, repeat. At least > with wake_affine() in the loop, there's somewhat of a damper. I think I got your point, a question about the possibility to locate an idle cpu which belong to different package from prev_cpu, and how wake_affine() help on it. Old logical require prev_cpu and curr_cpu to be affine, which means they share caches on some level, usually means they are in the same package, and the select_idle_sibling() start search from the top cache share level, usually means search all the smp in one package, so, usually, both of the cases will only locate idle cpu in their package. I could not figure out what's the case that the wake_affine() could benefit a lot, but it do harm a lot according to the testing results. > So the new logical in this patch set is: new_cpu = select_idle_sibling(prev_cpu) if idle_cpu(new_cpu) return new_cpu >>> >>> So you tilted the scales in favor of leaving tasks in their current >>> package, which should benefit large footprint tasks, but should also >>> penalize light communicating tasks. >> >> Yes, I'd prefer to wakeup the task on a cpu which: >> 1. idle >> 2. close to prev_cpu >> >> So if both curr_cpu and prev_cpu have idle cpu in their topology, which >> one is better? that depends on how task benefit from cache and the >> balance situation, whatever, I don't think the benefit worth the high >> cost of wake_affine() in most cases... > > We've always used wake_affine() before, yet been able to schedule at > high frequency, so I don't see that it can be _that_ expensive. I > haven't actually measured lately (lng time) though. > > WRT cost/benefit of migration, yeah, it depends entirely on the tasks, > some will gain, some will lose. On a modern single processor box, it > just doesn't matter, there's only one llc (two s_i_s() calls = oopsie), > but on my beloved old Q6600 or a big box, it'll matter a lot to > something. NUMA balance will deal with big boxen, my trusty old Q6600 > will likely get all upset with some localhost stuff. But is this patch set really cause regression on your Q6600? It may sacrificed some thing, but I still think it will benefit far more, especially on huge systems. Regards, Michael Wang > > -Mike > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Thu, 2013-02-21 at 15:00 +0800, Michael Wang wrote: > On 02/21/2013 02:11 PM, Mike Galbraith wrote: > > On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: > >> On 02/20/2013 06:49 PM, Ingo Molnar wrote: > >> [snip] > [snip] > >> > >>if wake_affine() > >>new_cpu = select_idle_sibling(curr_cpu) > >>else > >>new_cpu = select_idle_sibling(prev_cpu) > >> > >>return new_cpu > >> > >> Actually that doesn't make sense. > >> > >> I think wake_affine() is trying to check whether move a task from > >> prev_cpu to curr_cpu will break the balance in affine_sd or not, but why > >> won't break balance means curr_cpu is better than prev_cpu for searching > >> the idle cpu? > > > > You could argue that it's impossible to break balance by moving any task > > to any idle cpu, but that would mean bouncing tasks cross node on every > > wakeup is fine, which it isn't. > > I don't get it... could you please give me more detail on how > wake_affine() related with bouncing? If we didn't ever ask if it's ok, we'd always pull, and stack load up on one package if there's the tiniest of holes to stuff a task into, periodic balance forcibly rips buddies back apart, repeat. At least with wake_affine() in the loop, there's somewhat of a damper. > >> So the new logical in this patch set is: > >> > >>new_cpu = select_idle_sibling(prev_cpu) > >>if idle_cpu(new_cpu) > >>return new_cpu > > > > So you tilted the scales in favor of leaving tasks in their current > > package, which should benefit large footprint tasks, but should also > > penalize light communicating tasks. > > Yes, I'd prefer to wakeup the task on a cpu which: > 1. idle > 2. close to prev_cpu > > So if both curr_cpu and prev_cpu have idle cpu in their topology, which > one is better? that depends on how task benefit from cache and the > balance situation, whatever, I don't think the benefit worth the high > cost of wake_affine() in most cases... We've always used wake_affine() before, yet been able to schedule at high frequency, so I don't see that it can be _that_ expensive. I haven't actually measured lately (lng time) though. WRT cost/benefit of migration, yeah, it depends entirely on the tasks, some will gain, some will lose. On a modern single processor box, it just doesn't matter, there's only one llc (two s_i_s() calls = oopsie), but on my beloved old Q6600 or a big box, it'll matter a lot to something. NUMA balance will deal with big boxen, my trusty old Q6600 will likely get all upset with some localhost stuff. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Thu, 2013-02-21 at 15:00 +0800, Michael Wang wrote: On 02/21/2013 02:11 PM, Mike Galbraith wrote: On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: On 02/20/2013 06:49 PM, Ingo Molnar wrote: [snip] [snip] if wake_affine() new_cpu = select_idle_sibling(curr_cpu) else new_cpu = select_idle_sibling(prev_cpu) return new_cpu Actually that doesn't make sense. I think wake_affine() is trying to check whether move a task from prev_cpu to curr_cpu will break the balance in affine_sd or not, but why won't break balance means curr_cpu is better than prev_cpu for searching the idle cpu? You could argue that it's impossible to break balance by moving any task to any idle cpu, but that would mean bouncing tasks cross node on every wakeup is fine, which it isn't. I don't get it... could you please give me more detail on how wake_affine() related with bouncing? If we didn't ever ask if it's ok, we'd always pull, and stack load up on one package if there's the tiniest of holes to stuff a task into, periodic balance forcibly rips buddies back apart, repeat. At least with wake_affine() in the loop, there's somewhat of a damper. So the new logical in this patch set is: new_cpu = select_idle_sibling(prev_cpu) if idle_cpu(new_cpu) return new_cpu So you tilted the scales in favor of leaving tasks in their current package, which should benefit large footprint tasks, but should also penalize light communicating tasks. Yes, I'd prefer to wakeup the task on a cpu which: 1. idle 2. close to prev_cpu So if both curr_cpu and prev_cpu have idle cpu in their topology, which one is better? that depends on how task benefit from cache and the balance situation, whatever, I don't think the benefit worth the high cost of wake_affine() in most cases... We've always used wake_affine() before, yet been able to schedule at high frequency, so I don't see that it can be _that_ expensive. I haven't actually measured lately (lng time) though. WRT cost/benefit of migration, yeah, it depends entirely on the tasks, some will gain, some will lose. On a modern single processor box, it just doesn't matter, there's only one llc (two s_i_s() calls = oopsie), but on my beloved old Q6600 or a big box, it'll matter a lot to something. NUMA balance will deal with big boxen, my trusty old Q6600 will likely get all upset with some localhost stuff. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/21/2013 04:10 PM, Mike Galbraith wrote: On Thu, 2013-02-21 at 15:00 +0800, Michael Wang wrote: On 02/21/2013 02:11 PM, Mike Galbraith wrote: On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: On 02/20/2013 06:49 PM, Ingo Molnar wrote: [snip] [snip] if wake_affine() new_cpu = select_idle_sibling(curr_cpu) else new_cpu = select_idle_sibling(prev_cpu) return new_cpu Actually that doesn't make sense. I think wake_affine() is trying to check whether move a task from prev_cpu to curr_cpu will break the balance in affine_sd or not, but why won't break balance means curr_cpu is better than prev_cpu for searching the idle cpu? You could argue that it's impossible to break balance by moving any task to any idle cpu, but that would mean bouncing tasks cross node on every wakeup is fine, which it isn't. I don't get it... could you please give me more detail on how wake_affine() related with bouncing? If we didn't ever ask if it's ok, we'd always pull, and stack load up on one package if there's the tiniest of holes to stuff a task into, periodic balance forcibly rips buddies back apart, repeat. At least with wake_affine() in the loop, there's somewhat of a damper. I think I got your point, a question about the possibility to locate an idle cpu which belong to different package from prev_cpu, and how wake_affine() help on it. Old logical require prev_cpu and curr_cpu to be affine, which means they share caches on some level, usually means they are in the same package, and the select_idle_sibling() start search from the top cache share level, usually means search all the smp in one package, so, usually, both of the cases will only locate idle cpu in their package. I could not figure out what's the case that the wake_affine() could benefit a lot, but it do harm a lot according to the testing results. So the new logical in this patch set is: new_cpu = select_idle_sibling(prev_cpu) if idle_cpu(new_cpu) return new_cpu So you tilted the scales in favor of leaving tasks in their current package, which should benefit large footprint tasks, but should also penalize light communicating tasks. Yes, I'd prefer to wakeup the task on a cpu which: 1. idle 2. close to prev_cpu So if both curr_cpu and prev_cpu have idle cpu in their topology, which one is better? that depends on how task benefit from cache and the balance situation, whatever, I don't think the benefit worth the high cost of wake_affine() in most cases... We've always used wake_affine() before, yet been able to schedule at high frequency, so I don't see that it can be _that_ expensive. I haven't actually measured lately (lng time) though. WRT cost/benefit of migration, yeah, it depends entirely on the tasks, some will gain, some will lose. On a modern single processor box, it just doesn't matter, there's only one llc (two s_i_s() calls = oopsie), but on my beloved old Q6600 or a big box, it'll matter a lot to something. NUMA balance will deal with big boxen, my trusty old Q6600 will likely get all upset with some localhost stuff. But is this patch set really cause regression on your Q6600? It may sacrificed some thing, but I still think it will benefit far more, especially on huge systems. Regards, Michael Wang -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/21/2013 04:10 PM, Mike Galbraith wrote: On Thu, 2013-02-21 at 15:00 +0800, Michael Wang wrote: On 02/21/2013 02:11 PM, Mike Galbraith wrote: On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: On 02/20/2013 06:49 PM, Ingo Molnar wrote: [snip] [snip] if wake_affine() new_cpu = select_idle_sibling(curr_cpu) else new_cpu = select_idle_sibling(prev_cpu) return new_cpu Actually that doesn't make sense. I think wake_affine() is trying to check whether move a task from prev_cpu to curr_cpu will break the balance in affine_sd or not, but why won't break balance means curr_cpu is better than prev_cpu for searching the idle cpu? You could argue that it's impossible to break balance by moving any task to any idle cpu, but that would mean bouncing tasks cross node on every wakeup is fine, which it isn't. I don't get it... could you please give me more detail on how wake_affine() related with bouncing? If we didn't ever ask if it's ok, we'd always pull, and stack load up on one package if there's the tiniest of holes to stuff a task into, periodic balance forcibly rips buddies back apart, repeat. At least with wake_affine() in the loop, there's somewhat of a damper. Oh, I think I got the reason why old logical check the affine_sd firstly, so when prev_cpu and curr_cpu belong to different package, there will be a chance to enter balance path, that seems like a solution for this problem, amazing ;-) You are right, with out this logical, chances to balance load between packages will missed, I will apply it in next version. Regards, Michael Wang So the new logical in this patch set is: new_cpu = select_idle_sibling(prev_cpu) if idle_cpu(new_cpu) return new_cpu So you tilted the scales in favor of leaving tasks in their current package, which should benefit large footprint tasks, but should also penalize light communicating tasks. Yes, I'd prefer to wakeup the task on a cpu which: 1. idle 2. close to prev_cpu So if both curr_cpu and prev_cpu have idle cpu in their topology, which one is better? that depends on how task benefit from cache and the balance situation, whatever, I don't think the benefit worth the high cost of wake_affine() in most cases... We've always used wake_affine() before, yet been able to schedule at high frequency, so I don't see that it can be _that_ expensive. I haven't actually measured lately (lng time) though. WRT cost/benefit of migration, yeah, it depends entirely on the tasks, some will gain, some will lose. On a modern single processor box, it just doesn't matter, there's only one llc (two s_i_s() calls = oopsie), but on my beloved old Q6600 or a big box, it'll matter a lot to something. NUMA balance will deal with big boxen, my trusty old Q6600 will likely get all upset with some localhost stuff. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote: But is this patch set really cause regression on your Q6600? It may sacrificed some thing, but I still think it will benefit far more, especially on huge systems. We spread on FORK/EXEC, and will no longer will pull communicating tasks back to a shared cache with the new logic preferring to leave wakee remote, so while no, I haven't tested (will try to find round tuit) it seems it _must_ hurt. Dragging data from one llc to the other on Q6600 hurts a LOT. Every time a client and server are cross llc, it's a huge hit. The previous logic pulled communicating tasks together right when it matters the most, intermittent load... or interactive use. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: The old logical when locate affine_sd is: if prev_cpu != curr_cpu if wake_affine() prev_cpu = curr_cpu new_cpu = select_idle_sibling(prev_cpu) return new_cpu The new logical is same to the old one if prev_cpu == curr_cpu, so let's simplify the old logical like: if wake_affine() new_cpu = select_idle_sibling(curr_cpu) else new_cpu = select_idle_sibling(prev_cpu) return new_cpu Actually that doesn't make sense. It does :-) I think wake_affine() is trying to check whether move a task from prev_cpu to curr_cpu will break the balance in affine_sd or not, but why won't break balance means curr_cpu is better than prev_cpu for searching the idle cpu? It doesn't, the whole affine wakeup stuff is meant to pull waking tasks towards the cpu that does the wakeup, we limit this by putting bounds on the imbalance this is may create. The reason we want to run tasks on the cpu that does the wakeup is because that cpu 'obviously' is running something related and it seems like a good idea to run related tasks close together. So look at affine wakeups as a force that groups related tasks. So the new logical in this patch set is: new_cpu = select_idle_sibling(prev_cpu) if idle_cpu(new_cpu) return new_cpu new_cpu = select_idle_sibling(curr_cpu) if idle_cpu(new_cpu) { if wake_affine() return new_cpu } return prev_cpu And now, unless we are really going to move load from prev_cpu to curr_cpu, we won't use wake_affine() any more. That's completely breaks stuff, not cool. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/21/2013 05:43 PM, Mike Galbraith wrote: On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote: But is this patch set really cause regression on your Q6600? It may sacrificed some thing, but I still think it will benefit far more, especially on huge systems. We spread on FORK/EXEC, and will no longer will pull communicating tasks back to a shared cache with the new logic preferring to leave wakee remote, so while no, I haven't tested (will try to find round tuit) it seems it _must_ hurt. Dragging data from one llc to the other on Q6600 hurts a LOT. Every time a client and server are cross llc, it's a huge hit. The previous logic pulled communicating tasks together right when it matters the most, intermittent load... or interactive use. I agree that this is a problem need to be solved, but don't agree that wake_affine() is the solution. According to my understanding, in the old world, wake_affine() will only be used if curr_cpu and prev_cpu share cache, which means they are in one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't have the chance to spread the task out of that package. I'm going to recover the logical that only do select_idle_sibling() when prev_cpu and curr_cpu are affine, so now the new logical will only prefer leaving task in old package if both prev_cpu and curr_cpu are in that package, I think this could solve the problem, isn't it? Regards, Michael Wang -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/21/2013 06:20 PM, Peter Zijlstra wrote: On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: The old logical when locate affine_sd is: if prev_cpu != curr_cpu if wake_affine() prev_cpu = curr_cpu new_cpu = select_idle_sibling(prev_cpu) return new_cpu The new logical is same to the old one if prev_cpu == curr_cpu, so let's simplify the old logical like: if wake_affine() new_cpu = select_idle_sibling(curr_cpu) else new_cpu = select_idle_sibling(prev_cpu) return new_cpu Actually that doesn't make sense. It does :-) I think wake_affine() is trying to check whether move a task from prev_cpu to curr_cpu will break the balance in affine_sd or not, but why won't break balance means curr_cpu is better than prev_cpu for searching the idle cpu? It doesn't, the whole affine wakeup stuff is meant to pull waking tasks towards the cpu that does the wakeup, we limit this by putting bounds on the imbalance this is may create. The reason we want to run tasks on the cpu that does the wakeup is because that cpu 'obviously' is running something related and it seems like a good idea to run related tasks close together. So look at affine wakeups as a force that groups related tasks. That's right, and it's one point I've missed when judging the wake_affine()... But that's really some benefit hardly to be estimate, especially when the workload is heavy, the cost of wake_affine() is very high to calculated se one by one, is that worth for some benefit we could not promise? According to the testing result, I could not agree this purpose of wake_affine() benefit us, but I'm sure that wake_affine() is a terrible performance killer when system is busy. So the new logical in this patch set is: new_cpu = select_idle_sibling(prev_cpu) if idle_cpu(new_cpu) return new_cpu new_cpu = select_idle_sibling(curr_cpu) if idle_cpu(new_cpu) { if wake_affine() return new_cpu } return prev_cpu And now, unless we are really going to move load from prev_cpu to curr_cpu, we won't use wake_affine() any more. That's completely breaks stuff, not cool. Could you please give more details on what's the point you think is bad? Regards, Michael Wang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote: On 02/21/2013 05:43 PM, Mike Galbraith wrote: On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote: But is this patch set really cause regression on your Q6600? It may sacrificed some thing, but I still think it will benefit far more, especially on huge systems. We spread on FORK/EXEC, and will no longer will pull communicating tasks back to a shared cache with the new logic preferring to leave wakee remote, so while no, I haven't tested (will try to find round tuit) it seems it _must_ hurt. Dragging data from one llc to the other on Q6600 hurts a LOT. Every time a client and server are cross llc, it's a huge hit. The previous logic pulled communicating tasks together right when it matters the most, intermittent load... or interactive use. I agree that this is a problem need to be solved, but don't agree that wake_affine() is the solution. It's not perfect, but it's better than no countering force at all. It's a relic of the dark ages, when affine meant L2, ie this cpu. Now days, affine has a whole new meaning, L3, so it could be done differently, but _some_ kind of opposing force is required. According to my understanding, in the old world, wake_affine() will only be used if curr_cpu and prev_cpu share cache, which means they are in one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't have the chance to spread the task out of that package. ? affine_sd is the first domain spanning both cpus, that may be NODE. True we won't ever spread in the wakeup path unless SD_WAKE_BALANCE is set that is. Would be nice to be able to do that without shredding performance. Off the top of my pointy head, I can think of a way to _maybe_ improve the affine wakeup criteria: Add a small (package size? and very fast) FIFO queue to task struct, record waker/wakee relationship. If relationship exists in that queue (rbtree), try to wake local, if not, wake remote. The thought is to identify situations ala 1:N pgbench where you really need to keep the load spread. That need arises when the sum wakees + waker won't fit in one cache. True buddies would always hit (hm, hit rate), always try to become affine where they thrive. 1:N stuff starts missing when client count exceeds package size, starts expanding it's horizons. 'Course you would still need to NAK if imbalanced too badly, and let NUMA stuff NAK touching lard-balls and whatnot. With a little more smarts, we could have happy 1:N, and buddies don't have to chat through 2m thick walls to make 1:N scale as well as it can before it dies of stupidity. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote: According to the testing result, I could not agree this purpose of wake_affine() benefit us, but I'm sure that wake_affine() is a terrible performance killer when system is busy. (hm, result is singular.. pgbench in 1:N mode only?) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/22/2013 01:02 PM, Mike Galbraith wrote: On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote: On 02/21/2013 05:43 PM, Mike Galbraith wrote: On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote: But is this patch set really cause regression on your Q6600? It may sacrificed some thing, but I still think it will benefit far more, especially on huge systems. We spread on FORK/EXEC, and will no longer will pull communicating tasks back to a shared cache with the new logic preferring to leave wakee remote, so while no, I haven't tested (will try to find round tuit) it seems it _must_ hurt. Dragging data from one llc to the other on Q6600 hurts a LOT. Every time a client and server are cross llc, it's a huge hit. The previous logic pulled communicating tasks together right when it matters the most, intermittent load... or interactive use. I agree that this is a problem need to be solved, but don't agree that wake_affine() is the solution. It's not perfect, but it's better than no countering force at all. It's a relic of the dark ages, when affine meant L2, ie this cpu. Now days, affine has a whole new meaning, L3, so it could be done differently, but _some_ kind of opposing force is required. According to my understanding, in the old world, wake_affine() will only be used if curr_cpu and prev_cpu share cache, which means they are in one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't have the chance to spread the task out of that package. ? affine_sd is the first domain spanning both cpus, that may be NODE. True we won't ever spread in the wakeup path unless SD_WAKE_BALANCE is set that is. Would be nice to be able to do that without shredding performance. Off the top of my pointy head, I can think of a way to _maybe_ improve the affine wakeup criteria: Add a small (package size? and very fast) FIFO queue to task struct, record waker/wakee relationship. If relationship exists in that queue (rbtree), try to wake local, if not, wake remote. The thought is to identify situations ala 1:N pgbench where you really need to keep the load spread. That need arises when the sum wakees + waker won't fit in one cache. True buddies would always hit (hm, hit rate), always try to become affine where they thrive. 1:N stuff starts missing when client count exceeds package size, starts expanding it's horizons. 'Course you would still need to NAK if imbalanced too badly, and let NUMA stuff NAK touching lard-balls and whatnot. With a little more smarts, we could have happy 1:N, and buddies don't have to chat through 2m thick walls to make 1:N scale as well as it can before it dies of stupidity. Just confirm that I'm not on the wrong way, did the 1:N mode here means 1 task forked N threads, and child always talk with father? Regards, Michael Wang -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/22/2013 01:08 PM, Mike Galbraith wrote: On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote: According to the testing result, I could not agree this purpose of wake_affine() benefit us, but I'm sure that wake_affine() is a terrible performance killer when system is busy. (hm, result is singular.. pgbench in 1:N mode only?) I'm not sure about how pgbench implemented, all I know is it will create several instance and access the database, I suppose no different from several threads access database (1 server and N clients?). There are improvement since when system busy, wake_affine() will be skipped. And in old world, when system is busy, wake_affine() will only be skipped if prev_cpu and curr_cpu belong to different nodes. Regards, Michael Wang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 13:26 +0800, Michael Wang wrote: Just confirm that I'm not on the wrong way, did the 1:N mode here means 1 task forked N threads, and child always talk with father? Yes, one server, many clients. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Fri, 2013-02-22 at 14:06 +0800, Michael Wang wrote: On 02/22/2013 01:08 PM, Mike Galbraith wrote: On Fri, 2013-02-22 at 10:37 +0800, Michael Wang wrote: According to the testing result, I could not agree this purpose of wake_affine() benefit us, but I'm sure that wake_affine() is a terrible performance killer when system is busy. (hm, result is singular.. pgbench in 1:N mode only?) I'm not sure about how pgbench implemented, all I know is it will create several instance and access the database, I suppose no different from several threads access database (1 server and N clients?). It's user switchable. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/22/2013 01:02 PM, Mike Galbraith wrote: On Fri, 2013-02-22 at 10:36 +0800, Michael Wang wrote: On 02/21/2013 05:43 PM, Mike Galbraith wrote: On Thu, 2013-02-21 at 17:08 +0800, Michael Wang wrote: But is this patch set really cause regression on your Q6600? It may sacrificed some thing, but I still think it will benefit far more, especially on huge systems. We spread on FORK/EXEC, and will no longer will pull communicating tasks back to a shared cache with the new logic preferring to leave wakee remote, so while no, I haven't tested (will try to find round tuit) it seems it _must_ hurt. Dragging data from one llc to the other on Q6600 hurts a LOT. Every time a client and server are cross llc, it's a huge hit. The previous logic pulled communicating tasks together right when it matters the most, intermittent load... or interactive use. I agree that this is a problem need to be solved, but don't agree that wake_affine() is the solution. It's not perfect, but it's better than no countering force at all. It's a relic of the dark ages, when affine meant L2, ie this cpu. Now days, affine has a whole new meaning, L3, so it could be done differently, but _some_ kind of opposing force is required. According to my understanding, in the old world, wake_affine() will only be used if curr_cpu and prev_cpu share cache, which means they are in one package, whatever search in llc sd of curr_cpu or prev_cpu, we won't have the chance to spread the task out of that package. ? affine_sd is the first domain spanning both cpus, that may be NODE. True we won't ever spread in the wakeup path unless SD_WAKE_BALANCE is set that is. Would be nice to be able to do that without shredding performance. That's right, we need two conditions in each select instance: 1. prev_cpu and curr_cpu are not affine 2. SD_WAKE_BALANCE Off the top of my pointy head, I can think of a way to _maybe_ improve the affine wakeup criteria: Add a small (package size? and very fast) FIFO queue to task struct, record waker/wakee relationship. If relationship exists in that queue (rbtree), try to wake local, if not, wake remote. The thought is to identify situations ala 1:N pgbench where you really need to keep the load spread. That need arises when the sum wakees + waker won't fit in one cache. True buddies would always hit (hm, hit rate), always try to become affine where they thrive. 1:N stuff starts missing when client count exceeds package size, starts expanding it's horizons. 'Course you would still need to NAK if imbalanced too badly, and let NUMA stuff NAK touching lard-balls and whatnot. With a little more smarts, we could have happy 1:N, and buddies don't have to chat through 2m thick walls to make 1:N scale as well as it can before it dies of stupidity. So this is trying to take care the condition when curr_cpu(local) and prev_cpu(remote) are on different nodes, which in the old world, wake_affine() won't be invoked, correct? Hmm...I think this maybe a good additional checking before enter balance path, but I could not estimate the cost to record the relationship at this moment of time... Whatever, after applied the affine logical into new world, it will gain the ability to spread tasks cross nodes just like the old world, your idea may be an optimization, but the logical is out of the changing in this patch set, which means if it benefits, the beneficiary will be not only new but also old. Regards, Michael Wang -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/21/2013 02:11 PM, Mike Galbraith wrote: > On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: >> On 02/20/2013 06:49 PM, Ingo Molnar wrote: >> [snip] [snip] >> >> if wake_affine() >> new_cpu = select_idle_sibling(curr_cpu) >> else >> new_cpu = select_idle_sibling(prev_cpu) >> >> return new_cpu >> >> Actually that doesn't make sense. >> >> I think wake_affine() is trying to check whether move a task from >> prev_cpu to curr_cpu will break the balance in affine_sd or not, but why >> won't break balance means curr_cpu is better than prev_cpu for searching >> the idle cpu? > > You could argue that it's impossible to break balance by moving any task > to any idle cpu, but that would mean bouncing tasks cross node on every > wakeup is fine, which it isn't. I don't get it... could you please give me more detail on how wake_affine() related with bouncing? > >> So the new logical in this patch set is: >> >> new_cpu = select_idle_sibling(prev_cpu) >> if idle_cpu(new_cpu) >> return new_cpu > > So you tilted the scales in favor of leaving tasks in their current > package, which should benefit large footprint tasks, but should also > penalize light communicating tasks. Yes, I'd prefer to wakeup the task on a cpu which: 1. idle 2. close to prev_cpu So if both curr_cpu and prev_cpu have idle cpu in their topology, which one is better? that depends on how task benefit from cache and the balance situation, whatever, I don't think the benefit worth the high cost of wake_affine() in most cases... Regards, Michael Wang > > I suspect that much of the pgbench improvement comes from the preemption > mitigation from keeping 1:N load maximally spread, which is the perfect > thing to do with such loads. In all the testing I ever did with it in > 1:N mode, preemption dominated performance numbers. Keep server away > from clients, it has fewer fair competition worries, can consume more > CPU preemption free, pushing the load collapse point strongly upward. > > -Mike > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: > On 02/20/2013 06:49 PM, Ingo Molnar wrote: > [snip] > > > > The changes look clean and reasoable, any ideas exactly *why* it > > speeds up? > > > > I.e. are there one or two key changes in the before/after logic > > and scheduling patterns that you can identify as causing the > > speedup? > > Hi, Ingo > > Thanks for your reply, please let me point out the key changes here > (forgive me for haven't wrote a good description in cover). > > The performance improvement from this patch set is: > 1. delay the invoke on wake_affine(). > 2. save the circle to gain proper sd. > > The second point is obviously, and will benefit a lot when the sd > topology is deep (NUMA is suppose to make it deeper on large system). > > So in my testing on a 12 cpu box, actually most of the benefit comes > from the first point, and please let me introduce it in detail. > > The old logical when locate affine_sd is: > > if prev_cpu != curr_cpu > if wake_affine() > prev_cpu = curr_cpu > new_cpu = select_idle_sibling(prev_cpu) > return new_cpu > > The new logical is same to the old one if prev_cpu == curr_cpu, so let's > simplify the old logical like: > > if wake_affine() > new_cpu = select_idle_sibling(curr_cpu) > else > new_cpu = select_idle_sibling(prev_cpu) > > return new_cpu > > Actually that doesn't make sense. > > I think wake_affine() is trying to check whether move a task from > prev_cpu to curr_cpu will break the balance in affine_sd or not, but why > won't break balance means curr_cpu is better than prev_cpu for searching > the idle cpu? You could argue that it's impossible to break balance by moving any task to any idle cpu, but that would mean bouncing tasks cross node on every wakeup is fine, which it isn't. > So the new logical in this patch set is: > > new_cpu = select_idle_sibling(prev_cpu) > if idle_cpu(new_cpu) > return new_cpu So you tilted the scales in favor of leaving tasks in their current package, which should benefit large footprint tasks, but should also penalize light communicating tasks. I suspect that much of the pgbench improvement comes from the preemption mitigation from keeping 1:N load maximally spread, which is the perfect thing to do with such loads. In all the testing I ever did with it in 1:N mode, preemption dominated performance numbers. Keep server away from clients, it has fewer fair competition worries, can consume more CPU preemption free, pushing the load collapse point strongly upward. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/20/2013 10:05 PM, Mike Galbraith wrote: > On Wed, 2013-02-20 at 14:32 +0100, Peter Zijlstra wrote: >> On Wed, 2013-02-20 at 11:49 +0100, Ingo Molnar wrote: >> >>> The changes look clean and reasoable, >> >> I don't necessarily agree, note that O(n^2) storage requirement that >> Michael failed to highlight ;-) > > (yeah, I mentioned that needs to shrink.. a lot) Exactly, and I'm going to apply the suggestion now :) > >>> any ideas exactly *why* it speeds up? >> >> That is indeed the most interesting part.. There's two parts to >> select_task_rq_fair(), the 'regular' affine wakeup path, and the >> fork/exec find_idlest_goo() path. At the very least we need to quantify >> which of these two parts contributes most to the speedup. >> >> In the power balancing discussion we already noted that the >> find_idlest_goo() is in need of attention. > > Yup, even little stuff like break off the search when load is zero.. Agree, searching in a bunch of idle cpus and their subsets doesn't make sense... Regards, Michael Wang > unless someone is planning on implementing anti-idle 'course ;-) > > -Mike > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/20/2013 09:32 PM, Peter Zijlstra wrote: > On Wed, 2013-02-20 at 11:49 +0100, Ingo Molnar wrote: > >> The changes look clean and reasoable, > > I don't necessarily agree, note that O(n^2) storage requirement that > Michael failed to highlight ;-) Forgive me for not explain this point in cover, but it's really not a big deal in my opinion... And I'm going to apply Mike's suggestion, do allocation when cpu active, that will save some space :) Regards, Michael Wang > >> any ideas exactly *why* it speeds up? > > That is indeed the most interesting part.. There's two parts to > select_task_rq_fair(), the 'regular' affine wakeup path, and the > fork/exec find_idlest_goo() path. At the very least we need to quantify > which of these two parts contributes most to the speedup. > > In the power balancing discussion we already noted that the > find_idlest_goo() is in need of attention. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/20/2013 06:49 PM, Ingo Molnar wrote: [snip] > > The changes look clean and reasoable, any ideas exactly *why* it > speeds up? > > I.e. are there one or two key changes in the before/after logic > and scheduling patterns that you can identify as causing the > speedup? Hi, Ingo Thanks for your reply, please let me point out the key changes here (forgive me for haven't wrote a good description in cover). The performance improvement from this patch set is: 1. delay the invoke on wake_affine(). 2. save the circle to gain proper sd. The second point is obviously, and will benefit a lot when the sd topology is deep (NUMA is suppose to make it deeper on large system). So in my testing on a 12 cpu box, actually most of the benefit comes from the first point, and please let me introduce it in detail. The old logical when locate affine_sd is: if prev_cpu != curr_cpu if wake_affine() prev_cpu = curr_cpu new_cpu = select_idle_sibling(prev_cpu) return new_cpu The new logical is same to the old one if prev_cpu == curr_cpu, so let's simplify the old logical like: if wake_affine() new_cpu = select_idle_sibling(curr_cpu) else new_cpu = select_idle_sibling(prev_cpu) return new_cpu Actually that doesn't make sense. I think wake_affine() is trying to check whether move a task from prev_cpu to curr_cpu will break the balance in affine_sd or not, but why won't break balance means curr_cpu is better than prev_cpu for searching the idle cpu? So the new logical in this patch set is: new_cpu = select_idle_sibling(prev_cpu) if idle_cpu(new_cpu) return new_cpu new_cpu = select_idle_sibling(curr_cpu) if idle_cpu(new_cpu) { if wake_affine() return new_cpu } return prev_cpu And now, unless we are really going to move load from prev_cpu to curr_cpu, we won't use wake_affine() any more. So we avoid wake_affine() when system load is low or high, for middle load, the worst cases is when failed to locate idle cpu in prev_cpu topology but succeed to locate one in curr_cpu's, but that's rarely happen and the benchmark results proved that point. Some comparison below: 1. system load is low old logical cost: wake_affine() select_idle_sibling() new logical cost: select_idle_sibling() 2. system load is high old logical cost: wake_affine() select_idle_sibling() new logical cost: select_idle_sibling() select_idle_sibling() 3. system load is middle don't know 1 save the cost of wake_affine(), 3 could be proved by benchmark that no regression at least. For 2, it's the comparison between wake_affine() and select_idle_sibling(), since the system load is high, wake_affine() cost far more than select_idle_sibling(), and we saved many according to the benchmark results. > > Such changes also typically have a chance to cause regressions > in other workloads - when that happens we need this kind of > information to be able to enact plan-B. The benefit comes from avoiding unnecessary works, and the patch set is suppose to only reduce the cost of key function with least logical changing, I could not promise it benefit all the workloads, but till now, I've not found regression. Regards, Michael Wang > > Thanks, > > Ingo > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Wed, 2013-02-20 at 14:32 +0100, Peter Zijlstra wrote: > On Wed, 2013-02-20 at 11:49 +0100, Ingo Molnar wrote: > > > The changes look clean and reasoable, > > I don't necessarily agree, note that O(n^2) storage requirement that > Michael failed to highlight ;-) (yeah, I mentioned that needs to shrink.. a lot) > > any ideas exactly *why* it speeds up? > > That is indeed the most interesting part.. There's two parts to > select_task_rq_fair(), the 'regular' affine wakeup path, and the > fork/exec find_idlest_goo() path. At the very least we need to quantify > which of these two parts contributes most to the speedup. > > In the power balancing discussion we already noted that the > find_idlest_goo() is in need of attention. Yup, even little stuff like break off the search when load is zero.. unless someone is planning on implementing anti-idle 'course ;-) -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Wed, 2013-02-20 at 11:49 +0100, Ingo Molnar wrote: > The changes look clean and reasoable, I don't necessarily agree, note that O(n^2) storage requirement that Michael failed to highlight ;-) > any ideas exactly *why* it speeds up? That is indeed the most interesting part.. There's two parts to select_task_rq_fair(), the 'regular' affine wakeup path, and the fork/exec find_idlest_goo() path. At the very least we need to quantify which of these two parts contributes most to the speedup. In the power balancing discussion we already noted that the find_idlest_goo() is in need of attention. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
* Michael Wang wrote: > v3 change log: > Fix small logical issues (Thanks to Mike Galbraith). > Change the way of handling WAKE. > > This patch set is trying to simplify the select_task_rq_fair() > with schedule balance map. > > After get rid of the complex code and reorganize the logical, > pgbench show the improvement, more the clients, bigger the > improvement. > > Prev: Post: > > | db_size | clients | | tps | | tps | > +-+-+ +---+ +---+ > | 22 MB | 1 | | 10788 | | 10881 | > | 22 MB | 2 | | 21617 | | 21837 | > | 22 MB | 4 | | 41597 | | 42645 | > | 22 MB | 8 | | 54622 | | 57808 | > | 22 MB | 12 | | 50753 | | 54527 | > | 22 MB | 16 | | 50433 | | 56368 | +11.77% > | 22 MB | 24 | | 46725 | | 54319 | +16.25% > | 22 MB | 32 | | 43498 | | 54650 | +25.64% > | 7484 MB | 1 | | 7894 | | 8301 | > | 7484 MB | 2 | | 19477 | | 19622 | > | 7484 MB | 4 | | 36458 | | 38242 | > | 7484 MB | 8 | | 48423 | | 50796 | > | 7484 MB | 12 | | 46042 | | 49938 | > | 7484 MB | 16 | | 46274 | | 50507 | +9.15% > | 7484 MB | 24 | | 42583 | | 49175 | +15.48% > | 7484 MB | 32 | | 36413 | | 49148 | +34.97% > | 15 GB | 1 | | 7742 | | 7876 | > | 15 GB | 2 | | 19339 | | 19531 | > | 15 GB | 4 | | 36072 | | 37389 | > | 15 GB | 8 | | 48549 | | 50570 | > | 15 GB | 12 | | 45716 | | 49542 | > | 15 GB | 16 | | 46127 | | 49647 | +7.63% > | 15 GB | 24 | | 42539 | | 48639 | +14.34% > | 15 GB | 32 | | 36038 | | 48560 | +34.75% > > Please check the patch for more details about schedule balance map. The changes look clean and reasoable, any ideas exactly *why* it speeds up? I.e. are there one or two key changes in the before/after logic and scheduling patterns that you can identify as causing the speedup? Such changes also typically have a chance to cause regressions in other workloads - when that happens we need this kind of information to be able to enact plan-B. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
* Michael Wang wang...@linux.vnet.ibm.com wrote: v3 change log: Fix small logical issues (Thanks to Mike Galbraith). Change the way of handling WAKE. This patch set is trying to simplify the select_task_rq_fair() with schedule balance map. After get rid of the complex code and reorganize the logical, pgbench show the improvement, more the clients, bigger the improvement. Prev: Post: | db_size | clients | | tps | | tps | +-+-+ +---+ +---+ | 22 MB | 1 | | 10788 | | 10881 | | 22 MB | 2 | | 21617 | | 21837 | | 22 MB | 4 | | 41597 | | 42645 | | 22 MB | 8 | | 54622 | | 57808 | | 22 MB | 12 | | 50753 | | 54527 | | 22 MB | 16 | | 50433 | | 56368 | +11.77% | 22 MB | 24 | | 46725 | | 54319 | +16.25% | 22 MB | 32 | | 43498 | | 54650 | +25.64% | 7484 MB | 1 | | 7894 | | 8301 | | 7484 MB | 2 | | 19477 | | 19622 | | 7484 MB | 4 | | 36458 | | 38242 | | 7484 MB | 8 | | 48423 | | 50796 | | 7484 MB | 12 | | 46042 | | 49938 | | 7484 MB | 16 | | 46274 | | 50507 | +9.15% | 7484 MB | 24 | | 42583 | | 49175 | +15.48% | 7484 MB | 32 | | 36413 | | 49148 | +34.97% | 15 GB | 1 | | 7742 | | 7876 | | 15 GB | 2 | | 19339 | | 19531 | | 15 GB | 4 | | 36072 | | 37389 | | 15 GB | 8 | | 48549 | | 50570 | | 15 GB | 12 | | 45716 | | 49542 | | 15 GB | 16 | | 46127 | | 49647 | +7.63% | 15 GB | 24 | | 42539 | | 48639 | +14.34% | 15 GB | 32 | | 36038 | | 48560 | +34.75% Please check the patch for more details about schedule balance map. The changes look clean and reasoable, any ideas exactly *why* it speeds up? I.e. are there one or two key changes in the before/after logic and scheduling patterns that you can identify as causing the speedup? Such changes also typically have a chance to cause regressions in other workloads - when that happens we need this kind of information to be able to enact plan-B. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Wed, 2013-02-20 at 11:49 +0100, Ingo Molnar wrote: The changes look clean and reasoable, I don't necessarily agree, note that O(n^2) storage requirement that Michael failed to highlight ;-) any ideas exactly *why* it speeds up? That is indeed the most interesting part.. There's two parts to select_task_rq_fair(), the 'regular' affine wakeup path, and the fork/exec find_idlest_goo() path. At the very least we need to quantify which of these two parts contributes most to the speedup. In the power balancing discussion we already noted that the find_idlest_goo() is in need of attention. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Wed, 2013-02-20 at 14:32 +0100, Peter Zijlstra wrote: On Wed, 2013-02-20 at 11:49 +0100, Ingo Molnar wrote: The changes look clean and reasoable, I don't necessarily agree, note that O(n^2) storage requirement that Michael failed to highlight ;-) (yeah, I mentioned that needs to shrink.. a lot) any ideas exactly *why* it speeds up? That is indeed the most interesting part.. There's two parts to select_task_rq_fair(), the 'regular' affine wakeup path, and the fork/exec find_idlest_goo() path. At the very least we need to quantify which of these two parts contributes most to the speedup. In the power balancing discussion we already noted that the find_idlest_goo() is in need of attention. Yup, even little stuff like break off the search when load is zero.. unless someone is planning on implementing anti-idle 'course ;-) -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/20/2013 06:49 PM, Ingo Molnar wrote: [snip] The changes look clean and reasoable, any ideas exactly *why* it speeds up? I.e. are there one or two key changes in the before/after logic and scheduling patterns that you can identify as causing the speedup? Hi, Ingo Thanks for your reply, please let me point out the key changes here (forgive me for haven't wrote a good description in cover). The performance improvement from this patch set is: 1. delay the invoke on wake_affine(). 2. save the circle to gain proper sd. The second point is obviously, and will benefit a lot when the sd topology is deep (NUMA is suppose to make it deeper on large system). So in my testing on a 12 cpu box, actually most of the benefit comes from the first point, and please let me introduce it in detail. The old logical when locate affine_sd is: if prev_cpu != curr_cpu if wake_affine() prev_cpu = curr_cpu new_cpu = select_idle_sibling(prev_cpu) return new_cpu The new logical is same to the old one if prev_cpu == curr_cpu, so let's simplify the old logical like: if wake_affine() new_cpu = select_idle_sibling(curr_cpu) else new_cpu = select_idle_sibling(prev_cpu) return new_cpu Actually that doesn't make sense. I think wake_affine() is trying to check whether move a task from prev_cpu to curr_cpu will break the balance in affine_sd or not, but why won't break balance means curr_cpu is better than prev_cpu for searching the idle cpu? So the new logical in this patch set is: new_cpu = select_idle_sibling(prev_cpu) if idle_cpu(new_cpu) return new_cpu new_cpu = select_idle_sibling(curr_cpu) if idle_cpu(new_cpu) { if wake_affine() return new_cpu } return prev_cpu And now, unless we are really going to move load from prev_cpu to curr_cpu, we won't use wake_affine() any more. So we avoid wake_affine() when system load is low or high, for middle load, the worst cases is when failed to locate idle cpu in prev_cpu topology but succeed to locate one in curr_cpu's, but that's rarely happen and the benchmark results proved that point. Some comparison below: 1. system load is low old logical cost: wake_affine() select_idle_sibling() new logical cost: select_idle_sibling() 2. system load is high old logical cost: wake_affine() select_idle_sibling() new logical cost: select_idle_sibling() select_idle_sibling() 3. system load is middle don't know 1 save the cost of wake_affine(), 3 could be proved by benchmark that no regression at least. For 2, it's the comparison between wake_affine() and select_idle_sibling(), since the system load is high, wake_affine() cost far more than select_idle_sibling(), and we saved many according to the benchmark results. Such changes also typically have a chance to cause regressions in other workloads - when that happens we need this kind of information to be able to enact plan-B. The benefit comes from avoiding unnecessary works, and the patch set is suppose to only reduce the cost of key function with least logical changing, I could not promise it benefit all the workloads, but till now, I've not found regression. Regards, Michael Wang Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/20/2013 09:32 PM, Peter Zijlstra wrote: On Wed, 2013-02-20 at 11:49 +0100, Ingo Molnar wrote: The changes look clean and reasoable, I don't necessarily agree, note that O(n^2) storage requirement that Michael failed to highlight ;-) Forgive me for not explain this point in cover, but it's really not a big deal in my opinion... And I'm going to apply Mike's suggestion, do allocation when cpu active, that will save some space :) Regards, Michael Wang any ideas exactly *why* it speeds up? That is indeed the most interesting part.. There's two parts to select_task_rq_fair(), the 'regular' affine wakeup path, and the fork/exec find_idlest_goo() path. At the very least we need to quantify which of these two parts contributes most to the speedup. In the power balancing discussion we already noted that the find_idlest_goo() is in need of attention. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/20/2013 10:05 PM, Mike Galbraith wrote: On Wed, 2013-02-20 at 14:32 +0100, Peter Zijlstra wrote: On Wed, 2013-02-20 at 11:49 +0100, Ingo Molnar wrote: The changes look clean and reasoable, I don't necessarily agree, note that O(n^2) storage requirement that Michael failed to highlight ;-) (yeah, I mentioned that needs to shrink.. a lot) Exactly, and I'm going to apply the suggestion now :) any ideas exactly *why* it speeds up? That is indeed the most interesting part.. There's two parts to select_task_rq_fair(), the 'regular' affine wakeup path, and the fork/exec find_idlest_goo() path. At the very least we need to quantify which of these two parts contributes most to the speedup. In the power balancing discussion we already noted that the find_idlest_goo() is in need of attention. Yup, even little stuff like break off the search when load is zero.. Agree, searching in a bunch of idle cpus and their subsets doesn't make sense... Regards, Michael Wang unless someone is planning on implementing anti-idle 'course ;-) -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: On 02/20/2013 06:49 PM, Ingo Molnar wrote: [snip] The changes look clean and reasoable, any ideas exactly *why* it speeds up? I.e. are there one or two key changes in the before/after logic and scheduling patterns that you can identify as causing the speedup? Hi, Ingo Thanks for your reply, please let me point out the key changes here (forgive me for haven't wrote a good description in cover). The performance improvement from this patch set is: 1. delay the invoke on wake_affine(). 2. save the circle to gain proper sd. The second point is obviously, and will benefit a lot when the sd topology is deep (NUMA is suppose to make it deeper on large system). So in my testing on a 12 cpu box, actually most of the benefit comes from the first point, and please let me introduce it in detail. The old logical when locate affine_sd is: if prev_cpu != curr_cpu if wake_affine() prev_cpu = curr_cpu new_cpu = select_idle_sibling(prev_cpu) return new_cpu The new logical is same to the old one if prev_cpu == curr_cpu, so let's simplify the old logical like: if wake_affine() new_cpu = select_idle_sibling(curr_cpu) else new_cpu = select_idle_sibling(prev_cpu) return new_cpu Actually that doesn't make sense. I think wake_affine() is trying to check whether move a task from prev_cpu to curr_cpu will break the balance in affine_sd or not, but why won't break balance means curr_cpu is better than prev_cpu for searching the idle cpu? You could argue that it's impossible to break balance by moving any task to any idle cpu, but that would mean bouncing tasks cross node on every wakeup is fine, which it isn't. So the new logical in this patch set is: new_cpu = select_idle_sibling(prev_cpu) if idle_cpu(new_cpu) return new_cpu So you tilted the scales in favor of leaving tasks in their current package, which should benefit large footprint tasks, but should also penalize light communicating tasks. I suspect that much of the pgbench improvement comes from the preemption mitigation from keeping 1:N load maximally spread, which is the perfect thing to do with such loads. In all the testing I ever did with it in 1:N mode, preemption dominated performance numbers. Keep server away from clients, it has fewer fair competition worries, can consume more CPU preemption free, pushing the load collapse point strongly upward. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 02/21/2013 02:11 PM, Mike Galbraith wrote: On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote: On 02/20/2013 06:49 PM, Ingo Molnar wrote: [snip] [snip] if wake_affine() new_cpu = select_idle_sibling(curr_cpu) else new_cpu = select_idle_sibling(prev_cpu) return new_cpu Actually that doesn't make sense. I think wake_affine() is trying to check whether move a task from prev_cpu to curr_cpu will break the balance in affine_sd or not, but why won't break balance means curr_cpu is better than prev_cpu for searching the idle cpu? You could argue that it's impossible to break balance by moving any task to any idle cpu, but that would mean bouncing tasks cross node on every wakeup is fine, which it isn't. I don't get it... could you please give me more detail on how wake_affine() related with bouncing? So the new logical in this patch set is: new_cpu = select_idle_sibling(prev_cpu) if idle_cpu(new_cpu) return new_cpu So you tilted the scales in favor of leaving tasks in their current package, which should benefit large footprint tasks, but should also penalize light communicating tasks. Yes, I'd prefer to wakeup the task on a cpu which: 1. idle 2. close to prev_cpu So if both curr_cpu and prev_cpu have idle cpu in their topology, which one is better? that depends on how task benefit from cache and the balance situation, whatever, I don't think the benefit worth the high cost of wake_affine() in most cases... Regards, Michael Wang I suspect that much of the pgbench improvement comes from the preemption mitigation from keeping 1:N load maximally spread, which is the perfect thing to do with such loads. In all the testing I ever did with it in 1:N mode, preemption dominated performance numbers. Keep server away from clients, it has fewer fair competition worries, can consume more CPU preemption free, pushing the load collapse point strongly upward. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 01/29/2013 05:08 PM, Michael Wang wrote: > v3 change log: > Fix small logical issues (Thanks to Mike Galbraith). > Change the way of handling WAKE. > > This patch set is trying to simplify the select_task_rq_fair() with > schedule balance map. > > After get rid of the complex code and reorganize the logical, pgbench show > the improvement, more the clients, bigger the improvement. > > Prev: Post: > > | db_size | clients | | tps | | tps | > +-+-+ +---+ +---+ > | 22 MB | 1 | | 10788 | | 10881 | > | 22 MB | 2 | | 21617 | | 21837 | > | 22 MB | 4 | | 41597 | | 42645 | > | 22 MB | 8 | | 54622 | | 57808 | > | 22 MB | 12 | | 50753 | | 54527 | > | 22 MB | 16 | | 50433 | | 56368 | +11.77% > | 22 MB | 24 | | 46725 | | 54319 | +16.25% > | 22 MB | 32 | | 43498 | | 54650 | +25.64% > | 7484 MB | 1 | | 7894 | | 8301 | > | 7484 MB | 2 | | 19477 | | 19622 | > | 7484 MB | 4 | | 36458 | | 38242 | > | 7484 MB | 8 | | 48423 | | 50796 | > | 7484 MB | 12 | | 46042 | | 49938 | > | 7484 MB | 16 | | 46274 | | 50507 | +9.15% > | 7484 MB | 24 | | 42583 | | 49175 | +15.48% > | 7484 MB | 32 | | 36413 | | 49148 | +34.97% > | 15 GB | 1 | | 7742 | | 7876 | > | 15 GB | 2 | | 19339 | | 19531 | > | 15 GB | 4 | | 36072 | | 37389 | > | 15 GB | 8 | | 48549 | | 50570 | > | 15 GB | 12 | | 45716 | | 49542 | > | 15 GB | 16 | | 46127 | | 49647 | +7.63% > | 15 GB | 24 | | 42539 | | 48639 | +14.34% > | 15 GB | 32 | | 36038 | | 48560 | +34.75% > > Please check the patch for more details about schedule balance map. > > Support the NUMA domain but not well tested. > Support the rebuild of domain but not tested. Hi, Ingo, Peter I've finished the test I could figure out (NUMA, domain rebuild...) , no issue appear on my box. I think this patch set will benefit the system, especially when there are huge amount of cpus. How do you think about this idea? do you have any comments on the patch set? Regards, Michael Wang > > Comments are very welcomed. > > Behind the v3: > Some changes has been applied to the way of handling WAKE. > > And that's all around one question, whether we should do load balance > for WAKE or not? > > In the old world, the only chance to do load balance for WAKE is when > prev cpu and curr cpu are not cache affine, but that doesn't make sense. > > I suppose the real meaning behind that logical is, do balance only if > cache benefit nothing after changing cpu. > > However, select_idle_sibling() is not only designed for the purpose to > take care of cache, it also benefit latency, and cost less than the > balance path. > > Besides, it's impossible to estimate the benefit of doing load balance > at that point of time. > > And that's come out the v3, no load balance for WAKE. > > Test with: > 12 cpu X86 server and linux-next 3.8.0-rc3. > > Michael Wang (3): > [RFC PATCH v3 1/3] sched: schedule balance map foundation > [RFC PATCH v3 2/3] sched: build schedule balance map > [RFC PATCH v3 3/3] sched: simplify select_task_rq_fair() with schedule > balance map > > Signed-off-by: Michael Wang > --- > b/kernel/sched/core.c | 44 +++ > b/kernel/sched/fair.c | 135 > ++--- > b/kernel/sched/sched.h | 14 + > kernel/sched/core.c| 67 > 4 files changed, 199 insertions(+), 61 deletions(-) > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()
On 01/29/2013 05:08 PM, Michael Wang wrote: v3 change log: Fix small logical issues (Thanks to Mike Galbraith). Change the way of handling WAKE. This patch set is trying to simplify the select_task_rq_fair() with schedule balance map. After get rid of the complex code and reorganize the logical, pgbench show the improvement, more the clients, bigger the improvement. Prev: Post: | db_size | clients | | tps | | tps | +-+-+ +---+ +---+ | 22 MB | 1 | | 10788 | | 10881 | | 22 MB | 2 | | 21617 | | 21837 | | 22 MB | 4 | | 41597 | | 42645 | | 22 MB | 8 | | 54622 | | 57808 | | 22 MB | 12 | | 50753 | | 54527 | | 22 MB | 16 | | 50433 | | 56368 | +11.77% | 22 MB | 24 | | 46725 | | 54319 | +16.25% | 22 MB | 32 | | 43498 | | 54650 | +25.64% | 7484 MB | 1 | | 7894 | | 8301 | | 7484 MB | 2 | | 19477 | | 19622 | | 7484 MB | 4 | | 36458 | | 38242 | | 7484 MB | 8 | | 48423 | | 50796 | | 7484 MB | 12 | | 46042 | | 49938 | | 7484 MB | 16 | | 46274 | | 50507 | +9.15% | 7484 MB | 24 | | 42583 | | 49175 | +15.48% | 7484 MB | 32 | | 36413 | | 49148 | +34.97% | 15 GB | 1 | | 7742 | | 7876 | | 15 GB | 2 | | 19339 | | 19531 | | 15 GB | 4 | | 36072 | | 37389 | | 15 GB | 8 | | 48549 | | 50570 | | 15 GB | 12 | | 45716 | | 49542 | | 15 GB | 16 | | 46127 | | 49647 | +7.63% | 15 GB | 24 | | 42539 | | 48639 | +14.34% | 15 GB | 32 | | 36038 | | 48560 | +34.75% Please check the patch for more details about schedule balance map. Support the NUMA domain but not well tested. Support the rebuild of domain but not tested. Hi, Ingo, Peter I've finished the test I could figure out (NUMA, domain rebuild...) , no issue appear on my box. I think this patch set will benefit the system, especially when there are huge amount of cpus. How do you think about this idea? do you have any comments on the patch set? Regards, Michael Wang Comments are very welcomed. Behind the v3: Some changes has been applied to the way of handling WAKE. And that's all around one question, whether we should do load balance for WAKE or not? In the old world, the only chance to do load balance for WAKE is when prev cpu and curr cpu are not cache affine, but that doesn't make sense. I suppose the real meaning behind that logical is, do balance only if cache benefit nothing after changing cpu. However, select_idle_sibling() is not only designed for the purpose to take care of cache, it also benefit latency, and cost less than the balance path. Besides, it's impossible to estimate the benefit of doing load balance at that point of time. And that's come out the v3, no load balance for WAKE. Test with: 12 cpu X86 server and linux-next 3.8.0-rc3. Michael Wang (3): [RFC PATCH v3 1/3] sched: schedule balance map foundation [RFC PATCH v3 2/3] sched: build schedule balance map [RFC PATCH v3 3/3] sched: simplify select_task_rq_fair() with schedule balance map Signed-off-by: Michael Wang wang...@linux.vnet.ibm.com --- b/kernel/sched/core.c | 44 +++ b/kernel/sched/fair.c | 135 ++--- b/kernel/sched/sched.h | 14 + kernel/sched/core.c| 67 4 files changed, 199 insertions(+), 61 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/