Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On 12/13/2012 07:35 PM, Borislav Petkov wrote: > On Thu, Dec 13, 2012 at 11:07:43AM +0800, Alex Shi wrote: now, on the other hand, if you have two threads of a process that share a bunch of data structures, and you'd spread these over 2 sockets, you end up bouncing data between the two sockets a lot, running inefficient --> bad for power. >>> >>> Yeah, that should be addressed by the NUMA patches people are >>> working on right now. >> >> Yes, as to balance/powersaving policy, we can tight pack tasks >> firstly, then NUMA balancing will make memory follow us. >> >> BTW, NUMA balancing is more related with page in memory. not LLC. > > Sure, let's look at the worst and best cases: > > * worst case: you have memory shared by multiple threads on one node > *and* working set doesn't fit in LLC. > > Here, if you pack threads tightly only on one node, you still suffer the > working set kicking out parts of itself out of LLC. > > If you spread threads around, you still cannot avoid the LLC thrashing > because the LLC of the node containing the shared memory needs to cache > all those transactions. *In* *addition*, you get the cross-node traffic > because the shared pages are on the first node. > > Major suckage. > > Does it matter? I don't know. It can be decided on a case-by-case basis. > If people care about singlethread perf, they would likely want to spread > around and buy in the cross-node traffic. > > If they care for power, then maybe they don't want to turn on the second > socket yet. > > * the optimal case is where memory follows threads and gets spread > around such that LLC doesn't get thrashed and cross-node traffic gets > avoided. > > Now, you can think of all those other scenarios in between :-/ You are right. thanks for explanation! :) Actually, what I went to say is that numa balancing target is pages in different node memory, but of course, it may improve LLC performance. > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On Thu, Dec 13, 2012 at 11:07:43AM +0800, Alex Shi wrote: > >> now, on the other hand, if you have two threads of a process that > >> share a bunch of data structures, and you'd spread these over 2 > >> sockets, you end up bouncing data between the two sockets a lot, > >> running inefficient --> bad for power. > > > > Yeah, that should be addressed by the NUMA patches people are > > working on right now. > > Yes, as to balance/powersaving policy, we can tight pack tasks > firstly, then NUMA balancing will make memory follow us. > > BTW, NUMA balancing is more related with page in memory. not LLC. Sure, let's look at the worst and best cases: * worst case: you have memory shared by multiple threads on one node *and* working set doesn't fit in LLC. Here, if you pack threads tightly only on one node, you still suffer the working set kicking out parts of itself out of LLC. If you spread threads around, you still cannot avoid the LLC thrashing because the LLC of the node containing the shared memory needs to cache all those transactions. *In* *addition*, you get the cross-node traffic because the shared pages are on the first node. Major suckage. Does it matter? I don't know. It can be decided on a case-by-case basis. If people care about singlethread perf, they would likely want to spread around and buy in the cross-node traffic. If they care for power, then maybe they don't want to turn on the second socket yet. * the optimal case is where memory follows threads and gets spread around such that LLC doesn't get thrashed and cross-node traffic gets avoided. Now, you can think of all those other scenarios in between :-/ Thanks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On Thu, Dec 13, 2012 at 11:07:43AM +0800, Alex Shi wrote: now, on the other hand, if you have two threads of a process that share a bunch of data structures, and you'd spread these over 2 sockets, you end up bouncing data between the two sockets a lot, running inefficient -- bad for power. Yeah, that should be addressed by the NUMA patches people are working on right now. Yes, as to balance/powersaving policy, we can tight pack tasks firstly, then NUMA balancing will make memory follow us. BTW, NUMA balancing is more related with page in memory. not LLC. Sure, let's look at the worst and best cases: * worst case: you have memory shared by multiple threads on one node *and* working set doesn't fit in LLC. Here, if you pack threads tightly only on one node, you still suffer the working set kicking out parts of itself out of LLC. If you spread threads around, you still cannot avoid the LLC thrashing because the LLC of the node containing the shared memory needs to cache all those transactions. *In* *addition*, you get the cross-node traffic because the shared pages are on the first node. Major suckage. Does it matter? I don't know. It can be decided on a case-by-case basis. If people care about singlethread perf, they would likely want to spread around and buy in the cross-node traffic. If they care for power, then maybe they don't want to turn on the second socket yet. * the optimal case is where memory follows threads and gets spread around such that LLC doesn't get thrashed and cross-node traffic gets avoided. Now, you can think of all those other scenarios in between :-/ Thanks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On 12/13/2012 07:35 PM, Borislav Petkov wrote: On Thu, Dec 13, 2012 at 11:07:43AM +0800, Alex Shi wrote: now, on the other hand, if you have two threads of a process that share a bunch of data structures, and you'd spread these over 2 sockets, you end up bouncing data between the two sockets a lot, running inefficient -- bad for power. Yeah, that should be addressed by the NUMA patches people are working on right now. Yes, as to balance/powersaving policy, we can tight pack tasks firstly, then NUMA balancing will make memory follow us. BTW, NUMA balancing is more related with page in memory. not LLC. Sure, let's look at the worst and best cases: * worst case: you have memory shared by multiple threads on one node *and* working set doesn't fit in LLC. Here, if you pack threads tightly only on one node, you still suffer the working set kicking out parts of itself out of LLC. If you spread threads around, you still cannot avoid the LLC thrashing because the LLC of the node containing the shared memory needs to cache all those transactions. *In* *addition*, you get the cross-node traffic because the shared pages are on the first node. Major suckage. Does it matter? I don't know. It can be decided on a case-by-case basis. If people care about singlethread perf, they would likely want to spread around and buy in the cross-node traffic. If they care for power, then maybe they don't want to turn on the second socket yet. * the optimal case is where memory follows threads and gets spread around such that LLC doesn't get thrashed and cross-node traffic gets avoided. Now, you can think of all those other scenarios in between :-/ You are right. thanks for explanation! :) Actually, what I went to say is that numa balancing target is pages in different node memory, but of course, it may improve LLC performance. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
>> now, on the other hand, if you have two threads of a process that >> share a bunch of data structures, and you'd spread these over 2 >> sockets, you end up bouncing data between the two sockets a lot, >> running inefficient --> bad for power. > > Yeah, that should be addressed by the NUMA patches people are working on > right now. Yes, as to balance/powersaving policy, we can tight pack tasks firstly, then NUMA balancing will make memory follow us. BTW, NUMA balancing is more related with page in memory. not LLC. > >> having said all this, if you have to tasks that don't have such >> cache effects, the most efficient way of running things will be on 2 >> hyperthreading halves... it's very hard to beat the power efficiency >> of that. But this assumes the tasks don't compete with resources much >> on the HT level, and achieve good scaling. and this still has to >> compete with "race to halt", because if you're done quicker, you can >> put the memory in self refresh quicker. > > Right, how are we addressing the breakeven in that case? AFAIK, we > do schedule them now on two different cores (not HT threads, i.e. no > resource sharing besides L2) so that we get done faster, i.e. race to that's balance policy for. :) > idle in the performance case. And in the powersavings' case we leave > them as tightly packed as possible. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On 12/12/2012 10:21 PM, Vincent Guittot wrote: >>> >> If Linux is to continue to work efficiently on heterogeneous >>> >> multi-processing platforms, it needs to provide scheduling mechanisms >>> >> that can be exploited as per the demands of the HW architecture. >> > >> > Linus definitely disagree such ideas. :) So, need to summaries the >> > logical beyond all hardware. >> > >>> >> example is the "small task packing (and spreading)" for which Vincent >>> >> Guittot has posted a patchset[1] earlier and so has Alex now. >> > >> > Sure. I just thought my patchset should handled the 'small task >> > packing' scenario. Could you guy like to have a try? > Hi Alex, > > Yes, I will do a try with your patchset when i will have some spare time Thanks Vincent! the balance and powersaving policy should have effect. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On Tue, Dec 11, 2012 at 08:40:40AM -0800, Arjan van de Ven wrote: > >Let me try to understand what this means: so "performance" above with > >8 threads means that those threads are spread out across more than one > >socket, no? > > > >If so, this would mean that you have a smaller amount of tasks on each > >socket, thus the smaller wattage. > > > >The "powersaving" method OTOH fills up the one socket up to the brim, > >thus the slightly higher consumption due to all threads being occupied. > > > >Is that it? > > not sure. > > by and large, power efficiency is the same as performance efficiency, > with some twists. or to reword that to be more clear if you waste > performance due to something that becomes inefficient, you're wasting > power as well. now, you might have some hardware effects that can > then save you power... but those effects then first need to overcome > the waste from the performance inefficiency... and that almost never > happens. > > for example, if you have two workloads that each fit barely inside > the last level cache... it's much more efficient to spread these over > two sockets... where each has its own full LLC to use. If you'd group > these together, both would thrash the cache all the time and run > inefficient --> bad for power. Hmm, are you saying that powering up the second socket so that the working set fully fits in the LLC is still less power used than the cost of going up to memory and bringing those lines back in? I'd say there's breakeven point depending on the workload duration, no? Which means that we need to be able to look into the future in order to know what to do... ;-/ > now, on the other hand, if you have two threads of a process that > share a bunch of data structures, and you'd spread these over 2 > sockets, you end up bouncing data between the two sockets a lot, > running inefficient --> bad for power. Yeah, that should be addressed by the NUMA patches people are working on right now. > having said all this, if you have to tasks that don't have such > cache effects, the most efficient way of running things will be on 2 > hyperthreading halves... it's very hard to beat the power efficiency > of that. But this assumes the tasks don't compete with resources much > on the HT level, and achieve good scaling. and this still has to > compete with "race to halt", because if you're done quicker, you can > put the memory in self refresh quicker. Right, how are we addressing the breakeven in that case? AFAIK, we do schedule them now on two different cores (not HT threads, i.e. no resource sharing besides L2) so that we get done faster, i.e. race to idle in the performance case. And in the powersavings' case we leave them as tightly packed as possible. > none of this stuff is easy for humans or computer programs to > determine ahead of time... or sometimes even afterwards. heck, even > for just performance it's really really hard already, never mind > adding power. > > my personal gut feeling is that we should just optimize this scheduler > stuff for performance, and that we're going to be doing quite well on > power already if we achieve that. Probably. I wonder if there is a way to measure power consumption of different workloads in perf and then run those with different scheduling policies. Thanks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On 12 December 2012 14:55, Alex Shi wrote: > > > well... it's not always beneficial to group or to spread out > it depends on cache behavior mostly which is best Let me try to understand what this means: so "performance" above with 8 threads means that those threads are spread out across more than one socket, no? If so, this would mean that you have a smaller amount of tasks on each socket, thus the smaller wattage. The "powersaving" method OTOH fills up the one socket up to the brim, thus the slightly higher consumption due to all threads being occupied. Is that it? >>> >>> >>> not sure. >>> >>> by and large, power efficiency is the same as performance efficiency, with >>> some twists. >>> or to reword that to be more clear >>> if you waste performance due to something that becomes inefficient, you're >>> wasting power as well. >>> now, you might have some hardware effects that can then save you power... >>> but those effects >>> then first need to overcome the waste from the performance inefficiency... >>> and that almost never happens. >>> >>> for example, if you have two workloads that each fit barely inside the last >>> level cache... >>> it's much more efficient to spread these over two sockets... where each has >>> its own full LLC >>> to use. >>> If you'd group these together, both would thrash the cache all the time and >>> run inefficient --> bad for power. >>> >>> now, on the other hand, if you have two threads of a process that share a >>> bunch of data structures, >>> and you'd spread these over 2 sockets, you end up bouncing data between the >>> two sockets a lot, >>> running inefficient --> bad for power. >>> >> >> Agree with all of the above. However.. >> >>> having said all this, if you have to tasks that don't have such cache >>> effects, the most efficient way >>> of running things will be on 2 hyperthreading halves... it's very hard to >>> beat the power efficiency of that. >> >> .. there are alternatives to hyperthreading. On ARM's big.LITTLE >> architecture you could simply schedule them on the LITTLE cores. The >> big cores just can't beat the power efficiency of the LITTLE ones even >> with 'race to halt' that you allude to below. And usecases like mp3 >> playback simply don't require the kind of performance that the big >> cores can offer. >> >>> But this assumes the tasks don't compete with resources much on the HT >>> level, and achieve good scaling. >>> and this still has to compete with "race to halt", because if you're done >>> quicker, you can put the memory >>> in self refresh quicker. >>> >>> none of this stuff is easy for humans or computer programs to determine >>> ahead of time... or sometimes even afterwards. >>> heck, even for just performance it's really really hard already, never mind >>> adding power. >>> >>> my personal gut feeling is that we should just optimize this scheduler stuff >>> for performance, and that >>> we're going to be doing quite well on power already if we achieve that. >> >> If Linux is to continue to work efficiently on heterogeneous >> multi-processing platforms, it needs to provide scheduling mechanisms >> that can be exploited as per the demands of the HW architecture. > > Linus definitely disagree such ideas. :) So, need to summaries the > logical beyond all hardware. > >> example is the "small task packing (and spreading)" for which Vincent >> Guittot has posted a patchset[1] earlier and so has Alex now. > > Sure. I just thought my patchset should handled the 'small task > packing' scenario. Could you guy like to have a try? Hi Alex, Yes, I will do a try with your patchset when i will have some spare time Vincent >> >> [1] http://lwn.net/Articles/518834/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
well... it's not always beneficial to group or to spread out it depends on cache behavior mostly which is best >>> >>> >>> Let me try to understand what this means: so "performance" above with >>> 8 threads means that those threads are spread out across more than one >>> socket, no? >>> >>> If so, this would mean that you have a smaller amount of tasks on each >>> socket, thus the smaller wattage. >>> >>> The "powersaving" method OTOH fills up the one socket up to the brim, >>> thus the slightly higher consumption due to all threads being occupied. >>> >>> Is that it? >> >> >> not sure. >> >> by and large, power efficiency is the same as performance efficiency, with >> some twists. >> or to reword that to be more clear >> if you waste performance due to something that becomes inefficient, you're >> wasting power as well. >> now, you might have some hardware effects that can then save you power... >> but those effects >> then first need to overcome the waste from the performance inefficiency... >> and that almost never happens. >> >> for example, if you have two workloads that each fit barely inside the last >> level cache... >> it's much more efficient to spread these over two sockets... where each has >> its own full LLC >> to use. >> If you'd group these together, both would thrash the cache all the time and >> run inefficient --> bad for power. >> >> now, on the other hand, if you have two threads of a process that share a >> bunch of data structures, >> and you'd spread these over 2 sockets, you end up bouncing data between the >> two sockets a lot, >> running inefficient --> bad for power. >> > > Agree with all of the above. However.. > >> having said all this, if you have to tasks that don't have such cache >> effects, the most efficient way >> of running things will be on 2 hyperthreading halves... it's very hard to >> beat the power efficiency of that. > > .. there are alternatives to hyperthreading. On ARM's big.LITTLE > architecture you could simply schedule them on the LITTLE cores. The > big cores just can't beat the power efficiency of the LITTLE ones even > with 'race to halt' that you allude to below. And usecases like mp3 > playback simply don't require the kind of performance that the big > cores can offer. > >> But this assumes the tasks don't compete with resources much on the HT >> level, and achieve good scaling. >> and this still has to compete with "race to halt", because if you're done >> quicker, you can put the memory >> in self refresh quicker. >> >> none of this stuff is easy for humans or computer programs to determine >> ahead of time... or sometimes even afterwards. >> heck, even for just performance it's really really hard already, never mind >> adding power. >> >> my personal gut feeling is that we should just optimize this scheduler stuff >> for performance, and that >> we're going to be doing quite well on power already if we achieve that. > > If Linux is to continue to work efficiently on heterogeneous > multi-processing platforms, it needs to provide scheduling mechanisms > that can be exploited as per the demands of the HW architecture. Linus definitely disagree such ideas. :) So, need to summaries the logical beyond all hardware. > example is the "small task packing (and spreading)" for which Vincent > Guittot has posted a patchset[1] earlier and so has Alex now. Sure. I just thought my patchset should handled the 'small task packing' scenario. Could you guy like to have a try? > > [1] http://lwn.net/Articles/518834/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On Tue, Dec 11, 2012 at 10:10 PM, Arjan van de Ven wrote: > On 12/11/2012 8:13 AM, Borislav Petkov wrote: >> >> On Tue, Dec 11, 2012 at 08:03:01AM -0800, Arjan van de Ven wrote: >>> >>> On 12/11/2012 7:48 AM, Borislav Petkov wrote: On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote: > > Another testing of parallel compress with pigz on Linus' git tree. > results show we get much better performance/power with powersaving and > balance policy: > > testing command: > #pigz -k -c -p$x -r linux* &> /dev/null > > On a NHM EP box > powersaving balance performance > x = 4166.516 /88 68 170.515 /82 71 165.283 /103 > 58 > x = 8173.654 /61 94 177.693 /60 93 172.31 /76 76 This looks funny: so "performance" is eating less watts than "powersaving" and "balance" on NHM. Could it be that the average watts measurements on NHM are not correct/precise..? On SNB they look as expected, according to your scheme. >>> >>> >>> well... it's not always beneficial to group or to spread out >>> it depends on cache behavior mostly which is best >> >> >> Let me try to understand what this means: so "performance" above with >> 8 threads means that those threads are spread out across more than one >> socket, no? >> >> If so, this would mean that you have a smaller amount of tasks on each >> socket, thus the smaller wattage. >> >> The "powersaving" method OTOH fills up the one socket up to the brim, >> thus the slightly higher consumption due to all threads being occupied. >> >> Is that it? > > > not sure. > > by and large, power efficiency is the same as performance efficiency, with > some twists. > or to reword that to be more clear > if you waste performance due to something that becomes inefficient, you're > wasting power as well. > now, you might have some hardware effects that can then save you power... > but those effects > then first need to overcome the waste from the performance inefficiency... > and that almost never happens. > > for example, if you have two workloads that each fit barely inside the last > level cache... > it's much more efficient to spread these over two sockets... where each has > its own full LLC > to use. > If you'd group these together, both would thrash the cache all the time and > run inefficient --> bad for power. > > now, on the other hand, if you have two threads of a process that share a > bunch of data structures, > and you'd spread these over 2 sockets, you end up bouncing data between the > two sockets a lot, > running inefficient --> bad for power. > Agree with all of the above. However.. > having said all this, if you have to tasks that don't have such cache > effects, the most efficient way > of running things will be on 2 hyperthreading halves... it's very hard to > beat the power efficiency of that. .. there are alternatives to hyperthreading. On ARM's big.LITTLE architecture you could simply schedule them on the LITTLE cores. The big cores just can't beat the power efficiency of the LITTLE ones even with 'race to halt' that you allude to below. And usecases like mp3 playback simply don't require the kind of performance that the big cores can offer. > But this assumes the tasks don't compete with resources much on the HT > level, and achieve good scaling. > and this still has to compete with "race to halt", because if you're done > quicker, you can put the memory > in self refresh quicker. > > none of this stuff is easy for humans or computer programs to determine > ahead of time... or sometimes even afterwards. > heck, even for just performance it's really really hard already, never mind > adding power. > > my personal gut feeling is that we should just optimize this scheduler stuff > for performance, and that > we're going to be doing quite well on power already if we achieve that. If Linux is to continue to work efficiently on heterogeneous multi-processing platforms, it needs to provide scheduling mechanisms that can be exploited as per the demands of the HW architecture. An example is the "small task packing (and spreading)" for which Vincent Guittot has posted a patchset[1] earlier and so has Alex now. [1] http://lwn.net/Articles/518834/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On Tue, Dec 11, 2012 at 10:10 PM, Arjan van de Ven ar...@linux.intel.com wrote: On 12/11/2012 8:13 AM, Borislav Petkov wrote: On Tue, Dec 11, 2012 at 08:03:01AM -0800, Arjan van de Ven wrote: On 12/11/2012 7:48 AM, Borislav Petkov wrote: On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote: Another testing of parallel compress with pigz on Linus' git tree. results show we get much better performance/power with powersaving and balance policy: testing command: #pigz -k -c -p$x -r linux* /dev/null On a NHM EP box powersaving balance performance x = 4166.516 /88 68 170.515 /82 71 165.283 /103 58 x = 8173.654 /61 94 177.693 /60 93 172.31 /76 76 This looks funny: so performance is eating less watts than powersaving and balance on NHM. Could it be that the average watts measurements on NHM are not correct/precise..? On SNB they look as expected, according to your scheme. well... it's not always beneficial to group or to spread out it depends on cache behavior mostly which is best Let me try to understand what this means: so performance above with 8 threads means that those threads are spread out across more than one socket, no? If so, this would mean that you have a smaller amount of tasks on each socket, thus the smaller wattage. The powersaving method OTOH fills up the one socket up to the brim, thus the slightly higher consumption due to all threads being occupied. Is that it? not sure. by and large, power efficiency is the same as performance efficiency, with some twists. or to reword that to be more clear if you waste performance due to something that becomes inefficient, you're wasting power as well. now, you might have some hardware effects that can then save you power... but those effects then first need to overcome the waste from the performance inefficiency... and that almost never happens. for example, if you have two workloads that each fit barely inside the last level cache... it's much more efficient to spread these over two sockets... where each has its own full LLC to use. If you'd group these together, both would thrash the cache all the time and run inefficient -- bad for power. now, on the other hand, if you have two threads of a process that share a bunch of data structures, and you'd spread these over 2 sockets, you end up bouncing data between the two sockets a lot, running inefficient -- bad for power. Agree with all of the above. However.. having said all this, if you have to tasks that don't have such cache effects, the most efficient way of running things will be on 2 hyperthreading halves... it's very hard to beat the power efficiency of that. .. there are alternatives to hyperthreading. On ARM's big.LITTLE architecture you could simply schedule them on the LITTLE cores. The big cores just can't beat the power efficiency of the LITTLE ones even with 'race to halt' that you allude to below. And usecases like mp3 playback simply don't require the kind of performance that the big cores can offer. But this assumes the tasks don't compete with resources much on the HT level, and achieve good scaling. and this still has to compete with race to halt, because if you're done quicker, you can put the memory in self refresh quicker. none of this stuff is easy for humans or computer programs to determine ahead of time... or sometimes even afterwards. heck, even for just performance it's really really hard already, never mind adding power. my personal gut feeling is that we should just optimize this scheduler stuff for performance, and that we're going to be doing quite well on power already if we achieve that. If Linux is to continue to work efficiently on heterogeneous multi-processing platforms, it needs to provide scheduling mechanisms that can be exploited as per the demands of the HW architecture. An example is the small task packing (and spreading) for which Vincent Guittot has posted a patchset[1] earlier and so has Alex now. [1] http://lwn.net/Articles/518834/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
well... it's not always beneficial to group or to spread out it depends on cache behavior mostly which is best Let me try to understand what this means: so performance above with 8 threads means that those threads are spread out across more than one socket, no? If so, this would mean that you have a smaller amount of tasks on each socket, thus the smaller wattage. The powersaving method OTOH fills up the one socket up to the brim, thus the slightly higher consumption due to all threads being occupied. Is that it? not sure. by and large, power efficiency is the same as performance efficiency, with some twists. or to reword that to be more clear if you waste performance due to something that becomes inefficient, you're wasting power as well. now, you might have some hardware effects that can then save you power... but those effects then first need to overcome the waste from the performance inefficiency... and that almost never happens. for example, if you have two workloads that each fit barely inside the last level cache... it's much more efficient to spread these over two sockets... where each has its own full LLC to use. If you'd group these together, both would thrash the cache all the time and run inefficient -- bad for power. now, on the other hand, if you have two threads of a process that share a bunch of data structures, and you'd spread these over 2 sockets, you end up bouncing data between the two sockets a lot, running inefficient -- bad for power. Agree with all of the above. However.. having said all this, if you have to tasks that don't have such cache effects, the most efficient way of running things will be on 2 hyperthreading halves... it's very hard to beat the power efficiency of that. .. there are alternatives to hyperthreading. On ARM's big.LITTLE architecture you could simply schedule them on the LITTLE cores. The big cores just can't beat the power efficiency of the LITTLE ones even with 'race to halt' that you allude to below. And usecases like mp3 playback simply don't require the kind of performance that the big cores can offer. But this assumes the tasks don't compete with resources much on the HT level, and achieve good scaling. and this still has to compete with race to halt, because if you're done quicker, you can put the memory in self refresh quicker. none of this stuff is easy for humans or computer programs to determine ahead of time... or sometimes even afterwards. heck, even for just performance it's really really hard already, never mind adding power. my personal gut feeling is that we should just optimize this scheduler stuff for performance, and that we're going to be doing quite well on power already if we achieve that. If Linux is to continue to work efficiently on heterogeneous multi-processing platforms, it needs to provide scheduling mechanisms that can be exploited as per the demands of the HW architecture. Linus definitely disagree such ideas. :) So, need to summaries the logical beyond all hardware. example is the small task packing (and spreading) for which Vincent Guittot has posted a patchset[1] earlier and so has Alex now. Sure. I just thought my patchset should handled the 'small task packing' scenario. Could you guy like to have a try? [1] http://lwn.net/Articles/518834/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On 12 December 2012 14:55, Alex Shi lkml.a...@gmail.com wrote: well... it's not always beneficial to group or to spread out it depends on cache behavior mostly which is best Let me try to understand what this means: so performance above with 8 threads means that those threads are spread out across more than one socket, no? If so, this would mean that you have a smaller amount of tasks on each socket, thus the smaller wattage. The powersaving method OTOH fills up the one socket up to the brim, thus the slightly higher consumption due to all threads being occupied. Is that it? not sure. by and large, power efficiency is the same as performance efficiency, with some twists. or to reword that to be more clear if you waste performance due to something that becomes inefficient, you're wasting power as well. now, you might have some hardware effects that can then save you power... but those effects then first need to overcome the waste from the performance inefficiency... and that almost never happens. for example, if you have two workloads that each fit barely inside the last level cache... it's much more efficient to spread these over two sockets... where each has its own full LLC to use. If you'd group these together, both would thrash the cache all the time and run inefficient -- bad for power. now, on the other hand, if you have two threads of a process that share a bunch of data structures, and you'd spread these over 2 sockets, you end up bouncing data between the two sockets a lot, running inefficient -- bad for power. Agree with all of the above. However.. having said all this, if you have to tasks that don't have such cache effects, the most efficient way of running things will be on 2 hyperthreading halves... it's very hard to beat the power efficiency of that. .. there are alternatives to hyperthreading. On ARM's big.LITTLE architecture you could simply schedule them on the LITTLE cores. The big cores just can't beat the power efficiency of the LITTLE ones even with 'race to halt' that you allude to below. And usecases like mp3 playback simply don't require the kind of performance that the big cores can offer. But this assumes the tasks don't compete with resources much on the HT level, and achieve good scaling. and this still has to compete with race to halt, because if you're done quicker, you can put the memory in self refresh quicker. none of this stuff is easy for humans or computer programs to determine ahead of time... or sometimes even afterwards. heck, even for just performance it's really really hard already, never mind adding power. my personal gut feeling is that we should just optimize this scheduler stuff for performance, and that we're going to be doing quite well on power already if we achieve that. If Linux is to continue to work efficiently on heterogeneous multi-processing platforms, it needs to provide scheduling mechanisms that can be exploited as per the demands of the HW architecture. Linus definitely disagree such ideas. :) So, need to summaries the logical beyond all hardware. example is the small task packing (and spreading) for which Vincent Guittot has posted a patchset[1] earlier and so has Alex now. Sure. I just thought my patchset should handled the 'small task packing' scenario. Could you guy like to have a try? Hi Alex, Yes, I will do a try with your patchset when i will have some spare time Vincent [1] http://lwn.net/Articles/518834/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On Tue, Dec 11, 2012 at 08:40:40AM -0800, Arjan van de Ven wrote: Let me try to understand what this means: so performance above with 8 threads means that those threads are spread out across more than one socket, no? If so, this would mean that you have a smaller amount of tasks on each socket, thus the smaller wattage. The powersaving method OTOH fills up the one socket up to the brim, thus the slightly higher consumption due to all threads being occupied. Is that it? not sure. by and large, power efficiency is the same as performance efficiency, with some twists. or to reword that to be more clear if you waste performance due to something that becomes inefficient, you're wasting power as well. now, you might have some hardware effects that can then save you power... but those effects then first need to overcome the waste from the performance inefficiency... and that almost never happens. for example, if you have two workloads that each fit barely inside the last level cache... it's much more efficient to spread these over two sockets... where each has its own full LLC to use. If you'd group these together, both would thrash the cache all the time and run inefficient -- bad for power. Hmm, are you saying that powering up the second socket so that the working set fully fits in the LLC is still less power used than the cost of going up to memory and bringing those lines back in? I'd say there's breakeven point depending on the workload duration, no? Which means that we need to be able to look into the future in order to know what to do... ;-/ now, on the other hand, if you have two threads of a process that share a bunch of data structures, and you'd spread these over 2 sockets, you end up bouncing data between the two sockets a lot, running inefficient -- bad for power. Yeah, that should be addressed by the NUMA patches people are working on right now. having said all this, if you have to tasks that don't have such cache effects, the most efficient way of running things will be on 2 hyperthreading halves... it's very hard to beat the power efficiency of that. But this assumes the tasks don't compete with resources much on the HT level, and achieve good scaling. and this still has to compete with race to halt, because if you're done quicker, you can put the memory in self refresh quicker. Right, how are we addressing the breakeven in that case? AFAIK, we do schedule them now on two different cores (not HT threads, i.e. no resource sharing besides L2) so that we get done faster, i.e. race to idle in the performance case. And in the powersavings' case we leave them as tightly packed as possible. none of this stuff is easy for humans or computer programs to determine ahead of time... or sometimes even afterwards. heck, even for just performance it's really really hard already, never mind adding power. my personal gut feeling is that we should just optimize this scheduler stuff for performance, and that we're going to be doing quite well on power already if we achieve that. Probably. I wonder if there is a way to measure power consumption of different workloads in perf and then run those with different scheduling policies. Thanks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On 12/12/2012 10:21 PM, Vincent Guittot wrote: If Linux is to continue to work efficiently on heterogeneous multi-processing platforms, it needs to provide scheduling mechanisms that can be exploited as per the demands of the HW architecture. Linus definitely disagree such ideas. :) So, need to summaries the logical beyond all hardware. example is the small task packing (and spreading) for which Vincent Guittot has posted a patchset[1] earlier and so has Alex now. Sure. I just thought my patchset should handled the 'small task packing' scenario. Could you guy like to have a try? Hi Alex, Yes, I will do a try with your patchset when i will have some spare time Thanks Vincent! the balance and powersaving policy should have effect. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
now, on the other hand, if you have two threads of a process that share a bunch of data structures, and you'd spread these over 2 sockets, you end up bouncing data between the two sockets a lot, running inefficient -- bad for power. Yeah, that should be addressed by the NUMA patches people are working on right now. Yes, as to balance/powersaving policy, we can tight pack tasks firstly, then NUMA balancing will make memory follow us. BTW, NUMA balancing is more related with page in memory. not LLC. having said all this, if you have to tasks that don't have such cache effects, the most efficient way of running things will be on 2 hyperthreading halves... it's very hard to beat the power efficiency of that. But this assumes the tasks don't compete with resources much on the HT level, and achieve good scaling. and this still has to compete with race to halt, because if you're done quicker, you can put the memory in self refresh quicker. Right, how are we addressing the breakeven in that case? AFAIK, we do schedule them now on two different cores (not HT threads, i.e. no resource sharing besides L2) so that we get done faster, i.e. race to that's balance policy for. :) idle in the performance case. And in the powersavings' case we leave them as tightly packed as possible. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On 12/12/2012 12:13 AM, Borislav Petkov wrote: > On Tue, Dec 11, 2012 at 08:03:01AM -0800, Arjan van de Ven wrote: >> On 12/11/2012 7:48 AM, Borislav Petkov wrote: >>> On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote: Another testing of parallel compress with pigz on Linus' git tree. results show we get much better performance/power with powersaving and balance policy: testing command: #pigz -k -c -p$x -r linux* &> /dev/null On a NHM EP box powersaving balance performance x = 4166.516 /88 68 170.515 /82 71 165.283 /103 58 x = 8173.654 /61 94 177.693 /60 93 172.31 /76 76 >>> >>> This looks funny: so "performance" is eating less watts than >>> "powersaving" and "balance" on NHM. Could it be that the average watts >>> measurements on NHM are not correct/precise..? On SNB they look as >>> expected, according to your scheme. >> >> well... it's not always beneficial to group or to spread out >> it depends on cache behavior mostly which is best > > Let me try to understand what this means: so "performance" above with > 8 threads means that those threads are spread out across more than one > socket, no? > > If so, this would mean that you have a smaller amount of tasks on each > socket, thus the smaller wattage. > > The "powersaving" method OTOH fills up the one socket up to the brim, > thus the slightly higher consumption due to all threads being occupied. > As Arjan said we know the performance increase should be due to the cache sharing in LLC. As to power consumption value between powersaving and performance, when we burn 2 socket CPU, the cpu load is not 100%, so some LCPU still has time to go idle or to run with low frequency, that also can save some power. That's just generalise situation, as to different hardware, different CPU, they may has different tuning in CPU packages, core, uncore part etc. So as to different benchmark, the result are also different. > Is that it? > > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On 12/11/2012 8:13 AM, Borislav Petkov wrote: On Tue, Dec 11, 2012 at 08:03:01AM -0800, Arjan van de Ven wrote: On 12/11/2012 7:48 AM, Borislav Petkov wrote: On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote: Another testing of parallel compress with pigz on Linus' git tree. results show we get much better performance/power with powersaving and balance policy: testing command: #pigz -k -c -p$x -r linux* &> /dev/null On a NHM EP box powersaving balance performance x = 4166.516 /88 68 170.515 /82 71 165.283 /103 58 x = 8173.654 /61 94 177.693 /60 93 172.31 /76 76 This looks funny: so "performance" is eating less watts than "powersaving" and "balance" on NHM. Could it be that the average watts measurements on NHM are not correct/precise..? On SNB they look as expected, according to your scheme. well... it's not always beneficial to group or to spread out it depends on cache behavior mostly which is best Let me try to understand what this means: so "performance" above with 8 threads means that those threads are spread out across more than one socket, no? If so, this would mean that you have a smaller amount of tasks on each socket, thus the smaller wattage. The "powersaving" method OTOH fills up the one socket up to the brim, thus the slightly higher consumption due to all threads being occupied. Is that it? not sure. by and large, power efficiency is the same as performance efficiency, with some twists. or to reword that to be more clear if you waste performance due to something that becomes inefficient, you're wasting power as well. now, you might have some hardware effects that can then save you power... but those effects then first need to overcome the waste from the performance inefficiency... and that almost never happens. for example, if you have two workloads that each fit barely inside the last level cache... it's much more efficient to spread these over two sockets... where each has its own full LLC to use. If you'd group these together, both would thrash the cache all the time and run inefficient --> bad for power. now, on the other hand, if you have two threads of a process that share a bunch of data structures, and you'd spread these over 2 sockets, you end up bouncing data between the two sockets a lot, running inefficient --> bad for power. having said all this, if you have to tasks that don't have such cache effects, the most efficient way of running things will be on 2 hyperthreading halves... it's very hard to beat the power efficiency of that. But this assumes the tasks don't compete with resources much on the HT level, and achieve good scaling. and this still has to compete with "race to halt", because if you're done quicker, you can put the memory in self refresh quicker. none of this stuff is easy for humans or computer programs to determine ahead of time... or sometimes even afterwards. heck, even for just performance it's really really hard already, never mind adding power. my personal gut feeling is that we should just optimize this scheduler stuff for performance, and that we're going to be doing quite well on power already if we achieve that. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On Tue, Dec 11, 2012 at 08:03:01AM -0800, Arjan van de Ven wrote: > On 12/11/2012 7:48 AM, Borislav Petkov wrote: > >On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote: > >>Another testing of parallel compress with pigz on Linus' git tree. > >>results show we get much better performance/power with powersaving and > >>balance policy: > >> > >>testing command: > >>#pigz -k -c -p$x -r linux* &> /dev/null > >> > >>On a NHM EP box > >> powersaving balance performance > >>x = 4166.516 /88 68 170.515 /82 71 165.283 /103 58 > >>x = 8173.654 /61 94 177.693 /60 93 172.31 /76 76 > > > >This looks funny: so "performance" is eating less watts than > >"powersaving" and "balance" on NHM. Could it be that the average watts > >measurements on NHM are not correct/precise..? On SNB they look as > >expected, according to your scheme. > > well... it's not always beneficial to group or to spread out > it depends on cache behavior mostly which is best Let me try to understand what this means: so "performance" above with 8 threads means that those threads are spread out across more than one socket, no? If so, this would mean that you have a smaller amount of tasks on each socket, thus the smaller wattage. The "powersaving" method OTOH fills up the one socket up to the brim, thus the slightly higher consumption due to all threads being occupied. Is that it? Thanks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On 12/11/2012 7:48 AM, Borislav Petkov wrote: On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote: Another testing of parallel compress with pigz on Linus' git tree. results show we get much better performance/power with powersaving and balance policy: testing command: #pigz -k -c -p$x -r linux* &> /dev/null On a NHM EP box powersaving balance performance x = 4166.516 /88 68 170.515 /82 71 165.283 /103 58 x = 8173.654 /61 94 177.693 /60 93 172.31 /76 76 This looks funny: so "performance" is eating less watts than "powersaving" and "balance" on NHM. Could it be that the average watts measurements on NHM are not correct/precise..? On SNB they look as expected, according to your scheme. well... it's not always beneficial to group or to spread out it depends on cache behavior mostly which is best -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote: > Another testing of parallel compress with pigz on Linus' git tree. > results show we get much better performance/power with powersaving and > balance policy: > > testing command: > #pigz -k -c -p$x -r linux* &> /dev/null > > On a NHM EP box > powersaving balance performance > x = 4166.516 /88 68 170.515 /82 71 165.283 /103 58 > x = 8173.654 /61 94 177.693 /60 93 172.31 /76 76 This looks funny: so "performance" is eating less watts than "powersaving" and "balance" on NHM. Could it be that the average watts measurements on NHM are not correct/precise..? On SNB they look as expected, according to your scheme. Also, shouldn't you have the shortest compress times with "performance"? > > On a 2 sockets SNB EP box. > powersaving balance performance > x = 4190.995 /149 35 200.6 /129 38 208.561 /135 35 > x = 8197.969 /108 46 208.885 /103 46213.96 /108 43 > x = 16 205.163 /76 64 212.144 /91 51 229.287 /97 44 Ditto here, compress times with "performance" are not the shortest. Or does "performance" mean something else? :-) Thanks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On 12/11/2012 08:51 AM, Alex Shi wrote: > On Mon, Dec 10, 2012 at 4:22 PM, Alex Shi wrote: >> This patchset base on tip/sched/core tree temporary, since it is more >> steady than tip/master. and it's easy to rebase on tip/master. >> >> It includes 3 parts changes. >> >> 1, simplified fork, patch 1~4, that simplified the fork/exec/wake log in >> find_idlest_group and select_task_rq_fair. it can increase 10+% >> hackbench process and thread performance on our 4 sockets SNB EP machine. >> >> 2, enable load average into LB, patch 5~9, that using load average in >> load balancing, with a runnable load value industrialization bug fix and >> new fork task load contrib enhancement. >> >> 3, power awareness scheduling, patch 10~18, >> Defined 2 new power aware policy balance and >> powersaving, and then try to spread or shrink tasks on CPU unit >> according the different scheduler policy. That can save much power when >> task number in system is no more then cpu number. > > tried with sysbench fileio test rndrw mode, with half thread of LCPU number, > performance is similar, power can save about 5~10 Watts on 2 sockets SNB EP > and NHM EP boxes. Another testing of parallel compress with pigz on Linus' git tree. results show we get much better performance/power with powersaving and balance policy: testing command: #pigz -k -c -p$x -r linux* &> /dev/null On a NHM EP box powersaving balance performance x = 4166.516 /88 68 170.515 /82 71 165.283 /103 58 x = 8173.654 /61 94 177.693 /60 93 172.31 /76 76 On a 2 sockets SNB EP box. powersaving balance performance x = 4190.995 /149 35 200.6 /129 38 208.561 /135 35 x = 8197.969 /108 46 208.885 /103 46213.96 /108 43 x = 16 205.163 /76 64 212.144 /91 51 229.287 /97 44 data format is: 166.516 /88 68 166.516: avagerage Watts 88: seconds(compress time) 68: scaled performance/power = 100 / time / power -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On 12/11/2012 08:51 AM, Alex Shi wrote: On Mon, Dec 10, 2012 at 4:22 PM, Alex Shi alex@intel.com wrote: This patchset base on tip/sched/core tree temporary, since it is more steady than tip/master. and it's easy to rebase on tip/master. It includes 3 parts changes. 1, simplified fork, patch 1~4, that simplified the fork/exec/wake log in find_idlest_group and select_task_rq_fair. it can increase 10+% hackbench process and thread performance on our 4 sockets SNB EP machine. 2, enable load average into LB, patch 5~9, that using load average in load balancing, with a runnable load value industrialization bug fix and new fork task load contrib enhancement. 3, power awareness scheduling, patch 10~18, Defined 2 new power aware policy balance and powersaving, and then try to spread or shrink tasks on CPU unit according the different scheduler policy. That can save much power when task number in system is no more then cpu number. tried with sysbench fileio test rndrw mode, with half thread of LCPU number, performance is similar, power can save about 5~10 Watts on 2 sockets SNB EP and NHM EP boxes. Another testing of parallel compress with pigz on Linus' git tree. results show we get much better performance/power with powersaving and balance policy: testing command: #pigz -k -c -p$x -r linux* /dev/null On a NHM EP box powersaving balance performance x = 4166.516 /88 68 170.515 /82 71 165.283 /103 58 x = 8173.654 /61 94 177.693 /60 93 172.31 /76 76 On a 2 sockets SNB EP box. powersaving balance performance x = 4190.995 /149 35 200.6 /129 38 208.561 /135 35 x = 8197.969 /108 46 208.885 /103 46213.96 /108 43 x = 16 205.163 /76 64 212.144 /91 51 229.287 /97 44 data format is: 166.516 /88 68 166.516: avagerage Watts 88: seconds(compress time) 68: scaled performance/power = 100 / time / power -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote: Another testing of parallel compress with pigz on Linus' git tree. results show we get much better performance/power with powersaving and balance policy: testing command: #pigz -k -c -p$x -r linux* /dev/null On a NHM EP box powersaving balance performance x = 4166.516 /88 68 170.515 /82 71 165.283 /103 58 x = 8173.654 /61 94 177.693 /60 93 172.31 /76 76 This looks funny: so performance is eating less watts than powersaving and balance on NHM. Could it be that the average watts measurements on NHM are not correct/precise..? On SNB they look as expected, according to your scheme. Also, shouldn't you have the shortest compress times with performance? On a 2 sockets SNB EP box. powersaving balance performance x = 4190.995 /149 35 200.6 /129 38 208.561 /135 35 x = 8197.969 /108 46 208.885 /103 46213.96 /108 43 x = 16 205.163 /76 64 212.144 /91 51 229.287 /97 44 Ditto here, compress times with performance are not the shortest. Or does performance mean something else? :-) Thanks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On 12/11/2012 7:48 AM, Borislav Petkov wrote: On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote: Another testing of parallel compress with pigz on Linus' git tree. results show we get much better performance/power with powersaving and balance policy: testing command: #pigz -k -c -p$x -r linux* /dev/null On a NHM EP box powersaving balance performance x = 4166.516 /88 68 170.515 /82 71 165.283 /103 58 x = 8173.654 /61 94 177.693 /60 93 172.31 /76 76 This looks funny: so performance is eating less watts than powersaving and balance on NHM. Could it be that the average watts measurements on NHM are not correct/precise..? On SNB they look as expected, according to your scheme. well... it's not always beneficial to group or to spread out it depends on cache behavior mostly which is best -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On Tue, Dec 11, 2012 at 08:03:01AM -0800, Arjan van de Ven wrote: On 12/11/2012 7:48 AM, Borislav Petkov wrote: On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote: Another testing of parallel compress with pigz on Linus' git tree. results show we get much better performance/power with powersaving and balance policy: testing command: #pigz -k -c -p$x -r linux* /dev/null On a NHM EP box powersaving balance performance x = 4166.516 /88 68 170.515 /82 71 165.283 /103 58 x = 8173.654 /61 94 177.693 /60 93 172.31 /76 76 This looks funny: so performance is eating less watts than powersaving and balance on NHM. Could it be that the average watts measurements on NHM are not correct/precise..? On SNB they look as expected, according to your scheme. well... it's not always beneficial to group or to spread out it depends on cache behavior mostly which is best Let me try to understand what this means: so performance above with 8 threads means that those threads are spread out across more than one socket, no? If so, this would mean that you have a smaller amount of tasks on each socket, thus the smaller wattage. The powersaving method OTOH fills up the one socket up to the brim, thus the slightly higher consumption due to all threads being occupied. Is that it? Thanks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On 12/11/2012 8:13 AM, Borislav Petkov wrote: On Tue, Dec 11, 2012 at 08:03:01AM -0800, Arjan van de Ven wrote: On 12/11/2012 7:48 AM, Borislav Petkov wrote: On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote: Another testing of parallel compress with pigz on Linus' git tree. results show we get much better performance/power with powersaving and balance policy: testing command: #pigz -k -c -p$x -r linux* /dev/null On a NHM EP box powersaving balance performance x = 4166.516 /88 68 170.515 /82 71 165.283 /103 58 x = 8173.654 /61 94 177.693 /60 93 172.31 /76 76 This looks funny: so performance is eating less watts than powersaving and balance on NHM. Could it be that the average watts measurements on NHM are not correct/precise..? On SNB they look as expected, according to your scheme. well... it's not always beneficial to group or to spread out it depends on cache behavior mostly which is best Let me try to understand what this means: so performance above with 8 threads means that those threads are spread out across more than one socket, no? If so, this would mean that you have a smaller amount of tasks on each socket, thus the smaller wattage. The powersaving method OTOH fills up the one socket up to the brim, thus the slightly higher consumption due to all threads being occupied. Is that it? not sure. by and large, power efficiency is the same as performance efficiency, with some twists. or to reword that to be more clear if you waste performance due to something that becomes inefficient, you're wasting power as well. now, you might have some hardware effects that can then save you power... but those effects then first need to overcome the waste from the performance inefficiency... and that almost never happens. for example, if you have two workloads that each fit barely inside the last level cache... it's much more efficient to spread these over two sockets... where each has its own full LLC to use. If you'd group these together, both would thrash the cache all the time and run inefficient -- bad for power. now, on the other hand, if you have two threads of a process that share a bunch of data structures, and you'd spread these over 2 sockets, you end up bouncing data between the two sockets a lot, running inefficient -- bad for power. having said all this, if you have to tasks that don't have such cache effects, the most efficient way of running things will be on 2 hyperthreading halves... it's very hard to beat the power efficiency of that. But this assumes the tasks don't compete with resources much on the HT level, and achieve good scaling. and this still has to compete with race to halt, because if you're done quicker, you can put the memory in self refresh quicker. none of this stuff is easy for humans or computer programs to determine ahead of time... or sometimes even afterwards. heck, even for just performance it's really really hard already, never mind adding power. my personal gut feeling is that we should just optimize this scheduler stuff for performance, and that we're going to be doing quite well on power already if we achieve that. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On 12/12/2012 12:13 AM, Borislav Petkov wrote: On Tue, Dec 11, 2012 at 08:03:01AM -0800, Arjan van de Ven wrote: On 12/11/2012 7:48 AM, Borislav Petkov wrote: On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote: Another testing of parallel compress with pigz on Linus' git tree. results show we get much better performance/power with powersaving and balance policy: testing command: #pigz -k -c -p$x -r linux* /dev/null On a NHM EP box powersaving balance performance x = 4166.516 /88 68 170.515 /82 71 165.283 /103 58 x = 8173.654 /61 94 177.693 /60 93 172.31 /76 76 This looks funny: so performance is eating less watts than powersaving and balance on NHM. Could it be that the average watts measurements on NHM are not correct/precise..? On SNB they look as expected, according to your scheme. well... it's not always beneficial to group or to spread out it depends on cache behavior mostly which is best Let me try to understand what this means: so performance above with 8 threads means that those threads are spread out across more than one socket, no? If so, this would mean that you have a smaller amount of tasks on each socket, thus the smaller wattage. The powersaving method OTOH fills up the one socket up to the brim, thus the slightly higher consumption due to all threads being occupied. As Arjan said we know the performance increase should be due to the cache sharing in LLC. As to power consumption value between powersaving and performance, when we burn 2 socket CPU, the cpu load is not 100%, so some LCPU still has time to go idle or to run with low frequency, that also can save some power. That's just generalise situation, as to different hardware, different CPU, they may has different tuning in CPU packages, core, uncore part etc. So as to different benchmark, the result are also different. Is that it? Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On Mon, Dec 10, 2012 at 4:22 PM, Alex Shi wrote: > This patchset base on tip/sched/core tree temporary, since it is more > steady than tip/master. and it's easy to rebase on tip/master. > > It includes 3 parts changes. > > 1, simplified fork, patch 1~4, that simplified the fork/exec/wake log in > find_idlest_group and select_task_rq_fair. it can increase 10+% > hackbench process and thread performance on our 4 sockets SNB EP machine. > > 2, enable load average into LB, patch 5~9, that using load average in > load balancing, with a runnable load value industrialization bug fix and > new fork task load contrib enhancement. > > 3, power awareness scheduling, patch 10~18, > Defined 2 new power aware policy balance and > powersaving, and then try to spread or shrink tasks on CPU unit > according the different scheduler policy. That can save much power when > task number in system is no more then cpu number. tried with sysbench fileio test rndrw mode, with half thread of LCPU number, performance is similar, power can save about 5~10 Watts on 2 sockets SNB EP and NHM EP boxes. Any comments :) > > Any comments are appreciated! > > Best regards! > Alex > > [PATCH 01/18] sched: select_task_rq_fair clean up > [PATCH 02/18] sched: fix find_idlest_group mess logical > [PATCH 03/18] sched: don't need go to smaller sched domain > [PATCH 04/18] sched: remove domain iterations in fork/exec/wake > [PATCH 05/18] sched: load tracking bug fix > [PATCH 06/18] sched: set initial load avg of new forked task as its > [PATCH 07/18] sched: compute runnable load avg in cpu_load and > [PATCH 08/18] sched: consider runnable load average in move_tasks > [PATCH 09/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED > [PATCH 10/18] sched: add sched_policy in kernel > [PATCH 11/18] sched: add sched_policy and it's sysfs interface > [PATCH 12/18] sched: log the cpu utilization at rq > [PATCH 13/18] sched: add power aware scheduling in fork/exec/wake > [PATCH 14/18] sched: add power/performance balance allowed flag > [PATCH 15/18] sched: don't care if the local group has capacity > [PATCH 16/18] sched: pull all tasks from source group > [PATCH 17/18] sched: power aware load balance, > [PATCH 18/18] sched: lazy powersaving balance > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
This patchset base on tip/sched/core tree temporary, since it is more steady than tip/master. and it's easy to rebase on tip/master. It includes 3 parts changes. 1, simplified fork, patch 1~4, that simplified the fork/exec/wake log in find_idlest_group and select_task_rq_fair. it can increase 10+% hackbench process and thread performance on our 4 sockets SNB EP machine. 2, enable load average into LB, patch 5~9, that using load average in load balancing, with a runnable load value industrialization bug fix and new fork task load contrib enhancement. 3, power awareness scheduling, patch 10~18, Defined 2 new power aware policy balance and powersaving, and then try to spread or shrink tasks on CPU unit according the different scheduler policy. That can save much power when task number in system is no more then cpu number. Any comments are appreciated! Best regards! Alex [PATCH 01/18] sched: select_task_rq_fair clean up [PATCH 02/18] sched: fix find_idlest_group mess logical [PATCH 03/18] sched: don't need go to smaller sched domain [PATCH 04/18] sched: remove domain iterations in fork/exec/wake [PATCH 05/18] sched: load tracking bug fix [PATCH 06/18] sched: set initial load avg of new forked task as its [PATCH 07/18] sched: compute runnable load avg in cpu_load and [PATCH 08/18] sched: consider runnable load average in move_tasks [PATCH 09/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED [PATCH 10/18] sched: add sched_policy in kernel [PATCH 11/18] sched: add sched_policy and it's sysfs interface [PATCH 12/18] sched: log the cpu utilization at rq [PATCH 13/18] sched: add power aware scheduling in fork/exec/wake [PATCH 14/18] sched: add power/performance balance allowed flag [PATCH 15/18] sched: don't care if the local group has capacity [PATCH 16/18] sched: pull all tasks from source group [PATCH 17/18] sched: power aware load balance, [PATCH 18/18] sched: lazy powersaving balance -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
This patchset base on tip/sched/core tree temporary, since it is more steady than tip/master. and it's easy to rebase on tip/master. It includes 3 parts changes. 1, simplified fork, patch 1~4, that simplified the fork/exec/wake log in find_idlest_group and select_task_rq_fair. it can increase 10+% hackbench process and thread performance on our 4 sockets SNB EP machine. 2, enable load average into LB, patch 5~9, that using load average in load balancing, with a runnable load value industrialization bug fix and new fork task load contrib enhancement. 3, power awareness scheduling, patch 10~18, Defined 2 new power aware policy balance and powersaving, and then try to spread or shrink tasks on CPU unit according the different scheduler policy. That can save much power when task number in system is no more then cpu number. Any comments are appreciated! Best regards! Alex [PATCH 01/18] sched: select_task_rq_fair clean up [PATCH 02/18] sched: fix find_idlest_group mess logical [PATCH 03/18] sched: don't need go to smaller sched domain [PATCH 04/18] sched: remove domain iterations in fork/exec/wake [PATCH 05/18] sched: load tracking bug fix [PATCH 06/18] sched: set initial load avg of new forked task as its [PATCH 07/18] sched: compute runnable load avg in cpu_load and [PATCH 08/18] sched: consider runnable load average in move_tasks [PATCH 09/18] Revert sched: Introduce temporary FAIR_GROUP_SCHED [PATCH 10/18] sched: add sched_policy in kernel [PATCH 11/18] sched: add sched_policy and it's sysfs interface [PATCH 12/18] sched: log the cpu utilization at rq [PATCH 13/18] sched: add power aware scheduling in fork/exec/wake [PATCH 14/18] sched: add power/performance balance allowed flag [PATCH 15/18] sched: don't care if the local group has capacity [PATCH 16/18] sched: pull all tasks from source group [PATCH 17/18] sched: power aware load balance, [PATCH 18/18] sched: lazy powersaving balance -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
On Mon, Dec 10, 2012 at 4:22 PM, Alex Shi alex@intel.com wrote: This patchset base on tip/sched/core tree temporary, since it is more steady than tip/master. and it's easy to rebase on tip/master. It includes 3 parts changes. 1, simplified fork, patch 1~4, that simplified the fork/exec/wake log in find_idlest_group and select_task_rq_fair. it can increase 10+% hackbench process and thread performance on our 4 sockets SNB EP machine. 2, enable load average into LB, patch 5~9, that using load average in load balancing, with a runnable load value industrialization bug fix and new fork task load contrib enhancement. 3, power awareness scheduling, patch 10~18, Defined 2 new power aware policy balance and powersaving, and then try to spread or shrink tasks on CPU unit according the different scheduler policy. That can save much power when task number in system is no more then cpu number. tried with sysbench fileio test rndrw mode, with half thread of LCPU number, performance is similar, power can save about 5~10 Watts on 2 sockets SNB EP and NHM EP boxes. Any comments :) Any comments are appreciated! Best regards! Alex [PATCH 01/18] sched: select_task_rq_fair clean up [PATCH 02/18] sched: fix find_idlest_group mess logical [PATCH 03/18] sched: don't need go to smaller sched domain [PATCH 04/18] sched: remove domain iterations in fork/exec/wake [PATCH 05/18] sched: load tracking bug fix [PATCH 06/18] sched: set initial load avg of new forked task as its [PATCH 07/18] sched: compute runnable load avg in cpu_load and [PATCH 08/18] sched: consider runnable load average in move_tasks [PATCH 09/18] Revert sched: Introduce temporary FAIR_GROUP_SCHED [PATCH 10/18] sched: add sched_policy in kernel [PATCH 11/18] sched: add sched_policy and it's sysfs interface [PATCH 12/18] sched: log the cpu utilization at rq [PATCH 13/18] sched: add power aware scheduling in fork/exec/wake [PATCH 14/18] sched: add power/performance balance allowed flag [PATCH 15/18] sched: don't care if the local group has capacity [PATCH 16/18] sched: pull all tasks from source group [PATCH 17/18] sched: power aware load balance, [PATCH 18/18] sched: lazy powersaving balance -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/