Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-13 Thread Alex Shi
On 12/13/2012 07:35 PM, Borislav Petkov wrote:
> On Thu, Dec 13, 2012 at 11:07:43AM +0800, Alex Shi wrote:
 now, on the other hand, if you have two threads of a process that
 share a bunch of data structures, and you'd spread these over 2
 sockets, you end up bouncing data between the two sockets a lot,
 running inefficient --> bad for power.
>>>
>>> Yeah, that should be addressed by the NUMA patches people are
>>> working on right now.
>>
>> Yes, as to balance/powersaving policy, we can tight pack tasks
>> firstly, then NUMA balancing will make memory follow us.
>>
>> BTW, NUMA balancing is more related with page in memory. not LLC.
> 
> Sure, let's look at the worst and best cases:
> 
> * worst case: you have memory shared by multiple threads on one node
> *and* working set doesn't fit in LLC.
> 
> Here, if you pack threads tightly only on one node, you still suffer the
> working set kicking out parts of itself out of LLC.
> 
> If you spread threads around, you still cannot avoid the LLC thrashing
> because the LLC of the node containing the shared memory needs to cache
> all those transactions. *In* *addition*, you get the cross-node traffic
> because the shared pages are on the first node.
> 
> Major suckage.
> 
> Does it matter? I don't know. It can be decided on a case-by-case basis.
> If people care about singlethread perf, they would likely want to spread
> around and buy in the cross-node traffic.
> 
> If they care for power, then maybe they don't want to turn on the second
> socket yet.
> 
> * the optimal case is where memory follows threads and gets spread
> around such that LLC doesn't get thrashed and cross-node traffic gets
> avoided.
> 
> Now, you can think of all those other scenarios in between :-/

You are right. thanks for explanation! :)

Actually, what I went to say is that numa balancing target is pages in
different node memory, but of course, it may improve LLC performance.
> 
> Thanks.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-13 Thread Borislav Petkov
On Thu, Dec 13, 2012 at 11:07:43AM +0800, Alex Shi wrote:
> >> now, on the other hand, if you have two threads of a process that
> >> share a bunch of data structures, and you'd spread these over 2
> >> sockets, you end up bouncing data between the two sockets a lot,
> >> running inefficient --> bad for power.
> >
> > Yeah, that should be addressed by the NUMA patches people are
> > working on right now.
>
> Yes, as to balance/powersaving policy, we can tight pack tasks
> firstly, then NUMA balancing will make memory follow us.
>
> BTW, NUMA balancing is more related with page in memory. not LLC.

Sure, let's look at the worst and best cases:

* worst case: you have memory shared by multiple threads on one node
*and* working set doesn't fit in LLC.

Here, if you pack threads tightly only on one node, you still suffer the
working set kicking out parts of itself out of LLC.

If you spread threads around, you still cannot avoid the LLC thrashing
because the LLC of the node containing the shared memory needs to cache
all those transactions. *In* *addition*, you get the cross-node traffic
because the shared pages are on the first node.

Major suckage.

Does it matter? I don't know. It can be decided on a case-by-case basis.
If people care about singlethread perf, they would likely want to spread
around and buy in the cross-node traffic.

If they care for power, then maybe they don't want to turn on the second
socket yet.

* the optimal case is where memory follows threads and gets spread
around such that LLC doesn't get thrashed and cross-node traffic gets
avoided.

Now, you can think of all those other scenarios in between :-/

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-13 Thread Borislav Petkov
On Thu, Dec 13, 2012 at 11:07:43AM +0800, Alex Shi wrote:
  now, on the other hand, if you have two threads of a process that
  share a bunch of data structures, and you'd spread these over 2
  sockets, you end up bouncing data between the two sockets a lot,
  running inefficient -- bad for power.
 
  Yeah, that should be addressed by the NUMA patches people are
  working on right now.

 Yes, as to balance/powersaving policy, we can tight pack tasks
 firstly, then NUMA balancing will make memory follow us.

 BTW, NUMA balancing is more related with page in memory. not LLC.

Sure, let's look at the worst and best cases:

* worst case: you have memory shared by multiple threads on one node
*and* working set doesn't fit in LLC.

Here, if you pack threads tightly only on one node, you still suffer the
working set kicking out parts of itself out of LLC.

If you spread threads around, you still cannot avoid the LLC thrashing
because the LLC of the node containing the shared memory needs to cache
all those transactions. *In* *addition*, you get the cross-node traffic
because the shared pages are on the first node.

Major suckage.

Does it matter? I don't know. It can be decided on a case-by-case basis.
If people care about singlethread perf, they would likely want to spread
around and buy in the cross-node traffic.

If they care for power, then maybe they don't want to turn on the second
socket yet.

* the optimal case is where memory follows threads and gets spread
around such that LLC doesn't get thrashed and cross-node traffic gets
avoided.

Now, you can think of all those other scenarios in between :-/

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-13 Thread Alex Shi
On 12/13/2012 07:35 PM, Borislav Petkov wrote:
 On Thu, Dec 13, 2012 at 11:07:43AM +0800, Alex Shi wrote:
 now, on the other hand, if you have two threads of a process that
 share a bunch of data structures, and you'd spread these over 2
 sockets, you end up bouncing data between the two sockets a lot,
 running inefficient -- bad for power.

 Yeah, that should be addressed by the NUMA patches people are
 working on right now.

 Yes, as to balance/powersaving policy, we can tight pack tasks
 firstly, then NUMA balancing will make memory follow us.

 BTW, NUMA balancing is more related with page in memory. not LLC.
 
 Sure, let's look at the worst and best cases:
 
 * worst case: you have memory shared by multiple threads on one node
 *and* working set doesn't fit in LLC.
 
 Here, if you pack threads tightly only on one node, you still suffer the
 working set kicking out parts of itself out of LLC.
 
 If you spread threads around, you still cannot avoid the LLC thrashing
 because the LLC of the node containing the shared memory needs to cache
 all those transactions. *In* *addition*, you get the cross-node traffic
 because the shared pages are on the first node.
 
 Major suckage.
 
 Does it matter? I don't know. It can be decided on a case-by-case basis.
 If people care about singlethread perf, they would likely want to spread
 around and buy in the cross-node traffic.
 
 If they care for power, then maybe they don't want to turn on the second
 socket yet.
 
 * the optimal case is where memory follows threads and gets spread
 around such that LLC doesn't get thrashed and cross-node traffic gets
 avoided.
 
 Now, you can think of all those other scenarios in between :-/

You are right. thanks for explanation! :)

Actually, what I went to say is that numa balancing target is pages in
different node memory, but of course, it may improve LLC performance.
 
 Thanks.
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-12 Thread Alex Shi

>> now, on the other hand, if you have two threads of a process that
>> share a bunch of data structures, and you'd spread these over 2
>> sockets, you end up bouncing data between the two sockets a lot,
>> running inefficient --> bad for power.
> 
> Yeah, that should be addressed by the NUMA patches people are working on
> right now.


Yes, as to balance/powersaving policy, we can tight pack tasks firstly,
then NUMA balancing will make memory follow us.

BTW, NUMA balancing is more related with page in memory. not LLC.
> 
>> having said all this, if you have to tasks that don't have such
>> cache effects, the most efficient way of running things will be on 2
>> hyperthreading halves... it's very hard to beat the power efficiency
>> of that. But this assumes the tasks don't compete with resources much
>> on the HT level, and achieve good scaling. and this still has to
>> compete with "race to halt", because if you're done quicker, you can
>> put the memory in self refresh quicker.
> 
> Right, how are we addressing the breakeven in that case? AFAIK, we
> do schedule them now on two different cores (not HT threads, i.e. no
> resource sharing besides L2) so that we get done faster, i.e. race to

that's balance policy for. :)
> idle in the performance case. And in the powersavings' case we leave
> them as tightly packed as possible.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-12 Thread Alex Shi
On 12/12/2012 10:21 PM, Vincent Guittot wrote:
>>> >> If Linux is to continue to work efficiently on heterogeneous
>>> >> multi-processing platforms, it needs to provide scheduling mechanisms
>>> >> that can be exploited as per the demands of the HW architecture.
>> >
>> > Linus definitely disagree such ideas. :) So, need to summaries the
>> > logical beyond all hardware.
>> >
>>> >> example is the "small task packing (and spreading)" for which Vincent
>>> >> Guittot has posted a patchset[1] earlier and so has Alex now.
>> >
>> > Sure. I just thought my patchset should handled the 'small task
>> > packing' scenario. Could you guy like to have a try?
> Hi Alex,
> 
> Yes, I will do a try with your patchset when i will have some spare time

Thanks Vincent! the balance and powersaving policy should have effect.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-12 Thread Borislav Petkov
On Tue, Dec 11, 2012 at 08:40:40AM -0800, Arjan van de Ven wrote:

> >Let me try to understand what this means: so "performance" above with
> >8 threads means that those threads are spread out across more than one
> >socket, no?
> >
> >If so, this would mean that you have a smaller amount of tasks on each
> >socket, thus the smaller wattage.
> >
> >The "powersaving" method OTOH fills up the one socket up to the brim,
> >thus the slightly higher consumption due to all threads being occupied.
> >
> >Is that it?
>
> not sure.
>
> by and large, power efficiency is the same as performance efficiency,
> with some twists. or to reword that to be more clear if you waste
> performance due to something that becomes inefficient, you're wasting
> power as well. now, you might have some hardware effects that can
> then save you power... but those effects then first need to overcome
> the waste from the performance inefficiency... and that almost never
> happens.
>
> for example, if you have two workloads that each fit barely inside
> the last level cache... it's much more efficient to spread these over
> two sockets... where each has its own full LLC to use. If you'd group
> these together, both would thrash the cache all the time and run
> inefficient --> bad for power.

Hmm, are you saying that powering up the second socket so that the
working set fully fits in the LLC is still less power used than the cost
of going up to memory and bringing those lines back in?

I'd say there's breakeven point depending on the workload duration, no?

Which means that we need to be able to look into the future in order to
know what to do... ;-/

> now, on the other hand, if you have two threads of a process that
> share a bunch of data structures, and you'd spread these over 2
> sockets, you end up bouncing data between the two sockets a lot,
> running inefficient --> bad for power.

Yeah, that should be addressed by the NUMA patches people are working on
right now.

> having said all this, if you have to tasks that don't have such
> cache effects, the most efficient way of running things will be on 2
> hyperthreading halves... it's very hard to beat the power efficiency
> of that. But this assumes the tasks don't compete with resources much
> on the HT level, and achieve good scaling. and this still has to
> compete with "race to halt", because if you're done quicker, you can
> put the memory in self refresh quicker.

Right, how are we addressing the breakeven in that case? AFAIK, we
do schedule them now on two different cores (not HT threads, i.e. no
resource sharing besides L2) so that we get done faster, i.e. race to
idle in the performance case. And in the powersavings' case we leave
them as tightly packed as possible.

> none of this stuff is easy for humans or computer programs to
> determine ahead of time... or sometimes even afterwards. heck, even
> for just performance it's really really hard already, never mind
> adding power.
>
> my personal gut feeling is that we should just optimize this scheduler
> stuff for performance, and that we're going to be doing quite well on
> power already if we achieve that.

Probably. I wonder if there is a way to measure power consumption of
different workloads in perf and then run those with different scheduling
policies.

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-12 Thread Vincent Guittot
On 12 December 2012 14:55, Alex Shi  wrote:
>
>
> well... it's not always beneficial to group or to spread out
> it depends on cache behavior mostly which is best


 Let me try to understand what this means: so "performance" above with
 8 threads means that those threads are spread out across more than one
 socket, no?

 If so, this would mean that you have a smaller amount of tasks on each
 socket, thus the smaller wattage.

 The "powersaving" method OTOH fills up the one socket up to the brim,
 thus the slightly higher consumption due to all threads being occupied.

 Is that it?
>>>
>>>
>>> not sure.
>>>
>>> by and large, power efficiency is the same as performance efficiency, with
>>> some twists.
>>> or to reword that to be more clear
>>> if you waste performance due to something that becomes inefficient, you're
>>> wasting power as well.
>>> now, you might have some hardware effects that can then save you power...
>>> but those effects
>>> then first need to overcome the waste from the performance inefficiency...
>>> and that almost never happens.
>>>
>>> for example, if you have two workloads that each fit barely inside the last
>>> level cache...
>>> it's much more efficient to spread these over two sockets... where each has
>>> its own full LLC
>>> to use.
>>> If you'd group these together, both would thrash the cache all the time and
>>> run inefficient --> bad for power.
>>>
>>> now, on the other hand, if you have two threads of a process that share a
>>> bunch of data structures,
>>> and you'd spread these over 2 sockets, you end up bouncing data between the
>>> two sockets a lot,
>>> running inefficient --> bad for power.
>>>
>>
>> Agree with all of the above. However..
>>
>>> having said all this, if you have to tasks that don't have such cache
>>> effects, the most efficient way
>>> of running things will be on 2 hyperthreading halves... it's very hard to
>>> beat the power efficiency of that.
>>
>> .. there are alternatives to hyperthreading. On ARM's big.LITTLE
>> architecture you could simply schedule them on the LITTLE cores. The
>> big cores just can't beat the power efficiency of the LITTLE ones even
>> with 'race to halt' that you allude to below. And usecases like mp3
>> playback simply don't require the kind of performance that the big
>> cores can offer.
>>
>>> But this assumes the tasks don't compete with resources much on the HT
>>> level, and achieve good scaling.
>>> and this still has to compete with "race to halt", because if you're done
>>> quicker, you can put the memory
>>> in self refresh quicker.
>>>
>>> none of this stuff is easy for humans or computer programs to determine
>>> ahead of time... or sometimes even afterwards.
>>> heck, even for just performance it's really really hard already, never mind
>>> adding power.
>>>
>>> my personal gut feeling is that we should just optimize this scheduler stuff
>>> for performance, and that
>>> we're going to be doing quite well on power already if we achieve that.
>>
>> If Linux is to continue to work efficiently on heterogeneous
>> multi-processing platforms, it needs to provide scheduling mechanisms
>> that can be exploited as per the demands of the HW architecture.
>
> Linus definitely disagree such ideas. :) So, need to summaries the
> logical beyond all hardware.
>
>> example is the "small task packing (and spreading)" for which Vincent
>> Guittot has posted a patchset[1] earlier and so has Alex now.
>
> Sure. I just thought my patchset should handled the 'small task
> packing' scenario. Could you guy like to have a try?

Hi Alex,

Yes, I will do a try with your patchset when i will have some spare time

Vincent

>>
>> [1] http://lwn.net/Articles/518834/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-12 Thread Alex Shi


 well... it's not always beneficial to group or to spread out
 it depends on cache behavior mostly which is best
>>>
>>>
>>> Let me try to understand what this means: so "performance" above with
>>> 8 threads means that those threads are spread out across more than one
>>> socket, no?
>>>
>>> If so, this would mean that you have a smaller amount of tasks on each
>>> socket, thus the smaller wattage.
>>>
>>> The "powersaving" method OTOH fills up the one socket up to the brim,
>>> thus the slightly higher consumption due to all threads being occupied.
>>>
>>> Is that it?
>>
>>
>> not sure.
>>
>> by and large, power efficiency is the same as performance efficiency, with
>> some twists.
>> or to reword that to be more clear
>> if you waste performance due to something that becomes inefficient, you're
>> wasting power as well.
>> now, you might have some hardware effects that can then save you power...
>> but those effects
>> then first need to overcome the waste from the performance inefficiency...
>> and that almost never happens.
>>
>> for example, if you have two workloads that each fit barely inside the last
>> level cache...
>> it's much more efficient to spread these over two sockets... where each has
>> its own full LLC
>> to use.
>> If you'd group these together, both would thrash the cache all the time and
>> run inefficient --> bad for power.
>>
>> now, on the other hand, if you have two threads of a process that share a
>> bunch of data structures,
>> and you'd spread these over 2 sockets, you end up bouncing data between the
>> two sockets a lot,
>> running inefficient --> bad for power.
>>
>
> Agree with all of the above. However..
>
>> having said all this, if you have to tasks that don't have such cache
>> effects, the most efficient way
>> of running things will be on 2 hyperthreading halves... it's very hard to
>> beat the power efficiency of that.
>
> .. there are alternatives to hyperthreading. On ARM's big.LITTLE
> architecture you could simply schedule them on the LITTLE cores. The
> big cores just can't beat the power efficiency of the LITTLE ones even
> with 'race to halt' that you allude to below. And usecases like mp3
> playback simply don't require the kind of performance that the big
> cores can offer.
>
>> But this assumes the tasks don't compete with resources much on the HT
>> level, and achieve good scaling.
>> and this still has to compete with "race to halt", because if you're done
>> quicker, you can put the memory
>> in self refresh quicker.
>>
>> none of this stuff is easy for humans or computer programs to determine
>> ahead of time... or sometimes even afterwards.
>> heck, even for just performance it's really really hard already, never mind
>> adding power.
>>
>> my personal gut feeling is that we should just optimize this scheduler stuff
>> for performance, and that
>> we're going to be doing quite well on power already if we achieve that.
>
> If Linux is to continue to work efficiently on heterogeneous
> multi-processing platforms, it needs to provide scheduling mechanisms
> that can be exploited as per the demands of the HW architecture.

Linus definitely disagree such ideas. :) So, need to summaries the
logical beyond all hardware.

> example is the "small task packing (and spreading)" for which Vincent
> Guittot has posted a patchset[1] earlier and so has Alex now.

Sure. I just thought my patchset should handled the 'small task
packing' scenario. Could you guy like to have a try?
>
> [1] http://lwn.net/Articles/518834/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-12 Thread Amit Kucheria
On Tue, Dec 11, 2012 at 10:10 PM, Arjan van de Ven
 wrote:
> On 12/11/2012 8:13 AM, Borislav Petkov wrote:
>>
>> On Tue, Dec 11, 2012 at 08:03:01AM -0800, Arjan van de Ven wrote:
>>>
>>> On 12/11/2012 7:48 AM, Borislav Petkov wrote:

 On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote:
>
> Another testing of parallel compress with pigz on Linus' git tree.
> results show we get much better performance/power with powersaving and
> balance policy:
>
> testing command:
> #pigz -k -c  -p$x -r linux* &> /dev/null
>
> On a NHM EP box
>   powersaving   balance  performance
> x = 4166.516 /88 68   170.515 /82 71 165.283 /103
> 58
> x = 8173.654 /61 94   177.693 /60 93 172.31 /76 76


 This looks funny: so "performance" is eating less watts than
 "powersaving" and "balance" on NHM. Could it be that the average watts
 measurements on NHM are not correct/precise..? On SNB they look as
 expected, according to your scheme.
>>>
>>>
>>> well... it's not always beneficial to group or to spread out
>>> it depends on cache behavior mostly which is best
>>
>>
>> Let me try to understand what this means: so "performance" above with
>> 8 threads means that those threads are spread out across more than one
>> socket, no?
>>
>> If so, this would mean that you have a smaller amount of tasks on each
>> socket, thus the smaller wattage.
>>
>> The "powersaving" method OTOH fills up the one socket up to the brim,
>> thus the slightly higher consumption due to all threads being occupied.
>>
>> Is that it?
>
>
> not sure.
>
> by and large, power efficiency is the same as performance efficiency, with
> some twists.
> or to reword that to be more clear
> if you waste performance due to something that becomes inefficient, you're
> wasting power as well.
> now, you might have some hardware effects that can then save you power...
> but those effects
> then first need to overcome the waste from the performance inefficiency...
> and that almost never happens.
>
> for example, if you have two workloads that each fit barely inside the last
> level cache...
> it's much more efficient to spread these over two sockets... where each has
> its own full LLC
> to use.
> If you'd group these together, both would thrash the cache all the time and
> run inefficient --> bad for power.
>
> now, on the other hand, if you have two threads of a process that share a
> bunch of data structures,
> and you'd spread these over 2 sockets, you end up bouncing data between the
> two sockets a lot,
> running inefficient --> bad for power.
>

Agree with all of the above. However..

> having said all this, if you have to tasks that don't have such cache
> effects, the most efficient way
> of running things will be on 2 hyperthreading halves... it's very hard to
> beat the power efficiency of that.

.. there are alternatives to hyperthreading. On ARM's big.LITTLE
architecture you could simply schedule them on the LITTLE cores. The
big cores just can't beat the power efficiency of the LITTLE ones even
with 'race to halt' that you allude to below. And usecases like mp3
playback simply don't require the kind of performance that the big
cores can offer.

> But this assumes the tasks don't compete with resources much on the HT
> level, and achieve good scaling.
> and this still has to compete with "race to halt", because if you're done
> quicker, you can put the memory
> in self refresh quicker.
>
> none of this stuff is easy for humans or computer programs to determine
> ahead of time... or sometimes even afterwards.
> heck, even for just performance it's really really hard already, never mind
> adding power.
>
> my personal gut feeling is that we should just optimize this scheduler stuff
> for performance, and that
> we're going to be doing quite well on power already if we achieve that.

If Linux is to continue to work efficiently on heterogeneous
multi-processing platforms, it needs to provide scheduling mechanisms
that can be exploited as per the demands of the HW architecture. An
example is the "small task packing (and spreading)" for which Vincent
Guittot has posted a patchset[1] earlier and so has Alex now.

[1] http://lwn.net/Articles/518834/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-12 Thread Amit Kucheria
On Tue, Dec 11, 2012 at 10:10 PM, Arjan van de Ven
ar...@linux.intel.com wrote:
 On 12/11/2012 8:13 AM, Borislav Petkov wrote:

 On Tue, Dec 11, 2012 at 08:03:01AM -0800, Arjan van de Ven wrote:

 On 12/11/2012 7:48 AM, Borislav Petkov wrote:

 On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote:

 Another testing of parallel compress with pigz on Linus' git tree.
 results show we get much better performance/power with powersaving and
 balance policy:

 testing command:
 #pigz -k -c  -p$x -r linux*  /dev/null

 On a NHM EP box
   powersaving   balance  performance
 x = 4166.516 /88 68   170.515 /82 71 165.283 /103
 58
 x = 8173.654 /61 94   177.693 /60 93 172.31 /76 76


 This looks funny: so performance is eating less watts than
 powersaving and balance on NHM. Could it be that the average watts
 measurements on NHM are not correct/precise..? On SNB they look as
 expected, according to your scheme.


 well... it's not always beneficial to group or to spread out
 it depends on cache behavior mostly which is best


 Let me try to understand what this means: so performance above with
 8 threads means that those threads are spread out across more than one
 socket, no?

 If so, this would mean that you have a smaller amount of tasks on each
 socket, thus the smaller wattage.

 The powersaving method OTOH fills up the one socket up to the brim,
 thus the slightly higher consumption due to all threads being occupied.

 Is that it?


 not sure.

 by and large, power efficiency is the same as performance efficiency, with
 some twists.
 or to reword that to be more clear
 if you waste performance due to something that becomes inefficient, you're
 wasting power as well.
 now, you might have some hardware effects that can then save you power...
 but those effects
 then first need to overcome the waste from the performance inefficiency...
 and that almost never happens.

 for example, if you have two workloads that each fit barely inside the last
 level cache...
 it's much more efficient to spread these over two sockets... where each has
 its own full LLC
 to use.
 If you'd group these together, both would thrash the cache all the time and
 run inefficient -- bad for power.

 now, on the other hand, if you have two threads of a process that share a
 bunch of data structures,
 and you'd spread these over 2 sockets, you end up bouncing data between the
 two sockets a lot,
 running inefficient -- bad for power.


Agree with all of the above. However..

 having said all this, if you have to tasks that don't have such cache
 effects, the most efficient way
 of running things will be on 2 hyperthreading halves... it's very hard to
 beat the power efficiency of that.

.. there are alternatives to hyperthreading. On ARM's big.LITTLE
architecture you could simply schedule them on the LITTLE cores. The
big cores just can't beat the power efficiency of the LITTLE ones even
with 'race to halt' that you allude to below. And usecases like mp3
playback simply don't require the kind of performance that the big
cores can offer.

 But this assumes the tasks don't compete with resources much on the HT
 level, and achieve good scaling.
 and this still has to compete with race to halt, because if you're done
 quicker, you can put the memory
 in self refresh quicker.

 none of this stuff is easy for humans or computer programs to determine
 ahead of time... or sometimes even afterwards.
 heck, even for just performance it's really really hard already, never mind
 adding power.

 my personal gut feeling is that we should just optimize this scheduler stuff
 for performance, and that
 we're going to be doing quite well on power already if we achieve that.

If Linux is to continue to work efficiently on heterogeneous
multi-processing platforms, it needs to provide scheduling mechanisms
that can be exploited as per the demands of the HW architecture. An
example is the small task packing (and spreading) for which Vincent
Guittot has posted a patchset[1] earlier and so has Alex now.

[1] http://lwn.net/Articles/518834/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-12 Thread Alex Shi


 well... it's not always beneficial to group or to spread out
 it depends on cache behavior mostly which is best


 Let me try to understand what this means: so performance above with
 8 threads means that those threads are spread out across more than one
 socket, no?

 If so, this would mean that you have a smaller amount of tasks on each
 socket, thus the smaller wattage.

 The powersaving method OTOH fills up the one socket up to the brim,
 thus the slightly higher consumption due to all threads being occupied.

 Is that it?


 not sure.

 by and large, power efficiency is the same as performance efficiency, with
 some twists.
 or to reword that to be more clear
 if you waste performance due to something that becomes inefficient, you're
 wasting power as well.
 now, you might have some hardware effects that can then save you power...
 but those effects
 then first need to overcome the waste from the performance inefficiency...
 and that almost never happens.

 for example, if you have two workloads that each fit barely inside the last
 level cache...
 it's much more efficient to spread these over two sockets... where each has
 its own full LLC
 to use.
 If you'd group these together, both would thrash the cache all the time and
 run inefficient -- bad for power.

 now, on the other hand, if you have two threads of a process that share a
 bunch of data structures,
 and you'd spread these over 2 sockets, you end up bouncing data between the
 two sockets a lot,
 running inefficient -- bad for power.


 Agree with all of the above. However..

 having said all this, if you have to tasks that don't have such cache
 effects, the most efficient way
 of running things will be on 2 hyperthreading halves... it's very hard to
 beat the power efficiency of that.

 .. there are alternatives to hyperthreading. On ARM's big.LITTLE
 architecture you could simply schedule them on the LITTLE cores. The
 big cores just can't beat the power efficiency of the LITTLE ones even
 with 'race to halt' that you allude to below. And usecases like mp3
 playback simply don't require the kind of performance that the big
 cores can offer.

 But this assumes the tasks don't compete with resources much on the HT
 level, and achieve good scaling.
 and this still has to compete with race to halt, because if you're done
 quicker, you can put the memory
 in self refresh quicker.

 none of this stuff is easy for humans or computer programs to determine
 ahead of time... or sometimes even afterwards.
 heck, even for just performance it's really really hard already, never mind
 adding power.

 my personal gut feeling is that we should just optimize this scheduler stuff
 for performance, and that
 we're going to be doing quite well on power already if we achieve that.

 If Linux is to continue to work efficiently on heterogeneous
 multi-processing platforms, it needs to provide scheduling mechanisms
 that can be exploited as per the demands of the HW architecture.

Linus definitely disagree such ideas. :) So, need to summaries the
logical beyond all hardware.

 example is the small task packing (and spreading) for which Vincent
 Guittot has posted a patchset[1] earlier and so has Alex now.

Sure. I just thought my patchset should handled the 'small task
packing' scenario. Could you guy like to have a try?

 [1] http://lwn.net/Articles/518834/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-12 Thread Vincent Guittot
On 12 December 2012 14:55, Alex Shi lkml.a...@gmail.com wrote:


 well... it's not always beneficial to group or to spread out
 it depends on cache behavior mostly which is best


 Let me try to understand what this means: so performance above with
 8 threads means that those threads are spread out across more than one
 socket, no?

 If so, this would mean that you have a smaller amount of tasks on each
 socket, thus the smaller wattage.

 The powersaving method OTOH fills up the one socket up to the brim,
 thus the slightly higher consumption due to all threads being occupied.

 Is that it?


 not sure.

 by and large, power efficiency is the same as performance efficiency, with
 some twists.
 or to reword that to be more clear
 if you waste performance due to something that becomes inefficient, you're
 wasting power as well.
 now, you might have some hardware effects that can then save you power...
 but those effects
 then first need to overcome the waste from the performance inefficiency...
 and that almost never happens.

 for example, if you have two workloads that each fit barely inside the last
 level cache...
 it's much more efficient to spread these over two sockets... where each has
 its own full LLC
 to use.
 If you'd group these together, both would thrash the cache all the time and
 run inefficient -- bad for power.

 now, on the other hand, if you have two threads of a process that share a
 bunch of data structures,
 and you'd spread these over 2 sockets, you end up bouncing data between the
 two sockets a lot,
 running inefficient -- bad for power.


 Agree with all of the above. However..

 having said all this, if you have to tasks that don't have such cache
 effects, the most efficient way
 of running things will be on 2 hyperthreading halves... it's very hard to
 beat the power efficiency of that.

 .. there are alternatives to hyperthreading. On ARM's big.LITTLE
 architecture you could simply schedule them on the LITTLE cores. The
 big cores just can't beat the power efficiency of the LITTLE ones even
 with 'race to halt' that you allude to below. And usecases like mp3
 playback simply don't require the kind of performance that the big
 cores can offer.

 But this assumes the tasks don't compete with resources much on the HT
 level, and achieve good scaling.
 and this still has to compete with race to halt, because if you're done
 quicker, you can put the memory
 in self refresh quicker.

 none of this stuff is easy for humans or computer programs to determine
 ahead of time... or sometimes even afterwards.
 heck, even for just performance it's really really hard already, never mind
 adding power.

 my personal gut feeling is that we should just optimize this scheduler stuff
 for performance, and that
 we're going to be doing quite well on power already if we achieve that.

 If Linux is to continue to work efficiently on heterogeneous
 multi-processing platforms, it needs to provide scheduling mechanisms
 that can be exploited as per the demands of the HW architecture.

 Linus definitely disagree such ideas. :) So, need to summaries the
 logical beyond all hardware.

 example is the small task packing (and spreading) for which Vincent
 Guittot has posted a patchset[1] earlier and so has Alex now.

 Sure. I just thought my patchset should handled the 'small task
 packing' scenario. Could you guy like to have a try?

Hi Alex,

Yes, I will do a try with your patchset when i will have some spare time

Vincent


 [1] http://lwn.net/Articles/518834/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-12 Thread Borislav Petkov
On Tue, Dec 11, 2012 at 08:40:40AM -0800, Arjan van de Ven wrote:

 Let me try to understand what this means: so performance above with
 8 threads means that those threads are spread out across more than one
 socket, no?
 
 If so, this would mean that you have a smaller amount of tasks on each
 socket, thus the smaller wattage.
 
 The powersaving method OTOH fills up the one socket up to the brim,
 thus the slightly higher consumption due to all threads being occupied.
 
 Is that it?

 not sure.

 by and large, power efficiency is the same as performance efficiency,
 with some twists. or to reword that to be more clear if you waste
 performance due to something that becomes inefficient, you're wasting
 power as well. now, you might have some hardware effects that can
 then save you power... but those effects then first need to overcome
 the waste from the performance inefficiency... and that almost never
 happens.

 for example, if you have two workloads that each fit barely inside
 the last level cache... it's much more efficient to spread these over
 two sockets... where each has its own full LLC to use. If you'd group
 these together, both would thrash the cache all the time and run
 inefficient -- bad for power.

Hmm, are you saying that powering up the second socket so that the
working set fully fits in the LLC is still less power used than the cost
of going up to memory and bringing those lines back in?

I'd say there's breakeven point depending on the workload duration, no?

Which means that we need to be able to look into the future in order to
know what to do... ;-/

 now, on the other hand, if you have two threads of a process that
 share a bunch of data structures, and you'd spread these over 2
 sockets, you end up bouncing data between the two sockets a lot,
 running inefficient -- bad for power.

Yeah, that should be addressed by the NUMA patches people are working on
right now.

 having said all this, if you have to tasks that don't have such
 cache effects, the most efficient way of running things will be on 2
 hyperthreading halves... it's very hard to beat the power efficiency
 of that. But this assumes the tasks don't compete with resources much
 on the HT level, and achieve good scaling. and this still has to
 compete with race to halt, because if you're done quicker, you can
 put the memory in self refresh quicker.

Right, how are we addressing the breakeven in that case? AFAIK, we
do schedule them now on two different cores (not HT threads, i.e. no
resource sharing besides L2) so that we get done faster, i.e. race to
idle in the performance case. And in the powersavings' case we leave
them as tightly packed as possible.

 none of this stuff is easy for humans or computer programs to
 determine ahead of time... or sometimes even afterwards. heck, even
 for just performance it's really really hard already, never mind
 adding power.

 my personal gut feeling is that we should just optimize this scheduler
 stuff for performance, and that we're going to be doing quite well on
 power already if we achieve that.

Probably. I wonder if there is a way to measure power consumption of
different workloads in perf and then run those with different scheduling
policies.

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-12 Thread Alex Shi
On 12/12/2012 10:21 PM, Vincent Guittot wrote:
  If Linux is to continue to work efficiently on heterogeneous
  multi-processing platforms, it needs to provide scheduling mechanisms
  that can be exploited as per the demands of the HW architecture.
 
  Linus definitely disagree such ideas. :) So, need to summaries the
  logical beyond all hardware.
 
  example is the small task packing (and spreading) for which Vincent
  Guittot has posted a patchset[1] earlier and so has Alex now.
 
  Sure. I just thought my patchset should handled the 'small task
  packing' scenario. Could you guy like to have a try?
 Hi Alex,
 
 Yes, I will do a try with your patchset when i will have some spare time

Thanks Vincent! the balance and powersaving policy should have effect.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-12 Thread Alex Shi

 now, on the other hand, if you have two threads of a process that
 share a bunch of data structures, and you'd spread these over 2
 sockets, you end up bouncing data between the two sockets a lot,
 running inefficient -- bad for power.
 
 Yeah, that should be addressed by the NUMA patches people are working on
 right now.


Yes, as to balance/powersaving policy, we can tight pack tasks firstly,
then NUMA balancing will make memory follow us.

BTW, NUMA balancing is more related with page in memory. not LLC.
 
 having said all this, if you have to tasks that don't have such
 cache effects, the most efficient way of running things will be on 2
 hyperthreading halves... it's very hard to beat the power efficiency
 of that. But this assumes the tasks don't compete with resources much
 on the HT level, and achieve good scaling. and this still has to
 compete with race to halt, because if you're done quicker, you can
 put the memory in self refresh quicker.
 
 Right, how are we addressing the breakeven in that case? AFAIK, we
 do schedule them now on two different cores (not HT threads, i.e. no
 resource sharing besides L2) so that we get done faster, i.e. race to

that's balance policy for. :)
 idle in the performance case. And in the powersavings' case we leave
 them as tightly packed as possible.
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-11 Thread Alex Shi
On 12/12/2012 12:13 AM, Borislav Petkov wrote:
> On Tue, Dec 11, 2012 at 08:03:01AM -0800, Arjan van de Ven wrote:
>> On 12/11/2012 7:48 AM, Borislav Petkov wrote:
>>> On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote:
 Another testing of parallel compress with pigz on Linus' git tree.
 results show we get much better performance/power with powersaving and
 balance policy:

 testing command:
 #pigz -k -c  -p$x -r linux* &> /dev/null

 On a NHM EP box
  powersaving   balance  performance
 x = 4166.516 /88 68   170.515 /82 71 165.283 /103 58
 x = 8173.654 /61 94   177.693 /60 93 172.31 /76 76
>>>
>>> This looks funny: so "performance" is eating less watts than
>>> "powersaving" and "balance" on NHM. Could it be that the average watts
>>> measurements on NHM are not correct/precise..? On SNB they look as
>>> expected, according to your scheme.
>>
>> well... it's not always beneficial to group or to spread out
>> it depends on cache behavior mostly which is best
> 
> Let me try to understand what this means: so "performance" above with
> 8 threads means that those threads are spread out across more than one
> socket, no?
> 
> If so, this would mean that you have a smaller amount of tasks on each
> socket, thus the smaller wattage.
> 
> The "powersaving" method OTOH fills up the one socket up to the brim,
> thus the slightly higher consumption due to all threads being occupied.
> 

As Arjan said we know the performance increase should be due to the
cache sharing in LLC.
As to power consumption value between powersaving and performance, when
we burn 2 socket CPU, the cpu load is not 100%, so some LCPU still has
time to go idle or to run with low frequency, that also can save some
power.
That's just generalise situation, as to different hardware, different
CPU, they may has different tuning in CPU packages, core, uncore part
etc. So as to different benchmark, the result are also different.


> Is that it?
> 
> Thanks.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-11 Thread Arjan van de Ven

On 12/11/2012 8:13 AM, Borislav Petkov wrote:

On Tue, Dec 11, 2012 at 08:03:01AM -0800, Arjan van de Ven wrote:

On 12/11/2012 7:48 AM, Borislav Petkov wrote:

On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote:

Another testing of parallel compress with pigz on Linus' git tree.
results show we get much better performance/power with powersaving and
balance policy:

testing command:
#pigz -k -c  -p$x -r linux* &> /dev/null

On a NHM EP box
  powersaving   balance  performance
x = 4166.516 /88 68   170.515 /82 71 165.283 /103 58
x = 8173.654 /61 94   177.693 /60 93 172.31 /76 76


This looks funny: so "performance" is eating less watts than
"powersaving" and "balance" on NHM. Could it be that the average watts
measurements on NHM are not correct/precise..? On SNB they look as
expected, according to your scheme.


well... it's not always beneficial to group or to spread out
it depends on cache behavior mostly which is best


Let me try to understand what this means: so "performance" above with
8 threads means that those threads are spread out across more than one
socket, no?

If so, this would mean that you have a smaller amount of tasks on each
socket, thus the smaller wattage.

The "powersaving" method OTOH fills up the one socket up to the brim,
thus the slightly higher consumption due to all threads being occupied.

Is that it?


not sure.

by and large, power efficiency is the same as performance efficiency, with some 
twists.
or to reword that to be more clear
if you waste performance due to something that becomes inefficient, you're 
wasting power as well.
now, you might have some hardware effects that can then save you power... but 
those effects
then first need to overcome the waste from the performance inefficiency... and 
that almost never happens.

for example, if you have two workloads that each fit barely inside the last 
level cache...
it's much more efficient to spread these over two sockets... where each has its 
own full LLC
to use.
If you'd group these together, both would thrash the cache all the time and run 
inefficient --> bad for power.

now, on the other hand, if you have two threads of a process that share a bunch 
of data structures,
and you'd spread these over 2 sockets, you end up bouncing data between the two 
sockets a lot,
running inefficient --> bad for power.


having said all this, if you have to tasks that don't have such cache effects, 
the most efficient way
of running things will be on 2 hyperthreading halves... it's very hard to beat 
the power efficiency of that.
But this assumes the tasks don't compete with resources much on the HT level, 
and achieve good scaling.
and this still has to compete with "race to halt", because if you're done 
quicker, you can put the memory
in self refresh quicker.

none of this stuff is easy for humans or computer programs to determine ahead 
of time... or sometimes even afterwards.
heck, even for just performance it's really really hard already, never mind 
adding power.

my personal gut feeling is that we should just optimize this scheduler stuff 
for performance, and that
we're going to be doing quite well on power already if we achieve that.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-11 Thread Borislav Petkov
On Tue, Dec 11, 2012 at 08:03:01AM -0800, Arjan van de Ven wrote:
> On 12/11/2012 7:48 AM, Borislav Petkov wrote:
> >On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote:
> >>Another testing of parallel compress with pigz on Linus' git tree.
> >>results show we get much better performance/power with powersaving and
> >>balance policy:
> >>
> >>testing command:
> >>#pigz -k -c  -p$x -r linux* &> /dev/null
> >>
> >>On a NHM EP box
> >>  powersaving   balance  performance
> >>x = 4166.516 /88 68   170.515 /82 71 165.283 /103 58
> >>x = 8173.654 /61 94   177.693 /60 93 172.31 /76 76
> >
> >This looks funny: so "performance" is eating less watts than
> >"powersaving" and "balance" on NHM. Could it be that the average watts
> >measurements on NHM are not correct/precise..? On SNB they look as
> >expected, according to your scheme.
> 
> well... it's not always beneficial to group or to spread out
> it depends on cache behavior mostly which is best

Let me try to understand what this means: so "performance" above with
8 threads means that those threads are spread out across more than one
socket, no?

If so, this would mean that you have a smaller amount of tasks on each
socket, thus the smaller wattage.

The "powersaving" method OTOH fills up the one socket up to the brim,
thus the slightly higher consumption due to all threads being occupied.

Is that it?

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-11 Thread Arjan van de Ven

On 12/11/2012 7:48 AM, Borislav Petkov wrote:

On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote:

Another testing of parallel compress with pigz on Linus' git tree.
results show we get much better performance/power with powersaving and
balance policy:

testing command:
#pigz -k -c  -p$x -r linux* &> /dev/null

On a NHM EP box
  powersaving   balance  performance
x = 4166.516 /88 68   170.515 /82 71 165.283 /103 58
x = 8173.654 /61 94   177.693 /60 93 172.31 /76 76


This looks funny: so "performance" is eating less watts than
"powersaving" and "balance" on NHM. Could it be that the average watts
measurements on NHM are not correct/precise..? On SNB they look as
expected, according to your scheme.


well... it's not always beneficial to group or to spread out
it depends on cache behavior mostly which is best

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-11 Thread Borislav Petkov
On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote:
> Another testing of parallel compress with pigz on Linus' git tree.
> results show we get much better performance/power with powersaving and
> balance policy:
> 
> testing command:
> #pigz -k -c  -p$x -r linux* &> /dev/null
> 
> On a NHM EP box
>  powersaving   balance performance
> x = 4166.516 /88 68   170.515 /82 71 165.283 /103 58
> x = 8173.654 /61 94   177.693 /60 93 172.31 /76 76

This looks funny: so "performance" is eating less watts than
"powersaving" and "balance" on NHM. Could it be that the average watts
measurements on NHM are not correct/precise..? On SNB they look as
expected, according to your scheme.

Also, shouldn't you have the shortest compress times with "performance"?

> 
> On a 2 sockets SNB EP box.
>  powersaving   balance performance
> x = 4190.995 /149 35  200.6 /129 38  208.561 /135 35
> x = 8197.969 /108 46  208.885 /103 46213.96 /108 43
> x = 16   205.163 /76 64   212.144 /91 51 229.287 /97 44

Ditto here, compress times with "performance" are not the shortest. Or
does "performance" mean something else? :-)

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-11 Thread Alex Shi
On 12/11/2012 08:51 AM, Alex Shi wrote:
> On Mon, Dec 10, 2012 at 4:22 PM, Alex Shi  wrote:
>> This patchset base on tip/sched/core tree temporary, since it is more
>> steady than tip/master. and it's easy to rebase on tip/master.
>>
>> It includes 3 parts changes.
>>
>> 1, simplified fork, patch 1~4, that simplified the fork/exec/wake log in
>> find_idlest_group and select_task_rq_fair. it can increase 10+%
>> hackbench process and thread performance on our 4 sockets SNB EP machine.
>>
>> 2, enable load average into LB, patch 5~9, that using load average in
>> load balancing, with a runnable load value industrialization bug fix and
>> new fork task load contrib enhancement.
>>
>> 3, power awareness scheduling, patch 10~18,
>> Defined 2 new power aware policy balance and
>> powersaving, and then try to spread or shrink tasks on CPU unit
>> according the different scheduler policy. That can save much power when
>> task number in system is no more then cpu number.
> 
> tried with sysbench fileio test rndrw mode, with half thread of LCPU number,
> performance is similar, power can save about 5~10 Watts on 2 sockets SNB EP
> and NHM EP boxes.

Another testing of parallel compress with pigz on Linus' git tree.
results show we get much better performance/power with powersaving and
balance policy:

testing command:
#pigz -k -c  -p$x -r linux* &> /dev/null

On a NHM EP box
 powersaving   balance   performance
x = 4166.516 /88 68   170.515 /82 71 165.283 /103 58
x = 8173.654 /61 94   177.693 /60 93 172.31 /76 76

On a 2 sockets SNB EP box.
 powersaving   balance   performance
x = 4190.995 /149 35  200.6 /129 38  208.561 /135 35
x = 8197.969 /108 46  208.885 /103 46213.96 /108 43
x = 16   205.163 /76 64   212.144 /91 51 229.287 /97 44

data format is: 166.516 /88 68
166.516: avagerage Watts
88: seconds(compress time)
68:  scaled performance/power = 100 / time / power


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-11 Thread Alex Shi
On 12/11/2012 08:51 AM, Alex Shi wrote:
 On Mon, Dec 10, 2012 at 4:22 PM, Alex Shi alex@intel.com wrote:
 This patchset base on tip/sched/core tree temporary, since it is more
 steady than tip/master. and it's easy to rebase on tip/master.

 It includes 3 parts changes.

 1, simplified fork, patch 1~4, that simplified the fork/exec/wake log in
 find_idlest_group and select_task_rq_fair. it can increase 10+%
 hackbench process and thread performance on our 4 sockets SNB EP machine.

 2, enable load average into LB, patch 5~9, that using load average in
 load balancing, with a runnable load value industrialization bug fix and
 new fork task load contrib enhancement.

 3, power awareness scheduling, patch 10~18,
 Defined 2 new power aware policy balance and
 powersaving, and then try to spread or shrink tasks on CPU unit
 according the different scheduler policy. That can save much power when
 task number in system is no more then cpu number.
 
 tried with sysbench fileio test rndrw mode, with half thread of LCPU number,
 performance is similar, power can save about 5~10 Watts on 2 sockets SNB EP
 and NHM EP boxes.

Another testing of parallel compress with pigz on Linus' git tree.
results show we get much better performance/power with powersaving and
balance policy:

testing command:
#pigz -k -c  -p$x -r linux*  /dev/null

On a NHM EP box
 powersaving   balance   performance
x = 4166.516 /88 68   170.515 /82 71 165.283 /103 58
x = 8173.654 /61 94   177.693 /60 93 172.31 /76 76

On a 2 sockets SNB EP box.
 powersaving   balance   performance
x = 4190.995 /149 35  200.6 /129 38  208.561 /135 35
x = 8197.969 /108 46  208.885 /103 46213.96 /108 43
x = 16   205.163 /76 64   212.144 /91 51 229.287 /97 44

data format is: 166.516 /88 68
166.516: avagerage Watts
88: seconds(compress time)
68:  scaled performance/power = 100 / time / power


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-11 Thread Borislav Petkov
On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote:
 Another testing of parallel compress with pigz on Linus' git tree.
 results show we get much better performance/power with powersaving and
 balance policy:
 
 testing command:
 #pigz -k -c  -p$x -r linux*  /dev/null
 
 On a NHM EP box
  powersaving   balance performance
 x = 4166.516 /88 68   170.515 /82 71 165.283 /103 58
 x = 8173.654 /61 94   177.693 /60 93 172.31 /76 76

This looks funny: so performance is eating less watts than
powersaving and balance on NHM. Could it be that the average watts
measurements on NHM are not correct/precise..? On SNB they look as
expected, according to your scheme.

Also, shouldn't you have the shortest compress times with performance?

 
 On a 2 sockets SNB EP box.
  powersaving   balance performance
 x = 4190.995 /149 35  200.6 /129 38  208.561 /135 35
 x = 8197.969 /108 46  208.885 /103 46213.96 /108 43
 x = 16   205.163 /76 64   212.144 /91 51 229.287 /97 44

Ditto here, compress times with performance are not the shortest. Or
does performance mean something else? :-)

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-11 Thread Arjan van de Ven

On 12/11/2012 7:48 AM, Borislav Petkov wrote:

On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote:

Another testing of parallel compress with pigz on Linus' git tree.
results show we get much better performance/power with powersaving and
balance policy:

testing command:
#pigz -k -c  -p$x -r linux*  /dev/null

On a NHM EP box
  powersaving   balance  performance
x = 4166.516 /88 68   170.515 /82 71 165.283 /103 58
x = 8173.654 /61 94   177.693 /60 93 172.31 /76 76


This looks funny: so performance is eating less watts than
powersaving and balance on NHM. Could it be that the average watts
measurements on NHM are not correct/precise..? On SNB they look as
expected, according to your scheme.


well... it's not always beneficial to group or to spread out
it depends on cache behavior mostly which is best

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-11 Thread Borislav Petkov
On Tue, Dec 11, 2012 at 08:03:01AM -0800, Arjan van de Ven wrote:
 On 12/11/2012 7:48 AM, Borislav Petkov wrote:
 On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote:
 Another testing of parallel compress with pigz on Linus' git tree.
 results show we get much better performance/power with powersaving and
 balance policy:
 
 testing command:
 #pigz -k -c  -p$x -r linux*  /dev/null
 
 On a NHM EP box
   powersaving   balance  performance
 x = 4166.516 /88 68   170.515 /82 71 165.283 /103 58
 x = 8173.654 /61 94   177.693 /60 93 172.31 /76 76
 
 This looks funny: so performance is eating less watts than
 powersaving and balance on NHM. Could it be that the average watts
 measurements on NHM are not correct/precise..? On SNB they look as
 expected, according to your scheme.
 
 well... it's not always beneficial to group or to spread out
 it depends on cache behavior mostly which is best

Let me try to understand what this means: so performance above with
8 threads means that those threads are spread out across more than one
socket, no?

If so, this would mean that you have a smaller amount of tasks on each
socket, thus the smaller wattage.

The powersaving method OTOH fills up the one socket up to the brim,
thus the slightly higher consumption due to all threads being occupied.

Is that it?

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-11 Thread Arjan van de Ven

On 12/11/2012 8:13 AM, Borislav Petkov wrote:

On Tue, Dec 11, 2012 at 08:03:01AM -0800, Arjan van de Ven wrote:

On 12/11/2012 7:48 AM, Borislav Petkov wrote:

On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote:

Another testing of parallel compress with pigz on Linus' git tree.
results show we get much better performance/power with powersaving and
balance policy:

testing command:
#pigz -k -c  -p$x -r linux*  /dev/null

On a NHM EP box
  powersaving   balance  performance
x = 4166.516 /88 68   170.515 /82 71 165.283 /103 58
x = 8173.654 /61 94   177.693 /60 93 172.31 /76 76


This looks funny: so performance is eating less watts than
powersaving and balance on NHM. Could it be that the average watts
measurements on NHM are not correct/precise..? On SNB they look as
expected, according to your scheme.


well... it's not always beneficial to group or to spread out
it depends on cache behavior mostly which is best


Let me try to understand what this means: so performance above with
8 threads means that those threads are spread out across more than one
socket, no?

If so, this would mean that you have a smaller amount of tasks on each
socket, thus the smaller wattage.

The powersaving method OTOH fills up the one socket up to the brim,
thus the slightly higher consumption due to all threads being occupied.

Is that it?


not sure.

by and large, power efficiency is the same as performance efficiency, with some 
twists.
or to reword that to be more clear
if you waste performance due to something that becomes inefficient, you're 
wasting power as well.
now, you might have some hardware effects that can then save you power... but 
those effects
then first need to overcome the waste from the performance inefficiency... and 
that almost never happens.

for example, if you have two workloads that each fit barely inside the last 
level cache...
it's much more efficient to spread these over two sockets... where each has its 
own full LLC
to use.
If you'd group these together, both would thrash the cache all the time and run 
inefficient -- bad for power.

now, on the other hand, if you have two threads of a process that share a bunch 
of data structures,
and you'd spread these over 2 sockets, you end up bouncing data between the two 
sockets a lot,
running inefficient -- bad for power.


having said all this, if you have to tasks that don't have such cache effects, 
the most efficient way
of running things will be on 2 hyperthreading halves... it's very hard to beat 
the power efficiency of that.
But this assumes the tasks don't compete with resources much on the HT level, 
and achieve good scaling.
and this still has to compete with race to halt, because if you're done 
quicker, you can put the memory
in self refresh quicker.

none of this stuff is easy for humans or computer programs to determine ahead 
of time... or sometimes even afterwards.
heck, even for just performance it's really really hard already, never mind 
adding power.

my personal gut feeling is that we should just optimize this scheduler stuff 
for performance, and that
we're going to be doing quite well on power already if we achieve that.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-11 Thread Alex Shi
On 12/12/2012 12:13 AM, Borislav Petkov wrote:
 On Tue, Dec 11, 2012 at 08:03:01AM -0800, Arjan van de Ven wrote:
 On 12/11/2012 7:48 AM, Borislav Petkov wrote:
 On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote:
 Another testing of parallel compress with pigz on Linus' git tree.
 results show we get much better performance/power with powersaving and
 balance policy:

 testing command:
 #pigz -k -c  -p$x -r linux*  /dev/null

 On a NHM EP box
  powersaving   balance  performance
 x = 4166.516 /88 68   170.515 /82 71 165.283 /103 58
 x = 8173.654 /61 94   177.693 /60 93 172.31 /76 76

 This looks funny: so performance is eating less watts than
 powersaving and balance on NHM. Could it be that the average watts
 measurements on NHM are not correct/precise..? On SNB they look as
 expected, according to your scheme.

 well... it's not always beneficial to group or to spread out
 it depends on cache behavior mostly which is best
 
 Let me try to understand what this means: so performance above with
 8 threads means that those threads are spread out across more than one
 socket, no?
 
 If so, this would mean that you have a smaller amount of tasks on each
 socket, thus the smaller wattage.
 
 The powersaving method OTOH fills up the one socket up to the brim,
 thus the slightly higher consumption due to all threads being occupied.
 

As Arjan said we know the performance increase should be due to the
cache sharing in LLC.
As to power consumption value between powersaving and performance, when
we burn 2 socket CPU, the cpu load is not 100%, so some LCPU still has
time to go idle or to run with low frequency, that also can save some
power.
That's just generalise situation, as to different hardware, different
CPU, they may has different tuning in CPU packages, core, uncore part
etc. So as to different benchmark, the result are also different.


 Is that it?
 
 Thanks.
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-10 Thread Alex Shi
On Mon, Dec 10, 2012 at 4:22 PM, Alex Shi  wrote:
> This patchset base on tip/sched/core tree temporary, since it is more
> steady than tip/master. and it's easy to rebase on tip/master.
>
> It includes 3 parts changes.
>
> 1, simplified fork, patch 1~4, that simplified the fork/exec/wake log in
> find_idlest_group and select_task_rq_fair. it can increase 10+%
> hackbench process and thread performance on our 4 sockets SNB EP machine.
>
> 2, enable load average into LB, patch 5~9, that using load average in
> load balancing, with a runnable load value industrialization bug fix and
> new fork task load contrib enhancement.
>
> 3, power awareness scheduling, patch 10~18,
> Defined 2 new power aware policy balance and
> powersaving, and then try to spread or shrink tasks on CPU unit
> according the different scheduler policy. That can save much power when
> task number in system is no more then cpu number.

tried with sysbench fileio test rndrw mode, with half thread of LCPU number,
performance is similar, power can save about 5~10 Watts on 2 sockets SNB EP
and NHM EP boxes.

Any comments :)
>
> Any comments are appreciated!
>
> Best regards!
> Alex
>
> [PATCH 01/18] sched: select_task_rq_fair clean up
> [PATCH 02/18] sched: fix find_idlest_group mess logical
> [PATCH 03/18] sched: don't need go to smaller sched domain
> [PATCH 04/18] sched: remove domain iterations in fork/exec/wake
> [PATCH 05/18] sched: load tracking bug fix
> [PATCH 06/18] sched: set initial load avg of new forked task as its
> [PATCH 07/18] sched: compute runnable load avg in cpu_load and
> [PATCH 08/18] sched: consider runnable load average in move_tasks
> [PATCH 09/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED
> [PATCH 10/18] sched: add sched_policy in kernel
> [PATCH 11/18] sched: add sched_policy and it's sysfs interface
> [PATCH 12/18] sched: log the cpu utilization at rq
> [PATCH 13/18] sched: add power aware scheduling in fork/exec/wake
> [PATCH 14/18] sched: add power/performance balance allowed flag
> [PATCH 15/18] sched: don't care if the local group has capacity
> [PATCH 16/18] sched: pull all tasks from source group
> [PATCH 17/18] sched: power aware load balance,
> [PATCH 18/18] sched: lazy powersaving balance
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-10 Thread Alex Shi
This patchset base on tip/sched/core tree temporary, since it is more 
steady than tip/master. and it's easy to rebase on tip/master.

It includes 3 parts changes.

1, simplified fork, patch 1~4, that simplified the fork/exec/wake log in
find_idlest_group and select_task_rq_fair. it can increase 10+%
hackbench process and thread performance on our 4 sockets SNB EP machine.

2, enable load average into LB, patch 5~9, that using load average in
load balancing, with a runnable load value industrialization bug fix and
new fork task load contrib enhancement.

3, power awareness scheduling, patch 10~18, 
Defined 2 new power aware policy balance and
powersaving, and then try to spread or shrink tasks on CPU unit
according the different scheduler policy. That can save much power when
task number in system is no more then cpu number.

Any comments are appreciated!

Best regards!
Alex

[PATCH 01/18] sched: select_task_rq_fair clean up
[PATCH 02/18] sched: fix find_idlest_group mess logical
[PATCH 03/18] sched: don't need go to smaller sched domain
[PATCH 04/18] sched: remove domain iterations in fork/exec/wake
[PATCH 05/18] sched: load tracking bug fix
[PATCH 06/18] sched: set initial load avg of new forked task as its
[PATCH 07/18] sched: compute runnable load avg in cpu_load and
[PATCH 08/18] sched: consider runnable load average in move_tasks
[PATCH 09/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED
[PATCH 10/18] sched: add sched_policy in kernel
[PATCH 11/18] sched: add sched_policy and it's sysfs interface
[PATCH 12/18] sched: log the cpu utilization at rq
[PATCH 13/18] sched: add power aware scheduling in fork/exec/wake
[PATCH 14/18] sched: add power/performance balance allowed flag
[PATCH 15/18] sched: don't care if the local group has capacity
[PATCH 16/18] sched: pull all tasks from source group
[PATCH 17/18] sched: power aware load balance,
[PATCH 18/18] sched: lazy powersaving balance
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-10 Thread Alex Shi
This patchset base on tip/sched/core tree temporary, since it is more 
steady than tip/master. and it's easy to rebase on tip/master.

It includes 3 parts changes.

1, simplified fork, patch 1~4, that simplified the fork/exec/wake log in
find_idlest_group and select_task_rq_fair. it can increase 10+%
hackbench process and thread performance on our 4 sockets SNB EP machine.

2, enable load average into LB, patch 5~9, that using load average in
load balancing, with a runnable load value industrialization bug fix and
new fork task load contrib enhancement.

3, power awareness scheduling, patch 10~18, 
Defined 2 new power aware policy balance and
powersaving, and then try to spread or shrink tasks on CPU unit
according the different scheduler policy. That can save much power when
task number in system is no more then cpu number.

Any comments are appreciated!

Best regards!
Alex

[PATCH 01/18] sched: select_task_rq_fair clean up
[PATCH 02/18] sched: fix find_idlest_group mess logical
[PATCH 03/18] sched: don't need go to smaller sched domain
[PATCH 04/18] sched: remove domain iterations in fork/exec/wake
[PATCH 05/18] sched: load tracking bug fix
[PATCH 06/18] sched: set initial load avg of new forked task as its
[PATCH 07/18] sched: compute runnable load avg in cpu_load and
[PATCH 08/18] sched: consider runnable load average in move_tasks
[PATCH 09/18] Revert sched: Introduce temporary FAIR_GROUP_SCHED
[PATCH 10/18] sched: add sched_policy in kernel
[PATCH 11/18] sched: add sched_policy and it's sysfs interface
[PATCH 12/18] sched: log the cpu utilization at rq
[PATCH 13/18] sched: add power aware scheduling in fork/exec/wake
[PATCH 14/18] sched: add power/performance balance allowed flag
[PATCH 15/18] sched: don't care if the local group has capacity
[PATCH 16/18] sched: pull all tasks from source group
[PATCH 17/18] sched: power aware load balance,
[PATCH 18/18] sched: lazy powersaving balance
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-10 Thread Alex Shi
On Mon, Dec 10, 2012 at 4:22 PM, Alex Shi alex@intel.com wrote:
 This patchset base on tip/sched/core tree temporary, since it is more
 steady than tip/master. and it's easy to rebase on tip/master.

 It includes 3 parts changes.

 1, simplified fork, patch 1~4, that simplified the fork/exec/wake log in
 find_idlest_group and select_task_rq_fair. it can increase 10+%
 hackbench process and thread performance on our 4 sockets SNB EP machine.

 2, enable load average into LB, patch 5~9, that using load average in
 load balancing, with a runnable load value industrialization bug fix and
 new fork task load contrib enhancement.

 3, power awareness scheduling, patch 10~18,
 Defined 2 new power aware policy balance and
 powersaving, and then try to spread or shrink tasks on CPU unit
 according the different scheduler policy. That can save much power when
 task number in system is no more then cpu number.

tried with sysbench fileio test rndrw mode, with half thread of LCPU number,
performance is similar, power can save about 5~10 Watts on 2 sockets SNB EP
and NHM EP boxes.

Any comments :)

 Any comments are appreciated!

 Best regards!
 Alex

 [PATCH 01/18] sched: select_task_rq_fair clean up
 [PATCH 02/18] sched: fix find_idlest_group mess logical
 [PATCH 03/18] sched: don't need go to smaller sched domain
 [PATCH 04/18] sched: remove domain iterations in fork/exec/wake
 [PATCH 05/18] sched: load tracking bug fix
 [PATCH 06/18] sched: set initial load avg of new forked task as its
 [PATCH 07/18] sched: compute runnable load avg in cpu_load and
 [PATCH 08/18] sched: consider runnable load average in move_tasks
 [PATCH 09/18] Revert sched: Introduce temporary FAIR_GROUP_SCHED
 [PATCH 10/18] sched: add sched_policy in kernel
 [PATCH 11/18] sched: add sched_policy and it's sysfs interface
 [PATCH 12/18] sched: log the cpu utilization at rq
 [PATCH 13/18] sched: add power aware scheduling in fork/exec/wake
 [PATCH 14/18] sched: add power/performance balance allowed flag
 [PATCH 15/18] sched: don't care if the local group has capacity
 [PATCH 16/18] sched: pull all tasks from source group
 [PATCH 17/18] sched: power aware load balance,
 [PATCH 18/18] sched: lazy powersaving balance
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/