Re: [RFC] Comparison of power-efficient scheduling patch sets

2013-05-31 Thread Alex Shi
On 05/31/2013 09:17 AM, Alex Shi wrote:
>> > Kernel: 3.9
>> > 
>> > Patch sets:
>> > rlb-v4: sched: use runnable load based balance (Alex Shi)
>> > 
> Thanks for the valuable comparison!
> 
> The runnable load balance target is performance. It is still try to
> disperse tasks to as much as possible CPUs. :)
> The latest v7 version remove the 6th patch(wake_affine change) in v4.
> and plus fix a slept time double counting issue, and remove
> blocked_load_avg in tg load.
> http://comments.gmane.org/gmane.linux.kernel/1498988

Even the rlb patch set target is performance, Maybe the power benefit is
due to better balancing?

Anyway I appreciate if you like to test the latest v7 version. :)
https://github.com/alexshi/power-scheduling.git runnablelb

-- 
Thanks
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Comparison of power-efficient scheduling patch sets

2013-05-31 Thread Alex Shi
On 05/31/2013 09:17 AM, Alex Shi wrote:
  Kernel: 3.9
  
  Patch sets:
  rlb-v4: sched: use runnable load based balance (Alex Shi)
  https://lkml.org/lkml/2013/4/27/13
 Thanks for the valuable comparison!
 
 The runnable load balance target is performance. It is still try to
 disperse tasks to as much as possible CPUs. :)
 The latest v7 version remove the 6th patch(wake_affine change) in v4.
 and plus fix a slept time double counting issue, and remove
 blocked_load_avg in tg load.
 http://comments.gmane.org/gmane.linux.kernel/1498988

Even the rlb patch set target is performance, Maybe the power benefit is
due to better balancing?

Anyway I appreciate if you like to test the latest v7 version. :)
https://github.com/alexshi/power-scheduling.git runnablelb

-- 
Thanks
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Comparison of power-efficient scheduling patch sets

2013-05-30 Thread Alex Shi
On 05/30/2013 09:47 PM, Morten Rasmussen wrote:
> Hi,
> 
> A number of patch sets related to power-efficient scheduling have been
> posted over the last couple of months. Most of them do not have much
> data to back them up, so I decided to do some testing.
> 
> Common for all of the patch sets that I have tested, except one, is that
> they attempt to pack tasks on as few cpus as possible to allow the
> remaining cpus to enter deeper sleep states - a strategy that should
> make sense on most platforms that support per-cpu power gating and
> multi-socket machines.
> 
> Kernel: 3.9
> 
> Patch sets:
> rlb-v4: sched: use runnable load based balance (Alex Shi)
> 

Thanks for the valuable comparison!

The runnable load balance target is performance. It is still try to
disperse tasks to as much as possible CPUs. :)
The latest v7 version remove the 6th patch(wake_affine change) in v4.
and plus fix a slept time double counting issue, and remove
blocked_load_avg in tg load.
http://comments.gmane.org/gmane.linux.kernel/1498988
Enjoy!
> pas-v7: sched: power aware scheduling (Alex Shi)
> 

We still have some internal discussion on this patch set before update
it. Sorry for response late on this patchset!

> pst-v3: sched: packing small tasks (Vincent Guittot)
> 
> pst-v4: sched: packing small tasks (Vincent Guittot)
> 
> 
> Configuration:
> pas-v7: Set to "powersaving" mode.
> pst-v4: Set to "Full" packing mode.
> 
> Platform:
> ARM TC2 (test-chip), 2xCortex-A15 + 3xCortex-A7. Cortex-A15s disabled.
> 
> Measurement technique:
> Time spent non-idle (not in idle state) for each cpu based on cpuidle
> ftrace events. TC2 does not have per-core power-gating, so packing
> inside the A7 cluster does not lead to any significant power savings.
> Note that any product grade hardware (TC2 is a test-chip) will very
> likely have per-core power-gating, so in those cases packing will have
> an appreciable effect on power savings.
> Measuring non-idle time rather than power should give a more clear idea
> about the effect of the patch sets given that the idle back-end is
> highly implementation specific.
> 
> Benchmarks:
> audio playback (Android): 30s mp3 file playback on Android.
> bbench+audio (Android): Web page rendering while doing mp3 playback.
> andebench_native (Android): Android benchmark running in native mode.
> cyclictest: Short periodic tasks.
> 
> Results:
> Two runs for each patch set.
> 
> audio playback (Android) SMP
> non-idle %  cpu 0  cpu 1  cpu 2
> 3.9_1   11.96   2.86   2.48
> 3.9_2   12.64   2.81   1.88
> rlb-v4_112.61   2.44   1.90
> rlb-v4_212.45   2.44   1.90
> pas-v7_116.17   0.03   0.24
> pas-v7_216.08   0.28   0.07
> pst-v3_115.18   2.76   1.70
> pst-v3_215.13   0.80   0.38
> pst-v4_116.14   0.05   0.00
> pst-v4_216.34   0.06   0.00
> 
> bbench+audio (Android) SMP
> non-idle %  cpu 0  cpu 1  cpu 2  render time
> 3.9_1   25.00  20.73  21.22   812
> 3.9_2   24.29  19.78  22.34   795
> rlb-v4_123.84  19.36  22.74   782
> rlb-v4_224.07  19.36  22.74   797
> pas-v7_128.29  17.86  16.01   869
> pas-v7_228.62  18.54  15.05   908
> pst-v3_129.14  20.59  21.72   830
> pst-v3_227.69  18.81  20.06   830
> pst-v4_142.20  13.63   2.29   880
> pst-v4_241.56  14.40   2.17   935
> 
> andebench_native (8 threads) (Android) SMP
> non-idle %  cpu 0  cpu 1  cpu 2  Score
> 3.9_1   99.22  98.88  99.61   4139
> 3.9_2   99.56  99.31  99.46   4148
> rlb-v4_199.49  99.61  99.53   4153
> rlb-v4_299.56  99.61  99.53   4149
> pas-v7_199.53  99.59  99.29   4149
> pas-v7_299.42  99.63  99.48   4150
> pst-v3_197.89  99.33  99.42   4097
> pst-v3_299.16  99.62  99.42   4097
> pst-v4_199.34  99.01  99.59   4146
> pst-v4_299.49  99.52  99.20   4146
> 
> cyclictest SMP
> non-idle %  cpu 0  cpu 1  cpu 2
> 3.9_19.13   8.88   8.41
> 3.9_2   10.27   8.02   6.30
> rlb-v4_1 8.88   8.09   8.11
> rlb-v4_2 8.49   8.09   8.11
> pas-v7_110.20   0.02  11.50
> pas-v7_2 7.86  14.31   0.02
> pst-v3_120.44   8.68   7.97
> pst-v3_220.41   0.78   1.00
> pst-v4_121.32   0.21   0.05
> pst-v4_221.56   0.21   0.04
> 
> Overall, pas-v7 seems to do a fairly good job at packing. The idle time
> distribution seems to be somewhere between pst-v3 and the more
> aggressive pst-v4 for all the benchmarks. pst-v4 manages to keep two
> cpus nearly idle (<0.25% non-idle) for both cyclictest and audio, which
> is better than both pst-v3 and pas-v7. pas-v7 fails to pack cyclictest.
> Packing does come at at cost which can be seen for bbench+audio, where
> pst-v3 and rlb-v4 get better render times than pas-v7 and pst-v4 which
> do more aggressive packing. rlb-v4 does not pack, it is only included
> for reference.
> 
> From a packing 

[RFC] Comparison of power-efficient scheduling patch sets

2013-05-30 Thread Morten Rasmussen
Hi,

A number of patch sets related to power-efficient scheduling have been
posted over the last couple of months. Most of them do not have much
data to back them up, so I decided to do some testing.

Common for all of the patch sets that I have tested, except one, is that
they attempt to pack tasks on as few cpus as possible to allow the
remaining cpus to enter deeper sleep states - a strategy that should
make sense on most platforms that support per-cpu power gating and
multi-socket machines.

Kernel: 3.9

Patch sets:
rlb-v4: sched: use runnable load based balance (Alex Shi)

pas-v7: sched: power aware scheduling (Alex Shi)

pst-v3: sched: packing small tasks (Vincent Guittot)

pst-v4: sched: packing small tasks (Vincent Guittot)


Configuration:
pas-v7: Set to "powersaving" mode.
pst-v4: Set to "Full" packing mode.

Platform:
ARM TC2 (test-chip), 2xCortex-A15 + 3xCortex-A7. Cortex-A15s disabled.

Measurement technique:
Time spent non-idle (not in idle state) for each cpu based on cpuidle
ftrace events. TC2 does not have per-core power-gating, so packing
inside the A7 cluster does not lead to any significant power savings.
Note that any product grade hardware (TC2 is a test-chip) will very
likely have per-core power-gating, so in those cases packing will have
an appreciable effect on power savings.
Measuring non-idle time rather than power should give a more clear idea
about the effect of the patch sets given that the idle back-end is
highly implementation specific.

Benchmarks:
audio playback (Android): 30s mp3 file playback on Android.
bbench+audio (Android): Web page rendering while doing mp3 playback.
andebench_native (Android): Android benchmark running in native mode.
cyclictest: Short periodic tasks.

Results:
Two runs for each patch set.

audio playback (Android) SMP
non-idle %  cpu 0  cpu 1  cpu 2
3.9_1   11.96   2.86   2.48
3.9_2   12.64   2.81   1.88
rlb-v4_112.61   2.44   1.90
rlb-v4_212.45   2.44   1.90
pas-v7_116.17   0.03   0.24
pas-v7_216.08   0.28   0.07
pst-v3_115.18   2.76   1.70
pst-v3_215.13   0.80   0.38
pst-v4_116.14   0.05   0.00
pst-v4_216.34   0.06   0.00

bbench+audio (Android) SMP
non-idle %  cpu 0  cpu 1  cpu 2  render time
3.9_1   25.00  20.73  21.22   812
3.9_2   24.29  19.78  22.34   795
rlb-v4_123.84  19.36  22.74   782
rlb-v4_224.07  19.36  22.74   797
pas-v7_128.29  17.86  16.01   869
pas-v7_228.62  18.54  15.05   908
pst-v3_129.14  20.59  21.72   830
pst-v3_227.69  18.81  20.06   830
pst-v4_142.20  13.63   2.29   880
pst-v4_241.56  14.40   2.17   935

andebench_native (8 threads) (Android) SMP
non-idle %  cpu 0  cpu 1  cpu 2  Score
3.9_1   99.22  98.88  99.61   4139
3.9_2   99.56  99.31  99.46   4148
rlb-v4_199.49  99.61  99.53   4153
rlb-v4_299.56  99.61  99.53   4149
pas-v7_199.53  99.59  99.29   4149
pas-v7_299.42  99.63  99.48   4150
pst-v3_197.89  99.33  99.42   4097
pst-v3_299.16  99.62  99.42   4097
pst-v4_199.34  99.01  99.59   4146
pst-v4_299.49  99.52  99.20   4146

cyclictest SMP
non-idle %  cpu 0  cpu 1  cpu 2
3.9_19.13   8.88   8.41
3.9_2   10.27   8.02   6.30
rlb-v4_1 8.88   8.09   8.11
rlb-v4_2 8.49   8.09   8.11
pas-v7_110.20   0.02  11.50
pas-v7_2 7.86  14.31   0.02
pst-v3_120.44   8.68   7.97
pst-v3_220.41   0.78   1.00
pst-v4_121.32   0.21   0.05
pst-v4_221.56   0.21   0.04

Overall, pas-v7 seems to do a fairly good job at packing. The idle time
distribution seems to be somewhere between pst-v3 and the more
aggressive pst-v4 for all the benchmarks. pst-v4 manages to keep two
cpus nearly idle (<0.25% non-idle) for both cyclictest and audio, which
is better than both pst-v3 and pas-v7. pas-v7 fails to pack cyclictest.
Packing does come at at cost which can be seen for bbench+audio, where
pst-v3 and rlb-v4 get better render times than pas-v7 and pst-v4 which
do more aggressive packing. rlb-v4 does not pack, it is only included
for reference.

>From a packing perspective pst-v4 seems to do the best job for the
workloads that I have tested on ARM TC2. The less aggressive packing in
pst-v3 may be a better choice for in terms of performance.

I'm well aware that these tests are heavily focused on mobile workloads.
I would therefore encourage people to share your test results for your
workloads on your platforms to complete the picture. Comments are also
welcome.

Thanks,
Morten


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] Comparison of power-efficient scheduling patch sets

2013-05-30 Thread Morten Rasmussen
Hi,

A number of patch sets related to power-efficient scheduling have been
posted over the last couple of months. Most of them do not have much
data to back them up, so I decided to do some testing.

Common for all of the patch sets that I have tested, except one, is that
they attempt to pack tasks on as few cpus as possible to allow the
remaining cpus to enter deeper sleep states - a strategy that should
make sense on most platforms that support per-cpu power gating and
multi-socket machines.

Kernel: 3.9

Patch sets:
rlb-v4: sched: use runnable load based balance (Alex Shi)
https://lkml.org/lkml/2013/4/27/13
pas-v7: sched: power aware scheduling (Alex Shi)
https://lkml.org/lkml/2013/4/3/732
pst-v3: sched: packing small tasks (Vincent Guittot)
https://lkml.org/lkml/2013/3/22/183
pst-v4: sched: packing small tasks (Vincent Guittot)
https://lkml.org/lkml/2013/4/25/396

Configuration:
pas-v7: Set to powersaving mode.
pst-v4: Set to Full packing mode.

Platform:
ARM TC2 (test-chip), 2xCortex-A15 + 3xCortex-A7. Cortex-A15s disabled.

Measurement technique:
Time spent non-idle (not in idle state) for each cpu based on cpuidle
ftrace events. TC2 does not have per-core power-gating, so packing
inside the A7 cluster does not lead to any significant power savings.
Note that any product grade hardware (TC2 is a test-chip) will very
likely have per-core power-gating, so in those cases packing will have
an appreciable effect on power savings.
Measuring non-idle time rather than power should give a more clear idea
about the effect of the patch sets given that the idle back-end is
highly implementation specific.

Benchmarks:
audio playback (Android): 30s mp3 file playback on Android.
bbench+audio (Android): Web page rendering while doing mp3 playback.
andebench_native (Android): Android benchmark running in native mode.
cyclictest: Short periodic tasks.

Results:
Two runs for each patch set.

audio playback (Android) SMP
non-idle %  cpu 0  cpu 1  cpu 2
3.9_1   11.96   2.86   2.48
3.9_2   12.64   2.81   1.88
rlb-v4_112.61   2.44   1.90
rlb-v4_212.45   2.44   1.90
pas-v7_116.17   0.03   0.24
pas-v7_216.08   0.28   0.07
pst-v3_115.18   2.76   1.70
pst-v3_215.13   0.80   0.38
pst-v4_116.14   0.05   0.00
pst-v4_216.34   0.06   0.00

bbench+audio (Android) SMP
non-idle %  cpu 0  cpu 1  cpu 2  render time
3.9_1   25.00  20.73  21.22   812
3.9_2   24.29  19.78  22.34   795
rlb-v4_123.84  19.36  22.74   782
rlb-v4_224.07  19.36  22.74   797
pas-v7_128.29  17.86  16.01   869
pas-v7_228.62  18.54  15.05   908
pst-v3_129.14  20.59  21.72   830
pst-v3_227.69  18.81  20.06   830
pst-v4_142.20  13.63   2.29   880
pst-v4_241.56  14.40   2.17   935

andebench_native (8 threads) (Android) SMP
non-idle %  cpu 0  cpu 1  cpu 2  Score
3.9_1   99.22  98.88  99.61   4139
3.9_2   99.56  99.31  99.46   4148
rlb-v4_199.49  99.61  99.53   4153
rlb-v4_299.56  99.61  99.53   4149
pas-v7_199.53  99.59  99.29   4149
pas-v7_299.42  99.63  99.48   4150
pst-v3_197.89  99.33  99.42   4097
pst-v3_299.16  99.62  99.42   4097
pst-v4_199.34  99.01  99.59   4146
pst-v4_299.49  99.52  99.20   4146

cyclictest SMP
non-idle %  cpu 0  cpu 1  cpu 2
3.9_19.13   8.88   8.41
3.9_2   10.27   8.02   6.30
rlb-v4_1 8.88   8.09   8.11
rlb-v4_2 8.49   8.09   8.11
pas-v7_110.20   0.02  11.50
pas-v7_2 7.86  14.31   0.02
pst-v3_120.44   8.68   7.97
pst-v3_220.41   0.78   1.00
pst-v4_121.32   0.21   0.05
pst-v4_221.56   0.21   0.04

Overall, pas-v7 seems to do a fairly good job at packing. The idle time
distribution seems to be somewhere between pst-v3 and the more
aggressive pst-v4 for all the benchmarks. pst-v4 manages to keep two
cpus nearly idle (0.25% non-idle) for both cyclictest and audio, which
is better than both pst-v3 and pas-v7. pas-v7 fails to pack cyclictest.
Packing does come at at cost which can be seen for bbench+audio, where
pst-v3 and rlb-v4 get better render times than pas-v7 and pst-v4 which
do more aggressive packing. rlb-v4 does not pack, it is only included
for reference.

From a packing perspective pst-v4 seems to do the best job for the
workloads that I have tested on ARM TC2. The less aggressive packing in
pst-v3 may be a better choice for in terms of performance.

I'm well aware that these tests are heavily focused on mobile workloads.
I would therefore encourage people to share your test results for your
workloads on your platforms to complete the picture. Comments are also
welcome.

Thanks,
Morten


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Comparison of power-efficient scheduling patch sets

2013-05-30 Thread Alex Shi
On 05/30/2013 09:47 PM, Morten Rasmussen wrote:
 Hi,
 
 A number of patch sets related to power-efficient scheduling have been
 posted over the last couple of months. Most of them do not have much
 data to back them up, so I decided to do some testing.
 
 Common for all of the patch sets that I have tested, except one, is that
 they attempt to pack tasks on as few cpus as possible to allow the
 remaining cpus to enter deeper sleep states - a strategy that should
 make sense on most platforms that support per-cpu power gating and
 multi-socket machines.
 
 Kernel: 3.9
 
 Patch sets:
 rlb-v4: sched: use runnable load based balance (Alex Shi)
 https://lkml.org/lkml/2013/4/27/13

Thanks for the valuable comparison!

The runnable load balance target is performance. It is still try to
disperse tasks to as much as possible CPUs. :)
The latest v7 version remove the 6th patch(wake_affine change) in v4.
and plus fix a slept time double counting issue, and remove
blocked_load_avg in tg load.
http://comments.gmane.org/gmane.linux.kernel/1498988
Enjoy!
 pas-v7: sched: power aware scheduling (Alex Shi)
 https://lkml.org/lkml/2013/4/3/732

We still have some internal discussion on this patch set before update
it. Sorry for response late on this patchset!

 pst-v3: sched: packing small tasks (Vincent Guittot)
 https://lkml.org/lkml/2013/3/22/183
 pst-v4: sched: packing small tasks (Vincent Guittot)
 https://lkml.org/lkml/2013/4/25/396
 
 Configuration:
 pas-v7: Set to powersaving mode.
 pst-v4: Set to Full packing mode.
 
 Platform:
 ARM TC2 (test-chip), 2xCortex-A15 + 3xCortex-A7. Cortex-A15s disabled.
 
 Measurement technique:
 Time spent non-idle (not in idle state) for each cpu based on cpuidle
 ftrace events. TC2 does not have per-core power-gating, so packing
 inside the A7 cluster does not lead to any significant power savings.
 Note that any product grade hardware (TC2 is a test-chip) will very
 likely have per-core power-gating, so in those cases packing will have
 an appreciable effect on power savings.
 Measuring non-idle time rather than power should give a more clear idea
 about the effect of the patch sets given that the idle back-end is
 highly implementation specific.
 
 Benchmarks:
 audio playback (Android): 30s mp3 file playback on Android.
 bbench+audio (Android): Web page rendering while doing mp3 playback.
 andebench_native (Android): Android benchmark running in native mode.
 cyclictest: Short periodic tasks.
 
 Results:
 Two runs for each patch set.
 
 audio playback (Android) SMP
 non-idle %  cpu 0  cpu 1  cpu 2
 3.9_1   11.96   2.86   2.48
 3.9_2   12.64   2.81   1.88
 rlb-v4_112.61   2.44   1.90
 rlb-v4_212.45   2.44   1.90
 pas-v7_116.17   0.03   0.24
 pas-v7_216.08   0.28   0.07
 pst-v3_115.18   2.76   1.70
 pst-v3_215.13   0.80   0.38
 pst-v4_116.14   0.05   0.00
 pst-v4_216.34   0.06   0.00
 
 bbench+audio (Android) SMP
 non-idle %  cpu 0  cpu 1  cpu 2  render time
 3.9_1   25.00  20.73  21.22   812
 3.9_2   24.29  19.78  22.34   795
 rlb-v4_123.84  19.36  22.74   782
 rlb-v4_224.07  19.36  22.74   797
 pas-v7_128.29  17.86  16.01   869
 pas-v7_228.62  18.54  15.05   908
 pst-v3_129.14  20.59  21.72   830
 pst-v3_227.69  18.81  20.06   830
 pst-v4_142.20  13.63   2.29   880
 pst-v4_241.56  14.40   2.17   935
 
 andebench_native (8 threads) (Android) SMP
 non-idle %  cpu 0  cpu 1  cpu 2  Score
 3.9_1   99.22  98.88  99.61   4139
 3.9_2   99.56  99.31  99.46   4148
 rlb-v4_199.49  99.61  99.53   4153
 rlb-v4_299.56  99.61  99.53   4149
 pas-v7_199.53  99.59  99.29   4149
 pas-v7_299.42  99.63  99.48   4150
 pst-v3_197.89  99.33  99.42   4097
 pst-v3_299.16  99.62  99.42   4097
 pst-v4_199.34  99.01  99.59   4146
 pst-v4_299.49  99.52  99.20   4146
 
 cyclictest SMP
 non-idle %  cpu 0  cpu 1  cpu 2
 3.9_19.13   8.88   8.41
 3.9_2   10.27   8.02   6.30
 rlb-v4_1 8.88   8.09   8.11
 rlb-v4_2 8.49   8.09   8.11
 pas-v7_110.20   0.02  11.50
 pas-v7_2 7.86  14.31   0.02
 pst-v3_120.44   8.68   7.97
 pst-v3_220.41   0.78   1.00
 pst-v4_121.32   0.21   0.05
 pst-v4_221.56   0.21   0.04
 
 Overall, pas-v7 seems to do a fairly good job at packing. The idle time
 distribution seems to be somewhere between pst-v3 and the more
 aggressive pst-v4 for all the benchmarks. pst-v4 manages to keep two
 cpus nearly idle (0.25% non-idle) for both cyclictest and audio, which
 is better than both pst-v3 and pas-v7. pas-v7 fails to pack cyclictest.
 Packing does come at at cost which can be seen for bbench+audio, where
 pst-v3 and rlb-v4 get better render times than pas-v7 and pst-v4 which
 do more aggressive packing. rlb-v4 does not pack, it is only included
 for reference.
 
 From a packing perspective pst-v4 seems to do the best job for the
 workloads that I have tested on ARM TC2. The less aggressive packing in
 pst-v3