Re: [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks

2017-12-15 Thread Mike Galbraith
On Fri, 2017-12-15 at 21:23 +0100, Mike Galbraith wrote:
> 
> Point: if you think it's OK to serialize these firefox threads, would
> you still think so if those were kernel threads instead?  Serializing
> your kernel is a clear fail, but unpinned kthreads can be stacked up
> just as effectively as those browser threads are, eat needless wakeup
> latency and pass it on.

FWIW, somewhat cheezy example of that below.

(later, /me returns to [apparently endless] squabble w. PELT/SIS;)

bonnie in nfs mount of own box competing with 7 hogs:
 

  Task  |   Runtime ms  | Switches | Average delay ms | Maximum 
delay ms | Sum delay ms | Maximum delay at  |
 

  kworker/3:0:29|630.078 ms |89669 | avg:0.011 ms | max:  
102.340 ms | sum:  962.919 ms | max at:310.501277 |
  kworker/3:1H:464  |   1179.868 ms |   101944 | avg:0.005 ms | max:  
102.232 ms | sum:  480.915 ms | max at:310.501273 |
  kswapd0:78|   2662.230 ms | 1661 | avg:0.128 ms | max:   
93.935 ms | sum:  213.258 ms | max at:310.503419 |
  nfsd:2039 |   3257.143 ms |78448 | avg:0.112 ms | max:   
86.039 ms | sum: 8795.767 ms | max at:258.847140 |
  nfsd:2038 |   3185.730 ms |76253 | avg:0.113 ms | max:   
78.348 ms | sum: 8580.676 ms | max at:258.831370 |
  nfsd:2042 |   3256.554 ms |81423 | avg:0.110 ms | max:   
74.941 ms | sum: 8929.015 ms | max at:288.397203 |
  nfsd:2040 |   3314.826 ms |80396 | avg:0.105 ms | max:   
51.039 ms | sum: 8471.816 ms | max at:363.870078 |
  nfsd:2036 |   3058.867 ms |70460 | avg:0.115 ms | max:   
44.629 ms | sum: 8092.319 ms | max at:250.074253 |
  nfsd:2037 |   3113.592 ms |74276 | avg:0.115 ms | max:   
43.294 ms | sum: 8556.110 ms | max at:310.443722 |
  konsole:4013  |402.509 ms |  894 | avg:0.148 ms | max:   
38.129 ms | sum:  132.050 ms | max at:332.156495 |
  haveged:497   | 11.831 ms | 1224 | avg:0.104 ms | max:   
37.575 ms | sum:  127.706 ms | max at:350.669645 |
  nfsd:2043 |   3316.033 ms |78303 | avg:0.115 ms | max:   
36.511 ms | sum: 8995.138 ms | max at:248.576108 |
  nfsd:2035 |   3064.108 ms |67413 | avg:0.115 ms | max:   
28.221 ms | sum: 7746.306 ms | max at:313.785682 |
  bash:7022 |  0.342 ms |1 | avg:   22.959 ms | max:   
22.959 ms | sum:   22.959 ms | max at:262.258960 |
  kworker/u16:4:354 |   2073.383 ms | 1550 | avg:0.050 ms | max:   
21.203 ms | sum:   77.185 ms | max at:332.220678 |
  kworker/4:3:6975  |   1189.868 ms |   115776 | avg:0.018 ms | max:   
20.856 ms | sum: 2071.894 ms | max at:348.142757 |
  kworker/2:4:6981  |335.895 ms |26617 | avg:0.023 ms | max:   
20.726 ms | sum:  625.102 ms | max at:248.522083 |
  bash:7021 |  0.517 ms |2 | avg:   10.363 ms | max:   
20.726 ms | sum:   20.727 ms | max at:262.235708 |
  ksoftirqd/2:22| 65.718 ms |  998 | avg:0.138 ms | max:   
19.072 ms | sum:  137.827 ms | max at:332.221676 |
  kworker/7:3:6969  |625.724 ms |84153 | avg:0.010 ms | max:   
18.838 ms | sum:  876.603 ms | max at:264.188983 |
  bonnie:6965   |  79637.998 ms |35434 | avg:0.007 ms | max:   
18.719 ms | sum:  256.748 ms | max at:331.299867 | 


Re: [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks

2017-12-15 Thread Mike Galbraith
On Fri, 2017-12-15 at 16:13 +, Patrick Bellasi wrote:
> Hi Mike,
> 
> On 13-Dec 18:56, Mike Galbraith wrote:
> > On Tue, 2017-12-05 at 17:10 +, Patrick Bellasi wrote:
> > > This is a respin of:
> > >https://lkml.org/lkml/2017/11/9/546
> > > which has been rebased on v4.15-rc2 to have util_est now working on top
> > > of the recent PeterZ's:
> > >[PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul
> > > 
> > > The aim of this series is to improve some PELT behaviors to make it a
> > > better fit for the scheduling of tasks common in embedded mobile
> > > use-cases, without affecting other classes of workloads.
> > 
> > I thought perhaps this patch set would improve the below behavior, but
> > alas it does not.  That's 3 instances of firefox playing youtube clips
> > being shoved into a corner by hogs sitting on 7 of 8 runqueues.  PELT
> > serializes the threaded desktop, making that threading kinda pointless,
> > and CFS not all that fair.
> 
> Perhaps I don't completely get your use-case.
> Are the cpuhog thread pinned to a CPU or just happens to be always
> running on the same CPU?

Nothing is pinned.

> I guess you would expect the three Firefox instances to be spread on
> different CPUs. But whether this is possible depends also on the
> specific tasks composition generated by Firefox, isn't it?

It depends on load balancing.  We're letting firefox threads stack up
to 6 deep while single hogs dominate the box.

> Being a video playback pipeline I would not be surprised to see that
> most of the time we actually have only 1 or 2 tasks RUNNABLE, while
> the others are sleeping... and if an HW decoder is involved, even if
> you have three instances running you likely get only one pipeline
> active at each time...
> 
> If that's the case, why should CFS move Fairfox tasks around?

No, while they are indeed ~fairly synchronous, there is overlap.  If
there were not, there would be no wait time being accumulated. The load
wants to consume roughly one full core worth, but to achieve that, it
needs access to more than one runqueue, which we are not facilitating.

> Is this always happening... or sometimes Firefox tasks gets a chance
> to run on CPUs other then CPU2?

There is some escape going on, but not enough for the load to get its
fair share.  I have it sort of fixed up locally, but while patch keeps
changing, it's not getting any prettier, nor is it particularly
interested in letting me keep some performance gains I want, so...

> How do you get these stats?

perf sched record/perf sched lat.  I twiddled it to output accumulated
wait times as well for convenience, stock only shows max.  See below.
 If you play with perf sched, you'll notice some.. oddities about it.

> It's definitively an interesting use-case however I think it's out of
> the scope of util_est.

Yeah.  If I had been less busy and read the whole thing, I wouldn't
have taken it out for a spin.

> Regarding the specific statement "CFS not all that fair", I would say
> that the fairness of CFS is defined and has to be evaluated within a
> single CPU and on a temporal (not clock cycles) base.

No, that doesn't really fly.  In fact, in the group scheduling code, we
actively pursue box wide fairness.  PELT is going a bit too far ATM.

Point: if you think it's OK to serialize these firefox threads, would
you still think so if those were kernel threads instead?  Serializing
your kernel is a clear fail, but unpinned kthreads can be stacked up
just as effectively as those browser threads are, eat needless wakeup
latency and pass it on.

> AFAIK, vruntime is progressed based on elapsed time, thus you can have
> two tasks which gets the same slice time but consume it at different
> frequencies. In this case also we are not that fair, isn't it?

Time slices don't really exist as a concrete quantum in CFS.  There's
vruntime equalization, and that's it.

> Thus, at the end it all boils down to some (as much as possible)
> low-overhead heuristics. Thus, a proper description of a
> reproducible use-case can help on improving them.

Nah, heuristics are fickle beasts, they WILL knife you in the back,
it's just a question of how often, and how deep.

> Can we model your use-case using a simple rt-app configuration?

No idea.

> This would likely help to have a simple and reproducible testing
> scenario to better understand where the issue eventually is...
> maybe by looking at an execution trace.

It should be reproducible by anyone, just fire up NR_CPUS-1 pure hogs,
point firefox at youtube, open three clips in tabs, watch tasks stack.

Root cause IMHO is PELT having grown too aggressive.  SIS was made more
aggressive to compensate, but when you slam that door you get the full
PELT impact, and it stings, as does too aggressive bouncing when you
leave the escape hatch open.  Sticky wicket that.  Both of those want a
gentle wrap upside the head, as they're both acting a bit nutty.

-Mike

---
 tools/perf/builtin-sched.c |   34

Re: [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks

2017-12-15 Thread Patrick Bellasi
Hi Mike,

On 13-Dec 18:56, Mike Galbraith wrote:
> On Tue, 2017-12-05 at 17:10 +, Patrick Bellasi wrote:
> > This is a respin of:
> >https://lkml.org/lkml/2017/11/9/546
> > which has been rebased on v4.15-rc2 to have util_est now working on top
> > of the recent PeterZ's:
> >[PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul
> > 
> > The aim of this series is to improve some PELT behaviors to make it a
> > better fit for the scheduling of tasks common in embedded mobile
> > use-cases, without affecting other classes of workloads.
> 
> I thought perhaps this patch set would improve the below behavior, but
> alas it does not.  That's 3 instances of firefox playing youtube clips
> being shoved into a corner by hogs sitting on 7 of 8 runqueues.  PELT
> serializes the threaded desktop, making that threading kinda pointless,
> and CFS not all that fair.

Perhaps I don't completely get your use-case.
Are the cpuhog thread pinned to a CPU or just happens to be always
running on the same CPU?

I guess you would expect the three Firefox instances to be spread on
different CPUs. But whether this is possible depends also on the
specific tasks composition generated by Firefox, isn't it?

Being a video playback pipeline I would not be surprised to see that
most of the time we actually have only 1 or 2 tasks RUNNABLE, while
the others are sleeping... and if an HW decoder is involved, even if
you have three instances running you likely get only one pipeline
active at each time...

If that's the case, why should CFS move Fairfox tasks around?


>  6569 root  20   04048704628 R 100.0 0.004   5:10.48 7 cpuhog
>  6573 root  20   04048712636 R 100.0 0.004   5:07.47 5 cpuhog
>  6581 root  20   04048696620 R 100.0 0.004   5:07.36 1 cpuhog
>  6585 root  20   04048812736 R 100.0 0.005   5:08.14 4 cpuhog
>  6589 root  20   04048712636 R 100.0 0.004   5:06.42 6 cpuhog
>  6577 root  20   04048720644 R 99.80 0.005   5:06.52 3 cpuhog
>  6593 root  20   04048728652 R 99.60 0.005   5:04.25 0 cpuhog
>  6755 mikeg 20   0 2714788 885324 179196 S 19.96 5.544   2:14.36 2 Web 
> Content
>  6620 mikeg 20   0 2318348 312336 145044 S 8.383 1.956   0:51.51 2 firefox
>  3190 root  20   0  323944  71704  42368 S 3.194 0.449   0:11.90 2 Xorg
>  3718 root  20   0 3009580  67112  49256 S 0.599 0.420   0:02.89 2 
> kwin_x11
>  3761 root  20   0  769760  90740  62048 S 0.399 0.568   0:03.46 2 konsole
>  3845 root   9 -11  791224  20132  14236 S 0.399 0.126   0:03.00 2 
> pulseaudio
>  3722 root  20   0 3722308 172568  88088 S 0.200 1.081   0:04.35 2 
> plasmashel

Is this always happening... or sometimes Firefox tasks gets a chance
to run on CPUs other then CPU2?

Could be that looking at an htop output we don't see these small opportunities?


>  
> 
>   Task  |   Runtime ms  | Switches | Average delay ms | 
> Maximum delay ms | Sum delay ms | Maximum delay at  |
>  
> 
>   Web Content:6755  |   2864.862 ms | 7314 | avg:0.299 ms | max:  
>  40.374 ms | sum: 2189.472 ms | max at:375.769240 |
>   Compositor:6680   |   1889.847 ms | 4672 | avg:0.531 ms | max:  
>  29.092 ms | sum: 2478.559 ms | max at:375.759405 |
>   MediaPl~back #3:(13)  |   3269.777 ms | 7853 | avg:0.218 ms | max:  
>  19.451 ms | sum: 1711.635 ms | max at:391.123970 |
>   MediaPl~back #4:(10)  |   1472.986 ms | 8189 | avg:0.236 ms | max:  
>  18.653 ms | sum: 1933.886 ms | max at:376.124211 |
>   MediaPl~back #1:(9)   |601.788 ms | 6598 | avg:0.247 ms | max:  
>  17.823 ms | sum: 1627.852 ms | max at:401.122567 |
>   firefox:6620  |303.181 ms | 6232 | avg:0.111 ms | max:  
>  15.602 ms | sum:  691.865 ms | max at:385.078558 |
>   Socket Thread:6639|667.537 ms | 4806 | avg:0.069 ms | max:  
>  12.638 ms | sum:  329.387 ms | max at:380.827323 |
>   MediaPD~oder #1:6835  |154.737 ms | 1592 | avg:0.700 ms | max:  
>  10.139 ms | sum: 1113.688 ms | max at:392.575370 |
>   MediaTimer #1:6828| 42.660 ms | 5250 | avg:0.575 ms | max:  
>   9.845 ms | sum: 3018.994 ms | max at:380.823677 |
>   MediaPD~oder #2:6840  |150.822 ms | 1583 | avg:0.703 ms | max:  
>   9.639 ms | sum: 1112.962 ms | max at:380.823741 |

How do you get these stats?

It's definitively an interesting use-case however I think it's out of
the scope of util_est.

Regarding the specific statement "CFS not all that fair", I would say
that the fairness of CFS is defined and has to be evaluated within a
single CPU and on a 

Re: [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks

2017-12-13 Thread Mike Galbraith
On Tue, 2017-12-05 at 17:10 +, Patrick Bellasi wrote:
> This is a respin of:
>https://lkml.org/lkml/2017/11/9/546
> which has been rebased on v4.15-rc2 to have util_est now working on top
> of the recent PeterZ's:
>[PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul
> 
> The aim of this series is to improve some PELT behaviors to make it a
> better fit for the scheduling of tasks common in embedded mobile
> use-cases, without affecting other classes of workloads.

I thought perhaps this patch set would improve the below behavior, but
alas it does not.  That's 3 instances of firefox playing youtube clips
being shoved into a corner by hogs sitting on 7 of 8 runqueues.  PELT
serializes the threaded desktop, making that threading kinda pointless,
and CFS not all that fair.

 6569 root  20   04048704628 R 100.0 0.004   5:10.48 7 cpuhog   

 
 6573 root  20   04048712636 R 100.0 0.004   5:07.47 5 cpuhog   

 
 6581 root  20   04048696620 R 100.0 0.004   5:07.36 1 cpuhog   

 
 6585 root  20   04048812736 R 100.0 0.005   5:08.14 4 cpuhog   

 
 6589 root  20   04048712636 R 100.0 0.004   5:06.42 6 cpuhog   

 
 6577 root  20   04048720644 R 99.80 0.005   5:06.52 3 cpuhog   

 
 6593 root  20   04048728652 R 99.60 0.005   5:04.25 0 cpuhog   

 
 6755 mikeg 20   0 2714788 885324 179196 S 19.96 5.544   2:14.36 2 Web 
Content 

  
 6620 mikeg 20   0 2318348 312336 145044 S 8.383 1.956   0:51.51 2 firefox  

 
 3190 root  20   0  323944  71704  42368 S 3.194 0.449   0:11.90 2 Xorg 

 
 3718 root  20   0 3009580  67112  49256 S 0.599 0.420   0:02.89 2 kwin_x11 

 
 3761 root  20   0  769760  90740  62048 S 0.399 0.568   0:03.46 2 konsole  

 
 3845 root   9 -11  791224  20132  14236 S 0.399 0.126   0:03.00 2 
pulseaudio  

  
 3722 root  20   0 3722308 172568  88088 S 0.200 1.081   0:04.35 2 
plasmashel

 

  Task  |   Runtime ms  | Switches | Average delay ms | Maximum 
delay ms | Sum delay ms | Maximum delay at  |
 

  Web Content:6755  |   2864.862 ms | 7314 | avg:0.299 ms | max:   
40.374 ms | sum: 2189.472 ms | max at:375.769240 |
  Compositor:6680   |   1889.847 ms | 4672 | avg:0.531 ms | max:   
29.092 ms | sum: 2478.559 ms | max at:375.759405 |
  MediaPl~back #3:(13)  |   3269.777 ms | 7853 | avg:0.218 ms | max:   
19.451 ms | sum: 1711.635 ms | max at:391.123970 |
  MediaPl~back #4:(10)  |   1472.986 ms | 8189 | avg:0.236 ms | max:   
18.653 ms | sum: 1933.886 ms | max at:376.124211 |
  MediaPl~back #1:(9)   |601.788 ms |

Re: [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks

2017-12-13 Thread Patrick Bellasi
On 13-Dec 17:03, Peter Zijlstra wrote:
> On Tue, Dec 05, 2017 at 05:10:14PM +, Patrick Bellasi wrote:
> > With this feature enabled, the measured overhead is in the range of ~1%
> > on the same HW/SW test configuration.
> 
> That's quite a lot; did you look where that comes from?

I've tracked it down to  util_est_dequeue() introduced by PATCH 2/4,
mainly due to the EWMA udpate. Initially the running average was
implemented using the library function provided in:

   include/linux/average.h::DECLARE_EWMA

but that solution generated even more overhead.
That's why we switched to an "inline custom" implementation.

Hackbench is quite stressful for that path and we have also few
branches which can play a role. One for example has been added to
avoid the EWMA if the rolling average is "close enough" to the
current PELT value.

All that considered, that's why I disable by default the sched_feat,
in which case we have 0 overheads... in !SCHED_DEBUG kernel the code
is actually removed by the compiler.

In mobile systems (i.e. non-hackbench scenarios) the additional
benefits on tasks placement and OPP selection is likely still worth
the overhead.

Do you think the idea to have a Kconfig to enabled this feature only
on systems which do not care about the possible  overheads is a viable
solution?

-- 
#include 

Patrick Bellasi


Re: [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks

2017-12-13 Thread Peter Zijlstra
On Tue, Dec 05, 2017 at 05:10:14PM +, Patrick Bellasi wrote:
> With this feature enabled, the measured overhead is in the range of ~1%
> on the same HW/SW test configuration.

That's quite a lot; did you look where that comes from?


[PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks

2017-12-05 Thread Patrick Bellasi
This is a respin of:
   https://lkml.org/lkml/2017/11/9/546
which has been rebased on v4.15-rc2 to have util_est now working on top
of the recent PeterZ's:
   [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul

The aim of this series is to improve some PELT behaviors to make it a
better fit for the scheduling of tasks common in embedded mobile
use-cases, without affecting other classes of workloads.

A complete description of these behaviors has been presented in the
previous RFC [1] and further discussed during the last OSPM Summit [2]
as well as during the last two LPCs.

This series presents an implementation which improves the initial RFC's
prototype. Specifically, this new implementation has been verified to
not impact in any noticeable way the performance of:

perf bench sched messaging --pipe --thread --group 8 --loop 5

when running 30 iterations on a dual socket, 10 cores (20 threads) per
socket Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz, whit the
sched_feat(SCHED_UTILEST) set to False.
With this feature enabled, the measured overhead is in the range of ~1%
on the same HW/SW test configuration.

That's the main reason why this sched feature is disabled by default.
A possible improvement can be the addition of a KConfig option to toggle
the sched_feat default value on systems where a 1% overhead on hackbench
is not a concern, e.g. mobile systems, especially considering the
benefits coming from estimated utilization on workloads of interest.

>From a functional standpoint, this implementation shows a more stable
utilization signal, compared to mainline, when running synthetics
benchmarks describing a set of interesting target use-cases.
This allows for a better selection of the target CPU as well as a
faster selection of the most appropriate OPP.
A detailed description of the used functional tests has been already
covered in the previous RFC [1].

This series is based on v4.15-rc2 and is composed of four patches:
 1) a small refactoring preparing the ground
 2) introducing the required data structures to track util_est of both
TASKs and CPUs
 3) make use of util_est in the wakeup and load balance paths
 4) make use of util_est in schedutil for frequency selection

Cheers Patrick

.:: References
==
[1] https://lkml.org/lkml/2017/8/25/195
[2] slides: 
http://retis.sssup.it/ospm-summit/Downloads/OSPM_PELT_DecayClampingVsUtilEst.pdf
 video: http://youtu.be/adnSHPBGS-w

Changes v1->v2:
 - rebase on top of v4.15-rc2
 - tested that overhauled PELT code does not affect the util_est

Patrick Bellasi (4):
  sched/fair: always used unsigned long for utilization
  sched/fair: add util_est on top of PELT
  sched/fair: use util_est in LB and WU paths
  sched/cpufreq_schedutil: use util_est for OPP selection

 include/linux/sched.h|  21 +
 kernel/sched/cpufreq_schedutil.c |   6 +-
 kernel/sched/debug.c |   4 +
 kernel/sched/fair.c  | 184 ---
 kernel/sched/features.h  |   5 ++
 kernel/sched/sched.h |   1 +
 6 files changed, 209 insertions(+), 12 deletions(-)

-- 
2.14.1