[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy

2017-07-23 Thread Stephan Erb (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16097798#comment-16097798
 ] 

Stephan Erb commented on AURORA-1939:
-

This is now on master. Thanks for the patch!

{code}
commit cdc5b8efd5bb86d38f73cca6d91903078b120333
Author: Reza Motamedi reza.motam...@gmail.com
Date:   Sat Jul 22 20:28:50 2017 +0200

Remove psutil's oneshot

After a lot of testing on busy machines, I realized that psutil's oneshot is
not threadsafe. I contacted the developer however, have not recevied a conceret
fix.

Please read https://issues.apache.org/jira/browse/AURORA-1939 and
https://github.com/giampaolo/psutil/issues/1110 for more information.

These inconsistencies disappear after removing oneshot.

Reviewed at https://reviews.apache.org/r/61016/

src/main/python/apache/thermos/monitoring/process_collector_psutil.py | 23 
+++
 1 file changed, 11 insertions(+), 12 deletions(-)
{code}

> Thermos landing (host) page reports incorrect CPU rates when it is busy
> ---
>
> Key: AURORA-1939
> URL: https://issues.apache.org/jira/browse/AURORA-1939
> Project: Aurora
>  Issue Type: Bug
>Reporter: Reza Motamedi
>Assignee: Reza Motamedi
>Priority: Minor
>
> Thermos Observer uses `psutil` to monitor resource consumption of Thermos 
> Processes. On a busy machine, I have noticed negative CPU values when 
> visiting the Thermos landing page.
> In my test I reproduced this by starting many processes that constantly 
> create short lived children. This indicates that in time between 
> `process_collector_psutil` looks up the Process children and the time it 
> calculates the CPU time the pid of the child is actually reused by another 
> much younger process, which leads to negative CPU times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy

2017-07-21 Thread Reza Motamedi (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16096880#comment-16096880
 ] 

Reza Motamedi commented on AURORA-1939:
---

reviewable: https://reviews.apache.org/r/61016/

> Thermos landing (host) page reports incorrect CPU rates when it is busy
> ---
>
> Key: AURORA-1939
> URL: https://issues.apache.org/jira/browse/AURORA-1939
> Project: Aurora
>  Issue Type: Bug
>Reporter: Reza Motamedi
>Assignee: Reza Motamedi
>Priority: Minor
>
> Thermos Observer uses `psutil` to monitor resource consumption of Thermos 
> Processes. On a busy machine, I have noticed negative CPU values when 
> visiting the Thermos landing page.
> In my test I reproduced this by starting many processes that constantly 
> create short lived children. This indicates that in time between 
> `process_collector_psutil` looks up the Process children and the time it 
> calculates the CPU time the pid of the child is actually reused by another 
> much younger process, which leads to negative CPU times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy

2017-07-20 Thread Reza Motamedi (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095732#comment-16095732
 ] 

Reza Motamedi commented on AURORA-1939:
---

Following [~StephanErb]'s suggestion tried to guard the psutil's `oneshot` as a 
critical section. It does not seem to work however:

```
...
from threading import Lock
...

oneshot_lock = Lock()

def process_to_sample(process):
  """ Given a psutil.Process, return a current ProcessSample """
  try:
with oneshot_lock:
  with process.oneshot():
# the nonblocking get_cpu_percent call is stateful on a particular 
Process object, and hence
# >2 consecutive calls are required before it will return a non-zero 
value
rate = process.cpu_percent(0.0) / 100.0
cpu_times = process.cpu_times()
...
```

> Thermos landing (host) page reports incorrect CPU rates when it is busy
> ---
>
> Key: AURORA-1939
> URL: https://issues.apache.org/jira/browse/AURORA-1939
> Project: Aurora
>  Issue Type: Bug
>Reporter: Reza Motamedi
>Priority: Minor
>
> Thermos Observer uses `psutil` to monitor resource consumption of Thermos 
> Processes. On a busy machine, I have noticed negative CPU values when 
> visiting the Thermos landing page.
> In my test I reproduced this by starting many processes that constantly 
> create short lived children. This indicates that in time between 
> `process_collector_psutil` looks up the Process children and the time it 
> calculates the CPU time the pid of the child is actually reused by another 
> much younger process, which leads to negative CPU times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy

2017-07-20 Thread Reza Motamedi (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095271#comment-16095271
 ] 

Reza Motamedi commented on AURORA-1939:
---

I just stopped using oneshot and the problem went away. Makes me believe that 
{{oneshot}} is not thread safe.

> Thermos landing (host) page reports incorrect CPU rates when it is busy
> ---
>
> Key: AURORA-1939
> URL: https://issues.apache.org/jira/browse/AURORA-1939
> Project: Aurora
>  Issue Type: Bug
>Reporter: Reza Motamedi
>Priority: Minor
>
> Thermos Observer uses `psutil` to monitor resource consumption of Thermos 
> Processes. On a busy machine, I have noticed negative CPU values when 
> visiting the Thermos landing page.
> In my test I reproduced this by starting many processes that constantly 
> create short lived children. This indicates that in time between 
> `process_collector_psutil` looks up the Process children and the time it 
> calculates the CPU time the pid of the child is actually reused by another 
> much younger process, which leads to negative CPU times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy

2017-07-20 Thread Reza Motamedi (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095046#comment-16095046
 ] 

Reza Motamedi commented on AURORA-1939:
---

Added more logging to check the returned value for a process.cpu_times(). The 
numbers simply don't add up. What I see the same pid sampled by psutil showing 
a different name and CPU samples that fluctuate (decrease).

Note the in this example pid:14712 was created on July 16th, and the log is 
from July 20th. So there is no chance that what we see in the log is because 
pid reused.

{noformat}
~ $ grep 'pid=14712' thermos-observer.DEBUG
D0720 17:01:04.148628 61755 process_collector_psutil.py:35] 
process:psutil.Process(pid=14712, name='mesos-slave') cpu times 
pcputimes(user=603.88, system=1104.31, children_user=0.0, children_system=0.0)
D0720 17:06:02.358989 61755 process_collector_psutil.py:35] 
process:psutil.Process(pid=14712, name='mesos-slave') cpu times 
pcputimes(user=591.16, system=1097.5, children_user=0.0, children_system=0.0)
D0720 17:10:55.258080 61755 process_collector_psutil.py:35] 
process:psutil.Process(pid=14712, name='python2.7') cpu times 
pcputimes(user=44.81, system=7.29, children_user=0.0, children_system=0.0)
D0720 17:16:23.156296 61755 process_collector_psutil.py:35] 
process:psutil.Process(pid=14712, name='python2.7') cpu times 
pcputimes(user=596.01, system=1104.11, children_user=0.0, children_system=0.0)
D0720 17:21:21.552978 61755 process_collector_psutil.py:35] 
process:psutil.Process(pid=14712, name='python2.7') cpu times 
pcputimes(user=44.9, system=7.3, children_user=0.0, children_system=0.0)

~ $ ps -o lstart= -p 14712
Sun Jul 16 11:39:19 2017
{noformat}
 

> Thermos landing (host) page reports incorrect CPU rates when it is busy
> ---
>
> Key: AURORA-1939
> URL: https://issues.apache.org/jira/browse/AURORA-1939
> Project: Aurora
>  Issue Type: Bug
>Reporter: Reza Motamedi
>Priority: Minor
>
> Thermos Observer uses `psutil` to monitor resource consumption of Thermos 
> Processes. On a busy machine, I have noticed negative CPU values when 
> visiting the Thermos landing page.
> In my test I reproduced this by starting many processes that constantly 
> create short lived children. This indicates that in time between 
> `process_collector_psutil` looks up the Process children and the time it 
> calculates the CPU time the pid of the child is actually reused by another 
> much younger process, which leads to negative CPU times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy

2017-07-19 Thread Reza Motamedi (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16093883#comment-16093883
 ] 

Reza Motamedi commented on AURORA-1939:
---

I just started tailing my logs and as samples came in catted the 
proc/``/stat.

log snippets:
{noformat}
~ $ tail -n 1000 -f thermos-observer.DEBUG  | grep 62328
D0719 22:01:37.055901 24474 process_collector_psutil.py:105] Calculated rate 
for pid=62328 and children: -1.94176829079 (old: 1488.84, new: 920.87) 
{62328: 157014.22, 866: 157038.32} -> {62328: 157014.22, 866: 
157038.32} [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, 
rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: 
ProcessSample(rate=0.0, user=526.98, system=961.84, rss=35430400, 
vms=1600204800, nice=0, status='sleeping', threads=20)}] [{62328: 
ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, 
nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, 
user=332.69, system=588.16, rss=35430400, vms=1600204800, nice=0, 
status='sleeping', threads=20)}]
D0719 22:06:22.156851 24474 process_collector_psutil.py:105] Calculated rate 
for pid=62328 and children: -0.00277094847146 (old: 920.87, new: 
920.08) {62328: 157014.22, 866: 157038.32} -> {62328: 
157014.22, 866: 157038.32} [{62328: ProcessSample(rate=0.0, user=0.0, 
system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 
866: ProcessSample(rate=0.0, user=332.69, system=588.16, rss=35430400, 
vms=1600204800, nice=0, status='sleeping', threads=20)}] [{62328: 
ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, 
nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, 
user=329.27, system=590.79, rss=34787328, vms=1600204800, nice=0, 
status='sleeping', threads=20)}]
D0719 22:11:07.052711 24474 process_collector_psutil.py:105] Calculated rate 
for pid=62328 and children: -0.0636723871193 (old: 920.08, new: 901.94) 
{62328: 157014.22, 866: 157038.32} -> {62328: 157014.22, 866: 
157038.32} [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, 
rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: 
ProcessSample(rate=0.0, user=329.27, system=590.79, rss=34787328, 
vms=1600204800, nice=0, status='sleeping', threads=20)}] [{62328: 
ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, 
nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, 
user=326.72, system=575.2, rss=35430400, vms=1600204800, nice=0, 
status='sleeping', threads=20)}]
{noformat}



checking values directly. Also added date so that we can compare times.
{noformat}
~ $ date; cat /proc/866/stat | awk '{print $14, $15}'
Wed Jul 19 22:01:41 UTC 2017
52733 96241
~ $ date; cat /proc/866/stat | awk '{print $14, $15}'
Wed Jul 19 22:06:27 UTC 2017
52766 96296
{noformat}



> Thermos landing (host) page reports incorrect CPU rates when it is busy
> ---
>
> Key: AURORA-1939
> URL: https://issues.apache.org/jira/browse/AURORA-1939
> Project: Aurora
>  Issue Type: Bug
>Reporter: Reza Motamedi
>Priority: Minor
>
> Thermos Observer uses `psutil` to monitor resource consumption of Thermos 
> Processes. On a busy machine, I have noticed negative CPU values when 
> visiting the Thermos landing page.
> In my test I reproduced this by starting many processes that constantly 
> create short lived children. This indicates that in time between 
> `process_collector_psutil` looks up the Process children and the time it 
> calculates the CPU time the pid of the child is actually reused by another 
> much younger process, which leads to negative CPU times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy

2017-07-19 Thread Reza Motamedi (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16093862#comment-16093862
 ] 

Reza Motamedi commented on AURORA-1939:
---

This seems to be totally related to psutil. I contacted the developers there.

> Thermos landing (host) page reports incorrect CPU rates when it is busy
> ---
>
> Key: AURORA-1939
> URL: https://issues.apache.org/jira/browse/AURORA-1939
> Project: Aurora
>  Issue Type: Bug
>Reporter: Reza Motamedi
>Priority: Minor
>
> Thermos Observer uses `psutil` to monitor resource consumption of Thermos 
> Processes. On a busy machine, I have noticed negative CPU values when 
> visiting the Thermos landing page.
> In my test I reproduced this by starting many processes that constantly 
> create short lived children. This indicates that in time between 
> `process_collector_psutil` looks up the Process children and the time it 
> calculates the CPU time the pid of the child is actually reused by another 
> much younger process, which leads to negative CPU times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy

2017-07-19 Thread Reza Motamedi (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16093761#comment-16093761
 ] 

Reza Motamedi commented on AURORA-1939:
---

I see this problem when the host is super busy, and resource collection is 
backlogged. In this case I also see much more errors of this kind in the log:
{noformat}
D0719 20:18:28.064794 24474 process_collector_psutil.py:84] Error during 
process sampling: psutil.NoSuchProcess process no longer exists (pid=62193)
D0719 20:18:35.351458 24474 process_collector_psutil.py:84] Error during 
process sampling: psutil.NoSuchProcess process no longer exists (pid=14711)
D0719 20:18:35.552953 24474 process_collector_psutil.py:84] Error during 
process sampling: psutil.NoSuchProcess process no longer exists (pid=62331)
D0719 20:18:42.857400 24474 task_observer.py:142] TaskObserver: finished 
checkpoint refresh in 0.01s
D0719 20:18:43.753732 24474 process_collector_psutil.py:84] Error during 
process sampling: psutil.NoSuchProcess process no longer exists (pid=62338)
D0719 20:18:48.454077 24474 mesos_vars.py:384] Metrics collection took 6506.1ms
D0719 20:18:50.253031 24474 process_collector_psutil.py:84] Error during 
process sampling: psutil.NoSuchProcess process no longer exists (pid=62345)
D0719 20:18:57.861535 24474 task_observer.py:142] TaskObserver: finished 
checkpoint refresh in 0.00s
D0719 20:19:12.955235 24474 task_observer.py:142] TaskObserver: finished 
checkpoint refresh in 0.01s
D0719 20:19:14.959180 24474 process_collector_psutil.py:84] Error during 
process sampling: psutil.NoSuchProcess process no longer exists (pid=62361)
D0719 20:19:14.960768 24474 process_collector_psutil.py:84] Error during 
process sampling: psutil.NoSuchProcess process no longer exists (pid=62232)
D0719 20:19:18.056128 24474 mesos_vars.py:384] Metrics collection took 6008.0ms
D0719 20:19:22.856868 24474 process_collector_psutil.py:84] Error during 
process sampling: psutil.NoSuchProcess process no longer exists (pid=62366)
D0719 20:19:28.048165 24474 task_observer.py:142] TaskObserver: finished 
checkpoint refresh in 0.09s
D0719 20:19:28.660691 24474 process_collector_psutil.py:84] Error during 
process sampling: psutil.NoSuchProcess process no longer exists (pid=62374)
D0719 20:19:43.051047 24474 task_observer.py:142] TaskObserver: finished 
checkpoint refresh in 0.00s
D0719 20:19:48.355678 24474 mesos_vars.py:384] Metrics collection took 6299.8ms
D0719 20:19:58.148663 24474 task_observer.py:142] TaskObserver: finished 
checkpoint refresh in 0.10s
D0719 20:20:11.449485 24474 process_collector_psutil.py:84] Error during 
process sampling: psutil.NoSuchProcess process no longer exists (pid=62271)
D0719 20:20:13.155102 24474 task_observer.py:142] TaskObserver: finished 
checkpoint refresh in 0.01s
D0719 20:20:18.249528 24474 mesos_vars.py:384] Metrics collection took 6179.9ms
D0719 20:20:23.354832 24474 process_collector_psutil.py:84] Error during 
process sampling: psutil.NoSuchProcess process no longer exists (pid=11317)
D0719 20:20:27.060431 24474 process_collector_psutil.py:84] Error during 
process sampling: psutil.NoSuchProcess process no longer exists (pid=62281)
D0719 20:20:28.160298 24474 task_observer.py:142] TaskObserver: finished 
checkpoint refresh in 0.00s
D0719 20:20:35.452637 24474 process_collector_psutil.py:84] Error during 
process sampling: psutil.NoSuchProcess process no longer exists (pid=62289)
D0719 20:20:43.252589 24474 task_observer.py:142] TaskObserver: finished 
checkpoint refresh in 0.09s
D0719 20:20:48.151144 24474 mesos_vars.py:384] Metrics collection took 6058.3ms
D0719 20:20:55.254796 24474 process_collector_psutil.py:84] Error during 
process sampling: psutil.NoSuchProcess process no longer exists (pid=62428)
D0719 20:20:58.257311 24474 task_observer.py:142] TaskObserver: finished 
checkpoint refresh in 0.00s
D0719 20:21:13.352955 24474 task_observer.py:142] TaskObserver: finished 
checkpoint refresh in 0.10s
D0719 20:21:17.555244 24474 process_collector_psutil.py:84] Error during 
process sampling: psutil.NoSuchProcess process no longer exists (pid=62124)
{noformat}

> Thermos landing (host) page reports incorrect CPU rates when it is busy
> ---
>
> Key: AURORA-1939
> URL: https://issues.apache.org/jira/browse/AURORA-1939
> Project: Aurora
>  Issue Type: Bug
>Reporter: Reza Motamedi
>Priority: Minor
>
> Thermos Observer uses `psutil` to monitor resource consumption of Thermos 
> Processes. On a busy machine, I have noticed negative CPU values when 
> visiting the Thermos landing page.
> In my test I reproduced this by starting many processes that constantly 
> create short lived children. This indicates that in time between 
> `process_collector_psutil` looks up the Process children and the time it 
> calculates the CPU 

[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy

2017-07-19 Thread Reza Motamedi (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16093763#comment-16093763
 ] 

Reza Motamedi commented on AURORA-1939:
---

I added more debug logs to trace what is kept in the process tree:
{noformat}
D0719 20:23:05.758181 24474 process_collector_psutil.py:105] Calculated rate 
for pid=62328 and children: 0.000365496747142 (old: 885.57, new: 885.67)
{62328: 157014.22, 866: 157038.32} -> {62328: 157014.22, 866: 
157038.32}
[{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, 
vms=11919360, nice=0, status='sleeping', threads=1), 866: 
ProcessSample(rate=0.0, user=315.39, system=570.16, rss=36704256, 
vms=1601019904, nice=0, status='sleeping', threads=20)}] 
[{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, 
vms=11919360, nice=0, status='sleeping', threads=1), 866: 
ProcessSample(rate=0.0, user=314.96, system=570.69, rss=36134912, 
vms=1601019904, nice=0, status='sleeping', threads=20)}]

D0719 20:27:44.454570 24474 process_collector_psutil.py:105] Calculated rate 
for pid=62328 and children: 2.10562464635 (old: 885.67, new: 1472.50)
{62328: 157014.22, 866: 157038.32} -> {62328: 157014.22, 866: 
157038.32}
[{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, 
vms=11919360, nice=0, status='sleeping', threads=1), 866: 
ProcessSample(rate=0.0, user=314.96, system=570.69, rss=36134912, 
vms=1601019904, nice=0, status='sleeping', threads=20)}] 
[{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, 
vms=11919360, nice=0, status='sleeping', threads=1), 866: 
ProcessSample(rate=0.0, user=520.93, system=951.55, rss=35430400, 
vms=1600204800, nice=0, status='sleeping', threads=20)}]

D0719 20:32:29.451004 24474 process_collector_psutil.py:105] Calculated rate 
for pid=62328 and children: 0.00294740555899 (old: 1472.50, new: 
1473.34)
{62328: 157014.22, 866: 157038.32} -> {62328: 157014.22, 866: 
157038.32}
[{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, 
vms=11919360, nice=0, status='sleeping', threads=1), 866: 
ProcessSample(rate=0.0, user=520.93, system=951.55, rss=35430400, 
vms=1600204800, nice=0, status='sleeping', threads=20)}] 
[{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, 
vms=11919360, nice=0, status='sleeping', threads=1), 866: 
ProcessSample(rate=0.0, user=521.23, system=952.09, rss=35430400, 
vms=1600204800, nice=0, status='sleeping', threads=20)}]

D0719 20:37:10.152525 24474 process_collector_psutil.py:105] Calculated rate 
for pid=62328 and children: 0.00299250250927 (old: 1473.34, new: 
1474.18)
{62328: 157014.22, 866: 157038.32} -> {62328: 157014.22, 866: 
157038.32}
[{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, 
vms=11919360, nice=0, status='sleeping', threads=1), 866: 
ProcessSample(rate=0.0, user=521.23, system=952.09, rss=35430400, 
vms=1600204800, nice=0, status='sleeping', threads=20)}] 
[{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, 
vms=11919360, nice=0, status='sleeping', threads=1), 866: 
ProcessSample(rate=0.0, user=521.55, system=952.61, rss=35430400, 
vms=1600204800, nice=0, status='sleeping', threads=20)}]

D0719 20:42:58.449567 24474 process_collector_psutil.py:105] Calculated rate 
for pid=62328 and children: -1.64913818086 (old: 1474.18, new: 899.79)
{62328: 157014.22, 866: 157038.32} -> {62328: 157014.22, 866: 
157038.32}
[{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, 
vms=11919360, nice=0, status='sleeping', threads=1), 866: 
ProcessSample(rate=0.0, user=521.55, system=952.61, rss=35430400, 
vms=1600204800, nice=0, status='sleeping', threads=20)}] 
[{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2732032, 
vms=11919360, nice=0, status='sleeping', threads=1), 866: 
ProcessSample(rate=0.0, user=322.65, system=577.12, rss=35430400, 
vms=1600204800, nice=0, status='sleeping', threads=20)}]
{noformat}

Note: {62328: 157014.22, 866: 157038.32} -> {62328: 157014.22, 866: 
157038.32} what processes where in the old and new sample, as in old -> 
new. The line after that is the ProcessSamples in the old pstree, kept in the 
object as "pid => ProcessSample", and the line after is the new sample.


> Thermos landing (host) page reports incorrect CPU rates when it is busy
> ---
>
> Key: AURORA-1939
> URL: https://issues.apache.org/jira/browse/AURORA-1939
> Project: Aurora
>  Issue Type: Bug
>Reporter: Reza Motamedi
>Priority: Minor
>
> Thermos Observer uses `psutil` to monitor resource consumption of Thermos 
> Processes. On a busy machine, I have noticed negative CPU values when 
> visiting the Thermos landing page.
> In 

[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy

2017-06-26 Thread Reza Motamedi (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16064286#comment-16064286
 ] 

Reza Motamedi commented on AURORA-1939:
---

On second thought, the negative CPU values can simply be caused by a dead child 
process. Let me explain how. First, remember that CPU time reported by psutil, 
is the total CPU time spent to progress a process.

Supposes at {{t_0 = 10}}, we have the following processes forked inside a 
thermos process.

{noformat}
__ p0
   \_ p1
{noformat}

The total CPU time of the thermos process is calculated at the CPU time in all 
the processes, i.e., {{Process(p_0).cpu_time + Process(p_1).cpu_time}}, For the 
sake of argument, let's say 1 second in {{p_0}} and 5 seconds in {{p_1}}.
Now imagine that by the time to collect the next sample at {{t_1 = 20}}, 
another 5 seconds where spend in p_0, and p_0 finishes (dies) before the 
collection. Also, only an extra 1 second was spent by {{p_0}}. The current 
calculation leads to the following reported CPU values.

(sum(new_samples) - sum(old_samples)) / (time difference).
(2) - (1 + 5) / 5 = -3/10.

A perfect calculation would include the time spend in the dead processes at the 
time of their death in the new sample. What makes sense is to discard the old 
processes that have died during the last time interval.




> Thermos landing (host) page reports incorrect CPU rates when it is busy
> ---
>
> Key: AURORA-1939
> URL: https://issues.apache.org/jira/browse/AURORA-1939
> Project: Aurora
>  Issue Type: Bug
>Reporter: Reza Motamedi
>Priority: Minor
>
> Thermos Observer uses `psutil` to monitor resource consumption of Thermos 
> Processes. On a busy machine, I have noticed negative CPU values when 
> visiting the Thermos landing page.
> In my test I reproduced this by starting many processes that constantly 
> create short lived children. This indicates that in time between 
> `process_collector_psutil` looks up the Process children and the time it 
> calculates the CPU time the pid of the child is actually reused by another 
> much younger process, which leads to negative CPU times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)