[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy
[ https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16097798#comment-16097798 ] Stephan Erb commented on AURORA-1939: - This is now on master. Thanks for the patch! {code} commit cdc5b8efd5bb86d38f73cca6d91903078b120333 Author: Reza Motamedi reza.motam...@gmail.com Date: Sat Jul 22 20:28:50 2017 +0200 Remove psutil's oneshot After a lot of testing on busy machines, I realized that psutil's oneshot is not threadsafe. I contacted the developer however, have not recevied a conceret fix. Please read https://issues.apache.org/jira/browse/AURORA-1939 and https://github.com/giampaolo/psutil/issues/1110 for more information. These inconsistencies disappear after removing oneshot. Reviewed at https://reviews.apache.org/r/61016/ src/main/python/apache/thermos/monitoring/process_collector_psutil.py | 23 +++ 1 file changed, 11 insertions(+), 12 deletions(-) {code} > Thermos landing (host) page reports incorrect CPU rates when it is busy > --- > > Key: AURORA-1939 > URL: https://issues.apache.org/jira/browse/AURORA-1939 > Project: Aurora > Issue Type: Bug >Reporter: Reza Motamedi >Assignee: Reza Motamedi >Priority: Minor > > Thermos Observer uses `psutil` to monitor resource consumption of Thermos > Processes. On a busy machine, I have noticed negative CPU values when > visiting the Thermos landing page. > In my test I reproduced this by starting many processes that constantly > create short lived children. This indicates that in time between > `process_collector_psutil` looks up the Process children and the time it > calculates the CPU time the pid of the child is actually reused by another > much younger process, which leads to negative CPU times. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy
[ https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16096880#comment-16096880 ] Reza Motamedi commented on AURORA-1939: --- reviewable: https://reviews.apache.org/r/61016/ > Thermos landing (host) page reports incorrect CPU rates when it is busy > --- > > Key: AURORA-1939 > URL: https://issues.apache.org/jira/browse/AURORA-1939 > Project: Aurora > Issue Type: Bug >Reporter: Reza Motamedi >Assignee: Reza Motamedi >Priority: Minor > > Thermos Observer uses `psutil` to monitor resource consumption of Thermos > Processes. On a busy machine, I have noticed negative CPU values when > visiting the Thermos landing page. > In my test I reproduced this by starting many processes that constantly > create short lived children. This indicates that in time between > `process_collector_psutil` looks up the Process children and the time it > calculates the CPU time the pid of the child is actually reused by another > much younger process, which leads to negative CPU times. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy
[ https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095732#comment-16095732 ] Reza Motamedi commented on AURORA-1939: --- Following [~StephanErb]'s suggestion tried to guard the psutil's `oneshot` as a critical section. It does not seem to work however: ``` ... from threading import Lock ... oneshot_lock = Lock() def process_to_sample(process): """ Given a psutil.Process, return a current ProcessSample """ try: with oneshot_lock: with process.oneshot(): # the nonblocking get_cpu_percent call is stateful on a particular Process object, and hence # >2 consecutive calls are required before it will return a non-zero value rate = process.cpu_percent(0.0) / 100.0 cpu_times = process.cpu_times() ... ``` > Thermos landing (host) page reports incorrect CPU rates when it is busy > --- > > Key: AURORA-1939 > URL: https://issues.apache.org/jira/browse/AURORA-1939 > Project: Aurora > Issue Type: Bug >Reporter: Reza Motamedi >Priority: Minor > > Thermos Observer uses `psutil` to monitor resource consumption of Thermos > Processes. On a busy machine, I have noticed negative CPU values when > visiting the Thermos landing page. > In my test I reproduced this by starting many processes that constantly > create short lived children. This indicates that in time between > `process_collector_psutil` looks up the Process children and the time it > calculates the CPU time the pid of the child is actually reused by another > much younger process, which leads to negative CPU times. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy
[ https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095271#comment-16095271 ] Reza Motamedi commented on AURORA-1939: --- I just stopped using oneshot and the problem went away. Makes me believe that {{oneshot}} is not thread safe. > Thermos landing (host) page reports incorrect CPU rates when it is busy > --- > > Key: AURORA-1939 > URL: https://issues.apache.org/jira/browse/AURORA-1939 > Project: Aurora > Issue Type: Bug >Reporter: Reza Motamedi >Priority: Minor > > Thermos Observer uses `psutil` to monitor resource consumption of Thermos > Processes. On a busy machine, I have noticed negative CPU values when > visiting the Thermos landing page. > In my test I reproduced this by starting many processes that constantly > create short lived children. This indicates that in time between > `process_collector_psutil` looks up the Process children and the time it > calculates the CPU time the pid of the child is actually reused by another > much younger process, which leads to negative CPU times. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy
[ https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095046#comment-16095046 ] Reza Motamedi commented on AURORA-1939: --- Added more logging to check the returned value for a process.cpu_times(). The numbers simply don't add up. What I see the same pid sampled by psutil showing a different name and CPU samples that fluctuate (decrease). Note the in this example pid:14712 was created on July 16th, and the log is from July 20th. So there is no chance that what we see in the log is because pid reused. {noformat} ~ $ grep 'pid=14712' thermos-observer.DEBUG D0720 17:01:04.148628 61755 process_collector_psutil.py:35] process:psutil.Process(pid=14712, name='mesos-slave') cpu times pcputimes(user=603.88, system=1104.31, children_user=0.0, children_system=0.0) D0720 17:06:02.358989 61755 process_collector_psutil.py:35] process:psutil.Process(pid=14712, name='mesos-slave') cpu times pcputimes(user=591.16, system=1097.5, children_user=0.0, children_system=0.0) D0720 17:10:55.258080 61755 process_collector_psutil.py:35] process:psutil.Process(pid=14712, name='python2.7') cpu times pcputimes(user=44.81, system=7.29, children_user=0.0, children_system=0.0) D0720 17:16:23.156296 61755 process_collector_psutil.py:35] process:psutil.Process(pid=14712, name='python2.7') cpu times pcputimes(user=596.01, system=1104.11, children_user=0.0, children_system=0.0) D0720 17:21:21.552978 61755 process_collector_psutil.py:35] process:psutil.Process(pid=14712, name='python2.7') cpu times pcputimes(user=44.9, system=7.3, children_user=0.0, children_system=0.0) ~ $ ps -o lstart= -p 14712 Sun Jul 16 11:39:19 2017 {noformat} > Thermos landing (host) page reports incorrect CPU rates when it is busy > --- > > Key: AURORA-1939 > URL: https://issues.apache.org/jira/browse/AURORA-1939 > Project: Aurora > Issue Type: Bug >Reporter: Reza Motamedi >Priority: Minor > > Thermos Observer uses `psutil` to monitor resource consumption of Thermos > Processes. On a busy machine, I have noticed negative CPU values when > visiting the Thermos landing page. > In my test I reproduced this by starting many processes that constantly > create short lived children. This indicates that in time between > `process_collector_psutil` looks up the Process children and the time it > calculates the CPU time the pid of the child is actually reused by another > much younger process, which leads to negative CPU times. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy
[ https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16093883#comment-16093883 ] Reza Motamedi commented on AURORA-1939: --- I just started tailing my logs and as samples came in catted the proc/``/stat. log snippets: {noformat} ~ $ tail -n 1000 -f thermos-observer.DEBUG | grep 62328 D0719 22:01:37.055901 24474 process_collector_psutil.py:105] Calculated rate for pid=62328 and children: -1.94176829079 (old: 1488.84, new: 920.87) {62328: 157014.22, 866: 157038.32} -> {62328: 157014.22, 866: 157038.32} [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=526.98, system=961.84, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=332.69, system=588.16, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] D0719 22:06:22.156851 24474 process_collector_psutil.py:105] Calculated rate for pid=62328 and children: -0.00277094847146 (old: 920.87, new: 920.08) {62328: 157014.22, 866: 157038.32} -> {62328: 157014.22, 866: 157038.32} [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=332.69, system=588.16, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=329.27, system=590.79, rss=34787328, vms=1600204800, nice=0, status='sleeping', threads=20)}] D0719 22:11:07.052711 24474 process_collector_psutil.py:105] Calculated rate for pid=62328 and children: -0.0636723871193 (old: 920.08, new: 901.94) {62328: 157014.22, 866: 157038.32} -> {62328: 157014.22, 866: 157038.32} [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=329.27, system=590.79, rss=34787328, vms=1600204800, nice=0, status='sleeping', threads=20)}] [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=326.72, system=575.2, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] {noformat} checking values directly. Also added date so that we can compare times. {noformat} ~ $ date; cat /proc/866/stat | awk '{print $14, $15}' Wed Jul 19 22:01:41 UTC 2017 52733 96241 ~ $ date; cat /proc/866/stat | awk '{print $14, $15}' Wed Jul 19 22:06:27 UTC 2017 52766 96296 {noformat} > Thermos landing (host) page reports incorrect CPU rates when it is busy > --- > > Key: AURORA-1939 > URL: https://issues.apache.org/jira/browse/AURORA-1939 > Project: Aurora > Issue Type: Bug >Reporter: Reza Motamedi >Priority: Minor > > Thermos Observer uses `psutil` to monitor resource consumption of Thermos > Processes. On a busy machine, I have noticed negative CPU values when > visiting the Thermos landing page. > In my test I reproduced this by starting many processes that constantly > create short lived children. This indicates that in time between > `process_collector_psutil` looks up the Process children and the time it > calculates the CPU time the pid of the child is actually reused by another > much younger process, which leads to negative CPU times. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy
[ https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16093862#comment-16093862 ] Reza Motamedi commented on AURORA-1939: --- This seems to be totally related to psutil. I contacted the developers there. > Thermos landing (host) page reports incorrect CPU rates when it is busy > --- > > Key: AURORA-1939 > URL: https://issues.apache.org/jira/browse/AURORA-1939 > Project: Aurora > Issue Type: Bug >Reporter: Reza Motamedi >Priority: Minor > > Thermos Observer uses `psutil` to monitor resource consumption of Thermos > Processes. On a busy machine, I have noticed negative CPU values when > visiting the Thermos landing page. > In my test I reproduced this by starting many processes that constantly > create short lived children. This indicates that in time between > `process_collector_psutil` looks up the Process children and the time it > calculates the CPU time the pid of the child is actually reused by another > much younger process, which leads to negative CPU times. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy
[ https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16093761#comment-16093761 ] Reza Motamedi commented on AURORA-1939: --- I see this problem when the host is super busy, and resource collection is backlogged. In this case I also see much more errors of this kind in the log: {noformat} D0719 20:18:28.064794 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62193) D0719 20:18:35.351458 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=14711) D0719 20:18:35.552953 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62331) D0719 20:18:42.857400 24474 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.01s D0719 20:18:43.753732 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62338) D0719 20:18:48.454077 24474 mesos_vars.py:384] Metrics collection took 6506.1ms D0719 20:18:50.253031 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62345) D0719 20:18:57.861535 24474 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.00s D0719 20:19:12.955235 24474 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.01s D0719 20:19:14.959180 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62361) D0719 20:19:14.960768 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62232) D0719 20:19:18.056128 24474 mesos_vars.py:384] Metrics collection took 6008.0ms D0719 20:19:22.856868 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62366) D0719 20:19:28.048165 24474 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.09s D0719 20:19:28.660691 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62374) D0719 20:19:43.051047 24474 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.00s D0719 20:19:48.355678 24474 mesos_vars.py:384] Metrics collection took 6299.8ms D0719 20:19:58.148663 24474 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.10s D0719 20:20:11.449485 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62271) D0719 20:20:13.155102 24474 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.01s D0719 20:20:18.249528 24474 mesos_vars.py:384] Metrics collection took 6179.9ms D0719 20:20:23.354832 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=11317) D0719 20:20:27.060431 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62281) D0719 20:20:28.160298 24474 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.00s D0719 20:20:35.452637 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62289) D0719 20:20:43.252589 24474 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.09s D0719 20:20:48.151144 24474 mesos_vars.py:384] Metrics collection took 6058.3ms D0719 20:20:55.254796 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62428) D0719 20:20:58.257311 24474 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.00s D0719 20:21:13.352955 24474 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.10s D0719 20:21:17.555244 24474 process_collector_psutil.py:84] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=62124) {noformat} > Thermos landing (host) page reports incorrect CPU rates when it is busy > --- > > Key: AURORA-1939 > URL: https://issues.apache.org/jira/browse/AURORA-1939 > Project: Aurora > Issue Type: Bug >Reporter: Reza Motamedi >Priority: Minor > > Thermos Observer uses `psutil` to monitor resource consumption of Thermos > Processes. On a busy machine, I have noticed negative CPU values when > visiting the Thermos landing page. > In my test I reproduced this by starting many processes that constantly > create short lived children. This indicates that in time between > `process_collector_psutil` looks up the Process children and the time it > calculates the CPU
[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy
[ https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16093763#comment-16093763 ] Reza Motamedi commented on AURORA-1939: --- I added more debug logs to trace what is kept in the process tree: {noformat} D0719 20:23:05.758181 24474 process_collector_psutil.py:105] Calculated rate for pid=62328 and children: 0.000365496747142 (old: 885.57, new: 885.67) {62328: 157014.22, 866: 157038.32} -> {62328: 157014.22, 866: 157038.32} [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=315.39, system=570.16, rss=36704256, vms=1601019904, nice=0, status='sleeping', threads=20)}] [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=314.96, system=570.69, rss=36134912, vms=1601019904, nice=0, status='sleeping', threads=20)}] D0719 20:27:44.454570 24474 process_collector_psutil.py:105] Calculated rate for pid=62328 and children: 2.10562464635 (old: 885.67, new: 1472.50) {62328: 157014.22, 866: 157038.32} -> {62328: 157014.22, 866: 157038.32} [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=314.96, system=570.69, rss=36134912, vms=1601019904, nice=0, status='sleeping', threads=20)}] [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=520.93, system=951.55, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] D0719 20:32:29.451004 24474 process_collector_psutil.py:105] Calculated rate for pid=62328 and children: 0.00294740555899 (old: 1472.50, new: 1473.34) {62328: 157014.22, 866: 157038.32} -> {62328: 157014.22, 866: 157038.32} [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=520.93, system=951.55, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=521.23, system=952.09, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] D0719 20:37:10.152525 24474 process_collector_psutil.py:105] Calculated rate for pid=62328 and children: 0.00299250250927 (old: 1473.34, new: 1474.18) {62328: 157014.22, 866: 157038.32} -> {62328: 157014.22, 866: 157038.32} [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=521.23, system=952.09, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=521.55, system=952.61, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] D0719 20:42:58.449567 24474 process_collector_psutil.py:105] Calculated rate for pid=62328 and children: -1.64913818086 (old: 1474.18, new: 899.79) {62328: 157014.22, 866: 157038.32} -> {62328: 157014.22, 866: 157038.32} [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=521.55, system=952.61, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2732032, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=322.65, system=577.12, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] {noformat} Note: {62328: 157014.22, 866: 157038.32} -> {62328: 157014.22, 866: 157038.32} what processes where in the old and new sample, as in old -> new. The line after that is the ProcessSamples in the old pstree, kept in the object as "pid => ProcessSample", and the line after is the new sample. > Thermos landing (host) page reports incorrect CPU rates when it is busy > --- > > Key: AURORA-1939 > URL: https://issues.apache.org/jira/browse/AURORA-1939 > Project: Aurora > Issue Type: Bug >Reporter: Reza Motamedi >Priority: Minor > > Thermos Observer uses `psutil` to monitor resource consumption of Thermos > Processes. On a busy machine, I have noticed negative CPU values when > visiting the Thermos landing page. > In
[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy
[ https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16064286#comment-16064286 ] Reza Motamedi commented on AURORA-1939: --- On second thought, the negative CPU values can simply be caused by a dead child process. Let me explain how. First, remember that CPU time reported by psutil, is the total CPU time spent to progress a process. Supposes at {{t_0 = 10}}, we have the following processes forked inside a thermos process. {noformat} __ p0 \_ p1 {noformat} The total CPU time of the thermos process is calculated at the CPU time in all the processes, i.e., {{Process(p_0).cpu_time + Process(p_1).cpu_time}}, For the sake of argument, let's say 1 second in {{p_0}} and 5 seconds in {{p_1}}. Now imagine that by the time to collect the next sample at {{t_1 = 20}}, another 5 seconds where spend in p_0, and p_0 finishes (dies) before the collection. Also, only an extra 1 second was spent by {{p_0}}. The current calculation leads to the following reported CPU values. (sum(new_samples) - sum(old_samples)) / (time difference). (2) - (1 + 5) / 5 = -3/10. A perfect calculation would include the time spend in the dead processes at the time of their death in the new sample. What makes sense is to discard the old processes that have died during the last time interval. > Thermos landing (host) page reports incorrect CPU rates when it is busy > --- > > Key: AURORA-1939 > URL: https://issues.apache.org/jira/browse/AURORA-1939 > Project: Aurora > Issue Type: Bug >Reporter: Reza Motamedi >Priority: Minor > > Thermos Observer uses `psutil` to monitor resource consumption of Thermos > Processes. On a busy machine, I have noticed negative CPU values when > visiting the Thermos landing page. > In my test I reproduced this by starting many processes that constantly > create short lived children. This indicates that in time between > `process_collector_psutil` looks up the Process children and the time it > calculates the CPU time the pid of the child is actually reused by another > much younger process, which leads to negative CPU times. -- This message was sent by Atlassian JIRA (v6.4.14#64029)