[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy

2017-06-26 Thread Reza Motamedi (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16064286#comment-16064286
 ] 

Reza Motamedi commented on AURORA-1939:
---

On second thought, the negative CPU values can simply be caused by a dead child 
process. Let me explain how. First, remember that CPU time reported by psutil, 
is the total CPU time spent to progress a process.

Supposes at {{t_0 = 10}}, we have the following processes forked inside a 
thermos process.

{noformat}
__ p0
   \_ p1
{noformat}

The total CPU time of the thermos process is calculated at the CPU time in all 
the processes, i.e., {{Process(p_0).cpu_time + Process(p_1).cpu_time}}, For the 
sake of argument, let's say 1 second in {{p_0}} and 5 seconds in {{p_1}}.
Now imagine that by the time to collect the next sample at {{t_1 = 20}}, 
another 5 seconds where spend in p_0, and p_0 finishes (dies) before the 
collection. Also, only an extra 1 second was spent by {{p_0}}. The current 
calculation leads to the following reported CPU values.

(sum(new_samples) - sum(old_samples)) / (time difference).
(2) - (1 + 5) / 5 = -3/10.

A perfect calculation would include the time spend in the dead processes at the 
time of their death in the new sample. What makes sense is to discard the old 
processes that have died during the last time interval.




> Thermos landing (host) page reports incorrect CPU rates when it is busy
> ---
>
> Key: AURORA-1939
> URL: https://issues.apache.org/jira/browse/AURORA-1939
> Project: Aurora
>  Issue Type: Bug
>Reporter: Reza Motamedi
>Priority: Minor
>
> Thermos Observer uses `psutil` to monitor resource consumption of Thermos 
> Processes. On a busy machine, I have noticed negative CPU values when 
> visiting the Thermos landing page.
> In my test I reproduced this by starting many processes that constantly 
> create short lived children. This indicates that in time between 
> `process_collector_psutil` looks up the Process children and the time it 
> calculates the CPU time the pid of the child is actually reused by another 
> much younger process, which leads to negative CPU times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (AURORA-1937) Add metrics for status updates before switching to V1 Mesos Driver implementaion

2017-06-26 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16063422#comment-16063422
 ] 

Kai Huang edited comment on AURORA-1937 at 6/26/17 5:38 PM:


Add counter for status_update and framework_message: 
https://reviews.apache.org/r/60350/



was (Author: kaih):
Add counter for status_update and framework_message:
https://reviews.apache.org/r/60350/


> Add metrics for status updates before switching to V1 Mesos Driver 
> implementaion
> 
>
> Key: AURORA-1937
> URL: https://issues.apache.org/jira/browse/AURORA-1937
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Kai Huang
>Assignee: Kai Huang
>
> Zameer has created a new driver implementation around V1Mesos 
> (https://reviews.apache.org/r/57061). 
> The V1 Mesos code requires a Scheduler callback with a different API. To 
> maximize code reuse, event handling logic was extracted into a 
> [MesosCallbackHandler | 
> https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java#L286]
>  class. However, we do not have the metrics for handling task status update 
> in this class.
> Metrics around task status update are key performance indicators for the 
> scheduler. We need to add the metrics back, in order to switch to V1Mesos 
> driver in production.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AURORA-1937) Add metrics for status updates before switching to V1 Mesos Driver implementaion

2017-06-26 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16063483#comment-16063483
 ] 

Kai Huang commented on AURORA-1937:
---

Add timing metrics for status_update: https://reviews.apache.org/r/60437/

> Add metrics for status updates before switching to V1 Mesos Driver 
> implementaion
> 
>
> Key: AURORA-1937
> URL: https://issues.apache.org/jira/browse/AURORA-1937
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Kai Huang
>Assignee: Kai Huang
>
> Zameer has created a new driver implementation around V1Mesos 
> (https://reviews.apache.org/r/57061). 
> The V1 Mesos code requires a Scheduler callback with a different API. To 
> maximize code reuse, event handling logic was extracted into a 
> [MesosCallbackHandler | 
> https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java#L286]
>  class. However, we do not have the metrics for handling task status update 
> in this class.
> Metrics around task status update are key performance indicators for the 
> scheduler. We need to add the metrics back, in order to switch to V1Mesos 
> driver in production.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy

2017-06-26 Thread Reza Motamedi (JIRA)
Reza Motamedi created AURORA-1939:
-

 Summary: Thermos landing (host) page reports incorrect CPU rates 
when it is busy
 Key: AURORA-1939
 URL: https://issues.apache.org/jira/browse/AURORA-1939
 Project: Aurora
  Issue Type: Bug
Reporter: Reza Motamedi
Priority: Minor


Thermos Observer uses `psutil` to monitor resource consumption of Thermos 
Processes. On a busy machine, I have noticed negative CPU values when visiting 
the Thermos landing page.

In my test I reproduced this by starting many processes that constantly create 
short lived children. This indicates that in time between 
`process_collector_psutil` looks up the Process children and the time it 
calculates the CPU time the pid of the child is actually reused by another much 
younger process, which leads to negative CPU times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AURORA-1938) Aurora failed without log detail

2017-06-26 Thread Stephan Erb (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16063380#comment-16063380
 ] 

Stephan Erb commented on AURORA-1938:
-

The current snippet you posted does not tell us why Aurora thinks the storage 
is not ready. Normally those messages point to problems with the replicated 
log, or maybe connectivity issues between your Aurora schedulers. 

The log lines indicates that Aurora cannot even properly connect to the 
ZooKeeper ensemble. This is a prerequisite for a working cluster as well.
{code}2017-06-20 
17:38:58,527:1(0x7f13511fc700):ZOO_ERROR@handle_socket_error_msg@1697: Socket 
[10.176.128.91:2181] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
{code} 

How many Aurora schedulers do you have? 3 or 5?  Would be great to have the 
full log of those (if you feel comfortable sharing those). 


> Aurora failed without log detail
> 
>
> Key: AURORA-1938
> URL: https://issues.apache.org/jira/browse/AURORA-1938
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 0.13.0
>Reporter: Luc Nguyen
> Fix For: 0.13.0
>
> Attachments: Error_1.txt, Error_2.txt
>
>
> Aurora failed without log detail
> We also had a backup for Aurora as well. However, the Aurora backup was also 
> failed.
> It was bother us that there was no log which showing the failure in detail.
> Was there anyone running the same problem?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AURORA-1938) Aurora failed without log detail

2017-06-26 Thread Luc Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16063341#comment-16063341
 ] 

Luc Nguyen commented on AURORA-1938:


Hi Stephan! Hope you have a chance to check the log and let us know today. 
Thanks.

> Aurora failed without log detail
> 
>
> Key: AURORA-1938
> URL: https://issues.apache.org/jira/browse/AURORA-1938
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 0.13.0
>Reporter: Luc Nguyen
> Fix For: 0.13.0
>
> Attachments: Error_1.txt, Error_2.txt
>
>
> Aurora failed without log detail
> We also had a backup for Aurora as well. However, the Aurora backup was also 
> failed.
> It was bother us that there was no log which showing the failure in detail.
> Was there anyone running the same problem?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)