[ 
https://issues.apache.org/jira/browse/YARN-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013344#comment-15013344
 ] 

Lavkesh Lahngir commented on YARN-4314:
---------------------------------------

Initial thoughts:
An AM sends resource requests with heartbeat and RM tries to fulfil the 
requests and sends back the response. 
We can maintain a data structure called ContainerWaitTime in the 
AppSchedulingInfo to keep track of the last timestamp of the heartbeat and 
number of pending containers. Resource requests and resource allocations change 
the containerWaitTime object to increase or decrease pending containers. With 
every heartbeat, the total wait time for this attempt will be increased by
(pending_containers *(current_timestamp - last_timestamp). At this moment 
last_timestamp will be updated to the current timestamp.

Every attempt will maintain this data structure similar to memory-seconds and 
vcores-seconds.
In the AppImpl class, there is a method called getAppMetrics() where we will 
aggregate the wait time from all the attempts and return it back. 

For AM container wait time, we need to add an additional parameter called 
scheduledTime. In getAppMetrics() method, we can get total AM container wait 
time by summing up (attempt_scheduledTime- attempt_startedTime) for all 
attempts. If the attempt is not yet scheduled, scheduledTime will be replaced 
by current time. 

For adding these new metrics to the queue, we need to just update the 
queue_metrics object.. it will be aggregated at the queue level. 

For RM recovery we will need to save these metrics to the state store similar 
to other metrics of the attempt.(memory-seconds and vcore-seconds)
Few more classes to be touched for implementing above, but the core idea 
remains the same. Most of the code is independent of the scheduler apart from 
few line addition in the different implementation of the scheduler. 

I have implemented an initial version. I will put out the patch once I have 
tested it completely.

Feedback?


> Adding container wait time as a metric at queue level and application level.
> ----------------------------------------------------------------------------
>
>                 Key: YARN-4314
>                 URL: https://issues.apache.org/jira/browse/YARN-4314
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Lavkesh Lahngir
>            Assignee: Lavkesh Lahngir
>
> There is a need for adding the container wait-time which can be tracked at 
> the queue and application level. 
> An application can have two kinds of wait times. One is AM wait time after 
> submission and another is total container wait time between AM asking for 
> containers and getting them. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to