Ray Chiang commented on YARN-2868:

I would like to make this metrics discussion a bit more clear for my own 
sanity.  The current situation:

A1) ClusterMetrics, prior to YARN-2802, only had NM metrics.  AM metrics were 
added in YARN-2802, partly because storing in each node isn't useful for 
debugging.  Review from Vinod pushed the metric from the RM (since it really 
isn't RM related) to ClusterMetrics.

A2) QueueMetrics (and derived classes) currently has metrics for App counts and 
MB/VCore/Container statistics.

This JIRA is the first of many, to start placing the metrics to get some sort 
of YARN profiling in place, at least at some basic level.

B1) If it's put into ClusterMetrics, it is as Anubhav mentioned, a good global 
metric/warning system, but won't necessarily help with debugging other than at 
the cluster level.

B2) If it's put into the QueueMetrics, then there is the additional ability to 
be able to debug queue vs. network/cluster issues with respect to container 

My feedback on the discussion so far:

C1) I do believe container allocation has a chance of being queue dependent.  
Now, whether it's only useful for FairScheduler vs. other schedulers could be 
debated (which is why it was originally in FSQueueMetrics).

C2) QueueMetrics has the advantage of being able to have a customer take a 
metrics snapshot and use it for debugging application delays (at least for this 
first metric so far).  My goal for the near-future is to continue adding to 
this area in order to get a clear snapshot of any RM related application 
runtime metrics for each queue.

Any thoughts?

PS: I appreciate all the great feedback so far.  It's definitely giving me 
places to look at the code and get a better overall understanding.  Thanks.

> Add metric for initial container launch time to FairScheduler
> -------------------------------------------------------------
>                 Key: YARN-2868
>                 URL: https://issues.apache.org/jira/browse/YARN-2868
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Ray Chiang
>            Assignee: Ray Chiang
>              Labels: metrics, supportability
>         Attachments: YARN-2868-01.patch, YARN-2868.002.patch, 
> YARN-2868.003.patch, YARN-2868.004.patch, YARN-2868.005.patch, 
> YARN-2868.006.patch, YARN-2868.007.patch
> Add a metric to measure the latency between "starting container allocation" 
> and "first container actually allocated".

This message was sent by Atlassian JIRA

Reply via email to