[ 
https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15183611#comment-15183611
 ] 

Sangjin Lee commented on YARN-4712:
-----------------------------------

My main concern with using {{cpuUsageTotalCoresPercentage}} is about 
*aggregation*, and I think using {{cpuUsageTotalCoresPercentage}} breaks down 
in a heterogeneous cluster. Here is an illustrative example.

Suppose you have a 2-node cluster, where the first node has 4 cores and the 
second node has 8 cores. Furthermore, suppose that the container on the 4-core 
node is utilizing all 4 cores and the container on the 8-core node is utilizing 
1 core. Since the entire cluster has 12 cores and the app is using 5 cores, the 
utilization of this app should be 42% (5/12 cores).

However, if we use {{cpuUsageTotalCoresPercentage}}, we have a problem. The 
container on the 4-core node will report 100% utilization on that node, and the 
other container on the 8-core node will report 12.5% utilization. Then, if we 
aggregated the container metrics to the app, the app would have 112.5% 
utilization of the cluster or 56% per node. IMO this is not correct, or at best 
misleading.

If the node capacity in terms of cores is homogeneous, it does not make a 
difference by using either. However, if we have a heterogeneous cluster, the 
latter essentially under-weighs larger machines by using *simple* averages. 
This would result in a misleading and confusing result on aggregation.

I do recognize using {{cpuUsagePercentPerCore}} would require the total number 
of cores for the cluster when aggregated to arrive at a relative percentage 
number. But overall I do feel that {{cpuUsagePercentPerCore}} would be a more 
accurate measure of the cluster utilization when aggregated.

I am OK with separating that discussion to another JIRA.

> CPU Usage Metric is not captured properly in YARN-2928
> ------------------------------------------------------
>
>                 Key: YARN-4712
>                 URL: https://issues.apache.org/jira/browse/YARN-4712
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Naganarasimha G R
>            Assignee: Naganarasimha G R
>              Labels: yarn-2928-1st-milestone
>         Attachments: YARN-4712-YARN-2928.v1.001.patch, 
> YARN-4712-YARN-2928.v1.002.patch, YARN-4712-YARN-2928.v1.003.patch, 
> YARN-4712-YARN-2928.v1.004.patch
>
>
> There are 2 issues with CPU usage collection 
> * I was able to observe that that many times CPU usage got from 
> {{pTree.getCpuUsagePercent()}} is 
> ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do 
> the calculation  i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore 
> /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE 
> check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not 
> encountered. so proper checks needs to be handled
> * {{EntityColumnPrefix.METRIC}} uses always LongConverter but 
> ContainerMonitor is publishing decimal values for the CPU usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to