[ 
https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181424#comment-15181424
 ] 

Sangjin Lee commented on YARN-4712:
-----------------------------------

I agree that we should take care of the UNAVAILABLE metrics via YARN-4308. My 
position is that we should skip reporting the value rather than reporting 0.

I know I might be opening a can of worms, but I'd like to raise a couple of 
points as they are closely related to this.

First, what should we report via {{NMTimelinePublisher}}? There are 2 choices: 
{{cpuUsagePercentPerCore}} (300% in the example mentioned in the comment) and 
{{cpuUsageTotalCoresPercentage}} (50% in the same example). I see that we're 
storing {{cpuUsageTotalCoresPercentage}}. I wonder if that is the best choice 
here.

For example, consider a cluster with workers with substantially different 
capacity (number of cores). If we used the latter and tried to aggregate them 
later for the application or the flow, this would lead to a highly misleading 
sum. 50% of a 6-core node is very different than 50% of a 24-core node.

Most of YARN's CPU accounting is based on cores rather than nodes/machines. IMO 
{{cpuUsagePercentPerCore}} would be a better value to emit. Thoughts?

The second point is the following line in the existing code:
{code}
        cpuMetric.setId(ContainerMetric.CPU.toString() + pId);
{code}

I vaguely remember reading this line and being puzzled. Why are we appending 
the process id to the metric id? Doesn't this cause issues when we do the 
aggregation? For example, suppose we have a container #1 (process id = 1234) on 
some machine whose CPU usage is 10%, and container #2 (process id = 5678) on 
another machine whose CPU usage is 20%. The object model will be

{noformat}
(container #1) -> (metric) -> ("CPU1234" => 10)
(container #2) -> (metric) -> ("CPU5678" => 20)
{noformat}

But we want to add them for the parent application. It would be real awkward to 
add these metrics with different keys. Why is process id needed here in the 
first place?

> CPU Usage Metric is not captured properly in YARN-2928
> ------------------------------------------------------
>
>                 Key: YARN-4712
>                 URL: https://issues.apache.org/jira/browse/YARN-4712
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Naganarasimha G R
>            Assignee: Naganarasimha G R
>              Labels: yarn-2928-1st-milestone
>         Attachments: YARN-4712-YARN-2928.v1.001.patch, 
> YARN-4712-YARN-2928.v1.002.patch, YARN-4712-YARN-2928.v1.003.patch
>
>
> There are 2 issues with CPU usage collection 
> * I was able to observe that that many times CPU usage got from 
> {{pTree.getCpuUsagePercent()}} is 
> ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do 
> the calculation  i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore 
> /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE 
> check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not 
> encountered. so proper checks needs to be handled
> * {{EntityColumnPrefix.METRIC}} uses always LongConverter but 
> ContainerMonitor is publishing decimal values for the CPU usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to