[
https://issues.apache.org/jira/browse/YARN-4821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267794#comment-15267794
]
Sangjin Lee commented on YARN-4821:
-----------------------------------
Thanks [~Naganarasimha] for the patch! I took a look at the patch, and I have
some feedback and questions. I suspect there is more subtlety than it appears
initially.
(1) config
Currently it is defined as "number of monitors to skip before publishing to
timeline service." However, this is not too user-friendly. For example, if this
value is 3, it really means we would publish every *4-th* time if I read this
correctly. This may not be too apparent to users. My preference would be to use
a true multiple. If we're going to emit every {{n}}-th time, we should let
users define {{n}} as the config. Can we redefine this config?
Also, I would suggest the default value of 5 as discussed previously. 5 gives
us a nice round number of 15 seconds (4 times a minute).
Also, can you please add it to {{yarn-default.xml}} too?
(2) keeping track of intermediate values
I am somewhat concerned about the amount of objects that get instantiated (and
discarded) during this process. For every live container and every monitoring
interval, it would create a new {{ResourceUtilization}} object, and also
another {{ResourceUtilization}} object when you average. It would add to the
garbage collection pressure.
Can we come up with a different way of keeping track of intermediate values?
How about creating a data class that lets you accumulate values (as opposed to
having a {{List}})? That way, a single live container needs only one data class
object. Also, we won't have to store things like vmem which we're not tracking
anyway.
(3) edge cases
What if we are still accumulating (without publishing) when the container is
finished? I think we simply don't publish? I suspect that might be OK. We might
want to make it explicit with a little comments. Any other edge case we need to
consider?
Onto more line-level comments:
(ContainersMonitorImpl.java)
- l. 85: It should be private. Also, let's instantiate this map in
{{serviceInit()}} only if NM timeline publisher is enabled (i.e. timeline
service v.2 is enabled).
- l.579: I am puzzled by this. Why are we re-defining this memory metric to be
in MB? Is this necessary as part of this patch? If there is no reason to
redefine the unit, we should go back to bytes...
> Have a separate NM timeline publishing-interval
> -----------------------------------------------
>
> Key: YARN-4821
> URL: https://issues.apache.org/jira/browse/YARN-4821
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: timelineserver
> Affects Versions: YARN-2928
> Reporter: Sangjin Lee
> Assignee: Naganarasimha G R
> Labels: yarn-2928-1st-milestone
> Attachments: YARN-4821-YARN-2928.v1.001.patch
>
>
> Currently the interval with which NM publishes container CPU and memory
> metrics is tied to {{yarn.nodemanager.resource-monitor.interval-ms}} whose
> default is 3 seconds. This is too aggressive.
> There should be a separate configuration that controls how often
> {{NMTimelinePublisher}} publishes container metrics.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]