[
https://issues.apache.org/jira/browse/YARN-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345069#comment-15345069
]
Jason Lowe commented on YARN-4862:
----------------------------------
bq. I also do not see this as a performance bottleneck, as we are operating on
a small set of running vs finished for a node per heartbeat.
The performance impact is basically zero because we're not doing that small set
comparison most of the time. The only thing we do all the time is simply
increment of an integer during a scan of the node report that was already being
done before, then simply comparing that integer to the size of a hash set which
is also super cheap. Only when those numbers are different do we do the diff
between the set and the report. For YARN-5197, those numbers will always be
the same _unless_ the NM failed to report a container completion which should
be a rare event. The performance hit is going to be very hard to detect in
practice because of the cheap conditional check up-front before doing the full
diff.
bq. This will slowup the cleanup in case if we preempt AM container, but may be
more cleaner.
It won't slow how fast the container will be killed, if that's what you mean by
"cleanup case." Only the NM can kill it anyway, and it won't know to do so
until it subsequently heartbeats. It will slow down how fast the RM will
re-schedule the resource associated with the preempted container, since it will
wait until the NM confirms the container completion before releasing the
resources within the scheduler bookkeeping and re-allocating them. This means
today the RM can, and does, accidentally overcommit nodes because it considers
the resources free before they actually are free. Filed YARN-5290 as we've
recently seen this in practice on some of our clusters.
> Handle duplicate completed containers in RMNodeImpl
> ---------------------------------------------------
>
> Key: YARN-4862
> URL: https://issues.apache.org/jira/browse/YARN-4862
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Reporter: Rohith Sharma K S
> Assignee: Rohith Sharma K S
> Attachments: 0001-YARN-4862.patch, 0002-YARN-4862.patch
>
>
> As per
> [comment|https://issues.apache.org/jira/browse/YARN-4852?focusedCommentId=15209689&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209689]
> from [~sharadag], there should be safe guard for duplicated container status
> in RMNodeImpl before creating UpdatedContainerInfo.
> Or else in heavily loaded cluster where event processing is gradually slow,
> if any duplicated container are sent to RM(may be bug in NM also), there is
> significant impact that RMNodImpl always create UpdatedContainerInfo for
> duplicated containers. This result in increase in the heap memory and causes
> problem like YARN-4852.
> This is an optimization for issue kind YARN-4852
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]