[ 
https://issues.apache.org/jira/browse/YARN-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345069#comment-15345069
 ] 

Jason Lowe commented on YARN-4862:
----------------------------------

bq. I also do not see this as a performance bottleneck, as we are operating on 
a small set of running vs finished for a node per heartbeat.

The performance impact is basically zero because we're not doing that small set 
comparison most of the time.  The only thing we do all the time is simply 
increment of an integer during a scan of the node report that was already being 
done before, then simply comparing that integer to the size of a hash set which 
is also super cheap.  Only when those numbers are different do we do the diff 
between the set and the report.  For YARN-5197, those numbers will always be 
the same _unless_ the NM failed to report a container completion which should 
be a rare event.   The performance hit is going to be very hard to detect in 
practice because of the cheap conditional check up-front before doing the full 
diff.

bq. This will slowup the cleanup in case if we preempt AM container, but may be 
more cleaner.

It won't slow how fast the container will be killed, if that's what you mean by 
"cleanup case."  Only the NM can kill it anyway, and it won't know to do so 
until it subsequently heartbeats.  It will slow down how fast the RM will 
re-schedule the resource associated with the preempted container, since it will 
wait until the NM confirms the container completion before releasing the 
resources within the scheduler bookkeeping and re-allocating them.  This means 
today the RM can, and does, accidentally overcommit nodes because it considers 
the resources free before they actually are free.  Filed YARN-5290 as we've 
recently seen this in practice on some of our clusters.

> Handle duplicate completed containers in RMNodeImpl
> ---------------------------------------------------
>
>                 Key: YARN-4862
>                 URL: https://issues.apache.org/jira/browse/YARN-4862
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Rohith Sharma K S
>            Assignee: Rohith Sharma K S
>         Attachments: 0001-YARN-4862.patch, 0002-YARN-4862.patch
>
>
> As per 
> [comment|https://issues.apache.org/jira/browse/YARN-4852?focusedCommentId=15209689&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209689]
>  from [~sharadag], there should be safe guard for duplicated container status 
> in RMNodeImpl before creating UpdatedContainerInfo. 
> Or else in heavily loaded cluster where event processing is gradually slow, 
> if any duplicated container are sent to RM(may be bug in NM also), there is 
> significant impact that RMNodImpl always create UpdatedContainerInfo for 
> duplicated containers. This result in increase in the heap memory and causes 
> problem like YARN-4852.
> This is an optimization for issue kind YARN-4852



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to