[
https://issues.apache.org/jira/browse/YARN-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204712#comment-15204712
]
Sunil G commented on YARN-4685:
-------------------------------
Agreeing to your point [~rohithsharma].
We have {{blacklistManager}} per {{RMAppAttempt}}. So to operate anything on
{{blacklistManager}}, we have to pass reference to scheduler. Assuming I am
interested in your second approach. In Each heartbeat call, we will check for
pending AM container resource request. Then for such resource request,
re-compute blacklist threshold if needed (which means if some nodes are
added/removed recently) in {{blacklistManager}}. If there are some changes in
threshold, remove blacklist for this ResourceRequest.
But we need to change lot of interface api syntax. If we had a common
BlackListManager, which keeps tracks of all blacklist information for all apps,
it would have been more clean.
> AM blacklisting result in application to get hanged
> ---------------------------------------------------
>
> Key: YARN-4685
> URL: https://issues.apache.org/jira/browse/YARN-4685
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.8.0
> Reporter: Rohith Sharma K S
> Assignee: Rohith Sharma K S
>
> AM blacklist addition or removal is updated only when RMAppAttempt is
> scheduled i.e {{RMAppAttemptImpl#ScheduleTransition#transition}}. But once
> attempt is scheduled if there is any removeNode/addNode in cluster then this
> is not updated to {{BlackListManager#refreshNodeHostCount}}. This leads
> BlackListManager to operate on stale NM's count. And application is in
> ACCEPTED state and wait forever even if we add more nodes to cluster.
> Solution is update BlacklistManager for every
> {{RMAppAttemptImpl#AMContainerAllocatedTransition#transition}} call. This
> ensures if there is any addition/removal in nodes, this will be updated to
> BlacklistManager
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)