[
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128533#comment-15128533
]
Junping Du commented on YARN-4635:
----------------------------------
bq. If we can launch an AM container at some later point of time after the
first failure, we can remove that node immediately from global blacklisting.
In most case, AM container won't get chance to launch again on this node
because blacklist mechanism already blacklist it get allocated. However, the
corner case is: two AM containers get launched at the same time, one failure
but the other one successful. IMO, the successfully completed one shouldn't
purge node from blacklist as normal node as the failure marked as global
affected like DISK_FAILURE could still happen on coming am containers. In
another words, it still get more risky for AM launched on this node which is
not changed by another AM container finished. We can discuss more about purge
node from global list, like: time based, event (NM reconnect) based, etc. in a
dedicated JIRA YARN-4637 that I filed before.
bq. I think SimpleBlacklistManager#refreshNodeHostCount can pre-compute failure
threshold also along with updating numberOfNodeManagerHosts. So whoever is
invoking getBlacklistUpdates need not have to compute always. This is minor
suggestion in existing code.
Sounds good. Updated in v2 patch.
bq. There are chances of duplicates from global and per-app level blacklists,
correct?. So could we use a Set here. One possibility, one AM container failed
due to ABORTED and added to per-app level blacklist, second attempt failed to
due to DISK_FAILED and added to global. Now this will be a duplicate scenario.
Thoughts?
Nice catch! The same app with different attempts won't cause this duplicated
issue. The possible duplicated scenario is: an app AM failed on this node for
reason like ABORTED, but at the mean time, the other app's AM failed on this
node for DISK_FAILURE, then the same node could be duplicated on two list. Fix
this issue in v2 patch.
There is another issue that the threshold control on BlacklistManager is
applied on two list (global and per app) separately, so it is possible that two
lists together could unexpectedly blacklist all nodes. We need a thread-safe
merge operation for two BlacklistManagers to address this problem. Mark a TODO
item in the patch. Will file a separated JIRA to fix this.
> Add global blacklist tracking for AM container failure.
> -------------------------------------------------------
>
> Key: YARN-4635
> URL: https://issues.apache.org/jira/browse/YARN-4635
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Junping Du
> Assignee: Junping Du
> Priority: Critical
> Attachments: YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM
> container failures in global
> affection. That means we need to differentiate the non-succeed
> ContainerExitStatus reasoning from
> NM or more related to App.
> For more details, please refer the document in YARN-4576.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)