[
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128389#comment-15128389
]
Sunil G commented on YARN-4635:
-------------------------------
Hi [~djp]
Thanks for sharing the patch fast. Overall looks fine for me.
Few points:
1. Per app blacklist manager need not have to consider the case to remove a
node from this blacklist. But for global blacklist manager, i think we need a
{{removeNode}} interface in {{BlacklistManager}}. If we can launch an AM
container at some later point of time after the first failure, we can remove
that node immediately from global blacklisting. May be
{{RMAppAttemptImpl#checkStatusForNodeBlacklisting}} can check for success too
(Or are we planning to handle in the ticket where we try to come with time
based clear mechanism). Thoughts?
2. I think {{SimpleBlacklistManager#refreshNodeHostCount}} can pre-compute
failure threshold also along with updating {{numberOfNodeManagerHosts}}. So
whoever is invoking {{getBlacklistUpdates}} need not have to compute always.
This is minor suggestion in existing code.
3.
{code}
+ // No thread safe problem as getBlacklistUpdates() in
+ // SimpleBlacklistManager do clone operation to blacklistNodes
+ List<String> amBlacklistAdditions = new ArrayList<String>();
{code}
There are chances of duplicates from global and per-app level blacklists,
correct?. So could we use a Set here. One possibility, one AM container failed
due to ABORTED and added to per-app level blacklist, second attempt failed to
due to DISK_FAILED and added to global. Now this will be a duplicate scenario.
Thoughts?
> Add global blacklist tracking for AM container failure.
> -------------------------------------------------------
>
> Key: YARN-4635
> URL: https://issues.apache.org/jira/browse/YARN-4635
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Junping Du
> Assignee: Junping Du
> Priority: Critical
> Attachments: YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM
> container failures in global
> affection. That means we need to differentiate the non-succeed
> ContainerExitStatus reasoning from
> NM or more related to App.
> For more details, please refer the document in YARN-4576.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)