[ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128389#comment-15128389
 ] 

Sunil G commented on YARN-4635:
-------------------------------

Hi [~djp]
Thanks for sharing the patch fast. Overall looks fine for me.

Few points:
1. Per app blacklist manager need not have to consider the case to remove a 
node from this blacklist. But for global blacklist manager, i think we need a 
{{removeNode}} interface in {{BlacklistManager}}. If we can launch an AM 
container at some later point of time after the first failure, we can remove 
that node immediately from global blacklisting. May be 
{{RMAppAttemptImpl#checkStatusForNodeBlacklisting}} can check for success too 
(Or are we planning to handle in the ticket where we try to come with time 
based clear mechanism). Thoughts?

2. I think {{SimpleBlacklistManager#refreshNodeHostCount}} can pre-compute 
failure threshold also along with updating {{numberOfNodeManagerHosts}}. So 
whoever is invoking {{getBlacklistUpdates}} need not have to compute always. 
This is  minor suggestion in existing code.

3.
{code}
+        // No thread safe problem as getBlacklistUpdates() in
+        // SimpleBlacklistManager do clone operation to blacklistNodes
+        List<String> amBlacklistAdditions = new ArrayList<String>();
{code}

There are chances of duplicates from global and per-app level blacklists, 
correct?. So could we use a Set here. One possibility, one AM container failed 
due to ABORTED and added to per-app level blacklist, second attempt failed to 
due to DISK_FAILED and added to global. Now this will be a duplicate scenario. 
Thoughts?

 


> Add global blacklist tracking for AM container failure.
> -------------------------------------------------------
>
>                 Key: YARN-4635
>                 URL: https://issues.apache.org/jira/browse/YARN-4635
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM 
> container failures in global 
> affection. That means we need to differentiate the non­-succeed 
> ContainerExitStatus reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to