[ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128533#comment-15128533
 ] 

Junping Du commented on YARN-4635:
----------------------------------

bq. If we can launch an AM container at some later point of time after the 
first failure, we can remove that node immediately from global blacklisting.
In most case, AM container won't get chance to launch again on this node 
because blacklist mechanism already blacklist it get allocated. However, the 
corner case is: two AM containers get launched at the same time, one failure 
but the other one successful. IMO, the successfully completed one shouldn't 
purge node from blacklist as normal node as the failure marked as global 
affected like DISK_FAILURE could still happen on coming am containers. In 
another words, it still get more risky for AM launched on this node which is 
not changed by another AM container finished. We can discuss more about purge 
node from global list, like: time based, event (NM reconnect) based, etc. in a 
dedicated JIRA YARN-4637 that I filed before.

bq. I think SimpleBlacklistManager#refreshNodeHostCount can pre-compute failure 
threshold also along with updating numberOfNodeManagerHosts. So whoever is 
invoking getBlacklistUpdates need not have to compute always. This is minor 
suggestion in existing code.
Sounds good. Updated in v2 patch.

bq. There are chances of duplicates from global and per-app level blacklists, 
correct?. So could we use a Set here. One possibility, one AM container failed 
due to ABORTED and added to per-app level blacklist, second attempt failed to 
due to DISK_FAILED and added to global. Now this will be a duplicate scenario. 
Thoughts?
Nice catch! The same app with different attempts won't cause this duplicated 
issue. The possible duplicated scenario is: an app AM failed on this node for 
reason like ABORTED, but at the mean time, the other app's AM failed on this 
node for DISK_FAILURE, then the same node could be duplicated on two list. Fix 
this issue in v2 patch.

There is another issue that the threshold control on BlacklistManager is 
applied on two list (global and per app) separately, so it is possible that two 
lists together could unexpectedly blacklist all nodes. We need a thread-safe 
merge operation for two BlacklistManagers to address this problem. Mark a TODO 
item in the patch. Will file a separated JIRA to fix this.

> Add global blacklist tracking for AM container failure.
> -------------------------------------------------------
>
>                 Key: YARN-4635
>                 URL: https://issues.apache.org/jira/browse/YARN-4635
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM 
> container failures in global 
> affection. That means we need to differentiate the non­-succeed 
> ContainerExitStatus reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to