[ 
https://issues.apache.org/jira/browse/YARN-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217225#comment-15217225
 ] 

Rohith Sharma K S commented on YARN-4685:
-----------------------------------------

Some of the points brought in offline discussion with [~sunilg] and [~vvasudev] 
are
# The default value for maximum threshold value is 0.8. This should be reduced 
to 0.1 i.e 10% OR 0.2 i.e 20%. As Vinod 
[commented|https://issues.apache.org/jira/browse/YARN-4685?focusedCommentId=15201117&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15201117]
 previously in this JIRA, In real production cluster, blacklisting 80% of nodes 
for one app is very prone to be problematic if 20% of nodes are always busy.
# Once attempt is scheduled, there is no way to update scheduler for updated 
blacklist add/remove. Since the existing API *allocate* is used for updating 
blacklisted nodes for AM, using same API for update AM blacklist add/removal 
nodes from RMAppAttempt is critical. Lot of RMAppAttempt state machines need to 
be handled since allocate API return Allocation object, lot of race conditions 
would appear. In order to update scheduler for blacklisting nodes is triggering 
an update event from RMAppAttempt for AM blacklisting nodes. This make sure 
YarnScheduler interface is compatible.

> AM blacklisting result in application to get hanged
> ---------------------------------------------------
>
>                 Key: YARN-4685
>                 URL: https://issues.apache.org/jira/browse/YARN-4685
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Rohith Sharma K S
>            Assignee: Rohith Sharma K S
>
> AM blacklist addition or removal is updated only when RMAppAttempt is 
> scheduled i.e {{RMAppAttemptImpl#ScheduleTransition#transition}}. But once 
> attempt is scheduled if there is any removeNode/addNode in cluster then this 
> is not updated to {{BlackListManager#refreshNodeHostCount}}. This leads 
> BlackListManager to operate on stale NM's count. And application is in 
> ACCEPTED state and wait forever even if we add more nodes to cluster.
> Solution is update BlacklistManager for every 
> {{RMAppAttemptImpl#AMContainerAllocatedTransition#transition}} call. This 
> ensures if there is any addition/removal in nodes, this will be updated to 
> BlacklistManager 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to