[jira] [Commented] (YARN-4685) AM blacklisting result in application to get hanged

Sunil G (JIRA) Mon, 21 Mar 2016 10:41:22 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204712#comment-15204712
 ]


Sunil G commented on YARN-4685:
-------------------------------

Agreeing to your point [~rohithsharma].

We have {{blacklistManager}}  per {{RMAppAttempt}}. So to operate anything on 
{{blacklistManager}}, we have to pass reference to scheduler. Assuming I am 
interested in your second approach. In Each heartbeat call, we will check for 
pending AM container resource request. Then for such resource request, 
re-compute blacklist threshold if needed (which means if some nodes are 
added/removed recently) in {{blacklistManager}}. If there are some changes in 
threshold, remove blacklist for this ResourceRequest.

But we need to change lot of interface api syntax. If we had a common 
BlackListManager, which keeps tracks of all blacklist information for all apps, 
it would have been more clean.

> AM blacklisting result in application to get hanged
> ---------------------------------------------------
>
>                 Key: YARN-4685
>                 URL: https://issues.apache.org/jira/browse/YARN-4685
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Rohith Sharma K S
>            Assignee: Rohith Sharma K S
>
> AM blacklist addition or removal is updated only when RMAppAttempt is 
> scheduled i.e {{RMAppAttemptImpl#ScheduleTransition#transition}}. But once 
> attempt is scheduled if there is any removeNode/addNode in cluster then this 
> is not updated to {{BlackListManager#refreshNodeHostCount}}. This leads 
> BlackListManager to operate on stale NM's count. And application is in 
> ACCEPTED state and wait forever even if we add more nodes to cluster.
> Solution is update BlacklistManager for every 
> {{RMAppAttemptImpl#AMContainerAllocatedTransition#transition}} call. This 
> ensures if there is any addition/removal in nodes, this will be updated to 
> BlacklistManager 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4685) AM blacklisting result in application to get hanged

Reply via email to