[ 
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157271#comment-14157271
 ] 

Craig Welch commented on YARN-1680:
-----------------------------------

[~john.jian.fang] I should probably not have referred to the cluster level 
adjustments as "blacklisting".  What I see is a mechanism (state machine, 
events, including adding and removing nodes and the "unhealthy" state/the 
health monitor) that, I think, ultimately result in the 
CapacityScheduler.addNode() and removeNode() calls, which modify the 
clusterResource value.  In any case, the blacklisting functionality we are 
addressing here definitely looks to be application specific needs to be 
addressed at that level.  The issue isn't, so far as I know, related to any 
blacklisting/node health issues outside the one in play here, as those should 
work properly for headroom as they adjust the cluster resource.  The problem is 
that the application blacklist activity does not adjust the cluster resource 
and was previously not involved in the headroom calculation.  If it's not the 
case that cluster level adjustments are being made for nodes then this 
blacklisting will result in duplication among applications as they 
independently discover problems with nodes and blacklist them, but that is not 
a new characteristic of the way the system works.

> availableResources sent to applicationMaster in heartbeat should exclude 
> blacklistedNodes free memory.
> ------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-1680
>                 URL: https://issues.apache.org/jira/browse/YARN-1680
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>    Affects Versions: 2.2.0, 2.3.0
>         Environment: SuSE 11 SP2 + Hadoop-2.3 
>            Reporter: Rohith
>            Assignee: Chen He
>         Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, 
> YARN-1680-v2.patch, YARN-1680.patch
>
>
> There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster 
> slow start is set to 1.
> Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is 
> become unstable(3 Map got killed), MRAppMaster blacklisted unstable 
> NodeManager(NM-4). All reducer task are running in cluster now.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes memory. This makes 
> jobs to hang forever(ResourceManager does not assing any new containers on 
> blacklisted nodes but returns availableResouce considers cluster free 
> memory). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to