[ 
https://issues.apache.org/jira/browse/YARN-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated YARN-1996:
--------------------------
    Attachment: YARN-1996-2.patch

[~jira.shegalov], [~maysamyabandeh] and I identified the root cause of 
https://issues.apache.org/jira/browse/MAPREDUCE-6043 and came up with the 
updated patch to address that scenario.

MRAppMaster's RMContainerAllocator depends on RM's CompletedContainers messages 
to make allocation request. In some corner cases when the node becomes 
unhealthy, CompletedContainers messages might be lost. The new patch makes sure 
RM will deliver CompletedContainers messages to AM in the following scenarios.

* When NM delivers unhealthy and completed containers notifications in the same 
heartbeat to RM.
* NM becomes unhealthy first, then it restarts.
* NM becomes unhealthy first, then it becomes healthy.
* NM becomes unhealthy first, then RM asks it to reboot.
* NM becomes unhealthy first, then it is decommissioned.
* NM becomes unhealthy first, then RM lost it.

For work preserving RM restart, an unhealthy NM will first be transitioned to 
RUNNING state after RM restart, and then to UNHEALTHY state. So if the RM 
restarts while it is draining unhealthy nodes, it should be able to continue to 
drain unhealthy nodes after the restart.

Appreciate any input on this.



> Provide alternative policies for UNHEALTHY nodes.
> -------------------------------------------------
>
>                 Key: YARN-1996
>                 URL: https://issues.apache.org/jira/browse/YARN-1996
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager, scheduler
>    Affects Versions: 2.4.0
>            Reporter: Gera Shegalov
>            Assignee: Gera Shegalov
>         Attachments: YARN-1996-2.patch, YARN-1996.v01.patch
>
>
> Currently, UNHEALTHY nodes can significantly prolong execution of large 
> expensive jobs as demonstrated by MAPREDUCE-5817, and downgrade the cluster 
> health even further due to [positive 
> feedback|http://en.wikipedia.org/wiki/Positive_feedback]. A container set 
> that might have deemed the node unhealthy in the first place starts spreading 
> across the cluster because the current node is declared unusable and all its 
> containers are killed and rescheduled on different nodes.
> To mitigate this, we experiment with a patch that allows containers already 
> running on a node turning UNHEALTHY to complete (drain) whereas no new 
> container can be assigned to it until it turns healthy again.
> This mechanism can also be used for graceful decommissioning of NM. To this 
> end, we have to write a health script  such that it can deterministically 
> report UNHEALTHY. For example with 
> {code}
> if [ -e $1 ] ; then                                                           
>      
>   echo ERROR Node decommmissioning via health script hack                     
>      
> fi 
> {code}
> In the current version patch, the behavior is controlled by a boolean 
> property {{yarn.nodemanager.unhealthy.drain.containers}}. More versatile 
> policies are possible in the future work. Currently, the health state of a 
> node is binary determined based on the disk checker and the health script 
> ERROR outputs. However, we can as well interpret health script output similar 
> to java logging levels (one of which is ERROR) such as WARN, FATAL. Each 
> level can then be treated differently. E.g.,
> - FATAL:  unusable like today 
> - ERROR: drain
> - WARN: halve the node capacity.
> complimented with some equivalence rules such as 3 WARN messages == ERROR,  
> 2*ERROR == FATAL, etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to