[jira] [Commented] (YARN-1996) Provide alternative policies for UNHEALTHY nodes.

Hadoop QA (JIRA) Mon, 09 Feb 2015 18:14:09 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313379#comment-14313379
 ]


Hadoop QA commented on YARN-1996:
---------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12683446/YARN-1996-2.patch
  against trunk revision af08425.

    {color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

    {color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

      {color:red}-1 javac{color}.  The applied patch generated 1153 javac 
compiler warnings (more than the trunk's current 1149 warnings).

    {color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

    {color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

    {color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

                  
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore
                  
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6565//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/6565//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6565//console

This message is automatically generated.

> Provide alternative policies for UNHEALTHY nodes.
> -------------------------------------------------
>
>                 Key: YARN-1996
>                 URL: https://issues.apache.org/jira/browse/YARN-1996
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager, scheduler
>    Affects Versions: 2.4.0
>            Reporter: Gera Shegalov
>            Assignee: Gera Shegalov
>         Attachments: YARN-1996-2.patch, YARN-1996.v01.patch
>
>
> Currently, UNHEALTHY nodes can significantly prolong execution of large 
> expensive jobs as demonstrated by MAPREDUCE-5817, and downgrade the cluster 
> health even further due to [positive 
> feedback|http://en.wikipedia.org/wiki/Positive_feedback]. A container set 
> that might have deemed the node unhealthy in the first place starts spreading 
> across the cluster because the current node is declared unusable and all its 
> containers are killed and rescheduled on different nodes.
> To mitigate this, we experiment with a patch that allows containers already 
> running on a node turning UNHEALTHY to complete (drain) whereas no new 
> container can be assigned to it until it turns healthy again.
> This mechanism can also be used for graceful decommissioning of NM. To this 
> end, we have to write a health script  such that it can deterministically 
> report UNHEALTHY. For example with 
> {code}
> if [ -e $1 ] ; then                                                           
>      
>   echo ERROR Node decommmissioning via health script hack                     
>      
> fi 
> {code}
> In the current version patch, the behavior is controlled by a boolean 
> property {{yarn.nodemanager.unhealthy.drain.containers}}. More versatile 
> policies are possible in the future work. Currently, the health state of a 
> node is binary determined based on the disk checker and the health script 
> ERROR outputs. However, we can as well interpret health script output similar 
> to java logging levels (one of which is ERROR) such as WARN, FATAL. Each 
> level can then be treated differently. E.g.,
> - FATAL:  unusable like today 
> - ERROR: drain
> - WARN: halve the node capacity.
> complimented with some equivalence rules such as 3 WARN messages == ERROR,  
> 2*ERROR == FATAL, etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1996) Provide alternative policies for UNHEALTHY nodes.

Reply via email to