[ 
https://issues.apache.org/jira/browse/YARN-5567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15484820#comment-15484820
 ] 

Allen Wittenauer commented on YARN-5567:
----------------------------------------

bq. would you prefer this be a config setting to choose the behavior?

The history of the health check script is interesting, but long.  But not 
trusting the exit code was one of the key learnings by the ops team from the 
HOD experience. It fails a lot more often than people realize, mainly due to 
users doing crazy things, especially on insecure systems.

This is one of those times where it's going to be extremely difficult to 
convince me otherwise.  I can't think of a reason to ever trust the exit code 
enough to bring down the NodeManager.   In this particular environment, the 
number of conditions that the script can fail for reasons which may be 
temporary/pointless are many.  

Now it could be argued that those temporary failures should cause the NM to 
come down, but then you get into a race condition between heartbeats and actual 
issues.  HDFS worked around it by basically saying "it has to fail for X long". 
Ignoring the exit code avoids that problem because one can be sure that "ERROR 
-" really did come from the script.

bq. Alternatively, would you be okay with standardizing on a specific error 
code for "detected bad Node" vs "bad script"?

If by error code you specifically mean the value the NM reports back to the RM, 
yes that makes sense.  It just can't fail the node.  

> Fix script exit code checking in NodeHealthScriptRunner#reportHealthStatus
> --------------------------------------------------------------------------
>
>                 Key: YARN-5567
>                 URL: https://issues.apache.org/jira/browse/YARN-5567
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.8.0, 3.0.0-alpha1
>            Reporter: Yufei Gu
>            Assignee: Yufei Gu
>             Fix For: 3.0.0-alpha1
>
>         Attachments: YARN-5567.001.patch
>
>
> In case of FAILED_WITH_EXIT_CODE, health status should be false.
> {code}
>       case FAILED_WITH_EXIT_CODE:
>         setHealthStatus(true, "", now);
>         break;
> {code}
> should be 
> {code}
>       case FAILED_WITH_EXIT_CODE:
>         setHealthStatus(false, "", now);
>         break;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to