[ 
https://issues.apache.org/jira/browse/YARN-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15487878#comment-15487878
 ] 

Naganarasimha G R commented on YARN-5635:
-----------------------------------------

[~rchiang], Unfortunately i reopened this jira and reworded almost about the 
same time. Sorry was not aware new jira got raised little earlier than this and 
thanks for closing it. you can go ahead and make it subtask of YARN-5078.

Well for your new approach it almost sounds like a incompatible change for 
existing node health scripts to define a new exit code. But is it required ? 
Existing code treats any exit code other zero as unsuccessful and reports it as 
{{HealthCheckerExitStatus.FAILED_WITH_EXIT_CODE}}. But 
{{HealthCheckerExitStatus.FAILED}} is thrown when the output of script as 
{{"ERROR"}} string in it.

So what we would want to address here would be, if the script output has errors 
or script gets timed out then how to handle better. In this case it would *not* 
be good to gracefully drain the NM directly, but to report that status could 
not be got from the NM properly through script. Any thoughts on my earlier 
comment 
{code}
NM can inform Healthy/UnHealthy/HealthValidationError, And this can be sent 
across Heartbeat to RM and RM can capture the state of this NM to be other than 
Running and UnHealthy (a New state). This can be displayed in the WebUI and 
also in the can be queried using ./yarn node -list -state
{code}

> Better handling when bad script is configured as Node's HealthScript
> --------------------------------------------------------------------
>
>                 Key: YARN-5635
>                 URL: https://issues.apache.org/jira/browse/YARN-5635
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Allen Wittenauer
>            Assignee: Yufei Gu
>
> Earlier fix to YARN-5567 is reverted because its not ideal to get the whole 
> cluster down because of a bad script. At the same time its important to 
> report that script is erroneous which is configured as node health script as 
> it might miss to detect bad health of a node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to