[
https://issues.apache.org/jira/browse/YARN-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15487878#comment-15487878
]
Naganarasimha G R commented on YARN-5635:
-----------------------------------------
[~rchiang], Unfortunately i reopened this jira and reworded almost about the
same time. Sorry was not aware new jira got raised little earlier than this and
thanks for closing it. you can go ahead and make it subtask of YARN-5078.
Well for your new approach it almost sounds like a incompatible change for
existing node health scripts to define a new exit code. But is it required ?
Existing code treats any exit code other zero as unsuccessful and reports it as
{{HealthCheckerExitStatus.FAILED_WITH_EXIT_CODE}}. But
{{HealthCheckerExitStatus.FAILED}} is thrown when the output of script as
{{"ERROR"}} string in it.
So what we would want to address here would be, if the script output has errors
or script gets timed out then how to handle better. In this case it would *not*
be good to gracefully drain the NM directly, but to report that status could
not be got from the NM properly through script. Any thoughts on my earlier
comment
{code}
NM can inform Healthy/UnHealthy/HealthValidationError, And this can be sent
across Heartbeat to RM and RM can capture the state of this NM to be other than
Running and UnHealthy (a New state). This can be displayed in the WebUI and
also in the can be queried using ./yarn node -list -state
{code}
> Better handling when bad script is configured as Node's HealthScript
> --------------------------------------------------------------------
>
> Key: YARN-5635
> URL: https://issues.apache.org/jira/browse/YARN-5635
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Allen Wittenauer
> Assignee: Yufei Gu
>
> Earlier fix to YARN-5567 is reverted because its not ideal to get the whole
> cluster down because of a bad script. At the same time its important to
> report that script is erroneous which is configured as node health script as
> it might miss to detect bad health of a node.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]