[
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110356#comment-17110356
]
Craig Condit commented on YARN-9809:
------------------------------------
Since health check scripts are by nature different for every deployment, it
seems that neither the current behavior nor what is proposed makes sense in all
cases. However, I believe there can be some middle ground. I propose we keep
the existing behavior as default to avoid causing pain for existing users, but
allow a configuration to opt-in to a single synchronous execution of a health
check on startup before node check-in (controlled via a new
*{{yarn.nodemanager.health-checker.preflight.enabled}}* boolean configuration).
It may also be desirable to kill the NM upon repeated failures of this
preflight check. We could add a new config
*{{yarn.nodemanager.health-checker.preflight.retries}}* to control the number
of retries before aborting the NM (or -1 for infinite).
> NMs should supply a health status when registering with RM
> ----------------------------------------------------------
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Eric Badger
> Assignee: Eric Badger
> Priority: Major
>
> Currently if the NM registers with the RM and it is unhealthy, it can be
> scheduled many containers before the first heartbeat. After the first
> heartbeat, the RM will mark the NM as unhealthy and kill all of the
> containers.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]