Craig Condit commented on YARN-9809:

Since health check scripts are by nature different for every deployment, it 
seems that neither the current behavior nor what is proposed makes sense in all 
cases. However, I believe there can be some middle ground. I propose we keep 
the existing behavior as default to avoid causing pain for existing users, but 
allow a configuration to opt-in to a single synchronous execution of a health 
check on startup before node check-in (controlled via a new 
*{{yarn.nodemanager.health-checker.preflight.enabled}}* boolean configuration). 
It may also be desirable to kill the NM upon repeated failures of this 
preflight check. We could add a new config 
*{{yarn.nodemanager.health-checker.preflight.retries}}* to control the number 
of retries before aborting the NM (or -1 for infinite).

> NMs should supply a health status when registering with RM
> ----------------------------------------------------------
>                 Key: YARN-9809
>                 URL: https://issues.apache.org/jira/browse/YARN-9809
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Eric Badger
>            Assignee: Eric Badger
>            Priority: Major
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.

This message was sent by Atlassian Jira

To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to