Eric Badger commented on YARN-9809:

I can see pros and cons to both approaches. On the one hand, if the health 
check script fails to execute properly, that's not good and could imply 
something bad. But health check scripts are pretty dangerous since they can 
take out an entire cluster if they're written improperly. So if someone updates 
the script and all of a sudden the script errors out, the whole cluster is 
unhealthy. Or the health check script could rely on querying a service and that 
service times out. The node is healthy, but the health check script returned 
error. Unless you are parsing for specific error codes, you can no longer 
differentiate between the health check script failing internally and the health 
check script returning successfully that the node is unhealthy. 

Regardless of this discussion though, this is outside of the scope of this 
JIRA. That's an issue with how the health check script is handled while this 
JIRA is just about providing a health status at NM startup

> NMs should supply a health status when registering with RM
> ----------------------------------------------------------
>                 Key: YARN-9809
>                 URL: https://issues.apache.org/jira/browse/YARN-9809
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Eric Badger
>            Assignee: Eric Badger
>            Priority: Major
>         Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.

This message was sent by Atlassian Jira

To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to