[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921754#comment-16921754
 ] 

Eric Badger commented on YARN-9809:
-----------------------------------

bq. It is unlikely to determine unhealthy status until at least one container 
tried to run on the given node manager.
This scenario can happen when none of the local dirs are available due to bad 
disks or for any other arbitrary reason in the health check script. For 
example, we have an optional offline file that can be set on the node to mark 
it as unhealthy. 

bq. How does health status field in registration heartbeat help?
If the node can register as unhealthy then it won't ever have containers 
assigned to it. There is currently a period of time between registration and 
the first node heartbeat where the node appears to be healthy.

bq. If containers are getting killed, they are supposed to schedule else where. 
Do you observe any problem in rescheduling containers?
Yes, the containers will get rescheduled, but it is still wasteful to schedule 
containers to a node if we are just going to kill them shortly after. If this 
happens over many nodes at once then there are a lot of unnecessary container 
kills happening which we can avoid by sending the health status of the node 
with the initial RM registration.

> NMs should supply a health status when registering with RM
> ----------------------------------------------------------
>
>                 Key: YARN-9809
>                 URL: https://issues.apache.org/jira/browse/YARN-9809
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Eric Badger
>            Assignee: Eric Badger
>            Priority: Major
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to