[
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921754#comment-16921754
]
Eric Badger commented on YARN-9809:
-----------------------------------
bq. It is unlikely to determine unhealthy status until at least one container
tried to run on the given node manager.
This scenario can happen when none of the local dirs are available due to bad
disks or for any other arbitrary reason in the health check script. For
example, we have an optional offline file that can be set on the node to mark
it as unhealthy.
bq. How does health status field in registration heartbeat help?
If the node can register as unhealthy then it won't ever have containers
assigned to it. There is currently a period of time between registration and
the first node heartbeat where the node appears to be healthy.
bq. If containers are getting killed, they are supposed to schedule else where.
Do you observe any problem in rescheduling containers?
Yes, the containers will get rescheduled, but it is still wasteful to schedule
containers to a node if we are just going to kill them shortly after. If this
happens over many nodes at once then there are a lot of unnecessary container
kills happening which we can avoid by sending the health status of the node
with the initial RM registration.
> NMs should supply a health status when registering with RM
> ----------------------------------------------------------
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Eric Badger
> Assignee: Eric Badger
> Priority: Major
>
> Currently if the NM registers with the RM and it is unhealthy, it can be
> scheduled many containers before the first heartbeat. After the first
> heartbeat, the RM will mark the NM as unhealthy and kill all of the
> containers.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]