[
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108576#comment-17108576
]
Jim Brennan commented on YARN-9809:
-----------------------------------
I would like to revive this discussion. We have this implemented internally.
Our health-check script runs very quickly, so the impact on the time it takes
to register with the RM is minimal (not really noticeable in our case). Our
health-check script does a number of checks to validate the health of the host
the NM is running on. We don't have any checks directly related to
success/failure of containers launching since the last check, but even if we
did, that particular check just wouldn't find anything if no containers have
been launched yet.
The cases that we were trying to address with this change involve hardware or
os issues with the node that may not prevent a container from launching, but
are serious enough to mark the node as unhealthy (memory/disk errors, etc...).
We have seen this during a rolling upgrade. Nodes that had been previously
marked as unhealthy would be brought up as part of the RU, and those nodes
would start running containers only to be marked unhealthy 10 minutes later
when the health-check script ran. This caused a lot of killed task attempts.
With large clusters there can be hundreds of nodes that are unhealthy, so
there can be a lot of failed task attempts.
It seems the main question that [~eyang] is raising is whether we should allow
a synchronous call to run the health-check script during nodemanager
startup/registration. I agree that this can introduce a potential slowdown if
the health-check-script is slow. In our case, the delay is not noticeable,
and we think it is worth it to prevent the false start. What do others think?
cc: [~ebadger], [~eyang], [~ccondit-target], [[email protected]]
> NMs should supply a health status when registering with RM
> ----------------------------------------------------------
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Eric Badger
> Assignee: Eric Badger
> Priority: Major
>
> Currently if the NM registers with the RM and it is unhealthy, it can be
> scheduled many containers before the first heartbeat. After the first
> heartbeat, the RM will mark the NM as unhealthy and kill all of the
> containers.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]