Eric Badger created YARN-10616:
----------------------------------
Summary: Nodemanagers cannot detect GPU failures
Key: YARN-10616
URL: https://issues.apache.org/jira/browse/YARN-10616
Project: Hadoop YARN
Issue Type: Bug
Reporter: Eric Badger
Assignee: Eric Badger
As stated above, the bug is that GPUs can fail, but the NM doesn't notice the
failure. The NM will continue to schedule tasks onto the failed GPU, but the
GPU won't actually work and so the container will likely fail or run very
slowly on the CPU.
My initial thought on solving this is to add NM resource capabilities to the
NM-RM heartbeat and have the RM update its view of the NM's resource
capabilities on each heartbeat. This would be a fairly trivial change, but
comes with the unfortunate side effect that it completely undermindes {{yarn
rmadmin -updateNodeResource}}. When you run {{-updateNodeResource}} the
assumption is that the node will retain these new resource capabilities until
either the NM or RM is restarted. But with a heartbeat interaction constantly
updating those resource capabilities from the NM perspective, the explicit
changes via {{-updateNodeResource}} would be lost on the next heartbeat. We
could potentially add a flag to ignore the heartbeat updates for any node who
has had {{-updateNodeResource}} called on it (until a re-registration). But in
this case, the node would no longer get resource capability updates until the
NM or RM restarted. If {{-updateNodeResource}} is used a decent amount, then
that would give potentially unexpected behavior in relation to nodes properly
auto-detecting failures.
Another idea is to add a GPU monitor thread on the NM to periodically run
{{nvidia-smi}} and detect changes in the number of healthy GPUs. If that number
decreased, the node would hook into the health check status and mark itself as
unhealthy. The downside of this approach is that a single failed GPU would mean
taking out an entire node (e.g. 8 GPUs).
I would really like to go with the NM-RM heartbeat approach, but the
{{-updateNodeResource}} issue bothers me. The second approach is ok I guess,
but I also don't like taking down whole GPU nodes when only a single GPU is
bad. Would like to hear thoughts of others on how best to approach this
[~jhung], [~leftnoteasy], [~sunilg], [~epayne], [~Jim_Brennan]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]