Eric Badger created YARN-10616:
----------------------------------

             Summary: Nodemanagers cannot detect GPU failures
                 Key: YARN-10616
                 URL: https://issues.apache.org/jira/browse/YARN-10616
             Project: Hadoop YARN
          Issue Type: Bug
            Reporter: Eric Badger
            Assignee: Eric Badger


As stated above, the bug is that GPUs can fail, but the NM doesn't notice the 
failure. The NM will continue to schedule tasks onto the failed GPU, but the 
GPU won't actually work and so the container will likely fail or run very 
slowly on the CPU. 

My initial thought on solving this is to add NM resource capabilities to the 
NM-RM heartbeat and have the RM update its view of the NM's resource 
capabilities on each heartbeat. This would be a fairly trivial change, but 
comes with the unfortunate side effect that it completely undermindes {{yarn 
rmadmin -updateNodeResource}}. When you run {{-updateNodeResource}} the 
assumption is that the node will retain these new resource capabilities until 
either the NM or RM is restarted. But with a heartbeat interaction constantly 
updating those resource capabilities from the NM perspective, the explicit 
changes via {{-updateNodeResource}} would be lost on the next heartbeat. We 
could potentially add a flag to ignore the heartbeat updates for any node who 
has had {{-updateNodeResource}} called on it (until a re-registration). But in 
this case, the node would no longer get resource capability updates until the 
NM or RM restarted. If {{-updateNodeResource}} is used a decent amount, then 
that would give potentially unexpected behavior in relation to nodes properly 
auto-detecting failures.

Another idea is to add a GPU monitor thread on the NM to periodically run 
{{nvidia-smi}} and detect changes in the number of healthy GPUs. If that number 
decreased, the node would hook into the health check status and mark itself as 
unhealthy. The downside of this approach is that a single failed GPU would mean 
taking out an entire node (e.g. 8 GPUs).

I would really like to go with the NM-RM heartbeat approach, but the 
{{-updateNodeResource}} issue bothers me. The second approach is ok I guess, 
but I also don't like taking down whole GPU nodes when only a single GPU is 
bad. Would like to hear thoughts of others on how best to approach this

[~jhung], [~leftnoteasy], [~sunilg], [~epayne], [~Jim_Brennan]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to