[ https://issues.apache.org/jira/browse/YARN-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Qi Zhu updated YARN-10616: -------------------------- Parent: YARN-10690 Issue Type: Sub-task (was: Bug) > Nodemanagers cannot detect GPU failures > --------------------------------------- > > Key: YARN-10616 > URL: https://issues.apache.org/jira/browse/YARN-10616 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Eric Badger > Assignee: Eric Badger > Priority: Major > > As stated above, the bug is that GPUs can fail, but the NM doesn't notice the > failure. The NM will continue to schedule tasks onto the failed GPU, but the > GPU won't actually work and so the container will likely fail or run very > slowly on the CPU. > My initial thought on solving this is to add NM resource capabilities to the > NM-RM heartbeat and have the RM update its view of the NM's resource > capabilities on each heartbeat. This would be a fairly trivial change, but > comes with the unfortunate side effect that it completely undermindes {{yarn > rmadmin -updateNodeResource}}. When you run {{-updateNodeResource}} the > assumption is that the node will retain these new resource capabilities until > either the NM or RM is restarted. But with a heartbeat interaction constantly > updating those resource capabilities from the NM perspective, the explicit > changes via {{-updateNodeResource}} would be lost on the next heartbeat. We > could potentially add a flag to ignore the heartbeat updates for any node who > has had {{-updateNodeResource}} called on it (until a re-registration). But > in this case, the node would no longer get resource capability updates until > the NM or RM restarted. If {{-updateNodeResource}} is used a decent amount, > then that would give potentially unexpected behavior in relation to nodes > properly auto-detecting failures. > Another idea is to add a GPU monitor thread on the NM to periodically run > {{nvidia-smi}} and detect changes in the number of healthy GPUs. If that > number decreased, the node would hook into the health check status and mark > itself as unhealthy. The downside of this approach is that a single failed > GPU would mean taking out an entire node (e.g. 8 GPUs). > I would really like to go with the NM-RM heartbeat approach, but the > {{-updateNodeResource}} issue bothers me. The second approach is ok I guess, > but I also don't like taking down whole GPU nodes when only a single GPU is > bad. Would like to hear thoughts of others on how best to approach this > [~jhung], [~leftnoteasy], [~sunilg], [~epayne], [~Jim_Brennan] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org