[
https://issues.apache.org/jira/browse/YARN-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17303044#comment-17303044
]
Qi Zhu commented on YARN-10616:
-------------------------------
[~ebadger] [~ztang]
Actually we can use the graceful decommission way to realize that:
"We will use {{updateNodeResource}} to set the node resources to 0, meaning
that nothing will get scheduled on the node. But the NM will still be running
so that we can jstack or grab a heap dump."
I think we can realize the NM-RM heartbeat approach first, then to handle the
updateNodeResource.
What you advice about this?
> Nodemanagers cannot detect GPU failures
> ---------------------------------------
>
> Key: YARN-10616
> URL: https://issues.apache.org/jira/browse/YARN-10616
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Eric Badger
> Assignee: Eric Badger
> Priority: Major
>
> As stated above, the bug is that GPUs can fail, but the NM doesn't notice the
> failure. The NM will continue to schedule tasks onto the failed GPU, but the
> GPU won't actually work and so the container will likely fail or run very
> slowly on the CPU.
> My initial thought on solving this is to add NM resource capabilities to the
> NM-RM heartbeat and have the RM update its view of the NM's resource
> capabilities on each heartbeat. This would be a fairly trivial change, but
> comes with the unfortunate side effect that it completely undermindes {{yarn
> rmadmin -updateNodeResource}}. When you run {{-updateNodeResource}} the
> assumption is that the node will retain these new resource capabilities until
> either the NM or RM is restarted. But with a heartbeat interaction constantly
> updating those resource capabilities from the NM perspective, the explicit
> changes via {{-updateNodeResource}} would be lost on the next heartbeat. We
> could potentially add a flag to ignore the heartbeat updates for any node who
> has had {{-updateNodeResource}} called on it (until a re-registration). But
> in this case, the node would no longer get resource capability updates until
> the NM or RM restarted. If {{-updateNodeResource}} is used a decent amount,
> then that would give potentially unexpected behavior in relation to nodes
> properly auto-detecting failures.
> Another idea is to add a GPU monitor thread on the NM to periodically run
> {{nvidia-smi}} and detect changes in the number of healthy GPUs. If that
> number decreased, the node would hook into the health check status and mark
> itself as unhealthy. The downside of this approach is that a single failed
> GPU would mean taking out an entire node (e.g. 8 GPUs).
> I would really like to go with the NM-RM heartbeat approach, but the
> {{-updateNodeResource}} issue bothers me. The second approach is ok I guess,
> but I also don't like taking down whole GPU nodes when only a single GPU is
> bad. Would like to hear thoughts of others on how best to approach this
> [~jhung], [~leftnoteasy], [~sunilg], [~epayne], [~Jim_Brennan]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]