[
https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037718#comment-14037718
]
Jason Lowe commented on YARN-1341:
----------------------------------
As I was stating earlier, I personally would rather have the NM try to keep
going. Restarts should be rare, and I'd rather not force a loss of work by
taking the NM down instantly when the state store hiccups. Like you said, it's
"cheaper" to lose an NM than an RM, which means the ramifications of not being
able to recover the work on a restart is cheaper than it is for an RM. If the
state store is missing some things, we might not be able to recover a localized
resource, a token, a container, or possibly anything at all. If we aren't able
to recover anything then the containers will fail to be reported to the RM when
the NM registers, and the RM can act accordingly (i.e.: notify the AM that the
containers are lost). Or in the worst-case, the state store is so corrupted on
startup that we don't even survive the NM restart and the NM crashes, which
would have an end result just like if we took it down when the state store
failed.
Therefore I'd rather not guarantee that we'll lose work by crashing the NM on
any store error and instead try to preserve the work we have. The NM could
theoretically recover (e.g.: if the error is transient then the next RM key
store could succeed). If we take the NM down immediately then we're
guaranteeing the work is lost. Is that really better?
Maybe a better approach is to have errors like this trigger an unhealthy state
for the NM when we have the ability to do a graceful decommission. Then the RM
can stop allocating new containers on the node but we can try to finish up the
containers that are still running on the node. (Maybe have a configurable
timeout where at some point where we will take down the NM even if there are
still active containers.) Then we can declare the node is acting strangely,
don't put any more load on it, but we won't destroy work on it instantly when
the error occurs. That is outside the scope of this JIRA, but it would allow
us to react to the error while still trying to avoid destroying work in the
process.
> Recover NMTokens upon nodemanager restart
> -----------------------------------------
>
> Key: YARN-1341
> URL: https://issues.apache.org/jira/browse/YARN-1341
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Affects Versions: 2.3.0
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch,
> YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.2#6252)