[
https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14047910#comment-14047910
]
Devaraj K commented on YARN-1341:
---------------------------------
Sorry for coming late here.
+1 for limiting the implementation/discussion as per Jira title and handling
other cases in the respected Jira’s.
In addition to option 1), I'd think of making the NM down if NM fails to store
RM keys for certain number of times(configurable) consecutively. And also we
can make it(i.e. tear down NM or not) as configurable and let the users choose
whether to enable or disable the config to make the NM down for RM keys state
store failures.
Similarly for Container/Application state store failures, NM can mark that
Container/Application as failed and can be reported to RM. These can be
discussed more detail in the corresponding Jira’s YARN-1337 and YARN-1354.
However for all these NM state store operations, we could think of having
retries before throwing the IOException.
Thoughts?
> Recover NMTokens upon nodemanager restart
> -----------------------------------------
>
> Key: YARN-1341
> URL: https://issues.apache.org/jira/browse/YARN-1341
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Affects Versions: 2.3.0
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch,
> YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.2#6252)