[ 
https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14047910#comment-14047910
 ] 

Devaraj K commented on YARN-1341:
---------------------------------

Sorry for coming late here. 

+1 for limiting the implementation/discussion as per Jira title and handling 
other cases in the respected Jira’s.

In addition to option 1), I'd think of making the NM down if NM fails to store 
RM keys for certain number of times(configurable) consecutively. And also we 
can make it(i.e. tear down NM or not) as configurable and let the users choose 
whether to enable or disable the config to make the NM down for RM keys  state 
store failures.

Similarly for Container/Application state store failures, NM can mark that 
Container/Application as failed and can be reported to RM. These can be 
discussed more detail in the corresponding Jira’s YARN-1337 and YARN-1354. 
However for all these NM state store operations, we could think of having 
retries before throwing the IOException. 

Thoughts?


> Recover NMTokens upon nodemanager restart
> -----------------------------------------
>
>                 Key: YARN-1341
>                 URL: https://issues.apache.org/jira/browse/YARN-1341
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, 
> YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to