[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039235#comment-14039235 ]
Jason Lowe commented on YARN-1341: ---------------------------------- bq. Application state - If we failed to store the application update, i.e. from init to finish, then we get wrong state on application after recovery. Yes, applications should be like containers. If we fail to store an application start in the state store then we should fail the container launch that triggered the application to be added. This already happens in the current patch for YARN-1354. If we fail to store the completion of an application then worst-case we will report an application to the RM on restart that isn't active, and the RM will correct the NM when it re-registers. bq. NodeManagerMetrics - The metrics of NM will get mess up if partial updated. I wasn't planning on persisting metrics during restart, as there are quite a few (e.g.: RPC metrics, etc.), and I'm not sure it's critical that they be preserved across a restart. Does RM restart do this or are there plans to do so? bq. About stale tag on NMStateStore - I don't mean to put on NMStateStore, but haven't think clearly on where to do - may be we can persistent on local disk directly or send to RM and retrieval it in NM registration? I think in most cases the attempt to update the stale tag, even if it's separate from the NMStateStore, will often fail in a similar way when the state store fails (e.g.: full local disk, read-only filesystem, etc.). Therefore I don't believe the effort to maintain a stale tag is going to be worth it. Also if we refuse to load a state store that's stale then we are going to leak containers because we won't try to recover anything from a stale state store. Instead I think we should decide in the various store failure cases whether the error should be fatal to the operation (which may lead to it being fatal to the NM overall) or if we feel the recovery with stale information is a better outcome than taking the NM down. In the latter case we should just log the error and move on. > Recover NMTokens upon nodemanager restart > ----------------------------------------- > > Key: YARN-1341 > URL: https://issues.apache.org/jira/browse/YARN-1341 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Affects Versions: 2.3.0 > Reporter: Jason Lowe > Assignee: Jason Lowe > Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, > YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)