[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062861#comment-14062861 ]
Jason Lowe commented on YARN-1341: ---------------------------------- Thanks for commenting, Devaraj! My apologies for the late reply, as I was on vacation and am still catching up. bq. In addition to option 1), I'd think of making the NM down if NM fails to store RM keys for certain number of times(configurable) consecutively. As for retries, I mentioned earlier that if retries are likely to help then the state store implementation should do so rather than have the common code do so. For the leveldb implementation it is very unlikely that a retry is going to do anything other than just make the operation take longer to ultimately fail. The the firmware of the drive is already going to implement a large number of retries to attempt to recover from hardware errors, and non-hardware local filesystem errors are highly unlikely to be fixed by simply retrying immediately. If that were the case then I'd expect retries to be implemented in many other places where the local filesystem is used by Hadoop code. bq. And also we can make it(i.e. tear down NM or not) as configurable I'd like to avoid adding yet more config options unless we think we really need them, but if people agree this needs to be configurable then we can do so. Also I assume in that scenario you would want the NM to shutdown while also tearing down containers, cleaning up, etc. as if it didn't support recovery. Tearing down the NM on a state store error just to have it start up again and try to recover with stale state seems pointless -- might as well have just kept running which is a better outcome. Or am I missing a use case for that? And thanks, Junping, for the recent comments! bq. If you are also agree on this, we can separate this document effort to other JIRA (Umbrella or a dedicated one, whatever you like) and continue the discussion on this particular case. Sure, we can discuss general error handling or an overall document for it either on YARN-1336 or a new JIRA. bq. a. if currentMasterKey is stale, it can be updated and override soon with registering to RM later. Nothing is affected. Correct, the NM should receive the current master key upon re-registration with the RM after it restarts. bq. b. if previousMasterKey is stale, then the real previous master key is lost, so the affection is: AMs with real master key cannot connect to NM to launch containers. AMs that have the current master key will still be able to connect because the NM just got the current master key as described in a). AM's that have the previous master key will not be able to connect to the NM unless that particular master key also happened to be successfully associated with the attempt in the state store (related to case c). bq. c. if applicationMasterKeys are stale, then previous old keys get tracked in applicationMasterKeys get lost after restart. The affection is: AMs with old keys cannot connect to NM to launch containers. AMs that use an old key (i.e.: not the current or previous master key) would be unable to connect to the NM. bq. Anything I am missing here? I don't believe so. The bottom line is that an AM may not be able to successfully connect to an NM after a restart with stale NM token state. > Recover NMTokens upon nodemanager restart > ----------------------------------------- > > Key: YARN-1341 > URL: https://issues.apache.org/jira/browse/YARN-1341 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Affects Versions: 2.3.0 > Reporter: Jason Lowe > Assignee: Jason Lowe > Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, > YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)