[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14042559#comment-14042559 ]
Jason Lowe commented on YARN-1341: ---------------------------------- bq. So far from I know, RM restart didn't track this because these metrics will be recover during events recovery in RM restart. In current NM restart, some metrics could be lost, i.e. allocatedContainers, etc. I think we should either count them back as part of events during recovery or persistent them. Thoughts? Not all of the RM metrics will be recovered, correct? RPC metrics will be zeroed since those aren't persisted (nor should they be, IMHO). Aggregate containers allocated/released in the queue metrics will be wrong since the RM restart work, by design, doesn't store per-container state. If the cluster stays up too long then apps submitted/completed/failed/killed will not be correct, as I believe it will only count the applications that haven't been reaped due to retention policies. Anyway this is outside the scope of this JIRA, and I'll file a separate JIRA underneath the YARN-1336 umbrella to discuss what we should do about NM metrics and restart. bq. If so, how about we don't apply these changes until these changes can be persistent? If so, we still keep consistent between state store and NM's current state. Even we choose to fail the NM, we still can load state and recover the working. Again I think this is a case-by-case thing. For the RM master key, I'd rather keep going with the current master key and hope the next key update is able to persist (e.g.: a full disk where the state is stored that is later cleared up) rather than ditch the new key update and risk bringing down the NM because it can no longer keep talking to the RM or AMs. As I mentioned earlier, the failure to persist the RM master key or the master key used by an AM is that _if_ the NM happens to restart then some AMs _might_ not be able to authenticate with the NM until they get updated to the new master key. If we take down the NM or keep going but fail to update the master key in memory then this seems purely worse. The opportunity for error has widened, but I don't see any advantage gained by doing so. bq. Do we expect some operations can be failed while other operation can be successful? If this means short-term unavailable for persistent effort, we can just handle it by adding retry. If not, we should expect other operations that fetal get failed soon enough, and in this case, log error and move on in non-fatal operations don't have many differences. No? I don't expect immediate retry to help, and if the state store implementation is such that immediate retry is likely to help then the state store implementation should do that directly before throwing the error rather than relying on the upper-layer code to do so. However I do expect there to be common failure modes where the error state is temporary but not in the immediate sense (e.g.: the full disk scenario). And although an NM can't launch containers without a working state store, there's still a lot of useful stuff an NM can do with a broken state store -- report status of active containers, serve up shuffle data, etc. So far I don't think any of the state store updates should result in a teardown of the NM if there is a failure, although please let me know if you have a scenario where we should. > Recover NMTokens upon nodemanager restart > ----------------------------------------- > > Key: YARN-1341 > URL: https://issues.apache.org/jira/browse/YARN-1341 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Affects Versions: 2.3.0 > Reporter: Jason Lowe > Assignee: Jason Lowe > Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, > YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)