[
https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14042559#comment-14042559
]
Jason Lowe commented on YARN-1341:
----------------------------------
bq. So far from I know, RM restart didn't track this because these metrics will
be recover during events recovery in RM restart. In current NM restart, some
metrics could be lost, i.e. allocatedContainers, etc. I think we should either
count them back as part of events during recovery or persistent them. Thoughts?
Not all of the RM metrics will be recovered, correct? RPC metrics will be
zeroed since those aren't persisted (nor should they be, IMHO). Aggregate
containers allocated/released in the queue metrics will be wrong since the RM
restart work, by design, doesn't store per-container state. If the cluster
stays up too long then apps submitted/completed/failed/killed will not be
correct, as I believe it will only count the applications that haven't been
reaped due to retention policies. Anyway this is outside the scope of this
JIRA, and I'll file a separate JIRA underneath the YARN-1336 umbrella to
discuss what we should do about NM metrics and restart.
bq. If so, how about we don't apply these changes until these changes can be
persistent? If so, we still keep consistent between state store and NM's
current state. Even we choose to fail the NM, we still can load state and
recover the working.
Again I think this is a case-by-case thing. For the RM master key, I'd rather
keep going with the current master key and hope the next key update is able to
persist (e.g.: a full disk where the state is stored that is later cleared up)
rather than ditch the new key update and risk bringing down the NM because it
can no longer keep talking to the RM or AMs. As I mentioned earlier, the
failure to persist the RM master key or the master key used by an AM is that
_if_ the NM happens to restart then some AMs _might_ not be able to
authenticate with the NM until they get updated to the new master key. If we
take down the NM or keep going but fail to update the master key in memory then
this seems purely worse. The opportunity for error has widened, but I don't
see any advantage gained by doing so.
bq. Do we expect some operations can be failed while other operation can be
successful? If this means short-term unavailable for persistent effort, we can
just handle it by adding retry. If not, we should expect other operations that
fetal get failed soon enough, and in this case, log error and move on in
non-fatal operations don't have many differences. No?
I don't expect immediate retry to help, and if the state store implementation
is such that immediate retry is likely to help then the state store
implementation should do that directly before throwing the error rather than
relying on the upper-layer code to do so. However I do expect there to be
common failure modes where the error state is temporary but not in the
immediate sense (e.g.: the full disk scenario). And although an NM can't
launch containers without a working state store, there's still a lot of useful
stuff an NM can do with a broken state store -- report status of active
containers, serve up shuffle data, etc. So far I don't think any of the state
store updates should result in a teardown of the NM if there is a failure,
although please let me know if you have a scenario where we should.
> Recover NMTokens upon nodemanager restart
> -----------------------------------------
>
> Key: YARN-1341
> URL: https://issues.apache.org/jira/browse/YARN-1341
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Affects Versions: 2.3.0
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch,
> YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.2#6252)