Jason Lowe commented on YARN-1341:

bq. So far from I know, RM restart didn't track this because these metrics will 
be recover during events recovery in RM restart. In current NM restart, some 
metrics could be lost, i.e. allocatedContainers, etc. I think we should either 
count them back as part of events during recovery or persistent them. Thoughts?

Not all of the RM metrics will be recovered, correct?  RPC metrics will be 
zeroed since those aren't persisted (nor should they be, IMHO).  Aggregate 
containers allocated/released in the queue metrics will be wrong since the RM 
restart work, by design, doesn't store per-container state.  If the cluster 
stays up too long then apps submitted/completed/failed/killed will not be 
correct, as I believe it will only count the applications that haven't been 
reaped due to retention policies.  Anyway this is outside the scope of this 
JIRA, and I'll file a separate JIRA underneath the YARN-1336 umbrella to 
discuss what we should do about NM metrics and restart.

bq. If so, how about we don't apply these changes until these changes can be 
persistent? If so, we still keep consistent between state store and NM's 
current state. Even we choose to fail the NM, we still can load state and 
recover the working. 

Again I think this is a case-by-case thing.  For the RM master key, I'd rather 
keep going with the current master key and hope the next key update is able to 
persist (e.g.: a full disk where the state is stored that is later cleared up) 
rather than ditch the new key update and risk bringing down the NM because it 
can no longer keep talking to the RM or AMs.  As I mentioned earlier, the 
failure to persist the RM master key or the master key used by an AM is that 
_if_ the NM happens to restart then some AMs _might_ not be able to 
authenticate with the NM until they get updated to the new master key.  If we 
take down the NM or keep going but fail to update the master key in memory then 
this seems purely worse.  The opportunity for error has widened, but I don't 
see any advantage gained by doing so.

bq. Do we expect some operations can be failed while other operation can be 
successful? If this means short-term unavailable for persistent effort, we can 
just handle it by adding retry. If not, we should expect other operations that 
fetal get failed soon enough, and in this case, log error and move on in 
non-fatal operations don't have many differences. No? 

I don't expect immediate retry to help, and if the state store implementation 
is such that immediate retry is likely to help then the state store 
implementation should do that directly before throwing the error rather than 
relying on the upper-layer code to do so.  However I do expect there to be 
common failure modes where the error state is temporary but not in the 
immediate sense (e.g.: the full disk scenario).  And although an NM can't 
launch containers without a working state store, there's still a lot of useful 
stuff an NM can do with a broken state store -- report status of active 
containers, serve up shuffle data, etc.  So far I don't think any of the state 
store updates should result in a teardown of the NM if there is a failure, 
although please let me know if you have a scenario where we should.

> Recover NMTokens upon nodemanager restart
> -----------------------------------------
>                 Key: YARN-1341
>                 URL: https://issues.apache.org/jira/browse/YARN-1341
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, 
> YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch

This message was sent by Atlassian JIRA

Reply via email to