Jason Lowe commented on YARN-1341:
bq. Application state - If we failed to store the application update, i.e. from
init to finish, then we get wrong state on application after recovery.
Yes, applications should be like containers. If we fail to store an
application start in the state store then we should fail the container launch
that triggered the application to be added. This already happens in the
current patch for YARN-1354. If we fail to store the completion of an
application then worst-case we will report an application to the RM on restart
that isn't active, and the RM will correct the NM when it re-registers.
bq. NodeManagerMetrics - The metrics of NM will get mess up if partial updated.
I wasn't planning on persisting metrics during restart, as there are quite a
few (e.g.: RPC metrics, etc.), and I'm not sure it's critical that they be
preserved across a restart. Does RM restart do this or are there plans to do
bq. About stale tag on NMStateStore - I don't mean to put on NMStateStore, but
haven't think clearly on where to do - may be we can persistent on local disk
directly or send to RM and retrieval it in NM registration?
I think in most cases the attempt to update the stale tag, even if it's
separate from the NMStateStore, will often fail in a similar way when the state
store fails (e.g.: full local disk, read-only filesystem, etc.). Therefore I
don't believe the effort to maintain a stale tag is going to be worth it. Also
if we refuse to load a state store that's stale then we are going to leak
containers because we won't try to recover anything from a stale state store.
Instead I think we should decide in the various store failure cases whether the
error should be fatal to the operation (which may lead to it being fatal to the
NM overall) or if we feel the recovery with stale information is a better
outcome than taking the NM down. In the latter case we should just log the
error and move on.
> Recover NMTokens upon nodemanager restart
> Key: YARN-1341
> URL: https://issues.apache.org/jira/browse/YARN-1341
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Affects Versions: 2.3.0
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch,
> YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch
This message was sent by Atlassian JIRA