Junping Du commented on YARN-1341:
bq. Yes, applications should be like containers. If we fail to store an
application start in the state store then we should fail the container launch
that triggered the application to be added. This already happens in the current
patch for YARN-1354. If we fail to store the completion of an application then
worst-case we will report an application to the RM on restart that isn't
active, and the RM will correct the NM when it re-registers.
That make sense. I guess we should do additional work to check if the behavior
is as our expected.
bq. I wasn't planning on persisting metrics during restart, as there are quite
a few (e.g.: RPC metrics, etc.), and I'm not sure it's critical that they be
preserved across a restart. Does RM restart do this or are there plans to do so?
I think these metrics are important especially for user's monitoring tools and
we should make these info consistent during restart. So far from I know, RM
restart didn't track this because these metrics will be recover during events
recovery in RM restart. In current NM restart, some metrics could be lost, i.e.
allocatedContainers, etc. I think we should either count them back as part of
events during recovery or persistent them. Thoughts?
bq. Therefore I don't believe the effort to maintain a stale tag is going to be
worth it. Also if we refuse to load a state store that's stale then we are
going to leak containers because we won't try to recover anything from a stale
If so, how about we don't apply these changes until these changes can be
persistent? If so, we still keep consistent between state store and NM's
current state. Even we choose to fail the NM, we still can load state and
recover the working.
bq. Instead I think we should decide in the various store failure cases whether
the error should be fatal to the operation (which may lead to it being fatal to
the NM overall) or if we feel the recovery with stale information is a better
outcome than taking the NM down. In the latter case we should just log the
error and move on.
Do we expect some operations can be failed while other operation can be
successful? If this means short-term unavailable for persistent effort, we can
just handle it by adding retry. If not, we should expect other operations that
fetal get failed soon enough, and in this case, log error and move on in
non-fatal operations don't have many differences. No?
> Recover NMTokens upon nodemanager restart
> Key: YARN-1341
> URL: https://issues.apache.org/jira/browse/YARN-1341
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Affects Versions: 2.3.0
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch,
> YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch
This message was sent by Atlassian JIRA