Junping Du commented on YARN-1341:

bq. Yes, applications should be like containers. If we fail to store an 
application start in the state store then we should fail the container launch 
that triggered the application to be added. This already happens in the current 
patch for YARN-1354. If we fail to store the completion of an application then 
worst-case we will report an application to the RM on restart that isn't 
active, and the RM will correct the NM when it re-registers.
That make sense. I guess we should do additional work to check if the behavior 
is as our expected.

bq. I wasn't planning on persisting metrics during restart, as there are quite 
a few (e.g.: RPC metrics, etc.), and I'm not sure it's critical that they be 
preserved across a restart. Does RM restart do this or are there plans to do so?
I think these metrics are important especially for user's monitoring tools and 
we should make these info consistent during restart. So far from I know, RM 
restart didn't track this because these metrics will be recover during events 
recovery in RM restart. In current NM restart, some metrics could be lost, i.e. 
allocatedContainers, etc. I think we should either count them back as part of 
events during recovery or persistent them. Thoughts?

bq. Therefore I don't believe the effort to maintain a stale tag is going to be 
worth it. Also if we refuse to load a state store that's stale then we are 
going to leak containers because we won't try to recover anything from a stale 
state store.
If so, how about we don't apply these changes until these changes can be 
persistent? If so, we still keep consistent between state store and NM's 
current state. Even we choose to fail the NM, we still can load state and 
recover the working.  

bq. Instead I think we should decide in the various store failure cases whether 
the error should be fatal to the operation (which may lead to it being fatal to 
the NM overall) or if we feel the recovery with stale information is a better 
outcome than taking the NM down. In the latter case we should just log the 
error and move on.
Do we expect some operations can be failed while other operation can be 
successful? If this means short-term unavailable for persistent effort, we can 
just handle it by adding retry. If not, we should expect other operations that 
fetal get failed soon enough, and in this case, log error and move on in 
non-fatal operations don't have many differences. No? 

> Recover NMTokens upon nodemanager restart
> -----------------------------------------
>                 Key: YARN-1341
>                 URL: https://issues.apache.org/jira/browse/YARN-1341
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, 
> YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch

This message was sent by Atlassian JIRA

Reply via email to