Jason Lowe commented on YARN-1341:

bq. Application state - If we failed to store the application update, i.e. from 
init to finish, then we get wrong state on application after recovery.

Yes, applications should be like containers.  If we fail to store an 
application start in the state store then we should fail the container launch 
that triggered the application to be added.  This already happens in the 
current patch for YARN-1354.  If we fail to store the completion of an 
application then worst-case we will report an application to the RM on restart 
that isn't active, and the RM will correct the NM when it re-registers.

bq. NodeManagerMetrics - The metrics of NM will get mess up if partial updated.

I wasn't planning on persisting metrics during restart, as there are quite a 
few (e.g.: RPC metrics, etc.), and I'm not sure it's critical that they be 
preserved across a restart.  Does RM restart do this or are there plans to do 

bq. About stale tag on NMStateStore - I don't mean to put on NMStateStore, but 
haven't think clearly on where to do - may be we can persistent on local disk 
directly or send to RM and retrieval it in NM registration?

I think in most cases the attempt to update the stale tag, even if it's 
separate from the NMStateStore, will often fail in a similar way when the state 
store fails (e.g.: full local disk, read-only filesystem, etc.).  Therefore I 
don't believe the effort to maintain a stale tag is going to be worth it.  Also 
if we refuse to load a state store that's stale then we are going to leak 
containers because we won't try to recover anything from a stale state store.

Instead I think we should decide in the various store failure cases whether the 
error should be fatal to the operation (which may lead to it being fatal to the 
NM overall) or if we feel the recovery with stale information is a better 
outcome than taking the NM down.  In the latter case we should just log the 
error and move on.

> Recover NMTokens upon nodemanager restart
> -----------------------------------------
>                 Key: YARN-1341
>                 URL: https://issues.apache.org/jira/browse/YARN-1341
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, 
> YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch

This message was sent by Atlassian JIRA

Reply via email to