[ 
https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13980360#comment-13980360
 ] 

Jason Lowe commented on YARN-1354:
----------------------------------

Yes, we can't rely on any active containers to tell us which apps are active.  
I stumbled across YARN-1421, and I think that's the best way to solve the lost 
FINISH_APPS event.  We can already lose them in the RM restart scenario, and 
the proposed fix in that JIRA (having the NM heartbeat the active applications 
along with active containers) would solve it for the NM restart case as well.

As for nmStore.start() being called during serviceInit, that's because we're 
recovering the secret manager states during init and the store needs to be 
started in order to do that.  We might be able to postpone the recovery until 
start but I thought it was safer to recover during init to avoid any racing 
between component startups and when they touched other components relative to 
when those components recover.

I need to update the patch to handle the runtime DBException issue that was 
pointed out in the review for MAPREDUCE-5652.  I hope to have that updated 
patch posted shortly.

> Recover applications upon nodemanager restart
> ---------------------------------------------
>
>                 Key: YARN-1354
>                 URL: https://issues.apache.org/jira/browse/YARN-1354
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-1354-v1.patch
>
>
> The set of active applications in the nodemanager context need to be 
> recovered for work-preserving nodemanager restart



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to