[ 
https://issues.apache.org/jira/browse/YARN-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992405#comment-14992405
 ] 

Jason Lowe commented on YARN-4334:
----------------------------------

This would probably involve some sort of "heartbeat" to the state store to keep 
track of an approximate last uptime of the ResourceManager.  We would not want 
to update the state store very often, probably only on the order of a minute or 
so.

One key use-case for this is Oozie.  Oozie launchers have a known problem where 
when they restart they will re-launch applications.  If the launcher AM gives 
up and the sub-job's AM gives up, then when the RM recovers and re-launches AM 
attempts for both jobs the launcher will re-submit the job.  Then there will be 
two instances of the sub-job running which is undesirable.  I suspect there are 
other job-launches-job situations besides Oozie where this would also be 
problematic.

> Ability to avoid ResourceManager recovery if state store is "too old"
> ---------------------------------------------------------------------
>
>                 Key: YARN-4334
>                 URL: https://issues.apache.org/jira/browse/YARN-4334
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>            Reporter: Jason Lowe
>            Assignee: Chang Li
>
> There are times when a ResourceManager has been down long enough that 
> ApplicationMasters and potentially external client-side monitoring mechanisms 
> have given up completely.  If the ResourceManager starts back up and tries to 
> recover we can get into situations where the RM launches new application 
> attempts for the AMs that gave up, but then the client _also_ launches 
> another instance of the app because it assumed everything was dead.
> It would be nice if the RM could be optionally configured to avoid trying to 
> recover if the state store was "too old."  The RM would come up without any 
> applications recovered, but we would avoid a double-submission situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to