[ 
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13817941#comment-13817941
 ] 

Omkar Vinit Joshi commented on YARN-1210:
-----------------------------------------

Attaching rebased patch.
I slightly modified the logic for RMRestart app recovery code.
* If application doesn't have any attempt then it will start new attempt when 
we do submitApplication as a part of recovery.
* If application has 1 more application attempts then the attempt recovery will 
take place in 2 steps.
** All the application attempts except the last attempt will be recovered first.
** When we do submitApplication as a part of application recovery we will 
replay the last attempt.
*** If last attempt doesn't have any finalRecoveredState stored then it will be 
considered as the one for which AM may or may not have been started/finished. 
So we will move this application attempt into LAUNCHED state, add it to 
AMLivenessMonitor and move application to RUNNING state.
*** If last attempt was in either FAILED/KILLED/FINISHED state then we will 
replay that attempt's BaseFinalTransition by recovering attempt synchronously 
here.

Adding test to cover below scenarios
* New application attempt is not started until previous AM container finish 
event is reported back to RM as a part of nm registration.
* If previous AM container finish event is never reported back (i.e. node 
manager on which this AM container was running also went down) in that case 
AMLivenessMonitor should time out previous attempt and start new attempt.
* If all the stored attempts had finished then new attempt should be started 
immediately.

> During RM restart, RM should start a new attempt only when previous attempt 
> exits for real
> ------------------------------------------------------------------------------------------
>
>                 Key: YARN-1210
>                 URL: https://issues.apache.org/jira/browse/YARN-1210
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Omkar Vinit Joshi
>         Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch
>
>
> When RM recovers, it can wait for existing AMs to contact RM back and then 
> kill them forcefully before even starting a new AM. Worst case, RM will start 
> a new AppAttempt after waiting for 10 mins ( the expiry interval). This way 
> we'll minimize multiple AMs racing with each other. This can help issues with 
> downstream components like Pig, Hive and Oozie during RM restart.
> In the mean while, new apps will proceed as usual as existing apps wait for 
> recovery.
> This can continue to be useful after work-preserving restart, so that AMs 
> which can properly sync back up with RM can continue to run and those that 
> don't are guaranteed to be killed before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to