[ 
https://issues.apache.org/jira/browse/YARN-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862436#comment-13862436
 ] 

Bikas Saha commented on YARN-1490:
----------------------------------

Having more than 1 active attempt object will be have even more race 
conditions, problems and general debugging nightmare. Will we stop at 2 
attempts or will we have n attempts, say when attempts fail rapidly one after 
another?
Is there someway we can fix the race conditions mentioned in the scheduler?

Can we have the old attempt (now in a terminal state) to just hold onto the 
events and do nothing with them. When the new attempt becomes fully functional 
(such that the routing has fully transferred onto it) then it can pull the 
saved events from all previous attempts and then process them and move ahead as 
normal.

Can the data that we want to share across app attempts in this patch be moved 
to the app itself? That is no matter which attempt receives the event, it will 
save it in the app. The current active attempt will pull it from the app. This 
is similar to some of the failure info that we store in the RMApp because an 
app will persist across attempts.
 

> RM should optionally not kill all containers when an ApplicationMaster exits
> ----------------------------------------------------------------------------
>
>                 Key: YARN-1490
>                 URL: https://issues.apache.org/jira/browse/YARN-1490
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Jian He
>         Attachments: YARN-1490.1.patch, YARN-1490.2.patch, YARN-1490.3.patch
>
>
> This is needed to enable work-preserving AM restart. Some apps can chose to 
> reconnect with old running containers, some may not want to. This should be 
> an option.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to