[
https://issues.apache.org/jira/browse/YARN-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862436#comment-13862436
]
Bikas Saha commented on YARN-1490:
----------------------------------
Having more than 1 active attempt object will be have even more race
conditions, problems and general debugging nightmare. Will we stop at 2
attempts or will we have n attempts, say when attempts fail rapidly one after
another?
Is there someway we can fix the race conditions mentioned in the scheduler?
Can we have the old attempt (now in a terminal state) to just hold onto the
events and do nothing with them. When the new attempt becomes fully functional
(such that the routing has fully transferred onto it) then it can pull the
saved events from all previous attempts and then process them and move ahead as
normal.
Can the data that we want to share across app attempts in this patch be moved
to the app itself? That is no matter which attempt receives the event, it will
save it in the app. The current active attempt will pull it from the app. This
is similar to some of the failure info that we store in the RMApp because an
app will persist across attempts.
> RM should optionally not kill all containers when an ApplicationMaster exits
> ----------------------------------------------------------------------------
>
> Key: YARN-1490
> URL: https://issues.apache.org/jira/browse/YARN-1490
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Vinod Kumar Vavilapalli
> Assignee: Jian He
> Attachments: YARN-1490.1.patch, YARN-1490.2.patch, YARN-1490.3.patch
>
>
> This is needed to enable work-preserving AM restart. Some apps can chose to
> reconnect with old running containers, some may not want to. This should be
> an option.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)