[
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13818618#comment-13818618
]
Jian He commented on YARN-1210:
-------------------------------
- Please revert all files with import-only changes, CotainerLauncher etc.
- revert NodeManager, TestNodeManagerResync changes.
- rename RegisterNodeManagerRequest.addAllContainerStatuses to
setContainerStatuses to conform with convention namings.
- Bug in RegisterNodeManagerRequestPBImpl.addAllContainerStatuses(), you are
appending the containers to the existing container list instead of set. also no
need to call initFinishedContainers() inside. Can you see what
NodeStatusPBImpl.(get/set)ContainerStatuses() is doing?
- RMAppAttemptImpl.getRecoveredFinalState not used, removed.
- wrong comment, am2 attempt should be at Launched state. There's one more such
same wrong comment at line 537
{code}
// am1 attempt should be in FAILED state where as am2 attempt should be in
// RUNNING state
{code}
- waiting for 20 secs is too long for a unit test, can you pick a value as
small as possible ?
{code} // Setting AMLivelinessMonitor interval to be 20 Secs.
{code}
- no need to create new method waitForAppAttemptToExpire(), we can just use
MockRM.waitForState()
{code}
app.currentAttempt.handle(new RMAppAttemptEvent(app.currentAttempt
.getAppAttemptId(), RMAppAttemptEventType.RECOVER));
{code}
- similarly for newly created methods, waitForRMToProcessAllEvents,
waitForRMAppAttempts can be achieved by just waiting for the specific app or
attempt state. That achieves the same result as waiting for all events get
processed.
- For the 3rd case, shouldn't we test that as you said in the comment "all the
stored attempts had finished then new attempt should be started immediately"
{code}
// The 4th attempt has started but is not yet saved into RMStateStore
// It will be saved only when we launch AM.
eventQueue = new LinkedBlockingQueue<Event>();
final AsyncDispatcher dispatcher3 = new AsyncDispatcher(eventQueue);
{code}
> During RM restart, RM should start a new attempt only when previous attempt
> exits for real
> ------------------------------------------------------------------------------------------
>
> Key: YARN-1210
> URL: https://issues.apache.org/jira/browse/YARN-1210
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Vinod Kumar Vavilapalli
> Assignee: Omkar Vinit Joshi
> Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch,
> YARN-1210.4.patch, YARN-1210.4.patch
>
>
> When RM recovers, it can wait for existing AMs to contact RM back and then
> kill them forcefully before even starting a new AM. Worst case, RM will start
> a new AppAttempt after waiting for 10 mins ( the expiry interval). This way
> we'll minimize multiple AMs racing with each other. This can help issues with
> downstream components like Pig, Hive and Oozie during RM restart.
> In the mean while, new apps will proceed as usual as existing apps wait for
> recovery.
> This can continue to be useful after work-preserving restart, so that AMs
> which can properly sync back up with RM can continue to run and those that
> don't are guaranteed to be killed before starting a new attempt.
--
This message was sent by Atlassian JIRA
(v6.1#6144)