[
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813500#comment-13813500
]
Jian He commented on YARN-1210:
-------------------------------
- Instead of passing running containers as parameter in
RegisterNodeManagerRequest, is it possible to just call heartBeat immediately
after registerCall and then unBlockNewContainerRequests ? That way we can take
advantage of the existing heartbeat logic, cover other things like keep app
alive for log aggregation after AM container completes.
-- Or at least we can send the list of ContainerStatus(including diagnostics)
instead of just container Ids and also the list of keep-alive apps (separate
jira)?
- Unnecessary import changes in DefaultContainerExecutor.java and
LinuxContainerExecutor, ContainerLaunch, ContainersLauncher
- Finished containers may not necessary be killed. The containers can also
normal finish and remain in the NM cache before NM resync.
{code}
RMAppAttemptContainerFinishedEvent evt =
new RMAppAttemptContainerFinishedEvent(appAttemptId,
ContainerStatus.newInstance(cId, ContainerState.COMPLETE,
"Killed due to RM restart",
ExitCode.FORCE_KILLED.getExitCode()));
{code}
- wrong LOG class name.
{code}
private static final Log LOG = LogFactory.getLog(RMAppImpl.class);
{code}
- Isn't always the case that after this patch only the last attempt can be
running ? a new attempt will not be launched until the previous attempt reports
back it really exits. If this is case, it can be a bug.
We may only need to check that if the last attempt is finished or not.
{code}
// check if any application attempt was running
// if yes then don't start new application attempt.
for (Entry<ApplicationAttemptId, RMAppAttempt> attempt : app.attempts
.entrySet()) {
boolean appAttemptInFinalState =
RMAppAttemptImpl.isAttemptInFinalState(attempt.getValue());
LOG.info("attempt :" + attempt.getKey().toString()
+ " in final state :" + appAttemptInFinalState);
if (!appAttemptInFinalState) {
// One of the application attempt is not in final state.
// Not starting new application attempt.
return RMAppState.RUNNING;
}
}
{code}
- should we return RUNNING or ACCEPTED for apps that are not in final state ?
It's ok to return RUNNING in the scope of this patch because anyways we are
launching a new attempt. Later on in working preserving restart, RM can crash
before attempt register, attempt can register with RM after RM comes back in
which case we can then move app from ACCEPTED to RUNNING?
> During RM restart, RM should start a new attempt only when previous attempt
> exits for real
> ------------------------------------------------------------------------------------------
>
> Key: YARN-1210
> URL: https://issues.apache.org/jira/browse/YARN-1210
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Vinod Kumar Vavilapalli
> Assignee: Omkar Vinit Joshi
> Attachments: YARN-1210.1.patch, YARN-1210.2.patch
>
>
> When RM recovers, it can wait for existing AMs to contact RM back and then
> kill them forcefully before even starting a new AM. Worst case, RM will start
> a new AppAttempt after waiting for 10 mins ( the expiry interval). This way
> we'll minimize multiple AMs racing with each other. This can help issues with
> downstream components like Pig, Hive and Oozie during RM restart.
> In the mean while, new apps will proceed as usual as existing apps wait for
> recovery.
> This can continue to be useful after work-preserving restart, so that AMs
> which can properly sync back up with RM can continue to run and those that
> don't are guaranteed to be killed before starting a new attempt.
--
This message was sent by Atlassian JIRA
(v6.1#6144)