[
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814496#comment-13814496
]
Omkar Vinit Joshi commented on YARN-1210:
-----------------------------------------
Thanks [~jianhe] for reviewing it.
{code}
Instead of passing running containers as parameter in
RegisterNodeManagerRequest, is it possible to just call heartBeat immediately
after registerCall and then unBlockNewContainerRequests ? That way we can take
advantage of the existing heartbeat logic, cover other things like keep app
alive for log aggregation after AM container completes.
Or at least we can send the list of ContainerStatus(including diagnostics)
instead of just container Ids and also the list of keep-alive apps (separate
jira)?
{code}
it makes sense replacing finishedContainers with containerStatuses.
bq. Unnecessary import changes in DefaultContainerExecutor.java and
LinuxContainerExecutor, ContainerLaunch, ContainersLauncher
actually I wanted that earlier as I had created new ExitCode.java. I wanted to
access it from ResourceTrackerService. Now since we are sending container
status from node manager itself so no longer need that ..fixed it.
bq. Finished containers may not necessary be killed. The containers can also
normal finish and remain in the NM cache before NM resync.
Updated the logic for cleanupContainers on node manager side. Now we should
have all the finishedContainer statuses as it is.
bq. wrong LOG class name.
:) fixed it..
bq. LogFactory.getLog(RMAppImpl.class);
removed.
bq. Isn't always the case that after this patch only the last attempt can be
running ? a new attempt will not be launched until the previous attempt reports
back it really exits. If this is case, it can be a bug.
We may only need to check that if the last attempt is finished or not.
It is actually checking for any attempt to be in non running state. Do you want
me to only check last attempt (by comparing application attempt ids)?.
bq. should we return RUNNING or ACCEPTED for apps that are not in final state ?
It's ok to return RUNNING in the scope of this patch because anyways we are
launching a new attempt. Later on in working preserving restart, RM can crash
before attempt register, attempt can register with RM after RM comes back in
which case we can then move app from ACCEPTED to RUNNING?
Yes right now I will keep it as RUNNING only. Today we don't have any
information whether previous application master started and registered or not.
Once we will have that information then probably we can do this.
> During RM restart, RM should start a new attempt only when previous attempt
> exits for real
> ------------------------------------------------------------------------------------------
>
> Key: YARN-1210
> URL: https://issues.apache.org/jira/browse/YARN-1210
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Vinod Kumar Vavilapalli
> Assignee: Omkar Vinit Joshi
> Attachments: YARN-1210.1.patch, YARN-1210.2.patch
>
>
> When RM recovers, it can wait for existing AMs to contact RM back and then
> kill them forcefully before even starting a new AM. Worst case, RM will start
> a new AppAttempt after waiting for 10 mins ( the expiry interval). This way
> we'll minimize multiple AMs racing with each other. This can help issues with
> downstream components like Pig, Hive and Oozie during RM restart.
> In the mean while, new apps will proceed as usual as existing apps wait for
> recovery.
> This can continue to be useful after work-preserving restart, so that AMs
> which can properly sync back up with RM can continue to run and those that
> don't are guaranteed to be killed before starting a new attempt.
--
This message was sent by Atlassian JIRA
(v6.1#6144)