[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

Omkar Vinit Joshi (JIRA) Tue, 05 Nov 2013 17:33:12 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814496#comment-13814496
 ]


Omkar Vinit Joshi commented on YARN-1210:
-----------------------------------------

Thanks [~jianhe] for reviewing it.

{code}
Instead of passing running containers as parameter in 
RegisterNodeManagerRequest, is it possible to just call heartBeat immediately 
after registerCall and then unBlockNewContainerRequests ? That way we can take 
advantage of the existing heartbeat logic, cover other things like keep app 
alive for log aggregation after AM container completes.
Or at least we can send the list of ContainerStatus(including diagnostics) 
instead of just container Ids and also the list of keep-alive apps (separate 
jira)?
{code}
it makes sense replacing finishedContainers with containerStatuses. 

bq. Unnecessary import changes in DefaultContainerExecutor.java and 
LinuxContainerExecutor, ContainerLaunch, ContainersLauncher
actually I wanted that earlier as I had created new ExitCode.java. I wanted to 
access it from ResourceTrackerService. Now since we are sending container 
status from node manager itself so no longer need that ..fixed it.

bq. Finished containers may not necessary be killed. The containers can also 
normal finish and remain in the NM cache before NM resync.
Updated the logic for cleanupContainers on node manager side. Now we should 
have all the finishedContainer statuses as it is.

bq. wrong LOG class name.
:) fixed it..

bq. LogFactory.getLog(RMAppImpl.class);
removed.

bq. Isn't always the case that after this patch only the last attempt can be 
running ? a new attempt will not be launched until the previous attempt reports 
back it really exits. If this is case, it can be a bug.
We may only need to check that if the last attempt is finished or not.
It is actually checking for any attempt to be in non running state. Do you want 
me to only check last attempt (by comparing application attempt ids)?.

bq. should we return RUNNING or ACCEPTED for apps that are not in final state ? 
It's ok to return RUNNING in the scope of this patch because anyways we are 
launching a new attempt. Later on in working preserving restart, RM can crash 
before attempt register, attempt can register with RM after RM comes back in 
which case we can then move app from ACCEPTED to RUNNING?
Yes right now I will keep it as RUNNING only. Today we don't have any 
information whether previous application master started and registered or not. 
Once we will have that information then probably we can do this.

> During RM restart, RM should start a new attempt only when previous attempt 
> exits for real
> ------------------------------------------------------------------------------------------
>
>                 Key: YARN-1210
>                 URL: https://issues.apache.org/jira/browse/YARN-1210
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Omkar Vinit Joshi
>         Attachments: YARN-1210.1.patch, YARN-1210.2.patch
>
>
> When RM recovers, it can wait for existing AMs to contact RM back and then 
> kill them forcefully before even starting a new AM. Worst case, RM will start 
> a new AppAttempt after waiting for 10 mins ( the expiry interval). This way 
> we'll minimize multiple AMs racing with each other. This can help issues with 
> downstream components like Pig, Hive and Oozie during RM restart.
> In the mean while, new apps will proceed as usual as existing apps wait for 
> recovery.
> This can continue to be useful after work-preserving restart, so that AMs 
> which can properly sync back up with RM can continue to run and those that 
> don't are guaranteed to be killed before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

Reply via email to