[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

Jian He (JIRA) Mon, 04 Nov 2013 17:04:38 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813500#comment-13813500
 ]


Jian He commented on YARN-1210:
-------------------------------

- Instead of passing running containers as parameter in 
RegisterNodeManagerRequest, is it possible to just call heartBeat immediately 
after registerCall and then unBlockNewContainerRequests ? That way we can take 
advantage of the existing heartbeat logic, cover other things like keep app 
alive for log aggregation after AM container completes.
 -- Or at least we can send the list of ContainerStatus(including diagnostics) 
instead of just container Ids and also the list of keep-alive apps (separate 
jira)?
-  Unnecessary import changes  in DefaultContainerExecutor.java and 
LinuxContainerExecutor, ContainerLaunch, ContainersLauncher
-  Finished containers may not necessary be killed. The containers can also 
normal finish and remain in the NM cache before NM resync.
 {code}
 RMAppAttemptContainerFinishedEvent evt =
                new RMAppAttemptContainerFinishedEvent(appAttemptId,
                    ContainerStatus.newInstance(cId, ContainerState.COMPLETE,
                        "Killed due to RM restart",
                        ExitCode.FORCE_KILLED.getExitCode()));
{code}
- wrong LOG class name. 
{code}
private static final Log LOG = LogFactory.getLog(RMAppImpl.class);
{code}

- Isn't always the case that after this patch only the last attempt can be 
running ? a new attempt will not be launched until the previous attempt reports 
back it really exits. If this is case, it can be a bug.
We may only need to check that if the last attempt is finished or not.
{code}
        // check if any application attempt was running
        // if yes then don't start new application attempt.
        for (Entry<ApplicationAttemptId, RMAppAttempt> attempt : app.attempts
            .entrySet()) {
          boolean appAttemptInFinalState =
              RMAppAttemptImpl.isAttemptInFinalState(attempt.getValue());
          LOG.info("attempt :" + attempt.getKey().toString()
              + " in final state :" + appAttemptInFinalState);
          if (!appAttemptInFinalState) {
            // One of the application attempt is not in final state.
            // Not starting new application attempt.
            return RMAppState.RUNNING;
          }
        }
{code}
- should we return RUNNING or ACCEPTED for apps that are not in final state ? 
It's ok to return RUNNING in the scope of this patch because anyways we are 
launching a new attempt. Later on in working preserving restart, RM can crash 
before attempt register, attempt can register with RM after RM comes back in 
which case we can then move app from ACCEPTED to RUNNING? 

> During RM restart, RM should start a new attempt only when previous attempt 
> exits for real
> ------------------------------------------------------------------------------------------
>
>                 Key: YARN-1210
>                 URL: https://issues.apache.org/jira/browse/YARN-1210
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Omkar Vinit Joshi
>         Attachments: YARN-1210.1.patch, YARN-1210.2.patch
>
>
> When RM recovers, it can wait for existing AMs to contact RM back and then 
> kill them forcefully before even starting a new AM. Worst case, RM will start 
> a new AppAttempt after waiting for 10 mins ( the expiry interval). This way 
> we'll minimize multiple AMs racing with each other. This can help issues with 
> downstream components like Pig, Hive and Oozie during RM restart.
> In the mean while, new apps will proceed as usual as existing apps wait for 
> recovery.
> This can continue to be useful after work-preserving restart, so that AMs 
> which can properly sync back up with RM can continue to run and those that 
> don't are guaranteed to be killed before starting a new attempt.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real

Reply via email to