[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133367#comment-15133367
 ] 

Vinod Kumar Vavilapalli commented on YARN-3998:
-----------------------------------------------

Thanks for working on this [~hex108]!

I've been meaning to file this JIRA with a slighly larger scope, but glad to 
see this moving.

The following has a list of additional things that I wanted in this feature. 
But, if you / [~vvasudev] want progress on the current patch as it stands, I 
can go file additional tickets as needed.

h4. Unification with AM restart policies
 - We should try to unify the retry-policies you are adding with the ones for 
AM container / app-attempt itself: See {{ApplicationSubmissionContext. 
getMaxAppAttempts()}} /  {{getAttemptFailuresValidityInterval()}}. I like your 
policy framework, so may be (in a followup?) use the same framework for AMs. 
/cc [~xgong]
 - To avoid containers crashing and restarting in no time, we should have a 
global min-retry-interval. See YARN-3669 which does the same for AMs.
 - Also, similar to AM restarts, we should support a sliding window of restarts 
so as to forget about very-old failures instead of accumulating them for-ever.

h4. Treat relaunch in a first-class manner
 - It's surprising we don't recognize the relaunches in a first-class manner.
 - I really don't like lots of if-else conditions everywhere. So, instead of 
using Container.isRelaunch(), I think we should have
    — an explicit RELAUNCHING state instead of simply going back to LOCALIZED 
state.
    — An additional ContainersLauncherEventType.RELAUNCH_CONTAINER
    — And a new {{ContainerRelaunch}} callable which overrides the behavior of 
ContainerLaunch.

h4. Additional things needed during relaunch
 - The relaunch feature needs to work across NM restarts, so we should save the 
retry-context and policy per container into the state-store and reload it for 
continue relaunching after NM restart.
 - We should also handle restarting of any containers that may have crashed 
during the NM reboot.

h4. Comments on the current approach and patch.
 - The following isn’t fool-proof and won’t work for all apps, can we just 
persist and read the selected log-dir from the state-store?
{code}
+   * We apply a simple heuristic to find its previous working directory:
+   * if a good work dir with file pattern '*out' (e.g. stdout) already exists,
{code}
 - The same can be done for the work-dir.
 - In fact, if we end up changing the work-dir during relaunch due to a 
bad-dir, that may result in a breakage for the app. Apps may be reading from / 
writing into the work-dir and changing it during relaunch may invalidate 
application's assumptions. Should we just fail the container completely and let 
the AM deal with it?
 - ContainerLaunch.handleContainerExitWithFailure() needs to handled 
differently during container-relaunches.

There are other minor things I saw but they can wait for these things.

> Add retry-times to let NM re-launch container when it fails to run
> ------------------------------------------------------------------
>
>                 Key: YARN-3998
>                 URL: https://issues.apache.org/jira/browse/YARN-3998
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to