[
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133367#comment-15133367
]
Vinod Kumar Vavilapalli commented on YARN-3998:
-----------------------------------------------
Thanks for working on this [~hex108]!
I've been meaning to file this JIRA with a slighly larger scope, but glad to
see this moving.
The following has a list of additional things that I wanted in this feature.
But, if you / [~vvasudev] want progress on the current patch as it stands, I
can go file additional tickets as needed.
h4. Unification with AM restart policies
- We should try to unify the retry-policies you are adding with the ones for
AM container / app-attempt itself: See {{ApplicationSubmissionContext.
getMaxAppAttempts()}} / {{getAttemptFailuresValidityInterval()}}. I like your
policy framework, so may be (in a followup?) use the same framework for AMs.
/cc [~xgong]
- To avoid containers crashing and restarting in no time, we should have a
global min-retry-interval. See YARN-3669 which does the same for AMs.
- Also, similar to AM restarts, we should support a sliding window of restarts
so as to forget about very-old failures instead of accumulating them for-ever.
h4. Treat relaunch in a first-class manner
- It's surprising we don't recognize the relaunches in a first-class manner.
- I really don't like lots of if-else conditions everywhere. So, instead of
using Container.isRelaunch(), I think we should have
— an explicit RELAUNCHING state instead of simply going back to LOCALIZED
state.
— An additional ContainersLauncherEventType.RELAUNCH_CONTAINER
— And a new {{ContainerRelaunch}} callable which overrides the behavior of
ContainerLaunch.
h4. Additional things needed during relaunch
- The relaunch feature needs to work across NM restarts, so we should save the
retry-context and policy per container into the state-store and reload it for
continue relaunching after NM restart.
- We should also handle restarting of any containers that may have crashed
during the NM reboot.
h4. Comments on the current approach and patch.
- The following isn’t fool-proof and won’t work for all apps, can we just
persist and read the selected log-dir from the state-store?
{code}
+ * We apply a simple heuristic to find its previous working directory:
+ * if a good work dir with file pattern '*out' (e.g. stdout) already exists,
{code}
- The same can be done for the work-dir.
- In fact, if we end up changing the work-dir during relaunch due to a
bad-dir, that may result in a breakage for the app. Apps may be reading from /
writing into the work-dir and changing it during relaunch may invalidate
application's assumptions. Should we just fail the container completely and let
the AM deal with it?
- ContainerLaunch.handleContainerExitWithFailure() needs to handled
differently during container-relaunches.
There are other minor things I saw but they can wait for these things.
> Add retry-times to let NM re-launch container when it fails to run
> ------------------------------------------------------------------
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
> Issue Type: New Feature
> Reporter: Jun Gong
> Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch,
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM
> launches containers, it could specify the value. Then NM will re-launch the
> container 'retry-times' times when it fails to run(e.g.exit code is not 0).
> It will save a lot of time. It avoids container localization. RM does not
> need to re-schedule the container. And local files in container's working
> directory will be left for re-use.(If container have downloaded some big
> files, it does not need to re-download them when running again.)
> We find it is useful in systems like Storm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)