[
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149654#comment-15149654
]
Jun Gong commented on YARN-3998:
--------------------------------
Sorry for late reply, I was on holiday.
Thanks [~vinodkv] and [~vvasudev] for suggestion and review!
Some additional thought besides [~vvasudev]'s opinion:
{quote}
Unification with AM restart policies
{quote}
I agree with [~vvasudev]. Now AM restart polices is retrying across different
nodes, this feature is retrying on local node. When RM launches AM, it could
specify local retry policy for it.
{quote}
Treat relaunch in a first-class manner
{quote}
Glad to see it to be a first-class manner, I will update the patch.
{quote}
The following isn’t fool-proof and won’t work for all apps, can we just persist
and read the selected log-dir from the state-store?
ContainerLaunch.handleContainerExitWithFailure() needs to handled differently
during container-relaunches.
The same can be done for the work-dir.
All of these are related. If we store the log dir and work dir in the state
store, we can address all 3 of these.
{quote}
Yes, it will be better to store the log dir and work dir if we aims to make it
more accurate. I was thinking to make minimal changes for this feature.
{quote}
In fact, if we end up changing the work-dir during relaunch due to a bad-dir,
that may result in a breakage for the app. Apps may be reading from / writing
into the work-dir and changing it during relaunch may invalidate application's
assumptions. Should we just fail the container completely and let the AM deal
with it?
{quote}
My thought is that if user specifies retry policy on container, the user should
make sure that container could deal with this situation.
{quote}
Instead of removing a line and setting the limit to 10*1000, take the last 'n'
characters in the string where 'n' is a config setting.
{quote}
It might make the diagnostics not consistent to remove the last n characters,
suppose the diagnostics is “The exception is XXXX” and there is n characters
in XXX, the diagnositics becomes “The exception is”. There is similar problem
by removing first or last n lines. How about removing previous attempts' error
information and just keeping the latest attempt's information?
Glad to see more discussion about the feature.
> Add retry-times to let NM re-launch container when it fails to run
> ------------------------------------------------------------------
>
> Key: YARN-3998
> URL: https://issues.apache.org/jira/browse/YARN-3998
> Project: Hadoop YARN
> Issue Type: New Feature
> Reporter: Jun Gong
> Assignee: Jun Gong
> Attachments: YARN-3998.01.patch, YARN-3998.02.patch,
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM
> launches containers, it could specify the value. Then NM will re-launch the
> container 'retry-times' times when it fails to run(e.g.exit code is not 0).
> It will save a lot of time. It avoids container localization. RM does not
> need to re-schedule the container. And local files in container's working
> directory will be left for re-use.(If container have downloaded some big
> files, it does not need to re-download them when running again.)
> We find it is useful in systems like Storm.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)