[
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498825#comment-13498825
]
Tom White commented on YARN-128:
--------------------------------
Bikas, this looks good so far. Thanks for working on it. A few comments:
* Is there a race condition in ResourceManager#recover where RMAppImpl#recover
is called after the StartAppAttemptTransition from resubmitting the app? The
problem would be that the earlier app attempts (from before the resart) would
not be the first ones since the new attempt would get in first.
* I think we need the concept of a 'killed' app attempt (when the system is at
fault, not the app) as well as a 'failed' attempt, like we have in MR task
attempts. Without the distinction a restart will count against the user's app
attempts (default 1 retry) which is undesirable.
* Rather than change the ResourceManager constructor, you could read the
recoveryEnabled flag from the configuration.
> Resurrect RM Restart
> ---------------------
>
> Key: YARN-128
> URL: https://issues.apache.org/jira/browse/YARN-128
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.0.0-alpha
> Reporter: Arun C Murthy
> Assignee: Bikas Saha
> Attachments: MR-4343.1.patch, RM-recovery-initial-thoughts.txt,
> RMRestartPhase1.pdf, YARN-128.full-code.3.patch, YARN-128.full-code-4.patch,
> YARN-128.new-code-added.3.patch, YARN-128.new-code-added-4.patch,
> YARN-128.old-code-removed.3.patch, YARN-128.old-code-removed.4.patch,
> YARN-128.patch
>
>
> We should resurrect 'RM Restart' which we disabled sometime during the RM
> refactor.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira