[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498825#comment-13498825
 ] 

Tom White commented on YARN-128:
--------------------------------

Bikas, this looks good so far. Thanks for working on it. A few comments:

* Is there a race condition in ResourceManager#recover where RMAppImpl#recover 
is called after the StartAppAttemptTransition from resubmitting the app? The 
problem would be that the earlier app attempts (from before the resart) would 
not be the first ones since the new attempt would get in first.
* I think we need the concept of a 'killed' app attempt (when the system is at 
fault, not the app) as well as a 'failed' attempt, like we have in MR task 
attempts. Without the distinction a restart will count against the user's app 
attempts (default 1 retry) which is undesirable.
* Rather than change the ResourceManager constructor, you could read the 
recoveryEnabled flag from the configuration.
                
> Resurrect RM Restart 
> ---------------------
>
>                 Key: YARN-128
>                 URL: https://issues.apache.org/jira/browse/YARN-128
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.0.0-alpha
>            Reporter: Arun C Murthy
>            Assignee: Bikas Saha
>         Attachments: MR-4343.1.patch, RM-recovery-initial-thoughts.txt, 
> RMRestartPhase1.pdf, YARN-128.full-code.3.patch, YARN-128.full-code-4.patch, 
> YARN-128.new-code-added.3.patch, YARN-128.new-code-added-4.patch, 
> YARN-128.old-code-removed.3.patch, YARN-128.old-code-removed.4.patch, 
> YARN-128.patch
>
>
> We should resurrect 'RM Restart' which we disabled sometime during the RM 
> refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to