[ 
https://issues.apache.org/jira/browse/YARN-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035254#comment-15035254
 ] 

Rohith Sharma K S commented on YARN-4401:
-----------------------------------------

bq. if a job is stored with a resource allocation that is higher than the 
configured maximum at the time of recovery, the recovery will throw an 
exception which will prevent the RM from starting.
Which version of Hadoop are you using? This issue is fixed in YARN-3493.

And regarding the patch, app should never be removed from RMContext at any 
point of time during recovery, it causes ApplincationNotFoundException to 
client which is incorrect. IAC, to continue  any flows, need to trigger an 
appropriate event which makes state transition complete.

> A failed app recovery should not prevent the RM from starting
> -------------------------------------------------------------
>
>                 Key: YARN-4401
>                 URL: https://issues.apache.org/jira/browse/YARN-4401
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 2.7.1
>            Reporter: Daniel Templeton
>            Assignee: Daniel Templeton
>            Priority: Critical
>         Attachments: YARN-4401.001.patch
>
>
> There are many different reasons why an app recovery could fail with an 
> exception, causing the RM start to be aborted.  If that happens the RM will 
> fail to start.  Presumably, the reason the RM is trying to do a recovery is 
> that it's the standby trying to fill in for the active.  Failing to come up 
> defeats the purpose of the HA configuration.  Instead of preventing the RM 
> from starting, a failed app recovery should log an error and skip the 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to