[ https://issues.apache.org/jira/browse/YARN-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daniel Templeton updated YARN-4401: ----------------------------------- Attachment: YARN-4401.001.patch Here's the basic idea of what I'm proposing. > A failed app recovery should not prevent the RM from starting > ------------------------------------------------------------- > > Key: YARN-4401 > URL: https://issues.apache.org/jira/browse/YARN-4401 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager > Affects Versions: 2.7.1 > Reporter: Daniel Templeton > Assignee: Daniel Templeton > Priority: Critical > Attachments: YARN-4401.001.patch > > > There are many different reasons why an app recovery could fail with an > exception, causing the RM start to be aborted. If that happens the RM will > fail to start. Presumably, the reason the RM is trying to do a recovery is > that it's the standby trying to fill in for the active. Failing to come up > defeats the purpose of the HA configuration. Instead of preventing the RM > from starting, a failed app recovery should log an error and skip the > application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)