Daniel Templeton commented on YARN-4401:

I suppose I posed my proposal a little naively.  Let's try again.

The reason for configuring HA is to prevent an outage.  It should be possible 
to tell the standby to come up regardless of recovery failures, in effect 
performing automatically the operation that [~sunilg] described or failing the 
bad app(s) or whatever.

The app resource issue I offered was just the first example I (thought I) found 
while skimming the code.  Rather than having to hunt down every possible way to 
throw an exception (checked or unchecked) during recovery, it would be 
convenient to have recovery catch any exception, log it, and do something 
sensible so that the RM can come up for cases where RM availability is a 

> A failed app recovery should not prevent the RM from starting
> -------------------------------------------------------------
>                 Key: YARN-4401
>                 URL: https://issues.apache.org/jira/browse/YARN-4401
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 2.7.1
>            Reporter: Daniel Templeton
>            Assignee: Daniel Templeton
>            Priority: Critical
>         Attachments: YARN-4401.001.patch
> There are many different reasons why an app recovery could fail with an 
> exception, causing the RM start to be aborted.  If that happens the RM will 
> fail to start.  Presumably, the reason the RM is trying to do a recovery is 
> that it's the standby trying to fill in for the active.  Failing to come up 
> defeats the purpose of the HA configuration.  Instead of preventing the RM 
> from starting, a failed app recovery should log an error and skip the 
> application.

This message was sent by Atlassian JIRA

Reply via email to