[jira] [Commented] (YARN-4401) A failed app recovery should not prevent the RM from starting

2015-12-02 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035892#comment-15035892
 ] 

Daniel Templeton commented on YARN-4401:


I suppose I posed my proposal a little naively.  Let's try again.

The reason for configuring HA is to prevent an outage.  It should be possible 
to tell the standby to come up regardless of recovery failures, in effect 
performing automatically the operation that [~sunilg] described or failing the 
bad app(s) or whatever.

The app resource issue I offered was just the first example I (thought I) found 
while skimming the code.  Rather than having to hunt down every possible way to 
throw an exception (checked or unchecked) during recovery, it would be 
convenient to have recovery catch any exception, log it, and do something 
sensible so that the RM can come up for cases where RM availability is a 
priority.

> A failed app recovery should not prevent the RM from starting
> -
>
> Key: YARN-4401
> URL: https://issues.apache.org/jira/browse/YARN-4401
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
> Attachments: YARN-4401.001.patch
>
>
> There are many different reasons why an app recovery could fail with an 
> exception, causing the RM start to be aborted.  If that happens the RM will 
> fail to start.  Presumably, the reason the RM is trying to do a recovery is 
> that it's the standby trying to fill in for the active.  Failing to come up 
> defeats the purpose of the HA configuration.  Instead of preventing the RM 
> from starting, a failed app recovery should log an error and skip the 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4401) A failed app recovery should not prevent the RM from starting

2015-12-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035492#comment-15035492
 ] 

Sunil G commented on YARN-4401:
---

Hi [~templedf]
I am not very sure about the use case here. However I feel if such a case 
occurs, we will have enough information from logs to get the app-id.
Then we can use below command to clear such apps if necessary rather than 
forcefully clear from rmcontext.
{noformat}
Usage: yarn resourcemanager [-format-state-store]
[-remove-application-from-state-store ]
{noformat}

> A failed app recovery should not prevent the RM from starting
> -
>
> Key: YARN-4401
> URL: https://issues.apache.org/jira/browse/YARN-4401
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
> Attachments: YARN-4401.001.patch
>
>
> There are many different reasons why an app recovery could fail with an 
> exception, causing the RM start to be aborted.  If that happens the RM will 
> fail to start.  Presumably, the reason the RM is trying to do a recovery is 
> that it's the standby trying to fill in for the active.  Failing to come up 
> defeats the purpose of the HA configuration.  Instead of preventing the RM 
> from starting, a failed app recovery should log an error and skip the 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4401) A failed app recovery should not prevent the RM from starting

2015-12-01 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035254#comment-15035254
 ] 

Rohith Sharma K S commented on YARN-4401:
-

bq. if a job is stored with a resource allocation that is higher than the 
configured maximum at the time of recovery, the recovery will throw an 
exception which will prevent the RM from starting.
Which version of Hadoop are you using? This issue is fixed in YARN-3493.

And regarding the patch, app should never be removed from RMContext at any 
point of time during recovery, it causes ApplincationNotFoundException to 
client which is incorrect. IAC, to continue  any flows, need to trigger an 
appropriate event which makes state transition complete.

> A failed app recovery should not prevent the RM from starting
> -
>
> Key: YARN-4401
> URL: https://issues.apache.org/jira/browse/YARN-4401
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
> Attachments: YARN-4401.001.patch
>
>
> There are many different reasons why an app recovery could fail with an 
> exception, causing the RM start to be aborted.  If that happens the RM will 
> fail to start.  Presumably, the reason the RM is trying to do a recovery is 
> that it's the standby trying to fill in for the active.  Failing to come up 
> defeats the purpose of the HA configuration.  Instead of preventing the RM 
> from starting, a failed app recovery should log an error and skip the 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4401) A failed app recovery should not prevent the RM from starting

2015-12-01 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034399#comment-15034399
 ] 

Daniel Templeton commented on YARN-4401:


There are lots of reasons a recovery could fail.  For example, if a job is 
stored with a resource allocation that is higher than the configured maximum at 
the time of recovery, the recovery will throw an exception which will prevent 
the RM from starting.

In a single RM configuration, it makes some sense to allow the RM restart to be 
interrupted by recovery failure, but in an HA scenario, the standby in becoming 
active to prevent an outage.  Causing an outage over a bad application is 
undermining the point of HA.  It becomes a question of trading an application 
failure for a service outage.  I think most sites would choose the former.

There's already yarn.fail-fast and yarn.resourcemanager.fail-fast that control 
this behavior for some of the recovery failure scenarios, such as bad queue 
assignments.  I would propose we extend the meaning of those properties to 
cover the full range of what could go wrong during recovery.

> A failed app recovery should not prevent the RM from starting
> -
>
> Key: YARN-4401
> URL: https://issues.apache.org/jira/browse/YARN-4401
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
>
> There are many different reasons why an app recovery could fail with an 
> exception, causing the RM start to be aborted.  If that happens the RM will 
> fail to start.  Presumably, the reason the RM is trying to do a recovery is 
> that it's the standby trying to fill in for the active.  Failing to come up 
> defeats the purpose of the HA configuration.  Instead of preventing the RM 
> from starting, a failed app recovery should log an error and skip the 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4401) A failed app recovery should not prevent the RM from starting

2015-11-30 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033040#comment-15033040
 ] 

Rohith Sharma K S commented on YARN-4401:
-

In an ideal case, app recovery should not fail. If it fails, then fix should 
given to "cause of failure". Do you have in mind any specific scenario which is 
causing recovery failure? I am open to get convinced:-)

> A failed app recovery should not prevent the RM from starting
> -
>
> Key: YARN-4401
> URL: https://issues.apache.org/jira/browse/YARN-4401
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
>
> There are many different reasons why an app recovery could fail with an 
> exception, causing the RM start to be aborted.  If that happens the RM will 
> fail to start.  Presumably, the reason the RM is trying to do a recovery is 
> that it's the standby trying to fill in for the active.  Failing to come up 
> defeats the purpose of the HA configuration.  Instead of preventing the RM 
> from starting, a failed app recovery should log an error and skip the 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)