[ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194810#comment-14194810
 ] 

Karthik Kambatla commented on YARN-2010:
----------------------------------------

bq. Give RM is synchronously starting and renewing the token synchronously, I 
don't quite understand why we have to catch the queue exception and stop RM 
asynchronously via events. I think it's fine to just let exception throw out 
and let RM stop.
This is not always on startup. Transitions to Active also go through this. In 
HA cases, we would want to transition to standby, no? 

bq. After a closer look, RUNNING app on recovery will move to ACCEPTED state, 
ACCEPTED state is actually not handling RMAppRejectedEvent.
Good point. What do you think of handling rejection in ACCEPTED as well? 

bq. We may still need to move addApplicationSync into RMAppRecoveredTransition.
I am not sure if this is necessarily related to the rest of the patch. It is 
definitely a code improvement. 

> Handle app-recovery failures gracefully
> ---------------------------------------
>
>                 Key: YARN-2010
>                 URL: https://issues.apache.org/jira/browse/YARN-2010
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.3.0
>            Reporter: bc Wong
>            Assignee: Karthik Kambatla
>            Priority: Blocker
>         Attachments: YARN-2010.1.patch, YARN-2010.patch, 
> issue-stacktrace.rtf, yarn-2010-2.patch, yarn-2010-3.patch, 
> yarn-2010-3.patch, yarn-2010-4.patch, yarn-2010-5.patch, yarn-2010-6.patch, 
> yarn-2010-7.patch, yarn-2010-8.patch, yarn-2010-9.patch
>
>
> Sometimes, the RM fails to recover an application. It could be because of 
> turning security on, token expiry, or issues connecting to HDFS etc. The 
> causes could be classified into (1) transient, (2) specific to one 
> application, and (3) permanent and apply to multiple (all) applications. 
> Today, the RM fails to transition to Active and ends up in STOPPED state and 
> can never be transitioned to Active again.
> The initial stacktrace reported is at 
> https://issues.apache.org/jira/secure/attachment/12676476/issue-stacktrace.rtf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to