[ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14195847#comment-14195847
 ] 

Karthik Kambatla commented on YARN-2010:
----------------------------------------

Jian - thanks for taking the time to look at this closely. The patch looks 
mostly good. However, this patch only catches DT renewal issues during app 
recovery. Any other errors that could be encountered in the remaining code 
paths are not handled gracefully. For instance, any errors during 
app.recoverAppAttempts can affect the health of the RM.



> Handle app-recovery failures gracefully
> ---------------------------------------
>
>                 Key: YARN-2010
>                 URL: https://issues.apache.org/jira/browse/YARN-2010
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.3.0
>            Reporter: bc Wong
>            Assignee: Karthik Kambatla
>            Priority: Blocker
>         Attachments: YARN-2010.1.patch, YARN-2010.patch, 
> issue-stacktrace.rtf, yarn-2010-10.patch, yarn-2010-11.patch, 
> yarn-2010-2.patch, yarn-2010-3.patch, yarn-2010-3.patch, yarn-2010-4.patch, 
> yarn-2010-5.patch, yarn-2010-6.patch, yarn-2010-7.patch, yarn-2010-8.patch, 
> yarn-2010-9.patch
>
>
> Sometimes, the RM fails to recover an application. It could be because of 
> turning security on, token expiry, or issues connecting to HDFS etc. The 
> causes could be classified into (1) transient, (2) specific to one 
> application, and (3) permanent and apply to multiple (all) applications. 
> Today, the RM fails to transition to Active and ends up in STOPPED state and 
> can never be transitioned to Active again.
> The initial stacktrace reported is at 
> https://issues.apache.org/jira/secure/attachment/12676476/issue-stacktrace.rtf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to