[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14195847#comment-14195847 ]
Karthik Kambatla commented on YARN-2010: ---------------------------------------- Jian - thanks for taking the time to look at this closely. The patch looks mostly good. However, this patch only catches DT renewal issues during app recovery. Any other errors that could be encountered in the remaining code paths are not handled gracefully. For instance, any errors during app.recoverAppAttempts can affect the health of the RM. > Handle app-recovery failures gracefully > --------------------------------------- > > Key: YARN-2010 > URL: https://issues.apache.org/jira/browse/YARN-2010 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.3.0 > Reporter: bc Wong > Assignee: Karthik Kambatla > Priority: Blocker > Attachments: YARN-2010.1.patch, YARN-2010.patch, > issue-stacktrace.rtf, yarn-2010-10.patch, yarn-2010-11.patch, > yarn-2010-2.patch, yarn-2010-3.patch, yarn-2010-3.patch, yarn-2010-4.patch, > yarn-2010-5.patch, yarn-2010-6.patch, yarn-2010-7.patch, yarn-2010-8.patch, > yarn-2010-9.patch > > > Sometimes, the RM fails to recover an application. It could be because of > turning security on, token expiry, or issues connecting to HDFS etc. The > causes could be classified into (1) transient, (2) specific to one > application, and (3) permanent and apply to multiple (all) applications. > Today, the RM fails to transition to Active and ends up in STOPPED state and > can never be transitioned to Active again. > The initial stacktrace reported is at > https://issues.apache.org/jira/secure/attachment/12676476/issue-stacktrace.rtf -- This message was sent by Atlassian JIRA (v6.3.4#6332)