[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14193678#comment-14193678 ]
Karthik Kambatla commented on YARN-2010: ---------------------------------------- Updated patch doesn't handle ConnectException, and preserves the behavior introduced in YARN-2308 through a new QueueNotFoundException. > Handle app-recovery failures gracefully > --------------------------------------- > > Key: YARN-2010 > URL: https://issues.apache.org/jira/browse/YARN-2010 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.3.0 > Reporter: bc Wong > Assignee: Karthik Kambatla > Priority: Blocker > Attachments: YARN-2010.1.patch, YARN-2010.patch, > issue-stacktrace.rtf, yarn-2010-2.patch, yarn-2010-3.patch, > yarn-2010-3.patch, yarn-2010-4.patch, yarn-2010-5.patch, yarn-2010-6.patch, > yarn-2010-7.patch, yarn-2010-8.patch, yarn-2010-9.patch > > > Sometimes, the RM fails to recover an application. It could be because of > turning security on, token expiry, or issues connecting to HDFS etc. The > causes could be classified into (1) transient, (2) specific to one > application, and (3) permanent and apply to multiple (all) applications. > Today, the RM fails to transition to Active and ends up in STOPPED state and > can never be transitioned to Active again. > The initial stacktrace reported is at > https://issues.apache.org/jira/secure/attachment/12676476/issue-stacktrace.rtf -- This message was sent by Atlassian JIRA (v6.3.4#6332)