[ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14193311#comment-14193311
 ] 

Karthik Kambatla commented on YARN-2010:
----------------------------------------

The latest patch is a step back, closer to v6 patch. Fixing the test failures 
on v7 of the patch was more involved than I thought and was taking longer. So, 
in the interest of time, I would like to work on moving credential parsing to 
RMAppRecoveredTransition as part of a follow-up JIRA. 

bq. Inside the catch, we may just return FAILED?
This doesn't apply anymore. Will take a closer look in the follow-up JIRA.

bq. I don’t think we can get ConnectException here, could you explain under 
what scenario, we get ConnectException
The comments elaborate on potential reasons for ConnectException. The stack 
trace corresponding to one instance is here - 
https://issues.apache.org/jira/browse/YARN-2010?focusedCommentId=14164516&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14164516



> Handle app-recovery failures gracefully
> ---------------------------------------
>
>                 Key: YARN-2010
>                 URL: https://issues.apache.org/jira/browse/YARN-2010
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.3.0
>            Reporter: bc Wong
>            Assignee: Karthik Kambatla
>            Priority: Blocker
>         Attachments: YARN-2010.1.patch, YARN-2010.patch, 
> issue-stacktrace.rtf, yarn-2010-2.patch, yarn-2010-3.patch, 
> yarn-2010-3.patch, yarn-2010-4.patch, yarn-2010-5.patch, yarn-2010-6.patch, 
> yarn-2010-7.patch, yarn-2010-8.patch
>
>
> Sometimes, the RM fails to recover an application. It could be because of 
> turning security on, token expiry, or issues connecting to HDFS etc. The 
> causes could be classified into (1) transient, (2) specific to one 
> application, and (3) permanent and apply to multiple (all) applications. 
> Today, the RM fails to transition to Active and ends up in STOPPED state and 
> can never be transitioned to Active again.
> The initial stacktrace reported is at 
> https://issues.apache.org/jira/secure/attachment/12676476/issue-stacktrace.rtf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to