[
https://issues.apache.org/jira/browse/YARN-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213289#comment-14213289
]
Rohith commented on YARN-2865:
------------------------------
bq. why does the rmContext still contain the application?If the RM were at
standby mode, the transitionToStandby should have cleaned the rmContext up ?
I agree in positive flow. What if trainsitionToActive throw exception after
recovery is succeeded?? Recovery process adds back applications to RMContext in
RMAppManager. Any service start failures occur after recovery is completed then
RMContext remain with stale applications.
Consider the below scenario execution
# RM is in StandBy state. Initial state is STANDBY
# STANDBY to ACTIVE :
## Recovery : All application recovery is success. RMContext has recovered
applications in it.
## Any active service start failed which throw exception back.
RM state remain STANDBY. But here, exception handling is done i.e. only
dispatcher has been reset, but not rmcontext/metrics system. Currently, it is
done at {{stopActiveService ()}}
# STANDBY to ACTIVE : recovery fails with above exception. And it never move to
ACTIVE in further transtitionToActive command from elector unless RM gets
command to STANDBY to STANDBY and next STANDBY to ACTIVE.
> Application recovery continuously fails with "Application with id already
> present. Cannot duplicate"
> ----------------------------------------------------------------------------------------------------
>
> Key: YARN-2865
> URL: https://issues.apache.org/jira/browse/YARN-2865
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Reporter: Rohith
> Assignee: Rohith
> Priority: Critical
> Attachments: YARN-2865.patch
>
>
> YARN-2588 handles exception thrown while transitioningToActive and reset
> activeServices. But it misses out clearing RMcontext apps/nodes details and
> ClusterMetrics and QueueMetrics. This causes application recovery to fail.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)