[ 
https://issues.apache.org/jira/browse/YARN-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213289#comment-14213289
 ] 

Rohith commented on YARN-2865:
------------------------------

bq. why does the rmContext still contain the application?If the RM were at 
standby mode, the transitionToStandby should have cleaned the rmContext up ?
I agree in positive flow. What if trainsitionToActive throw exception after 
recovery is succeeded?? Recovery process adds back applications to RMContext in 
RMAppManager. Any service start failures occur after recovery is completed then 
RMContext remain with stale applications.
Consider the below scenario execution
# RM is in StandBy state. Initial state is STANDBY
# STANDBY to ACTIVE  : 
## Recovery : All application recovery is success. RMContext has recovered 
applications in it.
## Any active service start failed which throw exception back.
   RM state remain STANDBY. But here, exception handling is done i.e. only 
dispatcher has been reset, but not rmcontext/metrics system. Currently, it is 
done at  {{stopActiveService ()}}
# STANDBY to ACTIVE : recovery fails with above exception. And it never move to 
ACTIVE in further transtitionToActive command from elector unless RM gets 
command to STANDBY to STANDBY and next  STANDBY to ACTIVE.

      

> Application recovery continuously fails with "Application with id already 
> present. Cannot duplicate"
> ----------------------------------------------------------------------------------------------------
>
>                 Key: YARN-2865
>                 URL: https://issues.apache.org/jira/browse/YARN-2865
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Rohith
>            Assignee: Rohith
>            Priority: Critical
>         Attachments: YARN-2865.patch
>
>
> YARN-2588 handles exception thrown while transitioningToActive and reset 
> activeServices. But it misses out clearing RMcontext apps/nodes details and 
> ClusterMetrics and QueueMetrics. This causes application recovery to fail.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to