[jira] [Commented] (YARN-4892) Job will be hung and can not be finished after resource manager restarting and enabling recovery

Fang Xie (JIRA) Wed, 30 Mar 2016 07:32:45 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218039#comment-15218039
 ]


Fang Xie commented on YARN-4892:
--------------------------------

Sorry just correct step #5, job can not be finished from view of cli and Yarn 
GUI.
The root cause is the value of numCompletedContainers is not correct, it add 
all containers  (before rm killed and after RM restart) which more than the 
numbers of tasks.



> Job will be hung and can not be finished after resource manager restarting 
> and enabling recovery
> ------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4892
>                 URL: https://issues.apache.org/jira/browse/YARN-4892
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.1
>            Reporter: Fang Xie
>            Priority: Critical
>
> Enable resourcemanager recovery, set properties as below:
> <property>
>     <description>Enable RM to recover state after starting. If true, then
>     yarn.resourcemanager.store.class must be specified. </description>
>    <name>yarn.resourcemanager.recovery.enabled</name>
>    <value>true</value>
> </property>
> <property>
>     <description> </description>
>     <name>yarn.resourcemanager.store.class</name>
> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore</value>
> </property>
> <property>
>     <description> </description>
>     <name>yarn.resourcemanager.fs.state-store.uri</name>
>     <value>hdfs://apple02:9000/rmstore</value>
> </property>
> run a distributedshell job, when job running, kill resourcemanager, and then 
> restart resourcemanager, this job can not be finished and will be hung.
> Both fair-share and capacity scheduler have such issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4892) Job will be hung and can not be finished after resource manager restarting and enabling recovery

Reply via email to