[ https://issues.apache.org/jira/browse/YARN-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13723970#comment-13723970 ]
Jason Lowe commented on YARN-993: --------------------------------- This looks more like a MAPREDUCE issue to me. The MR AM is removing the staging directory when it shouldn't. As [~jianhe] noted, this is probably fixed by YARN-513 / MAPREDUCE-5398 or it could be a duplicate of YARN-917. > job can not recovery after restart resourcemanager > -------------------------------------------------- > > Key: YARN-993 > URL: https://issues.apache.org/jira/browse/YARN-993 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.0.5-alpha > Environment: CentOS5.3 JDK1.7.0_11 > Reporter: prophy Yan > Priority: Critical > > Recently, i have test the function job recovery in the YARN framework, but it > failed. > first, i run the wordcount example program, and the i kill -9 the > resourcemanager process on the server when the wordcount process in map 100%. > the job will exit with error in minutes. > second, i restart the resourcemanager on the server by user the > 'start-yarn.sh' command. but, the failed job(wordcount) can not to continue. > the yarn log says "file not exist!" > Here is the YARN log: > 013-07-23 16:05:21,472 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done > launching container Container: [ContainerId: > container_1374564764970_0001_02_000001, NodeId: mv8.mzhen.cn:52117, > NodeHttpAddress: mv8.mzhen.cn:8042, Resource: <memory:2048, vCores:1>, > Priority: 0, State: NEW, Token: null, Status: container_id {, app_attempt_id > {, application_id {, id: 1, cluster_timestamp: 1374564764970, }, attemptId: > 2, }, id: 1, }, state: C_NEW, ] for AM appattempt_1374564764970_0001_000002 > 2013-07-23 16:05:21,473 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1374564764970_0001_000002 State change from ALLOCATED to LAUNCHED > 2013-07-23 16:05:21,925 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1374564764970_0001_000002 State change from LAUNCHED to FAILED > 2013-07-23 16:05:21,925 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application > application_1374564764970_0001 failed 1 times due to AM Container for > appattempt_1374564764970_0001_000002 exited with exitCode: -1000 due to: > RemoteTrace: > java.io.FileNotFoundException: File does not exist: > hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:815) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176) > at > org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:51) > at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:284) > at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:282) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:280) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:51) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > at LocalTrace: > org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: > File does not exist: > hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens > at > org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217) > at > org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:819) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:491) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:218) > at > org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:46) > at > org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:57) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735) > .Failing this attempt.. Failing the application. > 2013-07-23 16:05:21,935 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: > application_1374564764970_0001 State change from ACCEPTED to FAILED > 2013-07-23 16:05:21,937 WARN > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=supertool > OPERATION=Application Finished - Failed TARGET=RMAppManager > RESULT=FAILURE DESCRIPTION=App failed with state: FAILED > PERMISSIONS=Application application_1374564764970_0001 failed 1 times due to > AM Container for appattempt_1374564764970_0001_000002 exited with exitCode: > -1000 due to: RemoteTrace: > java.io.FileNotFoundException: File does not exist: > hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens > this is the log in YARN-logfile after i restart the resourcemanager -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira