[jira] [Commented] (YARN-1058) Recovery issues on RM Restart with FileSystemRMStateStore

Bikas Saha (JIRA) Mon, 12 Aug 2013 09:47:38 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737021#comment-13737021
 ]


Bikas Saha commented on YARN-1058:
----------------------------------

The first one is expected because the RM is currently not preserving AMRMTokens.
The second one may be because the job client is deleting staging dir because it 
thinks the job has failed when the first attempt fails? Can you try by 
terminating the sleep job client after it has launched the job so that it 
cannot take further action?
                
> Recovery issues on RM Restart with FileSystemRMStateStore
> ---------------------------------------------------------
>
>                 Key: YARN-1058
>                 URL: https://issues.apache.org/jira/browse/YARN-1058
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.1.0-beta
>            Reporter: Karthik Kambatla
>
> App recovery doesn't work as expected using FileSystemRMStateStore.
> Steps to reproduce:
> - Ran sleep job with a single map and sleep time of 2 mins
> - Restarted RM while the map task is still running
> - The first attempt fails with the following error
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  Password not found for ApplicationAttempt 
> appattempt_1376294441253_0001_000001
>       at org.apache.hadoop.ipc.Client.call(Client.java:1404)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1357)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>       at $Proxy28.finishApplicationMaster(Unknown Source)
>       at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:91)
> {noformat}
> - The second attempt fails with a different error:
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>  No lease on 
> /tmp/hadoop-yarn/staging/kasha/.staging/job_1376294441253_0001/job_1376294441253_0001_2.jhist:
>  File does not exist. Holder DFSClient_NONMAPREDUCE_389533538_1 does not have 
> any open files.
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2737)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2543)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2454)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:534)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387)
>       at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48073)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-1058) Recovery issues on RM Restart with FileSystemRMStateStore

Reply via email to