[ 
https://issues.apache.org/jira/browse/YARN-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737100#comment-13737100
 ] 

Jian He commented on YARN-1058:
-------------------------------

As Bikas said, the first exception is expected because although AMRMTokens 
currently are stored along with AppAttemptState, but it's not populated back to 
AMRMTokenSecretManager yet when RM comes back. What MR AM now handles this 
exception is simply ignoring it(MAPREDUCE-5436). So AM process will hang and 
waiting be killed by NM instead of rebooting itself.
                
> Recovery issues on RM Restart with FileSystemRMStateStore
> ---------------------------------------------------------
>
>                 Key: YARN-1058
>                 URL: https://issues.apache.org/jira/browse/YARN-1058
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.1.0-beta
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>
> App recovery doesn't work as expected using FileSystemRMStateStore.
> Steps to reproduce:
> - Ran sleep job with a single map and sleep time of 2 mins
> - Restarted RM while the map task is still running
> - The first attempt fails with the following error
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  Password not found for ApplicationAttempt 
> appattempt_1376294441253_0001_000001
>       at org.apache.hadoop.ipc.Client.call(Client.java:1404)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1357)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>       at $Proxy28.finishApplicationMaster(Unknown Source)
>       at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:91)
> {noformat}
> - The second attempt fails with a different error:
> {noformat}
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>  No lease on 
> /tmp/hadoop-yarn/staging/kasha/.staging/job_1376294441253_0001/job_1376294441253_0001_2.jhist:
>  File does not exist. Holder DFSClient_NONMAPREDUCE_389533538_1 does not have 
> any open files.
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2737)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2543)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2454)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:534)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387)
>       at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48073)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to