[
https://issues.apache.org/jira/browse/YARN-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15287561#comment-15287561
]
Jian He commented on YARN-5098:
-------------------------------
bq. the app needs to be submitted without an HDFS token so the RM will acquire
and manage it directly on the app's behalf
Btw, this is not necessary, RM will try to get the token on app's behalf if the
token is going to expire, regardless whether the app provided the token or not
in the first place.
I debugged this, with YARN-2704, in normal case, RM should get the new token
and distribute it to NM if the token is going to expire. The problem here is
that RM gets shutdown for a long time during which the token expired. After RM
restart, RM tries to recover the app and renew the token. Obviously the renew
will fail because the token is expired, and so the log aggregation failed when
the app completed.
One solution in my mind is to let RM request a new token and distribute it to
NM, if the token renewal fails on app recovery. Right now the failure is just
ignored and continue.
> Yarn Application log Aggreagation fails due to NM can not get correct HDFS
> delegation token
> -------------------------------------------------------------------------------------------
>
> Key: YARN-5098
> URL: https://issues.apache.org/jira/browse/YARN-5098
> Project: Hadoop YARN
> Issue Type: Bug
> Components: yarn
> Reporter: Yesha Vora
>
> Environment : HA cluster
> Yarn application logs for long running application could not be gathered
> because Nodemanager failed to talk to HDFS with below error.
> {code}
> 2016-05-16 18:18:28,533 INFO logaggregation.AppLogAggregatorImpl
> (AppLogAggregatorImpl.java:finishLogAggregation(555)) - Application just
> finished : application_1463170334122_0002
> 2016-05-16 18:18:28,545 WARN ipc.Client (Client.java:run(705)) - Exception
> encountered while connecting to the server :
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> token (HDFS_DELEGATION_TOKEN token 171 for hrt_qa) can't be found in cache
> at
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:375)
> at
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:583)
> at
> org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:398)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:752)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:748)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1719)
> at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:747)
> at
> org.apache.hadoop.ipc.Client$Connection.access$3100(Client.java:398)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1597)
> at org.apache.hadoop.ipc.Client.call(Client.java:1439)
> at org.apache.hadoop.ipc.Client.call(Client.java:1386)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:240)
> at com.sun.proxy.$Proxy83.getServerDefaults(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getServerDefaults(ClientNamenodeProtocolTranslatorPB.java:282)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
> at com.sun.proxy.$Proxy84.getServerDefaults(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSClient.getServerDefaults(DFSClient.java:1018)
> at org.apache.hadoop.fs.Hdfs.getServerDefaults(Hdfs.java:156)
> at
> org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:550)
> at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:687)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]