Alternatively you can schedule a cron job to do kinit every 20 hours or so. Just to renew token before it expires.
— Sent from Mailbox for iPad On Mon, Dec 2, 2013 at 9:12 AM, Rainer Toebbicke <[email protected]> wrote: > Hello, > I am trying to understand why my long-running mapreduce jobs stop after 24 > hours (approx) on a secure cluster. > This is on Cloudera CDH 4.3.0, hence hadoop 2.0.0, using mrv1 (not yarn), > authentication specified as "kerberos". Trying with a short-lived Kerberos > ticket (1h) I see that it gets renewed regularly. Still, the job crashes > after 24 hours because the delegation token expires. > On a test cluster with increased logging and shortened > dfs.namenode.delegation.token.renew-interval (for quicker debugging) I see > that an immediate renew of the delegation token fails, and then after the > expiry period the Namenode log starts getting clobbered. > Detail: > 2013-12-02 15:57:08,461 INFO SecurityLogger.org.apache.hadoop.ipc.Server: > Auth successful for [email protected] (auth:TOKEN) > 2013-12-02 15:57:08,462 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for [email protected] (auth:TOKEN) for > protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol > 2013-12-02 15:57:08,500 INFO SecurityLogger.org.apache.hadoop.ipc.Server: > Auth successful for mapred/[email protected] (auth:SIMPLE) > 2013-12-02 15:57:08,540 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for mapred/[email protected] (auth:KERBEROS) for > protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol > 2013-12-02 15:57:08,541 INFO > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: > Token renewal requested for identifier: HDFS_DELEGATION_TOKEN token 12 for > tobbicke > 2013-12-02 15:57:08,541 ERROR > org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException > as:mapred/[email protected] (auth:KERBEROS) > cause:org.apache.hadoop.security.AccessControlException: Client mapred tries > to renew a token with renewer specified as nobody > 2013-12-02 15:57:08,541 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 9 on 9000, call > org.apache.hadoop.hdfs.protocol.ClientProtocol.renewDelegationToken from > 188.184.xxx.xxx:42031: error: > org.apache.hadoop.security.AccessControlException: Client mapred tries to > renew a token with renewer specified as nobody > org.apache.hadoop.security.AccessControlException: Client mapred tries to > renew a token with renewer specified as nobody > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:274) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:5319) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:377) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:814) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:45024) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1701) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1697) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1695) > Is this as unhealthy as it looks? If the first (immediate) renewal fails I > assume others will share the same fate. Would that explain the > 24-hour-lifetime on the "real" cluster and what could be the reason? How does > "nobody" come into the game here? > In any case, linked to this or not, after > dfs.namenode.delegation.token.renew-interval ms the following is logged a > zillion times: > 2013-12-02 16:58:09,718 WARN SecurityLogger.org.apache.hadoop.ipc.Server: > Auth failed for 188.184.xxx.xxx:44979:null (DIGEST-MD5: IO error acquiring > password) > 2013-12-02 16:58:09,719 INFO org.apache.hadoop.ipc.Server: IPC Server > listener on 9000: readAndProcess threw exception > javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password > [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: token > (HDFS_DELEGATION_TOKEN token 12 for tobbicke) is expired] from client > 188.184.xxx.xxx. Count of bytes read: 0 > javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password > [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: token > (HDFS_DELEGATION_TOKEN token 12 for tobbicke) is expired] > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:577) > at > com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:226) > at > org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1210) > at > org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1405) > at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:719) > at > org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:518) > at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:493) > Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: token > (HDFS_DELEGATION_TOKEN token 12 for tobbicke) is expired > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.retrievePassword(AbstractDelegationTokenSecretManager.java:227) > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.retrievePassword(AbstractDelegationTokenSecretManager.java:46) > at > org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.getPassword(SaslRpcServer.java:194) > at > org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.handle(SaslRpcServer.java:220) > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:568) > ... 6 more > Any ideas? > Rainer
