Dear Hadoop community, We found that DFSClient failed to read block since token is expired. Here is the detail: 1. DFSClient tried to connect datanode, but doSaslHandshake failed with IOException, then DFSClient added the datanode to deadNodes list:
2019-12-06 14:25:52,911 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /192.168.5.183:50010 for block, add to deadNodes and continue. java.io.IOException: DIGEST-MD5: IO error acquiring password java.io.IOException: DIGEST-MD5: IO error acquiring password at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:419) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:474) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) at org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:90) at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3453) at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:789) at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695) at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:660) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:886) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:938) at java.io.DataInputStream.read(DataInputStream.java:149) 2. From the datanode's log, we can see it is caused by InvalidToken: 2019-12-06 14:25:52,908 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: tdh185:50010:DataXceiver error processing unknown operation src: /192.168.5.181:52500 dst: / 192.168.5.183:50010 javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Block token with block_token_identifier (expiryDate=1575548483208, keyId=-1555008939, userId=hbase, blockPoolId=BP-1371500706-192.168.5.181-1560479016519, blockId=1109964016, access modes=[READ]) is expired.] at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:594) at com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:373) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:298) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:125) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:193) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Block token with block_token_identifier (expiryDate=1575548483208, keyId=-1555008939, userId=hbase, blockPoolId=BP-1371500706-192.168.5.181-1560479016519, blockId=1109964016, access modes=[READ]) is expired. 3. Since there are 3 replicas, DFSClient would also try the other 2 datanodes and got same errors: 2019-12-06 14:25:53,819 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain BP-1371500706-192.168.5.181-1560479016519:blk_1109964016_36223686 from any node: java.io.IOException: No live nodes contain block BP-1371500706-192.168.5.181-1560479016519:blk_1109964016_36223686 after checking nodes = [DatanodeInfoWithStorage[192.168.5.183:50010,DS-53265352-2c9c-4c36-a923-c764028a32bf,DISK], DatanodeInfoWithStorage[192.168.5.185:50010,DS-5496b545-e3c6-4414-9a05-adb6930ae29e,DISK], DatanodeInfoWithStorage[192.168.5.186:50010,DS-dfbbfa07-7d2c-48fa-997a-d35cb8d700ac,DISK]], ignoredNodes = null No live nodes contain current block Block locations: DatanodeInfoWithStorage[192.168.5.183:50010,DS-53265352-2c9c-4c36-a923-c764028a32bf,DISK] DatanodeInfoWithStorage[192.168.5.185:50010,DS-5496b545-e3c6-4414-9a05-adb6930ae29e,DISK] DatanodeInfoWithStorage[192.168.5.186:50010,DS-dfbbfa07-7d2c-48fa-997a-d35cb8d700ac,DISK] Dead nodes: DatanodeInfoWithStorage[192.168.5.183:50010,DS-53265352-2c9c-4c36-a923-c764028a32bf,DISK] DatanodeInfoWithStorage[192.168.5.186:50010,DS-dfbbfa07-7d2c-48fa-997a-d35cb8d700ac,DISK] DatanodeInfoWithStorage[192.168.5.185:50010,DS-5496b545-e3c6-4414-9a05-adb6930ae29e,DISK]. Will get new block locations from namenode and retry... ------ As you can see, the key is DFSClient cannot know the root cause: InvalidToken. Because datanode checks InvalidEncryptionKeyException only. For InvalidToken, it just returns a general SASL error. We can see that from SaslDataTransferServer.doSaslHandshake: try { ... localResponse = sasl.evaluateChallengeOrResponse(remoteResponse); ... } catch (IOException ioe) { if (ioe instanceof SaslException && ioe.getCause() != null && ioe.getCause() instanceof InvalidEncryptionKeyException) { // This could just be because the client is long-lived and hasn't gotten // a new encryption key from the NN in a while. Upon receiving this // error, the client will get a new encryption key from the NN and retry // connecting to this DN. sendInvalidKeySaslErrorMessage(out, ioe.getCause().getMessage()); } else { sendGenericSaslErrorMessage(out, ioe.getMessage()); } throw ioe; } So could you help explain why InvalidToken is not checked? Afterall, if the client knows the root cause, then it could just re-fetch token from namenode immediately. Many thanks! -- Best Regards! Wenqi