Dear Hadoop community,

We found that DFSClient failed to read block since token is expired.
Here is the detail:
1. DFSClient tried to connect datanode, but doSaslHandshake failed with
IOException, then DFSClient added the datanode to deadNodes list:

    2019-12-06 14:25:52,911 WARN org.apache.hadoop.hdfs.DFSClient: Failed
to connect to /192.168.5.183:50010 for block, add to deadNodes and
continue. java.io.IOException: DIGEST-MD5: IO error acquiring password
    java.io.IOException: DIGEST-MD5: IO error acquiring password
   at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:419)
   at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:474)
   at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391)
   at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
   at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
   at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
   at
org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:90)
   at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3453)
   at
org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:789)
   at
org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695)
   at
org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356)
   at
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:660)
   at
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:886)
   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:938)
   at java.io.DataInputStream.read(DataInputStream.java:149)

2. From the datanode's log, we can see it is caused by InvalidToken:

    2019-12-06 14:25:52,908 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode: tdh185:50010:DataXceiver
error processing unknown operation  src: /192.168.5.181:52500 dst: /
192.168.5.183:50010
    javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring
password [Caused by
org.apache.hadoop.security.token.SecretManager$InvalidToken: Block token
with block_token_identifier (expiryDate=1575548483208, keyId=-1555008939,
userId=hbase, blockPoolId=BP-1371500706-192.168.5.181-1560479016519,
blockId=1109964016, access modes=[READ]) is expired.]
   at
com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:594)
   at
com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
   at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115)
   at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:373)
   at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:298)
   at
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:125)
   at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:193)
   at java.lang.Thread.run(Thread.java:745)
    Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken:
Block token with block_token_identifier (expiryDate=1575548483208,
keyId=-1555008939, userId=hbase,
blockPoolId=BP-1371500706-192.168.5.181-1560479016519, blockId=1109964016,
access modes=[READ]) is expired.

3. Since there are 3 replicas, DFSClient would also try the other 2
datanodes and got same errors:

    2019-12-06 14:25:53,819 INFO org.apache.hadoop.hdfs.DFSClient: Could
not obtain
BP-1371500706-192.168.5.181-1560479016519:blk_1109964016_36223686 from any
node: java.io.IOException: No live nodes contain block
BP-1371500706-192.168.5.181-1560479016519:blk_1109964016_36223686 after
checking nodes =
[DatanodeInfoWithStorage[192.168.5.183:50010,DS-53265352-2c9c-4c36-a923-c764028a32bf,DISK],
DatanodeInfoWithStorage[192.168.5.185:50010,DS-5496b545-e3c6-4414-9a05-adb6930ae29e,DISK],
DatanodeInfoWithStorage[192.168.5.186:50010,DS-dfbbfa07-7d2c-48fa-997a-d35cb8d700ac,DISK]],
ignoredNodes = null No live nodes contain current block Block locations:
DatanodeInfoWithStorage[192.168.5.183:50010,DS-53265352-2c9c-4c36-a923-c764028a32bf,DISK]
DatanodeInfoWithStorage[192.168.5.185:50010,DS-5496b545-e3c6-4414-9a05-adb6930ae29e,DISK]
DatanodeInfoWithStorage[192.168.5.186:50010,DS-dfbbfa07-7d2c-48fa-997a-d35cb8d700ac,DISK]
Dead nodes:  
DatanodeInfoWithStorage[192.168.5.183:50010,DS-53265352-2c9c-4c36-a923-c764028a32bf,DISK]
DatanodeInfoWithStorage[192.168.5.186:50010,DS-dfbbfa07-7d2c-48fa-997a-d35cb8d700ac,DISK]
DatanodeInfoWithStorage[192.168.5.185:50010,DS-5496b545-e3c6-4414-9a05-adb6930ae29e,DISK].
Will get new block locations from namenode and retry...

------

As you can see, the key is DFSClient cannot know the root cause:
InvalidToken.
Because datanode checks InvalidEncryptionKeyException only. For
InvalidToken, it just returns a general SASL error.
We can see that from SaslDataTransferServer.doSaslHandshake:

    try {
      ...
      localResponse = sasl.evaluateChallengeOrResponse(remoteResponse);
      ...
    } catch (IOException ioe) {
      if (ioe instanceof SaslException &&
          ioe.getCause() != null &&
          ioe.getCause() instanceof InvalidEncryptionKeyException) {
        // This could just be because the client is long-lived and hasn't
gotten
        // a new encryption key from the NN in a while. Upon receiving this
        // error, the client will get a new encryption key from the NN and
retry
        // connecting to this DN.
        sendInvalidKeySaslErrorMessage(out, ioe.getCause().getMessage());
      } else {
        sendGenericSaslErrorMessage(out, ioe.getMessage());
      }
      throw ioe;
    }

So could you help explain why InvalidToken is not checked? Afterall, if the
client knows the root cause, then it could just re-fetch token from
namenode immediately.

Many thanks!
-- 
Best Regards!
Wenqi

Reply via email to