[jira] [Updated] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped

2018-04-18 Thread yunjiong zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yunjiong zhao updated HDFS-13441:
-
Attachment: HDFS-13441.003.patch

> DataNode missed BlockKey update from NameNode due to HeartbeatResponse was 
> dropped
> --
>
> Key: HDFS-13441
> URL: https://issues.apache.org/jira/browse/HDFS-13441
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.7.1
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
>Priority: Major
> Attachments: HDFS-13441.002.patch, HDFS-13441.003.patch, 
> HDFS-13441.patch
>
>
> After NameNode failover, lots of application failed due to some DataNodes 
> can't re-compute password from block token.
> {code:java}
> 2018-04-11 20:10:52,448 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error 
> processing unknown operation  src: /10.142.74.116:57404 dst: 
> /10.142.77.45:50010
> javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
> [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=1523538652448, 
> keyId=1762737944, userId=hadoop, 
> blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, 
> access modes=[WRITE]), since the required block key (keyID=1762737944) 
> doesn't exist.]
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598)
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=1523538652448, 
> keyId=1762737944, userId=hadoop, 
> blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, 
> access modes=[WRITE]), since the required block key (keyID=1762737944) 
> doesn't exist.
>         at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382)
>         at 
> org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241)
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
>         ... 7 more
> {code}
>  
> In the DataNode log, we didn't see DataNode update block keys around 
> 2018-04-11 09:55:00 and around 2018-04-11 19:55:00.
> {code:java}
> 2018-04-10 14:51:36,424 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-10 23:55:38,420 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-11 00:51:34,792 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-11 10:51:39,403 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-11 20:51:44,422 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-12 02:54:47,855 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-12 05:55:44,456 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> {code}
> The reason is there is SocketTimeOutException when sending heartbeat 

[jira] [Updated] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped

2018-04-16 Thread yunjiong zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yunjiong zhao updated HDFS-13441:
-
Attachment: HDFS-13441.002.patch

> DataNode missed BlockKey update from NameNode due to HeartbeatResponse was 
> dropped
> --
>
> Key: HDFS-13441
> URL: https://issues.apache.org/jira/browse/HDFS-13441
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.7.1
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
>Priority: Major
> Attachments: HDFS-13441.002.patch, HDFS-13441.patch
>
>
> After NameNode failover, lots of application failed due to some DataNodes 
> can't re-compute password from block token.
> {code:java}
> 2018-04-11 20:10:52,448 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error 
> processing unknown operation  src: /10.142.74.116:57404 dst: 
> /10.142.77.45:50010
> javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
> [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=1523538652448, 
> keyId=1762737944, userId=hadoop, 
> blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, 
> access modes=[WRITE]), since the required block key (keyID=1762737944) 
> doesn't exist.]
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598)
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=1523538652448, 
> keyId=1762737944, userId=hadoop, 
> blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, 
> access modes=[WRITE]), since the required block key (keyID=1762737944) 
> doesn't exist.
>         at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382)
>         at 
> org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241)
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
>         ... 7 more
> {code}
>  
> In the DataNode log, we didn't see DataNode update block keys around 
> 2018-04-11 09:55:00 and around 2018-04-11 19:55:00.
> {code:java}
> 2018-04-10 14:51:36,424 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-10 23:55:38,420 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-11 00:51:34,792 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-11 10:51:39,403 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-11 20:51:44,422 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-12 02:54:47,855 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-12 05:55:44,456 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> {code}
> The reason is there is SocketTimeOutException when sending heartbeat to 
> StandbyNameNode
> 

[jira] [Updated] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped

2018-04-13 Thread yunjiong zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yunjiong zhao updated HDFS-13441:
-
Description: 
After NameNode failover, lots of application failed due to some DataNodes can't 
re-compute password from block token.
{code:java}
2018-04-11 20:10:52,448 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error processing 
unknown operation  src: /10.142.74.116:57404 dst: /10.142.77.45:50010
javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
[Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
re-compute password for block_token_identifier (expiryDate=1523538652448, 
keyId=1762737944, userId=hadoop, 
blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, 
access modes=[WRITE]), since the required block key (keyID=1762737944) doesn't 
exist.]
        at 
com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598)
        at 
com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
re-compute password for block_token_identifier (expiryDate=1523538652448, 
keyId=1762737944, userId=hadoop, 
blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, 
access modes=[WRITE]), since the required block key (keyID=1762737944) doesn't 
exist.
        at 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382)
        at 
org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241)
        at 
com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
        ... 7 more

{code}
 

In the DataNode log, we didn't see DataNode update block keys around 2018-04-11 
09:55:00 and around 2018-04-11 19:55:00.
{code:java}
2018-04-10 14:51:36,424 INFO 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
block keys
2018-04-10 23:55:38,420 INFO 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
block keys
2018-04-11 00:51:34,792 INFO 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
block keys
2018-04-11 10:51:39,403 INFO 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
block keys
2018-04-11 20:51:44,422 INFO 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
block keys
2018-04-12 02:54:47,855 INFO 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
block keys
2018-04-12 05:55:44,456 INFO 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
block keys
{code}
The reason is there is SocketTimeOutException when sending heartbeat to 
StandbyNameNode
{code:java}
2018-04-11 09:55:34,699 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
IOException in offerService
java.net.SocketTimeoutException: Call From 
hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com/10.142.77.45 to 
ares-nn.vip.ebay.com:8030 failed on socket timeout exception: 
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.142.77.45:48803 remote=ares-nn.vip.ebay.com/10.103.108.200:8030]; For 
more details see:  http://wiki.apache.org/hadoop/SocketTimeout
        at sun.reflect.GeneratedConstructorAccessor32.newInstance(Unknown 
Source)
        at 

[jira] [Updated] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped

2018-04-13 Thread yunjiong zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yunjiong zhao updated HDFS-13441:
-
Description: 
After NameNode failover, lots of application failed due to some DataNodes can't 
re-compute password from block token.
{code:java}
2018-04-11 20:10:52,448 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error processing 
unknown operation  src: /10.142.74.116:57404 dst: /10.142.77.45:50010
javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
[Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
re-compute password for block_token_identifier (expiryDate=1523538652448, 
keyId=1762737944, userId=hadoop, 
blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, 
access modes=[WRITE]), since the required block key (keyID=1762737944) doesn't 
exist.]
        at 
com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598)
        at 
com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
re-compute password for block_token_identifier (expiryDate=1523538652448, 
keyId=1762737944, userId=hadoop, 
blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, 
access modes=[WRITE]), since the required block key (keyID=1762737944) doesn't 
exist.
        at 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382)
        at 
org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241)
        at 
com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
        ... 7 more

{code}
 

In the DataNode log, we didn't see DataNode update block keys around 2018-04-11 
09:55:00 and around 2018-04-11 19:55:00.
{code:java}
2018-04-10 14:51:36,424 INFO 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
block keys
2018-04-10 23:55:38,420 INFO 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
block keys
2018-04-11 00:51:34,792 INFO 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
block keys
2018-04-11 10:51:39,403 INFO 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
block keys
2018-04-11 20:51:44,422 INFO 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
block keys
2018-04-12 02:54:47,855 INFO 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
block keys
2018-04-12 05:55:44,456 INFO 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
block keys
{code}
The reason is there is SocketTimeOutException when sending heartbeat to 
StandbyNameNode
{code:java}
2018-04-11 09:55:34,699 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
IOException in offerService
java.net.SocketTimeoutException: Call From 
hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com/10.142.77.45 to 
ares-nn.vip.ebay.com:8030 failed on socket timeout exception: 
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.142.77.45:48803 remote=ares-nn.vip.ebay.com/10.103.108.200:8030]; For 
more details see:  http://wiki.apache.org/hadoop/SocketTimeout
        at sun.reflect.GeneratedConstructorAccessor32.newInstance(Unknown 
Source)
        at 

[jira] [Updated] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped

2018-04-12 Thread yunjiong zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yunjiong zhao updated HDFS-13441:
-
Status: Patch Available  (was: Open)

There are two ways to fix this bug: one is making sure NameNode's send new 
BlockKey to DataNodes successfully; another one is once DataNode can't find 
BlockKey, re-register the Datanode to NameNodes to make sure DataNode get the 
newest BlockKeys.

The first way is much more complex and need change more code than second way.

The attached patch is re-register DataNode to NameNodes.

Not tested, just for ideas.

> DataNode missed BlockKey update from NameNode due to HeartbeatResponse was 
> dropped
> --
>
> Key: HDFS-13441
> URL: https://issues.apache.org/jira/browse/HDFS-13441
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.7.1
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
>Priority: Major
> Attachments: HDFS-13441.patch
>
>
> After NameNode failover, lots of application failed due to some DataNodes 
> can't re-compute password from block token.
> {code:java}
> 2018-04-11 20:10:52,448 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error 
> processing unknown operation  src: /10.142.74.116:57404 dst: 
> /10.142.77.45:50010
> javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
> [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=1523538652448, 
> keyId=1762737944, userId=hadoop, 
> blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, 
> access modes=[WRITE]), since the required block key (keyID=1762737944) 
> doesn't exist.]
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598)
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=1523538652448, 
> keyId=1762737944, userId=hadoop, 
> blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, 
> access modes=[WRITE]), since the required block key (keyID=1762737944) 
> doesn't exist.
>         at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382)
>         at 
> org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241)
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
>         ... 7 more
> {code}
>  
> In the DataNode log, we didn't see DataNode update block keys around 
> 2018-04-11 09:55.
> {code:java}
> 2018-04-10 14:51:36,424 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-10 23:55:38,420 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-11 00:51:34,792 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-11 10:51:39,403 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-11 20:51:44,422 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> 

[jira] [Updated] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped

2018-04-12 Thread yunjiong zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yunjiong zhao updated HDFS-13441:
-
Attachment: HDFS-13441.patch

> DataNode missed BlockKey update from NameNode due to HeartbeatResponse was 
> dropped
> --
>
> Key: HDFS-13441
> URL: https://issues.apache.org/jira/browse/HDFS-13441
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.7.1
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
>Priority: Major
> Attachments: HDFS-13441.patch
>
>
> After NameNode failover, lots of application failed due to some DataNodes 
> can't re-compute password from block token.
> {code:java}
> 2018-04-11 20:10:52,448 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error 
> processing unknown operation  src: /10.142.74.116:57404 dst: 
> /10.142.77.45:50010
> javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
> [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=1523538652448, 
> keyId=1762737944, userId=hadoop, 
> blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, 
> access modes=[WRITE]), since the required block key (keyID=1762737944) 
> doesn't exist.]
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598)
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=1523538652448, 
> keyId=1762737944, userId=hadoop, 
> blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, 
> access modes=[WRITE]), since the required block key (keyID=1762737944) 
> doesn't exist.
>         at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382)
>         at 
> org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241)
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
>         ... 7 more
> {code}
>  
> In the DataNode log, we didn't see DataNode update block keys around 
> 2018-04-11 09:55.
> {code:java}
> 2018-04-10 14:51:36,424 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-10 23:55:38,420 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-11 00:51:34,792 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-11 10:51:39,403 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-11 20:51:44,422 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-12 02:54:47,855 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-12 05:55:44,456 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> {code}
> The reason is there is SocketTimeOutException when send heartbeat to 
> StandbyNameNode
> {code:java}
> 2018-04-11 09:55:34,699 WARN 

[jira] [Updated] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped

2018-04-12 Thread yunjiong zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yunjiong zhao updated HDFS-13441:
-
Description: 
After NameNode failover, lots of application failed due to some DataNodes can't 
re-compute password from block token.
{code:java}
2018-04-11 20:10:52,448 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error processing 
unknown operation  src: /10.142.74.116:57404 dst: /10.142.77.45:50010
javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
[Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
re-compute password for block_token_identifier (expiryDate=1523538652448, 
keyId=1762737944, userId=hadoop, 
blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, 
access modes=[WRITE]), since the required block key (keyID=1762737944) doesn't 
exist.]
        at 
com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598)
        at 
com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
re-compute password for block_token_identifier (expiryDate=1523538652448, 
keyId=1762737944, userId=hadoop, 
blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, 
access modes=[WRITE]), since the required block key (keyID=1762737944) doesn't 
exist.
        at 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382)
        at 
org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241)
        at 
com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
        ... 7 more

{code}
 

In the DataNode log, we didn't see DataNode update block keys around 2018-04-11 
09:55.
{code:java}
2018-04-10 14:51:36,424 INFO 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
block keys
2018-04-10 23:55:38,420 INFO 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
block keys
2018-04-11 00:51:34,792 INFO 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
block keys
2018-04-11 10:51:39,403 INFO 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
block keys
2018-04-11 20:51:44,422 INFO 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
block keys
2018-04-12 02:54:47,855 INFO 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
block keys
2018-04-12 05:55:44,456 INFO 
org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
block keys
{code}
The reason is there is SocketTimeOutException when send heartbeat to 
StandbyNameNode
{code:java}
2018-04-11 09:55:34,699 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
IOException in offerService
java.net.SocketTimeoutException: Call From 
hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com/10.142.77.45 to 
ares-nn.vip.ebay.com:8030 failed on socket timeout exception: 
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.142.77.45:48803 remote=ares-nn.vip.ebay.com/10.103.108.200:8030]; For 
more details see:  http://wiki.apache.org/hadoop/SocketTimeout
        at sun.reflect.GeneratedConstructorAccessor32.newInstance(Unknown 
Source)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at