[jira] [Comment Edited] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped

2018-04-17 Thread He Xiaoqiao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440798#comment-16440798
 ] 

He Xiaoqiao edited comment on HDFS-13441 at 4/17/18 12:17 PM:
--

{quote}Let DataNode pull Block Key from NameNode is one choice, since it need 
change protocol, if I didn't understand wrong, it will need go to Hadoop 
4.{quote}
[~zhaoyunjiong] 
The interface about get BlockKeys is ready now, reference: 
NamenodeProtocol#getBlockKeys.
if DataNode get Block Keys from NameNode using this interface, then it is more 
convenient to process kind of exceptions. if possible, is it necessary to 
create new issue to push this feature?


was (Author: hexiaoqiao):
{quote}Let DataNode pull Block Key from NameNode is one choice, since it need 
change protocol, if I didn't understand wrong, it will need go to Hadoop 
4.{quote}
[~zhaoyunjiong] 
The interface about get BlockKeys is ready now, reference: 
NamenodeProtocol#getBlockKeys.
if DataNode get Block Keys from NameNode using this interface, then it is more 
convenient to process kind of exceptions.

> DataNode missed BlockKey update from NameNode due to HeartbeatResponse was 
> dropped
> --
>
> Key: HDFS-13441
> URL: https://issues.apache.org/jira/browse/HDFS-13441
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.7.1
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
>Priority: Major
> Attachments: HDFS-13441.002.patch, HDFS-13441.patch
>
>
> After NameNode failover, lots of application failed due to some DataNodes 
> can't re-compute password from block token.
> {code:java}
> 2018-04-11 20:10:52,448 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error 
> processing unknown operation  src: /10.142.74.116:57404 dst: 
> /10.142.77.45:50010
> javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
> [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=1523538652448, 
> keyId=1762737944, userId=hadoop, 
> blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, 
> access modes=[WRITE]), since the required block key (keyID=1762737944) 
> doesn't exist.]
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598)
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=1523538652448, 
> keyId=1762737944, userId=hadoop, 
> blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, 
> access modes=[WRITE]), since the required block key (keyID=1762737944) 
> doesn't exist.
>         at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382)
>         at 
> org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241)
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
>         ... 7 more
> {code}
>  
> In the DataNode log, we didn't see DataNode update block keys around 
> 2018-04-11 09:55:00 and around 2018-04-11 19:55:00.
> {code:java}
> 2018-04-10 14:51:36,424 INFO 
> 

[jira] [Comment Edited] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped

2018-04-14 Thread He Xiaoqiao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438448#comment-16438448
 ] 

He Xiaoqiao edited comment on HDFS-13441 at 4/14/18 6:09 PM:
-

[~zhaoyunjiong], If I understand correctly, this issue was caused by 
StandbyNameNode restart.
NameNode's load is very high when restart in a large cluster, especially when 
NameNode starts to process block report, if there are some DataNode to 
reregister in this period, it may be timeout very likely, since 
BPServiceActor#register can not catch IOException which wrap over 
SocketTimeoutException. In one word, NameNode correctly processed the 
registration, but the DataNode timeout before receiving the response. so 
#updateBlockKeysWhenStartup could not be invoke.
{code:java}
  void register(NamespaceInfo nsInfo) throws IOException {
// The handshake() phase loaded the block pool storage
// off disk - so update the bpRegistration object from that info
DatanodeRegistration newBpRegistration = bpos.createRegistration();

LOG.info(this + " beginning handshake with NN");

while (shouldRun()) {
  try {
// Use returned registration from namenode with updated fields
newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
newBpRegistration.setNamespaceInfo(nsInfo);
bpRegistration = newBpRegistration;
break;
  } catch(EOFException e) {  // namenode might have just restarted
LOG.info("Problem connecting to server: " + nnAddr + " :"
+ e.getLocalizedMessage());
sleepAndLogInterrupts(1000, "connecting to server");
  } catch(SocketTimeoutException e) {  // namenode is busy
LOG.info("Problem connecting to server: " + nnAddr);
sleepAndLogInterrupts(1000, "connecting to server");
  }
}

LOG.info("Block pool " + this + " successfully registered with NN");
bpos.registrationSucceeded(this, bpRegistration);

// random short delay - helps scatter the BR from all DNs
scheduler.scheduleBlockReport(dnConf.initialBlockReportDelay);
updateBlockKeysWhenStartup();
  }
{code}

In this case, the following Read/Write from client to this DataNode would be 
certain to thrown {{SaslException}}. HDFS-12749 is trying to resolve this 
matter once for all. FYI.


was (Author: hexiaoqiao):
[~zhaoyunjiong], If I understand correctly, this issue was caused by 
StandbyNameNode restart.
NameNode's load is very high when restart in a large cluster, especially when 
NameNode starts to process block report, if there are some DataNode to 
reregister in this period, it may be timeout very likely, since 
BPServiceActor#register can not catch SocketTimeoutException. In one word, 
NameNode correctly processed the registration, but the DataNode timeout before 
receiving the response. so #updateBlockKeysWhenStartup could not be invoke.
{code:java}
  void register(NamespaceInfo nsInfo) throws IOException {
// The handshake() phase loaded the block pool storage
// off disk - so update the bpRegistration object from that info
DatanodeRegistration newBpRegistration = bpos.createRegistration();

LOG.info(this + " beginning handshake with NN");

while (shouldRun()) {
  try {
// Use returned registration from namenode with updated fields
newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
newBpRegistration.setNamespaceInfo(nsInfo);
bpRegistration = newBpRegistration;
break;
  } catch(EOFException e) {  // namenode might have just restarted
LOG.info("Problem connecting to server: " + nnAddr + " :"
+ e.getLocalizedMessage());
sleepAndLogInterrupts(1000, "connecting to server");
  } catch(SocketTimeoutException e) {  // namenode is busy
LOG.info("Problem connecting to server: " + nnAddr);
sleepAndLogInterrupts(1000, "connecting to server");
  }
}

LOG.info("Block pool " + this + " successfully registered with NN");
bpos.registrationSucceeded(this, bpRegistration);

// random short delay - helps scatter the BR from all DNs
scheduler.scheduleBlockReport(dnConf.initialBlockReportDelay);
updateBlockKeysWhenStartup();
  }
{code}

In this case, the following Read/Write from client to this DataNode would be 
certain to thrown {{SaslException}}. HDFS-12749 is trying to resolve this 
matter once for all. FYI.

> DataNode missed BlockKey update from NameNode due to HeartbeatResponse was 
> dropped
> --
>
> Key: HDFS-13441
> URL: https://issues.apache.org/jira/browse/HDFS-13441
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.7.1
>Reporter: yunjiong zhao
>Assignee: 

[jira] [Comment Edited] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped

2018-04-13 Thread yunjiong zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437740#comment-16437740
 ] 

yunjiong zhao edited comment on HDFS-13441 at 4/13/18 9:50 PM:
---

{quote} 

BlockKey is usually synchronized aggressively – IIRC every 2.5 hours, and 
BlockKey's life time is much longer than that (can't recall right away) so it's 
surprising to me a single missing heartbeat would cause the error you 
mentioned. There's probably something deeper you need to dig into.
{quote}
DataNode gets block keys from NameNode happens at two places.

One is when DataNode registers to NameNode.

Another one is via heartbeat, by default happens every 600 minutes.

By default,
{quote}
 dfs.block.access.key.update.interval
 600
 
 Interval in minutes at which namenode updates its access keys.
 
 
{quote}
{quote}
 dfs.block.access.token.lifetime
 600
 The lifetime of access tokens in minutes.
 
{quote}
[~jojochuang] , double checked the code and log, the DataNode must be missing 
*two* heartbeats from Standby NameNode which contains new block key just before 
Standby NameNode become active.

We have more than 2000 Datanodes in this cluster, if more than three Datanodes 
doesn't have the current Block key, worst case is users will not able to read 
some blocks for 10 at most hours.


was (Author: zhaoyunjiong):
{quote} 

BlockKey is usually synchronized aggressively – IIRC every 2.5 hours, and 
BlockKey's life time is much longer than that (can't recall right away) so it's 
surprising to me a single missing heartbeat would cause the error you 
mentioned. There's probably something deeper you need to dig into.
{quote}
DataNode gets block keys from NameNode happens at two places.

One is when DataNode registers to NameNode.

Another one is via heartbeat, by default happens every 600 minutes.

By default,
{quote}
 dfs.block.access.key.update.interval
 600
 
 Interval in minutes at which namenode updates its access keys.
 
{quote}
{quote}
 dfs.block.access.token.lifetime
 600
 The lifetime of access tokens in minutes.
{quote}
 

Not any single heartbeat of course.

The missing heartbeat must from Standby NameNode which contains new block key, 
and Standby NameNode transferred to Active NameNode after that missing 
heartbeat happened and before the new heartbeat which contains new block keys.

We have more than 2000 Datanodes in this cluster, if more than three Datanodes 
doesn't have the current Block key, worst case is users will not able to read 
some blocks for 10 hours.

> DataNode missed BlockKey update from NameNode due to HeartbeatResponse was 
> dropped
> --
>
> Key: HDFS-13441
> URL: https://issues.apache.org/jira/browse/HDFS-13441
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.7.1
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
>Priority: Major
> Attachments: HDFS-13441.patch
>
>
> After NameNode failover, lots of application failed due to some DataNodes 
> can't re-compute password from block token.
> {code:java}
> 2018-04-11 20:10:52,448 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error 
> processing unknown operation  src: /10.142.74.116:57404 dst: 
> /10.142.77.45:50010
> javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
> [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=1523538652448, 
> keyId=1762737944, userId=hadoop, 
> blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, 
> access modes=[WRITE]), since the required block key (keyID=1762737944) 
> doesn't exist.]
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598)
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password 

[jira] [Comment Edited] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped

2018-04-12 Thread Wei-Chiu Chuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436649#comment-16436649
 ] 

Wei-Chiu Chuang edited comment on HDFS-13441 at 4/13/18 1:31 AM:
-

Hi [~zhaoyunjiong] thanks for reporting the issue.

BlockKey is usually synchronized aggressively – IIRC every 2.5 hours, and 
BlockKey's life time is much longer than that (can't recall right away) so it's 
surprising to me a single missing heartbeat would cause the error you 
mentioned. There's probably something deeper you need to dig into.

That said, with regard to your patch,

# please attach a test case to demonstrate the fix will work in this scenario
# it is not a good idea to match exception's message string. As you see in 
BlockTokenSecretManager, there are two sources where exception message starts 
with "Can't re-compute" and message can change.

 P.S. Take a look at HDFS-11741 and HDFS-10609 for reference to craft your test 
case.


was (Author: jojochuang):
Hi [~zhaoyunjiong] thanks for reporting the issue.

BlockKey is usually synchronized aggressively – IIRC every 2.5 hours, and 
BlockKey's life time is much longer than that (can't recall right away) so it's 
surprising to me a single missing heartbeat would cause the error you 
mentioned. There's probably something deeper you need to dig into.

That said, with regard to your patch,

# please attach a test case to demonstrate the fix will work in this scenario
# it is not a good idea to match exception's message string. As you see in 
BlockTokenSecretManager, there are two sources where exception message starts 
with "Can't re-compute" and message can change.

 

> DataNode missed BlockKey update from NameNode due to HeartbeatResponse was 
> dropped
> --
>
> Key: HDFS-13441
> URL: https://issues.apache.org/jira/browse/HDFS-13441
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.7.1
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
>Priority: Major
> Attachments: HDFS-13441.patch
>
>
> After NameNode failover, lots of application failed due to some DataNodes 
> can't re-compute password from block token.
> {code:java}
> 2018-04-11 20:10:52,448 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error 
> processing unknown operation  src: /10.142.74.116:57404 dst: 
> /10.142.77.45:50010
> javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
> [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=1523538652448, 
> keyId=1762737944, userId=hadoop, 
> blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, 
> access modes=[WRITE]), since the required block key (keyID=1762737944) 
> doesn't exist.]
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598)
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=1523538652448, 
> keyId=1762737944, userId=hadoop, 
> blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, 
> access modes=[WRITE]), since the required block key (keyID=1762737944) 
> doesn't exist.
>         at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382)
>         at 
> org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73)
>         at 
>