[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2019-10-21 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956471#comment-16956471
 ] 

Hudson commented on HDFS-12749:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17559 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/17559/])
HDFS-12749. DN may not send block report to NN after NN restart. (kihwal: rev 
c4e27ef7735acd6f91b73d2ecb0227f8dd75a2e4)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java


> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749-trunk.004.patch, 
> HDFS-12749-trunk.005.patch, HDFS-12749-trunk.006.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2019-10-21 Thread Kihwal Lee (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956462#comment-16956462
 ] 

Kihwal Lee commented on HDFS-12749:
---

I am very sorry this has fell through a crack.  
+1 on the version 6. Committing it soon.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749-trunk.004.patch, 
> HDFS-12749-trunk.005.patch, HDFS-12749-trunk.006.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short delay - helps scatter the BR from all DNs
> 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2019-08-22 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913143#comment-16913143
 ] 

Hadoop QA commented on HDFS-12749:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
33s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 
30s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 59s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
8s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
49s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 47s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 95m 40s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
33s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}151m 41s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.blockmanagement.TestReplicationPolicy 
|
|   | hadoop.hdfs.server.balancer.TestBalancer |
|   | hadoop.hdfs.TestMultipleNNPortQOP |
|   | hadoop.hdfs.server.namenode.TestReconstructStripedBlocks |
|   | hadoop.hdfs.TestDFSInotifyEventInputStreamKerberized |
|   | hadoop.hdfs.TestSafeModeWithStripedFileWithRandomECPolicy |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 Image:yetus/hadoop:bdbca0e |
| JIRA Issue | HDFS-12749 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12978261/HDFS-12749-trunk.006.patch
 |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 2dc1bb129298 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 5e156b9 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| findbugs | v3.1.0-RC1 |
| unit | 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2019-08-22 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913002#comment-16913002
 ] 

He Xiaoqiao commented on HDFS-12749:


attach [^HDFS-12749-trunk.006.patch] based on trunk without unit test.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749-trunk.004.patch, 
> HDFS-12749-trunk.005.patch, HDFS-12749-trunk.006.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short delay - helps scatter the BR from all DNs
> 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2019-08-21 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912960#comment-16912960
 ] 

He Xiaoqiao commented on HDFS-12749:


[~jojochuang] Thanks for your feedback, and sorry for late response. 005 patch 
may have some conflict to trunk, I would like to supply new one same as 005.
{quote}i am not sure i understand the test case. Additionally, if the fix is 
removed, the test doesn't fail, so i am not sure if the test case is 
useful.{quote}
somethings changes? will try fix it. Thanks again. 

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749-trunk.004.patch, 
> HDFS-12749-trunk.005.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2019-08-20 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911596#comment-16911596
 ] 

Wei-Chiu Chuang commented on HDFS-12749:


Not much progress -- I'll commit 005 patch later today.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749-trunk.004.patch, 
> HDFS-12749-trunk.005.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short delay - helps scatter the BR from all DNs
> scheduler.scheduleBlockReport(dnConf.initialBlockReportDelay);
>   }

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2019-08-09 Thread Wei-Chiu Chuang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904090#comment-16904090
 ] 

Wei-Chiu Chuang commented on HDFS-12749:


I am late in the game. Reviewing the patch  [^HDFS-12749-trunk.005.patch] , i 
am not sure i understand the test case. Additionally, if the fix is removed, 
the test doesn't fail, so i am not sure if the test case is useful.

Since [~kihwal] stated the fix itself it good, how about we commit the patch 
without test? Not ideal without a test, but I don't want to block a critical 
bug fix (IMO, this is critical) only because there's no test.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749-trunk.004.patch, 
> HDFS-12749-trunk.005.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2019-07-19 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889276#comment-16889276
 ] 

Hadoop QA commented on HDFS-12749:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  8m 
16s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 45m 
 2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
20s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
48s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
20m 31s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
26s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
8s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
16s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 47s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 2 new + 47 unchanged - 0 fixed = 49 total (was 47) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 30s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}116m 55s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
34s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}222m  2s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.datanode.TestDirectoryScanner |
|   | hadoop.hdfs.server.namenode.TestPersistentStoragePolicySatisfier |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailureWithRandomECPolicy |
|   | hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=18.09.7 Server=18.09.7 Image:yetus/hadoop:bdbca0e53b4 |
| JIRA Issue | HDFS-12749 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12925261/HDFS-12749-trunk.005.patch
 |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 4e8139876ee9 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 
22:49:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 7f1b76c |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_212 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HDFS-Build/27266/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
 |
| unit | 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-11-19 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691789#comment-16691789
 ] 

Hadoop QA commented on HDFS-12749:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
21s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  5m  
3s{color} | {color:red} root in trunk failed. {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
4s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
51s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
1s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 41s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
47s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 43s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 2 new + 47 unchanged - 0 fixed = 49 total (was 47) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 48s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}214m 29s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
34s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}256m 30s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hdfs.protocol.datatransfer.sasl.TestSaslDataTransfer |
|   | hadoop.hdfs.server.namenode.TestAuditLoggerWithCommands |
|   | hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks |
|   | hadoop.hdfs.server.namenode.TestReconstructStripedBlocks |
|   | hadoop.hdfs.server.namenode.TestFileContextAcl |
|   | hadoop.hdfs.TestClientReportBadBlock |
|   | hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy |
|   | hadoop.hdfs.TestDFSStripedOutputStream |
|   | hadoop.hdfs.TestReservedRawPaths |
|   | hadoop.hdfs.server.mover.TestMover |
|   | hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier |
|   | hadoop.hdfs.server.namenode.TestFsck |
|   | 
hadoop.hdfs.server.namenode.snapshot.TestINodeFileUnderConstructionWithSnapshot 
|
|   | hadoop.hdfs.TestUnsetAndChangeDirectoryEcPolicy |
|   | hadoop.hdfs.server.namenode.TestDiskspaceQuotaUpdate |
|   | hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics |
|   | hadoop.hdfs.server.namenode.TestBackupNode |
|   | hadoop.hdfs.client.impl.TestClientBlockVerification |
|   | hadoop.hdfs.tools.TestViewFSStoragePolicyCommands |
|   | hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFS |
|   | 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-11-19 Thread He Xiaoqiao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691520#comment-16691520
 ] 

He Xiaoqiao commented on HDFS-12749:


[~kihwal],[~xkrogen] do you mind anymore review to push this issue forward.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749-trunk.004.patch, 
> HDFS-12749-trunk.005.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short delay - helps scatter the BR from all DNs
> 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-05-26 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491844#comment-16491844
 ] 

genericqa commented on HDFS-12749:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
37s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 
20s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
50s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 10s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
47s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 47s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 2 new + 48 unchanged - 0 fixed = 50 total (was 48) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 29s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}102m 21s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
26s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}165m 19s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hdfs.TestSafeModeWithStripedFileWithRandomECPolicy |
|   | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd |
| JIRA Issue | HDFS-12749 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12925261/HDFS-12749-trunk.005.patch
 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux fb7ad56ec141 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 
10:45:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 0cf6e87 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_162 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HDFS-Build/24308/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/24308/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-05-26 Thread He Xiaoqiao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491812#comment-16491812
 ] 

He Xiaoqiao commented on HDFS-12749:


thanks [~kihwal], v005 patch is with test case following your suggestions, FYI. 
I looks forward to your feedback.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749-trunk.004.patch, 
> HDFS-12749-trunk.005.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short delay - helps scatter the BR from all DNs
> 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-05-22 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484503#comment-16484503
 ] 

Kihwal Lee commented on HDFS-12749:
---

The patch looks good, but a test case would still be good. We can test whether 
the exception handling in non-startup registration is working. That is not 
directly testing the reported issue, but will check whether the new code is 
working.

Here is what I think can be done.
 - Setup up a mini dfs cluster with an include host file. Refresh/reload if 
necessary
 - Remove a datanode from the include file.
 - Force re-registration of the datanode.
 - In detailedRpc metrics, {{DisallowedDatanodeExceptionNumOps}} should not go 
up. I.e. the BP actor thread should shutdown.
 The existing behavior is that the BP actor thread retries forever despite the 
exception. Your patch also fixes this issue by handling exceptions properly in 
DNA_REGISTER command processing.

See various existing test cases such as 
{{TestDatanodeRegistration#testForcedRegistration}} for hints. See test cases 
using {{HostsFileWriter}} for examples of include host file and refreshing it. 
Decommissioning tests use it.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749-trunk.004.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-04-18 Thread He Xiaoqiao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442942#comment-16442942
 ] 

He Xiaoqiao commented on HDFS-12749:


ping [~kihwal],[~daryn],[~arpitagarwal],[~ajayydv] do you mind having a look?

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749-trunk.004.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short delay - helps scatter the BR from all DNs
> scheduler.scheduleBlockReport(dnConf.initialBlockReportDelay);
>   }
> {code}
> But 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-04-03 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16424402#comment-16424402
 ] 

genericqa commented on HDFS-12749:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 16m 
31s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
41s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
39s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green}  
9m 49s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
39s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green}  
9m 16s{color} | {color:green} patch has no errors when building and testing our 
client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}105m 22s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
19s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}174m  3s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.web.TestWebHdfsTimeouts |
|   | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting |
|   | hadoop.hdfs.TestSafeModeWithStripedFileWithRandomECPolicy |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8620d2b |
| JIRA Issue | HDFS-12749 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12917385/HDFS-12749-trunk.004.patch
 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 5cca398ec0d7 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 
11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 5a174f8 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_162 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/23763/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/23763/testReport/ |
| Max. process+thread count | 3466 (vs. ulimit of 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-04-03 Thread He Xiaoqiao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16424184#comment-16424184
 ] 

He Xiaoqiao commented on HDFS-12749:


update patch v4 and fix bug based comments and without testcase.
[~kihwal], I don't find a grace way to test the scenario that NN correctly 
processed the registration, but the DN timed out before receiving the response, 
do you have a good idea?
Thanks again.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749-trunk.004.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-04-02 Thread He Xiaoqiao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422668#comment-16422668
 ] 

He Xiaoqiao commented on HDFS-12749:


[~kihwal] Thanks for your feedback. I misunderstood your idea above. It is 
indeed to fix in the catch block of {{processCommand()}},  I will solve this 
issue in couple of days.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short delay - helps scatter the BR from all DNs
> 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-03-30 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420758#comment-16420758
 ] 

Kihwal Lee commented on HDFS-12749:
---

The {{register()}} method is called from two different contexts. BP service 
actor loop and command processing (through {{reRegister()}}). The quoted code 
above is from the actor loop.  In the {{catch}} block of {{processCommand()}}, 
you can add an equivalent logic. 

{code}
  boolean processCommand(DatanodeCommand[] cmds) {
if (cmds != null) {
  for (DatanodeCommand cmd : cmds) {
try {
  if (bpos.processCommandFromActor(cmd, this) == false) {
return false;
  }
} catch(RemoteException re) {
  String reClass = re.getClassName();
  if (UnregisteredNodeException.class.getName().equals(reClass) ||
  DisallowedDatanodeException.class.getName().equals(reClass) ||
  IncorrectVersionException.class.getName().equals(reClass)) {
LOG.warn(this + " is shutting down", re);
shouldServiceRun = false;
return false;
  }
} catch (IOException ioe) {
  LOG.warn("Error processing datanode Command", ioe);
}
  }
}
return true;
  }
{code}

Your version of {{register()}} will retry on {{IOException}}, which includes 
IPC timeout. If the NN explicitly throws a {{RemoteException}}, this will be 
caught and rethrown in your updated {{register()}}.  This will bubble up to 
{{processCommand()}} above, which will stop the service actor if the condition 
is terminal.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-03-28 Thread He Xiaoqiao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417802#comment-16417802
 ] 

He Xiaoqiao commented on HDFS-12749:


Thanks [~kihwal] for your detailed comment.
{quote}You can keep your change in register() and simply add the same logic to 
the processCommand()'s catch block. I.e. crack open the RemoteException and 
stop the actor thread if it is one of the terminal exceptions.{quote}
I think it may not be able to resolve this issue when catch and crack 
{{RemoteException}} for #processCommand, since #processCommand throws 
{{IOException}} which wrap {{SocketTimeoutException}} as [~tanyuxin] mentioned 
in description
{quote}java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel{quote}
According to your suggestions, is it better to create new issue to push that 
stop the actor thread if it meet some fatal or terminal exceptions? 

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-03-27 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416048#comment-16416048
 ] 

Kihwal Lee commented on HDFS-12749:
---

The latest patch is a step in the right direction. Some of the exceptions 
thrown by namenode to datanode are fatal and terminal. I.e. retry will never 
work.  The catch block of {{offerService()}} takes this into account. Simply 
throwing one in {{register()}} will get ignored in 
{{BPServiceActor#processCommand()}}. {{shouldServiceRun}} needs to be set false 
in order to stop the actor thread.

{code:java}
 } catch(RemoteException re) {
String reClass = re.getClassName();
if (UnregisteredNodeException.class.getName().equals(reClass) ||
DisallowedDatanodeException.class.getName().equals(reClass) ||
IncorrectVersionException.class.getName().equals(reClass)) {
  LOG.warn(this + " is shutting down", re);
  shouldServiceRun = false;
  return;
}
LOG.warn("RemoteException in offerService", re);
sleepAfterException();
  } catch (IOException e) {
LOG.warn("IOException in offerService", e);
sleepAfterException();
  }
 {code}

You can keep your change in {{register()}} and simply add the same logic to the 
{{processCommand()}}'s catch block. I.e. crack open the {{RemoteException}} and 
stop the actor thread if it is one of the terminal exceptions.

I know it's hard, but it will be nice if you can add a test case.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-03-23 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16411417#comment-16411417
 ] 

Kihwal Lee commented on HDFS-12749:
---

Sorry for the delay. I will review it in a couple of days.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short delay - helps scatter the BR from all DNs
> scheduler.scheduleBlockReport(dnConf.initialBlockReportDelay);
>   }
> {code}
> But NameNode has processed registerDatanode successfully, so it won't ask DN 
> to 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-03-05 Thread He Xiaoqiao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386430#comment-16386430
 ] 

He Xiaoqiao commented on HDFS-12749:


[~kihwal] could you help to review or give some suggestions?

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short delay - helps scatter the BR from all DNs
> scheduler.scheduleBlockReport(dnConf.initialBlockReportDelay);
>   }
> {code}
> But NameNode has processed registerDatanode successfully, so it won't ask DN 
> to 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-03-01 Thread TanYuxin (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383110#comment-16383110
 ] 

TanYuxin commented on HDFS-12749:
-

Thanks [~hexiaoqiao]  [~xkrogen] very much for reviewing and resolving the 
issue. I think v003 patch committed by [~hexiaoqiao] is more effective and 
simpler. 
Anyone mind having a review?  In our production cluster, the problem has been 
fixed for months by the proposed patch.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-03-01 Thread He Xiaoqiao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382040#comment-16382040
 ] 

He Xiaoqiao commented on HDFS-12749:


Thanks [~xkrogen] for your review all the same.
{quote}I am not satisfied that we fully understand the over-wrapped 
exception{quote}
[~xkrogen] do you have any different opinions? look forwards to your share. 
thanks again.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short delay - helps scatter the BR from all DNs
> 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-02-28 Thread Erik Krogen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380482#comment-16380482
 ] 

Erik Krogen commented on HDFS-12749:


I think the v003 patch is fine to resolve this issue. I am not a committer so 
cannot help you commit based on my review.

I am not satisfied that we fully understand the over-wrapped exception, but 
that is not critical for this bug fix. Maybe let's open a new JIRA to discuss 
resolution.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-02-28 Thread He Xiaoqiao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380179#comment-16380179
 ] 

He Xiaoqiao commented on HDFS-12749:


ping [~xkrogen],[~kihwal] any suggestions or feedback for this issue?

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short delay - helps scatter the BR from all DNs
> scheduler.scheduleBlockReport(dnConf.initialBlockReportDelay);
>   }
> {code}
> But NameNode has processed registerDatanode successfully, so it won't ask DN 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-02-23 Thread He Xiaoqiao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375299#comment-16375299
 ] 

He Xiaoqiao commented on HDFS-12749:


[~xkrogen] Thanks for your review and comments.
{quote}Are you running without Kerberos? Do you see a relevant WARN statement 
from the log for ipc.Client?
{quote}
a. This is security cluster with Kerberos,
 b. All relevant WARN or EXCEPTION depict as [~tanyuxin] mentioned above 
(description & second comment.)

Based on the exception logs that [~tanyuxin] provided, I think 
{{SocketTimeoutException}} was over-wrapped in extra {{IOException}} by 
{{Client#cleanupCalls}}, the following notes based branch-2.7 (maybe I am 
wrong, if that please correct me.)
 a. Client#call (line:1448) throws {{IOException}} which wrapped 
{{SocketTimeoutException}} when {{call.error}} is not null and it is not 
instance of {{RemoteException}}, thus this exception is wrapped by 
{{NetUtils#wrapException}}:
{code:java}
  public Writable call(RPC.RpcKind rpcKind, Writable rpcRequest,
  ConnectionId remoteId, int serviceClass,
  AtomicBoolean fallbackToSimpleAuth) throws IOException {
final Call call = createCall(rpcKind, rpcRequest);
Connection connection = getConnection(remoteId, call, serviceClass,
  fallbackToSimpleAuth);
try {
  connection.sendRpcRequest(call); // send the rpc request
} catch (RejectedExecutionException e) {
  throw new IOException("connection has been closed", e);
} catch (InterruptedException e) {
  Thread.currentThread().interrupt();
  LOG.warn("interrupted waiting to send rpc request to server", e);
  throw new IOException(e);
}

synchronized (call) {
  while (!call.done) {
try {
  call.wait();   // wait for the result
} catch (InterruptedException ie) {
  Thread.currentThread().interrupt();
  throw new InterruptedIOException("Call interrupted");
}
  }

  if (call.error != null) {
if (call.error instanceof RemoteException) {
  call.error.fillInStackTrace();
  throw call.error;
} else { // local exception
  InetSocketAddress address = connection.getRemoteAddress();
  throw NetUtils.wrapException(address.getHostName(),
  address.getPort(),
  NetUtils.getHostname(),
  0,
  call.error);
}
  } else {
return call.getRpcResponse();
  }
}
  }
{code}
b. {{NetUtils#wrapException}} can distinguish {{SocketTimeoutException}} if 
{{call.error}} is instance of, but not actually so logs `Failed on local 
exception: java.io.IOException: ...`
 c. {{call.error}} is set only by client#setException which invoked by 
{{Client#receiveRpcResponse}} and {{Client#cleanupCalls}}, however 
{{call.error}} is set RemoteException always in {{Client#receiveRpcResponse}}. 
Evidently, the only possibility is that {{SocketTimeoutException}} was 
over-wrapped in {{IOException}} by {{Client#cleanupCalls}}.
 d.The key point in {{Client#cleanupCalls}} is {{#closeException}} which is set 
by {{Client#markClosed}} invoked by {{Client#sendRpcRequest}} and it catch all 
{{IOException}} then set {{#closeException}} equal it.
{code:java}
public void sendRpcRequest(final Call call)
throws InterruptedException, IOException {
  ..
  synchronized (sendRpcRequestLock) {
Future senderFuture = sendParamsExecutor.submit(new Runnable() {
  @Override
  public void run() {
try {
  ..
} catch (IOException e) {
  // exception at this point would leave the connection in an
  // unrecoverable state (eg half a call left on the wire).
  // So, close the connection, killing any outstanding calls
  markClosed(e);
} finally {
  //the buffer is just an in-memory buffer, but it is still polite 
to
  // close early
  IOUtils.closeStream(d);
}
  }
});
.
  }
}
{code}
[~xkrogen],[~kihwal] any suggestions? 

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-02-23 Thread Erik Krogen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374858#comment-16374858
 ] 

Erik Krogen commented on HDFS-12749:


v003 patch LGTM! Thanks for pushing this [~hexiaoqiao] and sorry for the slow 
response.

I wanted to dig deeper to understand why the {{SocketTimeoutException}} was 
over-wrapped in an extra {{IOException}}, which was also a cause of this 
problem. As far as I can tell, it comes from this code block within 
{{Client.Connection#setupIOStreams}}:
{code}
try {
  authMethod = ticket
  .doAs(new PrivilegedExceptionAction() {
@Override
public AuthMethod run()
throws IOException, InterruptedException {
  return setupSaslConnection(ipcStreams);
}
  });
} catch (IOException ex) {
  if (saslRpcClient == null) {
// whatever happened -it can't be handled, so rethrow
throw ex;
  }
  // otherwise, assume a connection problem
  authMethod = saslRpcClient.getAuthMethod();
  if (rand == null) {
rand = new Random();
  }
  handleSaslConnectionFailure(numRetries++, maxRetriesOnSasl, ex,
  rand, ticket);
  continue;
}
{code}
The signature of {{handleSaslConnectionFailure}} accepts an {{Exception}} 
rather than {{IOException}} (even though this is its only usage), so it 
additionally wraps this in a new {{IOException}}:
{code}
private synchronized void handleSaslConnectionFailure(
final int currRetries, final int maxRetries, final Exception ex,
final Random rand, final UserGroupInformation ugi) throws IOException,
InterruptedException {
  ...
  if (shouldAuthenticateOverKrb()) {
if (currRetries < maxRetries) {
  ...
  return null;
} else {
  ...
  throw (IOException) new IOException(msg).initCause(ex);
}
  } else {
LOG.warn("Exception encountered while connecting to "
+ "the server : " + ex);
  }
  if (ex instanceof RemoteException)
throw (RemoteException) ex;
  throw new IOException(ex);
}
  });
}
{code}
It looks like the case where it wraps with no additional message would only 
occur in an unkerberized cluster. Are you running without Kerberos? In any case 
I think we should modify {{handleSaslConnectionFailure}} to accept 
{{IOException}} and avoid the additional wrap.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-02-23 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374515#comment-16374515
 ] 

genericqa commented on HDFS-12749:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
22s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
36s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
55s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 32s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
52s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 54s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}120m 34s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
23s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}172m 39s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA |
|   | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure |
|   | hadoop.hdfs.server.namenode.TestReencryptionWithKMS |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639 |
| JIRA Issue | HDFS-12749 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12911717/HDFS-12749-trunk.003.patch
 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux bd626db8f0c1 3.13.0-135-generic #184-Ubuntu SMP Wed Oct 18 
11:55:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / d1cd573 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_151 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/23169/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/23169/testReport/ |
| Max. process+thread count | 3002 (vs. ulimit 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-02-23 Thread He Xiaoqiao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374295#comment-16374295
 ] 

He Xiaoqiao commented on HDFS-12749:


Thanks to [~kihwal] for review and comments.
{quote}It seems okay to retry there, but it should not unconditionally retry on 
all IOException{quote}
V2 does not make sense when catch all {{IOException}} and retry, especially 
{{DataNode}} register at the first time when restart. I just submit V3 based 
trunk and separate to catch {{RemoteException}} and throw up. Thanks again 
[~kihwal]. Please correct me If there is something wrong.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-02-22 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373667#comment-16373667
 ] 

Kihwal Lee commented on HDFS-12749:
---

bq. About catching IOException: if the registration fails with a 
RemoteException that is not RetriableException, the actor may need to stop 
instead of retrying.
It seems okay to retry there, but it should not unconditionally retry on all 
{{IOException}}.   Take a look at the catch block in {{offerService()}}.   Also 
the initial patch should be submitted against trunk.


> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-02-22 Thread He Xiaoqiao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372749#comment-16372749
 ] 

He Xiaoqiao commented on HDFS-12749:


check failed UT(#TestFileCreationEmpty) and test locally, it seems to work fine 
and might not relate to this patch.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short delay - helps scatter the BR from all DNs
> scheduler.scheduleBlockReport(dnConf.initialBlockReportDelay);
>   }
> {code}
> But NameNode has processed registerDatanode successfully, so 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-02-22 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372717#comment-16372717
 ] 

genericqa commented on HDFS-12749:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 16m 
39s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} branch-2.7 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  6m 
 0s{color} | {color:green} branch-2.7 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
1s{color} | {color:green} branch-2.7 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
24s{color} | {color:green} branch-2.7 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
0s{color} | {color:green} branch-2.7 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
57s{color} | {color:green} branch-2.7 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
43s{color} | {color:green} branch-2.7 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 60 line(s) that end in whitespace. Use 
git apply --whitespace=fix <>. Refer 
https://git-scm.com/docs/git-apply {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m  
8s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
42s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 94m 48s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  1m 
12s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}135m 28s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Unreaped Processes | hadoop-hdfs:24 |
| Failed junit tests | hadoop.hdfs.TestFileCreationEmpty |
| Timed out junit tests | org.apache.hadoop.hdfs.TestDatanodeRegistration |
|   | org.apache.hadoop.hdfs.TestDFSClientFailover |
|   | org.apache.hadoop.hdfs.TestSetrepIncreasing |
|   | org.apache.hadoop.hdfs.TestDatanodeDeath |
|   | org.apache.hadoop.hdfs.TestDFSClientRetries |
|   | org.apache.hadoop.hdfs.TestDFSFinalize |
|   | org.apache.hadoop.hdfs.TestHDFSFileSystemContract |
|   | org.apache.hadoop.hdfs.TestDatanodeStartupFixesLegacyStorageIDs |
|   | org.apache.hadoop.hdfs.TestDecommission |
|   | org.apache.hadoop.hdfs.TestSeekBug |
|   | org.apache.hadoop.hdfs.TestDatanodeReport |
|   | org.apache.hadoop.hdfs.web.TestWebHDFS |
|   | org.apache.hadoop.hdfs.web.TestWebHDFSXAttr |
|   | org.apache.hadoop.hdfs.TestDFSRollback |
|   | org.apache.hadoop.hdfs.TestMiniDFSCluster |
|   | org.apache.hadoop.hdfs.TestDistributedFileSystem |
|   | org.apache.hadoop.hdfs.web.TestWebHDFSForHA |
|   | org.apache.hadoop.hdfs.TestDFSClientExcludedNodes |
|   | org.apache.hadoop.hdfs.TestBalancerBandwidth |
|   | org.apache.hadoop.hdfs.TestSetTimes |
|   | org.apache.hadoop.hdfs.TestDFSShell |
|   | org.apache.hadoop.hdfs.TestDataTransferProtocol |
|   | org.apache.hadoop.hdfs.web.TestWebHDFSAcl |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ea57d10 |
| JIRA Issue | HDFS-12749 |
| 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-02-22 Thread He Xiaoqiao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372595#comment-16372595
 ] 

He Xiaoqiao commented on HDFS-12749:


ping [~kihwal], [~xkrogen]
I upload new patch rebase branch-2.7, FYI. 

The fundamental reason of this issue is {{BPServiceActor#register}} not catch 
{{IOException}}. 
trigger conditions:
a. Restart {{NameNode}}, rather than restart {{DataNode}},
b. high load of NameNode when restart, (it is easily reappeared in a large 
scale cluster)

For [~tanyuxin] reported case,
when NameNode restart, all datanodes in cluster need to re-retrieve namespace 
info (#versionRequest) and re-register (#registerDatanode), but all RPC request 
from DataNode to NameNode may meet SocketTimeoutException or IOException due to 
high load of NameNode. while {{BPServiceActor#register}} meet IOException, it 
will continue to throw up until catch by {{BPServiceActor#processCommand}} and 
return true, however block report not be scheduled since meet exception before, 
so DataNode will wait for most {{blockReportInterval}} to send block report to 
{{NameNode}}.

NOTE: {{BPServiceActor#retrieveNamespaceInfo}} catch IOException when invoke 
{{bpNamenode.versionRequest()}} for reference:
{code:java}
while (shouldRun()) {
  try {
nsInfo = bpNamenode.versionRequest();
LOG.debug(this + " received versionRequest response: " + nsInfo);
break;
  } catch(SocketTimeoutException e) {  // namenode is busy
LOG.warn("Problem connecting to server: " + nnAddr);
  } catch(IOException e ) {  // namenode is not available
LOG.warn("Problem connecting to server: " + nnAddr);
  }
  
  // try again in a second
  sleepAndLogInterrupts(5000, "requesting version info from NN");
}
{code}

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-02-14 Thread He Xiaoqiao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365105#comment-16365105
 ] 

He Xiaoqiao commented on HDFS-12749:


[~kihwal],[~xkrogen] Thanks for your comments.
{quote}the problem is that the NN correctly processed the registration, but the 
DN timed out before receiving the response. Since from NN point of view the 
registration was complete, it did not send another DNA_REGISTER command.{quote}
thanks for your additional comments [~xkrogen].
Since NN considers DN register successfully but DN timeout before receiving 
response, and DN would not get DNA_REGISTER command from NN at next heartbeat 
intervel. If as except, DN should catch exception in 
{{BPServiceActor#register}} and retry until register successfully as catching 
{{SocketTimeoutException}} or {{EOFException}} currently. 

One question is why underlying RPC not throw {{SocketTimeoutException}} but 
{{IOException}}, After trace invoke stack, I find {{NetUtils#wrapException}} 
who invoke by {{Client#call}} line 1474 may be the main reason.
{code:java}
  if (call.error != null) {
if (call.error instanceof RemoteException) {
  call.error.fillInStackTrace();
  throw call.error;
} else { // local exception
  InetSocketAddress address = connection.getRemoteAddress();
  throw NetUtils.wrapException(address.getHostName(),
  address.getPort(),
  NetUtils.getHostname(),
  0,
  call.error);
}
{code}
but I am confused why {{call.error}} set IOException and throw upper lead to 
{{BPServiceActor#register}} not catch it. 
After reviewed HDFS-8995 I think it could not fix this issue.

Anyway, I agree to fix ti with catching IOException in 
{{BPServiceActor#register}} and retry until register successfully.
[~kihwal],[~xkrogen] do you mind having a look?

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-02-14 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364731#comment-16364731
 ] 

Kihwal Lee commented on HDFS-12749:
---

I see. The old registration looks identical to the new one, so NN still accepts 
it.  About catching IOException: if the registration fails with a 
RemoteException that is not RetriableException, the actor may need to stop 
instead of retrying. Also, if we choose to blank something out before trying to 
re-register to re-trigger registration, we should avoid hitting something like 
HDFS-8995.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short delay 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-02-14 Thread Erik Krogen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364520#comment-16364520
 ] 

Erik Krogen commented on HDFS-12749:


[~kihwal], IIUC, the problem is that the NN correctly processed the 
registration, but the DN got an IOException during processing. Since from NN 
point of view the registration was complete, it did not send another 
DNA_REGISTER command. 

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short delay - helps scatter the BR from all DNs
> scheduler.scheduleBlockReport(dnConf.initialBlockReportDelay);
>   }
> {code}
> But NameNode has processed 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-02-14 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364508#comment-16364508
 ] 

Kihwal Lee commented on HDFS-12749:
---

What was the command that failed when you saw "Error processing datanode 
Command"?  If the processing of DNA_REGISTER blew up with an IOException, the 
DN would have gotten the command again in the next hearbeat and retried. The 
block token secret is updated (DNA_ACCESSKEYUPDATE)  in the first heartbeat 
after registration. There can be other commands along with it, but failure of 
processing a command does not abort the whole processing, so that shouldn't 
matter.  We need to understand where exactly it went wrong.  More detailed 
exception/stack traces, thread name (so that we can tell which actor thread it 
came from), timing, etc. will be helpful.



 

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-02-14 Thread Erik Krogen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364356#comment-16364356
 ] 

Erik Krogen commented on HDFS-12749:


Hey [~hexiaoqiao], I have to admit I'm not familiar with this portion of the 
codebase. But it seems that the underlying issue is still a 
{{SocketTimeoutException}}, which we are trying to catch. Is this just an issue 
of an exception being over-wrapped? If it is over-wrapped in this case, is 
there any case where it won't be, i.e. is that catch statement actually useful 
as-is?

It does seem that probably we want to catch all {{IOException}} here, but like 
I said I'm not familiar with this area and don't know if there is a good reason 
not to. Maybe re-ping [~kihwal] - I definitely agree that with proper tuning 
this issue shouldn't happen, but it does seem that the original complaint is 
valid and that this could be more robust. 

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-02-14 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364305#comment-16364305
 ] 

genericqa commented on HDFS-12749:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  7s{color} 
| {color:red} HDFS-12749 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HDFS-12749 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12895403/HDFS-12749.001.patch |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/23067/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
>   

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-02-14 Thread He Xiaoqiao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364280#comment-16364280
 ] 

He Xiaoqiao commented on HDFS-12749:


ping [~xkrogen]
I also had seen this problems in our big production cluster, and I think this 
is a blocked issue.
The issue was original reported by my colleague [~tanyuxin] and we applied this 
patch to our production cluster based branch-2.7, the problems as mentioned 
above disappears.

I think the description of this ticket may lead to little ambiguity. I would 
like to offer more information,

a) {{BPServiceActor#register}} only catch exception {{EOFException}} and 
{{SocketTimeoutException}} (both are subclass of IOException) when register to 
NameNode.
b) when the request pass to under {{RPC}} layer, {{Client#call}} (line 1448) 
may throw many type exceptions, such as {{InterruptedIOException}}, 
{{IOException}}, etc. of course for different exception reasons.
c) Load of NameNode may very high during restarting process, especially in big 
cluster. when datanode reregister to namenode at this time, {{RPC#Client}} may 
throw IOException (not subclass of IOException) to {{BPServiceActor#register}} 
as [~tanyuxin] describe above.
d) The IOException will be catch by {{BPServiceActor#processCommand}} and print 
warn log {{`WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error 
processing datanode Command`}} but not reinit context of {{BlockPool}} when 
reregister and continue run, so {{BlockReport}} will not be scheduled 
immediately (when DataNode send {{BlockReport}} to restarting NameNode based on 
the last {{BlockReport}} time.)
e) Actually, There are other subsequent problems besides NOT schedule 
{{BlockReport}} immediately. Such as block token secret keys not update 
correctly and client can not read {{Blocks}} from this DataNode even if hold 
the correct BlockToken.

I have reviewed patch and one minor suggestion,
a) please rebase an active branch (such as branch-2.7) and offer new patch;
b) it will be better if add new UnitTest for this fix.

If I am wrong please correct me. [~tanyuxin]

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2017-11-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16236348#comment-16236348
 ] 

Hadoop QA commented on HDFS-12749:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 16m 
35s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
29s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
56s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 30s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
49s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 41s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 10 new + 469 unchanged - 0 fixed = 479 total (was 469) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m  4s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
7s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}131m 46s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
46s{color} | {color:red} The patch generated 1 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}200m 53s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestDFSStripedOutputStreamWithRandomECPolicy 
|
|   | hadoop.fs.contract.hdfs.TestHDFSContractRename |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure060 |
|   | hadoop.hdfs.TestDataTransferKeepalive |
|   | hadoop.cli.TestXAttrCLI |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure020 |
|   | hadoop.hdfs.TestReconstructStripedFile |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure130 |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure070 |
|   | hadoop.hdfs.TestFileChecksum |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639 |
| JIRA Issue | HDFS-12749 |
| GITHUB PR | https://github.com/apache/hadoop/pull/288 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux f6bf160befe5 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 
12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2017-11-02 Thread TanYuxin (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16235320#comment-16235320
 ] 

TanYuxin commented on HDFS-12749:
-

Thanks @kihwal very much for the comments. 
DN will retry RPC calls when timeout, but DN won't send BlockReport in 
BPserviceActor#register if it got an IOException.
Here is the scene occurred in our cluseter.
1. DN BlockReport Interval is set to 10 hours.
2. Restart NN at 1 hours after DN sent last BR.
3. DN re-register to NN, but DN get an IOException and won't send BR 
immediately. 
4. NN will receive DN's BR after 9 more hours later, which is too long.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: TanYuxin
>Priority: Major
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short delay - helps 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2017-11-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16235302#comment-16235302
 ] 

ASF GitHub Bot commented on HDFS-12749:
---

GitHub user yuxintan opened a pull request:

https://github.com/apache/hadoop/pull/288

HDFS-12749. Catch IOException to schedule BlockReport correctly when …

…DN re-register. (Contributed by TanYuxin)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yuxintan/hadoop HDFS-12749

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/hadoop/pull/288.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #288


commit 1a7a53d70c3f0e68fb37e3c953e0dea6d4662b70
Author: tanyuxin 
Date:   2017-11-02T05:54:51Z

HDFS-12749. Catch IOException to schedule BlockReport correctly when DN 
re-register. (Contributed by TanYuxin)




> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: TanYuxin
>Priority: Major
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2017-10-31 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227402#comment-16227402
 ] 

Kihwal Lee commented on HDFS-12749:
---

The rpc calls are timing out because the NN is not able to serve them fast 
enough. Influx of full block reports can slow things down.  First of all, they 
can timeout but will be retried. I.e. datanodes will retransmit.  Also, make 
sure your NN is configured correctly. E.g. tcp listen queue size, # of 
handlers, etc.  are enough to absorb surges in requests.  If there are too many 
large block reports, it may not be possible to completely avoid 
timeout-retransmission. This also increases the amount/size of objects sitting 
in the heap and potentially promoted to the old gen prematurely, increasing 
chance of a full GC.   To lessen the memory pressure during the block report 
surges, config your cluster to break down full block report down to individual 
storage level. That way, each RPC will be smaller.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: TanYuxin
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2017-10-31 Thread TanYuxin (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226405#comment-16226405
 ] 

TanYuxin commented on HDFS-12749:
-


{code}
// DN and NN log and timeline about DN re-register
// DN start re-register:
2017-10-26 12:59:35,134 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeCommand action : DNA_REGISTER from Namenode_Host/IP:Port with standby 
state

// DN SocketTimeoutException and retry to register
2017-10-26 13:02:08,497 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Problem connecting to server: Namenode_Host/IP:Port
2017-10-26 13:06:34,204 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Problem connecting to server: Namenode_Host/IP:Port

// NN Successfully register the DN
2017-10-26 13:07:24,325 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
registerDatanode: from DatanodeRegistration(DataNode_IP:Port, 
datanodeUuid=DataNodeUuid, infoPort=DataNode_port, infoSecurePort=0, 
ipcPort=IPCPort, storageInfo=lv=-57;cid=CID-;nsid=NSID;c=0) storage Storage_ID

// DN get IOExecption:
2017-10-26 13:07:35,265 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Error processing datanode Command
java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/DataNode_IP:Port remote=Namenode_Host/IP:Port]; Host Details : local 
host is: "DataNode_Host/IP"; destination host is: "NameNode_Host":Port;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
at org.apache.hadoop.ipc.Client.call(Client.java:1474)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(
...
{code}


> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: TanYuxin
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2017-10-31 Thread He Xiaoqiao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226339#comment-16226339
 ] 

He Xiaoqiao commented on HDFS-12749:


[~tanyuxin] could you attach some more troubleshoot or exception logs?

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: TanYuxin
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After SNN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/IP:Port remote=NameNode/IP:Port]; Host Details : local host is: 
> "DataNode/Datanode_ip"; destination host is: "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> If encountering a IOException in BPServiceActor#register,  
> scheduler.scheduleBlockReport method can't be run, and the Block Report will 
> not be sent immediately. 
> But NN has get the register RPC, and successfully register the DN. So NN will 
> not make DN register again at next HeartBeat, which makes DN Block Report  is 
> not sent correctly after register. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org