[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-10-02 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943165#comment-16943165
 ] 

Wei-Chiu Chuang commented on HDFS-14527:


Patch applies cleanly in branch-3.2 also.
But it doesn't compile in branch-3.1. I'll provide a patch shortly.

> Stop all DataNodes may result in NN terminate
> -
>
> Key: HDFS-14527
> URL: https://issues.apache.org/jira/browse/HDFS-14527
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
> Attachments: HDFS-14527.001.patch, HDFS-14527.002.patch, 
> HDFS-14527.003.patch, HDFS-14527.004.patch, HDFS-14527.005.patch
>
>
> If we stop all datanodes of cluster, BlockPlacementPolicyDefault#chooseTarget 
> may get ArithmeticException when calling #getMaxNodesPerRack, which throws 
> the runtime exception out to BlockManager's ReplicationMonitor thread and 
> then terminate the NN.
> The root cause is that BlockPlacementPolicyDefault#chooseTarget not hold the 
> global lock, and if all DataNodes are dead between 
> {{clusterMap.getNumberOfLeaves()}} and {{getMaxNodesPerRack}} then it meet 
> {{ArithmeticException}} while invoke {{getMaxNodesPerRack}}.
> {code:java}
>   private DatanodeStorageInfo[] chooseTarget(int numOfReplicas,
> Node writer,
> List chosenStorage,
> boolean returnChosenNodes,
> Set excludedNodes,
> long blocksize,
> final BlockStoragePolicy storagePolicy,
> EnumSet addBlockFlags,
> EnumMap sTypes) {
> if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) {
>   return DatanodeStorageInfo.EMPTY_ARRAY;
> }
> ..
> int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas);
> ..
> }
> {code}
> Some detailed log show as following.
> {code:java}
> 2019-05-31 12:29:21,803 ERROR 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception. 
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getMaxNodesPerRack(BlockPlacementPolicyDefault.java:282)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:228)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:132)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4533)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$1800(BlockManager.java:4493)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1954)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1830)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4453)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4388)
> at java.lang.Thread.run(Thread.java:745)
> 2019-05-31 12:29:21,805 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}
> To be honest, this is not serious bug and not reprod easily, since if we stop 
> all Datanodes and only keep NameNode lives, HDFS could be not offer service 
> normally and we could only retrieve directory. It may be one corner case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-06-06 Thread He Xiaoqiao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857931#comment-16857931
 ] 

He Xiaoqiao commented on HDFS-14527:


Thanks [~elgoiri], [~ayushtkn] for the reviews and commit.

> Stop all DataNodes may result in NN terminate
> -
>
> Key: HDFS-14527
> URL: https://issues.apache.org/jira/browse/HDFS-14527
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14527.001.patch, HDFS-14527.002.patch, 
> HDFS-14527.003.patch, HDFS-14527.004.patch, HDFS-14527.005.patch
>
>
> If we stop all datanodes of cluster, BlockPlacementPolicyDefault#chooseTarget 
> may get ArithmeticException when calling #getMaxNodesPerRack, which throws 
> the runtime exception out to BlockManager's ReplicationMonitor thread and 
> then terminate the NN.
> The root cause is that BlockPlacementPolicyDefault#chooseTarget not hold the 
> global lock, and if all DataNodes are dead between 
> {{clusterMap.getNumberOfLeaves()}} and {{getMaxNodesPerRack}} then it meet 
> {{ArithmeticException}} while invoke {{getMaxNodesPerRack}}.
> {code:java}
>   private DatanodeStorageInfo[] chooseTarget(int numOfReplicas,
> Node writer,
> List chosenStorage,
> boolean returnChosenNodes,
> Set excludedNodes,
> long blocksize,
> final BlockStoragePolicy storagePolicy,
> EnumSet addBlockFlags,
> EnumMap sTypes) {
> if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) {
>   return DatanodeStorageInfo.EMPTY_ARRAY;
> }
> ..
> int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas);
> ..
> }
> {code}
> Some detailed log show as following.
> {code:java}
> 2019-05-31 12:29:21,803 ERROR 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception. 
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getMaxNodesPerRack(BlockPlacementPolicyDefault.java:282)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:228)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:132)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4533)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$1800(BlockManager.java:4493)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1954)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1830)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4453)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4388)
> at java.lang.Thread.run(Thread.java:745)
> 2019-05-31 12:29:21,805 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}
> To be honest, this is not serious bug and not reprod easily, since if we stop 
> all Datanodes and only keep NameNode lives, HDFS could be not offer service 
> normally and we could only retrieve directory. It may be one corner case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-06-06 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857922#comment-16857922
 ] 

Hudson commented on HDFS-14527:
---

FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #16694 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/16694/])
HDFS-14527. Stop all DataNodes may result in NN terminate. Contributed 
(inigoiri: rev 944adc61b1830388d520d4052fc7eb6c7ba2790d)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java
* (add) 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestRedundancyMonitor.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyRackFaultTolerant.java


> Stop all DataNodes may result in NN terminate
> -
>
> Key: HDFS-14527
> URL: https://issues.apache.org/jira/browse/HDFS-14527
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14527.001.patch, HDFS-14527.002.patch, 
> HDFS-14527.003.patch, HDFS-14527.004.patch, HDFS-14527.005.patch
>
>
> If we stop all datanodes of cluster, BlockPlacementPolicyDefault#chooseTarget 
> may get ArithmeticException when calling #getMaxNodesPerRack, which throws 
> the runtime exception out to BlockManager's ReplicationMonitor thread and 
> then terminate the NN.
> The root cause is that BlockPlacementPolicyDefault#chooseTarget not hold the 
> global lock, and if all DataNodes are dead between 
> {{clusterMap.getNumberOfLeaves()}} and {{getMaxNodesPerRack}} then it meet 
> {{ArithmeticException}} while invoke {{getMaxNodesPerRack}}.
> {code:java}
>   private DatanodeStorageInfo[] chooseTarget(int numOfReplicas,
> Node writer,
> List chosenStorage,
> boolean returnChosenNodes,
> Set excludedNodes,
> long blocksize,
> final BlockStoragePolicy storagePolicy,
> EnumSet addBlockFlags,
> EnumMap sTypes) {
> if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) {
>   return DatanodeStorageInfo.EMPTY_ARRAY;
> }
> ..
> int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas);
> ..
> }
> {code}
> Some detailed log show as following.
> {code:java}
> 2019-05-31 12:29:21,803 ERROR 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception. 
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getMaxNodesPerRack(BlockPlacementPolicyDefault.java:282)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:228)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:132)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4533)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$1800(BlockManager.java:4493)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1954)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1830)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4453)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4388)
> at java.lang.Thread.run(Thread.java:745)
> 2019-05-31 12:29:21,805 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}
> To be honest, this is not serious bug and not reprod easily, since if we stop 
> all Datanodes and only keep NameNode lives, HDFS could be not offer service 
> normally and we could only retrieve directory. It may be one corner case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-06-06 Thread JIRA


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857914#comment-16857914
 ] 

Íñigo Goiri commented on HDFS-14527:


Thanks [~hexiaoqiao] for the patch and [~ayushtkn] for the review.
Committed to trunk.

> Stop all DataNodes may result in NN terminate
> -
>
> Key: HDFS-14527
> URL: https://issues.apache.org/jira/browse/HDFS-14527
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14527.001.patch, HDFS-14527.002.patch, 
> HDFS-14527.003.patch, HDFS-14527.004.patch, HDFS-14527.005.patch
>
>
> If we stop all datanodes of cluster, BlockPlacementPolicyDefault#chooseTarget 
> may get ArithmeticException when calling #getMaxNodesPerRack, which throws 
> the runtime exception out to BlockManager's ReplicationMonitor thread and 
> then terminate the NN.
> The root cause is that BlockPlacementPolicyDefault#chooseTarget not hold the 
> global lock, and if all DataNodes are dead between 
> {{clusterMap.getNumberOfLeaves()}} and {{getMaxNodesPerRack}} then it meet 
> {{ArithmeticException}} while invoke {{getMaxNodesPerRack}}.
> {code:java}
>   private DatanodeStorageInfo[] chooseTarget(int numOfReplicas,
> Node writer,
> List chosenStorage,
> boolean returnChosenNodes,
> Set excludedNodes,
> long blocksize,
> final BlockStoragePolicy storagePolicy,
> EnumSet addBlockFlags,
> EnumMap sTypes) {
> if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) {
>   return DatanodeStorageInfo.EMPTY_ARRAY;
> }
> ..
> int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas);
> ..
> }
> {code}
> Some detailed log show as following.
> {code:java}
> 2019-05-31 12:29:21,803 ERROR 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception. 
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getMaxNodesPerRack(BlockPlacementPolicyDefault.java:282)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:228)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:132)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4533)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$1800(BlockManager.java:4493)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1954)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1830)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4453)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4388)
> at java.lang.Thread.run(Thread.java:745)
> 2019-05-31 12:29:21,805 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}
> To be honest, this is not serious bug and not reprod easily, since if we stop 
> all Datanodes and only keep NameNode lives, HDFS could be not offer service 
> normally and we could only retrieve directory. It may be one corner case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-06-06 Thread JIRA


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857910#comment-16857910
 ] 

Íñigo Goiri commented on HDFS-14527:


Committing to trunk soon.

> Stop all DataNodes may result in NN terminate
> -
>
> Key: HDFS-14527
> URL: https://issues.apache.org/jira/browse/HDFS-14527
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14527.001.patch, HDFS-14527.002.patch, 
> HDFS-14527.003.patch, HDFS-14527.004.patch, HDFS-14527.005.patch
>
>
> If we stop all datanodes of cluster, BlockPlacementPolicyDefault#chooseTarget 
> may get ArithmeticException when calling #getMaxNodesPerRack, which throws 
> the runtime exception out to BlockManager's ReplicationMonitor thread and 
> then terminate the NN.
> The root cause is that BlockPlacementPolicyDefault#chooseTarget not hold the 
> global lock, and if all DataNodes are dead between 
> {{clusterMap.getNumberOfLeaves()}} and {{getMaxNodesPerRack}} then it meet 
> {{ArithmeticException}} while invoke {{getMaxNodesPerRack}}.
> {code:java}
>   private DatanodeStorageInfo[] chooseTarget(int numOfReplicas,
> Node writer,
> List chosenStorage,
> boolean returnChosenNodes,
> Set excludedNodes,
> long blocksize,
> final BlockStoragePolicy storagePolicy,
> EnumSet addBlockFlags,
> EnumMap sTypes) {
> if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) {
>   return DatanodeStorageInfo.EMPTY_ARRAY;
> }
> ..
> int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas);
> ..
> }
> {code}
> Some detailed log show as following.
> {code:java}
> 2019-05-31 12:29:21,803 ERROR 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception. 
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getMaxNodesPerRack(BlockPlacementPolicyDefault.java:282)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:228)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:132)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4533)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$1800(BlockManager.java:4493)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1954)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1830)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4453)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4388)
> at java.lang.Thread.run(Thread.java:745)
> 2019-05-31 12:29:21,805 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}
> To be honest, this is not serious bug and not reprod easily, since if we stop 
> all Datanodes and only keep NameNode lives, HDFS could be not offer service 
> normally and we could only retrieve directory. It may be one corner case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-06-06 Thread Ayush Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857905#comment-16857905
 ] 

Ayush Saxena commented on HDFS-14527:
-

No more from my side

> Stop all DataNodes may result in NN terminate
> -
>
> Key: HDFS-14527
> URL: https://issues.apache.org/jira/browse/HDFS-14527
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14527.001.patch, HDFS-14527.002.patch, 
> HDFS-14527.003.patch, HDFS-14527.004.patch, HDFS-14527.005.patch
>
>
> If we stop all datanodes of cluster, BlockPlacementPolicyDefault#chooseTarget 
> may get ArithmeticException when calling #getMaxNodesPerRack, which throws 
> the runtime exception out to BlockManager's ReplicationMonitor thread and 
> then terminate the NN.
> The root cause is that BlockPlacementPolicyDefault#chooseTarget not hold the 
> global lock, and if all DataNodes are dead between 
> {{clusterMap.getNumberOfLeaves()}} and {{getMaxNodesPerRack}} then it meet 
> {{ArithmeticException}} while invoke {{getMaxNodesPerRack}}.
> {code:java}
>   private DatanodeStorageInfo[] chooseTarget(int numOfReplicas,
> Node writer,
> List chosenStorage,
> boolean returnChosenNodes,
> Set excludedNodes,
> long blocksize,
> final BlockStoragePolicy storagePolicy,
> EnumSet addBlockFlags,
> EnumMap sTypes) {
> if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) {
>   return DatanodeStorageInfo.EMPTY_ARRAY;
> }
> ..
> int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas);
> ..
> }
> {code}
> Some detailed log show as following.
> {code:java}
> 2019-05-31 12:29:21,803 ERROR 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception. 
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getMaxNodesPerRack(BlockPlacementPolicyDefault.java:282)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:228)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:132)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4533)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$1800(BlockManager.java:4493)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1954)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1830)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4453)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4388)
> at java.lang.Thread.run(Thread.java:745)
> 2019-05-31 12:29:21,805 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}
> To be honest, this is not serious bug and not reprod easily, since if we stop 
> all Datanodes and only keep NameNode lives, HDFS could be not offer service 
> normally and we could only retrieve directory. It may be one corner case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-06-06 Thread JIRA


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857890#comment-16857890
 ] 

Íñigo Goiri commented on HDFS-14527:


+1 on  [^HDFS-14527.005.patch].
[~ayushtkn], any further comments?

> Stop all DataNodes may result in NN terminate
> -
>
> Key: HDFS-14527
> URL: https://issues.apache.org/jira/browse/HDFS-14527
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14527.001.patch, HDFS-14527.002.patch, 
> HDFS-14527.003.patch, HDFS-14527.004.patch, HDFS-14527.005.patch
>
>
> If we stop all datanodes of cluster, BlockPlacementPolicyDefault#chooseTarget 
> may get ArithmeticException when calling #getMaxNodesPerRack, which throws 
> the runtime exception out to BlockManager's ReplicationMonitor thread and 
> then terminate the NN.
> The root cause is that BlockPlacementPolicyDefault#chooseTarget not hold the 
> global lock, and if all DataNodes are dead between 
> {{clusterMap.getNumberOfLeaves()}} and {{getMaxNodesPerRack}} then it meet 
> {{ArithmeticException}} while invoke {{getMaxNodesPerRack}}.
> {code:java}
>   private DatanodeStorageInfo[] chooseTarget(int numOfReplicas,
> Node writer,
> List chosenStorage,
> boolean returnChosenNodes,
> Set excludedNodes,
> long blocksize,
> final BlockStoragePolicy storagePolicy,
> EnumSet addBlockFlags,
> EnumMap sTypes) {
> if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) {
>   return DatanodeStorageInfo.EMPTY_ARRAY;
> }
> ..
> int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas);
> ..
> }
> {code}
> Some detailed log show as following.
> {code:java}
> 2019-05-31 12:29:21,803 ERROR 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception. 
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getMaxNodesPerRack(BlockPlacementPolicyDefault.java:282)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:228)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:132)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4533)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$1800(BlockManager.java:4493)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1954)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1830)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4453)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4388)
> at java.lang.Thread.run(Thread.java:745)
> 2019-05-31 12:29:21,805 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}
> To be honest, this is not serious bug and not reprod easily, since if we stop 
> all Datanodes and only keep NameNode lives, HDFS could be not offer service 
> normally and we could only retrieve directory. It may be one corner case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-06-05 Thread He Xiaoqiao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857262#comment-16857262
 ] 

He Xiaoqiao commented on HDFS-14527:


Thanks [~elgoiri], checked failed unit tests and run 
{{TestDataNodeHotSwapVolumes}} & {{TestDFSZKFailoverController}} local, both 
passed. {{TestWebHdfsTimeouts}} and {{TestBalancer}} meet  timeout exception 
and I do not think it is related with this patch.

> Stop all DataNodes may result in NN terminate
> -
>
> Key: HDFS-14527
> URL: https://issues.apache.org/jira/browse/HDFS-14527
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14527.001.patch, HDFS-14527.002.patch, 
> HDFS-14527.003.patch, HDFS-14527.004.patch, HDFS-14527.005.patch
>
>
> If we stop all datanodes of cluster, BlockPlacementPolicyDefault#chooseTarget 
> may get ArithmeticException when calling #getMaxNodesPerRack, which throws 
> the runtime exception out to BlockManager's ReplicationMonitor thread and 
> then terminate the NN.
> The root cause is that BlockPlacementPolicyDefault#chooseTarget not hold the 
> global lock, and if all DataNodes are dead between 
> {{clusterMap.getNumberOfLeaves()}} and {{getMaxNodesPerRack}} then it meet 
> {{ArithmeticException}} while invoke {{getMaxNodesPerRack}}.
> {code:java}
>   private DatanodeStorageInfo[] chooseTarget(int numOfReplicas,
> Node writer,
> List chosenStorage,
> boolean returnChosenNodes,
> Set excludedNodes,
> long blocksize,
> final BlockStoragePolicy storagePolicy,
> EnumSet addBlockFlags,
> EnumMap sTypes) {
> if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) {
>   return DatanodeStorageInfo.EMPTY_ARRAY;
> }
> ..
> int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas);
> ..
> }
> {code}
> Some detailed log show as following.
> {code:java}
> 2019-05-31 12:29:21,803 ERROR 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception. 
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getMaxNodesPerRack(BlockPlacementPolicyDefault.java:282)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:228)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:132)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4533)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$1800(BlockManager.java:4493)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1954)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1830)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4453)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4388)
> at java.lang.Thread.run(Thread.java:745)
> 2019-05-31 12:29:21,805 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}
> To be honest, this is not serious bug and not reprod easily, since if we stop 
> all Datanodes and only keep NameNode lives, HDFS could be not offer service 
> normally and we could only retrieve directory. It may be one corner case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-06-05 Thread JIRA


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856852#comment-16856852
 ] 

Íñigo Goiri commented on HDFS-14527:


[^HDFS-14527.005.patch] looks good.
I think the unit tests are unrelated; do you mind making a quick check on them?

> Stop all DataNodes may result in NN terminate
> -
>
> Key: HDFS-14527
> URL: https://issues.apache.org/jira/browse/HDFS-14527
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14527.001.patch, HDFS-14527.002.patch, 
> HDFS-14527.003.patch, HDFS-14527.004.patch, HDFS-14527.005.patch
>
>
> If we stop all datanodes of cluster, BlockPlacementPolicyDefault#chooseTarget 
> may get ArithmeticException when calling #getMaxNodesPerRack, which throws 
> the runtime exception out to BlockManager's ReplicationMonitor thread and 
> then terminate the NN.
> The root cause is that BlockPlacementPolicyDefault#chooseTarget not hold the 
> global lock, and if all DataNodes are dead between 
> {{clusterMap.getNumberOfLeaves()}} and {{getMaxNodesPerRack}} then it meet 
> {{ArithmeticException}} while invoke {{getMaxNodesPerRack}}.
> {code:java}
>   private DatanodeStorageInfo[] chooseTarget(int numOfReplicas,
> Node writer,
> List chosenStorage,
> boolean returnChosenNodes,
> Set excludedNodes,
> long blocksize,
> final BlockStoragePolicy storagePolicy,
> EnumSet addBlockFlags,
> EnumMap sTypes) {
> if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) {
>   return DatanodeStorageInfo.EMPTY_ARRAY;
> }
> ..
> int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas);
> ..
> }
> {code}
> Some detailed log show as following.
> {code:java}
> 2019-05-31 12:29:21,803 ERROR 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception. 
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getMaxNodesPerRack(BlockPlacementPolicyDefault.java:282)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:228)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:132)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4533)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$1800(BlockManager.java:4493)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1954)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1830)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4453)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4388)
> at java.lang.Thread.run(Thread.java:745)
> 2019-05-31 12:29:21,805 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}
> To be honest, this is not serious bug and not reprod easily, since if we stop 
> all Datanodes and only keep NameNode lives, HDFS could be not offer service 
> normally and we could only retrieve directory. It may be one corner case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-06-05 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856437#comment-16856437
 ] 

Hadoop QA commented on HDFS-14527:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
29s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  6m  
8s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
47s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  3m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 45s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
18s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
52s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 35s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}112m 52s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
46s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}184m 30s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.tools.TestDFSZKFailoverController |
|   | hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes |
|   | hadoop.hdfs.web.TestWebHdfsTimeouts |
|   | hadoop.hdfs.server.balancer.TestBalancer |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e |
| JIRA Issue | HDFS-14527 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12970906/HDFS-14527.005.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux a2b211898a13 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / cd17cc2 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_212 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/26900/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/26900/testReport/ |
| Max. process+thread count | 3855 (vs. ulimit of 1) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| 

[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-06-04 Thread He Xiaoqiao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856309#comment-16856309
 ] 

He Xiaoqiao commented on HDFS-14527:


[^HDFS-14527.005.patch] fix checkstyle and rename {{chooseTargetFuturn}} to 
{{chooseTargetFuture}}.
check the failed unit test, It seems not relate with this issue.

> Stop all DataNodes may result in NN terminate
> -
>
> Key: HDFS-14527
> URL: https://issues.apache.org/jira/browse/HDFS-14527
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14527.001.patch, HDFS-14527.002.patch, 
> HDFS-14527.003.patch, HDFS-14527.004.patch, HDFS-14527.005.patch
>
>
> If we stop all datanodes of cluster, BlockPlacementPolicyDefault#chooseTarget 
> may get ArithmeticException when calling #getMaxNodesPerRack, which throws 
> the runtime exception out to BlockManager's ReplicationMonitor thread and 
> then terminate the NN.
> The root cause is that BlockPlacementPolicyDefault#chooseTarget not hold the 
> global lock, and if all DataNodes are dead between 
> {{clusterMap.getNumberOfLeaves()}} and {{getMaxNodesPerRack}} then it meet 
> {{ArithmeticException}} while invoke {{getMaxNodesPerRack}}.
> {code:java}
>   private DatanodeStorageInfo[] chooseTarget(int numOfReplicas,
> Node writer,
> List chosenStorage,
> boolean returnChosenNodes,
> Set excludedNodes,
> long blocksize,
> final BlockStoragePolicy storagePolicy,
> EnumSet addBlockFlags,
> EnumMap sTypes) {
> if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) {
>   return DatanodeStorageInfo.EMPTY_ARRAY;
> }
> ..
> int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas);
> ..
> }
> {code}
> Some detailed log show as following.
> {code:java}
> 2019-05-31 12:29:21,803 ERROR 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception. 
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getMaxNodesPerRack(BlockPlacementPolicyDefault.java:282)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:228)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:132)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4533)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$1800(BlockManager.java:4493)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1954)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1830)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4453)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4388)
> at java.lang.Thread.run(Thread.java:745)
> 2019-05-31 12:29:21,805 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}
> To be honest, this is not serious bug and not reprod easily, since if we stop 
> all Datanodes and only keep NameNode lives, HDFS could be not offer service 
> normally and we could only retrieve directory. It may be one corner case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-06-04 Thread JIRA


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856219#comment-16856219
 ] 

Íñigo Goiri commented on HDFS-14527:


Just one minor comment, it looks like {{chooseTargetFuturn}} should be 
{{chooseTargetFuture}}, right?

> Stop all DataNodes may result in NN terminate
> -
>
> Key: HDFS-14527
> URL: https://issues.apache.org/jira/browse/HDFS-14527
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14527.001.patch, HDFS-14527.002.patch, 
> HDFS-14527.003.patch, HDFS-14527.004.patch
>
>
> If we stop all datanodes of cluster, BlockPlacementPolicyDefault#chooseTarget 
> may get ArithmeticException when calling #getMaxNodesPerRack, which throws 
> the runtime exception out to BlockManager's ReplicationMonitor thread and 
> then terminate the NN.
> The root cause is that BlockPlacementPolicyDefault#chooseTarget not hold the 
> global lock, and if all DataNodes are dead between 
> {{clusterMap.getNumberOfLeaves()}} and {{getMaxNodesPerRack}} then it meet 
> {{ArithmeticException}} while invoke {{getMaxNodesPerRack}}.
> {code:java}
>   private DatanodeStorageInfo[] chooseTarget(int numOfReplicas,
> Node writer,
> List chosenStorage,
> boolean returnChosenNodes,
> Set excludedNodes,
> long blocksize,
> final BlockStoragePolicy storagePolicy,
> EnumSet addBlockFlags,
> EnumMap sTypes) {
> if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) {
>   return DatanodeStorageInfo.EMPTY_ARRAY;
> }
> ..
> int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas);
> ..
> }
> {code}
> Some detailed log show as following.
> {code:java}
> 2019-05-31 12:29:21,803 ERROR 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception. 
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getMaxNodesPerRack(BlockPlacementPolicyDefault.java:282)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:228)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:132)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4533)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$1800(BlockManager.java:4493)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1954)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1830)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4453)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4388)
> at java.lang.Thread.run(Thread.java:745)
> 2019-05-31 12:29:21,805 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}
> To be honest, this is not serious bug and not reprod easily, since if we stop 
> all Datanodes and only keep NameNode lives, HDFS could be not offer service 
> normally and we could only retrieve directory. It may be one corner case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-06-04 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856181#comment-16856181
 ] 

Hadoop QA commented on HDFS-14527:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m  
3s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 29m 
41s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
9s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 37s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
26s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
0s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
13s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 44s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 2 new + 40 unchanged - 0 fixed = 42 total (was 40) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 12s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}128m 51s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
37s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}205m 45s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.namenode.TestReencryption |
|   | hadoop.hdfs.TestReconstructStripedFile |
|   | hadoop.hdfs.server.namenode.TestReencryptionWithKMS |
|   | hadoop.hdfs.server.namenode.TestPersistentStoragePolicySatisfier |
|   | hadoop.hdfs.TestDFSInotifyEventInputStreamKerberized |
|   | hadoop.hdfs.server.namenode.TestReconstructStripedBlocks |
|   | hadoop.hdfs.server.namenode.TestListCorruptFileBlocks |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=18.09.5 Server=18.09.5 Image:yetus/hadoop:bdbca0e53b4 |
| JIRA Issue | HDFS-14527 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12970870/HDFS-14527.004.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux cb7207dc3cc6 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 
08:28:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 580b639 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_212 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 

[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-06-04 Thread He Xiaoqiao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856027#comment-16856027
 ] 

He Xiaoqiao commented on HDFS-14527:


[~elgoiri],[~ayushtkn] Thanks for your quick reviews.
{quote}though the scenario seems quite difficult to happen at production{quote}
yes, you are right, as mentioned in description, it is not reprod easily, but I 
did encounter this problem.
I just submit another patch [^HDFS-14527.004.patch] and following all 
suggestions above. Thanks again.

> Stop all DataNodes may result in NN terminate
> -
>
> Key: HDFS-14527
> URL: https://issues.apache.org/jira/browse/HDFS-14527
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14527.001.patch, HDFS-14527.002.patch, 
> HDFS-14527.003.patch, HDFS-14527.004.patch
>
>
> If we stop all datanodes of cluster, BlockPlacementPolicyDefault#chooseTarget 
> may get ArithmeticException when calling #getMaxNodesPerRack, which throws 
> the runtime exception out to BlockManager's ReplicationMonitor thread and 
> then terminate the NN.
> The root cause is that BlockPlacementPolicyDefault#chooseTarget not hold the 
> global lock, and if all DataNodes are dead between 
> {{clusterMap.getNumberOfLeaves()}} and {{getMaxNodesPerRack}} then it meet 
> {{ArithmeticException}} while invoke {{getMaxNodesPerRack}}.
> {code:java}
>   private DatanodeStorageInfo[] chooseTarget(int numOfReplicas,
> Node writer,
> List chosenStorage,
> boolean returnChosenNodes,
> Set excludedNodes,
> long blocksize,
> final BlockStoragePolicy storagePolicy,
> EnumSet addBlockFlags,
> EnumMap sTypes) {
> if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) {
>   return DatanodeStorageInfo.EMPTY_ARRAY;
> }
> ..
> int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas);
> ..
> }
> {code}
> Some detailed log show as following.
> {code:java}
> 2019-05-31 12:29:21,803 ERROR 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception. 
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getMaxNodesPerRack(BlockPlacementPolicyDefault.java:282)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:228)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:132)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4533)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$1800(BlockManager.java:4493)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1954)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1830)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4453)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4388)
> at java.lang.Thread.run(Thread.java:745)
> 2019-05-31 12:29:21,805 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}
> To be honest, this is not serious bug and not reprod easily, since if we stop 
> all Datanodes and only keep NameNode lives, HDFS could be not offer service 
> normally and we could only retrieve directory. It may be one corner case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-06-04 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855975#comment-16855975
 ] 

Hadoop QA commented on HDFS-14527:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
33s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 
20s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
20s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
25s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 17s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
22s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
2s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
7s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
7s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 53s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}107m 
45s{color} | {color:green} hadoop-hdfs in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
35s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}176m 10s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e |
| JIRA Issue | HDFS-14527 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12970848/HDFS-14527.003.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 5a7d38f32f30 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 827a847 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_212 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/26894/testReport/ |
| Max. process+thread count | 2937 (vs. ulimit of 1) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/26894/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> Stop all DataNodes may result in NN terminate
> -
>
> Key: HDFS-14527
> URL: 

[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-06-04 Thread JIRA


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855923#comment-16855923
 ] 

Íñigo Goiri commented on HDFS-14527:


Just to clarify what [~ayushtkn] is proposing for the exception, instead of:
{code}
try {
  chooseTargetFuturn.get();
} catch (ArithmeticException ae) {
  fail("It meets RuntimeException since there are no DataNodes!");
}
{code}

It should be:
{code}
try {
  chooseTargetFuturn.get();
} catch (ExecutionException ee) {
  throw ee.getCause();
}
{code}


> Stop all DataNodes may result in NN terminate
> -
>
> Key: HDFS-14527
> URL: https://issues.apache.org/jira/browse/HDFS-14527
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14527.001.patch, HDFS-14527.002.patch, 
> HDFS-14527.003.patch
>
>
> If we stop all datanodes of cluster, BlockPlacementPolicyDefault#chooseTarget 
> may get ArithmeticException when calling #getMaxNodesPerRack, which throws 
> the runtime exception out to BlockManager's ReplicationMonitor thread and 
> then terminate the NN.
> The root cause is that BlockPlacementPolicyDefault#chooseTarget not hold the 
> global lock, and if all DataNodes are dead between 
> {{clusterMap.getNumberOfLeaves()}} and {{getMaxNodesPerRack}} then it meet 
> {{ArithmeticException}} while invoke {{getMaxNodesPerRack}}.
> {code:java}
>   private DatanodeStorageInfo[] chooseTarget(int numOfReplicas,
> Node writer,
> List chosenStorage,
> boolean returnChosenNodes,
> Set excludedNodes,
> long blocksize,
> final BlockStoragePolicy storagePolicy,
> EnumSet addBlockFlags,
> EnumMap sTypes) {
> if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) {
>   return DatanodeStorageInfo.EMPTY_ARRAY;
> }
> ..
> int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas);
> ..
> }
> {code}
> Some detailed log show as following.
> {code:java}
> 2019-05-31 12:29:21,803 ERROR 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception. 
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getMaxNodesPerRack(BlockPlacementPolicyDefault.java:282)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:228)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:132)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4533)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$1800(BlockManager.java:4493)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1954)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1830)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4453)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4388)
> at java.lang.Thread.run(Thread.java:745)
> 2019-05-31 12:29:21,805 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}
> To be honest, this is not serious bug and not reprod easily, since if we stop 
> all Datanodes and only keep NameNode lives, HDFS could be not offer service 
> normally and we could only retrieve directory. It may be one corner case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-06-04 Thread Ayush Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855903#comment-16855903
 ] 

Ayush Saxena commented on HDFS-14527:
-

Thanx [~hexiaoqiao] for the patch, though the scenario seems quite difficult to 
happen at production, but still technically it can.
 * Well if we pull {{numRacks}} calculation above the {{clusterSize}} 
calculation, same logic shall work? The present approach is correct too. Just 
if we want to keep the same logic.

{code:java}
  try {
chooseTargetFuturn.get();
  } catch (ArithmeticException ae) {
fail("It meets RuntimeException since there are no DataNodes!");
  }
{code}
 * This isn't working, This throws {{java.util.concurrent.ExecutionException}} 
the cause for which is {{ArithmeticException}} . May be you can let the 
exception surface in case of failure, rather than having this. Would be better 
for someone landing up with this failed.
 * For the Test do you need 6 Dn's? can go away with 2 too, I guess, might save 
some time.
 * For the miniDfsCluster, instead having finally to close at end you may use 
try with resources. Somewhat like this.

{code:java}
   try(MiniDFSCluster miniCluster = new 
MiniDFSCluster.Builder(conf).racks(racks)
.hosts(hosts).numDataNodes(hosts.length).build()){
{code}
 * Can make the name better little generic as compared to existing 
{{TestRedundancyMonitorChooseTarget}} may be related to BPP, and put a javadoc 
on the test explaining the scenario, So, that the test class may be used by 
someone else too in future.

> Stop all DataNodes may result in NN terminate
> -
>
> Key: HDFS-14527
> URL: https://issues.apache.org/jira/browse/HDFS-14527
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14527.001.patch, HDFS-14527.002.patch, 
> HDFS-14527.003.patch
>
>
> If we stop all datanodes of cluster, BlockPlacementPolicyDefault#chooseTarget 
> may get ArithmeticException when calling #getMaxNodesPerRack, which throws 
> the runtime exception out to BlockManager's ReplicationMonitor thread and 
> then terminate the NN.
> The root cause is that BlockPlacementPolicyDefault#chooseTarget not hold the 
> global lock, and if all DataNodes are dead between 
> {{clusterMap.getNumberOfLeaves()}} and {{getMaxNodesPerRack}} then it meet 
> {{ArithmeticException}} while invoke {{getMaxNodesPerRack}}.
> {code:java}
>   private DatanodeStorageInfo[] chooseTarget(int numOfReplicas,
> Node writer,
> List chosenStorage,
> boolean returnChosenNodes,
> Set excludedNodes,
> long blocksize,
> final BlockStoragePolicy storagePolicy,
> EnumSet addBlockFlags,
> EnumMap sTypes) {
> if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) {
>   return DatanodeStorageInfo.EMPTY_ARRAY;
> }
> ..
> int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas);
> ..
> }
> {code}
> Some detailed log show as following.
> {code:java}
> 2019-05-31 12:29:21,803 ERROR 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception. 
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getMaxNodesPerRack(BlockPlacementPolicyDefault.java:282)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:228)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:132)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4533)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$1800(BlockManager.java:4493)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1954)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1830)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4453)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4388)
> at java.lang.Thread.run(Thread.java:745)
> 2019-05-31 12:29:21,805 INFO 

[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-06-04 Thread He Xiaoqiao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855822#comment-16855822
 ] 

He Xiaoqiao commented on HDFS-14527:


Thanks [~elgoiri] for your detailed reviews. upload [^HDFS-14527.003.patch] to 
fix following comments. Pending jenkins. Another more reviews. Thanks again.

> Stop all DataNodes may result in NN terminate
> -
>
> Key: HDFS-14527
> URL: https://issues.apache.org/jira/browse/HDFS-14527
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14527.001.patch, HDFS-14527.002.patch, 
> HDFS-14527.003.patch
>
>
> If we stop all datanodes of cluster, BlockPlacementPolicyDefault#chooseTarget 
> may get ArithmeticException when calling #getMaxNodesPerRack, which throws 
> the runtime exception out to BlockManager's ReplicationMonitor thread and 
> then terminate the NN.
> The root cause is that BlockPlacementPolicyDefault#chooseTarget not hold the 
> global lock, and if all DataNodes are dead between 
> {{clusterMap.getNumberOfLeaves()}} and {{getMaxNodesPerRack}} then it meet 
> {{ArithmeticException}} while invoke {{getMaxNodesPerRack}}.
> {code:java}
>   private DatanodeStorageInfo[] chooseTarget(int numOfReplicas,
> Node writer,
> List chosenStorage,
> boolean returnChosenNodes,
> Set excludedNodes,
> long blocksize,
> final BlockStoragePolicy storagePolicy,
> EnumSet addBlockFlags,
> EnumMap sTypes) {
> if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) {
>   return DatanodeStorageInfo.EMPTY_ARRAY;
> }
> ..
> int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas);
> ..
> }
> {code}
> Some detailed log show as following.
> {code:java}
> 2019-05-31 12:29:21,803 ERROR 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception. 
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getMaxNodesPerRack(BlockPlacementPolicyDefault.java:282)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:228)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:132)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4533)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$1800(BlockManager.java:4493)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1954)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1830)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4453)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4388)
> at java.lang.Thread.run(Thread.java:745)
> 2019-05-31 12:29:21,805 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}
> To be honest, this is not serious bug and not reprod easily, since if we stop 
> all Datanodes and only keep NameNode lives, HDFS could be not offer service 
> normally and we could only retrieve directory. It may be one corner case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-06-03 Thread JIRA


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854844#comment-16854844
 ] 

Íñigo Goiri commented on HDFS-14527:


Thanks [~hexiaoqiao] for  [^HDFS-14527.002.patch]:
* Can we get rid of the whitebox deprecation? It may need some refactoring.
* Can we fix the checkstyle?
* Can we use lambdas for the submit operations?
* Sleeping 500 ms is kind of arbitrary, is there any condition we cna way for?
* Not a big fan of catching an exception to do a fail. What is the advantage 
with seeing the full exception go throw and fail the test?
* Do a direct import of GenericTestUtils.DelayAnswer.
* Add an overview of what the test is doing.
* Make the capitalizaiton of the comments consistent.

> Stop all DataNodes may result in NN terminate
> -
>
> Key: HDFS-14527
> URL: https://issues.apache.org/jira/browse/HDFS-14527
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14527.001.patch, HDFS-14527.002.patch
>
>
> If we stop all datanodes of cluster, BlockPlacementPolicyDefault#chooseTarget 
> may get ArithmeticException when calling #getMaxNodesPerRack, which throws 
> the runtime exception out to BlockManager's ReplicationMonitor thread and 
> then terminate the NN.
> The root cause is that BlockPlacementPolicyDefault#chooseTarget not hold the 
> global lock, and if all DataNodes are dead between 
> {{clusterMap.getNumberOfLeaves()}} and {{getMaxNodesPerRack}} then it meet 
> {{ArithmeticException}} while invoke {{getMaxNodesPerRack}}.
> {code:java}
>   private DatanodeStorageInfo[] chooseTarget(int numOfReplicas,
> Node writer,
> List chosenStorage,
> boolean returnChosenNodes,
> Set excludedNodes,
> long blocksize,
> final BlockStoragePolicy storagePolicy,
> EnumSet addBlockFlags,
> EnumMap sTypes) {
> if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) {
>   return DatanodeStorageInfo.EMPTY_ARRAY;
> }
> ..
> int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas);
> ..
> }
> {code}
> Some detailed log show as following.
> {code:java}
> 2019-05-31 12:29:21,803 ERROR 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception. 
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getMaxNodesPerRack(BlockPlacementPolicyDefault.java:282)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:228)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:132)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4533)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$1800(BlockManager.java:4493)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1954)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1830)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4453)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4388)
> at java.lang.Thread.run(Thread.java:745)
> 2019-05-31 12:29:21,805 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}
> To be honest, this is not serious bug and not reprod easily, since if we stop 
> all Datanodes and only keep NameNode lives, HDFS could be not offer service 
> normally and we could only retrieve directory. It may be one corner case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-06-03 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854807#comment-16854807
 ] 

Hadoop QA commented on HDFS-14527:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
16s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
54s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
12s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
14s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 46s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
25s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
56s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
14s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
7s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  1m  7s{color} 
| {color:red} hadoop-hdfs-project_hadoop-hdfs generated 2 new + 475 unchanged - 
0 fixed = 477 total (was 475) {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 46s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 3 new + 40 unchanged - 0 fixed = 43 total (was 40) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 44s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
53s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 85m 22s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
34s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}152m 23s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.web.TestWebHdfsTimeouts |
|   | hadoop.hdfs.server.datanode.TestDataNodeLifeline |
|   | hadoop.hdfs.TestEncryptionZonesWithKMS |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e |
| JIRA Issue | HDFS-14527 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12970688/HDFS-14527.002.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 1e1d58c81c9c 4.4.0-143-generic #169~14.04.2-Ubuntu SMP Wed Feb 
13 15:00:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 59719dc |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_212 |
| findbugs | v3.1.0-RC1 |
| javac | 
https://builds.apache.org/job/PreCommit-HDFS-Build/26888/artifact/out/diff-compile-javac-hadoop-hdfs-project_hadoop-hdfs.txt
 |
| checkstyle | 

[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-06-03 Thread He Xiaoqiao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854629#comment-16854629
 ] 

He Xiaoqiao commented on HDFS-14527:


[~elgoiri],  [^HDFS-14527.002.patch] add a unit test and fix this issue. Please 
take a review. 

> Stop all DataNodes may result in NN terminate
> -
>
> Key: HDFS-14527
> URL: https://issues.apache.org/jira/browse/HDFS-14527
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14527.001.patch, HDFS-14527.002.patch
>
>
> If we stop all datanodes of cluster, BlockPlacementPolicyDefault#chooseTarget 
> may get ArithmeticException when calling #getMaxNodesPerRack, which throws 
> the runtime exception out to BlockManager's ReplicationMonitor thread and 
> then terminate the NN.
> The root cause is that BlockPlacementPolicyDefault#chooseTarget not hold the 
> global lock, and if all DataNodes are dead between 
> {{clusterMap.getNumberOfLeaves()}} and {{getMaxNodesPerRack}} then it meet 
> {{ArithmeticException}} while invoke {{getMaxNodesPerRack}}.
> {code:java}
>   private DatanodeStorageInfo[] chooseTarget(int numOfReplicas,
> Node writer,
> List chosenStorage,
> boolean returnChosenNodes,
> Set excludedNodes,
> long blocksize,
> final BlockStoragePolicy storagePolicy,
> EnumSet addBlockFlags,
> EnumMap sTypes) {
> if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) {
>   return DatanodeStorageInfo.EMPTY_ARRAY;
> }
> ..
> int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas);
> ..
> }
> {code}
> Some detailed log show as following.
> {code:java}
> 2019-05-31 12:29:21,803 ERROR 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception. 
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getMaxNodesPerRack(BlockPlacementPolicyDefault.java:282)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:228)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:132)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4533)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$1800(BlockManager.java:4493)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1954)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1830)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4453)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4388)
> at java.lang.Thread.run(Thread.java:745)
> 2019-05-31 12:29:21,805 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}
> To be honest, this is not serious bug and not reprod easily, since if we stop 
> all Datanodes and only keep NameNode lives, HDFS could be not offer service 
> normally and we could only retrieve directory. It may be one corner case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-05-31 Thread JIRA


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853190#comment-16853190
 ] 

Íñigo Goiri commented on HDFS-14527:


Can we add a unit test reproing the issue?

> Stop all DataNodes may result in NN terminate
> -
>
> Key: HDFS-14527
> URL: https://issues.apache.org/jira/browse/HDFS-14527
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Attachments: HDFS-14527.001.patch
>
>
> If we stop all datanodes of cluster, BlockPlacementPolicyDefault#chooseTarget 
> may get ArithmeticException when calling #getMaxNodesPerRack, which throws 
> the runtime exception out to BlockManager's ReplicationMonitor thread and 
> then terminate the NN.
> The root cause is that BlockPlacementPolicyDefault#chooseTarget not hold the 
> global lock, and if all DataNodes are dead between 
> {{clusterMap.getNumberOfLeaves()}} and {{getMaxNodesPerRack}} then it meet 
> {{ArithmeticException}} while invoke {{getMaxNodesPerRack}}.
> {code:java}
>   private DatanodeStorageInfo[] chooseTarget(int numOfReplicas,
> Node writer,
> List chosenStorage,
> boolean returnChosenNodes,
> Set excludedNodes,
> long blocksize,
> final BlockStoragePolicy storagePolicy,
> EnumSet addBlockFlags,
> EnumMap sTypes) {
> if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) {
>   return DatanodeStorageInfo.EMPTY_ARRAY;
> }
> ..
> int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas);
> ..
> }
> {code}
> Some detailed log show as following.
> {code:java}
> 2019-05-31 12:29:21,803 ERROR 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception. 
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getMaxNodesPerRack(BlockPlacementPolicyDefault.java:282)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:228)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:132)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4533)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$1800(BlockManager.java:4493)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1954)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1830)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4453)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4388)
> at java.lang.Thread.run(Thread.java:745)
> 2019-05-31 12:29:21,805 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}
> To be honest, this is not serious bug and not reprod easily, since if we stop 
> all Datanodes and only keep NameNode lives, HDFS could be not offer service 
> normally and we could only retrieve directory. It may be one corner case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-05-31 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852911#comment-16852911
 ] 

Hadoop QA commented on HDFS-14527:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
24s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
 0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 50s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
48s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m  9s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 77m 56s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}131m 32s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.web.TestWebHdfsTimeouts |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e |
| JIRA Issue | HDFS-14527 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12970452/HDFS-14527.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 8f2831b763a9 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 
10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 52128e3 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_212 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/26877/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/26877/testReport/ |
| Max. process+thread count | 4872 (vs. ulimit of 1) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| Console output |