[jira] [Commented] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-02-15 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493011#comment-17493011
 ] 

Wei-Chiu Chuang commented on HDFS-16456:


[~tasanuma] [~ferhui] would you be interesting in reviewing this?

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> 
>
> Key: HDFS-16456
> URL: https://issues.apache.org/jira/browse/HDFS-16456
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, namenode
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Priority: Critical
> Attachments: HDFS-16456.001.patch
>
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
>  # Enable EC policy, such as RS-6-3-1024k.
>  # The rack number in this cluster is equal with the replication number(9)
>  # A rack only has one DN, and decommission this DN.
> The root cause is in 
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
> give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
> the maxNodesPerRack is 1, which means each rack can only be chosen one 
> datanode.
> {code:java}
>   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
>...
>     // If more replicas than racks, evenly spread the replicas.
>     // This calculation rounds up.
>     int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> return new int[] {numOfReplicas, maxNodesPerRack};
>   } {code}
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> here will be called, where totalNumOfReplicas=9 and  numOfRacks=9  
> When we decommission one dn which is only one node in its rack, the 
> chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() 
> will throw NotEnoughReplicasException, but the exception will not be caught 
> and fail to fallback to chooseEvenlyFromRemainingRacks() function.
> When decommission, after choose targets, verifyBlockPlacement() function will 
> return the total rack number contains the invalid rack, and 
> BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false 
> and it will also cause decommission fail.
> {code:java}
>   public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
>       int numberOfReplicas) {
>     if (locs == null)
>       locs = DatanodeDescriptor.EMPTY_ARRAY;
>     if (!clusterMap.hasClusterEverBeenMultiRack()) {
>       // only one rack
>       return new BlockPlacementStatusDefault(1, 1, 1);
>     }
>     // Count locations on different racks.
>     Set racks = new HashSet<>();
>     for (DatanodeInfo dn : locs) {
>       racks.add(dn.getNetworkLocation());
>     }
>     return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas,
>         clusterMap.getNumOfRacks());
>   } {code}
> {code:java}
>   public boolean isPlacementPolicySatisfied() {
>     return requiredRacks <= currentRacks || currentRacks >= totalRacks;
>   }{code}
> According to the above description, we should make the below modify to fix it:
>  # In startDecommission() or stopDecommission(), we should also change the 
> numOfRacks in class NetworkTopology. Or choose targets may fail for the 
> maxNodesPerRack is too small. And even choose targets success, 
> isPlacementPolicySatisfied will also return false cause decommission fail.
>  # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first 
> chooseOnce() function should also be put in try..catch..., or it will not 
> fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
>  # In chooseEvenlyFromRemainingRacks(), this numResultsOflastChoose = 
> results.size(); code should be move to after chooseOnce(), or it will throw 
> lastException and make choose targets failed.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16457) Make fs.getspaceused.classname reconfigurable

2022-02-15 Thread yanbin.zhang (Jira)
yanbin.zhang created HDFS-16457:
---

 Summary: Make fs.getspaceused.classname reconfigurable
 Key: HDFS-16457
 URL: https://issues.apache.org/jira/browse/HDFS-16457
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 3.3.0
Reporter: yanbin.zhang
Assignee: yanbin.zhang


Now if we want to switch fs.getspaceused.classname we need to restart the 
NameNode. It would be convenient if we can switch it at runtime.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16316) Improve DirectoryScanner: add regular file check related block

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16316?focusedWorklogId=728013=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-728013
 ]

ASF GitHub Bot logged work on HDFS-16316:
-

Author: ASF GitHub Bot
Created on: 16/Feb/22 02:50
Start Date: 16/Feb/22 02:50
Worklog Time Spent: 10m 
  Work Description: jianghuazhu commented on pull request #3861:
URL: https://github.com/apache/hadoop/pull/3861#issuecomment-1041040826


   Thanks for the suggestion, @jojochuang .
   I re-updated the unit tests and also did some tests.
   
   When I remove the fix, the newly added unit test does not succeed, which is 
expected and does not affect the execution of other unit tests.
   Here is an example of the test when removing the fix:
   
![image](https://user-images.githubusercontent.com/6416939/154185727-620eacac-5b4e-4b49-b6f2-1e612017cc35.png)
   
   Here is an example during normal testing:
   
![image](https://user-images.githubusercontent.com/6416939/154186788-f53338e6-2a40-46b1-95c9-59282fa7616b.png)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 728013)
Time Spent: 3h 50m  (was: 3h 40m)

> Improve DirectoryScanner: add regular file check related block
> --
>
> Key: HDFS-16316
> URL: https://issues.apache.org/jira/browse/HDFS-16316
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.9.2
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Something unusual happened in the online environment.
> The DataNode is configured with 11 disks (${dfs.datanode.data.dir}). It is 
> normal for 10 disks to calculate the used capacity, and the calculated value 
> for the other 1 disk is much larger, which is very strange.
> This is about the live view on the NameNode:
>  !screenshot-1.png! 
> This is about the live view on the DataNode:
>  !screenshot-2.png! 
> We can look at the view on linux:
>  !screenshot-3.png! 
> There is a big gap here, regarding'/mnt/dfs/11/data'. This situation should 
> be prohibited from happening.
> I found that there are some abnormal block files.
> There are wrong blk_.meta in some subdir directories, causing abnormal 
> computing space.
> Here are some abnormal block files:
>  !screenshot-4.png! 
> Such files should not be used as normal blocks. They should be actively 
> identified and filtered, which is good for cluster stability.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16396) Reconfig slow peer parameters for datanode

2022-02-15 Thread Takanobu Asanuma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takanobu Asanuma updated HDFS-16396:

Fix Version/s: 3.3.3

> Reconfig slow peer parameters for datanode
> --
>
> Key: HDFS-16396
> URL: https://issues.apache.org/jira/browse/HDFS-16396
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.3
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> In large clusters, rolling restart datanodes takes a long time. We can make 
> slow peers parameters and slow disks parameters in datanode reconfigurable to 
> facilitate cluster operation and maintenance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15745) Make DataNodePeerMetrics#LOW_THRESHOLD_MS and MIN_OUTLIER_DETECTION_NODES configurable

2022-02-15 Thread Takanobu Asanuma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takanobu Asanuma updated HDFS-15745:

Fix Version/s: 3.3.3

> Make DataNodePeerMetrics#LOW_THRESHOLD_MS and MIN_OUTLIER_DETECTION_NODES 
> configurable
> --
>
> Key: HDFS-15745
> URL: https://issues.apache.org/jira/browse/HDFS-15745
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haibin Huang
>Assignee: Haibin Huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.3
>
> Attachments: HDFS-15745-001.patch, HDFS-15745-002.patch, 
> HDFS-15745-003.patch, HDFS-15745-branch-3.1.001.patch, 
> HDFS-15745-branch-3.2.001.patch, HDFS-15745-branch-3.3.001.patch, 
> image-2020-12-22-17-00-50-796.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When i enable DataNodePeerMetrics to find slow slow peer in cluster, i found 
> there is a lot of slow peer but ReportingNodes's averageDelay is very low, 
> and these slow peer node are normal. I think the reason of why generating so 
> many slow peer is that  the value of DataNodePeerMetrics#LOW_THRESHOLD_MS is 
> too small (only 5ms) and it is not configurable. The default value of slow io 
> warning log threshold is 300ms, i.e. 
> DFSConfigKeys.DFS_DATANODE_SLOW_IO_WARNING_THRESHOLD_DEFAULT = 300, so 
> DataNodePeerMetrics#LOW_THRESHOLD_MS should not be less than 300ms, otherwise 
> namenode will get a lot of invalid slow peer information.
> !image-2020-12-22-17-00-50-796.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10650) DFSClient#mkdirs and DFSClient#primitiveMkdir should use default directory permission

2022-02-15 Thread Konstantin Shvachko (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated HDFS-10650:
---
Fix Version/s: 2.10.2

Merged this into branch-2.10.
Updated fix version.

> DFSClient#mkdirs and DFSClient#primitiveMkdir should use default directory 
> permission
> -
>
> Key: HDFS-10650
> URL: https://issues.apache.org/jira/browse/HDFS-10650
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Minor
> Fix For: 3.0.0-alpha1, 2.10.2
>
> Attachments: HDFS-10650.001.patch, HDFS-10650.002.patch
>
>
> These 2 DFSClient methods should use default directory permission to create a 
> directory.
> {code:java}
>   public boolean mkdirs(String src, FsPermission permission,
>   boolean createParent) throws IOException {
> if (permission == null) {
>   permission = FsPermission.getDefault();
> }
> {code}
> {code:java}
>   public boolean primitiveMkdir(String src, FsPermission absPermission, 
> boolean createParent)
> throws IOException {
> checkOpen();
> if (absPermission == null) {
>   absPermission = 
> FsPermission.getDefault().applyUMask(dfsClientConf.uMask);
> } 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15745) Make DataNodePeerMetrics#LOW_THRESHOLD_MS and MIN_OUTLIER_DETECTION_NODES configurable

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15745?focusedWorklogId=727965=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727965
 ]

ASF GitHub Bot logged work on HDFS-15745:
-

Author: ASF GitHub Bot
Created on: 16/Feb/22 00:42
Start Date: 16/Feb/22 00:42
Worklog Time Spent: 10m 
  Work Description: tasanuma merged pull request #3992:
URL: https://github.com/apache/hadoop/pull/3992


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 727965)
Time Spent: 50m  (was: 40m)

> Make DataNodePeerMetrics#LOW_THRESHOLD_MS and MIN_OUTLIER_DETECTION_NODES 
> configurable
> --
>
> Key: HDFS-15745
> URL: https://issues.apache.org/jira/browse/HDFS-15745
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haibin Huang
>Assignee: Haibin Huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-15745-001.patch, HDFS-15745-002.patch, 
> HDFS-15745-003.patch, HDFS-15745-branch-3.1.001.patch, 
> HDFS-15745-branch-3.2.001.patch, HDFS-15745-branch-3.3.001.patch, 
> image-2020-12-22-17-00-50-796.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When i enable DataNodePeerMetrics to find slow slow peer in cluster, i found 
> there is a lot of slow peer but ReportingNodes's averageDelay is very low, 
> and these slow peer node are normal. I think the reason of why generating so 
> many slow peer is that  the value of DataNodePeerMetrics#LOW_THRESHOLD_MS is 
> too small (only 5ms) and it is not configurable. The default value of slow io 
> warning log threshold is 300ms, i.e. 
> DFSConfigKeys.DFS_DATANODE_SLOW_IO_WARNING_THRESHOLD_DEFAULT = 300, so 
> DataNodePeerMetrics#LOW_THRESHOLD_MS should not be less than 300ms, otherwise 
> namenode will get a lot of invalid slow peer information.
> !image-2020-12-22-17-00-50-796.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15745) Make DataNodePeerMetrics#LOW_THRESHOLD_MS and MIN_OUTLIER_DETECTION_NODES configurable

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15745?focusedWorklogId=727964=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727964
 ]

ASF GitHub Bot logged work on HDFS-15745:
-

Author: ASF GitHub Bot
Created on: 16/Feb/22 00:41
Start Date: 16/Feb/22 00:41
Worklog Time Spent: 10m 
  Work Description: tasanuma commented on pull request #3992:
URL: https://github.com/apache/hadoop/pull/3992#issuecomment-1040943937


   The failed tests seem not to related. I'm merging it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 727964)
Time Spent: 40m  (was: 0.5h)

> Make DataNodePeerMetrics#LOW_THRESHOLD_MS and MIN_OUTLIER_DETECTION_NODES 
> configurable
> --
>
> Key: HDFS-15745
> URL: https://issues.apache.org/jira/browse/HDFS-15745
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haibin Huang
>Assignee: Haibin Huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-15745-001.patch, HDFS-15745-002.patch, 
> HDFS-15745-003.patch, HDFS-15745-branch-3.1.001.patch, 
> HDFS-15745-branch-3.2.001.patch, HDFS-15745-branch-3.3.001.patch, 
> image-2020-12-22-17-00-50-796.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When i enable DataNodePeerMetrics to find slow slow peer in cluster, i found 
> there is a lot of slow peer but ReportingNodes's averageDelay is very low, 
> and these slow peer node are normal. I think the reason of why generating so 
> many slow peer is that  the value of DataNodePeerMetrics#LOW_THRESHOLD_MS is 
> too small (only 5ms) and it is not configurable. The default value of slow io 
> warning log threshold is 300ms, i.e. 
> DFSConfigKeys.DFS_DATANODE_SLOW_IO_WARNING_THRESHOLD_DEFAULT = 300, so 
> DataNodePeerMetrics#LOW_THRESHOLD_MS should not be less than 300ms, otherwise 
> namenode will get a lot of invalid slow peer information.
> !image-2020-12-22-17-00-50-796.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16397) Reconfig slow disk parameters for datanode

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16397?focusedWorklogId=727951=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727951
 ]

ASF GitHub Bot logged work on HDFS-16397:
-

Author: ASF GitHub Bot
Created on: 16/Feb/22 00:16
Start Date: 16/Feb/22 00:16
Worklog Time Spent: 10m 
  Work Description: tomscut commented on pull request #3828:
URL: https://github.com/apache/hadoop/pull/3828#issuecomment-1040929250


   Hi @tasanuma @ayushtkn @Hexiaoqiao , could you please review this PR? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 727951)
Time Spent: 1h  (was: 50m)

> Reconfig slow disk parameters for datanode
> --
>
> Key: HDFS-16397
> URL: https://issues.apache.org/jira/browse/HDFS-16397
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In large clusters, rolling restart datanodes takes long time. We can make 
> slow peers parameters and slow disks parameters in datanode reconfigurable to 
> facilitate cluster operation and maintenance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16316) Improve DirectoryScanner: add regular file check related block

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16316?focusedWorklogId=727862=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727862
 ]

ASF GitHub Bot logged work on HDFS-16316:
-

Author: ASF GitHub Bot
Created on: 15/Feb/22 21:56
Start Date: 15/Feb/22 21:56
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3861:
URL: https://github.com/apache/hadoop/pull/3861#issuecomment-1040835489


   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 43s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  12m 48s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  25m 22s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  26m 32s |  |  trunk passed with JDK 
Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  compile  |  23m  0s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   4m  8s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m 21s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   2m 30s |  |  trunk passed with JDK 
Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   3m 36s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   5m 57s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  24m  8s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 29s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 23s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  23m 11s |  |  the patch passed with JDK 
Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javac  |  23m 11s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  21m 53s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |  21m 53s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   3m 33s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   3m 22s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   2m 18s |  |  the patch passed with JDK 
Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   3m 31s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   6m 34s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  24m  1s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  17m 55s |  |  hadoop-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  | 230m 38s |  |  hadoop-hdfs in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   1m  7s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 471m 50s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3861/7/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3861 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux dbbf79d9ac98 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / f5e27e408d9aa8f1e563d139d35a001375e19f7f |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3861/7/testReport/ |
   | Max. process+thread count | 3544 (vs. ulimit of 5500) |
   | modules | C: hadoop-common-project/hadoop-common 
hadoop-hdfs-project/hadoop-hdfs U: . 

[jira] [Work logged] (HDFS-16397) Reconfig slow disk parameters for datanode

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16397?focusedWorklogId=727763=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727763
 ]

ASF GitHub Bot logged work on HDFS-16397:
-

Author: ASF GitHub Bot
Created on: 15/Feb/22 19:35
Start Date: 15/Feb/22 19:35
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3828:
URL: https://github.com/apache/hadoop/pull/3828#issuecomment-1040706117


   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 41s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  32m 18s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 28s |  |  trunk passed with JDK 
Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  compile  |   1m 20s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   1m  4s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 30s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  4s |  |  trunk passed with JDK 
Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 34s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   3m 19s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  22m 32s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 16s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 21s |  |  the patch passed with JDK 
Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javac  |   1m 21s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 11s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |   1m 11s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 51s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3828/5/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 3 new + 136 unchanged 
- 2 fixed = 139 total (was 138)  |
   | +1 :green_heart: |  mvnsite  |   1m 17s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 50s |  |  the patch passed with JDK 
Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 24s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   3m 13s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m  6s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  | 226m  8s |  |  hadoop-hdfs in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 47s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 325m 11s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3828/5/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3828 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux dcca2c281b61 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 
23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 10536590e771f077d4ffdbfb9fe92112fc40254e |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3828/5/testReport/ |
   | Max. process+thread count | 3263 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: 

[jira] [Work logged] (HDFS-15745) Make DataNodePeerMetrics#LOW_THRESHOLD_MS and MIN_OUTLIER_DETECTION_NODES configurable

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15745?focusedWorklogId=727628=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727628
 ]

ASF GitHub Bot logged work on HDFS-15745:
-

Author: ASF GitHub Bot
Created on: 15/Feb/22 19:06
Start Date: 15/Feb/22 19:06
Worklog Time Spent: 10m 
  Work Description: tasanuma opened a new pull request #3992:
URL: https://github.com/apache/hadoop/pull/3992


   
   
   ### Description of PR
   
   HDFS-15745. Make DataNodePeerMetrics#LOW_THRESHOLD_MS and 
MIN_OUTLIER_DETECTION_NODES configurable.
   
   ### How was this patch tested?
   
   
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 727628)
Time Spent: 0.5h  (was: 20m)

> Make DataNodePeerMetrics#LOW_THRESHOLD_MS and MIN_OUTLIER_DETECTION_NODES 
> configurable
> --
>
> Key: HDFS-15745
> URL: https://issues.apache.org/jira/browse/HDFS-15745
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haibin Huang
>Assignee: Haibin Huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-15745-001.patch, HDFS-15745-002.patch, 
> HDFS-15745-003.patch, HDFS-15745-branch-3.1.001.patch, 
> HDFS-15745-branch-3.2.001.patch, HDFS-15745-branch-3.3.001.patch, 
> image-2020-12-22-17-00-50-796.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When i enable DataNodePeerMetrics to find slow slow peer in cluster, i found 
> there is a lot of slow peer but ReportingNodes's averageDelay is very low, 
> and these slow peer node are normal. I think the reason of why generating so 
> many slow peer is that  the value of DataNodePeerMetrics#LOW_THRESHOLD_MS is 
> too small (only 5ms) and it is not configurable. The default value of slow io 
> warning log threshold is 300ms, i.e. 
> DFSConfigKeys.DFS_DATANODE_SLOW_IO_WARNING_THRESHOLD_DEFAULT = 300, so 
> DataNodePeerMetrics#LOW_THRESHOLD_MS should not be less than 300ms, otherwise 
> namenode will get a lot of invalid slow peer information.
> !image-2020-12-22-17-00-50-796.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16440) RBF: Support router get HAServiceStatus with Lifeline RPC address

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16440?focusedWorklogId=727614=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727614
 ]

ASF GitHub Bot logged work on HDFS-16440:
-

Author: ASF GitHub Bot
Created on: 15/Feb/22 19:05
Start Date: 15/Feb/22 19:05
Worklog Time Spent: 10m 
  Work Description: goiri merged pull request #3971:
URL: https://github.com/apache/hadoop/pull/3971


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 727614)
Time Spent: 2h 10m  (was: 2h)

> RBF: Support router get HAServiceStatus with Lifeline RPC address
> -
>
> Key: HDFS-16440
> URL: https://issues.apache.org/jira/browse/HDFS-16440
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: YulongZ
>Assignee: YulongZ
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-16440.001.patch, HDFS-16440.003.patch, 
> HDFS-16440.004.patch
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> NamenodeHeartbeatService gets HAServiceStatus using 
> NNHAServiceTarget.getProxy. When we set a special 
> dfs.namenode.lifeline.rpc-address , NamenodeHeartbeatService may get 
> HAServiceStatus using NNHAServiceTarget.getHealthMonitorProxy.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15745) Make DataNodePeerMetrics#LOW_THRESHOLD_MS and MIN_OUTLIER_DETECTION_NODES configurable

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15745?focusedWorklogId=727611=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727611
 ]

ASF GitHub Bot logged work on HDFS-15745:
-

Author: ASF GitHub Bot
Created on: 15/Feb/22 19:05
Start Date: 15/Feb/22 19:05
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3992:
URL: https://github.com/apache/hadoop/pull/3992#issuecomment-1040638017


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   9m 48s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ branch-3.3 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  33m 37s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   1m 13s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 49s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   1m 19s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   1m 25s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   3m 14s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  27m  9s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 11s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  6s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   1m  6s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 41s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 15s |  |  the patch passed  |
   | +1 :green_heart: |  xml  |   0m  1s |  |  The patch has no ill-formed XML 
file.  |
   | +1 :green_heart: |  javadoc  |   1m 16s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   3m 16s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  26m 59s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 210m 34s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3992/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 37s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 322m 44s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped |
   |   | hadoop.hdfs.server.datanode.TestDirectoryScanner |
   |   | hadoop.hdfs.tools.offlineImageViewer.TestOfflineImageViewer |
   |   | hadoop.hdfs.TestRollingUpgrade |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3992/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3992 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell xml |
   | uname | Linux 84d43b1571d0 4.15.0-163-generic #171-Ubuntu SMP Fri Nov 5 
11:55:11 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / 81aa7a942b0af7d854b94431e34dd731bdb343c7 |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~18.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3992/1/testReport/ |
   | Max. process+thread count | 2163 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3992/1/console |
   | versions | git=2.17.1 maven=3.6.0 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 

[jira] [Work logged] (HDFS-16396) Reconfig slow peer parameters for datanode

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16396?focusedWorklogId=727563=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727563
 ]

ASF GitHub Bot logged work on HDFS-16396:
-

Author: ASF GitHub Bot
Created on: 15/Feb/22 19:01
Start Date: 15/Feb/22 19:01
Worklog Time Spent: 10m 
  Work Description: tasanuma merged pull request #3827:
URL: https://github.com/apache/hadoop/pull/3827


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 727563)
Time Spent: 5h 10m  (was: 5h)

> Reconfig slow peer parameters for datanode
> --
>
> Key: HDFS-16396
> URL: https://issues.apache.org/jira/browse/HDFS-16396
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> In large clusters, rolling restart datanodes takes a long time. We can make 
> slow peers parameters and slow disks parameters in datanode reconfigurable to 
> facilitate cluster operation and maintenance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16316) Improve DirectoryScanner: add regular file check related block

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16316?focusedWorklogId=727536=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727536
 ]

ASF GitHub Bot logged work on HDFS-16316:
-

Author: ASF GitHub Bot
Created on: 15/Feb/22 18:59
Start Date: 15/Feb/22 18:59
Worklog Time Spent: 10m 
  Work Description: jianghuazhu commented on a change in pull request #3861:
URL: https://github.com/apache/hadoop/pull/3861#discussion_r806465949



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDirectoryScanner.java
##
@@ -507,6 +509,71 @@ public void testDeleteBlockOnTransientStorage() throws 
Exception {
 }
   }
 
+  @Test(timeout = 60)
+  public void testRegularBlock() throws Exception {
+// add a logger stream to check what has printed to log
+ByteArrayOutputStream loggerStream = new ByteArrayOutputStream();

Review comment:
   Yes, it was my mistake.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 727536)
Time Spent: 3.5h  (was: 3h 20m)

> Improve DirectoryScanner: add regular file check related block
> --
>
> Key: HDFS-16316
> URL: https://issues.apache.org/jira/browse/HDFS-16316
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.9.2
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Something unusual happened in the online environment.
> The DataNode is configured with 11 disks (${dfs.datanode.data.dir}). It is 
> normal for 10 disks to calculate the used capacity, and the calculated value 
> for the other 1 disk is much larger, which is very strange.
> This is about the live view on the NameNode:
>  !screenshot-1.png! 
> This is about the live view on the DataNode:
>  !screenshot-2.png! 
> We can look at the view on linux:
>  !screenshot-3.png! 
> There is a big gap here, regarding'/mnt/dfs/11/data'. This situation should 
> be prohibited from happening.
> I found that there are some abnormal block files.
> There are wrong blk_.meta in some subdir directories, causing abnormal 
> computing space.
> Here are some abnormal block files:
>  !screenshot-4.png! 
> Such files should not be used as normal blocks. They should be actively 
> identified and filtered, which is good for cluster stability.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16316) Improve DirectoryScanner: add regular file check related block

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16316?focusedWorklogId=727516=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727516
 ]

ASF GitHub Bot logged work on HDFS-16316:
-

Author: ASF GitHub Bot
Created on: 15/Feb/22 18:57
Start Date: 15/Feb/22 18:57
Worklog Time Spent: 10m 
  Work Description: jianghuazhu commented on pull request #3861:
URL: https://github.com/apache/hadoop/pull/3861#issuecomment-1039826045






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 727516)
Time Spent: 3h 20m  (was: 3h 10m)

> Improve DirectoryScanner: add regular file check related block
> --
>
> Key: HDFS-16316
> URL: https://issues.apache.org/jira/browse/HDFS-16316
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.9.2
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Something unusual happened in the online environment.
> The DataNode is configured with 11 disks (${dfs.datanode.data.dir}). It is 
> normal for 10 disks to calculate the used capacity, and the calculated value 
> for the other 1 disk is much larger, which is very strange.
> This is about the live view on the NameNode:
>  !screenshot-1.png! 
> This is about the live view on the DataNode:
>  !screenshot-2.png! 
> We can look at the view on linux:
>  !screenshot-3.png! 
> There is a big gap here, regarding'/mnt/dfs/11/data'. This situation should 
> be prohibited from happening.
> I found that there are some abnormal block files.
> There are wrong blk_.meta in some subdir directories, causing abnormal 
> computing space.
> Here are some abnormal block files:
>  !screenshot-4.png! 
> Such files should not be used as normal blocks. They should be actively 
> identified and filtered, which is good for cluster stability.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16455) RBF: Router should explicitly specify the value of `jute.maxbuffer` in hadoop configuration files like core-site.xml

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16455?focusedWorklogId=727472=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727472
 ]

ASF GitHub Bot logged work on HDFS-16455:
-

Author: ASF GitHub Bot
Created on: 15/Feb/22 18:54
Start Date: 15/Feb/22 18:54
Worklog Time Spent: 10m 
  Work Description: Neilxzn commented on a change in pull request #3983:
URL: https://github.com/apache/hadoop/pull/3983#discussion_r806389802



##
File path: 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/token/delegation/ZKDelegationTokenSecretManager.java
##
@@ -199,6 +202,10 @@ public ZKDelegationTokenSecretManager(Configuration conf) {
 ZK_DTSM_ZK_SESSION_TIMEOUT_DEFAULT);
 int numRetries =
 conf.getInt(ZK_DTSM_ZK_NUM_RETRIES, 
ZK_DTSM_ZK_NUM_RETRIES_DEFAULT);
+String juteMaxBuffer =
+conf.get(ZK_DTSM_ZK_JUTE_MAXBUFFER, 
ZK_DTSM_ZK_JUTE_MAXBUFFER_DEFAULT);

Review comment:
   Thank you for your review. fix it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 727472)
Time Spent: 1h 50m  (was: 1h 40m)

> RBF: Router should explicitly specify the value of `jute.maxbuffer` in hadoop 
> configuration files like core-site.xml
> 
>
> Key: HDFS-16455
> URL: https://issues.apache.org/jira/browse/HDFS-16455
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Based on the current design for delegation token in secure Router, the total 
> number of  tokens store and update in zookeeper using 
> ZKDelegationTokenManager.  
> But the default value of  system property `jute.maxbuffer` is just 4MB,  if 
> Router store too many tokens in zk, it will throw  IOException   `{{{}Packet 
> lenxx is out of range{}}}` and all Router will crash. 
>  
> In our cluster,  Routers crashed because of it. The crash logs are below 
> {code:java}
> 2022-02-09 02:15:51,607 INFO 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  Token renewal for identifier: (token for xxx: HDFS_DELEGATION_TOKEN 
> owner=xxx/scheduler, renewer=hadoop, realUser=, issueDate=1644344146305, 
> maxDate=1644948946305, sequenceNumbe
> r=27136070, masterKeyId=1107); total currentTokens 279548  2022-02-09 
> 02:16:07,632 WARN org.apache.zookeeper.ClientCnxn: Session 0x1000172775a0012 
> for server zkurl:2181, unexpected e
> rror, closing socket connection and attempting reconnect
> java.io.IOException: Packet len4194553 is out of range!
> at org.apache.zookeeper.ClientCnxnSocket.readLength(ClientCnxnSocket.java:113)
> at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:79)
> at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1145)
> 2022-02-09 02:16:07,733 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 1254 on default port 9001, call Call#144 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.getDelegationToken from 
> ip:46534
> java.lang.RuntimeException: Could not increment shared counter !!
> at 
> org.apache.hadoop.security.token.delegation.ZKDelegationTokenSecretManager.incrementDelegationTokenSeqNum(ZKDelegationTokenSecretManager.java:582)
>  {code}
> When we restart a Router, it crashed again
> {code:java}
> 2022-02-09 03:14:17,308 INFO 
> org.apache.hadoop.security.token.delegation.ZKDelegationTokenSecretManager: 
> Starting to load key cache.
> 2022-02-09 03:14:17,310 INFO 
> org.apache.hadoop.security.token.delegation.ZKDelegationTokenSecretManager: 
> Loaded key cache.
> 2022-02-09 03:14:32,930 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x205584be35b0001 for server zkurl:2181, unexpected
> error, closing socket connection and attempting reconnect
> java.io.IOException: Packet len4194478 is out of range!
> at 
> org.apache.zookeeper.ClientCnxnSocket.readLength(ClientCnxnSocket.java:113)
> at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:79)
> at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> at 
> 

[jira] [Work logged] (HDFS-16440) RBF: Support router get HAServiceStatus with Lifeline RPC address

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16440?focusedWorklogId=727461=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727461
 ]

ASF GitHub Bot logged work on HDFS-16440:
-

Author: ASF GitHub Bot
Created on: 15/Feb/22 18:53
Start Date: 15/Feb/22 18:53
Worklog Time Spent: 10m 
  Work Description: goiri commented on pull request #3971:
URL: https://github.com/apache/hadoop/pull/3971#issuecomment-1039354689


   It would be nice to have a full Yetus run, not sure what happened with the 
previous one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 727461)
Time Spent: 2h  (was: 1h 50m)

> RBF: Support router get HAServiceStatus with Lifeline RPC address
> -
>
> Key: HDFS-16440
> URL: https://issues.apache.org/jira/browse/HDFS-16440
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: YulongZ
>Assignee: YulongZ
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-16440.001.patch, HDFS-16440.003.patch, 
> HDFS-16440.004.patch
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> NamenodeHeartbeatService gets HAServiceStatus using 
> NNHAServiceTarget.getProxy. When we set a special 
> dfs.namenode.lifeline.rpc-address , NamenodeHeartbeatService may get 
> HAServiceStatus using NNHAServiceTarget.getHealthMonitorProxy.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16316) Improve DirectoryScanner: add regular file check related block

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16316?focusedWorklogId=727458=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727458
 ]

ASF GitHub Bot logged work on HDFS-16316:
-

Author: ASF GitHub Bot
Created on: 15/Feb/22 18:52
Start Date: 15/Feb/22 18:52
Worklog Time Spent: 10m 
  Work Description: jianghuazhu removed a comment on pull request #3861:
URL: https://github.com/apache/hadoop/pull/3861#issuecomment-1039826045


   Here are some examples of online clusters.
   We construct a block device file such as:
   
![image](https://user-images.githubusercontent.com/6416939/153989107-901a87e4-4b1c-44f4-a654-225ce495ede1.png)
   
   This file is non-standard.
   This kind of file is found when DirectoryScanner is working.
   log:
   `
   2022-02-15 11:24:10,286 WARN 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
Block:1073741828 is not a regular file.
   `
   `
   2022-02-15 11:24:10,286 WARN 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Reporting 
the block blk_1073741828_0 as corrupt due to length mismatch
   `
   
   Then DataNode will tell NameNode that there are some unqualified blocks 
through NameNodeRpcServer#reportBadBlocks(). After the NameNode gets the data, 
it will process it further.
   After a period of time, the DataNode will automatically clean up these 
unqualified replica data.
   
![image](https://user-images.githubusercontent.com/6416939/153989296-65e0230c-031c-4fd0-ace1-d247f15791b3.png)
   
   Can you help review this pr again, @jojochuang .
   Thank you so much.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 727458)
Time Spent: 3h 10m  (was: 3h)

> Improve DirectoryScanner: add regular file check related block
> --
>
> Key: HDFS-16316
> URL: https://issues.apache.org/jira/browse/HDFS-16316
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.9.2
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Something unusual happened in the online environment.
> The DataNode is configured with 11 disks (${dfs.datanode.data.dir}). It is 
> normal for 10 disks to calculate the used capacity, and the calculated value 
> for the other 1 disk is much larger, which is very strange.
> This is about the live view on the NameNode:
>  !screenshot-1.png! 
> This is about the live view on the DataNode:
>  !screenshot-2.png! 
> We can look at the view on linux:
>  !screenshot-3.png! 
> There is a big gap here, regarding'/mnt/dfs/11/data'. This situation should 
> be prohibited from happening.
> I found that there are some abnormal block files.
> There are wrong blk_.meta in some subdir directories, causing abnormal 
> computing space.
> Here are some abnormal block files:
>  !screenshot-4.png! 
> Such files should not be used as normal blocks. They should be actively 
> identified and filtered, which is good for cluster stability.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16396) Reconfig slow peer parameters for datanode

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16396?focusedWorklogId=727436=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727436
 ]

ASF GitHub Bot logged work on HDFS-16396:
-

Author: ASF GitHub Bot
Created on: 15/Feb/22 18:50
Start Date: 15/Feb/22 18:50
Worklog Time Spent: 10m 
  Work Description: tomscut commented on pull request #3827:
URL: https://github.com/apache/hadoop/pull/3827#issuecomment-1039826206


   Thanks @tasanuma and @ayushtkn for the review and confirming this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 727436)
Time Spent: 5h  (was: 4h 50m)

> Reconfig slow peer parameters for datanode
> --
>
> Key: HDFS-16396
> URL: https://issues.apache.org/jira/browse/HDFS-16396
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> In large clusters, rolling restart datanodes takes a long time. We can make 
> slow peers parameters and slow disks parameters in datanode reconfigurable to 
> facilitate cluster operation and maintenance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16316) Improve DirectoryScanner: add regular file check related block

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16316?focusedWorklogId=727372=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727372
 ]

ASF GitHub Bot logged work on HDFS-16316:
-

Author: ASF GitHub Bot
Created on: 15/Feb/22 18:44
Start Date: 15/Feb/22 18:44
Worklog Time Spent: 10m 
  Work Description: jojochuang commented on a change in pull request #3861:
URL: https://github.com/apache/hadoop/pull/3861#discussion_r806448723



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDirectoryScanner.java
##
@@ -507,6 +509,71 @@ public void testDeleteBlockOnTransientStorage() throws 
Exception {
 }
   }
 
+  @Test(timeout = 60)
+  public void testRegularBlock() throws Exception {
+// add a logger stream to check what has printed to log
+ByteArrayOutputStream loggerStream = new ByteArrayOutputStream();

Review comment:
   Can you use the Hadoop utility class LogCapturer 
https://github.com/apache/hadoop/blob/6342d5e523941622a140fd877f06e9b59f48c48b/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/test/GenericTestUtils.java#L533
 for this purpose?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 727372)
Time Spent: 3h  (was: 2h 50m)

> Improve DirectoryScanner: add regular file check related block
> --
>
> Key: HDFS-16316
> URL: https://issues.apache.org/jira/browse/HDFS-16316
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.9.2
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Something unusual happened in the online environment.
> The DataNode is configured with 11 disks (${dfs.datanode.data.dir}). It is 
> normal for 10 disks to calculate the used capacity, and the calculated value 
> for the other 1 disk is much larger, which is very strange.
> This is about the live view on the NameNode:
>  !screenshot-1.png! 
> This is about the live view on the DataNode:
>  !screenshot-2.png! 
> We can look at the view on linux:
>  !screenshot-3.png! 
> There is a big gap here, regarding'/mnt/dfs/11/data'. This situation should 
> be prohibited from happening.
> I found that there are some abnormal block files.
> There are wrong blk_.meta in some subdir directories, causing abnormal 
> computing space.
> Here are some abnormal block files:
>  !screenshot-4.png! 
> Such files should not be used as normal blocks. They should be actively 
> identified and filtered, which is good for cluster stability.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16455) RBF: Router should explicitly specify the value of `jute.maxbuffer` in hadoop configuration files like core-site.xml

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16455?focusedWorklogId=727325=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727325
 ]

ASF GitHub Bot logged work on HDFS-16455:
-

Author: ASF GitHub Bot
Created on: 15/Feb/22 18:40
Start Date: 15/Feb/22 18:40
Worklog Time Spent: 10m 
  Work Description: goiri commented on a change in pull request #3983:
URL: https://github.com/apache/hadoop/pull/3983#discussion_r806077863



##
File path: 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/token/delegation/ZKDelegationTokenSecretManager.java
##
@@ -98,6 +98,8 @@
   + "kerberos.keytab";
   public static final String ZK_DTSM_ZK_KERBEROS_PRINCIPAL = ZK_CONF_PREFIX
   + "kerberos.principal";
+  public static final String ZK_DTSM_ZK_JUTE_MAXBUFFER = ZK_CONF_PREFIX
+  + "jute.maxbuffer";

Review comment:
   The indentation is not correct. Check the checkstyle.

##
File path: 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/token/delegation/ZKDelegationTokenSecretManager.java
##
@@ -199,6 +202,10 @@ public ZKDelegationTokenSecretManager(Configuration conf) {
 ZK_DTSM_ZK_SESSION_TIMEOUT_DEFAULT);
 int numRetries =
 conf.getInt(ZK_DTSM_ZK_NUM_RETRIES, 
ZK_DTSM_ZK_NUM_RETRIES_DEFAULT);
+String juteMaxBuffer =
+conf.get(ZK_DTSM_ZK_JUTE_MAXBUFFER, 
ZK_DTSM_ZK_JUTE_MAXBUFFER_DEFAULT);

Review comment:
   Indentation fix.

##
File path: 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/token/delegation/ZKDelegationTokenSecretManager.java
##
@@ -199,6 +202,10 @@ public ZKDelegationTokenSecretManager(Configuration conf) {
 ZK_DTSM_ZK_SESSION_TIMEOUT_DEFAULT);
 int numRetries =
 conf.getInt(ZK_DTSM_ZK_NUM_RETRIES, 
ZK_DTSM_ZK_NUM_RETRIES_DEFAULT);
+String juteMaxBuffer =
+conf.get(ZK_DTSM_ZK_JUTE_MAXBUFFER, 
ZK_DTSM_ZK_JUTE_MAXBUFFER_DEFAULT);
+System.setProperty(ZKClientConfig.JUTE_MAXBUFFER,
+ juteMaxBuffer);

Review comment:
   This could go to the previous line.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 727325)
Time Spent: 1h 40m  (was: 1.5h)

> RBF: Router should explicitly specify the value of `jute.maxbuffer` in hadoop 
> configuration files like core-site.xml
> 
>
> Key: HDFS-16455
> URL: https://issues.apache.org/jira/browse/HDFS-16455
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Based on the current design for delegation token in secure Router, the total 
> number of  tokens store and update in zookeeper using 
> ZKDelegationTokenManager.  
> But the default value of  system property `jute.maxbuffer` is just 4MB,  if 
> Router store too many tokens in zk, it will throw  IOException   `{{{}Packet 
> lenxx is out of range{}}}` and all Router will crash. 
>  
> In our cluster,  Routers crashed because of it. The crash logs are below 
> {code:java}
> 2022-02-09 02:15:51,607 INFO 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  Token renewal for identifier: (token for xxx: HDFS_DELEGATION_TOKEN 
> owner=xxx/scheduler, renewer=hadoop, realUser=, issueDate=1644344146305, 
> maxDate=1644948946305, sequenceNumbe
> r=27136070, masterKeyId=1107); total currentTokens 279548  2022-02-09 
> 02:16:07,632 WARN org.apache.zookeeper.ClientCnxn: Session 0x1000172775a0012 
> for server zkurl:2181, unexpected e
> rror, closing socket connection and attempting reconnect
> java.io.IOException: Packet len4194553 is out of range!
> at org.apache.zookeeper.ClientCnxnSocket.readLength(ClientCnxnSocket.java:113)
> at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:79)
> at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1145)
> 2022-02-09 02:16:07,733 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 1254 on default port 9001, call Call#144 Retry#0 
> 

[jira] [Work logged] (HDFS-16396) Reconfig slow peer parameters for datanode

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16396?focusedWorklogId=727302=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727302
 ]

ASF GitHub Bot logged work on HDFS-16396:
-

Author: ASF GitHub Bot
Created on: 15/Feb/22 18:38
Start Date: 15/Feb/22 18:38
Worklog Time Spent: 10m 
  Work Description: tasanuma commented on pull request #3827:
URL: https://github.com/apache/hadoop/pull/3827#issuecomment-1039851634


   Merged it. Thanks for your contribution, @tomscut, and thanks for your 
review, @ayushtkn!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 727302)
Time Spent: 4h 50m  (was: 4h 40m)

> Reconfig slow peer parameters for datanode
> --
>
> Key: HDFS-16396
> URL: https://issues.apache.org/jira/browse/HDFS-16396
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> In large clusters, rolling restart datanodes takes a long time. We can make 
> slow peers parameters and slow disks parameters in datanode reconfigurable to 
> facilitate cluster operation and maintenance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16455) RBF: Router should explicitly specify the value of `jute.maxbuffer` in hadoop configuration files like core-site.xml

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16455?focusedWorklogId=727261=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727261
 ]

ASF GitHub Bot logged work on HDFS-16455:
-

Author: ASF GitHub Bot
Created on: 15/Feb/22 18:35
Start Date: 15/Feb/22 18:35
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3983:
URL: https://github.com/apache/hadoop/pull/3983#issuecomment-1039875071






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 727261)
Time Spent: 1.5h  (was: 1h 20m)

> RBF: Router should explicitly specify the value of `jute.maxbuffer` in hadoop 
> configuration files like core-site.xml
> 
>
> Key: HDFS-16455
> URL: https://issues.apache.org/jira/browse/HDFS-16455
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Based on the current design for delegation token in secure Router, the total 
> number of  tokens store and update in zookeeper using 
> ZKDelegationTokenManager.  
> But the default value of  system property `jute.maxbuffer` is just 4MB,  if 
> Router store too many tokens in zk, it will throw  IOException   `{{{}Packet 
> lenxx is out of range{}}}` and all Router will crash. 
>  
> In our cluster,  Routers crashed because of it. The crash logs are below 
> {code:java}
> 2022-02-09 02:15:51,607 INFO 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  Token renewal for identifier: (token for xxx: HDFS_DELEGATION_TOKEN 
> owner=xxx/scheduler, renewer=hadoop, realUser=, issueDate=1644344146305, 
> maxDate=1644948946305, sequenceNumbe
> r=27136070, masterKeyId=1107); total currentTokens 279548  2022-02-09 
> 02:16:07,632 WARN org.apache.zookeeper.ClientCnxn: Session 0x1000172775a0012 
> for server zkurl:2181, unexpected e
> rror, closing socket connection and attempting reconnect
> java.io.IOException: Packet len4194553 is out of range!
> at org.apache.zookeeper.ClientCnxnSocket.readLength(ClientCnxnSocket.java:113)
> at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:79)
> at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1145)
> 2022-02-09 02:16:07,733 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 1254 on default port 9001, call Call#144 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.getDelegationToken from 
> ip:46534
> java.lang.RuntimeException: Could not increment shared counter !!
> at 
> org.apache.hadoop.security.token.delegation.ZKDelegationTokenSecretManager.incrementDelegationTokenSeqNum(ZKDelegationTokenSecretManager.java:582)
>  {code}
> When we restart a Router, it crashed again
> {code:java}
> 2022-02-09 03:14:17,308 INFO 
> org.apache.hadoop.security.token.delegation.ZKDelegationTokenSecretManager: 
> Starting to load key cache.
> 2022-02-09 03:14:17,310 INFO 
> org.apache.hadoop.security.token.delegation.ZKDelegationTokenSecretManager: 
> Loaded key cache.
> 2022-02-09 03:14:32,930 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x205584be35b0001 for server zkurl:2181, unexpected
> error, closing socket connection and attempting reconnect
> java.io.IOException: Packet len4194478 is out of range!
> at 
> org.apache.zookeeper.ClientCnxnSocket.readLength(ClientCnxnSocket.java:113)
> at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:79)
> at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> at 
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1145)
> 2022-02-09 03:14:33,030 ERROR 
> org.apache.hadoop.hdfs.server.federation.router.security.token.ZKDelegationTokenSecretManagerImpl:
>  Error starting threads for z
> kDelegationTokens
> java.io.IOException: Could not start PathChildrenCache for tokens {code}
> Finnally, we config `-Djute.maxbuffer=1000` in hadoop-env,sh to fix this 
> issue.
> After dig it, we found the number of the  znode 
> `/ZKDTSMRoot/ZKDTSMTokensRoot`'s children node was more than 25, which's 
> data size was over 4MB.
>  
> Maybe we should  

[jira] [Work logged] (HDFS-16440) RBF: Support router get HAServiceStatus with Lifeline RPC address

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16440?focusedWorklogId=727260=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727260
 ]

ASF GitHub Bot logged work on HDFS-16440:
-

Author: ASF GitHub Bot
Created on: 15/Feb/22 18:35
Start Date: 15/Feb/22 18:35
Worklog Time Spent: 10m 
  Work Description: yulongz commented on pull request #3971:
URL: https://github.com/apache/hadoop/pull/3971#issuecomment-1039887623


   @goiri This failed unit test is unrelated to my change. All tests work fine 
locally. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 727260)
Time Spent: 1h 50m  (was: 1h 40m)

> RBF: Support router get HAServiceStatus with Lifeline RPC address
> -
>
> Key: HDFS-16440
> URL: https://issues.apache.org/jira/browse/HDFS-16440
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: YulongZ
>Assignee: YulongZ
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-16440.001.patch, HDFS-16440.003.patch, 
> HDFS-16440.004.patch
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> NamenodeHeartbeatService gets HAServiceStatus using 
> NNHAServiceTarget.getProxy. When we set a special 
> dfs.namenode.lifeline.rpc-address , NamenodeHeartbeatService may get 
> HAServiceStatus using NNHAServiceTarget.getHealthMonitorProxy.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16316) Improve DirectoryScanner: add regular file check related block

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16316?focusedWorklogId=727252=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727252
 ]

ASF GitHub Bot logged work on HDFS-16316:
-

Author: ASF GitHub Bot
Created on: 15/Feb/22 18:34
Start Date: 15/Feb/22 18:34
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3861:
URL: https://github.com/apache/hadoop/pull/3861#issuecomment-1039728159






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 727252)
Time Spent: 2h 50m  (was: 2h 40m)

> Improve DirectoryScanner: add regular file check related block
> --
>
> Key: HDFS-16316
> URL: https://issues.apache.org/jira/browse/HDFS-16316
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.9.2
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Something unusual happened in the online environment.
> The DataNode is configured with 11 disks (${dfs.datanode.data.dir}). It is 
> normal for 10 disks to calculate the used capacity, and the calculated value 
> for the other 1 disk is much larger, which is very strange.
> This is about the live view on the NameNode:
>  !screenshot-1.png! 
> This is about the live view on the DataNode:
>  !screenshot-2.png! 
> We can look at the view on linux:
>  !screenshot-3.png! 
> There is a big gap here, regarding'/mnt/dfs/11/data'. This situation should 
> be prohibited from happening.
> I found that there are some abnormal block files.
> There are wrong blk_.meta in some subdir directories, causing abnormal 
> computing space.
> Here are some abnormal block files:
>  !screenshot-4.png! 
> Such files should not be used as normal blocks. They should be actively 
> identified and filtered, which is good for cluster stability.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16440) RBF: Support router get HAServiceStatus with Lifeline RPC address

2022-02-15 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-16440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492725#comment-17492725
 ] 

Íñigo Goiri commented on HDFS-16440:


Thanks [~YulongZ] for the PR.
Merged PR 3971 to trunk.

> RBF: Support router get HAServiceStatus with Lifeline RPC address
> -
>
> Key: HDFS-16440
> URL: https://issues.apache.org/jira/browse/HDFS-16440
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: YulongZ
>Assignee: YulongZ
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-16440.001.patch, HDFS-16440.003.patch, 
> HDFS-16440.004.patch
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> NamenodeHeartbeatService gets HAServiceStatus using 
> NNHAServiceTarget.getProxy. When we set a special 
> dfs.namenode.lifeline.rpc-address , NamenodeHeartbeatService may get 
> HAServiceStatus using NNHAServiceTarget.getHealthMonitorProxy.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16440) RBF: Support router get HAServiceStatus with Lifeline RPC address

2022-02-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HDFS-16440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Íñigo Goiri resolved HDFS-16440.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> RBF: Support router get HAServiceStatus with Lifeline RPC address
> -
>
> Key: HDFS-16440
> URL: https://issues.apache.org/jira/browse/HDFS-16440
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: YulongZ
>Assignee: YulongZ
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-16440.001.patch, HDFS-16440.003.patch, 
> HDFS-16440.004.patch
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> NamenodeHeartbeatService gets HAServiceStatus using 
> NNHAServiceTarget.getProxy. When we set a special 
> dfs.namenode.lifeline.rpc-address , NamenodeHeartbeatService may get 
> HAServiceStatus using NNHAServiceTarget.getHealthMonitorProxy.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16440) RBF: Support router get HAServiceStatus with Lifeline RPC address

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16440?focusedWorklogId=727143=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727143
 ]

ASF GitHub Bot logged work on HDFS-16440:
-

Author: ASF GitHub Bot
Created on: 15/Feb/22 16:44
Start Date: 15/Feb/22 16:44
Worklog Time Spent: 10m 
  Work Description: goiri merged pull request #3971:
URL: https://github.com/apache/hadoop/pull/3971


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 727143)
Time Spent: 1h 40m  (was: 1.5h)

> RBF: Support router get HAServiceStatus with Lifeline RPC address
> -
>
> Key: HDFS-16440
> URL: https://issues.apache.org/jira/browse/HDFS-16440
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: YulongZ
>Assignee: YulongZ
>Priority: Minor
>  Labels: pull-request-available
> Attachments: HDFS-16440.001.patch, HDFS-16440.003.patch, 
> HDFS-16440.004.patch
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> NamenodeHeartbeatService gets HAServiceStatus using 
> NNHAServiceTarget.getProxy. When we set a special 
> dfs.namenode.lifeline.rpc-address , NamenodeHeartbeatService may get 
> HAServiceStatus using NNHAServiceTarget.getHealthMonitorProxy.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-02-15 Thread caozhiqiang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492423#comment-17492423
 ] 

caozhiqiang edited comment on HDFS-16456 at 2/15/22, 2:29 PM:
--

[~ebadger], should you help to review this issue?


was (Author: caozhiqiang):
[~ebadger], please help to review this issue.

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> 
>
> Key: HDFS-16456
> URL: https://issues.apache.org/jira/browse/HDFS-16456
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, namenode
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Priority: Critical
> Attachments: HDFS-16456.001.patch
>
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
>  # Enable EC policy, such as RS-6-3-1024k.
>  # The rack number in this cluster is equal with the replication number(9)
>  # A rack only has one DN, and decommission this DN.
> The root cause is in 
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
> give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
> the maxNodesPerRack is 1, which means each rack can only be chosen one 
> datanode.
> {code:java}
>   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
>...
>     // If more replicas than racks, evenly spread the replicas.
>     // This calculation rounds up.
>     int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> return new int[] {numOfReplicas, maxNodesPerRack};
>   } {code}
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> here will be called, where totalNumOfReplicas=9 and  numOfRacks=9  
> When we decommission one dn which is only one node in its rack, the 
> chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() 
> will throw NotEnoughReplicasException, but the exception will not be caught 
> and fail to fallback to chooseEvenlyFromRemainingRacks() function.
> When decommission, after choose targets, verifyBlockPlacement() function will 
> return the total rack number contains the invalid rack, and 
> BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false 
> and it will also cause decommission fail.
> {code:java}
>   public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
>       int numberOfReplicas) {
>     if (locs == null)
>       locs = DatanodeDescriptor.EMPTY_ARRAY;
>     if (!clusterMap.hasClusterEverBeenMultiRack()) {
>       // only one rack
>       return new BlockPlacementStatusDefault(1, 1, 1);
>     }
>     // Count locations on different racks.
>     Set racks = new HashSet<>();
>     for (DatanodeInfo dn : locs) {
>       racks.add(dn.getNetworkLocation());
>     }
>     return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas,
>         clusterMap.getNumOfRacks());
>   } {code}
> {code:java}
>   public boolean isPlacementPolicySatisfied() {
>     return requiredRacks <= currentRacks || currentRacks >= totalRacks;
>   }{code}
> According to the above description, we should make the below modify to fix it:
>  # In startDecommission() or stopDecommission(), we should also change the 
> numOfRacks in class NetworkTopology. Or choose targets may fail for the 
> maxNodesPerRack is too small. And even choose targets success, 
> isPlacementPolicySatisfied will also return false cause decommission fail.
>  # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first 
> chooseOnce() function should also be put in try..catch..., or it will not 
> fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
>  # In chooseEvenlyFromRemainingRacks(), this numResultsOflastChoose = 
> results.size(); code should be move to after chooseOnce(), or it will throw 
> lastException and make choose targets failed.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-02-15 Thread caozhiqiang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492423#comment-17492423
 ] 

caozhiqiang edited comment on HDFS-16456 at 2/15/22, 2:29 PM:
--

Hi, [~ebadger], should you help to review this issue?


was (Author: caozhiqiang):
[~ebadger], should you help to review this issue?

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> 
>
> Key: HDFS-16456
> URL: https://issues.apache.org/jira/browse/HDFS-16456
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, namenode
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Priority: Critical
> Attachments: HDFS-16456.001.patch
>
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
>  # Enable EC policy, such as RS-6-3-1024k.
>  # The rack number in this cluster is equal with the replication number(9)
>  # A rack only has one DN, and decommission this DN.
> The root cause is in 
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
> give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
> the maxNodesPerRack is 1, which means each rack can only be chosen one 
> datanode.
> {code:java}
>   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
>...
>     // If more replicas than racks, evenly spread the replicas.
>     // This calculation rounds up.
>     int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> return new int[] {numOfReplicas, maxNodesPerRack};
>   } {code}
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> here will be called, where totalNumOfReplicas=9 and  numOfRacks=9  
> When we decommission one dn which is only one node in its rack, the 
> chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() 
> will throw NotEnoughReplicasException, but the exception will not be caught 
> and fail to fallback to chooseEvenlyFromRemainingRacks() function.
> When decommission, after choose targets, verifyBlockPlacement() function will 
> return the total rack number contains the invalid rack, and 
> BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false 
> and it will also cause decommission fail.
> {code:java}
>   public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
>       int numberOfReplicas) {
>     if (locs == null)
>       locs = DatanodeDescriptor.EMPTY_ARRAY;
>     if (!clusterMap.hasClusterEverBeenMultiRack()) {
>       // only one rack
>       return new BlockPlacementStatusDefault(1, 1, 1);
>     }
>     // Count locations on different racks.
>     Set racks = new HashSet<>();
>     for (DatanodeInfo dn : locs) {
>       racks.add(dn.getNetworkLocation());
>     }
>     return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas,
>         clusterMap.getNumOfRacks());
>   } {code}
> {code:java}
>   public boolean isPlacementPolicySatisfied() {
>     return requiredRacks <= currentRacks || currentRacks >= totalRacks;
>   }{code}
> According to the above description, we should make the below modify to fix it:
>  # In startDecommission() or stopDecommission(), we should also change the 
> numOfRacks in class NetworkTopology. Or choose targets may fail for the 
> maxNodesPerRack is too small. And even choose targets success, 
> isPlacementPolicySatisfied will also return false cause decommission fail.
>  # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first 
> chooseOnce() function should also be put in try..catch..., or it will not 
> fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
>  # In chooseEvenlyFromRemainingRacks(), this numResultsOflastChoose = 
> results.size(); code should be move to after chooseOnce(), or it will throw 
> lastException and make choose targets failed.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16316) Improve DirectoryScanner: add regular file check related block

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16316?focusedWorklogId=727026=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727026
 ]

ASF GitHub Bot logged work on HDFS-16316:
-

Author: ASF GitHub Bot
Created on: 15/Feb/22 13:58
Start Date: 15/Feb/22 13:58
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3861:
URL: https://github.com/apache/hadoop/pull/3861#issuecomment-1040305852


   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 45s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  12m 36s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  22m 44s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  24m 35s |  |  trunk passed with JDK 
Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  compile  |  21m 11s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   3m 45s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m 13s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   2m 24s |  |  trunk passed with JDK 
Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   3m 12s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   6m  0s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  23m 31s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 24s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 25s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  24m 48s |  |  the patch passed with JDK 
Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javac  |  24m 48s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  22m 16s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |  22m 16s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   3m 51s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   3m 17s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   2m 21s |  |  the patch passed with JDK 
Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   3m 28s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   6m 36s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  24m 22s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  17m 45s |  |  hadoop-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  | 238m 14s |  |  hadoop-hdfs in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   1m  2s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 473m 27s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3861/6/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3861 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 94871cc29013 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / ca61b8bee32722ede0c39562b39edeee90521ce0 |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3861/6/testReport/ |
   | Max. process+thread count | 3234 (vs. ulimit of 5500) |
   | modules | C: hadoop-common-project/hadoop-common 
hadoop-hdfs-project/hadoop-hdfs U: . 

[jira] [Commented] (HDFS-15745) Make DataNodePeerMetrics#LOW_THRESHOLD_MS and MIN_OUTLIER_DETECTION_NODES configurable

2022-02-15 Thread Takanobu Asanuma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492607#comment-17492607
 ] 

Takanobu Asanuma commented on HDFS-15745:
-

I'd like to cherry-pick it into branch-3.3 based on [~huanghaibin]'s work. 
Created the PR: https://github.com/apache/hadoop/pull/3992

> Make DataNodePeerMetrics#LOW_THRESHOLD_MS and MIN_OUTLIER_DETECTION_NODES 
> configurable
> --
>
> Key: HDFS-15745
> URL: https://issues.apache.org/jira/browse/HDFS-15745
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haibin Huang
>Assignee: Haibin Huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-15745-001.patch, HDFS-15745-002.patch, 
> HDFS-15745-003.patch, HDFS-15745-branch-3.1.001.patch, 
> HDFS-15745-branch-3.2.001.patch, HDFS-15745-branch-3.3.001.patch, 
> image-2020-12-22-17-00-50-796.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When i enable DataNodePeerMetrics to find slow slow peer in cluster, i found 
> there is a lot of slow peer but ReportingNodes's averageDelay is very low, 
> and these slow peer node are normal. I think the reason of why generating so 
> many slow peer is that  the value of DataNodePeerMetrics#LOW_THRESHOLD_MS is 
> too small (only 5ms) and it is not configurable. The default value of slow io 
> warning log threshold is 300ms, i.e. 
> DFSConfigKeys.DFS_DATANODE_SLOW_IO_WARNING_THRESHOLD_DEFAULT = 300, so 
> DataNodePeerMetrics#LOW_THRESHOLD_MS should not be less than 300ms, otherwise 
> namenode will get a lot of invalid slow peer information.
> !image-2020-12-22-17-00-50-796.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15745) Make DataNodePeerMetrics#LOW_THRESHOLD_MS and MIN_OUTLIER_DETECTION_NODES configurable

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15745?focusedWorklogId=727009=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-727009
 ]

ASF GitHub Bot logged work on HDFS-15745:
-

Author: ASF GitHub Bot
Created on: 15/Feb/22 13:16
Start Date: 15/Feb/22 13:16
Worklog Time Spent: 10m 
  Work Description: tasanuma opened a new pull request #3992:
URL: https://github.com/apache/hadoop/pull/3992


   
   
   ### Description of PR
   
   HDFS-15745. Make DataNodePeerMetrics#LOW_THRESHOLD_MS and 
MIN_OUTLIER_DETECTION_NODES configurable.
   
   ### How was this patch tested?
   
   
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 727009)
Remaining Estimate: 0h
Time Spent: 10m

> Make DataNodePeerMetrics#LOW_THRESHOLD_MS and MIN_OUTLIER_DETECTION_NODES 
> configurable
> --
>
> Key: HDFS-15745
> URL: https://issues.apache.org/jira/browse/HDFS-15745
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haibin Huang
>Assignee: Haibin Huang
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: HDFS-15745-001.patch, HDFS-15745-002.patch, 
> HDFS-15745-003.patch, HDFS-15745-branch-3.1.001.patch, 
> HDFS-15745-branch-3.2.001.patch, HDFS-15745-branch-3.3.001.patch, 
> image-2020-12-22-17-00-50-796.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When i enable DataNodePeerMetrics to find slow slow peer in cluster, i found 
> there is a lot of slow peer but ReportingNodes's averageDelay is very low, 
> and these slow peer node are normal. I think the reason of why generating so 
> many slow peer is that  the value of DataNodePeerMetrics#LOW_THRESHOLD_MS is 
> too small (only 5ms) and it is not configurable. The default value of slow io 
> warning log threshold is 300ms, i.e. 
> DFSConfigKeys.DFS_DATANODE_SLOW_IO_WARNING_THRESHOLD_DEFAULT = 300, so 
> DataNodePeerMetrics#LOW_THRESHOLD_MS should not be less than 300ms, otherwise 
> namenode will get a lot of invalid slow peer information.
> !image-2020-12-22-17-00-50-796.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15745) Make DataNodePeerMetrics#LOW_THRESHOLD_MS and MIN_OUTLIER_DETECTION_NODES configurable

2022-02-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-15745:
--
Labels: pull-request-available  (was: )

> Make DataNodePeerMetrics#LOW_THRESHOLD_MS and MIN_OUTLIER_DETECTION_NODES 
> configurable
> --
>
> Key: HDFS-15745
> URL: https://issues.apache.org/jira/browse/HDFS-15745
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haibin Huang
>Assignee: Haibin Huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-15745-001.patch, HDFS-15745-002.patch, 
> HDFS-15745-003.patch, HDFS-15745-branch-3.1.001.patch, 
> HDFS-15745-branch-3.2.001.patch, HDFS-15745-branch-3.3.001.patch, 
> image-2020-12-22-17-00-50-796.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When i enable DataNodePeerMetrics to find slow slow peer in cluster, i found 
> there is a lot of slow peer but ReportingNodes's averageDelay is very low, 
> and these slow peer node are normal. I think the reason of why generating so 
> many slow peer is that  the value of DataNodePeerMetrics#LOW_THRESHOLD_MS is 
> too small (only 5ms) and it is not configurable. The default value of slow io 
> warning log threshold is 300ms, i.e. 
> DFSConfigKeys.DFS_DATANODE_SLOW_IO_WARNING_THRESHOLD_DEFAULT = 300, so 
> DataNodePeerMetrics#LOW_THRESHOLD_MS should not be less than 300ms, otherwise 
> namenode will get a lot of invalid slow peer information.
> !image-2020-12-22-17-00-50-796.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org