[jira] [Updated] (HDFS-17478) FSPermissionChecker to avoid obtaining a new AccessControlEnforcer instance before each authz call

2024-04-18 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-17478:
---
Fix Version/s: 3.5.0
   (was: 3.4.1)

> FSPermissionChecker to avoid obtaining a new AccessControlEnforcer instance 
> before each authz call
> --
>
> Key: HDFS-17478
> URL: https://issues.apache.org/jira/browse/HDFS-17478
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Reporter: Madhan Neethiraj
>Assignee: Madhan Neethiraj
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
> Attachments: HDFS-17478.patch
>
>
> An instance of AccessControlEnforcer is obtained from the registered 
> INodeAttributeProvider before every call made to authorizer. This can be 
> avoided by initializing the AccessControlEnforcer instance during 
> construction of FsPermissionChecker and using it in every subsequent call to 
> the authorizer. This will eliminate the unnecessary overhead in highly 
> performance sensitive authz code path.
>  
> CC: [~abhay], [~arp], [~swagle]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17478) FSPermissionChecker to avoid obtaining a new AccessControlEnforcer instance before each authz call

2024-04-18 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-17478:
---
Fix Version/s: 3.4.1
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> FSPermissionChecker to avoid obtaining a new AccessControlEnforcer instance 
> before each authz call
> --
>
> Key: HDFS-17478
> URL: https://issues.apache.org/jira/browse/HDFS-17478
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Reporter: Madhan Neethiraj
>Assignee: Madhan Neethiraj
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.1
>
> Attachments: HDFS-17478.patch
>
>
> An instance of AccessControlEnforcer is obtained from the registered 
> INodeAttributeProvider before every call made to authorizer. This can be 
> avoided by initializing the AccessControlEnforcer instance during 
> construction of FsPermissionChecker and using it in every subsequent call to 
> the authorizer. This will eliminate the unnecessary overhead in highly 
> performance sensitive authz code path.
>  
> CC: [~abhay], [~arp], [~swagle]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17478) FSPermissionChecker to avoid obtaining a new AccessControlEnforcer instance before each authz call

2024-04-17 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reassigned HDFS-17478:
--

Assignee: Madhan Neethiraj

> FSPermissionChecker to avoid obtaining a new AccessControlEnforcer instance 
> before each authz call
> --
>
> Key: HDFS-17478
> URL: https://issues.apache.org/jira/browse/HDFS-17478
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Reporter: Madhan Neethiraj
>Assignee: Madhan Neethiraj
>Priority: Major
> Attachments: HDFS-17478.patch
>
>
> An instance of AccessControlEnforcer is obtained from the registered 
> INodeAttributeProvider before every call made to authorizer. This can be 
> avoided by initializing the AccessControlEnforcer instance during 
> construction of FsPermissionChecker and using it in every subsequent call to 
> the authorizer. This will eliminate the unnecessary overhead in highly 
> performance sensitive authz code path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17181) WebHDFS not considering whether a DN is good when called from outside the cluster

2024-02-20 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reassigned HDFS-17181:
--

Assignee: Lars Francke

> WebHDFS not considering whether a DN is good when called from outside the 
> cluster
> -
>
> Key: HDFS-17181
> URL: https://issues.apache.org/jira/browse/HDFS-17181
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode, webhdfs
>Affects Versions: 3.3.6
>Reporter: Lars Francke
>Assignee: Lars Francke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
> Attachments: Test_fix_for_HDFS-171811.patch
>
>
> When calling WebHDFS to create a file (I'm sure the same problem occurs for 
> other actions e.g. OPEN but I haven't checked all of them yet) it will 
> happily redirect to nodes that are in maintenance.
> The reason is in the 
> [{{chooseDatanode}}|https://github.com/apache/hadoop/blob/rel/release-3.3.6/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/web/resources/NamenodeWebHdfsMethods.java#L307C9-L315]
>  method in {{NamenodeWebHdfsMethods}} where it will only call the 
> {{BlockPlacementPolicy}} (which considers all these edge cases) in case the 
> {{remoteAddr}} (i.e. the address making the request to WebHDFS) is also 
> running a DataNode.
>  
> In all other cases it just refers to 
> [{{NetworkTopology#chooseRandom}}|https://github.com/apache/hadoop/blob/rel/release-3.3.6/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/web/resources/NamenodeWebHdfsMethods.java#L342-L343]
>  which does not consider any of these circumstances (e.g. load, maintenance).
> I don't understand the reason to not just always refer to the placement 
> policy and we're currently testing a patch to do just that.
> I have attached a draft patch for now.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17181) WebHDFS not considering whether a DN is good when called from outside the cluster

2024-02-20 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-17181.

Fix Version/s: 3.5.0
   Resolution: Fixed

> WebHDFS not considering whether a DN is good when called from outside the 
> cluster
> -
>
> Key: HDFS-17181
> URL: https://issues.apache.org/jira/browse/HDFS-17181
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode, webhdfs
>Affects Versions: 3.3.6
>Reporter: Lars Francke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
> Attachments: Test_fix_for_HDFS-171811.patch
>
>
> When calling WebHDFS to create a file (I'm sure the same problem occurs for 
> other actions e.g. OPEN but I haven't checked all of them yet) it will 
> happily redirect to nodes that are in maintenance.
> The reason is in the 
> [{{chooseDatanode}}|https://github.com/apache/hadoop/blob/rel/release-3.3.6/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/web/resources/NamenodeWebHdfsMethods.java#L307C9-L315]
>  method in {{NamenodeWebHdfsMethods}} where it will only call the 
> {{BlockPlacementPolicy}} (which considers all these edge cases) in case the 
> {{remoteAddr}} (i.e. the address making the request to WebHDFS) is also 
> running a DataNode.
>  
> In all other cases it just refers to 
> [{{NetworkTopology#chooseRandom}}|https://github.com/apache/hadoop/blob/rel/release-3.3.6/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/web/resources/NamenodeWebHdfsMethods.java#L342-L343]
>  which does not consider any of these circumstances (e.g. load, maintenance).
> I don't understand the reason to not just always refer to the placement 
> policy and we're currently testing a patch to do just that.
> I have attached a draft patch for now.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17024) Potential data race introduced by HDFS-15865

2023-10-26 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-17024:
---
Fix Version/s: 3.3.9

> Potential data race introduced by HDFS-15865
> 
>
> Key: HDFS-17024
> URL: https://issues.apache.org/jira/browse/HDFS-17024
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient
>Affects Versions: 3.3.1
>Reporter: Wei-Chiu Chuang
>Assignee: Segawa Hiroaki
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.9
>
>
> After HDFS-15865, we found client aborted due to an NPE.
> {noformat}
> 2023-04-10 16:07:43,409 ERROR 
> org.apache.hadoop.hbase.regionserver.HRegionServer: * ABORTING region 
> server kqhdp36,16020,1678077077562: Replay of WAL required. Forcing server 
> shutdown *
> org.apache.hadoop.hbase.DroppedSnapshotException: region: WAFER_ALL,16|CM 
> RIE.MA1|CP1114561.18|PROC|,1625899466315.0fbdf0f1810efa9e68af831247e6555f.
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2870)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2539)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2511)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2401)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:613)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:582)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$1000(MemStoreFlusher.java:69)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:362)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.DataStreamer.waitForAckedSeqno(DataStreamer.java:880)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:781)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:898)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:850)
> at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:76)
> at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.finishClose(HFileWriterImpl.java:859)
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.close(HFileWriterImpl.java:687)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.close(StoreFileWriter.java:393)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFlusher.finalizeWriter(StoreFlusher.java:69)
> at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:78)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1047)
> at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2349)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2806)
> {noformat}
> This is only possible if a data race happened. File this jira to improve the 
> data and eliminate the data race.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17024) Potential data race introduced by HDFS-15865

2023-10-26 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reassigned HDFS-17024:
--

Assignee: Segawa Hiroaki

> Potential data race introduced by HDFS-15865
> 
>
> Key: HDFS-17024
> URL: https://issues.apache.org/jira/browse/HDFS-17024
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient
>Affects Versions: 3.3.1
>Reporter: Wei-Chiu Chuang
>Assignee: Segawa Hiroaki
>Priority: Major
>  Labels: pull-request-available
>
> After HDFS-15865, we found client aborted due to an NPE.
> {noformat}
> 2023-04-10 16:07:43,409 ERROR 
> org.apache.hadoop.hbase.regionserver.HRegionServer: * ABORTING region 
> server kqhdp36,16020,1678077077562: Replay of WAL required. Forcing server 
> shutdown *
> org.apache.hadoop.hbase.DroppedSnapshotException: region: WAFER_ALL,16|CM 
> RIE.MA1|CP1114561.18|PROC|,1625899466315.0fbdf0f1810efa9e68af831247e6555f.
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2870)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2539)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2511)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2401)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:613)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:582)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$1000(MemStoreFlusher.java:69)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:362)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.DataStreamer.waitForAckedSeqno(DataStreamer.java:880)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:781)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:898)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:850)
> at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:76)
> at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.finishClose(HFileWriterImpl.java:859)
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.close(HFileWriterImpl.java:687)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.close(StoreFileWriter.java:393)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFlusher.finalizeWriter(StoreFlusher.java:69)
> at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:78)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1047)
> at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2349)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2806)
> {noformat}
> This is only possible if a data race happened. File this jira to improve the 
> data and eliminate the data race.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17024) Potential data race introduced by HDFS-15865

2023-10-26 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-17024.

Fix Version/s: 3.4.0
   Resolution: Fixed

> Potential data race introduced by HDFS-15865
> 
>
> Key: HDFS-17024
> URL: https://issues.apache.org/jira/browse/HDFS-17024
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient
>Affects Versions: 3.3.1
>Reporter: Wei-Chiu Chuang
>Assignee: Segawa Hiroaki
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> After HDFS-15865, we found client aborted due to an NPE.
> {noformat}
> 2023-04-10 16:07:43,409 ERROR 
> org.apache.hadoop.hbase.regionserver.HRegionServer: * ABORTING region 
> server kqhdp36,16020,1678077077562: Replay of WAL required. Forcing server 
> shutdown *
> org.apache.hadoop.hbase.DroppedSnapshotException: region: WAFER_ALL,16|CM 
> RIE.MA1|CP1114561.18|PROC|,1625899466315.0fbdf0f1810efa9e68af831247e6555f.
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2870)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2539)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2511)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2401)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:613)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:582)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$1000(MemStoreFlusher.java:69)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:362)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.DataStreamer.waitForAckedSeqno(DataStreamer.java:880)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:781)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:898)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:850)
> at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:76)
> at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.finishClose(HFileWriterImpl.java:859)
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.close(HFileWriterImpl.java:687)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.close(StoreFileWriter.java:393)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFlusher.finalizeWriter(StoreFlusher.java:69)
> at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:78)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1047)
> at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2349)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2806)
> {noformat}
> This is only possible if a data race happened. File this jira to improve the 
> data and eliminate the data race.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17024) Potential data race introduced by HDFS-15865

2023-10-26 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reassigned HDFS-17024:
--

Assignee: (was: Wei-Chiu Chuang)

> Potential data race introduced by HDFS-15865
> 
>
> Key: HDFS-17024
> URL: https://issues.apache.org/jira/browse/HDFS-17024
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient
>Affects Versions: 3.3.1
>Reporter: Wei-Chiu Chuang
>Priority: Major
>  Labels: pull-request-available
>
> After HDFS-15865, we found client aborted due to an NPE.
> {noformat}
> 2023-04-10 16:07:43,409 ERROR 
> org.apache.hadoop.hbase.regionserver.HRegionServer: * ABORTING region 
> server kqhdp36,16020,1678077077562: Replay of WAL required. Forcing server 
> shutdown *
> org.apache.hadoop.hbase.DroppedSnapshotException: region: WAFER_ALL,16|CM 
> RIE.MA1|CP1114561.18|PROC|,1625899466315.0fbdf0f1810efa9e68af831247e6555f.
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2870)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2539)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2511)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2401)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:613)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:582)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$1000(MemStoreFlusher.java:69)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:362)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.DataStreamer.waitForAckedSeqno(DataStreamer.java:880)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:781)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:898)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:850)
> at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:76)
> at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.finishClose(HFileWriterImpl.java:859)
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.close(HFileWriterImpl.java:687)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.close(StoreFileWriter.java:393)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFlusher.finalizeWriter(StoreFlusher.java:69)
> at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:78)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1047)
> at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2349)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2806)
> {noformat}
> This is only possible if a data race happened. File this jira to improve the 
> data and eliminate the data race.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15273) CacheReplicationMonitor hold lock for long time and lead to NN out of service

2023-10-26 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17780041#comment-17780041
 ] 

Wei-Chiu Chuang commented on HDFS-15273:


+1 sorry for the very late review.

IMO the second sleep isn't really needed.

> CacheReplicationMonitor hold lock for long time and lead to NN out of service
> -
>
> Key: HDFS-15273
> URL: https://issues.apache.org/jira/browse/HDFS-15273
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: caching, namenode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: HDFS-15273.001.patch, HDFS-15273.002.patch, 
> HDFS-15273.003.patch
>
>
> CacheReplicationMonitor scan Cache Directives and Cached BlockMap 
> periodically. If we add more and more cache directives, 
> CacheReplicationMonitor will cost very long time to rescan all of cache 
> directives and cache blocks. Meanwhile, scan operation hold global write 
> lock, during scan period, NameNode could not process other request.
> So I think we should warn this risk to end user who turn on CacheManager 
> feature before improve this implement.
> {code:java}
>   private void rescan() throws InterruptedException {
> scannedDirectives = 0;
> scannedBlocks = 0;
> try {
>   namesystem.writeLock();
>   try {
> lock.lock();
> if (shutdown) {
>   throw new InterruptedException("CacheReplicationMonitor was " +
>   "shut down.");
> }
> curScanCount = completedScanCount + 1;
>   } finally {
> lock.unlock();
>   }
>   resetStatistics();
>   rescanCacheDirectives();
>   rescanCachedBlockMap();
>   blockManager.getDatanodeManager().resetLastCachingDirectiveSentTime();
> } finally {
>   namesystem.writeUnlock();
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15273) CacheReplicationMonitor hold lock for long time and lead to NN out of service

2023-10-26 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-15273:
---
Fix Version/s: 3.4.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> CacheReplicationMonitor hold lock for long time and lead to NN out of service
> -
>
> Key: HDFS-15273
> URL: https://issues.apache.org/jira/browse/HDFS-15273
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: caching, namenode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: HDFS-15273.001.patch, HDFS-15273.002.patch, 
> HDFS-15273.003.patch
>
>
> CacheReplicationMonitor scan Cache Directives and Cached BlockMap 
> periodically. If we add more and more cache directives, 
> CacheReplicationMonitor will cost very long time to rescan all of cache 
> directives and cache blocks. Meanwhile, scan operation hold global write 
> lock, during scan period, NameNode could not process other request.
> So I think we should warn this risk to end user who turn on CacheManager 
> feature before improve this implement.
> {code:java}
>   private void rescan() throws InterruptedException {
> scannedDirectives = 0;
> scannedBlocks = 0;
> try {
>   namesystem.writeLock();
>   try {
> lock.lock();
> if (shutdown) {
>   throw new InterruptedException("CacheReplicationMonitor was " +
>   "shut down.");
> }
> curScanCount = completedScanCount + 1;
>   } finally {
> lock.unlock();
>   }
>   resetStatistics();
>   rescanCacheDirectives();
>   rescanCachedBlockMap();
>   blockManager.getDatanodeManager().resetLastCachingDirectiveSentTime();
> } finally {
>   namesystem.writeUnlock();
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16950) Gap in edits after -initializeSharedEdits

2023-10-26 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17780001#comment-17780001
 ] 

Wei-Chiu Chuang commented on HDFS-16950:


Karthik said because of the missing edit logs it caused data loss. And it's 
reproducible.

A workaround would be to enter the NN in safe mode, take checkpoint, before 
proceed with the migration.


> Gap in edits after -initializeSharedEdits
> -
>
> Key: HDFS-16950
> URL: https://issues.apache.org/jira/browse/HDFS-16950
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: journal-node, namenode
>Reporter: Karthik Palanisamy
>Priority: Critical
>
> Namenode failed in the production cluster when JN role is migrated. 
> {code:java}
> ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start 
> namenode.
> java.io.IOException: There appears to be a gap in the edit log.  We expected 
> txid xx, but got txid xx. {code}
> InitializeSharedEdits issued as part of the role migration step. Note, no 
> checkpoint is performed in the past few hours. 
> InitializeSharedEdits created a new log segment from the edit_inprogres 
> transaction and deleted all old transactions. 
> My ask here is to delete any edit transaction older than the fimage 
> transaction. But currently, it deletes all transactions and no check is 
> enforced in JNStorage#format(). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16950) Gap in edits after -initializeSharedEdits

2023-10-26 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16950:
---
Priority: Critical  (was: Major)

> Gap in edits after -initializeSharedEdits
> -
>
> Key: HDFS-16950
> URL: https://issues.apache.org/jira/browse/HDFS-16950
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: journal-node, namenode
>Reporter: Karthik Palanisamy
>Priority: Critical
>
> Namenode failed in the production cluster when JN role is migrated. 
> {code:java}
> ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start 
> namenode.
> java.io.IOException: There appears to be a gap in the edit log.  We expected 
> txid xx, but got txid xx. {code}
> InitializeSharedEdits issued as part of the role migration step. Note, no 
> checkpoint is performed in the past few hours. 
> InitializeSharedEdits created a new log segment from the edit_inprogres 
> transaction and deleted all old transactions. 
> My ask here is to delete any edit transaction older than the fimage 
> transaction. But currently, it deletes all transactions and no check is 
> enforced in JNStorage#format(). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16950) Gap in edits after -initializeSharedEdits

2023-10-26 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16950:
---
Issue Type: Bug  (was: Improvement)

> Gap in edits after -initializeSharedEdits
> -
>
> Key: HDFS-16950
> URL: https://issues.apache.org/jira/browse/HDFS-16950
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: journal-node, namenode
>Reporter: Karthik Palanisamy
>Priority: Critical
>
> Namenode failed in the production cluster when JN role is migrated. 
> {code:java}
> ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start 
> namenode.
> java.io.IOException: There appears to be a gap in the edit log.  We expected 
> txid xx, but got txid xx. {code}
> InitializeSharedEdits issued as part of the role migration step. Note, no 
> checkpoint is performed in the past few hours. 
> InitializeSharedEdits created a new log segment from the edit_inprogres 
> transaction and deleted all old transactions. 
> My ask here is to delete any edit transaction older than the fimage 
> transaction. But currently, it deletes all transactions and no check is 
> enforced in JNStorage#format(). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17175) setOwner maybe set any user due to HDFS-16798

2023-08-31 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760992#comment-17760992
 ] 

Wei-Chiu Chuang commented on HDFS-17175:


[~Linwood] can you elaborate?

[~xuzq_zander] [~hexiaoqiao] any ideas?

> setOwner maybe set any user due to HDFS-16798
> -
>
> Key: HDFS-17175
> URL: https://issues.apache.org/jira/browse/HDFS-17175
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: TangLin
>Priority: Major
>
> This mr may cause the corresponding relationship between t2i and i2t to be 
> inconsistent. We need revert it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17080) Fix ec connection leak (GitHub PR#5807)

2023-07-11 Thread Wei-Chiu Chuang (Jira)
Wei-Chiu Chuang created HDFS-17080:
--

 Summary: Fix ec connection leak (GitHub PR#5807)
 Key: HDFS-17080
 URL: https://issues.apache.org/jira/browse/HDFS-17080
 Project: Hadoop HDFS
  Issue Type: Task
Reporter: Wei-Chiu Chuang


Creating this jira to track GitHub PR #5807.

{quote}Description of PR
This PR fixes EC connection leak if got exception when constructing reader.

How was this patch tested?
Cluster: Presto
Data: EC
Query: select col from table(EC data) limit 10

Presto is a long time running process to deal with query.
In this case, when it gets 10 records, it will interrupt other threads.
Other threads may be in constructing reader or getting next record.
If getting next record is interrupted, it will be caught and Presto will close 
it.
But if constructing reader is interrupted, Presto cannot close it because 
reader in Presto has not been created.
So we can observe whether EC connection is closed when doing EC limit query.
Use netstat command.{quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16866) Fix a typo in Dispatcher.

2023-06-14 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16866:
---
Fix Version/s: 3.4.0
   (was: 3.3.6)

> Fix a typo in Dispatcher.
> -
>
> Key: HDFS-16866
> URL: https://issues.apache.org/jira/browse/HDFS-16866
> Project: Hadoop HDFS
>  Issue Type: Wish
>Reporter: ZhiWei Shi
>Assignee: ZhiWei Shi
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Fix a typo in Dispatcher.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16861) RBF. Truncate API always fails when dirs use AllResolver oder on Router

2023-06-14 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16861:
---
Fix Version/s: 3.4.0
   (was: 3.3.6)

> RBF. Truncate API always fails when dirs use AllResolver oder on Router  
> -
>
> Key: HDFS-16861
> URL: https://issues.apache.org/jira/browse/HDFS-16861
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: image-2022-12-05-17-35-19-841.png
>
>
> # prepare a directory in a HASH_ALL/SPACE/RANDOM mount point.
>  # put  a test file with 1024 bytes  to this directory
>  # truncate the file with 100 new length  and this op will fail and throw  a 
> exception that the file does not exist.
>  
> After dig it, we should ignore the result of Truncate API in 
> RouterClientProtocol because
> Truncate can return true/false, so don't expect a result.
> After fix it , the code is 
> !image-2022-12-05-17-35-19-841.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16858) Dynamically adjust max slow disks to exclude

2023-06-14 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16858:
---
Fix Version/s: 3.4.0
   (was: 3.3.6)

> Dynamically adjust max slow disks to exclude
> 
>
> Key: HDFS-16858
> URL: https://issues.apache.org/jira/browse/HDFS-16858
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode
>Reporter: dingshun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15368) TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally

2023-06-12 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-15368:
---
Fix Version/s: 3.4.0

> TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally
> 
>
> Key: HDFS-15368
> URL: https://issues.apache.org/jira/browse/HDFS-15368
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
>  Labels: balancer, test
> Fix For: 3.4.0, 3.3.6
>
> Attachments: HDFS-15368.001.patch, HDFS-15368.002.patch, 
> TestBalancerWithHANameNodes.testBalancerObserver.log, 
> TestBalancerWithHANameNodes.testBalancerObserver.log
>
>
> When I am working on HDFS-13183, I found that 
> TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally, 
> because the following code segment. Consider there are 1 ANN + 1 SBN + 2ONN, 
> when invoke getBlocks with opening Observer Read feature, it could request 
> any one of two ObserverNN based on my observation. So only verify the first 
> ObserverNN and check times of invoke #getBlocks is not expected.
> {code:java}
>   for (int i = 0; i < cluster.getNumNameNodes(); i++) {
> // First observer node is at idx 2, or 3 if 2 has been shut down
> // It should get both getBlocks calls, all other NNs should see 0 
> calls
> int expectedObserverIdx = withObserverFailure ? 3 : 2;
> int expectedCount = (i == expectedObserverIdx) ? 2 : 0;
> verify(namesystemSpies.get(i), times(expectedCount))
> .getBlocks(any(), anyLong(), anyLong());
>   }
> {code}
> cc [~xkrogen],[~weichiu]. I am not very familiar for Observer Read feature, 
> would you like give some suggestions? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15383) RBF: Disable watch in ZKDelegationSecretManager for performance

2023-06-12 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-15383:
---
Fix Version/s: 3.3.6

> RBF: Disable watch in ZKDelegationSecretManager for performance
> ---
>
> Key: HDFS-15383
> URL: https://issues.apache.org/jira/browse/HDFS-15383
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.6
>
>
> Based on the current design for delegation token in secure Router, the total 
> number of watches for tokens is the product of number of routers and number 
> of tokens, this is due to ZKDelegationTokenManager is using PathChildrenCache 
> from curator, which automatically sets the watch and ZK will push the sync 
> information to each router. There are some evaluations about the number of 
> watches in Zookeeper has negative performance impact to Zookeeper server.
> In our practice when the number of watches exceeds 1.2 Million in a single ZK 
> server there will be significant ZK performance degradation. Thus this ticket 
> is to rewrite ZKDelegationTokenManagerImpl.java to explicitly disable the 
> PathChildrenCache and have Routers sync periodically from Zookeeper. This has 
> been working fine at the scale of 10 Routers with 2 million tokens. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15368) TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally

2023-06-12 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-15368:
---
Fix Version/s: 3.3.6
   (was: 3.4.0)

> TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally
> 
>
> Key: HDFS-15368
> URL: https://issues.apache.org/jira/browse/HDFS-15368
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
>  Labels: balancer, test
> Fix For: 3.3.6
>
> Attachments: HDFS-15368.001.patch, HDFS-15368.002.patch, 
> TestBalancerWithHANameNodes.testBalancerObserver.log, 
> TestBalancerWithHANameNodes.testBalancerObserver.log
>
>
> When I am working on HDFS-13183, I found that 
> TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally, 
> because the following code segment. Consider there are 1 ANN + 1 SBN + 2ONN, 
> when invoke getBlocks with opening Observer Read feature, it could request 
> any one of two ObserverNN based on my observation. So only verify the first 
> ObserverNN and check times of invoke #getBlocks is not expected.
> {code:java}
>   for (int i = 0; i < cluster.getNumNameNodes(); i++) {
> // First observer node is at idx 2, or 3 if 2 has been shut down
> // It should get both getBlocks calls, all other NNs should see 0 
> calls
> int expectedObserverIdx = withObserverFailure ? 3 : 2;
> int expectedCount = (i == expectedObserverIdx) ? 2 : 0;
> verify(namesystemSpies.get(i), times(expectedCount))
> .getBlocks(any(), anyLong(), anyLong());
>   }
> {code}
> cc [~xkrogen],[~weichiu]. I am not very familiar for Observer Read feature, 
> would you like give some suggestions? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15383) RBF: Disable watch in ZKDelegationSecretManager for performance

2023-06-12 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-15383:
---
Fix Version/s: (was: 3.3.6)

> RBF: Disable watch in ZKDelegationSecretManager for performance
> ---
>
> Key: HDFS-15383
> URL: https://issues.apache.org/jira/browse/HDFS-15383
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Based on the current design for delegation token in secure Router, the total 
> number of watches for tokens is the product of number of routers and number 
> of tokens, this is due to ZKDelegationTokenManager is using PathChildrenCache 
> from curator, which automatically sets the watch and ZK will push the sync 
> information to each router. There are some evaluations about the number of 
> watches in Zookeeper has negative performance impact to Zookeeper server.
> In our practice when the number of watches exceeds 1.2 Million in a single ZK 
> server there will be significant ZK performance degradation. Thus this ticket 
> is to rewrite ZKDelegationTokenManagerImpl.java to explicitly disable the 
> PathChildrenCache and have Routers sync periodically from Zookeeper. This has 
> been working fine at the scale of 10 Routers with 2 million tokens. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15368) TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally

2023-06-12 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-15368:
---
Fix Version/s: (was: 3.3.6)

> TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally
> 
>
> Key: HDFS-15368
> URL: https://issues.apache.org/jira/browse/HDFS-15368
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
>  Labels: balancer, test
> Fix For: 3.4.0
>
> Attachments: HDFS-15368.001.patch, HDFS-15368.002.patch, 
> TestBalancerWithHANameNodes.testBalancerObserver.log, 
> TestBalancerWithHANameNodes.testBalancerObserver.log
>
>
> When I am working on HDFS-13183, I found that 
> TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally, 
> because the following code segment. Consider there are 1 ANN + 1 SBN + 2ONN, 
> when invoke getBlocks with opening Observer Read feature, it could request 
> any one of two ObserverNN based on my observation. So only verify the first 
> ObserverNN and check times of invoke #getBlocks is not expected.
> {code:java}
>   for (int i = 0; i < cluster.getNumNameNodes(); i++) {
> // First observer node is at idx 2, or 3 if 2 has been shut down
> // It should get both getBlocks calls, all other NNs should see 0 
> calls
> int expectedObserverIdx = withObserverFailure ? 3 : 2;
> int expectedCount = (i == expectedObserverIdx) ? 2 : 0;
> verify(namesystemSpies.get(i), times(expectedCount))
> .getBlocks(any(), anyLong(), anyLong());
>   }
> {code}
> cc [~xkrogen],[~weichiu]. I am not very familiar for Observer Read feature, 
> would you like give some suggestions? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17024) Potential data race introduced by HDFS-15865

2023-06-08 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-17024:
---
Target Version/s: 3.3.9  (was: 3.3.6)

> Potential data race introduced by HDFS-15865
> 
>
> Key: HDFS-17024
> URL: https://issues.apache.org/jira/browse/HDFS-17024
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient
>Affects Versions: 3.3.1
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Major
>  Labels: pull-request-available
>
> After HDFS-15865, we found client aborted due to an NPE.
> {noformat}
> 2023-04-10 16:07:43,409 ERROR 
> org.apache.hadoop.hbase.regionserver.HRegionServer: * ABORTING region 
> server kqhdp36,16020,1678077077562: Replay of WAL required. Forcing server 
> shutdown *
> org.apache.hadoop.hbase.DroppedSnapshotException: region: WAFER_ALL,16|CM 
> RIE.MA1|CP1114561.18|PROC|,1625899466315.0fbdf0f1810efa9e68af831247e6555f.
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2870)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2539)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2511)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2401)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:613)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:582)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$1000(MemStoreFlusher.java:69)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:362)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.DataStreamer.waitForAckedSeqno(DataStreamer.java:880)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:781)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:898)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:850)
> at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:76)
> at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.finishClose(HFileWriterImpl.java:859)
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.close(HFileWriterImpl.java:687)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.close(StoreFileWriter.java:393)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFlusher.finalizeWriter(StoreFlusher.java:69)
> at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:78)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1047)
> at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2349)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2806)
> {noformat}
> This is only possible if a data race happened. File this jira to improve the 
> data and eliminate the data race.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16839) It should consider EC reconstruction work when we determine if a node is busy

2023-06-08 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16839:
---
Component/s: ec
 erasure-coding

> It should consider EC reconstruction work when we determine if a node is busy
> -
>
> Key: HDFS-16839
> URL: https://issues.apache.org/jira/browse/HDFS-16839
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ec, erasure-coding
>Reporter: Kidd5368
>Assignee: Kidd5368
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.9
>
>
> In chooseSourceDatanodes( ), I think it's more reasonable if we take EC 
> reconstruction work as a consideration when we determine if a node is busy or 
> not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17011) Fix the metric of "HttpPort" at DataNodeInfo

2023-06-07 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-17011:
---
Fix Version/s: 3.4.0
   3.3.9

> Fix the metric of  "HttpPort" at DataNodeInfo 
> --
>
> Key: HDFS-17011
> URL: https://issues.apache.org/jira/browse/HDFS-17011
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Zhaohui Wang
>Assignee: Zhaohui Wang
>Priority: Minor
> Fix For: 3.4.0, 3.3.9
>
> Attachments: after.png, before.png
>
>
> Now, the "HttpPort" metric was getting from the conf `dfs.datanode.info.port`
> but the conf seem to be useless, httpPort already named infoPort and was 
> assigned (#Line1373)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17003) Erasure coding: invalidate wrong block after reporting bad blocks from datanode

2023-06-06 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-17003:
---
Target Version/s: 3.3.6

> Erasure coding: invalidate wrong block after reporting bad blocks from 
> datanode
> ---
>
> Key: HDFS-17003
> URL: https://issues.apache.org/jira/browse/HDFS-17003
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Critical
>  Labels: pull-request-available
>
> After receiving reportBadBlocks RPC from datanode, NameNode compute wrong 
> block to invalidate. It is a dangerous behaviour and may cause data loss. 
> Some logs in our production as below:
>  
> NameNode log:
> {code:java}
> 2023-05-08 21:23:49,112 INFO org.apache.hadoop.hdfs.StateChange: *DIR* 
> reportBadBlocks for block: 
> BP-932824627--1680179358678:blk_-9223372036848404320_1471186 on datanode: 
> datanode1:50010
> 2023-05-08 21:23:49,183 INFO org.apache.hadoop.hdfs.StateChange: *DIR* 
> reportBadBlocks for block: 
> BP-932824627--1680179358678:blk_-9223372036848404319_1471186 on datanode: 
> datanode2:50010{code}
> datanode1 log:
> {code:java}
> 2023-05-08 21:23:49,088 WARN 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner: Reporting bad 
> BP-932824627--1680179358678:blk_-9223372036848404320_1471186 on 
> /data7/hadoop/hdfs/datanode
> 2023-05-08 21:24:00,509 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Failed 
> to delete replica blk_-9223372036848404319_1471186: ReplicaInfo not 
> found.{code}
>  
> This phenomenon can be reproduced.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17003) Erasure coding: invalidate wrong block after reporting bad blocks from datanode

2023-06-06 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17729619#comment-17729619
 ] 

Wei-Chiu Chuang commented on HDFS-17003:


I've not reviewed the PR but I intend to include this in 3.3.6 release.

> Erasure coding: invalidate wrong block after reporting bad blocks from 
> datanode
> ---
>
> Key: HDFS-17003
> URL: https://issues.apache.org/jira/browse/HDFS-17003
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Critical
>  Labels: pull-request-available
>
> After receiving reportBadBlocks RPC from datanode, NameNode compute wrong 
> block to invalidate. It is a dangerous behaviour and may cause data loss. 
> Some logs in our production as below:
>  
> NameNode log:
> {code:java}
> 2023-05-08 21:23:49,112 INFO org.apache.hadoop.hdfs.StateChange: *DIR* 
> reportBadBlocks for block: 
> BP-932824627--1680179358678:blk_-9223372036848404320_1471186 on datanode: 
> datanode1:50010
> 2023-05-08 21:23:49,183 INFO org.apache.hadoop.hdfs.StateChange: *DIR* 
> reportBadBlocks for block: 
> BP-932824627--1680179358678:blk_-9223372036848404319_1471186 on datanode: 
> datanode2:50010{code}
> datanode1 log:
> {code:java}
> 2023-05-08 21:23:49,088 WARN 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner: Reporting bad 
> BP-932824627--1680179358678:blk_-9223372036848404320_1471186 on 
> /data7/hadoop/hdfs/datanode
> 2023-05-08 21:24:00,509 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Failed 
> to delete replica blk_-9223372036848404319_1471186: ReplicaInfo not 
> found.{code}
>  
> This phenomenon can be reproduced.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17024) Potential data race introduced by HDFS-15865

2023-05-24 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-17024:
---
Component/s: dfsclient

> Potential data race introduced by HDFS-15865
> 
>
> Key: HDFS-17024
> URL: https://issues.apache.org/jira/browse/HDFS-17024
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient
>Affects Versions: 3.3.1
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Major
>  Labels: pull-request-available
>
> After HDFS-15865, we found client aborted due to an NPE.
> {noformat}
> 2023-04-10 16:07:43,409 ERROR 
> org.apache.hadoop.hbase.regionserver.HRegionServer: * ABORTING region 
> server kqhdp36,16020,1678077077562: Replay of WAL required. Forcing server 
> shutdown *
> org.apache.hadoop.hbase.DroppedSnapshotException: region: WAFER_ALL,16|CM 
> RIE.MA1|CP1114561.18|PROC|,1625899466315.0fbdf0f1810efa9e68af831247e6555f.
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2870)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2539)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2511)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2401)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:613)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:582)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$1000(MemStoreFlusher.java:69)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:362)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.DataStreamer.waitForAckedSeqno(DataStreamer.java:880)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:781)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:898)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:850)
> at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:76)
> at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.finishClose(HFileWriterImpl.java:859)
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.close(HFileWriterImpl.java:687)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.close(StoreFileWriter.java:393)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFlusher.finalizeWriter(StoreFlusher.java:69)
> at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:78)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1047)
> at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2349)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2806)
> {noformat}
> This is only possible if a data race happened. File this jira to improve the 
> data and eliminate the data race.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17024) Potential data race introduced by HDFS-15865

2023-05-24 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-17024:
---
Target Version/s: 3.3.6

> Potential data race introduced by HDFS-15865
> 
>
> Key: HDFS-17024
> URL: https://issues.apache.org/jira/browse/HDFS-17024
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.1
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Major
>  Labels: pull-request-available
>
> After HDFS-15865, we found client aborted due to an NPE.
> {noformat}
> 2023-04-10 16:07:43,409 ERROR 
> org.apache.hadoop.hbase.regionserver.HRegionServer: * ABORTING region 
> server kqhdp36,16020,1678077077562: Replay of WAL required. Forcing server 
> shutdown *
> org.apache.hadoop.hbase.DroppedSnapshotException: region: WAFER_ALL,16|CM 
> RIE.MA1|CP1114561.18|PROC|,1625899466315.0fbdf0f1810efa9e68af831247e6555f.
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2870)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2539)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2511)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2401)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:613)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:582)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$1000(MemStoreFlusher.java:69)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:362)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.DataStreamer.waitForAckedSeqno(DataStreamer.java:880)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:781)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:898)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:850)
> at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:76)
> at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.finishClose(HFileWriterImpl.java:859)
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.close(HFileWriterImpl.java:687)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.close(StoreFileWriter.java:393)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFlusher.finalizeWriter(StoreFlusher.java:69)
> at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:78)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1047)
> at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2349)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2806)
> {noformat}
> This is only possible if a data race happened. File this jira to improve the 
> data and eliminate the data race.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17024) Potential data race introduced by HDFS-15865

2023-05-24 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-17024:
---
Affects Version/s: 3.3.1

> Potential data race introduced by HDFS-15865
> 
>
> Key: HDFS-17024
> URL: https://issues.apache.org/jira/browse/HDFS-17024
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.1
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Major
>  Labels: pull-request-available
>
> After HDFS-15865, we found client aborted due to an NPE.
> {noformat}
> 2023-04-10 16:07:43,409 ERROR 
> org.apache.hadoop.hbase.regionserver.HRegionServer: * ABORTING region 
> server kqhdp36,16020,1678077077562: Replay of WAL required. Forcing server 
> shutdown *
> org.apache.hadoop.hbase.DroppedSnapshotException: region: WAFER_ALL,16|CM 
> RIE.MA1|CP1114561.18|PROC|,1625899466315.0fbdf0f1810efa9e68af831247e6555f.
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2870)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2539)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2511)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2401)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:613)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:582)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$1000(MemStoreFlusher.java:69)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:362)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.DataStreamer.waitForAckedSeqno(DataStreamer.java:880)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:781)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:898)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:850)
> at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:76)
> at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.finishClose(HFileWriterImpl.java:859)
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.close(HFileWriterImpl.java:687)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.close(StoreFileWriter.java:393)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFlusher.finalizeWriter(StoreFlusher.java:69)
> at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:78)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1047)
> at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2349)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2806)
> {noformat}
> This is only possible if a data race happened. File this jira to improve the 
> data and eliminate the data race.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17024) Potential data race introduced by HDFS-15865

2023-05-23 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reassigned HDFS-17024:
--

Assignee: Wei-Chiu Chuang

> Potential data race introduced by HDFS-15865
> 
>
> Key: HDFS-17024
> URL: https://issues.apache.org/jira/browse/HDFS-17024
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Major
>
> After HDFS-15865, we found client aborted due to an NPE.
> {noformat}
> 2023-04-10 16:07:43,409 ERROR 
> org.apache.hadoop.hbase.regionserver.HRegionServer: * ABORTING region 
> server kqhdp36,16020,1678077077562: Replay of WAL required. Forcing server 
> shutdown *
> org.apache.hadoop.hbase.DroppedSnapshotException: region: WAFER_ALL,16|CM 
> RIE.MA1|CP1114561.18|PROC|,1625899466315.0fbdf0f1810efa9e68af831247e6555f.
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2870)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2539)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2511)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2401)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:613)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:582)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$1000(MemStoreFlusher.java:69)
> at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:362)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.DataStreamer.waitForAckedSeqno(DataStreamer.java:880)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:781)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:898)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:850)
> at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:76)
> at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.finishClose(HFileWriterImpl.java:859)
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.close(HFileWriterImpl.java:687)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.close(StoreFileWriter.java:393)
> at 
> org.apache.hadoop.hbase.regionserver.StoreFlusher.finalizeWriter(StoreFlusher.java:69)
> at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:78)
> at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1047)
> at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2349)
> at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2806)
> {noformat}
> This is only possible if a data race happened. File this jira to improve the 
> data and eliminate the data race.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17024) Potential data race introduced by HDFS-15865

2023-05-23 Thread Wei-Chiu Chuang (Jira)
Wei-Chiu Chuang created HDFS-17024:
--

 Summary: Potential data race introduced by HDFS-15865
 Key: HDFS-17024
 URL: https://issues.apache.org/jira/browse/HDFS-17024
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Wei-Chiu Chuang


After HDFS-15865, we found client aborted due to an NPE.
{noformat}
2023-04-10 16:07:43,409 ERROR 
org.apache.hadoop.hbase.regionserver.HRegionServer: * ABORTING region 
server kqhdp36,16020,1678077077562: Replay of WAL required. Forcing server 
shutdown *
org.apache.hadoop.hbase.DroppedSnapshotException: region: WAFER_ALL,16|CM 
RIE.MA1|CP1114561.18|PROC|,1625899466315.0fbdf0f1810efa9e68af831247e6555f.
at 
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2870)
at 
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2539)
at 
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2511)
at 
org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2401)
at 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:613)
at 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:582)
at 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$1000(MemStoreFlusher.java:69)
at 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:362)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.DataStreamer.waitForAckedSeqno(DataStreamer.java:880)
at 
org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:781)
at 
org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:898)
at 
org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:850)
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:76)
at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
at 
org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.finishClose(HFileWriterImpl.java:859)
at 
org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.close(HFileWriterImpl.java:687)
at 
org.apache.hadoop.hbase.regionserver.StoreFileWriter.close(StoreFileWriter.java:393)
at 
org.apache.hadoop.hbase.regionserver.StoreFlusher.finalizeWriter(StoreFlusher.java:69)
at 
org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:78)
at 
org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1047)
at 
org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2349)
at 
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2806)
{noformat}

This is only possible if a data race happened. File this jira to improve the 
data and eliminate the data race.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15865) Interrupt DataStreamer thread

2023-05-23 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reassigned HDFS-15865:
--

Assignee: Karthik Palanisamy

> Interrupt DataStreamer thread
> -
>
> Key: HDFS-15865
> URL: https://issues.apache.org/jira/browse/HDFS-15865
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Karthik Palanisamy
>Assignee: Karthik Palanisamy
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Have noticed HiveServer2 halts due to DataStreamer#waitForAckedSeqno. 
> I think we have to interrupt DataStreamer if no packet ack(from datanodes). 
> It likely happens with infra/network issue.
> {code:java}
> "HiveServer2-Background-Pool: Thread-35977576" #35977576 prio=5 os_prio=0 
> cpu=797.65ms elapsed=3406.28s tid=0x7fc0c6c29800 nid=0x4198 in 
> Object.wait()  [0x7fc1079f3000]
>     java.lang.Thread.State: TIMED_WAITING (on object monitor)
>  at java.lang.Object.wait(java.base(at)11.0.5/Native Method)
>  - waiting on 
>  at 
> org.apache.hadoop.hdfs.DataStreamer.waitForAckedSeqno(DataStreamer.java:886)
>  - waiting to re-lock in wait() <0x7fe6eda86ca0> (a 
> java.util.LinkedList){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16985) delete local block file when FileNotFoundException occurred may lead to missing block.

2023-04-17 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713351#comment-17713351
 ] 

Wei-Chiu Chuang commented on HDFS-16985:


More details, please?

> delete local block file when FileNotFoundException occurred may lead to 
> missing block.
> --
>
> Key: HDFS-16985
> URL: https://issues.apache.org/jira/browse/HDFS-16985
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Chengwei Wang
>Assignee: Chengwei Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16947) RBF NamenodeHeartbeatService to report error for not being able to register namenode in state store

2023-03-15 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-16947.

Resolution: Fixed

> RBF NamenodeHeartbeatService to report error for not being able to register 
> namenode in state store
> ---
>
> Key: HDFS-16947
> URL: https://issues.apache.org/jira/browse/HDFS-16947
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Namenode heartbeat service should provide error with full stacktrace if it 
> cannot register namenode in the state store. As of today, we only log info 
> msg.
> For zookeeper based impl, this might mean either a) curator manager is not 
> initialized or b) if it fails to write to znode after exhausting retries. For 
> either of these cases, reporting only INFO log might not be good enough and 
> we might have to look for errors elsewhere.
>  
> Sample example:
> {code:java}
> 2023-02-20 23:10:33,714 DEBUG [NamenodeHeartbeatService {ns} nn0-0] 
> router.NamenodeHeartbeatService - Received service state: ACTIVE from HA 
> namenode: {ns}-nn0:nn-0-{ns}.{cluster}:9000
> 2023-02-20 23:10:33,731 INFO  [NamenodeHeartbeatService {ns} nn0-0] 
> impl.MembershipStoreImpl - Inserting new NN registration: 
> nn-0.namenode.{cluster}:->{ns}:nn0:nn-0-{ns}.{cluster}:9000-ACTIVE
> 2023-02-20 23:10:33,731 INFO  [NamenodeHeartbeatService {ns} nn0-0] 
> router.NamenodeHeartbeatService - Cannot register namenode in the State Store
>  {code}
> If we could log full stacktrace:
> {code:java}
> 2023-02-21 00:20:24,691 ERROR [NamenodeHeartbeatService {ns} nn0-0] 
> router.NamenodeHeartbeatService - Cannot register namenode in the State Store
> org.apache.hadoop.hdfs.server.federation.store.StateStoreUnavailableException:
>  State Store driver StateStoreZooKeeperImpl in nn-0.namenode.{cluster} is not 
> ready.
>         at 
> org.apache.hadoop.hdfs.server.federation.store.driver.StateStoreDriver.verifyDriverReady(StateStoreDriver.java:158)
>         at 
> org.apache.hadoop.hdfs.server.federation.store.driver.impl.StateStoreZooKeeperImpl.putAll(StateStoreZooKeeperImpl.java:235)
>         at 
> org.apache.hadoop.hdfs.server.federation.store.driver.impl.StateStoreBaseImpl.put(StateStoreBaseImpl.java:74)
>         at 
> org.apache.hadoop.hdfs.server.federation.store.impl.MembershipStoreImpl.namenodeHeartbeat(MembershipStoreImpl.java:179)
>         at 
> org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.registerNamenode(MembershipNamenodeResolver.java:381)
>         at 
> org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.updateState(NamenodeHeartbeatService.java:317)
>         at 
> org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.lambda$periodicInvoke$0(NamenodeHeartbeatService.java:244)
> ...
> ... {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16947) RBF NamenodeHeartbeatService to report error for not being able to register namenode in state store

2023-03-15 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16947:
---
Fix Version/s: 3.4.0

> RBF NamenodeHeartbeatService to report error for not being able to register 
> namenode in state store
> ---
>
> Key: HDFS-16947
> URL: https://issues.apache.org/jira/browse/HDFS-16947
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Namenode heartbeat service should provide error with full stacktrace if it 
> cannot register namenode in the state store. As of today, we only log info 
> msg.
> For zookeeper based impl, this might mean either a) curator manager is not 
> initialized or b) if it fails to write to znode after exhausting retries. For 
> either of these cases, reporting only INFO log might not be good enough and 
> we might have to look for errors elsewhere.
>  
> Sample example:
> {code:java}
> 2023-02-20 23:10:33,714 DEBUG [NamenodeHeartbeatService {ns} nn0-0] 
> router.NamenodeHeartbeatService - Received service state: ACTIVE from HA 
> namenode: {ns}-nn0:nn-0-{ns}.{cluster}:9000
> 2023-02-20 23:10:33,731 INFO  [NamenodeHeartbeatService {ns} nn0-0] 
> impl.MembershipStoreImpl - Inserting new NN registration: 
> nn-0.namenode.{cluster}:->{ns}:nn0:nn-0-{ns}.{cluster}:9000-ACTIVE
> 2023-02-20 23:10:33,731 INFO  [NamenodeHeartbeatService {ns} nn0-0] 
> router.NamenodeHeartbeatService - Cannot register namenode in the State Store
>  {code}
> If we could log full stacktrace:
> {code:java}
> 2023-02-21 00:20:24,691 ERROR [NamenodeHeartbeatService {ns} nn0-0] 
> router.NamenodeHeartbeatService - Cannot register namenode in the State Store
> org.apache.hadoop.hdfs.server.federation.store.StateStoreUnavailableException:
>  State Store driver StateStoreZooKeeperImpl in nn-0.namenode.{cluster} is not 
> ready.
>         at 
> org.apache.hadoop.hdfs.server.federation.store.driver.StateStoreDriver.verifyDriverReady(StateStoreDriver.java:158)
>         at 
> org.apache.hadoop.hdfs.server.federation.store.driver.impl.StateStoreZooKeeperImpl.putAll(StateStoreZooKeeperImpl.java:235)
>         at 
> org.apache.hadoop.hdfs.server.federation.store.driver.impl.StateStoreBaseImpl.put(StateStoreBaseImpl.java:74)
>         at 
> org.apache.hadoop.hdfs.server.federation.store.impl.MembershipStoreImpl.namenodeHeartbeat(MembershipStoreImpl.java:179)
>         at 
> org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.registerNamenode(MembershipNamenodeResolver.java:381)
>         at 
> org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.updateState(NamenodeHeartbeatService.java:317)
>         at 
> org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.lambda$periodicInvoke$0(NamenodeHeartbeatService.java:244)
> ...
> ... {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15383) RBF: Disable watch in ZKDelegationSecretManager for performance

2023-01-23 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17679883#comment-17679883
 ] 

Wei-Chiu Chuang commented on HDFS-15383:


HADOOP-18519 backported HADOOP-17835 to branch-3.3. Update the fix version 
accordingly.

> RBF: Disable watch in ZKDelegationSecretManager for performance
> ---
>
> Key: HDFS-15383
> URL: https://issues.apache.org/jira/browse/HDFS-15383
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.6
>
>
> Based on the current design for delegation token in secure Router, the total 
> number of watches for tokens is the product of number of routers and number 
> of tokens, this is due to ZKDelegationTokenManager is using PathChildrenCache 
> from curator, which automatically sets the watch and ZK will push the sync 
> information to each router. There are some evaluations about the number of 
> watches in Zookeeper has negative performance impact to Zookeeper server.
> In our practice when the number of watches exceeds 1.2 Million in a single ZK 
> server there will be significant ZK performance degradation. Thus this ticket 
> is to rewrite ZKDelegationTokenManagerImpl.java to explicitly disable the 
> PathChildrenCache and have Routers sync periodically from Zookeeper. This has 
> been working fine at the scale of 10 Routers with 2 million tokens. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15383) RBF: Disable watch in ZKDelegationSecretManager for performance

2023-01-23 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-15383:
---
Fix Version/s: 3.3.6

> RBF: Disable watch in ZKDelegationSecretManager for performance
> ---
>
> Key: HDFS-15383
> URL: https://issues.apache.org/jira/browse/HDFS-15383
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.6
>
>
> Based on the current design for delegation token in secure Router, the total 
> number of watches for tokens is the product of number of routers and number 
> of tokens, this is due to ZKDelegationTokenManager is using PathChildrenCache 
> from curator, which automatically sets the watch and ZK will push the sync 
> information to each router. There are some evaluations about the number of 
> watches in Zookeeper has negative performance impact to Zookeeper server.
> In our practice when the number of watches exceeds 1.2 Million in a single ZK 
> server there will be significant ZK performance degradation. Thus this ticket 
> is to rewrite ZKDelegationTokenManagerImpl.java to explicitly disable the 
> PathChildrenCache and have Routers sync periodically from Zookeeper. This has 
> been working fine at the scale of 10 Routers with 2 million tokens. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16873) FileStatus compareTo does not specify ordering

2022-12-20 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-16873.

Fix Version/s: 3.4.0
   Resolution: Fixed

> FileStatus compareTo does not specify ordering
> --
>
> Key: HDFS-16873
> URL: https://issues.apache.org/jira/browse/HDFS-16873
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: DDillon
>Assignee: DDillon
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> The Javadoc of FileStatus does not specify the field and manner in which 
> objects are ordered. In order to use the Comparable interface, this is 
> critical to understand to avoid making any assumptions. Inspection of code 
> showed that it is by path name quite quickly, but we shouldn't have to go 
> into code to confirm any obvious assumptions.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16873) FileStatus compareTo does not specify ordering

2022-12-20 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reassigned HDFS-16873:
--

Assignee: DDillon

> FileStatus compareTo does not specify ordering
> --
>
> Key: HDFS-16873
> URL: https://issues.apache.org/jira/browse/HDFS-16873
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: DDillon
>Assignee: DDillon
>Priority: Trivial
>
> The Javadoc of FileStatus does not specify the field and manner in which 
> objects are ordered. In order to use the Comparable interface, this is 
> critical to understand to avoid making any assumptions. Inspection of code 
> showed that it is by path name quite quickly, but we shouldn't have to go 
> into code to confirm any obvious assumptions.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname

2022-12-20 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-16871.

Fix Version/s: 3.4.0
   Resolution: Fixed

> DiskBalancer process may throws IllegalArgumentException when the target 
> DataNode has capital letter in hostname
> 
>
> Key: HDFS-16871
> URL: https://issues.apache.org/jira/browse/HDFS-16871
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> DiskBalancer process read DataNode hostname as lowercase letters,
>  !screenshot-1.png! 
>  but there is no letter case transform when getNodeByName.
>  !screenshot-2.png! 
> For a DataNode with lowercase hostname. everything is ok.
> But for a DataNode with uppercase hostname, when Balancer process try to 
> migrate on it,  there will be a IllegalArgumentException thrown as below,
> {code:java}
> 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: 
> java.lang.IllegalArgumentException: Unable to find the specified node. 
> node-group-1YlRf0002
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16854) TestDFSIO to support non-default file system

2022-12-02 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642641#comment-17642641
 ] 

Wei-Chiu Chuang commented on HDFS-16854:


[~liuml07] thanks for the tip.
I still find it confusing occasionally because not all Hadoop commands uses 
ToolRunner to prints generic command help.

MAPREDUCE-7398 looks hacky to me but I'll give it a try.

> TestDFSIO to support non-default file system
> 
>
> Key: HDFS-16854
> URL: https://issues.apache.org/jira/browse/HDFS-16854
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Major
>
> TestDFSIO expects a parameter {{-Dtest.build.data=}} which is where the data 
> is located. Only paths on the default file system is supported. Running t 
> against other file systems, such as Ozone, throws an exception.
> It can be worked around by specifying {{-Dfs.defaultFS=}} but it would be 
> even nicer to support non-default file systems out of box, because no one 
> would know this trick unless she looks at the code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16854) TestDFSIO to support non-default file system

2022-12-01 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-16854.

Resolution: Duplicate

> TestDFSIO to support non-default file system
> 
>
> Key: HDFS-16854
> URL: https://issues.apache.org/jira/browse/HDFS-16854
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Major
>
> TestDFSIO expects a parameter {{-Dtest.build.data=}} which is where the data 
> is located. Only paths on the default file system is supported. Running t 
> against other file systems, such as Ozone, throws an exception.
> It can be worked around by specifying {{-Dfs.defaultFS=}} but it would be 
> even nicer to support non-default file systems out of box, because no one 
> would know this trick unless she looks at the code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16839) It should consider EC reconstruction work when we determine if a node is busy

2022-11-29 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16839:
---
Fix Version/s: 3.3.9

> It should consider EC reconstruction work when we determine if a node is busy
> -
>
> Key: HDFS-16839
> URL: https://issues.apache.org/jira/browse/HDFS-16839
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Kidd5368
>Assignee: Kidd5368
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.9
>
>
> In chooseSourceDatanodes( ), I think it's more reasonable if we take EC 
> reconstruction work as a consideration when we determine if a node is busy or 
> not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16839) It should consider EC reconstruction work when we determine if a node is busy

2022-11-29 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-16839.

Resolution: Fixed

> It should consider EC reconstruction work when we determine if a node is busy
> -
>
> Key: HDFS-16839
> URL: https://issues.apache.org/jira/browse/HDFS-16839
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Kidd5368
>Assignee: Kidd5368
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.9
>
>
> In chooseSourceDatanodes( ), I think it's more reasonable if we take EC 
> reconstruction work as a consideration when we determine if a node is busy or 
> not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16839) It should consider EC reconstruction work when we determine if a node is busy

2022-11-29 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16839:
---
Fix Version/s: 3.4.0

> It should consider EC reconstruction work when we determine if a node is busy
> -
>
> Key: HDFS-16839
> URL: https://issues.apache.org/jira/browse/HDFS-16839
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Kidd5368
>Assignee: Kidd5368
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> In chooseSourceDatanodes( ), I think it's more reasonable if we take EC 
> reconstruction work as a consideration when we determine if a node is busy or 
> not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16854) TestDFSIO to support non-default file system

2022-11-23 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16854:
---
Description: 
TestDFSIO expects a parameter {{-Dtest.build.data=}} which is where the data is 
located. Only paths on the default file system is supported. Running t against 
other file systems, such as Ozone, throws an exception.

It can be worked around by specifying {{-Dfs.defaultFS=}} but it would be even 
nicer to support non-default file systems out of box, because no one would know 
this trick unless she looks at the code.

  was:
TestDFSIO expects a parameter {{-Dtest.build.data=}} which is where the data is 
located. Only paths on the default file system is supported. Running t against 
other file systems, such as Ozone, throws an exception.

It can be worked around by specifying {{-Dfs.defaultFS=}} but it would be even 
nice to support non-default file systems out of box, because no one would know 
this trick unless she looks at the code.


> TestDFSIO to support non-default file system
> 
>
> Key: HDFS-16854
> URL: https://issues.apache.org/jira/browse/HDFS-16854
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Major
>
> TestDFSIO expects a parameter {{-Dtest.build.data=}} which is where the data 
> is located. Only paths on the default file system is supported. Running t 
> against other file systems, such as Ozone, throws an exception.
> It can be worked around by specifying {{-Dfs.defaultFS=}} but it would be 
> even nicer to support non-default file systems out of box, because no one 
> would know this trick unless she looks at the code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16854) TestDFSIO to support non-default file system

2022-11-23 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16854:
---
Description: 
TestDFSIO expects a parameter {{-Dtest.build.data=}} which is where the data is 
located. Only paths on the default file system is supported. Running t against 
other file systems, such as Ozone, throws an exception.

It can be worked around by specifying {{-Dfs.defaultFS=}} but it would be even 
nice to support non-default file systems out of box, because no one would know 
this trick unless she looks at the code.

  was:
TestDFSIO expects a parameter {{-Dtest.build.data=}} which is where the data is 
located. Only paths on the default file system is supported. Trying to run it 
against other file systems, such as Ozone, throws an exception.

It can be worked around by specifying {{-Dfs.defaultFS=}} but it would be even 
nice to support non-default file systems out of box, because no one would know 
this trick unless she looks at the code.


> TestDFSIO to support non-default file system
> 
>
> Key: HDFS-16854
> URL: https://issues.apache.org/jira/browse/HDFS-16854
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Major
>
> TestDFSIO expects a parameter {{-Dtest.build.data=}} which is where the data 
> is located. Only paths on the default file system is supported. Running t 
> against other file systems, such as Ozone, throws an exception.
> It can be worked around by specifying {{-Dfs.defaultFS=}} but it would be 
> even nice to support non-default file systems out of box, because no one 
> would know this trick unless she looks at the code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16854) TestDFSIO to support non-default file system

2022-11-23 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reassigned HDFS-16854:
--

Assignee: Wei-Chiu Chuang

> TestDFSIO to support non-default file system
> 
>
> Key: HDFS-16854
> URL: https://issues.apache.org/jira/browse/HDFS-16854
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Major
>
> TestDFSIO expects a parameter {{-Dtest.build.data=}} which is where the data 
> is located. Only paths on the default file system is supported. Trying to run 
> it against other file systems, such as Ozone, throws an exception.
> It can be worked around by specifying {{-Dfs.defaultFS=}} but it would be 
> even nice to support non-default file systems out of box, because no one 
> would know this trick unless she looks at the code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16854) TestDFSIO to support non-default file system

2022-11-23 Thread Wei-Chiu Chuang (Jira)
Wei-Chiu Chuang created HDFS-16854:
--

 Summary: TestDFSIO to support non-default file system
 Key: HDFS-16854
 URL: https://issues.apache.org/jira/browse/HDFS-16854
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Wei-Chiu Chuang


TestDFSIO expects a parameter {{-Dtest.build.data=}} which is where the data is 
located. Only paths on the default file system is supported. Trying to run it 
against other file systems, such as Ozone, throws an exception.

It can be worked around by specifying {{-Dfs.defaultFS=}} but it would be even 
nice to support non-default file systems out of box, because no one would know 
this trick unless she looks at the code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-9536) OOM errors during parallel upgrade to Block-ID based layout

2022-10-29 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-9536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-9536.
---
Resolution: Duplicate

I believe this is no longer an issue after HDFS-15937 and HDFS-15610.

> OOM errors during parallel upgrade to Block-ID based layout
> ---
>
> Key: HDFS-9536
> URL: https://issues.apache.org/jira/browse/HDFS-9536
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Vinayakumar B
>Assignee: Vinayakumar B
>Priority: Major
>
> This is a follow-up jira for the OOM errors observed during parallel upgrade 
> to Block-ID based datanode layout using HDFS-8578 fix.
> more clue 
> [here|https://issues.apache.org/jira/browse/HDFS-8578?focusedCommentId=15042012=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15042012]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16815) Error occurred in processing CacheManagerSection for xml parsing fsimage

2022-10-26 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reassigned HDFS-16815:
--

Assignee: MeiJing

> Error occurred in processing CacheManagerSection for xml parsing fsimage
> 
>
> Key: HDFS-16815
> URL: https://issues.apache.org/jira/browse/HDFS-16815
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: MeiJing
>Assignee: MeiJing
>Priority: Minor
>  Labels: pull-request-available
> Attachments: image-2022-10-24-10-34-07-288.png, 
> image-2022-10-24-10-43-57-652.png, image-2022-10-24-10-44-51-234.png
>
>
> For  testing  OfflineImageReconstructor, we run a test method. There are some 
> bugs in the process of  CacheManagerSectionProcessor. 
>   !image-2022-10-24-10-34-07-288.png|width=536,height=582!
>                                                    Fig1 Test Method
> First, we write the test method as Fig1 to test CacheManagerSectionProcessor, 
> but the method throw a exception that shows " found without 
> ".  As Fig2 shows.
> !image-2022-10-24-10-43-57-652.png|width=528,height=53!
>                                                    Fig2  result of test 
> method 
> Second, we run the test method again when we solve the first bug. there is 
> another bug that shows "Expected tag end event for CacheManagerSection, but 
> got tag end event for pool". As Fig3 shows.
> !image-2022-10-24-10-44-51-234.png!
>                                                       Fig3 second result of 
> test method



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16809) EC striped block is not sufficient when doing in maintenance

2022-10-25 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16809:
---
Component/s: erasure-coding
 (was: hdfs)

> EC striped block is not sufficient when doing in maintenance
> 
>
> Key: HDFS-16809
> URL: https://issues.apache.org/jira/browse/HDFS-16809
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, erasure-coding
>Reporter: dingshun
>Assignee: dingshun
>Priority: Major
>  Labels: pull-request-available
>
> When doing maintenance, ec striped block is not sufficient, which will lead 
> to miss block



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16809) EC striped block is not sufficient when doing in maintenance

2022-10-25 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reassigned HDFS-16809:
--

Assignee: dingshun

> EC striped block is not sufficient when doing in maintenance
> 
>
> Key: HDFS-16809
> URL: https://issues.apache.org/jira/browse/HDFS-16809
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, hdfs
>Reporter: dingshun
>Assignee: dingshun
>Priority: Major
>  Labels: pull-request-available
>
> When doing maintenance, ec striped block is not sufficient, which will lead 
> to miss block



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16791) New API for enclosing root path for a file

2022-10-10 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17615237#comment-17615237
 ] 

Wei-Chiu Chuang commented on HDFS-16791:


Should probably move this to HADOOP. But I added you as a HDFS contributor 
regardless.

It would be great to have a specification/design doc attached to the jira, so 
that other FS implementors know what to expect.



> New API for enclosing root path for a file
> --
>
> Key: HDFS-16791
> URL: https://issues.apache.org/jira/browse/HDFS-16791
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Tom McCormick
>Assignee: Tom McCormick
>Priority: Major
>  Labels: pull-request-available
>
> At LinkedIn we run many HDFS volumes that are federated by either 
> ViewFilesystem or Router Based Federation. As our number of hdfs volumes 
> grows, we have a growing need to migrate data seemlessly across volumes.
> Many frameworks have a notion of staging or temp directories, but those 
> directories often live in random locations. We want an API getEnclosingRoot, 
> which provides the root path a file or dataset. 
> In ViewFilesystem / Router Based Federation, the enclosingRoot will be the 
> mount point.
> We will also take into account other restrictions for renames like 
> encryptions zones.
> If there are several paths (a mount point and an encryption zone), we will 
> return the longer path



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16791) New API for enclosing root path for a file

2022-10-10 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reassigned HDFS-16791:
--

Assignee: Tom McCormick

> New API for enclosing root path for a file
> --
>
> Key: HDFS-16791
> URL: https://issues.apache.org/jira/browse/HDFS-16791
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Tom McCormick
>Assignee: Tom McCormick
>Priority: Major
>  Labels: pull-request-available
>
> At LinkedIn we run many HDFS volumes that are federated by either 
> ViewFilesystem or Router Based Federation. As our number of hdfs volumes 
> grows, we have a growing need to migrate data seemlessly across volumes.
> Many frameworks have a notion of staging or temp directories, but those 
> directories often live in random locations. We want an API getEnclosingRoot, 
> which provides the root path a file or dataset. 
> In ViewFilesystem / Router Based Federation, the enclosingRoot will be the 
> mount point.
> We will also take into account other restrictions for renames like 
> encryptions zones.
> If there are several paths (a mount point and an encryption zone), we will 
> return the longer path



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16789) Update FileSystem class to read disk usage from actual usage value instead of file's length

2022-09-30 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reassigned HDFS-16789:
--

Assignee: Dave Teng

> Update FileSystem class to read disk usage from actual usage value instead of 
> file's length
> ---
>
> Key: HDFS-16789
> URL: https://issues.apache.org/jira/browse/HDFS-16789
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfs
>Reporter: Dave Teng
>Assignee: Dave Teng
>Priority: Major
>
> Currently FileSystem class retrieve the disk usage value directly from 
> [file's 
> length|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L1895]
>  instead of actual usage value. 
> We need to update FileSystem & related classes to read disk usage from actual 
> usage value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-4043) Namenode Kerberos Login does not use proper hostname for host qualified hdfs principal name.

2022-08-22 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-4043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-4043.
---
Resolution: Fixed

> Namenode Kerberos Login does not use proper hostname for host qualified hdfs 
> principal name.
> 
>
> Key: HDFS-4043
> URL: https://issues.apache.org/jira/browse/HDFS-4043
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: security
>Affects Versions: 2.0.0-alpha, 2.0.1-alpha, 2.0.2-alpha, 2.0.3-alpha, 
> 3.4.0, 3.3.9
> Environment: CDH4U1 on Ubuntu 12.04
>Reporter: Ahad Rana
>Assignee: Steve Vaughan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.9
>
>   Original Estimate: 24h
>  Time Spent: 50m
>  Remaining Estimate: 23h 10m
>
> The Namenode uses the loginAsNameNodeUser method in NameNode.java to login 
> using the hdfs principal. This method in turn invokes SecurityUtil.login with 
> a hostname (last parameter) obtained via a call to InetAddress.getHostName. 
> This call does not always return the fully qualified host name, and thus 
> causes the namenode to login to fail due to kerberos's inability to find a 
> matching hdfs principal in the hdfs.keytab file. Instead it should use 
> InetAddress.getCanonicalHostName. This is consistent with what is used 
> internally by SecurityUtil.java to login in other services, such as the 
> DataNode. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16676) DatanodeAdminManager$Monitor reports a node as invalid continuously

2022-08-17 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16676:
---
Fix Version/s: 3.4.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> DatanodeAdminManager$Monitor reports a node as invalid continuously
> ---
>
> Key: HDFS-16676
> URL: https://issues.apache.org/jira/browse/HDFS-16676
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.1
>Reporter: Prabhu Joseph
>Assignee: groot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> DatanodeAdminManager$Monitor reports a node as invalid continuously
> {code}
> 2022-07-21 06:54:38,562 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
> (DatanodeAdminMonitor-0): DatanodeAdminMonitor caught exception when 
> processing node 1.2.3.4:9866.
> java.lang.IllegalStateException: Node 1.2.3.4:9866 is in an invalid state! 
> Invalid state: In Service 0 blocks are on this dn.
> at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:172)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:601)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:504)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:750)
> {code}
> A node goes into invalid state when stopDecommission sets the node to 
> IN-Service and misses to remove from pendingNodes queues (HDFS-16675). This 
> will be corrected only when user triggers startDecommission. Till then we 
> need not keep the invalid state node in the queue as anyway startDecommission 
> will add it back.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16730) Update the doc that append to EC files is supported

2022-08-16 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16730:
---
Fix Version/s: 3.4.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> Update the doc that append to EC files is supported
> ---
>
> Key: HDFS-16730
> URL: https://issues.apache.org/jira/browse/HDFS-16730
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: Wei-Chiu Chuang
>Assignee: groot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Our doc has a statement regarding EC limitations:
> https://hadoop.apache.org/docs/r3.3.0/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html#Limitations
> {noformat}
> append() and truncate() on an erasure coded file will throw IOException.
> {noformat}
> In fact, HDFS-7663 added the support since Hadoop 3.3.0. The caveat is that 
> it supports "Append to a closed striped file, with NEW_BLOCK flag enabled"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16730) Update the doc that append to EC files is supported

2022-08-16 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580442#comment-17580442
 ] 

Wei-Chiu Chuang commented on HDFS-16730:


Perfect. Thanks.

> Update the doc that append to EC files is supported
> ---
>
> Key: HDFS-16730
> URL: https://issues.apache.org/jira/browse/HDFS-16730
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: Wei-Chiu Chuang
>Assignee: groot
>Priority: Major
>
> Our doc has a statement regarding EC limitations:
> https://hadoop.apache.org/docs/r3.3.0/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html#Limitations
> {noformat}
> append() and truncate() on an erasure coded file will throw IOException.
> {noformat}
> In fact, HDFS-7663 added the support since Hadoop 3.3.0. The caveat is that 
> it supports "Append to a closed striped file, with NEW_BLOCK flag enabled"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16730) Update the doc that append to EC files is supported

2022-08-16 Thread Wei-Chiu Chuang (Jira)
Wei-Chiu Chuang created HDFS-16730:
--

 Summary: Update the doc that append to EC files is supported
 Key: HDFS-16730
 URL: https://issues.apache.org/jira/browse/HDFS-16730
 Project: Hadoop HDFS
  Issue Type: Task
Reporter: Wei-Chiu Chuang


Our doc has a statement regarding EC limitations:
https://hadoop.apache.org/docs/r3.3.0/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html#Limitations

{noformat}
append() and truncate() on an erasure coded file will throw IOException.

{noformat}

In fact, HDFS-7663 added the support since Hadoop 3.3.0. The caveat is that it 
supports "Append to a closed striped file, with NEW_BLOCK flag enabled"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-6874) Add GETFILEBLOCKLOCATIONS operation to HttpFS

2022-08-16 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580417#comment-17580417
 ] 

Wei-Chiu Chuang commented on HDFS-6874:
---

Feel free to take this over.

> Add GETFILEBLOCKLOCATIONS operation to HttpFS
> -
>
> Key: HDFS-6874
> URL: https://issues.apache.org/jira/browse/HDFS-6874
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: httpfs
>Affects Versions: 2.4.1, 2.7.3
>Reporter: Gao Zhong Liang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: BB2015-05-TBR, pull-request-available
> Attachments: HDFS-6874-1.patch, HDFS-6874-branch-2.6.0.patch, 
> HDFS-6874.011.patch, HDFS-6874.02.patch, HDFS-6874.03.patch, 
> HDFS-6874.04.patch, HDFS-6874.05.patch, HDFS-6874.06.patch, 
> HDFS-6874.07.patch, HDFS-6874.08.patch, HDFS-6874.09.patch, 
> HDFS-6874.10.patch, HDFS-6874.patch
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> GETFILEBLOCKLOCATIONS operation is missing in HttpFS, which is already 
> supported in WebHDFS.  For the request of GETFILEBLOCKLOCATIONS in 
> org.apache.hadoop.fs.http.server.HttpFSServer, BAD_REQUEST is returned so far:
> ...
>  case GETFILEBLOCKLOCATIONS: {
> response = Response.status(Response.Status.BAD_REQUEST).build();
> break;
>   }
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16727) Consider reading chunk files using MappedByteBuffer

2022-08-10 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16727:
---
Attachment: Screen Shot 2022-08-04 at 7.12.55 AM.png

> Consider reading chunk files using MappedByteBuffer
> ---
>
> Key: HDFS-16727
> URL: https://issues.apache.org/jira/browse/HDFS-16727
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wei-Chiu Chuang
>Priority: Major
> Attachments: Screen Shot 2022-08-04 at 7.12.55 AM.png, 
> ozone_dn-rhel03.ozone.cisco.local.html
>
>
> Running Impala TPC-DS which stresses Ozone DN read path.
> BufferUtils#assignByteBuffers stands out as one of the offender.
> We can experiment with MappedByteBuffer and see if it makes performance 
> better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16727) Consider reading chunk files using MappedByteBuffer

2022-08-10 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16727:
---
Attachment: ozone_dn-rhel03.ozone.cisco.local.html

> Consider reading chunk files using MappedByteBuffer
> ---
>
> Key: HDFS-16727
> URL: https://issues.apache.org/jira/browse/HDFS-16727
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wei-Chiu Chuang
>Priority: Major
> Attachments: Screen Shot 2022-08-04 at 7.12.55 AM.png, 
> ozone_dn-rhel03.ozone.cisco.local.html
>
>
> Running Impala TPC-DS which stresses Ozone DN read path.
> BufferUtils#assignByteBuffers stands out as one of the offender.
> We can experiment with MappedByteBuffer and see if it makes performance 
> better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16727) Consider reading chunk files using MappedByteBuffer

2022-08-10 Thread Wei-Chiu Chuang (Jira)
Wei-Chiu Chuang created HDFS-16727:
--

 Summary: Consider reading chunk files using MappedByteBuffer
 Key: HDFS-16727
 URL: https://issues.apache.org/jira/browse/HDFS-16727
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Wei-Chiu Chuang
 Attachments: Screen Shot 2022-08-04 at 7.12.55 AM.png, 
ozone_dn-rhel03.ozone.cisco.local.html

Running Impala TPC-DS which stresses Ozone DN read path.

BufferUtils#assignByteBuffers stands out as one of the offender.

We can experiment with MappedByteBuffer and see if it makes performance better.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-2139) Fast copy for HDFS.

2022-08-10 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578063#comment-17578063
 ] 

Wei-Chiu Chuang commented on HDFS-2139:
---

Sounds great!
Some questions I have as I wasn't involved in this from the begining. How is 
this different from other similar features? E.g. HDFS-3370 HDFS-15294 
(federation rename/balance)

A small design doc describing your enhancements is greatly appreciated.

> Fast copy for HDFS.
> ---
>
> Key: HDFS-2139
> URL: https://issues.apache.org/jira/browse/HDFS-2139
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Pritam Damania
>Assignee: Rituraj
>Priority: Major
> Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, 
> HDFS-2139.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> There is a need to perform fast file copy on HDFS. The fast copy mechanism 
> for a file works as
> follows :
> 1) Query metadata for all blocks of the source file.
> 2) For each block 'b' of the file, find out its datanode locations.
> 3) For each block of the file, add an empty block to the namesystem for
> the destination file.
> 4) For each location of the block, instruct the datanode to make a local
> copy of that block.
> 5) Once each datanode has copied over its respective blocks, they
> report to the namenode about it.
> 6) Wait for all blocks to be copied and exit.
> This would speed up the copying process considerably by removing top of
> the rack data transfers.
> Note : An extra improvement, would be to instruct the datanode to create a
> hardlink of the block file if we are copying a block on the same datanode



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16645) Multi inProgress segments caused "Invalid log manifest"

2022-07-27 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572211#comment-17572211
 ] 

Wei-Chiu Chuang commented on HDFS-16645:


Yeah... we don't have JNS in place so that's probably why there's gap. 

> Multi inProgress segments caused "Invalid log manifest"
> ---
>
> Key: HDFS-16645
> URL: https://issues.apache.org/jira/browse/HDFS-16645
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> {code:java}
> java.lang.IllegalStateException: Invalid log manifest (log [1-? 
> (in-progress)] overlaps [6-? (in-progress)])[[6-? (in-progress)], [1-? 
> (in-progress)]] CommittedTxId: 0 
> at 
> org.apache.hadoop.hdfs.server.protocol.RemoteEditLogManifest.checkState(RemoteEditLogManifest.java:62)
>   at 
> org.apache.hadoop.hdfs.server.protocol.RemoteEditLogManifest.(RemoteEditLogManifest.java:46)
>   at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.getEditLogManifest(Journal.java:740)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16619) Fix HttpHeaders.Values And HttpHeaders.Names Deprecated Import.

2022-07-27 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-16619.

Resolution: Fixed

> Fix HttpHeaders.Values And HttpHeaders.Names Deprecated Import.
> ---
>
> Key: HDFS-16619
> URL: https://issues.apache.org/jira/browse/HDFS-16619
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.4.0
>Reporter: fanshilun
>Assignee: fanshilun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: Fix HttpHeaders.Values And HttpHeaders.Names 
> Deprecated.png
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> HttpHeaders.Values ​​and HttpHeaders.Names are deprecated, use 
> HttpHeaderValues ​​and HttpHeaderNames instead.
> HttpHeaders.Names
> Deprecated. 
> Use HttpHeaderNames instead. Standard HTTP header names.
> {code:java}
> /** @deprecated */
> @Deprecated
> public static final class Names {
>   public static final String ACCEPT = "Accept";
>   public static final String ACCEPT_CHARSET = "Accept-Charset";
>   public static final String ACCEPT_ENCODING = "Accept-Encoding";
>   public static final String ACCEPT_LANGUAGE = "Accept-Language";
>   public static final String ACCEPT_RANGES = "Accept-Ranges";
>   public static final String ACCEPT_PATCH = "Accept-Patch";
>   public static final String ACCESS_CONTROL_ALLOW_CREDENTIALS = 
> "Access-Control-Allow-Credentials";
>   public static final String ACCESS_CONTROL_ALLOW_HEADERS = 
> "Access-Control-Allow-Headers"; {code}
> HttpHeaders.Values
> Deprecated. 
> Use HttpHeaderValues instead. Standard HTTP header values.
> {code:java}
> /** @deprecated */
> @Deprecated
> public static final class Values {
>   public static final String APPLICATION_JSON = "application/json";
>   public static final String APPLICATION_X_WWW_FORM_URLENCODED = 
> "application/x-www-form-urlencoded";
>   public static final String BASE64 = "base64";
>   public static final String BINARY = "binary";
>   public static final String BOUNDARY = "boundary";
>   public static final String BYTES = "bytes";
>   public static final String CHARSET = "charset";
>   public static final String CHUNKED = "chunked";
>   public static final String CLOSE = "close"; {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16645) Multi inProgress segments caused "Invalid log manifest"

2022-07-27 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572129#comment-17572129
 ] 

Wei-Chiu Chuang commented on HDFS-16645:


For the scenario that we observed, the older inprogress edit log files were 
valid edit log files except they were not finalized. So if JN simply 
delete/ignore the older inprogress files JN would abort due to the gap in 
transaction id.

Example:
edits_1000_1005
edits_inprogress_1006
edits_1010_1015
edits_inprogress_1016

> Multi inProgress segments caused "Invalid log manifest"
> ---
>
> Key: HDFS-16645
> URL: https://issues.apache.org/jira/browse/HDFS-16645
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> {code:java}
> java.lang.IllegalStateException: Invalid log manifest (log [1-? 
> (in-progress)] overlaps [6-? (in-progress)])[[6-? (in-progress)], [1-? 
> (in-progress)]] CommittedTxId: 0 
> at 
> org.apache.hadoop.hdfs.server.protocol.RemoteEditLogManifest.checkState(RemoteEditLogManifest.java:62)
>   at 
> org.apache.hadoop.hdfs.server.protocol.RemoteEditLogManifest.(RemoteEditLogManifest.java:46)
>   at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.getEditLogManifest(Journal.java:740)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16645) Multi inProgress segments caused "Invalid log manifest"

2022-07-26 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571592#comment-17571592
 ] 

Wei-Chiu Chuang commented on HDFS-16645:


I think instead of ignoring the older inprogress edit log files, my question 
is: how did we end up having multiple of them, because it was supposed to 
finalize the inprogress properly. Apparently there's a bug somewhere.

> Multi inProgress segments caused "Invalid log manifest"
> ---
>
> Key: HDFS-16645
> URL: https://issues.apache.org/jira/browse/HDFS-16645
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> {code:java}
> java.lang.IllegalStateException: Invalid log manifest (log [1-? 
> (in-progress)] overlaps [6-? (in-progress)])[[6-? (in-progress)], [1-? 
> (in-progress)]] CommittedTxId: 0 
> at 
> org.apache.hadoop.hdfs.server.protocol.RemoteEditLogManifest.checkState(RemoteEditLogManifest.java:62)
>   at 
> org.apache.hadoop.hdfs.server.protocol.RemoteEditLogManifest.(RemoteEditLogManifest.java:46)
>   at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.getEditLogManifest(Journal.java:740)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16645) Multi inProgress segments caused "Invalid log manifest"

2022-07-26 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571587#comment-17571587
 ] 

Wei-Chiu Chuang edited comment on HDFS-16645 at 7/26/22 7:38 PM:
-

[~smeng] and I encountered a similar problem.

For the context: we started to notice that JN directory can have multiple 
inprogress files after enabling multiple standby namenodes in a Hadoop 3.1 
based cluster. But we suspect it is an old bug not really caused by multiple 
standby NN.


was (Author: jojochuang):
For the context: we started to notice that JN directory can have multiple 
inprogress files after enabling multiple standby namenodes in a Hadoop 3.1 
based cluster. But we suspect it is an old bug not really caused by multiple 
standby NN.

> Multi inProgress segments caused "Invalid log manifest"
> ---
>
> Key: HDFS-16645
> URL: https://issues.apache.org/jira/browse/HDFS-16645
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> {code:java}
> java.lang.IllegalStateException: Invalid log manifest (log [1-? 
> (in-progress)] overlaps [6-? (in-progress)])[[6-? (in-progress)], [1-? 
> (in-progress)]] CommittedTxId: 0 
> at 
> org.apache.hadoop.hdfs.server.protocol.RemoteEditLogManifest.checkState(RemoteEditLogManifest.java:62)
>   at 
> org.apache.hadoop.hdfs.server.protocol.RemoteEditLogManifest.(RemoteEditLogManifest.java:46)
>   at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.getEditLogManifest(Journal.java:740)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16645) Multi inProgress segments caused "Invalid log manifest"

2022-07-26 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571587#comment-17571587
 ] 

Wei-Chiu Chuang edited comment on HDFS-16645 at 7/26/22 7:38 PM:
-

[~smeng] and I encountered a similar problem.

For the context: we started to notice that JN directory can have multiple 
inprogress files after enabling multiple standby namenodes in a Hadoop 3.1 
based cluster. And it crashed the new SbNN. But we suspect it is an old bug not 
really caused by multiple standby NN.


was (Author: jojochuang):
[~smeng] and I encountered a similar problem.

For the context: we started to notice that JN directory can have multiple 
inprogress files after enabling multiple standby namenodes in a Hadoop 3.1 
based cluster. But we suspect it is an old bug not really caused by multiple 
standby NN.

> Multi inProgress segments caused "Invalid log manifest"
> ---
>
> Key: HDFS-16645
> URL: https://issues.apache.org/jira/browse/HDFS-16645
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> {code:java}
> java.lang.IllegalStateException: Invalid log manifest (log [1-? 
> (in-progress)] overlaps [6-? (in-progress)])[[6-? (in-progress)], [1-? 
> (in-progress)]] CommittedTxId: 0 
> at 
> org.apache.hadoop.hdfs.server.protocol.RemoteEditLogManifest.checkState(RemoteEditLogManifest.java:62)
>   at 
> org.apache.hadoop.hdfs.server.protocol.RemoteEditLogManifest.(RemoteEditLogManifest.java:46)
>   at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.getEditLogManifest(Journal.java:740)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16645) Multi inProgress segments caused "Invalid log manifest"

2022-07-26 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571587#comment-17571587
 ] 

Wei-Chiu Chuang commented on HDFS-16645:


For the context: we started to notice that JN directory can have multiple 
inprogress files after enabling multiple standby namenodes in a Hadoop 3.1 
based cluster. But we suspect it is an old bug not really caused by multiple 
standby NN.

> Multi inProgress segments caused "Invalid log manifest"
> ---
>
> Key: HDFS-16645
> URL: https://issues.apache.org/jira/browse/HDFS-16645
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> {code:java}
> java.lang.IllegalStateException: Invalid log manifest (log [1-? 
> (in-progress)] overlaps [6-? (in-progress)])[[6-? (in-progress)], [1-? 
> (in-progress)]] CommittedTxId: 0 
> at 
> org.apache.hadoop.hdfs.server.protocol.RemoteEditLogManifest.checkState(RemoteEditLogManifest.java:62)
>   at 
> org.apache.hadoop.hdfs.server.protocol.RemoteEditLogManifest.(RemoteEditLogManifest.java:46)
>   at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.getEditLogManifest(Journal.java:740)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16675) DatanodeAdminManager.stopDecommission fails with ConcurrentModificationException

2022-07-25 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571193#comment-17571193
 ] 

Wei-Chiu Chuang commented on HDFS-16675:


[~sodonnell]
Is this fixed by your HDFS-16583? They look similar but different.

> DatanodeAdminManager.stopDecommission fails with 
> ConcurrentModificationException
> 
>
> Key: HDFS-16675
> URL: https://issues.apache.org/jira/browse/HDFS-16675
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Affects Versions: 3.2.1
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> DatanodeAdminManager.stopDecommission intermittently fails with 
> ConcurrentModificationException. 
> {code}
> java.util.ConcurrentModificationException: Tree has been modified outside of 
> iterator
> at 
> org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311)
> at 
> org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256)
> at 
> org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.next(FoldedTreeSet.java:262)
> at 
> java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1044)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processExtraRedundancyBlocksOnInService(BlockManager.java:4300)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager.stopDecommission(DatanodeAdminManager.java:241)
> {code}
> This is intermittently seen on busy cluster with AutoScale Enabled. This 
> leads to a Node with state "IN Service" in pendingNodes queue 
> {code}
> 2022-07-21 06:54:38,562 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager 
> (DatanodeAdminMonitor-0): DatanodeAdminMonitor caught exception when 
> processing node 1.2.3.4:9866.
> java.lang.IllegalStateException: Node 1.2.3.4:9866 is in an invalid state! 
> Invalid state: In Service 0 blocks are on this dn.
> at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:172)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:601)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:504)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:750)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16533) COMPOSITE_CRC failed between replicated file and striped file due to invalid requested length

2022-07-25 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16533:
---
Fix Version/s: 3.4.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> COMPOSITE_CRC failed between replicated file and striped file due to invalid 
> requested length
> -
>
> Key: HDFS-16533
> URL: https://issues.apache.org/jira/browse/HDFS-16533
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs, hdfs-client
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> After testing the COMPOSITE_CRC with some random length between replicated 
> file and striped file which has same data with replicated file, it failed. 
> Reproduce step like this:
> {code:java}
> @Test(timeout = 9)
> public void testStripedAndReplicatedFileChecksum2() throws Exception {
>   int abnormalSize = (dataBlocks * 2 - 2) * blockSize +
>   (int) (blockSize * 0.5);
>   prepareTestFiles(abnormalSize, new String[] {stripedFile1, replicatedFile});
>   int loopNumber = 100;
>   while (loopNumber-- > 0) {
> int verifyLength = ThreadLocalRandom.current()
> .nextInt(10, abnormalSize);
> FileChecksum stripedFileChecksum1 = getFileChecksum(stripedFile1,
> verifyLength, false);
> FileChecksum replicatedFileChecksum = getFileChecksum(replicatedFile,
> verifyLength, false);
> if (checksumCombineMode.equals(ChecksumCombineMode.COMPOSITE_CRC.name())) 
> {
>   Assert.assertEquals(stripedFileChecksum1, replicatedFileChecksum);
> } else {
>   Assert.assertNotEquals(stripedFileChecksum1, replicatedFileChecksum);
> }
>   }
> } {code}
> And after tracing the root cause, `FileChecksumHelper#makeCompositeCrcResult` 
> maybe compute an error `consumedLastBlockLength` when updating checksum for 
> the last block of the fixed length which maybe not the last block in the file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16533) COMPOSITE_CRC failed between replicated file and striped file due to invalid requested length

2022-07-25 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16533:
---
Summary: COMPOSITE_CRC failed between replicated file and striped file due 
to invalid requested length  (was: COMPOSITE_CRC failed between replicated file 
and striped file.)

> COMPOSITE_CRC failed between replicated file and striped file due to invalid 
> requested length
> -
>
> Key: HDFS-16533
> URL: https://issues.apache.org/jira/browse/HDFS-16533
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs, hdfs-client
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> After testing the COMPOSITE_CRC with some random length between replicated 
> file and striped file which has same data with replicated file, it failed. 
> Reproduce step like this:
> {code:java}
> @Test(timeout = 9)
> public void testStripedAndReplicatedFileChecksum2() throws Exception {
>   int abnormalSize = (dataBlocks * 2 - 2) * blockSize +
>   (int) (blockSize * 0.5);
>   prepareTestFiles(abnormalSize, new String[] {stripedFile1, replicatedFile});
>   int loopNumber = 100;
>   while (loopNumber-- > 0) {
> int verifyLength = ThreadLocalRandom.current()
> .nextInt(10, abnormalSize);
> FileChecksum stripedFileChecksum1 = getFileChecksum(stripedFile1,
> verifyLength, false);
> FileChecksum replicatedFileChecksum = getFileChecksum(replicatedFile,
> verifyLength, false);
> if (checksumCombineMode.equals(ChecksumCombineMode.COMPOSITE_CRC.name())) 
> {
>   Assert.assertEquals(stripedFileChecksum1, replicatedFileChecksum);
> } else {
>   Assert.assertNotEquals(stripedFileChecksum1, replicatedFileChecksum);
> }
>   }
> } {code}
> And after tracing the root cause, `FileChecksumHelper#makeCompositeCrcResult` 
> maybe compute an error `consumedLastBlockLength` when updating checksum for 
> the last block of the fixed length which maybe not the last block in the file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16566) Erasure Coding: Recovery may causes excess replicas when busy DN exsits

2022-07-15 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reassigned HDFS-16566:
--

Assignee: Ruinan Gu

> Erasure Coding: Recovery may causes excess replicas when busy DN exsits
> ---
>
> Key: HDFS-16566
> URL: https://issues.apache.org/jira/browse/HDFS-16566
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, erasure-coding
>Affects Versions: 3.3.2
>Reporter: Ruinan Gu
>Assignee: Ruinan Gu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Simple case:
> RS3-2 ,[0(busy),2,3,4] (1 missing),0 is busy.
> We can get liveblockIndice=[2,3,4], additionalRepl=1.So the DN will get the 
> LiveBitSet=[2,3,4] and targets.length=1.
> According to StripedWriter.initTargetIndices(), 0 will get recovered instead 
> of 1. So the internal blocks will become [0(busy),2,3,4,0'(excess)].Although 
> NN will detect, delete the excess replicas and recover the missing block(1) 
> correctly after the wrong recovery of 0', I don't think this process is 
> expected and the recovery of 0' is obviously wrong and not necessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16566) Erasure Coding: Recovery may causes excess replicas when busy DN exsits

2022-07-15 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16566:
---
Component/s: ec
 erasure-coding

> Erasure Coding: Recovery may causes excess replicas when busy DN exsits
> ---
>
> Key: HDFS-16566
> URL: https://issues.apache.org/jira/browse/HDFS-16566
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, erasure-coding
>Affects Versions: 3.3.2
>Reporter: Ruinan Gu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Simple case:
> RS3-2 ,[0(busy),2,3,4] (1 missing),0 is busy.
> We can get liveblockIndice=[2,3,4], additionalRepl=1.So the DN will get the 
> LiveBitSet=[2,3,4] and targets.length=1.
> According to StripedWriter.initTargetIndices(), 0 will get recovered instead 
> of 1. So the internal blocks will become [0(busy),2,3,4,0'(excess)].Although 
> NN will detect, delete the excess replicas and recover the missing block(1) 
> correctly after the wrong recovery of 0', I don't think this process is 
> expected and the recovery of 0' is obviously wrong and not necessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16566) Erasure Coding: Recovery may causes excess replicas when busy DN exsits

2022-07-15 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16566:
---
Fix Version/s: 3.4.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> Erasure Coding: Recovery may causes excess replicas when busy DN exsits
> ---
>
> Key: HDFS-16566
> URL: https://issues.apache.org/jira/browse/HDFS-16566
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.2
>Reporter: Ruinan Gu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Simple case:
> RS3-2 ,[0(busy),2,3,4] (1 missing),0 is busy.
> We can get liveblockIndice=[2,3,4], additionalRepl=1.So the DN will get the 
> LiveBitSet=[2,3,4] and targets.length=1.
> According to StripedWriter.initTargetIndices(), 0 will get recovered instead 
> of 1. So the internal blocks will become [0(busy),2,3,4,0'(excess)].Although 
> NN will detect, delete the excess replicas and recover the missing block(1) 
> correctly after the wrong recovery of 0', I don't think this process is 
> expected and the recovery of 0' is obviously wrong and not necessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released

2022-07-11 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565256#comment-17565256
 ] 

Wei-Chiu Chuang commented on HDFS-16565:


For future reference, i highly recommend using this tool to detect file 
descriptor leak/socket leak: https://file-leak-detector.kohsuke.org/
Or more generally, use ByteMan or BTrace.

> DataNode holds a large number of CLOSE_WAIT connections that are not released
> -
>
> Key: HDFS-16565
> URL: https://issues.apache.org/jira/browse/HDFS-16565
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, ec
>Affects Versions: 3.3.0
> Environment: CentOS Linux release 7.5.1804 (Core)
>Reporter: JiangHua Zhu
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> There is a strange phenomenon here, DataNode holds a large number of 
> connections in CLOSE_WAIT state and does not release.
> netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
> LISTEN 20
> CLOSE_WAIT 17707
> ESTABLISHED 1450
> TIME_WAIT 12
> It can be found that the connections with the CLOSE_WAIT state have reached 
> 17k and are still growing. View these CLOSE_WAITs through the lsof command, 
> and get the following phenomenon:
> lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND'
>  !screenshot-1.png! 
> It can be seen that the reason for this phenomenon is that Socket#close() is 
> not called correctly, and DataNode interacts with other nodes as Client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16642) [SBN read] Moving selecting inputstream from JN in EditlogTailer out of FSNLock

2022-06-29 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16642:
---
Summary: [SBN read] Moving selecting inputstream from JN in EditlogTailer 
out of FSNLock  (was: [HDFS] Moving selecting inputstream from JN in 
EditlogTailer out of FSNLock)

> [SBN read] Moving selecting inputstream from JN in EditlogTailer out of 
> FSNLock
> ---
>
> Key: HDFS-16642
> URL: https://issues.apache.org/jira/browse/HDFS-16642
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> EditlogTailer cost a long time for selecting InputStreams from Journalnode 
> while holding the write lock of FSNLock.
> During this period, 8020 handlers of Observer NameNode will be blocked by the 
> FSN Lock.
> In theory, selecting inputstreams from JournalNode does not involve changing 
> memory information in NameNode, so we can move the selecting out of the FSN 
> Lock, and it can improve the throughput of Observer NameNode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks

2022-06-16 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16613:
---
Fix Version/s: 3.4.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> EC: Improve performance of decommissioning dn with many ec blocks
> -
>
> Key: HDFS-16613
> URL: https://issues.apache.org/jira/browse/HDFS-16613
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ec, erasure-coding, namenode
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: image-2022-06-07-11-46-42-389.png, 
> image-2022-06-07-17-42-16-075.png, image-2022-06-07-17-45-45-316.png, 
> image-2022-06-07-17-51-04-876.png, image-2022-06-07-17-55-40-203.png, 
> image-2022-06-08-11-38-29-664.png, image-2022-06-08-11-41-11-127.png
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. 
> The reason is unlike replication blocks can be replicated from any dn which 
> has the same block replication, the ec block have to be replicated from the 
> decommissioning dn.
> The configurations dfs.namenode.replication.max-streams and 
> dfs.namenode.replication.max-streams-hard-limit will limit the replication 
> speed, but increase these configurations will create risk to the whole 
> cluster's network. So it should add a new configuration to limit the 
> decommissioning dn, distinguished from the cluster wide max-streams limit.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Moved] (HDFS-16632) java.io.IOException: Version Mismatch (Expected: 28, Received: 520 )

2022-06-15 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang moved HADOOP-18204 to HDFS-16632:
-

  Component/s: hdfs++
   native
   (was: hdfs-client)
  Key: HDFS-16632  (was: HADOOP-18204)
 Target Version/s:   (was: 3.3.2)
Affects Version/s: 3.3.2
   3.2.3
   3.3.1
   3.3.0
   (was: 3.3.0)
   (was: 3.3.1)
   (was: 3.2.3)
   (was: 3.3.2)
  Project: Hadoop HDFS  (was: Hadoop Common)

> java.io.IOException: Version Mismatch (Expected: 28, Received: 520 )
> 
>
> Key: HDFS-16632
> URL: https://issues.apache.org/jira/browse/HDFS-16632
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs++, native
>Affects Versions: 3.3.2, 3.2.3, 3.3.1, 3.3.0
> Environment: NAME="openEuler"
> VERSION="20.03 (LTS-SP1)"
> ID="openEuler"
> VERSION_ID="20.03"
>Reporter: cnnc
>Priority: Major
>  Labels: libhdfs, libhdfscpp, pull-request-available
> Attachments: test1M.cpp
>
>   Original Estimate: 504h
>  Time Spent: 10m
>  Remaining Estimate: 503h 50m
>
> read 1M bytes from hdfs with libhdfspp, and the  datanode-server report an 
> error like:
>  
> 2022-04-12 20:10:21,872 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: server1:9866:DataXceiver 
> error processing READ_BLOCK operation  src: /90.90.43.114:47956 dst: 
> /90.90.43.114:9866
> java.io.IOException: Version Mismatch (Expected: 28, Received: 520 )
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:74)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:269)
>         at java.lang.Thread.run(Thread.java:748)
> 2022-04-12 20:13:27,615 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: server1:9866:DataXceiver 
> error processing READ_BLOCK operation  src: /90.90.43.114:48142 dst: 
> /90.90.43.114:9866
> java.io.IOException: Version Mismatch (Expected: 28, Received: 520 )
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:74)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:269)
>         at java.lang.Thread.run(Thread.java:748)
>  
>  
> (1) compile test1M.cpp ,using command:
> g++ test1M.cpp -lprotobuf -lhdfspp_static -lhdfs -lpthread -lsasl2 -lcrypto 
> -llz4 -I./ -L./ -o test1M
>  
> (2)and execute test1M with:
> ./test1M  REAL_HDFS_FILE_PATH
>  
> (3) check hadoop logs and you will find erro info like :
> java.io.IOException: Version Mismatch (Expected: 28, Received: 520 )
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:74)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:269)
>         at java.lang.Thread.run(Thread.java:748)
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16595) Slow peer metrics - add median, mad and upper latency limits

2022-06-06 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16595:
---
Fix Version/s: 3.3.4

> Slow peer metrics - add median, mad and upper latency limits
> 
>
> Key: HDFS-16595
> URL: https://issues.apache.org/jira/browse/HDFS-16595
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.4
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Slow datanode metrics include slow node and it's reporting node details. With 
> HDFS-16582, we added the aggregate latency that is perceived by the reporting 
> nodes.
> In order to get more insights into how the outlier slownode's latencies 
> differ from the rest of the nodes, we should also expose median, median 
> absolute deviation and the calculated upper latency limit details.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-06-03 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16456:
---
Fix Version/s: 3.2.4

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> 
>
> Key: HDFS-16456
> URL: https://issues.apache.org/jira/browse/HDFS-16456
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, namenode
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.4
>
> Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, 
> HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, 
> HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, 
> HDFS-16456.009.patch, HDFS-16456.010.patch
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
>  # Enable EC policy, such as RS-6-3-1024k.
>  # The rack number in this cluster is equal with or less than the replication 
> number(9)
>  # A rack only has one DN, and decommission this DN.
> The root cause is in 
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
> give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
> the maxNodesPerRack is 1, which means each rack can only be chosen one 
> datanode.
> {code:java}
>   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
>...
>     // If more replicas than racks, evenly spread the replicas.
>     // This calculation rounds up.
>     int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> return new int[] {numOfReplicas, maxNodesPerRack};
>   } {code}
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> here will be called, where totalNumOfReplicas=9 and  numOfRacks=9  
> When we decommission one dn which is only one node in its rack, the 
> chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() 
> will throw NotEnoughReplicasException, but the exception will not be caught 
> and fail to fallback to chooseEvenlyFromRemainingRacks() function.
> When decommission, after choose targets, verifyBlockPlacement() function will 
> return the total rack number contains the invalid rack, and 
> BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false 
> and it will also cause decommission fail.
> {code:java}
>   public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
>       int numberOfReplicas) {
>     if (locs == null)
>       locs = DatanodeDescriptor.EMPTY_ARRAY;
>     if (!clusterMap.hasClusterEverBeenMultiRack()) {
>       // only one rack
>       return new BlockPlacementStatusDefault(1, 1, 1);
>     }
>     // Count locations on different racks.
>     Set racks = new HashSet<>();
>     for (DatanodeInfo dn : locs) {
>       racks.add(dn.getNetworkLocation());
>     }
>     return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas,
>         clusterMap.getNumOfRacks());
>   } {code}
> {code:java}
>   public boolean isPlacementPolicySatisfied() {
>     return requiredRacks <= currentRacks || currentRacks >= totalRacks;
>   }{code}
> According to the above description, we should make the below modify to fix it:
>  # In startDecommission() or stopDecommission(), we should also change the 
> numOfRacks in class NetworkTopology. Or choose targets may fail for the 
> maxNodesPerRack is too small. And even choose targets success, 
> isPlacementPolicySatisfied will also return false cause decommission fail.
>  # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first 
> chooseOnce() function should also be put in try..catch..., or it will not 
> fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
>  # In verifyBlockPlacement, we need to remove invalid racks from total 
> numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail 
> to reconstruct data.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16595) Slow peer metrics - add median, mad and upper latency limits

2022-06-03 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-16595.

Fix Version/s: 3.4.0
   Resolution: Fixed

> Slow peer metrics - add median, mad and upper latency limits
> 
>
> Key: HDFS-16595
> URL: https://issues.apache.org/jira/browse/HDFS-16595
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Slow datanode metrics include slow node and it's reporting node details. With 
> HDFS-16582, we added the aggregate latency that is perceived by the reporting 
> nodes.
> In order to get more insights into how the outlier slownode's latencies 
> differ from the rest of the nodes, we should also expose median, median 
> absolute deviation and the calculated upper latency limit details.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16583) DatanodeAdminDefaultMonitor can get stuck in an infinite loop

2022-05-31 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-16583.

Resolution: Fixed

> DatanodeAdminDefaultMonitor can get stuck in an infinite loop
> -
>
> Key: HDFS-16583
> URL: https://issues.apache.org/jira/browse/HDFS-16583
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.4
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> We encountered a case where the decommission monitor in the namenode got 
> stuck for about 6 hours. The logs give:
> {code}
> 2022-05-15 01:09:25,490 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping 
> maintenance of dead node 10.185.3.132:50010
> 2022-05-15 01:10:20,918 INFO org.apache.hadoop.http.HttpServer2: Process 
> Thread Dump: jsp requested
> 
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753665_3428271426
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753659_3428271420
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753662_3428271423
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753663_3428271424
> 2022-05-15 06:00:57,281 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping 
> maintenance of dead node 10.185.3.34:50010
> 2022-05-15 06:00:58,105 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock 
> held for 17492614 ms via
> java.lang.Thread.getStackTrace(Thread.java:1559)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:220)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1601)
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:496)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
>   Number of suppressed write-lock reports: 0
>   Longest write-lock held interval: 17492614
> {code}
> We only have the one thread dump triggered by the FC:
> {code}
> Thread 80 (DatanodeAdminMonitor-0):
>   State: RUNNABLE
>   Blocked count: 16
>   Waited count: 453693
>   Stack:
> 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:538)
> 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:494)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
> {code}
> This was the line of code:
> {code}
> private void check() {
>   final Iterator>>
>   it = new CyclicIteration<>(outOfServiceNodeBlocks,
>   iterkey).iterator();
>   final LinkedList toRemove = new LinkedList<>();
>   while (it.hasNext() && !exceededNumBlocksPerCheck() && namesystem
>   .isRunning()) {
> numNodesChecked++;
> final Map.Entry>
> entry = it.next();
> final DatanodeDescriptor dn = entry.getKey();
> AbstractList blocks = entry.getValue();
> boolean fullScan = false;
> if (dn.isMaintenance() && 

[jira] [Updated] (HDFS-16583) DatanodeAdminDefaultMonitor can get stuck in an infinite loop

2022-05-31 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16583:
---
Fix Version/s: 3.2.4

> DatanodeAdminDefaultMonitor can get stuck in an infinite loop
> -
>
> Key: HDFS-16583
> URL: https://issues.apache.org/jira/browse/HDFS-16583
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.4
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> We encountered a case where the decommission monitor in the namenode got 
> stuck for about 6 hours. The logs give:
> {code}
> 2022-05-15 01:09:25,490 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping 
> maintenance of dead node 10.185.3.132:50010
> 2022-05-15 01:10:20,918 INFO org.apache.hadoop.http.HttpServer2: Process 
> Thread Dump: jsp requested
> 
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753665_3428271426
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753659_3428271420
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753662_3428271423
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753663_3428271424
> 2022-05-15 06:00:57,281 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping 
> maintenance of dead node 10.185.3.34:50010
> 2022-05-15 06:00:58,105 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock 
> held for 17492614 ms via
> java.lang.Thread.getStackTrace(Thread.java:1559)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:220)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1601)
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:496)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
>   Number of suppressed write-lock reports: 0
>   Longest write-lock held interval: 17492614
> {code}
> We only have the one thread dump triggered by the FC:
> {code}
> Thread 80 (DatanodeAdminMonitor-0):
>   State: RUNNABLE
>   Blocked count: 16
>   Waited count: 453693
>   Stack:
> 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:538)
> 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:494)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
> {code}
> This was the line of code:
> {code}
> private void check() {
>   final Iterator>>
>   it = new CyclicIteration<>(outOfServiceNodeBlocks,
>   iterkey).iterator();
>   final LinkedList toRemove = new LinkedList<>();
>   while (it.hasNext() && !exceededNumBlocksPerCheck() && namesystem
>   .isRunning()) {
> numNodesChecked++;
> final Map.Entry>
> entry = it.next();
> final DatanodeDescriptor dn = entry.getKey();
> AbstractList blocks = entry.getValue();
> boolean fullScan = false;
> if (dn.isMaintenance() && 

[jira] [Resolved] (HDFS-16603) Improve DatanodeHttpServer With Netty recommended method

2022-05-31 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-16603.

Resolution: Fixed

> Improve DatanodeHttpServer With Netty recommended method
> 
>
> Key: HDFS-16603
> URL: https://issues.apache.org/jira/browse/HDFS-16603
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: fanshilun
>Assignee: fanshilun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When reading the code, I found that some usage methods are outdated due to 
> the upgrade of netty components.
> {color:#172b4d}*1.DatanodeHttpServer#Constructor*{color}
> {code:java}
> @Deprecated
> public static final ChannelOption WRITE_BUFFER_HIGH_WATER_MARK = 
> valueOf("WRITE_BUFFER_HIGH_WATER_MARK"); 
> Deprecated. Use WRITE_BUFFER_WATER_MARK
> @Deprecated
> public static final ChannelOption WRITE_BUFFER_LOW_WATER_MARK = 
> valueOf("WRITE_BUFFER_LOW_WATER_MARK");
> Deprecated. Use WRITE_BUFFER_WATER_MARK
> -
> this.httpServer.childOption(
>           ChannelOption.WRITE_BUFFER_HIGH_WATER_MARK,
>           conf.getInt(
>               DFSConfigKeys.DFS_WEBHDFS_NETTY_HIGH_WATERMARK,
>               DFSConfigKeys.DFS_WEBHDFS_NETTY_HIGH_WATERMARK_DEFAULT));
> this.httpServer.childOption(
>           ChannelOption.WRITE_BUFFER_LOW_WATER_MARK,
>           conf.getInt(
>               DFSConfigKeys.DFS_WEBHDFS_NETTY_LOW_WATERMARK,
>               DFSConfigKeys.DFS_WEBHDFS_NETTY_LOW_WATERMARK_DEFAULT));
> {code}
> *2.Duplicate code* 
> {code:java}
> ChannelFuture f = httpServer.bind(infoAddr);
> try {
>  f.syncUninterruptibly();
> } catch (Throwable e) {
>   if (e instanceof BindException) {
>    throw NetUtils.wrapException(null, 0, infoAddr.getHostName(),
>    infoAddr.getPort(), (SocketException) e);
>  } else {
>    throw e;
>  }
> }
> httpAddress = (InetSocketAddress) f.channel().localAddress();
> LOG.info("Listening HTTP traffic on " + httpAddress);{code}
> *3.io.netty.bootstrap.ChannelFactory Deprecated*
> *use io.netty.channel.ChannelFactory instead.*
> {code:java}
> /** @deprecated */
> @Deprecated
> public interface ChannelFactory {
>     T newChannel();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16585) Add @VisibleForTesting in Dispatcher.java after HDFS-16268

2022-05-26 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-16585.

Fix Version/s: 3.4.0
   3.2.4
   3.3.4
   Resolution: Fixed

> Add @VisibleForTesting in Dispatcher.java after HDFS-16268
> --
>
> Key: HDFS-16585
> URL: https://issues.apache.org/jira/browse/HDFS-16585
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wei-Chiu Chuang
>Assignee: groot
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.4
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The scope of a few methods were opened up by HDFS-16268 to facilitate unit 
> testing. We should annotate them with {{@VisibleForTesting}} so that they 
> don't get used by production code.
> The affected methods include:
> PendingMove
> markMovedIfGoodBlock
> isGoodBlockCandidate



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16583) DatanodeAdminDefaultMonitor can get stuck in an infinite loop

2022-05-26 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17542619#comment-17542619
 ] 

Wei-Chiu Chuang commented on HDFS-16583:


PR was merged in trunk and cherrypicked into branch-3.3.

We refactored the code in 3.3.x and the commit does not apply in 3.2 and below. 
If needed please open a new PR to backport.

> DatanodeAdminDefaultMonitor can get stuck in an infinite loop
> -
>
> Key: HDFS-16583
> URL: https://issues.apache.org/jira/browse/HDFS-16583
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.4
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> We encountered a case where the decommission monitor in the namenode got 
> stuck for about 6 hours. The logs give:
> {code}
> 2022-05-15 01:09:25,490 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping 
> maintenance of dead node 10.185.3.132:50010
> 2022-05-15 01:10:20,918 INFO org.apache.hadoop.http.HttpServer2: Process 
> Thread Dump: jsp requested
> 
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753665_3428271426
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753659_3428271420
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753662_3428271423
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753663_3428271424
> 2022-05-15 06:00:57,281 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping 
> maintenance of dead node 10.185.3.34:50010
> 2022-05-15 06:00:58,105 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock 
> held for 17492614 ms via
> java.lang.Thread.getStackTrace(Thread.java:1559)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:220)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1601)
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:496)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
>   Number of suppressed write-lock reports: 0
>   Longest write-lock held interval: 17492614
> {code}
> We only have the one thread dump triggered by the FC:
> {code}
> Thread 80 (DatanodeAdminMonitor-0):
>   State: RUNNABLE
>   Blocked count: 16
>   Waited count: 453693
>   Stack:
> 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:538)
> 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:494)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
> {code}
> This was the line of code:
> {code}
> private void check() {
>   final Iterator>>
>   it = new CyclicIteration<>(outOfServiceNodeBlocks,
>   iterkey).iterator();
>   final LinkedList toRemove = new LinkedList<>();
>   while (it.hasNext() && !exceededNumBlocksPerCheck() && namesystem
>   .isRunning()) {
> numNodesChecked++;
> final Map.Entry>
> entry = 

[jira] [Updated] (HDFS-16583) DatanodeAdminDefaultMonitor can get stuck in an infinite loop

2022-05-26 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16583:
---
Fix Version/s: 3.3.4

> DatanodeAdminDefaultMonitor can get stuck in an infinite loop
> -
>
> Key: HDFS-16583
> URL: https://issues.apache.org/jira/browse/HDFS-16583
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.4
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> We encountered a case where the decommission monitor in the namenode got 
> stuck for about 6 hours. The logs give:
> {code}
> 2022-05-15 01:09:25,490 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping 
> maintenance of dead node 10.185.3.132:50010
> 2022-05-15 01:10:20,918 INFO org.apache.hadoop.http.HttpServer2: Process 
> Thread Dump: jsp requested
> 
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753665_3428271426
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753659_3428271420
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753662_3428271423
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753663_3428271424
> 2022-05-15 06:00:57,281 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping 
> maintenance of dead node 10.185.3.34:50010
> 2022-05-15 06:00:58,105 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock 
> held for 17492614 ms via
> java.lang.Thread.getStackTrace(Thread.java:1559)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:220)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1601)
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:496)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
>   Number of suppressed write-lock reports: 0
>   Longest write-lock held interval: 17492614
> {code}
> We only have the one thread dump triggered by the FC:
> {code}
> Thread 80 (DatanodeAdminMonitor-0):
>   State: RUNNABLE
>   Blocked count: 16
>   Waited count: 453693
>   Stack:
> 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:538)
> 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:494)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
> {code}
> This was the line of code:
> {code}
> private void check() {
>   final Iterator>>
>   it = new CyclicIteration<>(outOfServiceNodeBlocks,
>   iterkey).iterator();
>   final LinkedList toRemove = new LinkedList<>();
>   while (it.hasNext() && !exceededNumBlocksPerCheck() && namesystem
>   .isRunning()) {
> numNodesChecked++;
> final Map.Entry>
> entry = it.next();
> final DatanodeDescriptor dn = entry.getKey();
> AbstractList blocks = entry.getValue();
> boolean fullScan = false;
> if (dn.isMaintenance() && 

[jira] [Updated] (HDFS-16583) DatanodeAdminDefaultMonitor can get stuck in an infinite loop

2022-05-26 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16583:
---
Fix Version/s: 3.4.0

> DatanodeAdminDefaultMonitor can get stuck in an infinite loop
> -
>
> Key: HDFS-16583
> URL: https://issues.apache.org/jira/browse/HDFS-16583
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> We encountered a case where the decommission monitor in the namenode got 
> stuck for about 6 hours. The logs give:
> {code}
> 2022-05-15 01:09:25,490 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping 
> maintenance of dead node 10.185.3.132:50010
> 2022-05-15 01:10:20,918 INFO org.apache.hadoop.http.HttpServer2: Process 
> Thread Dump: jsp requested
> 
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753665_3428271426
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753659_3428271420
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753662_3428271423
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753663_3428271424
> 2022-05-15 06:00:57,281 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping 
> maintenance of dead node 10.185.3.34:50010
> 2022-05-15 06:00:58,105 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock 
> held for 17492614 ms via
> java.lang.Thread.getStackTrace(Thread.java:1559)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:220)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1601)
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:496)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
>   Number of suppressed write-lock reports: 0
>   Longest write-lock held interval: 17492614
> {code}
> We only have the one thread dump triggered by the FC:
> {code}
> Thread 80 (DatanodeAdminMonitor-0):
>   State: RUNNABLE
>   Blocked count: 16
>   Waited count: 453693
>   Stack:
> 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:538)
> 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:494)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
> {code}
> This was the line of code:
> {code}
> private void check() {
>   final Iterator>>
>   it = new CyclicIteration<>(outOfServiceNodeBlocks,
>   iterkey).iterator();
>   final LinkedList toRemove = new LinkedList<>();
>   while (it.hasNext() && !exceededNumBlocksPerCheck() && namesystem
>   .isRunning()) {
> numNodesChecked++;
> final Map.Entry>
> entry = it.next();
> final DatanodeDescriptor dn = entry.getKey();
> AbstractList blocks = entry.getValue();
> boolean fullScan = false;
> if (dn.isMaintenance() && 

[jira] [Resolved] (HDFS-16583) DatanodeAdminDefaultMonitor can get stuck in an infinite loop

2022-05-26 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-16583.

Resolution: Fixed

> DatanodeAdminDefaultMonitor can get stuck in an infinite loop
> -
>
> Key: HDFS-16583
> URL: https://issues.apache.org/jira/browse/HDFS-16583
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.4
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> We encountered a case where the decommission monitor in the namenode got 
> stuck for about 6 hours. The logs give:
> {code}
> 2022-05-15 01:09:25,490 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping 
> maintenance of dead node 10.185.3.132:50010
> 2022-05-15 01:10:20,918 INFO org.apache.hadoop.http.HttpServer2: Process 
> Thread Dump: jsp requested
> 
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753665_3428271426
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753659_3428271420
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753662_3428271423
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753663_3428271424
> 2022-05-15 06:00:57,281 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping 
> maintenance of dead node 10.185.3.34:50010
> 2022-05-15 06:00:58,105 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock 
> held for 17492614 ms via
> java.lang.Thread.getStackTrace(Thread.java:1559)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:220)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1601)
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:496)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
>   Number of suppressed write-lock reports: 0
>   Longest write-lock held interval: 17492614
> {code}
> We only have the one thread dump triggered by the FC:
> {code}
> Thread 80 (DatanodeAdminMonitor-0):
>   State: RUNNABLE
>   Blocked count: 16
>   Waited count: 453693
>   Stack:
> 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:538)
> 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:494)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
> {code}
> This was the line of code:
> {code}
> private void check() {
>   final Iterator>>
>   it = new CyclicIteration<>(outOfServiceNodeBlocks,
>   iterkey).iterator();
>   final LinkedList toRemove = new LinkedList<>();
>   while (it.hasNext() && !exceededNumBlocksPerCheck() && namesystem
>   .isRunning()) {
> numNodesChecked++;
> final Map.Entry>
> entry = it.next();
> final DatanodeDescriptor dn = entry.getKey();
> AbstractList blocks = entry.getValue();
> boolean fullScan = false;
> if (dn.isMaintenance() && 

[jira] [Updated] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-05-26 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-16456:
---
Fix Version/s: 3.3.4

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> 
>
> Key: HDFS-16456
> URL: https://issues.apache.org/jira/browse/HDFS-16456
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, namenode
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.4
>
> Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, 
> HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, 
> HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, 
> HDFS-16456.009.patch, HDFS-16456.010.patch
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
>  # Enable EC policy, such as RS-6-3-1024k.
>  # The rack number in this cluster is equal with or less than the replication 
> number(9)
>  # A rack only has one DN, and decommission this DN.
> The root cause is in 
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
> give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
> the maxNodesPerRack is 1, which means each rack can only be chosen one 
> datanode.
> {code:java}
>   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
>...
>     // If more replicas than racks, evenly spread the replicas.
>     // This calculation rounds up.
>     int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> return new int[] {numOfReplicas, maxNodesPerRack};
>   } {code}
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> here will be called, where totalNumOfReplicas=9 and  numOfRacks=9  
> When we decommission one dn which is only one node in its rack, the 
> chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() 
> will throw NotEnoughReplicasException, but the exception will not be caught 
> and fail to fallback to chooseEvenlyFromRemainingRacks() function.
> When decommission, after choose targets, verifyBlockPlacement() function will 
> return the total rack number contains the invalid rack, and 
> BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false 
> and it will also cause decommission fail.
> {code:java}
>   public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
>       int numberOfReplicas) {
>     if (locs == null)
>       locs = DatanodeDescriptor.EMPTY_ARRAY;
>     if (!clusterMap.hasClusterEverBeenMultiRack()) {
>       // only one rack
>       return new BlockPlacementStatusDefault(1, 1, 1);
>     }
>     // Count locations on different racks.
>     Set racks = new HashSet<>();
>     for (DatanodeInfo dn : locs) {
>       racks.add(dn.getNetworkLocation());
>     }
>     return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas,
>         clusterMap.getNumOfRacks());
>   } {code}
> {code:java}
>   public boolean isPlacementPolicySatisfied() {
>     return requiredRacks <= currentRacks || currentRacks >= totalRacks;
>   }{code}
> According to the above description, we should make the below modify to fix it:
>  # In startDecommission() or stopDecommission(), we should also change the 
> numOfRacks in class NetworkTopology. Or choose targets may fail for the 
> maxNodesPerRack is too small. And even choose targets success, 
> isPlacementPolicySatisfied will also return false cause decommission fail.
>  # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first 
> chooseOnce() function should also be put in try..catch..., or it will not 
> fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
>  # In verifyBlockPlacement, we need to remove invalid racks from total 
> numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail 
> to reconstruct data.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16594) Many RpcCalls are blocked for a while while Decommission works

2022-05-25 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17542319#comment-17542319
 ] 

Wei-Chiu Chuang commented on HDFS-16594:


Stephen's very experience in this area of code so I'd take heed of his 
suggestions :) 

There were a few changes made to improve around this area. 2.9.2 is old and I 
am not sure if it has all the updates we made the last 2-3 years.



> Many RpcCalls are blocked for a while while Decommission works
> --
>
> Key: HDFS-16594
> URL: https://issues.apache.org/jira/browse/HDFS-16594
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.9.2
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
> Attachments: image-2022-05-26-02-05-38-878.png
>
>
> When there are some DataNodes that need to go offline, Decommission starts to 
> work, and periodically checks the number of blocks remaining to be processed. 
> By default, when checking more than 
> 50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the 
> DatanodeAdminDefaultMonitor thread will sleep for a while before continuing.
> If the number of blocks to be checked is very large, for example, the number 
> of replicas managed by the DataNode reaches 90w or even 100w, during this 
> period, the DatanodeAdminDefaultMonitor will continue to hold the 
> FSNamesystemLock, which will block a lot of RpcCalls. Here are some logs:
>  !image-2022-05-26-02-05-38-878.png! 
> It can be seen that in the last inspection process, there were more than 100w 
> blocks.
> When the check is over, FSNamesystemLock is released and RpcCall starts 
> working:
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 36 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process 
> from client Call#5571549 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:35727
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 135 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process 
> from client Call#36795561 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:37793
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 108 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process 
> from client Call#5497586 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:23475
> '
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 33 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process 
> from client Call#6043903 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:34746
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 139 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
> from client Call#274471 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:46419
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 77 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
> from client Call#73375524 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:34241
> '
> Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will 
> be longer than usual. A very large number of RpcCalls were affected during 
> this time.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16585) Add @VisibleForTesting in Dispatcher.java after HDFS-16268

2022-05-20 Thread Wei-Chiu Chuang (Jira)
Wei-Chiu Chuang created HDFS-16585:
--

 Summary: Add @VisibleForTesting in Dispatcher.java after HDFS-16268
 Key: HDFS-16585
 URL: https://issues.apache.org/jira/browse/HDFS-16585
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Wei-Chiu Chuang


The scope of a few methods were opened up by HDFS-16268 to facilitate unit 
testing. We should annotate them with {{@VisibleForTesting}} so that they don't 
get used by production code.

The affected methods include:
PendingMove
markMovedIfGoodBlock
isGoodBlockCandidate




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   10   >