[jira] [Commented] (HDFS-16583) DatanodeAdminDefaultMonitor can get stuck in an infinite loop

2023-06-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730260#comment-17730260
 ] 

ASF GitHub Bot commented on HDFS-16583:
---

jojochuang commented on PR #4332:
URL: https://github.com/apache/hadoop/pull/4332#issuecomment-1581395454

   @Kidd53685368 not sure i understand... could you elaborate a bit more? Does 
the PR not solve the issue, or does it cause regressions?




> DatanodeAdminDefaultMonitor can get stuck in an infinite loop
> -
>
> Key: HDFS-16583
> URL: https://issues.apache.org/jira/browse/HDFS-16583
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.5
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> We encountered a case where the decommission monitor in the namenode got 
> stuck for about 6 hours. The logs give:
> {code}
> 2022-05-15 01:09:25,490 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping 
> maintenance of dead node 10.185.3.132:50010
> 2022-05-15 01:10:20,918 INFO org.apache.hadoop.http.HttpServer2: Process 
> Thread Dump: jsp requested
> 
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753665_3428271426
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753659_3428271420
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753662_3428271423
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753663_3428271424
> 2022-05-15 06:00:57,281 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping 
> maintenance of dead node 10.185.3.34:50010
> 2022-05-15 06:00:58,105 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock 
> held for 17492614 ms via
> java.lang.Thread.getStackTrace(Thread.java:1559)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:220)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1601)
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:496)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
>   Number of suppressed write-lock reports: 0
>   Longest write-lock held interval: 17492614
> {code}
> We only have the one thread dump triggered by the FC:
> {code}
> Thread 80 (DatanodeAdminMonitor-0):
>   State: RUNNABLE
>   Blocked count: 16
>   Waited count: 453693
>   Stack:
> 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:538)
> 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:494)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
> {code}
> This was the line of code:
> {code}
> private void check() {
>   final Iterator>>
>   it = new CyclicIteration<>(outOfServiceNodeBlocks,
>   iterkey).iterator();
>   final LinkedList toRemove = new LinkedList<>();
>   while (it.hasNext() && !exceededNumBlocksPerCheck() && namesystem
>   .isRunning()) {
> 

[jira] [Commented] (HDFS-16583) DatanodeAdminDefaultMonitor can get stuck in an infinite loop

2023-06-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730011#comment-17730011
 ] 

ASF GitHub Bot commented on HDFS-16583:
---

Kidd53685368 commented on PR #4332:
URL: https://github.com/apache/hadoop/pull/4332#issuecomment-1580240331

   It seems the problem won't keeping holding the writeLock because of the 
exceededNumBlocksPerCheck()?




> DatanodeAdminDefaultMonitor can get stuck in an infinite loop
> -
>
> Key: HDFS-16583
> URL: https://issues.apache.org/jira/browse/HDFS-16583
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.5
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> We encountered a case where the decommission monitor in the namenode got 
> stuck for about 6 hours. The logs give:
> {code}
> 2022-05-15 01:09:25,490 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping 
> maintenance of dead node 10.185.3.132:50010
> 2022-05-15 01:10:20,918 INFO org.apache.hadoop.http.HttpServer2: Process 
> Thread Dump: jsp requested
> 
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753665_3428271426
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753659_3428271420
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753662_3428271423
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753663_3428271424
> 2022-05-15 06:00:57,281 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping 
> maintenance of dead node 10.185.3.34:50010
> 2022-05-15 06:00:58,105 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock 
> held for 17492614 ms via
> java.lang.Thread.getStackTrace(Thread.java:1559)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:220)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1601)
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:496)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
>   Number of suppressed write-lock reports: 0
>   Longest write-lock held interval: 17492614
> {code}
> We only have the one thread dump triggered by the FC:
> {code}
> Thread 80 (DatanodeAdminMonitor-0):
>   State: RUNNABLE
>   Blocked count: 16
>   Waited count: 453693
>   Stack:
> 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:538)
> 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:494)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
> {code}
> This was the line of code:
> {code}
> private void check() {
>   final Iterator>>
>   it = new CyclicIteration<>(outOfServiceNodeBlocks,
>   iterkey).iterator();
>   final LinkedList toRemove = new LinkedList<>();
>   while (it.hasNext() && !exceededNumBlocksPerCheck() && namesystem
>   .isRunning()) {
> numNodesChecked++;
> final 

[jira] [Commented] (HDFS-16583) DatanodeAdminDefaultMonitor can get stuck in an infinite loop

2022-05-26 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17542619#comment-17542619
 ] 

Wei-Chiu Chuang commented on HDFS-16583:


PR was merged in trunk and cherrypicked into branch-3.3.

We refactored the code in 3.3.x and the commit does not apply in 3.2 and below. 
If needed please open a new PR to backport.

> DatanodeAdminDefaultMonitor can get stuck in an infinite loop
> -
>
> Key: HDFS-16583
> URL: https://issues.apache.org/jira/browse/HDFS-16583
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.4
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> We encountered a case where the decommission monitor in the namenode got 
> stuck for about 6 hours. The logs give:
> {code}
> 2022-05-15 01:09:25,490 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping 
> maintenance of dead node 10.185.3.132:50010
> 2022-05-15 01:10:20,918 INFO org.apache.hadoop.http.HttpServer2: Process 
> Thread Dump: jsp requested
> 
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753665_3428271426
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753659_3428271420
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753662_3428271423
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753663_3428271424
> 2022-05-15 06:00:57,281 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping 
> maintenance of dead node 10.185.3.34:50010
> 2022-05-15 06:00:58,105 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock 
> held for 17492614 ms via
> java.lang.Thread.getStackTrace(Thread.java:1559)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:220)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1601)
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:496)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
>   Number of suppressed write-lock reports: 0
>   Longest write-lock held interval: 17492614
> {code}
> We only have the one thread dump triggered by the FC:
> {code}
> Thread 80 (DatanodeAdminMonitor-0):
>   State: RUNNABLE
>   Blocked count: 16
>   Waited count: 453693
>   Stack:
> 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:538)
> 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:494)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
> {code}
> This was the line of code:
> {code}
> private void check() {
>   final Iterator>>
>   it = new CyclicIteration<>(outOfServiceNodeBlocks,
>   iterkey).iterator();
>   final LinkedList toRemove = new LinkedList<>();
>   while (it.hasNext() && !exceededNumBlocksPerCheck() && namesystem
>   .isRunning()) {
> numNodesChecked++;
> final Map.Entry>
> entry =