[jira] [Commented] (HDFS-16583) DatanodeAdminDefaultMonitor can get stuck in an infinite loop
[ https://issues.apache.org/jira/browse/HDFS-16583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730260#comment-17730260 ] ASF GitHub Bot commented on HDFS-16583: --- jojochuang commented on PR #4332: URL: https://github.com/apache/hadoop/pull/4332#issuecomment-1581395454 @Kidd53685368 not sure i understand... could you elaborate a bit more? Does the PR not solve the issue, or does it cause regressions? > DatanodeAdminDefaultMonitor can get stuck in an infinite loop > - > > Key: HDFS-16583 > URL: https://issues.apache.org/jira/browse/HDFS-16583 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.5 > > Time Spent: 2h > Remaining Estimate: 0h > > We encountered a case where the decommission monitor in the namenode got > stuck for about 6 hours. The logs give: > {code} > 2022-05-15 01:09:25,490 INFO > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping > maintenance of dead node 10.185.3.132:50010 > 2022-05-15 01:10:20,918 INFO org.apache.hadoop.http.HttpServer2: Process > Thread Dump: jsp requested > > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753665_3428271426 > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753659_3428271420 > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753662_3428271423 > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753663_3428271424 > 2022-05-15 06:00:57,281 INFO > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping > maintenance of dead node 10.185.3.34:50010 > 2022-05-15 06:00:58,105 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock > held for 17492614 ms via > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:220) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1601) > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:496) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > Number of suppressed write-lock reports: 0 > Longest write-lock held interval: 17492614 > {code} > We only have the one thread dump triggered by the FC: > {code} > Thread 80 (DatanodeAdminMonitor-0): > State: RUNNABLE > Blocked count: 16 > Waited count: 453693 > Stack: > > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:538) > > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:494) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > {code} > This was the line of code: > {code} > private void check() { > final Iterator>> > it = new CyclicIteration<>(outOfServiceNodeBlocks, > iterkey).iterator(); > final LinkedList toRemove = new LinkedList<>(); > while (it.hasNext() && !exceededNumBlocksPerCheck() && namesystem > .isRunning()) { >
[jira] [Commented] (HDFS-16583) DatanodeAdminDefaultMonitor can get stuck in an infinite loop
[ https://issues.apache.org/jira/browse/HDFS-16583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730011#comment-17730011 ] ASF GitHub Bot commented on HDFS-16583: --- Kidd53685368 commented on PR #4332: URL: https://github.com/apache/hadoop/pull/4332#issuecomment-1580240331 It seems the problem won't keeping holding the writeLock because of the exceededNumBlocksPerCheck()? > DatanodeAdminDefaultMonitor can get stuck in an infinite loop > - > > Key: HDFS-16583 > URL: https://issues.apache.org/jira/browse/HDFS-16583 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.5 > > Time Spent: 2h > Remaining Estimate: 0h > > We encountered a case where the decommission monitor in the namenode got > stuck for about 6 hours. The logs give: > {code} > 2022-05-15 01:09:25,490 INFO > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping > maintenance of dead node 10.185.3.132:50010 > 2022-05-15 01:10:20,918 INFO org.apache.hadoop.http.HttpServer2: Process > Thread Dump: jsp requested > > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753665_3428271426 > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753659_3428271420 > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753662_3428271423 > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753663_3428271424 > 2022-05-15 06:00:57,281 INFO > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping > maintenance of dead node 10.185.3.34:50010 > 2022-05-15 06:00:58,105 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock > held for 17492614 ms via > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:220) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1601) > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:496) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > Number of suppressed write-lock reports: 0 > Longest write-lock held interval: 17492614 > {code} > We only have the one thread dump triggered by the FC: > {code} > Thread 80 (DatanodeAdminMonitor-0): > State: RUNNABLE > Blocked count: 16 > Waited count: 453693 > Stack: > > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:538) > > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:494) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > {code} > This was the line of code: > {code} > private void check() { > final Iterator>> > it = new CyclicIteration<>(outOfServiceNodeBlocks, > iterkey).iterator(); > final LinkedList toRemove = new LinkedList<>(); > while (it.hasNext() && !exceededNumBlocksPerCheck() && namesystem > .isRunning()) { > numNodesChecked++; > final
[jira] [Commented] (HDFS-16583) DatanodeAdminDefaultMonitor can get stuck in an infinite loop
[ https://issues.apache.org/jira/browse/HDFS-16583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17542619#comment-17542619 ] Wei-Chiu Chuang commented on HDFS-16583: PR was merged in trunk and cherrypicked into branch-3.3. We refactored the code in 3.3.x and the commit does not apply in 3.2 and below. If needed please open a new PR to backport. > DatanodeAdminDefaultMonitor can get stuck in an infinite loop > - > > Key: HDFS-16583 > URL: https://issues.apache.org/jira/browse/HDFS-16583 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.3.4 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > We encountered a case where the decommission monitor in the namenode got > stuck for about 6 hours. The logs give: > {code} > 2022-05-15 01:09:25,490 INFO > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping > maintenance of dead node 10.185.3.132:50010 > 2022-05-15 01:10:20,918 INFO org.apache.hadoop.http.HttpServer2: Process > Thread Dump: jsp requested > > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753665_3428271426 > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753659_3428271420 > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753662_3428271423 > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753663_3428271424 > 2022-05-15 06:00:57,281 INFO > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping > maintenance of dead node 10.185.3.34:50010 > 2022-05-15 06:00:58,105 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock > held for 17492614 ms via > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:220) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1601) > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:496) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > Number of suppressed write-lock reports: 0 > Longest write-lock held interval: 17492614 > {code} > We only have the one thread dump triggered by the FC: > {code} > Thread 80 (DatanodeAdminMonitor-0): > State: RUNNABLE > Blocked count: 16 > Waited count: 453693 > Stack: > > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:538) > > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:494) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > {code} > This was the line of code: > {code} > private void check() { > final Iterator>> > it = new CyclicIteration<>(outOfServiceNodeBlocks, > iterkey).iterator(); > final LinkedList toRemove = new LinkedList<>(); > while (it.hasNext() && !exceededNumBlocksPerCheck() && namesystem > .isRunning()) { > numNodesChecked++; > final Map.Entry> > entry =