[jira] [Commented] (HDFS-14861) Reset LowRedundancyBlocks Iterator periodically
[ https://issues.apache.org/jira/browse/HDFS-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17044944#comment-17044944 ] Hudson commented on HDFS-14861: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17993 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/17993/]) HDFS-14861. Reset LowRedundancyBlocks Iterator periodically. Contributed (weichiu: rev 900430b9907b590ed2d73a0d68f079c7f4d754b1) * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/LowRedundancyBlocks.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml * (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestLowRedundancyBlockQueues.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java > Reset LowRedundancyBlocks Iterator periodically > --- > > Key: HDFS-14861 > URL: https://issues.apache.org/jira/browse/HDFS-14861 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Labels: decommission > Fix For: 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-14861.001.patch, HDFS-14861.002.patch > > > When the namenode needs to schedule blocks for reconstruction, the blocks are > placed into the neededReconstruction object in the BlockManager. This is an > instance of LowRedundancyBlocks, which maintains a list of priority queues > where the blocks are held until they are scheduled for reconstruction / > replication. > Every 3 seconds, by default, a number of blocks are retrieved from > LowRedundancyBlocks. The method > LowRedundancyBlocks.chooseLowRedundancyBlocks() is used to retrieve the next > set of blocks using a bookmarked iterator. Each call to this method moves the > iterator forward. The number of blocks retrieved is governed by the formula: > number_of_live_nodes * dfs.namenode.replication.work.multiplier.per.iteration > (default 2) > Then the namenode attempts to schedule those blocks on datanodes, but each > datanode has a limit of how many blocks can be queued against it (controlled > by dfs.namenode.replication.max-streams) so all of the retrieved blocks may > not be scheduled. There may be other block availability reasons the blocks > are not scheduled too. > As the iterator in chooseLowRedundancyBlocks() always moves forward, the > blocks which were not scheduled are not retried until the end of the queue is > reached and the iterator is reset. > If the replication queue is very large (eg several nodes are being > decommissioned) or if blocks are being continuously added to the replication > queue (eg nodes decommission using the proposal in HDFS-14854) it may take a > very long time for the iterator to be reset to the start. > The result of this, could be a few blocks for a decommissioning or entering > maintenance mode node getting left behind and it taking many hours or even > days for them to be retried, and this could stop decommission completing. > With this Jira, I would like to suggest we reset the iterator after a > configurable number of calls to chooseLowRedundancyBlocks() so any left > behind blocks are retried. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14861) Reset LowRedundancyBlocks Iterator periodically
[ https://issues.apache.org/jira/browse/HDFS-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039407#comment-17039407 ] Wei-Chiu Chuang commented on HDFS-14861: +1 The patch looks good and it's sit for quite a while. If no objection I'd like to merge the patch soon. > Reset LowRedundancyBlocks Iterator periodically > --- > > Key: HDFS-14861 > URL: https://issues.apache.org/jira/browse/HDFS-14861 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Labels: decommission > Attachments: HDFS-14861.001.patch, HDFS-14861.002.patch > > > When the namenode needs to schedule blocks for reconstruction, the blocks are > placed into the neededReconstruction object in the BlockManager. This is an > instance of LowRedundancyBlocks, which maintains a list of priority queues > where the blocks are held until they are scheduled for reconstruction / > replication. > Every 3 seconds, by default, a number of blocks are retrieved from > LowRedundancyBlocks. The method > LowRedundancyBlocks.chooseLowRedundancyBlocks() is used to retrieve the next > set of blocks using a bookmarked iterator. Each call to this method moves the > iterator forward. The number of blocks retrieved is governed by the formula: > number_of_live_nodes * dfs.namenode.replication.work.multiplier.per.iteration > (default 2) > Then the namenode attempts to schedule those blocks on datanodes, but each > datanode has a limit of how many blocks can be queued against it (controlled > by dfs.namenode.replication.max-streams) so all of the retrieved blocks may > not be scheduled. There may be other block availability reasons the blocks > are not scheduled too. > As the iterator in chooseLowRedundancyBlocks() always moves forward, the > blocks which were not scheduled are not retried until the end of the queue is > reached and the iterator is reset. > If the replication queue is very large (eg several nodes are being > decommissioned) or if blocks are being continuously added to the replication > queue (eg nodes decommission using the proposal in HDFS-14854) it may take a > very long time for the iterator to be reset to the start. > The result of this, could be a few blocks for a decommissioning or entering > maintenance mode node getting left behind and it taking many hours or even > days for them to be retried, and this could stop decommission completing. > With this Jira, I would like to suggest we reset the iterator after a > configurable number of calls to chooseLowRedundancyBlocks() so any left > behind blocks are retried. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14861) Reset LowRedundancyBlocks Iterator periodically
[ https://issues.apache.org/jira/browse/HDFS-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992724#comment-16992724 ] Stephen O'Donnell commented on HDFS-14861: -- I discussed this issue with Wei-Chiu offline and we had a couple of concerns: 1. If we are resetting the iterator periodically, and there are a lot of missing blocks, how will that affect things 2. Is there any better way we can detect if the iterator needs reset. For (1) - missing / corrupt blocks go into the lowest priority queue which is not accessed by the iterators in question here. The iterators are used in chooseLowRedundancyBlocks: {code} synchronized List> chooseLowRedundancyBlocks( int blocksToProcess) { final List> blocksToReconstruct = new ArrayList<>(LEVEL); int count = 0; int priority = 0; for (; count < blocksToProcess && priority < LEVEL; priority++) { if (priority == QUEUE_WITH_CORRUPT_BLOCKS) { // do not choose corrupted blocks. continue; } // Go through all blocks that need reconstructions with current priority. // Set the iterator to the first unprocessed block at this priority level final Iterator i = priorityQueues.get(priority).getBookmark(); ... {code} The corrupt / missing blocks will all be in QUEUE_WITH_CORRUPT_BLOCKS and hence are not processed by this method. Therefore we don't need to worry about them with this change. For (2) - it is difficult to come up with something other than a time based metric. The reason is that each queue is effectively a double linked list and the iterator bookmark just points to the next element. Given that element, we have no knowledge as to how many blocks are behind that point, or ahead of it. Ideally we want to reset the iterator if there is some threshold of blocks behind the pointer, as those are the blocks which got skipped for some reason. The only way to see how many blocks are behind is to read the list from the start until you encounter the same element as the iterator returns which would not be very efficient. The easy solution is to simply reset the iterator to the start after some amount of time, but its hard to know what the best period of time would be. It may be useful to create a command to dump the contents of lowReduncanyBlocks in a separate Jira which could give some further insights into the queues, especially if decommission is stuck for seemingly no reason and also let us see often this problem occurs on a real cluster. > Reset LowRedundancyBlocks Iterator periodically > --- > > Key: HDFS-14861 > URL: https://issues.apache.org/jira/browse/HDFS-14861 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Labels: decommission > Attachments: HDFS-14861.001.patch, HDFS-14861.002.patch > > > When the namenode needs to schedule blocks for reconstruction, the blocks are > placed into the neededReconstruction object in the BlockManager. This is an > instance of LowRedundancyBlocks, which maintains a list of priority queues > where the blocks are held until they are scheduled for reconstruction / > replication. > Every 3 seconds, by default, a number of blocks are retrieved from > LowRedundancyBlocks. The method > LowRedundancyBlocks.chooseLowRedundancyBlocks() is used to retrieve the next > set of blocks using a bookmarked iterator. Each call to this method moves the > iterator forward. The number of blocks retrieved is governed by the formula: > number_of_live_nodes * dfs.namenode.replication.work.multiplier.per.iteration > (default 2) > Then the namenode attempts to schedule those blocks on datanodes, but each > datanode has a limit of how many blocks can be queued against it (controlled > by dfs.namenode.replication.max-streams) so all of the retrieved blocks may > not be scheduled. There may be other block availability reasons the blocks > are not scheduled too. > As the iterator in chooseLowRedundancyBlocks() always moves forward, the > blocks which were not scheduled are not retried until the end of the queue is > reached and the iterator is reset. > If the replication queue is very large (eg several nodes are being > decommissioned) or if blocks are being continuously added to the replication > queue (eg nodes decommission using the proposal in HDFS-14854) it may take a > very long time for the iterator to be reset to the start. > The result of this, could be a few blocks for a decommissioning or entering > maintenance mode node getting left behind and it taking many hours or even > days for them to be retried, and this could stop decommission completing. > With this
[jira] [Commented] (HDFS-14861) Reset LowRedundancyBlocks Iterator periodically
[ https://issues.apache.org/jira/browse/HDFS-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978212#comment-16978212 ] Stephen O'Donnell commented on HDFS-14861: -- There is no real logic for "2 hours" except I didn't want it to be too long or too short. It would be great if there was an efficient way to notice some number of items were left behind in the queue, but its tricky. The queue itself is a linked list, and the iterator points to the "next element". New items are added to the head of the list which the iterator works toward. The problem is the skipped items left behind in the tail. Therefore all we can really do is look backwards in the queue to see if there are many items behind the iterator pointer, but that involves walking the list (and I have not checked if the APIs is exposed to do it) which is why I went for the simple time based approach. I am definitely open to other suggestions here as using a fixed time window is not ideal. > Reset LowRedundancyBlocks Iterator periodically > --- > > Key: HDFS-14861 > URL: https://issues.apache.org/jira/browse/HDFS-14861 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Labels: decommission > Attachments: HDFS-14861.001.patch, HDFS-14861.002.patch > > > When the namenode needs to schedule blocks for reconstruction, the blocks are > placed into the neededReconstruction object in the BlockManager. This is an > instance of LowRedundancyBlocks, which maintains a list of priority queues > where the blocks are held until they are scheduled for reconstruction / > replication. > Every 3 seconds, by default, a number of blocks are retrieved from > LowRedundancyBlocks. The method > LowRedundancyBlocks.chooseLowRedundancyBlocks() is used to retrieve the next > set of blocks using a bookmarked iterator. Each call to this method moves the > iterator forward. The number of blocks retrieved is governed by the formula: > number_of_live_nodes * dfs.namenode.replication.work.multiplier.per.iteration > (default 2) > Then the namenode attempts to schedule those blocks on datanodes, but each > datanode has a limit of how many blocks can be queued against it (controlled > by dfs.namenode.replication.max-streams) so all of the retrieved blocks may > not be scheduled. There may be other block availability reasons the blocks > are not scheduled too. > As the iterator in chooseLowRedundancyBlocks() always moves forward, the > blocks which were not scheduled are not retried until the end of the queue is > reached and the iterator is reset. > If the replication queue is very large (eg several nodes are being > decommissioned) or if blocks are being continuously added to the replication > queue (eg nodes decommission using the proposal in HDFS-14854) it may take a > very long time for the iterator to be reset to the start. > The result of this, could be a few blocks for a decommissioning or entering > maintenance mode node getting left behind and it taking many hours or even > days for them to be retried, and this could stop decommission completing. > With this Jira, I would like to suggest we reset the iterator after a > configurable number of calls to chooseLowRedundancyBlocks() so any left > behind blocks are retried. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14861) Reset LowRedundancyBlocks Iterator periodically
[ https://issues.apache.org/jira/browse/HDFS-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977899#comment-16977899 ] Wei-Chiu Chuang commented on HDFS-14861: Sorry for the long delay. Patch makes sense to me. I'm curious why 2-hour? I know it can be hard to decide a proper number but i am hoping to find a way to optimize it. > Reset LowRedundancyBlocks Iterator periodically > --- > > Key: HDFS-14861 > URL: https://issues.apache.org/jira/browse/HDFS-14861 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Attachments: HDFS-14861.001.patch, HDFS-14861.002.patch > > > When the namenode needs to schedule blocks for reconstruction, the blocks are > placed into the neededReconstruction object in the BlockManager. This is an > instance of LowRedundancyBlocks, which maintains a list of priority queues > where the blocks are held until they are scheduled for reconstruction / > replication. > Every 3 seconds, by default, a number of blocks are retrieved from > LowRedundancyBlocks. The method > LowRedundancyBlocks.chooseLowRedundancyBlocks() is used to retrieve the next > set of blocks using a bookmarked iterator. Each call to this method moves the > iterator forward. The number of blocks retrieved is governed by the formula: > number_of_live_nodes * dfs.namenode.replication.work.multiplier.per.iteration > (default 2) > Then the namenode attempts to schedule those blocks on datanodes, but each > datanode has a limit of how many blocks can be queued against it (controlled > by dfs.namenode.replication.max-streams) so all of the retrieved blocks may > not be scheduled. There may be other block availability reasons the blocks > are not scheduled too. > As the iterator in chooseLowRedundancyBlocks() always moves forward, the > blocks which were not scheduled are not retried until the end of the queue is > reached and the iterator is reset. > If the replication queue is very large (eg several nodes are being > decommissioned) or if blocks are being continuously added to the replication > queue (eg nodes decommission using the proposal in HDFS-14854) it may take a > very long time for the iterator to be reset to the start. > The result of this, could be a few blocks for a decommissioning or entering > maintenance mode node getting left behind and it taking many hours or even > days for them to be retried, and this could stop decommission completing. > With this Jira, I would like to suggest we reset the iterator after a > configurable number of calls to chooseLowRedundancyBlocks() so any left > behind blocks are retried. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14861) Reset LowRedundancyBlocks Iterator periodically
[ https://issues.apache.org/jira/browse/HDFS-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949622#comment-16949622 ] Stephen O'Donnell commented on HDFS-14861: -- The failing test in hadoop.hdfs.server.blockmanagement.TestReplicationPolicy seems to pass locally. The checkstyle warning is highlighting a method which is too long in Block Manager - the method was already close to the limit. I could move my new lines into another method and call it, but I don't think it really adds any value if you look at what else is happening in that method already. Its mostly reading config and setting instance variables. > Reset LowRedundancyBlocks Iterator periodically > --- > > Key: HDFS-14861 > URL: https://issues.apache.org/jira/browse/HDFS-14861 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Attachments: HDFS-14861.001.patch, HDFS-14861.002.patch > > > When the namenode needs to schedule blocks for reconstruction, the blocks are > placed into the neededReconstruction object in the BlockManager. This is an > instance of LowRedundancyBlocks, which maintains a list of priority queues > where the blocks are held until they are scheduled for reconstruction / > replication. > Every 3 seconds, by default, a number of blocks are retrieved from > LowRedundancyBlocks. The method > LowRedundancyBlocks.chooseLowRedundancyBlocks() is used to retrieve the next > set of blocks using a bookmarked iterator. Each call to this method moves the > iterator forward. The number of blocks retrieved is governed by the formula: > number_of_live_nodes * dfs.namenode.replication.work.multiplier.per.iteration > (default 2) > Then the namenode attempts to schedule those blocks on datanodes, but each > datanode has a limit of how many blocks can be queued against it (controlled > by dfs.namenode.replication.max-streams) so all of the retrieved blocks may > not be scheduled. There may be other block availability reasons the blocks > are not scheduled too. > As the iterator in chooseLowRedundancyBlocks() always moves forward, the > blocks which were not scheduled are not retried until the end of the queue is > reached and the iterator is reset. > If the replication queue is very large (eg several nodes are being > decommissioned) or if blocks are being continuously added to the replication > queue (eg nodes decommission using the proposal in HDFS-14854) it may take a > very long time for the iterator to be reset to the start. > The result of this, could be a few blocks for a decommissioning or entering > maintenance mode node getting left behind and it taking many hours or even > days for them to be retried, and this could stop decommission completing. > With this Jira, I would like to suggest we reset the iterator after a > configurable number of calls to chooseLowRedundancyBlocks() so any left > behind blocks are retried. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14861) Reset LowRedundancyBlocks Iterator periodically
[ https://issues.apache.org/jira/browse/HDFS-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949518#comment-16949518 ] Hadoop QA commented on HDFS-14861: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 58s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 31s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 49s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 12s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 46s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 27s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 5s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 53s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 590 unchanged - 0 fixed = 591 total (was 590) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 2s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 28s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 21s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}107m 42s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 35s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}173m 9s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.blockmanagement.TestReplicationPolicy | | | hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.3 Server=19.03.3 Image:yetus/hadoop:104ccca9169 | | JIRA Issue | HDFS-14861 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12982769/HDFS-14861.002.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml | | uname | Linux eebd086e123a 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 62b5cef | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/28064/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | unit | https://builds.apache.org/jo
[jira] [Commented] (HDFS-14861) Reset LowRedundancyBlocks Iterator periodically
[ https://issues.apache.org/jira/browse/HDFS-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949370#comment-16949370 ] Stephen O'Donnell commented on HDFS-14861: -- Uploaded a new patch changing how this is done, as I did not like passing the reset threshold into the LowRedundancyQueues iterator. Instead, this patch manages the iteration count in the block manager and then calls chooseLowRedundancyBlocks with a flag indicating whether it should reset the iterator bookmarks or not. I think this is a slightly cleaner solution that the last one. The parameter "dfs.namenode.redundancy.queue.restart.iterations" indicates how many calls must be made before a position reset is issued and setting it to zero disables the change. > Reset LowRedundancyBlocks Iterator periodically > --- > > Key: HDFS-14861 > URL: https://issues.apache.org/jira/browse/HDFS-14861 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Attachments: HDFS-14861.001.patch, HDFS-14861.002.patch > > > When the namenode needs to schedule blocks for reconstruction, the blocks are > placed into the neededReconstruction object in the BlockManager. This is an > instance of LowRedundancyBlocks, which maintains a list of priority queues > where the blocks are held until they are scheduled for reconstruction / > replication. > Every 3 seconds, by default, a number of blocks are retrieved from > LowRedundancyBlocks. The method > LowRedundancyBlocks.chooseLowRedundancyBlocks() is used to retrieve the next > set of blocks using a bookmarked iterator. Each call to this method moves the > iterator forward. The number of blocks retrieved is governed by the formula: > number_of_live_nodes * dfs.namenode.replication.work.multiplier.per.iteration > (default 2) > Then the namenode attempts to schedule those blocks on datanodes, but each > datanode has a limit of how many blocks can be queued against it (controlled > by dfs.namenode.replication.max-streams) so all of the retrieved blocks may > not be scheduled. There may be other block availability reasons the blocks > are not scheduled too. > As the iterator in chooseLowRedundancyBlocks() always moves forward, the > blocks which were not scheduled are not retried until the end of the queue is > reached and the iterator is reset. > If the replication queue is very large (eg several nodes are being > decommissioned) or if blocks are being continuously added to the replication > queue (eg nodes decommission using the proposal in HDFS-14854) it may take a > very long time for the iterator to be reset to the start. > The result of this, could be a few blocks for a decommissioning or entering > maintenance mode node getting left behind and it taking many hours or even > days for them to be retried, and this could stop decommission completing. > With this Jira, I would like to suggest we reset the iterator after a > configurable number of calls to chooseLowRedundancyBlocks() so any left > behind blocks are retried. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14861) Reset LowRedundancyBlocks Iterator periodically
[ https://issues.apache.org/jira/browse/HDFS-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947990#comment-16947990 ] Hadoop QA commented on HDFS-14861: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 39s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 55s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 47s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 10s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 15s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 44s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 625 unchanged - 1 fixed = 627 total (was 626) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 2s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 33s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 16s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 94m 11s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 41s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}154m 28s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.TestMultipleNNPortQOP | | | hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots | | | hadoop.hdfs.server.namenode.ha.TestBootstrapAliasmap | | | hadoop.hdfs.TestDFSClientRetries | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.3 Server=19.03.3 Image:yetus/hadoop:1dde3efb91e | | JIRA Issue | HDFS-14861 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12982612/HDFS-14861.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml | | uname | Linux a0da8f46a274 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / d76e265 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/28049/artifact/out/diff-checkstyle
[jira] [Commented] (HDFS-14861) Reset LowRedundancyBlocks Iterator periodically
[ https://issues.apache.org/jira/browse/HDFS-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947887#comment-16947887 ] Stephen O'Donnell commented on HDFS-14861: -- I have uploaded a first go at this change. This adds a new config key "DFSConfigKeys.DFS_NAMENODE_REDUNDANCY_QUEUE_RESET_ITERATIONS_DEFAULT" which defaults to 2400. As the redundancy monitor runs every 3 seconds, this means it will take about 2400*3 seconds = 2 hours to reset the iterators. Setting the key to zero disables this change and it will work as before, only resetting the iterators after the queues have all reached their end. LowRedundancyBlocks does not currently get a conf object passed in, so I opted to pass a value into its constructor and grab the setting from config in BlockManager, which create the LowRedundancyBlocks object. Its not an ideal way to do this, and the value is only used in chooseLowRedundancyBlocks and nowhere else in the class, so its a little confusing. I still need to add some tests, but I wanted to share what I had for now. > Reset LowRedundancyBlocks Iterator periodically > --- > > Key: HDFS-14861 > URL: https://issues.apache.org/jira/browse/HDFS-14861 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > > When the namenode needs to schedule blocks for reconstruction, the blocks are > placed into the neededReconstruction object in the BlockManager. This is an > instance of LowRedundancyBlocks, which maintains a list of priority queues > where the blocks are held until they are scheduled for reconstruction / > replication. > Every 3 seconds, by default, a number of blocks are retrieved from > LowRedundancyBlocks. The method > LowRedundancyBlocks.chooseLowRedundancyBlocks() is used to retrieve the next > set of blocks using a bookmarked iterator. Each call to this method moves the > iterator forward. The number of blocks retrieved is governed by the formula: > number_of_live_nodes * dfs.namenode.replication.work.multiplier.per.iteration > (default 2) > Then the namenode attempts to schedule those blocks on datanodes, but each > datanode has a limit of how many blocks can be queued against it (controlled > by dfs.namenode.replication.max-streams) so all of the retrieved blocks may > not be scheduled. There may be other block availability reasons the blocks > are not scheduled too. > As the iterator in chooseLowRedundancyBlocks() always moves forward, the > blocks which were not scheduled are not retried until the end of the queue is > reached and the iterator is reset. > If the replication queue is very large (eg several nodes are being > decommissioned) or if blocks are being continuously added to the replication > queue (eg nodes decommission using the proposal in HDFS-14854) it may take a > very long time for the iterator to be reset to the start. > The result of this, could be a few blocks for a decommissioning or entering > maintenance mode node getting left behind and it taking many hours or even > days for them to be retried, and this could stop decommission completing. > With this Jira, I would like to suggest we reset the iterator after a > configurable number of calls to chooseLowRedundancyBlocks() so any left > behind blocks are retried. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14861) Reset LowRedundancyBlocks Iterator periodically
[ https://issues.apache.org/jira/browse/HDFS-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16934607#comment-16934607 ] Stephen O'Donnell commented on HDFS-14861: -- An alternative to the above, would be for the decommission monitor to somehow track how long it has been since a block was added to the replication queue and remove it and re-add it (giving the decommission logic more control on the blocks it is tracking) however that change would be much more complicate than the solution I suggested above. > Reset LowRedundancyBlocks Iterator periodically > --- > > Key: HDFS-14861 > URL: https://issues.apache.org/jira/browse/HDFS-14861 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > > When the namenode needs to schedule blocks for reconstruction, the blocks are > placed into the neededReconstruction object in the BlockManager. This is an > instance of LowRedundancyBlocks, which maintains a list of priority queues > where the blocks are held until they are scheduled for reconstruction / > replication. > Every 3 seconds, by default, a number of blocks are retrieved from > LowRedundancyBlocks. The method > LowRedundancyBlocks.chooseLowRedundancyBlocks() is used to retrieve the next > set of blocks using a bookmarked iterator. Each call to this method moves the > iterator forward. The number of blocks retrieved is governed by the formula: > number_of_live_nodes * dfs.namenode.replication.work.multiplier.per.iteration > (default 2) > Then the namenode attempts to schedule those blocks on datanodes, but each > datanode has a limit of how many blocks can be queued against it (controlled > by dfs.namenode.replication.max-streams) so all of the retrieved blocks may > not be scheduled. There may be other block availability reasons the blocks > are not scheduled too. > As the iterator in chooseLowRedundancyBlocks() always moves forward, the > blocks which were not scheduled are not retried until the end of the queue is > reached and the iterator is reset. > If the replication queue is very large (eg several nodes are being > decommissioned) or if blocks are being continuously added to the replication > queue (eg nodes decommission using the proposal in HDFS-14854) it may take a > very long time for the iterator to be reset to the start. > The result of this, could be a few blocks for a decommissioning or entering > maintenance mode node getting left behind and it taking many hours or even > days for them to be retried, and this could stop decommission completing. > With this Jira, I would like to suggest we reset the iterator after a > configurable number of calls to chooseLowRedundancyBlocks() so any left > behind blocks are retried. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14861) Reset LowRedundancyBlocks Iterator periodically
[ https://issues.apache.org/jira/browse/HDFS-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16934602#comment-16934602 ] Stephen O'Donnell commented on HDFS-14861: -- The simplest way to fix this problem may be to increment a counter each time chooseLowRedundancyBlocks() is called, and reset it each time the iterators are reset. If they have not been reset for X calls, force a reset: {code:java} synchronized List> chooseLowRedundancyBlocks( int blocksToProcess) { final List> blocksToReconstruct = new ArrayList<>(LEVEL); int count = 0; int priority = 0; for (; count < blocksToProcess && priority < LEVEL; priority++) { if (priority == QUEUE_WITH_CORRUPT_BLOCKS) { // do not choose corrupted blocks. continue; } // Go through all blocks that need reconstructions with current priority. // Set the iterator to the first unprocessed block at this priority level final Iterator i = priorityQueues.get(priority).getBookmark(); final List blocks = new LinkedList<>(); blocksToReconstruct.add(blocks); // Loop through all remaining blocks in the list. for(; count < blocksToProcess && i.hasNext(); count++) { blocks.add(i.next()); } } callCount++; // New Counter if (priority == LEVEL || callCount > threshold) { // Check counter against some threshold here callCount = 0 // Reset all bookmarks because there were no recently added blocks. for (LightWeightLinkedSet q : priorityQueues) { q.resetBookmark(); } } return blocksToReconstruct; } {code} If things are working well, then most or all blocks returned by this method should be scheduled on datanodes, and hence the iterator bookmark should be close to the head of the list. Resetting it would only cause a few blocks to be retried. If things are not working well, then resetting the iterator back to the head of the list would cause a lot of blocks to be retried and hence it would take longer to reach the tail of the list. However that would probably indicate there are problems on the cluster (eg unable to place new replicas, or out of service replicas). Provided the time between resets is not too small (eg 30 - 60 minutes) this would probably be OK. If blocks are under-replicated (eg from a node failure), skipped blocks are not a problem - all blocks have to be processed eventually anyway, it does not really matter what order it happens in, or what is skipped. However with decommissioning and maintenance mode, a skipped block can prevent the node from completing the process. Consider decommissioning a few nodes, and one has relatively few blocks. A skipped block on the smaller node, would cause it to wait with only a few blocks pending until the other two nodes are fully processed and the iterator is reset. > Reset LowRedundancyBlocks Iterator periodically > --- > > Key: HDFS-14861 > URL: https://issues.apache.org/jira/browse/HDFS-14861 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > > When the namenode needs to schedule blocks for reconstruction, the blocks are > placed into the neededReconstruction object in the BlockManager. This is an > instance of LowRedundancyBlocks, which maintains a list of priority queues > where the blocks are held until they are scheduled for reconstruction / > replication. > Every 3 seconds, by default, a number of blocks are retrieved from > LowRedundancyBlocks. The method > LowRedundancyBlocks.chooseLowRedundancyBlocks() is used to retrieve the next > set of blocks using a bookmarked iterator. Each call to this method moves the > iterator forward. The number of blocks retrieved is governed by the formula: > number_of_live_nodes * dfs.namenode.replication.work.multiplier.per.iteration > (default 2) > Then the namenode attempts to schedule those blocks on datanodes, but each > datanode has a limit of how many blocks can be queued against it (controlled > by dfs.namenode.replication.max-streams) so all of the retrieved blocks may > not be scheduled. There may be other block availability reasons the blocks > are not scheduled too. > As the iterator in chooseLowRedundancyBlocks() always moves forward, the > blocks which were not scheduled are not retried until the end of the queue is > reached and the iterator is reset. > If the replication queue is very large (eg several nodes are being > decommissioned) or if blocks are being continuously added to the replication > queue (eg nodes decommission using the proposal in HDFS-14854) it may take a > very long time for the iterator to be reset to the start. > The result of this, could be a few