[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227443#comment-16227443 ] Wei Yan commented on HDFS-11384: Gotcha, thanks for the explaination, [~shv] > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: Konstantin Shvachko > Fix For: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2 > > Attachments: HDFS-11384-007.patch, HDFS-11384-branch-2.7.011.patch, > HDFS-11384-branch-2.8.011.patch, HDFS-11384.001.patch, HDFS-11384.002.patch, > HDFS-11384.003.patch, HDFS-11384.004.patch, HDFS-11384.005.patch, > HDFS-11384.006.patch, HDFS-11384.008.patch, HDFS-11384.009.patch, > HDFS-11384.010.patch, HDFS-11384.011.patch, balancer.day.png, > balancer.week.png > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224231#comment-16224231 ] Konstantin Shvachko commented on HDFS-11384: Hey [~ywskycn] we want to disperse the initial RPC. Once they are dispersed the rest of them should follow the pattern. Therefore we do not need to delay {{dispatchBlocks(0)}} when we start reusing the threads {{j >= concurrentThreads}}. As [explained in this comment|https://issues.apache.org/jira/browse/HDFS-11384?focusedCommentId=15961620=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15961620]. Hope this makes sense. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: Konstantin Shvachko > Fix For: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2 > > Attachments: HDFS-11384-007.patch, HDFS-11384-branch-2.7.011.patch, > HDFS-11384-branch-2.8.011.patch, HDFS-11384.001.patch, HDFS-11384.002.patch, > HDFS-11384.003.patch, HDFS-11384.004.patch, HDFS-11384.005.patch, > HDFS-11384.006.patch, HDFS-11384.008.patch, HDFS-11384.009.patch, > HDFS-11384.010.patch, HDFS-11384.011.patch, balancer.day.png, > balancer.week.png > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16219762#comment-16219762 ] Wei Yan commented on HDFS-11384: For the following changes in the diff, why we set dSec to 0 when "j >= concurrentThreads", instead of "j <= concurrentThreads"? Did I miss anything there? [~shv], [~zhaoyunjiong], [~zhz] {code} // Calculate delay in seconds for the next iteration if(j >= concurrentThreads) { dSec = 0; } else if((j + 1) % BALANCER_NUM_RPC_PER_SEC == 0) { dSec++; } {code} > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: Konstantin Shvachko > Fix For: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2 > > Attachments: HDFS-11384-007.patch, HDFS-11384-branch-2.7.011.patch, > HDFS-11384-branch-2.8.011.patch, HDFS-11384.001.patch, HDFS-11384.002.patch, > HDFS-11384.003.patch, HDFS-11384.004.patch, HDFS-11384.005.patch, > HDFS-11384.006.patch, HDFS-11384.008.patch, HDFS-11384.009.patch, > HDFS-11384.010.patch, HDFS-11384.011.patch, balancer.day.png, > balancer.week.png > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201578#comment-16201578 ] xiaoli commented on HDFS-11384: --- The patch1 looks good!(/)(/) > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: Konstantin Shvachko > Fix For: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2 > > Attachments: HDFS-11384-007.patch, HDFS-11384-branch-2.7.011.patch, > HDFS-11384-branch-2.8.011.patch, HDFS-11384.001.patch, HDFS-11384.002.patch, > HDFS-11384.003.patch, HDFS-11384.004.patch, HDFS-11384.005.patch, > HDFS-11384.006.patch, HDFS-11384.008.patch, HDFS-11384.009.patch, > HDFS-11384.010.patch, HDFS-11384.011.patch, balancer.day.png, > balancer.week.png > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995826#comment-15995826 ] yunjiong zhao commented on HDFS-11384: -- [~shv] Thanks for the fix. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: Konstantin Shvachko > Fix For: 2.9.0, 2.7.4, 3.0.0-alpha3, 2.8.2 > > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch, HDFS-11384.003.patch, > HDFS-11384.004.patch, HDFS-11384.005.patch, HDFS-11384.006.patch, > HDFS-11384-007.patch, HDFS-11384.008.patch, HDFS-11384.009.patch, > HDFS-11384.010.patch, HDFS-11384.011.patch, HDFS-11384-branch-2.7.011.patch, > HDFS-11384-branch-2.8.011.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15986048#comment-15986048 ] Hadoop QA commented on HDFS-11384: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 12m 34s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 4 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 12s{color} | {color:green} branch-2.7 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 9s{color} | {color:green} branch-2.7 passed with JDK v1.8.0_131 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 7s{color} | {color:green} branch-2.7 passed with JDK v1.7.0_121 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 27s{color} | {color:green} branch-2.7 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 2s{color} | {color:green} branch-2.7 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 16s{color} | {color:green} branch-2.7 passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 3m 0s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs in branch-2.7 has 1 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 59s{color} | {color:green} branch-2.7 passed with JDK v1.8.0_131 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 44s{color} | {color:green} branch-2.7 passed with JDK v1.7.0_121 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 58s{color} | {color:green} the patch passed with JDK v1.8.0_131 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 0s{color} | {color:green} the patch passed with JDK v1.7.0_121 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 22s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 58 unchanged - 0 fixed = 60 total (was 58) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 4347 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 1m 34s{color} | {color:red} The patch 176 line(s) with tabs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 57s{color} | {color:green} the patch passed with JDK v1.8.0_131 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 42s{color} | {color:green} the patch passed with JDK v1.7.0_121 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 43m 48s{color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.7.0_121. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 18s{color} | {color:red} The patch generated 3 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}139m 17s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_131 Failed junit tests | hadoop.hdfs.web.TestWebHdfsTokens | | | hadoop.hdfs.TestBlockStoragePolicy | | | hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits | | | hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots | | JDK v1.7.0_121 Failed junit tests | hadoop.hdfs.server.mover.TestStorageMover | | | hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots | \\ \\ ||
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985898#comment-15985898 ] Hudson commented on HDFS-11384: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11637 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/11637/]) HDFS-11384. Balancer disperses getBlocks calls to avoid NameNode's rpc (shv: rev 28eb2aabebd15c15a357d86e23ca407d3c85211c) * (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManagerTestUtil.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancer.java * (add) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerRPCDelay.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Dispatcher.java > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: Konstantin Shvachko > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch, HDFS-11384.003.patch, > HDFS-11384.004.patch, HDFS-11384.005.patch, HDFS-11384.006.patch, > HDFS-11384-007.patch, HDFS-11384.008.patch, HDFS-11384.009.patch, > HDFS-11384.010.patch, HDFS-11384.011.patch, HDFS-11384-branch-2.7.011.patch, > HDFS-11384-branch-2.8.011.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985885#comment-15985885 ] Hadoop QA commented on HDFS-11384: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 15s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 4 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 14m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 16s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 29s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 22s{color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 2m 26s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs in trunk has 10 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 11s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 54s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 5 new + 280 unchanged - 2 fixed = 285 total (was 282) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 50s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 73m 38s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 22s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}106m 50s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.namenode.TestNameNodeMetadataConsistency | | | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting | | | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ac17dc | | JIRA Issue | HDFS-11384 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12865201/HDFS-11384.010.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 331b4cd05e03 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 8b5f2c3 | | Default Java | 1.8.0_121 | | findbugs | v3.1.0-RC1 | | findbugs | https://builds.apache.org/job/PreCommit-HDFS-Build/19210/artifact/patchprocess/branch-findbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/19210/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/19210/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/19210/testReport/ | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985763#comment-15985763 ] Hadoop QA commented on HDFS-11384: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 19s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 4 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 13m 24s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s{color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 38s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs in trunk has 10 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 46s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 35s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 5 new + 280 unchanged - 2 fixed = 285 total (was 282) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 89m 0s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}114m 36s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.web.TestWebHDFS | | Timed out junit tests | org.apache.hadoop.hdfs.TestLeaseRecovery2 | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ac17dc | | JIRA Issue | HDFS-11384 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12865201/HDFS-11384.010.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 52bc523accab 3.13.0-108-generic #155-Ubuntu SMP Wed Jan 11 16:58:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 8b5f2c3 | | Default Java | 1.8.0_121 | | findbugs | v3.1.0-RC1 | | findbugs | https://builds.apache.org/job/PreCommit-HDFS-Build/19209/artifact/patchprocess/branch-findbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/19209/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/19209/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/19209/testReport/ | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/19209/console | | Powered by | Apache
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985527#comment-15985527 ] Zhe Zhang commented on HDFS-11384: -- Thanks [~shv]. Latest patch LGTM. +1 pending Jenkins. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: Konstantin Shvachko > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch, HDFS-11384.003.patch, > HDFS-11384.004.patch, HDFS-11384.005.patch, HDFS-11384.006.patch, > HDFS-11384-007.patch, HDFS-11384.008.patch, HDFS-11384.009.patch, > HDFS-11384.010.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985152#comment-15985152 ] Zhe Zhang commented on HDFS-11384: -- {{TestBalancerWithSaslDataTransfer}} fails on my local box too (it passes without the patch). > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: Konstantin Shvachko > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch, HDFS-11384.003.patch, > HDFS-11384.004.patch, HDFS-11384.005.patch, HDFS-11384.006.patch, > HDFS-11384-007.patch, HDFS-11384.008.patch, HDFS-11384.009.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984129#comment-15984129 ] Hadoop QA commented on HDFS-11384: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 19s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 14m 3s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 4s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s{color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 53s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs in trunk has 10 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 52s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 52s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 36s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 4 new + 280 unchanged - 2 fixed = 284 total (was 282) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 42s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 70m 15s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 20s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 98m 5s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting | | | hadoop.hdfs.server.balancer.TestBalancerWithSaslDataTransfer | | | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ac17dc | | JIRA Issue | HDFS-11384 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12865051/HDFS-11384.009.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux beb7a0d3cf1c 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 4ea2778 | | Default Java | 1.8.0_121 | | findbugs | v3.1.0-RC1 | | findbugs | https://builds.apache.org/job/PreCommit-HDFS-Build/19201/artifact/patchprocess/branch-findbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/19201/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/19201/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/19201/testReport/ | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982410#comment-15982410 ] Zhe Zhang commented on HDFS-11384: -- Thanks for the update [~shv]. Now all other tests in {{TestBalancer}} pass except for {{testBalancerRPCDelay}}: {code} java.util.concurrent.TimeoutException: Timed out waiting for /tmp.txt to reach 40 replicas at org.apache.hadoop.hdfs.DFSTestUtil.waitReplication(DFSTestUtil.java:764) at org.apache.hadoop.hdfs.server.balancer.TestBalancer.createFile(TestBalancer.java:306) at org.apache.hadoop.hdfs.server.balancer.TestBalancer.doTest(TestBalancer.java:847) at org.apache.hadoop.hdfs.server.balancer.TestBalancer.testBalancerRPCDelay(TestBalancer.java:2071) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) {code} > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: Konstantin Shvachko > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch, HDFS-11384.003.patch, > HDFS-11384.004.patch, HDFS-11384.005.patch, HDFS-11384.006.patch, > HDFS-11384-007.patch, HDFS-11384.008.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981551#comment-15981551 ] Zhe Zhang commented on HDFS-11384: -- Thanks [~shv] for the update. The added Javadoc looks good. The unit test {{testBalancerRPCDelay}} fails for me locally with a NullPointerException. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: Konstantin Shvachko > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch, HDFS-11384.003.patch, > HDFS-11384.004.patch, HDFS-11384.005.patch, HDFS-11384.006.patch, > HDFS-11384-007.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15980234#comment-15980234 ] Hadoop QA commented on HDFS-11384: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 20s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 13m 25s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s{color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 37s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs in trunk has 10 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 36s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 3 new + 281 unchanged - 1 fixed = 284 total (was 282) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 50s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 38s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 85m 51s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}111m 33s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.balancer.TestBalancer | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ac17dc | | JIRA Issue | HDFS-11384 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12864665/HDFS-11384-007.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 305c4e63baa7 3.13.0-108-generic #155-Ubuntu SMP Wed Jan 11 16:58:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / fda86ef | | Default Java | 1.8.0_121 | | findbugs | v3.1.0-RC1 | | findbugs | https://builds.apache.org/job/PreCommit-HDFS-Build/19185/artifact/patchprocess/branch-findbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/19185/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/19185/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/19185/testReport/ | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/19185/console | | Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org | This
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979926#comment-15979926 ] Hadoop QA commented on HDFS-11384: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 19s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 16s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 6s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s{color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 50s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs in trunk has 10 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 53s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 37s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 8 new + 281 unchanged - 1 fixed = 289 total (was 282) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 42s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}113m 21s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 34s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}142m 53s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting | | | hadoop.hdfs.server.namenode.TestNameNodeMetadataConsistency | | Timed out junit tests | org.apache.hadoop.hdfs.server.balancer.TestBalancer | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ac17dc | | JIRA Issue | HDFS-11384 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12864621/HDFS-11384.006.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 5bf1fdc0f19e 3.13.0-108-generic #155-Ubuntu SMP Wed Jan 11 16:58:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / fda86ef | | Default Java | 1.8.0_121 | | findbugs | v3.1.0-RC1 | | findbugs | https://builds.apache.org/job/PreCommit-HDFS-Build/19180/artifact/patchprocess/branch-findbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/19180/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | whitespace | https://builds.apache.org/job/PreCommit-HDFS-Build/19180/artifact/patchprocess/whitespace-eol.txt | | unit |
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15974123#comment-15974123 ] Zhe Zhang commented on HDFS-11384: -- Thanks [~shv], the main logic LGTM. I could not reproduce reported unit test failures either. +1 pending a few final comments: # IIUC, {{BALANCER_NUM_RPC_PER_SEC}} is a best-effort throttling target, instead of a guaranteed threshold. E.g. it looks possible for {{Thread.sleep(delay)}} to be interrupted and {{getBlockList}} to be retried in the while loop. Or the entire {{dispatchBlocks}} call in a thread could die before {{delay}} seconds, then another {{future\[j\]}} will be issued without the delay. (Assuming this understanding is correct), I think this is the right way to handle this logic -- it is a good idea not to optimize for these rare cases. But can we update the documentation for {{BALANCER_NUM_RPC_PER_SEC}} to reflect it? # {{private void dispatchBlocks(long delay) {}} doesn't explain {{delay}} in its Javadoc. # What does {{testBalancerRPCDelay}} verify? It is not checking the number of RPC calls. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch, HDFS-11384.003.patch, > HDFS-11384.004.patch, HDFS-11384.005.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966901#comment-15966901 ] Hadoop QA commented on HDFS-11384: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 23m 12s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 55s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 56s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 44s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 6s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 47s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 87m 30s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}140m 29s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.namenode.TestReconstructStripedBlocks | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:612578f | | JIRA Issue | HDFS-11384 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12863142/HDFS-11384.005.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 6338c0ca2e43 3.13.0-108-generic #155-Ubuntu SMP Wed Jan 11 16:58:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / b053fdc | | Default Java | 1.8.0_121 | | findbugs | v3.0.0 | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/19067/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/19067/testReport/ | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/19067/console | | Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components:
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965334#comment-15965334 ] Hadoop QA commented on HDFS-11384: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 18s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 13m 32s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 54s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 48s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 35s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 251 unchanged - 0 fixed = 252 total (was 251) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 52s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 38s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 70m 24s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 21s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 96m 52s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.balancer.TestBalancer | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:612578f | | JIRA Issue | HDFS-11384 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12862956/HDFS-11384.004.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux f6f002ce984b 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 3a91376 | | Default Java | 1.8.0_121 | | findbugs | v3.0.0 | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/19056/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/19056/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/19056/testReport/ | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/19056/console | | Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike >
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965266#comment-15965266 ] Konstantin Shvachko commented on HDFS-11384: * I am usually very conservative about introducing new configuration parameters. Parameters seem to give you flexibility to adjust them, but in many cases administrators don't know what to do with that flexibility, because there so many of them. I prefer to have a reasonable constant value initially, and add a config variable later if _other_ value are needed in certain cases. In the end adding configs is easy, but you can never remove them. In this particular case the BALANCER_NUM_RPC_PER_SEC is chosen so that big clusters would distribute _initial_ RPC requests over 10 secs, and it does not effect small clusters at all. I think we are good with the constant set to 20 for now, but let me know if you see use cases for different values. * Fixed the typo in 004 patch. Thanks [~zhz]. * This would be a typical misuse of Preconditions, as we do in many cases in the code, and as it was discussed previously on many occasions. It is an assert, because we assume the condition should never happen. If it does, it's a bug, which should be caught during testing, with {{-ea}} option. And in the runtime we want to avoid checking any extra condition for performance reasons. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch, HDFS-11384.003.patch, > HDFS-11384.004.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963904#comment-15963904 ] Zhe Zhang commented on HDFS-11384: -- Thanks [~shv] for the patch. In addition to [~redvine]'s suggestion, two minor comments: # Typo "Inteval" # I guess Preconditions is the preferred way for cases like below: {code} assert concurrentThreads > 0 : "Number of concurrent threads is 0."; {code} > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch, HDFS-11384.003.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963707#comment-15963707 ] Vinitha Reddy Gankidi commented on HDFS-11384: -- [~shv] The delay logic looks good to me. It would be great if we can make BALANCER_NUM_RPC_PER_SEC configurable with a default value of 20.The test does not ensure that there are indeed 20 getBlocks calls per second and it probably is not straightforward to ensure that. So I would like to have the ability to configure BALANCER_NUM_RPC_PER_SEC. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch, HDFS-11384.003.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961667#comment-15961667 ] Hadoop QA commented on HDFS-11384: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 15s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 12m 54s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 50s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 35s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 251 unchanged - 0 fixed = 252 total (was 251) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 63m 4s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 88m 1s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:a9ad5d6 | | JIRA Issue | HDFS-11384 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12862555/HDFS-11384.003.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 70b6452eb18d 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 2aa8967 | | Default Java | 1.8.0_121 | | findbugs | v3.0.0 | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/19022/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/19022/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/19022/testReport/ | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/19022/console | | Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike >
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961625#comment-15961625 ] Hadoop QA commented on HDFS-11384: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 13s{color} | {color:red} HDFS-11384 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HDFS-11384 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12862552/HDFS-11384.003.patch | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/19019/console | | Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch, HDFS-11384.003.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951414#comment-15951414 ] Konstantin Shvachko commented on HDFS-11384: Yup, I was thinking the same last night. You need all locations of the block to select the right target that satisfies replication policy, cannot get it from a DN. So unfortunately we should discard (3) as an option. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949978#comment-15949978 ] Vinitha Reddy Gankidi commented on HDFS-11384: -- [~shv] I'm leaning towards reading from (4) instead of (3). {{isGoodBlockCandidate}} needs a global view of the block replicas. Also there is some additional logic to deal with erasure coded(EC) blocks and this may be a blocker for reading from DNs. [~zhz] you probably have more context regarding the EC blocks. {code} /** * Decide if the block/blockGroup is a good candidate to be moved from source * to target. A block is a good candidate if * 1. the block is not in the process of being moved/has not been moved; * 2. the block does not have a replica/internalBlock on the target; * 3. doing the move does not reduce the number of racks that the block has */ private boolean isGoodBlockCandidate(StorageGroup source, StorageGroup target, StorageType targetStorageType, DBlock block) { {code} I agree that (2) and (4) are complimentary. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949905#comment-15949905 ] Konstantin Shvachko commented on HDFS-11384: Combining my and Vinitha's proposals in a common list here, for further discussion. # Add a parameter to {{getBlockList(long delay)}}, where {{delay}} is a random time within a reasonable interval, which Balancer should wait for before sending the {{getBlocks()}} RPC to NN. The delay is only applied once, and set to 0 once applied. This looks rather straightforward to me. # Allocate a reasonable throughput of {{getBlocks()}} RPCs to NN, and delay calls if the quota is exceeded. This is similar to [~benoyantony]'s proposal, but allows to precisely specify how much of NN RPC bandwidth is allocated for the Balancer. # Dispatcher can get the blocks directly from the particular DN instead of NN. This makes {{getBlocksList()}} a distributed operation and doesn't impact any specific node. # Dispatcher can fetch the blocks from the Standby NN instead of the active. Balancer should be able to tolerate reasonable degree of staleness. I think (1) is quite easy and can solve immediate problem with bursting RPCs. I really like Vinitha's (3) obtaining block info directly from DNs. I did not look how much harder it is than other approaches. (2) and (4) are complimentary, that is we can have a limit on Balancer RPC bandwidth to NN, but run it against SBN. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949824#comment-15949824 ] Mingliang Liu commented on HDFS-11384: -- Ping [~szetszwo]. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949815#comment-15949815 ] Zhe Zhang commented on HDFS-11384: -- Also pinging [~mingma] and [~andrew.wang] since this is related to "reading from SbNN". > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949757#comment-15949757 ] Vinitha Reddy Gankidi commented on HDFS-11384: -- If we were to offload the calls to DN, dispersing calls wouldn't be a pressing issue. I would like to get some feedback on the various approaches discussed. [~benoyantony] [~daryn] [~liuml07] I would love to hear your opinions. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949731#comment-15949731 ] Zhe Zhang commented on HDFS-11384: -- Great discussions [~shv] and [~redvine]. I suggest we keep this JIRA for making the _pattern_ of {{getBlocks}} calls less bursty, and also create a JIRA to _offload_ {{getBlocks}} calls to DN or SbNN. [~redvine] Thoughts? > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949593#comment-15949593 ] Vinitha Reddy Gankidi commented on HDFS-11384: -- Two other approaches to fix this: 1. In {{getBlockList()}} Dispatcher fetches the blocks belonging to a particular DN from the NN. And then it moves those blocks from the source DN to the target DN. Dispatcher can instead get the blocks directly from the particular DN. This makes {{getBlocksList()}} a distributed operation and doesn't impact any specific node. 2. Dispatcher can fetch the blocks from the Standby NN instead of the active. Balancer should be able to tolerate reasonable degree of staleness. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944307#comment-15944307 ] Konstantin Shvachko commented on HDFS-11384: Took me some time to refresh my memory of the details of balancing. So here is my understanding of what is happening: * {{dispatchBlockMoves()}} spawns a thread for each {{Source}}, which represents all storages of the same type of a DN. Each thread executes {{dispatchBlocks()}} then. * {{dispatchBlocks()}} first tries to schedule block transfers for already selected (source:target) DN pairs, and if there are no more pairs it calls {{getBlockList()}}, which contacts NN to obtain the next portion of blocks from the source DN to be moved out. * The problem of two many RPC calls happens in the beginning of the Balancer iteration, when there are no scheduled pairs yet, so all the threads call {{getBlockList()}} and go to the NameNode simultaneously. So we need to disperse only the initial burst of RPCs at the start of an iteration, as subsequent {{getBlocks()}} are already dispersed fine. I see two ways to fix this: # Add a parameter to {{getBlockList(long delay)}}, where {{delay}} is a random time within a reasonable interval, which Balancer should wait for before sending the {{getBlocks()}} RPC to NN. The delay is only applied once, and set to 0 once applied. This looks rather straightforward to me. # Allocate a reasonable throughput of {{getBlocks()}} RPCs to NN, and delay calls if the quota is exceeded. This is similar to [~benoyantony]'s proposal, but allows to precisely specify how much of NN RPC bandwidth is allocated for the Balancer. [~zhaoyunjiong] I understand you wanted a simple fix without making too much changes, but this looks like a real problem to me, and we should fix it in a more generic manner. I am fine if you wish to implement option #1 here as an initial step. Ultimately we should target solution #2, which could be done in another jira. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933354#comment-15933354 ] yunjiong zhao commented on HDFS-11384: -- Thanks [~shv] for review. Only when you set dfs.balancer.getBlocks.interval.millis to non-zero, Balancer will only allow one thread to issue {code}getBlocks(){code} at any given time. Otherwise this patch doesn't change anything. So only one change actually. If use wait, it will release the lock, so can't make sure there are only one thread will call {code}getBlocks(){code}. By default, this patch doesn't change anything. So if you need run Balancer aggressively, don't set dfs.balancer.getBlocks.interval.millis. {quote} Can we add some heuristics so that the Balancer could adjust by itself instead of adding the configuration parameter {quote} I though this before. The best way I can thought is add new function in IPC that let clients get the CallQueueLength, if CallQueueLength is too high, block getBlocks() until the CallQueueLength become normal again. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933250#comment-15933250 ] Konstantin Shvachko commented on HDFS-11384: Hey [~zhaoyunjiong], [~benoyantony]. This seems like an important issue to fix. Thanks for reporting and working on the fix. We face the same problem on our cluster. If I understand correctly there are two changes in the patch: # Allow only one Balancer thread to issue {{getBlocks()}} at any given time. # Add a possibility to sleep for a thread between {{getBlocks()}} calls. Some thoughts # I agree with [~benoyantony] that the {{sleep()}} solution is not optimal. Do you guys think we can replace it with a {{wait()}}? # I also want to understand what is the effect of this change on the case when there is a lot unbalanced nodes and the Balancer should run aggressively. Can we add some heuristics so that the Balancer could adjust by itself instead of adding the configuration parameter, which would require Balancer restart if it needs to be changed. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15923332#comment-15923332 ] Hadoop QA commented on HDFS-11384: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 20s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 14m 2s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 51s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 43s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 38s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 451 unchanged - 0 fixed = 453 total (was 451) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 50s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 2s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 50s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 71m 19s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 21s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 97m 52s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.datanode.TestFsDatasetCache | | | hadoop.hdfs.web.TestWebHdfsFileSystemContract | | | hadoop.hdfs.TestHDFSFileSystemContract | | | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:a9ad5d6 | | JIRA Issue | HDFS-11384 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12858571/HDFS-11384.002.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle xml | | uname | Linux 5ceb83dbf5c7 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 55796a0 | | Default Java | 1.8.0_121 | | findbugs | v3.0.0 | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/18700/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/18700/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/18700/testReport/ | | modules
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903651#comment-15903651 ] Benoy Antony commented on HDFS-11384: - [~zhaoyunjiong], Could you please check to see if sleeping outside the sync block with a potentially longer delay is sufficient to avoid spike ? > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, HDFS-11384.001.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893388#comment-15893388 ] Hadoop QA commented on HDFS-11384: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 12s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 14m 9s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 52s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 44s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 51s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 51s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 41s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 449 unchanged - 0 fixed = 451 total (was 449) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 2m 3s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 69m 40s{color} | {color:green} hadoop-hdfs in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 20s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 97m 34s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-hdfs-project/hadoop-hdfs | | | org.apache.hadoop.hdfs.server.balancer.Dispatcher$Source.getBlockList() calls Thread.sleep() with a lock held At Dispatcher.java:lock held At Dispatcher.java:[line 778] | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:a9ad5d6 | | JIRA Issue | HDFS-11384 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12855719/HDFS-11384.001.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle xml | | uname | Linux c8f1eb7124c2 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / a97833e | | Default Java | 1.8.0_121 | | findbugs | v3.0.0 | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/18524/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | findbugs | https://builds.apache.org/job/PreCommit-HDFS-Build/18524/artifact/patchprocess/new-findbugs-hadoop-hdfs-project_hadoop-hdfs.html | | Test Results |
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15891587#comment-15891587 ] Benoy Antony commented on HDFS-11384: - Thanks for the explanation, [~zhaoyunjiong]. The patch looks good. I will commit this tomorrow if there are no other comments. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, HDFS-11384.001.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15891198#comment-15891198 ] yunjiong zhao commented on HDFS-11384: -- Thank you [~benoyantony] for your time to review this patch. {quote}Sleeping inside the Synchronized block should be avoided as it will prevent other threads from obtaining the lock while the thread is sleeping. {quote} I did it on purpose for sleeping inside the Synchronized block. In balancer there are multiple threads (by default 200) that may call getBlocks at same time, if user need to set dfs.balancer.getBlocks.interval.millis to slow down balancer, without a lock it won't work well due to at worst case there are still 200 getBlocks send to NameNode at same time. {quote}It will be better to keep track of the interval between successive getBlocks and sleep only for the required time. {quote} Since by default, this patch doesn't change anything, only add a option let user slow down balancer send getBlocks to NameNode, so I'd like to keep it as simple as possible. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, HDFS-11384.001.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890957#comment-15890957 ] Benoy Antony commented on HDFS-11384: - Sleeping inside the *Synchronized* block should be avoided as it will lock prevent other threads from obtaining the lock while the thread is sleeping. One tradeoff in sleeping fixed vs variable time is that code gets complicated. Since by default, the delay is not applied, it is okay to sleep for a fixed interval after getBlocks(). > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, HDFS-11384.001.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888766#comment-15888766 ] Benoy Antony commented on HDFS-11384: - [~zhaoyunjiong], If there are blocks to balance, then there will be sufficient delays between successive getBlocks. In such cases, we do not have to sleep. It will be better to keep track of the interval between successive getBlocks and sleep only for the required time. Can you also write a unit test to cover this change ? > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, HDFS-11384.001.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org