There is no such file /system/balancer.id, the /system directory is empty when balancer is not running When i run balancer i see this file is created, but then when balancer exists this file is deleted properly and /system gets empty as usual Each time balancer moves a very small part of data (around 10G, see below an example) and then exists after 2 minutes with the same MetricsException
2025-03-10 10:25:52,266 INFO balancer.Dispatcher: Total bytes (blocks) moved in this iteration 10.00 GB (169) Mar 10, 2025, 10:25:52 AM 0 10.00 GB 25.69 TB 10 GB 169 hdfs://{HIDDEN HOSTNAME HERE}:54310 2025-03-10 10:26:01,270 INFO balancer.Balancer: dfs.namenode.get-blocks.max-qps = 20 (default=20) 2025-03-10 10:26:01,270 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 5400000 (default=5400000) 2025-03-10 10:26:01,270 INFO balancer.Balancer: dfs.balancer.moverThreads = 1000 (default=1000) 2025-03-10 10:26:01,271 INFO balancer.Balancer: dfs.balancer.dispatcherThreads = 200 (default=200) 2025-03-10 10:26:01,271 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 2147483648 (default=2147483648) 2025-03-10 10:26:01,271 INFO balancer.Balancer: dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760) 2025-03-10 10:26:01,271 INFO balancer.Balancer: dfs.datanode.balance.max.concurrent.moves = 50 (default=100) 2025-03-10 10:26:01,271 INFO balancer.Balancer: dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600) 2025-03-10 10:26:01,271 INFO balancer.Balancer: dfs.balancer.max-size-to-move = 10737418240 (default=10737418240) 2025-03-10 10:26:01,271 INFO balancer.Balancer: dfs.blocksize = 134217728 (default=134217728) Mar 10, 2025, 10:26:01 AM Balancing took 1.8892166666666668 minutes 2025-03-10 10:26:01,301 ERROR balancer.Balancer: Exiting balancer due an exception org.apache.hadoop.metrics2.MetricsException: Metrics source Balancer-BP-716662839-{HIDDEN IP HERE}-1737639021855 already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) at org.apache.hadoop.hdfs.server.balancer.BalancerMetrics.create(BalancerMetrics.java:52) at org.apache.hadoop.hdfs.server.balancer.Balancer.<init>(Balancer.java:362) at org.apache.hadoop.hdfs.server.balancer.Balancer.doBalance(Balancer.java:824) at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:868) at org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:975) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82) at org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1133) Le lun. 10 mars 2025 à 00:32, Zhanghaobo <hfutzhan...@163.com> a écrit : > try to delete /system/balancer.id and search some error or warn logs in > namenode. > > ---- Replied Message ---- > From Sébastien Rebecchi<srebec...@kameleoon.com.INVALID> > <srebec...@kameleoon.com.INVALID> > Date 3/9/2025 23:08 > To Zhanghaobo<hfutzhan...@163.com> <hfutzhan...@163.com> > Cc hadoop-user-maillist<user@hadoop.apache.org>, > <user@hadoop.apache.org>hdfs-dev<hdfs-...@hadoop.apache.org> > <hdfs-...@hadoop.apache.org> > Subject Re: Can not run HDFS balancer cause metrics already exists > I got the same error adding -asService in the command line (metrics > already exists), the only diff is that it will retry every 5 mins > > 2025-03-09 15:05:04,542 INFO balancer.Balancer: Finished one round, will > wait for 5.0 minutes for next round > > That does not seem a good workaround, my cluster have hundreds of TB to > rebalance when adding a data node, and I don't remember having such issues > when I was using hadoop 2.9.1. > Is there any issue with balancer on recent hadoop versions? > > Thanks, > Sébastien > > Le dim. 9 mars 2025 à 16:02, Sébastien Rebecchi <srebec...@kameleoon.com> > a écrit : > >> OK I can try then, hoping it will help. >> Btw even if it works, it does not explain this metrics exception. >> Any idea how to solve this, I can't find a way to delete that metrics in >> any hadoop doc. >> >> Thanks >> >> Sébastien. >> >> Le dim. 9 mars 2025 à 15:39, Zhanghaobo <hfutzhan...@163.com> a écrit : >> >>> got it, you can use it as a service and see what will happen. >>> >>> ---- Replied Message ---- >>> From Sébastien Rebecchi<srebec...@kameleoon.com> >>> <srebec...@kameleoon.com> >>> Date 03/09/2025 22:22 >>> To Zhanghaobo<hfutzhan...@163.com> <hfutzhan...@163.com> >>> Cc user@hadoop.apache.org、hdfs-...@hadoop.apache.org >>> Subject Re: Can not run HDFS balancer cause metrics already exists >>> Hi Zhanghaobo, >>> >>> Thanks for the message. >>> >>> No I don't use as service, as I said the command line is the following: hdfs >>> balancer -Ddfs.balancer.movedWinWidth=5400000 >>> -Ddfs.balancer.moverThreads=1000 -Ddfs.balancer.dispatcherThreads=200 >>> -Ddfs.datanode.balance.max.concurrent.moves=50 >>> -Ddfs.datanode.balance.bandwidthPerSec=100m >>> -Ddfs.balancer.max-size-to-move=10737418240 -threshold 1 >>> >>> Also no other balancer is running concurrently on any other node. >>> >>> Sébastien >>> >>> Le dim. 9 mars 2025 à 13:57, Zhanghaobo <hfutzhan...@163.com> a écrit : >>> >>>> >>>> Hi, @Sébastien Rebecchi >>>> Don't know more details about how you start balancer, did you use >>>> -asService? >>>> >>>> >>>> ---- Replied Message ---- >>>> From Sébastien Rebecchi<srebec...@kameleoon.com.INVALID> >>>> <srebec...@kameleoon.com.INVALID> >>>> Date 3/9/2025 18:03 >>>> To <user@hadoop.apache.org>, >>>> <user@hadoop.apache.org><hdfs-...@hadoop.apache.org> >>>> <hdfs-...@hadoop.apache.org> >>>> Subject Re: Can not run HDFS balancer cause metrics already exists >>>> Hello >>>> >>>> Could anyone help on this please? >>>> Situation is still the same after several days. >>>> I add some precisions >>>> - hadoop version 3.4.1 >>>> - balancer command line run: hdfs balancer >>>> -Ddfs.balancer.movedWinWidth=5400000 -Ddfs.balancer.moverThreads=1000 >>>> -Ddfs.balancer.dispatcherThreads=200 >>>> -Ddfs.datanode.balance.max.concurrent.moves=50 >>>> -Ddfs.datanode.balance.bandwidthPerSec=100m >>>> -Ddfs.balancer.max-size-to-move=10737418240 -threshold 1 >>>> >>>> Thank you >>>> >>>> >>>> Le mar. 4 mars 2025, 16:59, Sébastien Rebecchi <srebec...@kameleoon.com> >>>> a écrit : >>>> >>>>> Hello >>>>> >>>>> After having added a new node on my HDFS cluster, I try running >>>>> balancer, but it always fails with the following error, even after >>>>> retrying >>>>> multiple times during the day, and even after having restarted name node >>>>> What should I do to unlock? >>>>> >>>>> Thanks, >>>>> >>>>> Sébastien >>>>> >>>>> >>>>> ERROR balancer.Balancer: Exiting balancer due an exception >>>>> org.apache.hadoop.metrics2.MetricsException: Metrics source >>>>> Balancer-{HERE REPLACE BY CLUSTER'S BLOCK POOL ID} already exists! >>>>> at >>>>> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) >>>>> at >>>>> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) >>>>> at >>>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) >>>>> at >>>>> org.apache.hadoop.hdfs.server.balancer.BalancerMetrics.create(BalancerMetrics.java:52) >>>>> at >>>>> org.apache.hadoop.hdfs.server.balancer.Balancer.<init>(Balancer.java:362) >>>>> at >>>>> org.apache.hadoop.hdfs.server.balancer.Balancer.doBalance(Balancer.java:824) >>>>> at >>>>> org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:868) >>>>> at >>>>> org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:975) >>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82) >>>>> at >>>>> org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1133) >>>>> >>>>