感觉还是 region 数量太多了,只有 14 台机器,4800 个 region 肯定集群状态就不太正常了。 这个表是有实时写入的数据吗?是否可以暂时停掉?比如找个晚上,新建一张表,1024 分片,读老表,把数据写到新的表里,然后把老表 disable 再 drop,把新表 rename 成老表。如果对 HBase 比较熟悉的话,可以尝试把老表 disable 掉,直接写 MR 任务读 snapshot,写 HFile 出来,然后往新表里 bulkload,速度能快不少 新表建议至少用 ConstantSizeRegionSplitPolicy(大概是这个名字,记不太清楚了),甚至直接把 split disable 都行,就 14 台机器,1024 分表也足够了,再多意义也不太大。。。
邢* <xingxuem...@163.com> 于2023年3月30日周四 17:59写道: > > 感谢张老师的回复,由于我们这个大表快照的时间是近期才明显增长的,所以还是感觉hbase集群存在问题。我这边儿重新整理了下这个问题发生的全过程,辛苦张老师再帮忙看一下,再次感谢。 > > > 1、hbase集群信息: > 3台虚拟机器: > 作为hbase master,hdfs namenode,journal node,failovercontroller > 14台物理机器:配置96c,376G,SSD盘 > 作为hbase region server,hdfs datanode > > > hbase版本:2.1.9 > hadoop版本:hadoop-3.0.0-cdh-6.2.0 > 表:sdhz_user_info_realtime > 大小:9.5TB > 1个family > > > 2、最近的操作: > 2023年1月:sdhz_user_info_realtime表,region数量1024 > 2023年2月:因为sdhz_user_info_realtime表有很多脏数据,所以采用scan > region的形式,一个一个region删除无用的列 > 2023年2月底:发现sdhz_user_info_realtime表的region数量暴增到4000多个,然后手动merge过region,造成hbase > meta 空洞的问题,之后就放弃了merge,目前已经达到4800个region了 > 2023年3月16日夜间:sdhz_user_info_realtime表做snapshot超时,超时时间5分钟 > 2023年3月17日:将snapshot超时时间改为15分钟 > 2023年3月19日:snapshot再次超时 > 2023年3月20日:将snapshot超时时间改为60分钟,目前观察每次snapshot时间,大约半小时左右 > > > 3、想得到解答的问题: > 为什么snapshot需要这么久? > region的数量如何降下去? > > > > > 4、在此期间对问题进行了跟踪排查,整理的信息如下: > > > > > 在snapshot的过程中,时间花费在region的flush上,snapshot的校验上 > 在出现“Done waiting - online snapshot > for”日志之前,一直在做flush操作,所有region大约花费总共20多分钟,每个region的flush的时间大约是10~20s,每次flush了几MB到几十MB的memstore,每个region > server上大约340个region > 在出现“Done waiting - online snapshot for”日志之后,开始对snapshot进行校验,移动等操作,花费了10分钟 > > > regionserver中发现了DFSClient的日志 > 2023-03-29 16:24:55,395 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /hbase/.hbase-snapshot/.tmp/sdhz_user_info_realtime_1680076952828/region-manifest.603b4d8028af279648af4bfaa3889fd0 > retrying... > > > > > datanode中发现了日志: > > 增量块上报过程中,同一个块,要上报好几次,例如下面的日志,blk_1109695462_35954673这个块,怀疑dn向nn发送IBR的过程中,有失败然后进入pendingIBR重试的,想通过arthas检测IBR上报过程中putMissing的方法 > > > 2023-03-29 16:16:51,367 DEBUG > org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call > blockReceivedAndDeleted: > [DatanodeStorage[DS-059fc182-f7cc-43bd-a0c3-2e1447f6650f,DISK,NORMAL][blk_1109695462_35954673, > status: RECEIVING_BLOCK, delHint: null, blk_1109695456_35954667, status: > RECEIVED_BLOCK, delHint: null], > DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695459_35954670, > status: RECEIVING_BLOCK, delHint: null]] > 2023-03-29 16:16:51,367 DEBUG > org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call > blockReceivedAndDeleted: > [DatanodeStorage[DS-059fc182-f7cc-43bd-a0c3-2e1447f6650f,DISK,NORMAL][blk_1109695462_35954673, > status: RECEIVING_BLOCK, delHint: null, blk_1109695456_35954667, status: > RECEIVED_BLOCK, delHint: null], > DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695459_35954670, > status: RECEIVING_BLOCK, delHint: null]] > 2023-03-29 16:16:51,370 DEBUG > org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call > blockReceivedAndDeleted: > [DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695452_35954663, > status: RECEIVED_BLOCK, delHint: null]] > 2023-03-29 16:16:51,372 DEBUG > org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call > blockReceivedAndDeleted: > [DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695459_35954670, > status: RECEIVED_BLOCK, delHint: null]] > 2023-03-29 16:16:51,380 DEBUG > org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call > blockReceivedAndDeleted: > [DatanodeStorage[DS-059fc182-f7cc-43bd-a0c3-2e1447f6650f,DISK,NORMAL][blk_1109695462_35954673, > status: RECEIVED_BLOCK, delHint: null], > DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695468_35954679, > status: RECEIVING_BLOCK, delHint: null]] > 2023-03-29 16:16:51,552 DEBUG > org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call > blockReceivedAndDeleted: > [DatanodeStorage[DS-059fc182-f7cc-43bd-a0c3-2e1447f6650f,DISK,NORMAL][blk_1109695391_35954602, > status: RECEIVED_BLOCK, delHint: null]] > 2023-03-29 16:16:51,574 DEBUG > org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call > blockReceivedAndDeleted: > [DatanodeStorage[DS-059fc182-f7cc-43bd-a0c3-2e1447f6650f,DISK,NORMAL][blk_1109695455_35954666, > status: RECEIVED_BLOCK, delHint: null]] > 2023-03-29 16:16:52,094 DEBUG > org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call > blockReceivedAndDeleted: > [DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695468_35954679, > status: RECEIVED_BLOCK, delHint: null]] > 2023-03-29 16:16:53,164 DEBUG > org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call > blockReceivedAndDeleted: > [DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695447_35954658, > status: RECEIVED_BLOCK, delHint: null]] > 2023-03-29 16:16:55,574 DEBUG > org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call > blockReceivedAndDeleted: > [DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695351_35954562, > status: RECEIVED_BLOCK, delHint: null]] > 2023-03-29 16:16:56,894 DEBUG > org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call > blockReceivedAndDeleted: > [DatanodeStorage[DS-059fc182-f7cc-43bd-a0c3-2e1447f6650f,DISK,NORMAL][blk_1109695391_35954602, > status: RECEIVED_BLOCK, delHint: null, blk_1109695462_35954673, status: > RECEIVED_BLOCK, delHint: null, blk_1109695455_35954666, status: > RECEIVED_BLOCK, delHint: null], > DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695351_35954562, > status: RECEIVED_BLOCK, delHint: null, blk_1109695459_35954670, status: > RECEIVED_BLOCK, delHint: null, blk_1109695468_35954679, status: > RECEIVED_BLOCK, delHint: null, blk_1109695447_35954658, status: > RECEIVED_BLOCK, delHint: null, blk_1109695452_35954663, status: > RECEIVED_BLOCK, delHint: null]] > > > > > namenode中发现日志: > > > 2023-03-29 16:24:54,995 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1109699571_35958782 is COMMITTED but not COMPLETE(numNodes= 0 < > minimum = 1) in file > /hbase/.hbase-snapshot/.tmp/sdhz_user_info_realtime_1680076952828/region-manifest.603b4d8028af279648af4bfaa3889fd0 > > > > > 2023-03-30 12:21:39,963 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed > to place enough replicas, still in need of 3 to reach 3 > (unavailableStorages=[], storagePolicy=BlockStoragePolicy{ALL_SSD:12, > storageTypes=[SSD], creationFallbacks=[DISK], replicationFallbacks=[DISK]}, > newBlock=true) For more information, please enable DEBUG log level on > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and > org.apache.hadoop.net.NetworkTopology > > > > > 2023-03-30 12:22:53,300 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed > to place enough replicas, still in need of 3 to reach 3 > (unavailableStorages=[], storagePolicy=BlockStoragePolicy{ALL_SSD:12, > storageTypes=[SSD], creationFallbacks=[DISK], replicationFallbacks=[DISK]}, > newBlock=true) > 2023-03-30 12:22:53,300 DEBUG org.apache.hadoop.net.NetworkTopology: > Choosing random from 10 available nodes on node /, scope=, > excludedScope=/TX/T10/10, excludeNodes=[x.x.x.x:9866]. numOfDatanodes=10. > 2023-03-30 12:22:53,300 DEBUG org.apache.hadoop.net.NetworkTopology: > Choosing random from 12 available nodes on node /, scope=, > excludedScope=/TX/T01/02, excludeNodes=[x.x.x.x:9866]. numOfDatanodes=12. > 2023-03-30 12:22:53,300 DEBUG org.apache.hadoop.net.NetworkTopology: > nthValidToReturn is 5 > 2023-03-30 12:22:53,300 DEBUG org.apache.hadoop.net.NetworkTopology: > nthValidToReturn is 5 > 2023-03-30 12:22:53,300 DEBUG org.apache.hadoop.net.NetworkTopology: > Chosen node x.x.x.x:9866 from first random > 2023-03-30 12:22:53,300 DEBUG org.apache.hadoop.net.NetworkTopology: > chooseRandom returning x.x.x.x:9866 > 2023-03-30 12:22:53,300 DEBUG org.apache.hadoop.net.NetworkTopology: > Chosen node x.x.x.x:9866 from first random > 2023-03-30 12:22:53,300 DEBUG org.apache.hadoop.net.NetworkTopology: > chooseRandom returning x.x.x.x:9866 > 2023-03-30 12:22:53,301 DEBUG org.apache.hadoop.net.NetworkTopology: > Choosing random from 3 available nodes on node /TX/T01/04, > scope=/TX/T01/04, excludedScope=null, excludeNodes=[x.x.x.x:9866, > x.x.x.x:9866]. numOfDatanodes=4. > 2023-03-30 12:22:53,301 DEBUG org.apache.hadoop.net.NetworkTopology: > Choosing random from 0 available nodes on node /TX/T02/02, > scope=/TX/T02/02, excludedScope=null, excludeNodes=[x.x.x.x:9866, > x.x.x.x:9866]. numOfDatanodes=1. > 2023-03-30 12:22:53,301 DEBUG org.apache.hadoop.net.NetworkTopology: > nthValidToReturn is 0 > 2023-03-30 12:22:53,301 DEBUG org.apache.hadoop.net.NetworkTopology: > Chosen node x.x.x.x:9866 from first random > 2023-03-30 12:22:53,301 DEBUG org.apache.hadoop.net.NetworkTopology: > chooseRandom returning x.x.x.x:9866 > 2023-03-30 12:22:53,301 DEBUG org.apache.hadoop.net.NetworkTopology: > chooseRandom returning null > 2023-03-30 12:22:53,301 DEBUG org.apache.hadoop.net.NetworkTopology: No > node to choose. > 2023-03-30 12:22:53,301 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1109782685_36041896, replicas=x.x.x.x:9866, x.x.x.x:9866, > x.x.x.x:9866 for > /hbase/WALs/hbase-virgo-rs-18.bigdata.bj.tx.lan,16020,1679431900739/hbase-virgo-rs-18.bigdata.bj.tx.lan%2C16020%2C1679431900739.hbase-virgo-rs-18.bigdata.bj.tx.lan%2C16020%2C1679431900739.regiongroup-0.1680150172590 > 2023-03-30 12:22:53,301 DEBUG > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed > to choose from local rack (location = /TX/T02/02), retry with the rack of > the next replica (location = /TX/T01/02) > > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException: > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:827) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:715) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:622) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:506) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:416) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:465) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:292) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:143) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:159) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2094) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2673) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:872) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:550) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > at > java.base/java.security.AccessController.doPrivileged(AccessController.java:691) > at java.base/javax.security.auth.Subject.doAs(Subject.java:423) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > 2023-03-30 12:22:53,301 DEBUG org.apache.hadoop.net.NetworkTopology: > Choosing random from 1 available nodes on node /TX/T01/02, > scope=/TX/T01/02, excludedScope=null, excludeNodes=[x.x.x.x:9866, > x.x.x.x:9866]. numOfDatanodes=2. > 2023-03-30 12:22:53,301 DEBUG org.apache.hadoop.net.NetworkTopology: > nthValidToReturn is 0 > 2023-03-30 12:22:53,301 DEBUG org.apache.hadoop.net.NetworkTopology: > chooseRandom returning x.x.x.x:9866 > 2023-03-30 12:22:53,301 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1109782686_36041897, replicas=x.x.x.x:9866, x.x.x.x:9866, > x.x.x.x:9866 for > /hbase/WALs/hbase-virgo-rs-12.bigdata.bj.tx.lan,16020,1679299081702/hbase-virgo-rs-12.bigdata.bj.tx.lan%2C16020%2C1679299081702.hbase-virgo-rs-12.bigdata.bj.tx.lan%2C16020%2C1679299081702.regiongroup-0.1680150171428 > 2023-03-30 12:22:53,311 INFO org.apache.hadoop.hdfs.StateChange: DIR* > completeFile: > /hbase/WALs/hbase-virgo-rs-18.bigdata.bj.tx.lan,16020,1679431900739/hbase-virgo-rs-18.bigdata.bj.tx.lan%2C16020%2C1679431900739.hbase-virgo-rs-18.bigdata.bj.tx.lan%2C16020%2C1679431900739.regiongroup-0.1680149941088 > is closed by DFSClient_NONMAPREDUCE_-1935786627_1 > 2023-03-30 12:22:53,311 DEBUG org.apache.hadoop.net.NetworkTopology: > Choosing random from 13 available nodes on node /, scope=, > excludedScope=/TX/T01/03, excludeNodes=[x.x.x.x:9866]. numOfDatanodes=13. > 2023-03-30 12:22:53,311 DEBUG org.apache.hadoop.net.NetworkTopology: > nthValidToReturn is 6 > 2023-03-30 12:22:53,311 DEBUG org.apache.hadoop.net.NetworkTopology: > Chosen node x.x.x.x:9866 from first random > 2023-03-30 12:22:53,311 DEBUG org.apache.hadoop.net.NetworkTopology: > chooseRandom returning x.x.x.x:9866 > 2023-03-30 12:22:53,311 DEBUG org.apache.hadoop.net.NetworkTopology: > Choosing random from 3 available nodes on node /TX/T01/04, > scope=/TX/T01/04, excludedScope=null, excludeNodes=[x.x.x.x:9866, > x.x.x.x:9866]. numOfDatanodes=4. > > > > > > > > > > > > > > > > > > > > 在 2023-03-29 22:00:18,"张铎(Duo Zhang)" <palomino...@gmail.com> 写道: > >这个要看你 flush 出去的数据有多大吧?以及写 HDFS 的速度,十几秒通常看还行吧,不算特别慢。有监控打 snapshot > 时候集群的整体负载和 > >IO 情况吗? > > > >邢* <xingxuem...@163.com> 于2023年3月29日周三 19:20写道: > > > >> hi all, > >> > >> > >> > 向社区求助一个HBase的情况,情况描述如下:我们hbase集群某张大表最近做snapshot的时间过长,耗时约半小时。梳理日志发现每个region > >> flush的时间在十几秒左右,想咨询下flush的时间为什么会这么久? > >> > >> 集群现状:目前是有14台cpu内存都是96c 384g机型的regionServer机器+3台cpu内存都是8c > >> 32g机型的zookeeper机器,使用的hbase版本为2.1.9。 > >> > >> 最近排查到的日志记录如下: > >> > >> regionserver DFSClient 日志详情: > >> > >> 2023-03-29 16:24:55,395 INFO org.apache.hadoop.hdfs.DFSClient: Could > >> not complete > >> > /hbase/.hbase-snapshot/.tmp/sdhz_user_info_realtime_1680076952828/region-manifest.603b4d8028af279648af4bfaa3889fd0 > >> retrying... > >> > >> namenode日志详情: > >> > >> 2023-03-29 16:24:54,995 INFO > >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > >> blk_1109699571_35958782 is COMMITTED but not COMPLETE(numNodes= 0 < > >> minimum = 1) in file > >> > /hbase/.hbase-snapshot/.tmp/sdhz_user_info_realtime_1680076952828/region-manifest.603b4d8028af279648af4bfaa3889fd0 >