Re: HBase Master hangs on startup during upgrade from 2.2.5 to 2.5.7

Duo Zhang Mon, 29 Apr 2024 19:44:30 -0700

Oh, there is a typo, I mean the ServerCrashProcedure should not block other
procedures if it is in claim replication queue stage.


张铎(Duo Zhang) <palomino...@gmail.com>于2024年4月30日 周二10:41写道：

> Sorry to be a pain as the procedure store is a big problem before HBase
> 2.3 so we have done a big refactoring on HBase 2.3+ so we have a migration
> which makes the upgrading a bit complicated.
>
> And on the upgrading, you do not need to mix up HBase and Hadoop, you can
> upgrading them separately. Second, rolling upgrading is also a bit
> complicated, so I suggest you try fully down/up upgrading first, if you
> have successfully done an upgrading, then you can start to try rolling
> upgrading.
>
> To your scenario, I suggest, you first upgrading Hadoop, including
> namenode and datanode, HBase should be functional after the upgrading. And
> then, as discussed above, turn off the balancer, view the master page to
> make sure there are no RITs and no procedures, then shutdown master, and
> then shutdown all the region servers. And then, start master(do not need to
> wait the master finishes start up, as it relies on meta region online,
> where we must have at least one region server), and then all the region
> servers, to see if the cluster can go back to normal.
>
> On the ServerCrashProcedure, it is blocked in claim replication queue,
> which should be blocked other procedures as the region assignment should
> have already been finished. Does your cluster has replication peers? If
> not, it is a bit strange that why your procedure is blocked in the claim
> replication queue stage…
>
> Thanks.
>
> Udo Offermann <udo.offerm...@zfabrik.de>于2024年4月29日 周一21:26写道：
>
>> This time we made progress.
>> I first upgraded the Master Hadoop and HBase wise (after making sure that
>> there are no regions in transition and no running procedures) with keeping
>> Zookeeper running. Master was started with new version 2.8.5 telling that
>> there are 6 nodes with inconsistent version (what was to be expected). Now
>> the startup process completes with "Starting cluster schema service
>>  COMPLETE“,
>>  all regions were assigned and the cluster seemed to be stable.
>>
>> Again there were no regions in transitions and no procedures running and
>> so I started to upgrade the data nodes one by one.
>> The problem now is that the new region servers are not assigned regions
>> except of 3: hbase:namespace, hbase:meta and one of our application level
>> tables (which is empty most of the time).
>> The more data nodes I migrated, the more regions were accumulated on the
>> nodes running the old version until the last old data node has managed all
>> regions except for 3.
>>
>>
>>
>> After all regions have been transitioned I migrated the last node which
>> yields that all regions are in transition and look like this one:
>>
>> 2185    2184    WAITING_TIMEOUT         seritrack
>>  TransitRegionStateProcedure table=tt_items,
>> region=d7a411647663dd9e0fc972c7e14088a5, ASSIGN     Mon Apr 29 14:12:36
>> CEST 2024   Mon Apr 29 14:59:44 CEST 2024           pid=2185, ppid=2184,
>> state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE,
>> locked=true; TransitRegionStateProcedure table=tt_items,
>> region=d7a411647663dd9e0fc972c7e14088a5, ASSIGN
>>
>> They are all waiting on this one:
>>
>> 2184            WAITING         seritrack       ServerCrashProcedure
>> datanode06ct.gmd9.intern,16020,1714378085579       Mon Apr 29 14:12:36 CEST
>> 2024   Mon Apr 29 14:12:36 CEST 2024           pid=2184,
>> state=WAITING:SERVER_CRASH_CLAIM_REPLICATION_QUEUES, locked=true;
>> ServerCrashProcedure datanode06ct.gmd9.intern,16020,1714378085579,
>> splitWal=true, meta=false
>>
>> Again „ServerCrashProcedure“! Why are they not processed?
>> Why is it so hard to upgrade the cluster? Is it worthwhile to take the
>> next stable version 2.5.8?
>> And - btw- what is the difference between the two distributions „bin“ and
>> „hadoop3-bin“?
>>
>> Best regards
>> Udo
>>
>>
>>
>>
>>
>> > Am 28.04.2024 um 03:03 schrieb 张铎(Duo Zhang) <palomino...@gmail.com>:
>> >
>> > Better turn it off, and observe the master page until there is no RITs
>> > and no other procedures, then call hbase-daemon.sh stop master, and
>> > then hbase-daemon.sh stop regionserver.
>> >
>> > I'm not 100% sure about the shell command, you'd better search try it
>> > by yourself. The key here is to stop master first and make sure there
>> > is no procedure, so we can safely remove MasterProcWALs, and then stop
>> > all region servers.
>> >
>> > Thanks.
>> >
>> > Udo Offermann <udo.offerm...@zfabrik.de> 于2024年4月26日周五 23:34写道：
>> >>
>> >> I know, but is it necessary or beneficial to turn it off - and if so -
>> when?
>> >> And what is your recommendation about stopping the region servers? Just
>> >> hbase-daemon.sh stop regionserver
>> >> or
>> >> gracefull_stop.sh localhost
>> >> ?
>> >>
>> >>> Am 26.04.2024 um 17:22 schrieb 张铎(Duo Zhang) <palomino...@gmail.com>:
>> >>>
>> >>> Turning off balancer is to make sure that the balancer will not
>> >>> schedule any procedures to balance the cluster.
>> >>>
>> >>> Udo Offermann <udo.offerm...@zfabrik.de> 于2024年4月26日周五 23:03写道：
>> >>>>
>> >>>> and what’s about turning of Hbase balancer before stopping hmaster?
>> >>>>
>> >>>>> Am 26.04.2024 um 17:00 schrieb Udo Offermann <
>> udo.offerm...@zfabrik.de>:
>> >>>>>
>> >>>>> So there is no need for
>> >>>>>
>> >>>>> hbase/bin/graceful_stop.sh localhost
>> >>>>>
>> >>>>> in order to stop the region servers?
>> >>>>>
>> >>>>>> Am 26.04.2024 um 16:51 schrieb 张铎(Duo Zhang) <
>> palomino...@gmail.com>:
>> >>>>>>
>> >>>>>> The key here is to make sure there are no procedures in HBase so we
>> >>>>>> are safe to move MasterProcWALs.
>> >>>>>>
>> >>>>>> And procedures can only be scheduled by master.
>> >>>>>>
>> >>>>>> So once there are no procedures in HBase, you should stop master
>> >>>>>> first, and then you are free to stop all the regionservers. And
>> then
>> >>>>>> you can proceed with the upgrading of hdfs/hadoop, and then restart
>> >>>>>> master and region servers with new versions.
>> >>>>>>
>> >>>>>> You can have a try.
>> >>>>>>
>> >>>>>> Udo Offermann <udo.offerm...@zfabrik.de> 于2024年4月26日周五 22:47写道：
>> >>>>>>>
>> >>>>>>> Ah, this sounds interesting!
>> >>>>>>>
>> >>>>>>> I need to think about how I'm going to manage this together with
>> upgrading Hadoop. My strategy was to first upgrade Hadoop on all machines
>> and then start HBase with the new version on all machines. But now I have
>> to upgrade the master first - Hadoop and Hbase wise - and then the data
>> nodes one by one - again Hadoop and Hbase wise. Is it also safe to do the
>> Hbase upgrade „inside“ a rolling Hadoop upgrade?
>> >>>>>>>
>> >>>>>>> I mean:
>> >>>>>>>
>> >>>>>>> 1) Upgrade master
>> >>>>>>>
>> >>>>>>> make sure there are no hbase procedures running
>> >>>>>>>
>> >>>>>>> hdfs dfsadmin -safemode enter
>> >>>>>>> hdfs dfsadmin -rollingUpgrade prepare
>> >>>>>>> kill hmaster
>> >>>>>>> kill/stop zookeeper ???
>> >>>>>>> hdfs dfs -rm  /hbase/MasterProcWALs/*
>> >>>>>>> stop secondary and namenode
>> >>>>>>> SWITCH-TO-NEW-VERSION
>> >>>>>>> hadoop-daemon.sh start namenode -rollingUpgrade started
>> >>>>>>> start secondary
>> >>>>>>> start zookeeper
>> >>>>>>> start hmaster
>> >>>>>>>> The cluster should be in an intermediate state, where master
>> >>>>>>>> is in new version but region servers remain in old version, but
>> it
>> >>>>>>>> should be functional.
>> >>>>>>>
>> >>>>>>> 2) Upgrade data node 1..6
>> >>>>>>> stop / kill region server ???
>> >>>>>>> hdfs dfsadmin -shutdownDatanode localhost:50020 upgrade
>> >>>>>>> SWITCH-TO-NEW-VERSION
>> >>>>>>> start datanode
>> >>>>>>> start region server
>> >>>>>>>
>> >>>>>>> 3) Finalize upgrade
>> >>>>>>> hdfs dfsadmin -rollingUpgrade finalize
>> >>>>>>> start yarn processes
>> >>>>>>>
>> >>>>>>> Hmm, sounds like a plan, what do you think?
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>> Am 26.04.2024 um 16:25 schrieb 张铎(Duo Zhang) <
>> palomino...@gmail.com>:
>> >>>>>>>>
>> >>>>>>>> I think the cluster is not in a correct state, none of the SCPs
>> has
>> >>>>>>>> carrying meta = true but meta is not online...
>> >>>>>>>>
>> >>>>>>>> If you have gracefully shutdown all the region servers, you
>> should not
>> >>>>>>>> delete all the MasterWALProcs, as there are already SCPs in it.
>> This
>> >>>>>>>> is how we deal with graceful shutdown, master just does not
>> process
>> >>>>>>>> the SCPs, but we do have already scheduled the SCPs...
>> >>>>>>>>
>> >>>>>>>> What I said above, is to make sure that there are no procedures
>> in the
>> >>>>>>>> system, then kill the master directly, without shutting down all
>> the
>> >>>>>>>> region servers, remove MasterWALProcs, and then restart master
>> with
>> >>>>>>>> new code. The cluster should be in an intermediate state, where
>> master
>> >>>>>>>> is in new version but region servers remain in old version, but
>> it
>> >>>>>>>> should be functional. And then you can rolling upgrade the region
>> >>>>>>>> servers one by one.
>> >>>>>>>>
>> >>>>>>>> You could try it again.
>> >>>>>>>>
>> >>>>>>>> Thanks.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Udo Offermann <udo.offerm...@zfabrik.de> 于2024年4月26日周五 22:03写道：
>> >>>>>>>>>
>> >>>>>>>>>> I think hostnames should be case insensitive? So why is there a
>> >>>>>>>>>> 'DATANODE01CT' and then a 'DATANODE01ct'?
>> >>>>>>>>> Well observed ;-) I was asked by our customer to disguise the
>> server names, and I missed some of them when searching and replacing, but I
>> can assure you that all server names are correct and we have never had any
>> problems with them.
>> >>>>>>>>>
>> >>>>>>>>> The cluster consists of 7 servers: one master and 6 data nodes
>> running on Alma Linux (version 8 I believe) and Java 8 (updated only some
>> weeks ago). Master is running Hadoop name node, secondary name node, yarn
>> resource manager and history server as well as Hbase Zookeeper and Master.
>> The data nodes are running data node, region server and Yarn node manager.
>> They're all virtual machines at the same size ram (16GB) and cpu wise (4
>> cores). The basic setup is from 2015 (with hbase 0.9 and we never change it
>> except upgrading to HBase 1.0 and to Hbase 2.2.5 in 2020), thus we have
>> been running Hadoop/HBase for almost 10 years now without any major
>> problems.
>> >>>>>>>>>
>> >>>>>>>>> The HBCKServerCrashProcedure comes from my attempt to recover
>> the cluster as you advised me the other day:
>> >>>>>>>>>
>> >>>>>>>>>>>>>>> Then use HBCK2, to schedule a SCP for this region server,
>> to see if it
>> >>>>>>>>>>>>>>> can fix the problem.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> https://github.com/apache/hbase-operator-tools/blob/master/hbase-hbck2/README.md
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> This is the document for HBCK2, you should use the
>> scheduleRecoveries command.
>> >>>>>>>>>
>> >>>>>>>>> You can take it as an act of desperation ;-)
>> >>>>>>>>>
>> >>>>>>>>> I will take care about log4j2 but how can I get the cluster up
>> and running?
>> >>>>>>>>>
>> >>>>>>>>> Best regards
>> >>>>>>>>> Udo
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>> Am 26.04.2024 um 15:29 schrieb 张铎(Duo Zhang) <
>> palomino...@gmail.com>:
>> >>>>>>>>>>
>> >>>>>>>>>> It is a bit strange that why do you have a
>> HBCKServerCrashProcedure?
>> >>>>>>>>>> It should only appear when you use HBCK2 to force schedule a
>> SCP.
>> >>>>>>>>>> And it is also a bit strange that all the SCPs are marked as
>> not
>> >>>>>>>>>> carrying meta... How many region servers do you have in your
>> cluster?
>> >>>>>>>>>>
>> >>>>>>>>>> I think hostnames should be case insensitive? So why is there a
>> >>>>>>>>>> 'DATANODE01CT' and then a 'DATANODE01ct'?
>> >>>>>>>>>>
>> >>>>>>>>>> And for hbase 2.5.x, we have switched to use log4j2, instead
>> of log4j.
>> >>>>>>>>>>
>> https://github.com/apache/hbase/blob/branch-2.5/conf/log4j2.properties
>> >>>>>>>>>>
>> >>>>>>>>>> Udo Offermann <udo.offerm...@zfabrik.de> 于2024年4月26日周五
>> 19:59写道：
>> >>>>>>>>>>>
>> >>>>>>>>>>> After resetting the VMs, we started a new upgrade attempt.
>> >>>>>>>>>>> The Hadoop part ran smoothly again, but we got stuck again
>> with HBase.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Before upgrading HBase I turned off the balancer and stopped
>> all region servers gracefully. I also deleted the MasterProcWALs folder in
>> hdfs.
>> >>>>>>>>>>> Then I startet the master and region servers with version
>> 2.5.7. Again the master stops at „Starting assignment manager“ task.
>> >>>>>>>>>>>
>> >>>>>>>>>>> There are a number of server crash procedures that do not
>> appear to be processed:
>> >>>>>>>>>>>
>> >>>>>>>>>>> HBase Shell
>> >>>>>>>>>>> Use "help" to get list of supported commands.
>> >>>>>>>>>>> Use "exit" to quit this interactive shell.
>> >>>>>>>>>>> For Reference, please visit:
>> http://hbase.apache.org/2.0/book.html#shell
>> >>>>>>>>>>> Version 2.5.7, r6788f98356dd70b4a7ff766ea7a8298e022e7b95, Thu
>> Dec 14 15:59:16 PST 2023
>> >>>>>>>>>>> Took 0.0016 seconds
>> >>>>>>>>>>> hbase:001:0> list_procedures
>> >>>>>>>>>>> PID Name State Submitted Last_Update Parameters
>> >>>>>>>>>>> 1
>> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure RUNNABLE
>> 2024-04-26 12:22:12 +0200 2024-04-26 13:16:57 +0200 [{"state"=>[1, 3]},
>> {"serverName"=>{"hostName"=>"DATANODE01CT", "port"=>16020,
>> "startCode"=>"1714126714199"}, "carryingMeta"=>false,
>> "shouldSplitWal"=>true}]
>> >>>>>>>>>>> 2
>> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure RUNNABLE
>> 2024-04-26 12:22:18 +0200 2024-04-26 13:16:57 +0200 [{"state"=>[1, 3]},
>> {"serverName"=>{"hostName"=>"DATANODE02CT", "port"=>16020,
>> "startCode"=>"1714126737220"}, "carryingMeta"=>false,
>> "shouldSplitWal"=>true}]
>> >>>>>>>>>>> 3
>> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure RUNNABLE
>> 2024-04-26 12:22:24 +0200 2024-04-26 13:16:57 +0200 [{"state"=>[1, 3]},
>> {"serverName"=>{"hostName"=>"DATANODE03CT", "port"=>16020,
>> "startCode"=>"1714126742645"}, "carryingMeta"=>false,
>> "shouldSplitWal"=>true}]
>> >>>>>>>>>>> 4
>> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure RUNNABLE
>> 2024-04-26 12:22:37 +0200 2024-04-26 13:16:57 +0200 [{"state"=>[1, 3]},
>> {"serverName"=>{"hostName"=>"DATANODE05CT", "port"=>16020,
>> "startCode"=>"1714126754579"}, "carryingMeta"=>false,
>> "shouldSplitWal"=>true}]
>> >>>>>>>>>>> 5
>> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure RUNNABLE
>> 2024-04-26 12:22:44 +0200 2024-04-26 13:16:57 +0200 [{"state"=>[1, 3]},
>> {"serverName"=>{"hostName"=>"DATANODE06CT", "port"=>16020,
>> "startCode"=>"1714126762089"}, "carryingMeta"=>false,
>> "shouldSplitWal"=>true}]
>> >>>>>>>>>>> 6
>> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure RUNNABLE
>> 2024-04-26 13:13:43 +0200 2024-04-26 13:16:57 +0200 [{"state"=>[1, 3]},
>> {"serverName"=>{"hostName"=>"DATANODE01ct", "port"=>16020,
>> "startCode"=>"1714127123596"}, "carryingMeta"=>false,
>> "shouldSplitWal"=>true}]
>> >>>>>>>>>>> 7
>> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure RUNNABLE
>> 2024-04-26 13:13:53 +0200 2024-04-26 13:16:57 +0200 [{"state"=>[1, 3]},
>> {"serverName"=>{"hostName"=>"DATANODE02ct", "port"=>16020,
>> "startCode"=>"1714127133136"}, "carryingMeta"=>false,
>> "shouldSplitWal"=>true}]
>> >>>>>>>>>>> 8
>> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure RUNNABLE
>> 2024-04-26 13:14:07 +0200 2024-04-26 13:16:57 +0200 [{"state"=>[1, 3]},
>> {"serverName"=>{"hostName"=>"DATANODE03ct", "port"=>16020,
>> "startCode"=>"1714127138682"}, "carryingMeta"=>false,
>> "shouldSplitWal"=>true}]
>> >>>>>>>>>>> 9
>> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure RUNNABLE
>> 2024-04-26 13:14:17 +0200 2024-04-26 13:16:57 +0200 [{"state"=>[1, 3]},
>> {"serverName"=>{"hostName"=>"DATANODE05ct", "port"=>16020,
>> "startCode"=>"1714127155080"}, "carryingMeta"=>false,
>> "shouldSplitWal"=>true}]
>> >>>>>>>>>>> 10
>> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure RUNNABLE
>> 2024-04-26 13:14:30 +0200 2024-04-26 13:16:57 +0200 [{"state"=>[1, 3]},
>> {"serverName"=>{"hostName"=>"DATANODE06ct", "port"=>16020,
>> "startCode"=>"1714127158551"}, "carryingMeta"=>false,
>> "shouldSplitWal"=>true}]
>> >>>>>>>>>>> 11
>> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure RUNNABLE
>> 2024-04-26 13:16:57 +0200 2024-04-26 13:16:57 +0200 [{"state"=>[1, 3]},
>> {"serverName"=>{"hostName"=>"DATANODE04ct", "port"=>16020,
>> "startCode"=>"1714126747741"}, "carryingMeta"=>false,
>> "shouldSplitWal"=>true}]
>> >>>>>>>>>>> 12
>> org.apache.hadoop.hbase.master.procedure.HBCKServerCrashProcedure RUNNABLE
>> 2024-04-26 13:22:16 +0200 2024-04-26 13:22:16 +0200 [{"state"=>[1, 3]},
>> {"serverName"=>{"hostName"=>"DATANODE03CT", "port"=>16020,
>> "startCode"=>"1714130315364"}, "carryingMeta"=>false,
>> "shouldSplitWal"=>true}]
>> >>>>>>>>>>> 12 row(s)
>> >>>>>>>>>>> Took 0.6564 seconds
>> >>>>>>>>>>>
>> >>>>>>>>>>> Strangely enough, the log files are empty:
>> >>>>>>>>>>>
>> >>>>>>>>>>> cat logs/hbase-seritrack-master-server.out
>> >>>>>>>>>>> 13:31:57.280
>> [ActiveMasterInitializationMonitor-1714130217278] ERROR
>> org.apache.hadoop.hbase.master.MasterInitializationMonitor - Master failed
>> to complete initialization after 900000ms. Please consider submitting a bug
>> report including a thread dump of this process.
>> >>>>>>>>>>>
>> >>>>>>>>>>> cat logs/hbase-seritrack-master-server.log
>> >>>>>>>>>>> Fri Apr 26 13:16:47 CEST 2024 Starting master on master-server
>> >>>>>>>>>>> core file size          (blocks, -c) 0
>> >>>>>>>>>>> data seg size           (kbytes, -d) unlimited
>> >>>>>>>>>>> scheduling priority             (-e) 0
>> >>>>>>>>>>> file size               (blocks, -f) unlimited
>> >>>>>>>>>>> pending signals                 (-i) 95119
>> >>>>>>>>>>> max locked memory       (kbytes, -l) 64
>> >>>>>>>>>>> max memory size         (kbytes, -m) unlimited
>> >>>>>>>>>>> open files                      (-n) 1024
>> >>>>>>>>>>> pipe size            (512 bytes, -p) 8
>> >>>>>>>>>>> POSIX message queues     (bytes, -q) 819200
>> >>>>>>>>>>> real-time priority              (-r) 0
>> >>>>>>>>>>> stack size              (kbytes, -s) 8192
>> >>>>>>>>>>> cpu time               (seconds, -t) unlimited
>> >>>>>>>>>>> max user processes              (-u) 95119
>> >>>>>>>>>>> virtual memory          (kbytes, -v) unlimited
>> >>>>>>>>>>> file locks                      (-x) unlimited
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> I have checked the settings:
>> >>>>>>>>>>>
>> >>>>>>>>>>> Submitted Log Name: org.apache.hadoop.hbase
>> >>>>>>>>>>> Log Class: org.apache.logging.slf4j.Log4jLogger
>> >>>>>>>>>>> Effective level: ERROR
>> >>>>>>>>>>>
>> >>>>>>>>>>> I then explicitly set the log level again:
>> >>>>>>>>>>>
>> >>>>>>>>>>> cat hbase/conf/log4j.properties
>> >>>>>>>>>>> [...]
>> >>>>>>>>>>> log4j.logger.org.apache.hadoop.hbase=INFO
>> >>>>>>>>>>>
>> >>>>>>>>>>> und
>> >>>>>>>>>>> export HBASE_ROOT_LOGGER=hbase.root.logger=INFO,console
>> >>>>>>>>>>>
>> >>>>>>>>>>> And then restarted HMaster - without success.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Why does the log level remain at ERROR?
>> >>>>>>>>>>> I'm pretty sure that the levels will be set to INFO at some
>> point later on but they remain at level ERROR during the startup phase.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Here I post the Zookeeper Dump:
>> >>>>>>>>>>>
>> >>>>>>>>>>> HBase is rooted at /hbase
>> >>>>>>>>>>> Active master address:
>> >>>>>>>>>>> master-server,16000,1714130208769
>> >>>>>>>>>>> Backup master addresses:
>> >>>>>>>>>>> Region server holding hbase:meta:
>> >>>>>>>>>>> DATANODE03ct,16020,1714122680513
>> >>>>>>>>>>> Region servers:
>> >>>>>>>>>>> DATANODE06ct,16020,1714130693358
>> >>>>>>>>>>> DATANODE03ct,16020,1714130672936
>> >>>>>>>>>>> DATANODE02ct,16020,1714130665456
>> >>>>>>>>>>> DATANODE01ct,16020,1714130653350
>> >>>>>>>>>>> DATANODE04ct,16020,1714130248620
>> >>>>>>>>>>> Quorum Server Statistics:
>> >>>>>>>>>>> master-server:2181
>> >>>>>>>>>>> stat is not executed because it is not in the whitelist.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> What do you have to do to solve the server crash procedures?
>> >>>>>>>>>>>
>> >>>>>>>>>>> Best regards
>> >>>>>>>>>>> Udo
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>> Am 23.04.2024 um 09:36 schrieb 张铎(Duo Zhang) <
>> palomino...@gmail.com>:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Strange, I checked the code, it seems we get NPE on this line
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> https://github.com/apache/hbase/blob/4d7ce1aac724fbf09e526fc422b5a11e530c32f0/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java#L2872
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Could you please confirm that you connect to the correct
>> active master
>> >>>>>>>>>>>> which is hanging? It seems that you are connecting the backup
>> >>>>>>>>>>>> master...
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Thanks.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> 张铎(Duo Zhang) <palomino...@gmail.com> 于2024年4月23日周二 15:31写道：
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Ah, NPE usually means a code bug, then there is no simple
>> way to fix
>> >>>>>>>>>>>>> it, need to take a deep look on the code :(
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Sorry.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Udo Offermann <udo.offerm...@zfabrik.de> 于2024年4月22日周一
>> 15:32写道：
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Unfortunately not.
>> >>>>>>>>>>>>>> I’ve found the node hosting the meta region and was able
>> to run hack scheduleRecoveries using hbase-operator-tools-1.2.0.
>> >>>>>>>>>>>>>> The tool however stops with an NPE:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 09:22:00.532 [main] WARN
>> org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>> >>>>>>>>>>>>>> 09:22:00.703 [main] INFO
>> org.apache.hadoop.conf.Configuration.deprecation - hbase.client.pause.cqtbe
>> is deprecated. Instead, use hbase.client.pause.server.overloaded
>> >>>>>>>>>>>>>> 09:22:00.765 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper -
>> Client
>> environment:zookeeper.version=3.8.3-6ad6d364c7c0bcf0de452d54ebefa3058098ab56,
>> built on 2023-10-05 10:34 UTC
>> >>>>>>>>>>>>>> 09:22:00.765 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper -
>> Client environment:host.name=HBaseMaster.gmd9.intern
>> >>>>>>>>>>>>>> 09:22:00.765 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper -
>> Client environment:java.version=1.8.0_402
>> >>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper -
>> Client environment:java.vendor=Red Hat, Inc.
>> >>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper -
>> Client
>> environment:java.home=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.402.b06-2.el8.x86_64/jre
>> >>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper -
>> Client
>> environment:java.class.path=hbase-operator-tools-1.2.0/hbase-hbck2/hbase-hbck2-1.2.0.jar:hbase/conf:/opt/seritrack/tt/jdk/lib/tools.jar:/opt/seritrack/tt/nosql/hbase:/opt/seritrack/tt/nosql/hbase/lib/shaded-clients/hbase-shaded-mapreduce-2.5.7.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/audience-annotations-0.13.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/commons-logging-1.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/htrace-core4-4.1.0-incubating.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/jcl-over-slf4j-1.7.33.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/jul-to-slf4j-1.7.33.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/opentelemetry-api-1.15.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/opentelemetry-context-1.15.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/opentelemetry-semconv-1.15.0-alpha.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/slf4j-api-1.7.33.jar:/opt/seritrack/tt/nosql/hbase/lib/shaded-clients/hbase-shaded-client-2.5.7.jar:/opt/seritrack/tt/nosql/pl_nosql_ext/libs/pl_nosql_ext-3.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-1.2-api-2.17.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-api-2.17.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-core-2.17.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-slf4j-impl-2.17.2.jar:/opt/seritrack/tt/prometheus_exporters/jmx_exporter/jmx_prometheus_javaagent.jar
>> >>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper -
>> Client
>> environment:java.library.path=/opt/seritrack/tt/nosql/hadoop/lib/native
>> >>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper -
>> Client environment:java.io.tmpdir=/tmp
>> >>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper -
>> Client environment:java.compiler=<NA>
>> >>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper -
>> Client environment:os.name=Linux
>> >>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper -
>> Client environment:os.arch=amd64
>> >>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper -
>> Client environment:os.version=4.18.0-513.18.1.el8_9.x86_64
>> >>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper -
>> Client environment:user.name=seritrack
>> >>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper -
>> Client environment:user.home=/opt/seritrack
>> >>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper -
>> Client environment:user.dir=/opt/seritrack/tt/nosql_3.0
>> >>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper -
>> Client environment:os.memory.free=275MB
>> >>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper -
>> Client environment:os.memory.max=2966MB
>> >>>>>>>>>>>>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper -
>> Client environment:os.memory.total=361MB
>> >>>>>>>>>>>>>> 09:22:00.771 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper -
>> Initiating client connection, connectString=HBaseMaster:2181
>> sessionTimeout=90000
>> watcher=org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$$Lambda$45/1091799416@aed32c5
>> >>>>>>>>>>>>>> 09:22:00.774 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.common.X509Util -
>> Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable
>> client-initiated TLS renegotiation
>> >>>>>>>>>>>>>> 09:22:00.777 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ClientCnxnSocket
>> - jute.maxbuffer value is 1048575 Bytes
>> >>>>>>>>>>>>>> 09:22:00.785 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ClientCnxn -
>> zookeeper.request.timeout value is 0. feature enabled=false
>> >>>>>>>>>>>>>> 09:22:00.793
>> [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f-SendThread(HBaseMaster:2181)]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ClientCnxn -
>> Opening socket connection to server HBaseMaster/10.21.204.230:2181.
>> >>>>>>>>>>>>>> 09:22:00.793
>> [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f-SendThread(HBaseMaster:2181)]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ClientCnxn - SASL
>> config status: Will not attempt to authenticate using SASL (unknown error)
>> >>>>>>>>>>>>>> 09:22:00.797
>> [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f-SendThread(HBaseMaster:2181)]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ClientCnxn -
>> Socket connection established, initiating session, client: /
>> 10.21.204.230:41072, server: HBaseMaster/10.21.204.230:2181
>> >>>>>>>>>>>>>> 09:22:00.801
>> [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f-SendThread(HBaseMaster:2181)]
>> INFO  org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ClientCnxn -
>> Session establishment complete on server HBaseMaster/10.21.204.230:2181,
>> session id = 0x10009a4f379001e, negotiated timeout = 90000
>> >>>>>>>>>>>>>> -1
>> >>>>>>>>>>>>>> Exception in thread "main" java.io.IOException:
>> org.apache.hbase.thirdparty.com.google.protobuf.ServiceException:
>> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(java.io.IOException):
>> java.io.IOException
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:479)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82)
>> >>>>>>>>>>>>>> Caused by: java.lang.NullPointerException
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.master.MasterRpcServices.shouldSubmitSCP(MasterRpcServices.java:2872)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.master.MasterRpcServices.scheduleServerCrashProcedure(MasterRpcServices.java:2600)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$HbckService$2.callBlockingMethod(MasterProtos.java)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:415)
>> >>>>>>>>>>>>>>  ... 3 more
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.client.HBaseHbck.scheduleServerCrashProcedures(HBaseHbck.java:198)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.client.Hbck.scheduleServerCrashProcedure(Hbck.java:128)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.HBCK2.scheduleRecoveries(HBCK2.java:418)
>> >>>>>>>>>>>>>>  at org.apache.hbase.HBCK2.doCommandLine(HBCK2.java:960)
>> >>>>>>>>>>>>>>  at org.apache.hbase.HBCK2.run(HBCK2.java:830)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>> >>>>>>>>>>>>>>  at org.apache.hbase.HBCK2.main(HBCK2.java:1145)
>> >>>>>>>>>>>>>> Caused by:
>> org.apache.hbase.thirdparty.com.google.protobuf.ServiceException:
>> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(java.io.IOException):
>> java.io.IOException
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:479)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82)
>> >>>>>>>>>>>>>> Caused by: java.lang.NullPointerException
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.master.MasterRpcServices.shouldSubmitSCP(MasterRpcServices.java:2872)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.master.MasterRpcServices.scheduleServerCrashProcedure(MasterRpcServices.java:2600)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$HbckService$2.callBlockingMethod(MasterProtos.java)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:415)
>> >>>>>>>>>>>>>>  ... 3 more
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:340)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$200(AbstractRpcClient.java:92)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:595)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$HbckService$BlockingStub.scheduleServerCrashProcedure(MasterProtos.java)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.client.HBaseHbck.scheduleServerCrashProcedures(HBaseHbck.java:190)
>> >>>>>>>>>>>>>>  ... 7 more
>> >>>>>>>>>>>>>> Caused by:
>> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(java.io.IOException):
>> java.io.IOException
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:479)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82)
>> >>>>>>>>>>>>>> Caused by: java.lang.NullPointerException
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.master.MasterRpcServices.shouldSubmitSCP(MasterRpcServices.java:2872)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.master.MasterRpcServices.scheduleServerCrashProcedure(MasterRpcServices.java:2600)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$HbckService$2.callBlockingMethod(MasterProtos.java)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:415)
>> >>>>>>>>>>>>>>  ... 3 more
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:388)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:92)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:425)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:420)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:114)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.Call.setException(Call.java:129)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.readResponse(NettyRpcDuplexHandler.java:199)
>> >>>>>>>>>>>>>>  at
>> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelRead(NettyRpcDuplexHandler.java:220)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>> >>>>>>>>>>>>>>  at
>> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>> >>>>>>>>>>>>>>  at java.lang.Thread.run(Thread.java:750)
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Am 20.04.2024 um 15:53 schrieb 张铎(Duo Zhang) <
>> palomino...@gmail.com>:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> OK, it was waitForMetaOnline.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Maybe the problem is that you do have some correct
>> procedures before
>> >>>>>>>>>>>>>>> upgrading, like ServerCrashProcedure, but then you delete
>> all the
>> >>>>>>>>>>>>>>> procedure wals so the ServerCrashProcedure is also gone,
>> so meta can
>> >>>>>>>>>>>>>>> never be online.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Please check the /hbase/meta-region-server znode on
>> zookeeper, dump
>> >>>>>>>>>>>>>>> its content, it is protobuf based but anyway, you could
>> see the
>> >>>>>>>>>>>>>>> encoded server name which hosts meta region.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Then use HBCK2, to schedule a SCP for this region server,
>> to see if it
>> >>>>>>>>>>>>>>> can fix the problem.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> https://github.com/apache/hbase-operator-tools/blob/master/hbase-hbck2/README.md
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> This is the document for HBCK2, you should use the
>> scheduleRecoveries command.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Hope this could fix your problem.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Thread 92 (master/masterserver:16000:becomeActiveMaster):
>> >>>>>>>>>>>>>>> State: TIMED_WAITING
>> >>>>>>>>>>>>>>> Blocked count: 165
>> >>>>>>>>>>>>>>> Waited count: 404
>> >>>>>>>>>>>>>>> Stack:
>> >>>>>>>>>>>>>>> java.lang.Thread.sleep(Native Method)
>> >>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.util.Threads.sleep(Threads.java:125)
>> >>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.master.HMaster.isRegionOnline(HMaster.java:1358)
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.master.HMaster.waitForMetaOnline(HMaster.java:1328)
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1069)
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2405)
>> >>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.master.HMaster.lambda$null$0(HMaster.java:565)
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.master.HMaster$$Lambda$265/1598878738.run(Unknown
>> >>>>>>>>>>>>>>> Source)
>> >>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:187)
>> >>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:177)
>> >>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.master.HMaster.lambda$run$1(HMaster.java:562)
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.master.HMaster$$Lambda$264/1129144214.run(Unknown
>> >>>>>>>>>>>>>>> Source)
>> >>>>>>>>>>>>>>> java.lang.Thread.run(Thread.java:750)
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Udo Offermann <udo.offerm...@zfabrik.de <mailto:
>> udo.offerm...@zfabrik.de>> 于2024年4月20日周六 21:13写道：
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Master status for
>> masterserver.gmd9.intern,16000,1713515965162 as of Fri
>> >>>>>>>>>>>>>>>> Apr 19 10:55:22 CEST 2024
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Version Info:
>> >>>>>>>>>>>>>>>>
>> ===========================================================
>> >>>>>>>>>>>>>>>> HBase 2.5.7
>> >>>>>>>>>>>>>>>> Source code repository
>> >>>>>>>>>>>>>>>> git://buildbox.localdomain/home/apurtell/tmp/RM/hbase
>> >>>>>>>>>>>>>>>> revision=6788f98356dd70b4a7ff766ea7a8298e022e7b95
>> >>>>>>>>>>>>>>>> Compiled by apurtell on Thu Dec 14 15:59:16 PST 2023
>> >>>>>>>>>>>>>>>> From source with checksum
>> >>>>>>>>>>>>>>>>
>> 1501d7fdf72398791ee335a229d099fc972cea7c2a952da7622eb087ddf975361f107cbbbee5d0ad6f603466e9afa1f4fd242ffccbd4371eb0b56059bb3b5402
>> >>>>>>>>>>>>>>>> Hadoop 2.10.2
>> >>>>>>>>>>>>>>>> Source code repository Unknown
>> >>>>>>>>>>>>>>>> revision=965fd380006fa78b2315668fbc7eb432e1d8200f
>> >>>>>>>>>>>>>>>> Compiled by ubuntu on 2022-05-25T00:12Z
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Tasks:
>> >>>>>>>>>>>>>>>>
>> ===========================================================
>> >>>>>>>>>>>>>>>> Task: Master startup
>> >>>>>>>>>>>>>>>> Status: RUNNING:Starting assignment manager
>> >>>>>>>>>>>>>>>> Running for 954s
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Task: Flushing
>> master:store,,1.1595e783b53d99cd5eef43b6debb2682.
>> >>>>>>>>>>>>>>>> Status: COMPLETE:Flush successful flush
>> result:CANNOT_FLUSH_MEMSTORE_EMPTY,
>> >>>>>>>>>>>>>>>> failureReason:Nothing to flush,flush seq id14
>> >>>>>>>>>>>>>>>> Completed 49s ago
>> >>>>>>>>>>>>>>>> Ran for 0s
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Task:
>> RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000
>> >>>>>>>>>>>>>>>> Status: WAITING:Waiting for a call
>> >>>>>>>>>>>>>>>> Running for 951s
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Task:
>> RpcServer.priority.RWQ.Fifo.write.handler=1,queue=0,port=16000
>> >>>>>>>>>>>>>>>> Status: WAITING:Waiting for a call
>> >>>>>>>>>>>>>>>> Running for 951s
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Servers:
>> >>>>>>>>>>>>>>>>
>> ===========================================================
>> >>>>>>>>>>>>>>>> servername1ct.gmd9.intern,16020,1713514863737:
>> requestsPerSecond=0.0,
>> >>>>>>>>>>>>>>>> numberOfOnlineRegions=0, usedHeapMB=37.0MB,
>> maxHeapMB=2966.0MB,
>> >>>>>>>>>>>>>>>> numberOfStores=0, numberOfStorefiles=0, storeRefCount=0,
>> >>>>>>>>>>>>>>>> maxCompactedStoreFileRefCount=0,
>> storefileUncompressedSizeMB=0,
>> >>>>>>>>>>>>>>>> storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0,
>> >>>>>>>>>>>>>>>> filteredReadRequestsCount=0, writeRequestsCount=0,
>> rootIndexSizeKB=0,
>> >>>>>>>>>>>>>>>> totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0,
>> totalCompactingKVs=0,
>> >>>>>>>>>>>>>>>> currentCompactedKVs=0, compactionProgressPct=NaN,
>> coprocessors=[]
>> >>>>>>>>>>>>>>>> servername2ct.gmd9.intern,16020,1713514925960:
>> requestsPerSecond=0.0,
>> >>>>>>>>>>>>>>>> numberOfOnlineRegions=0, usedHeapMB=20.0MB,
>> maxHeapMB=2966.0MB,
>> >>>>>>>>>>>>>>>> numberOfStores=0, numberOfStorefiles=0, storeRefCount=0,
>> >>>>>>>>>>>>>>>> maxCompactedStoreFileRefCount=0,
>> storefileUncompressedSizeMB=0,
>> >>>>>>>>>>>>>>>> storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0,
>> >>>>>>>>>>>>>>>> filteredReadRequestsCount=0, writeRequestsCount=0,
>> rootIndexSizeKB=0,
>> >>>>>>>>>>>>>>>> totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0,
>> totalCompactingKVs=0,
>> >>>>>>>>>>>>>>>> currentCompactedKVs=0, compactionProgressPct=NaN,
>> coprocessors=[]
>> >>>>>>>>>>>>>>>> servername3ct.gmd9.intern,16020,1713514937151:
>> requestsPerSecond=0.0,
>> >>>>>>>>>>>>>>>> numberOfOnlineRegions=0, usedHeapMB=67.0MB,
>> maxHeapMB=2966.0MB,
>> >>>>>>>>>>>>>>>> numberOfStores=0, numberOfStorefiles=0, storeRefCount=0,
>> >>>>>>>>>>>>>>>> maxCompactedStoreFileRefCount=0,
>> storefileUncompressedSizeMB=0,
>> >>>>>>>>>>>>>>>> storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0,
>> >>>>>>>>>>>>>>>> filteredReadRequestsCount=0, writeRequestsCount=0,
>> rootIndexSizeKB=0,
>> >>>>>>>>>>>>>>>> totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0,
>> totalCompactingKVs=0,
>> >>>>>>>>>>>>>>>> currentCompactedKVs=0, compactionProgressPct=NaN,
>> coprocessors=[]
>> >>>>>>>>>>>>>>>> servername4ct.gmd9.intern,16020,1713514968019:
>> requestsPerSecond=0.0,
>> >>>>>>>>>>>>>>>> numberOfOnlineRegions=0, usedHeapMB=24.0MB,
>> maxHeapMB=2966.0MB,
>> >>>>>>>>>>>>>>>> numberOfStores=0, numberOfStorefiles=0, storeRefCount=0,
>> >>>>>>>>>>>>>>>> maxCompactedStoreFileRefCount=0,
>> storefileUncompressedSizeMB=0,
>> >>>>>>>>>>>>>>>> storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0,
>> >>>>>>>>>>>>>>>> filteredReadRequestsCount=0, writeRequestsCount=0,
>> rootIndexSizeKB=0,
>> >>>>>>>>>>>>>>>> totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0,
>> totalCompactingKVs=0,
>> >>>>>>>>>>>>>>>> currentCompactedKVs=0, compactionProgressPct=NaN,
>> coprocessors=[]
>> >>>>>>>>>>>>>>>> servername5ct.gmd9.intern,16020,1713514979294:
>> requestsPerSecond=0.0,
>> >>>>>>>>>>>>>>>> numberOfOnlineRegions=0, usedHeapMB=58.0MB,
>> maxHeapMB=2966.0MB,
>> >>>>>>>>>>>>>>>> numberOfStores=0, numberOfStorefiles=0, storeRefCount=0,
>> >>>>>>>>>>>>>>>> maxCompactedStoreFileRefCount=0,
>> storefileUncompressedSizeMB=0,
>> >>>>>>>>>>>>>>>> storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0,
>> >>>>>>>>>>>>>>>> filteredReadRequestsCount=0, writeRequestsCount=0,
>> rootIndexSizeKB=0,
>> >>>>>>>>>>>>>>>> totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0,
>> totalCompactingKVs=0,
>> >>>>>>>>>>>>>>>> currentCompactedKVs=0, compactionProgressPct=NaN,
>> coprocessors=[]
>> >>>>>>>>>>>>>>>> servername6ct.gmd9.intern,16020,1713514994770:
>> requestsPerSecond=0.0,
>> >>>>>>>>>>>>>>>> numberOfOnlineRegions=0, usedHeapMB=31.0MB,
>> maxHeapMB=2966.0MB,
>> >>>>>>>>>>>>>>>> numberOfStores=0, numberOfStorefiles=0, storeRefCount=0,
>> >>>>>>>>>>>>>>>> maxCompactedStoreFileRefCount=0,
>> storefileUncompressedSizeMB=0,
>> >>>>>>>>>>>>>>>> storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0,
>> >>>>>>>>>>>>>>>> filteredReadRequestsCount=0, writeRequestsCount=0,
>> rootIndexSizeKB=0,
>> >>>>>>>>>>>>>>>> totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0,
>> totalCompactingKVs=0,
>> >>>>>>>>>>>>>>>> currentCompactedKVs=0, compactionProgressPct=NaN,
>> coprocessors=[]
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Regions-in-transition:
>> >>>>>>>>>>>>>>>>
>> ===========================================================
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Executors:
>> >>>>>>>>>>>>>>>>
>> ===========================================================
>> >>>>>>>>>>>>>>>> Status for executor:
>> >>>>>>>>>>>>>>>>
>> Executor-4-MASTER_META_SERVER_OPERATIONS-master/masterserver:16000
>> >>>>>>>>>>>>>>>> =======================================
>> >>>>>>>>>>>>>>>> 0 events queued, 0 running
>> >>>>>>>>>>>>>>>> Status for executor:
>> >>>>>>>>>>>>>>>>
>> Executor-6-MASTER_SNAPSHOT_OPERATIONS-master/masterserver:16000
>> >>>>>>>>>>>>>>>> =======================================
>> >>>>>>>>>>>>>>>> 0 events queued, 0 running
>> >>>>>>>>>>>>>>>> Status for executor:
>> >>>>>>>>>>>>>>>>
>> Executor-3-MASTER_SERVER_OPERATIONS-master/masterserver:16000
>> >>>>>>>>>>>>>>>> =======================================
>> >>>>>>>>>>>>>>>> 0 events queued, 0 running
>> >>>>>>>>>>>>>>>> Status for executor:
>> Executor-5-M_LOG_REPLAY_OPS-master/masterserver:16000
>> >>>>>>>>>>>>>>>> =======================================
>> >>>>>>>>>>>>>>>> 0 events queued, 0 running
>> >>>>>>>>>>>>>>>> Status for executor:
>> >>>>>>>>>>>>>>>> Executor-2-MASTER_CLOSE_REGION-master/masterserver:16000
>> >>>>>>>>>>>>>>>> =======================================
>> >>>>>>>>>>>>>>>> 0 events queued, 0 running
>> >>>>>>>>>>>>>>>> Status for executor:
>> >>>>>>>>>>>>>>>>
>> Executor-7-MASTER_MERGE_OPERATIONS-master/masterserver:16000
>> >>>>>>>>>>>>>>>> =======================================
>> >>>>>>>>>>>>>>>> 0 events queued, 0 running
>> >>>>>>>>>>>>>>>> Status for executor:
>> >>>>>>>>>>>>>>>>
>> Executor-8-MASTER_TABLE_OPERATIONS-master/masterserver:16000
>> >>>>>>>>>>>>>>>> =======================================
>> >>>>>>>>>>>>>>>> 0 events queued, 0 running
>> >>>>>>>>>>>>>>>> Status for executor:
>> >>>>>>>>>>>>>>>> Executor-1-MASTER_OPEN_REGION-master/masterserver:16000
>> >>>>>>>>>>>>>>>> =======================================
>> >>>>>>>>>>>>>>>> 0 events queued, 0 running
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Stacks:
>> >>>>>>>>>>>>>>>>
>> ===========================================================
>> >>>>>>>>>>>>>>>> Process Thread Dump:
>> >>>>>>>>>>>>>>>> 131 active threads
>> >>>>>>>>>>>>>>>> Thread 186 (WAL-Archive-0):
>> >>>>>>>>>>>>>>>> State: WAITING
>> >>>>>>>>>>>>>>>> Blocked count: 5
>> >>>>>>>>>>>>>>>> Waited count: 11
>> >>>>>>>>>>>>>>>> Waiting on
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@42f44d41
>> >>>>>>>>>>>>>>>> Stack:
>> >>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method)
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> >>>>>>>>>>>>>>>> java.lang.Thread.run(Thread.java:750)
>> >>>>>>>>>>>>>>>> Thread 185 (Close-WAL-Writer-0):
>> >>>>>>>>>>>>>>>> State: TIMED_WAITING
>> >>>>>>>>>>>>>>>> Blocked count: 2
>> >>>>>>>>>>>>>>>> Waited count: 6
>> >>>>>>>>>>>>>>>> Stack:
>> >>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method)
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> >>>>>>>>>>>>>>>> java.lang.Thread.run(Thread.java:750)
>> >>>>>>>>>>>>>>>> Thread 152 (Session-Scheduler-3bc4ef12-1):
>> >>>>>>>>>>>>>>>> State: TIMED_WAITING
>> >>>>>>>>>>>>>>>> Blocked count: 0
>> >>>>>>>>>>>>>>>> Waited count: 1
>> >>>>>>>>>>>>>>>> Stack:
>> >>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method)
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> >>>>>>>>>>>>>>>> java.lang.Thread.run(Thread.java:750)
>> >>>>>>>>>>>>>>>> Thread 151
>> >>>>>>>>>>>>>>>>
>> (master/masterserver:16000:becomeActiveMaster-HFileCleaner.small.0-1713515973400):
>> >>>>>>>>>>>>>>>> State: WAITING
>> >>>>>>>>>>>>>>>> Blocked count: 0
>> >>>>>>>>>>>>>>>> Waited count: 1
>> >>>>>>>>>>>>>>>> Waiting on
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@58626ec5
>> >>>>>>>>>>>>>>>> Stack:
>> >>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method)
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.PriorityBlockingQueue.take(PriorityBlockingQueue.java:549)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.master.cleaner.HFileCleaner.consumerLoop(HFileCleaner.java:285)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.master.cleaner.HFileCleaner$2.run(HFileCleaner.java:269)
>> >>>>>>>>>>>>>>>> Thread 150
>> >>>>>>>>>>>>>>>>
>> (master/masterserver:16000:becomeActiveMaster-HFileCleaner.large.0-1713515973400):
>> >>>>>>>>>>>>>>>> State: WAITING
>> >>>>>>>>>>>>>>>> Blocked count: 0
>> >>>>>>>>>>>>>>>> Waited count: 1
>> >>>>>>>>>>>>>>>> Waiting on
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@18916420
>> >>>>>>>>>>>>>>>> Stack:
>> >>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method)
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>> >>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.util.StealJobQueue.take(StealJobQueue.java:101)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.master.cleaner.HFileCleaner.consumerLoop(HFileCleaner.java:285)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.master.cleaner.HFileCleaner$1.run(HFileCleaner.java:254)
>> >>>>>>>>>>>>>>>> Thread 149 (snapshot-hfile-cleaner-cache-refresher):
>> >>>>>>>>>>>>>>>> State: TIMED_WAITING
>> >>>>>>>>>>>>>>>> Blocked count: 4
>> >>>>>>>>>>>>>>>> Waited count: 11
>> >>>>>>>>>>>>>>>> Stack:
>> >>>>>>>>>>>>>>>> java.lang.Object.wait(Native Method)
>> >>>>>>>>>>>>>>>> java.util.TimerThread.mainLoop(Timer.java:552)
>> >>>>>>>>>>>>>>>> java.util.TimerThread.run(Timer.java:505)
>> >>>>>>>>>>>>>>>> Thread 148 (master/masterserver:16000.Chore.1):
>> >>>>>>>>>>>>>>>> State: TIMED_WAITING
>> >>>>>>>>>>>>>>>> Blocked count: 2
>> >>>>>>>>>>>>>>>> Waited count: 10
>> >>>>>>>>>>>>>>>> Stack:
>> >>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method)
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> >>>>>>>>>>>>>>>> java.lang.Thread.run(Thread.java:750)
>> >>>>>>>>>>>>>>>> Thread 147 (OldWALsCleaner-1):
>> >>>>>>>>>>>>>>>> State: WAITING
>> >>>>>>>>>>>>>>>> Blocked count: 0
>> >>>>>>>>>>>>>>>> Waited count: 1
>> >>>>>>>>>>>>>>>> Waiting on
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@7a6a3b7e
>> >>>>>>>>>>>>>>>> Stack:
>> >>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method)
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.master.cleaner.LogCleaner.deleteFile(LogCleaner.java:172)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.master.cleaner.LogCleaner.lambda$createOldWalsCleaner$1(LogCleaner.java:152)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.master.cleaner.LogCleaner$$Lambda$494/556458560.run(Unknown
>> >>>>>>>>>>>>>>>> Source)
>> >>>>>>>>>>>>>>>> java.lang.Thread.run(Thread.java:750)
>> >>>>>>>>>>>>>>>> Thread 146 (OldWALsCleaner-0):
>> >>>>>>>>>>>>>>>> State: WAITING
>> >>>>>>>>>>>>>>>> Blocked count: 0
>> >>>>>>>>>>>>>>>> Waited count: 1
>> >>>>>>>>>>>>>>>> Waiting on
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@7a6a3b7e
>> >>>>>>>>>>>>>>>> Stack:
>> >>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method)
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.master.cleaner.LogCleaner.deleteFile(LogCleaner.java:172)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.master.cleaner.LogCleaner.lambda$createOldWalsCleaner$1(LogCleaner.java:152)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.master.cleaner.LogCleaner$$Lambda$494/556458560.run(Unknown
>> >>>>>>>>>>>>>>>> Source)
>> >>>>>>>>>>>>>>>> java.lang.Thread.run(Thread.java:750)
>> >>>>>>>>>>>>>>>> Thread 139 (PEWorker-16):
>> >>>>>>>>>>>>>>>> State: TIMED_WAITING
>> >>>>>>>>>>>>>>>> Blocked count: 0
>> >>>>>>>>>>>>>>>> Waited count: 16
>> >>>>>>>>>>>>>>>> Stack:
>> >>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method)
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.procedure2.AbstractProcedureScheduler.poll(AbstractProcedureScheduler.java:165)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.procedure2.AbstractProcedureScheduler.poll(AbstractProcedureScheduler.java:147)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2113)
>> >>>>>>>>>>>>>>>> Thread 138 (PEWorker-15):
>> >>>>>>>>>>>>>>>> State: TIMED_WAITING
>> >>>>>>>>>>>>>>>> Blocked count: 0
>> >>>>>>>>>>>>>>>> Waited count: 16
>> >>>>>>>>>>>>>>>> Stack:
>> >>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method)
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.procedure2.AbstractProcedureScheduler.poll(AbstractProcedureScheduler.java:165)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.procedure2.AbstractProcedureScheduler.poll(AbstractProcedureScheduler.java:147)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2113)
>> >>>>>>>>>>>>>>>> Thread 137 (PEWorker-14):
>> >>>>>>>>>>>>>>>> State: TIMED_WAITING
>> >>>>>>>>>>>>>>>> Blocked count: 0
>> >>>>>>>>>>>>>>>> Waited count: 16
>> >>>>>>>>>>>>>>>> Stack:
>> >>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method)
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.procedure2.AbstractProcedureScheduler.poll(AbstractProcedureScheduler.java:165)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.procedure2.AbstractProcedureScheduler.poll(AbstractProcedureScheduler.java:147)
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2113)
>> >>>>>>>>>>>>>>>> Thread 136 (PEWorker-13):
>> >>>>>>>>>>>>>>>> State: TIMED_WAITING
>> >>>>>>>>>>>>>>>> Blocked count: 0
>> >>>>>>>>>>>>>>>> Waited count: 16
>> >>>>>>>>>>>>>>>> Stack:
>> >>>>>>>>>>>>>>>> sun.misc.Unsafe.park(Native Method)
>> >>>>>>>>>>>>>>>>
>> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>> >>>>>>
>
>

Re: HBase Master hangs on startup during upgrade from 2.2.5 to 2.5.7

Reply via email to