[ 
https://issues.apache.org/jira/browse/HBASE-26287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419556#comment-17419556
 ] 

Anoop Sam John commented on HBASE-26287:
----------------------------------------

Why the SSH for the old RS (where NS region) not kicking in? You lost the 
MasterProcWAL?  I believe there is hbck2 option to assign NS region in such 
stuck state. We should use such tools in this case IMO

> the initialization of master could not be completed  when  hbase:namesapce' 
> region is not online
> ------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-26287
>                 URL: https://issues.apache.org/jira/browse/HBASE-26287
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 2.3.5
>            Reporter: bolao
>            Priority: Major
>
> hbase cluster unexpected shuts down and then restart, we sometimes find the 
> master can't not initialize becouse of that it is stuck in isRegionOnline 
> methad for Hbase:namespace。we found the master and meta table think the 
> hbase:namespace region is online but it's regionserver is dead by viewing 
> logs of master,isRegionOnline print log for this every one minute and don't 
> do Nothing, I think we can remove record form assignmentManager's RegionState 
> and assign hbase:namespace to another regionserver, in order to make hbase 
> cluster recover without human intervention。i came to ask your advice, what do 
> you think?
> {panel:title=the logs of master}
> 2021-09-02 18:32:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster] 
> WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229) 
> -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT 
> online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN, 
> ts=1630577738198, 
> server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870};
>  ServerCrashProcedures=false. Master startup cannot progress, in 
> holding-pattern until region onlined.
> 2021-09-02 18:33:01 [master/fx-hd-sc-hbase-master-0:16000.Chore.1] INFO 
> org.apache.hadoop.hbase.ChoreService.scheduleChore(157) -Chore ScheduledChore 
> name=fx-hd-sc-hbase-master-0.fx-hd-sc.fx-ns.svc.cluster.xjht,16000,1630401705440-ClusterStatusChore,
>  period=60000, unit=MILLISECONDS is enabled.
> 2021-09-02 18:33:01 [master/fx-hd-sc-hbase-master-0:16000.Chore.1] INFO 
> org.apache.hadoop.hbase.ScheduledChore.run(172) -Chore: 
> fx-hd-sc-hbase-master-0.fx-hd-sc.fx-ns.svc.cluster.xjht,16000,1630401705440-ClusterStatusChore
>  missed its start time
> 2021-09-02 18:33:41 [ProcExecTimeout] INFO 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.periodicExecute(1334)
>  -Found 0 OPEN regions on dead servers and 177568 OPEN regions on unknown 
> servers
> 2021-09-02 18:33:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster] 
> WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229) 
> -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT 
> online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN, 
> ts=1630577738198, 
> server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870};
>  ServerCrashProcedures=false. Master startup cannot progress, in 
> holding-pattern until region onlined.
> 2021-09-02 18:34:31 [qtp780802740-4192] INFO http.requests.master.write(60) 
> -15.22.70.168 - - [02/Sep/2021:10:34:31 +0000] "GET 
> //15.22.70.168:1601/master-status HTTP/1.1" 200 54124 
> 2021-09-02 18:34:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster] 
> WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229) 
> -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT 
> online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN, 
> ts=1630577738198, 
> server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870};
>  ServerCrashProcedures=false. Master startup cannot progress, in 
> holding-pattern until region onlined.
> 2021-09-02 18:34:51 [qtp780802740-4202] INFO http.requests.master.write(60) 
> -15.22.70.168 - - [02/Sep/2021:10:34:51 +0000] "GET 
> //15.22.70.168:1601/master-status HTTP/1.1" 200 54122 
> 2021-09-02 18:35:41 [ProcExecTimeout] INFO 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.periodicExecute(1334)
>  -Found 0 OPEN regions on dead servers and 177568 OPEN regions on unknown 
> servers
> 2021-09-02 18:35:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster] 
> WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229) 
> -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT 
> online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN, 
> ts=1630577738198, 
> server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870};
>  ServerCrashProcedures=false. Master startup cannot progress, in 
> holding-pattern until region onlined.
> 2021-09-02 18:36:20 [qtp780802740-4192] INFO http.requests.master.write(60) 
> -15.22.70.168 - - [02/Sep/2021:10:36:20 +0000] "GET 
> //15.22.70.168:1601/master-status HTTP/1.1" 200 54122 
> 2021-09-02 18:36:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster] 
> WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229) 
> -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT 
> online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN, 
> ts=1630577738198, 
> server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870};
>  ServerCrashProcedures=false. Master startup cannot progress, in 
> holding-pattern until region onlined.
> 2021-09-02 18:36:57 [RSProcedureDispatcher-pool4-t23] WARN 
> org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.scheduleForRetry(323)
>  -request to 
> fx-hd-sc-hbase-slave-15.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1630405458162 
> failed due to org.apache.hadoop.hbase.ipc.CallTimeoutException: Call to 
> fx-hd-sc-hbase-slave-15.fx-hd-sc.fx-ns.svc.cluster.xjht/172.49.9.38:16020 
> failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: 
> Call[id=6192,methodName=ExecuteProcedures], waitTime=600008, 
> rpcTimeout=600000, try=7, retrying...
> 2021-09-02 18:37:42 [ProcExecTimeout] INFO 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.periodicExecute(1334)
>  -Found 0 OPEN regions on dead servers and 177568 OPEN regions on unknown 
> servers
> 2021-09-02 18:37:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster] 
> WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229) 
> -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT 
> online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN, 
> ts=1630577738198, 
> server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870};
>  ServerCrashProcedures=false. Master startup cannot progress, in 
> holding-pattern until region onlined.
> 2021-09-02 18:38:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster] 
> WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229) 
> -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT 
> online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN, 
> ts=1630577738198, 
> server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870};
>  ServerCrashProcedures=false. Master startup cannot progress, in 
> holding-pattern until region onlined.
> 2021-09-02 18:38:49 [zk-event-processor-pool1-t1] INFO 
> org.apache.hadoop.hbase.security.token.ZKSecretWatcher.nodeDeleted(94) -Node 
> deleted id=168
> 2021-09-02 18:39:42 [ProcExecTimeout] INFO 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.periodicExecute(1334)
>  -Found 0 OPEN regions on dead servers and 177568 OPEN regions on unknown 
> servers
> 2021-09-02 18:39:46 [master/fx-hd-sc-hbase-master-0:16000:becomeActiveMaster] 
> WARN org.apache.hadoop.hbase.master.HMaster.isRegionOnline(1229) 
> -hbase:namespace,,1477036226969.c0b2d4af686dc6b1c98dd9c866fe7607. is NOT 
> online; state=\{c0b2d4af686dc6b1c98dd9c866fe7607 state=OPEN, 
> ts=1630577738198, 
> server=fx-hd-sc-hbase-slave-10.fx-hd-sc.fx-ns.svc.cluster.xjht,16020,1628591903870};
>  ServerCrashProcedures=false. Master startup cannot progress, in 
> holding-pattern until region onlined.
>  
> {panel:title=the code of master}
> https://github.com/apache/hbase/blob/fd3fdc08d1cd43eb3432a1a70d31c3aece6ecabe/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java#L1214
> {panel}
>  
> {panel}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to