[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421950#comment-16421950 ] Yi Liang commented on HBASE-19287: -- {quote} java.io.IOException: Call to abhishekk3.pne.ven.veritas.com/10.210.62.30:16020 failed on local exception: java.io.IOException: org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS initiate failed {quote} The cause is not related to this jira, not sure why this happen. plz make sure your kerberos setting up correctly. And also to run a secure HBase, you also use a secure ZooKeeper. For hbase mailing list, refer to https://hbase.apache.org/mail-lists.html > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang >Priority: Major > Fix For: 2.0.0-beta-1, 2.0.0 > > Attachments: HBASE-19287-master-v3.patch, > HBASE-19287-master-v3.patch, HBASE-19287-master-v4.patch, > hbase-19287-master-v2.patch, master.patch > > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not >
[jira] [Comment Edited] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421950#comment-16421950 ] Yi Liang edited comment on HBASE-19287 at 4/2/18 5:41 AM: -- {quote}java.io.IOException: Call to abhishekk3.pne.ven.veritas.com/10.210.62.30:16020 failed on local exception: java.io.IOException: org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS initiate failed {quote} The cause is not related to this jira, not sure why this happen. plz make sure your kerberos setting up correctly. And also to run a secure HBase, you also need a secure ZooKeeper. For hbase mailing list, refer to [https://hbase.apache.org/mail-lists.html] was (Author: easyliangjob): {quote} java.io.IOException: Call to abhishekk3.pne.ven.veritas.com/10.210.62.30:16020 failed on local exception: java.io.IOException: org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS initiate failed {quote} The cause is not related to this jira, not sure why this happen. plz make sure your kerberos setting up correctly. And also to run a secure HBase, you also use a secure ZooKeeper. For hbase mailing list, refer to https://hbase.apache.org/mail-lists.html > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang >Priority: Major > Fix For: 2.0.0-beta-1, 2.0.0 > > Attachments: HBASE-19287-master-v3.patch, > HBASE-19287-master-v3.patch, HBASE-19287-master-v4.patch, > hbase-19287-master-v2.patch, master.patch > > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering
[jira] [Commented] (HBASE-19218) Master stuck thinking hbase:namespace is assigned after restart preventing intialization
[ https://issues.apache.org/jira/browse/HBASE-19218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16299460#comment-16299460 ] Yi Liang commented on HBASE-19218: -- patch is good. +1 > Master stuck thinking hbase:namespace is assigned after restart preventing > intialization > > > Key: HBASE-19218 > URL: https://issues.apache.org/jira/browse/HBASE-19218 > Project: HBase > Issue Type: Bug >Reporter: Josh Elser >Assignee: stack >Priority: Critical > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-19218.master.001.patch, > hbase-hbase-master-ctr-e134-1499953498516-282290-01-03.hwx.site.log.zip, > hbase-site.xml > > > Our [~romil.choksi] brought this one to my attention after trying to get some > cluster tests running. > The Master seems to have gotten stuck never initializing after it thinks that > hbase:namespace was already deployed on the cluster when it actually was not. > On a Master restart, it reads the location out of meta and assumes that it's > there (I assume this invalid entry is the issue): > {noformat} > 2017-11-08 00:29:17,556 INFO > [ctr-e134-1499953498516-282290-01-03:2.masterManager] > assignment.RegionStateStore: Load hbase:meta entry region={ENCODED => > f147f204a579b885c351bdc0a7ebbf94, NAME => > 'hbase:namespace,,1510084256045.f147f204a579b885c351bdc0a7ebbf94.', STARTKEY > => '', ENDKEY => ''} regionState=OPENING > lastHost=ctr-e134-1499953498516-282290-01-05.hwx.site,16020,1510084579728 > regionLocation=ctr-e134-1499953498516-282290-01-05.hwx.site,16020,1510100695534 > {noformat} > Prior to this, the RS5 went through the ServerCrashProcedure, but it looks > like this bailed out unexpectedly: > {noformat} > 2017-11-08 00:25:25,187 WARN > [ctr-e134-1499953498516-282290-01-03:2.masterManager] > master.ServerManager: Expiration of > ctr-e134-1499953498516-282290-01-05.hwx.site,16020,1510084579728 but > server not online > 2017-11-08 00:25:25,187 INFO [ProcExecWrkr-5] > procedure.ServerCrashProcedure: Start pid=36, > state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure > server=ctr-e134-1499953498516-282290-01-03.hwx.site,16020,1510084580111, > splitWal=t > rue, meta=false > 2017-11-08 00:25:25,188 INFO > [ctr-e134-1499953498516-282290-01-03:2.masterManager] > master.ServerManager: Processing expiration of > ctr-e134-1499953498516-282290-01-05.hwx.site,16020,1510084579728 on > ctr-e134-1499953498516-28 > 2290-01-03.hwx.site,2,1510100690324 > ... > 2017-11-08 00:25:27,211 ERROR [ProcExecWrkr-22] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception: pid=40, ppid=37, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94 > java.lang.NullPointerException > at > java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936) > at > org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:171) > at > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.addToRemoteDispatcher(RegionTransitionProcedure.java:223) > at > org.apache.hadoop.hbase.master.assignment.AssignProcedure.updateTransition(AssignProcedure.java:252) > at > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:309) > at > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:82) > at > org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:845) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1452) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1221) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731) > 2017-11-08 00:25:27,239 FATAL [ProcExecWrkr-22] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=37, > state=FAILED:SERVER_CRASH_FINISH, exception=java.lang.NullPointerException > via CODE-BUG: Uncaught runtime exception: pid=40, ppid=37, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=hbase:namespace, > region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException; > ServerCrashProcedure > server=ctr-e134-1499953498516-282290-01-05.hwx.site,16020,1510084579728, > splitWal=true, meta=false > java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_FINISH > at >
[jira] [Commented] (HBASE-19218) Master stuck thinking hbase:namespace is assigned after restart preventing intialization
[ https://issues.apache.org/jira/browse/HBASE-19218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16299221#comment-16299221 ] Yi Liang commented on HBASE-19218: -- I have seen this error in my cluster as well {quote} 2017-11-08 00:25:25,260 INFO [ProcExecWrkr-18] procedure.MasterProcedureScheduler: pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94 hbase:namespace hbase:namespace,,1510084256045.f147f204a579b885c351bdc0a7ebbf94. 2017-11-08 00:25:25,263 INFO [ProcExecWrkr-18] assignment.AssignProcedure: Start pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94; rit=OFFLINE, location=node_5,16020,1510084579728; forceNewPlan=false, retain=true . 2017-11-08 00:25:26,040 INFO [ProcExecWrkr-23] procedure.ServerCrashProcedure: pid=42 found RIT pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94; rit=OPENING, location=node_5,16020,1510100695534 2017-11-08 00:25:26,040 WARN [ProcExecWrkr-23] assignment.RegionTransitionProcedure: Remote call failed pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94; rit=OPENING, location=node_5,16020,1510100695534; exception=ServerCrashProcedure pid=42, server=node_5,16020,1510100695534 2017-11-08 00:25:26,041 INFO [ProcExecWrkr-23] assignment.AssignProcedure: Retry=1 of max=10; pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94; rit=OPENING, location=node_5,16020,1510100695534 2017-11-08 00:25:26,193 INFO [ProcExecWrkr-25] zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in ZooKeeper as node_2,16020,1510100696388 2017-11-08 00:25:26,195 INFO [ProcExecWrkr-25] assignment.RegionTransitionProcedure: Dispatch pid=44, ppid=43, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, region=1588230740; rit=OPENING, location=node_2,16020,1510100696388 2017-11-08 00:25:26,346 INFO [ProcedureDispatcherTimeoutThread] procedure.RSProcedureDispatcher: Using procedure batch rpc execution for serverName=node_2,16020,1510100696388 version=2097152 2017-11-08 00:25:27,187 INFO [ProcExecWrkr-4] hbase.MetaTableAccessor: Updated table hbase:meta state to ENABLED in META 2017-11-08 00:25:27,187 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in ZooKeeper as node_2,16020,1510100696388 2017-11-08 00:25:27,209 INFO [ProcExecWrkr-22] assignment.RegionTransitionProcedure: Dispatch pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94; rit=OFFLINE, location=null 2017-11-08 00:25:27,210 INFO [ProcExecWrkr-21] assignment.RegionTransitionProcedure: Dispatch pid=39, ppid=36, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:acl, region=24aadcb52fdc43e2ebcffe95d39b43ab; rit=OPENING, location=node_2,16020,1510100696388 2017-11-08 00:25:27,211 ERROR [ProcExecWrkr-22] procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception: pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94 java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936) {quote} I track the AssignProcedure of hbase:namespace to above log 1. AssignProcedure created for table=hbase:namespace; and state is RUNNABLE:REGION_TRANSITION_QUEUE 2. ServerCrashProcedure happens and will handle AssignProcedure of table=hbase:namespace, and from the log, this AssignProcedure state has already been set to REGION_TRANSITION_DISPATCH when SCP handle it. The SCP will set the AssignProcedure back to REGION_TRANSITION_QUEUE and offline related region. 3. However, The AssignProcedure of hbase:namespace resume from state REGION_TRANSITION_DISPATCH, and SCP in step2 has already offline the region and set location as null, so null pointer exception may happen The problem happens at step 3 above, the AssignProcedure should resume from state REGION_TRANSITION_QUEUE, but it actually from REGION_TRANSITION_DISPATCH. This could happen, since when SCP call remoteCallFailed for AssignProcedure of hbase:namespace, ProcedureExecutor is running AssignProcedure of hbase:namespace as state REGION_TRANSITION_DISPATCH at same time. and if SCP's handleFailure(which set region location as null) for hbase:namespace happens before AssignProcedure#addToRemoteDispatcher, and then null pointer happens And if we can catch the nullpointerexception and set state back to REGION_TRANSITION_QUEUE,
[jira] [Updated] (HBASE-19556) Remove TestAssignmentManager#testGoodSplit, which no longer make sense
[ https://issues.apache.org/jira/browse/HBASE-19556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19556: - Fix Version/s: 2.0.0-beta-1 Affects Version/s: 2.0.0 Status: Patch Available (was: Open) > Remove TestAssignmentManager#testGoodSplit, which no longer make sense > -- > > Key: HBASE-19556 > URL: https://issues.apache.org/jira/browse/HBASE-19556 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang >Priority: Minor > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-19556-master-v1.patch > > > {quote} > [ERROR] > testGoodSplit(org.apache.hadoop.hbase.master.assignment.TestAssignmentManager) > Time elapsed: 0.478 s <<< ERROR! > java.io.IOException: 5a50732f7cb3a05dd3a297bacbc34943 NOT splittable > {quote} > GoodSplitExecutor can only mock some functions in RSProcedureDispatcher, > however, to test split, we need to create(or mock) a real table on region > server side. And GoodSplitExecutor can not mock those kinds of function. > And, similar test has already been covered in TestSplitTableRegionProcedure, > we no longer need to test it again in TestAssignmentManager -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-19556) Remove TestAssignmentManager#testGoodSplit, which no longer make sense
[ https://issues.apache.org/jira/browse/HBASE-19556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19556: - Attachment: HBASE-19556-master-v1.patch > Remove TestAssignmentManager#testGoodSplit, which no longer make sense > -- > > Key: HBASE-19556 > URL: https://issues.apache.org/jira/browse/HBASE-19556 > Project: HBase > Issue Type: Sub-task >Reporter: Yi Liang >Assignee: Yi Liang >Priority: Minor > Attachments: HBASE-19556-master-v1.patch > > > {quote} > [ERROR] > testGoodSplit(org.apache.hadoop.hbase.master.assignment.TestAssignmentManager) > Time elapsed: 0.478 s <<< ERROR! > java.io.IOException: 5a50732f7cb3a05dd3a297bacbc34943 NOT splittable > {quote} > GoodSplitExecutor can only mock some functions in RSProcedureDispatcher, > however, to test split, we need to create(or mock) a real table on region > server side. And GoodSplitExecutor can not mock those kinds of function. > And, similar test has already been covered in TestSplitTableRegionProcedure, > we no longer need to test it again in TestAssignmentManager -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-19556) Remove TestAssignmentManager#testGoodSplit, which no longer make sense
[ https://issues.apache.org/jira/browse/HBASE-19556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19556: - Issue Type: Sub-task (was: Bug) Parent: HBASE-18110 > Remove TestAssignmentManager#testGoodSplit, which no longer make sense > -- > > Key: HBASE-19556 > URL: https://issues.apache.org/jira/browse/HBASE-19556 > Project: HBase > Issue Type: Sub-task >Reporter: Yi Liang >Assignee: Yi Liang >Priority: Minor > > {quote} > [ERROR] > testGoodSplit(org.apache.hadoop.hbase.master.assignment.TestAssignmentManager) > Time elapsed: 0.478 s <<< ERROR! > java.io.IOException: 5a50732f7cb3a05dd3a297bacbc34943 NOT splittable > {quote} > GoodSplitExecutor can only mock some functions in RSProcedureDispatcher, > however, to test split, we need to create(or mock) a real table on region > server side. And GoodSplitExecutor can not mock those kinds of function. > And, similar test has already been covered in TestSplitTableRegionProcedure, > we no longer need to test it again in TestAssignmentManager -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HBASE-19556) Remove TestAssignmentManager#testGoodSplit, which no longer make sense
Yi Liang created HBASE-19556: Summary: Remove TestAssignmentManager#testGoodSplit, which no longer make sense Key: HBASE-19556 URL: https://issues.apache.org/jira/browse/HBASE-19556 Project: HBase Issue Type: Bug Reporter: Yi Liang Assignee: Yi Liang Priority: Minor {quote} [ERROR] testGoodSplit(org.apache.hadoop.hbase.master.assignment.TestAssignmentManager) Time elapsed: 0.478 s <<< ERROR! java.io.IOException: 5a50732f7cb3a05dd3a297bacbc34943 NOT splittable {quote} GoodSplitExecutor can only mock some functions in RSProcedureDispatcher, however, to test split, we need to create(or mock) a real table on region server side. And GoodSplitExecutor can not mock those kinds of function. And, similar test has already been covered in TestSplitTableRegionProcedure, we no longer need to test it again in TestAssignmentManager -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HBASE-18110) [AMv2] Reenable tests temporarily disabled
[ https://issues.apache.org/jira/browse/HBASE-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295832#comment-16295832 ] Yi Liang edited comment on HBASE-18110 at 12/18/17 11:29 PM: - Hi [~stack] I just check some ignored tests and found TestAssignmentManager#testGoodSplit() is not make sense. I think we can remove it. I can follow up a jira if you ok with it. was (Author: easyliangjob): Hi [~stack] I just check some ignored tests and found TestAssignmentManager#testGoodSplit() is not make sense. I think we can remove it. > [AMv2] Reenable tests temporarily disabled > -- > > Key: HBASE-18110 > URL: https://issues.apache.org/jira/browse/HBASE-18110 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Blocker > Fix For: 2.0.0-beta-1 > > > We disabled tests that didn't make sense or relied on behavior not supported > by AMv2. Revisit and reenable after AMv2 gets committed. Here is the set > (from > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.rsj53tx4vlwj) > testAllFavoredNodesDead and testAllFavoredNodesDeadMasterRestarted and > testMisplacedRegions in TestFavoredStochasticLoadBalancer … not sure what > this about. > testRegionNormalizationMergeOnCluster in TestSimpleRegionNormalizerOnCluster > disabled for now till we fix up Merge. > testMergeWithReplicas in TestRegionMergeTransactionOnCluster because don't > know how it is supposed to work. > Admin#close does not update Master. Causes > testHBaseFsckWithFewerMetaReplicaZnodes in TestMetaWithReplicas to fail > (Master gets report about server closing when it didn’t run the close -- gets > freaked out). > Disabled/Ignore TestRSGroupsOfflineMode#testOffline; need to dig in on what > offline is. > Disabled/Ignore TestRSGroups. > All tests that have to do w/ fsck:TestHBaseFsckTwoRS, > TestOfflineMetaRebuildBase TestHBaseFsckReplicas, > TestOfflineMetaRebuildOverlap, testChangingReplicaCount in > TestMetaWithReplicas (internally it is doing fscks which are killing RS)... > FSCK test testHBaseFsckWithExcessMetaReplicas in TestMetaWithReplicas. > So is testHBaseFsckWithFewerMetaReplicas in same class. > TestHBaseFsckOneRS is fsck. Disabled. > TestOfflineMetaRebuildHole is about rebuilding hole with fsck. > Master carries meta: > TestRegionRebalancing is disabled because doesn't consider the fact that > Master carries system tables only (fix of average in RegionStates brought out > the issue). > Disabled testMetaAddressChange in TestMetaWithReplicas because presumes can > move meta... you can't > TestAsyncTableGetMultiThreaded wants to move hbase:meta...Balancer does NPEs. > AMv2 won't let you move hbase:meta off Master. > Disabled parts of...testCreateTableWithMultipleReplicas in > TestMasterOperationsForRegionReplicas There is an issue w/ assigning more > replicas if number of replicas is changed on us. See '/* DISABLED! FOR > NOW'. > Disabled TestCorruptedRegionStoreFile. Depends on a half-implemented reopen > of a region when a store file goes missing; TODO. > testRetainAssignmentOnRestart in TestRestartCluster does not work. AMv2 does > retain semantic differently. Fix. TODO. > TestMasterFailover needs to be rewritten for AMv2. It uses tricks not > ordained when up on AMv2. The test is also hobbled by fact that we > religiously enforce that only master can carry meta, something we are lose > about in old AM. > Fix Ignores in TestServerCrashProcedure. Master is different now. > Offlining is done differently now: Because of this disabled testOfflineRegion > in TestAsyncRegionAdminApi > Skipping delete of table after test in TestAccessController3 because of > access issues w/ AMv2. AMv1 seems to crash servers on exit too for same lack > of auth perms but AMv2 gets hung up. TODO. See cleanUp method. > TestHCM#testMulti and TestHCM > Fix TestMasterMetrics. Stuff is different now around startup which messes up > this test. Disabled two of three tests. > I tried to fix TestMasterBalanceThrottling but it looks like > SimpleLoadBalancer is borked whether AMv2 or not. > Disabled testPickers in TestFavoredStochasticBalancerPickers. It hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18110) [AMv2] Reenable tests temporarily disabled
[ https://issues.apache.org/jira/browse/HBASE-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295832#comment-16295832 ] Yi Liang commented on HBASE-18110: -- Hi [~stack] I just check some ignored tests and found TestAssignmentManager#testGoodSplit() is not make sense. I think we can remove it. > [AMv2] Reenable tests temporarily disabled > -- > > Key: HBASE-18110 > URL: https://issues.apache.org/jira/browse/HBASE-18110 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: stack >Priority: Blocker > Fix For: 2.0.0-beta-1 > > > We disabled tests that didn't make sense or relied on behavior not supported > by AMv2. Revisit and reenable after AMv2 gets committed. Here is the set > (from > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.rsj53tx4vlwj) > testAllFavoredNodesDead and testAllFavoredNodesDeadMasterRestarted and > testMisplacedRegions in TestFavoredStochasticLoadBalancer … not sure what > this about. > testRegionNormalizationMergeOnCluster in TestSimpleRegionNormalizerOnCluster > disabled for now till we fix up Merge. > testMergeWithReplicas in TestRegionMergeTransactionOnCluster because don't > know how it is supposed to work. > Admin#close does not update Master. Causes > testHBaseFsckWithFewerMetaReplicaZnodes in TestMetaWithReplicas to fail > (Master gets report about server closing when it didn’t run the close -- gets > freaked out). > Disabled/Ignore TestRSGroupsOfflineMode#testOffline; need to dig in on what > offline is. > Disabled/Ignore TestRSGroups. > All tests that have to do w/ fsck:TestHBaseFsckTwoRS, > TestOfflineMetaRebuildBase TestHBaseFsckReplicas, > TestOfflineMetaRebuildOverlap, testChangingReplicaCount in > TestMetaWithReplicas (internally it is doing fscks which are killing RS)... > FSCK test testHBaseFsckWithExcessMetaReplicas in TestMetaWithReplicas. > So is testHBaseFsckWithFewerMetaReplicas in same class. > TestHBaseFsckOneRS is fsck. Disabled. > TestOfflineMetaRebuildHole is about rebuilding hole with fsck. > Master carries meta: > TestRegionRebalancing is disabled because doesn't consider the fact that > Master carries system tables only (fix of average in RegionStates brought out > the issue). > Disabled testMetaAddressChange in TestMetaWithReplicas because presumes can > move meta... you can't > TestAsyncTableGetMultiThreaded wants to move hbase:meta...Balancer does NPEs. > AMv2 won't let you move hbase:meta off Master. > Disabled parts of...testCreateTableWithMultipleReplicas in > TestMasterOperationsForRegionReplicas There is an issue w/ assigning more > replicas if number of replicas is changed on us. See '/* DISABLED! FOR > NOW'. > Disabled TestCorruptedRegionStoreFile. Depends on a half-implemented reopen > of a region when a store file goes missing; TODO. > testRetainAssignmentOnRestart in TestRestartCluster does not work. AMv2 does > retain semantic differently. Fix. TODO. > TestMasterFailover needs to be rewritten for AMv2. It uses tricks not > ordained when up on AMv2. The test is also hobbled by fact that we > religiously enforce that only master can carry meta, something we are lose > about in old AM. > Fix Ignores in TestServerCrashProcedure. Master is different now. > Offlining is done differently now: Because of this disabled testOfflineRegion > in TestAsyncRegionAdminApi > Skipping delete of table after test in TestAccessController3 because of > access issues w/ AMv2. AMv1 seems to crash servers on exit too for same lack > of auth perms but AMv2 gets hung up. TODO. See cleanUp method. > TestHCM#testMulti and TestHCM > Fix TestMasterMetrics. Stuff is different now around startup which messes up > this test. Disabled two of three tests. > I tried to fix TestMasterBalanceThrottling but it looks like > SimpleLoadBalancer is borked whether AMv2 or not. > Disabled testPickers in TestFavoredStochasticBalancerPickers. It hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292933#comment-16292933 ] Yi Liang commented on HBASE-19287: -- [~stack], thanks for reviewing and adding the javadoc. > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-19287-master-v3.patch, > HBASE-19287-master-v3.patch, HBASE-19287-master-v4.patch, > hbase-19287-master-v2.patch, master.patch > > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544) > at
[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19287: - Attachment: HBASE-19287-master-v4.patch > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Attachments: HBASE-19287-master-v3.patch, > HBASE-19287-master-v3.patch, HBASE-19287-master-v4.patch, > hbase-19287-master-v2.patch, master.patch > > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406) > at
[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19287: - Attachment: HBASE-19287-master-v3.patch > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Attachments: HBASE-19287-master-v3.patch, > HBASE-19287-master-v3.patch, hbase-19287-master-v2.patch, master.patch > > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at >
[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19287: - Attachment: HBASE-19287-master-v3.patch > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Attachments: HBASE-19287-master-v3.patch, > hbase-19287-master-v2.patch, master.patch > > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at >
[jira] [Comment Edited] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16286762#comment-16286762 ] Yi Liang edited comment on HBASE-19287 at 12/11/17 11:42 PM: - Add a new test case to handle situation when server crashed during assign meta. To review the code https://reviews.apache.org/r/64512/ was (Author: easyliangjob): Add a new test case to handle situation when server crashed during assign meta. > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Attachments: hbase-19287-master-v2.patch, master.patch > > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401) > at >
[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19287: - Attachment: hang.patch Add a new test case to handle situation when server crashed during assign meta. > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Attachments: hbase-19287-master-v2.patch, master.patch > > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at >
[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19287: - Attachment: hbase-19287-master-v2.patch > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Attachments: hbase-19287-master-v2.patch, master.patch > > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at >
[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19287: - Attachment: (was: hang.patch) > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Attachments: hbase-19287-master-v2.patch, master.patch > > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at >
[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282599#comment-16282599 ] Yi Liang commented on HBASE-19287: -- OK, I will try to put them into AM. Thanks for reviewing [~stack] > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Attachments: master.patch > > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at >
[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282374#comment-16282374 ] Yi Liang commented on HBASE-19287: -- See the log below: {code} 2017-12-07 19:01:45,218 INFO [ProcExecWrkr-1] procedure.RecoverMetaProcedure: pid=17, state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure failedMetaServer=null, splitWal=true; Retaining meta assignment to server=hadoop-slave1.hadoop,16020,1512673261766 2017-12-07 19:01:45,227 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766}] 2017-12-07 19:01:45,261 INFO [ProcExecWrkr-3] procedure.MasterProcedureScheduler: pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766 hbase:meta hbase:meta,,1.1588230740 2017-12-07 19:01:45,266 INFO [ProcExecWrkr-3] assignment.AssignProcedure: Start pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766; rit=OFFLINE, location=hadoop-slave1.hadoop,16020,1512673261766; forceNewPlan=false, retain=false 2017-12-07 19:01:45,419 INFO [ProcExecWrkr-2] zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in ZooKeeper as hadoop-slave2.hadoop,16020,1512673268932 2017-12-07 19:01:45,426 INFO [ProcExecWrkr-2] assignment.RegionTransitionProcedure: Dispatch pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766; rit=OPENING, location=hadoop-slave2.hadoop,16020,1512673268932 2017-12-07 19:01:45,580 INFO [ProcedureDispatcherTimeoutThread] procedure.RSProcedureDispatcher: Using procedure batch rpc execution for serverName=hadoop-slave2.hadoop,16020,1512673268932 version=2097152 2017-12-07 19:01:46,793 INFO [main-EventThread] zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [hadoop-slave2.hadoop,16020,1512673268932] 2017-12-07 19:01:46,793 INFO [main-EventThread] master.ServerManager: Master doesn't enable ServerShutdownHandler during initialization, delay expiring server hadoop-slave2.hadoop,16020,1512673268932 {code} *Usually Master will hangs as above log, and the assign procedure will become 'dead' The patch will notice and wake the meta assign procedure, and the procedure become active and run as below * {code} 2017-12-07 19:01:46,794 INFO [main-EventThread] master.ServerManager: Meta has been assigned to crashed server: hadoop-slave2.hadoop,16020,1512673268932; will do re-assign 2017-12-07 19:01:46,794 WARN [main-EventThread] assignment.RegionTransitionProcedure: Remote call failed pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766; rit=OPENING, location=hadoop-slave2.hadoop,16020,1512673268932; exception=ServerCrashProcedure pid=18, server=hadoop-slave2.hadoop,16020,1512673268932 2017-12-07 19:01:46,797 INFO [main-EventThread] assignment.AssignProcedure: Retry=1 of max=10; pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766; rit=OPENING, location=hadoop-slave2.hadoop,16020,1512673268932 2017-12-07 19:01:46,798 INFO [ProcExecWrkr-4] assignment.AssignProcedure: Start pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, region=1588230740; rit=OFFLINE, location=null; forceNewPlan=true, retain=false {code} > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Attachments: master.patch > > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] >
[jira] [Comment Edited] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282374#comment-16282374 ] Yi Liang edited comment on HBASE-19287 at 12/7/17 7:26 PM: --- See the log below: {code} 2017-12-07 19:01:45,218 INFO [ProcExecWrkr-1] procedure.RecoverMetaProcedure: pid=17, state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure failedMetaServer=null, splitWal=true; Retaining meta assignment to server=hadoop-slave1.hadoop,16020,1512673261766 2017-12-07 19:01:45,227 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766}] 2017-12-07 19:01:45,261 INFO [ProcExecWrkr-3] procedure.MasterProcedureScheduler: pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766 hbase:meta hbase:meta,,1.1588230740 2017-12-07 19:01:45,266 INFO [ProcExecWrkr-3] assignment.AssignProcedure: Start pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766; rit=OFFLINE, location=hadoop-slave1.hadoop,16020,1512673261766; forceNewPlan=false, retain=false 2017-12-07 19:01:45,419 INFO [ProcExecWrkr-2] zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in ZooKeeper as hadoop-slave2.hadoop,16020,1512673268932 2017-12-07 19:01:45,426 INFO [ProcExecWrkr-2] assignment.RegionTransitionProcedure: Dispatch pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766; rit=OPENING, location=hadoop-slave2.hadoop,16020,1512673268932 2017-12-07 19:01:45,580 INFO [ProcedureDispatcherTimeoutThread] procedure.RSProcedureDispatcher: Using procedure batch rpc execution for serverName=hadoop-slave2.hadoop,16020,1512673268932 version=2097152 2017-12-07 19:01:46,793 INFO [main-EventThread] zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [hadoop-slave2.hadoop,16020,1512673268932] 2017-12-07 19:01:46,793 INFO [main-EventThread] master.ServerManager: Master doesn't enable ServerShutdownHandler during initialization, delay expiring server hadoop-slave2.hadoop,16020,1512673268932 {code} *Usually Master will hangs as above log, and the assign procedure will become 'dead' The patch will notice and wake the meta assign procedure, and the procedure become active and run as below* {code} 2017-12-07 19:01:46,794 INFO [main-EventThread] master.ServerManager: Meta has been assigned to crashed server: hadoop-slave2.hadoop,16020,1512673268932; will do re-assign 2017-12-07 19:01:46,794 WARN [main-EventThread] assignment.RegionTransitionProcedure: Remote call failed pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766; rit=OPENING, location=hadoop-slave2.hadoop,16020,1512673268932; exception=ServerCrashProcedure pid=18, server=hadoop-slave2.hadoop,16020,1512673268932 2017-12-07 19:01:46,797 INFO [main-EventThread] assignment.AssignProcedure: Retry=1 of max=10; pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766; rit=OPENING, location=hadoop-slave2.hadoop,16020,1512673268932 2017-12-07 19:01:46,798 INFO [ProcExecWrkr-4] assignment.AssignProcedure: Start pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, region=1588230740; rit=OFFLINE, location=null; forceNewPlan=true, retain=false {code} was (Author: easyliangjob): See the log below: {code} 2017-12-07 19:01:45,218 INFO [ProcExecWrkr-1] procedure.RecoverMetaProcedure: pid=17, state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure failedMetaServer=null, splitWal=true; Retaining meta assignment to server=hadoop-slave1.hadoop,16020,1512673261766 2017-12-07 19:01:45,227 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766}] 2017-12-07 19:01:45,261 INFO [ProcExecWrkr-3] procedure.MasterProcedureScheduler: pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766 hbase:meta hbase:meta,,1.1588230740 2017-12-07 19:01:45,266 INFO [ProcExecWrkr-3] assignment.AssignProcedure: Start pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19287: - Attachment: master.patch > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Attachments: master.patch > > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278) > at >
[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19287: - Attachment: (was: p1-master.patch) > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278) > at >
[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19287: - Affects Version/s: 2.0.0 > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Attachments: p1-master.patch > > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278) > at >
[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19287: - Component/s: proc-v2 > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Attachments: p1-master.patch > > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278) > at >
[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19287: - Attachment: p1-master.patch > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Attachments: p1-master.patch > > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278) > at >
[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19287: - Attachment: (was: p1.patch) > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug >Reporter: Yi Liang >Assignee: Yi Liang > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:258) > row 'hbase:namespace' on table
[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19287: - Status: Patch Available (was: Open) > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug >Reporter: Yi Liang >Assignee: Yi Liang > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:258) > row 'hbase:namespace' on
[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281118#comment-16281118 ] Yi Liang commented on HBASE-19287: -- After some investigation, I found that it takes time to add a whole Timeout Mechanism into current Procedure. Not sure I can finished those before release of hbase2.0, so I just provide a fix that use idea we talked above {quote} (2) Or at least, if we get a crash for the server we are currently trying to assign hbase:meta too during startup, we should notice and recalibrate the assign? {quote} Draft patch to try UT, and still working on writing new testcase for this problem > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug >Reporter: Yi Liang >Assignee: Yi Liang > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401) > at >
[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19287: - Attachment: p1.patch > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug >Reporter: Yi Liang >Assignee: Yi Liang > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:258) > row 'hbase:namespace' on table 'hbase:meta'
[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16275129#comment-16275129 ] Yi Liang commented on HBASE-19287: -- [~stack] Spent some time digging into code. I found details of the assign Procedure work flow is {quote} 1. Master send assign request to target Regionserver, and this active AssignProcedure will be remove from Procedure Scheduler(A queue that store all the active procedure) and suspend this AssignProcedure. 2. Once target Server received request and open the region, it will send a response to master 3. Once Master receive the response, it will wake this procedure and put the AssignProcedure back to Procedure Scheduler. And worker threads in ProcedureExecutor will poll this AssignProcedure and run the remain steps. {quote} The problem happens on step3, if the master does not receive response from target server for any reason; That assign procedure will become a dead procedure, no other mechanism will wake the procedure(i.e put it back into procedure scheduler) any more. (Do not know why we need to remove this procedure out of procedure scheduler in step1, maybe we can just mark it as suspend and yield it?) The thing here is that this suspend procedure will be only wake by the response from target server, no other mechanism can wake it (ServerCrashProcedure may wake it, but if the target server is not crashed, master just can not receive the response for other reasons like network issue. this problem will still happens; or if master is not up, SCP also does not work). So this will be a general problem not only for meta, but for other normal regions. So we need to come up with a idea to wake those suspend procedures. My suggestion is that we can have a separate thread to check all those suspend procedures periodically, if they are timeout or their target server is crashed, we can do reassign. (1) The target server crashed will only suspend meta's assign since master is not up yet, other regions can be wake by ServerCrashProcedure. (2) Timeout mechanism for all suspend procedure. If one procedure has been suspended for too long, we mark it as timeout and redo the remain steps. We can do (1) first, but for (2), since we don't have timeout for procedure yet. Not sure how to fix it properly. > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug >Reporter: Yi Liang >Assignee: Yi Liang > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread]
[jira] [Assigned] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang reassigned HBASE-19287: Assignee: Yi Liang > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug >Reporter: Yi Liang >Assignee: Yi Liang > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:258) > row 'hbase:namespace' on table
[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16269587#comment-16269587 ] Yi Liang commented on HBASE-19287: -- I will try the second one, need to dig into code to see how to implement it. > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug >Reporter: Yi Liang > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278) > at >
[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261290#comment-16261290 ] Yi Liang commented on HBASE-19287: -- {code} (1) Should the assign of hbase:meta be synchronous so it can timeout/verify the hbase:meta assign, the important needed to get us up off the ground? (2) Or at least, if we get a crash for the server we are currently trying to assign hbase:meta too during startup, we should notice and recalibrate the assign? {code} I think both approaches are good, but if we use the first one, it is hard to define timeout, it depends on how large is the hbase cluster. And second one can re-calibrate the assign immediately after it detect target server down, does not need to wait for timeout, which can start hmaster faster. I prefer to try the second one first. what do you think. [~stack] > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug >Reporter: Yi Liang > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) > at >
[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260014#comment-16260014 ] Yi Liang commented on HBASE-19287: -- Workers stuck at assign hbase-meta, there seems no mechanism for a timeout procedure. Still dig into the code > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug >Reporter: Yi Liang > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278) > at >
[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260012#comment-16260012 ] Yi Liang commented on HBASE-19287: -- {code} 836 2017-11-20 23:05:24,829 INFO [ProcExecWrkr-2] client.AsyncRequestFutureImpl: #1, waiting for 1 actions to finish on table: hbase:meta 837 2017-11-20 23:05:28,570 WARN [ProcExecTimeout] procedure2.ProcedureExecutor: Worker stuck ProcExecWrkr-2(pid=81) run time 13.8040sec 838 2017-11-20 23:05:33,571 WARN [ProcExecTimeout] procedure2.ProcedureExecutor: Worker stuck ProcExecWrkr-2(pid=81) run time 18.8050sec 839 2017-11-20 23:05:34,836 INFO [ProcExecWrkr-2] client.AsyncRequestFutureImpl: #1, waiting for 1 actions to finish on table: hbase:meta 840 2017-11-20 23:05:38,572 WARN [ProcExecTimeout] procedure2.ProcedureExecutor: Worker stuck ProcExecWrkr-2(pid=81) run time 23.8060sec {code} > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug >Reporter: Yi Liang > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at >
[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255793#comment-16255793 ] Yi Liang commented on HBASE-19287: -- [~uagashe][~stack] Any ideas about this problem? > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug >Reporter: Yi Liang > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652 looks stale, new > server:hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Master doesn't enable ServerShutdownHandler during > initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:27:49,815 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not > online on hadoop-slave2.hadoop,16020,1510342023184 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:258) > row
[jira] [Comment Edited] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255788#comment-16255788 ] Yi Liang edited comment on HBASE-19287 at 11/16/17 7:02 PM: This happens when I restart the cluster, I see this error many times. The RecoverMetaProcedure have a step that will send AssignMetaRegion request to a target server. If the request sent out successfully but then the target server down. {code} try { final ExecuteProceduresResponse response = sendRequest(getServerName(), request.build()); remoteCallCompleted(env, response); } catch (IOException e) { e = unwrapException(e); // TODO: In the future some operation may want to bail out early. // TODO: How many times should we retry (use numberOfAttemptsSoFar) if (!scheduleForRetry(e)) { remoteCallFailed(env, e); } } {code} So there are no exceptions for above code when send assign region request to target server. But it seems that there is no timeout event to retry the assignProcedure or RecoverMetaProcedure. So it will hang there forever. And there are also errors below, the stale one is the target server in the above RPC request. {quote} RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] master.ServerManager: Triggering server recovery; existingServer hadoop-slave2.hadoop,16020,1510341988652 looks stale, new server:hadoop-slave2.hadoop,16020,1510342023184 2017-11-10 19:27:05,832 INFO [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] master.ServerManager: Master doesn't enable ServerShutdownHandler during initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 {quote} was (Author: easyliangjob): This happens when I restart the cluster, I see this error many times. The RecoverMetaProcedure have a step that will send AssignMetaRegion request to a target server. If the request sent out successfully but then the target server down. {code} try { final ExecuteProceduresResponse response = sendRequest(getServerName(), request.build()); remoteCallCompleted(env, response); } catch (IOException e) { e = unwrapException(e); // TODO: In the future some operation may want to bail out early. // TODO: How many times should we retry (use numberOfAttemptsSoFar) if (!scheduleForRetry(e)) { remoteCallFailed(env, e); } } {code} So there are no exceptions for above code when send assign region request to target server. But it seems that there is no timeout event to retry the assignProcedure or RecoverMetaProcedure. So it will hang there forever. And there are also errors below, the stale one is the target server in the above RPC request. {quote} RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] master.ServerManager: Triggering server recovery; existingServer hadoop-slave2.hadoop,16020,1510341988652 looks stale, new server:hadoop-slave2.hadoop,16020,1510342023184 2017-11-10 19:27:05,832 INFO [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] master.ServerManager: Master doesn't enable ServerShutdownHandler during initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 {quote} > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug >Reporter: Yi Liang > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as
[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
[ https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255788#comment-16255788 ] Yi Liang commented on HBASE-19287: -- This happens when I restart the cluster, I see this error many times. The RecoverMetaProcedure have a step that will send AssignMetaRegion request to a target server. If the request sent out successfully but then the target server down. {code} try { final ExecuteProceduresResponse response = sendRequest(getServerName(), request.build()); remoteCallCompleted(env, response); } catch (IOException e) { e = unwrapException(e); // TODO: In the future some operation may want to bail out early. // TODO: How many times should we retry (use numberOfAttemptsSoFar) if (!scheduleForRetry(e)) { remoteCallFailed(env, e); } } {code} So there are no exceptions for above code when send assign region request to target server. But it seems that there is no timeout event to retry the assignProcedure or RecoverMetaProcedure. So it will hang there forever. And there are also errors below, the stale one is the target server in the above RPC request. {quote} RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] master.ServerManager: Triggering server recovery; existingServer hadoop-slave2.hadoop,16020,1510341988652 looks stale, new server:hadoop-slave2.hadoop,16020,1510342023184 2017-11-10 19:27:05,832 INFO [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] master.ServerManager: Master doesn't enable ServerShutdownHandler during initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 {quote} > master hangs forever if RecoverMeta send assign meta region request to target > server fail > - > > Key: HBASE-19287 > URL: https://issues.apache.org/jira/browse/HBASE-19287 > Project: HBase > Issue Type: Bug >Reporter: Yi Liang > > 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] > procedure.RecoverMetaProcedure: pid=138, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to > server=hadoop-slave1.hadoop,16020,1510341981454 > 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] > 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] > procedure.MasterProcedureScheduler: pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta > hbase:meta,,1.1588230740 > 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: > Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740, > target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, > location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, > retain=false > 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: > Setting hbase:meta (replicaId=0) location in ZooKeeper as > hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] > assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, > region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; > rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] > procedure.RSProcedureDispatcher: Using procedure batch rpc execution for > serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 > 2017-11-10 19:26:57,542 INFO [main-EventThread] > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, > processing expiration [hadoop-slave2.hadoop,16020,1510341988652] > 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master > doesn't enable ServerShutdownHandler during initialization, delay expiring > server hadoop-slave2.hadoop,16020,1510341988652 > 2017-11-10 19:26:58,875 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave1.hadoop,16020,1510342016106 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Registering > server=hadoop-slave2.hadoop,16020,1510342023184 > 2017-11-10 19:27:05,832 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] > master.ServerManager: Triggering server recovery; existingServer > hadoop-slave2.hadoop,16020,1510341988652
[jira] [Created] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail
Yi Liang created HBASE-19287: Summary: master hangs forever if RecoverMeta send assign meta region request to target server fail Key: HBASE-19287 URL: https://issues.apache.org/jira/browse/HBASE-19287 Project: HBase Issue Type: Bug Reporter: Yi Liang 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1] procedure.RecoverMetaProcedure: pid=138, state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure failedMetaServer=null, splitWal=true; Retaining meta assignment to server=hadoop-slave1.hadoop,16020,1510341981454 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}] 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2] procedure.MasterProcedureScheduler: pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta hbase:meta,,1.1588230740 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure: Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, retain=false 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in ZooKeeper as hadoop-slave2.hadoop,16020,1510341988652 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4] assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread] procedure.RSProcedureDispatcher: Using procedure batch rpc execution for serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152 2017-11-10 19:26:57,542 INFO [main-EventThread] zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [hadoop-slave2.hadoop,16020,1510341988652] 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master doesn't enable ServerShutdownHandler during initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 2017-11-10 19:26:58,875 INFO [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] master.ServerManager: Registering server=hadoop-slave1.hadoop,16020,1510342016106 2017-11-10 19:27:05,832 INFO [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] master.ServerManager: Registering server=hadoop-slave2.hadoop,16020,1510342023184 2017-11-10 19:27:05,832 INFO [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] master.ServerManager: Triggering server recovery; existingServer hadoop-slave2.hadoop,16020,1510341988652 looks stale, new server:hadoop-slave2.hadoop,16020,1510342023184 2017-11-10 19:27:05,832 INFO [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] master.ServerManager: Master doesn't enable ServerShutdownHandler during initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652 2017-11-10 19:27:49,815 INFO [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not online on hadoop-slave2.hadoop,16020,1510342023184 at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290) at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370) at org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401) at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:258) row 'hbase:namespace' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=hadoop-slave2.hadoop,16020,1510341988652, seqNum=0 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-19237) TestMaster.testMasterOpsWhileSplitting fails
[ https://issues.apache.org/jira/browse/HBASE-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19237: - Attachment: HBASE-19237-master-v1.patch Hi Ted, the failed tests passed locally. but will retry the unit tests again. > TestMaster.testMasterOpsWhileSplitting fails > > > Key: HBASE-19237 > URL: https://issues.apache.org/jira/browse/HBASE-19237 > Project: HBase > Issue Type: Test >Reporter: Ted Yu > Attachments: HBASE-19237-master-v1.patch, HBASE-19237-master-v1.patch > > > This is the top flaky test: > {code} > java.lang.AssertionError: expected:<3> but was:<1> > at > org.apache.hadoop.hbase.master.TestMaster.testMasterOpsWhileSplitting(TestMaster.java:121) > {code} > After brief check, the test failure seems to be introduced by HBASE-19127 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-19237) TestMaster.testMasterOpsWhileSplitting fails
[ https://issues.apache.org/jira/browse/HBASE-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19237: - Attachment: HBASE-19237-master-v1.patch > TestMaster.testMasterOpsWhileSplitting fails > > > Key: HBASE-19237 > URL: https://issues.apache.org/jira/browse/HBASE-19237 > Project: HBase > Issue Type: Test >Reporter: Ted Yu > Attachments: HBASE-19237-master-v1.patch > > > This is the top flaky test: > {code} > java.lang.AssertionError: expected:<3> but was:<1> > at > org.apache.hadoop.hbase.master.TestMaster.testMasterOpsWhileSplitting(TestMaster.java:121) > {code} > After brief check, the test failure seems to be introduced by HBASE-19127 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-19237) TestMaster.testMasterOpsWhileSplitting fails
[ https://issues.apache.org/jira/browse/HBASE-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19237: - Attachment: (was: HBASE-19237-master-v1.patch) > TestMaster.testMasterOpsWhileSplitting fails > > > Key: HBASE-19237 > URL: https://issues.apache.org/jira/browse/HBASE-19237 > Project: HBase > Issue Type: Test >Reporter: Ted Yu > Attachments: HBASE-19237-master-v1.patch > > > This is the top flaky test: > {code} > java.lang.AssertionError: expected:<3> but was:<1> > at > org.apache.hadoop.hbase.master.TestMaster.testMasterOpsWhileSplitting(TestMaster.java:121) > {code} > After brief check, the test failure seems to be introduced by HBASE-19127 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-19237) TestMaster.testMasterOpsWhileSplitting fails
[ https://issues.apache.org/jira/browse/HBASE-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19237: - Attachment: HBASE-19237-master-v1.patch > TestMaster.testMasterOpsWhileSplitting fails > > > Key: HBASE-19237 > URL: https://issues.apache.org/jira/browse/HBASE-19237 > Project: HBase > Issue Type: Test >Reporter: Ted Yu > Attachments: 19237.v1.txt, HBASE-19237-master-v1.patch > > > This is the top flaky test: > {code} > java.lang.AssertionError: expected:<3> but was:<1> > at > org.apache.hadoop.hbase.master.TestMaster.testMasterOpsWhileSplitting(TestMaster.java:121) > {code} > After brief check, the test failure seems to be introduced by HBASE-19127 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19237) TestMaster.testMasterOpsWhileSplitting fails
[ https://issues.apache.org/jira/browse/HBASE-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248117#comment-16248117 ] Yi Liang commented on HBASE-19237: -- The reason why above test is fail is that we can not use number of regionStateNode to decide whether the split is complete or not; i.e use regionStates.getRegionsOfTable(TABLENAME).size(). We need to visit Meta to know if the split is completed or not. > TestMaster.testMasterOpsWhileSplitting fails > > > Key: HBASE-19237 > URL: https://issues.apache.org/jira/browse/HBASE-19237 > Project: HBase > Issue Type: Test >Reporter: Ted Yu > Attachments: 19237.v1.txt > > > This is the top flaky test: > {code} > java.lang.AssertionError: expected:<3> but was:<1> > at > org.apache.hadoop.hbase.master.TestMaster.testMasterOpsWhileSplitting(TestMaster.java:121) > {code} > After brief check, the test failure seems to be introduced by HBASE-19127 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19237) TestMaster.testMasterOpsWhileSplitting fails
[ https://issues.apache.org/jira/browse/HBASE-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248077#comment-16248077 ] Yi Liang commented on HBASE-19237: -- ok, I will check the code. > TestMaster.testMasterOpsWhileSplitting fails > > > Key: HBASE-19237 > URL: https://issues.apache.org/jira/browse/HBASE-19237 > Project: HBase > Issue Type: Test >Reporter: Ted Yu > Attachments: 19237.v1.txt > > > This is the top flaky test: > {code} > java.lang.AssertionError: expected:<3> but was:<1> > at > org.apache.hadoop.hbase.master.TestMaster.testMasterOpsWhileSplitting(TestMaster.java:121) > {code} > After brief check, the test failure seems to be introduced by HBASE-19127 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode
[ https://issues.apache.org/jira/browse/HBASE-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16246284#comment-16246284 ] Yi Liang commented on HBASE-19127: -- [~stack], it is good to commit. Thanks > Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in > RegionStatesNode > - > > Key: HBASE-19127 > URL: https://issues.apache.org/jira/browse/HBASE-19127 > Project: HBase > Issue Type: Sub-task >Reporter: Yi Liang >Assignee: Yi Liang > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-19126-v1-master.patch, region_state.patch, > state.patch > > > In current code, we did not set above states to a region node at all, but we > still have statements like below to check if node have above states. > {code} > else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) { > > } > {code} > We need to set above states in a correct place. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode
[ https://issues.apache.org/jira/browse/HBASE-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19127: - Attachment: (was: HBASE-19126-v1-master.patch) > Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in > RegionStatesNode > - > > Key: HBASE-19127 > URL: https://issues.apache.org/jira/browse/HBASE-19127 > Project: HBase > Issue Type: Sub-task >Reporter: Yi Liang >Assignee: Yi Liang > Attachments: HBASE-19126-v1-master.patch, region_state.patch, > state.patch > > > In current code, we did not set above states to a region node at all, but we > still have statements like below to check if node have above states. > {code} > else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) { > > } > {code} > We need to set above states in a correct place. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode
[ https://issues.apache.org/jira/browse/HBASE-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19127: - Attachment: HBASE-19126-v1-master.patch > Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in > RegionStatesNode > - > > Key: HBASE-19127 > URL: https://issues.apache.org/jira/browse/HBASE-19127 > Project: HBase > Issue Type: Sub-task >Reporter: Yi Liang >Assignee: Yi Liang > Attachments: HBASE-19126-v1-master.patch, region_state.patch, > state.patch > > > In current code, we did not set above states to a region node at all, but we > still have statements like below to check if node have above states. > {code} > else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) { > > } > {code} > We need to set above states in a correct place. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode
[ https://issues.apache.org/jira/browse/HBASE-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19127: - Attachment: HBASE-19126-v1-master.patch > Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in > RegionStatesNode > - > > Key: HBASE-19127 > URL: https://issues.apache.org/jira/browse/HBASE-19127 > Project: HBase > Issue Type: Sub-task >Reporter: Yi Liang >Assignee: Yi Liang > Attachments: HBASE-19126-v1-master.patch, region_state.patch, > state.patch > > > In current code, we did not set above states to a region node at all, but we > still have statements like below to check if node have above states. > {code} > else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) { > > } > {code} > We need to set above states in a correct place. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode
[ https://issues.apache.org/jira/browse/HBASE-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16244392#comment-16244392 ] Yi Liang commented on HBASE-19127: -- unit tests passed. will format the patch. > Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in > RegionStatesNode > - > > Key: HBASE-19127 > URL: https://issues.apache.org/jira/browse/HBASE-19127 > Project: HBase > Issue Type: Sub-task >Reporter: Yi Liang >Assignee: Yi Liang > Attachments: region_state.patch, state.patch > > > In current code, we did not set above states to a region node at all, but we > still have statements like below to check if node have above states. > {code} > else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) { > > } > {code} > We need to set above states in a correct place. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode
[ https://issues.apache.org/jira/browse/HBASE-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242946#comment-16242946 ] Yi Liang commented on HBASE-19127: -- [~stack] {quote}What does the change in AM do? (Adding state for daughters)?{quote} Yes, this is to add state for daughters {quote}Don't change numbering in protobufs.{quote} The reason why I remove MERGE_TABLE_REGIONS_MOVE_REGION_TO_SAME_RS = 3 in proto, because it has never been used in the current code. Do you think this will cause some compatibility issues? If you think it is better to keep them, I will keep it. Otherwise, I think it is safe to remove them, since MergeTableRegionsProcedure is newly added since hbase-2.0. > Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in > RegionStatesNode > - > > Key: HBASE-19127 > URL: https://issues.apache.org/jira/browse/HBASE-19127 > Project: HBase > Issue Type: Sub-task >Reporter: Yi Liang >Assignee: Yi Liang > Attachments: state.patch > > > In current code, we did not set above states to a region node at all, but we > still have statements like below to check if node have above states. > {code} > else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) { > > } > {code} > We need to set above states in a correct place. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode
[ https://issues.apache.org/jira/browse/HBASE-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19127: - Attachment: region_state.patch Try unit tests > Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in > RegionStatesNode > - > > Key: HBASE-19127 > URL: https://issues.apache.org/jira/browse/HBASE-19127 > Project: HBase > Issue Type: Sub-task >Reporter: Yi Liang >Assignee: Yi Liang > Attachments: region_state.patch, state.patch > > > In current code, we did not set above states to a region node at all, but we > still have statements like below to check if node have above states. > {code} > else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) { > > } > {code} > We need to set above states in a correct place. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode
[ https://issues.apache.org/jira/browse/HBASE-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16234479#comment-16234479 ] Yi Liang commented on HBASE-19127: -- All the changes are made in the new code of hbase2.0, so it wont be a problem in procedure. > Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in > RegionStatesNode > - > > Key: HBASE-19127 > URL: https://issues.apache.org/jira/browse/HBASE-19127 > Project: HBase > Issue Type: Improvement >Reporter: Yi Liang >Assignee: Yi Liang >Priority: Major > Attachments: state.patch > > > In current code, we did not set above states to a region node at all, but we > still have statements like below to check if node have above states. > {code} > else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) { > > } > {code} > We need to set above states in a correct place. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode
[ https://issues.apache.org/jira/browse/HBASE-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19127: - Status: Patch Available (was: Open) > Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in > RegionStatesNode > - > > Key: HBASE-19127 > URL: https://issues.apache.org/jira/browse/HBASE-19127 > Project: HBase > Issue Type: Improvement >Reporter: Yi Liang >Assignee: Yi Liang >Priority: Major > Attachments: state.patch > > > In current code, we did not set above states to a region node at all, but we > still have statements like below to check if node have above states. > {code} > else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) { > > } > {code} > We need to set above states in a correct place. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19126) [AMv2] RegionStates/RegionStateNode needs cleanup
[ https://issues.apache.org/jira/browse/HBASE-19126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16234414#comment-16234414 ] Yi Liang commented on HBASE-19126: -- Hi [~mdrob] yes, I am working on this one, I am recently read code about this part and found some problems, will open sub jiras to fix the issues. If you are interested, you can also work on this one. ;) > [AMv2] RegionStates/RegionStateNode needs cleanup > - > > Key: HBASE-19126 > URL: https://issues.apache.org/jira/browse/HBASE-19126 > Project: HBase > Issue Type: Improvement >Reporter: Yi Liang >Priority: Major > Fix For: 2.0.0-beta-1 > > > // Mutable/Immutable? Changes have to be synchronized or not? > // Data members are volatile which seems to say multi-threaded access is > fine. > // In the below we do check and set but the check state could change before > // we do the set because no synchronizationwhich seems dodgy. Clear up > // understanding here... how many threads accessing? Do locks make it so one > // thread at a time working on a single Region's RegionStateNode? Lets > presume > // so for now. Odd is that elsewhere in this RegionStates, we synchronize on > // the RegionStateNode instance > Copied from TODO in RegionState.java > Open this jira to track some cleanups for RegionStates/RegionStateNode -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode
[ https://issues.apache.org/jira/browse/HBASE-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19127: - Attachment: state.patch Try unit test, [~stack] [~jerryhe], I found some issues about regionstates especially about the intermediate state. In the patch, I removed MERGE_TABLE_REGIONS_MOVE_REGION_TO_SAME_RS, this MERGE_TABLE_REGIONS_MOVE_REGION_TO_SAME_RS has never been used. and also removed MERGE_TABLE_REGIONS_SET_MERGING_TABLE_STATE, And the method under this step is commented out, so nothing will be done under this step. And I think we do not need to have a specific step for it. After read RegionState.java; Set above states follow rules below 1. SPLITTING => After check the parent regions is splittable, set it to parent region 2. SPLITTING_NEW => Set it after create daughter regions and before Assign these daughters as OPEN in their region states. 3. Merging => After check 2 parent regions are mergeable, set it to both parent regions. 4. Merging_new => After create merged regions and before assign it as OPEN. above states won't affect the real procedure work, we set it because in the metrics will use them, and also RegionStates need to keep latest/correct state. > Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in > RegionStatesNode > - > > Key: HBASE-19127 > URL: https://issues.apache.org/jira/browse/HBASE-19127 > Project: HBase > Issue Type: Improvement >Reporter: Yi Liang >Assignee: Yi Liang > Attachments: state.patch > > > In current code, we did not set above states to a region node at all, but we > still have statements like below to check if node have above states. > {code} > else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) { > > } > {code} > We need to set above states in a correct place. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode
[ https://issues.apache.org/jira/browse/HBASE-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang reassigned HBASE-19127: Assignee: Yi Liang > Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in > RegionStatesNode > - > > Key: HBASE-19127 > URL: https://issues.apache.org/jira/browse/HBASE-19127 > Project: HBase > Issue Type: Improvement >Reporter: Yi Liang >Assignee: Yi Liang > > In current code, we did not set above states to a region node at all, but we > still have statements like below to check if node have above states. > {code} > else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) { > > } > {code} > We need to set above states in a correct place. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode
Yi Liang created HBASE-19127: Summary: Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode Key: HBASE-19127 URL: https://issues.apache.org/jira/browse/HBASE-19127 Project: HBase Issue Type: Improvement Reporter: Yi Liang In current code, we did not set above states to a region node at all, but we still have statements like below to check if node have above states. {code} else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) { } {code} We need to set above states in a correct place. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-19126) [AMv2] RegionStates/RegionStateNode needs cleanup
[ https://issues.apache.org/jira/browse/HBASE-19126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19126: - Description: // Mutable/Immutable? Changes have to be synchronized or not? // Data members are volatile which seems to say multi-threaded access is fine. // In the below we do check and set but the check state could change before // we do the set because no synchronizationwhich seems dodgy. Clear up // understanding here... how many threads accessing? Do locks make it so one // thread at a time working on a single Region's RegionStateNode? Lets presume // so for now. Odd is that elsewhere in this RegionStates, we synchronize on // the RegionStateNode instance Copied from TODO in RegionState.java Open this jira to track some cleanups for RegionStates/RegionStateNode was: // Mutable/Immutable? Changes have to be synchronized or not? // Data members are volatile which seems to say multi-threaded access is fine. // In the below we do check and set but the check state could change before // we do the set because no synchronizationwhich seems dodgy. Clear up // understanding here... how many threads accessing? Do locks make it so one // thread at a time working on a single Region's RegionStateNode? Lets presume // so for now. Odd is that elsewhere in this RegionStates, we synchronize on // the RegionStateNode instance Open this jira to track some cleanups for RegionStates/RegionStateNode > [AMv2] RegionStates/RegionStateNode needs cleanup > - > > Key: HBASE-19126 > URL: https://issues.apache.org/jira/browse/HBASE-19126 > Project: HBase > Issue Type: Improvement >Reporter: Yi Liang > Fix For: 2.0.0-beta-1 > > > // Mutable/Immutable? Changes have to be synchronized or not? > // Data members are volatile which seems to say multi-threaded access is > fine. > // In the below we do check and set but the check state could change before > // we do the set because no synchronizationwhich seems dodgy. Clear up > // understanding here... how many threads accessing? Do locks make it so one > // thread at a time working on a single Region's RegionStateNode? Lets > presume > // so for now. Odd is that elsewhere in this RegionStates, we synchronize on > // the RegionStateNode instance > Copied from TODO in RegionState.java > Open this jira to track some cleanups for RegionStates/RegionStateNode -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-19126) [AMv2] RegionStates/RegionStateNode needs cleanup
[ https://issues.apache.org/jira/browse/HBASE-19126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-19126: - Summary: [AMv2] RegionStates/RegionStateNode needs cleanup (was: [AMv2] RegionStates/RegionStateNode cleanup) > [AMv2] RegionStates/RegionStateNode needs cleanup > - > > Key: HBASE-19126 > URL: https://issues.apache.org/jira/browse/HBASE-19126 > Project: HBase > Issue Type: Improvement >Reporter: Yi Liang > Fix For: 2.0.0-beta-1 > > > // Mutable/Immutable? Changes have to be synchronized or not? > // Data members are volatile which seems to say multi-threaded access is > fine. > // In the below we do check and set but the check state could change before > // we do the set because no synchronizationwhich seems dodgy. Clear up > // understanding here... how many threads accessing? Do locks make it so one > // thread at a time working on a single Region's RegionStateNode? Lets > presume > // so for now. Odd is that elsewhere in this RegionStates, we synchronize on > // the RegionStateNode instance > Open this jira to track some cleanups for RegionStates/RegionStateNode -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HBASE-19126) [AMv2] RegionStates/RegionStateNode cleanup
Yi Liang created HBASE-19126: Summary: [AMv2] RegionStates/RegionStateNode cleanup Key: HBASE-19126 URL: https://issues.apache.org/jira/browse/HBASE-19126 Project: HBase Issue Type: Improvement Reporter: Yi Liang Fix For: 2.0.0-beta-1 // Mutable/Immutable? Changes have to be synchronized or not? // Data members are volatile which seems to say multi-threaded access is fine. // In the below we do check and set but the check state could change before // we do the set because no synchronizationwhich seems dodgy. Clear up // understanding here... how many threads accessing? Do locks make it so one // thread at a time working on a single Region's RegionStateNode? Lets presume // so for now. Odd is that elsewhere in this RegionStates, we synchronize on // the RegionStateNode instance Open this jira to track some cleanups for RegionStates/RegionStateNode -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2
[ https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-18984: - Resolution: Duplicate Status: Resolved (was: Patch Available) > [AMv2] Retain assignment does not work well in AMv2 > --- > > Key: HBASE-18984 > URL: https://issues.apache.org/jira/browse/HBASE-18984 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at > 2.24.19 PM.png > > > work on 8.17 Retain assignment in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k > To reproduce this error, in hbase shell: > createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> > list_reigons 't1' (maybe you need to try enable/disable multiple times) > See attached images. same region assigned to different region servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2
[ https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16219168#comment-16219168 ] Yi Liang commented on HBASE-18984: -- Mark this as duplicated of HBASE-19017, and will create a new jira to discuss the Writing those intermediate State(OPENING, CLOSING ..) into Meta. > [AMv2] Retain assignment does not work well in AMv2 > --- > > Key: HBASE-18984 > URL: https://issues.apache.org/jira/browse/HBASE-18984 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at > 2.24.19 PM.png > > > work on 8.17 Retain assignment in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k > To reproduce this error, in hbase shell: > createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> > list_reigons 't1' (maybe you need to try enable/disable multiple times) > See attached images. same region assigned to different region servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HBASE-18352) Enable Replica tests that were disabled by Proc-V2 AM in HBASE-14614
[ https://issues.apache.org/jira/browse/HBASE-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215502#comment-16215502 ] Yi Liang edited comment on HBASE-18352 at 10/23/17 5:36 PM: I saw another situation that may cause random assignment. When we restart hbase, if we start master first, and then region servers. Once one region server count in, master will start to begin region assignment. There is a possibility that the assign plan is created for a region before its last region server up, so AM will randomly chose one region servers for this region. And if we restart all rs before master, we will not see above issues. was (Author: easyliangjob): I saw another situation that may cause random assignment. When we restart hbase, if we start master first, and then regionservers. Once one region server count in, master will start to begin region assignment. There is a possibility that the assign plan is created for a region before its last region server up, so AM will randomly chose one region servers for this region. > Enable Replica tests that were disabled by Proc-V2 AM in HBASE-14614 > > > Key: HBASE-18352 > URL: https://issues.apache.org/jira/browse/HBASE-18352 > Project: HBase > Issue Type: Bug > Components: test >Affects Versions: 2.0.0-alpha-1 >Reporter: Stephen Yuan Jiang >Assignee: huaxiang sun > > The following replica tests were disabled by Core Proc-V2 AM in HBASE-14614: > - Disabled parts of...testCreateTableWithMultipleReplicas in > TestMasterOperationsForRegionReplicas There is an issue w/ assigning more > replicas if number of replicas is changed on us. See '/* DISABLED! FOR > NOW'. > - Disabled testRegionReplicasOnMidClusterHighReplication in > TestStochasticLoadBalancer2 > - Disabled testFlushAndCompactionsInPrimary in TestRegionReplicas > This JIRA tracks the work to enable them (or modify/remove if not applicable). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18352) Enable Replica tests that were disabled by Proc-V2 AM in HBASE-14614
[ https://issues.apache.org/jira/browse/HBASE-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215502#comment-16215502 ] Yi Liang commented on HBASE-18352: -- I saw another situation that may cause random assignment. When we restart hbase, if we start master first, and then regionservers. Once one region server count in, master will start to begin region assignment. There is a possibility that the assign plan is created for a region before its last region server up, so AM will randomly chose one region servers for this region. > Enable Replica tests that were disabled by Proc-V2 AM in HBASE-14614 > > > Key: HBASE-18352 > URL: https://issues.apache.org/jira/browse/HBASE-18352 > Project: HBase > Issue Type: Bug > Components: test >Affects Versions: 2.0.0-alpha-1 >Reporter: Stephen Yuan Jiang >Assignee: huaxiang sun > > The following replica tests were disabled by Core Proc-V2 AM in HBASE-14614: > - Disabled parts of...testCreateTableWithMultipleReplicas in > TestMasterOperationsForRegionReplicas There is an issue w/ assigning more > replicas if number of replicas is changed on us. See '/* DISABLED! FOR > NOW'. > - Disabled testRegionReplicasOnMidClusterHighReplication in > TestStochasticLoadBalancer2 > - Disabled testFlushAndCompactionsInPrimary in TestRegionReplicas > This JIRA tracks the work to enable them (or modify/remove if not applicable). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2
[ https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213443#comment-16213443 ] Yi Liang commented on HBASE-18984: -- It seems above two comments does not clearly state why {quote}// 2. UnAssignProcedure can run first, this region will be assigned as OPEN finally.{quote} would happen. For example, during master unassign region A, and then master crashed, and also the RS has region A crashed. So when master restart, it may reload region A's state as OPEN, and since RS crashed, this Master will create a ServerCrashProcedure for that RS, so there will be both assign(created by SCP) and unassign (old procedure) for region A. And it is really hard to guarantee which one run first(not so sure). > [AMv2] Retain assignment does not work well in AMv2 > --- > > Key: HBASE-18984 > URL: https://issues.apache.org/jira/browse/HBASE-18984 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at > 2.24.19 PM.png > > > work on 8.17 Retain assignment in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k > To reproduce this error, in hbase shell: > createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> > list_reigons 't1' (maybe you need to try enable/disable multiple times) > See attached images. same region assigned to different region servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Issue Comment Deleted] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2
[ https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-18984: - Comment: was deleted (was: [~ram_krish] I just fully check the code, I am wrong for the above comments, when do bulk region assign, the region in transition(which mean there exist an old procedure for this region) would not be assigned. So there are no additional AssignProcedure for region has old procedure. And we can safely remove the step that CLOSING AND OPENING writing into meta See loadMeta below, which use to visit meta to create regionstatenode for all regions {code} private void loadMeta() throws IOException { // TODO: use a thread pool regionStateStore.visitMeta(new RegionStateStore.RegionStateVisitor() { @Override public void visitRegionState(final RegionInfo regionInfo, final State state, final ServerName regionLocation, final ServerName lastHost, final long openSeqNum) { final RegionStateNode regionNode = regionStates.getOrCreateRegionNode(regionInfo); synchronized (regionNode) { if (!regionNode.isInTransition()) { //*here is the condition* regionNode.setState(state); regionNode.setLastHost(lastHost); regionNode.setRegionLocation(regionLocation); regionNode.setOpenSeqNum(openSeqNum); if (state == State.OPEN) { assert regionLocation != null : "found null region location for " + regionNode; regionStates.addRegionToServer(regionLocation, regionNode); } else if (state == State.OFFLINE || regionInfo.isOffline()) { regionStates.addToOfflineRegions(regionNode); } else { // These regions should have a procedure in replay regionStates.addRegionInTransition(regionNode, null); } } } } }); // every assignment is blocked until meta is loaded. wakeMetaLoadedEvent(); } {code}) > [AMv2] Retain assignment does not work well in AMv2 > --- > > Key: HBASE-18984 > URL: https://issues.apache.org/jira/browse/HBASE-18984 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at > 2.24.19 PM.png > > > work on 8.17 Retain assignment in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k > To reproduce this error, in hbase shell: > createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> > list_reigons 't1' (maybe you need to try enable/disable multiple times) > See attached images. same region assigned to different region servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2
[ https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213161#comment-16213161 ] Yi Liang edited comment on HBASE-18984 at 10/20/17 8:23 PM: [~ram_krish] I just fully check the code, I am wrong for the above comments, when do bulk region assign, the region in transition(which mean there exist an old procedure for this region) would not be assigned. So there are no additional AssignProcedure for region has old procedure. And we can safely remove the step that CLOSING AND OPENING writing into meta See loadMeta below, which use to visit meta to create regionstatenode for all regions {code} private void loadMeta() throws IOException { // TODO: use a thread pool regionStateStore.visitMeta(new RegionStateStore.RegionStateVisitor() { @Override public void visitRegionState(final RegionInfo regionInfo, final State state, final ServerName regionLocation, final ServerName lastHost, final long openSeqNum) { final RegionStateNode regionNode = regionStates.getOrCreateRegionNode(regionInfo); synchronized (regionNode) { if (!regionNode.isInTransition()) { //*here is the condition* regionNode.setState(state); regionNode.setLastHost(lastHost); regionNode.setRegionLocation(regionLocation); regionNode.setOpenSeqNum(openSeqNum); if (state == State.OPEN) { assert regionLocation != null : "found null region location for " + regionNode; regionStates.addRegionToServer(regionLocation, regionNode); } else if (state == State.OFFLINE || regionInfo.isOffline()) { regionStates.addToOfflineRegions(regionNode); } else { // These regions should have a procedure in replay regionStates.addRegionInTransition(regionNode, null); } } } } }); // every assignment is blocked until meta is loaded. wakeMetaLoadedEvent(); } {code} was (Author: easyliangjob): [~ram_krish] I just fully check the code, I am wrong for the above comments, when do bulk region assign, the region in transition(which mean there exist an old procedure for this region) would not be assigned. So there are no additional AssignProcedure for region has old procedure. And we can safely remove the step that CLOSING AND OPENING writing into meta See loadMeta below, which use to visit meta to create regionstatenode for all regions {code} private void loadMeta() throws IOException { // TODO: use a thread pool regionStateStore.visitMeta(new RegionStateStore.RegionStateVisitor() { @Override public void visitRegionState(final RegionInfo regionInfo, final State state, final ServerName regionLocation, final ServerName lastHost, final long openSeqNum) { final RegionStateNode regionNode = regionStates.getOrCreateRegionNode(regionInfo); synchronized (regionNode) { if (!regionNode.isInTransition()) { //*{color:red}here is the condition{color}* regionNode.setState(state); regionNode.setLastHost(lastHost); regionNode.setRegionLocation(regionLocation); regionNode.setOpenSeqNum(openSeqNum); if (state == State.OPEN) { assert regionLocation != null : "found null region location for " + regionNode; regionStates.addRegionToServer(regionLocation, regionNode); } else if (state == State.OFFLINE || regionInfo.isOffline()) { regionStates.addToOfflineRegions(regionNode); } else { // These regions should have a procedure in replay regionStates.addRegionInTransition(regionNode, null); } } } } }); // every assignment is blocked until meta is loaded. wakeMetaLoadedEvent(); } {code} > [AMv2] Retain assignment does not work well in AMv2 > --- > > Key: HBASE-18984 > URL: https://issues.apache.org/jira/browse/HBASE-18984 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at > 2.24.19 PM.png > > > work on 8.17 Retain assignment in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k > To reproduce this error, in hbase shell: > createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> > list_reigons 't1' (maybe you need to try enable/disable multiple times) > See attached images. same region assigned to different region servers -- This message was sent
[jira] [Comment Edited] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2
[ https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213161#comment-16213161 ] Yi Liang edited comment on HBASE-18984 at 10/20/17 8:23 PM: [~ram_krish] I just fully check the code, I am wrong for the above comments, when do bulk region assign, the region in transition(which mean there exist an old procedure for this region) would not be assigned. So there are no additional AssignProcedure for region has old procedure. And we can safely remove the step that CLOSING AND OPENING writing into meta See loadMeta below, which use to visit meta to create regionstatenode for all regions {code} private void loadMeta() throws IOException { // TODO: use a thread pool regionStateStore.visitMeta(new RegionStateStore.RegionStateVisitor() { @Override public void visitRegionState(final RegionInfo regionInfo, final State state, final ServerName regionLocation, final ServerName lastHost, final long openSeqNum) { final RegionStateNode regionNode = regionStates.getOrCreateRegionNode(regionInfo); synchronized (regionNode) { if (!regionNode.isInTransition()) { //*{color:red}here is the condition{color}* regionNode.setState(state); regionNode.setLastHost(lastHost); regionNode.setRegionLocation(regionLocation); regionNode.setOpenSeqNum(openSeqNum); if (state == State.OPEN) { assert regionLocation != null : "found null region location for " + regionNode; regionStates.addRegionToServer(regionLocation, regionNode); } else if (state == State.OFFLINE || regionInfo.isOffline()) { regionStates.addToOfflineRegions(regionNode); } else { // These regions should have a procedure in replay regionStates.addRegionInTransition(regionNode, null); } } } } }); // every assignment is blocked until meta is loaded. wakeMetaLoadedEvent(); } {code} was (Author: easyliangjob): [~ram_krish] I just fully check the code, I am wrong for the above comments, when do bulk region assign, the region is transition would not be assigned. So there are no additional AssignProcedure for region has old procedure. And we can safely remove the step that CLOSING AND OPENING writing into meta See loadMeta below, which use to visit meta to create regionstatenode for all regions {code} private void loadMeta() throws IOException { // TODO: use a thread pool regionStateStore.visitMeta(new RegionStateStore.RegionStateVisitor() { @Override public void visitRegionState(final RegionInfo regionInfo, final State state, final ServerName regionLocation, final ServerName lastHost, final long openSeqNum) { final RegionStateNode regionNode = regionStates.getOrCreateRegionNode(regionInfo); synchronized (regionNode) { if (!regionNode.isInTransition()) { regionNode.setState(state); regionNode.setLastHost(lastHost); regionNode.setRegionLocation(regionLocation); regionNode.setOpenSeqNum(openSeqNum); if (state == State.OPEN) { assert regionLocation != null : "found null region location for " + regionNode; regionStates.addRegionToServer(regionLocation, regionNode); } else if (state == State.OFFLINE || regionInfo.isOffline()) { regionStates.addToOfflineRegions(regionNode); } else { // These regions should have a procedure in replay regionStates.addRegionInTransition(regionNode, null); } } } } }); // every assignment is blocked until meta is loaded. wakeMetaLoadedEvent(); } {code} > [AMv2] Retain assignment does not work well in AMv2 > --- > > Key: HBASE-18984 > URL: https://issues.apache.org/jira/browse/HBASE-18984 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at > 2.24.19 PM.png > > > work on 8.17 Retain assignment in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k > To reproduce this error, in hbase shell: > createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> > list_reigons 't1' (maybe you need to try enable/disable multiple times) > See attached images. same region assigned to different region servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2
[ https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213161#comment-16213161 ] Yi Liang commented on HBASE-18984: -- [~ram_krish] I just fully check the code, I am wrong for the above comments, when do bulk region assign, the region is transition would not be assigned. So there are no additional AssignProcedure for region has old procedure. And we can safely remove the step that CLOSING AND OPENING writing into meta See loadMeta below, which use to visit meta to create regionstatenode for all regions {code} private void loadMeta() throws IOException { // TODO: use a thread pool regionStateStore.visitMeta(new RegionStateStore.RegionStateVisitor() { @Override public void visitRegionState(final RegionInfo regionInfo, final State state, final ServerName regionLocation, final ServerName lastHost, final long openSeqNum) { final RegionStateNode regionNode = regionStates.getOrCreateRegionNode(regionInfo); synchronized (regionNode) { if (!regionNode.isInTransition()) { regionNode.setState(state); regionNode.setLastHost(lastHost); regionNode.setRegionLocation(regionLocation); regionNode.setOpenSeqNum(openSeqNum); if (state == State.OPEN) { assert regionLocation != null : "found null region location for " + regionNode; regionStates.addRegionToServer(regionLocation, regionNode); } else if (state == State.OFFLINE || regionInfo.isOffline()) { regionStates.addToOfflineRegions(regionNode); } else { // These regions should have a procedure in replay regionStates.addRegionInTransition(regionNode, null); } } } } }); // every assignment is blocked until meta is loaded. wakeMetaLoadedEvent(); } {code} > [AMv2] Retain assignment does not work well in AMv2 > --- > > Key: HBASE-18984 > URL: https://issues.apache.org/jira/browse/HBASE-18984 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at > 2.24.19 PM.png > > > work on 8.17 Retain assignment in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k > To reproduce this error, in hbase shell: > createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> > list_reigons 't1' (maybe you need to try enable/disable multiple times) > See attached images. same region assigned to different region servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2
[ https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211714#comment-16211714 ] Yi Liang commented on HBASE-18984: -- {quote} 2. UnAssignProcedure run first, this region will be assigned as OPEN. => wrong {quote} Just check the code, above situation could happen, because HMaster#startProcedureExecutor runs before AssignmentManager#joinCluster(),. in startProcedureExecutor, it will start procedureExector and procedureStore, and also start to do the actual load of old procedures. in joinCluster, hbase will do read meta and do bulk assign regions. I think we can start load of old procedures later until at least meta recovered. or even after all user regions loaded(so above situation would not happen). What do you think. [~stack] [~ram_krish] > [AMv2] Retain assignment does not work well in AMv2 > --- > > Key: HBASE-18984 > URL: https://issues.apache.org/jira/browse/HBASE-18984 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at > 2.24.19 PM.png > > > work on 8.17 Retain assignment in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k > To reproduce this error, in hbase shell: > createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> > list_reigons 't1' (maybe you need to try enable/disable multiple times) > See attached images. same region assigned to different region servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2
[ https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211625#comment-16211625 ] Yi Liang commented on HBASE-18984: -- {quote}// 2. UnAssignProcedure run first, this region will be assigned as OPEN. => wrong{quote} If we can make sure that load regions happens ahead of restore failed procedures when master restart, then this situation would not happen Let me check the code > [AMv2] Retain assignment does not work well in AMv2 > --- > > Key: HBASE-18984 > URL: https://issues.apache.org/jira/browse/HBASE-18984 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at > 2.24.19 PM.png > > > work on 8.17 Retain assignment in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k > To reproduce this error, in hbase shell: > createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> > list_reigons 't1' (maybe you need to try enable/disable multiple times) > See attached images. same region assigned to different region servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2
[ https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-18984: - Attachment: HBASE-18984-V1-master.patch > [AMv2] Retain assignment does not work well in AMv2 > --- > > Key: HBASE-18984 > URL: https://issues.apache.org/jira/browse/HBASE-18984 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at > 2.24.19 PM.png > > > work on 8.17 Retain assignment in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k > To reproduce this error, in hbase shell: > createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> > list_reigons 't1' (maybe you need to try enable/disable multiple times) > See attached images. same region assigned to different region servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2
[ https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-18984: - Attachment: (was: HBASE-18984-V1-master.patch) > [AMv2] Retain assignment does not work well in AMv2 > --- > > Key: HBASE-18984 > URL: https://issues.apache.org/jira/browse/HBASE-18984 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: Screen Shot 2017-10-10 at 2.24.19 PM.png > > > work on 8.17 Retain assignment in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k > To reproduce this error, in hbase shell: > createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> > list_reigons 't1' (maybe you need to try enable/disable multiple times) > See attached images. same region assigned to different region servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2
[ https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208493#comment-16208493 ] Yi Liang commented on HBASE-18984: -- Hi [~ram_krish], This patch only contains some clean up about region nodes status updated that maybe related to this jira. Could you help to review, Thanks > [AMv2] Retain assignment does not work well in AMv2 > --- > > Key: HBASE-18984 > URL: https://issues.apache.org/jira/browse/HBASE-18984 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at > 2.24.19 PM.png > > > work on 8.17 Retain assignment in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k > To reproduce this error, in hbase shell: > createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> > list_reigons 't1' (maybe you need to try enable/disable multiple times) > See attached images. same region assigned to different region servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2
[ https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-18984: - Attachment: HBASE-18984-V1-master.patch > [AMv2] Retain assignment does not work well in AMv2 > --- > > Key: HBASE-18984 > URL: https://issues.apache.org/jira/browse/HBASE-18984 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at > 2.24.19 PM.png > > > work on 8.17 Retain assignment in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k > To reproduce this error, in hbase shell: > createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> > list_reigons 't1' (maybe you need to try enable/disable multiple times) > See attached images. same region assigned to different region servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2
[ https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-18984: - Status: Patch Available (was: Open) > [AMv2] Retain assignment does not work well in AMv2 > --- > > Key: HBASE-18984 > URL: https://issues.apache.org/jira/browse/HBASE-18984 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at > 2.24.19 PM.png > > > work on 8.17 Retain assignment in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k > To reproduce this error, in hbase shell: > createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> > list_reigons 't1' (maybe you need to try enable/disable multiple times) > See attached images. same region assigned to different region servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19017) EnableTableProcedure is not retaining the assignments
[ https://issues.apache.org/jira/browse/HBASE-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206711#comment-16206711 ] Yi Liang commented on HBASE-19017: -- reviewed your patch, the fix is correct. For HBASE-18984, I also add some clean up in the AssignProcedure. You can commit this one first, and I will rebase the patch there. And the problem I found seems not related to retain assignment, and try to reproduce and maybe open a new jira for it. > EnableTableProcedure is not retaining the assignments > - > > Key: HBASE-19017 > URL: https://issues.apache.org/jira/browse/HBASE-19017 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0-alpha-3 >Reporter: ramkrishna.s.vasudevan >Assignee: ramkrishna.s.vasudevan > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-19017.patch > > > Found this while working on HBASE-18946. In branch-1.4 when ever we do enable > table we try retain assignment. > But in branch-2 and trunk the EnableTableProcedure tries to get the location > from the existing regionNode. It always returns null because while doing > region CLOSE while disabling a table, the regionNode's 'regionLocation' is > made NULL but the 'lastHost' is actually having the servername where the > region was hosted. But on trying assignment again we try to see what was the > last RegionLocation and not the 'lastHost' and we go ahead with new > assignment. > On region CLOSE while disable table > {code} > public void markRegionAsClosed(final RegionStateNode regionNode) throws > IOException { > final RegionInfo hri = regionNode.getRegionInfo(); > synchronized (regionNode) { > State state = regionNode.transitionState(State.CLOSED, > RegionStates.STATES_EXPECTED_ON_CLOSE); > regionStates.removeRegionFromServer(regionNode.getRegionLocation(), > regionNode); > regionNode.setLastHost(regionNode.getRegionLocation()); > regionNode.setRegionLocation(null); > regionStateStore.updateRegionLocation(regionNode.getRegionInfo(), state, > regionNode.getRegionLocation()/*null*/, regionNode.getLastHost(), > HConstants.NO_SEQNUM, regionNode.getProcedure().getProcId()); > sendRegionClosedNotification(hri); > } > {code} > In AssignProcedure > {code} > ServerName lastRegionLocation = regionNode.offline(); > {code} > {code} > public ServerName setRegionLocation(final ServerName serverName) { > ServerName lastRegionLocation = this.regionLocation; > if (LOG.isTraceEnabled() && serverName == null) { > LOG.trace("Tracking when we are set to null " + this, new > Throwable("TRACE")); > } > this.regionLocation = serverName; > this.lastUpdate = EnvironmentEdgeManager.currentTime(); > return lastRegionLocation; > } > {code} > So further code in AssignProcedure > {code} > boolean retain = false; > if (!forceNewPlan) { > if (this.targetServer != null) { > retain = targetServer.equals(lastRegionLocation); > regionNode.setRegionLocation(targetServer); > } else { > if (lastRegionLocation != null) { > // Try and keep the location we had before we offlined. > retain = true; > regionNode.setRegionLocation(lastRegionLocation); > } > } > } > {code} > Tries to do retainAssignment but fails because lastRegionLocation is always > null. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2
[ https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206262#comment-16206262 ] Yi Liang commented on HBASE-18984: -- [~ram_krish], I also have same fix, after I put it into a real cluster, i saw some problems when restart hbase, it give some errors like hbase:meta is not online. > [AMv2] Retain assignment does not work well in AMv2 > --- > > Key: HBASE-18984 > URL: https://issues.apache.org/jira/browse/HBASE-18984 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: Screen Shot 2017-10-10 at 2.24.19 PM.png > > > work on 8.17 Retain assignment in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k > To reproduce this error, in hbase shell: > createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> > list_reigons 't1' (maybe you need to try enable/disable multiple times) > See attached images. same region assigned to different region servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2
[ https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205208#comment-16205208 ] Yi Liang commented on HBASE-18984: -- The reason why retain assignment does not work well after disable/enable is that every time we unsign the region, we mark the regionLocation as null in regionStateNode, and when we assign the region, it will load the this null as current region location, and if the region location is null, AM will assign it a random region server to it. Have already fixed above issues, but see other errors when restart cluster. Still debugging. > [AMv2] Retain assignment does not work well in AMv2 > --- > > Key: HBASE-18984 > URL: https://issues.apache.org/jira/browse/HBASE-18984 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: Screen Shot 2017-10-10 at 2.24.19 PM.png > > > work on 8.17 Retain assignment in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k > To reproduce this error, in hbase shell: > createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> > list_reigons 't1' (maybe you need to try enable/disable multiple times) > See attached images. same region assigned to different region servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2
[ https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199515#comment-16199515 ] Yi Liang commented on HBASE-18984: -- I will do some research on this. Thanks for the information > [AMv2] Retain assignment does not work well in AMv2 > --- > > Key: HBASE-18984 > URL: https://issues.apache.org/jira/browse/HBASE-18984 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: Screen Shot 2017-10-10 at 2.24.19 PM.png > > > work on 8.17 Retain assignment in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k > To reproduce this error, in hbase shell: > createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> > list_reigons 't1' (maybe you need to try enable/disable multiple times) > See attached images. same region assigned to different region servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2
[ https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199464#comment-16199464 ] Yi Liang commented on HBASE-18984: -- ping [~stack], do you know about retain assignment? Just make sure this is a problem before dig into it. > [AMv2] Retain assignment does not work well in AMv2 > --- > > Key: HBASE-18984 > URL: https://issues.apache.org/jira/browse/HBASE-18984 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: Screen Shot 2017-10-10 at 2.24.19 PM.png > > > work on 8.17 Retain assignment in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k > To reproduce this error, in hbase shell: > createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> > list_reigons 't1' (maybe you need to try enable/disable multiple times) > See attached images. same region assigned to different region servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2
[ https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-18984: - Attachment: Screen Shot 2017-10-10 at 2.24.19 PM.png > [AMv2] Retain assignment does not work well in AMv2 > --- > > Key: HBASE-18984 > URL: https://issues.apache.org/jira/browse/HBASE-18984 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Yi Liang >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: Screen Shot 2017-10-10 at 2.24.19 PM.png > > > work on 8.17 Retain assignment in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k > To reproduce this error, in hbase shell: > createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> > list_reigons 't1' (maybe you need to try enable/disable multiple times) > See attached images. same region assigned to different region servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2
Yi Liang created HBASE-18984: Summary: [AMv2] Retain assignment does not work well in AMv2 Key: HBASE-18984 URL: https://issues.apache.org/jira/browse/HBASE-18984 Project: HBase Issue Type: Bug Components: proc-v2 Affects Versions: 2.0.0 Reporter: Yi Liang Assignee: Yi Liang Fix For: 2.0.0 work on 8.17 Retain assignment in https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k To reproduce this error, in hbase shell: createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> list_reigons 't1' (maybe you need to try enable/disable multiple times) See attached images. same region assigned to different region servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18108) Procedure WALs are archived but not cleaned; fix
[ https://issues.apache.org/jira/browse/HBASE-18108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16197760#comment-16197760 ] Yi Liang commented on HBASE-18108: -- Hi Peter, I will take a look today or tomorrow > Procedure WALs are archived but not cleaned; fix > > > Key: HBASE-18108 > URL: https://issues.apache.org/jira/browse/HBASE-18108 > Project: HBase > Issue Type: Sub-task > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: stack >Assignee: Peter Somogyi >Priority: Blocker > Fix For: 2.0.0 > > Attachments: HBASE-18108.master.001.patch, > HBASE-18108.master.002.patch > > > The Procedure WAL files used to be deleted when done. HBASE-14614 keeps them > around in case issue but what is missing is a GC for no-longer-needed WAL > files. This one is pretty important. > From WALProcedureStore Cleaner TODO in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.r2pc835nb7vi -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-16894) Create more than 1 split per region, generalize HBASE-12590
[ https://issues.apache.org/jira/browse/HBASE-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16189961#comment-16189961 ] Yi Liang commented on HBASE-16894: -- For branch-1, I can not access to the unit test result page above to see the details, but all those tests are passed locally. And branch-2/master, we get a all green pass. I think both patchs are good to commit. Any comments [~apurtell]. Thanks > Create more than 1 split per region, generalize HBASE-12590 > --- > > Key: HBASE-16894 > URL: https://issues.apache.org/jira/browse/HBASE-16894 > Project: HBase > Issue Type: Improvement >Affects Versions: 3.0.0, 2.0.0-alpha-2 >Reporter: Enis Soztutar >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-16894.branch-1.patch, HBASE-16894.master.patch, > HBASE-16894-V2-master.patch, HBASE-16894-V3-master.patch, > ImplementaionAndSomeQuestion.docx > > > A common request from users is to be able to better control how many map > tasks are created per region. Right now, it is always 1 region = 1 input > split = 1 map task. Same goes for Spark since it uses the TIF. With region > sizes as large as 50 GBs, it is desirable to be able to create more than 1 > split per region. > HBASE-12590 adds a config property for MR jobs to be able to handle skew in > region sizes. The algorithm is roughly: > {code} > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size * ratio): combine > these regions into one MR input split. > {code} > Although we can set data skew ratio to be 0.5 or something to abuse > HBASE-12590 into creating more than 1 split task per region, it is not ideal. > But there is no way to create more with the patch as it is. For example we > cannot create more than 2 tasks per region. > If we want to fix this properly, we should extend the approach in > HBASE-12590, and make it so that the client can specify the desired num of > mappers, or desired split size, and the TIF generates the splits based on the > current region sizes very similar to the algorithm in HBASE-12590, but a more > generic way. This also would eliminate the hand tuning of data skew ratio. > We also can think about the guidepost approach that Phoenix has in the stats > table which is used for exactly this purpose. Right now, the region can be > split into powers of two assuming uniform distribution within the region. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18105) [AMv2] Split/Merge need cleanup; currently they diverge and do not fully embrace AMv2 world
[ https://issues.apache.org/jira/browse/HBASE-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188894#comment-16188894 ] Yi Liang commented on HBASE-18105: -- [~stack], do you have any other AMv2 related tasks that I can do some help, I am quite free this week. :) I found some issues about regionstates in AMv2, but will start to fix it after HBASE-18490 done. And also as we discussed in HBASE-18803, How we are going to deal with the curator jar? Shaded jar? Thanks > [AMv2] Split/Merge need cleanup; currently they diverge and do not fully > embrace AMv2 world > --- > > Key: HBASE-18105 > URL: https://issues.apache.org/jira/browse/HBASE-18105 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: Yi Liang > Fix For: 2.0.0-alpha-4 > > Attachments: HBASE-14350-V1-master.patch > > > Region Split and Merge work on the new AMv2 but they work differently. This > issue is about bringing them back together and fully embracing the AMv2 > program. > They both have issues mostly the fact that they carry around baggage no > longer necessary in the new world of assignment. > Here are some of the items: > Split and Merge metadata modifications are done by the Master now but we have > vestige of Split/Merge on RS still; e.g. when we SPLIT, we ask the Master > which asks the RS, which turns around, and asks the Master to run the > operation. Fun. MERGE is all done Master-side. > > Clean this up. Remove asking RS to run SPLIT and remove RegionMergeRequest, > etc. on RS-side. Also remove PONR. We don’t Points-Of-No-Return now we are up > on Pv2. Remove all calls in Interfaces; they are unused. Make RS still able > to detect when split, but have it be a client of Master like anyone else. > Split is Async but does not return procId > Split is async. Doesn’t return the procId though. Merge does. Fix. Only hard > part here I think is the Admin API does not allow procid return. > Flags > Currently OFFLINE is determined by looking either at the master instance of > HTD (isOffline) and/or at the RegionState#state. Ditto for SPLIT. For MERGE, > we rely on RegionState#state. Related is a note above on how split works -- > there is a split flag in HTD when there should not be. > > TODO is move to rely on RegionState#state exclusively in Master. > From Split/Merge Procedures need finishing in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.4b60dc1h4m1f -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-16894) Create more than 1 split per region, generalize HBASE-12590
[ https://issues.apache.org/jira/browse/HBASE-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188589#comment-16188589 ] Yi Liang commented on HBASE-16894: -- patch for master branch seems ok. retry branch-1, and above errors in branch-1 passed locally. > Create more than 1 split per region, generalize HBASE-12590 > --- > > Key: HBASE-16894 > URL: https://issues.apache.org/jira/browse/HBASE-16894 > Project: HBase > Issue Type: Improvement >Affects Versions: 3.0.0, 2.0.0-alpha-2 >Reporter: Enis Soztutar >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-16894.branch-1.patch, HBASE-16894.master.patch, > HBASE-16894-V2-master.patch, HBASE-16894-V3-master.patch, > ImplementaionAndSomeQuestion.docx > > > A common request from users is to be able to better control how many map > tasks are created per region. Right now, it is always 1 region = 1 input > split = 1 map task. Same goes for Spark since it uses the TIF. With region > sizes as large as 50 GBs, it is desirable to be able to create more than 1 > split per region. > HBASE-12590 adds a config property for MR jobs to be able to handle skew in > region sizes. The algorithm is roughly: > {code} > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size * ratio): combine > these regions into one MR input split. > {code} > Although we can set data skew ratio to be 0.5 or something to abuse > HBASE-12590 into creating more than 1 split task per region, it is not ideal. > But there is no way to create more with the patch as it is. For example we > cannot create more than 2 tasks per region. > If we want to fix this properly, we should extend the approach in > HBASE-12590, and make it so that the client can specify the desired num of > mappers, or desired split size, and the TIF generates the splits based on the > current region sizes very similar to the algorithm in HBASE-12590, but a more > generic way. This also would eliminate the hand tuning of data skew ratio. > We also can think about the guidepost approach that Phoenix has in the stats > table which is used for exactly this purpose. Right now, the region can be > split into powers of two assuming uniform distribution within the region. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-16894) Create more than 1 split per region, generalize HBASE-12590
[ https://issues.apache.org/jira/browse/HBASE-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-16894: - Attachment: (was: HBASE-16894.branch-1.patch) > Create more than 1 split per region, generalize HBASE-12590 > --- > > Key: HBASE-16894 > URL: https://issues.apache.org/jira/browse/HBASE-16894 > Project: HBase > Issue Type: Improvement >Affects Versions: 3.0.0, 2.0.0-alpha-2 >Reporter: Enis Soztutar >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-16894.branch-1.patch, HBASE-16894.master.patch, > HBASE-16894-V2-master.patch, HBASE-16894-V3-master.patch, > ImplementaionAndSomeQuestion.docx > > > A common request from users is to be able to better control how many map > tasks are created per region. Right now, it is always 1 region = 1 input > split = 1 map task. Same goes for Spark since it uses the TIF. With region > sizes as large as 50 GBs, it is desirable to be able to create more than 1 > split per region. > HBASE-12590 adds a config property for MR jobs to be able to handle skew in > region sizes. The algorithm is roughly: > {code} > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size * ratio): combine > these regions into one MR input split. > {code} > Although we can set data skew ratio to be 0.5 or something to abuse > HBASE-12590 into creating more than 1 split task per region, it is not ideal. > But there is no way to create more with the patch as it is. For example we > cannot create more than 2 tasks per region. > If we want to fix this properly, we should extend the approach in > HBASE-12590, and make it so that the client can specify the desired num of > mappers, or desired split size, and the TIF generates the splits based on the > current region sizes very similar to the algorithm in HBASE-12590, but a more > generic way. This also would eliminate the hand tuning of data skew ratio. > We also can think about the guidepost approach that Phoenix has in the stats > table which is used for exactly this purpose. Right now, the region can be > split into powers of two assuming uniform distribution within the region. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-16894) Create more than 1 split per region, generalize HBASE-12590
[ https://issues.apache.org/jira/browse/HBASE-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-16894: - Attachment: HBASE-16894.branch-1.patch > Create more than 1 split per region, generalize HBASE-12590 > --- > > Key: HBASE-16894 > URL: https://issues.apache.org/jira/browse/HBASE-16894 > Project: HBase > Issue Type: Improvement >Affects Versions: 3.0.0, 2.0.0-alpha-2 >Reporter: Enis Soztutar >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-16894.branch-1.patch, HBASE-16894.master.patch, > HBASE-16894-V2-master.patch, HBASE-16894-V3-master.patch, > ImplementaionAndSomeQuestion.docx > > > A common request from users is to be able to better control how many map > tasks are created per region. Right now, it is always 1 region = 1 input > split = 1 map task. Same goes for Spark since it uses the TIF. With region > sizes as large as 50 GBs, it is desirable to be able to create more than 1 > split per region. > HBASE-12590 adds a config property for MR jobs to be able to handle skew in > region sizes. The algorithm is roughly: > {code} > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size * ratio): combine > these regions into one MR input split. > {code} > Although we can set data skew ratio to be 0.5 or something to abuse > HBASE-12590 into creating more than 1 split task per region, it is not ideal. > But there is no way to create more with the patch as it is. For example we > cannot create more than 2 tasks per region. > If we want to fix this properly, we should extend the approach in > HBASE-12590, and make it so that the client can specify the desired num of > mappers, or desired split size, and the TIF generates the splits based on the > current region sizes very similar to the algorithm in HBASE-12590, but a more > generic way. This also would eliminate the hand tuning of data skew ratio. > We also can think about the guidepost approach that Phoenix has in the stats > table which is used for exactly this purpose. Right now, the region can be > split into powers of two assuming uniform distribution within the region. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HBASE-18105) [AMv2] Split/Merge need cleanup; currently they diverge and do not fully embrace AMv2 world
[ https://issues.apache.org/jira/browse/HBASE-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188426#comment-16188426 ] Yi Liang edited comment on HBASE-18105 at 10/2/17 5:06 PM: --- [~stack],if it is not good to change proto indices, I can keep it as original value and give some comment. I think the other changes and test cases are ok. And after solving this jira, I think cleanup for split/merge is almost done. was (Author: easyliangjob): [~stack],if it is not good to change proto indices, I can keep it as original value and give some comment. But the other changes and test cases should be ok > [AMv2] Split/Merge need cleanup; currently they diverge and do not fully > embrace AMv2 world > --- > > Key: HBASE-18105 > URL: https://issues.apache.org/jira/browse/HBASE-18105 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-14350-V1-master.patch > > > Region Split and Merge work on the new AMv2 but they work differently. This > issue is about bringing them back together and fully embracing the AMv2 > program. > They both have issues mostly the fact that they carry around baggage no > longer necessary in the new world of assignment. > Here are some of the items: > Split and Merge metadata modifications are done by the Master now but we have > vestige of Split/Merge on RS still; e.g. when we SPLIT, we ask the Master > which asks the RS, which turns around, and asks the Master to run the > operation. Fun. MERGE is all done Master-side. > > Clean this up. Remove asking RS to run SPLIT and remove RegionMergeRequest, > etc. on RS-side. Also remove PONR. We don’t Points-Of-No-Return now we are up > on Pv2. Remove all calls in Interfaces; they are unused. Make RS still able > to detect when split, but have it be a client of Master like anyone else. > Split is Async but does not return procId > Split is async. Doesn’t return the procId though. Merge does. Fix. Only hard > part here I think is the Admin API does not allow procid return. > Flags > Currently OFFLINE is determined by looking either at the master instance of > HTD (isOffline) and/or at the RegionState#state. Ditto for SPLIT. For MERGE, > we rely on RegionState#state. Related is a note above on how split works -- > there is a split flag in HTD when there should not be. > > TODO is move to rely on RegionState#state exclusively in Master. > From Split/Merge Procedures need finishing in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.4b60dc1h4m1f -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18105) [AMv2] Split/Merge need cleanup; currently they diverge and do not fully embrace AMv2 world
[ https://issues.apache.org/jira/browse/HBASE-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188426#comment-16188426 ] Yi Liang commented on HBASE-18105: -- [~stack],if it is not good to change proto indices, I can keep it as original value and give some comment. But the other changes and test cases should be ok > [AMv2] Split/Merge need cleanup; currently they diverge and do not fully > embrace AMv2 world > --- > > Key: HBASE-18105 > URL: https://issues.apache.org/jira/browse/HBASE-18105 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-14350-V1-master.patch > > > Region Split and Merge work on the new AMv2 but they work differently. This > issue is about bringing them back together and fully embracing the AMv2 > program. > They both have issues mostly the fact that they carry around baggage no > longer necessary in the new world of assignment. > Here are some of the items: > Split and Merge metadata modifications are done by the Master now but we have > vestige of Split/Merge on RS still; e.g. when we SPLIT, we ask the Master > which asks the RS, which turns around, and asks the Master to run the > operation. Fun. MERGE is all done Master-side. > > Clean this up. Remove asking RS to run SPLIT and remove RegionMergeRequest, > etc. on RS-side. Also remove PONR. We don’t Points-Of-No-Return now we are up > on Pv2. Remove all calls in Interfaces; they are unused. Make RS still able > to detect when split, but have it be a client of Master like anyone else. > Split is Async but does not return procId > Split is async. Doesn’t return the procId though. Merge does. Fix. Only hard > part here I think is the Admin API does not allow procid return. > Flags > Currently OFFLINE is determined by looking either at the master instance of > HTD (isOffline) and/or at the RegionState#state. Ditto for SPLIT. For MERGE, > we rely on RegionState#state. Related is a note above on how split works -- > there is a split flag in HTD when there should not be. > > TODO is move to rely on RegionState#state exclusively in Master. > From Split/Merge Procedures need finishing in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.4b60dc1h4m1f -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-16894) Create more than 1 split per region, generalize HBASE-12590
[ https://issues.apache.org/jira/browse/HBASE-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-16894: - Attachment: HBASE-16894.master.patch > Create more than 1 split per region, generalize HBASE-12590 > --- > > Key: HBASE-16894 > URL: https://issues.apache.org/jira/browse/HBASE-16894 > Project: HBase > Issue Type: Improvement >Affects Versions: 3.0.0, 2.0.0-alpha-2 >Reporter: Enis Soztutar >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-16894.branch-1.patch, HBASE-16894.master.patch, > HBASE-16894-V2-master.patch, HBASE-16894-V3-master.patch, > ImplementaionAndSomeQuestion.docx > > > A common request from users is to be able to better control how many map > tasks are created per region. Right now, it is always 1 region = 1 input > split = 1 map task. Same goes for Spark since it uses the TIF. With region > sizes as large as 50 GBs, it is desirable to be able to create more than 1 > split per region. > HBASE-12590 adds a config property for MR jobs to be able to handle skew in > region sizes. The algorithm is roughly: > {code} > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size * ratio): combine > these regions into one MR input split. > {code} > Although we can set data skew ratio to be 0.5 or something to abuse > HBASE-12590 into creating more than 1 split task per region, it is not ideal. > But there is no way to create more with the patch as it is. For example we > cannot create more than 2 tasks per region. > If we want to fix this properly, we should extend the approach in > HBASE-12590, and make it so that the client can specify the desired num of > mappers, or desired split size, and the TIF generates the splits based on the > current region sizes very similar to the algorithm in HBASE-12590, but a more > generic way. This also would eliminate the hand tuning of data skew ratio. > We also can think about the guidepost approach that Phoenix has in the stats > table which is used for exactly this purpose. Right now, the region can be > split into powers of two assuming uniform distribution within the region. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-16894) Create more than 1 split per region, generalize HBASE-12590
[ https://issues.apache.org/jira/browse/HBASE-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-16894: - Attachment: (was: HBASE-16894.master.patch) > Create more than 1 split per region, generalize HBASE-12590 > --- > > Key: HBASE-16894 > URL: https://issues.apache.org/jira/browse/HBASE-16894 > Project: HBase > Issue Type: Improvement >Affects Versions: 3.0.0, 2.0.0-alpha-2 >Reporter: Enis Soztutar >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-16894.branch-1.patch, HBASE-16894.master.patch, > HBASE-16894-V2-master.patch, HBASE-16894-V3-master.patch, > ImplementaionAndSomeQuestion.docx > > > A common request from users is to be able to better control how many map > tasks are created per region. Right now, it is always 1 region = 1 input > split = 1 map task. Same goes for Spark since it uses the TIF. With region > sizes as large as 50 GBs, it is desirable to be able to create more than 1 > split per region. > HBASE-12590 adds a config property for MR jobs to be able to handle skew in > region sizes. The algorithm is roughly: > {code} > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size * ratio): combine > these regions into one MR input split. > {code} > Although we can set data skew ratio to be 0.5 or something to abuse > HBASE-12590 into creating more than 1 split task per region, it is not ideal. > But there is no way to create more with the patch as it is. For example we > cannot create more than 2 tasks per region. > If we want to fix this properly, we should extend the approach in > HBASE-12590, and make it so that the client can specify the desired num of > mappers, or desired split size, and the TIF generates the splits based on the > current region sizes very similar to the algorithm in HBASE-12590, but a more > generic way. This also would eliminate the hand tuning of data skew ratio. > We also can think about the guidepost approach that Phoenix has in the stats > table which is used for exactly this purpose. Right now, the region can be > split into powers of two assuming uniform distribution within the region. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-16894) Create more than 1 split per region, generalize HBASE-12590
[ https://issues.apache.org/jira/browse/HBASE-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186590#comment-16186590 ] Yi Liang commented on HBASE-16894: -- Hi [~apurtell] Thanks, I have provide two patch, for branch-1.0 and master/branch-2.0, let me know if you have any questions. {code} HBASE-16894.branch-1.patch HBASE-16894.master.patch {code} > Create more than 1 split per region, generalize HBASE-12590 > --- > > Key: HBASE-16894 > URL: https://issues.apache.org/jira/browse/HBASE-16894 > Project: HBase > Issue Type: Improvement >Affects Versions: 3.0.0, 2.0.0-alpha-2 >Reporter: Enis Soztutar >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-16894.branch-1.patch, HBASE-16894.master.patch, > HBASE-16894-V2-master.patch, HBASE-16894-V3-master.patch, > ImplementaionAndSomeQuestion.docx > > > A common request from users is to be able to better control how many map > tasks are created per region. Right now, it is always 1 region = 1 input > split = 1 map task. Same goes for Spark since it uses the TIF. With region > sizes as large as 50 GBs, it is desirable to be able to create more than 1 > split per region. > HBASE-12590 adds a config property for MR jobs to be able to handle skew in > region sizes. The algorithm is roughly: > {code} > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size * ratio): combine > these regions into one MR input split. > {code} > Although we can set data skew ratio to be 0.5 or something to abuse > HBASE-12590 into creating more than 1 split task per region, it is not ideal. > But there is no way to create more with the patch as it is. For example we > cannot create more than 2 tasks per region. > If we want to fix this properly, we should extend the approach in > HBASE-12590, and make it so that the client can specify the desired num of > mappers, or desired split size, and the TIF generates the splits based on the > current region sizes very similar to the algorithm in HBASE-12590, but a more > generic way. This also would eliminate the hand tuning of data skew ratio. > We also can think about the guidepost approach that Phoenix has in the stats > table which is used for exactly this purpose. Right now, the region can be > split into powers of two assuming uniform distribution within the region. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-16894) Create more than 1 split per region, generalize HBASE-12590
[ https://issues.apache.org/jira/browse/HBASE-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-16894: - Attachment: (was: HBASE-12590-v1.patch) > Create more than 1 split per region, generalize HBASE-12590 > --- > > Key: HBASE-16894 > URL: https://issues.apache.org/jira/browse/HBASE-16894 > Project: HBase > Issue Type: Improvement >Affects Versions: 3.0.0, 2.0.0-alpha-2 >Reporter: Enis Soztutar >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-16894.branch-1.patch, HBASE-16894.master.patch, > HBASE-16894-V2-master.patch, HBASE-16894-V3-master.patch, > ImplementaionAndSomeQuestion.docx > > > A common request from users is to be able to better control how many map > tasks are created per region. Right now, it is always 1 region = 1 input > split = 1 map task. Same goes for Spark since it uses the TIF. With region > sizes as large as 50 GBs, it is desirable to be able to create more than 1 > split per region. > HBASE-12590 adds a config property for MR jobs to be able to handle skew in > region sizes. The algorithm is roughly: > {code} > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size * ratio): combine > these regions into one MR input split. > {code} > Although we can set data skew ratio to be 0.5 or something to abuse > HBASE-12590 into creating more than 1 split task per region, it is not ideal. > But there is no way to create more with the patch as it is. For example we > cannot create more than 2 tasks per region. > If we want to fix this properly, we should extend the approach in > HBASE-12590, and make it so that the client can specify the desired num of > mappers, or desired split size, and the TIF generates the splits based on the > current region sizes very similar to the algorithm in HBASE-12590, but a more > generic way. This also would eliminate the hand tuning of data skew ratio. > We also can think about the guidepost approach that Phoenix has in the stats > table which is used for exactly this purpose. Right now, the region can be > split into powers of two assuming uniform distribution within the region. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HBASE-16894) Create more than 1 split per region, generalize HBASE-12590
[ https://issues.apache.org/jira/browse/HBASE-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liang updated HBASE-16894: - Attachment: HBASE-16894.branch-1.patch HBASE-16894.master.patch > Create more than 1 split per region, generalize HBASE-12590 > --- > > Key: HBASE-16894 > URL: https://issues.apache.org/jira/browse/HBASE-16894 > Project: HBase > Issue Type: Improvement >Affects Versions: 3.0.0, 2.0.0-alpha-2 >Reporter: Enis Soztutar >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-12590-v1.patch, HBASE-16894.branch-1.patch, > HBASE-16894.master.patch, HBASE-16894-V2-master.patch, > HBASE-16894-V3-master.patch, ImplementaionAndSomeQuestion.docx > > > A common request from users is to be able to better control how many map > tasks are created per region. Right now, it is always 1 region = 1 input > split = 1 map task. Same goes for Spark since it uses the TIF. With region > sizes as large as 50 GBs, it is desirable to be able to create more than 1 > split per region. > HBASE-12590 adds a config property for MR jobs to be able to handle skew in > region sizes. The algorithm is roughly: > {code} > If (region size >= average size*ratio) : cut the region into two MR input > splits > If (average size <= region size < average size*ratio) : one region as one MR > input split > If (sum of several continuous regions size < average size * ratio): combine > these regions into one MR input split. > {code} > Although we can set data skew ratio to be 0.5 or something to abuse > HBASE-12590 into creating more than 1 split task per region, it is not ideal. > But there is no way to create more with the patch as it is. For example we > cannot create more than 2 tasks per region. > If we want to fix this properly, we should extend the approach in > HBASE-12590, and make it so that the client can specify the desired num of > mappers, or desired split size, and the TIF generates the splits based on the > current region sizes very similar to the algorithm in HBASE-12590, but a more > generic way. This also would eliminate the hand tuning of data skew ratio. > We also can think about the guidepost approach that Phoenix has in the stats > table which is used for exactly this purpose. Right now, the region can be > split into powers of two assuming uniform distribution within the region. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18894) null pointer exception in list_regions in shell command
[ https://issues.apache.org/jira/browse/HBASE-18894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186069#comment-16186069 ] Yi Liang commented on HBASE-18894: -- Yeah, the last two are fixed. Thanks. > null pointer exception in list_regions in shell command > --- > > Key: HBASE-18894 > URL: https://issues.apache.org/jira/browse/HBASE-18894 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0-alpha-3 >Reporter: Yi Liang >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-18894-v1-master.patch, > HBASE-18894-v2-master.patch, HBASE-18894-v3-master.patch > > > See this error when run list_regions command After disable 't1' > or after running split 't1', will see this error before split complete > this caused by region is disabled or still in transition > {quote} > list_regions 't1' > ERROR: undefined method `getDataLocality' for nil:NilClass > {quote} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18105) [AMv2] Split/Merge need cleanup; currently they diverge and do not fully embrace AMv2 world
[ https://issues.apache.org/jira/browse/HBASE-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185091#comment-16185091 ] Yi Liang commented on HBASE-18105: -- [~stack], Any thoughts about this patch?? > [AMv2] Split/Merge need cleanup; currently they diverge and do not fully > embrace AMv2 world > --- > > Key: HBASE-18105 > URL: https://issues.apache.org/jira/browse/HBASE-18105 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: stack >Assignee: Yi Liang > Fix For: 2.0.0 > > Attachments: HBASE-14350-V1-master.patch > > > Region Split and Merge work on the new AMv2 but they work differently. This > issue is about bringing them back together and fully embracing the AMv2 > program. > They both have issues mostly the fact that they carry around baggage no > longer necessary in the new world of assignment. > Here are some of the items: > Split and Merge metadata modifications are done by the Master now but we have > vestige of Split/Merge on RS still; e.g. when we SPLIT, we ask the Master > which asks the RS, which turns around, and asks the Master to run the > operation. Fun. MERGE is all done Master-side. > > Clean this up. Remove asking RS to run SPLIT and remove RegionMergeRequest, > etc. on RS-side. Also remove PONR. We don’t Points-Of-No-Return now we are up > on Pv2. Remove all calls in Interfaces; they are unused. Make RS still able > to detect when split, but have it be a client of Master like anyone else. > Split is Async but does not return procId > Split is async. Doesn’t return the procId though. Merge does. Fix. Only hard > part here I think is the Admin API does not allow procid return. > Flags > Currently OFFLINE is determined by looking either at the master instance of > HTD (isOffline) and/or at the RegionState#state. Ditto for SPLIT. For MERGE, > we rely on RegionState#state. Related is a note above on how split works -- > there is a split flag in HTD when there should not be. > > TODO is move to rely on RegionState#state exclusively in Master. > From Split/Merge Procedures need finishing in > https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.4b60dc1h4m1f -- This message was sent by Atlassian JIRA (v6.4.14#64029)