[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2018-04-01 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421950#comment-16421950
 ] 

Yi Liang commented on HBASE-19287:
--

{quote}
java.io.IOException: Call to abhishekk3.pne.ven.veritas.com/10.210.62.30:16020 
failed on local exception: java.io.IOException: 
org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
initiate failed
{quote}

The cause is not related to this jira, not sure why this happen. plz make sure 
your kerberos setting up correctly. And also to run a secure HBase, you also 
use a secure ZooKeeper.

For hbase mailing list, refer to https://hbase.apache.org/mail-lists.html

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
>Priority: Major
> Fix For: 2.0.0-beta-1, 2.0.0
>
> Attachments: HBASE-19287-master-v3.patch, 
> HBASE-19287-master-v3.patch, HBASE-19287-master-v4.patch, 
> hbase-19287-master-v2.patch, master.patch
>
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> 

[jira] [Comment Edited] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2018-04-01 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421950#comment-16421950
 ] 

Yi Liang edited comment on HBASE-19287 at 4/2/18 5:41 AM:
--

{quote}java.io.IOException: Call to 
abhishekk3.pne.ven.veritas.com/10.210.62.30:16020 failed on local exception: 
java.io.IOException: 
org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
initiate failed
{quote}
The cause is not related to this jira, not sure why this happen. plz make sure 
your kerberos setting up correctly. And also to run a secure HBase, you also 
need a secure ZooKeeper.

For hbase mailing list, refer to [https://hbase.apache.org/mail-lists.html]


was (Author: easyliangjob):
{quote}
java.io.IOException: Call to abhishekk3.pne.ven.veritas.com/10.210.62.30:16020 
failed on local exception: java.io.IOException: 
org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
initiate failed
{quote}

The cause is not related to this jira, not sure why this happen. plz make sure 
your kerberos setting up correctly. And also to run a secure HBase, you also 
use a secure ZooKeeper.

For hbase mailing list, refer to https://hbase.apache.org/mail-lists.html

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
>Priority: Major
> Fix For: 2.0.0-beta-1, 2.0.0
>
> Attachments: HBASE-19287-master-v3.patch, 
> HBASE-19287-master-v3.patch, HBASE-19287-master-v4.patch, 
> hbase-19287-master-v2.patch, master.patch
>
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering 

[jira] [Commented] (HBASE-19218) Master stuck thinking hbase:namespace is assigned after restart preventing intialization

2017-12-20 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16299460#comment-16299460
 ] 

Yi Liang commented on HBASE-19218:
--

patch is good. +1

> Master stuck thinking hbase:namespace is assigned after restart preventing 
> intialization
> 
>
> Key: HBASE-19218
> URL: https://issues.apache.org/jira/browse/HBASE-19218
> Project: HBase
>  Issue Type: Bug
>Reporter: Josh Elser
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.0-beta-1
>
> Attachments: HBASE-19218.master.001.patch, 
> hbase-hbase-master-ctr-e134-1499953498516-282290-01-03.hwx.site.log.zip, 
> hbase-site.xml
>
>
> Our [~romil.choksi] brought this one to my attention after trying to get some 
> cluster tests running.
> The Master seems to have gotten stuck never initializing after it thinks that 
> hbase:namespace was already deployed on the cluster when it actually was not. 
> On a Master restart, it reads the location out of meta and assumes that it's 
> there (I assume this invalid entry is the issue):
> {noformat}
> 2017-11-08 00:29:17,556 INFO  
> [ctr-e134-1499953498516-282290-01-03:2.masterManager] 
> assignment.RegionStateStore: Load hbase:meta entry region={ENCODED => 
> f147f204a579b885c351bdc0a7ebbf94, NAME => 
> 'hbase:namespace,,1510084256045.f147f204a579b885c351bdc0a7ebbf94.', STARTKEY 
> => '', ENDKEY => ''} regionState=OPENING 
> lastHost=ctr-e134-1499953498516-282290-01-05.hwx.site,16020,1510084579728 
> regionLocation=ctr-e134-1499953498516-282290-01-05.hwx.site,16020,1510100695534
> {noformat}
> Prior to this, the RS5 went through the ServerCrashProcedure, but it looks 
> like this bailed out unexpectedly:
> {noformat}
> 2017-11-08 00:25:25,187 WARN  
> [ctr-e134-1499953498516-282290-01-03:2.masterManager] 
> master.ServerManager: Expiration of 
> ctr-e134-1499953498516-282290-01-05.hwx.site,16020,1510084579728 but 
> server not online
> 2017-11-08 00:25:25,187 INFO  [ProcExecWrkr-5] 
> procedure.ServerCrashProcedure: Start pid=36, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=ctr-e134-1499953498516-282290-01-03.hwx.site,16020,1510084580111, 
> splitWal=t
> rue, meta=false
> 2017-11-08 00:25:25,188 INFO  
> [ctr-e134-1499953498516-282290-01-03:2.masterManager] 
> master.ServerManager: Processing expiration of 
> ctr-e134-1499953498516-282290-01-05.hwx.site,16020,1510084579728 on 
> ctr-e134-1499953498516-28
> 2290-01-03.hwx.site,2,1510100690324
> ...
> 2017-11-08 00:25:27,211 ERROR [ProcExecWrkr-22] procedure2.ProcedureExecutor: 
> CODE-BUG: Uncaught runtime exception: pid=40, ppid=37, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94
> java.lang.NullPointerException
> at 
> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
> at 
> org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:171)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.addToRemoteDispatcher(RegionTransitionProcedure.java:223)
> at 
> org.apache.hadoop.hbase.master.assignment.AssignProcedure.updateTransition(AssignProcedure.java:252)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:309)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:82)
> at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:845)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1452)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1221)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731)
> 2017-11-08 00:25:27,239 FATAL [ProcExecWrkr-22] procedure2.ProcedureExecutor: 
> CODE-BUG: Uncaught runtime exception for pid=37, 
> state=FAILED:SERVER_CRASH_FINISH, exception=java.lang.NullPointerException 
> via CODE-BUG: Uncaught runtime exception: pid=40, ppid=37, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=hbase:namespace, 
> region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException; 
> ServerCrashProcedure 
> server=ctr-e134-1499953498516-282290-01-05.hwx.site,16020,1510084579728, 
> splitWal=true, meta=false
> java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_FINISH
> at 
> 

[jira] [Commented] (HBASE-19218) Master stuck thinking hbase:namespace is assigned after restart preventing intialization

2017-12-20 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16299221#comment-16299221
 ] 

Yi Liang commented on HBASE-19218:
--

I have seen this error in my cluster as well
{quote}
2017-11-08 00:25:25,260 INFO  [ProcExecWrkr-18] 
procedure.MasterProcedureScheduler: pid=40, ppid=37, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:namespace, 
region=f147f204a579b885c351bdc0a7ebbf94 hbase:namespace 
hbase:namespace,,1510084256045.f147f204a579b885c351bdc0a7ebbf94.
2017-11-08 00:25:25,263 INFO  [ProcExecWrkr-18] assignment.AssignProcedure: 
Start pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94; rit=OFFLINE, 
location=node_5,16020,1510084579728; forceNewPlan=false, retain=true

.


2017-11-08 00:25:26,040 INFO  [ProcExecWrkr-23] procedure.ServerCrashProcedure: 
pid=42 found RIT pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94; 
rit=OPENING, location=node_5,16020,1510100695534
2017-11-08 00:25:26,040 WARN  [ProcExecWrkr-23] 
assignment.RegionTransitionProcedure: Remote call failed pid=40, ppid=37, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure 
table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94; rit=OPENING, 
location=node_5,16020,1510100695534; exception=ServerCrashProcedure pid=42, 
server=node_5,16020,1510100695534
2017-11-08 00:25:26,041 INFO  [ProcExecWrkr-23] assignment.AssignProcedure: 
Retry=1 of max=10; pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94; 
rit=OPENING, location=node_5,16020,1510100695534
2017-11-08 00:25:26,193 INFO  [ProcExecWrkr-25] zookeeper.MetaTableLocator: 
Setting hbase:meta (replicaId=0) location in ZooKeeper as 
node_2,16020,1510100696388
2017-11-08 00:25:26,195 INFO  [ProcExecWrkr-25] 
assignment.RegionTransitionProcedure: Dispatch pid=44, ppid=43, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
region=1588230740; rit=OPENING, location=node_2,16020,1510100696388
2017-11-08 00:25:26,346 INFO  [ProcedureDispatcherTimeoutThread] 
procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
serverName=node_2,16020,1510100696388 version=2097152
2017-11-08 00:25:27,187 INFO  [ProcExecWrkr-4] hbase.MetaTableAccessor: Updated 
table hbase:meta state to ENABLED in META
2017-11-08 00:25:27,187 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
Setting hbase:meta (replicaId=0) location in ZooKeeper as 
node_2,16020,1510100696388
2017-11-08 00:25:27,209 INFO  [ProcExecWrkr-22] 
assignment.RegionTransitionProcedure: Dispatch pid=40, ppid=37, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure 
table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94; rit=OFFLINE, 
location=null
2017-11-08 00:25:27,210 INFO  [ProcExecWrkr-21] 
assignment.RegionTransitionProcedure: Dispatch pid=39, ppid=36, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:acl, 
region=24aadcb52fdc43e2ebcffe95d39b43ab; rit=OPENING, 
location=node_2,16020,1510100696388
2017-11-08 00:25:27,211 ERROR [ProcExecWrkr-22] procedure2.ProcedureExecutor: 
CODE-BUG: Uncaught runtime exception: pid=40, ppid=37, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:namespace, 
region=f147f204a579b885c351bdc0a7ebbf94
java.lang.NullPointerException
at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
{quote}


I track the AssignProcedure of hbase:namespace to above log

1. AssignProcedure created for table=hbase:namespace; and state is 
RUNNABLE:REGION_TRANSITION_QUEUE
2. ServerCrashProcedure happens and will handle AssignProcedure of 
table=hbase:namespace, 
   and from the log, this AssignProcedure state has already been set to 
REGION_TRANSITION_DISPATCH when SCP handle it. 
   The SCP will set the AssignProcedure back to REGION_TRANSITION_QUEUE and 
offline related region.
3. However, The AssignProcedure of hbase:namespace resume from state 
REGION_TRANSITION_DISPATCH, 
   and SCP in step2 has already offline the region and set location as null, so 
null pointer exception may happen

The problem happens at step 3 above, the AssignProcedure should resume from 
state REGION_TRANSITION_QUEUE, but it actually from REGION_TRANSITION_DISPATCH. 
This could happen, since when SCP call remoteCallFailed for AssignProcedure of 
hbase:namespace, 
ProcedureExecutor is running AssignProcedure of hbase:namespace as state 
REGION_TRANSITION_DISPATCH at same time.
and if SCP's handleFailure(which set region location as null) for 
hbase:namespace happens before AssignProcedure#addToRemoteDispatcher, and then 
null pointer happens

  
And if we can catch the nullpointerexception and set state back to 
REGION_TRANSITION_QUEUE, 

[jira] [Updated] (HBASE-19556) Remove TestAssignmentManager#testGoodSplit, which no longer make sense

2017-12-19 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19556:
-
Fix Version/s: 2.0.0-beta-1
Affects Version/s: 2.0.0
   Status: Patch Available  (was: Open)

> Remove TestAssignmentManager#testGoodSplit, which no longer make sense
> --
>
> Key: HBASE-19556
> URL: https://issues.apache.org/jira/browse/HBASE-19556
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
>Priority: Minor
> Fix For: 2.0.0-beta-1
>
> Attachments: HBASE-19556-master-v1.patch
>
>
> {quote}
> [ERROR] 
> testGoodSplit(org.apache.hadoop.hbase.master.assignment.TestAssignmentManager)
>   Time elapsed: 0.478 s  <<< ERROR!
> java.io.IOException: 5a50732f7cb3a05dd3a297bacbc34943 NOT splittable
> {quote}
> GoodSplitExecutor can only mock some functions in RSProcedureDispatcher, 
> however, to test split, we need to create(or mock) a real table on region 
> server side. And GoodSplitExecutor can not mock those kinds of function.
> And, similar test has already been covered in TestSplitTableRegionProcedure, 
> we no longer need to test it again in TestAssignmentManager



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-19556) Remove TestAssignmentManager#testGoodSplit, which no longer make sense

2017-12-19 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19556:
-
Attachment: HBASE-19556-master-v1.patch

> Remove TestAssignmentManager#testGoodSplit, which no longer make sense
> --
>
> Key: HBASE-19556
> URL: https://issues.apache.org/jira/browse/HBASE-19556
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Yi Liang
>Assignee: Yi Liang
>Priority: Minor
> Attachments: HBASE-19556-master-v1.patch
>
>
> {quote}
> [ERROR] 
> testGoodSplit(org.apache.hadoop.hbase.master.assignment.TestAssignmentManager)
>   Time elapsed: 0.478 s  <<< ERROR!
> java.io.IOException: 5a50732f7cb3a05dd3a297bacbc34943 NOT splittable
> {quote}
> GoodSplitExecutor can only mock some functions in RSProcedureDispatcher, 
> however, to test split, we need to create(or mock) a real table on region 
> server side. And GoodSplitExecutor can not mock those kinds of function.
> And, similar test has already been covered in TestSplitTableRegionProcedure, 
> we no longer need to test it again in TestAssignmentManager



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-19556) Remove TestAssignmentManager#testGoodSplit, which no longer make sense

2017-12-19 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19556:
-
Issue Type: Sub-task  (was: Bug)
Parent: HBASE-18110

> Remove TestAssignmentManager#testGoodSplit, which no longer make sense
> --
>
> Key: HBASE-19556
> URL: https://issues.apache.org/jira/browse/HBASE-19556
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Yi Liang
>Assignee: Yi Liang
>Priority: Minor
>
> {quote}
> [ERROR] 
> testGoodSplit(org.apache.hadoop.hbase.master.assignment.TestAssignmentManager)
>   Time elapsed: 0.478 s  <<< ERROR!
> java.io.IOException: 5a50732f7cb3a05dd3a297bacbc34943 NOT splittable
> {quote}
> GoodSplitExecutor can only mock some functions in RSProcedureDispatcher, 
> however, to test split, we need to create(or mock) a real table on region 
> server side. And GoodSplitExecutor can not mock those kinds of function.
> And, similar test has already been covered in TestSplitTableRegionProcedure, 
> we no longer need to test it again in TestAssignmentManager



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HBASE-19556) Remove TestAssignmentManager#testGoodSplit, which no longer make sense

2017-12-19 Thread Yi Liang (JIRA)
Yi Liang created HBASE-19556:


 Summary: Remove TestAssignmentManager#testGoodSplit, which no 
longer make sense
 Key: HBASE-19556
 URL: https://issues.apache.org/jira/browse/HBASE-19556
 Project: HBase
  Issue Type: Bug
Reporter: Yi Liang
Assignee: Yi Liang
Priority: Minor


{quote}
[ERROR] 
testGoodSplit(org.apache.hadoop.hbase.master.assignment.TestAssignmentManager)  
Time elapsed: 0.478 s  <<< ERROR!
java.io.IOException: 5a50732f7cb3a05dd3a297bacbc34943 NOT splittable
{quote}

GoodSplitExecutor can only mock some functions in RSProcedureDispatcher, 
however, to test split, we need to create(or mock) a real table on region 
server side. And GoodSplitExecutor can not mock those kinds of function.

And, similar test has already been covered in TestSplitTableRegionProcedure, we 
no longer need to test it again in TestAssignmentManager



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HBASE-18110) [AMv2] Reenable tests temporarily disabled

2017-12-18 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295832#comment-16295832
 ] 

Yi Liang edited comment on HBASE-18110 at 12/18/17 11:29 PM:
-

Hi [~stack]
  I just check some ignored tests and found 
TestAssignmentManager#testGoodSplit() is not make sense. I think we can remove 
it. I can follow up a jira if you ok with it. 


was (Author: easyliangjob):
Hi [~stack]
  I just check some ignored tests and found 
TestAssignmentManager#testGoodSplit() is not make sense. I think we can remove 
it.

> [AMv2] Reenable tests temporarily disabled
> --
>
> Key: HBASE-18110
> URL: https://issues.apache.org/jira/browse/HBASE-18110
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Blocker
> Fix For: 2.0.0-beta-1
>
>
> We disabled tests that didn't make sense or relied on behavior not supported 
> by AMv2. Revisit and reenable after AMv2 gets committed. Here is the set 
> (from 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.rsj53tx4vlwj)
> testAllFavoredNodesDead and testAllFavoredNodesDeadMasterRestarted and 
> testMisplacedRegions in TestFavoredStochasticLoadBalancer … not sure what 
> this about.
> testRegionNormalizationMergeOnCluster in TestSimpleRegionNormalizerOnCluster 
> disabled for now till we fix up Merge.
> testMergeWithReplicas in TestRegionMergeTransactionOnCluster because don't 
> know how it is supposed to work.
> Admin#close does not update Master. Causes 
> testHBaseFsckWithFewerMetaReplicaZnodes in TestMetaWithReplicas to fail 
> (Master gets report about server closing when it didn’t run the close -- gets 
> freaked out).
> Disabled/Ignore TestRSGroupsOfflineMode#testOffline; need to dig in on what 
> offline is.
> Disabled/Ignore TestRSGroups.
> All tests that have to do w/ fsck:TestHBaseFsckTwoRS, 
> TestOfflineMetaRebuildBase TestHBaseFsckReplicas, 
> TestOfflineMetaRebuildOverlap, testChangingReplicaCount in 
> TestMetaWithReplicas (internally it is doing fscks which are killing RS)...
> FSCK test testHBaseFsckWithExcessMetaReplicas in TestMetaWithReplicas.
> So is testHBaseFsckWithFewerMetaReplicas in same class.
> TestHBaseFsckOneRS is fsck. Disabled.
> TestOfflineMetaRebuildHole is about rebuilding hole with fsck.
> Master carries meta:
> TestRegionRebalancing is disabled because doesn't consider the fact that 
> Master carries system tables only (fix of average in RegionStates brought out 
> the issue).
> Disabled testMetaAddressChange in TestMetaWithReplicas because presumes can 
> move meta... you can't
> TestAsyncTableGetMultiThreaded wants to move hbase:meta...Balancer does NPEs. 
> AMv2 won't let you move hbase:meta off Master.
> Disabled parts of...testCreateTableWithMultipleReplicas in 
> TestMasterOperationsForRegionReplicas There is an issue w/ assigning more 
> replicas if number of replicas is changed on us. See '/* DISABLED! FOR 
> NOW'.
> Disabled TestCorruptedRegionStoreFile. Depends on a half-implemented reopen 
> of a region when a store file goes missing; TODO.
> testRetainAssignmentOnRestart in TestRestartCluster does not work. AMv2 does 
> retain semantic differently. Fix. TODO.
> TestMasterFailover needs to be rewritten for AMv2. It uses tricks not 
> ordained when up on AMv2. The test is also hobbled by fact that we 
> religiously enforce that only master can carry meta, something we are lose 
> about in old AM.
> Fix Ignores in TestServerCrashProcedure. Master is different now.
> Offlining is done differently now: Because of this disabled testOfflineRegion 
> in TestAsyncRegionAdminApi
> Skipping delete of table after test in TestAccessController3 because of 
> access issues w/ AMv2. AMv1 seems to crash servers on exit too for same lack 
> of auth perms but AMv2 gets hung up. TODO. See cleanUp method.
> TestHCM#testMulti and TestHCM
> Fix TestMasterMetrics. Stuff is different now around startup which messes up 
> this test. Disabled two of three tests.
> I tried to fix TestMasterBalanceThrottling but it looks like 
> SimpleLoadBalancer is borked whether AMv2 or not.
> Disabled testPickers in TestFavoredStochasticBalancerPickers. It hangs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18110) [AMv2] Reenable tests temporarily disabled

2017-12-18 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295832#comment-16295832
 ] 

Yi Liang commented on HBASE-18110:
--

Hi [~stack]
  I just check some ignored tests and found 
TestAssignmentManager#testGoodSplit() is not make sense. I think we can remove 
it.

> [AMv2] Reenable tests temporarily disabled
> --
>
> Key: HBASE-18110
> URL: https://issues.apache.org/jira/browse/HBASE-18110
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: stack
>Priority: Blocker
> Fix For: 2.0.0-beta-1
>
>
> We disabled tests that didn't make sense or relied on behavior not supported 
> by AMv2. Revisit and reenable after AMv2 gets committed. Here is the set 
> (from 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.rsj53tx4vlwj)
> testAllFavoredNodesDead and testAllFavoredNodesDeadMasterRestarted and 
> testMisplacedRegions in TestFavoredStochasticLoadBalancer … not sure what 
> this about.
> testRegionNormalizationMergeOnCluster in TestSimpleRegionNormalizerOnCluster 
> disabled for now till we fix up Merge.
> testMergeWithReplicas in TestRegionMergeTransactionOnCluster because don't 
> know how it is supposed to work.
> Admin#close does not update Master. Causes 
> testHBaseFsckWithFewerMetaReplicaZnodes in TestMetaWithReplicas to fail 
> (Master gets report about server closing when it didn’t run the close -- gets 
> freaked out).
> Disabled/Ignore TestRSGroupsOfflineMode#testOffline; need to dig in on what 
> offline is.
> Disabled/Ignore TestRSGroups.
> All tests that have to do w/ fsck:TestHBaseFsckTwoRS, 
> TestOfflineMetaRebuildBase TestHBaseFsckReplicas, 
> TestOfflineMetaRebuildOverlap, testChangingReplicaCount in 
> TestMetaWithReplicas (internally it is doing fscks which are killing RS)...
> FSCK test testHBaseFsckWithExcessMetaReplicas in TestMetaWithReplicas.
> So is testHBaseFsckWithFewerMetaReplicas in same class.
> TestHBaseFsckOneRS is fsck. Disabled.
> TestOfflineMetaRebuildHole is about rebuilding hole with fsck.
> Master carries meta:
> TestRegionRebalancing is disabled because doesn't consider the fact that 
> Master carries system tables only (fix of average in RegionStates brought out 
> the issue).
> Disabled testMetaAddressChange in TestMetaWithReplicas because presumes can 
> move meta... you can't
> TestAsyncTableGetMultiThreaded wants to move hbase:meta...Balancer does NPEs. 
> AMv2 won't let you move hbase:meta off Master.
> Disabled parts of...testCreateTableWithMultipleReplicas in 
> TestMasterOperationsForRegionReplicas There is an issue w/ assigning more 
> replicas if number of replicas is changed on us. See '/* DISABLED! FOR 
> NOW'.
> Disabled TestCorruptedRegionStoreFile. Depends on a half-implemented reopen 
> of a region when a store file goes missing; TODO.
> testRetainAssignmentOnRestart in TestRestartCluster does not work. AMv2 does 
> retain semantic differently. Fix. TODO.
> TestMasterFailover needs to be rewritten for AMv2. It uses tricks not 
> ordained when up on AMv2. The test is also hobbled by fact that we 
> religiously enforce that only master can carry meta, something we are lose 
> about in old AM.
> Fix Ignores in TestServerCrashProcedure. Master is different now.
> Offlining is done differently now: Because of this disabled testOfflineRegion 
> in TestAsyncRegionAdminApi
> Skipping delete of table after test in TestAccessController3 because of 
> access issues w/ AMv2. AMv1 seems to crash servers on exit too for same lack 
> of auth perms but AMv2 gets hung up. TODO. See cleanUp method.
> TestHCM#testMulti and TestHCM
> Fix TestMasterMetrics. Stuff is different now around startup which messes up 
> this test. Disabled two of three tests.
> I tried to fix TestMasterBalanceThrottling but it looks like 
> SimpleLoadBalancer is borked whether AMv2 or not.
> Disabled testPickers in TestFavoredStochasticBalancerPickers. It hangs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-12-15 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292933#comment-16292933
 ] 

Yi Liang commented on HBASE-19287:
--

[~stack], thanks for reviewing and adding the javadoc.  

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Fix For: 2.0.0-beta-1
>
> Attachments: HBASE-19287-master-v3.patch, 
> HBASE-19287-master-v3.patch, HBASE-19287-master-v4.patch, 
> hbase-19287-master-v2.patch, master.patch
>
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
> at 

[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-12-13 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19287:
-
Attachment: HBASE-19287-master-v4.patch

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Attachments: HBASE-19287-master-v3.patch, 
> HBASE-19287-master-v3.patch, HBASE-19287-master-v4.patch, 
> hbase-19287-master-v2.patch, master.patch
>
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
> at 

[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-12-12 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19287:
-
Attachment: HBASE-19287-master-v3.patch

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Attachments: HBASE-19287-master-v3.patch, 
> HBASE-19287-master-v3.patch, hbase-19287-master-v2.patch, master.patch
>
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> 

[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-12-12 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19287:
-
Attachment: HBASE-19287-master-v3.patch

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Attachments: HBASE-19287-master-v3.patch, 
> hbase-19287-master-v2.patch, master.patch
>
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> 

[jira] [Comment Edited] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-12-11 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16286762#comment-16286762
 ] 

Yi Liang edited comment on HBASE-19287 at 12/11/17 11:42 PM:
-

Add a new test case to handle situation when server crashed during assign meta. 
To review the code https://reviews.apache.org/r/64512/


was (Author: easyliangjob):
Add a new test case to handle situation when server crashed during assign meta.

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Attachments: hbase-19287-master-v2.patch, master.patch
>
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at 
> 

[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-12-11 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19287:
-
Attachment: hang.patch

Add a new test case to handle situation when server crashed during assign meta.

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Attachments: hbase-19287-master-v2.patch, master.patch
>
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> 

[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-12-11 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19287:
-
Attachment: hbase-19287-master-v2.patch

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Attachments: hbase-19287-master-v2.patch, master.patch
>
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> 

[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-12-11 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19287:
-
Attachment: (was: hang.patch)

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Attachments: hbase-19287-master-v2.patch, master.patch
>
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> 

[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-12-07 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282599#comment-16282599
 ] 

Yi Liang commented on HBASE-19287:
--

OK, I will try to put them into AM. Thanks for reviewing [~stack]

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Attachments: master.patch
>
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> 

[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-12-07 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282374#comment-16282374
 ] 

Yi Liang commented on HBASE-19287:
--

See the log below:
{code}
2017-12-07 19:01:45,218 INFO  [ProcExecWrkr-1] procedure.RecoverMetaProcedure: 
pid=17, state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
failedMetaServer=null, splitWal=true; Retaining meta assignment to 
server=hadoop-slave1.hadoop,16020,1512673261766
2017-12-07 19:01:45,227 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
Initialized subprocedures=[{pid=18, ppid=17, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766}]
2017-12-07 19:01:45,261 INFO  [ProcExecWrkr-3] 
procedure.MasterProcedureScheduler: pid=18, ppid=17, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766 hbase:meta 
hbase:meta,,1.1588230740
2017-12-07 19:01:45,266 INFO  [ProcExecWrkr-3] assignment.AssignProcedure: 
Start pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=hbase:meta, region=1588230740, 
target=hadoop-slave1.hadoop,16020,1512673261766; rit=OFFLINE, 
location=hadoop-slave1.hadoop,16020,1512673261766; forceNewPlan=false, 
retain=false
2017-12-07 19:01:45,419 INFO  [ProcExecWrkr-2] zookeeper.MetaTableLocator: 
Setting hbase:meta (replicaId=0) location in ZooKeeper as 
hadoop-slave2.hadoop,16020,1512673268932
2017-12-07 19:01:45,426 INFO  [ProcExecWrkr-2] 
assignment.RegionTransitionProcedure: Dispatch pid=18, ppid=17, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766; 
rit=OPENING, location=hadoop-slave2.hadoop,16020,1512673268932
2017-12-07 19:01:45,580 INFO  [ProcedureDispatcherTimeoutThread] 
procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
serverName=hadoop-slave2.hadoop,16020,1512673268932 version=2097152
2017-12-07 19:01:46,793 INFO  [main-EventThread] zookeeper.RegionServerTracker: 
RegionServer ephemeral node deleted, processing expiration 
[hadoop-slave2.hadoop,16020,1512673268932]
2017-12-07 19:01:46,793 INFO  [main-EventThread] master.ServerManager: Master 
doesn't enable ServerShutdownHandler during initialization, delay expiring 
server hadoop-slave2.hadoop,16020,1512673268932
{code}

*Usually Master will hangs as above log, and the assign procedure will become 
'dead'
The patch will notice and wake the meta assign procedure, and the procedure 
become active and run as below *

{code}
2017-12-07 19:01:46,794 INFO  [main-EventThread] master.ServerManager: Meta has 
been assigned to crashed server: hadoop-slave2.hadoop,16020,1512673268932; will 
do re-assign
2017-12-07 19:01:46,794 WARN  [main-EventThread] 
assignment.RegionTransitionProcedure: Remote call failed pid=18, ppid=17, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766; 
rit=OPENING, location=hadoop-slave2.hadoop,16020,1512673268932; 
exception=ServerCrashProcedure pid=18, 
server=hadoop-slave2.hadoop,16020,1512673268932
2017-12-07 19:01:46,797 INFO  [main-EventThread] assignment.AssignProcedure: 
Retry=1 of max=10; pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
AssignProcedure table=hbase:meta, region=1588230740, 
target=hadoop-slave1.hadoop,16020,1512673261766; rit=OPENING, 
location=hadoop-slave2.hadoop,16020,1512673268932
2017-12-07 19:01:46,798 INFO  [ProcExecWrkr-4] assignment.AssignProcedure: 
Start pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=hbase:meta, region=1588230740; rit=OFFLINE, location=null; 
forceNewPlan=true, retain=false
{code}

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Attachments: master.patch
>
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 

[jira] [Comment Edited] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-12-07 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282374#comment-16282374
 ] 

Yi Liang edited comment on HBASE-19287 at 12/7/17 7:26 PM:
---

See the log below:
{code}
2017-12-07 19:01:45,218 INFO  [ProcExecWrkr-1] procedure.RecoverMetaProcedure: 
pid=17, state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
failedMetaServer=null, splitWal=true; Retaining meta assignment to 
server=hadoop-slave1.hadoop,16020,1512673261766
2017-12-07 19:01:45,227 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
Initialized subprocedures=[{pid=18, ppid=17, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766}]
2017-12-07 19:01:45,261 INFO  [ProcExecWrkr-3] 
procedure.MasterProcedureScheduler: pid=18, ppid=17, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766 hbase:meta 
hbase:meta,,1.1588230740
2017-12-07 19:01:45,266 INFO  [ProcExecWrkr-3] assignment.AssignProcedure: 
Start pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=hbase:meta, region=1588230740, 
target=hadoop-slave1.hadoop,16020,1512673261766; rit=OFFLINE, 
location=hadoop-slave1.hadoop,16020,1512673261766; forceNewPlan=false, 
retain=false
2017-12-07 19:01:45,419 INFO  [ProcExecWrkr-2] zookeeper.MetaTableLocator: 
Setting hbase:meta (replicaId=0) location in ZooKeeper as 
hadoop-slave2.hadoop,16020,1512673268932
2017-12-07 19:01:45,426 INFO  [ProcExecWrkr-2] 
assignment.RegionTransitionProcedure: Dispatch pid=18, ppid=17, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766; 
rit=OPENING, location=hadoop-slave2.hadoop,16020,1512673268932
2017-12-07 19:01:45,580 INFO  [ProcedureDispatcherTimeoutThread] 
procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
serverName=hadoop-slave2.hadoop,16020,1512673268932 version=2097152
2017-12-07 19:01:46,793 INFO  [main-EventThread] zookeeper.RegionServerTracker: 
RegionServer ephemeral node deleted, processing expiration 
[hadoop-slave2.hadoop,16020,1512673268932]
2017-12-07 19:01:46,793 INFO  [main-EventThread] master.ServerManager: Master 
doesn't enable ServerShutdownHandler during initialization, delay expiring 
server hadoop-slave2.hadoop,16020,1512673268932
{code}
*Usually Master will hangs as above log, and the assign procedure will become 
'dead'
The patch will notice and wake the meta assign procedure, and the procedure 
become active and run as below*

{code}
2017-12-07 19:01:46,794 INFO  [main-EventThread] master.ServerManager: Meta has 
been assigned to crashed server: hadoop-slave2.hadoop,16020,1512673268932; will 
do re-assign
2017-12-07 19:01:46,794 WARN  [main-EventThread] 
assignment.RegionTransitionProcedure: Remote call failed pid=18, ppid=17, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766; 
rit=OPENING, location=hadoop-slave2.hadoop,16020,1512673268932; 
exception=ServerCrashProcedure pid=18, 
server=hadoop-slave2.hadoop,16020,1512673268932
2017-12-07 19:01:46,797 INFO  [main-EventThread] assignment.AssignProcedure: 
Retry=1 of max=10; pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
AssignProcedure table=hbase:meta, region=1588230740, 
target=hadoop-slave1.hadoop,16020,1512673261766; rit=OPENING, 
location=hadoop-slave2.hadoop,16020,1512673268932
2017-12-07 19:01:46,798 INFO  [ProcExecWrkr-4] assignment.AssignProcedure: 
Start pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=hbase:meta, region=1588230740; rit=OFFLINE, location=null; 
forceNewPlan=true, retain=false
{code}


was (Author: easyliangjob):
See the log below:
{code}
2017-12-07 19:01:45,218 INFO  [ProcExecWrkr-1] procedure.RecoverMetaProcedure: 
pid=17, state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
failedMetaServer=null, splitWal=true; Retaining meta assignment to 
server=hadoop-slave1.hadoop,16020,1512673261766
2017-12-07 19:01:45,227 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
Initialized subprocedures=[{pid=18, ppid=17, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766}]
2017-12-07 19:01:45,261 INFO  [ProcExecWrkr-3] 
procedure.MasterProcedureScheduler: pid=18, ppid=17, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
region=1588230740, target=hadoop-slave1.hadoop,16020,1512673261766 hbase:meta 
hbase:meta,,1.1588230740
2017-12-07 19:01:45,266 INFO  [ProcExecWrkr-3] assignment.AssignProcedure: 
Start pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 

[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-12-07 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19287:
-
Attachment: master.patch

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Attachments: master.patch
>
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278)
> at 
> 

[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-12-07 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19287:
-
Attachment: (was: p1-master.patch)

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278)
> at 
> 

[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-12-06 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19287:
-
Affects Version/s: 2.0.0

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Attachments: p1-master.patch
>
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278)
> at 
> 

[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-12-06 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19287:
-
Component/s: proc-v2

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Attachments: p1-master.patch
>
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278)
> at 
> 

[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-12-06 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19287:
-
Attachment: p1-master.patch

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Attachments: p1-master.patch
>
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278)
> at 
> 

[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-12-06 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19287:
-
Attachment: (was: p1.patch)

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>Reporter: Yi Liang
>Assignee: Yi Liang
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:258)
>  row 'hbase:namespace' on table 

[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-12-06 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19287:
-
Status: Patch Available  (was: Open)

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>Reporter: Yi Liang
>Assignee: Yi Liang
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:258)
>  row 'hbase:namespace' on 

[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-12-06 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281118#comment-16281118
 ] 

Yi Liang commented on HBASE-19287:
--

After some investigation, I found that it takes time to add a whole Timeout 
Mechanism into current Procedure. Not sure I can finished those before release 
of hbase2.0,  so I just provide a fix that use idea we talked above
{quote}
(2) Or at least, if we get a crash for the server we are currently trying to 
assign hbase:meta too during startup, we should notice and recalibrate the 
assign?
{quote}

Draft patch to try UT, and still working on writing new testcase for this 
problem


> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>Reporter: Yi Liang
>Assignee: Yi Liang
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at 
> 

[jira] [Updated] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-12-06 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19287:
-
Attachment: p1.patch

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>Reporter: Yi Liang
>Assignee: Yi Liang
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:258)
>  row 'hbase:namespace' on table 'hbase:meta' 

[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-12-01 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16275129#comment-16275129
 ] 

Yi Liang commented on HBASE-19287:
--

[~stack] Spent some time digging into code. I found details of the assign 
Procedure work flow is 
{quote}
1. Master send assign request to target Regionserver, and this active 
AssignProcedure will be remove from Procedure Scheduler(A queue that store all 
the active procedure) and suspend this AssignProcedure. 
2. Once target Server received request and open the region, it will send a 
response to master
3. Once Master receive the response, it will wake this procedure and put the 
AssignProcedure back to Procedure Scheduler. And worker threads in 
ProcedureExecutor will poll this AssignProcedure and run the remain steps.
{quote}

The problem happens on step3, if the master does not receive response from 
target server for any reason; That assign procedure will become a dead 
procedure, no other mechanism will wake the procedure(i.e put it back into 
procedure scheduler) any more. (Do not know why we need to remove this 
procedure out of procedure scheduler in step1, maybe we can just mark it as 
suspend and yield it?)

The thing here is that this suspend procedure will be only wake by the response 
from target server, no other mechanism can wake it (ServerCrashProcedure may 
wake it, but if the target server is not crashed, master just can not receive 
the response for other reasons like network issue. this problem will still 
happens; or if master is not up, SCP also does not work).
 
So this will be a general problem not only for meta, but for other normal 
regions. 

So we need to come up with a idea to wake those suspend procedures.

My suggestion is that we can have a separate thread to check all those suspend 
procedures periodically, if they are timeout or their target server is crashed, 
we can do reassign. 

(1) The target server crashed will only suspend meta's assign since master is 
not up yet,  other regions can be wake by ServerCrashProcedure. 
(2) Timeout mechanism for all suspend procedure. If one procedure has been 
suspended for too long, we mark it as timeout and redo the remain steps. 

We can do (1) first, but for (2), since we don't have timeout for procedure 
yet. Not sure how to fix it properly. 



> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>Reporter: Yi Liang
>Assignee: Yi Liang
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] 

[jira] [Assigned] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-11-28 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang reassigned HBASE-19287:


Assignee: Yi Liang

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>Reporter: Yi Liang
>Assignee: Yi Liang
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:258)
>  row 'hbase:namespace' on table 

[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-11-28 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16269587#comment-16269587
 ] 

Yi Liang commented on HBASE-19287:
--

I will try the second one, need to dig into code to see how to implement it. 

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>Reporter: Yi Liang
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278)
> at 
> 

[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-11-21 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261290#comment-16261290
 ] 

Yi Liang commented on HBASE-19287:
--

{code}
(1) Should the assign of hbase:meta be synchronous so it can timeout/verify the 
hbase:meta assign, the important needed to get us up off the ground?
(2) Or at least, if we get a crash for the server we are currently trying to 
assign hbase:meta too during startup, we should notice and recalibrate the 
assign?
{code}

I think both approaches are good, but if we use the first one, it is hard to 
define timeout, it depends on how large is the hbase cluster. And second one 
can re-calibrate the assign immediately after it detect target server down, 
does not need to wait for timeout, which can start hmaster faster.  I prefer to 
try the second one first. what do you think. [~stack]

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>Reporter: Yi Liang
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at 
> 

[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-11-20 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260014#comment-16260014
 ] 

Yi Liang commented on HBASE-19287:
--

Workers stuck at assign hbase-meta, there seems no mechanism for a timeout 
procedure.  Still dig into the code

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>Reporter: Yi Liang
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278)
> at 
> 

[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-11-20 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260012#comment-16260012
 ] 

Yi Liang commented on HBASE-19287:
--

{code} 
   836 2017-11-20 23:05:24,829 INFO  [ProcExecWrkr-2] 
client.AsyncRequestFutureImpl: #1, waiting for 1  actions to finish on table: 
hbase:meta
837 2017-11-20 23:05:28,570 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor: Worker stuck ProcExecWrkr-2(pid=81) run time 
13.8040sec
838 2017-11-20 23:05:33,571 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor: Worker stuck ProcExecWrkr-2(pid=81) run time 
18.8050sec
839 2017-11-20 23:05:34,836 INFO  [ProcExecWrkr-2] 
client.AsyncRequestFutureImpl: #1, waiting for 1  actions to finish on table: 
hbase:meta
840 2017-11-20 23:05:38,572 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor: Worker stuck ProcExecWrkr-2(pid=81) run time 
23.8060sec
{code}

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>Reporter: Yi Liang
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> 

[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-11-16 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255793#comment-16255793
 ] 

Yi Liang commented on HBASE-19287:
--

[~uagashe][~stack] Any ideas about this problem?

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>Reporter: Yi Liang
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:258)
>  row 

[jira] [Comment Edited] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-11-16 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255788#comment-16255788
 ] 

Yi Liang edited comment on HBASE-19287 at 11/16/17 7:02 PM:


This happens when I restart the cluster, I see this error many times.

The RecoverMetaProcedure have a step that will send AssignMetaRegion request to 
a target server. If the request sent out successfully but then the target 
server down. 
{code}
try {
 
 final ExecuteProceduresResponse response = sendRequest(getServerName(), 
request.build());

 remoteCallCompleted(env, response);

} catch (IOException e) {
 
 e = unwrapException(e);
 
 // TODO: In the future some operation may want to bail out early.
 
 // TODO: How many times should we retry (use numberOfAttemptsSoFar)
 
 if (!scheduleForRetry(e)) {

  remoteCallFailed(env, e);
  
  }

}
{code}

So there are no exceptions for above code when send assign region request to 
target server. 

But it seems that there is no timeout event to retry the assignProcedure or 
RecoverMetaProcedure. So it will hang there forever. 

And there are also errors below, the stale one is the target server in the 
above RPC request.
{quote}
RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
master.ServerManager: Triggering server recovery; existingServer 
hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
server:hadoop-slave2.hadoop,16020,1510342023184
2017-11-10 19:27:05,832 INFO  
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
master.ServerManager: Master doesn't enable ServerShutdownHandler during 
initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
{quote}


was (Author: easyliangjob):
This happens when I restart the cluster, I see this error many times.

The RecoverMetaProcedure have a step that will send AssignMetaRegion request to 
a target server. If the request sent out successfully but then the target 
server down. 
{code}
try {
  final ExecuteProceduresResponse response = sendRequest(getServerName(), 
request.build());
  remoteCallCompleted(env, response);
} catch (IOException e) 
{
  e = unwrapException(e);
  // TODO: In the future some operation may want to 
bail out early.
  // TODO: How many times should we retry (use 
numberOfAttemptsSoFar)
  if (!scheduleForRetry(e)) {
remoteCallFailed(env, 
e);
  }
}
{code}

So there are no exceptions for above code when send assign region request to 
target server. 

But it seems that there is no timeout event to retry the assignProcedure or 
RecoverMetaProcedure. So it will hang there forever. 

And there are also errors below, the stale one is the target server in the 
above RPC request.
{quote}
RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
master.ServerManager: Triggering server recovery; existingServer 
hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
server:hadoop-slave2.hadoop,16020,1510342023184
2017-11-10 19:27:05,832 INFO  
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
master.ServerManager: Master doesn't enable ServerShutdownHandler during 
initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
{quote}

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>Reporter: Yi Liang
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 

[jira] [Commented] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-11-16 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255788#comment-16255788
 ] 

Yi Liang commented on HBASE-19287:
--

This happens when I restart the cluster, I see this error many times.

The RecoverMetaProcedure have a step that will send AssignMetaRegion request to 
a target server. If the request sent out successfully but then the target 
server down. 
{code}
try {
  final ExecuteProceduresResponse response = sendRequest(getServerName(), 
request.build());
  remoteCallCompleted(env, response);
} catch (IOException e) 
{
  e = unwrapException(e);
  // TODO: In the future some operation may want to 
bail out early.
  // TODO: How many times should we retry (use 
numberOfAttemptsSoFar)
  if (!scheduleForRetry(e)) {
remoteCallFailed(env, 
e);
  }
}
{code}

So there are no exceptions for above code when send assign region request to 
target server. 

But it seems that there is no timeout event to retry the assignProcedure or 
RecoverMetaProcedure. So it will hang there forever. 

And there are also errors below, the stale one is the target server in the 
above RPC request.
{quote}
RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
master.ServerManager: Triggering server recovery; existingServer 
hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
server:hadoop-slave2.hadoop,16020,1510342023184
2017-11-10 19:27:05,832 INFO  
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
master.ServerManager: Master doesn't enable ServerShutdownHandler during 
initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
{quote}

> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
>  Issue Type: Bug
>Reporter: Yi Liang
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 

[jira] [Created] (HBASE-19287) master hangs forever if RecoverMeta send assign meta region request to target server fail

2017-11-16 Thread Yi Liang (JIRA)
Yi Liang created HBASE-19287:


 Summary: master hangs forever if RecoverMeta send assign meta 
region request to target server fail
 Key: HBASE-19287
 URL: https://issues.apache.org/jira/browse/HBASE-19287
 Project: HBase
  Issue Type: Bug
Reporter: Yi Liang


2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] procedure.RecoverMetaProcedure: 
pid=138, state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
failedMetaServer=null, splitWal=true; Retaining meta assignment to 
server=hadoop-slave1.hadoop,16020,1510341981454
2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
Initialized subprocedures=[{pid=139, ppid=138, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
procedure.MasterProcedureScheduler: pid=139, ppid=138, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
hbase:meta,,1.1588230740
2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
AssignProcedure table=hbase:meta, region=1588230740, 
target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
retain=false
2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
Setting hbase:meta (replicaId=0) location in ZooKeeper as 
hadoop-slave2.hadoop,16020,1510341988652
2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
2017-11-10 19:26:57,542 INFO  [main-EventThread] zookeeper.RegionServerTracker: 
RegionServer ephemeral node deleted, processing expiration 
[hadoop-slave2.hadoop,16020,1510341988652]
2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
doesn't enable ServerShutdownHandler during initialization, delay expiring 
server hadoop-slave2.hadoop,16020,1510341988652
2017-11-10 19:26:58,875 INFO  
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
master.ServerManager: Registering 
server=hadoop-slave1.hadoop,16020,1510342016106
2017-11-10 19:27:05,832 INFO  
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
master.ServerManager: Registering 
server=hadoop-slave2.hadoop,16020,1510342023184
2017-11-10 19:27:05,832 INFO  
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
master.ServerManager: Triggering server recovery; existingServer 
hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
server:hadoop-slave2.hadoop,16020,1510342023184
2017-11-10 19:27:05,832 INFO  
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
master.ServerManager: Master doesn't enable ServerShutdownHandler during 
initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
2017-11-10 19:27:49,815 INFO  
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
online on hadoop-slave2.hadoop,16020,1510342023184
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:258)
 row 'hbase:namespace' on table 'hbase:meta' at 
region=hbase:meta,,1.1588230740, 
hostname=hadoop-slave2.hadoop,16020,1510341988652, seqNum=0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-19237) TestMaster.testMasterOpsWhileSplitting fails

2017-11-13 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19237:
-
Attachment: HBASE-19237-master-v1.patch

Hi Ted, the failed tests passed locally. 
but will retry the unit tests again. 

> TestMaster.testMasterOpsWhileSplitting fails
> 
>
> Key: HBASE-19237
> URL: https://issues.apache.org/jira/browse/HBASE-19237
> Project: HBase
>  Issue Type: Test
>Reporter: Ted Yu
> Attachments: HBASE-19237-master-v1.patch, HBASE-19237-master-v1.patch
>
>
> This is the top flaky test:
> {code}
> java.lang.AssertionError: expected:<3> but was:<1>
>   at 
> org.apache.hadoop.hbase.master.TestMaster.testMasterOpsWhileSplitting(TestMaster.java:121)
> {code}
> After brief check, the test failure seems to be introduced by HBASE-19127



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-19237) TestMaster.testMasterOpsWhileSplitting fails

2017-11-12 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19237:
-
Attachment: HBASE-19237-master-v1.patch

> TestMaster.testMasterOpsWhileSplitting fails
> 
>
> Key: HBASE-19237
> URL: https://issues.apache.org/jira/browse/HBASE-19237
> Project: HBase
>  Issue Type: Test
>Reporter: Ted Yu
> Attachments: HBASE-19237-master-v1.patch
>
>
> This is the top flaky test:
> {code}
> java.lang.AssertionError: expected:<3> but was:<1>
>   at 
> org.apache.hadoop.hbase.master.TestMaster.testMasterOpsWhileSplitting(TestMaster.java:121)
> {code}
> After brief check, the test failure seems to be introduced by HBASE-19127



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-19237) TestMaster.testMasterOpsWhileSplitting fails

2017-11-12 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19237:
-
Attachment: (was: HBASE-19237-master-v1.patch)

> TestMaster.testMasterOpsWhileSplitting fails
> 
>
> Key: HBASE-19237
> URL: https://issues.apache.org/jira/browse/HBASE-19237
> Project: HBase
>  Issue Type: Test
>Reporter: Ted Yu
> Attachments: HBASE-19237-master-v1.patch
>
>
> This is the top flaky test:
> {code}
> java.lang.AssertionError: expected:<3> but was:<1>
>   at 
> org.apache.hadoop.hbase.master.TestMaster.testMasterOpsWhileSplitting(TestMaster.java:121)
> {code}
> After brief check, the test failure seems to be introduced by HBASE-19127



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-19237) TestMaster.testMasterOpsWhileSplitting fails

2017-11-10 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19237:
-
Attachment: HBASE-19237-master-v1.patch

> TestMaster.testMasterOpsWhileSplitting fails
> 
>
> Key: HBASE-19237
> URL: https://issues.apache.org/jira/browse/HBASE-19237
> Project: HBase
>  Issue Type: Test
>Reporter: Ted Yu
> Attachments: 19237.v1.txt, HBASE-19237-master-v1.patch
>
>
> This is the top flaky test:
> {code}
> java.lang.AssertionError: expected:<3> but was:<1>
>   at 
> org.apache.hadoop.hbase.master.TestMaster.testMasterOpsWhileSplitting(TestMaster.java:121)
> {code}
> After brief check, the test failure seems to be introduced by HBASE-19127



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19237) TestMaster.testMasterOpsWhileSplitting fails

2017-11-10 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248117#comment-16248117
 ] 

Yi Liang commented on HBASE-19237:
--

The reason why above test is fail is that we can not use number of 
regionStateNode to decide whether the split is complete or not; i.e use 
regionStates.getRegionsOfTable(TABLENAME).size().

We need to visit Meta to know if the split is completed or not. 

> TestMaster.testMasterOpsWhileSplitting fails
> 
>
> Key: HBASE-19237
> URL: https://issues.apache.org/jira/browse/HBASE-19237
> Project: HBase
>  Issue Type: Test
>Reporter: Ted Yu
> Attachments: 19237.v1.txt
>
>
> This is the top flaky test:
> {code}
> java.lang.AssertionError: expected:<3> but was:<1>
>   at 
> org.apache.hadoop.hbase.master.TestMaster.testMasterOpsWhileSplitting(TestMaster.java:121)
> {code}
> After brief check, the test failure seems to be introduced by HBASE-19127



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19237) TestMaster.testMasterOpsWhileSplitting fails

2017-11-10 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248077#comment-16248077
 ] 

Yi Liang commented on HBASE-19237:
--

ok, I will check the code.

> TestMaster.testMasterOpsWhileSplitting fails
> 
>
> Key: HBASE-19237
> URL: https://issues.apache.org/jira/browse/HBASE-19237
> Project: HBase
>  Issue Type: Test
>Reporter: Ted Yu
> Attachments: 19237.v1.txt
>
>
> This is the top flaky test:
> {code}
> java.lang.AssertionError: expected:<3> but was:<1>
>   at 
> org.apache.hadoop.hbase.master.TestMaster.testMasterOpsWhileSplitting(TestMaster.java:121)
> {code}
> After brief check, the test failure seems to be introduced by HBASE-19127



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode

2017-11-09 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16246284#comment-16246284
 ] 

Yi Liang commented on HBASE-19127:
--

[~stack], it is good  to commit. Thanks

> Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in 
> RegionStatesNode
> -
>
> Key: HBASE-19127
> URL: https://issues.apache.org/jira/browse/HBASE-19127
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Yi Liang
>Assignee: Yi Liang
> Fix For: 2.0.0-beta-1
>
> Attachments: HBASE-19126-v1-master.patch, region_state.patch, 
> state.patch
>
>
> In current code, we did not set above states to a region node at all, but we 
> still have statements like below to check if node have above states.
> {code}
> else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) {
> 
> }
> {code}
> We need to set above states in a correct place.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode

2017-11-08 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19127:
-
Attachment: (was: HBASE-19126-v1-master.patch)

> Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in 
> RegionStatesNode
> -
>
> Key: HBASE-19127
> URL: https://issues.apache.org/jira/browse/HBASE-19127
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Yi Liang
>Assignee: Yi Liang
> Attachments: HBASE-19126-v1-master.patch, region_state.patch, 
> state.patch
>
>
> In current code, we did not set above states to a region node at all, but we 
> still have statements like below to check if node have above states.
> {code}
> else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) {
> 
> }
> {code}
> We need to set above states in a correct place.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode

2017-11-08 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19127:
-
Attachment: HBASE-19126-v1-master.patch

> Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in 
> RegionStatesNode
> -
>
> Key: HBASE-19127
> URL: https://issues.apache.org/jira/browse/HBASE-19127
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Yi Liang
>Assignee: Yi Liang
> Attachments: HBASE-19126-v1-master.patch, region_state.patch, 
> state.patch
>
>
> In current code, we did not set above states to a region node at all, but we 
> still have statements like below to check if node have above states.
> {code}
> else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) {
> 
> }
> {code}
> We need to set above states in a correct place.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode

2017-11-08 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19127:
-
Attachment: HBASE-19126-v1-master.patch

> Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in 
> RegionStatesNode
> -
>
> Key: HBASE-19127
> URL: https://issues.apache.org/jira/browse/HBASE-19127
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Yi Liang
>Assignee: Yi Liang
> Attachments: HBASE-19126-v1-master.patch, region_state.patch, 
> state.patch
>
>
> In current code, we did not set above states to a region node at all, but we 
> still have statements like below to check if node have above states.
> {code}
> else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) {
> 
> }
> {code}
> We need to set above states in a correct place.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode

2017-11-08 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16244392#comment-16244392
 ] 

Yi Liang commented on HBASE-19127:
--

unit tests passed. will format the patch. 

> Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in 
> RegionStatesNode
> -
>
> Key: HBASE-19127
> URL: https://issues.apache.org/jira/browse/HBASE-19127
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Yi Liang
>Assignee: Yi Liang
> Attachments: region_state.patch, state.patch
>
>
> In current code, we did not set above states to a region node at all, but we 
> still have statements like below to check if node have above states.
> {code}
> else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) {
> 
> }
> {code}
> We need to set above states in a correct place.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode

2017-11-07 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242946#comment-16242946
 ] 

Yi Liang commented on HBASE-19127:
--

[~stack]
{quote}What does the change in AM do? (Adding state for daughters)?{quote}
Yes, this is to add state for daughters

{quote}Don't change numbering in protobufs.{quote}
The reason why I remove MERGE_TABLE_REGIONS_MOVE_REGION_TO_SAME_RS = 3 in 
proto, because it has never been used in the current code. Do you think this 
will cause some compatibility issues? If you think it is better to keep them, I 
will keep it. Otherwise, I think it is safe to remove them, since 
MergeTableRegionsProcedure is newly added since hbase-2.0.





> Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in 
> RegionStatesNode
> -
>
> Key: HBASE-19127
> URL: https://issues.apache.org/jira/browse/HBASE-19127
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Yi Liang
>Assignee: Yi Liang
> Attachments: state.patch
>
>
> In current code, we did not set above states to a region node at all, but we 
> still have statements like below to check if node have above states.
> {code}
> else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) {
> 
> }
> {code}
> We need to set above states in a correct place.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode

2017-11-07 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19127:
-
Attachment: region_state.patch

Try unit tests

> Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in 
> RegionStatesNode
> -
>
> Key: HBASE-19127
> URL: https://issues.apache.org/jira/browse/HBASE-19127
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Yi Liang
>Assignee: Yi Liang
> Attachments: region_state.patch, state.patch
>
>
> In current code, we did not set above states to a region node at all, but we 
> still have statements like below to check if node have above states.
> {code}
> else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) {
> 
> }
> {code}
> We need to set above states in a correct place.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode

2017-11-01 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16234479#comment-16234479
 ] 

Yi Liang commented on HBASE-19127:
--

All the changes are made in the new code of hbase2.0, so it wont be a problem 
in procedure.

> Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in 
> RegionStatesNode
> -
>
> Key: HBASE-19127
> URL: https://issues.apache.org/jira/browse/HBASE-19127
> Project: HBase
>  Issue Type: Improvement
>Reporter: Yi Liang
>Assignee: Yi Liang
>Priority: Major
> Attachments: state.patch
>
>
> In current code, we did not set above states to a region node at all, but we 
> still have statements like below to check if node have above states.
> {code}
> else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) {
> 
> }
> {code}
> We need to set above states in a correct place.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode

2017-11-01 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19127:
-
Status: Patch Available  (was: Open)

> Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in 
> RegionStatesNode
> -
>
> Key: HBASE-19127
> URL: https://issues.apache.org/jira/browse/HBASE-19127
> Project: HBase
>  Issue Type: Improvement
>Reporter: Yi Liang
>Assignee: Yi Liang
>Priority: Major
> Attachments: state.patch
>
>
> In current code, we did not set above states to a region node at all, but we 
> still have statements like below to check if node have above states.
> {code}
> else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) {
> 
> }
> {code}
> We need to set above states in a correct place.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19126) [AMv2] RegionStates/RegionStateNode needs cleanup

2017-11-01 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16234414#comment-16234414
 ] 

Yi Liang commented on HBASE-19126:
--

Hi [~mdrob]
   yes, I am working on this one, I am recently read code about this part and 
found some problems, will open sub jiras to fix the issues. If you are 
interested, you can also work on this one. ;)

> [AMv2] RegionStates/RegionStateNode needs cleanup
> -
>
> Key: HBASE-19126
> URL: https://issues.apache.org/jira/browse/HBASE-19126
> Project: HBase
>  Issue Type: Improvement
>Reporter: Yi Liang
>Priority: Major
> Fix For: 2.0.0-beta-1
>
>
>   // Mutable/Immutable? Changes have to be synchronized or not?
>   // Data members are volatile which seems to say multi-threaded access is 
> fine.
>   // In the below we do check and set but the check state could change before
>   // we do the set because no synchronizationwhich seems dodgy. Clear up
>   // understanding here... how many threads accessing? Do locks make it so one
>   // thread at a time working on a single Region's RegionStateNode? Lets 
> presume
>   // so for now. Odd is that elsewhere in this RegionStates, we synchronize on
>   // the RegionStateNode instance
> Copied from TODO in RegionState.java
> Open this jira to track some cleanups for RegionStates/RegionStateNode



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode

2017-10-30 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19127:
-
Attachment: state.patch

Try unit test, 
[~stack] [~jerryhe],  I found some issues about regionstates especially about 
the intermediate state.

In the patch, I removed MERGE_TABLE_REGIONS_MOVE_REGION_TO_SAME_RS, this 
MERGE_TABLE_REGIONS_MOVE_REGION_TO_SAME_RS has never been used.

and also removed MERGE_TABLE_REGIONS_SET_MERGING_TABLE_STATE,  And the method 
under this step is commented out, so nothing will be done under this step. And 
I think we do not need to have a specific step for it.

After read RegionState.java; Set above states follow rules below

1. SPLITTING => After check the parent regions is splittable, set it to parent 
region
2. SPLITTING_NEW => Set it after create daughter regions and before Assign 
these daughters as OPEN in their region states.
3. Merging => After check 2 parent regions are mergeable, set it to both parent 
regions.
4. Merging_new => After create merged regions and before assign it as OPEN.

above states won't affect the real procedure work, we set it because in the 
metrics will use them, and also RegionStates need to keep latest/correct state. 

> Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in 
> RegionStatesNode
> -
>
> Key: HBASE-19127
> URL: https://issues.apache.org/jira/browse/HBASE-19127
> Project: HBase
>  Issue Type: Improvement
>Reporter: Yi Liang
>Assignee: Yi Liang
> Attachments: state.patch
>
>
> In current code, we did not set above states to a region node at all, but we 
> still have statements like below to check if node have above states.
> {code}
> else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) {
> 
> }
> {code}
> We need to set above states in a correct place.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode

2017-10-30 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang reassigned HBASE-19127:


Assignee: Yi Liang

> Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in 
> RegionStatesNode
> -
>
> Key: HBASE-19127
> URL: https://issues.apache.org/jira/browse/HBASE-19127
> Project: HBase
>  Issue Type: Improvement
>Reporter: Yi Liang
>Assignee: Yi Liang
>
> In current code, we did not set above states to a region node at all, but we 
> still have statements like below to check if node have above states.
> {code}
> else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) {
> 
> }
> {code}
> We need to set above states in a correct place.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HBASE-19127) Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW properly in RegionStatesNode

2017-10-30 Thread Yi Liang (JIRA)
Yi Liang created HBASE-19127:


 Summary: Set State.SPLITTING, MERGING, MERGING_NEW, SPLITTING_NEW 
properly in RegionStatesNode
 Key: HBASE-19127
 URL: https://issues.apache.org/jira/browse/HBASE-19127
 Project: HBase
  Issue Type: Improvement
Reporter: Yi Liang


In current code, we did not set above states to a region node at all, but we 
still have statements like below to check if node have above states.
{code}
else if (!regionNode.isInState(State.CLOSING, State.SPLITTING)) {

}
{code}

We need to set above states in a correct place.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-19126) [AMv2] RegionStates/RegionStateNode needs cleanup

2017-10-30 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19126:
-
Description: 
  // Mutable/Immutable? Changes have to be synchronized or not?
  // Data members are volatile which seems to say multi-threaded access is fine.
  // In the below we do check and set but the check state could change before
  // we do the set because no synchronizationwhich seems dodgy. Clear up
  // understanding here... how many threads accessing? Do locks make it so one
  // thread at a time working on a single Region's RegionStateNode? Lets presume
  // so for now. Odd is that elsewhere in this RegionStates, we synchronize on
  // the RegionStateNode instance
Copied from TODO in RegionState.java

Open this jira to track some cleanups for RegionStates/RegionStateNode

  was:
  // Mutable/Immutable? Changes have to be synchronized or not?
  // Data members are volatile which seems to say multi-threaded access is fine.
  // In the below we do check and set but the check state could change before
  // we do the set because no synchronizationwhich seems dodgy. Clear up
  // understanding here... how many threads accessing? Do locks make it so one
  // thread at a time working on a single Region's RegionStateNode? Lets presume
  // so for now. Odd is that elsewhere in this RegionStates, we synchronize on
  // the RegionStateNode instance

Open this jira to track some cleanups for RegionStates/RegionStateNode


> [AMv2] RegionStates/RegionStateNode needs cleanup
> -
>
> Key: HBASE-19126
> URL: https://issues.apache.org/jira/browse/HBASE-19126
> Project: HBase
>  Issue Type: Improvement
>Reporter: Yi Liang
> Fix For: 2.0.0-beta-1
>
>
>   // Mutable/Immutable? Changes have to be synchronized or not?
>   // Data members are volatile which seems to say multi-threaded access is 
> fine.
>   // In the below we do check and set but the check state could change before
>   // we do the set because no synchronizationwhich seems dodgy. Clear up
>   // understanding here... how many threads accessing? Do locks make it so one
>   // thread at a time working on a single Region's RegionStateNode? Lets 
> presume
>   // so for now. Odd is that elsewhere in this RegionStates, we synchronize on
>   // the RegionStateNode instance
> Copied from TODO in RegionState.java
> Open this jira to track some cleanups for RegionStates/RegionStateNode



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-19126) [AMv2] RegionStates/RegionStateNode needs cleanup

2017-10-30 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-19126:
-
Summary: [AMv2] RegionStates/RegionStateNode needs cleanup  (was: [AMv2] 
RegionStates/RegionStateNode cleanup)

> [AMv2] RegionStates/RegionStateNode needs cleanup
> -
>
> Key: HBASE-19126
> URL: https://issues.apache.org/jira/browse/HBASE-19126
> Project: HBase
>  Issue Type: Improvement
>Reporter: Yi Liang
> Fix For: 2.0.0-beta-1
>
>
>   // Mutable/Immutable? Changes have to be synchronized or not?
>   // Data members are volatile which seems to say multi-threaded access is 
> fine.
>   // In the below we do check and set but the check state could change before
>   // we do the set because no synchronizationwhich seems dodgy. Clear up
>   // understanding here... how many threads accessing? Do locks make it so one
>   // thread at a time working on a single Region's RegionStateNode? Lets 
> presume
>   // so for now. Odd is that elsewhere in this RegionStates, we synchronize on
>   // the RegionStateNode instance
> Open this jira to track some cleanups for RegionStates/RegionStateNode



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HBASE-19126) [AMv2] RegionStates/RegionStateNode cleanup

2017-10-30 Thread Yi Liang (JIRA)
Yi Liang created HBASE-19126:


 Summary: [AMv2] RegionStates/RegionStateNode cleanup
 Key: HBASE-19126
 URL: https://issues.apache.org/jira/browse/HBASE-19126
 Project: HBase
  Issue Type: Improvement
Reporter: Yi Liang
 Fix For: 2.0.0-beta-1


  // Mutable/Immutable? Changes have to be synchronized or not?
  // Data members are volatile which seems to say multi-threaded access is fine.
  // In the below we do check and set but the check state could change before
  // we do the set because no synchronizationwhich seems dodgy. Clear up
  // understanding here... how many threads accessing? Do locks make it so one
  // thread at a time working on a single Region's RegionStateNode? Lets presume
  // so for now. Odd is that elsewhere in this RegionStates, we synchronize on
  // the RegionStateNode instance

Open this jira to track some cleanups for RegionStates/RegionStateNode



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2

2017-10-25 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-18984:
-
Resolution: Duplicate
Status: Resolved  (was: Patch Available)

> [AMv2] Retain assignment does not work well in AMv2
> ---
>
> Key: HBASE-18984
> URL: https://issues.apache.org/jira/browse/HBASE-18984
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at 
> 2.24.19 PM.png
>
>
> work on 8.17 Retain assignment in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k
> To reproduce this error, in hbase shell:
> createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> 
> list_reigons 't1' (maybe you need to try enable/disable multiple times)
> See attached images. same region assigned to different region servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2

2017-10-25 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16219168#comment-16219168
 ] 

Yi Liang commented on HBASE-18984:
--

Mark this as duplicated of HBASE-19017, and will create a new jira to discuss 
the Writing those intermediate State(OPENING, CLOSING ..) into Meta.

> [AMv2] Retain assignment does not work well in AMv2
> ---
>
> Key: HBASE-18984
> URL: https://issues.apache.org/jira/browse/HBASE-18984
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at 
> 2.24.19 PM.png
>
>
> work on 8.17 Retain assignment in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k
> To reproduce this error, in hbase shell:
> createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> 
> list_reigons 't1' (maybe you need to try enable/disable multiple times)
> See attached images. same region assigned to different region servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HBASE-18352) Enable Replica tests that were disabled by Proc-V2 AM in HBASE-14614

2017-10-23 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215502#comment-16215502
 ] 

Yi Liang edited comment on HBASE-18352 at 10/23/17 5:36 PM:


I saw another situation that may cause random assignment. When we restart 
hbase, if we start master first, and then region servers.
Once one region server count in, master will start to begin region assignment. 
There is a possibility that the assign plan is created for a region before its 
last region server up, so AM will randomly chose one region servers for this 
region. 

And if we restart all rs before master, we will not see above issues.


was (Author: easyliangjob):
I saw another situation that may cause random assignment. When we restart 
hbase, if we start master first, and then regionservers.
Once one region server count in, master will start to begin region assignment. 
There is a possibility that the assign plan is created for a region before its 
last region server up, so AM will randomly chose one region servers for this 
region. 

> Enable Replica tests that were disabled by Proc-V2 AM in HBASE-14614
> 
>
> Key: HBASE-18352
> URL: https://issues.apache.org/jira/browse/HBASE-18352
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.0.0-alpha-1
>Reporter: Stephen Yuan Jiang
>Assignee: huaxiang sun
>
> The following replica tests were disabled by Core Proc-V2 AM in HBASE-14614:
> - Disabled parts of...testCreateTableWithMultipleReplicas in 
> TestMasterOperationsForRegionReplicas There is an issue w/ assigning more 
> replicas if number of replicas is changed on us. See '/* DISABLED! FOR 
> NOW'.
> - Disabled testRegionReplicasOnMidClusterHighReplication in 
> TestStochasticLoadBalancer2
> - Disabled testFlushAndCompactionsInPrimary in TestRegionReplicas
> This JIRA tracks the work to enable them (or modify/remove if not applicable).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18352) Enable Replica tests that were disabled by Proc-V2 AM in HBASE-14614

2017-10-23 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215502#comment-16215502
 ] 

Yi Liang commented on HBASE-18352:
--

I saw another situation that may cause random assignment. When we restart 
hbase, if we start master first, and then regionservers.
Once one region server count in, master will start to begin region assignment. 
There is a possibility that the assign plan is created for a region before its 
last region server up, so AM will randomly chose one region servers for this 
region. 

> Enable Replica tests that were disabled by Proc-V2 AM in HBASE-14614
> 
>
> Key: HBASE-18352
> URL: https://issues.apache.org/jira/browse/HBASE-18352
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.0.0-alpha-1
>Reporter: Stephen Yuan Jiang
>Assignee: huaxiang sun
>
> The following replica tests were disabled by Core Proc-V2 AM in HBASE-14614:
> - Disabled parts of...testCreateTableWithMultipleReplicas in 
> TestMasterOperationsForRegionReplicas There is an issue w/ assigning more 
> replicas if number of replicas is changed on us. See '/* DISABLED! FOR 
> NOW'.
> - Disabled testRegionReplicasOnMidClusterHighReplication in 
> TestStochasticLoadBalancer2
> - Disabled testFlushAndCompactionsInPrimary in TestRegionReplicas
> This JIRA tracks the work to enable them (or modify/remove if not applicable).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2

2017-10-20 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213443#comment-16213443
 ] 

Yi Liang commented on HBASE-18984:
--

It seems above two comments does not clearly state why {quote}// 2. 
UnAssignProcedure can run first, this region will be assigned as OPEN 
finally.{quote} would happen.

For example,  during master unassign region A, and then master crashed, and 
also the RS has region A crashed. So when master restart, it may reload region 
A's state as OPEN, and since RS crashed, this Master will create a 
ServerCrashProcedure for that RS, so there will be both assign(created by SCP) 
and unassign (old procedure) for region A. And it is really hard to guarantee 
which one run first(not so sure).



> [AMv2] Retain assignment does not work well in AMv2
> ---
>
> Key: HBASE-18984
> URL: https://issues.apache.org/jira/browse/HBASE-18984
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at 
> 2.24.19 PM.png
>
>
> work on 8.17 Retain assignment in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k
> To reproduce this error, in hbase shell:
> createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> 
> list_reigons 't1' (maybe you need to try enable/disable multiple times)
> See attached images. same region assigned to different region servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Issue Comment Deleted] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2

2017-10-20 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-18984:
-
Comment: was deleted

(was: [~ram_krish]
I just fully check the code, I am wrong for the above comments, when do bulk 
region assign, the region in transition(which mean there exist an old procedure 
for this region) would not be assigned. So there are no additional 
AssignProcedure for region has old procedure.  And we can safely remove the 
step that CLOSING AND OPENING writing into meta
See loadMeta below, which use to visit meta to create regionstatenode for all 
regions
{code}
private void loadMeta() throws IOException {
// TODO: use a thread pool
regionStateStore.visitMeta(new RegionStateStore.RegionStateVisitor() {
  @Override
  public void visitRegionState(final RegionInfo regionInfo, final State 
state,
  final ServerName regionLocation, final ServerName lastHost, final 
long openSeqNum) {
final RegionStateNode regionNode = 
regionStates.getOrCreateRegionNode(regionInfo);
synchronized (regionNode) {
  if (!regionNode.isInTransition()) {   //*here is the condition*
regionNode.setState(state);
regionNode.setLastHost(lastHost);
regionNode.setRegionLocation(regionLocation);
regionNode.setOpenSeqNum(openSeqNum);

if (state == State.OPEN) {
  assert regionLocation != null : "found null region location for " 
+ regionNode;
  regionStates.addRegionToServer(regionLocation, regionNode);
} else if (state == State.OFFLINE || regionInfo.isOffline()) {
  regionStates.addToOfflineRegions(regionNode);
} else {
  // These regions should have a procedure in replay
  regionStates.addRegionInTransition(regionNode, null);
}
  }
}
  }
});

// every assignment is blocked until meta is loaded.
wakeMetaLoadedEvent();
  }
{code})

> [AMv2] Retain assignment does not work well in AMv2
> ---
>
> Key: HBASE-18984
> URL: https://issues.apache.org/jira/browse/HBASE-18984
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at 
> 2.24.19 PM.png
>
>
> work on 8.17 Retain assignment in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k
> To reproduce this error, in hbase shell:
> createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> 
> list_reigons 't1' (maybe you need to try enable/disable multiple times)
> See attached images. same region assigned to different region servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2

2017-10-20 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213161#comment-16213161
 ] 

Yi Liang edited comment on HBASE-18984 at 10/20/17 8:23 PM:


[~ram_krish]
I just fully check the code, I am wrong for the above comments, when do bulk 
region assign, the region in transition(which mean there exist an old procedure 
for this region) would not be assigned. So there are no additional 
AssignProcedure for region has old procedure.  And we can safely remove the 
step that CLOSING AND OPENING writing into meta
See loadMeta below, which use to visit meta to create regionstatenode for all 
regions
{code}
private void loadMeta() throws IOException {
// TODO: use a thread pool
regionStateStore.visitMeta(new RegionStateStore.RegionStateVisitor() {
  @Override
  public void visitRegionState(final RegionInfo regionInfo, final State 
state,
  final ServerName regionLocation, final ServerName lastHost, final 
long openSeqNum) {
final RegionStateNode regionNode = 
regionStates.getOrCreateRegionNode(regionInfo);
synchronized (regionNode) {
  if (!regionNode.isInTransition()) {   //*here is the condition*
regionNode.setState(state);
regionNode.setLastHost(lastHost);
regionNode.setRegionLocation(regionLocation);
regionNode.setOpenSeqNum(openSeqNum);

if (state == State.OPEN) {
  assert regionLocation != null : "found null region location for " 
+ regionNode;
  regionStates.addRegionToServer(regionLocation, regionNode);
} else if (state == State.OFFLINE || regionInfo.isOffline()) {
  regionStates.addToOfflineRegions(regionNode);
} else {
  // These regions should have a procedure in replay
  regionStates.addRegionInTransition(regionNode, null);
}
  }
}
  }
});

// every assignment is blocked until meta is loaded.
wakeMetaLoadedEvent();
  }
{code}


was (Author: easyliangjob):
[~ram_krish]
I just fully check the code, I am wrong for the above comments, when do bulk 
region assign, the region in transition(which mean there exist an old procedure 
for this region) would not be assigned. So there are no additional 
AssignProcedure for region has old procedure.  And we can safely remove the 
step that CLOSING AND OPENING writing into meta
See loadMeta below, which use to visit meta to create regionstatenode for all 
regions
{code}
private void loadMeta() throws IOException {
// TODO: use a thread pool
regionStateStore.visitMeta(new RegionStateStore.RegionStateVisitor() {
  @Override
  public void visitRegionState(final RegionInfo regionInfo, final State 
state,
  final ServerName regionLocation, final ServerName lastHost, final 
long openSeqNum) {
final RegionStateNode regionNode = 
regionStates.getOrCreateRegionNode(regionInfo);
synchronized (regionNode) {
  if (!regionNode.isInTransition()) {   //*{color:red}here is the 
condition{color}*
regionNode.setState(state);
regionNode.setLastHost(lastHost);
regionNode.setRegionLocation(regionLocation);
regionNode.setOpenSeqNum(openSeqNum);

if (state == State.OPEN) {
  assert regionLocation != null : "found null region location for " 
+ regionNode;
  regionStates.addRegionToServer(regionLocation, regionNode);
} else if (state == State.OFFLINE || regionInfo.isOffline()) {
  regionStates.addToOfflineRegions(regionNode);
} else {
  // These regions should have a procedure in replay
  regionStates.addRegionInTransition(regionNode, null);
}
  }
}
  }
});

// every assignment is blocked until meta is loaded.
wakeMetaLoadedEvent();
  }
{code}

> [AMv2] Retain assignment does not work well in AMv2
> ---
>
> Key: HBASE-18984
> URL: https://issues.apache.org/jira/browse/HBASE-18984
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at 
> 2.24.19 PM.png
>
>
> work on 8.17 Retain assignment in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k
> To reproduce this error, in hbase shell:
> createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> 
> list_reigons 't1' (maybe you need to try enable/disable multiple times)
> See attached images. same region assigned to different region servers



--
This message was sent 

[jira] [Comment Edited] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2

2017-10-20 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213161#comment-16213161
 ] 

Yi Liang edited comment on HBASE-18984 at 10/20/17 8:23 PM:


[~ram_krish]
I just fully check the code, I am wrong for the above comments, when do bulk 
region assign, the region in transition(which mean there exist an old procedure 
for this region) would not be assigned. So there are no additional 
AssignProcedure for region has old procedure.  And we can safely remove the 
step that CLOSING AND OPENING writing into meta
See loadMeta below, which use to visit meta to create regionstatenode for all 
regions
{code}
private void loadMeta() throws IOException {
// TODO: use a thread pool
regionStateStore.visitMeta(new RegionStateStore.RegionStateVisitor() {
  @Override
  public void visitRegionState(final RegionInfo regionInfo, final State 
state,
  final ServerName regionLocation, final ServerName lastHost, final 
long openSeqNum) {
final RegionStateNode regionNode = 
regionStates.getOrCreateRegionNode(regionInfo);
synchronized (regionNode) {
  if (!regionNode.isInTransition()) {   //*{color:red}here is the 
condition{color}*
regionNode.setState(state);
regionNode.setLastHost(lastHost);
regionNode.setRegionLocation(regionLocation);
regionNode.setOpenSeqNum(openSeqNum);

if (state == State.OPEN) {
  assert regionLocation != null : "found null region location for " 
+ regionNode;
  regionStates.addRegionToServer(regionLocation, regionNode);
} else if (state == State.OFFLINE || regionInfo.isOffline()) {
  regionStates.addToOfflineRegions(regionNode);
} else {
  // These regions should have a procedure in replay
  regionStates.addRegionInTransition(regionNode, null);
}
  }
}
  }
});

// every assignment is blocked until meta is loaded.
wakeMetaLoadedEvent();
  }
{code}


was (Author: easyliangjob):
[~ram_krish]
I just fully check the code, I am wrong for the above comments, when do bulk 
region assign, the region is transition would not be assigned. So there are no 
additional AssignProcedure for region has old procedure.  And we can safely 
remove the step that CLOSING AND OPENING writing into meta
See loadMeta below, which use to visit meta to create regionstatenode for all 
regions
{code}
private void loadMeta() throws IOException {
// TODO: use a thread pool
regionStateStore.visitMeta(new RegionStateStore.RegionStateVisitor() {
  @Override
  public void visitRegionState(final RegionInfo regionInfo, final State 
state,
  final ServerName regionLocation, final ServerName lastHost, final 
long openSeqNum) {
final RegionStateNode regionNode = 
regionStates.getOrCreateRegionNode(regionInfo);
synchronized (regionNode) {
  if (!regionNode.isInTransition()) {
regionNode.setState(state);
regionNode.setLastHost(lastHost);
regionNode.setRegionLocation(regionLocation);
regionNode.setOpenSeqNum(openSeqNum);

if (state == State.OPEN) {
  assert regionLocation != null : "found null region location for " 
+ regionNode;
  regionStates.addRegionToServer(regionLocation, regionNode);
} else if (state == State.OFFLINE || regionInfo.isOffline()) {
  regionStates.addToOfflineRegions(regionNode);
} else {
  // These regions should have a procedure in replay
  regionStates.addRegionInTransition(regionNode, null);
}
  }
}
  }
});

// every assignment is blocked until meta is loaded.
wakeMetaLoadedEvent();
  }
{code}

> [AMv2] Retain assignment does not work well in AMv2
> ---
>
> Key: HBASE-18984
> URL: https://issues.apache.org/jira/browse/HBASE-18984
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at 
> 2.24.19 PM.png
>
>
> work on 8.17 Retain assignment in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k
> To reproduce this error, in hbase shell:
> createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> 
> list_reigons 't1' (maybe you need to try enable/disable multiple times)
> See attached images. same region assigned to different region servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2

2017-10-20 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213161#comment-16213161
 ] 

Yi Liang commented on HBASE-18984:
--

[~ram_krish]
I just fully check the code, I am wrong for the above comments, when do bulk 
region assign, the region is transition would not be assigned. So there are no 
additional AssignProcedure for region has old procedure.  And we can safely 
remove the step that CLOSING AND OPENING writing into meta
See loadMeta below, which use to visit meta to create regionstatenode for all 
regions
{code}
private void loadMeta() throws IOException {
// TODO: use a thread pool
regionStateStore.visitMeta(new RegionStateStore.RegionStateVisitor() {
  @Override
  public void visitRegionState(final RegionInfo regionInfo, final State 
state,
  final ServerName regionLocation, final ServerName lastHost, final 
long openSeqNum) {
final RegionStateNode regionNode = 
regionStates.getOrCreateRegionNode(regionInfo);
synchronized (regionNode) {
  if (!regionNode.isInTransition()) {
regionNode.setState(state);
regionNode.setLastHost(lastHost);
regionNode.setRegionLocation(regionLocation);
regionNode.setOpenSeqNum(openSeqNum);

if (state == State.OPEN) {
  assert regionLocation != null : "found null region location for " 
+ regionNode;
  regionStates.addRegionToServer(regionLocation, regionNode);
} else if (state == State.OFFLINE || regionInfo.isOffline()) {
  regionStates.addToOfflineRegions(regionNode);
} else {
  // These regions should have a procedure in replay
  regionStates.addRegionInTransition(regionNode, null);
}
  }
}
  }
});

// every assignment is blocked until meta is loaded.
wakeMetaLoadedEvent();
  }
{code}

> [AMv2] Retain assignment does not work well in AMv2
> ---
>
> Key: HBASE-18984
> URL: https://issues.apache.org/jira/browse/HBASE-18984
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at 
> 2.24.19 PM.png
>
>
> work on 8.17 Retain assignment in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k
> To reproduce this error, in hbase shell:
> createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> 
> list_reigons 't1' (maybe you need to try enable/disable multiple times)
> See attached images. same region assigned to different region servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2

2017-10-19 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211714#comment-16211714
 ] 

Yi Liang commented on HBASE-18984:
--

{quote} 2. UnAssignProcedure run first, this region will be assigned as OPEN. 
=> wrong {quote}
Just check the code, above situation could happen, because 
HMaster#startProcedureExecutor runs before AssignmentManager#joinCluster(),. 

in startProcedureExecutor, it will start procedureExector and procedureStore, 
and also start to do the actual load of old procedures.
in joinCluster, hbase will do read meta and do  bulk assign regions.  

I think we can start load of old procedures later until at least meta 
recovered. or even after all user regions loaded(so above situation would not 
happen). What do you think.
[~stack] [~ram_krish]



> [AMv2] Retain assignment does not work well in AMv2
> ---
>
> Key: HBASE-18984
> URL: https://issues.apache.org/jira/browse/HBASE-18984
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at 
> 2.24.19 PM.png
>
>
> work on 8.17 Retain assignment in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k
> To reproduce this error, in hbase shell:
> createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> 
> list_reigons 't1' (maybe you need to try enable/disable multiple times)
> See attached images. same region assigned to different region servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2

2017-10-19 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211625#comment-16211625
 ] 

Yi Liang commented on HBASE-18984:
--

{quote}// 2. UnAssignProcedure run first, this region will be assigned as OPEN. 
=> wrong{quote}
If we can make sure that load regions happens ahead of restore failed 
procedures when master restart, then this situation would not happen 

Let me check the code

> [AMv2] Retain assignment does not work well in AMv2
> ---
>
> Key: HBASE-18984
> URL: https://issues.apache.org/jira/browse/HBASE-18984
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at 
> 2.24.19 PM.png
>
>
> work on 8.17 Retain assignment in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k
> To reproduce this error, in hbase shell:
> createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> 
> list_reigons 't1' (maybe you need to try enable/disable multiple times)
> See attached images. same region assigned to different region servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2

2017-10-17 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-18984:
-
Attachment: HBASE-18984-V1-master.patch

> [AMv2] Retain assignment does not work well in AMv2
> ---
>
> Key: HBASE-18984
> URL: https://issues.apache.org/jira/browse/HBASE-18984
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at 
> 2.24.19 PM.png
>
>
> work on 8.17 Retain assignment in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k
> To reproduce this error, in hbase shell:
> createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> 
> list_reigons 't1' (maybe you need to try enable/disable multiple times)
> See attached images. same region assigned to different region servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2

2017-10-17 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-18984:
-
Attachment: (was: HBASE-18984-V1-master.patch)

> [AMv2] Retain assignment does not work well in AMv2
> ---
>
> Key: HBASE-18984
> URL: https://issues.apache.org/jira/browse/HBASE-18984
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: Screen Shot 2017-10-10 at 2.24.19 PM.png
>
>
> work on 8.17 Retain assignment in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k
> To reproduce this error, in hbase shell:
> createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> 
> list_reigons 't1' (maybe you need to try enable/disable multiple times)
> See attached images. same region assigned to different region servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2

2017-10-17 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208493#comment-16208493
 ] 

Yi Liang commented on HBASE-18984:
--

Hi [~ram_krish], This patch only contains some clean up about region nodes 
status updated that maybe related to this jira.
Could you help to review, Thanks

 

> [AMv2] Retain assignment does not work well in AMv2
> ---
>
> Key: HBASE-18984
> URL: https://issues.apache.org/jira/browse/HBASE-18984
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at 
> 2.24.19 PM.png
>
>
> work on 8.17 Retain assignment in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k
> To reproduce this error, in hbase shell:
> createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> 
> list_reigons 't1' (maybe you need to try enable/disable multiple times)
> See attached images. same region assigned to different region servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2

2017-10-17 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-18984:
-
Attachment: HBASE-18984-V1-master.patch

> [AMv2] Retain assignment does not work well in AMv2
> ---
>
> Key: HBASE-18984
> URL: https://issues.apache.org/jira/browse/HBASE-18984
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at 
> 2.24.19 PM.png
>
>
> work on 8.17 Retain assignment in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k
> To reproduce this error, in hbase shell:
> createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> 
> list_reigons 't1' (maybe you need to try enable/disable multiple times)
> See attached images. same region assigned to different region servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2

2017-10-17 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-18984:
-
Status: Patch Available  (was: Open)

> [AMv2] Retain assignment does not work well in AMv2
> ---
>
> Key: HBASE-18984
> URL: https://issues.apache.org/jira/browse/HBASE-18984
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-18984-V1-master.patch, Screen Shot 2017-10-10 at 
> 2.24.19 PM.png
>
>
> work on 8.17 Retain assignment in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k
> To reproduce this error, in hbase shell:
> createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> 
> list_reigons 't1' (maybe you need to try enable/disable multiple times)
> See attached images. same region assigned to different region servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19017) EnableTableProcedure is not retaining the assignments

2017-10-16 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206711#comment-16206711
 ] 

Yi Liang commented on HBASE-19017:
--

reviewed your patch, the fix is correct. 
For HBASE-18984, I also add some clean up in the AssignProcedure. You can 
commit this one first, and I will rebase the patch there. 

And the problem I found seems not related to retain assignment, and try to 
reproduce and maybe open a new jira for it. 

> EnableTableProcedure is not retaining the assignments
> -
>
> Key: HBASE-19017
> URL: https://issues.apache.org/jira/browse/HBASE-19017
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 2.0.0-alpha-3
>Reporter: ramkrishna.s.vasudevan
>Assignee: ramkrishna.s.vasudevan
> Fix For: 2.0.0-beta-1
>
> Attachments: HBASE-19017.patch
>
>
> Found this while working on HBASE-18946. In branch-1.4 when ever we do enable 
> table we try retain assignment. 
> But in branch-2 and trunk the EnableTableProcedure tries to get the location 
> from the existing regionNode. It always returns null because while doing 
> region CLOSE while disabling a table, the regionNode's 'regionLocation' is 
> made NULL but the 'lastHost' is actually having the servername where the 
> region was hosted. But on trying assignment again we try to see what was the 
> last RegionLocation and not the 'lastHost' and we go ahead with new 
> assignment.
> On region CLOSE while disable table
> {code}
> public void markRegionAsClosed(final RegionStateNode regionNode) throws 
> IOException {
> final RegionInfo hri = regionNode.getRegionInfo();
> synchronized (regionNode) {
>   State state = regionNode.transitionState(State.CLOSED, 
> RegionStates.STATES_EXPECTED_ON_CLOSE);
>   regionStates.removeRegionFromServer(regionNode.getRegionLocation(), 
> regionNode);
>   regionNode.setLastHost(regionNode.getRegionLocation());
>   regionNode.setRegionLocation(null);
>   regionStateStore.updateRegionLocation(regionNode.getRegionInfo(), state,
> regionNode.getRegionLocation()/*null*/, regionNode.getLastHost(),
> HConstants.NO_SEQNUM, regionNode.getProcedure().getProcId());
>   sendRegionClosedNotification(hri);
> }
> {code}
> In AssignProcedure
> {code}
> ServerName lastRegionLocation = regionNode.offline();
> {code}
> {code}
> public ServerName setRegionLocation(final ServerName serverName) {
>   ServerName lastRegionLocation = this.regionLocation;
>   if (LOG.isTraceEnabled() && serverName == null) {
> LOG.trace("Tracking when we are set to null " + this, new 
> Throwable("TRACE"));
>   }
>   this.regionLocation = serverName;
>   this.lastUpdate = EnvironmentEdgeManager.currentTime();
>   return lastRegionLocation;
> }
> {code}
> So further code in AssignProcedure
> {code}
>  boolean retain = false;
> if (!forceNewPlan) {
>   if (this.targetServer != null) {
> retain = targetServer.equals(lastRegionLocation);
> regionNode.setRegionLocation(targetServer);
>   } else {
> if (lastRegionLocation != null) {
>   // Try and keep the location we had before we offlined.
>   retain = true;
>   regionNode.setRegionLocation(lastRegionLocation);
> }
>   }
> }
> {code}
> Tries to do retainAssignment but fails because lastRegionLocation is always 
> null.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2

2017-10-16 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206262#comment-16206262
 ] 

Yi Liang commented on HBASE-18984:
--

[~ram_krish], I also have same fix, after I put it into a real cluster, i saw 
some problems when restart hbase, it give some errors like hbase:meta is not 
online. 

> [AMv2] Retain assignment does not work well in AMv2
> ---
>
> Key: HBASE-18984
> URL: https://issues.apache.org/jira/browse/HBASE-18984
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: Screen Shot 2017-10-10 at 2.24.19 PM.png
>
>
> work on 8.17 Retain assignment in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k
> To reproduce this error, in hbase shell:
> createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> 
> list_reigons 't1' (maybe you need to try enable/disable multiple times)
> See attached images. same region assigned to different region servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2

2017-10-15 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205208#comment-16205208
 ] 

Yi Liang commented on HBASE-18984:
--

The reason why retain assignment does not work well after disable/enable is 
that every time we unsign the region, we mark the regionLocation as null in 
regionStateNode, and when we assign the region, it will load the this null as 
current region location, and if the region location is null, AM will assign it 
a random region server to it.  
   Have already fixed above issues, but see other errors when restart cluster. 
Still debugging. 

> [AMv2] Retain assignment does not work well in AMv2
> ---
>
> Key: HBASE-18984
> URL: https://issues.apache.org/jira/browse/HBASE-18984
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: Screen Shot 2017-10-10 at 2.24.19 PM.png
>
>
> work on 8.17 Retain assignment in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k
> To reproduce this error, in hbase shell:
> createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> 
> list_reigons 't1' (maybe you need to try enable/disable multiple times)
> See attached images. same region assigned to different region servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2

2017-10-10 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199515#comment-16199515
 ] 

Yi Liang commented on HBASE-18984:
--

I will do some research on this. Thanks for the information


> [AMv2] Retain assignment does not work well in AMv2
> ---
>
> Key: HBASE-18984
> URL: https://issues.apache.org/jira/browse/HBASE-18984
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: Screen Shot 2017-10-10 at 2.24.19 PM.png
>
>
> work on 8.17 Retain assignment in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k
> To reproduce this error, in hbase shell:
> createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> 
> list_reigons 't1' (maybe you need to try enable/disable multiple times)
> See attached images. same region assigned to different region servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2

2017-10-10 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199464#comment-16199464
 ] 

Yi Liang commented on HBASE-18984:
--

ping [~stack],  do you know about retain assignment?  Just make sure this is a 
problem before dig into it. 

> [AMv2] Retain assignment does not work well in AMv2
> ---
>
> Key: HBASE-18984
> URL: https://issues.apache.org/jira/browse/HBASE-18984
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: Screen Shot 2017-10-10 at 2.24.19 PM.png
>
>
> work on 8.17 Retain assignment in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k
> To reproduce this error, in hbase shell:
> createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> 
> list_reigons 't1' (maybe you need to try enable/disable multiple times)
> See attached images. same region assigned to different region servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2

2017-10-10 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-18984:
-
Attachment: Screen Shot 2017-10-10 at 2.24.19 PM.png

> [AMv2] Retain assignment does not work well in AMv2
> ---
>
> Key: HBASE-18984
> URL: https://issues.apache.org/jira/browse/HBASE-18984
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: Yi Liang
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: Screen Shot 2017-10-10 at 2.24.19 PM.png
>
>
> work on 8.17 Retain assignment in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k
> To reproduce this error, in hbase shell:
> createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> 
> list_reigons 't1' (maybe you need to try enable/disable multiple times)
> See attached images. same region assigned to different region servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HBASE-18984) [AMv2] Retain assignment does not work well in AMv2

2017-10-10 Thread Yi Liang (JIRA)
Yi Liang created HBASE-18984:


 Summary: [AMv2] Retain assignment does not work well in AMv2
 Key: HBASE-18984
 URL: https://issues.apache.org/jira/browse/HBASE-18984
 Project: HBase
  Issue Type: Bug
  Components: proc-v2
Affects Versions: 2.0.0
Reporter: Yi Liang
Assignee: Yi Liang
 Fix For: 2.0.0


work on 8.17 Retain assignment in 
https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.epjn9nege80k

To reproduce this error, in hbase shell:
createTable t1 --> list_regions 't1' --> disable 't1' ---> enable 't1' --> 
list_reigons 't1' (maybe you need to try enable/disable multiple times)

See attached images. same region assigned to different region servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18108) Procedure WALs are archived but not cleaned; fix

2017-10-09 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16197760#comment-16197760
 ] 

Yi Liang commented on HBASE-18108:
--

Hi Peter, I will take a look today or tomorrow


> Procedure WALs are archived but not cleaned; fix
> 
>
> Key: HBASE-18108
> URL: https://issues.apache.org/jira/browse/HBASE-18108
> Project: HBase
>  Issue Type: Sub-task
>  Components: proc-v2
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: Peter Somogyi
>Priority: Blocker
> Fix For: 2.0.0
>
> Attachments: HBASE-18108.master.001.patch, 
> HBASE-18108.master.002.patch
>
>
> The Procedure WAL files used to be deleted when done. HBASE-14614 keeps them 
> around in case issue but what is missing is a GC for no-longer-needed WAL 
> files. This one is pretty important.
> From WALProcedureStore Cleaner TODO in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.r2pc835nb7vi



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-16894) Create more than 1 split per region, generalize HBASE-12590

2017-10-03 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16189961#comment-16189961
 ] 

Yi Liang commented on HBASE-16894:
--

For branch-1,  I can not access to the unit test result page above to see the 
details, but all those tests are passed locally. And branch-2/master, we get a 
all green pass. I think both patchs are good to commit. Any comments 
[~apurtell]. Thanks

> Create more than 1 split per region, generalize HBASE-12590
> ---
>
> Key: HBASE-16894
> URL: https://issues.apache.org/jira/browse/HBASE-16894
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0, 2.0.0-alpha-2
>Reporter: Enis Soztutar
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-16894.branch-1.patch, HBASE-16894.master.patch, 
> HBASE-16894-V2-master.patch, HBASE-16894-V3-master.patch, 
> ImplementaionAndSomeQuestion.docx
>
>
> A common request from users is to be able to better control how many map 
> tasks are created per region. Right now, it is always 1 region = 1 input 
> split = 1 map task. Same goes for Spark since it uses the TIF. With region 
> sizes as large as 50 GBs, it is desirable to be able to create more than 1 
> split per region.
> HBASE-12590 adds a config property for MR jobs to be able to handle skew in 
> region sizes. The algorithm is roughly: 
> {code}
> If (region size >= average size*ratio) : cut the region into two MR input 
> splits
> If (average size <= region size < average size*ratio) : one region as one MR 
> input split
> If (sum of several continuous regions size < average size * ratio): combine 
> these regions into one MR input split.
> {code}
> Although we can set data skew ratio to be 0.5 or something to abuse 
> HBASE-12590 into creating more than 1 split task per region, it is not ideal. 
> But there is no way to create more with the patch as it is. For example we 
> cannot create more than 2 tasks per region. 
> If we want to fix this properly, we should extend the approach in 
> HBASE-12590, and make it so that the client can specify the desired num of 
> mappers, or desired split size, and the TIF generates the splits based on the 
> current region sizes very similar to the algorithm in HBASE-12590, but a more 
> generic way. This also would eliminate the hand tuning of data skew ratio.
> We also can think about the guidepost approach that Phoenix has in the stats 
> table which is used for exactly this purpose. Right now, the region can be 
> split into powers of two assuming uniform distribution within the region. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18105) [AMv2] Split/Merge need cleanup; currently they diverge and do not fully embrace AMv2 world

2017-10-02 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188894#comment-16188894
 ] 

Yi Liang commented on HBASE-18105:
--

[~stack], do you have any other AMv2 related tasks that I can do some help, I 
am quite free this week.  :)
I found some issues about regionstates in AMv2, but will start to fix it after 
HBASE-18490 done. 
And also as we discussed in HBASE-18803, How we are going to deal with the 
curator jar? Shaded jar? 
Thanks

> [AMv2] Split/Merge need cleanup; currently they diverge and do not fully 
> embrace AMv2 world
> ---
>
> Key: HBASE-18105
> URL: https://issues.apache.org/jira/browse/HBASE-18105
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: Yi Liang
> Fix For: 2.0.0-alpha-4
>
> Attachments: HBASE-14350-V1-master.patch
>
>
> Region Split and Merge work on the new AMv2 but they work differently. This 
> issue is about bringing them back together and fully embracing the AMv2 
> program.
> They both have issues mostly the fact that they carry around baggage no 
> longer necessary in the new world of assignment.
> Here are some of the items:
> Split and Merge metadata modifications are done by the Master now but we have 
> vestige of Split/Merge on RS still; e.g. when we SPLIT, we ask the Master 
> which asks the RS, which turns around, and asks the Master to run the 
> operation. Fun. MERGE is all done Master-side.
>  
> Clean this up. Remove asking RS to run SPLIT and remove RegionMergeRequest, 
> etc. on RS-side. Also remove PONR. We don’t Points-Of-No-Return now we are up 
> on Pv2. Remove all calls in Interfaces; they are unused. Make RS still able 
> to detect when split, but have it be a client of Master like anyone else.
> Split is Async but does not return procId
> Split is async. Doesn’t return the procId though. Merge does. Fix. Only hard 
> part here I think is the Admin API does not allow procid return.
> Flags
> Currently OFFLINE is determined by looking either at the master instance of 
> HTD (isOffline) and/or at the RegionState#state. Ditto for SPLIT. For MERGE, 
> we rely on RegionState#state. Related is a note above on how split works -- 
> there is a split flag in HTD when there should not be.
>  
> TODO is move to rely on RegionState#state exclusively in Master.
> From Split/Merge Procedures need finishing in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.4b60dc1h4m1f



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-16894) Create more than 1 split per region, generalize HBASE-12590

2017-10-02 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188589#comment-16188589
 ] 

Yi Liang commented on HBASE-16894:
--

patch for master branch seems ok.
retry branch-1, and above errors in branch-1 passed locally. 

> Create more than 1 split per region, generalize HBASE-12590
> ---
>
> Key: HBASE-16894
> URL: https://issues.apache.org/jira/browse/HBASE-16894
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0, 2.0.0-alpha-2
>Reporter: Enis Soztutar
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-16894.branch-1.patch, HBASE-16894.master.patch, 
> HBASE-16894-V2-master.patch, HBASE-16894-V3-master.patch, 
> ImplementaionAndSomeQuestion.docx
>
>
> A common request from users is to be able to better control how many map 
> tasks are created per region. Right now, it is always 1 region = 1 input 
> split = 1 map task. Same goes for Spark since it uses the TIF. With region 
> sizes as large as 50 GBs, it is desirable to be able to create more than 1 
> split per region.
> HBASE-12590 adds a config property for MR jobs to be able to handle skew in 
> region sizes. The algorithm is roughly: 
> {code}
> If (region size >= average size*ratio) : cut the region into two MR input 
> splits
> If (average size <= region size < average size*ratio) : one region as one MR 
> input split
> If (sum of several continuous regions size < average size * ratio): combine 
> these regions into one MR input split.
> {code}
> Although we can set data skew ratio to be 0.5 or something to abuse 
> HBASE-12590 into creating more than 1 split task per region, it is not ideal. 
> But there is no way to create more with the patch as it is. For example we 
> cannot create more than 2 tasks per region. 
> If we want to fix this properly, we should extend the approach in 
> HBASE-12590, and make it so that the client can specify the desired num of 
> mappers, or desired split size, and the TIF generates the splits based on the 
> current region sizes very similar to the algorithm in HBASE-12590, but a more 
> generic way. This also would eliminate the hand tuning of data skew ratio.
> We also can think about the guidepost approach that Phoenix has in the stats 
> table which is used for exactly this purpose. Right now, the region can be 
> split into powers of two assuming uniform distribution within the region. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-16894) Create more than 1 split per region, generalize HBASE-12590

2017-10-02 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-16894:
-
Attachment: (was: HBASE-16894.branch-1.patch)

> Create more than 1 split per region, generalize HBASE-12590
> ---
>
> Key: HBASE-16894
> URL: https://issues.apache.org/jira/browse/HBASE-16894
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0, 2.0.0-alpha-2
>Reporter: Enis Soztutar
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-16894.branch-1.patch, HBASE-16894.master.patch, 
> HBASE-16894-V2-master.patch, HBASE-16894-V3-master.patch, 
> ImplementaionAndSomeQuestion.docx
>
>
> A common request from users is to be able to better control how many map 
> tasks are created per region. Right now, it is always 1 region = 1 input 
> split = 1 map task. Same goes for Spark since it uses the TIF. With region 
> sizes as large as 50 GBs, it is desirable to be able to create more than 1 
> split per region.
> HBASE-12590 adds a config property for MR jobs to be able to handle skew in 
> region sizes. The algorithm is roughly: 
> {code}
> If (region size >= average size*ratio) : cut the region into two MR input 
> splits
> If (average size <= region size < average size*ratio) : one region as one MR 
> input split
> If (sum of several continuous regions size < average size * ratio): combine 
> these regions into one MR input split.
> {code}
> Although we can set data skew ratio to be 0.5 or something to abuse 
> HBASE-12590 into creating more than 1 split task per region, it is not ideal. 
> But there is no way to create more with the patch as it is. For example we 
> cannot create more than 2 tasks per region. 
> If we want to fix this properly, we should extend the approach in 
> HBASE-12590, and make it so that the client can specify the desired num of 
> mappers, or desired split size, and the TIF generates the splits based on the 
> current region sizes very similar to the algorithm in HBASE-12590, but a more 
> generic way. This also would eliminate the hand tuning of data skew ratio.
> We also can think about the guidepost approach that Phoenix has in the stats 
> table which is used for exactly this purpose. Right now, the region can be 
> split into powers of two assuming uniform distribution within the region. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-16894) Create more than 1 split per region, generalize HBASE-12590

2017-10-02 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-16894:
-
Attachment: HBASE-16894.branch-1.patch

> Create more than 1 split per region, generalize HBASE-12590
> ---
>
> Key: HBASE-16894
> URL: https://issues.apache.org/jira/browse/HBASE-16894
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0, 2.0.0-alpha-2
>Reporter: Enis Soztutar
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-16894.branch-1.patch, HBASE-16894.master.patch, 
> HBASE-16894-V2-master.patch, HBASE-16894-V3-master.patch, 
> ImplementaionAndSomeQuestion.docx
>
>
> A common request from users is to be able to better control how many map 
> tasks are created per region. Right now, it is always 1 region = 1 input 
> split = 1 map task. Same goes for Spark since it uses the TIF. With region 
> sizes as large as 50 GBs, it is desirable to be able to create more than 1 
> split per region.
> HBASE-12590 adds a config property for MR jobs to be able to handle skew in 
> region sizes. The algorithm is roughly: 
> {code}
> If (region size >= average size*ratio) : cut the region into two MR input 
> splits
> If (average size <= region size < average size*ratio) : one region as one MR 
> input split
> If (sum of several continuous regions size < average size * ratio): combine 
> these regions into one MR input split.
> {code}
> Although we can set data skew ratio to be 0.5 or something to abuse 
> HBASE-12590 into creating more than 1 split task per region, it is not ideal. 
> But there is no way to create more with the patch as it is. For example we 
> cannot create more than 2 tasks per region. 
> If we want to fix this properly, we should extend the approach in 
> HBASE-12590, and make it so that the client can specify the desired num of 
> mappers, or desired split size, and the TIF generates the splits based on the 
> current region sizes very similar to the algorithm in HBASE-12590, but a more 
> generic way. This also would eliminate the hand tuning of data skew ratio.
> We also can think about the guidepost approach that Phoenix has in the stats 
> table which is used for exactly this purpose. Right now, the region can be 
> split into powers of two assuming uniform distribution within the region. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HBASE-18105) [AMv2] Split/Merge need cleanup; currently they diverge and do not fully embrace AMv2 world

2017-10-02 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188426#comment-16188426
 ] 

Yi Liang edited comment on HBASE-18105 at 10/2/17 5:06 PM:
---

[~stack],if it is not good to change proto indices, I can keep it as original 
value and give some comment. I think the other changes and test cases are ok. 
And after solving this jira, I think cleanup for split/merge is almost done.  


was (Author: easyliangjob):
[~stack],if it is not good to change proto indices, I can keep it as original 
value and give some comment. But the other changes and test cases should be ok

> [AMv2] Split/Merge need cleanup; currently they diverge and do not fully 
> embrace AMv2 world
> ---
>
> Key: HBASE-18105
> URL: https://issues.apache.org/jira/browse/HBASE-18105
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-14350-V1-master.patch
>
>
> Region Split and Merge work on the new AMv2 but they work differently. This 
> issue is about bringing them back together and fully embracing the AMv2 
> program.
> They both have issues mostly the fact that they carry around baggage no 
> longer necessary in the new world of assignment.
> Here are some of the items:
> Split and Merge metadata modifications are done by the Master now but we have 
> vestige of Split/Merge on RS still; e.g. when we SPLIT, we ask the Master 
> which asks the RS, which turns around, and asks the Master to run the 
> operation. Fun. MERGE is all done Master-side.
>  
> Clean this up. Remove asking RS to run SPLIT and remove RegionMergeRequest, 
> etc. on RS-side. Also remove PONR. We don’t Points-Of-No-Return now we are up 
> on Pv2. Remove all calls in Interfaces; they are unused. Make RS still able 
> to detect when split, but have it be a client of Master like anyone else.
> Split is Async but does not return procId
> Split is async. Doesn’t return the procId though. Merge does. Fix. Only hard 
> part here I think is the Admin API does not allow procid return.
> Flags
> Currently OFFLINE is determined by looking either at the master instance of 
> HTD (isOffline) and/or at the RegionState#state. Ditto for SPLIT. For MERGE, 
> we rely on RegionState#state. Related is a note above on how split works -- 
> there is a split flag in HTD when there should not be.
>  
> TODO is move to rely on RegionState#state exclusively in Master.
> From Split/Merge Procedures need finishing in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.4b60dc1h4m1f



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18105) [AMv2] Split/Merge need cleanup; currently they diverge and do not fully embrace AMv2 world

2017-10-02 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188426#comment-16188426
 ] 

Yi Liang commented on HBASE-18105:
--

[~stack],if it is not good to change proto indices, I can keep it as original 
value and give some comment. But the other changes and test cases should be ok

> [AMv2] Split/Merge need cleanup; currently they diverge and do not fully 
> embrace AMv2 world
> ---
>
> Key: HBASE-18105
> URL: https://issues.apache.org/jira/browse/HBASE-18105
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-14350-V1-master.patch
>
>
> Region Split and Merge work on the new AMv2 but they work differently. This 
> issue is about bringing them back together and fully embracing the AMv2 
> program.
> They both have issues mostly the fact that they carry around baggage no 
> longer necessary in the new world of assignment.
> Here are some of the items:
> Split and Merge metadata modifications are done by the Master now but we have 
> vestige of Split/Merge on RS still; e.g. when we SPLIT, we ask the Master 
> which asks the RS, which turns around, and asks the Master to run the 
> operation. Fun. MERGE is all done Master-side.
>  
> Clean this up. Remove asking RS to run SPLIT and remove RegionMergeRequest, 
> etc. on RS-side. Also remove PONR. We don’t Points-Of-No-Return now we are up 
> on Pv2. Remove all calls in Interfaces; they are unused. Make RS still able 
> to detect when split, but have it be a client of Master like anyone else.
> Split is Async but does not return procId
> Split is async. Doesn’t return the procId though. Merge does. Fix. Only hard 
> part here I think is the Admin API does not allow procid return.
> Flags
> Currently OFFLINE is determined by looking either at the master instance of 
> HTD (isOffline) and/or at the RegionState#state. Ditto for SPLIT. For MERGE, 
> we rely on RegionState#state. Related is a note above on how split works -- 
> there is a split flag in HTD when there should not be.
>  
> TODO is move to rely on RegionState#state exclusively in Master.
> From Split/Merge Procedures need finishing in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.4b60dc1h4m1f



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-16894) Create more than 1 split per region, generalize HBASE-12590

2017-10-02 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-16894:
-
Attachment: HBASE-16894.master.patch

> Create more than 1 split per region, generalize HBASE-12590
> ---
>
> Key: HBASE-16894
> URL: https://issues.apache.org/jira/browse/HBASE-16894
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0, 2.0.0-alpha-2
>Reporter: Enis Soztutar
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-16894.branch-1.patch, HBASE-16894.master.patch, 
> HBASE-16894-V2-master.patch, HBASE-16894-V3-master.patch, 
> ImplementaionAndSomeQuestion.docx
>
>
> A common request from users is to be able to better control how many map 
> tasks are created per region. Right now, it is always 1 region = 1 input 
> split = 1 map task. Same goes for Spark since it uses the TIF. With region 
> sizes as large as 50 GBs, it is desirable to be able to create more than 1 
> split per region.
> HBASE-12590 adds a config property for MR jobs to be able to handle skew in 
> region sizes. The algorithm is roughly: 
> {code}
> If (region size >= average size*ratio) : cut the region into two MR input 
> splits
> If (average size <= region size < average size*ratio) : one region as one MR 
> input split
> If (sum of several continuous regions size < average size * ratio): combine 
> these regions into one MR input split.
> {code}
> Although we can set data skew ratio to be 0.5 or something to abuse 
> HBASE-12590 into creating more than 1 split task per region, it is not ideal. 
> But there is no way to create more with the patch as it is. For example we 
> cannot create more than 2 tasks per region. 
> If we want to fix this properly, we should extend the approach in 
> HBASE-12590, and make it so that the client can specify the desired num of 
> mappers, or desired split size, and the TIF generates the splits based on the 
> current region sizes very similar to the algorithm in HBASE-12590, but a more 
> generic way. This also would eliminate the hand tuning of data skew ratio.
> We also can think about the guidepost approach that Phoenix has in the stats 
> table which is used for exactly this purpose. Right now, the region can be 
> split into powers of two assuming uniform distribution within the region. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-16894) Create more than 1 split per region, generalize HBASE-12590

2017-10-02 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-16894:
-
Attachment: (was: HBASE-16894.master.patch)

> Create more than 1 split per region, generalize HBASE-12590
> ---
>
> Key: HBASE-16894
> URL: https://issues.apache.org/jira/browse/HBASE-16894
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0, 2.0.0-alpha-2
>Reporter: Enis Soztutar
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-16894.branch-1.patch, HBASE-16894.master.patch, 
> HBASE-16894-V2-master.patch, HBASE-16894-V3-master.patch, 
> ImplementaionAndSomeQuestion.docx
>
>
> A common request from users is to be able to better control how many map 
> tasks are created per region. Right now, it is always 1 region = 1 input 
> split = 1 map task. Same goes for Spark since it uses the TIF. With region 
> sizes as large as 50 GBs, it is desirable to be able to create more than 1 
> split per region.
> HBASE-12590 adds a config property for MR jobs to be able to handle skew in 
> region sizes. The algorithm is roughly: 
> {code}
> If (region size >= average size*ratio) : cut the region into two MR input 
> splits
> If (average size <= region size < average size*ratio) : one region as one MR 
> input split
> If (sum of several continuous regions size < average size * ratio): combine 
> these regions into one MR input split.
> {code}
> Although we can set data skew ratio to be 0.5 or something to abuse 
> HBASE-12590 into creating more than 1 split task per region, it is not ideal. 
> But there is no way to create more with the patch as it is. For example we 
> cannot create more than 2 tasks per region. 
> If we want to fix this properly, we should extend the approach in 
> HBASE-12590, and make it so that the client can specify the desired num of 
> mappers, or desired split size, and the TIF generates the splits based on the 
> current region sizes very similar to the algorithm in HBASE-12590, but a more 
> generic way. This also would eliminate the hand tuning of data skew ratio.
> We also can think about the guidepost approach that Phoenix has in the stats 
> table which is used for exactly this purpose. Right now, the region can be 
> split into powers of two assuming uniform distribution within the region. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-16894) Create more than 1 split per region, generalize HBASE-12590

2017-09-29 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186590#comment-16186590
 ] 

Yi Liang commented on HBASE-16894:
--

Hi [~apurtell]
  Thanks, I have provide two patch, for branch-1.0 and master/branch-2.0, let 
me know if you have any questions.
{code}
HBASE-16894.branch-1.patch
HBASE-16894.master.patch
{code}

> Create more than 1 split per region, generalize HBASE-12590
> ---
>
> Key: HBASE-16894
> URL: https://issues.apache.org/jira/browse/HBASE-16894
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0, 2.0.0-alpha-2
>Reporter: Enis Soztutar
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-16894.branch-1.patch, HBASE-16894.master.patch, 
> HBASE-16894-V2-master.patch, HBASE-16894-V3-master.patch, 
> ImplementaionAndSomeQuestion.docx
>
>
> A common request from users is to be able to better control how many map 
> tasks are created per region. Right now, it is always 1 region = 1 input 
> split = 1 map task. Same goes for Spark since it uses the TIF. With region 
> sizes as large as 50 GBs, it is desirable to be able to create more than 1 
> split per region.
> HBASE-12590 adds a config property for MR jobs to be able to handle skew in 
> region sizes. The algorithm is roughly: 
> {code}
> If (region size >= average size*ratio) : cut the region into two MR input 
> splits
> If (average size <= region size < average size*ratio) : one region as one MR 
> input split
> If (sum of several continuous regions size < average size * ratio): combine 
> these regions into one MR input split.
> {code}
> Although we can set data skew ratio to be 0.5 or something to abuse 
> HBASE-12590 into creating more than 1 split task per region, it is not ideal. 
> But there is no way to create more with the patch as it is. For example we 
> cannot create more than 2 tasks per region. 
> If we want to fix this properly, we should extend the approach in 
> HBASE-12590, and make it so that the client can specify the desired num of 
> mappers, or desired split size, and the TIF generates the splits based on the 
> current region sizes very similar to the algorithm in HBASE-12590, but a more 
> generic way. This also would eliminate the hand tuning of data skew ratio.
> We also can think about the guidepost approach that Phoenix has in the stats 
> table which is used for exactly this purpose. Right now, the region can be 
> split into powers of two assuming uniform distribution within the region. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-16894) Create more than 1 split per region, generalize HBASE-12590

2017-09-29 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-16894:
-
Attachment: (was: HBASE-12590-v1.patch)

> Create more than 1 split per region, generalize HBASE-12590
> ---
>
> Key: HBASE-16894
> URL: https://issues.apache.org/jira/browse/HBASE-16894
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0, 2.0.0-alpha-2
>Reporter: Enis Soztutar
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-16894.branch-1.patch, HBASE-16894.master.patch, 
> HBASE-16894-V2-master.patch, HBASE-16894-V3-master.patch, 
> ImplementaionAndSomeQuestion.docx
>
>
> A common request from users is to be able to better control how many map 
> tasks are created per region. Right now, it is always 1 region = 1 input 
> split = 1 map task. Same goes for Spark since it uses the TIF. With region 
> sizes as large as 50 GBs, it is desirable to be able to create more than 1 
> split per region.
> HBASE-12590 adds a config property for MR jobs to be able to handle skew in 
> region sizes. The algorithm is roughly: 
> {code}
> If (region size >= average size*ratio) : cut the region into two MR input 
> splits
> If (average size <= region size < average size*ratio) : one region as one MR 
> input split
> If (sum of several continuous regions size < average size * ratio): combine 
> these regions into one MR input split.
> {code}
> Although we can set data skew ratio to be 0.5 or something to abuse 
> HBASE-12590 into creating more than 1 split task per region, it is not ideal. 
> But there is no way to create more with the patch as it is. For example we 
> cannot create more than 2 tasks per region. 
> If we want to fix this properly, we should extend the approach in 
> HBASE-12590, and make it so that the client can specify the desired num of 
> mappers, or desired split size, and the TIF generates the splits based on the 
> current region sizes very similar to the algorithm in HBASE-12590, but a more 
> generic way. This also would eliminate the hand tuning of data skew ratio.
> We also can think about the guidepost approach that Phoenix has in the stats 
> table which is used for exactly this purpose. Right now, the region can be 
> split into powers of two assuming uniform distribution within the region. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HBASE-16894) Create more than 1 split per region, generalize HBASE-12590

2017-09-29 Thread Yi Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liang updated HBASE-16894:
-
Attachment: HBASE-16894.branch-1.patch
HBASE-16894.master.patch

> Create more than 1 split per region, generalize HBASE-12590
> ---
>
> Key: HBASE-16894
> URL: https://issues.apache.org/jira/browse/HBASE-16894
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0, 2.0.0-alpha-2
>Reporter: Enis Soztutar
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-12590-v1.patch, HBASE-16894.branch-1.patch, 
> HBASE-16894.master.patch, HBASE-16894-V2-master.patch, 
> HBASE-16894-V3-master.patch, ImplementaionAndSomeQuestion.docx
>
>
> A common request from users is to be able to better control how many map 
> tasks are created per region. Right now, it is always 1 region = 1 input 
> split = 1 map task. Same goes for Spark since it uses the TIF. With region 
> sizes as large as 50 GBs, it is desirable to be able to create more than 1 
> split per region.
> HBASE-12590 adds a config property for MR jobs to be able to handle skew in 
> region sizes. The algorithm is roughly: 
> {code}
> If (region size >= average size*ratio) : cut the region into two MR input 
> splits
> If (average size <= region size < average size*ratio) : one region as one MR 
> input split
> If (sum of several continuous regions size < average size * ratio): combine 
> these regions into one MR input split.
> {code}
> Although we can set data skew ratio to be 0.5 or something to abuse 
> HBASE-12590 into creating more than 1 split task per region, it is not ideal. 
> But there is no way to create more with the patch as it is. For example we 
> cannot create more than 2 tasks per region. 
> If we want to fix this properly, we should extend the approach in 
> HBASE-12590, and make it so that the client can specify the desired num of 
> mappers, or desired split size, and the TIF generates the splits based on the 
> current region sizes very similar to the algorithm in HBASE-12590, but a more 
> generic way. This also would eliminate the hand tuning of data skew ratio.
> We also can think about the guidepost approach that Phoenix has in the stats 
> table which is used for exactly this purpose. Right now, the region can be 
> split into powers of two assuming uniform distribution within the region. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18894) null pointer exception in list_regions in shell command

2017-09-29 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186069#comment-16186069
 ] 

Yi Liang commented on HBASE-18894:
--

Yeah, the last two are fixed. Thanks.

> null pointer exception in list_regions in shell command
> ---
>
> Key: HBASE-18894
> URL: https://issues.apache.org/jira/browse/HBASE-18894
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.0.0-alpha-3
>Reporter: Yi Liang
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-18894-v1-master.patch, 
> HBASE-18894-v2-master.patch, HBASE-18894-v3-master.patch
>
>
> See this error when run list_regions command After disable 't1'
> or after running split 't1', will see this error before split complete
> this caused by region is disabled or still in transition
> {quote}
> list_regions 't1'
> ERROR: undefined method `getDataLocality' for nil:NilClass
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-18105) [AMv2] Split/Merge need cleanup; currently they diverge and do not fully embrace AMv2 world

2017-09-28 Thread Yi Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185091#comment-16185091
 ] 

Yi Liang commented on HBASE-18105:
--

[~stack], Any thoughts about this patch??



> [AMv2] Split/Merge need cleanup; currently they diverge and do not fully 
> embrace AMv2 world
> ---
>
> Key: HBASE-18105
> URL: https://issues.apache.org/jira/browse/HBASE-18105
> Project: HBase
>  Issue Type: Sub-task
>  Components: Region Assignment
>Affects Versions: 2.0.0
>Reporter: stack
>Assignee: Yi Liang
> Fix For: 2.0.0
>
> Attachments: HBASE-14350-V1-master.patch
>
>
> Region Split and Merge work on the new AMv2 but they work differently. This 
> issue is about bringing them back together and fully embracing the AMv2 
> program.
> They both have issues mostly the fact that they carry around baggage no 
> longer necessary in the new world of assignment.
> Here are some of the items:
> Split and Merge metadata modifications are done by the Master now but we have 
> vestige of Split/Merge on RS still; e.g. when we SPLIT, we ask the Master 
> which asks the RS, which turns around, and asks the Master to run the 
> operation. Fun. MERGE is all done Master-side.
>  
> Clean this up. Remove asking RS to run SPLIT and remove RegionMergeRequest, 
> etc. on RS-side. Also remove PONR. We don’t Points-Of-No-Return now we are up 
> on Pv2. Remove all calls in Interfaces; they are unused. Make RS still able 
> to detect when split, but have it be a client of Master like anyone else.
> Split is Async but does not return procId
> Split is async. Doesn’t return the procId though. Merge does. Fix. Only hard 
> part here I think is the Admin API does not allow procid return.
> Flags
> Currently OFFLINE is determined by looking either at the master instance of 
> HTD (isOffline) and/or at the RegionState#state. Ditto for SPLIT. For MERGE, 
> we rely on RegionState#state. Related is a note above on how split works -- 
> there is a split flag in HTD when there should not be.
>  
> TODO is move to rely on RegionState#state exclusively in Master.
> From Split/Merge Procedures need finishing in 
> https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.4b60dc1h4m1f



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


  1   2   3   4   5   6   7   >