[jira] [Commented] (HBASE-23597) Give high priority for meta assign procedure and ServerCrashProcedure which carry meta.

2019-12-23 Thread Lijin Bin (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002622#comment-17002622
 ] 

Lijin Bin commented on HBASE-23597:
---

[~zhangduo]
Use a dedicated thread for executing WorkerMonitor is good, i just do not get 
the lock in HBASE-23613 for DeadServerMetricRegionChore.

> Give high priority for meta assign procedure and ServerCrashProcedure which 
> carry meta.
> ---
>
> Key: HBASE-23597
> URL: https://issues.apache.org/jira/browse/HBASE-23597
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.2.2
>Reporter: Lijin Bin
>Priority: Major
>
> {code}
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> assignment.TransitRegionStateProcedure: Starting pid=23568, ppid=23567, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> openRegion rit=OPEN, location=100.107.165.61,60020,1576553057082; 
> loc=100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as 100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:43,515 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23569, resume processing parent pid=23568, 
> ppid=23567, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
> locked=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, 
> ASSIGN
> 2019-12-18 16:14:43,518 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished pid=23569, ppid=23568, state=SUCCESS; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in 1.5970sec
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23568, resume processing parent pid=23567, 
> state=RUNNABLE:SERVER_CRASH_GET_REGIONS, locked=true; ServerCrashProcedure 
> server=100.107.165.22,60020,1576553019781, splitWal=true, meta=true
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished pid=23568, ppid=23567, state=SUCCESS; TransitRegionStateProcedure 
> table=hbase:meta, region=1588230740, ASSIGN in 6.4630sec
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure2.ProcedureExecutor: Stored pid=69619, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure.MasterProcedureScheduler: Add 
> ServerQueue(100.107.165.61,60020,1576553057082, xlock=false sharedLock=0 
> size=1) to run queue because: the exclusive lock is not held by anyone when 
> adding pid=69619, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Scheduled SCP pid=69619 for 
> 100.107.165.61,60020,1576553057082 (carryingMeta=true) 
> 100.107.165.61,60020,1576553057082/CRASHED/regionCount=13026/lock=java.util.concurrent.locks.ReentrantReadWriteLock@68f2ee72[Write
>  locks = 1, Read locks = 0], oldState=ONLINE.
> 2019-12-18 16:15:21,629 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=959,queue=191,port=6] 
> master.DeadServer: Removed 100.107.165.61,60020,1576553057082, 
> processing=true, numProcessing=0
> 2019-12-18 16:16:14,779 DEBUG [qtp1688526221-1038] 
> client.ConnectionImplementation: locateRegionInMeta parentTable='hbase:meta', 
> attempt=0 of 31 failed; retrying after sleep of 31
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=31, exceptions:
> Wed Dec 18 16:16:14 CST 2019, null, java.net.SocketTimeoutException: 
> callTimeout=6, callDuration=81362: 
> org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on 100.107.165.61,60020,1576656916048
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.newRegionScanner(RSRpcServices.java:2967)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3300)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42190)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> 

[jira] [Commented] (HBASE-23597) Give high priority for meta assign procedure and ServerCrashProcedure which carry meta.

2019-12-23 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002371#comment-17002371
 ] 

Michael Stack commented on HBASE-23597:
---

[~binlijin] Nice find.

> Give high priority for meta assign procedure and ServerCrashProcedure which 
> carry meta.
> ---
>
> Key: HBASE-23597
> URL: https://issues.apache.org/jira/browse/HBASE-23597
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.2.2
>Reporter: Lijin Bin
>Priority: Major
>
> {code}
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> assignment.TransitRegionStateProcedure: Starting pid=23568, ppid=23567, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> openRegion rit=OPEN, location=100.107.165.61,60020,1576553057082; 
> loc=100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as 100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:43,515 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23569, resume processing parent pid=23568, 
> ppid=23567, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
> locked=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, 
> ASSIGN
> 2019-12-18 16:14:43,518 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished pid=23569, ppid=23568, state=SUCCESS; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in 1.5970sec
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23568, resume processing parent pid=23567, 
> state=RUNNABLE:SERVER_CRASH_GET_REGIONS, locked=true; ServerCrashProcedure 
> server=100.107.165.22,60020,1576553019781, splitWal=true, meta=true
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished pid=23568, ppid=23567, state=SUCCESS; TransitRegionStateProcedure 
> table=hbase:meta, region=1588230740, ASSIGN in 6.4630sec
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure2.ProcedureExecutor: Stored pid=69619, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure.MasterProcedureScheduler: Add 
> ServerQueue(100.107.165.61,60020,1576553057082, xlock=false sharedLock=0 
> size=1) to run queue because: the exclusive lock is not held by anyone when 
> adding pid=69619, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Scheduled SCP pid=69619 for 
> 100.107.165.61,60020,1576553057082 (carryingMeta=true) 
> 100.107.165.61,60020,1576553057082/CRASHED/regionCount=13026/lock=java.util.concurrent.locks.ReentrantReadWriteLock@68f2ee72[Write
>  locks = 1, Read locks = 0], oldState=ONLINE.
> 2019-12-18 16:15:21,629 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=959,queue=191,port=6] 
> master.DeadServer: Removed 100.107.165.61,60020,1576553057082, 
> processing=true, numProcessing=0
> 2019-12-18 16:16:14,779 DEBUG [qtp1688526221-1038] 
> client.ConnectionImplementation: locateRegionInMeta parentTable='hbase:meta', 
> attempt=0 of 31 failed; retrying after sleep of 31
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=31, exceptions:
> Wed Dec 18 16:16:14 CST 2019, null, java.net.SocketTimeoutException: 
> callTimeout=6, callDuration=81362: 
> org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on 100.107.165.61,60020,1576656916048
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.newRegionScanner(RSRpcServices.java:2967)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3300)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42190)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
> at 
> 

[jira] [Commented] (HBASE-23597) Give high priority for meta assign procedure and ServerCrashProcedure which carry meta.

2019-12-23 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002269#comment-17002269
 ] 

Duo Zhang commented on HBASE-23597:
---

So the solution here is that we should have a dedicated thread for executing 
WorkerMonitor?

> Give high priority for meta assign procedure and ServerCrashProcedure which 
> carry meta.
> ---
>
> Key: HBASE-23597
> URL: https://issues.apache.org/jira/browse/HBASE-23597
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.2.2
>Reporter: Lijin Bin
>Priority: Major
>
> {code}
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> assignment.TransitRegionStateProcedure: Starting pid=23568, ppid=23567, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> openRegion rit=OPEN, location=100.107.165.61,60020,1576553057082; 
> loc=100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as 100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:43,515 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23569, resume processing parent pid=23568, 
> ppid=23567, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
> locked=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, 
> ASSIGN
> 2019-12-18 16:14:43,518 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished pid=23569, ppid=23568, state=SUCCESS; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in 1.5970sec
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23568, resume processing parent pid=23567, 
> state=RUNNABLE:SERVER_CRASH_GET_REGIONS, locked=true; ServerCrashProcedure 
> server=100.107.165.22,60020,1576553019781, splitWal=true, meta=true
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished pid=23568, ppid=23567, state=SUCCESS; TransitRegionStateProcedure 
> table=hbase:meta, region=1588230740, ASSIGN in 6.4630sec
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure2.ProcedureExecutor: Stored pid=69619, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure.MasterProcedureScheduler: Add 
> ServerQueue(100.107.165.61,60020,1576553057082, xlock=false sharedLock=0 
> size=1) to run queue because: the exclusive lock is not held by anyone when 
> adding pid=69619, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Scheduled SCP pid=69619 for 
> 100.107.165.61,60020,1576553057082 (carryingMeta=true) 
> 100.107.165.61,60020,1576553057082/CRASHED/regionCount=13026/lock=java.util.concurrent.locks.ReentrantReadWriteLock@68f2ee72[Write
>  locks = 1, Read locks = 0], oldState=ONLINE.
> 2019-12-18 16:15:21,629 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=959,queue=191,port=6] 
> master.DeadServer: Removed 100.107.165.61,60020,1576553057082, 
> processing=true, numProcessing=0
> 2019-12-18 16:16:14,779 DEBUG [qtp1688526221-1038] 
> client.ConnectionImplementation: locateRegionInMeta parentTable='hbase:meta', 
> attempt=0 of 31 failed; retrying after sleep of 31
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=31, exceptions:
> Wed Dec 18 16:16:14 CST 2019, null, java.net.SocketTimeoutException: 
> callTimeout=6, callDuration=81362: 
> org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on 100.107.165.61,60020,1576656916048
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.newRegionScanner(RSRpcServices.java:2967)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3300)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42190)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
>

[jira] [Commented] (HBASE-23597) Give high priority for meta assign procedure and ServerCrashProcedure which carry meta.

2019-12-23 Thread Lijin Bin (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002234#comment-17002234
 ] 

Lijin Bin commented on HBASE-23597:
---

After debuging, i find the root cause is WorkerMonitor do not execute for a 
while because it is blocked by DeadServerMetricRegionChore.
TimeoutExecutorThread execute not only WorkerMonitor, but also 
DeadServerMetricRegionChore RegionInTransitionChore...
{code}
"ProcExecTimeout" #1052 daemon prio=5 os_prio=0 tid=0x7f5c98cc4000 
nid=0x229 waiting on condition [0x7f5c2f857000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0005c312ad80> (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at 
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at 
org.apache.hadoop.hbase.master.assignment.RegionStateNode.lock(RegionStateNode.java:313)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager$DeadServerMetricRegionChore.periodicExecute(AssignmentManager.java:1186)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager$DeadServerMetricRegionChore.periodicExecute(AssignmentManager.java:1163)
at 
org.apache.hadoop.hbase.procedure2.TimeoutExecutorThread.executeInMemoryChore(TimeoutExecutorThread.java:120)
at 
org.apache.hadoop.hbase.procedure2.TimeoutExecutorThread.execDelayedProcedure(TimeoutExecutorThread.java:99)
at 
org.apache.hadoop.hbase.procedure2.TimeoutExecutorThread.run(TimeoutExecutorThread.java:66)

"PEWorker-1" #1053 daemon prio=5 os_prio=0 tid=0x7f5c98cc5800 nid=0x22a in 
Object.wait() [0x7f5c2f756000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:168)
- locked <0x0005839f18b0> (a 
java.util.concurrent.atomic.AtomicBoolean)
at org.apache.hadoop.hbase.client.HTable.put(HTable.java:540)
at 
org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:209)
at 
org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateUserRegionLocation(RegionStateStore.java:203)
at 
org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:141)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.persistToMeta(AssignmentManager.java:1742)
at 
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:298)
at 
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:58)
at 
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1648)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1395)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965)
{code}

> Give high priority for meta assign procedure and ServerCrashProcedure which 
> carry meta.
> ---
>
> Key: HBASE-23597
> URL: https://issues.apache.org/jira/browse/HBASE-23597
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.2.2
>Reporter: Lijin Bin
>Priority: Major
>
> {code}
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> assignment.TransitRegionStateProcedure: Starting pid=23568, ppid=23567, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> openRegion rit=OPEN, location=100.107.165.61,60020,1576553057082; 
> loc=100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as 100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:43,515 INFO  [PEWorker-2] 

[jira] [Commented] (HBASE-23597) Give high priority for meta assign procedure and ServerCrashProcedure which carry meta.

2019-12-19 Thread Lijin Bin (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000661#comment-17000661
 ] 

Lijin Bin commented on HBASE-23597:
---

[~stack] Ok,get it。

> Give high priority for meta assign procedure and ServerCrashProcedure which 
> carry meta.
> ---
>
> Key: HBASE-23597
> URL: https://issues.apache.org/jira/browse/HBASE-23597
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.2.2
>Reporter: Lijin Bin
>Priority: Major
>
> {code}
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> assignment.TransitRegionStateProcedure: Starting pid=23568, ppid=23567, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> openRegion rit=OPEN, location=100.107.165.61,60020,1576553057082; 
> loc=100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as 100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:43,515 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23569, resume processing parent pid=23568, 
> ppid=23567, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
> locked=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, 
> ASSIGN
> 2019-12-18 16:14:43,518 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished pid=23569, ppid=23568, state=SUCCESS; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in 1.5970sec
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23568, resume processing parent pid=23567, 
> state=RUNNABLE:SERVER_CRASH_GET_REGIONS, locked=true; ServerCrashProcedure 
> server=100.107.165.22,60020,1576553019781, splitWal=true, meta=true
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished pid=23568, ppid=23567, state=SUCCESS; TransitRegionStateProcedure 
> table=hbase:meta, region=1588230740, ASSIGN in 6.4630sec
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure2.ProcedureExecutor: Stored pid=69619, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure.MasterProcedureScheduler: Add 
> ServerQueue(100.107.165.61,60020,1576553057082, xlock=false sharedLock=0 
> size=1) to run queue because: the exclusive lock is not held by anyone when 
> adding pid=69619, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Scheduled SCP pid=69619 for 
> 100.107.165.61,60020,1576553057082 (carryingMeta=true) 
> 100.107.165.61,60020,1576553057082/CRASHED/regionCount=13026/lock=java.util.concurrent.locks.ReentrantReadWriteLock@68f2ee72[Write
>  locks = 1, Read locks = 0], oldState=ONLINE.
> 2019-12-18 16:15:21,629 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=959,queue=191,port=6] 
> master.DeadServer: Removed 100.107.165.61,60020,1576553057082, 
> processing=true, numProcessing=0
> 2019-12-18 16:16:14,779 DEBUG [qtp1688526221-1038] 
> client.ConnectionImplementation: locateRegionInMeta parentTable='hbase:meta', 
> attempt=0 of 31 failed; retrying after sleep of 31
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=31, exceptions:
> Wed Dec 18 16:16:14 CST 2019, null, java.net.SocketTimeoutException: 
> callTimeout=6, callDuration=81362: 
> org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on 100.107.165.61,60020,1576656916048
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.newRegionScanner(RSRpcServices.java:2967)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3300)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42190)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
> at 
> 

[jira] [Commented] (HBASE-23597) Give high priority for meta assign procedure and ServerCrashProcedure which carry meta.

2019-12-19 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000657#comment-17000657
 ] 

Michael Stack commented on HBASE-23597:
---

Let me open my own issue, HBASE-23600 [~binlijin]. I was trying to gather data 
before doing so but let me put up what I have so far (As you can tell, I've 
been a little fixated with it of late. Thanks for asking).  

> Give high priority for meta assign procedure and ServerCrashProcedure which 
> carry meta.
> ---
>
> Key: HBASE-23597
> URL: https://issues.apache.org/jira/browse/HBASE-23597
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.2.2
>Reporter: Lijin Bin
>Priority: Major
>
> {code}
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> assignment.TransitRegionStateProcedure: Starting pid=23568, ppid=23567, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> openRegion rit=OPEN, location=100.107.165.61,60020,1576553057082; 
> loc=100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as 100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:43,515 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23569, resume processing parent pid=23568, 
> ppid=23567, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
> locked=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, 
> ASSIGN
> 2019-12-18 16:14:43,518 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished pid=23569, ppid=23568, state=SUCCESS; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in 1.5970sec
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23568, resume processing parent pid=23567, 
> state=RUNNABLE:SERVER_CRASH_GET_REGIONS, locked=true; ServerCrashProcedure 
> server=100.107.165.22,60020,1576553019781, splitWal=true, meta=true
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished pid=23568, ppid=23567, state=SUCCESS; TransitRegionStateProcedure 
> table=hbase:meta, region=1588230740, ASSIGN in 6.4630sec
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure2.ProcedureExecutor: Stored pid=69619, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure.MasterProcedureScheduler: Add 
> ServerQueue(100.107.165.61,60020,1576553057082, xlock=false sharedLock=0 
> size=1) to run queue because: the exclusive lock is not held by anyone when 
> adding pid=69619, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Scheduled SCP pid=69619 for 
> 100.107.165.61,60020,1576553057082 (carryingMeta=true) 
> 100.107.165.61,60020,1576553057082/CRASHED/regionCount=13026/lock=java.util.concurrent.locks.ReentrantReadWriteLock@68f2ee72[Write
>  locks = 1, Read locks = 0], oldState=ONLINE.
> 2019-12-18 16:15:21,629 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=959,queue=191,port=6] 
> master.DeadServer: Removed 100.107.165.61,60020,1576553057082, 
> processing=true, numProcessing=0
> 2019-12-18 16:16:14,779 DEBUG [qtp1688526221-1038] 
> client.ConnectionImplementation: locateRegionInMeta parentTable='hbase:meta', 
> attempt=0 of 31 failed; retrying after sleep of 31
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=31, exceptions:
> Wed Dec 18 16:16:14 CST 2019, null, java.net.SocketTimeoutException: 
> callTimeout=6, callDuration=81362: 
> org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on 100.107.165.61,60020,1576656916048
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.newRegionScanner(RSRpcServices.java:2967)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3300)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42190)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
> at 

[jira] [Commented] (HBASE-23597) Give high priority for meta assign procedure and ServerCrashProcedure which carry meta.

2019-12-19 Thread Lijin Bin (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000646#comment-17000646
 ] 

Lijin Bin commented on HBASE-23597:
---

[~stack] It doesn't matter. Also is there any thing that we can do to help you 
for you problem? 

> Give high priority for meta assign procedure and ServerCrashProcedure which 
> carry meta.
> ---
>
> Key: HBASE-23597
> URL: https://issues.apache.org/jira/browse/HBASE-23597
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.2.2
>Reporter: Lijin Bin
>Priority: Major
>
> {code}
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> assignment.TransitRegionStateProcedure: Starting pid=23568, ppid=23567, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> openRegion rit=OPEN, location=100.107.165.61,60020,1576553057082; 
> loc=100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as 100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:43,515 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23569, resume processing parent pid=23568, 
> ppid=23567, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
> locked=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, 
> ASSIGN
> 2019-12-18 16:14:43,518 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished pid=23569, ppid=23568, state=SUCCESS; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in 1.5970sec
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23568, resume processing parent pid=23567, 
> state=RUNNABLE:SERVER_CRASH_GET_REGIONS, locked=true; ServerCrashProcedure 
> server=100.107.165.22,60020,1576553019781, splitWal=true, meta=true
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished pid=23568, ppid=23567, state=SUCCESS; TransitRegionStateProcedure 
> table=hbase:meta, region=1588230740, ASSIGN in 6.4630sec
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure2.ProcedureExecutor: Stored pid=69619, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure.MasterProcedureScheduler: Add 
> ServerQueue(100.107.165.61,60020,1576553057082, xlock=false sharedLock=0 
> size=1) to run queue because: the exclusive lock is not held by anyone when 
> adding pid=69619, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Scheduled SCP pid=69619 for 
> 100.107.165.61,60020,1576553057082 (carryingMeta=true) 
> 100.107.165.61,60020,1576553057082/CRASHED/regionCount=13026/lock=java.util.concurrent.locks.ReentrantReadWriteLock@68f2ee72[Write
>  locks = 1, Read locks = 0], oldState=ONLINE.
> 2019-12-18 16:15:21,629 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=959,queue=191,port=6] 
> master.DeadServer: Removed 100.107.165.61,60020,1576553057082, 
> processing=true, numProcessing=0
> 2019-12-18 16:16:14,779 DEBUG [qtp1688526221-1038] 
> client.ConnectionImplementation: locateRegionInMeta parentTable='hbase:meta', 
> attempt=0 of 31 failed; retrying after sleep of 31
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=31, exceptions:
> Wed Dec 18 16:16:14 CST 2019, null, java.net.SocketTimeoutException: 
> callTimeout=6, callDuration=81362: 
> org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on 100.107.165.61,60020,1576656916048
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.newRegionScanner(RSRpcServices.java:2967)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3300)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42190)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)

[jira] [Commented] (HBASE-23597) Give high priority for meta assign procedure and ServerCrashProcedure which carry meta.

2019-12-19 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000636#comment-17000636
 ] 

Michael Stack commented on HBASE-23597:
---

[~binlijin] Sorry. Yeah, polluting your issue with prioritizing meta request 
concerns. Let me undo my mess. Sorry about that.

> Give high priority for meta assign procedure and ServerCrashProcedure which 
> carry meta.
> ---
>
> Key: HBASE-23597
> URL: https://issues.apache.org/jira/browse/HBASE-23597
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.2.2
>Reporter: Lijin Bin
>Priority: Major
>
> {code}
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> assignment.TransitRegionStateProcedure: Starting pid=23568, ppid=23567, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> openRegion rit=OPEN, location=100.107.165.61,60020,1576553057082; 
> loc=100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as 100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:43,515 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23569, resume processing parent pid=23568, 
> ppid=23567, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
> locked=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, 
> ASSIGN
> 2019-12-18 16:14:43,518 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished pid=23569, ppid=23568, state=SUCCESS; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in 1.5970sec
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23568, resume processing parent pid=23567, 
> state=RUNNABLE:SERVER_CRASH_GET_REGIONS, locked=true; ServerCrashProcedure 
> server=100.107.165.22,60020,1576553019781, splitWal=true, meta=true
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished pid=23568, ppid=23567, state=SUCCESS; TransitRegionStateProcedure 
> table=hbase:meta, region=1588230740, ASSIGN in 6.4630sec
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure2.ProcedureExecutor: Stored pid=69619, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure.MasterProcedureScheduler: Add 
> ServerQueue(100.107.165.61,60020,1576553057082, xlock=false sharedLock=0 
> size=1) to run queue because: the exclusive lock is not held by anyone when 
> adding pid=69619, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Scheduled SCP pid=69619 for 
> 100.107.165.61,60020,1576553057082 (carryingMeta=true) 
> 100.107.165.61,60020,1576553057082/CRASHED/regionCount=13026/lock=java.util.concurrent.locks.ReentrantReadWriteLock@68f2ee72[Write
>  locks = 1, Read locks = 0], oldState=ONLINE.
> 2019-12-18 16:15:21,629 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=959,queue=191,port=6] 
> master.DeadServer: Removed 100.107.165.61,60020,1576553057082, 
> processing=true, numProcessing=0
> 2019-12-18 16:16:14,779 DEBUG [qtp1688526221-1038] 
> client.ConnectionImplementation: locateRegionInMeta parentTable='hbase:meta', 
> attempt=0 of 31 failed; retrying after sleep of 31
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=31, exceptions:
> Wed Dec 18 16:16:14 CST 2019, null, java.net.SocketTimeoutException: 
> callTimeout=6, callDuration=81362: 
> org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on 100.107.165.61,60020,1576656916048
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.newRegionScanner(RSRpcServices.java:2967)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3300)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42190)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> 

[jira] [Commented] (HBASE-23597) Give high priority for meta assign procedure and ServerCrashProcedure which carry meta.

2019-12-19 Thread Lijin Bin (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000633#comment-17000633
 ] 

Lijin Bin commented on HBASE-23597:
---

[~stack] what the problem you encounter? Looks like we encounter two different 
problem, i encounter the problem is  ServerCrashProcedure 
server=100.107.165.61,60020,1576553057082(carry meta) queued for quit a while 
blocked by other procedure, so i want to execute ServerCrashProcedure(carry 
meta) and TransitRegionStateProcedure for meta region in ProcedureExecutor as 
fast as it produced.
Do you want to the patch https://issues.apache.org/jira/browse/HBASE-22280, 
which isolate the read/write to hbase:meta.


> Give high priority for meta assign procedure and ServerCrashProcedure which 
> carry meta.
> ---
>
> Key: HBASE-23597
> URL: https://issues.apache.org/jira/browse/HBASE-23597
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.2.2
>Reporter: Lijin Bin
>Priority: Major
> Attachments: priority.rpc.patch
>
>
> {code}
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> assignment.TransitRegionStateProcedure: Starting pid=23568, ppid=23567, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> openRegion rit=OPEN, location=100.107.165.61,60020,1576553057082; 
> loc=100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as 100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:43,515 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23569, resume processing parent pid=23568, 
> ppid=23567, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
> locked=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, 
> ASSIGN
> 2019-12-18 16:14:43,518 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished pid=23569, ppid=23568, state=SUCCESS; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in 1.5970sec
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23568, resume processing parent pid=23567, 
> state=RUNNABLE:SERVER_CRASH_GET_REGIONS, locked=true; ServerCrashProcedure 
> server=100.107.165.22,60020,1576553019781, splitWal=true, meta=true
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished pid=23568, ppid=23567, state=SUCCESS; TransitRegionStateProcedure 
> table=hbase:meta, region=1588230740, ASSIGN in 6.4630sec
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure2.ProcedureExecutor: Stored pid=69619, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure.MasterProcedureScheduler: Add 
> ServerQueue(100.107.165.61,60020,1576553057082, xlock=false sharedLock=0 
> size=1) to run queue because: the exclusive lock is not held by anyone when 
> adding pid=69619, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Scheduled SCP pid=69619 for 
> 100.107.165.61,60020,1576553057082 (carryingMeta=true) 
> 100.107.165.61,60020,1576553057082/CRASHED/regionCount=13026/lock=java.util.concurrent.locks.ReentrantReadWriteLock@68f2ee72[Write
>  locks = 1, Read locks = 0], oldState=ONLINE.
> 2019-12-18 16:15:21,629 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=959,queue=191,port=6] 
> master.DeadServer: Removed 100.107.165.61,60020,1576553057082, 
> processing=true, numProcessing=0
> 2019-12-18 16:16:14,779 DEBUG [qtp1688526221-1038] 
> client.ConnectionImplementation: locateRegionInMeta parentTable='hbase:meta', 
> attempt=0 of 31 failed; retrying after sleep of 31
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=31, exceptions:
> Wed Dec 18 16:16:14 CST 2019, null, java.net.SocketTimeoutException: 
> callTimeout=6, callDuration=81362: 
> org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on 100.107.165.61,60020,1576656916048
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.newRegionScanner(RSRpcServices.java:2967)
> at 
> 

[jira] [Commented] (HBASE-23597) Give high priority for meta assign procedure and ServerCrashProcedure which carry meta.

2019-12-19 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000625#comment-17000625
 ] 

Michael Stack commented on HBASE-23597:
---

I also upped the two configs below from theiir default of 10 to 10.  We 
will fail an RPC dispatch (which will throw a CallQueueTooB..Exception) if the 
count of outstanding RPCs is handler count * the configs below.  In my case it 
is about 1000 Which should be enough and less the default queue size of 1G 
but hard to tell. Trying it out.

public static final String IPC_SERVER_MAX_CALLQUEUE_LENGTH = 
"hbase.ipc.server.max.callqueue.length";
public static final String IPC_SERVER_PRIORITY_MAX_CALLQUEUE_LENGTH = 
"hbase.ipc.server.priority.max.callqueue.length";

> Give high priority for meta assign procedure and ServerCrashProcedure which 
> carry meta.
> ---
>
> Key: HBASE-23597
> URL: https://issues.apache.org/jira/browse/HBASE-23597
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.2.2
>Reporter: Lijin Bin
>Priority: Major
> Attachments: priority.rpc.patch
>
>
> {code}
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> assignment.TransitRegionStateProcedure: Starting pid=23568, ppid=23567, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> openRegion rit=OPEN, location=100.107.165.61,60020,1576553057082; 
> loc=100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as 100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:43,515 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23569, resume processing parent pid=23568, 
> ppid=23567, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
> locked=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, 
> ASSIGN
> 2019-12-18 16:14:43,518 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished pid=23569, ppid=23568, state=SUCCESS; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in 1.5970sec
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23568, resume processing parent pid=23567, 
> state=RUNNABLE:SERVER_CRASH_GET_REGIONS, locked=true; ServerCrashProcedure 
> server=100.107.165.22,60020,1576553019781, splitWal=true, meta=true
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished pid=23568, ppid=23567, state=SUCCESS; TransitRegionStateProcedure 
> table=hbase:meta, region=1588230740, ASSIGN in 6.4630sec
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure2.ProcedureExecutor: Stored pid=69619, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure.MasterProcedureScheduler: Add 
> ServerQueue(100.107.165.61,60020,1576553057082, xlock=false sharedLock=0 
> size=1) to run queue because: the exclusive lock is not held by anyone when 
> adding pid=69619, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Scheduled SCP pid=69619 for 
> 100.107.165.61,60020,1576553057082 (carryingMeta=true) 
> 100.107.165.61,60020,1576553057082/CRASHED/regionCount=13026/lock=java.util.concurrent.locks.ReentrantReadWriteLock@68f2ee72[Write
>  locks = 1, Read locks = 0], oldState=ONLINE.
> 2019-12-18 16:15:21,629 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=959,queue=191,port=6] 
> master.DeadServer: Removed 100.107.165.61,60020,1576553057082, 
> processing=true, numProcessing=0
> 2019-12-18 16:16:14,779 DEBUG [qtp1688526221-1038] 
> client.ConnectionImplementation: locateRegionInMeta parentTable='hbase:meta', 
> attempt=0 of 31 failed; retrying after sleep of 31
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=31, exceptions:
> Wed Dec 18 16:16:14 CST 2019, null, java.net.SocketTimeoutException: 
> callTimeout=6, callDuration=81362: 
> org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on 100.107.165.61,60020,1576656916048
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
> at 
> 

[jira] [Commented] (HBASE-23597) Give high priority for meta assign procedure and ServerCrashProcedure which carry meta.

2019-12-19 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000556#comment-17000556
 ] 

Michael Stack commented on HBASE-23597:
---

Attached what I've been playing with.

> Give high priority for meta assign procedure and ServerCrashProcedure which 
> carry meta.
> ---
>
> Key: HBASE-23597
> URL: https://issues.apache.org/jira/browse/HBASE-23597
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.2.2
>Reporter: Lijin Bin
>Priority: Major
> Attachments: priority.rpc.patch
>
>
> {code}
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> assignment.TransitRegionStateProcedure: Starting pid=23568, ppid=23567, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> openRegion rit=OPEN, location=100.107.165.61,60020,1576553057082; 
> loc=100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as 100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:43,515 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23569, resume processing parent pid=23568, 
> ppid=23567, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
> locked=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, 
> ASSIGN
> 2019-12-18 16:14:43,518 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished pid=23569, ppid=23568, state=SUCCESS; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in 1.5970sec
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23568, resume processing parent pid=23567, 
> state=RUNNABLE:SERVER_CRASH_GET_REGIONS, locked=true; ServerCrashProcedure 
> server=100.107.165.22,60020,1576553019781, splitWal=true, meta=true
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished pid=23568, ppid=23567, state=SUCCESS; TransitRegionStateProcedure 
> table=hbase:meta, region=1588230740, ASSIGN in 6.4630sec
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure2.ProcedureExecutor: Stored pid=69619, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure.MasterProcedureScheduler: Add 
> ServerQueue(100.107.165.61,60020,1576553057082, xlock=false sharedLock=0 
> size=1) to run queue because: the exclusive lock is not held by anyone when 
> adding pid=69619, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Scheduled SCP pid=69619 for 
> 100.107.165.61,60020,1576553057082 (carryingMeta=true) 
> 100.107.165.61,60020,1576553057082/CRASHED/regionCount=13026/lock=java.util.concurrent.locks.ReentrantReadWriteLock@68f2ee72[Write
>  locks = 1, Read locks = 0], oldState=ONLINE.
> 2019-12-18 16:15:21,629 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=959,queue=191,port=6] 
> master.DeadServer: Removed 100.107.165.61,60020,1576553057082, 
> processing=true, numProcessing=0
> 2019-12-18 16:16:14,779 DEBUG [qtp1688526221-1038] 
> client.ConnectionImplementation: locateRegionInMeta parentTable='hbase:meta', 
> attempt=0 of 31 failed; retrying after sleep of 31
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=31, exceptions:
> Wed Dec 18 16:16:14 CST 2019, null, java.net.SocketTimeoutException: 
> callTimeout=6, callDuration=81362: 
> org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on 100.107.165.61,60020,1576656916048
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.newRegionScanner(RSRpcServices.java:2967)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3300)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42190)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
>   

[jira] [Commented] (HBASE-23597) Give high priority for meta assign procedure and ServerCrashProcedure which carry meta.

2019-12-19 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000541#comment-17000541
 ] 

Michael Stack commented on HBASE-23597:
---

We check queued RPC size and send CallQueueTooBigException if in excess even if 
high-priority. I've been playing w/ trying to let high priority calls in 
anyways even when in excess of size limits... But then there are other checks 
deeper in the RPC handling that look at count of RPCs outstanding that cause us 
fail dispatch. These are a bit harder to figure...  Our queuing needs a bit of 
study. The CoDel stuff needs another look. 

> Give high priority for meta assign procedure and ServerCrashProcedure which 
> carry meta.
> ---
>
> Key: HBASE-23597
> URL: https://issues.apache.org/jira/browse/HBASE-23597
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.2.2
>Reporter: Lijin Bin
>Priority: Major
>
> {code}
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> assignment.TransitRegionStateProcedure: Starting pid=23568, ppid=23567, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> openRegion rit=OPEN, location=100.107.165.61,60020,1576553057082; 
> loc=100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:41,698 INFO  [KeepAlivePEWorker-18] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as 100.107.165.61,60020,1576553057082
> 2019-12-18 16:14:43,515 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23569, resume processing parent pid=23568, 
> ppid=23567, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
> locked=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, 
> ASSIGN
> 2019-12-18 16:14:43,518 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished pid=23569, ppid=23568, state=SUCCESS; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in 1.5970sec
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished subprocedure pid=23568, resume processing parent pid=23567, 
> state=RUNNABLE:SERVER_CRASH_GET_REGIONS, locked=true; ServerCrashProcedure 
> server=100.107.165.22,60020,1576553019781, splitWal=true, meta=true
> 2019-12-18 16:14:43,522 INFO  [PEWorker-9] procedure2.ProcedureExecutor: 
> Finished pid=23568, ppid=23567, state=SUCCESS; TransitRegionStateProcedure 
> table=hbase:meta, region=1588230740, ASSIGN in 6.4630sec
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure2.ProcedureExecutor: Stored pid=69619, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 DEBUG [RegionServerTracker-0] 
> procedure.MasterProcedureScheduler: Add 
> ServerQueue(100.107.165.61,60020,1576553057082, xlock=false sharedLock=0 
> size=1) to run queue because: the exclusive lock is not held by anyone when 
> adding pid=69619, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true
> 2019-12-18 16:15:07,212 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Scheduled SCP pid=69619 for 
> 100.107.165.61,60020,1576553057082 (carryingMeta=true) 
> 100.107.165.61,60020,1576553057082/CRASHED/regionCount=13026/lock=java.util.concurrent.locks.ReentrantReadWriteLock@68f2ee72[Write
>  locks = 1, Read locks = 0], oldState=ONLINE.
> 2019-12-18 16:15:21,629 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=959,queue=191,port=6] 
> master.DeadServer: Removed 100.107.165.61,60020,1576553057082, 
> processing=true, numProcessing=0
> 2019-12-18 16:16:14,779 DEBUG [qtp1688526221-1038] 
> client.ConnectionImplementation: locateRegionInMeta parentTable='hbase:meta', 
> attempt=0 of 31 failed; retrying after sleep of 31
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=31, exceptions:
> Wed Dec 18 16:16:14 CST 2019, null, java.net.SocketTimeoutException: 
> callTimeout=6, callDuration=81362: 
> org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on 100.107.165.61,60020,1576656916048
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3349)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3326)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1439)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.newRegionScanner(RSRpcServices.java:2967)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3300)
> at 
>