[jira] [Comment Edited] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure

2018-03-12 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396233#comment-16396233
 ] 

stack edited comment on HBASE-20152 at 3/13/18 12:16 AM:
-

Ran a test with patches from two subtasks in place -- the deadlock fix 
HBASE-20173 and the fail-fast throw exception when move is called HBASE-20178 
-- using the really nice IntegrationTestDDLMasterFailover... It creates 
hundreds of tables and tens of thousands of regions. Stuff seemed to work doing 
multiple runs w/ monkies and stop/start.


was (Author: stack):
Ran a test with patches from two subtasks in place -- the deadlock fix and the 
fail-fast throw exception when move is called -- using the really nice 
IntegrationTestDDLMasterFailover... It creates hundreds of tables and tens of 
thousands of regions. Stuff seemed to work doing multiple runs w/ monkies and 
stop/start.

> [AMv2] DisableTableProcedure versus ServerCrashProcedure
> 
>
> Key: HBASE-20152
> URL: https://issues.apache.org/jira/browse/HBASE-20152
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Reporter: stack
>Assignee: stack
>Priority: Major
>
> Seeing a small spate of issues where disabled tables/regions are being 
> assigned. Usually they happen when a DisableTableProcedure is running 
> concurrent with a ServerCrashProcedure. See below. See associated 
> HBASE-20131. This is umbrella issue for fixing.
> h3. Deadlock
> From HBASE-20137, 'TestRSGroups is Flakey', 
> https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325
> {code}
>  * SCP is running because a server was aborted in test.
>  * SCP starts AssignProcedure of region X from crashed server.
>  * DisableTable Procedure runs because test has finished and we're doing 
> table delete. Queues 
>  * UnassignProcedure for region X.
>  * Disable Unassign gets Lock on region X first.
>  * SCP AssignProcedure tries to get lock, waits on lock.
>  * DisableTable Procedure UnassignProcedure RPC fails because server is down 
> (Thats why the SCP).
>  * Tries to expire the server it failed the RPC against. Fails (currently 
> being SCP'd).
>  * DisableTable Procedure Unassign is suspended. It is a suspend with lock on 
> region X held
>  * SCP can't run because lock on X is held
>  * Test timesout.
> {code}
> Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 
> is the disable table:
> {code}
> 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] 
> procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, 
> state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure 
> server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false
> 2018-03-05 11:29:21,300 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo
> 2018-03-05 11:29:21,307 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): 
> Client=jenkins//172.17.0.2 list rsgroup
> 2018-03-05 11:29:21,312 INFO  [Time-limited test] client.HBaseAdmin$15(901): 
> Started disable of Group_ns:testKillRS
> 2018-03-05 11:29:21,313 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS
> 2018-03-05 11:29:21,384 INFO  [PEWorker-9] 
> procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, 
> ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}]
> 2018-03-05 11:29:21,534 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> procedure2.ProcedureExecutor(865): Stored pid=309, 
> state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure 
> table=Group_ns:testKillRS
> 2018-03-05 11:29:21,542 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
> 2018-03-05 11:29:21,644 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
> 2018-03-05 11:29:21,847 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
> 2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put 
> {"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117}
> 2018-03-05 11:29:22,123 INFO  [PEWorker-5] hbase.MetaTableAccessor(1646): 
> Updated table Group_ns:testKillRS 

[jira] [Comment Edited] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure

2018-03-09 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390901#comment-16390901
 ] 

stack edited comment on HBASE-20152 at 3/9/18 10:26 PM:


New scenario seen on internal cluster (not sure what the test is doing... kinda 
crazy killing of cluster and then restarts..)

h3. Procedure after Cluster Shutdown
{code}
 * Cluster shutdown is set. This means cluster down flag is set, we expire 
servers, but no ServerCrashProcedure gets scheduled.
  2018-03-03 05:51:33,852 INFO org.apache.hadoop.hbase.master.ServerManager: 
Cluster shutdown set; quasar-lxosrm-3.vpc.cloudera.com,22101,1520080141368 
expired; onlineServers=2
  2018-03-03 05:51:33,859 INFO 
org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral node 
deleted, processing expiration 
[quasar-lxosrm-2.vpc.cloudera.com,22101,1520080154394]
 * Just after, a Move Region event comes in
  2018-03-03 05:51:33,999 INFO 
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: pid=158, 
state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure 
hri=hbase:namespace,,1520062075237.d2854c6a96bb191d04a3f285c0eef210., 
source=quasar-lxosrm-2.vpc.cloudera.com,22101,1520080154394, destination= 
hbase:namespace hbase:namespace,,1520062075237.d2854c6a96bb191d04a3f285c0eef210.
* Move region tries to unassign but RPC fails with java.io.IOException: Call to 
quasar-lxosrm-2.vpc.cloudera.com/172.26.11.155:22101 failed on local exception: 
org.apache.hadoop.hbase.ipc.StoppedRpcClientException
* Can't schedule an SCP. Fails because server is going down already not 
present as online. Procedure suspended.
* New Master, Unassign procedure rerun. Does dispatch to non-existent server.
* Fails to schedule SCP because server is long gone.
* Procedure is Suspended. Stuck forever.
{code}



was (Author: stack):
New scenario seen on internal cluster (not sure what the test is doing... kinda 
crazy killing of cluster and then restarts..)

h3. Procedure after Cluster Shutdown
{code}
 * Cluster shutdown is set. This means cluster down flag is set, we expire 
servers, but no ServerCrashProcedure gets scheduled.
  2018-03-03 05:51:33,852 INFO org.apache.hadoop.hbase.master.ServerManager: 
Cluster shutdown set; quasar-lxosrm-3.vpc.cloudera.com,22101,1520080141368 
expired; onlineServers=2
  2018-03-03 05:51:33,859 INFO 
org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral node 
deleted, processing expiration 
[quasar-lxosrm-2.vpc.cloudera.com,22101,1520080154394]
 * Just after, a Move Region event comes in
  2018-03-03 05:51:33,999 INFO 
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: pid=158, 
state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure 
hri=hbase:namespace,,1520062075237.d2854c6a96bb191d04a3f285c0eef210., 
source=quasar-lxosrm-2.vpc.cloudera.com,22101,1520080154394, destination= 
hbase:namespace hbase:namespace,,1520062075237.d2854c6a96bb191d04a3f285c0eef210.
* Move region tries to unassign but RPC fails with java.io.IOException: Call to 
quasar-lxosrm-2.vpc.cloudera.com/172.26.11.155:22101 failed on local exception: 
org.apache.hadoop.hbase.ipc.StoppedRpcClientException
* Can't schedule an SCP. Fails because server is going down already not 
present as online. Procedure suspended.
* New Master, Unassign procedure rerun. Does dispatch to non-existent server.
* Fails to schedule SCP because server is long gone.
* Procedure is Suspended. Stuck forever.
{code}

Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 is 
the disable table:
{code}
2018-03-05 11:29:21,224 DEBUG [PEWorker-7] procedure.ServerCrashProcedure(225): 
Done splitting WALs pid=308, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; 
ServerCrashProcedure server=1cfd208ff882,40584,1520249102524, splitWal=true, 
meta=false
2018-03-05 11:29:21,300 INFO  
[RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo
2018-03-05 11:29:21,307 INFO  
[RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): 
Client=jenkins//172.17.0.2 list rsgroup
2018-03-05 11:29:21,312 INFO  [Time-limited test] client.HBaseAdmin$15(901): 
Started disable of Group_ns:testKillRS
2018-03-05 11:29:21,313 INFO  
[RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS
2018-03-05 11:29:21,384 INFO  [PEWorker-9] procedure2.ProcedureExecutor(1495): 
Initialized subprocedures=[{pid=310, ppid=308, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}]
2018-03-05 11:29:21,534 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
procedure2.ProcedureExecutor(865): Stored pid=309, 
state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure 

[jira] [Comment Edited] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure

2018-03-09 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390901#comment-16390901
 ] 

stack edited comment on HBASE-20152 at 3/9/18 10:18 PM:


New scenario seen on internal cluster (not sure what the test is doing... kinda 
crazy killing of cluster and then restarts..)

h3. Procedure after Cluster Shutdown
{code}
 * Cluster shutdown is set. This means cluster down flag is set, we expire 
servers, but no ServerCrashProcedure gets scheduled.
  2018-03-03 05:51:33,852 INFO org.apache.hadoop.hbase.master.ServerManager: 
Cluster shutdown set; quasar-lxosrm-3.vpc.cloudera.com,22101,1520080141368 
expired; onlineServers=2
  2018-03-03 05:51:33,859 INFO 
org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral node 
deleted, processing expiration 
[quasar-lxosrm-2.vpc.cloudera.com,22101,1520080154394]
 * Just after, a Move Region event comes in
  2018-03-03 05:51:33,999 INFO 
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: pid=158, 
state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure 
hri=hbase:namespace,,1520062075237.d2854c6a96bb191d04a3f285c0eef210., 
source=quasar-lxosrm-2.vpc.cloudera.com,22101,1520080154394, destination= 
hbase:namespace hbase:namespace,,1520062075237.d2854c6a96bb191d04a3f285c0eef210.
* Move region tries to unassign but RPC fails with java.io.IOException: Call to 
quasar-lxosrm-2.vpc.cloudera.com/172.26.11.155:22101 failed on local exception: 
org.apache.hadoop.hbase.ipc.StoppedRpcClientException
* Can't schedule an SCP. Fails because server is going down already not 
present as online. Procedure suspended.
* New Master, Unassign procedure rerun. Does dispatch to non-existent server.
* Fails to schedule SCP because server is long gone.
* Procedure is Suspended. Stuck forever.
{code}

Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 is 
the disable table:
{code}
2018-03-05 11:29:21,224 DEBUG [PEWorker-7] procedure.ServerCrashProcedure(225): 
Done splitting WALs pid=308, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; 
ServerCrashProcedure server=1cfd208ff882,40584,1520249102524, splitWal=true, 
meta=false
2018-03-05 11:29:21,300 INFO  
[RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo
2018-03-05 11:29:21,307 INFO  
[RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): 
Client=jenkins//172.17.0.2 list rsgroup
2018-03-05 11:29:21,312 INFO  [Time-limited test] client.HBaseAdmin$15(901): 
Started disable of Group_ns:testKillRS
2018-03-05 11:29:21,313 INFO  
[RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS
2018-03-05 11:29:21,384 INFO  [PEWorker-9] procedure2.ProcedureExecutor(1495): 
Initialized subprocedures=[{pid=310, ppid=308, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}]
2018-03-05 11:29:21,534 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
procedure2.ProcedureExecutor(865): Stored pid=309, 
state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure 
table=Group_ns:testKillRS
2018-03-05 11:29:21,542 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
2018-03-05 11:29:21,644 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
2018-03-05 11:29:21,847 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put 
{"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117}
2018-03-05 11:29:22,123 INFO  [PEWorker-5] hbase.MetaTableAccessor(1646): 
Updated table Group_ns:testKillRS state to DISABLING in META
2018-03-05 11:29:22,148 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
2018-03-05 11:29:22,345 INFO  [PEWorker-5] procedure2.ProcedureExecutor(1495): 
Initialized subprocedures=[{pid=311, ppid=309, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
server=1cfd208ff882,40584,1520249102524}]
2018-03-05 11:29:22,503 INFO  [PEWorker-13] 
procedure.MasterProcedureScheduler(571): pid=311, ppid=309, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
server=1cfd208ff882,40584,1520249102524,