[jira] [Updated] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure

2018-04-12 Thread Mike Drob (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Drob updated HBASE-20152:
--
Issue Type: Umbrella  (was: Bug)

> [AMv2] DisableTableProcedure versus ServerCrashProcedure
> 
>
> Key: HBASE-20152
> URL: https://issues.apache.org/jira/browse/HBASE-20152
> Project: HBase
>  Issue Type: Umbrella
>  Components: amv2
>Reporter: stack
>Assignee: stack
>Priority: Major
>
> Seeing a small spate of issues where disabled tables/regions are being 
> assigned. Usually they happen when a DisableTableProcedure is running 
> concurrent with a ServerCrashProcedure. See below. See associated 
> HBASE-20131. This is umbrella issue for fixing.
> h3. Deadlock
> From HBASE-20137, 'TestRSGroups is Flakey', 
> https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325
> {code}
>  * SCP is running because a server was aborted in test.
>  * SCP starts AssignProcedure of region X from crashed server.
>  * DisableTable Procedure runs because test has finished and we're doing 
> table delete. Queues 
>  * UnassignProcedure for region X.
>  * Disable Unassign gets Lock on region X first.
>  * SCP AssignProcedure tries to get lock, waits on lock.
>  * DisableTable Procedure UnassignProcedure RPC fails because server is down 
> (Thats why the SCP).
>  * Tries to expire the server it failed the RPC against. Fails (currently 
> being SCP'd).
>  * DisableTable Procedure Unassign is suspended. It is a suspend with lock on 
> region X held
>  * SCP can't run because lock on X is held
>  * Test timesout.
> {code}
> Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 
> is the disable table:
> {code}
> 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] 
> procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, 
> state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure 
> server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false
> 2018-03-05 11:29:21,300 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo
> 2018-03-05 11:29:21,307 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): 
> Client=jenkins//172.17.0.2 list rsgroup
> 2018-03-05 11:29:21,312 INFO  [Time-limited test] client.HBaseAdmin$15(901): 
> Started disable of Group_ns:testKillRS
> 2018-03-05 11:29:21,313 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS
> 2018-03-05 11:29:21,384 INFO  [PEWorker-9] 
> procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, 
> ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}]
> 2018-03-05 11:29:21,534 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> procedure2.ProcedureExecutor(865): Stored pid=309, 
> state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure 
> table=Group_ns:testKillRS
> 2018-03-05 11:29:21,542 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
> 2018-03-05 11:29:21,644 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
> 2018-03-05 11:29:21,847 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
> 2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put 
> {"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117}
> 2018-03-05 11:29:22,123 INFO  [PEWorker-5] hbase.MetaTableAccessor(1646): 
> Updated table Group_ns:testKillRS state to DISABLING in META
> 2018-03-05 11:29:22,148 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
> master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
> 2018-03-05 11:29:22,345 INFO  [PEWorker-5] 
> procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=311, 
> ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524}]
> 2018-03-05 11:29:22,503 INFO  [PEWorker-13] 
> procedure.MasterProcedureScheduler(571): pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534

[jira] [Updated] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure

2018-03-09 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-20152:
--
Description: 
Seeing a small spate of issues where disabled tables/regions are being 
assigned. Usually they happen when a DisableTableProcedure is running 
concurrent with a ServerCrashProcedure. See below. See associated HBASE-20131. 
This is umbrella issue for fixing.

h3. Deadlock
>From HBASE-20137, 'TestRSGroups is Flakey', 
>https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325

{code}
 * SCP is running because a server was aborted in test.
 * SCP starts AssignProcedure of region X from crashed server.
 * DisableTable Procedure runs because test has finished and we're doing table 
delete. Queues 
 * UnassignProcedure for region X.
 * Disable Unassign gets Lock on region X first.
 * SCP AssignProcedure tries to get lock, waits on lock.
 * DisableTable Procedure UnassignProcedure RPC fails because server is down 
(Thats why the SCP).
 * Tries to expire the server it failed the RPC against. Fails (currently being 
SCP'd).
 * DisableTable Procedure Unassign is suspended. It is a suspend with lock on 
region X held
 * SCP can't run because lock on X is held
 * Test timesout.
{code}


Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 is 
the disable table:
{code}
2018-03-05 11:29:21,224 DEBUG [PEWorker-7] procedure.ServerCrashProcedure(225): 
Done splitting WALs pid=308, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; 
ServerCrashProcedure server=1cfd208ff882,40584,1520249102524, splitWal=true, 
meta=false
2018-03-05 11:29:21,300 INFO  
[RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo
2018-03-05 11:29:21,307 INFO  
[RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): 
Client=jenkins//172.17.0.2 list rsgroup
2018-03-05 11:29:21,312 INFO  [Time-limited test] client.HBaseAdmin$15(901): 
Started disable of Group_ns:testKillRS
2018-03-05 11:29:21,313 INFO  
[RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS
2018-03-05 11:29:21,384 INFO  [PEWorker-9] procedure2.ProcedureExecutor(1495): 
Initialized subprocedures=[{pid=310, ppid=308, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}]
2018-03-05 11:29:21,534 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
procedure2.ProcedureExecutor(865): Stored pid=309, 
state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure 
table=Group_ns:testKillRS
2018-03-05 11:29:21,542 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
2018-03-05 11:29:21,644 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
2018-03-05 11:29:21,847 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put 
{"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117}
2018-03-05 11:29:22,123 INFO  [PEWorker-5] hbase.MetaTableAccessor(1646): 
Updated table Group_ns:testKillRS state to DISABLING in META
2018-03-05 11:29:22,148 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] 
master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
2018-03-05 11:29:22,345 INFO  [PEWorker-5] procedure2.ProcedureExecutor(1495): 
Initialized subprocedures=[{pid=311, ppid=309, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
server=1cfd208ff882,40584,1520249102524}]
2018-03-05 11:29:22,503 INFO  [PEWorker-13] 
procedure.MasterProcedureScheduler(571): pid=311, ppid=309, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
server=1cfd208ff882,40584,1520249102524, 
Group_ns:testKillRS,,1520249357742.de7534c208a06502537cd95c248b3043.
2018-03-05 11:29:22,503 INFO  [PEWorker-5] 
procedure.MasterProcedureScheduler(571): pid=310, ppid=308, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
Group_ns:testKillRS,,1520249357742.de7534c208a06502537cd95c248b3043.
2018-03-05 11:29:22,504 INFO  [PEWorker-13] assignment.RegionStateStore(189): 
pid=311 updating hbase:meta 
row=Group_ns:testKillRS,,152024935

[jira] [Updated] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure

2018-03-07 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-20152:
--
Description: 
Seeing a small spate of issues where disabled tables/regions are being 
assigned. Usually they happen when a DisableTableProcedure is running 
concurrent with a ServerCrashProcedure. See below. See associated HBASE-20131. 
This is umbrella issue for fixing.

h3. Deadlock
>From HBASE-20137, 'TestRSGroups is Flakey', 
>https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325

{code}
 * SCP is running because a server was aborted in test.
 * SCP starts AssignProcedure of region X from crashed server.
 * DisableTable Procedure runs because test has finished and we're doing table 
delete. Queues 
 * UnassignProcedure for region X.
 * Disable Unassign gets Lock on region X first.
 * SCP AssignProcedure tries to get lock, waits on lock.
 * DisableTable Procedure UnassignProcedure RPC fails because server is down 
(Thats why the SCP).
 * Tries to expire the server it failed the RPC against. Fails (currently being 
SCP'd).
 * DisableTable Procedure Unassign is suspended. It is a suspend with lock on 
region X held
 * SCP can't run because lock on X is held
 * Test timesout.
{code}

h3. Delete of online Regions
Saw this in nightly failure #452 for branch-2 in 
TestSplitTransactionOnCluster.org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster

{code}
 * DisableTableProcedure is queued before SCP.
 * DisableTableProcedure Unassign fails because can't RPC to crashed server and 
can't expire.
 * Unassign is Stuck in suspend.
 * SCP runs and cleans up suspended Disable Unassign.
 * SCP completes which includes assign of Disable Unassign region.
 * Disable Unassign completes
 * Disable completes.
 * A scheduled Drop Table Procedure runs (its end of test).
 * Succeeds deleting regions that are actually assigned (see above where SCP 
assigned region).
{code}

  was:
Seeing a small spate of issues where disabled tables/regions are being 
assigned. Usually they happen when a DisableTableProcedure is running 
concurrent with a ServerCrashProcedure. See below. See associated HBASE-20131. 
This is umbrella issue for fixing.

.h2 Deadlock
>From HBASE-20137, 'TestRSGroups is Flakey', 
>https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325

{code}
 * SCP is running because a server was aborted in test.
 * SCP starts AssignProcedure of region X from crashed server.
 * DisableTable Procedure runs because test has finished and we're doing table 
delete. Queues 
 * UnassignProcedure for region X.
 * Disable Unassign gets Lock on region X first.
 * SCP AssignProcedure tries to get lock, waits on lock.
 * DisableTable Procedure UnassignProcedure RPC fails because server is down 
(Thats why the SCP).
 * Tries to expire the server it failed the RPC against. Fails (currently being 
SCP'd).
 * DisableTable Procedure Unassign is suspended. It is a suspend with lock on 
region X held
 * SCP can't run because lock on X is held
 * Test timesout.
{code}

.h2 Delete of online Regions
Saw this in nightly failure #452 for branch-2 in 
TestSplitTransactionOnCluster.org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster

{code}
 * DisableTableProcedure is queued before SCP.
 * DisableTableProcedure Unassign fails because can't RPC to crashed server and 
can't expire.
 * Unassign is Stuck in suspend.
 * SCP runs and cleans up suspended Disable Unassign.
 * SCP completes which includes assign of Disable Unassign region.
 * Disable Unassign completes
 * Disable completes.
 * A scheduled Drop Table Procedure runs (its end of test).
 * Succeeds deleting regions that are actually assigned (see above where SCP 
assigned region).
{code}


> [AMv2] DisableTableProcedure versus ServerCrashProcedure
> 
>
> Key: HBASE-20152
> URL: https://issues.apache.org/jira/browse/HBASE-20152
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Reporter: stack
>Assignee: stack
>Priority: Major
>
> Seeing a small spate of issues where disabled tables/regions are being 
> assigned. Usually they happen when a DisableTableProcedure is running 
> concurrent with a ServerCrashProcedure. See below. See associated 
> HBASE-20131. This is umbrella issue for fixing.
> h3. Deadlock
> From HBASE-20137, 'TestRSGroups is Flakey', 
> https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325
> {code}
>  * SCP is running because a server was aborted in test.
>  * SCP starts AssignProcedure of region X f