[jira] [Comment Edited] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396233#comment-16396233 ] stack edited comment on HBASE-20152 at 3/13/18 12:16 AM: - Ran a test with patches from two subtasks in place -- the deadlock fix HBASE-20173 and the fail-fast throw exception when move is called HBASE-20178 -- using the really nice IntegrationTestDDLMasterFailover... It creates hundreds of tables and tens of thousands of regions. Stuff seemed to work doing multiple runs w/ monkies and stop/start. was (Author: stack): Ran a test with patches from two subtasks in place -- the deadlock fix and the fail-fast throw exception when move is called -- using the really nice IntegrationTestDDLMasterFailover... It creates hundreds of tables and tens of thousands of regions. Stuff seemed to work doing multiple runs w/ monkies and stop/start. > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 > is the disable table: > {code} > 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] > procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, > state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure > server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false > 2018-03-05 11:29:21,300 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo > 2018-03-05 11:29:21,307 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): > Client=jenkins//172.17.0.2 list rsgroup > 2018-03-05 11:29:21,312 INFO [Time-limited test] client.HBaseAdmin$15(901): > Started disable of Group_ns:testKillRS > 2018-03-05 11:29:21,313 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS > 2018-03-05 11:29:21,384 INFO [PEWorker-9] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, > ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}] > 2018-03-05 11:29:21,534 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > procedure2.ProcedureExecutor(865): Stored pid=309, > state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure > table=Group_ns:testKillRS > 2018-03-05 11:29:21,542 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,644 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,847 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put > {"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117} > 2018-03-05 11:29:22,123 INFO [PEWorker-5] hbase.MetaTableAccessor(1646): > Updated table Group_ns:testKillRS
[jira] [Comment Edited] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390901#comment-16390901 ] stack edited comment on HBASE-20152 at 3/9/18 10:26 PM: New scenario seen on internal cluster (not sure what the test is doing... kinda crazy killing of cluster and then restarts..) h3. Procedure after Cluster Shutdown {code} * Cluster shutdown is set. This means cluster down flag is set, we expire servers, but no ServerCrashProcedure gets scheduled. 2018-03-03 05:51:33,852 INFO org.apache.hadoop.hbase.master.ServerManager: Cluster shutdown set; quasar-lxosrm-3.vpc.cloudera.com,22101,1520080141368 expired; onlineServers=2 2018-03-03 05:51:33,859 INFO org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [quasar-lxosrm-2.vpc.cloudera.com,22101,1520080154394] * Just after, a Move Region event comes in 2018-03-03 05:51:33,999 INFO org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: pid=158, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure hri=hbase:namespace,,1520062075237.d2854c6a96bb191d04a3f285c0eef210., source=quasar-lxosrm-2.vpc.cloudera.com,22101,1520080154394, destination= hbase:namespace hbase:namespace,,1520062075237.d2854c6a96bb191d04a3f285c0eef210. * Move region tries to unassign but RPC fails with java.io.IOException: Call to quasar-lxosrm-2.vpc.cloudera.com/172.26.11.155:22101 failed on local exception: org.apache.hadoop.hbase.ipc.StoppedRpcClientException * Can't schedule an SCP. Fails because server is going down already not present as online. Procedure suspended. * New Master, Unassign procedure rerun. Does dispatch to non-existent server. * Fails to schedule SCP because server is long gone. * Procedure is Suspended. Stuck forever. {code} was (Author: stack): New scenario seen on internal cluster (not sure what the test is doing... kinda crazy killing of cluster and then restarts..) h3. Procedure after Cluster Shutdown {code} * Cluster shutdown is set. This means cluster down flag is set, we expire servers, but no ServerCrashProcedure gets scheduled. 2018-03-03 05:51:33,852 INFO org.apache.hadoop.hbase.master.ServerManager: Cluster shutdown set; quasar-lxosrm-3.vpc.cloudera.com,22101,1520080141368 expired; onlineServers=2 2018-03-03 05:51:33,859 INFO org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [quasar-lxosrm-2.vpc.cloudera.com,22101,1520080154394] * Just after, a Move Region event comes in 2018-03-03 05:51:33,999 INFO org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: pid=158, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure hri=hbase:namespace,,1520062075237.d2854c6a96bb191d04a3f285c0eef210., source=quasar-lxosrm-2.vpc.cloudera.com,22101,1520080154394, destination= hbase:namespace hbase:namespace,,1520062075237.d2854c6a96bb191d04a3f285c0eef210. * Move region tries to unassign but RPC fails with java.io.IOException: Call to quasar-lxosrm-2.vpc.cloudera.com/172.26.11.155:22101 failed on local exception: org.apache.hadoop.hbase.ipc.StoppedRpcClientException * Can't schedule an SCP. Fails because server is going down already not present as online. Procedure suspended. * New Master, Unassign procedure rerun. Does dispatch to non-existent server. * Fails to schedule SCP because server is long gone. * Procedure is Suspended. Stuck forever. {code} Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 is the disable table: {code} 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false 2018-03-05 11:29:21,300 INFO [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo 2018-03-05 11:29:21,307 INFO [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): Client=jenkins//172.17.0.2 list rsgroup 2018-03-05 11:29:21,312 INFO [Time-limited test] client.HBaseAdmin$15(901): Started disable of Group_ns:testKillRS 2018-03-05 11:29:21,313 INFO [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS 2018-03-05 11:29:21,384 INFO [PEWorker-9] procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}] 2018-03-05 11:29:21,534 DEBUG [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] procedure2.ProcedureExecutor(865): Stored pid=309, state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure
[jira] [Comment Edited] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390901#comment-16390901 ] stack edited comment on HBASE-20152 at 3/9/18 10:18 PM: New scenario seen on internal cluster (not sure what the test is doing... kinda crazy killing of cluster and then restarts..) h3. Procedure after Cluster Shutdown {code} * Cluster shutdown is set. This means cluster down flag is set, we expire servers, but no ServerCrashProcedure gets scheduled. 2018-03-03 05:51:33,852 INFO org.apache.hadoop.hbase.master.ServerManager: Cluster shutdown set; quasar-lxosrm-3.vpc.cloudera.com,22101,1520080141368 expired; onlineServers=2 2018-03-03 05:51:33,859 INFO org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [quasar-lxosrm-2.vpc.cloudera.com,22101,1520080154394] * Just after, a Move Region event comes in 2018-03-03 05:51:33,999 INFO org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: pid=158, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure hri=hbase:namespace,,1520062075237.d2854c6a96bb191d04a3f285c0eef210., source=quasar-lxosrm-2.vpc.cloudera.com,22101,1520080154394, destination= hbase:namespace hbase:namespace,,1520062075237.d2854c6a96bb191d04a3f285c0eef210. * Move region tries to unassign but RPC fails with java.io.IOException: Call to quasar-lxosrm-2.vpc.cloudera.com/172.26.11.155:22101 failed on local exception: org.apache.hadoop.hbase.ipc.StoppedRpcClientException * Can't schedule an SCP. Fails because server is going down already not present as online. Procedure suspended. * New Master, Unassign procedure rerun. Does dispatch to non-existent server. * Fails to schedule SCP because server is long gone. * Procedure is Suspended. Stuck forever. {code} Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 is the disable table: {code} 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false 2018-03-05 11:29:21,300 INFO [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo 2018-03-05 11:29:21,307 INFO [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): Client=jenkins//172.17.0.2 list rsgroup 2018-03-05 11:29:21,312 INFO [Time-limited test] client.HBaseAdmin$15(901): Started disable of Group_ns:testKillRS 2018-03-05 11:29:21,313 INFO [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS 2018-03-05 11:29:21,384 INFO [PEWorker-9] procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}] 2018-03-05 11:29:21,534 DEBUG [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] procedure2.ProcedureExecutor(865): Stored pid=309, state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure table=Group_ns:testKillRS 2018-03-05 11:29:21,542 DEBUG [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 2018-03-05 11:29:21,644 DEBUG [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 2018-03-05 11:29:21,847 DEBUG [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put {"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117} 2018-03-05 11:29:22,123 INFO [PEWorker-5] hbase.MetaTableAccessor(1646): Updated table Group_ns:testKillRS state to DISABLING in META 2018-03-05 11:29:22,148 DEBUG [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 2018-03-05 11:29:22,345 INFO [PEWorker-5] procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524}] 2018-03-05 11:29:22,503 INFO [PEWorker-13] procedure.MasterProcedureScheduler(571): pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524,