[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427232#comment-16427232 ] stack commented on HBASE-20152: --- Let me review before closing... thanks. > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 > is the disable table: > {code} > 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] > procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, > state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure > server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false > 2018-03-05 11:29:21,300 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo > 2018-03-05 11:29:21,307 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): > Client=jenkins//172.17.0.2 list rsgroup > 2018-03-05 11:29:21,312 INFO [Time-limited test] client.HBaseAdmin$15(901): > Started disable of Group_ns:testKillRS > 2018-03-05 11:29:21,313 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS > 2018-03-05 11:29:21,384 INFO [PEWorker-9] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, > ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}] > 2018-03-05 11:29:21,534 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > procedure2.ProcedureExecutor(865): Stored pid=309, > state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure > table=Group_ns:testKillRS > 2018-03-05 11:29:21,542 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,644 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,847 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put > {"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117} > 2018-03-05 11:29:22,123 INFO [PEWorker-5] hbase.MetaTableAccessor(1646): > Updated table Group_ns:testKillRS state to DISABLING in META > 2018-03-05 11:29:22,148 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,345 INFO [PEWorker-5] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=311, > ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524}] > 2018-03-05 11:29:22,503 INFO [PEWorker-13] > procedure.MasterProcedureScheduler(571): pid=311, ppid=309, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure >
[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427222#comment-16427222 ] Umesh Agashe commented on HBASE-20152: -- Looks done to me. I haven't seen any instance of this recently. > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 > is the disable table: > {code} > 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] > procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, > state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure > server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false > 2018-03-05 11:29:21,300 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo > 2018-03-05 11:29:21,307 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): > Client=jenkins//172.17.0.2 list rsgroup > 2018-03-05 11:29:21,312 INFO [Time-limited test] client.HBaseAdmin$15(901): > Started disable of Group_ns:testKillRS > 2018-03-05 11:29:21,313 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS > 2018-03-05 11:29:21,384 INFO [PEWorker-9] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, > ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}] > 2018-03-05 11:29:21,534 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > procedure2.ProcedureExecutor(865): Stored pid=309, > state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure > table=Group_ns:testKillRS > 2018-03-05 11:29:21,542 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,644 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,847 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put > {"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117} > 2018-03-05 11:29:22,123 INFO [PEWorker-5] hbase.MetaTableAccessor(1646): > Updated table Group_ns:testKillRS state to DISABLING in META > 2018-03-05 11:29:22,148 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,345 INFO [PEWorker-5] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=311, > ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524}] > 2018-03-05 11:29:22,503 INFO [PEWorker-13] > procedure.MasterProcedureScheduler(571): pid=311, ppid=309, >
[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426388#comment-16426388 ] Mike Drob commented on HBASE-20152: --- [~stack] [~uagashe] [~Apache9] - all subtasks here are resolved. is this issue done or do we have more work to do? > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 > is the disable table: > {code} > 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] > procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, > state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure > server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false > 2018-03-05 11:29:21,300 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo > 2018-03-05 11:29:21,307 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): > Client=jenkins//172.17.0.2 list rsgroup > 2018-03-05 11:29:21,312 INFO [Time-limited test] client.HBaseAdmin$15(901): > Started disable of Group_ns:testKillRS > 2018-03-05 11:29:21,313 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS > 2018-03-05 11:29:21,384 INFO [PEWorker-9] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, > ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}] > 2018-03-05 11:29:21,534 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > procedure2.ProcedureExecutor(865): Stored pid=309, > state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure > table=Group_ns:testKillRS > 2018-03-05 11:29:21,542 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,644 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,847 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put > {"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117} > 2018-03-05 11:29:22,123 INFO [PEWorker-5] hbase.MetaTableAccessor(1646): > Updated table Group_ns:testKillRS state to DISABLING in META > 2018-03-05 11:29:22,148 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,345 INFO [PEWorker-5] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=311, > ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524}] > 2018-03-05 11:29:22,503 INFO [PEWorker-13] > procedure.MasterProcedureScheduler(571): pid=311,
[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401927#comment-16401927 ] Duo Zhang commented on HBASE-20152: --- Not straight-forward I'd say since we need to run an AssignProcedure... So maybe split the initialization of ProcedureExecutor into two stages, in the first stages we only initialize the thread pool and scheduler, so we can execute the meta recover and also the assign procedures. And in this stages we will not store the procedures to ProcedureStore. And in the second stage we start the ProcedureStore and load the stored procedures. One thing I still need to dig more is how do we deal with double assign when master restart. Especially that if we use the above method to assign meta on start up. We may crash after sending open region call to RS, and after restart we decide to assign it to another RS, and then the finish transition call is coming but from the previous RS. Do not know if our AssignProcedure can deal with this problem. And if we decide to assign to the same RS again, then how RS deal with the two concurrent open region call... Let me read the related code again. Thanks. > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 > is the disable table: > {code} > 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] > procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, > state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure > server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false > 2018-03-05 11:29:21,300 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo > 2018-03-05 11:29:21,307 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): > Client=jenkins//172.17.0.2 list rsgroup > 2018-03-05 11:29:21,312 INFO [Time-limited test] client.HBaseAdmin$15(901): > Started disable of Group_ns:testKillRS > 2018-03-05 11:29:21,313 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS > 2018-03-05 11:29:21,384 INFO [PEWorker-9] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, > ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}] > 2018-03-05 11:29:21,534 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > procedure2.ProcedureExecutor(865): Stored pid=309, > state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure > table=Group_ns:testKillRS > 2018-03-05 11:29:21,542 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,644 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,847 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,118 DEBUG [PEWorker-5]
[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401860#comment-16401860 ] Duo Zhang commented on HBASE-20152: --- I prefer that we just do the meta recovery in the master thread. Since we use a special WAL for writing meta, the WAL file only contains the edits for meta so we do not need to split it when failover... A simple rename is enough I think? The problem maybe that if we have multiple wal files and we fail in the middle then the intermediate is a pain... But anyway, the wal for meta usually is very small usually, we can just read and write it in the master thread directly without going through the distributed log splitting framework. Let me try. > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 > is the disable table: > {code} > 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] > procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, > state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure > server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false > 2018-03-05 11:29:21,300 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo > 2018-03-05 11:29:21,307 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): > Client=jenkins//172.17.0.2 list rsgroup > 2018-03-05 11:29:21,312 INFO [Time-limited test] client.HBaseAdmin$15(901): > Started disable of Group_ns:testKillRS > 2018-03-05 11:29:21,313 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS > 2018-03-05 11:29:21,384 INFO [PEWorker-9] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, > ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}] > 2018-03-05 11:29:21,534 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > procedure2.ProcedureExecutor(865): Stored pid=309, > state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure > table=Group_ns:testKillRS > 2018-03-05 11:29:21,542 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,644 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,847 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put > {"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117} > 2018-03-05 11:29:22,123 INFO [PEWorker-5] hbase.MetaTableAccessor(1646): > Updated table Group_ns:testKillRS state to DISABLING in META > 2018-03-05 11:29:22,148 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if
[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398056#comment-16398056 ] stack commented on HBASE-20152: --- bq. I think once the ProcedureExecutor is started, we will start to schedule procedures if there are any unfinished ones before? Your are right. We'll read the store and schedule any we find that are not complete may read the store later? > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 > is the disable table: > {code} > 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] > procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, > state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure > server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false > 2018-03-05 11:29:21,300 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo > 2018-03-05 11:29:21,307 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): > Client=jenkins//172.17.0.2 list rsgroup > 2018-03-05 11:29:21,312 INFO [Time-limited test] client.HBaseAdmin$15(901): > Started disable of Group_ns:testKillRS > 2018-03-05 11:29:21,313 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS > 2018-03-05 11:29:21,384 INFO [PEWorker-9] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, > ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}] > 2018-03-05 11:29:21,534 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > procedure2.ProcedureExecutor(865): Stored pid=309, > state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure > table=Group_ns:testKillRS > 2018-03-05 11:29:21,542 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,644 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,847 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put > {"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117} > 2018-03-05 11:29:22,123 INFO [PEWorker-5] hbase.MetaTableAccessor(1646): > Updated table Group_ns:testKillRS state to DISABLING in META > 2018-03-05 11:29:22,148 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,345 INFO [PEWorker-5] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=311, > ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, >
[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398029#comment-16398029 ] Duo Zhang commented on HBASE-20152: --- {quote} What we have now is sort of how you describe it where we don't do anything until meta is deployed. We use the PE to run our RMP. Procedures know how to assign, etc. {quote} I think once the ProcedureExecutor is started, we will start to schedule procedures if there are any unfinished ones before? So here, before meta is deployed, we have already started to execute procedures... > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 > is the disable table: > {code} > 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] > procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, > state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure > server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false > 2018-03-05 11:29:21,300 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo > 2018-03-05 11:29:21,307 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): > Client=jenkins//172.17.0.2 list rsgroup > 2018-03-05 11:29:21,312 INFO [Time-limited test] client.HBaseAdmin$15(901): > Started disable of Group_ns:testKillRS > 2018-03-05 11:29:21,313 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS > 2018-03-05 11:29:21,384 INFO [PEWorker-9] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, > ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}] > 2018-03-05 11:29:21,534 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > procedure2.ProcedureExecutor(865): Stored pid=309, > state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure > table=Group_ns:testKillRS > 2018-03-05 11:29:21,542 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,644 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,847 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put > {"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117} > 2018-03-05 11:29:22,123 INFO [PEWorker-5] hbase.MetaTableAccessor(1646): > Updated table Group_ns:testKillRS state to DISABLING in META > 2018-03-05 11:29:22,148 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,345 INFO [PEWorker-5] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=311, > ppid=309,
[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398022#comment-16398022 ] stack commented on HBASE-20152: --- bq. I think this will be a flaky part... RMP does not implement holdLock which means it is suspended We can try it. During start though, we stop SCP being scheduled... could stop other Procedures too... But yeah, RMP should hold meta lock for its life. Probably an oversight. bq. Ideally, I prefer that, we should make meta online without depending on the procedure framework, just do the log splitting and assign in the master startup thread. Meta is special, the location is on zk so I think this is OK? What we have now is sort of how you describe it where we don't do anything until meta is deployed. We use the PE to run our RMP. Procedures know how to assign, etc. > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 > is the disable table: > {code} > 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] > procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, > state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure > server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false > 2018-03-05 11:29:21,300 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo > 2018-03-05 11:29:21,307 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): > Client=jenkins//172.17.0.2 list rsgroup > 2018-03-05 11:29:21,312 INFO [Time-limited test] client.HBaseAdmin$15(901): > Started disable of Group_ns:testKillRS > 2018-03-05 11:29:21,313 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS > 2018-03-05 11:29:21,384 INFO [PEWorker-9] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, > ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}] > 2018-03-05 11:29:21,534 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > procedure2.ProcedureExecutor(865): Stored pid=309, > state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure > table=Group_ns:testKillRS > 2018-03-05 11:29:21,542 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,644 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,847 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put > {"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117} > 2018-03-05 11:29:22,123 INFO [PEWorker-5] hbase.MetaTableAccessor(1646): > Updated table Group_ns:testKillRS state to DISABLING in META > 2018-03-05
[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397969#comment-16397969 ] Duo Zhang commented on HBASE-20152: --- {quote} RMP can deal w/ multiple runs in series. It will skip steps that don't need to be run (meta log splitting or reassign of meta). Only one RMP can run at a time (region lock). {quote} I think this will be a flaky part... RMP does not implement holdLock which means it is suspended, for example, when scheduling a sub procedure, then it will release the lock and another one can grab the lock... Ideally, I prefer that, we should make meta online without depending on the procedure framework, just do the log splitting and assign in the master startup thread. Meta is special, the location is on zk so I think this is OK? > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 > is the disable table: > {code} > 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] > procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, > state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure > server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false > 2018-03-05 11:29:21,300 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo > 2018-03-05 11:29:21,307 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): > Client=jenkins//172.17.0.2 list rsgroup > 2018-03-05 11:29:21,312 INFO [Time-limited test] client.HBaseAdmin$15(901): > Started disable of Group_ns:testKillRS > 2018-03-05 11:29:21,313 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS > 2018-03-05 11:29:21,384 INFO [PEWorker-9] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, > ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}] > 2018-03-05 11:29:21,534 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > procedure2.ProcedureExecutor(865): Stored pid=309, > state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure > table=Group_ns:testKillRS > 2018-03-05 11:29:21,542 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,644 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,847 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put > {"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117} > 2018-03-05 11:29:22,123 INFO [PEWorker-5] hbase.MetaTableAccessor(1646): > Updated table Group_ns:testKillRS state to DISABLING in META > 2018-03-05 11:29:22,148 DEBUG >
[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397062#comment-16397062 ] stack commented on HBASE-20152: --- bq. I think I still need to think more about the startup processing of HMaster. Yeah. It needs cleanup. There are too many 'Managers'. Master needs to be a true subclass of RegionServer. The services come online in a manner that needs to be ordered particularly. There is dupe of function in RegionServer up in Master, etc. bq. ...when restarting we reschedule a new RecoverMetaProcedure, then what happens... RMP can deal w/ multiple runs in series. It will skip steps that don't need to be run (meta log splitting or reassign of meta). Only one RMP can run at a time (region lock). Would have to do dig to answer your other questions. Good questions. > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 > is the disable table: > {code} > 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] > procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, > state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure > server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false > 2018-03-05 11:29:21,300 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo > 2018-03-05 11:29:21,307 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): > Client=jenkins//172.17.0.2 list rsgroup > 2018-03-05 11:29:21,312 INFO [Time-limited test] client.HBaseAdmin$15(901): > Started disable of Group_ns:testKillRS > 2018-03-05 11:29:21,313 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS > 2018-03-05 11:29:21,384 INFO [PEWorker-9] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, > ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}] > 2018-03-05 11:29:21,534 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > procedure2.ProcedureExecutor(865): Stored pid=309, > state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure > table=Group_ns:testKillRS > 2018-03-05 11:29:21,542 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,644 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,847 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put > {"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117} > 2018-03-05 11:29:22,123 INFO [PEWorker-5] hbase.MetaTableAccessor(1646): > Updated table Group_ns:testKillRS state to DISABLING in META > 2018-03-05 11:29:22,148 DEBUG >
[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397004#comment-16397004 ] Duo Zhang commented on HBASE-20152: --- I think I still need to think more about the startup processing of HMaster. So when restarting, we will first try to assign meta. We will schedule a RecoverMetaProcedure which could also do part of the job for ServerCrashProcedure. We can only continue after this procedure is done. One corner case for me is that, if we crash before finish this procedure, and when restarting we reschedule a new RecoverMetaProcedure, then what happens? I think the RMP we schedule at the last restart will finish normally, but how will the previous RMP go? And if we already have a RCP for that RS which have also scheduled RMP? And seems when we start the RMP, the ProcedureExecutor has not been started so SCP and other procedures have already been started? Is there any concurrency issues between the SCP and our RMP? Really a bit confusing to be... Will dig more tomorrow... Ideally I think we should recover meta at the very beginning and load everything we want before starting procedure executor... Thanks. > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 > is the disable table: > {code} > 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] > procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, > state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure > server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false > 2018-03-05 11:29:21,300 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo > 2018-03-05 11:29:21,307 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): > Client=jenkins//172.17.0.2 list rsgroup > 2018-03-05 11:29:21,312 INFO [Time-limited test] client.HBaseAdmin$15(901): > Started disable of Group_ns:testKillRS > 2018-03-05 11:29:21,313 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS > 2018-03-05 11:29:21,384 INFO [PEWorker-9] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, > ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}] > 2018-03-05 11:29:21,534 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > procedure2.ProcedureExecutor(865): Stored pid=309, > state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure > table=Group_ns:testKillRS > 2018-03-05 11:29:21,542 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,644 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,847 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,118 DEBUG [PEWorker-5]
[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396447#comment-16396447 ] stack commented on HBASE-20152: --- bq. If we can make sure that a restart master an find the crashed RS then I think scheduling or not is not important then. bq. This could make things complicated I think... Sure. This is how its been working for a good while now (since hbase-1.x or thereabouts...). I'd be game for looking at it again now we up on new AMv2 basis. bq. Ideally, I think the processing should be this, constructing the ServerManager first with the old set of RS before restart(by scanning wal fs), then going to zk to find the znode for each rs, then we can find out the crashed ones and try to schedule SCPs for them if there are no SCPs for them yet. I like this better. Faster startup. All tests run faster. Save 10% on total test run? New issue. I'll open it if you don't. > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 > is the disable table: > {code} > 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] > procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, > state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure > server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false > 2018-03-05 11:29:21,300 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo > 2018-03-05 11:29:21,307 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): > Client=jenkins//172.17.0.2 list rsgroup > 2018-03-05 11:29:21,312 INFO [Time-limited test] client.HBaseAdmin$15(901): > Started disable of Group_ns:testKillRS > 2018-03-05 11:29:21,313 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS > 2018-03-05 11:29:21,384 INFO [PEWorker-9] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, > ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}] > 2018-03-05 11:29:21,534 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > procedure2.ProcedureExecutor(865): Stored pid=309, > state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure > table=Group_ns:testKillRS > 2018-03-05 11:29:21,542 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,644 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,847 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put > {"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117} > 2018-03-05 11:29:22,123 INFO [PEWorker-5] hbase.MetaTableAccessor(1646): >
[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396316#comment-16396316 ] Duo Zhang commented on HBASE-20152: --- {quote} maybe I should just allow queuing of SCPs even if master is going down... will look at it. {quote} If we can make sure that a restart master an find the crashed RS then I think scheduling or not is not important then. {quote} Yeah, after master restarting, we'll poke the fs and look for WALs that do not belong to an existing server. {quote} OK, to be honest the "waiting region server count xxx" log during master restart is always a bit confusing to me. Ideally, I think the processing should be this, constructing the ServerManager first with the old set of RS before restart(by scanning wal fs), then going to zk to find the znode for each rs, then we can find out the crashed ones and try to schedule SCPs for them if there are no SCPs for them yet. {quote} during master start we put the handling of expired znodes aside until meta is assigned; {quote} This could make things complicated I think... If meta is on a crashed server, then we need to finish the SCP first otherwise meta will never be assigned... Let me also dig here, we need to find a proper order to do this... Thanks. > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 > is the disable table: > {code} > 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] > procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, > state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure > server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false > 2018-03-05 11:29:21,300 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo > 2018-03-05 11:29:21,307 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): > Client=jenkins//172.17.0.2 list rsgroup > 2018-03-05 11:29:21,312 INFO [Time-limited test] client.HBaseAdmin$15(901): > Started disable of Group_ns:testKillRS > 2018-03-05 11:29:21,313 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS > 2018-03-05 11:29:21,384 INFO [PEWorker-9] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, > ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}] > 2018-03-05 11:29:21,534 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > procedure2.ProcedureExecutor(865): Stored pid=309, > state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure > table=Group_ns:testKillRS > 2018-03-05 11:29:21,542 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,644 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,847 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see
[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396233#comment-16396233 ] stack commented on HBASE-20152: --- Ran a test with patches from two subtasks in place -- the deadlock fix and the fail-fast throw exception when move is called -- using the really nice IntegrationTestDDLMasterFailover... It creates hundreds of tables and tens of thousands of regions. Stuff seemed to work doing multiple runs w/ monkies and stop/start. > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 > is the disable table: > {code} > 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] > procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, > state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure > server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false > 2018-03-05 11:29:21,300 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo > 2018-03-05 11:29:21,307 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): > Client=jenkins//172.17.0.2 list rsgroup > 2018-03-05 11:29:21,312 INFO [Time-limited test] client.HBaseAdmin$15(901): > Started disable of Group_ns:testKillRS > 2018-03-05 11:29:21,313 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS > 2018-03-05 11:29:21,384 INFO [PEWorker-9] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, > ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}] > 2018-03-05 11:29:21,534 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > procedure2.ProcedureExecutor(865): Stored pid=309, > state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure > table=Group_ns:testKillRS > 2018-03-05 11:29:21,542 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,644 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,847 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put > {"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117} > 2018-03-05 11:29:22,123 INFO [PEWorker-5] hbase.MetaTableAccessor(1646): > Updated table Group_ns:testKillRS state to DISABLING in META > 2018-03-05 11:29:22,148 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,345 INFO [PEWorker-5] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=311, > ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure >
[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395573#comment-16395573 ] stack commented on HBASE-20152: --- bq. Just ask, after restarting we will re-schedule a SCP right? bq. So the SCP is suspended? This is a little confusing to me... I think SCP is for a crashed server? Why is it suspended? Thanks for review. Yeah, after master restarting, we'll poke the fs and look for WALs that do not belong to an existing server. If we find any, we schedule a SCP. The SCP processing is 'suspended' at two points during Master's lifetime; 1. during master start we put the handling of expired znodes aside until meta is assigned; and 2. when master has been stopped... we skip creating an SCP depending on Master restart finding any unprocessed crashed servers. My description gave impression that we stop SCP and that there might be a hole where we'd miss server crashes. Your questioning made me go back and check... maybe I should just allow queuing of SCPs even if master is going down... will look at it. bq. I do not think we should abandon it. ... Yeah, took me a while but I came back to here; the attached patch will now find any unassigns and interrupt them letting them complete if makes sense (they were unassigns against the crashed server). > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 > is the disable table: > {code} > 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] > procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, > state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure > server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false > 2018-03-05 11:29:21,300 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo > 2018-03-05 11:29:21,307 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): > Client=jenkins//172.17.0.2 list rsgroup > 2018-03-05 11:29:21,312 INFO [Time-limited test] client.HBaseAdmin$15(901): > Started disable of Group_ns:testKillRS > 2018-03-05 11:29:21,313 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS > 2018-03-05 11:29:21,384 INFO [PEWorker-9] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, > ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}] > 2018-03-05 11:29:21,534 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > procedure2.ProcedureExecutor(865): Stored pid=309, > state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure > table=Group_ns:testKillRS > 2018-03-05 11:29:21,542 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,644 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,847 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] >
[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394721#comment-16394721 ] Duo Zhang commented on HBASE-20152: --- {quote} Can't schedule an SCP. Fails because server is going down already not present as online. Procedure suspended. {quote} So the SCP is suspended? This is a little confusing to me... I think SCP is for a crashed server? Why is it suspended? > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 > is the disable table: > {code} > 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] > procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, > state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure > server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false > 2018-03-05 11:29:21,300 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo > 2018-03-05 11:29:21,307 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): > Client=jenkins//172.17.0.2 list rsgroup > 2018-03-05 11:29:21,312 INFO [Time-limited test] client.HBaseAdmin$15(901): > Started disable of Group_ns:testKillRS > 2018-03-05 11:29:21,313 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS > 2018-03-05 11:29:21,384 INFO [PEWorker-9] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, > ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}] > 2018-03-05 11:29:21,534 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > procedure2.ProcedureExecutor(865): Stored pid=309, > state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure > table=Group_ns:testKillRS > 2018-03-05 11:29:21,542 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,644 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,847 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put > {"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117} > 2018-03-05 11:29:22,123 INFO [PEWorker-5] hbase.MetaTableAccessor(1646): > Updated table Group_ns:testKillRS state to DISABLING in META > 2018-03-05 11:29:22,148 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,345 INFO [PEWorker-5] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=311, > ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, >
[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394719#comment-16394719 ] Duo Zhang commented on HBASE-20152: --- {quote} Trying to think through abandon... Would work in the small but hard part is how to fail compound procedures like Move and Disable Table... Split, Merge.. .. {quote} I do not think we should abandon it. In this case suspend is OK, but the SCP should be finished first, and then we wake up the unassign procedure and it can go on. Fail an unassign procedure may have lots of side effect, as you describe above, how do we process the parent procedures... Thanks. > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 > is the disable table: > {code} > 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] > procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, > state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure > server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false > 2018-03-05 11:29:21,300 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo > 2018-03-05 11:29:21,307 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): > Client=jenkins//172.17.0.2 list rsgroup > 2018-03-05 11:29:21,312 INFO [Time-limited test] client.HBaseAdmin$15(901): > Started disable of Group_ns:testKillRS > 2018-03-05 11:29:21,313 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS > 2018-03-05 11:29:21,384 INFO [PEWorker-9] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, > ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}] > 2018-03-05 11:29:21,534 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > procedure2.ProcedureExecutor(865): Stored pid=309, > state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure > table=Group_ns:testKillRS > 2018-03-05 11:29:21,542 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,644 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,847 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put > {"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117} > 2018-03-05 11:29:22,123 INFO [PEWorker-5] hbase.MetaTableAccessor(1646): > Updated table Group_ns:testKillRS state to DISABLING in META > 2018-03-05 11:29:22,148 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,345 INFO [PEWorker-5] >
[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394716#comment-16394716 ] Duo Zhang commented on HBASE-20152: --- {quote} Cluster shutdown is set. This means cluster down flag is set, we expire servers, but no ServerCrashProcedure gets scheduled. {quote} Just ask, after restarting we will re-schedule a SCP right? > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 > is the disable table: > {code} > 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] > procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, > state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure > server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false > 2018-03-05 11:29:21,300 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo > 2018-03-05 11:29:21,307 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): > Client=jenkins//172.17.0.2 list rsgroup > 2018-03-05 11:29:21,312 INFO [Time-limited test] client.HBaseAdmin$15(901): > Started disable of Group_ns:testKillRS > 2018-03-05 11:29:21,313 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS > 2018-03-05 11:29:21,384 INFO [PEWorker-9] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, > ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}] > 2018-03-05 11:29:21,534 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > procedure2.ProcedureExecutor(865): Stored pid=309, > state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure > table=Group_ns:testKillRS > 2018-03-05 11:29:21,542 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,644 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,847 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put > {"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117} > 2018-03-05 11:29:22,123 INFO [PEWorker-5] hbase.MetaTableAccessor(1646): > Updated table Group_ns:testKillRS state to DISABLING in META > 2018-03-05 11:29:22,148 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:22,345 INFO [PEWorker-5] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=311, > ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, > server=1cfd208ff882,40584,1520249102524}] > 2018-03-05
[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16393824#comment-16393824 ] stack commented on HBASE-20152: --- For the deadlock scenario, we might be able to get away w/ simple fixes * We were re-assigning a region though there was an unassign outstanding against the crashed server. This was a bug that had us doing an unnecessary assign and upping likelihood of deadlock (unassign and concurrent assign). * Deadlock happens because the unassign happens AFTER SCP cancels outstanding RITs but before our SCP Assign is scheduled. Rerunning the RIT cancel procedure after all assigns have been created and scheduled hopefully takes care of the case where an outstanding unassign has the region lock so the SCP assign can't progress; the re-run of the cancel of outstanding RITs against crashed server will cancel any unassigns that might have been lucky enough to have been scheduled. The above should fix scenario #1. Scenario #2 'Delete of Online Region' is tough. It happened while we had the patch over in HBASE-20137 was in place which allowed the UnassignProcedure progress when it needed to wait on SCP. I won't spend more time on it. For Scenario #3, 'Move after Cluster Shutdown', just need to retrofit Procedures so likes of Move, etc., don't run when cluster shutdown and Master stop has been set (probably don't want to turn off AMv2 Procedures completely since we'll likely want to put shutdown itself into a Procedure...) > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 > is the disable table: > {code} > 2018-03-05 11:29:21,224 DEBUG [PEWorker-7] > procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, > state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure > server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false > 2018-03-05 11:29:21,300 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo > 2018-03-05 11:29:21,307 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): > Client=jenkins//172.17.0.2 list rsgroup > 2018-03-05 11:29:21,312 INFO [Time-limited test] client.HBaseAdmin$15(901): > Started disable of Group_ns:testKillRS > 2018-03-05 11:29:21,313 INFO > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS > 2018-03-05 11:29:21,384 INFO [PEWorker-9] > procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, > ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}] > 2018-03-05 11:29:21,534 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > procedure2.ProcedureExecutor(865): Stored pid=309, > state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure > table=Group_ns:testKillRS > 2018-03-05 11:29:21,542 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] > master.MasterRpcServices(1134): Checking to see if procedure is done pid=309 > 2018-03-05 11:29:21,644 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498]
[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391969#comment-16391969 ] Mike Drob commented on HBASE-20152: --- bq. Move does its Assign... Region is onlined before WAL splitting completes. right, would need to tie this somehow to the completion of SCP. Hmm > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > h3. Delete of online Regions > Saw this in nightly failure #452 for branch-2 in > TestSplitTransactionOnCluster.org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster > {code} > * DisableTableProcedure is queued before SCP. > * DisableTableProcedure Unassign fails because can't RPC to crashed server > and can't expire. > * Unassign is Stuck in suspend. > * SCP runs and cleans up suspended Disable Unassign. > * SCP completes which includes assign of Disable Unassign region. > * Disable Unassign completes > * Disable completes. > * A scheduled Drop Table Procedure runs (its end of test). > * Succeeds deleting regions that are actually assigned (see above where SCP > assigned region). > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391634#comment-16391634 ] stack commented on HBASE-20152: --- Thanks for taking a look. bq. should unassign abandon Trying to think through abandon... Would work in the small but hard part is how to fail compound procedures like Move and Disable Table... Split, Merge.. .. bq. or possibly declare itself successful That was the original patch in TestRSGroups. My subsequent worry was dataloss. * Server crashes. Was carrying heavy in-memory load of edits for region X when it went down; i.e. needs WAL replay. * Notice of expiration which schedules an SCP and removes the server from online list. * A Move starts (a move is an unassign and an assign). * Move Unassign fails its RPC and fails expiration of server (it is already ongoing SCP is busy splitting logs). * Move does its Assign... Region is onlined before WAL splitting completes. bq. If there's a move_region scheduled while a server is down or is going down... can we fail the move? Yeah, there are a few simple checks we need to add... server down, table enabled/disabled. Let me add these. Was trying to think through the corner cases. Thanks for help. > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > h3. Delete of online Regions > Saw this in nightly failure #452 for branch-2 in > TestSplitTransactionOnCluster.org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster > {code} > * DisableTableProcedure is queued before SCP. > * DisableTableProcedure Unassign fails because can't RPC to crashed server > and can't expire. > * Unassign is Stuck in suspend. > * SCP runs and cleans up suspended Disable Unassign. > * SCP completes which includes assign of Disable Unassign region. > * Disable Unassign completes > * Disable completes. > * A scheduled Drop Table Procedure runs (its end of test). > * Succeeds deleting regions that are actually assigned (see above where SCP > assigned region). > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391524#comment-16391524 ] Mike Drob commented on HBASE-20152: --- should unassign abandon instead of suspend when the target server is down? or possibly declare itself successful, since in essence the region _is_ unassigned from that server? If there's a move_region scheduled while a server is down or is going down... can we fail the move? or we can try to skip directly to the assign portion of the move, and then have to make sure we don't try to assign the region to the original server when it comes back > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > h3. Delete of online Regions > Saw this in nightly failure #452 for branch-2 in > TestSplitTransactionOnCluster.org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster > {code} > * DisableTableProcedure is queued before SCP. > * DisableTableProcedure Unassign fails because can't RPC to crashed server > and can't expire. > * Unassign is Stuck in suspend. > * SCP runs and cleans up suspended Disable Unassign. > * SCP completes which includes assign of Disable Unassign region. > * Disable Unassign completes > * Disable completes. > * A scheduled Drop Table Procedure runs (its end of test). > * Succeeds deleting regions that are actually assigned (see above where SCP > assigned region). > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20152) [AMv2] DisableTableProcedure versus ServerCrashProcedure
[ https://issues.apache.org/jira/browse/HBASE-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390901#comment-16390901 ] stack commented on HBASE-20152: --- New scenario seen on internal cluster (not sure what the test is doing... kinda crazy killing of cluster and then restarts..) h3. Procedure after Cluster Shutdown {code} * Cluster shutdown is set. This means cluster down flag is set, we expire servers, but no ServerCrashProcedure gets scheduled. 2018-03-03 05:51:33,852 INFO org.apache.hadoop.hbase.master.ServerManager: Cluster shutdown set; quasar-lxosrm-3.vpc.cloudera.com,22101,1520080141368 expired; onlineServers=2 2018-03-03 05:51:33,859 INFO org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [quasar-lxosrm-2.vpc.cloudera.com,22101,1520080154394] * Just after, a Move Region event comes in 2018-03-03 05:51:33,999 INFO org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: pid=158, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure hri=hbase:namespace,,1520062075237.d2854c6a96bb191d04a3f285c0eef210., source=quasar-lxosrm-2.vpc.cloudera.com,22101,1520080154394, destination= hbase:namespace hbase:namespace,,1520062075237.d2854c6a96bb191d04a3f285c0eef210. * Move region tries to unassign but RPC fails with java.io.IOException: Call to quasar-lxosrm-2.vpc.cloudera.com/172.26.11.155:22101 failed on local exception: org.apache.hadoop.hbase.ipc.StoppedRpcClientException * Can't schedule an SCP. Fails because server is going down already not present as online. Procedure suspended. * New Master, Unassign procedure rerun. Does dispatch to non-existent server. * Fails to schedule SCP because server is long gone. * Procedure is Suspended. Stuck forever. {code} > [AMv2] DisableTableProcedure versus ServerCrashProcedure > > > Key: HBASE-20152 > URL: https://issues.apache.org/jira/browse/HBASE-20152 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack >Priority: Major > > Seeing a small spate of issues where disabled tables/regions are being > assigned. Usually they happen when a DisableTableProcedure is running > concurrent with a ServerCrashProcedure. See below. See associated > HBASE-20131. This is umbrella issue for fixing. > h3. Deadlock > From HBASE-20137, 'TestRSGroups is Flakey', > https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325 > {code} > * SCP is running because a server was aborted in test. > * SCP starts AssignProcedure of region X from crashed server. > * DisableTable Procedure runs because test has finished and we're doing > table delete. Queues > * UnassignProcedure for region X. > * Disable Unassign gets Lock on region X first. > * SCP AssignProcedure tries to get lock, waits on lock. > * DisableTable Procedure UnassignProcedure RPC fails because server is down > (Thats why the SCP). > * Tries to expire the server it failed the RPC against. Fails (currently > being SCP'd). > * DisableTable Procedure Unassign is suspended. It is a suspend with lock on > region X held > * SCP can't run because lock on X is held > * Test timesout. > {code} > h3. Delete of online Regions > Saw this in nightly failure #452 for branch-2 in > TestSplitTransactionOnCluster.org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster > {code} > * DisableTableProcedure is queued before SCP. > * DisableTableProcedure Unassign fails because can't RPC to crashed server > and can't expire. > * Unassign is Stuck in suspend. > * SCP runs and cleans up suspended Disable Unassign. > * SCP completes which includes assign of Disable Unassign region. > * Disable Unassign completes > * Disable completes. > * A scheduled Drop Table Procedure runs (its end of test). > * Succeeds deleting regions that are actually assigned (see above where SCP > assigned region). > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)