[ https://issues.apache.org/jira/browse/HBASE-21307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
stack resolved HBASE-21307. --------------------------- Resolution: Duplicate Resolving as another example of HBASE-21288. Will keep an eye out to see if the soln to HBASE-21213 causes more damage than good. > [amv2] Deadlock when we move a Region from a not-online RegionServer > -------------------------------------------------------------------- > > Key: HBASE-21307 > URL: https://issues.apache.org/jira/browse/HBASE-21307 > Project: HBase > Issue Type: Bug > Components: amv2 > Affects Versions: 2.1.1 > Reporter: stack > Assignee: stack > Priority: Critical > Fix For: 2.1.1 > > > Perhaps this doesn't happen in branch-2, but its problem in branch-2.1. > Highlevel, we go to move a region, its unassign subprocedure fails its > dispatch because the server is not online so it queues a SCP and waits on it > to break the RPC. The SCP can't run though because the MRP holds lock on the > region. > I can bypass the MRP but then the SCP fails because Region is 'owned' by the > MRP. See below: > {code} > 2018-10-12 16:29:53,423 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Begin bypass > pid=411982, ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH, > locked=true; UnassignProcedure > table=IntegrationTestBigLinkedList_20180709093726, > region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, > server=va1002.halxg.cloudera.com,22101,1539368318649 with lockWait=0, > override=true, recursive=true > 2018-10-12 16:29:53,424 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Bypassing pid=411982, > ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; > UnassignProcedure table=IntegrationTestBigLinkedList_20180709093726, > region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, > server=va1002.halxg.cloudera.com,22101,1539368318649 > 2018-10-12 16:29:53,712 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Bypassing pid=411981, > state=WAITING:MOVE_REGION_ASSIGN, locked=true; MoveRegionProcedure > hri=f5f9ff1e4b0f2d9555dabfcca71df568, > source=va1002.halxg.cloudera.com,22101,1539368318649, > destination=vd1021.halxg.cloudera.com,22101,1539368317897 > 2018-10-12 16:29:53,838 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Bypassing pid=411982, > ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true, > bypass=LOG-REDACTED UnassignProcedure > table=IntegrationTestBigLinkedList_20180709093726, > region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, > server=va1002.halxg.cloudera.com,22101,1539368318649 and its ancestors > successfully, adding to queue > 2018-10-12 16:29:53,839 INFO org.apache.hadoop.hbase.procedure2.Procedure: > pid=411982, ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH, > locked=true, bypass=LOG-REDACTED UnassignProcedure > table=IntegrationTestBigLinkedList_20180709093726, > region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, > server=va1002.halxg.cloudera.com,22101,1539368318649 bypassed, returning null > to finish it > 2018-10-12 16:29:53,954 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished subprocedure > pid=411982, resume processing parent pid=411981, > state=RUNNABLE:MOVE_REGION_ASSIGN, locked=true, bypass=LOG-REDACTED > MoveRegionProcedure hri=f5f9ff1e4b0f2d9555dabfcca71df568, > source=va1002.halxg.cloudera.com,22101,1539368318649, > destination=vd1021.halxg.cloudera.com,22101,1539368317897 > 2018-10-12 16:29:53,954 INFO org.apache.hadoop.hbase.procedure2.Procedure: > pid=411981, state=RUNNABLE:MOVE_REGION_ASSIGN, locked=true, > bypass=LOG-REDACTED MoveRegionProcedure hri=f5f9ff1e4b0f2d9555dabfcca71df568, > source=va1002.halxg.cloudera.com,22101,1539368318649, > destination=vd1021.halxg.cloudera.com,22101,1539368317897 bypassed, returning > null to finish it > 2018-10-12 16:29:53,956 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=411982, > ppid=411981, state=SUCCESS, bypass=LOG-REDACTED UnassignProcedure > table=IntegrationTestBigLinkedList_20180709093726, > region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, > server=va1002.halxg.cloudera.com,22101,1539368318649 in 3hrs, 49mins, > 12.419sec, unfinishedSiblingCount=0 > 2018-10-12 16:29:54,058 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=411981, > state=SUCCESS, bypass=LOG-REDACTED MoveRegionProcedure > hri=f5f9ff1e4b0f2d9555dabfcca71df568, > source=va1002.halxg.cloudera.com,22101,1539368318649, > destination=vd1021.halxg.cloudera.com,22101,1539368317897 in 3hrs, 49mins, > 12.878sec > 2018-10-12 16:29:54,059 INFO > org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: xlock for > pid=412210, ppid=411983, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=IntegrationTestBigLinkedList_20180709093726, > region=f5f9ff1e4b0f2d9555dabfcca71df568 > 2018-10-12 16:29:54,105 WARN > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: > f5f9ff1e4b0f2d9555dabfcca71df568 owned by pid=411982, CANNOT run 'this' > (pid=412210). > .... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)