[jira] [Commented] (HBASE-21278) TestMergeTableRegionsProcedure is flaky
[ https://issues.apache.org/jira/browse/HBASE-21278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16647283#comment-16647283 ] Duo Zhang commented on HBASE-21278: --- OK, I finally understand the implementation. Every time we execute a procedure we will push it into a stack, and when rolling back, we start to pop from the stack to revert the procedures. So maybe we could just skip reverting some procedures in executeRollback... > TestMergeTableRegionsProcedure is flaky > --- > > Key: HBASE-21278 > URL: https://issues.apache.org/jira/browse/HBASE-21278 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > Attachments: > org.apache.hadoop.hbase.master.assignment.TestMergeTableRegionsProcedure-output.txt > > > https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1235/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.assignment.TestMergeTableRegionsProcedure-output.txt/*view*/ > I think the problem is > {noformat} > 2018-10-08 03:44:30,315 INFO [PEWorker-1] > procedure.MasterProcedureScheduler(689): pid=43, ppid=42, state=SUCCESS, > hasLock=false; TransitRegionStateProcedure > table=testRollbackAndDoubleExecution, > region=9bac7c539ac0cff6dc5706ed375a3bfb, UNASSIGN checking lock on > 9bac7c539ac0cff6dc5706ed375a3bfb > 2018-10-08 03:44:30,320 ERROR [PEWorker-1] helpers.MarkerIgnoringBase(159): > CODE-BUG: Uncaught runtime exception for pid=43, ppid=42, state=SUCCESS, > hasLock=true; TransitRegionStateProcedure > table=testRollbackAndDoubleExecution, > region=9bac7c539ac0cff6dc5706ed375a3bfb, UNASSIGN > java.lang.UnsupportedOperationException > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.rollbackState(TransitRegionStateProcedure.java:458) > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.rollbackState(TransitRegionStateProcedure.java:97) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:957) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1605) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1567) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1446) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76) > {noformat} > Typically there is no rollback for TRSP. Need to dig more. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21278) TestMergeTableRegionsProcedure is flaky
[ https://issues.apache.org/jira/browse/HBASE-21278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16646035#comment-16646035 ] Duo Zhang commented on HBASE-21278: --- I think there are two scenarios which we want to rollback a procedure. 1. The procedure is aborted. 2. One of the sub procedure is failed. I think the proper way to rollback a procedure is: 1. If there are still running sub procedures, wait until they are all done. 2. Rollback this procedure. 3. Recursively rollback the parent procedure. For now the logic is almost the same as above, but we have a complicated way to store the rollback steps, where we will record the execution of sub procedures, and will also rollback the sub procedures when rolling back a procedure. Let me review the code carefully to see what is going on... > TestMergeTableRegionsProcedure is flaky > --- > > Key: HBASE-21278 > URL: https://issues.apache.org/jira/browse/HBASE-21278 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > Attachments: > org.apache.hadoop.hbase.master.assignment.TestMergeTableRegionsProcedure-output.txt > > > https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1235/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.assignment.TestMergeTableRegionsProcedure-output.txt/*view*/ > I think the problem is > {noformat} > 2018-10-08 03:44:30,315 INFO [PEWorker-1] > procedure.MasterProcedureScheduler(689): pid=43, ppid=42, state=SUCCESS, > hasLock=false; TransitRegionStateProcedure > table=testRollbackAndDoubleExecution, > region=9bac7c539ac0cff6dc5706ed375a3bfb, UNASSIGN checking lock on > 9bac7c539ac0cff6dc5706ed375a3bfb > 2018-10-08 03:44:30,320 ERROR [PEWorker-1] helpers.MarkerIgnoringBase(159): > CODE-BUG: Uncaught runtime exception for pid=43, ppid=42, state=SUCCESS, > hasLock=true; TransitRegionStateProcedure > table=testRollbackAndDoubleExecution, > region=9bac7c539ac0cff6dc5706ed375a3bfb, UNASSIGN > java.lang.UnsupportedOperationException > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.rollbackState(TransitRegionStateProcedure.java:458) > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.rollbackState(TransitRegionStateProcedure.java:97) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:957) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1605) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1567) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1446) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76) > {noformat} > Typically there is no rollback for TRSP. Need to dig more. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21278) TestMergeTableRegionsProcedure is flaky
[ https://issues.apache.org/jira/browse/HBASE-21278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645981#comment-16645981 ] stack commented on HBASE-21278: --- bq. It does not make sense to rollback a successful procedure right? Agree. bq. so we do not need to rollback the sub procedures... How to do this, prevent PE calling rollback on subprocedures? TRSP just ignores the calls? The 'throw new UnsupportedOperationException(this + " unhandled state=" + state);' needs changing too? > TestMergeTableRegionsProcedure is flaky > --- > > Key: HBASE-21278 > URL: https://issues.apache.org/jira/browse/HBASE-21278 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > Attachments: > org.apache.hadoop.hbase.master.assignment.TestMergeTableRegionsProcedure-output.txt > > > https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1235/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.assignment.TestMergeTableRegionsProcedure-output.txt/*view*/ > I think the problem is > {noformat} > 2018-10-08 03:44:30,315 INFO [PEWorker-1] > procedure.MasterProcedureScheduler(689): pid=43, ppid=42, state=SUCCESS, > hasLock=false; TransitRegionStateProcedure > table=testRollbackAndDoubleExecution, > region=9bac7c539ac0cff6dc5706ed375a3bfb, UNASSIGN checking lock on > 9bac7c539ac0cff6dc5706ed375a3bfb > 2018-10-08 03:44:30,320 ERROR [PEWorker-1] helpers.MarkerIgnoringBase(159): > CODE-BUG: Uncaught runtime exception for pid=43, ppid=42, state=SUCCESS, > hasLock=true; TransitRegionStateProcedure > table=testRollbackAndDoubleExecution, > region=9bac7c539ac0cff6dc5706ed375a3bfb, UNASSIGN > java.lang.UnsupportedOperationException > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.rollbackState(TransitRegionStateProcedure.java:458) > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.rollbackState(TransitRegionStateProcedure.java:97) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:957) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1605) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1567) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1446) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76) > {noformat} > Typically there is no rollback for TRSP. Need to dig more. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21278) TestMergeTableRegionsProcedure is flaky
[ https://issues.apache.org/jira/browse/HBASE-21278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645978#comment-16645978 ] Duo Zhang commented on HBASE-21278: --- For MergeTableRegionsProcedure, we will schedule new TRSPs to bring the two regions online, you can see the code. This is natural I think, as the original TRSPs have been finished successfully. It does not make sense to rollback a successful procedure right? Most developers will not consider that the a successful procedure can still be rolled back... > TestMergeTableRegionsProcedure is flaky > --- > > Key: HBASE-21278 > URL: https://issues.apache.org/jira/browse/HBASE-21278 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > Attachments: > org.apache.hadoop.hbase.master.assignment.TestMergeTableRegionsProcedure-output.txt > > > https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1235/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.assignment.TestMergeTableRegionsProcedure-output.txt/*view*/ > I think the problem is > {noformat} > 2018-10-08 03:44:30,315 INFO [PEWorker-1] > procedure.MasterProcedureScheduler(689): pid=43, ppid=42, state=SUCCESS, > hasLock=false; TransitRegionStateProcedure > table=testRollbackAndDoubleExecution, > region=9bac7c539ac0cff6dc5706ed375a3bfb, UNASSIGN checking lock on > 9bac7c539ac0cff6dc5706ed375a3bfb > 2018-10-08 03:44:30,320 ERROR [PEWorker-1] helpers.MarkerIgnoringBase(159): > CODE-BUG: Uncaught runtime exception for pid=43, ppid=42, state=SUCCESS, > hasLock=true; TransitRegionStateProcedure > table=testRollbackAndDoubleExecution, > region=9bac7c539ac0cff6dc5706ed375a3bfb, UNASSIGN > java.lang.UnsupportedOperationException > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.rollbackState(TransitRegionStateProcedure.java:458) > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.rollbackState(TransitRegionStateProcedure.java:97) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:957) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1605) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1567) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1446) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76) > {noformat} > Typically there is no rollback for TRSP. Need to dig more. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21278) TestMergeTableRegionsProcedure is flaky
[ https://issues.apache.org/jira/browse/HBASE-21278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645957#comment-16645957 ] stack commented on HBASE-21278: --- bq. In the current design, when rolling back a procedure, we will rollback the sub procedures first. At least for MergeTableRegionsProcedure, this does not make sense. ... because? Is it be cause "There is no rollback for TRSP"? Has the TRSP completed successfully? Can it 'ignore' the rollback request? Or, can we not schedule new TRSPs to do the MergeTableRegionsProcedure rollback? We should do a writeup on rollback, and try and design how it should work? > TestMergeTableRegionsProcedure is flaky > --- > > Key: HBASE-21278 > URL: https://issues.apache.org/jira/browse/HBASE-21278 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > Attachments: > org.apache.hadoop.hbase.master.assignment.TestMergeTableRegionsProcedure-output.txt > > > https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1235/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.assignment.TestMergeTableRegionsProcedure-output.txt/*view*/ > I think the problem is > {noformat} > 2018-10-08 03:44:30,315 INFO [PEWorker-1] > procedure.MasterProcedureScheduler(689): pid=43, ppid=42, state=SUCCESS, > hasLock=false; TransitRegionStateProcedure > table=testRollbackAndDoubleExecution, > region=9bac7c539ac0cff6dc5706ed375a3bfb, UNASSIGN checking lock on > 9bac7c539ac0cff6dc5706ed375a3bfb > 2018-10-08 03:44:30,320 ERROR [PEWorker-1] helpers.MarkerIgnoringBase(159): > CODE-BUG: Uncaught runtime exception for pid=43, ppid=42, state=SUCCESS, > hasLock=true; TransitRegionStateProcedure > table=testRollbackAndDoubleExecution, > region=9bac7c539ac0cff6dc5706ed375a3bfb, UNASSIGN > java.lang.UnsupportedOperationException > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.rollbackState(TransitRegionStateProcedure.java:458) > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.rollbackState(TransitRegionStateProcedure.java:97) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:957) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1605) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1567) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1446) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76) > {noformat} > Typically there is no rollback for TRSP. Need to dig more. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21278) TestMergeTableRegionsProcedure is flaky
[ https://issues.apache.org/jira/browse/HBASE-21278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645951#comment-16645951 ] Duo Zhang commented on HBASE-21278: --- OK I found the problem. When rolling back a procedure, we also need to acquire the lock. And for TRSP, we need to wait until meta loaded, but when the procedures is woken up by the meta loaded event, we will add it into the scheduler and try to execute it, not rollback it... I think the first thing here is to decide what is the correct behavior. In the current design, when rolling back a procedure, we will rollback the sub procedures first. At least for MergeTableRegionsProcedure, this does not make sense. There is no rollback for TRSP, and also, we will schedule new TRSPs to rollback the state when rolling back the MergeTableRegionsProcedure, so we do not need to rollback the sub procedures... > TestMergeTableRegionsProcedure is flaky > --- > > Key: HBASE-21278 > URL: https://issues.apache.org/jira/browse/HBASE-21278 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > Attachments: > org.apache.hadoop.hbase.master.assignment.TestMergeTableRegionsProcedure-output.txt > > > https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1235/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.assignment.TestMergeTableRegionsProcedure-output.txt/*view*/ > I think the problem is > {noformat} > 2018-10-08 03:44:30,315 INFO [PEWorker-1] > procedure.MasterProcedureScheduler(689): pid=43, ppid=42, state=SUCCESS, > hasLock=false; TransitRegionStateProcedure > table=testRollbackAndDoubleExecution, > region=9bac7c539ac0cff6dc5706ed375a3bfb, UNASSIGN checking lock on > 9bac7c539ac0cff6dc5706ed375a3bfb > 2018-10-08 03:44:30,320 ERROR [PEWorker-1] helpers.MarkerIgnoringBase(159): > CODE-BUG: Uncaught runtime exception for pid=43, ppid=42, state=SUCCESS, > hasLock=true; TransitRegionStateProcedure > table=testRollbackAndDoubleExecution, > region=9bac7c539ac0cff6dc5706ed375a3bfb, UNASSIGN > java.lang.UnsupportedOperationException > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.rollbackState(TransitRegionStateProcedure.java:458) > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.rollbackState(TransitRegionStateProcedure.java:97) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:957) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1605) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1567) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1446) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76) > {noformat} > Typically there is no rollback for TRSP. Need to dig more. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21278) TestMergeTableRegionsProcedure is flaky
[ https://issues.apache.org/jira/browse/HBASE-21278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644791#comment-16644791 ] Duo Zhang commented on HBASE-21278: --- Good news. After changing the testRecoveryAndDoubleExecution to only quit when the stepNum equals the lastStep, and change the lastStep to 8(It is the id for MERGE_TABLE_REGIONS_UPDATE_META, we can rollback before this state, actually) . The old test does not fail always is because that, we sometimes do not persist the TRSPs so when restarting we do not need to rollback the TRSPs. By changing the lastStep to 8, we make sure that the two TRSPs have been persistent so the test will fail always. Let me dig. > TestMergeTableRegionsProcedure is flaky > --- > > Key: HBASE-21278 > URL: https://issues.apache.org/jira/browse/HBASE-21278 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > Attachments: > org.apache.hadoop.hbase.master.assignment.TestMergeTableRegionsProcedure-output.txt > > > https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1235/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.assignment.TestMergeTableRegionsProcedure-output.txt/*view*/ > I think the problem is > {noformat} > 2018-10-08 03:44:30,315 INFO [PEWorker-1] > procedure.MasterProcedureScheduler(689): pid=43, ppid=42, state=SUCCESS, > hasLock=false; TransitRegionStateProcedure > table=testRollbackAndDoubleExecution, > region=9bac7c539ac0cff6dc5706ed375a3bfb, UNASSIGN checking lock on > 9bac7c539ac0cff6dc5706ed375a3bfb > 2018-10-08 03:44:30,320 ERROR [PEWorker-1] helpers.MarkerIgnoringBase(159): > CODE-BUG: Uncaught runtime exception for pid=43, ppid=42, state=SUCCESS, > hasLock=true; TransitRegionStateProcedure > table=testRollbackAndDoubleExecution, > region=9bac7c539ac0cff6dc5706ed375a3bfb, UNASSIGN > java.lang.UnsupportedOperationException > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.rollbackState(TransitRegionStateProcedure.java:458) > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.rollbackState(TransitRegionStateProcedure.java:97) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:957) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1605) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1567) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1446) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76) > {noformat} > Typically there is no rollback for TRSP. Need to dig more. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21278) TestMergeTableRegionsProcedure is flaky
[ https://issues.apache.org/jira/browse/HBASE-21278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644611#comment-16644611 ] Duo Zhang commented on HBASE-21278: --- OK, for the unsupported exception, I think the problem is that, we will record all the sub procedures, and when rolling back, we first need to rollback the sub procedures, and it will lead to an UnsupportedOperationException. But the behavior is still a bit strange. After throwing UnsupportedOperationException, the next time we enter the executeProcedure method and output this {noformat} 2018-10-08 03:44:31,101 DEBUG [PEWorker-1] procedure2.ProcedureExecutor(1425): pid=43, ppid=42, state=SUCCESS, hasLock=false; TransitRegionStateProcedure table=testRollbackAndDoubleExecution, region=9bac7c539ac0cff6dc5706ed375a3bfb, UNASSIGN is already finished, skipping execution {noformat} Need to dig more... > TestMergeTableRegionsProcedure is flaky > --- > > Key: HBASE-21278 > URL: https://issues.apache.org/jira/browse/HBASE-21278 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > Attachments: > org.apache.hadoop.hbase.master.assignment.TestMergeTableRegionsProcedure-output.txt > > > https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1235/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.assignment.TestMergeTableRegionsProcedure-output.txt/*view*/ > I think the problem is > {noformat} > 2018-10-08 03:44:30,315 INFO [PEWorker-1] > procedure.MasterProcedureScheduler(689): pid=43, ppid=42, state=SUCCESS, > hasLock=false; TransitRegionStateProcedure > table=testRollbackAndDoubleExecution, > region=9bac7c539ac0cff6dc5706ed375a3bfb, UNASSIGN checking lock on > 9bac7c539ac0cff6dc5706ed375a3bfb > 2018-10-08 03:44:30,320 ERROR [PEWorker-1] helpers.MarkerIgnoringBase(159): > CODE-BUG: Uncaught runtime exception for pid=43, ppid=42, state=SUCCESS, > hasLock=true; TransitRegionStateProcedure > table=testRollbackAndDoubleExecution, > region=9bac7c539ac0cff6dc5706ed375a3bfb, UNASSIGN > java.lang.UnsupportedOperationException > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.rollbackState(TransitRegionStateProcedure.java:458) > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.rollbackState(TransitRegionStateProcedure.java:97) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:957) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1605) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1567) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1446) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76) > {noformat} > Typically there is no rollback for TRSP. Need to dig more. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21278) TestMergeTableRegionsProcedure is flaky
[ https://issues.apache.org/jira/browse/HBASE-21278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644395#comment-16644395 ] Duo Zhang commented on HBASE-21278: --- I think the problem is that, we will rollback after the MERGE_TABLE_REGIONS_CHECK_CLOSED_REGIONS state, but we will not wait for the completion of the sub procedures, which are two TRSPs. In the code of ProcedureExecutor, we will wait before the sub procedures are finished, before executing rollback, and in the upload log we can see that the TRSP with pid=43 was finished, but finally we still wanted to rollback it, which is a bit strange. Need to dig more. > TestMergeTableRegionsProcedure is flaky > --- > > Key: HBASE-21278 > URL: https://issues.apache.org/jira/browse/HBASE-21278 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > Attachments: > org.apache.hadoop.hbase.master.assignment.TestMergeTableRegionsProcedure-output.txt > > > https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1235/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.assignment.TestMergeTableRegionsProcedure-output.txt/*view*/ > I think the problem is > {noformat} > 2018-10-08 03:44:30,315 INFO [PEWorker-1] > procedure.MasterProcedureScheduler(689): pid=43, ppid=42, state=SUCCESS, > hasLock=false; TransitRegionStateProcedure > table=testRollbackAndDoubleExecution, > region=9bac7c539ac0cff6dc5706ed375a3bfb, UNASSIGN checking lock on > 9bac7c539ac0cff6dc5706ed375a3bfb > 2018-10-08 03:44:30,320 ERROR [PEWorker-1] helpers.MarkerIgnoringBase(159): > CODE-BUG: Uncaught runtime exception for pid=43, ppid=42, state=SUCCESS, > hasLock=true; TransitRegionStateProcedure > table=testRollbackAndDoubleExecution, > region=9bac7c539ac0cff6dc5706ed375a3bfb, UNASSIGN > java.lang.UnsupportedOperationException > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.rollbackState(TransitRegionStateProcedure.java:458) > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.rollbackState(TransitRegionStateProcedure.java:97) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:957) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1605) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1567) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1446) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76) > {noformat} > Typically there is no rollback for TRSP. Need to dig more. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21278) TestMergeTableRegionsProcedure is flaky
[ https://issues.apache.org/jira/browse/HBASE-21278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642676#comment-16642676 ] Duo Zhang commented on HBASE-21278: --- I do not think so. The problem here is that we should not try to rollback a TRSP... > TestMergeTableRegionsProcedure is flaky > --- > > Key: HBASE-21278 > URL: https://issues.apache.org/jira/browse/HBASE-21278 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > Attachments: > org.apache.hadoop.hbase.master.assignment.TestMergeTableRegionsProcedure-output.txt > > > https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1235/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.assignment.TestMergeTableRegionsProcedure-output.txt/*view*/ > I think the problem is > {noformat} > 2018-10-08 03:44:30,315 INFO [PEWorker-1] > procedure.MasterProcedureScheduler(689): pid=43, ppid=42, state=SUCCESS, > hasLock=false; TransitRegionStateProcedure > table=testRollbackAndDoubleExecution, > region=9bac7c539ac0cff6dc5706ed375a3bfb, UNASSIGN checking lock on > 9bac7c539ac0cff6dc5706ed375a3bfb > 2018-10-08 03:44:30,320 ERROR [PEWorker-1] helpers.MarkerIgnoringBase(159): > CODE-BUG: Uncaught runtime exception for pid=43, ppid=42, state=SUCCESS, > hasLock=true; TransitRegionStateProcedure > table=testRollbackAndDoubleExecution, > region=9bac7c539ac0cff6dc5706ed375a3bfb, UNASSIGN > java.lang.UnsupportedOperationException > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.rollbackState(TransitRegionStateProcedure.java:458) > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.rollbackState(TransitRegionStateProcedure.java:97) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:957) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1605) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1567) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1446) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76) > {noformat} > Typically there is no rollback for TRSP. Need to dig more. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21278) TestMergeTableRegionsProcedure is flaky
[ https://issues.apache.org/jira/browse/HBASE-21278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642230#comment-16642230 ] stack commented on HBASE-21278: --- Would hbase-21271 help? Stop throwing unsupported exception? > TestMergeTableRegionsProcedure is flaky > --- > > Key: HBASE-21278 > URL: https://issues.apache.org/jira/browse/HBASE-21278 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Major > Attachments: > org.apache.hadoop.hbase.master.assignment.TestMergeTableRegionsProcedure-output.txt > > > https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1235/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.assignment.TestMergeTableRegionsProcedure-output.txt/*view*/ > I think the problem is > {noformat} > 2018-10-08 03:44:30,315 INFO [PEWorker-1] > procedure.MasterProcedureScheduler(689): pid=43, ppid=42, state=SUCCESS, > hasLock=false; TransitRegionStateProcedure > table=testRollbackAndDoubleExecution, > region=9bac7c539ac0cff6dc5706ed375a3bfb, UNASSIGN checking lock on > 9bac7c539ac0cff6dc5706ed375a3bfb > 2018-10-08 03:44:30,320 ERROR [PEWorker-1] helpers.MarkerIgnoringBase(159): > CODE-BUG: Uncaught runtime exception for pid=43, ppid=42, state=SUCCESS, > hasLock=true; TransitRegionStateProcedure > table=testRollbackAndDoubleExecution, > region=9bac7c539ac0cff6dc5706ed375a3bfb, UNASSIGN > java.lang.UnsupportedOperationException > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.rollbackState(TransitRegionStateProcedure.java:458) > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.rollbackState(TransitRegionStateProcedure.java:97) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:957) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1605) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1567) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1446) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76) > {noformat} > Typically there is no rollback for TRSP. Need to dig more. -- This message was sent by Atlassian JIRA (v7.6.3#76005)