[jira] [Commented] (HBASE-18551) [AMv2] UnassignProcedure and crashed regionservers
[ https://issues.apache.org/jira/browse/HBASE-18551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131120#comment-16131120 ] Hudson commented on HBASE-18551: FAILURE: Integrated in Jenkins build HBASE-14070.HLC #233 (See [https://builds.apache.org/job/HBASE-14070.HLC/233/]) HBASE-18551 [AMv2] UnassignProcedure and crashed regionservers (stack: rev 2dd75d10f8818ed31fcc36bd89024e9ad728ae41) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/TableStateManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/DisableTableProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransactionOnCluster.java * (edit) hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureExecutor.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashException.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/TableNamespaceManager.java Revert "HBASE-18551 [AMv2] UnassignProcedure and crashed regionservers" (stack: rev e4ba404a5a65d522421e26045b8d37fbfda8) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/TableNamespaceManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java * (edit) hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureExecutor.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/TableStateManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashException.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/DisableTableProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransactionOnCluster.java HBASE-18551 [AMv2] UnassignProcedure and crashed regionservers (stack: rev 6f44b24860192d81dbf88ffd834d4b998a6fe636) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/DisableTableProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashException.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/TableNamespaceManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/TableStateManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java * (edit) hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureExecutor.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransactionOnCluster.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java HBASE-18551 [AMv2] UnassignProcedure and crashed regionservers; (stack: rev 1070888fff3a89d435018f11bfb2fd5609be8bab) * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/assignment/TestAssignmentManager.java > [AMv2] UnassignProcedure and crashed regionservers > -- > > Key: HBASE-18551 > URL: https://issues.apache.org/jira/browse/HBASE-18551 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack > Fix For: 2.0.0 > > Attachments:
[jira] [Commented] (HBASE-18551) [AMv2] UnassignProcedure and crashed regionservers
[ https://issues.apache.org/jira/browse/HBASE-18551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124392#comment-16124392 ] Hudson commented on HBASE-18551: FAILURE: Integrated in Jenkins build HBase-2.0 #312 (See [https://builds.apache.org/job/HBase-2.0/312/]) HBASE-18551 [AMv2] UnassignProcedure and crashed regionservers; (stack: rev 7197b40cbfe0599fa792b8152ed94761377e75e3) * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/assignment/TestAssignmentManager.java > [AMv2] UnassignProcedure and crashed regionservers > -- > > Key: HBASE-18551 > URL: https://issues.apache.org/jira/browse/HBASE-18551 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack > Fix For: 2.0.0 > > Attachments: HBASE-18551.master.001.patch, > HBASE-18551.master.002.patch, HBASE-18551.master.003.patch > > > This has been [~uagashe] and my obsession over the last few days, what should > an UnassignProcedure do when it dispatches a CLOSE but the CLOSE fails > because of ConnectException or SocketTimeout. > + We used to let UnassignProcedure continue presuming the Region would be > closed since the server is dead. BUT, if the unassign was part of a > MoveProcedure, the unassign would proceed and the Move would then run WITHOUT > first splitting logs. Bad. > + So, we made it so UnassignProcedure failed; let the upper layers take care > of the failure. See HBASE-18491 that enabled this behavior. BUT, we are since > figuring that even if the UP completes as a failure, since it gives up the > Region lock on completion, another procedure -- say an AssignProcedure -- > could cut in before the ServerCrashProcedure had finished and again there > could be dataloss. > + Now we are thinking the UP should hold on to the Region lock until we are > signalled by a ServerCrashProcedure; only then let go of the region. The UP > has context that is hard to pass another. Waiting on a SCP has the UP living > on for what could be a good amount of time. It might be ok if we can suspend > the procedure. > There is a good sample scenario that came up doing the no-regions-on-master > issue, HBASE-18511. When meta is not on master, TestSplitTransactionOnCluster > is failing. It fails because though the test completes, the tests commonly > kill a RegionServer. The teardown for the test runs before we've noticed the > aborted RS. So, the disable of the table in the teardown prepartory to our > deleting the test table as part of clean up, goes to unassign regions but the > unassign fails against the aborted server. > Good stuff. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18551) [AMv2] UnassignProcedure and crashed regionservers
[ https://issues.apache.org/jira/browse/HBASE-18551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124374#comment-16124374 ] Hudson commented on HBASE-18551: SUCCESS: Integrated in Jenkins build HBase-Trunk_matrix #3515 (See [https://builds.apache.org/job/HBase-Trunk_matrix/3515/]) HBASE-18551 [AMv2] UnassignProcedure and crashed regionservers; (stack: rev 1070888fff3a89d435018f11bfb2fd5609be8bab) * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/assignment/TestAssignmentManager.java > [AMv2] UnassignProcedure and crashed regionservers > -- > > Key: HBASE-18551 > URL: https://issues.apache.org/jira/browse/HBASE-18551 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack > Fix For: 2.0.0 > > Attachments: HBASE-18551.master.001.patch, > HBASE-18551.master.002.patch, HBASE-18551.master.003.patch > > > This has been [~uagashe] and my obsession over the last few days, what should > an UnassignProcedure do when it dispatches a CLOSE but the CLOSE fails > because of ConnectException or SocketTimeout. > + We used to let UnassignProcedure continue presuming the Region would be > closed since the server is dead. BUT, if the unassign was part of a > MoveProcedure, the unassign would proceed and the Move would then run WITHOUT > first splitting logs. Bad. > + So, we made it so UnassignProcedure failed; let the upper layers take care > of the failure. See HBASE-18491 that enabled this behavior. BUT, we are since > figuring that even if the UP completes as a failure, since it gives up the > Region lock on completion, another procedure -- say an AssignProcedure -- > could cut in before the ServerCrashProcedure had finished and again there > could be dataloss. > + Now we are thinking the UP should hold on to the Region lock until we are > signalled by a ServerCrashProcedure; only then let go of the region. The UP > has context that is hard to pass another. Waiting on a SCP has the UP living > on for what could be a good amount of time. It might be ok if we can suspend > the procedure. > There is a good sample scenario that came up doing the no-regions-on-master > issue, HBASE-18511. When meta is not on master, TestSplitTransactionOnCluster > is failing. It fails because though the test completes, the tests commonly > kill a RegionServer. The teardown for the test runs before we've noticed the > aborted RS. So, the disable of the table in the teardown prepartory to our > deleting the test table as part of clean up, goes to unassign regions but the > unassign fails against the aborted server. > Good stuff. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18551) [AMv2] UnassignProcedure and crashed regionservers
[ https://issues.apache.org/jira/browse/HBASE-18551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124001#comment-16124001 ] Hudson commented on HBASE-18551: FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #3514 (See [https://builds.apache.org/job/HBase-Trunk_matrix/3514/]) HBASE-18551 [AMv2] UnassignProcedure and crashed regionservers (stack: rev 6f44b24860192d81dbf88ffd834d4b998a6fe636) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/TableNamespaceManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashException.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/DisableTableProcedure.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransactionOnCluster.java * (edit) hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureExecutor.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/TableStateManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java > [AMv2] UnassignProcedure and crashed regionservers > -- > > Key: HBASE-18551 > URL: https://issues.apache.org/jira/browse/HBASE-18551 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack > Fix For: 2.0.0 > > Attachments: HBASE-18551.master.001.patch, > HBASE-18551.master.002.patch, HBASE-18551.master.003.patch > > > This has been [~uagashe] and my obsession over the last few days, what should > an UnassignProcedure do when it dispatches a CLOSE but the CLOSE fails > because of ConnectException or SocketTimeout. > + We used to let UnassignProcedure continue presuming the Region would be > closed since the server is dead. BUT, if the unassign was part of a > MoveProcedure, the unassign would proceed and the Move would then run WITHOUT > first splitting logs. Bad. > + So, we made it so UnassignProcedure failed; let the upper layers take care > of the failure. See HBASE-18491 that enabled this behavior. BUT, we are since > figuring that even if the UP completes as a failure, since it gives up the > Region lock on completion, another procedure -- say an AssignProcedure -- > could cut in before the ServerCrashProcedure had finished and again there > could be dataloss. > + Now we are thinking the UP should hold on to the Region lock until we are > signalled by a ServerCrashProcedure; only then let go of the region. The UP > has context that is hard to pass another. Waiting on a SCP has the UP living > on for what could be a good amount of time. It might be ok if we can suspend > the procedure. > There is a good sample scenario that came up doing the no-regions-on-master > issue, HBASE-18511. When meta is not on master, TestSplitTransactionOnCluster > is failing. It fails because though the test completes, the tests commonly > kill a RegionServer. The teardown for the test runs before we've noticed the > aborted RS. So, the disable of the table in the teardown prepartory to our > deleting the test table as part of clean up, goes to unassign regions but the > unassign fails against the aborted server. > Good stuff. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18551) [AMv2] UnassignProcedure and crashed regionservers
[ https://issues.apache.org/jira/browse/HBASE-18551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123927#comment-16123927 ] Hudson commented on HBASE-18551: FAILURE: Integrated in Jenkins build HBase-2.0 #311 (See [https://builds.apache.org/job/HBase-2.0/311/]) HBASE-18551 [AMv2] UnassignProcedure and crashed regionservers (stack: rev 5940f4224c0ce0c01e98cdb28f74c6e227c918e3) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/DisableTableProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashException.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransactionOnCluster.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/TableNamespaceManager.java * (edit) hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureExecutor.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/TableStateManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java > [AMv2] UnassignProcedure and crashed regionservers > -- > > Key: HBASE-18551 > URL: https://issues.apache.org/jira/browse/HBASE-18551 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack > Fix For: 2.0.0 > > Attachments: HBASE-18551.master.001.patch, > HBASE-18551.master.002.patch, HBASE-18551.master.003.patch > > > This has been [~uagashe] and my obsession over the last few days, what should > an UnassignProcedure do when it dispatches a CLOSE but the CLOSE fails > because of ConnectException or SocketTimeout. > + We used to let UnassignProcedure continue presuming the Region would be > closed since the server is dead. BUT, if the unassign was part of a > MoveProcedure, the unassign would proceed and the Move would then run WITHOUT > first splitting logs. Bad. > + So, we made it so UnassignProcedure failed; let the upper layers take care > of the failure. See HBASE-18491 that enabled this behavior. BUT, we are since > figuring that even if the UP completes as a failure, since it gives up the > Region lock on completion, another procedure -- say an AssignProcedure -- > could cut in before the ServerCrashProcedure had finished and again there > could be dataloss. > + Now we are thinking the UP should hold on to the Region lock until we are > signalled by a ServerCrashProcedure; only then let go of the region. The UP > has context that is hard to pass another. Waiting on a SCP has the UP living > on for what could be a good amount of time. It might be ok if we can suspend > the procedure. > There is a good sample scenario that came up doing the no-regions-on-master > issue, HBASE-18511. When meta is not on master, TestSplitTransactionOnCluster > is failing. It fails because though the test completes, the tests commonly > kill a RegionServer. The teardown for the test runs before we've noticed the > aborted RS. So, the disable of the table in the teardown prepartory to our > deleting the test table as part of clean up, goes to unassign regions but the > unassign fails against the aborted server. > Good stuff. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18551) [AMv2] UnassignProcedure and crashed regionservers
[ https://issues.apache.org/jira/browse/HBASE-18551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122815#comment-16122815 ] Hadoop QA commented on HBASE-18551: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 18m 12s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 33s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 11s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 0s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 27s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 26s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 38s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 42s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 16s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 34m 14s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 or 3.0.0-alpha4. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 9s{color} | {color:green} hbase-procedure in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}154m 54s{color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 1m 22s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}231m 43s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hbase.regionserver.TestSplitTransactionOnCluster | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.13.1 Server=1.13.1 Image:yetus/hbase:bdc94b1 | | JIRA Issue | HBASE-18551 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12881361/HBASE-18551.master.003.patch | | Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile | | uname | Linux 785f0f6d07a1 3.13.0-117-generic #164-Ubuntu SMP Fri Apr 7 11:05:26 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / efd211d | | Default Java | 1.8.0_144 | | findbugs | v3.1.0-RC3 | | unit | https://builds.apache.org/job/PreCommit-HBASE-Build/8026/artifact/patchprocess/patch-unit-hbase-server.txt | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/8026/testReport/ | | modules | C: hbase-procedure hbase-server U: . | | Console
[jira] [Commented] (HBASE-18551) [AMv2] UnassignProcedure and crashed regionservers
[ https://issues.apache.org/jira/browse/HBASE-18551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122672#comment-16122672 ] Hudson commented on HBASE-18551: SUCCESS: Integrated in Jenkins build HBase-Trunk_matrix #3509 (See [https://builds.apache.org/job/HBase-Trunk_matrix/3509/]) HBASE-18551 [AMv2] UnassignProcedure and crashed regionservers (stack: rev 2dd75d10f8818ed31fcc36bd89024e9ad728ae41) * (edit) hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureExecutor.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransactionOnCluster.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/TableStateManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/TableNamespaceManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/DisableTableProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashException.java Revert "HBASE-18551 [AMv2] UnassignProcedure and crashed regionservers" (stack: rev e4ba404a5a65d522421e26045b8d37fbfda8) * (edit) hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureExecutor.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/TableStateManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/DisableTableProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransactionOnCluster.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/TableNamespaceManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashException.java > [AMv2] UnassignProcedure and crashed regionservers > -- > > Key: HBASE-18551 > URL: https://issues.apache.org/jira/browse/HBASE-18551 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack > Fix For: 2.0.0 > > Attachments: HBASE-18551.master.001.patch, > HBASE-18551.master.002.patch, HBASE-18551.master.003.patch > > > This has been [~uagashe] and my obsession over the last few days, what should > an UnassignProcedure do when it dispatches a CLOSE but the CLOSE fails > because of ConnectException or SocketTimeout. > + We used to let UnassignProcedure continue presuming the Region would be > closed since the server is dead. BUT, if the unassign was part of a > MoveProcedure, the unassign would proceed and the Move would then run WITHOUT > first splitting logs. Bad. > + So, we made it so UnassignProcedure failed; let the upper layers take care > of the failure. See HBASE-18491 that enabled this behavior. BUT, we are since > figuring that even if the UP completes as a failure, since it gives up the > Region lock on completion, another procedure -- say an AssignProcedure -- > could cut in before the ServerCrashProcedure had finished and again there > could be dataloss. > + Now we are thinking the UP should hold on to the Region lock until we are > signalled by a ServerCrashProcedure; only then let go of the region. The UP > has context that is hard to pass another. Waiting on a SCP has the UP living > on for what could be a good amount of time. It might be ok if we can suspend > the procedure. > There is a good sample scenario that came up doing the no-regions-on-master > issue, HBASE-18511. When meta is not on master, TestSplitTransactionOnCluster > is failing. It fails because though the test completes, the tests commonly > kill a
[jira] [Commented] (HBASE-18551) [AMv2] UnassignProcedure and crashed regionservers
[ https://issues.apache.org/jira/browse/HBASE-18551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122679#comment-16122679 ] Umesh Agashe commented on HBASE-18551: -- +1 > [AMv2] UnassignProcedure and crashed regionservers > -- > > Key: HBASE-18551 > URL: https://issues.apache.org/jira/browse/HBASE-18551 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack > Fix For: 2.0.0 > > Attachments: HBASE-18551.master.001.patch, > HBASE-18551.master.002.patch, HBASE-18551.master.003.patch > > > This has been [~uagashe] and my obsession over the last few days, what should > an UnassignProcedure do when it dispatches a CLOSE but the CLOSE fails > because of ConnectException or SocketTimeout. > + We used to let UnassignProcedure continue presuming the Region would be > closed since the server is dead. BUT, if the unassign was part of a > MoveProcedure, the unassign would proceed and the Move would then run WITHOUT > first splitting logs. Bad. > + So, we made it so UnassignProcedure failed; let the upper layers take care > of the failure. See HBASE-18491 that enabled this behavior. BUT, we are since > figuring that even if the UP completes as a failure, since it gives up the > Region lock on completion, another procedure -- say an AssignProcedure -- > could cut in before the ServerCrashProcedure had finished and again there > could be dataloss. > + Now we are thinking the UP should hold on to the Region lock until we are > signalled by a ServerCrashProcedure; only then let go of the region. The UP > has context that is hard to pass another. Waiting on a SCP has the UP living > on for what could be a good amount of time. It might be ok if we can suspend > the procedure. > There is a good sample scenario that came up doing the no-regions-on-master > issue, HBASE-18511. When meta is not on master, TestSplitTransactionOnCluster > is failing. It fails because though the test completes, the tests commonly > kill a RegionServer. The teardown for the test runs before we've noticed the > aborted RS. So, the disable of the table in the teardown prepartory to our > deleting the test table as part of clean up, goes to unassign regions but the > unassign fails against the aborted server. > Good stuff. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18551) [AMv2] UnassignProcedure and crashed regionservers
[ https://issues.apache.org/jira/browse/HBASE-18551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122560#comment-16122560 ] stack commented on HBASE-18551: --- .003 addresses [~uagashe] reviews up on rb. > [AMv2] UnassignProcedure and crashed regionservers > -- > > Key: HBASE-18551 > URL: https://issues.apache.org/jira/browse/HBASE-18551 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack > Fix For: 2.0.0 > > Attachments: HBASE-18551.master.001.patch, > HBASE-18551.master.002.patch, HBASE-18551.master.003.patch > > > This has been [~uagashe] and my obsession over the last few days, what should > an UnassignProcedure do when it dispatches a CLOSE but the CLOSE fails > because of ConnectException or SocketTimeout. > + We used to let UnassignProcedure continue presuming the Region would be > closed since the server is dead. BUT, if the unassign was part of a > MoveProcedure, the unassign would proceed and the Move would then run WITHOUT > first splitting logs. Bad. > + So, we made it so UnassignProcedure failed; let the upper layers take care > of the failure. See HBASE-18491 that enabled this behavior. BUT, we are since > figuring that even if the UP completes as a failure, since it gives up the > Region lock on completion, another procedure -- say an AssignProcedure -- > could cut in before the ServerCrashProcedure had finished and again there > could be dataloss. > + Now we are thinking the UP should hold on to the Region lock until we are > signalled by a ServerCrashProcedure; only then let go of the region. The UP > has context that is hard to pass another. Waiting on a SCP has the UP living > on for what could be a good amount of time. It might be ok if we can suspend > the procedure. > There is a good sample scenario that came up doing the no-regions-on-master > issue, HBASE-18511. When meta is not on master, TestSplitTransactionOnCluster > is failing. It fails because though the test completes, the tests commonly > kill a RegionServer. The teardown for the test runs before we've noticed the > aborted RS. So, the disable of the table in the teardown prepartory to our > deleting the test table as part of clean up, goes to unassign regions but the > unassign fails against the aborted server. > Good stuff. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18551) [AMv2] UnassignProcedure and crashed regionservers
[ https://issues.apache.org/jira/browse/HBASE-18551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122412#comment-16122412 ] Hadoop QA commented on HBASE-18551: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 4s{color} | {color:red} HBASE-18551 does not apply to master. Rebase required? Wrong Branch? See https://yetus.apache.org/documentation/0.4.0/precommit-patchnames for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HBASE-18551 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12881337/HBASE-18551.master.001.patch | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/8023/console | | Powered by | Apache Yetus 0.4.0 http://yetus.apache.org | This message was automatically generated. > [AMv2] UnassignProcedure and crashed regionservers > -- > > Key: HBASE-18551 > URL: https://issues.apache.org/jira/browse/HBASE-18551 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack > Fix For: 2.0.0 > > Attachments: HBASE-18551.master.001.patch > > > This has been [~uagashe] and my obsession over the last few days, what should > an UnassignProcedure do when it dispatches a CLOSE but the CLOSE fails > because of ConnectException or SocketTimeout. > + We used to let UnassignProcedure continue presuming the Region would be > closed since the server is dead. BUT, if the unassign was part of a > MoveProcedure, the unassign would proceed and the Move would then run WITHOUT > first splitting logs. Bad. > + So, we made it so UnassignProcedure failed; let the upper layers take care > of the failure. See HBASE-18491 that enabled this behavior. BUT, we are since > figuring that even if the UP completes as a failure, since it gives up the > Region lock on completion, another procedure -- say an AssignProcedure -- > could cut in before the ServerCrashProcedure had finished and again there > could be dataloss. > + Now we are thinking the UP should hold on to the Region lock until we are > signalled by a ServerCrashProcedure; only then let go of the region. The UP > has context that is hard to pass another. Waiting on a SCP has the UP living > on for what could be a good amount of time. It might be ok if we can suspend > the procedure. > There is a good sample scenario that came up doing the no-regions-on-master > issue, HBASE-18511. When meta is not on master, TestSplitTransactionOnCluster > is failing. It fails because though the test completes, the tests commonly > kill a RegionServer. The teardown for the test runs before we've noticed the > aborted RS. So, the disable of the table in the teardown prepartory to our > deleting the test table as part of clean up, goes to unassign regions but the > unassign fails against the aborted server. > Good stuff. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18551) [AMv2] UnassignProcedure and crashed regionservers
[ https://issues.apache.org/jira/browse/HBASE-18551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122373#comment-16122373 ] stack commented on HBASE-18551: --- .001 implements #3. TODO: make it so the likes of an UnassignProcedure is able to return a ServerCrashProcedure as a subprocedure; i.e. have the UP block until the SCP is done. > [AMv2] UnassignProcedure and crashed regionservers > -- > > Key: HBASE-18551 > URL: https://issues.apache.org/jira/browse/HBASE-18551 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack >Assignee: stack > Fix For: 2.0.0 > > Attachments: HBASE-18551.master.001.patch > > > This has been [~uagashe] and my obsession over the last few days, what should > an UnassignProcedure do when it dispatches a CLOSE but the CLOSE fails > because of ConnectException or SocketTimeout. > + We used to let UnassignProcedure continue presuming the Region would be > closed since the server is dead. BUT, if the unassign was part of a > MoveProcedure, the unassign would proceed and the Move would then run WITHOUT > first splitting logs. Bad. > + So, we made it so UnassignProcedure failed; let the upper layers take care > of the failure. See HBASE-18491 that enabled this behavior. BUT, we are since > figuring that even if the UP completes as a failure, since it gives up the > Region lock on completion, another procedure -- say an AssignProcedure -- > could cut in before the ServerCrashProcedure had finished and again there > could be dataloss. > + Now we are thinking the UP should hold on to the Region lock until we are > signalled by a ServerCrashProcedure; only then let go of the region. The UP > has context that is hard to pass another. Waiting on a SCP has the UP living > on for what could be a good amount of time. It might be ok if we can suspend > the procedure. > There is a good sample scenario that came up doing the no-regions-on-master > issue, HBASE-18511. When meta is not on master, TestSplitTransactionOnCluster > is failing. It fails because though the test completes, the tests commonly > kill a RegionServer. The teardown for the test runs before we've noticed the > aborted RS. So, the disable of the table in the teardown prepartory to our > deleting the test table as part of clean up, goes to unassign regions but the > unassign fails against the aborted server. > Good stuff. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18551) [AMv2] UnassignProcedure and crashed regionservers
[ https://issues.apache.org/jira/browse/HBASE-18551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121088#comment-16121088 ] stack commented on HBASE-18551: --- Hmm. #3 seems to work. Let me clean up the patch. > [AMv2] UnassignProcedure and crashed regionservers > -- > > Key: HBASE-18551 > URL: https://issues.apache.org/jira/browse/HBASE-18551 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack > > This has been [~uagashe] and my obsession over the last few days, what should > an UnassignProcedure do when it dispatches a CLOSE but the CLOSE fails > because of ConnectException or SocketTimeout. > + We used to let UnassignProcedure continue presuming the Region would be > closed since the server is dead. BUT, if the unassign was part of a > MoveProcedure, the unassign would proceed and the Move would then run WITHOUT > first splitting logs. Bad. > + So, we made it so UnassignProcedure failed; let the upper layers take care > of the failure. See HBASE-18491 that enabled this behavior. BUT, we are since > figuring that even if the UP completes as a failure, since it gives up the > Region lock on completion, another procedure -- say an AssignProcedure -- > could cut in before the ServerCrashProcedure had finished and again there > could be dataloss. > + Now we are thinking the UP should hold on to the Region lock until we are > signalled by a ServerCrashProcedure; only then let go of the region. The UP > has context that is hard to pass another. Waiting on a SCP has the UP living > on for what could be a good amount of time. It might be ok if we can suspend > the procedure. > There is a good sample scenario that came up doing the no-regions-on-master > issue, HBASE-18511. When meta is not on master, TestSplitTransactionOnCluster > is failing. It fails because though the test completes, the tests commonly > kill a RegionServer. The teardown for the test runs before we've noticed the > aborted RS. So, the disable of the table in the teardown prepartory to our > deleting the test table as part of clean up, goes to unassign regions but the > unassign fails against the aborted server. > Good stuff. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-18551) [AMv2] UnassignProcedure and crashed regionservers
[ https://issues.apache.org/jira/browse/HBASE-18551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120844#comment-16120844 ] Umesh Agashe commented on HBASE-18551: -- Thanks [~stack], nice description about the set of problems we are working on recently. > [AMv2] UnassignProcedure and crashed regionservers > -- > > Key: HBASE-18551 > URL: https://issues.apache.org/jira/browse/HBASE-18551 > Project: HBase > Issue Type: Bug > Components: amv2 >Reporter: stack > > This has been [~uagashe] and my obsession over the last few days, what should > an UnassignProcedure do when it dispatches a CLOSE but the CLOSE fails > because of ConnectException or SocketTimeout. > + We used to let UnassignProcedure continue presuming the Region would be > closed since the server is dead. BUT, if the unassign was part of a > MoveProcedure, the unassign would proceed and the Move would then run WITHOUT > first splitting logs. Bad. > + So, we made it so UnassignProcedure failed; let the upper layers take care > of the failure. See HBASE-18491 that enabled this behavior. BUT, we are since > figuring that even if the UP completes as a failure, since it gives up the > Region lock on completion, another procedure -- say an AssignProcedure -- > could cut in before the ServerCrashProcedure had finished and again there > could be dataloss. > + Now we are thinking the UP should hold on to the Region lock until we are > signalled by a ServerCrashProcedure; only then let go of the region. The UP > has context that is hard to pass another. Waiting on a SCP has the UP living > on for what could be a good amount of time. It might be ok if we can suspend > the procedure. > There is a good sample scenario that came up doing the no-regions-on-master > issue, HBASE-18511. When meta is not on master, TestSplitTransactionOnCluster > is failing. It fails because though the test completes, the tests commonly > kill a RegionServer. The teardown for the test runs before we've noticed the > aborted RS. So, the disable of the table in the teardown prepartory to our > deleting the test table as part of clean up, goes to unassign regions but the > unassign fails against the aborted server. > Good stuff. -- This message was sent by Atlassian JIRA (v6.4.14#64029)