[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570971#comment-13570971 ] Hudson commented on HBASE-2611: --- Integrated in HBase-0.94-security-on-Hadoop-23 #11 (See [https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/11/]) HBASE-2611 Handle RS that fails while processing the failure of another one (Himanshu Vashishtha) (Revision 1440054) Result = FAILURE larsh : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: 2611-0.94.txt, 2611-trunk-v3.patch, 2611-trunk-v4.patch, 2611-v3.patch, HBASE-2611-trunk-v2.patch, HBASE-2611-trunk-v3.patch, HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13566296#comment-13566296 ] Hudson commented on HBASE-2611: --- Integrated in HBase-0.94-security #102 (See [https://builds.apache.org/job/HBase-0.94-security/102/]) HBASE-2611 Handle RS that fails while processing the failure of another one (Himanshu Vashishtha) (Revision 1440054) Result = SUCCESS larsh : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: 2611-0.94.txt, 2611-trunk-v3.patch, 2611-trunk-v4.patch, 2611-v3.patch, HBASE-2611-trunk-v2.patch, HBASE-2611-trunk-v3.patch, HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13565341#comment-13565341 ] Hudson commented on HBASE-2611: --- Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #382 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/382/]) HBASE-2611 Handle RS that fails while processing the failure of another one (Himanshu) (Revision 1439744) Result = FAILURE tedyu : Files : * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: 2611-0.94.txt, 2611-trunk-v3.patch, 2611-trunk-v4.patch, 2611-v3.patch, HBASE-2611-trunk-v2.patch, HBASE-2611-trunk-v3.patch, HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13565728#comment-13565728 ] Hudson commented on HBASE-2611: --- Integrated in HBase-0.94 #800 (See [https://builds.apache.org/job/HBase-0.94/800/]) HBASE-2611 Handle RS that fails while processing the failure of another one (Himanshu Vashishtha) (Revision 1440054) Result = SUCCESS larsh : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: 2611-0.94.txt, 2611-trunk-v3.patch, 2611-trunk-v4.patch, 2611-v3.patch, HBASE-2611-trunk-v2.patch, HBASE-2611-trunk-v3.patch, HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564910#comment-13564910 ] Lars Hofhansl commented on HBASE-2611: -- [~ted_yu] Let's commit this. +1 from me. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: 2611-trunk-v3.patch, 2611-v3.patch, HBASE-2611-trunk-v2.patch, HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564936#comment-13564936 ] Jean-Daniel Cryans commented on HBASE-2611: --- Some comments: bq. LOG.info(Moving + rsZnode + 's hlogs to my queue); This could be changed to say whether it's going to be done atomically or not. bq. LOG.debug( The multi list is: + listOfOps + , size: + listOfOps.size()); This is going to print a lot of object references... not sure how useful this is. Maybe just keep the size? bq. LOG.info(Atomically moved the dead regionserver logs. ); With my first comment this becomes redundant and somewhere else it will say when the move is done anyway. bq. LOG.warn(Got exception in copyQueuesFromRSUsingMulti: + e); Put the e in the second paramater instead of appending it to the string. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: 2611-trunk-v3.patch, 2611-v3.patch, HBASE-2611-trunk-v2.patch, HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13565004#comment-13565004 ] Hadoop QA commented on HBASE-2611: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12566875/HBASE-2611-trunk-v3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 1 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4225//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4225//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4225//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4225//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4225//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4225//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4225//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4225//console This message is automatically generated. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: 2611-trunk-v3.patch, 2611-v3.patch, HBASE-2611-trunk-v2.patch, HBASE-2611-trunk-v3.patch, HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13565005#comment-13565005 ] Ted Yu commented on HBASE-2611: --- @Himanshu: Mind attaching patch for 0.94 ? Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: 2611-trunk-v3.patch, 2611-v3.patch, HBASE-2611-trunk-v2.patch, HBASE-2611-trunk-v3.patch, HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13565080#comment-13565080 ] Hadoop QA commented on HBASE-2611: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12566899/2611-trunk-v4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4228//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4228//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4228//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4228//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4228//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4228//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4228//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4228//console This message is automatically generated. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: 2611-trunk-v3.patch, 2611-trunk-v4.patch, 2611-v3.patch, HBASE-2611-trunk-v2.patch, HBASE-2611-trunk-v3.patch, HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13565081#comment-13565081 ] Ted Yu commented on HBASE-2611: --- Patch v4 integrated to trunk. Thanks for the patch, Himanshu. Thanks for the reviews, Lars and J-D. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: 2611-trunk-v3.patch, 2611-trunk-v4.patch, 2611-v3.patch, HBASE-2611-trunk-v2.patch, HBASE-2611-trunk-v3.patch, HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13565089#comment-13565089 ] Hadoop QA commented on HBASE-2611: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12566908/2611-0.94.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4229//console This message is automatically generated. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: 2611-0.94.txt, 2611-trunk-v3.patch, 2611-trunk-v4.patch, 2611-v3.patch, HBASE-2611-trunk-v2.patch, HBASE-2611-trunk-v3.patch, HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13565095#comment-13565095 ] Lars Hofhansl commented on HBASE-2611: -- Going to commit the 0.94 version tomorrow, unless I hear objections. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: 2611-0.94.txt, 2611-trunk-v3.patch, 2611-trunk-v4.patch, 2611-v3.patch, HBASE-2611-trunk-v2.patch, HBASE-2611-trunk-v3.patch, HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13565131#comment-13565131 ] Hudson commented on HBASE-2611: --- Integrated in HBase-TRUNK #3820 (See [https://builds.apache.org/job/HBase-TRUNK/3820/]) HBASE-2611 Handle RS that fails while processing the failure of another one (Himanshu) (Revision 1439744) Result = FAILURE tedyu : Files : * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: 2611-0.94.txt, 2611-trunk-v3.patch, 2611-trunk-v4.patch, 2611-v3.patch, HBASE-2611-trunk-v2.patch, HBASE-2611-trunk-v3.patch, HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564016#comment-13564016 ] Lars Hofhansl commented on HBASE-2611: -- [~jdcryans]? Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: 2611-trunk-v3.patch, 2611-v3.patch, HBASE-2611-trunk-v2.patch, HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562753#comment-13562753 ] Hadoop QA commented on HBASE-2611: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12566510/2611-trunk-v3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4181//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4181//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4181//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4181//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4181//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4181//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4181//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4181//console This message is automatically generated. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: 2611-trunk-v3.patch, 2611-v3.patch, HBASE-2611-trunk-v2.patch, HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562755#comment-13562755 ] Ted Yu commented on HBASE-2611: --- Will integrated patch v3 later today if there is no further review comment. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: 2611-trunk-v3.patch, 2611-v3.patch, HBASE-2611-trunk-v2.patch, HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562847#comment-13562847 ] Ted Yu commented on HBASE-2611: --- [~jdcryans]: It would be nice if you take a look at Himanshu's patch. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: 2611-trunk-v3.patch, 2611-v3.patch, HBASE-2611-trunk-v2.patch, HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562383#comment-13562383 ] Ted Yu commented on HBASE-2611: --- {code} p0 2611-upstream-v1.patch patching file hbase-server/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java Hunk #1 succeeded at 25 (offset -1 lines). Hunk #2 FAILED at 41. Hunk #3 succeeded at 858 (offset 131 lines). 1 out of 3 hunks FAILED -- saving rejects to file hbase-server/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java.rej patching file hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java Hunk #1 succeeded at 579 (offset 19 lines). patching file hbase-server/src/main/java/org/apache/hadoop/hbase/zookeeper/RecoverableZooKeeper.java Reversed (or previously applied) patch detected! Assume -R? [n] ^C {code} @Himanshu: Can you update the upstream patch ? Thanks Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: 2611-v3.patch, HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562460#comment-13562460 ] Lars Hofhansl commented on HBASE-2611: -- Himanshu, you are officially my hero now. We've been discussing this for over a year, and it looks like we're finally fixing it. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: 2611-v3.patch, HBASE-2611-trunk-v2.patch, HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562479#comment-13562479 ] Hadoop QA commented on HBASE-2611: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12566463/HBASE-2611-trunk-v2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 1 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4177//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4177//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4177//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4177//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4177//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4177//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4177//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4177//console This message is automatically generated. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: 2611-v3.patch, HBASE-2611-trunk-v2.patch, HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13560005#comment-13560005 ] Himanshu Vashishtha commented on HBASE-2611: Thanks for the review Lars :), and Ted for updating the patch. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Fix For: 0.96.0, 0.94.5 Attachments: 2611-v3.patch, HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13560036#comment-13560036 ] Ted Yu commented on HBASE-2611: --- The trunk patch depends on HBASE-7382 @Himanshu: Can you run the tests listed @ 28/Jun/12 04:07 ? Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Fix For: 0.96.0, 0.94.5 Attachments: 2611-v3.patch, HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558938#comment-13558938 ] Himanshu Vashishtha commented on HBASE-2611: [~lhofhansl]: Yes, I followed the same approach in the attached patch. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Fix For: 0.96.0, 0.94.5 Attachments: HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13559232#comment-13559232 ] Lars Hofhansl commented on HBASE-2611: -- Hmm... Yes, you did. Sorry, somehow missed it when I looked at it first. Cool then, we came to the same conclusion. Just took me much longer to get to it :) +1 on patch, it should indeed fix this problem. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Fix For: 0.96.0, 0.94.5 Attachments: HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558487#comment-13558487 ] Lars Hofhansl commented on HBASE-2611: -- Specifically, check out ReplicatoinSourceManager.NodeFailoverWorker.run(). First all surviving RSs race to obtain the lock: {code} if (!zkHelper.lockOtherRS(rsZnode)) { return; } {code} Only one RS will continue to move the failed RS's regions. I think what we could do is this: If multi is supported we just have all surviving RSs attempt to move the queues (don't bother with the lock step). If multi is as atomic as advertised that should work and only one of the RS will succeed to move the queues atomically, but all will try. It seems like that should work. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Fix For: 0.94.5 Attachments: HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13555213#comment-13555213 ] Himanshu Vashishtha commented on HBASE-2611: bq. But what can happen is that the region server who wins the race to take over the dead region server's queues could die before it even manages to call multi. Not following your question. How can a regionserver wins a race before calling multi? If regionserver A fails, *all* regionserver will call multi to do the failover, and only one (let's say B) will succeed. Now, if B also dies meanwhile (while it has succeeded in transferring the queue from zk perspective), the regionserver doing the failover for B will also process A's znodes (as they are with B now). Therefore, I don't see we really need a retry. Did I miss anything? Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Fix For: 0.94.5 Attachments: HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13555263#comment-13555263 ] Lars Hofhansl commented on HBASE-2611: -- But that is not case (unless I am misunderstanding completely). All RSs race to get the lock to take over the dead RS's queues. Once there is a winner, that RS will move the queues. So if the winning RS dies after it learn that it is the winner but before it move the queues those queues are lost. What you describe is one way to solve the problem: All RSs simply try to move the queues. That would work, but would lead to the herding effect (which I think is acceptable). Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Fix For: 0.94.5 Attachments: HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13555488#comment-13555488 ] Himanshu Vashishtha commented on HBASE-2611: Yes, your description is totally correct. So, you okay with the approach, Lars? Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Fix For: 0.94.5 Attachments: HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1322#comment-1322 ] Lars Hofhansl commented on HBASE-2611: -- This change is good (so +1), but it does not fix the whole problem (you're not having all RSs attempt the queue failover). Maybe we do your patch in a subtask and leave this issue open. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Fix For: 0.94.5 Attachments: HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1324#comment-1324 ] Lars Hofhansl commented on HBASE-2611: -- Or did you mean whether I'm OK with all RSs attempting to move the queues? I'm happy with that too. I think [~jdcryans] voiced some concerns over the incurred herding effect. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Fix For: 0.94.5 Attachments: HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13554574#comment-13554574 ] Lars Hofhansl commented on HBASE-2611: -- So that call to multi better not fail, ever. Otherwise we'll still lose track of data to be replicated. There two problem currently: # Transfer of queues is only attempted once # Queues may be partially transferred This patch addresses the only #2. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Fix For: 0.94.5 Attachments: HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13554582#comment-13554582 ] Ted Yu commented on HBASE-2611: --- {code} + * @param znode + * @return {code} Please finish javadoc. The key of SortedMap is peer cluster Id, right ? {code} + LOG.warn(Got exception in copyQueuesFromRSUsingMulti: + e); {code} If you use comma in place of +, you would get method names. There is no empty line in copyQueuesFromRSUsingMulti(). Consider adding empty line to separate sub-steps. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Fix For: 0.94.5 Attachments: HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13554584#comment-13554584 ] Himanshu Vashishtha commented on HBASE-2611: The call to multi is using RecoverableZookeeper#multi, which does a retry in case of {code} case CONNECTIONLOSS: case SESSIONEXPIRED: case OPERATIONTIMEOUT: {code} which by default, is three. I find this approach better than the existing one. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Fix For: 0.94.5 Attachments: HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13554586#comment-13554586 ] Chris Trezzo commented on HBASE-2611: - But the retries in RecoverableZookeeper are not atomic... if the region server fails in the middle of RecoverableZooKeeper.multi, the queues will not get transferred. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Fix For: 0.94.5 Attachments: HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13554588#comment-13554588 ] Chris Trezzo commented on HBASE-2611: - Also, I don't think your manual test described above hits this corner case. You need at least two region server failures for this to happen. For example, region server A fails, region server B races and wins the failover of A, and then region server B fails before it finishes copying A's queue to it's own queue. Then when someone picks up B, A's original queue will not get completely replicated. Thanks for working on this though! It is a tricky one. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Fix For: 0.94.5 Attachments: HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13554617#comment-13554617 ] Chris Trezzo commented on HBASE-2611: - [~hvash...@cs.ualberta.ca] Hmm I may have miss spoke... atomic was not the right word choice. bq. But the retries in RecoverableZookeeper are not atomic... if the region server fails in the middle of RecoverableZooKeeper.multi, the queues will not get transferred. I see that as long as a multi hasn't succeeded, all region servers will continue to try and failover the queues. So the problem seems to be more along the lines of if all region servers exhaust their multi retries, then the queues would get lost. Is there ever a case in practice where we would run into this and zookeeper is not down? Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Fix For: 0.94.5 Attachments: HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13554724#comment-13554724 ] Himanshu Vashishtha commented on HBASE-2611: Chris: Thanks for taking a look. bq. Is there ever a case in practice where we would run into this and zookeeper is not down? Can't think of any. Even if that ever happens (let's say all regionservers can't connect to zk or whatever), then, we need something different (possibly beyond the scope of this jira) so any new joining regionserver take a look at existing log znodes, etc. Re: Testing: Yeah, I know. But, given that it is moved in one transaction, I can't think of how to replicate it in a testing environment. Therefore, I tested to see what happens when two regionservers tries to copy the queue, and whether this approach scales well with number of logs or not. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Fix For: 0.94.5 Attachments: HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13554747#comment-13554747 ] Lars Hofhansl commented on HBASE-2611: -- This is definitely an improvement. What happens when a region server dies after it copied the queues but before it could finish shipping all the edits? Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Fix For: 0.94.5 Attachments: HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13554752#comment-13554752 ] Himanshu Vashishtha commented on HBASE-2611: Lars: Then a regionserver who does the failover will also process the leftover znodes (just like what happens currently). Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Fix For: 0.94.5 Attachments: HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13554758#comment-13554758 ] Lars Hofhansl commented on HBASE-2611: -- Cool... So as long as the multi itself does not fail we're good. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Fix For: 0.94.5 Attachments: HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13554761#comment-13554761 ] Himanshu Vashishtha commented on HBASE-2611: Yes. I would ask, though, in what possible circumstances you foresee failure of multi()? Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Fix For: 0.94.5 Attachments: HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13554783#comment-13554783 ] Himanshu Vashishtha commented on HBASE-2611: [~lhofhansl]: I asked about possible failure scenarios because it will be great if they can be worked upon beforehand. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Fix For: 0.94.5 Attachments: HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13554789#comment-13554789 ] Lars Hofhansl commented on HBASE-2611: -- Yeah, I don't know. But what can happen is that the region server who wins the race to take over the dead region server's queues could die before it even manages to call multi. In the case - since the ephemeral znode is only removed once - we won't ever retry to move that region server's queues again. Right? So another part of the puzzle is to have a way to retry the takeover later. Back in the comments here there are various suggestions about how to do that mostly centering around having all surviving RSs try to move a dead RS's queues. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Himanshu Vashishtha Fix For: 0.94.5 Attachments: HBase-2611-upstream-v1.patch, HBASE-2611-v2.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13553430#comment-13553430 ] Himanshu Vashishtha commented on HBASE-2611: Working on it; will provide a patch soon. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Jean-Daniel Cryans Assignee: Chris Trezzo Fix For: 0.94.5 Attachments: HBase-2611-upstream-v1.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13402582#comment-13402582 ] Zhihong Ted Yu commented on HBASE-2611: --- Putting patch on review board helps. {code} + * @param opList: list of Op to be executed as one trx. {code} 'trx' - 'transaction' {code} +if(opList == null || opList.size() ==0) {code} Space between if and (. {code} +}catch (InterruptedException ie) { + LOG.warn(multi call interrupted; process failed! + ie); {code} Restore interrupt status for the thread (same for doMultiAndWatch). Space between } and catch. {code} + LOG.warn(multi call failed! One of the passed ops has failed which result in the rolled back.); {code} Line length beyond 100. {code} + * @return + */ + public SortedMapString, SortedSetString copyDeadRSLogsWithMulti( + String deadRSZnode) { {code} javadoc for the return value. {code} + LOG.warn(This is us! Skipping the processing as we might be closing down.); {code} Add deadRSZnodePath to the log. {code} +RetryCounterFactory retryCounterFactory = new RetryCounterFactory(Integer.MAX_VALUE, 3 * 1000); {code} I don't think MAX_VALUE is a good choice. {code} +SortedSetString logQueue = new TreeSetString(); {code} Why is logQueue backed by a TreeSet ? {code} +LOG.warn(KeeperException occurred in multi; + +seems some other regionserver took the logs before us.); {code} Add ke to the above message. {code} +Op deleteOpForLog = Op.delete(zNodeForCurrentLog, -1); +znodesToWatch.add(logZnode); +opsList.add(createOpForLog); +opsList.add(deleteOpForLog); {code} Please reorder the above calls so that znodesToWatch.add() is after opsList.add() calls. This would make code more readable. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: replication Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Attachments: HBase-2611-upstream-v1.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13402594#comment-13402594 ] Zhihong Ted Yu commented on HBASE-2611: --- Suppose there are (relatively) large number of Op's in opsList, the chance of collision between active region servers is high. Some experiments should be performed so that we get idea of how long this procedure takes to succeed. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: replication Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Attachments: HBase-2611-upstream-v1.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13402775#comment-13402775 ] Himanshu Vashishtha commented on HBASE-2611: Thanks for the review Ted. I will upload a modified version on the rb. My initial idea of putting it here was to get some feedback on the approach. Yes, it is zk intensive as all other regionservers are competing to do the transaction. But, as soon as one is successful (the first one who create the list and issues the multi command), other regionservers which haven't had a chance to do a listChildern call on the dead regionserver znode will not see anything; and for other regionservers which have created the Ops, the very first Op will fail as the znode has already moved. Zookeeper#multi op is fail fast, it rolls back the transaction on first failure without retrying remaining Ops. I tested it on a 3 RS cluster with average load being 12-14 logs, and it usually is done within seconds after the regionserver failure is noticed. What sort of experiments you are thinking about. On an another note, TestReplication passes. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: replication Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Attachments: HBase-2611-upstream-v1.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13402801#comment-13402801 ] Zhihong Ted Yu commented on HBASE-2611: --- bq. average load being 12-14 logs Can you make the above 10x ? Another consideration is when (which major release) zookeeper 3.4 would be listed as minimum requirement. There hasn't been consensus so far. Here're all the replication-related tests: {code} src/test/java/org/apache/hadoop/hbase/client/replication/TestReplicationAdmin.java src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSink.java src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java src/test/java/org/apache/hadoop/hbase/replication/TestReplication.java src/test/java/org/apache/hadoop/hbase/replication/TestReplicationDeleteTypes.java src/test/java/org/apache/hadoop/hbase/replication/TestReplicationPeer.java src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java {code} Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: replication Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Attachments: HBase-2611-upstream-v1.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13402811#comment-13402811 ] Himanshu Vashishtha commented on HBASE-2611: zookeeper 3.4 is there in 0.92+? What do you mean by minimum requirement? Please explain. I find the related test, queuefailover, in TestReplication. Good to know about other test classes. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: replication Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Attachments: HBase-2611-upstream-v1.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13402816#comment-13402816 ] Zhihong Ted Yu commented on HBASE-2611: --- The 3.4 is only for zookeeper client. Companies (such as StumbleUpon) run 3.3.x in production which doesn't support multi(). Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: replication Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Attachments: HBase-2611-upstream-v1.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13402824#comment-13402824 ] Jesse Yates commented on HBASE-2611: 3.4 is currently only required for security and further, is not yet a stable release of ZK. That said, if it does become stable its likely to be adopted given that its been pretty solid for many people. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: replication Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Attachments: HBase-2611-upstream-v1.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13402829#comment-13402829 ] Himanshu Vashishtha commented on HBASE-2611: bq. 3.4 is only for zookeeper client. I find this a bit confusing. Why is it so? What do we gain by this? @Jesse: TM is using secure hbase in their production (if i am not wrong). So, 3.4 seems pretty reasonable choice. Has there been any discussion on this. I would like to know more context on this. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: replication Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Attachments: HBase-2611-upstream-v1.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13402863#comment-13402863 ] Jesse Yates commented on HBASE-2611: @Himanshu here is the thread I started on this on dev@ little while ago: http://search-hadoop.com/m/u2D7j1yRpi72 It basically comes down to the fact that it would be irresponsible to do a release of HBase that requires an unstable dependency. Yeah, TM has it in production, but that doesn't mean their usage is representative of _everyone's_. If the ZK fellas decide that 3.4 is a stable release, then I'm all for making it the requirement in 0.96, but until the guys who write the software feel like its stable, I don't think we are qualified to say it is stable. I do think its weird that we make 3.4 a dependency, but it really would be too weird (and honestly a waste of effort) to support two versions of the protocol, especially considering the trickiness of dealing with ZK clusters that may be in the process of upgrade, etc. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: replication Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Attachments: HBase-2611-upstream-v1.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13401779#comment-13401779 ] Himanshu Vashishtha commented on HBASE-2611: I looked at this issue from the perspective of using Zookeeper#multi Operation (present in 3.4). This API guarantees to do a list of Op as a single transaction, rolling back all the Ops in case any of the Op fails. I tested this functionality as a standalone case (where the transaction was to move a bunch of Znodes from one parent to another), and it works good (out of N threads which race to do the transfer, only 1 is successful). And in case of a failure, all the Ops done so far are rolled back. I can attach the sample code if required. Here is the approach I used to utilize multi for this issue: a) All the active region servers tries to move the logs of peers under the dead regionserver znode. It involves creating Op objects for creating new znodes and deleting old ones. As per the multi API guarantee, only one regionserver will be successful in moving the znodes. b) The regionservers will keep on trying to move the znodes from the dead regionserver untill they are sure that the node is deleted (by the successful regionserver), or there is no log to process. This is to avoid any corner case so as not to miss the logs for the dead regionserver. The number of trials can be made configurable. c) In case of cascading failure (when the successful regionserver dies before it gets the notification from zk about the successful move), other regionservers will get this new event and will proceed as normal (will try to move all the znodes from this newly dead regionserver znode). It will be good to know what others think about this approach. Other rogue conditions that can happen? Attached is a patch based and I tested it by manually killing regionservers at random (not totally random, but killing one and then killing the successful one when it has just transferred the logs) (its difficult to kill it while transferring as its an atomic operation now). Any ideas/suggestions for more direct testing are welcome. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: replication Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Attachments: HBase-2611-upstream-v1.patch HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13165085#comment-13165085 ] Chris Trezzo commented on HBASE-2611: - @J-D LarsH and I were talking about another approach to region server replication hlog queue failover yesterday, and I wanted to get some feedback on it. Currently when handling a nodeDeleted event, the live region servers only attempt to failover the node corresponding to the event. The nodeDeleted event is only fired once, so to protect ourselves from orphaning the znode state of the failed region server in a cascading failure scenario, we move the state to the znode of the region server that is performing the failover. Since we don't have an atomic way to move this state, it gets a little tricky. Instead of this approach, we could have the region server attempt to failover all failed region servers every time it receives a nodeDeleted event. For example, the nodeDeleted method could go something like this: refresh the region server list, get the list of region servers in the replication znode structure, attempt to lock and failover any region server listed in the replication znode structure that is not currently alive. The same race to lock the region server znode will occur. Only one region server will get the lock and handle the failover. Each NodeFailoverWorker that gets started could simply operate on the original dead region server znode structure. If the region server fails while preforming the failover, then both the region servers will get picked up by another region server when the nodeDeleted event for the second failure is fired. Locks would have to be ephemeral nodes to prevent permanent locking of a region server when the failover region server dies. Once the replication hlog queues are successfully replicated, the znode for the dead region server can be deleted. On the cons side, this approach makes the handling of a nodeDeleted event a heavier weight operation. On the pros side, it makes the failover code much simpler because we no longer have to worry about moving the region server znode state around in zookeeper. Thoughts always appreciated. Thanks, Chris Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: replication Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13138637#comment-13138637 ] Chris Trezzo commented on HBASE-2611: - I think adding the ability to atomically move a znode and all its child znodes might be a pretty invasive change. I couldn't seem to find any utility package for this on the net, but there is a patch in Zookeeper ([ZOOKEEPER-965|https://issues.apache.org/jira/browse/ZOOKEEPER-965]) implementing atomic batch operations that is scheduled for 3.4. I thought about the problem a little bit, and after conferring with Lars, I think we might not need the atomic move (although it would definitely make it simpler). Below is some pseudo code for the algorithm I came up with. It is very similar to what you suggested above. Both intentions and locks are tagged with the region server they point to (i.e. locks are tagged with the rs that holds them, and intentions are tagged with the rs they intend to lock). Intentions are at the same level in the znode structure as locks. It is a recursive, depth first algorithm. Questions/comments/suggestions always appreciated. Chris {code} //this method is the top-level failover method (i.e. NodeFailoverWorker.run()) failOverRun(FailedNode a) { recordIntention(a, this); if(getLock(a, this)) { //transfer all queues to local node moveState(a, this, this); } else { deleteIntention(a, this); return; } replicateQueues(); } moveState(NodeToMove a, CurrentNode c, TargetNode t) { if(lock exists on a) { if(lock on a is owned by c) { moveStateHelper(a, c, t); } else { //someone else has the lock and is handling //the failover deleteIntention(a, c); } } else { if(queue znodes exist) { //we know that this node has queues to transfer if(getLock(a, c)) { moveStateHelper(a, c, t); } else { deleteIntention(a, c); } } else { //we know that this node is being deleted deleteState(a); deleteIntention(a, c); } } } moveStateHelper(NodeToMove a, CurrentNode c, TargetNode t) { for(every intention b of a) { moveState(b, a, t); } //we need to safely handle the case where we try to copy //queues that have already been copied copy all queues in a to t; deleteState(a); deleteIntention(a, c); } deleteState(NodeToDelete d) { //there is no need to traverse down the tree at all //because at this point everything below us should have //been deleted // //we need to safely handle the case where we attempt to delete //nodes that have already been deleted delete entire node; } {code} Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: replication Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13138739#comment-13138739 ] Ted Yu commented on HBASE-2611: --- In moveState(), if lock on a is owned by c, should lock be released after moveStateHelper() returns ? I guess lock release can also be done at the end of moveStateHelper(). Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: replication Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13138762#comment-13138762 ] Chris Trezzo commented on HBASE-2611: - I should have specified that in deleteState(), the line delete entire node deletes the entire znode replication hierarchy for that region server. This would include the lock znode, which is essentially releasing the lock at the end of moveStateHelper(). Thanks! Chris Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: replication Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13135292#comment-13135292 ] Jean-Daniel Cryans commented on HBASE-2611: --- Actually it would be nice if it was in a separate utility package since atomically moving a znode folder recursively would be a very useful function in general. It might even already exist on the net. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: replication Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13134549#comment-13134549 ] Chris Trezzo commented on HBASE-2611: - @J-D If you don't mind, I was thinking about taking a crack at this using your 4 types of znode strategy. I'll start working on a sketch patch. At a first glance, it seems as though most of the code changes are going to be in ReplicationSourceManager.NodeFailoverWorker.run(). Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: replication Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-2611) Handle RS that fails while processing the failure of another one
[ https://issues.apache.org/jira/browse/HBASE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13134568#comment-13134568 ] Chris Trezzo commented on HBASE-2611: - ...and of course ReplicationZookeeper. Handle RS that fails while processing the failure of another one Key: HBASE-2611 URL: https://issues.apache.org/jira/browse/HBASE-2611 Project: HBase Issue Type: Sub-task Components: replication Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans HBASE-2223 doesn't manage region servers that fail while doing the transfer of HLogs queues from other region servers that failed. Devise a reliable way to do it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira