[jira] [Updated] (HBASE-5618) SplitLogManager - prevent unnecessary attempts to resubmits
[ https://issues.apache.org/jira/browse/HBASE-5618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Khemani updated HBASE-5618: --- Status: Open (was: Patch Available) SplitLogManager - prevent unnecessary attempts to resubmits --- Key: HBASE-5618 URL: https://issues.apache.org/jira/browse/HBASE-5618 Project: HBase Issue Type: Improvement Components: wal, zookeeper Reporter: Prakash Khemani Assignee: Prakash Khemani Attachments: 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch Currently once a watch fires that the task node has been updated (hearbeated) by the worker, the splitlogmanager still quite some time before it updates the last heard from time. This is because the manager currently schedules another getDataSetWatch() and only after that finishes will it update the task's last heard from time. This leads to a large number of zk-BadVersion warnings when resubmission is continuously attempted and it fails. Two changes should be made (1) On a resubmission failure because of BadVersion the task's lastUpdate time should get upped. (2) The task's lastUpdate time should get upped as soon as the nodeDataChanged() watch fires and without waiting for getDataSetWatch() to complete. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5618) SplitLogManager - prevent unnecessary attempts to resubmits
[ https://issues.apache.org/jira/browse/HBASE-5618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Khemani updated HBASE-5618: --- Attachment: 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch re-attaching the same patch. I had cancelled it by mistake. SplitLogManager - prevent unnecessary attempts to resubmits --- Key: HBASE-5618 URL: https://issues.apache.org/jira/browse/HBASE-5618 Project: HBase Issue Type: Improvement Components: wal, zookeeper Reporter: Prakash Khemani Assignee: Prakash Khemani Attachments: 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch Currently once a watch fires that the task node has been updated (hearbeated) by the worker, the splitlogmanager still quite some time before it updates the last heard from time. This is because the manager currently schedules another getDataSetWatch() and only after that finishes will it update the task's last heard from time. This leads to a large number of zk-BadVersion warnings when resubmission is continuously attempted and it fails. Two changes should be made (1) On a resubmission failure because of BadVersion the task's lastUpdate time should get upped. (2) The task's lastUpdate time should get upped as soon as the nodeDataChanged() watch fires and without waiting for getDataSetWatch() to complete. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5618) SplitLogManager - prevent unnecessary attempts to resubmits
[ https://issues.apache.org/jira/browse/HBASE-5618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Khemani updated HBASE-5618: --- Attachment: 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch SplitLogManager - prevent unnecessary attempts to resubmits --- Key: HBASE-5618 URL: https://issues.apache.org/jira/browse/HBASE-5618 Project: HBase Issue Type: Improvement Components: wal, zookeeper Reporter: Prakash Khemani Assignee: Prakash Khemani Attachments: 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch Currently once a watch fires that the task node has been updated (hearbeated) by the worker, the splitlogmanager still quite some time before it updates the last heard from time. This is because the manager currently schedules another getDataSetWatch() and only after that finishes will it update the task's last heard from time. This leads to a large number of zk-BadVersion warnings when resubmission is continuously attempted and it fails. Two changes should be made (1) On a resubmission failure because of BadVersion the task's lastUpdate time should get upped. (2) The task's lastUpdate time should get upped as soon as the nodeDataChanged() watch fires and without waiting for getDataSetWatch() to complete. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5618) SplitLogManager - prevent unnecessary attempts to resubmits
[ https://issues.apache.org/jira/browse/HBASE-5618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Khemani updated HBASE-5618: --- Attachment: 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch SplitLogManager - prevent unnecessary attempts to resubmits --- Key: HBASE-5618 URL: https://issues.apache.org/jira/browse/HBASE-5618 Project: HBase Issue Type: Improvement Components: wal, zookeeper Reporter: Prakash Khemani Assignee: Prakash Khemani Attachments: 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch Currently once a watch fires that the task node has been updated (hearbeated) by the worker, the splitlogmanager still quite some time before it updates the last heard from time. This is because the manager currently schedules another getDataSetWatch() and only after that finishes will it update the task's last heard from time. This leads to a large number of zk-BadVersion warnings when resubmission is continuously attempted and it fails. Two changes should be made (1) On a resubmission failure because of BadVersion the task's lastUpdate time should get upped. (2) The task's lastUpdate time should get upped as soon as the nodeDataChanged() watch fires and without waiting for getDataSetWatch() to complete. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5618) SplitLogManager - prevent unnecessary attempts to resubmits
[ https://issues.apache.org/jira/browse/HBASE-5618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Khemani updated HBASE-5618: --- Attachment: 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch patch without directory prefixes SplitLogManager - prevent unnecessary attempts to resubmits --- Key: HBASE-5618 URL: https://issues.apache.org/jira/browse/HBASE-5618 Project: HBase Issue Type: Improvement Components: wal, zookeeper Reporter: Prakash Khemani Assignee: Prakash Khemani Attachments: 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch Currently once a watch fires that the task node has been updated (hearbeated) by the worker, the splitlogmanager still quite some time before it updates the last heard from time. This is because the manager currently schedules another getDataSetWatch() and only after that finishes will it update the task's last heard from time. This leads to a large number of zk-BadVersion warnings when resubmission is continuously attempted and it fails. Two changes should be made (1) On a resubmission failure because of BadVersion the task's lastUpdate time should get upped. (2) The task's lastUpdate time should get upped as soon as the nodeDataChanged() watch fires and without waiting for getDataSetWatch() to complete. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5606) SplitLogManger async delete node hangs log splitting when ZK connection is lost
[ https://issues.apache.org/jira/browse/HBASE-5606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Khemani updated HBASE-5606: --- Attachment: 0001-HBASE-5606-SplitLogManger-async-delete-node-hangs-lo.patch Do not do any error processing if the getDataSetWatch() call from SplitLogManager timeoutMonitor fails SplitLogManger async delete node hangs log splitting when ZK connection is lost Key: HBASE-5606 URL: https://issues.apache.org/jira/browse/HBASE-5606 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0 Reporter: Gopinathan A Priority: Critical Fix For: 0.92.2 Attachments: 0001-HBASE-5606-SplitLogManger-async-delete-node-hangs-lo.patch, 5606.txt 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All tasks are failed due to ZK connection lost, so the all the tasks were deleted asynchronously; 3. Servershutdownhandler retried the log splitting; 4. The asynchronously deletion in step 2 finally happened for new task 5. This made the SplitLogManger in hanging state. This leads to .META. region not assigened for long time {noformat} hbase-root-master-HOST-192-168-47-204.log.2012-03-14(55413,79):2012-03-14 19:28:47,932 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task at znode /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170 hbase-root-master-HOST-192-168-47-204.log.2012-03-14(89303,79):2012-03-14 19:34:32,387 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task at znode /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170 {noformat} {noformat} hbase-root-master-HOST-192-168-47-204.log.2012-03-14(80417,99):2012-03-14 19:34:31,196 DEBUG org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: deleted /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170 hbase-root-master-HOST-192-168-47-204.log.2012-03-14(89456,99):2012-03-14 19:34:32,497 DEBUG org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: deleted /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170 {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5618) SplitLogManager - prevent unnecessary attempts to resubmits
[ https://issues.apache.org/jira/browse/HBASE-5618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Khemani updated HBASE-5618: --- Attachment: 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch update heartbeat time as soon as possible and as often as one can. SplitLogManager - prevent unnecessary attempts to resubmits --- Key: HBASE-5618 URL: https://issues.apache.org/jira/browse/HBASE-5618 Project: HBase Issue Type: Improvement Components: wal, zookeeper Reporter: Prakash Khemani Attachments: 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch Currently once a watch fires that the task node has been updated (hearbeated) by the worker, the splitlogmanager still quite some time before it updates the last heard from time. This is because the manager currently schedules another getDataSetWatch() and only after that finishes will it update the task's last heard from time. This leads to a large number of zk-BadVersion warnings when resubmission is continuously attempted and it fails. Two changes should be made (1) On a resubmission failure because of BadVersion the task's lastUpdate time should get upped. (2) The task's lastUpdate time should get upped as soon as the nodeDataChanged() watch fires and without waiting for getDataSetWatch() to complete. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5519) Incorrect warning in splitlogmanager
[ https://issues.apache.org/jira/browse/HBASE-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Khemani updated HBASE-5519: --- Attachment: 0001-HBASE-5519-Incorrect-warning-in-splitlogmanager.patch replace a log.warn() w/ a comment Incorrect warning in splitlogmanager Key: HBASE-5519 URL: https://issues.apache.org/jira/browse/HBASE-5519 Project: HBase Issue Type: Improvement Reporter: Prakash Khemani Attachments: 0001-HBASE-5519-Incorrect-warning-in-splitlogmanager.patch because of recently added behavior - where the splitlogmanager timeout thread get's data from zk node just to check that the zk node is there ... we might have multiple watches firing without the task znode expiring. remove the poor warning message. (internally, there was an assert that failed in Mikhail's tests) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5287) [89-fb] hbck can go into an infinite loop
[ https://issues.apache.org/jira/browse/HBASE-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Khemani updated HBASE-5287: --- Summary: [89-fb] hbck can go into an infinite loop (was: hbck can go into an infinite loop) will close this issue. Mikhail will be uploading the patch separately. [89-fb] hbck can go into an infinite loop - Key: HBASE-5287 URL: https://issues.apache.org/jira/browse/HBASE-5287 Project: HBase Issue Type: Bug Reporter: Prakash Khemani HBaseFsckRepair.prompt() should check for -1 return value from System.in.read() Only affects 0.89 release. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5287) hbck can go into an infinite loop
[ https://issues.apache.org/jira/browse/HBASE-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Khemani updated HBASE-5287: --- Summary: hbck can go into an infinite loop (was: fsync can go into an infinite loop) hbck can go into an infinite loop - Key: HBASE-5287 URL: https://issues.apache.org/jira/browse/HBASE-5287 Project: HBase Issue Type: Bug Reporter: Prakash Khemani HBaseFsckRepair.prompt() should check for -1 return value from System.in.read() Only affects 0.89 release. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Khemani updated HBASE-5081: --- Attachment: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch a test added on top of Ted's last change. Distributed log splitting deleteNode races against splitLog retry -- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Prakash Khemani Fix For: 0.92.0 Attachments: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 5081-deleteNode-with-while-loop.txt, distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Khemani updated HBASE-5081: --- Attachment: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch implement feedback Distributed log splitting deleteNode races against splitLog retry -- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Prakash Khemani Fix For: 0.92.0 Attachments: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Khemani updated HBASE-5081: --- Attachment: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch set the interrupt flag on InterruptedException. Remove OrphanLogException handling for distributed log splitting. Distributed log splitting deleteNode races against splitLog retry -- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Prakash Khemani Fix For: 0.92.0 Attachments: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Khemani updated HBASE-5081: --- Attachment: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Prakash Khemani Fix For: 0.92.0 Attachments: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Khemani updated HBASE-5081: --- Attachment: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch I rebased and it rebased cleanly for me. Anyway, uploading the format-patch output again to see if it applies. Distributed log splitting deleteNode races against splitLog retry -- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Prakash Khemani Fix For: 0.92.0 Attachments: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Khemani updated HBASE-5081: --- Attachment: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch fix an (unrelated) test failure in TestSplitLogManager.testRescanCleanup() Distributed log splitting deleteNode races against splitLog retry -- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Prakash Khemani Fix For: 0.92.0 Attachments: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5029) TestDistributedLogSplitting fails on occasion
[ https://issues.apache.org/jira/browse/HBASE-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Khemani updated HBASE-5029: --- Attachment: 0001-HBASE-5029-jira-TestDistributedLogSplitting-fails-on.patch TestDistributedLogSplitting fails on occasion - Key: HBASE-5029 URL: https://issues.apache.org/jira/browse/HBASE-5029 Project: HBase Issue Type: Bug Reporter: stack Assignee: Prakash Khemani Attachments: 0001-HBASE-5029-jira-TestDistributedLogSplitting-fails-on.patch, HBASE-5029.D891.1.patch, HBASE-5029.D891.2.patch This is how it usually fails: https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/lastCompletedBuild/testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testWorkerAbort/ Assigning mighty Prakash since he offered to take a looksee. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4721) Retain Delete Markers after Major Compaction
[ https://issues.apache.org/jira/browse/HBASE-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Khemani updated HBASE-4721: --- Summary: Retain Delete Markers after Major Compaction (was: Configurable TTL for Delete Markers) I think all we need to do is to retain delete markers even after a major compaction. Retaining delete markers beyond the ttl of regular KVs doesn't make sense to me. Say the regular KV TTL is 1 day. If we keep a delete marker alive for 2 days then the only KVs that it is going to affect are the ones that are older than 2 days and are already expired. If we just retain delete markers after every major compaction then the processing becomes exactly the same as in a minor compaction. Hopefully, will reduce some code complexity. Retain Delete Markers after Major Compaction Key: HBASE-4721 URL: https://issues.apache.org/jira/browse/HBASE-4721 Project: HBase Issue Type: New Feature Reporter: Prakash Khemani Assignee: Prakash Khemani There is a need to provide long TTLs for delete markers. This is useful when replicating hbase logs from one cluster to another. The receiving cluster shouldn't compact away the delete markers because the affected key-values might still be on the way. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4696) HRegionThriftServer' might have to indefinitely do redirtects
[ https://issues.apache.org/jira/browse/HBASE-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Khemani updated HBASE-4696: --- Environment: HRegionThriftServer.getRowWithColumnsTs() redirects the request to the correct region server if it has landed on the wrong region-server. With this approach the smart-client will never get a NotServingRegionException and it will never be able to invalidate its cache. It will indefinitely send the request to the wrong region server and the wrong region server will always be redirecting it. Either redirects should be turned off and the client should react to NotServingRegionExceptions. Or somehow a flag should be set in the response telling the client to refresh its cache. Summary: HRegionThriftServer' might have to indefinitely do redirtects (was: HRegionThriftServer) HRegionThriftServer' might have to indefinitely do redirtects - Key: HBASE-4696 URL: https://issues.apache.org/jira/browse/HBASE-4696 Project: HBase Issue Type: Bug Environment: HRegionThriftServer.getRowWithColumnsTs() redirects the request to the correct region server if it has landed on the wrong region-server. With this approach the smart-client will never get a NotServingRegionException and it will never be able to invalidate its cache. It will indefinitely send the request to the wrong region server and the wrong region server will always be redirecting it. Either redirects should be turned off and the client should react to NotServingRegionExceptions. Or somehow a flag should be set in the response telling the client to refresh its cache. Reporter: Prakash Khemani -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4696) HRegionThriftServer' might have to indefinitely do redirtects
[ https://issues.apache.org/jira/browse/HBASE-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Khemani updated HBASE-4696: --- Description: HRegionThriftServer.getRowWithColumnsTs() redirects the request to the correct region server if it has landed on the wrong region-server. With this approach the smart-client will never get a NotServingRegionException and it will never be able to invalidate its cache. It will indefinitely send the request to the wrong region server and the wrong region server will always be redirecting it. Either redirects should be turned off and the client should react to NotServingRegionExceptions. Or somehow a flag should be set in the response telling the client to refresh its cache. Environment: (was: HRegionThriftServer.getRowWithColumnsTs() redirects the request to the correct region server if it has landed on the wrong region-server. With this approach the smart-client will never get a NotServingRegionException and it will never be able to invalidate its cache. It will indefinitely send the request to the wrong region server and the wrong region server will always be redirecting it. Either redirects should be turned off and the client should react to NotServingRegionExceptions. Or somehow a flag should be set in the response telling the client to refresh its cache.) HRegionThriftServer' might have to indefinitely do redirtects - Key: HBASE-4696 URL: https://issues.apache.org/jira/browse/HBASE-4696 Project: HBase Issue Type: Bug Reporter: Prakash Khemani HRegionThriftServer.getRowWithColumnsTs() redirects the request to the correct region server if it has landed on the wrong region-server. With this approach the smart-client will never get a NotServingRegionException and it will never be able to invalidate its cache. It will indefinitely send the request to the wrong region server and the wrong region server will always be redirecting it. Either redirects should be turned off and the client should react to NotServingRegionExceptions. Or somehow a flag should be set in the response telling the client to refresh its cache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4687) regionserver may miss zk-heartbeats to master when replaying edits at region open
[ https://issues.apache.org/jira/browse/HBASE-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Khemani updated HBASE-4687: --- Attachment: 0001-HBASE-4687-regionserver-may-miss-zk-heartbeats-to-ma.patch path attached regionserver may miss zk-heartbeats to master when replaying edits at region open - Key: HBASE-4687 URL: https://issues.apache.org/jira/browse/HBASE-4687 Project: HBase Issue Type: Bug Reporter: Prakash Khemani Assignee: Prakash Khemani Attachments: 0001-HBASE-4687-regionserver-may-miss-zk-heartbeats-to-ma.patch replayRecoveredEdits() should do another reporter.progress() before returning. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira